# 015 — Monitoring: Metrics, SLOs, and Alerts for BZZZ/UCXI/DHT/SLURP

- Area: instrumentation across services, `infrastructure/monitoring/*`
- Priority: Medium

## Background
Prometheus/Grafana/Alertmanager are provisioned, but service metrics and SLO-based alerting for critical paths are incomplete. Operators need actionable dashboards and alerts.

## Scope / Deliverables
- Instrumentation:
  - Expose Prometheus metrics in BZZZ core (peer count, pubsub msgs), UCXI (req count/latency/errors by code), DHT (put/get latency, cache hits), SLURP (Issue 012 stats).
- Dashboards:
  - Grafana dashboards per component with golden signals (latency, error rate, saturation, traffic) and health.
- SLOs & Alerts:
  - Define SLOs (e.g., UCXI success rate ≥ 99%, DHT p95 get ≤ 300ms, peer count ≥ N) and add alert rules.
  - Alerts for election churn, breaker open (SLURP), DLQ backlog growth, sandbox failures.

## Acceptance Criteria / Tests
- `curl /metrics` endpoints show component metrics; Prometheus scrapes without errors.
- Grafana dashboards render with data; alert rules fire in simulated faults (recording rules ok).

## Notes
- Keep scrape configs least-privileged; avoid secret leakage in labels.