# 015 — Monitoring: Metrics, SLOs, and Alerts for BZZZ/UCXI/DHT/SLURP - Area: instrumentation across services, `infrastructure/monitoring/*` - Priority: Medium ## Background Prometheus/Grafana/Alertmanager are provisioned, but service metrics and SLO-based alerting for critical paths are incomplete. Operators need actionable dashboards and alerts. ## Scope / Deliverables - Instrumentation: - Expose Prometheus metrics in BZZZ core (peer count, pubsub msgs), UCXI (req count/latency/errors by code), DHT (put/get latency, cache hits), SLURP (Issue 012 stats). - Dashboards: - Grafana dashboards per component with golden signals (latency, error rate, saturation, traffic) and health. - SLOs & Alerts: - Define SLOs (e.g., UCXI success rate ≥ 99%, DHT p95 get ≤ 300ms, peer count ≥ N) and add alert rules. - Alerts for election churn, breaker open (SLURP), DLQ backlog growth, sandbox failures. ## Acceptance Criteria / Tests - `curl /metrics` endpoints show component metrics; Prometheus scrapes without errors. - Grafana dashboards render with data; alert rules fire in simulated faults (recording rules ok). ## Notes - Keep scrape configs least-privileged; avoid secret leakage in labels.