# CHORUS Metrics Package ## Overview The `pkg/metrics` package provides comprehensive Prometheus-based metrics collection for the CHORUS distributed system. It exposes detailed operational metrics across all system components including P2P networking, DHT operations, PubSub messaging, elections, task management, and resource utilization. ## Architecture ### Core Components - **CHORUSMetrics**: Central metrics collector managing all Prometheus metrics - **Prometheus Registry**: Custom registry for metric collection - **HTTP Server**: Exposes metrics endpoint for scraping - **Background Collectors**: Periodic system and resource metric collection ### Metric Types The package uses three Prometheus metric types: 1. **Counter**: Monotonically increasing values (e.g., total messages sent) 2. **Gauge**: Values that can go up or down (e.g., connected peers) 3. **Histogram**: Distribution of values with configurable buckets (e.g., latency measurements) ## Configuration ### MetricsConfig ```go type MetricsConfig struct { // HTTP server configuration ListenAddr string // Default: ":9090" MetricsPath string // Default: "/metrics" // Histogram buckets LatencyBuckets []float64 // Default: 0.001s to 10s SizeBuckets []float64 // Default: 64B to 16MB // Node identification labels NodeID string // Unique node identifier Version string // CHORUS version Environment string // deployment environment (dev/staging/prod) Cluster string // cluster identifier // Collection intervals SystemMetricsInterval time.Duration // Default: 30s ResourceMetricsInterval time.Duration // Default: 15s } ``` ### Default Configuration ```go config := metrics.DefaultMetricsConfig() // Returns: // - ListenAddr: ":9090" // - MetricsPath: "/metrics" // - LatencyBuckets: [0.001, 0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0, 10.0] // - SizeBuckets: [64, 256, 1024, 4096, 16384, 65536, 262144, 1048576, 4194304, 16777216] // - SystemMetricsInterval: 30s // - ResourceMetricsInterval: 15s ``` ## Metrics Catalog ### System Metrics #### chorus_system_info **Type**: Gauge **Description**: System information with version labels **Labels**: `node_id`, `version`, `go_version`, `cluster`, `environment` **Value**: Always 1 when present #### chorus_uptime_seconds **Type**: Gauge **Description**: System uptime in seconds since start **Value**: Current uptime in seconds ### P2P Network Metrics #### chorus_p2p_connected_peers **Type**: Gauge **Description**: Number of currently connected P2P peers **Value**: Current peer count #### chorus_p2p_messages_sent_total **Type**: Counter **Description**: Total number of P2P messages sent **Labels**: `message_type`, `peer_id` **Usage**: Track outbound message volume per type and destination #### chorus_p2p_messages_received_total **Type**: Counter **Description**: Total number of P2P messages received **Labels**: `message_type`, `peer_id` **Usage**: Track inbound message volume per type and source #### chorus_p2p_message_latency_seconds **Type**: Histogram **Description**: P2P message round-trip latency distribution **Labels**: `message_type` **Buckets**: Configurable latency buckets (default: 1ms to 10s) #### chorus_p2p_connection_duration_seconds **Type**: Histogram **Description**: Duration of P2P connections **Labels**: `peer_id` **Usage**: Track connection stability #### chorus_p2p_peer_score **Type**: Gauge **Description**: Peer quality score **Labels**: `peer_id` **Value**: Score between 0.0 (poor) and 1.0 (excellent) ### DHT (Distributed Hash Table) Metrics #### chorus_dht_put_operations_total **Type**: Counter **Description**: Total number of DHT put operations **Labels**: `status` (success/failure) **Usage**: Track DHT write operations #### chorus_dht_get_operations_total **Type**: Counter **Description**: Total number of DHT get operations **Labels**: `status` (success/failure) **Usage**: Track DHT read operations #### chorus_dht_operation_latency_seconds **Type**: Histogram **Description**: DHT operation latency distribution **Labels**: `operation` (put/get), `status` (success/failure) **Usage**: Monitor DHT performance #### chorus_dht_provider_records **Type**: Gauge **Description**: Number of provider records stored in DHT **Value**: Current provider record count #### chorus_dht_content_keys **Type**: Gauge **Description**: Number of content keys stored in DHT **Value**: Current content key count #### chorus_dht_replication_factor **Type**: Gauge **Description**: Replication factor for DHT keys **Labels**: `key_hash` **Value**: Number of replicas for specific keys #### chorus_dht_cache_hits_total **Type**: Counter **Description**: DHT cache hit count **Labels**: `cache_type` **Usage**: Monitor DHT caching effectiveness #### chorus_dht_cache_misses_total **Type**: Counter **Description**: DHT cache miss count **Labels**: `cache_type` **Usage**: Monitor DHT caching effectiveness ### PubSub Messaging Metrics #### chorus_pubsub_topics **Type**: Gauge **Description**: Number of active PubSub topics **Value**: Current topic count #### chorus_pubsub_subscribers **Type**: Gauge **Description**: Number of subscribers per topic **Labels**: `topic` **Value**: Subscriber count for each topic #### chorus_pubsub_messages_total **Type**: Counter **Description**: Total PubSub messages **Labels**: `topic`, `direction` (sent/received), `message_type` **Usage**: Track message volume per topic #### chorus_pubsub_message_latency_seconds **Type**: Histogram **Description**: PubSub message delivery latency **Labels**: `topic` **Usage**: Monitor message propagation performance #### chorus_pubsub_message_size_bytes **Type**: Histogram **Description**: PubSub message size distribution **Labels**: `topic` **Buckets**: Configurable size buckets (default: 64B to 16MB) ### Election System Metrics #### chorus_election_term **Type**: Gauge **Description**: Current election term number **Value**: Monotonically increasing term number #### chorus_election_state **Type**: Gauge **Description**: Current election state (1 for active state, 0 for others) **Labels**: `state` (idle/discovering/electing/reconstructing/complete) **Usage**: Only one state should have value 1 at any time #### chorus_heartbeats_sent_total **Type**: Counter **Description**: Total number of heartbeats sent by this node **Usage**: Monitor leader heartbeat activity #### chorus_heartbeats_received_total **Type**: Counter **Description**: Total number of heartbeats received from leader **Usage**: Monitor follower connectivity to leader #### chorus_leadership_changes_total **Type**: Counter **Description**: Total number of leadership changes **Usage**: Monitor election stability (lower is better) #### chorus_leader_uptime_seconds **Type**: Gauge **Description**: Current leader's tenure duration **Value**: Seconds since current leader was elected #### chorus_election_latency_seconds **Type**: Histogram **Description**: Time taken to complete election process **Usage**: Monitor election efficiency ### Health Monitoring Metrics #### chorus_health_checks_passed_total **Type**: Counter **Description**: Total number of health checks passed **Labels**: `check_name` **Usage**: Track health check success rate #### chorus_health_checks_failed_total **Type**: Counter **Description**: Total number of health checks failed **Labels**: `check_name`, `reason` **Usage**: Track health check failures and reasons #### chorus_health_check_duration_seconds **Type**: Histogram **Description**: Health check execution duration **Labels**: `check_name` **Usage**: Monitor health check performance #### chorus_system_health_score **Type**: Gauge **Description**: Overall system health score **Value**: 0.0 (unhealthy) to 1.0 (healthy) **Usage**: Monitor overall system health #### chorus_component_health_score **Type**: Gauge **Description**: Component-specific health score **Labels**: `component` **Value**: 0.0 (unhealthy) to 1.0 (healthy) **Usage**: Track individual component health ### Task Management Metrics #### chorus_tasks_active **Type**: Gauge **Description**: Number of currently active tasks **Value**: Current active task count #### chorus_tasks_queued **Type**: Gauge **Description**: Number of queued tasks waiting execution **Value**: Current queue depth #### chorus_tasks_completed_total **Type**: Counter **Description**: Total number of completed tasks **Labels**: `status` (success/failure), `task_type` **Usage**: Track task completion and success rate #### chorus_task_duration_seconds **Type**: Histogram **Description**: Task execution duration distribution **Labels**: `task_type`, `status` **Usage**: Monitor task performance #### chorus_task_queue_wait_time_seconds **Type**: Histogram **Description**: Time tasks spend in queue before execution **Usage**: Monitor task scheduling efficiency ### SLURP (Context Generation) Metrics #### chorus_slurp_contexts_generated_total **Type**: Counter **Description**: Total number of SLURP contexts generated **Labels**: `role`, `status` (success/failure) **Usage**: Track context generation volume #### chorus_slurp_generation_time_seconds **Type**: Histogram **Description**: Time taken to generate SLURP contexts **Buckets**: [0.1, 0.5, 1.0, 2.0, 5.0, 10.0, 30.0, 60.0, 120.0] **Usage**: Monitor context generation performance #### chorus_slurp_queue_length **Type**: Gauge **Description**: Length of SLURP generation queue **Value**: Current queue depth #### chorus_slurp_active_jobs **Type**: Gauge **Description**: Number of active SLURP generation jobs **Value**: Currently running generation jobs #### chorus_slurp_leadership_events_total **Type**: Counter **Description**: SLURP-related leadership events **Usage**: Track leader-initiated context generation ### SHHH (Secret Sentinel) Metrics #### chorus_shhh_findings_total **Type**: Counter **Description**: Total number of SHHH redaction findings **Labels**: `rule`, `severity` (low/medium/high/critical) **Usage**: Monitor secret detection effectiveness ### UCXI (Protocol Resolution) Metrics #### chorus_ucxi_requests_total **Type**: Counter **Description**: Total number of UCXI protocol requests **Labels**: `method`, `status` (success/failure) **Usage**: Track UCXI usage and success rate #### chorus_ucxi_resolution_latency_seconds **Type**: Histogram **Description**: UCXI address resolution latency **Usage**: Monitor resolution performance #### chorus_ucxi_cache_hits_total **Type**: Counter **Description**: UCXI cache hit count **Usage**: Monitor caching effectiveness #### chorus_ucxi_cache_misses_total **Type**: Counter **Description**: UCXI cache miss count **Usage**: Monitor caching effectiveness #### chorus_ucxi_content_size_bytes **Type**: Histogram **Description**: Size of resolved UCXI content **Usage**: Monitor content distribution ### Resource Utilization Metrics #### chorus_cpu_usage_ratio **Type**: Gauge **Description**: CPU usage ratio **Value**: 0.0 (idle) to 1.0 (fully utilized) #### chorus_memory_usage_bytes **Type**: Gauge **Description**: Memory usage in bytes **Value**: Current memory consumption #### chorus_disk_usage_ratio **Type**: Gauge **Description**: Disk usage ratio **Labels**: `mount_point` **Value**: 0.0 (empty) to 1.0 (full) #### chorus_network_bytes_in_total **Type**: Counter **Description**: Total bytes received from network **Usage**: Track inbound network traffic #### chorus_network_bytes_out_total **Type**: Counter **Description**: Total bytes sent to network **Usage**: Track outbound network traffic #### chorus_goroutines **Type**: Gauge **Description**: Number of active goroutines **Value**: Current goroutine count ### Error Metrics #### chorus_errors_total **Type**: Counter **Description**: Total number of errors **Labels**: `component`, `error_type` **Usage**: Track error frequency by component and type #### chorus_panics_total **Type**: Counter **Description**: Total number of panics recovered **Usage**: Monitor system stability ## Usage Examples ### Basic Initialization ```go import "chorus/pkg/metrics" // Create metrics collector with default config config := metrics.DefaultMetricsConfig() config.NodeID = "chorus-node-01" config.Version = "v1.0.0" config.Environment = "production" config.Cluster = "cluster-01" metricsCollector := metrics.NewCHORUSMetrics(config) // Start metrics HTTP server if err := metricsCollector.StartServer(config); err != nil { log.Fatalf("Failed to start metrics server: %v", err) } // Start background metric collection metricsCollector.CollectMetrics(config) ``` ### Recording P2P Metrics ```go // Update peer count metricsCollector.SetConnectedPeers(5) // Record message sent metricsCollector.IncrementMessagesSent("task_assignment", "peer-abc123") // Record message received metricsCollector.IncrementMessagesReceived("task_result", "peer-def456") // Record message latency startTime := time.Now() // ... send message and wait for response ... latency := time.Since(startTime) metricsCollector.ObserveMessageLatency("task_assignment", latency) ``` ### Recording DHT Metrics ```go // Record DHT put operation startTime := time.Now() err := dht.Put(key, value) latency := time.Since(startTime) if err != nil { metricsCollector.IncrementDHTPutOperations("failure") metricsCollector.ObserveDHTOperationLatency("put", "failure", latency) } else { metricsCollector.IncrementDHTPutOperations("success") metricsCollector.ObserveDHTOperationLatency("put", "success", latency) } // Update DHT statistics metricsCollector.SetDHTProviderRecords(150) metricsCollector.SetDHTContentKeys(450) metricsCollector.SetDHTReplicationFactor("key-hash-123", 3.0) ``` ### Recording PubSub Metrics ```go // Update topic count metricsCollector.SetPubSubTopics(10) // Record message published metricsCollector.IncrementPubSubMessages("CHORUS/tasks/v1", "sent", "task_created") // Record message received metricsCollector.IncrementPubSubMessages("CHORUS/tasks/v1", "received", "task_completed") // Record message latency startTime := time.Now() // ... publish message and wait for delivery confirmation ... latency := time.Since(startTime) metricsCollector.ObservePubSubMessageLatency("CHORUS/tasks/v1", latency) ``` ### Recording Election Metrics ```go // Update election state metricsCollector.SetElectionTerm(42) metricsCollector.SetElectionState("idle") // Record heartbeat sent (leader) metricsCollector.IncrementHeartbeatsSent() // Record heartbeat received (follower) metricsCollector.IncrementHeartbeatsReceived() // Record leadership change metricsCollector.IncrementLeadershipChanges() ``` ### Recording Health Metrics ```go // Record health check success metricsCollector.IncrementHealthCheckPassed("database-connectivity") // Record health check failure metricsCollector.IncrementHealthCheckFailed("p2p-connectivity", "no_peers") // Update health scores metricsCollector.SetSystemHealthScore(0.95) metricsCollector.SetComponentHealthScore("dht", 0.98) metricsCollector.SetComponentHealthScore("pubsub", 0.92) ``` ### Recording Task Metrics ```go // Update task counts metricsCollector.SetActiveTasks(5) metricsCollector.SetQueuedTasks(12) // Record task completion startTime := time.Now() // ... execute task ... duration := time.Since(startTime) metricsCollector.IncrementTasksCompleted("success", "data_processing") metricsCollector.ObserveTaskDuration("data_processing", "success", duration) ``` ### Recording SLURP Metrics ```go // Record context generation startTime := time.Now() // ... generate SLURP context ... duration := time.Since(startTime) metricsCollector.IncrementSLURPGenerated("admin", "success") metricsCollector.ObserveSLURPGenerationTime(duration) // Update queue length metricsCollector.SetSLURPQueueLength(3) ``` ### Recording SHHH Metrics ```go // Record secret findings findings := scanForSecrets(content) for _, finding := range findings { metricsCollector.IncrementSHHHFindings(finding.Rule, finding.Severity, 1) } ``` ### Recording Resource Metrics ```go import "runtime" // Get runtime stats var memStats runtime.MemStats runtime.ReadMemStats(&memStats) metricsCollector.SetMemoryUsage(float64(memStats.Alloc)) metricsCollector.SetGoroutines(runtime.NumGoroutine()) // Record system resource usage metricsCollector.SetCPUUsage(0.45) // 45% CPU usage metricsCollector.SetDiskUsage("/var/lib/CHORUS", 0.73) // 73% disk usage ``` ### Recording Errors ```go // Record error occurrence if err != nil { metricsCollector.IncrementErrors("dht", "timeout") } // Record recovered panic defer func() { if r := recover(); r != nil { metricsCollector.IncrementPanics() // Handle panic... } }() ``` ## Prometheus Integration ### Scrape Configuration Add the following to your `prometheus.yml`: ```yaml scrape_configs: - job_name: 'chorus-nodes' scrape_interval: 15s scrape_timeout: 10s metrics_path: '/metrics' static_configs: - targets: - 'chorus-node-01:9090' - 'chorus-node-02:9090' - 'chorus-node-03:9090' relabel_configs: - source_labels: [__address__] target_label: instance - source_labels: [__address__] regex: '([^:]+):.*' target_label: node replacement: '${1}' ``` ### Example Queries #### P2P Network Health ```promql # Average connected peers across cluster avg(chorus_p2p_connected_peers) # Message rate per second rate(chorus_p2p_messages_sent_total[5m]) # 95th percentile message latency histogram_quantile(0.95, rate(chorus_p2p_message_latency_seconds_bucket[5m])) ``` #### DHT Performance ```promql # DHT operation success rate rate(chorus_dht_get_operations_total{status="success"}[5m]) / rate(chorus_dht_get_operations_total[5m]) # Average DHT operation latency rate(chorus_dht_operation_latency_seconds_sum[5m]) / rate(chorus_dht_operation_latency_seconds_count[5m]) # DHT cache hit rate rate(chorus_dht_cache_hits_total[5m]) / (rate(chorus_dht_cache_hits_total[5m]) + rate(chorus_dht_cache_misses_total[5m])) ``` #### Election Stability ```promql # Leadership changes per hour rate(chorus_leadership_changes_total[1h]) * 3600 # Nodes by election state sum by (state) (chorus_election_state) # Heartbeat rate rate(chorus_heartbeats_sent_total[5m]) ``` #### Task Management ```promql # Task success rate rate(chorus_tasks_completed_total{status="success"}[5m]) / rate(chorus_tasks_completed_total[5m]) # Average task duration histogram_quantile(0.50, rate(chorus_task_duration_seconds_bucket[5m])) # Task queue depth chorus_tasks_queued ``` #### Resource Utilization ```promql # CPU usage by node chorus_cpu_usage_ratio # Memory usage by node chorus_memory_usage_bytes / (1024 * 1024 * 1024) # Convert to GB # Disk usage alert (>90%) chorus_disk_usage_ratio > 0.9 ``` #### System Health ```promql # Overall system health score chorus_system_health_score # Component health scores chorus_component_health_score # Health check failure rate rate(chorus_health_checks_failed_total[5m]) ``` ### Alerting Rules Example Prometheus alerting rules for CHORUS: ```yaml groups: - name: chorus_alerts interval: 30s rules: # P2P connectivity alerts - alert: LowPeerCount expr: chorus_p2p_connected_peers < 2 for: 5m labels: severity: warning annotations: summary: "Low P2P peer count on {{ $labels.instance }}" description: "Node has {{ $value }} peers (minimum: 2)" # DHT performance alerts - alert: HighDHTFailureRate expr: | rate(chorus_dht_get_operations_total{status="failure"}[5m]) / rate(chorus_dht_get_operations_total[5m]) > 0.1 for: 10m labels: severity: warning annotations: summary: "High DHT failure rate on {{ $labels.instance }}" description: "DHT failure rate: {{ $value | humanizePercentage }}" # Election stability alerts - alert: FrequentLeadershipChanges expr: rate(chorus_leadership_changes_total[1h]) * 3600 > 5 for: 15m labels: severity: warning annotations: summary: "Frequent leadership changes" description: "{{ $value }} leadership changes per hour" # Task management alerts - alert: HighTaskQueueDepth expr: chorus_tasks_queued > 100 for: 10m labels: severity: warning annotations: summary: "High task queue depth on {{ $labels.instance }}" description: "{{ $value }} tasks queued" # Resource alerts - alert: HighMemoryUsage expr: chorus_memory_usage_bytes > 8 * 1024 * 1024 * 1024 # 8GB for: 5m labels: severity: warning annotations: summary: "High memory usage on {{ $labels.instance }}" description: "Memory usage: {{ $value | humanize1024 }}B" - alert: HighDiskUsage expr: chorus_disk_usage_ratio > 0.9 for: 10m labels: severity: critical annotations: summary: "High disk usage on {{ $labels.instance }}" description: "Disk usage: {{ $value | humanizePercentage }}" # Health monitoring alerts - alert: LowSystemHealth expr: chorus_system_health_score < 0.75 for: 5m labels: severity: warning annotations: summary: "Low system health score on {{ $labels.instance }}" description: "Health score: {{ $value }}" - alert: ComponentUnhealthy expr: chorus_component_health_score < 0.5 for: 10m labels: severity: warning annotations: summary: "Component {{ $labels.component }} unhealthy" description: "Health score: {{ $value }}" ``` ## HTTP Endpoints ### Metrics Endpoint **URL**: `/metrics` **Method**: GET **Description**: Prometheus metrics in text exposition format **Response Format**: ``` # HELP chorus_p2p_connected_peers Number of connected P2P peers # TYPE chorus_p2p_connected_peers gauge chorus_p2p_connected_peers 5 # HELP chorus_dht_put_operations_total Total number of DHT put operations # TYPE chorus_dht_put_operations_total counter chorus_dht_put_operations_total{status="success"} 1523 chorus_dht_put_operations_total{status="failure"} 12 # HELP chorus_task_duration_seconds Task execution duration # TYPE chorus_task_duration_seconds histogram chorus_task_duration_seconds_bucket{task_type="data_processing",status="success",le="0.001"} 0 chorus_task_duration_seconds_bucket{task_type="data_processing",status="success",le="0.005"} 12 chorus_task_duration_seconds_bucket{task_type="data_processing",status="success",le="0.01"} 45 ... ``` ### Health Endpoint **URL**: `/health` **Method**: GET **Description**: Basic health check for metrics server **Response**: `200 OK` with body `OK` ## Best Practices ### Metric Naming - Use descriptive metric names with `chorus_` prefix - Follow Prometheus naming conventions: `component_metric_unit` - Use `_total` suffix for counters - Use `_seconds` suffix for time measurements - Use `_bytes` suffix for size measurements ### Label Usage - Keep label cardinality low (avoid high-cardinality labels like request IDs) - Use consistent label names across metrics - Document label meanings and expected values - Avoid labels that change frequently ### Performance Considerations - Metrics collection is lock-free for read operations - Histogram observations are optimized for high throughput - Background collectors run on separate goroutines - Custom registry prevents pollution of default registry ### Error Handling - Metrics collection should never panic - Failed metric updates should be logged but not block operations - Use nil checks before accessing metrics collectors ### Testing ```go func TestMetrics(t *testing.T) { config := metrics.DefaultMetricsConfig() config.NodeID = "test-node" m := metrics.NewCHORUSMetrics(config) // Test metric updates m.SetConnectedPeers(5) m.IncrementMessagesSent("test", "peer1") // Verify metrics are collected // (Use prometheus testutil for verification) } ``` ## Troubleshooting ### Metrics Not Appearing 1. Verify metrics server is running: `curl http://localhost:9090/metrics` 2. Check configuration: ensure correct `ListenAddr` and `MetricsPath` 3. Verify Prometheus scrape configuration 4. Check for errors in application logs ### High Memory Usage 1. Review label cardinality (check for unbounded label values) 2. Adjust histogram buckets if too granular 3. Reduce metric collection frequency 4. Consider metric retention policies in Prometheus ### Missing Metrics 1. Ensure metric is being updated by application code 2. Verify metric registration in `initializeMetrics()` 3. Check for race conditions in metric access 4. Review metric type compatibility (Counter vs Gauge vs Histogram) ## Migration Guide ### From Default Prometheus Registry ```go // Old approach prometheus.MustRegister(myCounter) // New approach config := metrics.DefaultMetricsConfig() m := metrics.NewCHORUSMetrics(config) // Use m.IncrementErrors(...) instead of direct counter access ``` ### Adding New Metrics 1. Add metric field to `CHORUSMetrics` struct 2. Initialize metric in `initializeMetrics()` method 3. Add helper methods for updating the metric 4. Document the metric in this file 5. Add Prometheus queries and alerts as needed ## Related Documentation - [Health Package Documentation](./health.md) - [Shutdown Package Documentation](./shutdown.md) - [Prometheus Documentation](https://prometheus.io/docs/) - [Prometheus Best Practices](https://prometheus.io/docs/practices/naming/)