 c5b7311a8b
			
		
	
	c5b7311a8b
	
	
	
		
			
			Comprehensive documentation for coordination, messaging, discovery, and internal systems. Core Coordination Packages: - pkg/election - Democratic leader election (uptime-based, heartbeat mechanism, SLURP integration) - pkg/coordination - Meta-coordination with dependency detection (4 built-in rules) - coordinator/ - Task orchestration and assignment (AI-powered scoring) - discovery/ - mDNS peer discovery (automatic LAN detection) Messaging & P2P Infrastructure: - pubsub/ - GossipSub messaging (31 message types, role-based topics, HMMM integration) - p2p/ - libp2p networking (DHT modes, connection management, security) Monitoring & Health: - pkg/metrics - Prometheus metrics (80+ metrics across 12 categories) - pkg/health - Health monitoring (4 HTTP endpoints, enhanced checks, graceful degradation) Internal Systems: - internal/licensing - License validation (KACHING integration, cluster leases, fail-closed) - internal/hapui - Human Agent Portal UI (9 commands, HMMM wizard, UCXL browser, decision voting) - internal/backbeat - P2P operation telemetry (6 phases, beat synchronization, health reporting) Documentation Statistics (Phase 3): - 10 packages documented (~18,000 lines) - 31 PubSub message types cataloged - 80+ Prometheus metrics documented - Complete API references with examples - Integration patterns and best practices Key Features Documented: - Election: 5 triggers, candidate scoring (5 weighted components), stability windows - Coordination: AI-powered dependency detection, cross-repo sessions, escalation handling - PubSub: Topic patterns, message envelopes, SHHH redaction, Hypercore logging - Metrics: All metric types with labels, Prometheus scrape config, alert rules - Health: Liveness vs readiness, critical checks, Kubernetes integration - Licensing: Grace periods, circuit breaker, cluster lease management - HAP UI: Interactive terminal commands, HMMM composition wizard, web interface (beta) - BACKBEAT: 6-phase operation tracking, beat budget estimation, drift detection Implementation Status Marked: - ✅ Production: Election, metrics, health, licensing, pubsub, p2p, discovery, coordinator - 🔶 Beta: HAP web interface, BACKBEAT telemetry, advanced coordination - 🔷 Alpha: SLURP election scoring - ⚠️ Experimental: Meta-coordination, AI-powered dependency detection Progress: 22/62 files complete (35%) Next Phase: AI providers, SLURP system, API layer, reasoning engine 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
		
			
				
	
	
		
			914 lines
		
	
	
		
			25 KiB
		
	
	
	
		
			Markdown
		
	
	
	
	
	
			
		
		
	
	
			914 lines
		
	
	
		
			25 KiB
		
	
	
	
		
			Markdown
		
	
	
	
	
	
| # CHORUS Metrics Package
 | |
| 
 | |
| ## Overview
 | |
| 
 | |
| The `pkg/metrics` package provides comprehensive Prometheus-based metrics collection for the CHORUS distributed system. It exposes detailed operational metrics across all system components including P2P networking, DHT operations, PubSub messaging, elections, task management, and resource utilization.
 | |
| 
 | |
| ## Architecture
 | |
| 
 | |
| ### Core Components
 | |
| 
 | |
| - **CHORUSMetrics**: Central metrics collector managing all Prometheus metrics
 | |
| - **Prometheus Registry**: Custom registry for metric collection
 | |
| - **HTTP Server**: Exposes metrics endpoint for scraping
 | |
| - **Background Collectors**: Periodic system and resource metric collection
 | |
| 
 | |
| ### Metric Types
 | |
| 
 | |
| The package uses three Prometheus metric types:
 | |
| 
 | |
| 1. **Counter**: Monotonically increasing values (e.g., total messages sent)
 | |
| 2. **Gauge**: Values that can go up or down (e.g., connected peers)
 | |
| 3. **Histogram**: Distribution of values with configurable buckets (e.g., latency measurements)
 | |
| 
 | |
| ## Configuration
 | |
| 
 | |
| ### MetricsConfig
 | |
| 
 | |
| ```go
 | |
| type MetricsConfig struct {
 | |
|     // HTTP server configuration
 | |
|     ListenAddr  string        // Default: ":9090"
 | |
|     MetricsPath string        // Default: "/metrics"
 | |
| 
 | |
|     // Histogram buckets
 | |
|     LatencyBuckets []float64  // Default: 0.001s to 10s
 | |
|     SizeBuckets    []float64  // Default: 64B to 16MB
 | |
| 
 | |
|     // Node identification labels
 | |
|     NodeID      string        // Unique node identifier
 | |
|     Version     string        // CHORUS version
 | |
|     Environment string        // deployment environment (dev/staging/prod)
 | |
|     Cluster     string        // cluster identifier
 | |
| 
 | |
|     // Collection intervals
 | |
|     SystemMetricsInterval   time.Duration  // Default: 30s
 | |
|     ResourceMetricsInterval time.Duration  // Default: 15s
 | |
| }
 | |
| ```
 | |
| 
 | |
| ### Default Configuration
 | |
| 
 | |
| ```go
 | |
| config := metrics.DefaultMetricsConfig()
 | |
| // Returns:
 | |
| // - ListenAddr: ":9090"
 | |
| // - MetricsPath: "/metrics"
 | |
| // - LatencyBuckets: [0.001, 0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0, 10.0]
 | |
| // - SizeBuckets: [64, 256, 1024, 4096, 16384, 65536, 262144, 1048576, 4194304, 16777216]
 | |
| // - SystemMetricsInterval: 30s
 | |
| // - ResourceMetricsInterval: 15s
 | |
| ```
 | |
| 
 | |
| ## Metrics Catalog
 | |
| 
 | |
| ### System Metrics
 | |
| 
 | |
| #### chorus_system_info
 | |
| **Type**: Gauge
 | |
| **Description**: System information with version labels
 | |
| **Labels**: `node_id`, `version`, `go_version`, `cluster`, `environment`
 | |
| **Value**: Always 1 when present
 | |
| 
 | |
| #### chorus_uptime_seconds
 | |
| **Type**: Gauge
 | |
| **Description**: System uptime in seconds since start
 | |
| **Value**: Current uptime in seconds
 | |
| 
 | |
| ### P2P Network Metrics
 | |
| 
 | |
| #### chorus_p2p_connected_peers
 | |
| **Type**: Gauge
 | |
| **Description**: Number of currently connected P2P peers
 | |
| **Value**: Current peer count
 | |
| 
 | |
| #### chorus_p2p_messages_sent_total
 | |
| **Type**: Counter
 | |
| **Description**: Total number of P2P messages sent
 | |
| **Labels**: `message_type`, `peer_id`
 | |
| **Usage**: Track outbound message volume per type and destination
 | |
| 
 | |
| #### chorus_p2p_messages_received_total
 | |
| **Type**: Counter
 | |
| **Description**: Total number of P2P messages received
 | |
| **Labels**: `message_type`, `peer_id`
 | |
| **Usage**: Track inbound message volume per type and source
 | |
| 
 | |
| #### chorus_p2p_message_latency_seconds
 | |
| **Type**: Histogram
 | |
| **Description**: P2P message round-trip latency distribution
 | |
| **Labels**: `message_type`
 | |
| **Buckets**: Configurable latency buckets (default: 1ms to 10s)
 | |
| 
 | |
| #### chorus_p2p_connection_duration_seconds
 | |
| **Type**: Histogram
 | |
| **Description**: Duration of P2P connections
 | |
| **Labels**: `peer_id`
 | |
| **Usage**: Track connection stability
 | |
| 
 | |
| #### chorus_p2p_peer_score
 | |
| **Type**: Gauge
 | |
| **Description**: Peer quality score
 | |
| **Labels**: `peer_id`
 | |
| **Value**: Score between 0.0 (poor) and 1.0 (excellent)
 | |
| 
 | |
| ### DHT (Distributed Hash Table) Metrics
 | |
| 
 | |
| #### chorus_dht_put_operations_total
 | |
| **Type**: Counter
 | |
| **Description**: Total number of DHT put operations
 | |
| **Labels**: `status` (success/failure)
 | |
| **Usage**: Track DHT write operations
 | |
| 
 | |
| #### chorus_dht_get_operations_total
 | |
| **Type**: Counter
 | |
| **Description**: Total number of DHT get operations
 | |
| **Labels**: `status` (success/failure)
 | |
| **Usage**: Track DHT read operations
 | |
| 
 | |
| #### chorus_dht_operation_latency_seconds
 | |
| **Type**: Histogram
 | |
| **Description**: DHT operation latency distribution
 | |
| **Labels**: `operation` (put/get), `status` (success/failure)
 | |
| **Usage**: Monitor DHT performance
 | |
| 
 | |
| #### chorus_dht_provider_records
 | |
| **Type**: Gauge
 | |
| **Description**: Number of provider records stored in DHT
 | |
| **Value**: Current provider record count
 | |
| 
 | |
| #### chorus_dht_content_keys
 | |
| **Type**: Gauge
 | |
| **Description**: Number of content keys stored in DHT
 | |
| **Value**: Current content key count
 | |
| 
 | |
| #### chorus_dht_replication_factor
 | |
| **Type**: Gauge
 | |
| **Description**: Replication factor for DHT keys
 | |
| **Labels**: `key_hash`
 | |
| **Value**: Number of replicas for specific keys
 | |
| 
 | |
| #### chorus_dht_cache_hits_total
 | |
| **Type**: Counter
 | |
| **Description**: DHT cache hit count
 | |
| **Labels**: `cache_type`
 | |
| **Usage**: Monitor DHT caching effectiveness
 | |
| 
 | |
| #### chorus_dht_cache_misses_total
 | |
| **Type**: Counter
 | |
| **Description**: DHT cache miss count
 | |
| **Labels**: `cache_type`
 | |
| **Usage**: Monitor DHT caching effectiveness
 | |
| 
 | |
| ### PubSub Messaging Metrics
 | |
| 
 | |
| #### chorus_pubsub_topics
 | |
| **Type**: Gauge
 | |
| **Description**: Number of active PubSub topics
 | |
| **Value**: Current topic count
 | |
| 
 | |
| #### chorus_pubsub_subscribers
 | |
| **Type**: Gauge
 | |
| **Description**: Number of subscribers per topic
 | |
| **Labels**: `topic`
 | |
| **Value**: Subscriber count for each topic
 | |
| 
 | |
| #### chorus_pubsub_messages_total
 | |
| **Type**: Counter
 | |
| **Description**: Total PubSub messages
 | |
| **Labels**: `topic`, `direction` (sent/received), `message_type`
 | |
| **Usage**: Track message volume per topic
 | |
| 
 | |
| #### chorus_pubsub_message_latency_seconds
 | |
| **Type**: Histogram
 | |
| **Description**: PubSub message delivery latency
 | |
| **Labels**: `topic`
 | |
| **Usage**: Monitor message propagation performance
 | |
| 
 | |
| #### chorus_pubsub_message_size_bytes
 | |
| **Type**: Histogram
 | |
| **Description**: PubSub message size distribution
 | |
| **Labels**: `topic`
 | |
| **Buckets**: Configurable size buckets (default: 64B to 16MB)
 | |
| 
 | |
| ### Election System Metrics
 | |
| 
 | |
| #### chorus_election_term
 | |
| **Type**: Gauge
 | |
| **Description**: Current election term number
 | |
| **Value**: Monotonically increasing term number
 | |
| 
 | |
| #### chorus_election_state
 | |
| **Type**: Gauge
 | |
| **Description**: Current election state (1 for active state, 0 for others)
 | |
| **Labels**: `state` (idle/discovering/electing/reconstructing/complete)
 | |
| **Usage**: Only one state should have value 1 at any time
 | |
| 
 | |
| #### chorus_heartbeats_sent_total
 | |
| **Type**: Counter
 | |
| **Description**: Total number of heartbeats sent by this node
 | |
| **Usage**: Monitor leader heartbeat activity
 | |
| 
 | |
| #### chorus_heartbeats_received_total
 | |
| **Type**: Counter
 | |
| **Description**: Total number of heartbeats received from leader
 | |
| **Usage**: Monitor follower connectivity to leader
 | |
| 
 | |
| #### chorus_leadership_changes_total
 | |
| **Type**: Counter
 | |
| **Description**: Total number of leadership changes
 | |
| **Usage**: Monitor election stability (lower is better)
 | |
| 
 | |
| #### chorus_leader_uptime_seconds
 | |
| **Type**: Gauge
 | |
| **Description**: Current leader's tenure duration
 | |
| **Value**: Seconds since current leader was elected
 | |
| 
 | |
| #### chorus_election_latency_seconds
 | |
| **Type**: Histogram
 | |
| **Description**: Time taken to complete election process
 | |
| **Usage**: Monitor election efficiency
 | |
| 
 | |
| ### Health Monitoring Metrics
 | |
| 
 | |
| #### chorus_health_checks_passed_total
 | |
| **Type**: Counter
 | |
| **Description**: Total number of health checks passed
 | |
| **Labels**: `check_name`
 | |
| **Usage**: Track health check success rate
 | |
| 
 | |
| #### chorus_health_checks_failed_total
 | |
| **Type**: Counter
 | |
| **Description**: Total number of health checks failed
 | |
| **Labels**: `check_name`, `reason`
 | |
| **Usage**: Track health check failures and reasons
 | |
| 
 | |
| #### chorus_health_check_duration_seconds
 | |
| **Type**: Histogram
 | |
| **Description**: Health check execution duration
 | |
| **Labels**: `check_name`
 | |
| **Usage**: Monitor health check performance
 | |
| 
 | |
| #### chorus_system_health_score
 | |
| **Type**: Gauge
 | |
| **Description**: Overall system health score
 | |
| **Value**: 0.0 (unhealthy) to 1.0 (healthy)
 | |
| **Usage**: Monitor overall system health
 | |
| 
 | |
| #### chorus_component_health_score
 | |
| **Type**: Gauge
 | |
| **Description**: Component-specific health score
 | |
| **Labels**: `component`
 | |
| **Value**: 0.0 (unhealthy) to 1.0 (healthy)
 | |
| **Usage**: Track individual component health
 | |
| 
 | |
| ### Task Management Metrics
 | |
| 
 | |
| #### chorus_tasks_active
 | |
| **Type**: Gauge
 | |
| **Description**: Number of currently active tasks
 | |
| **Value**: Current active task count
 | |
| 
 | |
| #### chorus_tasks_queued
 | |
| **Type**: Gauge
 | |
| **Description**: Number of queued tasks waiting execution
 | |
| **Value**: Current queue depth
 | |
| 
 | |
| #### chorus_tasks_completed_total
 | |
| **Type**: Counter
 | |
| **Description**: Total number of completed tasks
 | |
| **Labels**: `status` (success/failure), `task_type`
 | |
| **Usage**: Track task completion and success rate
 | |
| 
 | |
| #### chorus_task_duration_seconds
 | |
| **Type**: Histogram
 | |
| **Description**: Task execution duration distribution
 | |
| **Labels**: `task_type`, `status`
 | |
| **Usage**: Monitor task performance
 | |
| 
 | |
| #### chorus_task_queue_wait_time_seconds
 | |
| **Type**: Histogram
 | |
| **Description**: Time tasks spend in queue before execution
 | |
| **Usage**: Monitor task scheduling efficiency
 | |
| 
 | |
| ### SLURP (Context Generation) Metrics
 | |
| 
 | |
| #### chorus_slurp_contexts_generated_total
 | |
| **Type**: Counter
 | |
| **Description**: Total number of SLURP contexts generated
 | |
| **Labels**: `role`, `status` (success/failure)
 | |
| **Usage**: Track context generation volume
 | |
| 
 | |
| #### chorus_slurp_generation_time_seconds
 | |
| **Type**: Histogram
 | |
| **Description**: Time taken to generate SLURP contexts
 | |
| **Buckets**: [0.1, 0.5, 1.0, 2.0, 5.0, 10.0, 30.0, 60.0, 120.0]
 | |
| **Usage**: Monitor context generation performance
 | |
| 
 | |
| #### chorus_slurp_queue_length
 | |
| **Type**: Gauge
 | |
| **Description**: Length of SLURP generation queue
 | |
| **Value**: Current queue depth
 | |
| 
 | |
| #### chorus_slurp_active_jobs
 | |
| **Type**: Gauge
 | |
| **Description**: Number of active SLURP generation jobs
 | |
| **Value**: Currently running generation jobs
 | |
| 
 | |
| #### chorus_slurp_leadership_events_total
 | |
| **Type**: Counter
 | |
| **Description**: SLURP-related leadership events
 | |
| **Usage**: Track leader-initiated context generation
 | |
| 
 | |
| ### SHHH (Secret Sentinel) Metrics
 | |
| 
 | |
| #### chorus_shhh_findings_total
 | |
| **Type**: Counter
 | |
| **Description**: Total number of SHHH redaction findings
 | |
| **Labels**: `rule`, `severity` (low/medium/high/critical)
 | |
| **Usage**: Monitor secret detection effectiveness
 | |
| 
 | |
| ### UCXI (Protocol Resolution) Metrics
 | |
| 
 | |
| #### chorus_ucxi_requests_total
 | |
| **Type**: Counter
 | |
| **Description**: Total number of UCXI protocol requests
 | |
| **Labels**: `method`, `status` (success/failure)
 | |
| **Usage**: Track UCXI usage and success rate
 | |
| 
 | |
| #### chorus_ucxi_resolution_latency_seconds
 | |
| **Type**: Histogram
 | |
| **Description**: UCXI address resolution latency
 | |
| **Usage**: Monitor resolution performance
 | |
| 
 | |
| #### chorus_ucxi_cache_hits_total
 | |
| **Type**: Counter
 | |
| **Description**: UCXI cache hit count
 | |
| **Usage**: Monitor caching effectiveness
 | |
| 
 | |
| #### chorus_ucxi_cache_misses_total
 | |
| **Type**: Counter
 | |
| **Description**: UCXI cache miss count
 | |
| **Usage**: Monitor caching effectiveness
 | |
| 
 | |
| #### chorus_ucxi_content_size_bytes
 | |
| **Type**: Histogram
 | |
| **Description**: Size of resolved UCXI content
 | |
| **Usage**: Monitor content distribution
 | |
| 
 | |
| ### Resource Utilization Metrics
 | |
| 
 | |
| #### chorus_cpu_usage_ratio
 | |
| **Type**: Gauge
 | |
| **Description**: CPU usage ratio
 | |
| **Value**: 0.0 (idle) to 1.0 (fully utilized)
 | |
| 
 | |
| #### chorus_memory_usage_bytes
 | |
| **Type**: Gauge
 | |
| **Description**: Memory usage in bytes
 | |
| **Value**: Current memory consumption
 | |
| 
 | |
| #### chorus_disk_usage_ratio
 | |
| **Type**: Gauge
 | |
| **Description**: Disk usage ratio
 | |
| **Labels**: `mount_point`
 | |
| **Value**: 0.0 (empty) to 1.0 (full)
 | |
| 
 | |
| #### chorus_network_bytes_in_total
 | |
| **Type**: Counter
 | |
| **Description**: Total bytes received from network
 | |
| **Usage**: Track inbound network traffic
 | |
| 
 | |
| #### chorus_network_bytes_out_total
 | |
| **Type**: Counter
 | |
| **Description**: Total bytes sent to network
 | |
| **Usage**: Track outbound network traffic
 | |
| 
 | |
| #### chorus_goroutines
 | |
| **Type**: Gauge
 | |
| **Description**: Number of active goroutines
 | |
| **Value**: Current goroutine count
 | |
| 
 | |
| ### Error Metrics
 | |
| 
 | |
| #### chorus_errors_total
 | |
| **Type**: Counter
 | |
| **Description**: Total number of errors
 | |
| **Labels**: `component`, `error_type`
 | |
| **Usage**: Track error frequency by component and type
 | |
| 
 | |
| #### chorus_panics_total
 | |
| **Type**: Counter
 | |
| **Description**: Total number of panics recovered
 | |
| **Usage**: Monitor system stability
 | |
| 
 | |
| ## Usage Examples
 | |
| 
 | |
| ### Basic Initialization
 | |
| 
 | |
| ```go
 | |
| import "chorus/pkg/metrics"
 | |
| 
 | |
| // Create metrics collector with default config
 | |
| config := metrics.DefaultMetricsConfig()
 | |
| config.NodeID = "chorus-node-01"
 | |
| config.Version = "v1.0.0"
 | |
| config.Environment = "production"
 | |
| config.Cluster = "cluster-01"
 | |
| 
 | |
| metricsCollector := metrics.NewCHORUSMetrics(config)
 | |
| 
 | |
| // Start metrics HTTP server
 | |
| if err := metricsCollector.StartServer(config); err != nil {
 | |
|     log.Fatalf("Failed to start metrics server: %v", err)
 | |
| }
 | |
| 
 | |
| // Start background metric collection
 | |
| metricsCollector.CollectMetrics(config)
 | |
| ```
 | |
| 
 | |
| ### Recording P2P Metrics
 | |
| 
 | |
| ```go
 | |
| // Update peer count
 | |
| metricsCollector.SetConnectedPeers(5)
 | |
| 
 | |
| // Record message sent
 | |
| metricsCollector.IncrementMessagesSent("task_assignment", "peer-abc123")
 | |
| 
 | |
| // Record message received
 | |
| metricsCollector.IncrementMessagesReceived("task_result", "peer-def456")
 | |
| 
 | |
| // Record message latency
 | |
| startTime := time.Now()
 | |
| // ... send message and wait for response ...
 | |
| latency := time.Since(startTime)
 | |
| metricsCollector.ObserveMessageLatency("task_assignment", latency)
 | |
| ```
 | |
| 
 | |
| ### Recording DHT Metrics
 | |
| 
 | |
| ```go
 | |
| // Record DHT put operation
 | |
| startTime := time.Now()
 | |
| err := dht.Put(key, value)
 | |
| latency := time.Since(startTime)
 | |
| 
 | |
| if err != nil {
 | |
|     metricsCollector.IncrementDHTPutOperations("failure")
 | |
|     metricsCollector.ObserveDHTOperationLatency("put", "failure", latency)
 | |
| } else {
 | |
|     metricsCollector.IncrementDHTPutOperations("success")
 | |
|     metricsCollector.ObserveDHTOperationLatency("put", "success", latency)
 | |
| }
 | |
| 
 | |
| // Update DHT statistics
 | |
| metricsCollector.SetDHTProviderRecords(150)
 | |
| metricsCollector.SetDHTContentKeys(450)
 | |
| metricsCollector.SetDHTReplicationFactor("key-hash-123", 3.0)
 | |
| ```
 | |
| 
 | |
| ### Recording PubSub Metrics
 | |
| 
 | |
| ```go
 | |
| // Update topic count
 | |
| metricsCollector.SetPubSubTopics(10)
 | |
| 
 | |
| // Record message published
 | |
| metricsCollector.IncrementPubSubMessages("CHORUS/tasks/v1", "sent", "task_created")
 | |
| 
 | |
| // Record message received
 | |
| metricsCollector.IncrementPubSubMessages("CHORUS/tasks/v1", "received", "task_completed")
 | |
| 
 | |
| // Record message latency
 | |
| startTime := time.Now()
 | |
| // ... publish message and wait for delivery confirmation ...
 | |
| latency := time.Since(startTime)
 | |
| metricsCollector.ObservePubSubMessageLatency("CHORUS/tasks/v1", latency)
 | |
| ```
 | |
| 
 | |
| ### Recording Election Metrics
 | |
| 
 | |
| ```go
 | |
| // Update election state
 | |
| metricsCollector.SetElectionTerm(42)
 | |
| metricsCollector.SetElectionState("idle")
 | |
| 
 | |
| // Record heartbeat sent (leader)
 | |
| metricsCollector.IncrementHeartbeatsSent()
 | |
| 
 | |
| // Record heartbeat received (follower)
 | |
| metricsCollector.IncrementHeartbeatsReceived()
 | |
| 
 | |
| // Record leadership change
 | |
| metricsCollector.IncrementLeadershipChanges()
 | |
| ```
 | |
| 
 | |
| ### Recording Health Metrics
 | |
| 
 | |
| ```go
 | |
| // Record health check success
 | |
| metricsCollector.IncrementHealthCheckPassed("database-connectivity")
 | |
| 
 | |
| // Record health check failure
 | |
| metricsCollector.IncrementHealthCheckFailed("p2p-connectivity", "no_peers")
 | |
| 
 | |
| // Update health scores
 | |
| metricsCollector.SetSystemHealthScore(0.95)
 | |
| metricsCollector.SetComponentHealthScore("dht", 0.98)
 | |
| metricsCollector.SetComponentHealthScore("pubsub", 0.92)
 | |
| ```
 | |
| 
 | |
| ### Recording Task Metrics
 | |
| 
 | |
| ```go
 | |
| // Update task counts
 | |
| metricsCollector.SetActiveTasks(5)
 | |
| metricsCollector.SetQueuedTasks(12)
 | |
| 
 | |
| // Record task completion
 | |
| startTime := time.Now()
 | |
| // ... execute task ...
 | |
| duration := time.Since(startTime)
 | |
| 
 | |
| metricsCollector.IncrementTasksCompleted("success", "data_processing")
 | |
| metricsCollector.ObserveTaskDuration("data_processing", "success", duration)
 | |
| ```
 | |
| 
 | |
| ### Recording SLURP Metrics
 | |
| 
 | |
| ```go
 | |
| // Record context generation
 | |
| startTime := time.Now()
 | |
| // ... generate SLURP context ...
 | |
| duration := time.Since(startTime)
 | |
| 
 | |
| metricsCollector.IncrementSLURPGenerated("admin", "success")
 | |
| metricsCollector.ObserveSLURPGenerationTime(duration)
 | |
| 
 | |
| // Update queue length
 | |
| metricsCollector.SetSLURPQueueLength(3)
 | |
| ```
 | |
| 
 | |
| ### Recording SHHH Metrics
 | |
| 
 | |
| ```go
 | |
| // Record secret findings
 | |
| findings := scanForSecrets(content)
 | |
| for _, finding := range findings {
 | |
|     metricsCollector.IncrementSHHHFindings(finding.Rule, finding.Severity, 1)
 | |
| }
 | |
| ```
 | |
| 
 | |
| ### Recording Resource Metrics
 | |
| 
 | |
| ```go
 | |
| import "runtime"
 | |
| 
 | |
| // Get runtime stats
 | |
| var memStats runtime.MemStats
 | |
| runtime.ReadMemStats(&memStats)
 | |
| 
 | |
| metricsCollector.SetMemoryUsage(float64(memStats.Alloc))
 | |
| metricsCollector.SetGoroutines(runtime.NumGoroutine())
 | |
| 
 | |
| // Record system resource usage
 | |
| metricsCollector.SetCPUUsage(0.45)  // 45% CPU usage
 | |
| metricsCollector.SetDiskUsage("/var/lib/CHORUS", 0.73)  // 73% disk usage
 | |
| ```
 | |
| 
 | |
| ### Recording Errors
 | |
| 
 | |
| ```go
 | |
| // Record error occurrence
 | |
| if err != nil {
 | |
|     metricsCollector.IncrementErrors("dht", "timeout")
 | |
| }
 | |
| 
 | |
| // Record recovered panic
 | |
| defer func() {
 | |
|     if r := recover(); r != nil {
 | |
|         metricsCollector.IncrementPanics()
 | |
|         // Handle panic...
 | |
|     }
 | |
| }()
 | |
| ```
 | |
| 
 | |
| ## Prometheus Integration
 | |
| 
 | |
| ### Scrape Configuration
 | |
| 
 | |
| Add the following to your `prometheus.yml`:
 | |
| 
 | |
| ```yaml
 | |
| scrape_configs:
 | |
|   - job_name: 'chorus-nodes'
 | |
|     scrape_interval: 15s
 | |
|     scrape_timeout: 10s
 | |
|     metrics_path: '/metrics'
 | |
|     static_configs:
 | |
|       - targets:
 | |
|           - 'chorus-node-01:9090'
 | |
|           - 'chorus-node-02:9090'
 | |
|           - 'chorus-node-03:9090'
 | |
|     relabel_configs:
 | |
|       - source_labels: [__address__]
 | |
|         target_label: instance
 | |
|       - source_labels: [__address__]
 | |
|         regex: '([^:]+):.*'
 | |
|         target_label: node
 | |
|         replacement: '${1}'
 | |
| ```
 | |
| 
 | |
| ### Example Queries
 | |
| 
 | |
| #### P2P Network Health
 | |
| ```promql
 | |
| # Average connected peers across cluster
 | |
| avg(chorus_p2p_connected_peers)
 | |
| 
 | |
| # Message rate per second
 | |
| rate(chorus_p2p_messages_sent_total[5m])
 | |
| 
 | |
| # 95th percentile message latency
 | |
| histogram_quantile(0.95, rate(chorus_p2p_message_latency_seconds_bucket[5m]))
 | |
| ```
 | |
| 
 | |
| #### DHT Performance
 | |
| ```promql
 | |
| # DHT operation success rate
 | |
| rate(chorus_dht_get_operations_total{status="success"}[5m]) /
 | |
| rate(chorus_dht_get_operations_total[5m])
 | |
| 
 | |
| # Average DHT operation latency
 | |
| rate(chorus_dht_operation_latency_seconds_sum[5m]) /
 | |
| rate(chorus_dht_operation_latency_seconds_count[5m])
 | |
| 
 | |
| # DHT cache hit rate
 | |
| rate(chorus_dht_cache_hits_total[5m]) /
 | |
| (rate(chorus_dht_cache_hits_total[5m]) + rate(chorus_dht_cache_misses_total[5m]))
 | |
| ```
 | |
| 
 | |
| #### Election Stability
 | |
| ```promql
 | |
| # Leadership changes per hour
 | |
| rate(chorus_leadership_changes_total[1h]) * 3600
 | |
| 
 | |
| # Nodes by election state
 | |
| sum by (state) (chorus_election_state)
 | |
| 
 | |
| # Heartbeat rate
 | |
| rate(chorus_heartbeats_sent_total[5m])
 | |
| ```
 | |
| 
 | |
| #### Task Management
 | |
| ```promql
 | |
| # Task success rate
 | |
| rate(chorus_tasks_completed_total{status="success"}[5m]) /
 | |
| rate(chorus_tasks_completed_total[5m])
 | |
| 
 | |
| # Average task duration
 | |
| histogram_quantile(0.50, rate(chorus_task_duration_seconds_bucket[5m]))
 | |
| 
 | |
| # Task queue depth
 | |
| chorus_tasks_queued
 | |
| ```
 | |
| 
 | |
| #### Resource Utilization
 | |
| ```promql
 | |
| # CPU usage by node
 | |
| chorus_cpu_usage_ratio
 | |
| 
 | |
| # Memory usage by node
 | |
| chorus_memory_usage_bytes / (1024 * 1024 * 1024)  # Convert to GB
 | |
| 
 | |
| # Disk usage alert (>90%)
 | |
| chorus_disk_usage_ratio > 0.9
 | |
| ```
 | |
| 
 | |
| #### System Health
 | |
| ```promql
 | |
| # Overall system health score
 | |
| chorus_system_health_score
 | |
| 
 | |
| # Component health scores
 | |
| chorus_component_health_score
 | |
| 
 | |
| # Health check failure rate
 | |
| rate(chorus_health_checks_failed_total[5m])
 | |
| ```
 | |
| 
 | |
| ### Alerting Rules
 | |
| 
 | |
| Example Prometheus alerting rules for CHORUS:
 | |
| 
 | |
| ```yaml
 | |
| groups:
 | |
|   - name: chorus_alerts
 | |
|     interval: 30s
 | |
|     rules:
 | |
|       # P2P connectivity alerts
 | |
|       - alert: LowPeerCount
 | |
|         expr: chorus_p2p_connected_peers < 2
 | |
|         for: 5m
 | |
|         labels:
 | |
|           severity: warning
 | |
|         annotations:
 | |
|           summary: "Low P2P peer count on {{ $labels.instance }}"
 | |
|           description: "Node has {{ $value }} peers (minimum: 2)"
 | |
| 
 | |
|       # DHT performance alerts
 | |
|       - alert: HighDHTFailureRate
 | |
|         expr: |
 | |
|           rate(chorus_dht_get_operations_total{status="failure"}[5m]) /
 | |
|           rate(chorus_dht_get_operations_total[5m]) > 0.1
 | |
|         for: 10m
 | |
|         labels:
 | |
|           severity: warning
 | |
|         annotations:
 | |
|           summary: "High DHT failure rate on {{ $labels.instance }}"
 | |
|           description: "DHT failure rate: {{ $value | humanizePercentage }}"
 | |
| 
 | |
|       # Election stability alerts
 | |
|       - alert: FrequentLeadershipChanges
 | |
|         expr: rate(chorus_leadership_changes_total[1h]) * 3600 > 5
 | |
|         for: 15m
 | |
|         labels:
 | |
|           severity: warning
 | |
|         annotations:
 | |
|           summary: "Frequent leadership changes"
 | |
|           description: "{{ $value }} leadership changes per hour"
 | |
| 
 | |
|       # Task management alerts
 | |
|       - alert: HighTaskQueueDepth
 | |
|         expr: chorus_tasks_queued > 100
 | |
|         for: 10m
 | |
|         labels:
 | |
|           severity: warning
 | |
|         annotations:
 | |
|           summary: "High task queue depth on {{ $labels.instance }}"
 | |
|           description: "{{ $value }} tasks queued"
 | |
| 
 | |
|       # Resource alerts
 | |
|       - alert: HighMemoryUsage
 | |
|         expr: chorus_memory_usage_bytes > 8 * 1024 * 1024 * 1024  # 8GB
 | |
|         for: 5m
 | |
|         labels:
 | |
|           severity: warning
 | |
|         annotations:
 | |
|           summary: "High memory usage on {{ $labels.instance }}"
 | |
|           description: "Memory usage: {{ $value | humanize1024 }}B"
 | |
| 
 | |
|       - alert: HighDiskUsage
 | |
|         expr: chorus_disk_usage_ratio > 0.9
 | |
|         for: 10m
 | |
|         labels:
 | |
|           severity: critical
 | |
|         annotations:
 | |
|           summary: "High disk usage on {{ $labels.instance }}"
 | |
|           description: "Disk usage: {{ $value | humanizePercentage }}"
 | |
| 
 | |
|       # Health monitoring alerts
 | |
|       - alert: LowSystemHealth
 | |
|         expr: chorus_system_health_score < 0.75
 | |
|         for: 5m
 | |
|         labels:
 | |
|           severity: warning
 | |
|         annotations:
 | |
|           summary: "Low system health score on {{ $labels.instance }}"
 | |
|           description: "Health score: {{ $value }}"
 | |
| 
 | |
|       - alert: ComponentUnhealthy
 | |
|         expr: chorus_component_health_score < 0.5
 | |
|         for: 10m
 | |
|         labels:
 | |
|           severity: warning
 | |
|         annotations:
 | |
|           summary: "Component {{ $labels.component }} unhealthy"
 | |
|           description: "Health score: {{ $value }}"
 | |
| ```
 | |
| 
 | |
| ## HTTP Endpoints
 | |
| 
 | |
| ### Metrics Endpoint
 | |
| 
 | |
| **URL**: `/metrics`
 | |
| **Method**: GET
 | |
| **Description**: Prometheus metrics in text exposition format
 | |
| 
 | |
| **Response Format**:
 | |
| ```
 | |
| # HELP chorus_p2p_connected_peers Number of connected P2P peers
 | |
| # TYPE chorus_p2p_connected_peers gauge
 | |
| chorus_p2p_connected_peers 5
 | |
| 
 | |
| # HELP chorus_dht_put_operations_total Total number of DHT put operations
 | |
| # TYPE chorus_dht_put_operations_total counter
 | |
| chorus_dht_put_operations_total{status="success"} 1523
 | |
| chorus_dht_put_operations_total{status="failure"} 12
 | |
| 
 | |
| # HELP chorus_task_duration_seconds Task execution duration
 | |
| # TYPE chorus_task_duration_seconds histogram
 | |
| chorus_task_duration_seconds_bucket{task_type="data_processing",status="success",le="0.001"} 0
 | |
| chorus_task_duration_seconds_bucket{task_type="data_processing",status="success",le="0.005"} 12
 | |
| chorus_task_duration_seconds_bucket{task_type="data_processing",status="success",le="0.01"} 45
 | |
| ...
 | |
| ```
 | |
| 
 | |
| ### Health Endpoint
 | |
| 
 | |
| **URL**: `/health`
 | |
| **Method**: GET
 | |
| **Description**: Basic health check for metrics server
 | |
| 
 | |
| **Response**: `200 OK` with body `OK`
 | |
| 
 | |
| ## Best Practices
 | |
| 
 | |
| ### Metric Naming
 | |
| - Use descriptive metric names with `chorus_` prefix
 | |
| - Follow Prometheus naming conventions: `component_metric_unit`
 | |
| - Use `_total` suffix for counters
 | |
| - Use `_seconds` suffix for time measurements
 | |
| - Use `_bytes` suffix for size measurements
 | |
| 
 | |
| ### Label Usage
 | |
| - Keep label cardinality low (avoid high-cardinality labels like request IDs)
 | |
| - Use consistent label names across metrics
 | |
| - Document label meanings and expected values
 | |
| - Avoid labels that change frequently
 | |
| 
 | |
| ### Performance Considerations
 | |
| - Metrics collection is lock-free for read operations
 | |
| - Histogram observations are optimized for high throughput
 | |
| - Background collectors run on separate goroutines
 | |
| - Custom registry prevents pollution of default registry
 | |
| 
 | |
| ### Error Handling
 | |
| - Metrics collection should never panic
 | |
| - Failed metric updates should be logged but not block operations
 | |
| - Use nil checks before accessing metrics collectors
 | |
| 
 | |
| ### Testing
 | |
| ```go
 | |
| func TestMetrics(t *testing.T) {
 | |
|     config := metrics.DefaultMetricsConfig()
 | |
|     config.NodeID = "test-node"
 | |
| 
 | |
|     m := metrics.NewCHORUSMetrics(config)
 | |
| 
 | |
|     // Test metric updates
 | |
|     m.SetConnectedPeers(5)
 | |
|     m.IncrementMessagesSent("test", "peer1")
 | |
| 
 | |
|     // Verify metrics are collected
 | |
|     // (Use prometheus testutil for verification)
 | |
| }
 | |
| ```
 | |
| 
 | |
| ## Troubleshooting
 | |
| 
 | |
| ### Metrics Not Appearing
 | |
| 1. Verify metrics server is running: `curl http://localhost:9090/metrics`
 | |
| 2. Check configuration: ensure correct `ListenAddr` and `MetricsPath`
 | |
| 3. Verify Prometheus scrape configuration
 | |
| 4. Check for errors in application logs
 | |
| 
 | |
| ### High Memory Usage
 | |
| 1. Review label cardinality (check for unbounded label values)
 | |
| 2. Adjust histogram buckets if too granular
 | |
| 3. Reduce metric collection frequency
 | |
| 4. Consider metric retention policies in Prometheus
 | |
| 
 | |
| ### Missing Metrics
 | |
| 1. Ensure metric is being updated by application code
 | |
| 2. Verify metric registration in `initializeMetrics()`
 | |
| 3. Check for race conditions in metric access
 | |
| 4. Review metric type compatibility (Counter vs Gauge vs Histogram)
 | |
| 
 | |
| ## Migration Guide
 | |
| 
 | |
| ### From Default Prometheus Registry
 | |
| ```go
 | |
| // Old approach
 | |
| prometheus.MustRegister(myCounter)
 | |
| 
 | |
| // New approach
 | |
| config := metrics.DefaultMetricsConfig()
 | |
| m := metrics.NewCHORUSMetrics(config)
 | |
| // Use m.IncrementErrors(...) instead of direct counter access
 | |
| ```
 | |
| 
 | |
| ### Adding New Metrics
 | |
| 1. Add metric field to `CHORUSMetrics` struct
 | |
| 2. Initialize metric in `initializeMetrics()` method
 | |
| 3. Add helper methods for updating the metric
 | |
| 4. Document the metric in this file
 | |
| 5. Add Prometheus queries and alerts as needed
 | |
| 
 | |
| ## Related Documentation
 | |
| 
 | |
| - [Health Package Documentation](./health.md)
 | |
| - [Shutdown Package Documentation](./shutdown.md)
 | |
| - [Prometheus Documentation](https://prometheus.io/docs/)
 | |
| - [Prometheus Best Practices](https://prometheus.io/docs/practices/naming/) |