Comprehensive documentation for coordination, messaging, discovery, and internal systems. Core Coordination Packages: - pkg/election - Democratic leader election (uptime-based, heartbeat mechanism, SLURP integration) - pkg/coordination - Meta-coordination with dependency detection (4 built-in rules) - coordinator/ - Task orchestration and assignment (AI-powered scoring) - discovery/ - mDNS peer discovery (automatic LAN detection) Messaging & P2P Infrastructure: - pubsub/ - GossipSub messaging (31 message types, role-based topics, HMMM integration) - p2p/ - libp2p networking (DHT modes, connection management, security) Monitoring & Health: - pkg/metrics - Prometheus metrics (80+ metrics across 12 categories) - pkg/health - Health monitoring (4 HTTP endpoints, enhanced checks, graceful degradation) Internal Systems: - internal/licensing - License validation (KACHING integration, cluster leases, fail-closed) - internal/hapui - Human Agent Portal UI (9 commands, HMMM wizard, UCXL browser, decision voting) - internal/backbeat - P2P operation telemetry (6 phases, beat synchronization, health reporting) Documentation Statistics (Phase 3): - 10 packages documented (~18,000 lines) - 31 PubSub message types cataloged - 80+ Prometheus metrics documented - Complete API references with examples - Integration patterns and best practices Key Features Documented: - Election: 5 triggers, candidate scoring (5 weighted components), stability windows - Coordination: AI-powered dependency detection, cross-repo sessions, escalation handling - PubSub: Topic patterns, message envelopes, SHHH redaction, Hypercore logging - Metrics: All metric types with labels, Prometheus scrape config, alert rules - Health: Liveness vs readiness, critical checks, Kubernetes integration - Licensing: Grace periods, circuit breaker, cluster lease management - HAP UI: Interactive terminal commands, HMMM composition wizard, web interface (beta) - BACKBEAT: 6-phase operation tracking, beat budget estimation, drift detection Implementation Status Marked: - ✅ Production: Election, metrics, health, licensing, pubsub, p2p, discovery, coordinator - 🔶 Beta: HAP web interface, BACKBEAT telemetry, advanced coordination - 🔷 Alpha: SLURP election scoring - ⚠️ Experimental: Meta-coordination, AI-powered dependency detection Progress: 22/62 files complete (35%) Next Phase: AI providers, SLURP system, API layer, reasoning engine 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
914 lines
25 KiB
Markdown
914 lines
25 KiB
Markdown
# CHORUS Metrics Package
|
|
|
|
## Overview
|
|
|
|
The `pkg/metrics` package provides comprehensive Prometheus-based metrics collection for the CHORUS distributed system. It exposes detailed operational metrics across all system components including P2P networking, DHT operations, PubSub messaging, elections, task management, and resource utilization.
|
|
|
|
## Architecture
|
|
|
|
### Core Components
|
|
|
|
- **CHORUSMetrics**: Central metrics collector managing all Prometheus metrics
|
|
- **Prometheus Registry**: Custom registry for metric collection
|
|
- **HTTP Server**: Exposes metrics endpoint for scraping
|
|
- **Background Collectors**: Periodic system and resource metric collection
|
|
|
|
### Metric Types
|
|
|
|
The package uses three Prometheus metric types:
|
|
|
|
1. **Counter**: Monotonically increasing values (e.g., total messages sent)
|
|
2. **Gauge**: Values that can go up or down (e.g., connected peers)
|
|
3. **Histogram**: Distribution of values with configurable buckets (e.g., latency measurements)
|
|
|
|
## Configuration
|
|
|
|
### MetricsConfig
|
|
|
|
```go
|
|
type MetricsConfig struct {
|
|
// HTTP server configuration
|
|
ListenAddr string // Default: ":9090"
|
|
MetricsPath string // Default: "/metrics"
|
|
|
|
// Histogram buckets
|
|
LatencyBuckets []float64 // Default: 0.001s to 10s
|
|
SizeBuckets []float64 // Default: 64B to 16MB
|
|
|
|
// Node identification labels
|
|
NodeID string // Unique node identifier
|
|
Version string // CHORUS version
|
|
Environment string // deployment environment (dev/staging/prod)
|
|
Cluster string // cluster identifier
|
|
|
|
// Collection intervals
|
|
SystemMetricsInterval time.Duration // Default: 30s
|
|
ResourceMetricsInterval time.Duration // Default: 15s
|
|
}
|
|
```
|
|
|
|
### Default Configuration
|
|
|
|
```go
|
|
config := metrics.DefaultMetricsConfig()
|
|
// Returns:
|
|
// - ListenAddr: ":9090"
|
|
// - MetricsPath: "/metrics"
|
|
// - LatencyBuckets: [0.001, 0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0, 10.0]
|
|
// - SizeBuckets: [64, 256, 1024, 4096, 16384, 65536, 262144, 1048576, 4194304, 16777216]
|
|
// - SystemMetricsInterval: 30s
|
|
// - ResourceMetricsInterval: 15s
|
|
```
|
|
|
|
## Metrics Catalog
|
|
|
|
### System Metrics
|
|
|
|
#### chorus_system_info
|
|
**Type**: Gauge
|
|
**Description**: System information with version labels
|
|
**Labels**: `node_id`, `version`, `go_version`, `cluster`, `environment`
|
|
**Value**: Always 1 when present
|
|
|
|
#### chorus_uptime_seconds
|
|
**Type**: Gauge
|
|
**Description**: System uptime in seconds since start
|
|
**Value**: Current uptime in seconds
|
|
|
|
### P2P Network Metrics
|
|
|
|
#### chorus_p2p_connected_peers
|
|
**Type**: Gauge
|
|
**Description**: Number of currently connected P2P peers
|
|
**Value**: Current peer count
|
|
|
|
#### chorus_p2p_messages_sent_total
|
|
**Type**: Counter
|
|
**Description**: Total number of P2P messages sent
|
|
**Labels**: `message_type`, `peer_id`
|
|
**Usage**: Track outbound message volume per type and destination
|
|
|
|
#### chorus_p2p_messages_received_total
|
|
**Type**: Counter
|
|
**Description**: Total number of P2P messages received
|
|
**Labels**: `message_type`, `peer_id`
|
|
**Usage**: Track inbound message volume per type and source
|
|
|
|
#### chorus_p2p_message_latency_seconds
|
|
**Type**: Histogram
|
|
**Description**: P2P message round-trip latency distribution
|
|
**Labels**: `message_type`
|
|
**Buckets**: Configurable latency buckets (default: 1ms to 10s)
|
|
|
|
#### chorus_p2p_connection_duration_seconds
|
|
**Type**: Histogram
|
|
**Description**: Duration of P2P connections
|
|
**Labels**: `peer_id`
|
|
**Usage**: Track connection stability
|
|
|
|
#### chorus_p2p_peer_score
|
|
**Type**: Gauge
|
|
**Description**: Peer quality score
|
|
**Labels**: `peer_id`
|
|
**Value**: Score between 0.0 (poor) and 1.0 (excellent)
|
|
|
|
### DHT (Distributed Hash Table) Metrics
|
|
|
|
#### chorus_dht_put_operations_total
|
|
**Type**: Counter
|
|
**Description**: Total number of DHT put operations
|
|
**Labels**: `status` (success/failure)
|
|
**Usage**: Track DHT write operations
|
|
|
|
#### chorus_dht_get_operations_total
|
|
**Type**: Counter
|
|
**Description**: Total number of DHT get operations
|
|
**Labels**: `status` (success/failure)
|
|
**Usage**: Track DHT read operations
|
|
|
|
#### chorus_dht_operation_latency_seconds
|
|
**Type**: Histogram
|
|
**Description**: DHT operation latency distribution
|
|
**Labels**: `operation` (put/get), `status` (success/failure)
|
|
**Usage**: Monitor DHT performance
|
|
|
|
#### chorus_dht_provider_records
|
|
**Type**: Gauge
|
|
**Description**: Number of provider records stored in DHT
|
|
**Value**: Current provider record count
|
|
|
|
#### chorus_dht_content_keys
|
|
**Type**: Gauge
|
|
**Description**: Number of content keys stored in DHT
|
|
**Value**: Current content key count
|
|
|
|
#### chorus_dht_replication_factor
|
|
**Type**: Gauge
|
|
**Description**: Replication factor for DHT keys
|
|
**Labels**: `key_hash`
|
|
**Value**: Number of replicas for specific keys
|
|
|
|
#### chorus_dht_cache_hits_total
|
|
**Type**: Counter
|
|
**Description**: DHT cache hit count
|
|
**Labels**: `cache_type`
|
|
**Usage**: Monitor DHT caching effectiveness
|
|
|
|
#### chorus_dht_cache_misses_total
|
|
**Type**: Counter
|
|
**Description**: DHT cache miss count
|
|
**Labels**: `cache_type`
|
|
**Usage**: Monitor DHT caching effectiveness
|
|
|
|
### PubSub Messaging Metrics
|
|
|
|
#### chorus_pubsub_topics
|
|
**Type**: Gauge
|
|
**Description**: Number of active PubSub topics
|
|
**Value**: Current topic count
|
|
|
|
#### chorus_pubsub_subscribers
|
|
**Type**: Gauge
|
|
**Description**: Number of subscribers per topic
|
|
**Labels**: `topic`
|
|
**Value**: Subscriber count for each topic
|
|
|
|
#### chorus_pubsub_messages_total
|
|
**Type**: Counter
|
|
**Description**: Total PubSub messages
|
|
**Labels**: `topic`, `direction` (sent/received), `message_type`
|
|
**Usage**: Track message volume per topic
|
|
|
|
#### chorus_pubsub_message_latency_seconds
|
|
**Type**: Histogram
|
|
**Description**: PubSub message delivery latency
|
|
**Labels**: `topic`
|
|
**Usage**: Monitor message propagation performance
|
|
|
|
#### chorus_pubsub_message_size_bytes
|
|
**Type**: Histogram
|
|
**Description**: PubSub message size distribution
|
|
**Labels**: `topic`
|
|
**Buckets**: Configurable size buckets (default: 64B to 16MB)
|
|
|
|
### Election System Metrics
|
|
|
|
#### chorus_election_term
|
|
**Type**: Gauge
|
|
**Description**: Current election term number
|
|
**Value**: Monotonically increasing term number
|
|
|
|
#### chorus_election_state
|
|
**Type**: Gauge
|
|
**Description**: Current election state (1 for active state, 0 for others)
|
|
**Labels**: `state` (idle/discovering/electing/reconstructing/complete)
|
|
**Usage**: Only one state should have value 1 at any time
|
|
|
|
#### chorus_heartbeats_sent_total
|
|
**Type**: Counter
|
|
**Description**: Total number of heartbeats sent by this node
|
|
**Usage**: Monitor leader heartbeat activity
|
|
|
|
#### chorus_heartbeats_received_total
|
|
**Type**: Counter
|
|
**Description**: Total number of heartbeats received from leader
|
|
**Usage**: Monitor follower connectivity to leader
|
|
|
|
#### chorus_leadership_changes_total
|
|
**Type**: Counter
|
|
**Description**: Total number of leadership changes
|
|
**Usage**: Monitor election stability (lower is better)
|
|
|
|
#### chorus_leader_uptime_seconds
|
|
**Type**: Gauge
|
|
**Description**: Current leader's tenure duration
|
|
**Value**: Seconds since current leader was elected
|
|
|
|
#### chorus_election_latency_seconds
|
|
**Type**: Histogram
|
|
**Description**: Time taken to complete election process
|
|
**Usage**: Monitor election efficiency
|
|
|
|
### Health Monitoring Metrics
|
|
|
|
#### chorus_health_checks_passed_total
|
|
**Type**: Counter
|
|
**Description**: Total number of health checks passed
|
|
**Labels**: `check_name`
|
|
**Usage**: Track health check success rate
|
|
|
|
#### chorus_health_checks_failed_total
|
|
**Type**: Counter
|
|
**Description**: Total number of health checks failed
|
|
**Labels**: `check_name`, `reason`
|
|
**Usage**: Track health check failures and reasons
|
|
|
|
#### chorus_health_check_duration_seconds
|
|
**Type**: Histogram
|
|
**Description**: Health check execution duration
|
|
**Labels**: `check_name`
|
|
**Usage**: Monitor health check performance
|
|
|
|
#### chorus_system_health_score
|
|
**Type**: Gauge
|
|
**Description**: Overall system health score
|
|
**Value**: 0.0 (unhealthy) to 1.0 (healthy)
|
|
**Usage**: Monitor overall system health
|
|
|
|
#### chorus_component_health_score
|
|
**Type**: Gauge
|
|
**Description**: Component-specific health score
|
|
**Labels**: `component`
|
|
**Value**: 0.0 (unhealthy) to 1.0 (healthy)
|
|
**Usage**: Track individual component health
|
|
|
|
### Task Management Metrics
|
|
|
|
#### chorus_tasks_active
|
|
**Type**: Gauge
|
|
**Description**: Number of currently active tasks
|
|
**Value**: Current active task count
|
|
|
|
#### chorus_tasks_queued
|
|
**Type**: Gauge
|
|
**Description**: Number of queued tasks waiting execution
|
|
**Value**: Current queue depth
|
|
|
|
#### chorus_tasks_completed_total
|
|
**Type**: Counter
|
|
**Description**: Total number of completed tasks
|
|
**Labels**: `status` (success/failure), `task_type`
|
|
**Usage**: Track task completion and success rate
|
|
|
|
#### chorus_task_duration_seconds
|
|
**Type**: Histogram
|
|
**Description**: Task execution duration distribution
|
|
**Labels**: `task_type`, `status`
|
|
**Usage**: Monitor task performance
|
|
|
|
#### chorus_task_queue_wait_time_seconds
|
|
**Type**: Histogram
|
|
**Description**: Time tasks spend in queue before execution
|
|
**Usage**: Monitor task scheduling efficiency
|
|
|
|
### SLURP (Context Generation) Metrics
|
|
|
|
#### chorus_slurp_contexts_generated_total
|
|
**Type**: Counter
|
|
**Description**: Total number of SLURP contexts generated
|
|
**Labels**: `role`, `status` (success/failure)
|
|
**Usage**: Track context generation volume
|
|
|
|
#### chorus_slurp_generation_time_seconds
|
|
**Type**: Histogram
|
|
**Description**: Time taken to generate SLURP contexts
|
|
**Buckets**: [0.1, 0.5, 1.0, 2.0, 5.0, 10.0, 30.0, 60.0, 120.0]
|
|
**Usage**: Monitor context generation performance
|
|
|
|
#### chorus_slurp_queue_length
|
|
**Type**: Gauge
|
|
**Description**: Length of SLURP generation queue
|
|
**Value**: Current queue depth
|
|
|
|
#### chorus_slurp_active_jobs
|
|
**Type**: Gauge
|
|
**Description**: Number of active SLURP generation jobs
|
|
**Value**: Currently running generation jobs
|
|
|
|
#### chorus_slurp_leadership_events_total
|
|
**Type**: Counter
|
|
**Description**: SLURP-related leadership events
|
|
**Usage**: Track leader-initiated context generation
|
|
|
|
### SHHH (Secret Sentinel) Metrics
|
|
|
|
#### chorus_shhh_findings_total
|
|
**Type**: Counter
|
|
**Description**: Total number of SHHH redaction findings
|
|
**Labels**: `rule`, `severity` (low/medium/high/critical)
|
|
**Usage**: Monitor secret detection effectiveness
|
|
|
|
### UCXI (Protocol Resolution) Metrics
|
|
|
|
#### chorus_ucxi_requests_total
|
|
**Type**: Counter
|
|
**Description**: Total number of UCXI protocol requests
|
|
**Labels**: `method`, `status` (success/failure)
|
|
**Usage**: Track UCXI usage and success rate
|
|
|
|
#### chorus_ucxi_resolution_latency_seconds
|
|
**Type**: Histogram
|
|
**Description**: UCXI address resolution latency
|
|
**Usage**: Monitor resolution performance
|
|
|
|
#### chorus_ucxi_cache_hits_total
|
|
**Type**: Counter
|
|
**Description**: UCXI cache hit count
|
|
**Usage**: Monitor caching effectiveness
|
|
|
|
#### chorus_ucxi_cache_misses_total
|
|
**Type**: Counter
|
|
**Description**: UCXI cache miss count
|
|
**Usage**: Monitor caching effectiveness
|
|
|
|
#### chorus_ucxi_content_size_bytes
|
|
**Type**: Histogram
|
|
**Description**: Size of resolved UCXI content
|
|
**Usage**: Monitor content distribution
|
|
|
|
### Resource Utilization Metrics
|
|
|
|
#### chorus_cpu_usage_ratio
|
|
**Type**: Gauge
|
|
**Description**: CPU usage ratio
|
|
**Value**: 0.0 (idle) to 1.0 (fully utilized)
|
|
|
|
#### chorus_memory_usage_bytes
|
|
**Type**: Gauge
|
|
**Description**: Memory usage in bytes
|
|
**Value**: Current memory consumption
|
|
|
|
#### chorus_disk_usage_ratio
|
|
**Type**: Gauge
|
|
**Description**: Disk usage ratio
|
|
**Labels**: `mount_point`
|
|
**Value**: 0.0 (empty) to 1.0 (full)
|
|
|
|
#### chorus_network_bytes_in_total
|
|
**Type**: Counter
|
|
**Description**: Total bytes received from network
|
|
**Usage**: Track inbound network traffic
|
|
|
|
#### chorus_network_bytes_out_total
|
|
**Type**: Counter
|
|
**Description**: Total bytes sent to network
|
|
**Usage**: Track outbound network traffic
|
|
|
|
#### chorus_goroutines
|
|
**Type**: Gauge
|
|
**Description**: Number of active goroutines
|
|
**Value**: Current goroutine count
|
|
|
|
### Error Metrics
|
|
|
|
#### chorus_errors_total
|
|
**Type**: Counter
|
|
**Description**: Total number of errors
|
|
**Labels**: `component`, `error_type`
|
|
**Usage**: Track error frequency by component and type
|
|
|
|
#### chorus_panics_total
|
|
**Type**: Counter
|
|
**Description**: Total number of panics recovered
|
|
**Usage**: Monitor system stability
|
|
|
|
## Usage Examples
|
|
|
|
### Basic Initialization
|
|
|
|
```go
|
|
import "chorus/pkg/metrics"
|
|
|
|
// Create metrics collector with default config
|
|
config := metrics.DefaultMetricsConfig()
|
|
config.NodeID = "chorus-node-01"
|
|
config.Version = "v1.0.0"
|
|
config.Environment = "production"
|
|
config.Cluster = "cluster-01"
|
|
|
|
metricsCollector := metrics.NewCHORUSMetrics(config)
|
|
|
|
// Start metrics HTTP server
|
|
if err := metricsCollector.StartServer(config); err != nil {
|
|
log.Fatalf("Failed to start metrics server: %v", err)
|
|
}
|
|
|
|
// Start background metric collection
|
|
metricsCollector.CollectMetrics(config)
|
|
```
|
|
|
|
### Recording P2P Metrics
|
|
|
|
```go
|
|
// Update peer count
|
|
metricsCollector.SetConnectedPeers(5)
|
|
|
|
// Record message sent
|
|
metricsCollector.IncrementMessagesSent("task_assignment", "peer-abc123")
|
|
|
|
// Record message received
|
|
metricsCollector.IncrementMessagesReceived("task_result", "peer-def456")
|
|
|
|
// Record message latency
|
|
startTime := time.Now()
|
|
// ... send message and wait for response ...
|
|
latency := time.Since(startTime)
|
|
metricsCollector.ObserveMessageLatency("task_assignment", latency)
|
|
```
|
|
|
|
### Recording DHT Metrics
|
|
|
|
```go
|
|
// Record DHT put operation
|
|
startTime := time.Now()
|
|
err := dht.Put(key, value)
|
|
latency := time.Since(startTime)
|
|
|
|
if err != nil {
|
|
metricsCollector.IncrementDHTPutOperations("failure")
|
|
metricsCollector.ObserveDHTOperationLatency("put", "failure", latency)
|
|
} else {
|
|
metricsCollector.IncrementDHTPutOperations("success")
|
|
metricsCollector.ObserveDHTOperationLatency("put", "success", latency)
|
|
}
|
|
|
|
// Update DHT statistics
|
|
metricsCollector.SetDHTProviderRecords(150)
|
|
metricsCollector.SetDHTContentKeys(450)
|
|
metricsCollector.SetDHTReplicationFactor("key-hash-123", 3.0)
|
|
```
|
|
|
|
### Recording PubSub Metrics
|
|
|
|
```go
|
|
// Update topic count
|
|
metricsCollector.SetPubSubTopics(10)
|
|
|
|
// Record message published
|
|
metricsCollector.IncrementPubSubMessages("CHORUS/tasks/v1", "sent", "task_created")
|
|
|
|
// Record message received
|
|
metricsCollector.IncrementPubSubMessages("CHORUS/tasks/v1", "received", "task_completed")
|
|
|
|
// Record message latency
|
|
startTime := time.Now()
|
|
// ... publish message and wait for delivery confirmation ...
|
|
latency := time.Since(startTime)
|
|
metricsCollector.ObservePubSubMessageLatency("CHORUS/tasks/v1", latency)
|
|
```
|
|
|
|
### Recording Election Metrics
|
|
|
|
```go
|
|
// Update election state
|
|
metricsCollector.SetElectionTerm(42)
|
|
metricsCollector.SetElectionState("idle")
|
|
|
|
// Record heartbeat sent (leader)
|
|
metricsCollector.IncrementHeartbeatsSent()
|
|
|
|
// Record heartbeat received (follower)
|
|
metricsCollector.IncrementHeartbeatsReceived()
|
|
|
|
// Record leadership change
|
|
metricsCollector.IncrementLeadershipChanges()
|
|
```
|
|
|
|
### Recording Health Metrics
|
|
|
|
```go
|
|
// Record health check success
|
|
metricsCollector.IncrementHealthCheckPassed("database-connectivity")
|
|
|
|
// Record health check failure
|
|
metricsCollector.IncrementHealthCheckFailed("p2p-connectivity", "no_peers")
|
|
|
|
// Update health scores
|
|
metricsCollector.SetSystemHealthScore(0.95)
|
|
metricsCollector.SetComponentHealthScore("dht", 0.98)
|
|
metricsCollector.SetComponentHealthScore("pubsub", 0.92)
|
|
```
|
|
|
|
### Recording Task Metrics
|
|
|
|
```go
|
|
// Update task counts
|
|
metricsCollector.SetActiveTasks(5)
|
|
metricsCollector.SetQueuedTasks(12)
|
|
|
|
// Record task completion
|
|
startTime := time.Now()
|
|
// ... execute task ...
|
|
duration := time.Since(startTime)
|
|
|
|
metricsCollector.IncrementTasksCompleted("success", "data_processing")
|
|
metricsCollector.ObserveTaskDuration("data_processing", "success", duration)
|
|
```
|
|
|
|
### Recording SLURP Metrics
|
|
|
|
```go
|
|
// Record context generation
|
|
startTime := time.Now()
|
|
// ... generate SLURP context ...
|
|
duration := time.Since(startTime)
|
|
|
|
metricsCollector.IncrementSLURPGenerated("admin", "success")
|
|
metricsCollector.ObserveSLURPGenerationTime(duration)
|
|
|
|
// Update queue length
|
|
metricsCollector.SetSLURPQueueLength(3)
|
|
```
|
|
|
|
### Recording SHHH Metrics
|
|
|
|
```go
|
|
// Record secret findings
|
|
findings := scanForSecrets(content)
|
|
for _, finding := range findings {
|
|
metricsCollector.IncrementSHHHFindings(finding.Rule, finding.Severity, 1)
|
|
}
|
|
```
|
|
|
|
### Recording Resource Metrics
|
|
|
|
```go
|
|
import "runtime"
|
|
|
|
// Get runtime stats
|
|
var memStats runtime.MemStats
|
|
runtime.ReadMemStats(&memStats)
|
|
|
|
metricsCollector.SetMemoryUsage(float64(memStats.Alloc))
|
|
metricsCollector.SetGoroutines(runtime.NumGoroutine())
|
|
|
|
// Record system resource usage
|
|
metricsCollector.SetCPUUsage(0.45) // 45% CPU usage
|
|
metricsCollector.SetDiskUsage("/var/lib/CHORUS", 0.73) // 73% disk usage
|
|
```
|
|
|
|
### Recording Errors
|
|
|
|
```go
|
|
// Record error occurrence
|
|
if err != nil {
|
|
metricsCollector.IncrementErrors("dht", "timeout")
|
|
}
|
|
|
|
// Record recovered panic
|
|
defer func() {
|
|
if r := recover(); r != nil {
|
|
metricsCollector.IncrementPanics()
|
|
// Handle panic...
|
|
}
|
|
}()
|
|
```
|
|
|
|
## Prometheus Integration
|
|
|
|
### Scrape Configuration
|
|
|
|
Add the following to your `prometheus.yml`:
|
|
|
|
```yaml
|
|
scrape_configs:
|
|
- job_name: 'chorus-nodes'
|
|
scrape_interval: 15s
|
|
scrape_timeout: 10s
|
|
metrics_path: '/metrics'
|
|
static_configs:
|
|
- targets:
|
|
- 'chorus-node-01:9090'
|
|
- 'chorus-node-02:9090'
|
|
- 'chorus-node-03:9090'
|
|
relabel_configs:
|
|
- source_labels: [__address__]
|
|
target_label: instance
|
|
- source_labels: [__address__]
|
|
regex: '([^:]+):.*'
|
|
target_label: node
|
|
replacement: '${1}'
|
|
```
|
|
|
|
### Example Queries
|
|
|
|
#### P2P Network Health
|
|
```promql
|
|
# Average connected peers across cluster
|
|
avg(chorus_p2p_connected_peers)
|
|
|
|
# Message rate per second
|
|
rate(chorus_p2p_messages_sent_total[5m])
|
|
|
|
# 95th percentile message latency
|
|
histogram_quantile(0.95, rate(chorus_p2p_message_latency_seconds_bucket[5m]))
|
|
```
|
|
|
|
#### DHT Performance
|
|
```promql
|
|
# DHT operation success rate
|
|
rate(chorus_dht_get_operations_total{status="success"}[5m]) /
|
|
rate(chorus_dht_get_operations_total[5m])
|
|
|
|
# Average DHT operation latency
|
|
rate(chorus_dht_operation_latency_seconds_sum[5m]) /
|
|
rate(chorus_dht_operation_latency_seconds_count[5m])
|
|
|
|
# DHT cache hit rate
|
|
rate(chorus_dht_cache_hits_total[5m]) /
|
|
(rate(chorus_dht_cache_hits_total[5m]) + rate(chorus_dht_cache_misses_total[5m]))
|
|
```
|
|
|
|
#### Election Stability
|
|
```promql
|
|
# Leadership changes per hour
|
|
rate(chorus_leadership_changes_total[1h]) * 3600
|
|
|
|
# Nodes by election state
|
|
sum by (state) (chorus_election_state)
|
|
|
|
# Heartbeat rate
|
|
rate(chorus_heartbeats_sent_total[5m])
|
|
```
|
|
|
|
#### Task Management
|
|
```promql
|
|
# Task success rate
|
|
rate(chorus_tasks_completed_total{status="success"}[5m]) /
|
|
rate(chorus_tasks_completed_total[5m])
|
|
|
|
# Average task duration
|
|
histogram_quantile(0.50, rate(chorus_task_duration_seconds_bucket[5m]))
|
|
|
|
# Task queue depth
|
|
chorus_tasks_queued
|
|
```
|
|
|
|
#### Resource Utilization
|
|
```promql
|
|
# CPU usage by node
|
|
chorus_cpu_usage_ratio
|
|
|
|
# Memory usage by node
|
|
chorus_memory_usage_bytes / (1024 * 1024 * 1024) # Convert to GB
|
|
|
|
# Disk usage alert (>90%)
|
|
chorus_disk_usage_ratio > 0.9
|
|
```
|
|
|
|
#### System Health
|
|
```promql
|
|
# Overall system health score
|
|
chorus_system_health_score
|
|
|
|
# Component health scores
|
|
chorus_component_health_score
|
|
|
|
# Health check failure rate
|
|
rate(chorus_health_checks_failed_total[5m])
|
|
```
|
|
|
|
### Alerting Rules
|
|
|
|
Example Prometheus alerting rules for CHORUS:
|
|
|
|
```yaml
|
|
groups:
|
|
- name: chorus_alerts
|
|
interval: 30s
|
|
rules:
|
|
# P2P connectivity alerts
|
|
- alert: LowPeerCount
|
|
expr: chorus_p2p_connected_peers < 2
|
|
for: 5m
|
|
labels:
|
|
severity: warning
|
|
annotations:
|
|
summary: "Low P2P peer count on {{ $labels.instance }}"
|
|
description: "Node has {{ $value }} peers (minimum: 2)"
|
|
|
|
# DHT performance alerts
|
|
- alert: HighDHTFailureRate
|
|
expr: |
|
|
rate(chorus_dht_get_operations_total{status="failure"}[5m]) /
|
|
rate(chorus_dht_get_operations_total[5m]) > 0.1
|
|
for: 10m
|
|
labels:
|
|
severity: warning
|
|
annotations:
|
|
summary: "High DHT failure rate on {{ $labels.instance }}"
|
|
description: "DHT failure rate: {{ $value | humanizePercentage }}"
|
|
|
|
# Election stability alerts
|
|
- alert: FrequentLeadershipChanges
|
|
expr: rate(chorus_leadership_changes_total[1h]) * 3600 > 5
|
|
for: 15m
|
|
labels:
|
|
severity: warning
|
|
annotations:
|
|
summary: "Frequent leadership changes"
|
|
description: "{{ $value }} leadership changes per hour"
|
|
|
|
# Task management alerts
|
|
- alert: HighTaskQueueDepth
|
|
expr: chorus_tasks_queued > 100
|
|
for: 10m
|
|
labels:
|
|
severity: warning
|
|
annotations:
|
|
summary: "High task queue depth on {{ $labels.instance }}"
|
|
description: "{{ $value }} tasks queued"
|
|
|
|
# Resource alerts
|
|
- alert: HighMemoryUsage
|
|
expr: chorus_memory_usage_bytes > 8 * 1024 * 1024 * 1024 # 8GB
|
|
for: 5m
|
|
labels:
|
|
severity: warning
|
|
annotations:
|
|
summary: "High memory usage on {{ $labels.instance }}"
|
|
description: "Memory usage: {{ $value | humanize1024 }}B"
|
|
|
|
- alert: HighDiskUsage
|
|
expr: chorus_disk_usage_ratio > 0.9
|
|
for: 10m
|
|
labels:
|
|
severity: critical
|
|
annotations:
|
|
summary: "High disk usage on {{ $labels.instance }}"
|
|
description: "Disk usage: {{ $value | humanizePercentage }}"
|
|
|
|
# Health monitoring alerts
|
|
- alert: LowSystemHealth
|
|
expr: chorus_system_health_score < 0.75
|
|
for: 5m
|
|
labels:
|
|
severity: warning
|
|
annotations:
|
|
summary: "Low system health score on {{ $labels.instance }}"
|
|
description: "Health score: {{ $value }}"
|
|
|
|
- alert: ComponentUnhealthy
|
|
expr: chorus_component_health_score < 0.5
|
|
for: 10m
|
|
labels:
|
|
severity: warning
|
|
annotations:
|
|
summary: "Component {{ $labels.component }} unhealthy"
|
|
description: "Health score: {{ $value }}"
|
|
```
|
|
|
|
## HTTP Endpoints
|
|
|
|
### Metrics Endpoint
|
|
|
|
**URL**: `/metrics`
|
|
**Method**: GET
|
|
**Description**: Prometheus metrics in text exposition format
|
|
|
|
**Response Format**:
|
|
```
|
|
# HELP chorus_p2p_connected_peers Number of connected P2P peers
|
|
# TYPE chorus_p2p_connected_peers gauge
|
|
chorus_p2p_connected_peers 5
|
|
|
|
# HELP chorus_dht_put_operations_total Total number of DHT put operations
|
|
# TYPE chorus_dht_put_operations_total counter
|
|
chorus_dht_put_operations_total{status="success"} 1523
|
|
chorus_dht_put_operations_total{status="failure"} 12
|
|
|
|
# HELP chorus_task_duration_seconds Task execution duration
|
|
# TYPE chorus_task_duration_seconds histogram
|
|
chorus_task_duration_seconds_bucket{task_type="data_processing",status="success",le="0.001"} 0
|
|
chorus_task_duration_seconds_bucket{task_type="data_processing",status="success",le="0.005"} 12
|
|
chorus_task_duration_seconds_bucket{task_type="data_processing",status="success",le="0.01"} 45
|
|
...
|
|
```
|
|
|
|
### Health Endpoint
|
|
|
|
**URL**: `/health`
|
|
**Method**: GET
|
|
**Description**: Basic health check for metrics server
|
|
|
|
**Response**: `200 OK` with body `OK`
|
|
|
|
## Best Practices
|
|
|
|
### Metric Naming
|
|
- Use descriptive metric names with `chorus_` prefix
|
|
- Follow Prometheus naming conventions: `component_metric_unit`
|
|
- Use `_total` suffix for counters
|
|
- Use `_seconds` suffix for time measurements
|
|
- Use `_bytes` suffix for size measurements
|
|
|
|
### Label Usage
|
|
- Keep label cardinality low (avoid high-cardinality labels like request IDs)
|
|
- Use consistent label names across metrics
|
|
- Document label meanings and expected values
|
|
- Avoid labels that change frequently
|
|
|
|
### Performance Considerations
|
|
- Metrics collection is lock-free for read operations
|
|
- Histogram observations are optimized for high throughput
|
|
- Background collectors run on separate goroutines
|
|
- Custom registry prevents pollution of default registry
|
|
|
|
### Error Handling
|
|
- Metrics collection should never panic
|
|
- Failed metric updates should be logged but not block operations
|
|
- Use nil checks before accessing metrics collectors
|
|
|
|
### Testing
|
|
```go
|
|
func TestMetrics(t *testing.T) {
|
|
config := metrics.DefaultMetricsConfig()
|
|
config.NodeID = "test-node"
|
|
|
|
m := metrics.NewCHORUSMetrics(config)
|
|
|
|
// Test metric updates
|
|
m.SetConnectedPeers(5)
|
|
m.IncrementMessagesSent("test", "peer1")
|
|
|
|
// Verify metrics are collected
|
|
// (Use prometheus testutil for verification)
|
|
}
|
|
```
|
|
|
|
## Troubleshooting
|
|
|
|
### Metrics Not Appearing
|
|
1. Verify metrics server is running: `curl http://localhost:9090/metrics`
|
|
2. Check configuration: ensure correct `ListenAddr` and `MetricsPath`
|
|
3. Verify Prometheus scrape configuration
|
|
4. Check for errors in application logs
|
|
|
|
### High Memory Usage
|
|
1. Review label cardinality (check for unbounded label values)
|
|
2. Adjust histogram buckets if too granular
|
|
3. Reduce metric collection frequency
|
|
4. Consider metric retention policies in Prometheus
|
|
|
|
### Missing Metrics
|
|
1. Ensure metric is being updated by application code
|
|
2. Verify metric registration in `initializeMetrics()`
|
|
3. Check for race conditions in metric access
|
|
4. Review metric type compatibility (Counter vs Gauge vs Histogram)
|
|
|
|
## Migration Guide
|
|
|
|
### From Default Prometheus Registry
|
|
```go
|
|
// Old approach
|
|
prometheus.MustRegister(myCounter)
|
|
|
|
// New approach
|
|
config := metrics.DefaultMetricsConfig()
|
|
m := metrics.NewCHORUSMetrics(config)
|
|
// Use m.IncrementErrors(...) instead of direct counter access
|
|
```
|
|
|
|
### Adding New Metrics
|
|
1. Add metric field to `CHORUSMetrics` struct
|
|
2. Initialize metric in `initializeMetrics()` method
|
|
3. Add helper methods for updating the metric
|
|
4. Document the metric in this file
|
|
5. Add Prometheus queries and alerts as needed
|
|
|
|
## Related Documentation
|
|
|
|
- [Health Package Documentation](./health.md)
|
|
- [Shutdown Package Documentation](./shutdown.md)
|
|
- [Prometheus Documentation](https://prometheus.io/docs/)
|
|
- [Prometheus Best Practices](https://prometheus.io/docs/practices/naming/) |