# CHORUS Health Package ## Overview The `pkg/health` package provides comprehensive health monitoring and readiness/liveness probe capabilities for the CHORUS distributed system. It orchestrates health checks across all system components, integrates with graceful shutdown, and exposes Kubernetes-compatible health endpoints. ## Architecture ### Core Components 1. **Manager**: Central health check orchestration and HTTP endpoint management 2. **HealthCheck**: Individual health check definitions with configurable intervals 3. **EnhancedHealthChecks**: Advanced health monitoring with metrics and history 4. **Adapters**: Integration layer for PubSub, DHT, and other subsystems 5. **SystemStatus**: Aggregated health status representation ### Health Check Types - **Critical Checks**: Failures trigger graceful shutdown - **Non-Critical Checks**: Failures degrade health status but don't trigger shutdown - **Active Probes**: Synthetic tests that verify end-to-end functionality - **Passive Checks**: Monitor existing system state without creating load ## Core Types ### HealthCheck ```go type HealthCheck struct { Name string // Unique check identifier Description string // Human-readable description Checker func(ctx context.Context) CheckResult // Check execution function Interval time.Duration // Check frequency (default: 30s) Timeout time.Duration // Check timeout (default: 10s) Enabled bool // Enable/disable check Critical bool // If true, failure triggers shutdown LastRun time.Time // Timestamp of last execution LastResult *CheckResult // Most recent check result } ``` ### CheckResult ```go type CheckResult struct { Healthy bool // Check passed/failed Message string // Human-readable result message Details map[string]interface{} // Additional structured information Latency time.Duration // Check execution time Timestamp time.Time // Result timestamp Error error // Error details if check failed } ``` ### SystemStatus ```go type SystemStatus struct { Status Status // Overall status enum Message string // Status description Checks map[string]*CheckResult // All check results Uptime time.Duration // System uptime StartTime time.Time // System start timestamp LastUpdate time.Time // Last status update Version string // CHORUS version NodeID string // Node identifier } ``` ### Status Levels ```go const ( StatusHealthy Status = "healthy" // All checks passing StatusDegraded Status = "degraded" // Some non-critical checks failing StatusUnhealthy Status = "unhealthy" // Critical checks failing StatusStarting Status = "starting" // System initializing StatusStopping Status = "stopping" // Graceful shutdown in progress ) ``` ## Manager ### Initialization ```go import "chorus/pkg/health" // Create health manager logger := yourLogger // Implements health.Logger interface manager := health.NewManager("node-123", "v1.0.0", logger) // Connect to shutdown manager for critical failures shutdownMgr := shutdown.NewManager(30*time.Second, logger) manager.SetShutdownManager(shutdownMgr) ``` ### Registration System ```go // Register a health check check := &health.HealthCheck{ Name: "database-connectivity", Description: "PostgreSQL database connectivity check", Enabled: true, Critical: true, // Failure triggers shutdown Interval: 30 * time.Second, Timeout: 10 * time.Second, Checker: func(ctx context.Context) health.CheckResult { // Perform health check err := db.PingContext(ctx) if err != nil { return health.CheckResult{ Healthy: false, Message: fmt.Sprintf("Database ping failed: %v", err), Error: err, Timestamp: time.Now(), } } return health.CheckResult{ Healthy: true, Message: "Database connectivity OK", Timestamp: time.Now(), } }, } manager.RegisterCheck(check) // Unregister when no longer needed manager.UnregisterCheck("database-connectivity") ``` ### Lifecycle Management ```go // Start health monitoring if err := manager.Start(); err != nil { log.Fatalf("Failed to start health manager: %v", err) } // Start HTTP server for health endpoints if err := manager.StartHTTPServer(8081); err != nil { log.Fatalf("Failed to start health HTTP server: %v", err) } // ... application runs ... // Stop health monitoring during shutdown if err := manager.Stop(); err != nil { log.Printf("Error stopping health manager: %v", err) } ``` ## HTTP Endpoints ### /health - Overall Health Status **Method**: GET **Description**: Returns comprehensive system health status **Response Codes**: - `200 OK`: System is healthy or degraded - `503 Service Unavailable`: System is unhealthy, starting, or stopping **Response Schema**: ```json { "status": "healthy", "message": "All health checks passing", "checks": { "database-connectivity": { "healthy": true, "message": "Database connectivity OK", "latency": 15000000, "timestamp": "2025-09-30T10:30:00Z" }, "p2p-connectivity": { "healthy": true, "message": "5 peers connected", "details": { "connected_peers": 5, "min_peers": 3 }, "latency": 8000000, "timestamp": "2025-09-30T10:30:05Z" } }, "uptime": 86400000000000, "start_time": "2025-09-29T10:30:00Z", "last_update": "2025-09-30T10:30:05Z", "version": "v1.0.0", "node_id": "node-123" } ``` ### /health/ready - Readiness Probe **Method**: GET **Description**: Kubernetes readiness probe - indicates if node can handle requests **Response Codes**: - `200 OK`: Node is ready (healthy or degraded) - `503 Service Unavailable`: Node is not ready **Response Schema**: ```json { "ready": true, "status": "healthy", "message": "All health checks passing" } ``` **Usage**: Use for Kubernetes readiness probes to control traffic routing ```yaml readinessProbe: httpGet: path: /health/ready port: 8081 initialDelaySeconds: 10 periodSeconds: 5 timeoutSeconds: 3 failureThreshold: 3 ``` ### /health/live - Liveness Probe **Method**: GET **Description**: Kubernetes liveness probe - indicates if node is alive **Response Codes**: - `200 OK`: Process is alive (not stopping) - `503 Service Unavailable`: Process is stopping **Response Schema**: ```json { "live": true, "status": "healthy", "uptime": "24h0m0s" } ``` **Usage**: Use for Kubernetes liveness probes to restart unhealthy pods ```yaml livenessProbe: httpGet: path: /health/live port: 8081 initialDelaySeconds: 30 periodSeconds: 10 timeoutSeconds: 5 failureThreshold: 3 ``` ### /health/checks - Detailed Check Results **Method**: GET **Description**: Returns detailed results for all registered health checks **Response Schema**: ```json { "checks": { "database-connectivity": { "healthy": true, "message": "Database connectivity OK", "latency": 15000000, "timestamp": "2025-09-30T10:30:00Z" }, "p2p-connectivity": { "healthy": true, "message": "5 peers connected", "details": { "connected_peers": 5, "min_peers": 3 }, "latency": 8000000, "timestamp": "2025-09-30T10:30:05Z" } }, "total": 2, "timestamp": "2025-09-30T10:30:10Z" } ``` ## Built-in Health Checks ### Database Connectivity Check ```go check := health.CreateDatabaseCheck("primary-db", func() error { return db.Ping() }) manager.RegisterCheck(check) ``` **Properties**: - Critical: Yes - Interval: 30 seconds - Timeout: 10 seconds - Checks: Database ping/connectivity ### Disk Space Check ```go check := health.CreateDiskSpaceCheck("/var/lib/CHORUS", 0.90) // Alert at 90% manager.RegisterCheck(check) ``` **Properties**: - Critical: No (warning only) - Interval: 60 seconds - Timeout: 5 seconds - Threshold: Configurable (e.g., 90%) ### Memory Usage Check ```go check := health.CreateMemoryCheck(0.85) // Alert at 85% manager.RegisterCheck(check) ``` **Properties**: - Critical: No (warning only) - Interval: 30 seconds - Timeout: 5 seconds - Threshold: Configurable (e.g., 85%) ### Active PubSub Check ```go adapter := health.NewPubSubAdapter(pubsubInstance) check := health.CreateActivePubSubCheck(adapter) manager.RegisterCheck(check) ``` **Properties**: - Critical: No - Interval: 60 seconds - Timeout: 15 seconds - Test: Publish/subscribe loopback with unique message - Validates: End-to-end PubSub functionality **Test Flow**: 1. Subscribe to test topic `CHORUS/health-test/v1` 2. Publish unique test message with timestamp 3. Wait for message receipt (max 10 seconds) 4. Verify message integrity 5. Report success or timeout ### Active DHT Check ```go adapter := health.NewDHTAdapter(dhtInstance) check := health.CreateActiveDHTCheck(adapter) manager.RegisterCheck(check) ``` **Properties**: - Critical: No - Interval: 90 seconds - Timeout: 20 seconds - Test: Put/get operation with unique key - Validates: DHT storage and retrieval integrity **Test Flow**: 1. Generate unique test key and value 2. Perform DHT put operation 3. Wait for propagation (100ms) 4. Perform DHT get operation 5. Verify retrieved value matches original 6. Report success, failure, or integrity violation ## Enhanced Health Checks ### Overview The `EnhancedHealthChecks` system provides advanced monitoring with metrics tracking, historical data, and comprehensive component health scoring. ### Initialization ```go import "chorus/pkg/health" // Create enhanced health monitoring enhanced := health.NewEnhancedHealthChecks( manager, // Health manager electionMgr, // Election manager dhtInstance, // DHT instance pubsubInstance, // PubSub instance replicationMgr, // Replication manager logger, // Logger ) // Enhanced checks are automatically registered ``` ### Configuration ```go type HealthConfig struct { // Active probe intervals PubSubProbeInterval time.Duration // Default: 30s DHTProbeInterval time.Duration // Default: 60s ElectionProbeInterval time.Duration // Default: 15s // Probe timeouts PubSubProbeTimeout time.Duration // Default: 10s DHTProbeTimeout time.Duration // Default: 20s ElectionProbeTimeout time.Duration // Default: 5s // Thresholds MaxFailedProbes int // Default: 3 HealthyThreshold float64 // Default: 0.95 DegradedThreshold float64 // Default: 0.75 // History retention MaxHistoryEntries int // Default: 1000 HistoryCleanupInterval time.Duration // Default: 1h // Enable/disable specific checks EnablePubSubProbes bool // Default: true EnableDHTProbes bool // Default: true EnableElectionProbes bool // Default: true EnableReplicationProbes bool // Default: true } // Use custom configuration config := health.DefaultHealthConfig() config.PubSubProbeInterval = 45 * time.Second config.HealthyThreshold = 0.98 enhanced.config = config ``` ### Enhanced Health Checks Registered #### 1. Enhanced PubSub Check - **Name**: `pubsub-enhanced` - **Critical**: Yes - **Interval**: Configurable (default: 30s) - **Features**: - Loopback message testing - Success rate tracking - Consecutive failure counting - Latency measurement - Health score calculation #### 2. Enhanced DHT Check - **Name**: `dht-enhanced` - **Critical**: Yes - **Interval**: Configurable (default: 60s) - **Features**: - Put/get operation testing - Data integrity verification - Replication health monitoring - Success rate tracking - Latency measurement #### 3. Election Health Check - **Name**: `election-health` - **Critical**: No - **Interval**: Configurable (default: 15s) - **Features**: - Election state monitoring - Heartbeat status tracking - Leadership stability calculation - Admin uptime tracking #### 4. Replication Health Check - **Name**: `replication-health` - **Critical**: No - **Interval**: 120 seconds - **Features**: - Replication metrics monitoring - Failure rate tracking - Average replication factor - Provider record counting #### 5. P2P Connectivity Check - **Name**: `p2p-connectivity` - **Critical**: Yes - **Interval**: 30 seconds - **Features**: - Connected peer counting - Minimum peer threshold validation - Connectivity score calculation #### 6. Resource Health Check - **Name**: `resource-health` - **Critical**: No - **Interval**: 60 seconds - **Features**: - CPU usage monitoring - Memory usage monitoring - Disk usage monitoring - Threshold-based alerting #### 7. Task Manager Check - **Name**: `task-manager` - **Critical**: No - **Interval**: 30 seconds - **Features**: - Active task counting - Queue depth monitoring - Task success rate tracking - Capacity monitoring ### Health Metrics ```go type HealthMetrics struct { // Overall system health SystemHealthScore float64 // 0.0-1.0 LastFullHealthCheck time.Time TotalHealthChecks int64 FailedHealthChecks int64 // PubSub metrics PubSubHealthScore float64 PubSubProbeLatency time.Duration PubSubSuccessRate float64 PubSubLastSuccess time.Time PubSubConsecutiveFails int // DHT metrics DHTHealthScore float64 DHTProbeLatency time.Duration DHTSuccessRate float64 DHTLastSuccess time.Time DHTConsecutiveFails int DHTReplicationStatus map[string]*ReplicationStatus // Election metrics ElectionHealthScore float64 ElectionStability float64 HeartbeatLatency time.Duration LeadershipChanges int64 LastLeadershipChange time.Time AdminUptime time.Duration // Network metrics P2PConnectedPeers int P2PConnectivityScore float64 NetworkLatency time.Duration // Resource metrics CPUUsage float64 MemoryUsage float64 DiskUsage float64 // Service metrics ActiveTasks int QueuedTasks int TaskSuccessRate float64 } // Access metrics metrics := enhanced.GetHealthMetrics() fmt.Printf("System Health Score: %.2f\n", metrics.SystemHealthScore) fmt.Printf("PubSub Health: %.2f\n", metrics.PubSubHealthScore) fmt.Printf("DHT Health: %.2f\n", metrics.DHTHealthScore) ``` ### Health Summary ```go summary := enhanced.GetHealthSummary() // Returns: // { // "status": "healthy", // "overall_score": 0.96, // "last_check": "2025-09-30T10:30:00Z", // "total_checks": 1523, // "component_scores": { // "pubsub": 0.98, // "dht": 0.95, // "election": 0.92, // "p2p": 1.0 // }, // "key_metrics": { // "connected_peers": 5, // "active_tasks": 3, // "admin_uptime": "2h30m15s", // "leadership_changes": 2, // "resource_utilization": { // "cpu": 0.45, // "memory": 0.62, // "disk": 0.73 // } // } // } ``` ## Adapter System ### PubSub Adapter Adapts CHORUS PubSub system to health check interface: ```go type PubSubInterface interface { SubscribeToTopic(topic string, handler func([]byte)) error PublishToTopic(topic string, data interface{}) error } // Create adapter adapter := health.NewPubSubAdapter(pubsubInstance) // Use in health checks check := health.CreateActivePubSubCheck(adapter) ``` ### DHT Adapter Adapts various DHT implementations to health check interface: ```go type DHTInterface interface { PutValue(ctx context.Context, key string, value []byte) error GetValue(ctx context.Context, key string) ([]byte, error) } // Create adapter (supports multiple DHT types) adapter := health.NewDHTAdapter(dhtInstance) // Use in health checks check := health.CreateActiveDHTCheck(adapter) ``` **Supported DHT Types**: - `*dht.LibP2PDHT` - `*dht.MockDHTInterface` - `*dht.EncryptedDHTStorage` ### Mock Adapters For testing without real infrastructure: ```go // Mock PubSub mockPubSub := health.NewMockPubSubAdapter() check := health.CreateActivePubSubCheck(mockPubSub) // Mock DHT mockDHT := health.NewMockDHTAdapter() check := health.CreateActiveDHTCheck(mockDHT) ``` ## Integration with Graceful Shutdown ### Critical Health Check Failures When a critical health check fails, the health manager can trigger graceful shutdown: ```go // Connect managers healthMgr.SetShutdownManager(shutdownMgr) // Register critical check criticalCheck := &health.HealthCheck{ Name: "database-connectivity", Critical: true, // Failure triggers shutdown Checker: func(ctx context.Context) health.CheckResult { // Check logic }, } healthMgr.RegisterCheck(criticalCheck) // If check fails, shutdown is automatically initiated ``` ### Shutdown Integration Example ```go import ( "chorus/pkg/health" "chorus/pkg/shutdown" ) // Create managers shutdownMgr := shutdown.NewManager(30*time.Second, logger) healthMgr := health.NewManager("node-123", "v1.0.0", logger) healthMgr.SetShutdownManager(shutdownMgr) // Register health manager for shutdown healthComponent := shutdown.NewGenericComponent("health-manager", 10, true). SetShutdownFunc(func(ctx context.Context) error { return healthMgr.Stop() }) shutdownMgr.Register(healthComponent) // Add pre-shutdown hook to update health status shutdownMgr.AddHook(shutdown.PhasePreShutdown, func(ctx context.Context) error { status := healthMgr.GetStatus() status.Status = health.StatusStopping status.Message = "System is shutting down" return nil }) // Start systems healthMgr.Start() healthMgr.StartHTTPServer(8081) shutdownMgr.Start() // Wait for shutdown shutdownMgr.Wait() ``` ## Health Check Best Practices ### Design Principles 1. **Fast Execution**: Health checks should complete quickly (< 5 seconds typical) 2. **Idempotent**: Checks should be safe to run repeatedly without side effects 3. **Isolated**: Checks should not depend on other checks 4. **Meaningful**: Checks should validate actual functionality, not just existence 5. **Critical vs Warning**: Reserve critical status for failures that prevent core functionality ### Check Intervals ```go // Critical infrastructure: Check frequently databaseCheck.Interval = 15 * time.Second // Expensive operations: Check less frequently replicationCheck.Interval = 120 * time.Second // Active probes: Balance thoroughness with overhead pubsubProbe.Interval = 60 * time.Second ``` ### Timeout Configuration ```go // Fast checks: Short timeout connectivityCheck.Timeout = 5 * time.Second // Network operations: Longer timeout dhtProbe.Timeout = 20 * time.Second // Complex operations: Generous timeout systemCheck.Timeout = 30 * time.Second ``` ### Critical Check Guidelines Mark a check as critical when: - Failure prevents core system functionality - Continued operation would cause data corruption - User-facing services become unavailable - System cannot safely recover automatically Do NOT mark as critical when: - Failure is temporary or transient - System can operate in degraded mode - Alternative mechanisms exist - Recovery is possible without restart ### Error Handling ```go Checker: func(ctx context.Context) health.CheckResult { // Handle context cancellation select { case <-ctx.Done(): return health.CheckResult{ Healthy: false, Message: "Check cancelled", Error: ctx.Err(), Timestamp: time.Now(), } default: } // Perform check with timeout result := make(chan error, 1) go func() { result <- performCheck() }() select { case err := <-result: if err != nil { return health.CheckResult{ Healthy: false, Message: fmt.Sprintf("Check failed: %v", err), Error: err, Timestamp: time.Now(), } } return health.CheckResult{ Healthy: true, Message: "Check passed", Timestamp: time.Now(), } case <-ctx.Done(): return health.CheckResult{ Healthy: false, Message: "Check timeout", Error: ctx.Err(), Timestamp: time.Now(), } } } ``` ### Detailed Results Provide structured details for debugging: ```go return health.CheckResult{ Healthy: true, Message: "P2P network healthy", Details: map[string]interface{}{ "connected_peers": 5, "min_peers": 3, "max_peers": 20, "current_usage": "25%", "peer_quality": 0.85, "network_latency": "50ms", }, Timestamp: time.Now(), } ``` ## Custom Health Checks ### Simple Health Check ```go simpleCheck := &health.HealthCheck{ Name: "my-service", Description: "Custom service health check", Enabled: true, Critical: false, Interval: 30 * time.Second, Timeout: 10 * time.Second, Checker: func(ctx context.Context) health.CheckResult { // Your check logic healthy := checkMyService() return health.CheckResult{ Healthy: healthy, Message: "Service status", Timestamp: time.Now(), } }, } manager.RegisterCheck(simpleCheck) ``` ### Health Check with Metrics ```go type ServiceHealthCheck struct { service MyService metricsCollector *metrics.CHORUSMetrics } func (s *ServiceHealthCheck) Check(ctx context.Context) health.CheckResult { start := time.Now() err := s.service.Ping(ctx) latency := time.Since(start) if err != nil { s.metricsCollector.IncrementHealthCheckFailed("my-service", err.Error()) return health.CheckResult{ Healthy: false, Message: fmt.Sprintf("Service unavailable: %v", err), Error: err, Latency: latency, Timestamp: time.Now(), } } s.metricsCollector.IncrementHealthCheckPassed("my-service") return health.CheckResult{ Healthy: true, Message: "Service available", Latency: latency, Timestamp: time.Now(), Details: map[string]interface{}{ "latency_ms": latency.Milliseconds(), "version": s.service.Version(), }, } } ``` ### Stateful Health Check ```go type StatefulHealthCheck struct { consecutiveFailures int maxFailures int } func (s *StatefulHealthCheck) Check(ctx context.Context) health.CheckResult { healthy := performCheck() if !healthy { s.consecutiveFailures++ } else { s.consecutiveFailures = 0 } // Only report unhealthy after multiple failures if s.consecutiveFailures >= s.maxFailures { return health.CheckResult{ Healthy: false, Message: fmt.Sprintf("Failed %d consecutive checks", s.consecutiveFailures), Details: map[string]interface{}{ "consecutive_failures": s.consecutiveFailures, "threshold": s.maxFailures, }, Timestamp: time.Now(), } } return health.CheckResult{ Healthy: true, Message: "Check passed", Timestamp: time.Now(), } } ``` ## Monitoring and Alerting ### Prometheus Integration Health check results are automatically exposed as metrics: ```promql # Health check success rate rate(chorus_health_checks_passed_total[5m]) / (rate(chorus_health_checks_passed_total[5m]) + rate(chorus_health_checks_failed_total[5m])) # System health score chorus_system_health_score # Component health chorus_component_health_score{component="dht"} ``` ### Alert Rules ```yaml groups: - name: health_alerts interval: 30s rules: - alert: HealthCheckFailing expr: chorus_health_checks_failed_total{check_name="database-connectivity"} > 0 for: 5m labels: severity: critical annotations: summary: "Critical health check failing" description: "{{ $labels.check_name }} has been failing for 5 minutes" - alert: LowSystemHealth expr: chorus_system_health_score < 0.75 for: 10m labels: severity: warning annotations: summary: "Low system health score" description: "System health score: {{ $value }}" - alert: ComponentDegraded expr: chorus_component_health_score < 0.5 for: 15m labels: severity: warning annotations: summary: "Component {{ $labels.component }} degraded" description: "Health score: {{ $value }}" ``` ## Troubleshooting ### Health Check Not Running ```bash # Check if health manager is started curl http://localhost:8081/health/checks # Verify check is registered and enabled # Look for "enabled": true in response # Check application logs for errors grep "health check" /var/log/chorus/chorus.log ``` ### Health Check Timeouts ```go // Increase timeout for slow operations check.Timeout = 30 * time.Second // Add timeout monitoring Checker: func(ctx context.Context) health.CheckResult { deadline, ok := ctx.Deadline() if ok { log.Printf("Check deadline: %v (%.2fs remaining)", deadline, time.Until(deadline).Seconds()) } // ... check logic } ``` ### False Positives ```go // Add retry logic attempts := 3 for i := 0; i < attempts; i++ { if checkPasses() { return health.CheckResult{Healthy: true, ...} } if i < attempts-1 { time.Sleep(100 * time.Millisecond) } } return health.CheckResult{Healthy: false, ...} ``` ### High Memory Usage ```go // Limit check history config := health.DefaultHealthConfig() config.MaxHistoryEntries = 500 // Reduce from default 1000 config.HistoryCleanupInterval = 30 * time.Minute // More frequent cleanup ``` ## Testing ### Unit Testing Health Checks ```go func TestHealthCheck(t *testing.T) { // Create check check := &health.HealthCheck{ Name: "test-check", Enabled: true, Timeout: 5 * time.Second, Checker: func(ctx context.Context) health.CheckResult { // Test logic return health.CheckResult{ Healthy: true, Message: "Test passed", Timestamp: time.Now(), } }, } // Execute check ctx := context.Background() result := check.Checker(ctx) // Verify result assert.True(t, result.Healthy) assert.Equal(t, "Test passed", result.Message) } ``` ### Integration Testing with Mocks ```go func TestHealthManager(t *testing.T) { logger := &testLogger{} manager := health.NewManager("test-node", "v1.0.0", logger) // Register mock check mockCheck := &health.HealthCheck{ Name: "mock-check", Enabled: true, Interval: 1 * time.Second, Timeout: 500 * time.Millisecond, Checker: func(ctx context.Context) health.CheckResult { return health.CheckResult{ Healthy: true, Message: "Mock check passed", Timestamp: time.Now(), } }, } manager.RegisterCheck(mockCheck) // Start manager err := manager.Start() assert.NoError(t, err) // Wait for check execution time.Sleep(2 * time.Second) // Verify status status := manager.GetStatus() assert.Equal(t, health.StatusHealthy, status.Status) assert.Contains(t, status.Checks, "mock-check") // Stop manager err = manager.Stop() assert.NoError(t, err) } ``` ## Related Documentation - [Metrics Package Documentation](./metrics.md) - [Shutdown Package Documentation](./shutdown.md) - [Kubernetes Health Probes](https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/) - [CHORUS Election System](../modules/election.md) - [CHORUS DHT System](../modules/dht.md)