 c5b7311a8b
			
		
	
	c5b7311a8b
	
	
	
		
			
			Comprehensive documentation for coordination, messaging, discovery, and internal systems. Core Coordination Packages: - pkg/election - Democratic leader election (uptime-based, heartbeat mechanism, SLURP integration) - pkg/coordination - Meta-coordination with dependency detection (4 built-in rules) - coordinator/ - Task orchestration and assignment (AI-powered scoring) - discovery/ - mDNS peer discovery (automatic LAN detection) Messaging & P2P Infrastructure: - pubsub/ - GossipSub messaging (31 message types, role-based topics, HMMM integration) - p2p/ - libp2p networking (DHT modes, connection management, security) Monitoring & Health: - pkg/metrics - Prometheus metrics (80+ metrics across 12 categories) - pkg/health - Health monitoring (4 HTTP endpoints, enhanced checks, graceful degradation) Internal Systems: - internal/licensing - License validation (KACHING integration, cluster leases, fail-closed) - internal/hapui - Human Agent Portal UI (9 commands, HMMM wizard, UCXL browser, decision voting) - internal/backbeat - P2P operation telemetry (6 phases, beat synchronization, health reporting) Documentation Statistics (Phase 3): - 10 packages documented (~18,000 lines) - 31 PubSub message types cataloged - 80+ Prometheus metrics documented - Complete API references with examples - Integration patterns and best practices Key Features Documented: - Election: 5 triggers, candidate scoring (5 weighted components), stability windows - Coordination: AI-powered dependency detection, cross-repo sessions, escalation handling - PubSub: Topic patterns, message envelopes, SHHH redaction, Hypercore logging - Metrics: All metric types with labels, Prometheus scrape config, alert rules - Health: Liveness vs readiness, critical checks, Kubernetes integration - Licensing: Grace periods, circuit breaker, cluster lease management - HAP UI: Interactive terminal commands, HMMM composition wizard, web interface (beta) - BACKBEAT: 6-phase operation tracking, beat budget estimation, drift detection Implementation Status Marked: - ✅ Production: Election, metrics, health, licensing, pubsub, p2p, discovery, coordinator - 🔶 Beta: HAP web interface, BACKBEAT telemetry, advanced coordination - 🔷 Alpha: SLURP election scoring - ⚠️ Experimental: Meta-coordination, AI-powered dependency detection Progress: 22/62 files complete (35%) Next Phase: AI providers, SLURP system, API layer, reasoning engine 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
		
			
				
	
	
		
			1124 lines
		
	
	
		
			28 KiB
		
	
	
	
		
			Markdown
		
	
	
	
	
	
			
		
		
	
	
			1124 lines
		
	
	
		
			28 KiB
		
	
	
	
		
			Markdown
		
	
	
	
	
	
| # CHORUS Health Package
 | |
| 
 | |
| ## Overview
 | |
| 
 | |
| The `pkg/health` package provides comprehensive health monitoring and readiness/liveness probe capabilities for the CHORUS distributed system. It orchestrates health checks across all system components, integrates with graceful shutdown, and exposes Kubernetes-compatible health endpoints.
 | |
| 
 | |
| ## Architecture
 | |
| 
 | |
| ### Core Components
 | |
| 
 | |
| 1. **Manager**: Central health check orchestration and HTTP endpoint management
 | |
| 2. **HealthCheck**: Individual health check definitions with configurable intervals
 | |
| 3. **EnhancedHealthChecks**: Advanced health monitoring with metrics and history
 | |
| 4. **Adapters**: Integration layer for PubSub, DHT, and other subsystems
 | |
| 5. **SystemStatus**: Aggregated health status representation
 | |
| 
 | |
| ### Health Check Types
 | |
| 
 | |
| - **Critical Checks**: Failures trigger graceful shutdown
 | |
| - **Non-Critical Checks**: Failures degrade health status but don't trigger shutdown
 | |
| - **Active Probes**: Synthetic tests that verify end-to-end functionality
 | |
| - **Passive Checks**: Monitor existing system state without creating load
 | |
| 
 | |
| ## Core Types
 | |
| 
 | |
| ### HealthCheck
 | |
| 
 | |
| ```go
 | |
| type HealthCheck struct {
 | |
|     Name        string                                // Unique check identifier
 | |
|     Description string                                // Human-readable description
 | |
|     Checker     func(ctx context.Context) CheckResult // Check execution function
 | |
|     Interval    time.Duration                         // Check frequency (default: 30s)
 | |
|     Timeout     time.Duration                         // Check timeout (default: 10s)
 | |
|     Enabled     bool                                  // Enable/disable check
 | |
|     Critical    bool                                  // If true, failure triggers shutdown
 | |
|     LastRun     time.Time                             // Timestamp of last execution
 | |
|     LastResult  *CheckResult                          // Most recent check result
 | |
| }
 | |
| ```
 | |
| 
 | |
| ### CheckResult
 | |
| 
 | |
| ```go
 | |
| type CheckResult struct {
 | |
|     Healthy    bool                   // Check passed/failed
 | |
|     Message    string                 // Human-readable result message
 | |
|     Details    map[string]interface{} // Additional structured information
 | |
|     Latency    time.Duration          // Check execution time
 | |
|     Timestamp  time.Time              // Result timestamp
 | |
|     Error      error                  // Error details if check failed
 | |
| }
 | |
| ```
 | |
| 
 | |
| ### SystemStatus
 | |
| 
 | |
| ```go
 | |
| type SystemStatus struct {
 | |
|     Status     Status                     // Overall status enum
 | |
|     Message    string                     // Status description
 | |
|     Checks     map[string]*CheckResult    // All check results
 | |
|     Uptime     time.Duration              // System uptime
 | |
|     StartTime  time.Time                  // System start timestamp
 | |
|     LastUpdate time.Time                  // Last status update
 | |
|     Version    string                     // CHORUS version
 | |
|     NodeID     string                     // Node identifier
 | |
| }
 | |
| ```
 | |
| 
 | |
| ### Status Levels
 | |
| 
 | |
| ```go
 | |
| const (
 | |
|     StatusHealthy   Status = "healthy"    // All checks passing
 | |
|     StatusDegraded  Status = "degraded"   // Some non-critical checks failing
 | |
|     StatusUnhealthy Status = "unhealthy"  // Critical checks failing
 | |
|     StatusStarting  Status = "starting"   // System initializing
 | |
|     StatusStopping  Status = "stopping"   // Graceful shutdown in progress
 | |
| )
 | |
| ```
 | |
| 
 | |
| ## Manager
 | |
| 
 | |
| ### Initialization
 | |
| 
 | |
| ```go
 | |
| import "chorus/pkg/health"
 | |
| 
 | |
| // Create health manager
 | |
| logger := yourLogger // Implements health.Logger interface
 | |
| manager := health.NewManager("node-123", "v1.0.0", logger)
 | |
| 
 | |
| // Connect to shutdown manager for critical failures
 | |
| shutdownMgr := shutdown.NewManager(30*time.Second, logger)
 | |
| manager.SetShutdownManager(shutdownMgr)
 | |
| ```
 | |
| 
 | |
| ### Registration System
 | |
| 
 | |
| ```go
 | |
| // Register a health check
 | |
| check := &health.HealthCheck{
 | |
|     Name:        "database-connectivity",
 | |
|     Description: "PostgreSQL database connectivity check",
 | |
|     Enabled:     true,
 | |
|     Critical:    true,  // Failure triggers shutdown
 | |
|     Interval:    30 * time.Second,
 | |
|     Timeout:     10 * time.Second,
 | |
|     Checker: func(ctx context.Context) health.CheckResult {
 | |
|         // Perform health check
 | |
|         err := db.PingContext(ctx)
 | |
|         if err != nil {
 | |
|             return health.CheckResult{
 | |
|                 Healthy:   false,
 | |
|                 Message:   fmt.Sprintf("Database ping failed: %v", err),
 | |
|                 Error:     err,
 | |
|                 Timestamp: time.Now(),
 | |
|             }
 | |
|         }
 | |
|         return health.CheckResult{
 | |
|             Healthy:   true,
 | |
|             Message:   "Database connectivity OK",
 | |
|             Timestamp: time.Now(),
 | |
|         }
 | |
|     },
 | |
| }
 | |
| 
 | |
| manager.RegisterCheck(check)
 | |
| 
 | |
| // Unregister when no longer needed
 | |
| manager.UnregisterCheck("database-connectivity")
 | |
| ```
 | |
| 
 | |
| ### Lifecycle Management
 | |
| 
 | |
| ```go
 | |
| // Start health monitoring
 | |
| if err := manager.Start(); err != nil {
 | |
|     log.Fatalf("Failed to start health manager: %v", err)
 | |
| }
 | |
| 
 | |
| // Start HTTP server for health endpoints
 | |
| if err := manager.StartHTTPServer(8081); err != nil {
 | |
|     log.Fatalf("Failed to start health HTTP server: %v", err)
 | |
| }
 | |
| 
 | |
| // ... application runs ...
 | |
| 
 | |
| // Stop health monitoring during shutdown
 | |
| if err := manager.Stop(); err != nil {
 | |
|     log.Printf("Error stopping health manager: %v", err)
 | |
| }
 | |
| ```
 | |
| 
 | |
| ## HTTP Endpoints
 | |
| 
 | |
| ### /health - Overall Health Status
 | |
| 
 | |
| **Method**: GET
 | |
| **Description**: Returns comprehensive system health status
 | |
| 
 | |
| **Response Codes**:
 | |
| - `200 OK`: System is healthy or degraded
 | |
| - `503 Service Unavailable`: System is unhealthy, starting, or stopping
 | |
| 
 | |
| **Response Schema**:
 | |
| ```json
 | |
| {
 | |
|     "status": "healthy",
 | |
|     "message": "All health checks passing",
 | |
|     "checks": {
 | |
|         "database-connectivity": {
 | |
|             "healthy": true,
 | |
|             "message": "Database connectivity OK",
 | |
|             "latency": 15000000,
 | |
|             "timestamp": "2025-09-30T10:30:00Z"
 | |
|         },
 | |
|         "p2p-connectivity": {
 | |
|             "healthy": true,
 | |
|             "message": "5 peers connected",
 | |
|             "details": {
 | |
|                 "connected_peers": 5,
 | |
|                 "min_peers": 3
 | |
|             },
 | |
|             "latency": 8000000,
 | |
|             "timestamp": "2025-09-30T10:30:05Z"
 | |
|         }
 | |
|     },
 | |
|     "uptime": 86400000000000,
 | |
|     "start_time": "2025-09-29T10:30:00Z",
 | |
|     "last_update": "2025-09-30T10:30:05Z",
 | |
|     "version": "v1.0.0",
 | |
|     "node_id": "node-123"
 | |
| }
 | |
| ```
 | |
| 
 | |
| ### /health/ready - Readiness Probe
 | |
| 
 | |
| **Method**: GET
 | |
| **Description**: Kubernetes readiness probe - indicates if node can handle requests
 | |
| 
 | |
| **Response Codes**:
 | |
| - `200 OK`: Node is ready (healthy or degraded)
 | |
| - `503 Service Unavailable`: Node is not ready
 | |
| 
 | |
| **Response Schema**:
 | |
| ```json
 | |
| {
 | |
|     "ready": true,
 | |
|     "status": "healthy",
 | |
|     "message": "All health checks passing"
 | |
| }
 | |
| ```
 | |
| 
 | |
| **Usage**: Use for Kubernetes readiness probes to control traffic routing
 | |
| 
 | |
| ```yaml
 | |
| readinessProbe:
 | |
|   httpGet:
 | |
|     path: /health/ready
 | |
|     port: 8081
 | |
|   initialDelaySeconds: 10
 | |
|   periodSeconds: 5
 | |
|   timeoutSeconds: 3
 | |
|   failureThreshold: 3
 | |
| ```
 | |
| 
 | |
| ### /health/live - Liveness Probe
 | |
| 
 | |
| **Method**: GET
 | |
| **Description**: Kubernetes liveness probe - indicates if node is alive
 | |
| 
 | |
| **Response Codes**:
 | |
| - `200 OK`: Process is alive (not stopping)
 | |
| - `503 Service Unavailable`: Process is stopping
 | |
| 
 | |
| **Response Schema**:
 | |
| ```json
 | |
| {
 | |
|     "live": true,
 | |
|     "status": "healthy",
 | |
|     "uptime": "24h0m0s"
 | |
| }
 | |
| ```
 | |
| 
 | |
| **Usage**: Use for Kubernetes liveness probes to restart unhealthy pods
 | |
| 
 | |
| ```yaml
 | |
| livenessProbe:
 | |
|   httpGet:
 | |
|     path: /health/live
 | |
|     port: 8081
 | |
|   initialDelaySeconds: 30
 | |
|   periodSeconds: 10
 | |
|   timeoutSeconds: 5
 | |
|   failureThreshold: 3
 | |
| ```
 | |
| 
 | |
| ### /health/checks - Detailed Check Results
 | |
| 
 | |
| **Method**: GET
 | |
| **Description**: Returns detailed results for all registered health checks
 | |
| 
 | |
| **Response Schema**:
 | |
| ```json
 | |
| {
 | |
|     "checks": {
 | |
|         "database-connectivity": {
 | |
|             "healthy": true,
 | |
|             "message": "Database connectivity OK",
 | |
|             "latency": 15000000,
 | |
|             "timestamp": "2025-09-30T10:30:00Z"
 | |
|         },
 | |
|         "p2p-connectivity": {
 | |
|             "healthy": true,
 | |
|             "message": "5 peers connected",
 | |
|             "details": {
 | |
|                 "connected_peers": 5,
 | |
|                 "min_peers": 3
 | |
|             },
 | |
|             "latency": 8000000,
 | |
|             "timestamp": "2025-09-30T10:30:05Z"
 | |
|         }
 | |
|     },
 | |
|     "total": 2,
 | |
|     "timestamp": "2025-09-30T10:30:10Z"
 | |
| }
 | |
| ```
 | |
| 
 | |
| ## Built-in Health Checks
 | |
| 
 | |
| ### Database Connectivity Check
 | |
| 
 | |
| ```go
 | |
| check := health.CreateDatabaseCheck("primary-db", func() error {
 | |
|     return db.Ping()
 | |
| })
 | |
| manager.RegisterCheck(check)
 | |
| ```
 | |
| 
 | |
| **Properties**:
 | |
| - Critical: Yes
 | |
| - Interval: 30 seconds
 | |
| - Timeout: 10 seconds
 | |
| - Checks: Database ping/connectivity
 | |
| 
 | |
| ### Disk Space Check
 | |
| 
 | |
| ```go
 | |
| check := health.CreateDiskSpaceCheck("/var/lib/CHORUS", 0.90)  // Alert at 90%
 | |
| manager.RegisterCheck(check)
 | |
| ```
 | |
| 
 | |
| **Properties**:
 | |
| - Critical: No (warning only)
 | |
| - Interval: 60 seconds
 | |
| - Timeout: 5 seconds
 | |
| - Threshold: Configurable (e.g., 90%)
 | |
| 
 | |
| ### Memory Usage Check
 | |
| 
 | |
| ```go
 | |
| check := health.CreateMemoryCheck(0.85)  // Alert at 85%
 | |
| manager.RegisterCheck(check)
 | |
| ```
 | |
| 
 | |
| **Properties**:
 | |
| - Critical: No (warning only)
 | |
| - Interval: 30 seconds
 | |
| - Timeout: 5 seconds
 | |
| - Threshold: Configurable (e.g., 85%)
 | |
| 
 | |
| ### Active PubSub Check
 | |
| 
 | |
| ```go
 | |
| adapter := health.NewPubSubAdapter(pubsubInstance)
 | |
| check := health.CreateActivePubSubCheck(adapter)
 | |
| manager.RegisterCheck(check)
 | |
| ```
 | |
| 
 | |
| **Properties**:
 | |
| - Critical: No
 | |
| - Interval: 60 seconds
 | |
| - Timeout: 15 seconds
 | |
| - Test: Publish/subscribe loopback with unique message
 | |
| - Validates: End-to-end PubSub functionality
 | |
| 
 | |
| **Test Flow**:
 | |
| 1. Subscribe to test topic `CHORUS/health-test/v1`
 | |
| 2. Publish unique test message with timestamp
 | |
| 3. Wait for message receipt (max 10 seconds)
 | |
| 4. Verify message integrity
 | |
| 5. Report success or timeout
 | |
| 
 | |
| ### Active DHT Check
 | |
| 
 | |
| ```go
 | |
| adapter := health.NewDHTAdapter(dhtInstance)
 | |
| check := health.CreateActiveDHTCheck(adapter)
 | |
| manager.RegisterCheck(check)
 | |
| ```
 | |
| 
 | |
| **Properties**:
 | |
| - Critical: No
 | |
| - Interval: 90 seconds
 | |
| - Timeout: 20 seconds
 | |
| - Test: Put/get operation with unique key
 | |
| - Validates: DHT storage and retrieval integrity
 | |
| 
 | |
| **Test Flow**:
 | |
| 1. Generate unique test key and value
 | |
| 2. Perform DHT put operation
 | |
| 3. Wait for propagation (100ms)
 | |
| 4. Perform DHT get operation
 | |
| 5. Verify retrieved value matches original
 | |
| 6. Report success, failure, or integrity violation
 | |
| 
 | |
| ## Enhanced Health Checks
 | |
| 
 | |
| ### Overview
 | |
| 
 | |
| The `EnhancedHealthChecks` system provides advanced monitoring with metrics tracking, historical data, and comprehensive component health scoring.
 | |
| 
 | |
| ### Initialization
 | |
| 
 | |
| ```go
 | |
| import "chorus/pkg/health"
 | |
| 
 | |
| // Create enhanced health monitoring
 | |
| enhanced := health.NewEnhancedHealthChecks(
 | |
|     manager,         // Health manager
 | |
|     electionMgr,     // Election manager
 | |
|     dhtInstance,     // DHT instance
 | |
|     pubsubInstance,  // PubSub instance
 | |
|     replicationMgr,  // Replication manager
 | |
|     logger,          // Logger
 | |
| )
 | |
| 
 | |
| // Enhanced checks are automatically registered
 | |
| ```
 | |
| 
 | |
| ### Configuration
 | |
| 
 | |
| ```go
 | |
| type HealthConfig struct {
 | |
|     // Active probe intervals
 | |
|     PubSubProbeInterval    time.Duration  // Default: 30s
 | |
|     DHTProbeInterval       time.Duration  // Default: 60s
 | |
|     ElectionProbeInterval  time.Duration  // Default: 15s
 | |
| 
 | |
|     // Probe timeouts
 | |
|     PubSubProbeTimeout     time.Duration  // Default: 10s
 | |
|     DHTProbeTimeout        time.Duration  // Default: 20s
 | |
|     ElectionProbeTimeout   time.Duration  // Default: 5s
 | |
| 
 | |
|     // Thresholds
 | |
|     MaxFailedProbes        int            // Default: 3
 | |
|     HealthyThreshold       float64        // Default: 0.95
 | |
|     DegradedThreshold      float64        // Default: 0.75
 | |
| 
 | |
|     // History retention
 | |
|     MaxHistoryEntries      int            // Default: 1000
 | |
|     HistoryCleanupInterval time.Duration  // Default: 1h
 | |
| 
 | |
|     // Enable/disable specific checks
 | |
|     EnablePubSubProbes     bool           // Default: true
 | |
|     EnableDHTProbes        bool           // Default: true
 | |
|     EnableElectionProbes   bool           // Default: true
 | |
|     EnableReplicationProbes bool          // Default: true
 | |
| }
 | |
| 
 | |
| // Use custom configuration
 | |
| config := health.DefaultHealthConfig()
 | |
| config.PubSubProbeInterval = 45 * time.Second
 | |
| config.HealthyThreshold = 0.98
 | |
| enhanced.config = config
 | |
| ```
 | |
| 
 | |
| ### Enhanced Health Checks Registered
 | |
| 
 | |
| #### 1. Enhanced PubSub Check
 | |
| - **Name**: `pubsub-enhanced`
 | |
| - **Critical**: Yes
 | |
| - **Interval**: Configurable (default: 30s)
 | |
| - **Features**:
 | |
|   - Loopback message testing
 | |
|   - Success rate tracking
 | |
|   - Consecutive failure counting
 | |
|   - Latency measurement
 | |
|   - Health score calculation
 | |
| 
 | |
| #### 2. Enhanced DHT Check
 | |
| - **Name**: `dht-enhanced`
 | |
| - **Critical**: Yes
 | |
| - **Interval**: Configurable (default: 60s)
 | |
| - **Features**:
 | |
|   - Put/get operation testing
 | |
|   - Data integrity verification
 | |
|   - Replication health monitoring
 | |
|   - Success rate tracking
 | |
|   - Latency measurement
 | |
| 
 | |
| #### 3. Election Health Check
 | |
| - **Name**: `election-health`
 | |
| - **Critical**: No
 | |
| - **Interval**: Configurable (default: 15s)
 | |
| - **Features**:
 | |
|   - Election state monitoring
 | |
|   - Heartbeat status tracking
 | |
|   - Leadership stability calculation
 | |
|   - Admin uptime tracking
 | |
| 
 | |
| #### 4. Replication Health Check
 | |
| - **Name**: `replication-health`
 | |
| - **Critical**: No
 | |
| - **Interval**: 120 seconds
 | |
| - **Features**:
 | |
|   - Replication metrics monitoring
 | |
|   - Failure rate tracking
 | |
|   - Average replication factor
 | |
|   - Provider record counting
 | |
| 
 | |
| #### 5. P2P Connectivity Check
 | |
| - **Name**: `p2p-connectivity`
 | |
| - **Critical**: Yes
 | |
| - **Interval**: 30 seconds
 | |
| - **Features**:
 | |
|   - Connected peer counting
 | |
|   - Minimum peer threshold validation
 | |
|   - Connectivity score calculation
 | |
| 
 | |
| #### 6. Resource Health Check
 | |
| - **Name**: `resource-health`
 | |
| - **Critical**: No
 | |
| - **Interval**: 60 seconds
 | |
| - **Features**:
 | |
|   - CPU usage monitoring
 | |
|   - Memory usage monitoring
 | |
|   - Disk usage monitoring
 | |
|   - Threshold-based alerting
 | |
| 
 | |
| #### 7. Task Manager Check
 | |
| - **Name**: `task-manager`
 | |
| - **Critical**: No
 | |
| - **Interval**: 30 seconds
 | |
| - **Features**:
 | |
|   - Active task counting
 | |
|   - Queue depth monitoring
 | |
|   - Task success rate tracking
 | |
|   - Capacity monitoring
 | |
| 
 | |
| ### Health Metrics
 | |
| 
 | |
| ```go
 | |
| type HealthMetrics struct {
 | |
|     // Overall system health
 | |
|     SystemHealthScore     float64    // 0.0-1.0
 | |
|     LastFullHealthCheck   time.Time
 | |
|     TotalHealthChecks     int64
 | |
|     FailedHealthChecks    int64
 | |
| 
 | |
|     // PubSub metrics
 | |
|     PubSubHealthScore     float64
 | |
|     PubSubProbeLatency    time.Duration
 | |
|     PubSubSuccessRate     float64
 | |
|     PubSubLastSuccess     time.Time
 | |
|     PubSubConsecutiveFails int
 | |
| 
 | |
|     // DHT metrics
 | |
|     DHTHealthScore        float64
 | |
|     DHTProbeLatency       time.Duration
 | |
|     DHTSuccessRate        float64
 | |
|     DHTLastSuccess        time.Time
 | |
|     DHTConsecutiveFails   int
 | |
|     DHTReplicationStatus  map[string]*ReplicationStatus
 | |
| 
 | |
|     // Election metrics
 | |
|     ElectionHealthScore   float64
 | |
|     ElectionStability     float64
 | |
|     HeartbeatLatency      time.Duration
 | |
|     LeadershipChanges     int64
 | |
|     LastLeadershipChange  time.Time
 | |
|     AdminUptime           time.Duration
 | |
| 
 | |
|     // Network metrics
 | |
|     P2PConnectedPeers     int
 | |
|     P2PConnectivityScore  float64
 | |
|     NetworkLatency        time.Duration
 | |
| 
 | |
|     // Resource metrics
 | |
|     CPUUsage             float64
 | |
|     MemoryUsage          float64
 | |
|     DiskUsage            float64
 | |
| 
 | |
|     // Service metrics
 | |
|     ActiveTasks          int
 | |
|     QueuedTasks          int
 | |
|     TaskSuccessRate      float64
 | |
| }
 | |
| 
 | |
| // Access metrics
 | |
| metrics := enhanced.GetHealthMetrics()
 | |
| fmt.Printf("System Health Score: %.2f\n", metrics.SystemHealthScore)
 | |
| fmt.Printf("PubSub Health: %.2f\n", metrics.PubSubHealthScore)
 | |
| fmt.Printf("DHT Health: %.2f\n", metrics.DHTHealthScore)
 | |
| ```
 | |
| 
 | |
| ### Health Summary
 | |
| 
 | |
| ```go
 | |
| summary := enhanced.GetHealthSummary()
 | |
| 
 | |
| // Returns:
 | |
| // {
 | |
| //     "status": "healthy",
 | |
| //     "overall_score": 0.96,
 | |
| //     "last_check": "2025-09-30T10:30:00Z",
 | |
| //     "total_checks": 1523,
 | |
| //     "component_scores": {
 | |
| //         "pubsub": 0.98,
 | |
| //         "dht": 0.95,
 | |
| //         "election": 0.92,
 | |
| //         "p2p": 1.0
 | |
| //     },
 | |
| //     "key_metrics": {
 | |
| //         "connected_peers": 5,
 | |
| //         "active_tasks": 3,
 | |
| //         "admin_uptime": "2h30m15s",
 | |
| //         "leadership_changes": 2,
 | |
| //         "resource_utilization": {
 | |
| //             "cpu": 0.45,
 | |
| //             "memory": 0.62,
 | |
| //             "disk": 0.73
 | |
| //         }
 | |
| //     }
 | |
| // }
 | |
| ```
 | |
| 
 | |
| ## Adapter System
 | |
| 
 | |
| ### PubSub Adapter
 | |
| 
 | |
| Adapts CHORUS PubSub system to health check interface:
 | |
| 
 | |
| ```go
 | |
| type PubSubInterface interface {
 | |
|     SubscribeToTopic(topic string, handler func([]byte)) error
 | |
|     PublishToTopic(topic string, data interface{}) error
 | |
| }
 | |
| 
 | |
| // Create adapter
 | |
| adapter := health.NewPubSubAdapter(pubsubInstance)
 | |
| 
 | |
| // Use in health checks
 | |
| check := health.CreateActivePubSubCheck(adapter)
 | |
| ```
 | |
| 
 | |
| ### DHT Adapter
 | |
| 
 | |
| Adapts various DHT implementations to health check interface:
 | |
| 
 | |
| ```go
 | |
| type DHTInterface interface {
 | |
|     PutValue(ctx context.Context, key string, value []byte) error
 | |
|     GetValue(ctx context.Context, key string) ([]byte, error)
 | |
| }
 | |
| 
 | |
| // Create adapter (supports multiple DHT types)
 | |
| adapter := health.NewDHTAdapter(dhtInstance)
 | |
| 
 | |
| // Use in health checks
 | |
| check := health.CreateActiveDHTCheck(adapter)
 | |
| ```
 | |
| 
 | |
| **Supported DHT Types**:
 | |
| - `*dht.LibP2PDHT`
 | |
| - `*dht.MockDHTInterface`
 | |
| - `*dht.EncryptedDHTStorage`
 | |
| 
 | |
| ### Mock Adapters
 | |
| 
 | |
| For testing without real infrastructure:
 | |
| 
 | |
| ```go
 | |
| // Mock PubSub
 | |
| mockPubSub := health.NewMockPubSubAdapter()
 | |
| check := health.CreateActivePubSubCheck(mockPubSub)
 | |
| 
 | |
| // Mock DHT
 | |
| mockDHT := health.NewMockDHTAdapter()
 | |
| check := health.CreateActiveDHTCheck(mockDHT)
 | |
| ```
 | |
| 
 | |
| ## Integration with Graceful Shutdown
 | |
| 
 | |
| ### Critical Health Check Failures
 | |
| 
 | |
| When a critical health check fails, the health manager can trigger graceful shutdown:
 | |
| 
 | |
| ```go
 | |
| // Connect managers
 | |
| healthMgr.SetShutdownManager(shutdownMgr)
 | |
| 
 | |
| // Register critical check
 | |
| criticalCheck := &health.HealthCheck{
 | |
|     Name:     "database-connectivity",
 | |
|     Critical: true,  // Failure triggers shutdown
 | |
|     Checker:  func(ctx context.Context) health.CheckResult {
 | |
|         // Check logic
 | |
|     },
 | |
| }
 | |
| healthMgr.RegisterCheck(criticalCheck)
 | |
| 
 | |
| // If check fails, shutdown is automatically initiated
 | |
| ```
 | |
| 
 | |
| ### Shutdown Integration Example
 | |
| 
 | |
| ```go
 | |
| import (
 | |
|     "chorus/pkg/health"
 | |
|     "chorus/pkg/shutdown"
 | |
| )
 | |
| 
 | |
| // Create managers
 | |
| shutdownMgr := shutdown.NewManager(30*time.Second, logger)
 | |
| healthMgr := health.NewManager("node-123", "v1.0.0", logger)
 | |
| healthMgr.SetShutdownManager(shutdownMgr)
 | |
| 
 | |
| // Register health manager for shutdown
 | |
| healthComponent := shutdown.NewGenericComponent("health-manager", 10, true).
 | |
|     SetShutdownFunc(func(ctx context.Context) error {
 | |
|         return healthMgr.Stop()
 | |
|     })
 | |
| shutdownMgr.Register(healthComponent)
 | |
| 
 | |
| // Add pre-shutdown hook to update health status
 | |
| shutdownMgr.AddHook(shutdown.PhasePreShutdown, func(ctx context.Context) error {
 | |
|     status := healthMgr.GetStatus()
 | |
|     status.Status = health.StatusStopping
 | |
|     status.Message = "System is shutting down"
 | |
|     return nil
 | |
| })
 | |
| 
 | |
| // Start systems
 | |
| healthMgr.Start()
 | |
| healthMgr.StartHTTPServer(8081)
 | |
| shutdownMgr.Start()
 | |
| 
 | |
| // Wait for shutdown
 | |
| shutdownMgr.Wait()
 | |
| ```
 | |
| 
 | |
| ## Health Check Best Practices
 | |
| 
 | |
| ### Design Principles
 | |
| 
 | |
| 1. **Fast Execution**: Health checks should complete quickly (< 5 seconds typical)
 | |
| 2. **Idempotent**: Checks should be safe to run repeatedly without side effects
 | |
| 3. **Isolated**: Checks should not depend on other checks
 | |
| 4. **Meaningful**: Checks should validate actual functionality, not just existence
 | |
| 5. **Critical vs Warning**: Reserve critical status for failures that prevent core functionality
 | |
| 
 | |
| ### Check Intervals
 | |
| 
 | |
| ```go
 | |
| // Critical infrastructure: Check frequently
 | |
| databaseCheck.Interval = 15 * time.Second
 | |
| 
 | |
| // Expensive operations: Check less frequently
 | |
| replicationCheck.Interval = 120 * time.Second
 | |
| 
 | |
| // Active probes: Balance thoroughness with overhead
 | |
| pubsubProbe.Interval = 60 * time.Second
 | |
| ```
 | |
| 
 | |
| ### Timeout Configuration
 | |
| 
 | |
| ```go
 | |
| // Fast checks: Short timeout
 | |
| connectivityCheck.Timeout = 5 * time.Second
 | |
| 
 | |
| // Network operations: Longer timeout
 | |
| dhtProbe.Timeout = 20 * time.Second
 | |
| 
 | |
| // Complex operations: Generous timeout
 | |
| systemCheck.Timeout = 30 * time.Second
 | |
| ```
 | |
| 
 | |
| ### Critical Check Guidelines
 | |
| 
 | |
| Mark a check as critical when:
 | |
| - Failure prevents core system functionality
 | |
| - Continued operation would cause data corruption
 | |
| - User-facing services become unavailable
 | |
| - System cannot safely recover automatically
 | |
| 
 | |
| Do NOT mark as critical when:
 | |
| - Failure is temporary or transient
 | |
| - System can operate in degraded mode
 | |
| - Alternative mechanisms exist
 | |
| - Recovery is possible without restart
 | |
| 
 | |
| ### Error Handling
 | |
| 
 | |
| ```go
 | |
| Checker: func(ctx context.Context) health.CheckResult {
 | |
|     // Handle context cancellation
 | |
|     select {
 | |
|     case <-ctx.Done():
 | |
|         return health.CheckResult{
 | |
|             Healthy:   false,
 | |
|             Message:   "Check cancelled",
 | |
|             Error:     ctx.Err(),
 | |
|             Timestamp: time.Now(),
 | |
|         }
 | |
|     default:
 | |
|     }
 | |
| 
 | |
|     // Perform check with timeout
 | |
|     result := make(chan error, 1)
 | |
|     go func() {
 | |
|         result <- performCheck()
 | |
|     }()
 | |
| 
 | |
|     select {
 | |
|     case err := <-result:
 | |
|         if err != nil {
 | |
|             return health.CheckResult{
 | |
|                 Healthy:   false,
 | |
|                 Message:   fmt.Sprintf("Check failed: %v", err),
 | |
|                 Error:     err,
 | |
|                 Timestamp: time.Now(),
 | |
|             }
 | |
|         }
 | |
|         return health.CheckResult{
 | |
|             Healthy:   true,
 | |
|             Message:   "Check passed",
 | |
|             Timestamp: time.Now(),
 | |
|         }
 | |
|     case <-ctx.Done():
 | |
|         return health.CheckResult{
 | |
|             Healthy:   false,
 | |
|             Message:   "Check timeout",
 | |
|             Error:     ctx.Err(),
 | |
|             Timestamp: time.Now(),
 | |
|         }
 | |
|     }
 | |
| }
 | |
| ```
 | |
| 
 | |
| ### Detailed Results
 | |
| 
 | |
| Provide structured details for debugging:
 | |
| 
 | |
| ```go
 | |
| return health.CheckResult{
 | |
|     Healthy: true,
 | |
|     Message: "P2P network healthy",
 | |
|     Details: map[string]interface{}{
 | |
|         "connected_peers":  5,
 | |
|         "min_peers":       3,
 | |
|         "max_peers":       20,
 | |
|         "current_usage":   "25%",
 | |
|         "peer_quality":    0.85,
 | |
|         "network_latency": "50ms",
 | |
|     },
 | |
|     Timestamp: time.Now(),
 | |
| }
 | |
| ```
 | |
| 
 | |
| ## Custom Health Checks
 | |
| 
 | |
| ### Simple Health Check
 | |
| 
 | |
| ```go
 | |
| simpleCheck := &health.HealthCheck{
 | |
|     Name:        "my-service",
 | |
|     Description: "Custom service health check",
 | |
|     Enabled:     true,
 | |
|     Critical:    false,
 | |
|     Interval:    30 * time.Second,
 | |
|     Timeout:     10 * time.Second,
 | |
|     Checker: func(ctx context.Context) health.CheckResult {
 | |
|         // Your check logic
 | |
|         healthy := checkMyService()
 | |
| 
 | |
|         return health.CheckResult{
 | |
|             Healthy:   healthy,
 | |
|             Message:   "Service status",
 | |
|             Timestamp: time.Now(),
 | |
|         }
 | |
|     },
 | |
| }
 | |
| 
 | |
| manager.RegisterCheck(simpleCheck)
 | |
| ```
 | |
| 
 | |
| ### Health Check with Metrics
 | |
| 
 | |
| ```go
 | |
| type ServiceHealthCheck struct {
 | |
|     service        MyService
 | |
|     metricsCollector *metrics.CHORUSMetrics
 | |
| }
 | |
| 
 | |
| func (s *ServiceHealthCheck) Check(ctx context.Context) health.CheckResult {
 | |
|     start := time.Now()
 | |
| 
 | |
|     err := s.service.Ping(ctx)
 | |
|     latency := time.Since(start)
 | |
| 
 | |
|     if err != nil {
 | |
|         s.metricsCollector.IncrementHealthCheckFailed("my-service", err.Error())
 | |
|         return health.CheckResult{
 | |
|             Healthy:   false,
 | |
|             Message:   fmt.Sprintf("Service unavailable: %v", err),
 | |
|             Error:     err,
 | |
|             Latency:   latency,
 | |
|             Timestamp: time.Now(),
 | |
|         }
 | |
|     }
 | |
| 
 | |
|     s.metricsCollector.IncrementHealthCheckPassed("my-service")
 | |
|     return health.CheckResult{
 | |
|         Healthy:   true,
 | |
|         Message:   "Service available",
 | |
|         Latency:   latency,
 | |
|         Timestamp: time.Now(),
 | |
|         Details: map[string]interface{}{
 | |
|             "latency_ms": latency.Milliseconds(),
 | |
|             "version":    s.service.Version(),
 | |
|         },
 | |
|     }
 | |
| }
 | |
| ```
 | |
| 
 | |
| ### Stateful Health Check
 | |
| 
 | |
| ```go
 | |
| type StatefulHealthCheck struct {
 | |
|     consecutiveFailures int
 | |
|     maxFailures        int
 | |
| }
 | |
| 
 | |
| func (s *StatefulHealthCheck) Check(ctx context.Context) health.CheckResult {
 | |
|     healthy := performCheck()
 | |
| 
 | |
|     if !healthy {
 | |
|         s.consecutiveFailures++
 | |
|     } else {
 | |
|         s.consecutiveFailures = 0
 | |
|     }
 | |
| 
 | |
|     // Only report unhealthy after multiple failures
 | |
|     if s.consecutiveFailures >= s.maxFailures {
 | |
|         return health.CheckResult{
 | |
|             Healthy: false,
 | |
|             Message: fmt.Sprintf("Failed %d consecutive checks", s.consecutiveFailures),
 | |
|             Details: map[string]interface{}{
 | |
|                 "consecutive_failures": s.consecutiveFailures,
 | |
|                 "threshold":           s.maxFailures,
 | |
|             },
 | |
|             Timestamp: time.Now(),
 | |
|         }
 | |
|     }
 | |
| 
 | |
|     return health.CheckResult{
 | |
|         Healthy:   true,
 | |
|         Message:   "Check passed",
 | |
|         Timestamp: time.Now(),
 | |
|     }
 | |
| }
 | |
| ```
 | |
| 
 | |
| ## Monitoring and Alerting
 | |
| 
 | |
| ### Prometheus Integration
 | |
| 
 | |
| Health check results are automatically exposed as metrics:
 | |
| 
 | |
| ```promql
 | |
| # Health check success rate
 | |
| rate(chorus_health_checks_passed_total[5m]) /
 | |
| (rate(chorus_health_checks_passed_total[5m]) + rate(chorus_health_checks_failed_total[5m]))
 | |
| 
 | |
| # System health score
 | |
| chorus_system_health_score
 | |
| 
 | |
| # Component health
 | |
| chorus_component_health_score{component="dht"}
 | |
| ```
 | |
| 
 | |
| ### Alert Rules
 | |
| 
 | |
| ```yaml
 | |
| groups:
 | |
|   - name: health_alerts
 | |
|     interval: 30s
 | |
|     rules:
 | |
|       - alert: HealthCheckFailing
 | |
|         expr: chorus_health_checks_failed_total{check_name="database-connectivity"} > 0
 | |
|         for: 5m
 | |
|         labels:
 | |
|           severity: critical
 | |
|         annotations:
 | |
|           summary: "Critical health check failing"
 | |
|           description: "{{ $labels.check_name }} has been failing for 5 minutes"
 | |
| 
 | |
|       - alert: LowSystemHealth
 | |
|         expr: chorus_system_health_score < 0.75
 | |
|         for: 10m
 | |
|         labels:
 | |
|           severity: warning
 | |
|         annotations:
 | |
|           summary: "Low system health score"
 | |
|           description: "System health score: {{ $value }}"
 | |
| 
 | |
|       - alert: ComponentDegraded
 | |
|         expr: chorus_component_health_score < 0.5
 | |
|         for: 15m
 | |
|         labels:
 | |
|           severity: warning
 | |
|         annotations:
 | |
|           summary: "Component {{ $labels.component }} degraded"
 | |
|           description: "Health score: {{ $value }}"
 | |
| ```
 | |
| 
 | |
| ## Troubleshooting
 | |
| 
 | |
| ### Health Check Not Running
 | |
| 
 | |
| ```bash
 | |
| # Check if health manager is started
 | |
| curl http://localhost:8081/health/checks
 | |
| 
 | |
| # Verify check is registered and enabled
 | |
| # Look for "enabled": true in response
 | |
| 
 | |
| # Check application logs for errors
 | |
| grep "health check" /var/log/chorus/chorus.log
 | |
| ```
 | |
| 
 | |
| ### Health Check Timeouts
 | |
| 
 | |
| ```go
 | |
| // Increase timeout for slow operations
 | |
| check.Timeout = 30 * time.Second
 | |
| 
 | |
| // Add timeout monitoring
 | |
| Checker: func(ctx context.Context) health.CheckResult {
 | |
|     deadline, ok := ctx.Deadline()
 | |
|     if ok {
 | |
|         log.Printf("Check deadline: %v (%.2fs remaining)",
 | |
|             deadline, time.Until(deadline).Seconds())
 | |
|     }
 | |
|     // ... check logic
 | |
| }
 | |
| ```
 | |
| 
 | |
| ### False Positives
 | |
| 
 | |
| ```go
 | |
| // Add retry logic
 | |
| attempts := 3
 | |
| for i := 0; i < attempts; i++ {
 | |
|     if checkPasses() {
 | |
|         return health.CheckResult{Healthy: true, ...}
 | |
|     }
 | |
|     if i < attempts-1 {
 | |
|         time.Sleep(100 * time.Millisecond)
 | |
|     }
 | |
| }
 | |
| return health.CheckResult{Healthy: false, ...}
 | |
| ```
 | |
| 
 | |
| ### High Memory Usage
 | |
| 
 | |
| ```go
 | |
| // Limit check history
 | |
| config := health.DefaultHealthConfig()
 | |
| config.MaxHistoryEntries = 500  // Reduce from default 1000
 | |
| config.HistoryCleanupInterval = 30 * time.Minute  // More frequent cleanup
 | |
| ```
 | |
| 
 | |
| ## Testing
 | |
| 
 | |
| ### Unit Testing Health Checks
 | |
| 
 | |
| ```go
 | |
| func TestHealthCheck(t *testing.T) {
 | |
|     // Create check
 | |
|     check := &health.HealthCheck{
 | |
|         Name:    "test-check",
 | |
|         Enabled: true,
 | |
|         Timeout: 5 * time.Second,
 | |
|         Checker: func(ctx context.Context) health.CheckResult {
 | |
|             // Test logic
 | |
|             return health.CheckResult{
 | |
|                 Healthy:   true,
 | |
|                 Message:   "Test passed",
 | |
|                 Timestamp: time.Now(),
 | |
|             }
 | |
|         },
 | |
|     }
 | |
| 
 | |
|     // Execute check
 | |
|     ctx := context.Background()
 | |
|     result := check.Checker(ctx)
 | |
| 
 | |
|     // Verify result
 | |
|     assert.True(t, result.Healthy)
 | |
|     assert.Equal(t, "Test passed", result.Message)
 | |
| }
 | |
| ```
 | |
| 
 | |
| ### Integration Testing with Mocks
 | |
| 
 | |
| ```go
 | |
| func TestHealthManager(t *testing.T) {
 | |
|     logger := &testLogger{}
 | |
|     manager := health.NewManager("test-node", "v1.0.0", logger)
 | |
| 
 | |
|     // Register mock check
 | |
|     mockCheck := &health.HealthCheck{
 | |
|         Name:     "mock-check",
 | |
|         Enabled:  true,
 | |
|         Interval: 1 * time.Second,
 | |
|         Timeout:  500 * time.Millisecond,
 | |
|         Checker: func(ctx context.Context) health.CheckResult {
 | |
|             return health.CheckResult{
 | |
|                 Healthy:   true,
 | |
|                 Message:   "Mock check passed",
 | |
|                 Timestamp: time.Now(),
 | |
|             }
 | |
|         },
 | |
|     }
 | |
|     manager.RegisterCheck(mockCheck)
 | |
| 
 | |
|     // Start manager
 | |
|     err := manager.Start()
 | |
|     assert.NoError(t, err)
 | |
| 
 | |
|     // Wait for check execution
 | |
|     time.Sleep(2 * time.Second)
 | |
| 
 | |
|     // Verify status
 | |
|     status := manager.GetStatus()
 | |
|     assert.Equal(t, health.StatusHealthy, status.Status)
 | |
|     assert.Contains(t, status.Checks, "mock-check")
 | |
| 
 | |
|     // Stop manager
 | |
|     err = manager.Stop()
 | |
|     assert.NoError(t, err)
 | |
| }
 | |
| ```
 | |
| 
 | |
| ## Related Documentation
 | |
| 
 | |
| - [Metrics Package Documentation](./metrics.md)
 | |
| - [Shutdown Package Documentation](./shutdown.md)
 | |
| - [Kubernetes Health Probes](https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/)
 | |
| - [CHORUS Election System](../modules/election.md)
 | |
| - [CHORUS DHT System](../modules/dht.md) |