 c5b7311a8b
			
		
	
	c5b7311a8b
	
	
	
		
			
			Comprehensive documentation for coordination, messaging, discovery, and internal systems. Core Coordination Packages: - pkg/election - Democratic leader election (uptime-based, heartbeat mechanism, SLURP integration) - pkg/coordination - Meta-coordination with dependency detection (4 built-in rules) - coordinator/ - Task orchestration and assignment (AI-powered scoring) - discovery/ - mDNS peer discovery (automatic LAN detection) Messaging & P2P Infrastructure: - pubsub/ - GossipSub messaging (31 message types, role-based topics, HMMM integration) - p2p/ - libp2p networking (DHT modes, connection management, security) Monitoring & Health: - pkg/metrics - Prometheus metrics (80+ metrics across 12 categories) - pkg/health - Health monitoring (4 HTTP endpoints, enhanced checks, graceful degradation) Internal Systems: - internal/licensing - License validation (KACHING integration, cluster leases, fail-closed) - internal/hapui - Human Agent Portal UI (9 commands, HMMM wizard, UCXL browser, decision voting) - internal/backbeat - P2P operation telemetry (6 phases, beat synchronization, health reporting) Documentation Statistics (Phase 3): - 10 packages documented (~18,000 lines) - 31 PubSub message types cataloged - 80+ Prometheus metrics documented - Complete API references with examples - Integration patterns and best practices Key Features Documented: - Election: 5 triggers, candidate scoring (5 weighted components), stability windows - Coordination: AI-powered dependency detection, cross-repo sessions, escalation handling - PubSub: Topic patterns, message envelopes, SHHH redaction, Hypercore logging - Metrics: All metric types with labels, Prometheus scrape config, alert rules - Health: Liveness vs readiness, critical checks, Kubernetes integration - Licensing: Grace periods, circuit breaker, cluster lease management - HAP UI: Interactive terminal commands, HMMM composition wizard, web interface (beta) - BACKBEAT: 6-phase operation tracking, beat budget estimation, drift detection Implementation Status Marked: - ✅ Production: Election, metrics, health, licensing, pubsub, p2p, discovery, coordinator - 🔶 Beta: HAP web interface, BACKBEAT telemetry, advanced coordination - 🔷 Alpha: SLURP election scoring - ⚠️ Experimental: Meta-coordination, AI-powered dependency detection Progress: 22/62 files complete (35%) Next Phase: AI providers, SLURP system, API layer, reasoning engine 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
		
			
				
	
	
	
		
			28 KiB
		
	
	
	
	
	
	
	
			
		
		
	
	CHORUS Health Package
Overview
The pkg/health package provides comprehensive health monitoring and readiness/liveness probe capabilities for the CHORUS distributed system. It orchestrates health checks across all system components, integrates with graceful shutdown, and exposes Kubernetes-compatible health endpoints.
Architecture
Core Components
- Manager: Central health check orchestration and HTTP endpoint management
- HealthCheck: Individual health check definitions with configurable intervals
- EnhancedHealthChecks: Advanced health monitoring with metrics and history
- Adapters: Integration layer for PubSub, DHT, and other subsystems
- SystemStatus: Aggregated health status representation
Health Check Types
- Critical Checks: Failures trigger graceful shutdown
- Non-Critical Checks: Failures degrade health status but don't trigger shutdown
- Active Probes: Synthetic tests that verify end-to-end functionality
- Passive Checks: Monitor existing system state without creating load
Core Types
HealthCheck
type HealthCheck struct {
    Name        string                                // Unique check identifier
    Description string                                // Human-readable description
    Checker     func(ctx context.Context) CheckResult // Check execution function
    Interval    time.Duration                         // Check frequency (default: 30s)
    Timeout     time.Duration                         // Check timeout (default: 10s)
    Enabled     bool                                  // Enable/disable check
    Critical    bool                                  // If true, failure triggers shutdown
    LastRun     time.Time                             // Timestamp of last execution
    LastResult  *CheckResult                          // Most recent check result
}
CheckResult
type CheckResult struct {
    Healthy    bool                   // Check passed/failed
    Message    string                 // Human-readable result message
    Details    map[string]interface{} // Additional structured information
    Latency    time.Duration          // Check execution time
    Timestamp  time.Time              // Result timestamp
    Error      error                  // Error details if check failed
}
SystemStatus
type SystemStatus struct {
    Status     Status                     // Overall status enum
    Message    string                     // Status description
    Checks     map[string]*CheckResult    // All check results
    Uptime     time.Duration              // System uptime
    StartTime  time.Time                  // System start timestamp
    LastUpdate time.Time                  // Last status update
    Version    string                     // CHORUS version
    NodeID     string                     // Node identifier
}
Status Levels
const (
    StatusHealthy   Status = "healthy"    // All checks passing
    StatusDegraded  Status = "degraded"   // Some non-critical checks failing
    StatusUnhealthy Status = "unhealthy"  // Critical checks failing
    StatusStarting  Status = "starting"   // System initializing
    StatusStopping  Status = "stopping"   // Graceful shutdown in progress
)
Manager
Initialization
import "chorus/pkg/health"
// Create health manager
logger := yourLogger // Implements health.Logger interface
manager := health.NewManager("node-123", "v1.0.0", logger)
// Connect to shutdown manager for critical failures
shutdownMgr := shutdown.NewManager(30*time.Second, logger)
manager.SetShutdownManager(shutdownMgr)
Registration System
// Register a health check
check := &health.HealthCheck{
    Name:        "database-connectivity",
    Description: "PostgreSQL database connectivity check",
    Enabled:     true,
    Critical:    true,  // Failure triggers shutdown
    Interval:    30 * time.Second,
    Timeout:     10 * time.Second,
    Checker: func(ctx context.Context) health.CheckResult {
        // Perform health check
        err := db.PingContext(ctx)
        if err != nil {
            return health.CheckResult{
                Healthy:   false,
                Message:   fmt.Sprintf("Database ping failed: %v", err),
                Error:     err,
                Timestamp: time.Now(),
            }
        }
        return health.CheckResult{
            Healthy:   true,
            Message:   "Database connectivity OK",
            Timestamp: time.Now(),
        }
    },
}
manager.RegisterCheck(check)
// Unregister when no longer needed
manager.UnregisterCheck("database-connectivity")
Lifecycle Management
// Start health monitoring
if err := manager.Start(); err != nil {
    log.Fatalf("Failed to start health manager: %v", err)
}
// Start HTTP server for health endpoints
if err := manager.StartHTTPServer(8081); err != nil {
    log.Fatalf("Failed to start health HTTP server: %v", err)
}
// ... application runs ...
// Stop health monitoring during shutdown
if err := manager.Stop(); err != nil {
    log.Printf("Error stopping health manager: %v", err)
}
HTTP Endpoints
/health - Overall Health Status
Method: GET Description: Returns comprehensive system health status
Response Codes:
- 200 OK: System is healthy or degraded
- 503 Service Unavailable: System is unhealthy, starting, or stopping
Response Schema:
{
    "status": "healthy",
    "message": "All health checks passing",
    "checks": {
        "database-connectivity": {
            "healthy": true,
            "message": "Database connectivity OK",
            "latency": 15000000,
            "timestamp": "2025-09-30T10:30:00Z"
        },
        "p2p-connectivity": {
            "healthy": true,
            "message": "5 peers connected",
            "details": {
                "connected_peers": 5,
                "min_peers": 3
            },
            "latency": 8000000,
            "timestamp": "2025-09-30T10:30:05Z"
        }
    },
    "uptime": 86400000000000,
    "start_time": "2025-09-29T10:30:00Z",
    "last_update": "2025-09-30T10:30:05Z",
    "version": "v1.0.0",
    "node_id": "node-123"
}
/health/ready - Readiness Probe
Method: GET Description: Kubernetes readiness probe - indicates if node can handle requests
Response Codes:
- 200 OK: Node is ready (healthy or degraded)
- 503 Service Unavailable: Node is not ready
Response Schema:
{
    "ready": true,
    "status": "healthy",
    "message": "All health checks passing"
}
Usage: Use for Kubernetes readiness probes to control traffic routing
readinessProbe:
  httpGet:
    path: /health/ready
    port: 8081
  initialDelaySeconds: 10
  periodSeconds: 5
  timeoutSeconds: 3
  failureThreshold: 3
/health/live - Liveness Probe
Method: GET Description: Kubernetes liveness probe - indicates if node is alive
Response Codes:
- 200 OK: Process is alive (not stopping)
- 503 Service Unavailable: Process is stopping
Response Schema:
{
    "live": true,
    "status": "healthy",
    "uptime": "24h0m0s"
}
Usage: Use for Kubernetes liveness probes to restart unhealthy pods
livenessProbe:
  httpGet:
    path: /health/live
    port: 8081
  initialDelaySeconds: 30
  periodSeconds: 10
  timeoutSeconds: 5
  failureThreshold: 3
/health/checks - Detailed Check Results
Method: GET Description: Returns detailed results for all registered health checks
Response Schema:
{
    "checks": {
        "database-connectivity": {
            "healthy": true,
            "message": "Database connectivity OK",
            "latency": 15000000,
            "timestamp": "2025-09-30T10:30:00Z"
        },
        "p2p-connectivity": {
            "healthy": true,
            "message": "5 peers connected",
            "details": {
                "connected_peers": 5,
                "min_peers": 3
            },
            "latency": 8000000,
            "timestamp": "2025-09-30T10:30:05Z"
        }
    },
    "total": 2,
    "timestamp": "2025-09-30T10:30:10Z"
}
Built-in Health Checks
Database Connectivity Check
check := health.CreateDatabaseCheck("primary-db", func() error {
    return db.Ping()
})
manager.RegisterCheck(check)
Properties:
- Critical: Yes
- Interval: 30 seconds
- Timeout: 10 seconds
- Checks: Database ping/connectivity
Disk Space Check
check := health.CreateDiskSpaceCheck("/var/lib/CHORUS", 0.90)  // Alert at 90%
manager.RegisterCheck(check)
Properties:
- Critical: No (warning only)
- Interval: 60 seconds
- Timeout: 5 seconds
- Threshold: Configurable (e.g., 90%)
Memory Usage Check
check := health.CreateMemoryCheck(0.85)  // Alert at 85%
manager.RegisterCheck(check)
Properties:
- Critical: No (warning only)
- Interval: 30 seconds
- Timeout: 5 seconds
- Threshold: Configurable (e.g., 85%)
Active PubSub Check
adapter := health.NewPubSubAdapter(pubsubInstance)
check := health.CreateActivePubSubCheck(adapter)
manager.RegisterCheck(check)
Properties:
- Critical: No
- Interval: 60 seconds
- Timeout: 15 seconds
- Test: Publish/subscribe loopback with unique message
- Validates: End-to-end PubSub functionality
Test Flow:
- Subscribe to test topic CHORUS/health-test/v1
- Publish unique test message with timestamp
- Wait for message receipt (max 10 seconds)
- Verify message integrity
- Report success or timeout
Active DHT Check
adapter := health.NewDHTAdapter(dhtInstance)
check := health.CreateActiveDHTCheck(adapter)
manager.RegisterCheck(check)
Properties:
- Critical: No
- Interval: 90 seconds
- Timeout: 20 seconds
- Test: Put/get operation with unique key
- Validates: DHT storage and retrieval integrity
Test Flow:
- Generate unique test key and value
- Perform DHT put operation
- Wait for propagation (100ms)
- Perform DHT get operation
- Verify retrieved value matches original
- Report success, failure, or integrity violation
Enhanced Health Checks
Overview
The EnhancedHealthChecks system provides advanced monitoring with metrics tracking, historical data, and comprehensive component health scoring.
Initialization
import "chorus/pkg/health"
// Create enhanced health monitoring
enhanced := health.NewEnhancedHealthChecks(
    manager,         // Health manager
    electionMgr,     // Election manager
    dhtInstance,     // DHT instance
    pubsubInstance,  // PubSub instance
    replicationMgr,  // Replication manager
    logger,          // Logger
)
// Enhanced checks are automatically registered
Configuration
type HealthConfig struct {
    // Active probe intervals
    PubSubProbeInterval    time.Duration  // Default: 30s
    DHTProbeInterval       time.Duration  // Default: 60s
    ElectionProbeInterval  time.Duration  // Default: 15s
    // Probe timeouts
    PubSubProbeTimeout     time.Duration  // Default: 10s
    DHTProbeTimeout        time.Duration  // Default: 20s
    ElectionProbeTimeout   time.Duration  // Default: 5s
    // Thresholds
    MaxFailedProbes        int            // Default: 3
    HealthyThreshold       float64        // Default: 0.95
    DegradedThreshold      float64        // Default: 0.75
    // History retention
    MaxHistoryEntries      int            // Default: 1000
    HistoryCleanupInterval time.Duration  // Default: 1h
    // Enable/disable specific checks
    EnablePubSubProbes     bool           // Default: true
    EnableDHTProbes        bool           // Default: true
    EnableElectionProbes   bool           // Default: true
    EnableReplicationProbes bool          // Default: true
}
// Use custom configuration
config := health.DefaultHealthConfig()
config.PubSubProbeInterval = 45 * time.Second
config.HealthyThreshold = 0.98
enhanced.config = config
Enhanced Health Checks Registered
1. Enhanced PubSub Check
- Name: pubsub-enhanced
- Critical: Yes
- Interval: Configurable (default: 30s)
- Features:
- Loopback message testing
- Success rate tracking
- Consecutive failure counting
- Latency measurement
- Health score calculation
 
2. Enhanced DHT Check
- Name: dht-enhanced
- Critical: Yes
- Interval: Configurable (default: 60s)
- Features:
- Put/get operation testing
- Data integrity verification
- Replication health monitoring
- Success rate tracking
- Latency measurement
 
3. Election Health Check
- Name: election-health
- Critical: No
- Interval: Configurable (default: 15s)
- Features:
- Election state monitoring
- Heartbeat status tracking
- Leadership stability calculation
- Admin uptime tracking
 
4. Replication Health Check
- Name: replication-health
- Critical: No
- Interval: 120 seconds
- Features:
- Replication metrics monitoring
- Failure rate tracking
- Average replication factor
- Provider record counting
 
5. P2P Connectivity Check
- Name: p2p-connectivity
- Critical: Yes
- Interval: 30 seconds
- Features:
- Connected peer counting
- Minimum peer threshold validation
- Connectivity score calculation
 
6. Resource Health Check
- Name: resource-health
- Critical: No
- Interval: 60 seconds
- Features:
- CPU usage monitoring
- Memory usage monitoring
- Disk usage monitoring
- Threshold-based alerting
 
7. Task Manager Check
- Name: task-manager
- Critical: No
- Interval: 30 seconds
- Features:
- Active task counting
- Queue depth monitoring
- Task success rate tracking
- Capacity monitoring
 
Health Metrics
type HealthMetrics struct {
    // Overall system health
    SystemHealthScore     float64    // 0.0-1.0
    LastFullHealthCheck   time.Time
    TotalHealthChecks     int64
    FailedHealthChecks    int64
    // PubSub metrics
    PubSubHealthScore     float64
    PubSubProbeLatency    time.Duration
    PubSubSuccessRate     float64
    PubSubLastSuccess     time.Time
    PubSubConsecutiveFails int
    // DHT metrics
    DHTHealthScore        float64
    DHTProbeLatency       time.Duration
    DHTSuccessRate        float64
    DHTLastSuccess        time.Time
    DHTConsecutiveFails   int
    DHTReplicationStatus  map[string]*ReplicationStatus
    // Election metrics
    ElectionHealthScore   float64
    ElectionStability     float64
    HeartbeatLatency      time.Duration
    LeadershipChanges     int64
    LastLeadershipChange  time.Time
    AdminUptime           time.Duration
    // Network metrics
    P2PConnectedPeers     int
    P2PConnectivityScore  float64
    NetworkLatency        time.Duration
    // Resource metrics
    CPUUsage             float64
    MemoryUsage          float64
    DiskUsage            float64
    // Service metrics
    ActiveTasks          int
    QueuedTasks          int
    TaskSuccessRate      float64
}
// Access metrics
metrics := enhanced.GetHealthMetrics()
fmt.Printf("System Health Score: %.2f\n", metrics.SystemHealthScore)
fmt.Printf("PubSub Health: %.2f\n", metrics.PubSubHealthScore)
fmt.Printf("DHT Health: %.2f\n", metrics.DHTHealthScore)
Health Summary
summary := enhanced.GetHealthSummary()
// Returns:
// {
//     "status": "healthy",
//     "overall_score": 0.96,
//     "last_check": "2025-09-30T10:30:00Z",
//     "total_checks": 1523,
//     "component_scores": {
//         "pubsub": 0.98,
//         "dht": 0.95,
//         "election": 0.92,
//         "p2p": 1.0
//     },
//     "key_metrics": {
//         "connected_peers": 5,
//         "active_tasks": 3,
//         "admin_uptime": "2h30m15s",
//         "leadership_changes": 2,
//         "resource_utilization": {
//             "cpu": 0.45,
//             "memory": 0.62,
//             "disk": 0.73
//         }
//     }
// }
Adapter System
PubSub Adapter
Adapts CHORUS PubSub system to health check interface:
type PubSubInterface interface {
    SubscribeToTopic(topic string, handler func([]byte)) error
    PublishToTopic(topic string, data interface{}) error
}
// Create adapter
adapter := health.NewPubSubAdapter(pubsubInstance)
// Use in health checks
check := health.CreateActivePubSubCheck(adapter)
DHT Adapter
Adapts various DHT implementations to health check interface:
type DHTInterface interface {
    PutValue(ctx context.Context, key string, value []byte) error
    GetValue(ctx context.Context, key string) ([]byte, error)
}
// Create adapter (supports multiple DHT types)
adapter := health.NewDHTAdapter(dhtInstance)
// Use in health checks
check := health.CreateActiveDHTCheck(adapter)
Supported DHT Types:
- *dht.LibP2PDHT
- *dht.MockDHTInterface
- *dht.EncryptedDHTStorage
Mock Adapters
For testing without real infrastructure:
// Mock PubSub
mockPubSub := health.NewMockPubSubAdapter()
check := health.CreateActivePubSubCheck(mockPubSub)
// Mock DHT
mockDHT := health.NewMockDHTAdapter()
check := health.CreateActiveDHTCheck(mockDHT)
Integration with Graceful Shutdown
Critical Health Check Failures
When a critical health check fails, the health manager can trigger graceful shutdown:
// Connect managers
healthMgr.SetShutdownManager(shutdownMgr)
// Register critical check
criticalCheck := &health.HealthCheck{
    Name:     "database-connectivity",
    Critical: true,  // Failure triggers shutdown
    Checker:  func(ctx context.Context) health.CheckResult {
        // Check logic
    },
}
healthMgr.RegisterCheck(criticalCheck)
// If check fails, shutdown is automatically initiated
Shutdown Integration Example
import (
    "chorus/pkg/health"
    "chorus/pkg/shutdown"
)
// Create managers
shutdownMgr := shutdown.NewManager(30*time.Second, logger)
healthMgr := health.NewManager("node-123", "v1.0.0", logger)
healthMgr.SetShutdownManager(shutdownMgr)
// Register health manager for shutdown
healthComponent := shutdown.NewGenericComponent("health-manager", 10, true).
    SetShutdownFunc(func(ctx context.Context) error {
        return healthMgr.Stop()
    })
shutdownMgr.Register(healthComponent)
// Add pre-shutdown hook to update health status
shutdownMgr.AddHook(shutdown.PhasePreShutdown, func(ctx context.Context) error {
    status := healthMgr.GetStatus()
    status.Status = health.StatusStopping
    status.Message = "System is shutting down"
    return nil
})
// Start systems
healthMgr.Start()
healthMgr.StartHTTPServer(8081)
shutdownMgr.Start()
// Wait for shutdown
shutdownMgr.Wait()
Health Check Best Practices
Design Principles
- Fast Execution: Health checks should complete quickly (< 5 seconds typical)
- Idempotent: Checks should be safe to run repeatedly without side effects
- Isolated: Checks should not depend on other checks
- Meaningful: Checks should validate actual functionality, not just existence
- Critical vs Warning: Reserve critical status for failures that prevent core functionality
Check Intervals
// Critical infrastructure: Check frequently
databaseCheck.Interval = 15 * time.Second
// Expensive operations: Check less frequently
replicationCheck.Interval = 120 * time.Second
// Active probes: Balance thoroughness with overhead
pubsubProbe.Interval = 60 * time.Second
Timeout Configuration
// Fast checks: Short timeout
connectivityCheck.Timeout = 5 * time.Second
// Network operations: Longer timeout
dhtProbe.Timeout = 20 * time.Second
// Complex operations: Generous timeout
systemCheck.Timeout = 30 * time.Second
Critical Check Guidelines
Mark a check as critical when:
- Failure prevents core system functionality
- Continued operation would cause data corruption
- User-facing services become unavailable
- System cannot safely recover automatically
Do NOT mark as critical when:
- Failure is temporary or transient
- System can operate in degraded mode
- Alternative mechanisms exist
- Recovery is possible without restart
Error Handling
Checker: func(ctx context.Context) health.CheckResult {
    // Handle context cancellation
    select {
    case <-ctx.Done():
        return health.CheckResult{
            Healthy:   false,
            Message:   "Check cancelled",
            Error:     ctx.Err(),
            Timestamp: time.Now(),
        }
    default:
    }
    // Perform check with timeout
    result := make(chan error, 1)
    go func() {
        result <- performCheck()
    }()
    select {
    case err := <-result:
        if err != nil {
            return health.CheckResult{
                Healthy:   false,
                Message:   fmt.Sprintf("Check failed: %v", err),
                Error:     err,
                Timestamp: time.Now(),
            }
        }
        return health.CheckResult{
            Healthy:   true,
            Message:   "Check passed",
            Timestamp: time.Now(),
        }
    case <-ctx.Done():
        return health.CheckResult{
            Healthy:   false,
            Message:   "Check timeout",
            Error:     ctx.Err(),
            Timestamp: time.Now(),
        }
    }
}
Detailed Results
Provide structured details for debugging:
return health.CheckResult{
    Healthy: true,
    Message: "P2P network healthy",
    Details: map[string]interface{}{
        "connected_peers":  5,
        "min_peers":       3,
        "max_peers":       20,
        "current_usage":   "25%",
        "peer_quality":    0.85,
        "network_latency": "50ms",
    },
    Timestamp: time.Now(),
}
Custom Health Checks
Simple Health Check
simpleCheck := &health.HealthCheck{
    Name:        "my-service",
    Description: "Custom service health check",
    Enabled:     true,
    Critical:    false,
    Interval:    30 * time.Second,
    Timeout:     10 * time.Second,
    Checker: func(ctx context.Context) health.CheckResult {
        // Your check logic
        healthy := checkMyService()
        return health.CheckResult{
            Healthy:   healthy,
            Message:   "Service status",
            Timestamp: time.Now(),
        }
    },
}
manager.RegisterCheck(simpleCheck)
Health Check with Metrics
type ServiceHealthCheck struct {
    service        MyService
    metricsCollector *metrics.CHORUSMetrics
}
func (s *ServiceHealthCheck) Check(ctx context.Context) health.CheckResult {
    start := time.Now()
    err := s.service.Ping(ctx)
    latency := time.Since(start)
    if err != nil {
        s.metricsCollector.IncrementHealthCheckFailed("my-service", err.Error())
        return health.CheckResult{
            Healthy:   false,
            Message:   fmt.Sprintf("Service unavailable: %v", err),
            Error:     err,
            Latency:   latency,
            Timestamp: time.Now(),
        }
    }
    s.metricsCollector.IncrementHealthCheckPassed("my-service")
    return health.CheckResult{
        Healthy:   true,
        Message:   "Service available",
        Latency:   latency,
        Timestamp: time.Now(),
        Details: map[string]interface{}{
            "latency_ms": latency.Milliseconds(),
            "version":    s.service.Version(),
        },
    }
}
Stateful Health Check
type StatefulHealthCheck struct {
    consecutiveFailures int
    maxFailures        int
}
func (s *StatefulHealthCheck) Check(ctx context.Context) health.CheckResult {
    healthy := performCheck()
    if !healthy {
        s.consecutiveFailures++
    } else {
        s.consecutiveFailures = 0
    }
    // Only report unhealthy after multiple failures
    if s.consecutiveFailures >= s.maxFailures {
        return health.CheckResult{
            Healthy: false,
            Message: fmt.Sprintf("Failed %d consecutive checks", s.consecutiveFailures),
            Details: map[string]interface{}{
                "consecutive_failures": s.consecutiveFailures,
                "threshold":           s.maxFailures,
            },
            Timestamp: time.Now(),
        }
    }
    return health.CheckResult{
        Healthy:   true,
        Message:   "Check passed",
        Timestamp: time.Now(),
    }
}
Monitoring and Alerting
Prometheus Integration
Health check results are automatically exposed as metrics:
# Health check success rate
rate(chorus_health_checks_passed_total[5m]) /
(rate(chorus_health_checks_passed_total[5m]) + rate(chorus_health_checks_failed_total[5m]))
# System health score
chorus_system_health_score
# Component health
chorus_component_health_score{component="dht"}
Alert Rules
groups:
  - name: health_alerts
    interval: 30s
    rules:
      - alert: HealthCheckFailing
        expr: chorus_health_checks_failed_total{check_name="database-connectivity"} > 0
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Critical health check failing"
          description: "{{ $labels.check_name }} has been failing for 5 minutes"
      - alert: LowSystemHealth
        expr: chorus_system_health_score < 0.75
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Low system health score"
          description: "System health score: {{ $value }}"
      - alert: ComponentDegraded
        expr: chorus_component_health_score < 0.5
        for: 15m
        labels:
          severity: warning
        annotations:
          summary: "Component {{ $labels.component }} degraded"
          description: "Health score: {{ $value }}"
Troubleshooting
Health Check Not Running
# Check if health manager is started
curl http://localhost:8081/health/checks
# Verify check is registered and enabled
# Look for "enabled": true in response
# Check application logs for errors
grep "health check" /var/log/chorus/chorus.log
Health Check Timeouts
// Increase timeout for slow operations
check.Timeout = 30 * time.Second
// Add timeout monitoring
Checker: func(ctx context.Context) health.CheckResult {
    deadline, ok := ctx.Deadline()
    if ok {
        log.Printf("Check deadline: %v (%.2fs remaining)",
            deadline, time.Until(deadline).Seconds())
    }
    // ... check logic
}
False Positives
// Add retry logic
attempts := 3
for i := 0; i < attempts; i++ {
    if checkPasses() {
        return health.CheckResult{Healthy: true, ...}
    }
    if i < attempts-1 {
        time.Sleep(100 * time.Millisecond)
    }
}
return health.CheckResult{Healthy: false, ...}
High Memory Usage
// Limit check history
config := health.DefaultHealthConfig()
config.MaxHistoryEntries = 500  // Reduce from default 1000
config.HistoryCleanupInterval = 30 * time.Minute  // More frequent cleanup
Testing
Unit Testing Health Checks
func TestHealthCheck(t *testing.T) {
    // Create check
    check := &health.HealthCheck{
        Name:    "test-check",
        Enabled: true,
        Timeout: 5 * time.Second,
        Checker: func(ctx context.Context) health.CheckResult {
            // Test logic
            return health.CheckResult{
                Healthy:   true,
                Message:   "Test passed",
                Timestamp: time.Now(),
            }
        },
    }
    // Execute check
    ctx := context.Background()
    result := check.Checker(ctx)
    // Verify result
    assert.True(t, result.Healthy)
    assert.Equal(t, "Test passed", result.Message)
}
Integration Testing with Mocks
func TestHealthManager(t *testing.T) {
    logger := &testLogger{}
    manager := health.NewManager("test-node", "v1.0.0", logger)
    // Register mock check
    mockCheck := &health.HealthCheck{
        Name:     "mock-check",
        Enabled:  true,
        Interval: 1 * time.Second,
        Timeout:  500 * time.Millisecond,
        Checker: func(ctx context.Context) health.CheckResult {
            return health.CheckResult{
                Healthy:   true,
                Message:   "Mock check passed",
                Timestamp: time.Now(),
            }
        },
    }
    manager.RegisterCheck(mockCheck)
    // Start manager
    err := manager.Start()
    assert.NoError(t, err)
    // Wait for check execution
    time.Sleep(2 * time.Second)
    // Verify status
    status := manager.GetStatus()
    assert.Equal(t, health.StatusHealthy, status.Status)
    assert.Contains(t, status.Checks, "mock-check")
    // Stop manager
    err = manager.Stop()
    assert.NoError(t, err)
}