Files

anthonyrawlins c5b7311a8b docs: Add Phase 3 coordination and infrastructure documentation

Comprehensive documentation for coordination, messaging, discovery, and internal systems.

Core Coordination Packages:
- pkg/election - Democratic leader election (uptime-based, heartbeat mechanism, SLURP integration)
- pkg/coordination - Meta-coordination with dependency detection (4 built-in rules)
- coordinator/ - Task orchestration and assignment (AI-powered scoring)
- discovery/ - mDNS peer discovery (automatic LAN detection)

Messaging & P2P Infrastructure:
- pubsub/ - GossipSub messaging (31 message types, role-based topics, HMMM integration)
- p2p/ - libp2p networking (DHT modes, connection management, security)

Monitoring & Health:
- pkg/metrics - Prometheus metrics (80+ metrics across 12 categories)
- pkg/health - Health monitoring (4 HTTP endpoints, enhanced checks, graceful degradation)

Internal Systems:
- internal/licensing - License validation (KACHING integration, cluster leases, fail-closed)
- internal/hapui - Human Agent Portal UI (9 commands, HMMM wizard, UCXL browser, decision voting)
- internal/backbeat - P2P operation telemetry (6 phases, beat synchronization, health reporting)

Documentation Statistics (Phase 3):
- 10 packages documented (~18,000 lines)
- 31 PubSub message types cataloged
- 80+ Prometheus metrics documented
- Complete API references with examples
- Integration patterns and best practices

Key Features Documented:
- Election: 5 triggers, candidate scoring (5 weighted components), stability windows
- Coordination: AI-powered dependency detection, cross-repo sessions, escalation handling
- PubSub: Topic patterns, message envelopes, SHHH redaction, Hypercore logging
- Metrics: All metric types with labels, Prometheus scrape config, alert rules
- Health: Liveness vs readiness, critical checks, Kubernetes integration
- Licensing: Grace periods, circuit breaker, cluster lease management
- HAP UI: Interactive terminal commands, HMMM composition wizard, web interface (beta)
- BACKBEAT: 6-phase operation tracking, beat budget estimation, drift detection

Implementation Status Marked:
- ✅ Production: Election, metrics, health, licensing, pubsub, p2p, discovery, coordinator
- 🔶 Beta: HAP web interface, BACKBEAT telemetry, advanced coordination
- 🔷 Alpha: SLURP election scoring
- ⚠️ Experimental: Meta-coordination, AI-powered dependency detection

Progress: 22/62 files complete (35%)

Next Phase: AI providers, SLURP system, API layer, reasoning engine

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

2025-09-30 18:27:39 +10:00

28 KiB

Raw Blame History

CHORUS Health Package

Overview

The pkg/health package provides comprehensive health monitoring and readiness/liveness probe capabilities for the CHORUS distributed system. It orchestrates health checks across all system components, integrates with graceful shutdown, and exposes Kubernetes-compatible health endpoints.

Architecture

Core Components

Manager: Central health check orchestration and HTTP endpoint management
HealthCheck: Individual health check definitions with configurable intervals
EnhancedHealthChecks: Advanced health monitoring with metrics and history
Adapters: Integration layer for PubSub, DHT, and other subsystems
SystemStatus: Aggregated health status representation

Health Check Types

Critical Checks: Failures trigger graceful shutdown
Non-Critical Checks: Failures degrade health status but don't trigger shutdown
Active Probes: Synthetic tests that verify end-to-end functionality
Passive Checks: Monitor existing system state without creating load

Core Types

HealthCheck

type HealthCheck struct {
    Name        string                                // Unique check identifier
    Description string                                // Human-readable description
    Checker     func(ctx context.Context) CheckResult // Check execution function
    Interval    time.Duration                         // Check frequency (default: 30s)
    Timeout     time.Duration                         // Check timeout (default: 10s)
    Enabled     bool                                  // Enable/disable check
    Critical    bool                                  // If true, failure triggers shutdown
    LastRun     time.Time                             // Timestamp of last execution
    LastResult  *CheckResult                          // Most recent check result
}

CheckResult

type CheckResult struct {
    Healthy    bool                   // Check passed/failed
    Message    string                 // Human-readable result message
    Details    map[string]interface{} // Additional structured information
    Latency    time.Duration          // Check execution time
    Timestamp  time.Time              // Result timestamp
    Error      error                  // Error details if check failed
}

SystemStatus

type SystemStatus struct {
    Status     Status                     // Overall status enum
    Message    string                     // Status description
    Checks     map[string]*CheckResult    // All check results
    Uptime     time.Duration              // System uptime
    StartTime  time.Time                  // System start timestamp
    LastUpdate time.Time                  // Last status update
    Version    string                     // CHORUS version
    NodeID     string                     // Node identifier
}

Status Levels

const (
    StatusHealthy   Status = "healthy"    // All checks passing
    StatusDegraded  Status = "degraded"   // Some non-critical checks failing
    StatusUnhealthy Status = "unhealthy"  // Critical checks failing
    StatusStarting  Status = "starting"   // System initializing
    StatusStopping  Status = "stopping"   // Graceful shutdown in progress
)

Manager

Initialization

import "chorus/pkg/health"

// Create health manager
logger := yourLogger // Implements health.Logger interface
manager := health.NewManager("node-123", "v1.0.0", logger)

// Connect to shutdown manager for critical failures
shutdownMgr := shutdown.NewManager(30*time.Second, logger)
manager.SetShutdownManager(shutdownMgr)

Registration System

// Register a health check
check := &health.HealthCheck{
    Name:        "database-connectivity",
    Description: "PostgreSQL database connectivity check",
    Enabled:     true,
    Critical:    true,  // Failure triggers shutdown
    Interval:    30 * time.Second,
    Timeout:     10 * time.Second,
    Checker: func(ctx context.Context) health.CheckResult {
        // Perform health check
        err := db.PingContext(ctx)
        if err != nil {
            return health.CheckResult{
                Healthy:   false,
                Message:   fmt.Sprintf("Database ping failed: %v", err),
                Error:     err,
                Timestamp: time.Now(),
            }
        }
        return health.CheckResult{
            Healthy:   true,
            Message:   "Database connectivity OK",
            Timestamp: time.Now(),
        }
    },
}

manager.RegisterCheck(check)

// Unregister when no longer needed
manager.UnregisterCheck("database-connectivity")

Lifecycle Management

// Start health monitoring
if err := manager.Start(); err != nil {
    log.Fatalf("Failed to start health manager: %v", err)
}

// Start HTTP server for health endpoints
if err := manager.StartHTTPServer(8081); err != nil {
    log.Fatalf("Failed to start health HTTP server: %v", err)
}

// ... application runs ...

// Stop health monitoring during shutdown
if err := manager.Stop(); err != nil {
    log.Printf("Error stopping health manager: %v", err)
}

HTTP Endpoints

/health - Overall Health Status

Method: GET Description: Returns comprehensive system health status

Response Codes:

200 OK: System is healthy or degraded
503 Service Unavailable: System is unhealthy, starting, or stopping

Response Schema:

{
    "status": "healthy",
    "message": "All health checks passing",
    "checks": {
        "database-connectivity": {
            "healthy": true,
            "message": "Database connectivity OK",
            "latency": 15000000,
            "timestamp": "2025-09-30T10:30:00Z"
        },
        "p2p-connectivity": {
            "healthy": true,
            "message": "5 peers connected",
            "details": {
                "connected_peers": 5,
                "min_peers": 3
            },
            "latency": 8000000,
            "timestamp": "2025-09-30T10:30:05Z"
        }
    },
    "uptime": 86400000000000,
    "start_time": "2025-09-29T10:30:00Z",
    "last_update": "2025-09-30T10:30:05Z",
    "version": "v1.0.0",
    "node_id": "node-123"
}

/health/ready - Readiness Probe

Method: GET Description: Kubernetes readiness probe - indicates if node can handle requests

Response Codes:

200 OK: Node is ready (healthy or degraded)
503 Service Unavailable: Node is not ready

Response Schema:

{
    "ready": true,
    "status": "healthy",
    "message": "All health checks passing"
}

Usage: Use for Kubernetes readiness probes to control traffic routing

readinessProbe:
  httpGet:
    path: /health/ready
    port: 8081
  initialDelaySeconds: 10
  periodSeconds: 5
  timeoutSeconds: 3
  failureThreshold: 3

/health/live - Liveness Probe

Method: GET Description: Kubernetes liveness probe - indicates if node is alive

Response Codes:

200 OK: Process is alive (not stopping)
503 Service Unavailable: Process is stopping

Response Schema:

{
    "live": true,
    "status": "healthy",
    "uptime": "24h0m0s"
}

Usage: Use for Kubernetes liveness probes to restart unhealthy pods

livenessProbe:
  httpGet:
    path: /health/live
    port: 8081
  initialDelaySeconds: 30
  periodSeconds: 10
  timeoutSeconds: 5
  failureThreshold: 3

/health/checks - Detailed Check Results

Method: GET Description: Returns detailed results for all registered health checks

Response Schema:

{
    "checks": {
        "database-connectivity": {
            "healthy": true,
            "message": "Database connectivity OK",
            "latency": 15000000,
            "timestamp": "2025-09-30T10:30:00Z"
        },
        "p2p-connectivity": {
            "healthy": true,
            "message": "5 peers connected",
            "details": {
                "connected_peers": 5,
                "min_peers": 3
            },
            "latency": 8000000,
            "timestamp": "2025-09-30T10:30:05Z"
        }
    },
    "total": 2,
    "timestamp": "2025-09-30T10:30:10Z"
}

Built-in Health Checks

Database Connectivity Check

check := health.CreateDatabaseCheck("primary-db", func() error {
    return db.Ping()
})
manager.RegisterCheck(check)

Properties:

Critical: Yes
Interval: 30 seconds
Timeout: 10 seconds
Checks: Database ping/connectivity

Disk Space Check

check := health.CreateDiskSpaceCheck("/var/lib/CHORUS", 0.90)  // Alert at 90%
manager.RegisterCheck(check)

Properties:

Critical: No (warning only)
Interval: 60 seconds
Timeout: 5 seconds
Threshold: Configurable (e.g., 90%)

Memory Usage Check

check := health.CreateMemoryCheck(0.85)  // Alert at 85%
manager.RegisterCheck(check)

Properties:

Critical: No (warning only)
Interval: 30 seconds
Timeout: 5 seconds
Threshold: Configurable (e.g., 85%)

Active PubSub Check

adapter := health.NewPubSubAdapter(pubsubInstance)
check := health.CreateActivePubSubCheck(adapter)
manager.RegisterCheck(check)

Properties:

Critical: No
Interval: 60 seconds
Timeout: 15 seconds
Test: Publish/subscribe loopback with unique message
Validates: End-to-end PubSub functionality

Test Flow:

Subscribe to test topic CHORUS/health-test/v1
Publish unique test message with timestamp
Wait for message receipt (max 10 seconds)
Verify message integrity
Report success or timeout

Active DHT Check

adapter := health.NewDHTAdapter(dhtInstance)
check := health.CreateActiveDHTCheck(adapter)
manager.RegisterCheck(check)

Properties:

Critical: No
Interval: 90 seconds
Timeout: 20 seconds
Test: Put/get operation with unique key
Validates: DHT storage and retrieval integrity

Test Flow:

Generate unique test key and value
Perform DHT put operation
Wait for propagation (100ms)
Perform DHT get operation
Verify retrieved value matches original
Report success, failure, or integrity violation

Enhanced Health Checks

Overview

The EnhancedHealthChecks system provides advanced monitoring with metrics tracking, historical data, and comprehensive component health scoring.

Initialization

import "chorus/pkg/health"

// Create enhanced health monitoring
enhanced := health.NewEnhancedHealthChecks(
    manager,         // Health manager
    electionMgr,     // Election manager
    dhtInstance,     // DHT instance
    pubsubInstance,  // PubSub instance
    replicationMgr,  // Replication manager
    logger,          // Logger
)

// Enhanced checks are automatically registered

Configuration

type HealthConfig struct {
    // Active probe intervals
    PubSubProbeInterval    time.Duration  // Default: 30s
    DHTProbeInterval       time.Duration  // Default: 60s
    ElectionProbeInterval  time.Duration  // Default: 15s

    // Probe timeouts
    PubSubProbeTimeout     time.Duration  // Default: 10s
    DHTProbeTimeout        time.Duration  // Default: 20s
    ElectionProbeTimeout   time.Duration  // Default: 5s

    // Thresholds
    MaxFailedProbes        int            // Default: 3
    HealthyThreshold       float64        // Default: 0.95
    DegradedThreshold      float64        // Default: 0.75

    // History retention
    MaxHistoryEntries      int            // Default: 1000
    HistoryCleanupInterval time.Duration  // Default: 1h

    // Enable/disable specific checks
    EnablePubSubProbes     bool           // Default: true
    EnableDHTProbes        bool           // Default: true
    EnableElectionProbes   bool           // Default: true
    EnableReplicationProbes bool          // Default: true
}

// Use custom configuration
config := health.DefaultHealthConfig()
config.PubSubProbeInterval = 45 * time.Second
config.HealthyThreshold = 0.98
enhanced.config = config

Enhanced Health Checks Registered

1. Enhanced PubSub Check

Name: pubsub-enhanced
Critical: Yes
Interval: Configurable (default: 30s)
Features:
- Loopback message testing
- Success rate tracking
- Consecutive failure counting
- Latency measurement
- Health score calculation

2. Enhanced DHT Check

Name: dht-enhanced
Critical: Yes
Interval: Configurable (default: 60s)
Features:
- Put/get operation testing
- Data integrity verification
- Replication health monitoring
- Success rate tracking
- Latency measurement

3. Election Health Check

Name: election-health
Critical: No
Interval: Configurable (default: 15s)
Features:
- Election state monitoring
- Heartbeat status tracking
- Leadership stability calculation
- Admin uptime tracking

4. Replication Health Check

Name: replication-health
Critical: No
Interval: 120 seconds
Features:
- Replication metrics monitoring
- Failure rate tracking
- Average replication factor
- Provider record counting

5. P2P Connectivity Check

Name: p2p-connectivity
Critical: Yes
Interval: 30 seconds
Features:
- Connected peer counting
- Minimum peer threshold validation
- Connectivity score calculation

6. Resource Health Check

Name: resource-health
Critical: No
Interval: 60 seconds
Features:
- CPU usage monitoring
- Memory usage monitoring
- Disk usage monitoring
- Threshold-based alerting

7. Task Manager Check

Name: task-manager
Critical: No
Interval: 30 seconds
Features:
- Active task counting
- Queue depth monitoring
- Task success rate tracking
- Capacity monitoring

Health Metrics

type HealthMetrics struct {
    // Overall system health
    SystemHealthScore     float64    // 0.0-1.0
    LastFullHealthCheck   time.Time
    TotalHealthChecks     int64
    FailedHealthChecks    int64

    // PubSub metrics
    PubSubHealthScore     float64
    PubSubProbeLatency    time.Duration
    PubSubSuccessRate     float64
    PubSubLastSuccess     time.Time
    PubSubConsecutiveFails int

    // DHT metrics
    DHTHealthScore        float64
    DHTProbeLatency       time.Duration
    DHTSuccessRate        float64
    DHTLastSuccess        time.Time
    DHTConsecutiveFails   int
    DHTReplicationStatus  map[string]*ReplicationStatus

    // Election metrics
    ElectionHealthScore   float64
    ElectionStability     float64
    HeartbeatLatency      time.Duration
    LeadershipChanges     int64
    LastLeadershipChange  time.Time
    AdminUptime           time.Duration

    // Network metrics
    P2PConnectedPeers     int
    P2PConnectivityScore  float64
    NetworkLatency        time.Duration

    // Resource metrics
    CPUUsage             float64
    MemoryUsage          float64
    DiskUsage            float64

    // Service metrics
    ActiveTasks          int
    QueuedTasks          int
    TaskSuccessRate      float64
}

// Access metrics
metrics := enhanced.GetHealthMetrics()
fmt.Printf("System Health Score: %.2f\n", metrics.SystemHealthScore)
fmt.Printf("PubSub Health: %.2f\n", metrics.PubSubHealthScore)
fmt.Printf("DHT Health: %.2f\n", metrics.DHTHealthScore)

Health Summary

summary := enhanced.GetHealthSummary()

// Returns:
// {
//     "status": "healthy",
//     "overall_score": 0.96,
//     "last_check": "2025-09-30T10:30:00Z",
//     "total_checks": 1523,
//     "component_scores": {
//         "pubsub": 0.98,
//         "dht": 0.95,
//         "election": 0.92,
//         "p2p": 1.0
//     },
//     "key_metrics": {
//         "connected_peers": 5,
//         "active_tasks": 3,
//         "admin_uptime": "2h30m15s",
//         "leadership_changes": 2,
//         "resource_utilization": {
//             "cpu": 0.45,
//             "memory": 0.62,
//             "disk": 0.73
//         }
//     }
// }

Adapter System

PubSub Adapter

Adapts CHORUS PubSub system to health check interface:

type PubSubInterface interface {
    SubscribeToTopic(topic string, handler func([]byte)) error
    PublishToTopic(topic string, data interface{}) error
}

// Create adapter
adapter := health.NewPubSubAdapter(pubsubInstance)

// Use in health checks
check := health.CreateActivePubSubCheck(adapter)

DHT Adapter

Adapts various DHT implementations to health check interface:

type DHTInterface interface {
    PutValue(ctx context.Context, key string, value []byte) error
    GetValue(ctx context.Context, key string) ([]byte, error)
}

// Create adapter (supports multiple DHT types)
adapter := health.NewDHTAdapter(dhtInstance)

// Use in health checks
check := health.CreateActiveDHTCheck(adapter)

Supported DHT Types:

*dht.LibP2PDHT
*dht.MockDHTInterface
*dht.EncryptedDHTStorage

Mock Adapters

For testing without real infrastructure:

// Mock PubSub
mockPubSub := health.NewMockPubSubAdapter()
check := health.CreateActivePubSubCheck(mockPubSub)

// Mock DHT
mockDHT := health.NewMockDHTAdapter()
check := health.CreateActiveDHTCheck(mockDHT)

Integration with Graceful Shutdown

Critical Health Check Failures

When a critical health check fails, the health manager can trigger graceful shutdown:

// Connect managers
healthMgr.SetShutdownManager(shutdownMgr)

// Register critical check
criticalCheck := &health.HealthCheck{
    Name:     "database-connectivity",
    Critical: true,  // Failure triggers shutdown
    Checker:  func(ctx context.Context) health.CheckResult {
        // Check logic
    },
}
healthMgr.RegisterCheck(criticalCheck)

// If check fails, shutdown is automatically initiated

Shutdown Integration Example

import (
    "chorus/pkg/health"
    "chorus/pkg/shutdown"
)

// Create managers
shutdownMgr := shutdown.NewManager(30*time.Second, logger)
healthMgr := health.NewManager("node-123", "v1.0.0", logger)
healthMgr.SetShutdownManager(shutdownMgr)

// Register health manager for shutdown
healthComponent := shutdown.NewGenericComponent("health-manager", 10, true).
    SetShutdownFunc(func(ctx context.Context) error {
        return healthMgr.Stop()
    })
shutdownMgr.Register(healthComponent)

// Add pre-shutdown hook to update health status
shutdownMgr.AddHook(shutdown.PhasePreShutdown, func(ctx context.Context) error {
    status := healthMgr.GetStatus()
    status.Status = health.StatusStopping
    status.Message = "System is shutting down"
    return nil
})

// Start systems
healthMgr.Start()
healthMgr.StartHTTPServer(8081)
shutdownMgr.Start()

// Wait for shutdown
shutdownMgr.Wait()

Health Check Best Practices

Design Principles

Fast Execution: Health checks should complete quickly (< 5 seconds typical)
Idempotent: Checks should be safe to run repeatedly without side effects
Isolated: Checks should not depend on other checks
Meaningful: Checks should validate actual functionality, not just existence
Critical vs Warning: Reserve critical status for failures that prevent core functionality

Check Intervals

// Critical infrastructure: Check frequently
databaseCheck.Interval = 15 * time.Second

// Expensive operations: Check less frequently
replicationCheck.Interval = 120 * time.Second

// Active probes: Balance thoroughness with overhead
pubsubProbe.Interval = 60 * time.Second

Timeout Configuration

// Fast checks: Short timeout
connectivityCheck.Timeout = 5 * time.Second

// Network operations: Longer timeout
dhtProbe.Timeout = 20 * time.Second

// Complex operations: Generous timeout
systemCheck.Timeout = 30 * time.Second

Critical Check Guidelines

Mark a check as critical when:

Failure prevents core system functionality
Continued operation would cause data corruption
User-facing services become unavailable
System cannot safely recover automatically

Do NOT mark as critical when:

Failure is temporary or transient
System can operate in degraded mode
Alternative mechanisms exist
Recovery is possible without restart

Error Handling

Checker: func(ctx context.Context) health.CheckResult {
    // Handle context cancellation
    select {
    case <-ctx.Done():
        return health.CheckResult{
            Healthy:   false,
            Message:   "Check cancelled",
            Error:     ctx.Err(),
            Timestamp: time.Now(),
        }
    default:
    }

    // Perform check with timeout
    result := make(chan error, 1)
    go func() {
        result <- performCheck()
    }()

    select {
    case err := <-result:
        if err != nil {
            return health.CheckResult{
                Healthy:   false,
                Message:   fmt.Sprintf("Check failed: %v", err),
                Error:     err,
                Timestamp: time.Now(),
            }
        }
        return health.CheckResult{
            Healthy:   true,
            Message:   "Check passed",
            Timestamp: time.Now(),
        }
    case <-ctx.Done():
        return health.CheckResult{
            Healthy:   false,
            Message:   "Check timeout",
            Error:     ctx.Err(),
            Timestamp: time.Now(),
        }
    }
}

Detailed Results

Provide structured details for debugging:

return health.CheckResult{
    Healthy: true,
    Message: "P2P network healthy",
    Details: map[string]interface{}{
        "connected_peers":  5,
        "min_peers":       3,
        "max_peers":       20,
        "current_usage":   "25%",
        "peer_quality":    0.85,
        "network_latency": "50ms",
    },
    Timestamp: time.Now(),
}

Custom Health Checks

Simple Health Check

simpleCheck := &health.HealthCheck{
    Name:        "my-service",
    Description: "Custom service health check",
    Enabled:     true,
    Critical:    false,
    Interval:    30 * time.Second,
    Timeout:     10 * time.Second,
    Checker: func(ctx context.Context) health.CheckResult {
        // Your check logic
        healthy := checkMyService()

        return health.CheckResult{
            Healthy:   healthy,
            Message:   "Service status",
            Timestamp: time.Now(),
        }
    },
}

manager.RegisterCheck(simpleCheck)

Health Check with Metrics

type ServiceHealthCheck struct {
    service        MyService
    metricsCollector *metrics.CHORUSMetrics
}

func (s *ServiceHealthCheck) Check(ctx context.Context) health.CheckResult {
    start := time.Now()

    err := s.service.Ping(ctx)
    latency := time.Since(start)

    if err != nil {
        s.metricsCollector.IncrementHealthCheckFailed("my-service", err.Error())
        return health.CheckResult{
            Healthy:   false,
            Message:   fmt.Sprintf("Service unavailable: %v", err),
            Error:     err,
            Latency:   latency,
            Timestamp: time.Now(),
        }
    }

    s.metricsCollector.IncrementHealthCheckPassed("my-service")
    return health.CheckResult{
        Healthy:   true,
        Message:   "Service available",
        Latency:   latency,
        Timestamp: time.Now(),
        Details: map[string]interface{}{
            "latency_ms": latency.Milliseconds(),
            "version":    s.service.Version(),
        },
    }
}

Stateful Health Check

type StatefulHealthCheck struct {
    consecutiveFailures int
    maxFailures        int
}

func (s *StatefulHealthCheck) Check(ctx context.Context) health.CheckResult {
    healthy := performCheck()

    if !healthy {
        s.consecutiveFailures++
    } else {
        s.consecutiveFailures = 0
    }

    // Only report unhealthy after multiple failures
    if s.consecutiveFailures >= s.maxFailures {
        return health.CheckResult{
            Healthy: false,
            Message: fmt.Sprintf("Failed %d consecutive checks", s.consecutiveFailures),
            Details: map[string]interface{}{
                "consecutive_failures": s.consecutiveFailures,
                "threshold":           s.maxFailures,
            },
            Timestamp: time.Now(),
        }
    }

    return health.CheckResult{
        Healthy:   true,
        Message:   "Check passed",
        Timestamp: time.Now(),
    }
}

Monitoring and Alerting

Prometheus Integration

Health check results are automatically exposed as metrics:

# Health check success rate
rate(chorus_health_checks_passed_total[5m]) /
(rate(chorus_health_checks_passed_total[5m]) + rate(chorus_health_checks_failed_total[5m]))

# System health score
chorus_system_health_score

# Component health
chorus_component_health_score{component="dht"}

Alert Rules

groups:
  - name: health_alerts
    interval: 30s
    rules:
      - alert: HealthCheckFailing
        expr: chorus_health_checks_failed_total{check_name="database-connectivity"} > 0
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Critical health check failing"
          description: "{{ $labels.check_name }} has been failing for 5 minutes"

      - alert: LowSystemHealth
        expr: chorus_system_health_score < 0.75
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Low system health score"
          description: "System health score: {{ $value }}"

      - alert: ComponentDegraded
        expr: chorus_component_health_score < 0.5
        for: 15m
        labels:
          severity: warning
        annotations:
          summary: "Component {{ $labels.component }} degraded"
          description: "Health score: {{ $value }}"

Troubleshooting

Health Check Not Running

# Check if health manager is started
curl http://localhost:8081/health/checks

# Verify check is registered and enabled
# Look for "enabled": true in response

# Check application logs for errors
grep "health check" /var/log/chorus/chorus.log

Health Check Timeouts

// Increase timeout for slow operations
check.Timeout = 30 * time.Second

// Add timeout monitoring
Checker: func(ctx context.Context) health.CheckResult {
    deadline, ok := ctx.Deadline()
    if ok {
        log.Printf("Check deadline: %v (%.2fs remaining)",
            deadline, time.Until(deadline).Seconds())
    }
    // ... check logic
}

False Positives

// Add retry logic
attempts := 3
for i := 0; i < attempts; i++ {
    if checkPasses() {
        return health.CheckResult{Healthy: true, ...}
    }
    if i < attempts-1 {
        time.Sleep(100 * time.Millisecond)
    }
}
return health.CheckResult{Healthy: false, ...}

High Memory Usage

// Limit check history
config := health.DefaultHealthConfig()
config.MaxHistoryEntries = 500  // Reduce from default 1000
config.HistoryCleanupInterval = 30 * time.Minute  // More frequent cleanup

Testing

Unit Testing Health Checks

func TestHealthCheck(t *testing.T) {
    // Create check
    check := &health.HealthCheck{
        Name:    "test-check",
        Enabled: true,
        Timeout: 5 * time.Second,
        Checker: func(ctx context.Context) health.CheckResult {
            // Test logic
            return health.CheckResult{
                Healthy:   true,
                Message:   "Test passed",
                Timestamp: time.Now(),
            }
        },
    }

    // Execute check
    ctx := context.Background()
    result := check.Checker(ctx)

    // Verify result
    assert.True(t, result.Healthy)
    assert.Equal(t, "Test passed", result.Message)
}

Integration Testing with Mocks

func TestHealthManager(t *testing.T) {
    logger := &testLogger{}
    manager := health.NewManager("test-node", "v1.0.0", logger)

    // Register mock check
    mockCheck := &health.HealthCheck{
        Name:     "mock-check",
        Enabled:  true,
        Interval: 1 * time.Second,
        Timeout:  500 * time.Millisecond,
        Checker: func(ctx context.Context) health.CheckResult {
            return health.CheckResult{
                Healthy:   true,
                Message:   "Mock check passed",
                Timestamp: time.Now(),
            }
        },
    }
    manager.RegisterCheck(mockCheck)

    // Start manager
    err := manager.Start()
    assert.NoError(t, err)

    // Wait for check execution
    time.Sleep(2 * time.Second)

    // Verify status
    status := manager.GetStatus()
    assert.Equal(t, health.StatusHealthy, status.Status)
    assert.Contains(t, status.Checks, "mock-check")

    // Stop manager
    err = manager.Stop()
    assert.NoError(t, err)
}

28 KiB Raw Blame History

CHORUS Health Package

Overview

Architecture

Core Components

Health Check Types

Core Types

HealthCheck

CheckResult

SystemStatus

Status Levels

Manager

Initialization

Registration System

Lifecycle Management

HTTP Endpoints

/health - Overall Health Status

/health/ready - Readiness Probe

/health/live - Liveness Probe

/health/checks - Detailed Check Results

Built-in Health Checks

Database Connectivity Check

Disk Space Check

Memory Usage Check

Active PubSub Check

Active DHT Check

Enhanced Health Checks

Overview

Initialization

Configuration

Enhanced Health Checks Registered

1. Enhanced PubSub Check

2. Enhanced DHT Check

3. Election Health Check

4. Replication Health Check

5. P2P Connectivity Check

6. Resource Health Check

7. Task Manager Check

Health Metrics

Health Summary

Adapter System

PubSub Adapter

DHT Adapter

Mock Adapters

Integration with Graceful Shutdown

Critical Health Check Failures

Shutdown Integration Example

Health Check Best Practices

Design Principles

Check Intervals

Timeout Configuration

Critical Check Guidelines

Error Handling

Detailed Results

Custom Health Checks

Simple Health Check

Health Check with Metrics

Stateful Health Check

Monitoring and Alerting

Prometheus Integration

Alert Rules

Troubleshooting

Health Check Not Running

Health Check Timeouts

False Positives

High Memory Usage

Testing

Unit Testing Health Checks

Integration Testing with Mocks

Related Documentation

28 KiB

Raw Blame History