CHORUS/docs/comprehensive/packages/health.md

# CHORUS Health Package

## Overview

The `pkg/health` package provides comprehensive health monitoring and readiness/liveness probe capabilities for the CHORUS distributed system. It orchestrates health checks across all system components, integrates with graceful shutdown, and exposes Kubernetes-compatible health endpoints.

## Architecture

### Core Components

1. **Manager**: Central health check orchestration and HTTP endpoint management
2. **HealthCheck**: Individual health check definitions with configurable intervals
3. **EnhancedHealthChecks**: Advanced health monitoring with metrics and history
4. **Adapters**: Integration layer for PubSub, DHT, and other subsystems
5. **SystemStatus**: Aggregated health status representation

### Health Check Types

- **Critical Checks**: Failures trigger graceful shutdown
- **Non-Critical Checks**: Failures degrade health status but don't trigger shutdown
- **Active Probes**: Synthetic tests that verify end-to-end functionality
- **Passive Checks**: Monitor existing system state without creating load

## Core Types

### HealthCheck

```go
type HealthCheck struct {
    Name        string                                // Unique check identifier
    Description string                                // Human-readable description
    Checker     func(ctx context.Context) CheckResult // Check execution function
    Interval    time.Duration                         // Check frequency (default: 30s)
    Timeout     time.Duration                         // Check timeout (default: 10s)
    Enabled     bool                                  // Enable/disable check
    Critical    bool                                  // If true, failure triggers shutdown
    LastRun     time.Time                             // Timestamp of last execution
    LastResult  *CheckResult                          // Most recent check result
}
```

### CheckResult

```go
type CheckResult struct {
    Healthy    bool                   // Check passed/failed
    Message    string                 // Human-readable result message
    Details    map[string]interface{} // Additional structured information
    Latency    time.Duration          // Check execution time
    Timestamp  time.Time              // Result timestamp
    Error      error                  // Error details if check failed
}
```

### SystemStatus

```go
type SystemStatus struct {
    Status     Status                     // Overall status enum
    Message    string                     // Status description
    Checks     map[string]*CheckResult    // All check results
    Uptime     time.Duration              // System uptime
    StartTime  time.Time                  // System start timestamp
    LastUpdate time.Time                  // Last status update
    Version    string                     // CHORUS version
    NodeID     string                     // Node identifier
}
```

### Status Levels

```go
const (
    StatusHealthy   Status = "healthy"    // All checks passing
    StatusDegraded  Status = "degraded"   // Some non-critical checks failing
    StatusUnhealthy Status = "unhealthy"  // Critical checks failing
    StatusStarting  Status = "starting"   // System initializing
    StatusStopping  Status = "stopping"   // Graceful shutdown in progress
)
```

## Manager

### Initialization

```go
import "chorus/pkg/health"

// Create health manager
logger := yourLogger // Implements health.Logger interface
manager := health.NewManager("node-123", "v1.0.0", logger)

// Connect to shutdown manager for critical failures
shutdownMgr := shutdown.NewManager(30*time.Second, logger)
manager.SetShutdownManager(shutdownMgr)
```

### Registration System

```go
// Register a health check
check := &health.HealthCheck{
    Name:        "database-connectivity",
    Description: "PostgreSQL database connectivity check",
    Enabled:     true,
    Critical:    true,  // Failure triggers shutdown
    Interval:    30 * time.Second,
    Timeout:     10 * time.Second,
    Checker: func(ctx context.Context) health.CheckResult {
        // Perform health check
        err := db.PingContext(ctx)
        if err != nil {
            return health.CheckResult{
                Healthy:   false,
                Message:   fmt.Sprintf("Database ping failed: %v", err),
                Error:     err,
                Timestamp: time.Now(),
            }
        }
        return health.CheckResult{
            Healthy:   true,
            Message:   "Database connectivity OK",
            Timestamp: time.Now(),
        }
    },
}

manager.RegisterCheck(check)

// Unregister when no longer needed
manager.UnregisterCheck("database-connectivity")
```

### Lifecycle Management

```go
// Start health monitoring
if err := manager.Start(); err != nil {
    log.Fatalf("Failed to start health manager: %v", err)
}

// Start HTTP server for health endpoints
if err := manager.StartHTTPServer(8081); err != nil {
    log.Fatalf("Failed to start health HTTP server: %v", err)
}

// ... application runs ...

// Stop health monitoring during shutdown
if err := manager.Stop(); err != nil {
    log.Printf("Error stopping health manager: %v", err)
}
```

## HTTP Endpoints

### /health - Overall Health Status

**Method**: GET
**Description**: Returns comprehensive system health status

**Response Codes**:
- `200 OK`: System is healthy or degraded
- `503 Service Unavailable`: System is unhealthy, starting, or stopping

**Response Schema**:
```json
{
    "status": "healthy",
    "message": "All health checks passing",
    "checks": {
        "database-connectivity": {
            "healthy": true,
            "message": "Database connectivity OK",
            "latency": 15000000,
            "timestamp": "2025-09-30T10:30:00Z"
        },
        "p2p-connectivity": {
            "healthy": true,
            "message": "5 peers connected",
            "details": {
                "connected_peers": 5,
                "min_peers": 3
            },
            "latency": 8000000,
            "timestamp": "2025-09-30T10:30:05Z"
        }
    },
    "uptime": 86400000000000,
    "start_time": "2025-09-29T10:30:00Z",
    "last_update": "2025-09-30T10:30:05Z",
    "version": "v1.0.0",
    "node_id": "node-123"
}
```

### /health/ready - Readiness Probe

**Method**: GET
**Description**: Kubernetes readiness probe - indicates if node can handle requests

**Response Codes**:
- `200 OK`: Node is ready (healthy or degraded)
- `503 Service Unavailable`: Node is not ready

**Response Schema**:
```json
{
    "ready": true,
    "status": "healthy",
    "message": "All health checks passing"
}
```

**Usage**: Use for Kubernetes readiness probes to control traffic routing

```yaml
readinessProbe:
  httpGet:
    path: /health/ready
    port: 8081
  initialDelaySeconds: 10
  periodSeconds: 5
  timeoutSeconds: 3
  failureThreshold: 3
```

### /health/live - Liveness Probe

**Method**: GET
**Description**: Kubernetes liveness probe - indicates if node is alive

**Response Codes**:
- `200 OK`: Process is alive (not stopping)
- `503 Service Unavailable`: Process is stopping

**Response Schema**:
```json
{
    "live": true,
    "status": "healthy",
    "uptime": "24h0m0s"
}
```

**Usage**: Use for Kubernetes liveness probes to restart unhealthy pods

```yaml
livenessProbe:
  httpGet:
    path: /health/live
    port: 8081
  initialDelaySeconds: 30
  periodSeconds: 10
  timeoutSeconds: 5
  failureThreshold: 3
```

### /health/checks - Detailed Check Results

**Method**: GET
**Description**: Returns detailed results for all registered health checks

**Response Schema**:
```json
{
    "checks": {
        "database-connectivity": {
            "healthy": true,
            "message": "Database connectivity OK",
            "latency": 15000000,
            "timestamp": "2025-09-30T10:30:00Z"
        },
        "p2p-connectivity": {
            "healthy": true,
            "message": "5 peers connected",
            "details": {
                "connected_peers": 5,
                "min_peers": 3
            },
            "latency": 8000000,
            "timestamp": "2025-09-30T10:30:05Z"
        }
    },
    "total": 2,
    "timestamp": "2025-09-30T10:30:10Z"
}
```

## Built-in Health Checks

### Database Connectivity Check

```go
check := health.CreateDatabaseCheck("primary-db", func() error {
    return db.Ping()
})
manager.RegisterCheck(check)
```

**Properties**:
- Critical: Yes
- Interval: 30 seconds
- Timeout: 10 seconds
- Checks: Database ping/connectivity

### Disk Space Check

```go
check := health.CreateDiskSpaceCheck("/var/lib/CHORUS", 0.90)  // Alert at 90%
manager.RegisterCheck(check)
```

**Properties**:
- Critical: No (warning only)
- Interval: 60 seconds
- Timeout: 5 seconds
- Threshold: Configurable (e.g., 90%)

### Memory Usage Check

```go
check := health.CreateMemoryCheck(0.85)  // Alert at 85%
manager.RegisterCheck(check)
```

**Properties**:
- Critical: No (warning only)
- Interval: 30 seconds
- Timeout: 5 seconds
- Threshold: Configurable (e.g., 85%)

### Active PubSub Check

```go
adapter := health.NewPubSubAdapter(pubsubInstance)
check := health.CreateActivePubSubCheck(adapter)
manager.RegisterCheck(check)
```

**Properties**:
- Critical: No
- Interval: 60 seconds
- Timeout: 15 seconds
- Test: Publish/subscribe loopback with unique message
- Validates: End-to-end PubSub functionality

**Test Flow**:
1. Subscribe to test topic `CHORUS/health-test/v1`
2. Publish unique test message with timestamp
3. Wait for message receipt (max 10 seconds)
4. Verify message integrity
5. Report success or timeout

### Active DHT Check

```go
adapter := health.NewDHTAdapter(dhtInstance)
check := health.CreateActiveDHTCheck(adapter)
manager.RegisterCheck(check)
```

**Properties**:
- Critical: No
- Interval: 90 seconds
- Timeout: 20 seconds
- Test: Put/get operation with unique key
- Validates: DHT storage and retrieval integrity

**Test Flow**:
1. Generate unique test key and value
2. Perform DHT put operation
3. Wait for propagation (100ms)
4. Perform DHT get operation
5. Verify retrieved value matches original
6. Report success, failure, or integrity violation

## Enhanced Health Checks

### Overview

The `EnhancedHealthChecks` system provides advanced monitoring with metrics tracking, historical data, and comprehensive component health scoring.

### Initialization

```go
import "chorus/pkg/health"

// Create enhanced health monitoring
enhanced := health.NewEnhancedHealthChecks(
    manager,         // Health manager
    electionMgr,     // Election manager
    dhtInstance,     // DHT instance
    pubsubInstance,  // PubSub instance
    replicationMgr,  // Replication manager
    logger,          // Logger
)

// Enhanced checks are automatically registered
```

### Configuration

```go
type HealthConfig struct {
    // Active probe intervals
    PubSubProbeInterval    time.Duration  // Default: 30s
    DHTProbeInterval       time.Duration  // Default: 60s
    ElectionProbeInterval  time.Duration  // Default: 15s

    // Probe timeouts
    PubSubProbeTimeout     time.Duration  // Default: 10s
    DHTProbeTimeout        time.Duration  // Default: 20s
    ElectionProbeTimeout   time.Duration  // Default: 5s

    // Thresholds
    MaxFailedProbes        int            // Default: 3
    HealthyThreshold       float64        // Default: 0.95
    DegradedThreshold      float64        // Default: 0.75

    // History retention
    MaxHistoryEntries      int            // Default: 1000
    HistoryCleanupInterval time.Duration  // Default: 1h

    // Enable/disable specific checks
    EnablePubSubProbes     bool           // Default: true
    EnableDHTProbes        bool           // Default: true
    EnableElectionProbes   bool           // Default: true
    EnableReplicationProbes bool          // Default: true
}

// Use custom configuration
config := health.DefaultHealthConfig()
config.PubSubProbeInterval = 45 * time.Second
config.HealthyThreshold = 0.98
enhanced.config = config
```

### Enhanced Health Checks Registered

#### 1. Enhanced PubSub Check
- **Name**: `pubsub-enhanced`
- **Critical**: Yes
- **Interval**: Configurable (default: 30s)
- **Features**:
  - Loopback message testing
  - Success rate tracking
  - Consecutive failure counting
  - Latency measurement
  - Health score calculation

#### 2. Enhanced DHT Check
- **Name**: `dht-enhanced`
- **Critical**: Yes
- **Interval**: Configurable (default: 60s)
- **Features**:
  - Put/get operation testing
  - Data integrity verification
  - Replication health monitoring
  - Success rate tracking
  - Latency measurement

#### 3. Election Health Check
- **Name**: `election-health`
- **Critical**: No
- **Interval**: Configurable (default: 15s)
- **Features**:
  - Election state monitoring
  - Heartbeat status tracking
  - Leadership stability calculation
  - Admin uptime tracking

#### 4. Replication Health Check
- **Name**: `replication-health`
- **Critical**: No
- **Interval**: 120 seconds
- **Features**:
  - Replication metrics monitoring
  - Failure rate tracking
  - Average replication factor
  - Provider record counting

#### 5. P2P Connectivity Check
- **Name**: `p2p-connectivity`
- **Critical**: Yes
- **Interval**: 30 seconds
- **Features**:
  - Connected peer counting
  - Minimum peer threshold validation
  - Connectivity score calculation

#### 6. Resource Health Check
- **Name**: `resource-health`
- **Critical**: No
- **Interval**: 60 seconds
- **Features**:
  - CPU usage monitoring
  - Memory usage monitoring
  - Disk usage monitoring
  - Threshold-based alerting

#### 7. Task Manager Check
- **Name**: `task-manager`
- **Critical**: No
- **Interval**: 30 seconds
- **Features**:
  - Active task counting
  - Queue depth monitoring
  - Task success rate tracking
  - Capacity monitoring

### Health Metrics

```go
type HealthMetrics struct {
    // Overall system health
    SystemHealthScore     float64    // 0.0-1.0
    LastFullHealthCheck   time.Time
    TotalHealthChecks     int64
    FailedHealthChecks    int64

    // PubSub metrics
    PubSubHealthScore     float64
    PubSubProbeLatency    time.Duration
    PubSubSuccessRate     float64
    PubSubLastSuccess     time.Time
    PubSubConsecutiveFails int

    // DHT metrics
    DHTHealthScore        float64
    DHTProbeLatency       time.Duration
    DHTSuccessRate        float64
    DHTLastSuccess        time.Time
    DHTConsecutiveFails   int
    DHTReplicationStatus  map[string]*ReplicationStatus

    // Election metrics
    ElectionHealthScore   float64
    ElectionStability     float64
    HeartbeatLatency      time.Duration
    LeadershipChanges     int64
    LastLeadershipChange  time.Time
    AdminUptime           time.Duration

    // Network metrics
    P2PConnectedPeers     int
    P2PConnectivityScore  float64
    NetworkLatency        time.Duration

    // Resource metrics
    CPUUsage             float64
    MemoryUsage          float64
    DiskUsage            float64

    // Service metrics
    ActiveTasks          int
    QueuedTasks          int
    TaskSuccessRate      float64
}

// Access metrics
metrics := enhanced.GetHealthMetrics()
fmt.Printf("System Health Score: %.2f\n", metrics.SystemHealthScore)
fmt.Printf("PubSub Health: %.2f\n", metrics.PubSubHealthScore)
fmt.Printf("DHT Health: %.2f\n", metrics.DHTHealthScore)
```

### Health Summary

```go
summary := enhanced.GetHealthSummary()

// Returns:
// {
//     "status": "healthy",
//     "overall_score": 0.96,
//     "last_check": "2025-09-30T10:30:00Z",
//     "total_checks": 1523,
//     "component_scores": {
//         "pubsub": 0.98,
//         "dht": 0.95,
//         "election": 0.92,
//         "p2p": 1.0
//     },
//     "key_metrics": {
//         "connected_peers": 5,
//         "active_tasks": 3,
//         "admin_uptime": "2h30m15s",
//         "leadership_changes": 2,
//         "resource_utilization": {
//             "cpu": 0.45,
//             "memory": 0.62,
//             "disk": 0.73
//         }
//     }
// }
```

## Adapter System

### PubSub Adapter

Adapts CHORUS PubSub system to health check interface:

```go
type PubSubInterface interface {
    SubscribeToTopic(topic string, handler func([]byte)) error
    PublishToTopic(topic string, data interface{}) error
}

// Create adapter
adapter := health.NewPubSubAdapter(pubsubInstance)

// Use in health checks
check := health.CreateActivePubSubCheck(adapter)
```

### DHT Adapter

Adapts various DHT implementations to health check interface:

```go
type DHTInterface interface {
    PutValue(ctx context.Context, key string, value []byte) error
    GetValue(ctx context.Context, key string) ([]byte, error)
}

// Create adapter (supports multiple DHT types)
adapter := health.NewDHTAdapter(dhtInstance)

// Use in health checks
check := health.CreateActiveDHTCheck(adapter)
```

**Supported DHT Types**:
- `*dht.LibP2PDHT`
- `*dht.MockDHTInterface`
- `*dht.EncryptedDHTStorage`

### Mock Adapters

For testing without real infrastructure:

```go
// Mock PubSub
mockPubSub := health.NewMockPubSubAdapter()
check := health.CreateActivePubSubCheck(mockPubSub)

// Mock DHT
mockDHT := health.NewMockDHTAdapter()
check := health.CreateActiveDHTCheck(mockDHT)
```

## Integration with Graceful Shutdown

### Critical Health Check Failures

When a critical health check fails, the health manager can trigger graceful shutdown:

```go
// Connect managers
healthMgr.SetShutdownManager(shutdownMgr)

// Register critical check
criticalCheck := &health.HealthCheck{
    Name:     "database-connectivity",
    Critical: true,  // Failure triggers shutdown
    Checker:  func(ctx context.Context) health.CheckResult {
        // Check logic
    },
}
healthMgr.RegisterCheck(criticalCheck)

// If check fails, shutdown is automatically initiated
```

### Shutdown Integration Example

```go
import (
    "chorus/pkg/health"
    "chorus/pkg/shutdown"
)

// Create managers
shutdownMgr := shutdown.NewManager(30*time.Second, logger)
healthMgr := health.NewManager("node-123", "v1.0.0", logger)
healthMgr.SetShutdownManager(shutdownMgr)

// Register health manager for shutdown
healthComponent := shutdown.NewGenericComponent("health-manager", 10, true).
    SetShutdownFunc(func(ctx context.Context) error {
        return healthMgr.Stop()
    })
shutdownMgr.Register(healthComponent)

// Add pre-shutdown hook to update health status
shutdownMgr.AddHook(shutdown.PhasePreShutdown, func(ctx context.Context) error {
    status := healthMgr.GetStatus()
    status.Status = health.StatusStopping
    status.Message = "System is shutting down"
    return nil
})

// Start systems
healthMgr.Start()
healthMgr.StartHTTPServer(8081)
shutdownMgr.Start()

// Wait for shutdown
shutdownMgr.Wait()
```

## Health Check Best Practices

### Design Principles

1. **Fast Execution**: Health checks should complete quickly (< 5 seconds typical)
2. **Idempotent**: Checks should be safe to run repeatedly without side effects
3. **Isolated**: Checks should not depend on other checks
4. **Meaningful**: Checks should validate actual functionality, not just existence
5. **Critical vs Warning**: Reserve critical status for failures that prevent core functionality

### Check Intervals

```go
// Critical infrastructure: Check frequently
databaseCheck.Interval = 15 * time.Second

// Expensive operations: Check less frequently
replicationCheck.Interval = 120 * time.Second

// Active probes: Balance thoroughness with overhead
pubsubProbe.Interval = 60 * time.Second
```

### Timeout Configuration

```go
// Fast checks: Short timeout
connectivityCheck.Timeout = 5 * time.Second

// Network operations: Longer timeout
dhtProbe.Timeout = 20 * time.Second

// Complex operations: Generous timeout
systemCheck.Timeout = 30 * time.Second
```

### Critical Check Guidelines

Mark a check as critical when:
- Failure prevents core system functionality
- Continued operation would cause data corruption
- User-facing services become unavailable
- System cannot safely recover automatically

Do NOT mark as critical when:
- Failure is temporary or transient
- System can operate in degraded mode
- Alternative mechanisms exist
- Recovery is possible without restart

### Error Handling

```go
Checker: func(ctx context.Context) health.CheckResult {
    // Handle context cancellation
    select {
    case <-ctx.Done():
        return health.CheckResult{
            Healthy:   false,
            Message:   "Check cancelled",
            Error:     ctx.Err(),
            Timestamp: time.Now(),
        }
    default:
    }

    // Perform check with timeout
    result := make(chan error, 1)
    go func() {
        result <- performCheck()
    }()

    select {
    case err := <-result:
        if err != nil {
            return health.CheckResult{
                Healthy:   false,
                Message:   fmt.Sprintf("Check failed: %v", err),
                Error:     err,
                Timestamp: time.Now(),
            }
        }
        return health.CheckResult{
            Healthy:   true,
            Message:   "Check passed",
            Timestamp: time.Now(),
        }
    case <-ctx.Done():
        return health.CheckResult{
            Healthy:   false,
            Message:   "Check timeout",
            Error:     ctx.Err(),
            Timestamp: time.Now(),
        }
    }
}
```

### Detailed Results

Provide structured details for debugging:

```go
return health.CheckResult{
    Healthy: true,
    Message: "P2P network healthy",
    Details: map[string]interface{}{
        "connected_peers":  5,
        "min_peers":       3,
        "max_peers":       20,
        "current_usage":   "25%",
        "peer_quality":    0.85,
        "network_latency": "50ms",
    },
    Timestamp: time.Now(),
}
```

## Custom Health Checks

### Simple Health Check

```go
simpleCheck := &health.HealthCheck{
    Name:        "my-service",
    Description: "Custom service health check",
    Enabled:     true,
    Critical:    false,
    Interval:    30 * time.Second,
    Timeout:     10 * time.Second,
    Checker: func(ctx context.Context) health.CheckResult {
        // Your check logic
        healthy := checkMyService()

        return health.CheckResult{
            Healthy:   healthy,
            Message:   "Service status",
            Timestamp: time.Now(),
        }
    },
}

manager.RegisterCheck(simpleCheck)
```

### Health Check with Metrics

```go
type ServiceHealthCheck struct {
    service        MyService
    metricsCollector *metrics.CHORUSMetrics
}

func (s *ServiceHealthCheck) Check(ctx context.Context) health.CheckResult {
    start := time.Now()

    err := s.service.Ping(ctx)
    latency := time.Since(start)

    if err != nil {
        s.metricsCollector.IncrementHealthCheckFailed("my-service", err.Error())
        return health.CheckResult{
            Healthy:   false,
            Message:   fmt.Sprintf("Service unavailable: %v", err),
            Error:     err,
            Latency:   latency,
            Timestamp: time.Now(),
        }
    }

    s.metricsCollector.IncrementHealthCheckPassed("my-service")
    return health.CheckResult{
        Healthy:   true,
        Message:   "Service available",
        Latency:   latency,
        Timestamp: time.Now(),
        Details: map[string]interface{}{
            "latency_ms": latency.Milliseconds(),
            "version":    s.service.Version(),
        },
    }
}
```

### Stateful Health Check

```go
type StatefulHealthCheck struct {
    consecutiveFailures int
    maxFailures        int
}

func (s *StatefulHealthCheck) Check(ctx context.Context) health.CheckResult {
    healthy := performCheck()

    if !healthy {
        s.consecutiveFailures++
    } else {
        s.consecutiveFailures = 0
    }

    // Only report unhealthy after multiple failures
    if s.consecutiveFailures >= s.maxFailures {
        return health.CheckResult{
            Healthy: false,
            Message: fmt.Sprintf("Failed %d consecutive checks", s.consecutiveFailures),
            Details: map[string]interface{}{
                "consecutive_failures": s.consecutiveFailures,
                "threshold":           s.maxFailures,
            },
            Timestamp: time.Now(),
        }
    }

    return health.CheckResult{
        Healthy:   true,
        Message:   "Check passed",
        Timestamp: time.Now(),
    }
}
```

## Monitoring and Alerting

### Prometheus Integration

Health check results are automatically exposed as metrics:

```promql
# Health check success rate
rate(chorus_health_checks_passed_total[5m]) /
(rate(chorus_health_checks_passed_total[5m]) + rate(chorus_health_checks_failed_total[5m]))

# System health score
chorus_system_health_score

# Component health
chorus_component_health_score{component="dht"}
```

### Alert Rules

```yaml
groups:
  - name: health_alerts
    interval: 30s
    rules:
      - alert: HealthCheckFailing
        expr: chorus_health_checks_failed_total{check_name="database-connectivity"} > 0
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Critical health check failing"
          description: "{{ $labels.check_name }} has been failing for 5 minutes"

      - alert: LowSystemHealth
        expr: chorus_system_health_score < 0.75
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Low system health score"
          description: "System health score: {{ $value }}"

      - alert: ComponentDegraded
        expr: chorus_component_health_score < 0.5
        for: 15m
        labels:
          severity: warning
        annotations:
          summary: "Component {{ $labels.component }} degraded"
          description: "Health score: {{ $value }}"
```

## Troubleshooting

### Health Check Not Running

```bash
# Check if health manager is started
curl http://localhost:8081/health/checks

# Verify check is registered and enabled
# Look for "enabled": true in response

# Check application logs for errors
grep "health check" /var/log/chorus/chorus.log
```

### Health Check Timeouts

```go
// Increase timeout for slow operations
check.Timeout = 30 * time.Second

// Add timeout monitoring
Checker: func(ctx context.Context) health.CheckResult {
    deadline, ok := ctx.Deadline()
    if ok {
        log.Printf("Check deadline: %v (%.2fs remaining)",
            deadline, time.Until(deadline).Seconds())
    }
    // ... check logic
}
```

### False Positives

```go
// Add retry logic
attempts := 3
for i := 0; i < attempts; i++ {
    if checkPasses() {
        return health.CheckResult{Healthy: true, ...}
    }
    if i < attempts-1 {
        time.Sleep(100 * time.Millisecond)
    }
}
return health.CheckResult{Healthy: false, ...}
```

### High Memory Usage

```go
// Limit check history
config := health.DefaultHealthConfig()
config.MaxHistoryEntries = 500  // Reduce from default 1000
config.HistoryCleanupInterval = 30 * time.Minute  // More frequent cleanup
```

## Testing

### Unit Testing Health Checks

```go
func TestHealthCheck(t *testing.T) {
    // Create check
    check := &health.HealthCheck{
        Name:    "test-check",
        Enabled: true,
        Timeout: 5 * time.Second,
        Checker: func(ctx context.Context) health.CheckResult {
            // Test logic
            return health.CheckResult{
                Healthy:   true,
                Message:   "Test passed",
                Timestamp: time.Now(),
            }
        },
    }

    // Execute check
    ctx := context.Background()
    result := check.Checker(ctx)

    // Verify result
    assert.True(t, result.Healthy)
    assert.Equal(t, "Test passed", result.Message)
}
```

### Integration Testing with Mocks

```go
func TestHealthManager(t *testing.T) {
    logger := &testLogger{}
    manager := health.NewManager("test-node", "v1.0.0", logger)

    // Register mock check
    mockCheck := &health.HealthCheck{
        Name:     "mock-check",
        Enabled:  true,
        Interval: 1 * time.Second,
        Timeout:  500 * time.Millisecond,
        Checker: func(ctx context.Context) health.CheckResult {
            return health.CheckResult{
                Healthy:   true,
                Message:   "Mock check passed",
                Timestamp: time.Now(),
            }
        },
    }
    manager.RegisterCheck(mockCheck)

    // Start manager
    err := manager.Start()
    assert.NoError(t, err)

    // Wait for check execution
    time.Sleep(2 * time.Second)

    // Verify status
    status := manager.GetStatus()
    assert.Equal(t, health.StatusHealthy, status.Status)
    assert.Contains(t, status.Checks, "mock-check")

    // Stop manager
    err = manager.Stop()
    assert.NoError(t, err)
}
```

## Related Documentation

- [Metrics Package Documentation](./metrics.md)
- [Shutdown Package Documentation](./shutdown.md)
- [Kubernetes Health Probes](https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/)
- [CHORUS Election System](../modules/election.md)
- [CHORUS DHT System](../modules/dht.md)