Files
CHORUS/docs/comprehensive/packages/health.md
anthonyrawlins c5b7311a8b docs: Add Phase 3 coordination and infrastructure documentation
Comprehensive documentation for coordination, messaging, discovery, and internal systems.

Core Coordination Packages:
- pkg/election - Democratic leader election (uptime-based, heartbeat mechanism, SLURP integration)
- pkg/coordination - Meta-coordination with dependency detection (4 built-in rules)
- coordinator/ - Task orchestration and assignment (AI-powered scoring)
- discovery/ - mDNS peer discovery (automatic LAN detection)

Messaging & P2P Infrastructure:
- pubsub/ - GossipSub messaging (31 message types, role-based topics, HMMM integration)
- p2p/ - libp2p networking (DHT modes, connection management, security)

Monitoring & Health:
- pkg/metrics - Prometheus metrics (80+ metrics across 12 categories)
- pkg/health - Health monitoring (4 HTTP endpoints, enhanced checks, graceful degradation)

Internal Systems:
- internal/licensing - License validation (KACHING integration, cluster leases, fail-closed)
- internal/hapui - Human Agent Portal UI (9 commands, HMMM wizard, UCXL browser, decision voting)
- internal/backbeat - P2P operation telemetry (6 phases, beat synchronization, health reporting)

Documentation Statistics (Phase 3):
- 10 packages documented (~18,000 lines)
- 31 PubSub message types cataloged
- 80+ Prometheus metrics documented
- Complete API references with examples
- Integration patterns and best practices

Key Features Documented:
- Election: 5 triggers, candidate scoring (5 weighted components), stability windows
- Coordination: AI-powered dependency detection, cross-repo sessions, escalation handling
- PubSub: Topic patterns, message envelopes, SHHH redaction, Hypercore logging
- Metrics: All metric types with labels, Prometheus scrape config, alert rules
- Health: Liveness vs readiness, critical checks, Kubernetes integration
- Licensing: Grace periods, circuit breaker, cluster lease management
- HAP UI: Interactive terminal commands, HMMM composition wizard, web interface (beta)
- BACKBEAT: 6-phase operation tracking, beat budget estimation, drift detection

Implementation Status Marked:
-  Production: Election, metrics, health, licensing, pubsub, p2p, discovery, coordinator
- 🔶 Beta: HAP web interface, BACKBEAT telemetry, advanced coordination
- 🔷 Alpha: SLURP election scoring
- ⚠️ Experimental: Meta-coordination, AI-powered dependency detection

Progress: 22/62 files complete (35%)

Next Phase: AI providers, SLURP system, API layer, reasoning engine

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-09-30 18:27:39 +10:00

1124 lines
28 KiB
Markdown

# CHORUS Health Package
## Overview
The `pkg/health` package provides comprehensive health monitoring and readiness/liveness probe capabilities for the CHORUS distributed system. It orchestrates health checks across all system components, integrates with graceful shutdown, and exposes Kubernetes-compatible health endpoints.
## Architecture
### Core Components
1. **Manager**: Central health check orchestration and HTTP endpoint management
2. **HealthCheck**: Individual health check definitions with configurable intervals
3. **EnhancedHealthChecks**: Advanced health monitoring with metrics and history
4. **Adapters**: Integration layer for PubSub, DHT, and other subsystems
5. **SystemStatus**: Aggregated health status representation
### Health Check Types
- **Critical Checks**: Failures trigger graceful shutdown
- **Non-Critical Checks**: Failures degrade health status but don't trigger shutdown
- **Active Probes**: Synthetic tests that verify end-to-end functionality
- **Passive Checks**: Monitor existing system state without creating load
## Core Types
### HealthCheck
```go
type HealthCheck struct {
Name string // Unique check identifier
Description string // Human-readable description
Checker func(ctx context.Context) CheckResult // Check execution function
Interval time.Duration // Check frequency (default: 30s)
Timeout time.Duration // Check timeout (default: 10s)
Enabled bool // Enable/disable check
Critical bool // If true, failure triggers shutdown
LastRun time.Time // Timestamp of last execution
LastResult *CheckResult // Most recent check result
}
```
### CheckResult
```go
type CheckResult struct {
Healthy bool // Check passed/failed
Message string // Human-readable result message
Details map[string]interface{} // Additional structured information
Latency time.Duration // Check execution time
Timestamp time.Time // Result timestamp
Error error // Error details if check failed
}
```
### SystemStatus
```go
type SystemStatus struct {
Status Status // Overall status enum
Message string // Status description
Checks map[string]*CheckResult // All check results
Uptime time.Duration // System uptime
StartTime time.Time // System start timestamp
LastUpdate time.Time // Last status update
Version string // CHORUS version
NodeID string // Node identifier
}
```
### Status Levels
```go
const (
StatusHealthy Status = "healthy" // All checks passing
StatusDegraded Status = "degraded" // Some non-critical checks failing
StatusUnhealthy Status = "unhealthy" // Critical checks failing
StatusStarting Status = "starting" // System initializing
StatusStopping Status = "stopping" // Graceful shutdown in progress
)
```
## Manager
### Initialization
```go
import "chorus/pkg/health"
// Create health manager
logger := yourLogger // Implements health.Logger interface
manager := health.NewManager("node-123", "v1.0.0", logger)
// Connect to shutdown manager for critical failures
shutdownMgr := shutdown.NewManager(30*time.Second, logger)
manager.SetShutdownManager(shutdownMgr)
```
### Registration System
```go
// Register a health check
check := &health.HealthCheck{
Name: "database-connectivity",
Description: "PostgreSQL database connectivity check",
Enabled: true,
Critical: true, // Failure triggers shutdown
Interval: 30 * time.Second,
Timeout: 10 * time.Second,
Checker: func(ctx context.Context) health.CheckResult {
// Perform health check
err := db.PingContext(ctx)
if err != nil {
return health.CheckResult{
Healthy: false,
Message: fmt.Sprintf("Database ping failed: %v", err),
Error: err,
Timestamp: time.Now(),
}
}
return health.CheckResult{
Healthy: true,
Message: "Database connectivity OK",
Timestamp: time.Now(),
}
},
}
manager.RegisterCheck(check)
// Unregister when no longer needed
manager.UnregisterCheck("database-connectivity")
```
### Lifecycle Management
```go
// Start health monitoring
if err := manager.Start(); err != nil {
log.Fatalf("Failed to start health manager: %v", err)
}
// Start HTTP server for health endpoints
if err := manager.StartHTTPServer(8081); err != nil {
log.Fatalf("Failed to start health HTTP server: %v", err)
}
// ... application runs ...
// Stop health monitoring during shutdown
if err := manager.Stop(); err != nil {
log.Printf("Error stopping health manager: %v", err)
}
```
## HTTP Endpoints
### /health - Overall Health Status
**Method**: GET
**Description**: Returns comprehensive system health status
**Response Codes**:
- `200 OK`: System is healthy or degraded
- `503 Service Unavailable`: System is unhealthy, starting, or stopping
**Response Schema**:
```json
{
"status": "healthy",
"message": "All health checks passing",
"checks": {
"database-connectivity": {
"healthy": true,
"message": "Database connectivity OK",
"latency": 15000000,
"timestamp": "2025-09-30T10:30:00Z"
},
"p2p-connectivity": {
"healthy": true,
"message": "5 peers connected",
"details": {
"connected_peers": 5,
"min_peers": 3
},
"latency": 8000000,
"timestamp": "2025-09-30T10:30:05Z"
}
},
"uptime": 86400000000000,
"start_time": "2025-09-29T10:30:00Z",
"last_update": "2025-09-30T10:30:05Z",
"version": "v1.0.0",
"node_id": "node-123"
}
```
### /health/ready - Readiness Probe
**Method**: GET
**Description**: Kubernetes readiness probe - indicates if node can handle requests
**Response Codes**:
- `200 OK`: Node is ready (healthy or degraded)
- `503 Service Unavailable`: Node is not ready
**Response Schema**:
```json
{
"ready": true,
"status": "healthy",
"message": "All health checks passing"
}
```
**Usage**: Use for Kubernetes readiness probes to control traffic routing
```yaml
readinessProbe:
httpGet:
path: /health/ready
port: 8081
initialDelaySeconds: 10
periodSeconds: 5
timeoutSeconds: 3
failureThreshold: 3
```
### /health/live - Liveness Probe
**Method**: GET
**Description**: Kubernetes liveness probe - indicates if node is alive
**Response Codes**:
- `200 OK`: Process is alive (not stopping)
- `503 Service Unavailable`: Process is stopping
**Response Schema**:
```json
{
"live": true,
"status": "healthy",
"uptime": "24h0m0s"
}
```
**Usage**: Use for Kubernetes liveness probes to restart unhealthy pods
```yaml
livenessProbe:
httpGet:
path: /health/live
port: 8081
initialDelaySeconds: 30
periodSeconds: 10
timeoutSeconds: 5
failureThreshold: 3
```
### /health/checks - Detailed Check Results
**Method**: GET
**Description**: Returns detailed results for all registered health checks
**Response Schema**:
```json
{
"checks": {
"database-connectivity": {
"healthy": true,
"message": "Database connectivity OK",
"latency": 15000000,
"timestamp": "2025-09-30T10:30:00Z"
},
"p2p-connectivity": {
"healthy": true,
"message": "5 peers connected",
"details": {
"connected_peers": 5,
"min_peers": 3
},
"latency": 8000000,
"timestamp": "2025-09-30T10:30:05Z"
}
},
"total": 2,
"timestamp": "2025-09-30T10:30:10Z"
}
```
## Built-in Health Checks
### Database Connectivity Check
```go
check := health.CreateDatabaseCheck("primary-db", func() error {
return db.Ping()
})
manager.RegisterCheck(check)
```
**Properties**:
- Critical: Yes
- Interval: 30 seconds
- Timeout: 10 seconds
- Checks: Database ping/connectivity
### Disk Space Check
```go
check := health.CreateDiskSpaceCheck("/var/lib/CHORUS", 0.90) // Alert at 90%
manager.RegisterCheck(check)
```
**Properties**:
- Critical: No (warning only)
- Interval: 60 seconds
- Timeout: 5 seconds
- Threshold: Configurable (e.g., 90%)
### Memory Usage Check
```go
check := health.CreateMemoryCheck(0.85) // Alert at 85%
manager.RegisterCheck(check)
```
**Properties**:
- Critical: No (warning only)
- Interval: 30 seconds
- Timeout: 5 seconds
- Threshold: Configurable (e.g., 85%)
### Active PubSub Check
```go
adapter := health.NewPubSubAdapter(pubsubInstance)
check := health.CreateActivePubSubCheck(adapter)
manager.RegisterCheck(check)
```
**Properties**:
- Critical: No
- Interval: 60 seconds
- Timeout: 15 seconds
- Test: Publish/subscribe loopback with unique message
- Validates: End-to-end PubSub functionality
**Test Flow**:
1. Subscribe to test topic `CHORUS/health-test/v1`
2. Publish unique test message with timestamp
3. Wait for message receipt (max 10 seconds)
4. Verify message integrity
5. Report success or timeout
### Active DHT Check
```go
adapter := health.NewDHTAdapter(dhtInstance)
check := health.CreateActiveDHTCheck(adapter)
manager.RegisterCheck(check)
```
**Properties**:
- Critical: No
- Interval: 90 seconds
- Timeout: 20 seconds
- Test: Put/get operation with unique key
- Validates: DHT storage and retrieval integrity
**Test Flow**:
1. Generate unique test key and value
2. Perform DHT put operation
3. Wait for propagation (100ms)
4. Perform DHT get operation
5. Verify retrieved value matches original
6. Report success, failure, or integrity violation
## Enhanced Health Checks
### Overview
The `EnhancedHealthChecks` system provides advanced monitoring with metrics tracking, historical data, and comprehensive component health scoring.
### Initialization
```go
import "chorus/pkg/health"
// Create enhanced health monitoring
enhanced := health.NewEnhancedHealthChecks(
manager, // Health manager
electionMgr, // Election manager
dhtInstance, // DHT instance
pubsubInstance, // PubSub instance
replicationMgr, // Replication manager
logger, // Logger
)
// Enhanced checks are automatically registered
```
### Configuration
```go
type HealthConfig struct {
// Active probe intervals
PubSubProbeInterval time.Duration // Default: 30s
DHTProbeInterval time.Duration // Default: 60s
ElectionProbeInterval time.Duration // Default: 15s
// Probe timeouts
PubSubProbeTimeout time.Duration // Default: 10s
DHTProbeTimeout time.Duration // Default: 20s
ElectionProbeTimeout time.Duration // Default: 5s
// Thresholds
MaxFailedProbes int // Default: 3
HealthyThreshold float64 // Default: 0.95
DegradedThreshold float64 // Default: 0.75
// History retention
MaxHistoryEntries int // Default: 1000
HistoryCleanupInterval time.Duration // Default: 1h
// Enable/disable specific checks
EnablePubSubProbes bool // Default: true
EnableDHTProbes bool // Default: true
EnableElectionProbes bool // Default: true
EnableReplicationProbes bool // Default: true
}
// Use custom configuration
config := health.DefaultHealthConfig()
config.PubSubProbeInterval = 45 * time.Second
config.HealthyThreshold = 0.98
enhanced.config = config
```
### Enhanced Health Checks Registered
#### 1. Enhanced PubSub Check
- **Name**: `pubsub-enhanced`
- **Critical**: Yes
- **Interval**: Configurable (default: 30s)
- **Features**:
- Loopback message testing
- Success rate tracking
- Consecutive failure counting
- Latency measurement
- Health score calculation
#### 2. Enhanced DHT Check
- **Name**: `dht-enhanced`
- **Critical**: Yes
- **Interval**: Configurable (default: 60s)
- **Features**:
- Put/get operation testing
- Data integrity verification
- Replication health monitoring
- Success rate tracking
- Latency measurement
#### 3. Election Health Check
- **Name**: `election-health`
- **Critical**: No
- **Interval**: Configurable (default: 15s)
- **Features**:
- Election state monitoring
- Heartbeat status tracking
- Leadership stability calculation
- Admin uptime tracking
#### 4. Replication Health Check
- **Name**: `replication-health`
- **Critical**: No
- **Interval**: 120 seconds
- **Features**:
- Replication metrics monitoring
- Failure rate tracking
- Average replication factor
- Provider record counting
#### 5. P2P Connectivity Check
- **Name**: `p2p-connectivity`
- **Critical**: Yes
- **Interval**: 30 seconds
- **Features**:
- Connected peer counting
- Minimum peer threshold validation
- Connectivity score calculation
#### 6. Resource Health Check
- **Name**: `resource-health`
- **Critical**: No
- **Interval**: 60 seconds
- **Features**:
- CPU usage monitoring
- Memory usage monitoring
- Disk usage monitoring
- Threshold-based alerting
#### 7. Task Manager Check
- **Name**: `task-manager`
- **Critical**: No
- **Interval**: 30 seconds
- **Features**:
- Active task counting
- Queue depth monitoring
- Task success rate tracking
- Capacity monitoring
### Health Metrics
```go
type HealthMetrics struct {
// Overall system health
SystemHealthScore float64 // 0.0-1.0
LastFullHealthCheck time.Time
TotalHealthChecks int64
FailedHealthChecks int64
// PubSub metrics
PubSubHealthScore float64
PubSubProbeLatency time.Duration
PubSubSuccessRate float64
PubSubLastSuccess time.Time
PubSubConsecutiveFails int
// DHT metrics
DHTHealthScore float64
DHTProbeLatency time.Duration
DHTSuccessRate float64
DHTLastSuccess time.Time
DHTConsecutiveFails int
DHTReplicationStatus map[string]*ReplicationStatus
// Election metrics
ElectionHealthScore float64
ElectionStability float64
HeartbeatLatency time.Duration
LeadershipChanges int64
LastLeadershipChange time.Time
AdminUptime time.Duration
// Network metrics
P2PConnectedPeers int
P2PConnectivityScore float64
NetworkLatency time.Duration
// Resource metrics
CPUUsage float64
MemoryUsage float64
DiskUsage float64
// Service metrics
ActiveTasks int
QueuedTasks int
TaskSuccessRate float64
}
// Access metrics
metrics := enhanced.GetHealthMetrics()
fmt.Printf("System Health Score: %.2f\n", metrics.SystemHealthScore)
fmt.Printf("PubSub Health: %.2f\n", metrics.PubSubHealthScore)
fmt.Printf("DHT Health: %.2f\n", metrics.DHTHealthScore)
```
### Health Summary
```go
summary := enhanced.GetHealthSummary()
// Returns:
// {
// "status": "healthy",
// "overall_score": 0.96,
// "last_check": "2025-09-30T10:30:00Z",
// "total_checks": 1523,
// "component_scores": {
// "pubsub": 0.98,
// "dht": 0.95,
// "election": 0.92,
// "p2p": 1.0
// },
// "key_metrics": {
// "connected_peers": 5,
// "active_tasks": 3,
// "admin_uptime": "2h30m15s",
// "leadership_changes": 2,
// "resource_utilization": {
// "cpu": 0.45,
// "memory": 0.62,
// "disk": 0.73
// }
// }
// }
```
## Adapter System
### PubSub Adapter
Adapts CHORUS PubSub system to health check interface:
```go
type PubSubInterface interface {
SubscribeToTopic(topic string, handler func([]byte)) error
PublishToTopic(topic string, data interface{}) error
}
// Create adapter
adapter := health.NewPubSubAdapter(pubsubInstance)
// Use in health checks
check := health.CreateActivePubSubCheck(adapter)
```
### DHT Adapter
Adapts various DHT implementations to health check interface:
```go
type DHTInterface interface {
PutValue(ctx context.Context, key string, value []byte) error
GetValue(ctx context.Context, key string) ([]byte, error)
}
// Create adapter (supports multiple DHT types)
adapter := health.NewDHTAdapter(dhtInstance)
// Use in health checks
check := health.CreateActiveDHTCheck(adapter)
```
**Supported DHT Types**:
- `*dht.LibP2PDHT`
- `*dht.MockDHTInterface`
- `*dht.EncryptedDHTStorage`
### Mock Adapters
For testing without real infrastructure:
```go
// Mock PubSub
mockPubSub := health.NewMockPubSubAdapter()
check := health.CreateActivePubSubCheck(mockPubSub)
// Mock DHT
mockDHT := health.NewMockDHTAdapter()
check := health.CreateActiveDHTCheck(mockDHT)
```
## Integration with Graceful Shutdown
### Critical Health Check Failures
When a critical health check fails, the health manager can trigger graceful shutdown:
```go
// Connect managers
healthMgr.SetShutdownManager(shutdownMgr)
// Register critical check
criticalCheck := &health.HealthCheck{
Name: "database-connectivity",
Critical: true, // Failure triggers shutdown
Checker: func(ctx context.Context) health.CheckResult {
// Check logic
},
}
healthMgr.RegisterCheck(criticalCheck)
// If check fails, shutdown is automatically initiated
```
### Shutdown Integration Example
```go
import (
"chorus/pkg/health"
"chorus/pkg/shutdown"
)
// Create managers
shutdownMgr := shutdown.NewManager(30*time.Second, logger)
healthMgr := health.NewManager("node-123", "v1.0.0", logger)
healthMgr.SetShutdownManager(shutdownMgr)
// Register health manager for shutdown
healthComponent := shutdown.NewGenericComponent("health-manager", 10, true).
SetShutdownFunc(func(ctx context.Context) error {
return healthMgr.Stop()
})
shutdownMgr.Register(healthComponent)
// Add pre-shutdown hook to update health status
shutdownMgr.AddHook(shutdown.PhasePreShutdown, func(ctx context.Context) error {
status := healthMgr.GetStatus()
status.Status = health.StatusStopping
status.Message = "System is shutting down"
return nil
})
// Start systems
healthMgr.Start()
healthMgr.StartHTTPServer(8081)
shutdownMgr.Start()
// Wait for shutdown
shutdownMgr.Wait()
```
## Health Check Best Practices
### Design Principles
1. **Fast Execution**: Health checks should complete quickly (< 5 seconds typical)
2. **Idempotent**: Checks should be safe to run repeatedly without side effects
3. **Isolated**: Checks should not depend on other checks
4. **Meaningful**: Checks should validate actual functionality, not just existence
5. **Critical vs Warning**: Reserve critical status for failures that prevent core functionality
### Check Intervals
```go
// Critical infrastructure: Check frequently
databaseCheck.Interval = 15 * time.Second
// Expensive operations: Check less frequently
replicationCheck.Interval = 120 * time.Second
// Active probes: Balance thoroughness with overhead
pubsubProbe.Interval = 60 * time.Second
```
### Timeout Configuration
```go
// Fast checks: Short timeout
connectivityCheck.Timeout = 5 * time.Second
// Network operations: Longer timeout
dhtProbe.Timeout = 20 * time.Second
// Complex operations: Generous timeout
systemCheck.Timeout = 30 * time.Second
```
### Critical Check Guidelines
Mark a check as critical when:
- Failure prevents core system functionality
- Continued operation would cause data corruption
- User-facing services become unavailable
- System cannot safely recover automatically
Do NOT mark as critical when:
- Failure is temporary or transient
- System can operate in degraded mode
- Alternative mechanisms exist
- Recovery is possible without restart
### Error Handling
```go
Checker: func(ctx context.Context) health.CheckResult {
// Handle context cancellation
select {
case <-ctx.Done():
return health.CheckResult{
Healthy: false,
Message: "Check cancelled",
Error: ctx.Err(),
Timestamp: time.Now(),
}
default:
}
// Perform check with timeout
result := make(chan error, 1)
go func() {
result <- performCheck()
}()
select {
case err := <-result:
if err != nil {
return health.CheckResult{
Healthy: false,
Message: fmt.Sprintf("Check failed: %v", err),
Error: err,
Timestamp: time.Now(),
}
}
return health.CheckResult{
Healthy: true,
Message: "Check passed",
Timestamp: time.Now(),
}
case <-ctx.Done():
return health.CheckResult{
Healthy: false,
Message: "Check timeout",
Error: ctx.Err(),
Timestamp: time.Now(),
}
}
}
```
### Detailed Results
Provide structured details for debugging:
```go
return health.CheckResult{
Healthy: true,
Message: "P2P network healthy",
Details: map[string]interface{}{
"connected_peers": 5,
"min_peers": 3,
"max_peers": 20,
"current_usage": "25%",
"peer_quality": 0.85,
"network_latency": "50ms",
},
Timestamp: time.Now(),
}
```
## Custom Health Checks
### Simple Health Check
```go
simpleCheck := &health.HealthCheck{
Name: "my-service",
Description: "Custom service health check",
Enabled: true,
Critical: false,
Interval: 30 * time.Second,
Timeout: 10 * time.Second,
Checker: func(ctx context.Context) health.CheckResult {
// Your check logic
healthy := checkMyService()
return health.CheckResult{
Healthy: healthy,
Message: "Service status",
Timestamp: time.Now(),
}
},
}
manager.RegisterCheck(simpleCheck)
```
### Health Check with Metrics
```go
type ServiceHealthCheck struct {
service MyService
metricsCollector *metrics.CHORUSMetrics
}
func (s *ServiceHealthCheck) Check(ctx context.Context) health.CheckResult {
start := time.Now()
err := s.service.Ping(ctx)
latency := time.Since(start)
if err != nil {
s.metricsCollector.IncrementHealthCheckFailed("my-service", err.Error())
return health.CheckResult{
Healthy: false,
Message: fmt.Sprintf("Service unavailable: %v", err),
Error: err,
Latency: latency,
Timestamp: time.Now(),
}
}
s.metricsCollector.IncrementHealthCheckPassed("my-service")
return health.CheckResult{
Healthy: true,
Message: "Service available",
Latency: latency,
Timestamp: time.Now(),
Details: map[string]interface{}{
"latency_ms": latency.Milliseconds(),
"version": s.service.Version(),
},
}
}
```
### Stateful Health Check
```go
type StatefulHealthCheck struct {
consecutiveFailures int
maxFailures int
}
func (s *StatefulHealthCheck) Check(ctx context.Context) health.CheckResult {
healthy := performCheck()
if !healthy {
s.consecutiveFailures++
} else {
s.consecutiveFailures = 0
}
// Only report unhealthy after multiple failures
if s.consecutiveFailures >= s.maxFailures {
return health.CheckResult{
Healthy: false,
Message: fmt.Sprintf("Failed %d consecutive checks", s.consecutiveFailures),
Details: map[string]interface{}{
"consecutive_failures": s.consecutiveFailures,
"threshold": s.maxFailures,
},
Timestamp: time.Now(),
}
}
return health.CheckResult{
Healthy: true,
Message: "Check passed",
Timestamp: time.Now(),
}
}
```
## Monitoring and Alerting
### Prometheus Integration
Health check results are automatically exposed as metrics:
```promql
# Health check success rate
rate(chorus_health_checks_passed_total[5m]) /
(rate(chorus_health_checks_passed_total[5m]) + rate(chorus_health_checks_failed_total[5m]))
# System health score
chorus_system_health_score
# Component health
chorus_component_health_score{component="dht"}
```
### Alert Rules
```yaml
groups:
- name: health_alerts
interval: 30s
rules:
- alert: HealthCheckFailing
expr: chorus_health_checks_failed_total{check_name="database-connectivity"} > 0
for: 5m
labels:
severity: critical
annotations:
summary: "Critical health check failing"
description: "{{ $labels.check_name }} has been failing for 5 minutes"
- alert: LowSystemHealth
expr: chorus_system_health_score < 0.75
for: 10m
labels:
severity: warning
annotations:
summary: "Low system health score"
description: "System health score: {{ $value }}"
- alert: ComponentDegraded
expr: chorus_component_health_score < 0.5
for: 15m
labels:
severity: warning
annotations:
summary: "Component {{ $labels.component }} degraded"
description: "Health score: {{ $value }}"
```
## Troubleshooting
### Health Check Not Running
```bash
# Check if health manager is started
curl http://localhost:8081/health/checks
# Verify check is registered and enabled
# Look for "enabled": true in response
# Check application logs for errors
grep "health check" /var/log/chorus/chorus.log
```
### Health Check Timeouts
```go
// Increase timeout for slow operations
check.Timeout = 30 * time.Second
// Add timeout monitoring
Checker: func(ctx context.Context) health.CheckResult {
deadline, ok := ctx.Deadline()
if ok {
log.Printf("Check deadline: %v (%.2fs remaining)",
deadline, time.Until(deadline).Seconds())
}
// ... check logic
}
```
### False Positives
```go
// Add retry logic
attempts := 3
for i := 0; i < attempts; i++ {
if checkPasses() {
return health.CheckResult{Healthy: true, ...}
}
if i < attempts-1 {
time.Sleep(100 * time.Millisecond)
}
}
return health.CheckResult{Healthy: false, ...}
```
### High Memory Usage
```go
// Limit check history
config := health.DefaultHealthConfig()
config.MaxHistoryEntries = 500 // Reduce from default 1000
config.HistoryCleanupInterval = 30 * time.Minute // More frequent cleanup
```
## Testing
### Unit Testing Health Checks
```go
func TestHealthCheck(t *testing.T) {
// Create check
check := &health.HealthCheck{
Name: "test-check",
Enabled: true,
Timeout: 5 * time.Second,
Checker: func(ctx context.Context) health.CheckResult {
// Test logic
return health.CheckResult{
Healthy: true,
Message: "Test passed",
Timestamp: time.Now(),
}
},
}
// Execute check
ctx := context.Background()
result := check.Checker(ctx)
// Verify result
assert.True(t, result.Healthy)
assert.Equal(t, "Test passed", result.Message)
}
```
### Integration Testing with Mocks
```go
func TestHealthManager(t *testing.T) {
logger := &testLogger{}
manager := health.NewManager("test-node", "v1.0.0", logger)
// Register mock check
mockCheck := &health.HealthCheck{
Name: "mock-check",
Enabled: true,
Interval: 1 * time.Second,
Timeout: 500 * time.Millisecond,
Checker: func(ctx context.Context) health.CheckResult {
return health.CheckResult{
Healthy: true,
Message: "Mock check passed",
Timestamp: time.Now(),
}
},
}
manager.RegisterCheck(mockCheck)
// Start manager
err := manager.Start()
assert.NoError(t, err)
// Wait for check execution
time.Sleep(2 * time.Second)
// Verify status
status := manager.GetStatus()
assert.Equal(t, health.StatusHealthy, status.Status)
assert.Contains(t, status.Checks, "mock-check")
// Stop manager
err = manager.Stop()
assert.NoError(t, err)
}
```
## Related Documentation
- [Metrics Package Documentation](./metrics.md)
- [Shutdown Package Documentation](./shutdown.md)
- [Kubernetes Health Probes](https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/)
- [CHORUS Election System](../modules/election.md)
- [CHORUS DHT System](../modules/dht.md)