Comprehensive documentation for coordination, messaging, discovery, and internal systems. Core Coordination Packages: - pkg/election - Democratic leader election (uptime-based, heartbeat mechanism, SLURP integration) - pkg/coordination - Meta-coordination with dependency detection (4 built-in rules) - coordinator/ - Task orchestration and assignment (AI-powered scoring) - discovery/ - mDNS peer discovery (automatic LAN detection) Messaging & P2P Infrastructure: - pubsub/ - GossipSub messaging (31 message types, role-based topics, HMMM integration) - p2p/ - libp2p networking (DHT modes, connection management, security) Monitoring & Health: - pkg/metrics - Prometheus metrics (80+ metrics across 12 categories) - pkg/health - Health monitoring (4 HTTP endpoints, enhanced checks, graceful degradation) Internal Systems: - internal/licensing - License validation (KACHING integration, cluster leases, fail-closed) - internal/hapui - Human Agent Portal UI (9 commands, HMMM wizard, UCXL browser, decision voting) - internal/backbeat - P2P operation telemetry (6 phases, beat synchronization, health reporting) Documentation Statistics (Phase 3): - 10 packages documented (~18,000 lines) - 31 PubSub message types cataloged - 80+ Prometheus metrics documented - Complete API references with examples - Integration patterns and best practices Key Features Documented: - Election: 5 triggers, candidate scoring (5 weighted components), stability windows - Coordination: AI-powered dependency detection, cross-repo sessions, escalation handling - PubSub: Topic patterns, message envelopes, SHHH redaction, Hypercore logging - Metrics: All metric types with labels, Prometheus scrape config, alert rules - Health: Liveness vs readiness, critical checks, Kubernetes integration - Licensing: Grace periods, circuit breaker, cluster lease management - HAP UI: Interactive terminal commands, HMMM composition wizard, web interface (beta) - BACKBEAT: 6-phase operation tracking, beat budget estimation, drift detection Implementation Status Marked: - ✅ Production: Election, metrics, health, licensing, pubsub, p2p, discovery, coordinator - 🔶 Beta: HAP web interface, BACKBEAT telemetry, advanced coordination - 🔷 Alpha: SLURP election scoring - ⚠️ Experimental: Meta-coordination, AI-powered dependency detection Progress: 22/62 files complete (35%) Next Phase: AI providers, SLURP system, API layer, reasoning engine 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
1124 lines
28 KiB
Markdown
1124 lines
28 KiB
Markdown
# CHORUS Health Package
|
|
|
|
## Overview
|
|
|
|
The `pkg/health` package provides comprehensive health monitoring and readiness/liveness probe capabilities for the CHORUS distributed system. It orchestrates health checks across all system components, integrates with graceful shutdown, and exposes Kubernetes-compatible health endpoints.
|
|
|
|
## Architecture
|
|
|
|
### Core Components
|
|
|
|
1. **Manager**: Central health check orchestration and HTTP endpoint management
|
|
2. **HealthCheck**: Individual health check definitions with configurable intervals
|
|
3. **EnhancedHealthChecks**: Advanced health monitoring with metrics and history
|
|
4. **Adapters**: Integration layer for PubSub, DHT, and other subsystems
|
|
5. **SystemStatus**: Aggregated health status representation
|
|
|
|
### Health Check Types
|
|
|
|
- **Critical Checks**: Failures trigger graceful shutdown
|
|
- **Non-Critical Checks**: Failures degrade health status but don't trigger shutdown
|
|
- **Active Probes**: Synthetic tests that verify end-to-end functionality
|
|
- **Passive Checks**: Monitor existing system state without creating load
|
|
|
|
## Core Types
|
|
|
|
### HealthCheck
|
|
|
|
```go
|
|
type HealthCheck struct {
|
|
Name string // Unique check identifier
|
|
Description string // Human-readable description
|
|
Checker func(ctx context.Context) CheckResult // Check execution function
|
|
Interval time.Duration // Check frequency (default: 30s)
|
|
Timeout time.Duration // Check timeout (default: 10s)
|
|
Enabled bool // Enable/disable check
|
|
Critical bool // If true, failure triggers shutdown
|
|
LastRun time.Time // Timestamp of last execution
|
|
LastResult *CheckResult // Most recent check result
|
|
}
|
|
```
|
|
|
|
### CheckResult
|
|
|
|
```go
|
|
type CheckResult struct {
|
|
Healthy bool // Check passed/failed
|
|
Message string // Human-readable result message
|
|
Details map[string]interface{} // Additional structured information
|
|
Latency time.Duration // Check execution time
|
|
Timestamp time.Time // Result timestamp
|
|
Error error // Error details if check failed
|
|
}
|
|
```
|
|
|
|
### SystemStatus
|
|
|
|
```go
|
|
type SystemStatus struct {
|
|
Status Status // Overall status enum
|
|
Message string // Status description
|
|
Checks map[string]*CheckResult // All check results
|
|
Uptime time.Duration // System uptime
|
|
StartTime time.Time // System start timestamp
|
|
LastUpdate time.Time // Last status update
|
|
Version string // CHORUS version
|
|
NodeID string // Node identifier
|
|
}
|
|
```
|
|
|
|
### Status Levels
|
|
|
|
```go
|
|
const (
|
|
StatusHealthy Status = "healthy" // All checks passing
|
|
StatusDegraded Status = "degraded" // Some non-critical checks failing
|
|
StatusUnhealthy Status = "unhealthy" // Critical checks failing
|
|
StatusStarting Status = "starting" // System initializing
|
|
StatusStopping Status = "stopping" // Graceful shutdown in progress
|
|
)
|
|
```
|
|
|
|
## Manager
|
|
|
|
### Initialization
|
|
|
|
```go
|
|
import "chorus/pkg/health"
|
|
|
|
// Create health manager
|
|
logger := yourLogger // Implements health.Logger interface
|
|
manager := health.NewManager("node-123", "v1.0.0", logger)
|
|
|
|
// Connect to shutdown manager for critical failures
|
|
shutdownMgr := shutdown.NewManager(30*time.Second, logger)
|
|
manager.SetShutdownManager(shutdownMgr)
|
|
```
|
|
|
|
### Registration System
|
|
|
|
```go
|
|
// Register a health check
|
|
check := &health.HealthCheck{
|
|
Name: "database-connectivity",
|
|
Description: "PostgreSQL database connectivity check",
|
|
Enabled: true,
|
|
Critical: true, // Failure triggers shutdown
|
|
Interval: 30 * time.Second,
|
|
Timeout: 10 * time.Second,
|
|
Checker: func(ctx context.Context) health.CheckResult {
|
|
// Perform health check
|
|
err := db.PingContext(ctx)
|
|
if err != nil {
|
|
return health.CheckResult{
|
|
Healthy: false,
|
|
Message: fmt.Sprintf("Database ping failed: %v", err),
|
|
Error: err,
|
|
Timestamp: time.Now(),
|
|
}
|
|
}
|
|
return health.CheckResult{
|
|
Healthy: true,
|
|
Message: "Database connectivity OK",
|
|
Timestamp: time.Now(),
|
|
}
|
|
},
|
|
}
|
|
|
|
manager.RegisterCheck(check)
|
|
|
|
// Unregister when no longer needed
|
|
manager.UnregisterCheck("database-connectivity")
|
|
```
|
|
|
|
### Lifecycle Management
|
|
|
|
```go
|
|
// Start health monitoring
|
|
if err := manager.Start(); err != nil {
|
|
log.Fatalf("Failed to start health manager: %v", err)
|
|
}
|
|
|
|
// Start HTTP server for health endpoints
|
|
if err := manager.StartHTTPServer(8081); err != nil {
|
|
log.Fatalf("Failed to start health HTTP server: %v", err)
|
|
}
|
|
|
|
// ... application runs ...
|
|
|
|
// Stop health monitoring during shutdown
|
|
if err := manager.Stop(); err != nil {
|
|
log.Printf("Error stopping health manager: %v", err)
|
|
}
|
|
```
|
|
|
|
## HTTP Endpoints
|
|
|
|
### /health - Overall Health Status
|
|
|
|
**Method**: GET
|
|
**Description**: Returns comprehensive system health status
|
|
|
|
**Response Codes**:
|
|
- `200 OK`: System is healthy or degraded
|
|
- `503 Service Unavailable`: System is unhealthy, starting, or stopping
|
|
|
|
**Response Schema**:
|
|
```json
|
|
{
|
|
"status": "healthy",
|
|
"message": "All health checks passing",
|
|
"checks": {
|
|
"database-connectivity": {
|
|
"healthy": true,
|
|
"message": "Database connectivity OK",
|
|
"latency": 15000000,
|
|
"timestamp": "2025-09-30T10:30:00Z"
|
|
},
|
|
"p2p-connectivity": {
|
|
"healthy": true,
|
|
"message": "5 peers connected",
|
|
"details": {
|
|
"connected_peers": 5,
|
|
"min_peers": 3
|
|
},
|
|
"latency": 8000000,
|
|
"timestamp": "2025-09-30T10:30:05Z"
|
|
}
|
|
},
|
|
"uptime": 86400000000000,
|
|
"start_time": "2025-09-29T10:30:00Z",
|
|
"last_update": "2025-09-30T10:30:05Z",
|
|
"version": "v1.0.0",
|
|
"node_id": "node-123"
|
|
}
|
|
```
|
|
|
|
### /health/ready - Readiness Probe
|
|
|
|
**Method**: GET
|
|
**Description**: Kubernetes readiness probe - indicates if node can handle requests
|
|
|
|
**Response Codes**:
|
|
- `200 OK`: Node is ready (healthy or degraded)
|
|
- `503 Service Unavailable`: Node is not ready
|
|
|
|
**Response Schema**:
|
|
```json
|
|
{
|
|
"ready": true,
|
|
"status": "healthy",
|
|
"message": "All health checks passing"
|
|
}
|
|
```
|
|
|
|
**Usage**: Use for Kubernetes readiness probes to control traffic routing
|
|
|
|
```yaml
|
|
readinessProbe:
|
|
httpGet:
|
|
path: /health/ready
|
|
port: 8081
|
|
initialDelaySeconds: 10
|
|
periodSeconds: 5
|
|
timeoutSeconds: 3
|
|
failureThreshold: 3
|
|
```
|
|
|
|
### /health/live - Liveness Probe
|
|
|
|
**Method**: GET
|
|
**Description**: Kubernetes liveness probe - indicates if node is alive
|
|
|
|
**Response Codes**:
|
|
- `200 OK`: Process is alive (not stopping)
|
|
- `503 Service Unavailable`: Process is stopping
|
|
|
|
**Response Schema**:
|
|
```json
|
|
{
|
|
"live": true,
|
|
"status": "healthy",
|
|
"uptime": "24h0m0s"
|
|
}
|
|
```
|
|
|
|
**Usage**: Use for Kubernetes liveness probes to restart unhealthy pods
|
|
|
|
```yaml
|
|
livenessProbe:
|
|
httpGet:
|
|
path: /health/live
|
|
port: 8081
|
|
initialDelaySeconds: 30
|
|
periodSeconds: 10
|
|
timeoutSeconds: 5
|
|
failureThreshold: 3
|
|
```
|
|
|
|
### /health/checks - Detailed Check Results
|
|
|
|
**Method**: GET
|
|
**Description**: Returns detailed results for all registered health checks
|
|
|
|
**Response Schema**:
|
|
```json
|
|
{
|
|
"checks": {
|
|
"database-connectivity": {
|
|
"healthy": true,
|
|
"message": "Database connectivity OK",
|
|
"latency": 15000000,
|
|
"timestamp": "2025-09-30T10:30:00Z"
|
|
},
|
|
"p2p-connectivity": {
|
|
"healthy": true,
|
|
"message": "5 peers connected",
|
|
"details": {
|
|
"connected_peers": 5,
|
|
"min_peers": 3
|
|
},
|
|
"latency": 8000000,
|
|
"timestamp": "2025-09-30T10:30:05Z"
|
|
}
|
|
},
|
|
"total": 2,
|
|
"timestamp": "2025-09-30T10:30:10Z"
|
|
}
|
|
```
|
|
|
|
## Built-in Health Checks
|
|
|
|
### Database Connectivity Check
|
|
|
|
```go
|
|
check := health.CreateDatabaseCheck("primary-db", func() error {
|
|
return db.Ping()
|
|
})
|
|
manager.RegisterCheck(check)
|
|
```
|
|
|
|
**Properties**:
|
|
- Critical: Yes
|
|
- Interval: 30 seconds
|
|
- Timeout: 10 seconds
|
|
- Checks: Database ping/connectivity
|
|
|
|
### Disk Space Check
|
|
|
|
```go
|
|
check := health.CreateDiskSpaceCheck("/var/lib/CHORUS", 0.90) // Alert at 90%
|
|
manager.RegisterCheck(check)
|
|
```
|
|
|
|
**Properties**:
|
|
- Critical: No (warning only)
|
|
- Interval: 60 seconds
|
|
- Timeout: 5 seconds
|
|
- Threshold: Configurable (e.g., 90%)
|
|
|
|
### Memory Usage Check
|
|
|
|
```go
|
|
check := health.CreateMemoryCheck(0.85) // Alert at 85%
|
|
manager.RegisterCheck(check)
|
|
```
|
|
|
|
**Properties**:
|
|
- Critical: No (warning only)
|
|
- Interval: 30 seconds
|
|
- Timeout: 5 seconds
|
|
- Threshold: Configurable (e.g., 85%)
|
|
|
|
### Active PubSub Check
|
|
|
|
```go
|
|
adapter := health.NewPubSubAdapter(pubsubInstance)
|
|
check := health.CreateActivePubSubCheck(adapter)
|
|
manager.RegisterCheck(check)
|
|
```
|
|
|
|
**Properties**:
|
|
- Critical: No
|
|
- Interval: 60 seconds
|
|
- Timeout: 15 seconds
|
|
- Test: Publish/subscribe loopback with unique message
|
|
- Validates: End-to-end PubSub functionality
|
|
|
|
**Test Flow**:
|
|
1. Subscribe to test topic `CHORUS/health-test/v1`
|
|
2. Publish unique test message with timestamp
|
|
3. Wait for message receipt (max 10 seconds)
|
|
4. Verify message integrity
|
|
5. Report success or timeout
|
|
|
|
### Active DHT Check
|
|
|
|
```go
|
|
adapter := health.NewDHTAdapter(dhtInstance)
|
|
check := health.CreateActiveDHTCheck(adapter)
|
|
manager.RegisterCheck(check)
|
|
```
|
|
|
|
**Properties**:
|
|
- Critical: No
|
|
- Interval: 90 seconds
|
|
- Timeout: 20 seconds
|
|
- Test: Put/get operation with unique key
|
|
- Validates: DHT storage and retrieval integrity
|
|
|
|
**Test Flow**:
|
|
1. Generate unique test key and value
|
|
2. Perform DHT put operation
|
|
3. Wait for propagation (100ms)
|
|
4. Perform DHT get operation
|
|
5. Verify retrieved value matches original
|
|
6. Report success, failure, or integrity violation
|
|
|
|
## Enhanced Health Checks
|
|
|
|
### Overview
|
|
|
|
The `EnhancedHealthChecks` system provides advanced monitoring with metrics tracking, historical data, and comprehensive component health scoring.
|
|
|
|
### Initialization
|
|
|
|
```go
|
|
import "chorus/pkg/health"
|
|
|
|
// Create enhanced health monitoring
|
|
enhanced := health.NewEnhancedHealthChecks(
|
|
manager, // Health manager
|
|
electionMgr, // Election manager
|
|
dhtInstance, // DHT instance
|
|
pubsubInstance, // PubSub instance
|
|
replicationMgr, // Replication manager
|
|
logger, // Logger
|
|
)
|
|
|
|
// Enhanced checks are automatically registered
|
|
```
|
|
|
|
### Configuration
|
|
|
|
```go
|
|
type HealthConfig struct {
|
|
// Active probe intervals
|
|
PubSubProbeInterval time.Duration // Default: 30s
|
|
DHTProbeInterval time.Duration // Default: 60s
|
|
ElectionProbeInterval time.Duration // Default: 15s
|
|
|
|
// Probe timeouts
|
|
PubSubProbeTimeout time.Duration // Default: 10s
|
|
DHTProbeTimeout time.Duration // Default: 20s
|
|
ElectionProbeTimeout time.Duration // Default: 5s
|
|
|
|
// Thresholds
|
|
MaxFailedProbes int // Default: 3
|
|
HealthyThreshold float64 // Default: 0.95
|
|
DegradedThreshold float64 // Default: 0.75
|
|
|
|
// History retention
|
|
MaxHistoryEntries int // Default: 1000
|
|
HistoryCleanupInterval time.Duration // Default: 1h
|
|
|
|
// Enable/disable specific checks
|
|
EnablePubSubProbes bool // Default: true
|
|
EnableDHTProbes bool // Default: true
|
|
EnableElectionProbes bool // Default: true
|
|
EnableReplicationProbes bool // Default: true
|
|
}
|
|
|
|
// Use custom configuration
|
|
config := health.DefaultHealthConfig()
|
|
config.PubSubProbeInterval = 45 * time.Second
|
|
config.HealthyThreshold = 0.98
|
|
enhanced.config = config
|
|
```
|
|
|
|
### Enhanced Health Checks Registered
|
|
|
|
#### 1. Enhanced PubSub Check
|
|
- **Name**: `pubsub-enhanced`
|
|
- **Critical**: Yes
|
|
- **Interval**: Configurable (default: 30s)
|
|
- **Features**:
|
|
- Loopback message testing
|
|
- Success rate tracking
|
|
- Consecutive failure counting
|
|
- Latency measurement
|
|
- Health score calculation
|
|
|
|
#### 2. Enhanced DHT Check
|
|
- **Name**: `dht-enhanced`
|
|
- **Critical**: Yes
|
|
- **Interval**: Configurable (default: 60s)
|
|
- **Features**:
|
|
- Put/get operation testing
|
|
- Data integrity verification
|
|
- Replication health monitoring
|
|
- Success rate tracking
|
|
- Latency measurement
|
|
|
|
#### 3. Election Health Check
|
|
- **Name**: `election-health`
|
|
- **Critical**: No
|
|
- **Interval**: Configurable (default: 15s)
|
|
- **Features**:
|
|
- Election state monitoring
|
|
- Heartbeat status tracking
|
|
- Leadership stability calculation
|
|
- Admin uptime tracking
|
|
|
|
#### 4. Replication Health Check
|
|
- **Name**: `replication-health`
|
|
- **Critical**: No
|
|
- **Interval**: 120 seconds
|
|
- **Features**:
|
|
- Replication metrics monitoring
|
|
- Failure rate tracking
|
|
- Average replication factor
|
|
- Provider record counting
|
|
|
|
#### 5. P2P Connectivity Check
|
|
- **Name**: `p2p-connectivity`
|
|
- **Critical**: Yes
|
|
- **Interval**: 30 seconds
|
|
- **Features**:
|
|
- Connected peer counting
|
|
- Minimum peer threshold validation
|
|
- Connectivity score calculation
|
|
|
|
#### 6. Resource Health Check
|
|
- **Name**: `resource-health`
|
|
- **Critical**: No
|
|
- **Interval**: 60 seconds
|
|
- **Features**:
|
|
- CPU usage monitoring
|
|
- Memory usage monitoring
|
|
- Disk usage monitoring
|
|
- Threshold-based alerting
|
|
|
|
#### 7. Task Manager Check
|
|
- **Name**: `task-manager`
|
|
- **Critical**: No
|
|
- **Interval**: 30 seconds
|
|
- **Features**:
|
|
- Active task counting
|
|
- Queue depth monitoring
|
|
- Task success rate tracking
|
|
- Capacity monitoring
|
|
|
|
### Health Metrics
|
|
|
|
```go
|
|
type HealthMetrics struct {
|
|
// Overall system health
|
|
SystemHealthScore float64 // 0.0-1.0
|
|
LastFullHealthCheck time.Time
|
|
TotalHealthChecks int64
|
|
FailedHealthChecks int64
|
|
|
|
// PubSub metrics
|
|
PubSubHealthScore float64
|
|
PubSubProbeLatency time.Duration
|
|
PubSubSuccessRate float64
|
|
PubSubLastSuccess time.Time
|
|
PubSubConsecutiveFails int
|
|
|
|
// DHT metrics
|
|
DHTHealthScore float64
|
|
DHTProbeLatency time.Duration
|
|
DHTSuccessRate float64
|
|
DHTLastSuccess time.Time
|
|
DHTConsecutiveFails int
|
|
DHTReplicationStatus map[string]*ReplicationStatus
|
|
|
|
// Election metrics
|
|
ElectionHealthScore float64
|
|
ElectionStability float64
|
|
HeartbeatLatency time.Duration
|
|
LeadershipChanges int64
|
|
LastLeadershipChange time.Time
|
|
AdminUptime time.Duration
|
|
|
|
// Network metrics
|
|
P2PConnectedPeers int
|
|
P2PConnectivityScore float64
|
|
NetworkLatency time.Duration
|
|
|
|
// Resource metrics
|
|
CPUUsage float64
|
|
MemoryUsage float64
|
|
DiskUsage float64
|
|
|
|
// Service metrics
|
|
ActiveTasks int
|
|
QueuedTasks int
|
|
TaskSuccessRate float64
|
|
}
|
|
|
|
// Access metrics
|
|
metrics := enhanced.GetHealthMetrics()
|
|
fmt.Printf("System Health Score: %.2f\n", metrics.SystemHealthScore)
|
|
fmt.Printf("PubSub Health: %.2f\n", metrics.PubSubHealthScore)
|
|
fmt.Printf("DHT Health: %.2f\n", metrics.DHTHealthScore)
|
|
```
|
|
|
|
### Health Summary
|
|
|
|
```go
|
|
summary := enhanced.GetHealthSummary()
|
|
|
|
// Returns:
|
|
// {
|
|
// "status": "healthy",
|
|
// "overall_score": 0.96,
|
|
// "last_check": "2025-09-30T10:30:00Z",
|
|
// "total_checks": 1523,
|
|
// "component_scores": {
|
|
// "pubsub": 0.98,
|
|
// "dht": 0.95,
|
|
// "election": 0.92,
|
|
// "p2p": 1.0
|
|
// },
|
|
// "key_metrics": {
|
|
// "connected_peers": 5,
|
|
// "active_tasks": 3,
|
|
// "admin_uptime": "2h30m15s",
|
|
// "leadership_changes": 2,
|
|
// "resource_utilization": {
|
|
// "cpu": 0.45,
|
|
// "memory": 0.62,
|
|
// "disk": 0.73
|
|
// }
|
|
// }
|
|
// }
|
|
```
|
|
|
|
## Adapter System
|
|
|
|
### PubSub Adapter
|
|
|
|
Adapts CHORUS PubSub system to health check interface:
|
|
|
|
```go
|
|
type PubSubInterface interface {
|
|
SubscribeToTopic(topic string, handler func([]byte)) error
|
|
PublishToTopic(topic string, data interface{}) error
|
|
}
|
|
|
|
// Create adapter
|
|
adapter := health.NewPubSubAdapter(pubsubInstance)
|
|
|
|
// Use in health checks
|
|
check := health.CreateActivePubSubCheck(adapter)
|
|
```
|
|
|
|
### DHT Adapter
|
|
|
|
Adapts various DHT implementations to health check interface:
|
|
|
|
```go
|
|
type DHTInterface interface {
|
|
PutValue(ctx context.Context, key string, value []byte) error
|
|
GetValue(ctx context.Context, key string) ([]byte, error)
|
|
}
|
|
|
|
// Create adapter (supports multiple DHT types)
|
|
adapter := health.NewDHTAdapter(dhtInstance)
|
|
|
|
// Use in health checks
|
|
check := health.CreateActiveDHTCheck(adapter)
|
|
```
|
|
|
|
**Supported DHT Types**:
|
|
- `*dht.LibP2PDHT`
|
|
- `*dht.MockDHTInterface`
|
|
- `*dht.EncryptedDHTStorage`
|
|
|
|
### Mock Adapters
|
|
|
|
For testing without real infrastructure:
|
|
|
|
```go
|
|
// Mock PubSub
|
|
mockPubSub := health.NewMockPubSubAdapter()
|
|
check := health.CreateActivePubSubCheck(mockPubSub)
|
|
|
|
// Mock DHT
|
|
mockDHT := health.NewMockDHTAdapter()
|
|
check := health.CreateActiveDHTCheck(mockDHT)
|
|
```
|
|
|
|
## Integration with Graceful Shutdown
|
|
|
|
### Critical Health Check Failures
|
|
|
|
When a critical health check fails, the health manager can trigger graceful shutdown:
|
|
|
|
```go
|
|
// Connect managers
|
|
healthMgr.SetShutdownManager(shutdownMgr)
|
|
|
|
// Register critical check
|
|
criticalCheck := &health.HealthCheck{
|
|
Name: "database-connectivity",
|
|
Critical: true, // Failure triggers shutdown
|
|
Checker: func(ctx context.Context) health.CheckResult {
|
|
// Check logic
|
|
},
|
|
}
|
|
healthMgr.RegisterCheck(criticalCheck)
|
|
|
|
// If check fails, shutdown is automatically initiated
|
|
```
|
|
|
|
### Shutdown Integration Example
|
|
|
|
```go
|
|
import (
|
|
"chorus/pkg/health"
|
|
"chorus/pkg/shutdown"
|
|
)
|
|
|
|
// Create managers
|
|
shutdownMgr := shutdown.NewManager(30*time.Second, logger)
|
|
healthMgr := health.NewManager("node-123", "v1.0.0", logger)
|
|
healthMgr.SetShutdownManager(shutdownMgr)
|
|
|
|
// Register health manager for shutdown
|
|
healthComponent := shutdown.NewGenericComponent("health-manager", 10, true).
|
|
SetShutdownFunc(func(ctx context.Context) error {
|
|
return healthMgr.Stop()
|
|
})
|
|
shutdownMgr.Register(healthComponent)
|
|
|
|
// Add pre-shutdown hook to update health status
|
|
shutdownMgr.AddHook(shutdown.PhasePreShutdown, func(ctx context.Context) error {
|
|
status := healthMgr.GetStatus()
|
|
status.Status = health.StatusStopping
|
|
status.Message = "System is shutting down"
|
|
return nil
|
|
})
|
|
|
|
// Start systems
|
|
healthMgr.Start()
|
|
healthMgr.StartHTTPServer(8081)
|
|
shutdownMgr.Start()
|
|
|
|
// Wait for shutdown
|
|
shutdownMgr.Wait()
|
|
```
|
|
|
|
## Health Check Best Practices
|
|
|
|
### Design Principles
|
|
|
|
1. **Fast Execution**: Health checks should complete quickly (< 5 seconds typical)
|
|
2. **Idempotent**: Checks should be safe to run repeatedly without side effects
|
|
3. **Isolated**: Checks should not depend on other checks
|
|
4. **Meaningful**: Checks should validate actual functionality, not just existence
|
|
5. **Critical vs Warning**: Reserve critical status for failures that prevent core functionality
|
|
|
|
### Check Intervals
|
|
|
|
```go
|
|
// Critical infrastructure: Check frequently
|
|
databaseCheck.Interval = 15 * time.Second
|
|
|
|
// Expensive operations: Check less frequently
|
|
replicationCheck.Interval = 120 * time.Second
|
|
|
|
// Active probes: Balance thoroughness with overhead
|
|
pubsubProbe.Interval = 60 * time.Second
|
|
```
|
|
|
|
### Timeout Configuration
|
|
|
|
```go
|
|
// Fast checks: Short timeout
|
|
connectivityCheck.Timeout = 5 * time.Second
|
|
|
|
// Network operations: Longer timeout
|
|
dhtProbe.Timeout = 20 * time.Second
|
|
|
|
// Complex operations: Generous timeout
|
|
systemCheck.Timeout = 30 * time.Second
|
|
```
|
|
|
|
### Critical Check Guidelines
|
|
|
|
Mark a check as critical when:
|
|
- Failure prevents core system functionality
|
|
- Continued operation would cause data corruption
|
|
- User-facing services become unavailable
|
|
- System cannot safely recover automatically
|
|
|
|
Do NOT mark as critical when:
|
|
- Failure is temporary or transient
|
|
- System can operate in degraded mode
|
|
- Alternative mechanisms exist
|
|
- Recovery is possible without restart
|
|
|
|
### Error Handling
|
|
|
|
```go
|
|
Checker: func(ctx context.Context) health.CheckResult {
|
|
// Handle context cancellation
|
|
select {
|
|
case <-ctx.Done():
|
|
return health.CheckResult{
|
|
Healthy: false,
|
|
Message: "Check cancelled",
|
|
Error: ctx.Err(),
|
|
Timestamp: time.Now(),
|
|
}
|
|
default:
|
|
}
|
|
|
|
// Perform check with timeout
|
|
result := make(chan error, 1)
|
|
go func() {
|
|
result <- performCheck()
|
|
}()
|
|
|
|
select {
|
|
case err := <-result:
|
|
if err != nil {
|
|
return health.CheckResult{
|
|
Healthy: false,
|
|
Message: fmt.Sprintf("Check failed: %v", err),
|
|
Error: err,
|
|
Timestamp: time.Now(),
|
|
}
|
|
}
|
|
return health.CheckResult{
|
|
Healthy: true,
|
|
Message: "Check passed",
|
|
Timestamp: time.Now(),
|
|
}
|
|
case <-ctx.Done():
|
|
return health.CheckResult{
|
|
Healthy: false,
|
|
Message: "Check timeout",
|
|
Error: ctx.Err(),
|
|
Timestamp: time.Now(),
|
|
}
|
|
}
|
|
}
|
|
```
|
|
|
|
### Detailed Results
|
|
|
|
Provide structured details for debugging:
|
|
|
|
```go
|
|
return health.CheckResult{
|
|
Healthy: true,
|
|
Message: "P2P network healthy",
|
|
Details: map[string]interface{}{
|
|
"connected_peers": 5,
|
|
"min_peers": 3,
|
|
"max_peers": 20,
|
|
"current_usage": "25%",
|
|
"peer_quality": 0.85,
|
|
"network_latency": "50ms",
|
|
},
|
|
Timestamp: time.Now(),
|
|
}
|
|
```
|
|
|
|
## Custom Health Checks
|
|
|
|
### Simple Health Check
|
|
|
|
```go
|
|
simpleCheck := &health.HealthCheck{
|
|
Name: "my-service",
|
|
Description: "Custom service health check",
|
|
Enabled: true,
|
|
Critical: false,
|
|
Interval: 30 * time.Second,
|
|
Timeout: 10 * time.Second,
|
|
Checker: func(ctx context.Context) health.CheckResult {
|
|
// Your check logic
|
|
healthy := checkMyService()
|
|
|
|
return health.CheckResult{
|
|
Healthy: healthy,
|
|
Message: "Service status",
|
|
Timestamp: time.Now(),
|
|
}
|
|
},
|
|
}
|
|
|
|
manager.RegisterCheck(simpleCheck)
|
|
```
|
|
|
|
### Health Check with Metrics
|
|
|
|
```go
|
|
type ServiceHealthCheck struct {
|
|
service MyService
|
|
metricsCollector *metrics.CHORUSMetrics
|
|
}
|
|
|
|
func (s *ServiceHealthCheck) Check(ctx context.Context) health.CheckResult {
|
|
start := time.Now()
|
|
|
|
err := s.service.Ping(ctx)
|
|
latency := time.Since(start)
|
|
|
|
if err != nil {
|
|
s.metricsCollector.IncrementHealthCheckFailed("my-service", err.Error())
|
|
return health.CheckResult{
|
|
Healthy: false,
|
|
Message: fmt.Sprintf("Service unavailable: %v", err),
|
|
Error: err,
|
|
Latency: latency,
|
|
Timestamp: time.Now(),
|
|
}
|
|
}
|
|
|
|
s.metricsCollector.IncrementHealthCheckPassed("my-service")
|
|
return health.CheckResult{
|
|
Healthy: true,
|
|
Message: "Service available",
|
|
Latency: latency,
|
|
Timestamp: time.Now(),
|
|
Details: map[string]interface{}{
|
|
"latency_ms": latency.Milliseconds(),
|
|
"version": s.service.Version(),
|
|
},
|
|
}
|
|
}
|
|
```
|
|
|
|
### Stateful Health Check
|
|
|
|
```go
|
|
type StatefulHealthCheck struct {
|
|
consecutiveFailures int
|
|
maxFailures int
|
|
}
|
|
|
|
func (s *StatefulHealthCheck) Check(ctx context.Context) health.CheckResult {
|
|
healthy := performCheck()
|
|
|
|
if !healthy {
|
|
s.consecutiveFailures++
|
|
} else {
|
|
s.consecutiveFailures = 0
|
|
}
|
|
|
|
// Only report unhealthy after multiple failures
|
|
if s.consecutiveFailures >= s.maxFailures {
|
|
return health.CheckResult{
|
|
Healthy: false,
|
|
Message: fmt.Sprintf("Failed %d consecutive checks", s.consecutiveFailures),
|
|
Details: map[string]interface{}{
|
|
"consecutive_failures": s.consecutiveFailures,
|
|
"threshold": s.maxFailures,
|
|
},
|
|
Timestamp: time.Now(),
|
|
}
|
|
}
|
|
|
|
return health.CheckResult{
|
|
Healthy: true,
|
|
Message: "Check passed",
|
|
Timestamp: time.Now(),
|
|
}
|
|
}
|
|
```
|
|
|
|
## Monitoring and Alerting
|
|
|
|
### Prometheus Integration
|
|
|
|
Health check results are automatically exposed as metrics:
|
|
|
|
```promql
|
|
# Health check success rate
|
|
rate(chorus_health_checks_passed_total[5m]) /
|
|
(rate(chorus_health_checks_passed_total[5m]) + rate(chorus_health_checks_failed_total[5m]))
|
|
|
|
# System health score
|
|
chorus_system_health_score
|
|
|
|
# Component health
|
|
chorus_component_health_score{component="dht"}
|
|
```
|
|
|
|
### Alert Rules
|
|
|
|
```yaml
|
|
groups:
|
|
- name: health_alerts
|
|
interval: 30s
|
|
rules:
|
|
- alert: HealthCheckFailing
|
|
expr: chorus_health_checks_failed_total{check_name="database-connectivity"} > 0
|
|
for: 5m
|
|
labels:
|
|
severity: critical
|
|
annotations:
|
|
summary: "Critical health check failing"
|
|
description: "{{ $labels.check_name }} has been failing for 5 minutes"
|
|
|
|
- alert: LowSystemHealth
|
|
expr: chorus_system_health_score < 0.75
|
|
for: 10m
|
|
labels:
|
|
severity: warning
|
|
annotations:
|
|
summary: "Low system health score"
|
|
description: "System health score: {{ $value }}"
|
|
|
|
- alert: ComponentDegraded
|
|
expr: chorus_component_health_score < 0.5
|
|
for: 15m
|
|
labels:
|
|
severity: warning
|
|
annotations:
|
|
summary: "Component {{ $labels.component }} degraded"
|
|
description: "Health score: {{ $value }}"
|
|
```
|
|
|
|
## Troubleshooting
|
|
|
|
### Health Check Not Running
|
|
|
|
```bash
|
|
# Check if health manager is started
|
|
curl http://localhost:8081/health/checks
|
|
|
|
# Verify check is registered and enabled
|
|
# Look for "enabled": true in response
|
|
|
|
# Check application logs for errors
|
|
grep "health check" /var/log/chorus/chorus.log
|
|
```
|
|
|
|
### Health Check Timeouts
|
|
|
|
```go
|
|
// Increase timeout for slow operations
|
|
check.Timeout = 30 * time.Second
|
|
|
|
// Add timeout monitoring
|
|
Checker: func(ctx context.Context) health.CheckResult {
|
|
deadline, ok := ctx.Deadline()
|
|
if ok {
|
|
log.Printf("Check deadline: %v (%.2fs remaining)",
|
|
deadline, time.Until(deadline).Seconds())
|
|
}
|
|
// ... check logic
|
|
}
|
|
```
|
|
|
|
### False Positives
|
|
|
|
```go
|
|
// Add retry logic
|
|
attempts := 3
|
|
for i := 0; i < attempts; i++ {
|
|
if checkPasses() {
|
|
return health.CheckResult{Healthy: true, ...}
|
|
}
|
|
if i < attempts-1 {
|
|
time.Sleep(100 * time.Millisecond)
|
|
}
|
|
}
|
|
return health.CheckResult{Healthy: false, ...}
|
|
```
|
|
|
|
### High Memory Usage
|
|
|
|
```go
|
|
// Limit check history
|
|
config := health.DefaultHealthConfig()
|
|
config.MaxHistoryEntries = 500 // Reduce from default 1000
|
|
config.HistoryCleanupInterval = 30 * time.Minute // More frequent cleanup
|
|
```
|
|
|
|
## Testing
|
|
|
|
### Unit Testing Health Checks
|
|
|
|
```go
|
|
func TestHealthCheck(t *testing.T) {
|
|
// Create check
|
|
check := &health.HealthCheck{
|
|
Name: "test-check",
|
|
Enabled: true,
|
|
Timeout: 5 * time.Second,
|
|
Checker: func(ctx context.Context) health.CheckResult {
|
|
// Test logic
|
|
return health.CheckResult{
|
|
Healthy: true,
|
|
Message: "Test passed",
|
|
Timestamp: time.Now(),
|
|
}
|
|
},
|
|
}
|
|
|
|
// Execute check
|
|
ctx := context.Background()
|
|
result := check.Checker(ctx)
|
|
|
|
// Verify result
|
|
assert.True(t, result.Healthy)
|
|
assert.Equal(t, "Test passed", result.Message)
|
|
}
|
|
```
|
|
|
|
### Integration Testing with Mocks
|
|
|
|
```go
|
|
func TestHealthManager(t *testing.T) {
|
|
logger := &testLogger{}
|
|
manager := health.NewManager("test-node", "v1.0.0", logger)
|
|
|
|
// Register mock check
|
|
mockCheck := &health.HealthCheck{
|
|
Name: "mock-check",
|
|
Enabled: true,
|
|
Interval: 1 * time.Second,
|
|
Timeout: 500 * time.Millisecond,
|
|
Checker: func(ctx context.Context) health.CheckResult {
|
|
return health.CheckResult{
|
|
Healthy: true,
|
|
Message: "Mock check passed",
|
|
Timestamp: time.Now(),
|
|
}
|
|
},
|
|
}
|
|
manager.RegisterCheck(mockCheck)
|
|
|
|
// Start manager
|
|
err := manager.Start()
|
|
assert.NoError(t, err)
|
|
|
|
// Wait for check execution
|
|
time.Sleep(2 * time.Second)
|
|
|
|
// Verify status
|
|
status := manager.GetStatus()
|
|
assert.Equal(t, health.StatusHealthy, status.Status)
|
|
assert.Contains(t, status.Checks, "mock-check")
|
|
|
|
// Stop manager
|
|
err = manager.Stop()
|
|
assert.NoError(t, err)
|
|
}
|
|
```
|
|
|
|
## Related Documentation
|
|
|
|
- [Metrics Package Documentation](./metrics.md)
|
|
- [Shutdown Package Documentation](./shutdown.md)
|
|
- [Kubernetes Health Probes](https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/)
|
|
- [CHORUS Election System](../modules/election.md)
|
|
- [CHORUS DHT System](../modules/dht.md) |