Comprehensive documentation for coordination, messaging, discovery, and internal systems. Core Coordination Packages: - pkg/election - Democratic leader election (uptime-based, heartbeat mechanism, SLURP integration) - pkg/coordination - Meta-coordination with dependency detection (4 built-in rules) - coordinator/ - Task orchestration and assignment (AI-powered scoring) - discovery/ - mDNS peer discovery (automatic LAN detection) Messaging & P2P Infrastructure: - pubsub/ - GossipSub messaging (31 message types, role-based topics, HMMM integration) - p2p/ - libp2p networking (DHT modes, connection management, security) Monitoring & Health: - pkg/metrics - Prometheus metrics (80+ metrics across 12 categories) - pkg/health - Health monitoring (4 HTTP endpoints, enhanced checks, graceful degradation) Internal Systems: - internal/licensing - License validation (KACHING integration, cluster leases, fail-closed) - internal/hapui - Human Agent Portal UI (9 commands, HMMM wizard, UCXL browser, decision voting) - internal/backbeat - P2P operation telemetry (6 phases, beat synchronization, health reporting) Documentation Statistics (Phase 3): - 10 packages documented (~18,000 lines) - 31 PubSub message types cataloged - 80+ Prometheus metrics documented - Complete API references with examples - Integration patterns and best practices Key Features Documented: - Election: 5 triggers, candidate scoring (5 weighted components), stability windows - Coordination: AI-powered dependency detection, cross-repo sessions, escalation handling - PubSub: Topic patterns, message envelopes, SHHH redaction, Hypercore logging - Metrics: All metric types with labels, Prometheus scrape config, alert rules - Health: Liveness vs readiness, critical checks, Kubernetes integration - Licensing: Grace periods, circuit breaker, cluster lease management - HAP UI: Interactive terminal commands, HMMM composition wizard, web interface (beta) - BACKBEAT: 6-phase operation tracking, beat budget estimation, drift detection Implementation Status Marked: - ✅ Production: Election, metrics, health, licensing, pubsub, p2p, discovery, coordinator - 🔶 Beta: HAP web interface, BACKBEAT telemetry, advanced coordination - 🔷 Alpha: SLURP election scoring - ⚠️ Experimental: Meta-coordination, AI-powered dependency detection Progress: 22/62 files complete (35%) Next Phase: AI providers, SLURP system, API layer, reasoning engine 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
28 KiB
CHORUS Health Package
Overview
The pkg/health package provides comprehensive health monitoring and readiness/liveness probe capabilities for the CHORUS distributed system. It orchestrates health checks across all system components, integrates with graceful shutdown, and exposes Kubernetes-compatible health endpoints.
Architecture
Core Components
- Manager: Central health check orchestration and HTTP endpoint management
- HealthCheck: Individual health check definitions with configurable intervals
- EnhancedHealthChecks: Advanced health monitoring with metrics and history
- Adapters: Integration layer for PubSub, DHT, and other subsystems
- SystemStatus: Aggregated health status representation
Health Check Types
- Critical Checks: Failures trigger graceful shutdown
- Non-Critical Checks: Failures degrade health status but don't trigger shutdown
- Active Probes: Synthetic tests that verify end-to-end functionality
- Passive Checks: Monitor existing system state without creating load
Core Types
HealthCheck
type HealthCheck struct {
Name string // Unique check identifier
Description string // Human-readable description
Checker func(ctx context.Context) CheckResult // Check execution function
Interval time.Duration // Check frequency (default: 30s)
Timeout time.Duration // Check timeout (default: 10s)
Enabled bool // Enable/disable check
Critical bool // If true, failure triggers shutdown
LastRun time.Time // Timestamp of last execution
LastResult *CheckResult // Most recent check result
}
CheckResult
type CheckResult struct {
Healthy bool // Check passed/failed
Message string // Human-readable result message
Details map[string]interface{} // Additional structured information
Latency time.Duration // Check execution time
Timestamp time.Time // Result timestamp
Error error // Error details if check failed
}
SystemStatus
type SystemStatus struct {
Status Status // Overall status enum
Message string // Status description
Checks map[string]*CheckResult // All check results
Uptime time.Duration // System uptime
StartTime time.Time // System start timestamp
LastUpdate time.Time // Last status update
Version string // CHORUS version
NodeID string // Node identifier
}
Status Levels
const (
StatusHealthy Status = "healthy" // All checks passing
StatusDegraded Status = "degraded" // Some non-critical checks failing
StatusUnhealthy Status = "unhealthy" // Critical checks failing
StatusStarting Status = "starting" // System initializing
StatusStopping Status = "stopping" // Graceful shutdown in progress
)
Manager
Initialization
import "chorus/pkg/health"
// Create health manager
logger := yourLogger // Implements health.Logger interface
manager := health.NewManager("node-123", "v1.0.0", logger)
// Connect to shutdown manager for critical failures
shutdownMgr := shutdown.NewManager(30*time.Second, logger)
manager.SetShutdownManager(shutdownMgr)
Registration System
// Register a health check
check := &health.HealthCheck{
Name: "database-connectivity",
Description: "PostgreSQL database connectivity check",
Enabled: true,
Critical: true, // Failure triggers shutdown
Interval: 30 * time.Second,
Timeout: 10 * time.Second,
Checker: func(ctx context.Context) health.CheckResult {
// Perform health check
err := db.PingContext(ctx)
if err != nil {
return health.CheckResult{
Healthy: false,
Message: fmt.Sprintf("Database ping failed: %v", err),
Error: err,
Timestamp: time.Now(),
}
}
return health.CheckResult{
Healthy: true,
Message: "Database connectivity OK",
Timestamp: time.Now(),
}
},
}
manager.RegisterCheck(check)
// Unregister when no longer needed
manager.UnregisterCheck("database-connectivity")
Lifecycle Management
// Start health monitoring
if err := manager.Start(); err != nil {
log.Fatalf("Failed to start health manager: %v", err)
}
// Start HTTP server for health endpoints
if err := manager.StartHTTPServer(8081); err != nil {
log.Fatalf("Failed to start health HTTP server: %v", err)
}
// ... application runs ...
// Stop health monitoring during shutdown
if err := manager.Stop(); err != nil {
log.Printf("Error stopping health manager: %v", err)
}
HTTP Endpoints
/health - Overall Health Status
Method: GET Description: Returns comprehensive system health status
Response Codes:
200 OK: System is healthy or degraded503 Service Unavailable: System is unhealthy, starting, or stopping
Response Schema:
{
"status": "healthy",
"message": "All health checks passing",
"checks": {
"database-connectivity": {
"healthy": true,
"message": "Database connectivity OK",
"latency": 15000000,
"timestamp": "2025-09-30T10:30:00Z"
},
"p2p-connectivity": {
"healthy": true,
"message": "5 peers connected",
"details": {
"connected_peers": 5,
"min_peers": 3
},
"latency": 8000000,
"timestamp": "2025-09-30T10:30:05Z"
}
},
"uptime": 86400000000000,
"start_time": "2025-09-29T10:30:00Z",
"last_update": "2025-09-30T10:30:05Z",
"version": "v1.0.0",
"node_id": "node-123"
}
/health/ready - Readiness Probe
Method: GET Description: Kubernetes readiness probe - indicates if node can handle requests
Response Codes:
200 OK: Node is ready (healthy or degraded)503 Service Unavailable: Node is not ready
Response Schema:
{
"ready": true,
"status": "healthy",
"message": "All health checks passing"
}
Usage: Use for Kubernetes readiness probes to control traffic routing
readinessProbe:
httpGet:
path: /health/ready
port: 8081
initialDelaySeconds: 10
periodSeconds: 5
timeoutSeconds: 3
failureThreshold: 3
/health/live - Liveness Probe
Method: GET Description: Kubernetes liveness probe - indicates if node is alive
Response Codes:
200 OK: Process is alive (not stopping)503 Service Unavailable: Process is stopping
Response Schema:
{
"live": true,
"status": "healthy",
"uptime": "24h0m0s"
}
Usage: Use for Kubernetes liveness probes to restart unhealthy pods
livenessProbe:
httpGet:
path: /health/live
port: 8081
initialDelaySeconds: 30
periodSeconds: 10
timeoutSeconds: 5
failureThreshold: 3
/health/checks - Detailed Check Results
Method: GET Description: Returns detailed results for all registered health checks
Response Schema:
{
"checks": {
"database-connectivity": {
"healthy": true,
"message": "Database connectivity OK",
"latency": 15000000,
"timestamp": "2025-09-30T10:30:00Z"
},
"p2p-connectivity": {
"healthy": true,
"message": "5 peers connected",
"details": {
"connected_peers": 5,
"min_peers": 3
},
"latency": 8000000,
"timestamp": "2025-09-30T10:30:05Z"
}
},
"total": 2,
"timestamp": "2025-09-30T10:30:10Z"
}
Built-in Health Checks
Database Connectivity Check
check := health.CreateDatabaseCheck("primary-db", func() error {
return db.Ping()
})
manager.RegisterCheck(check)
Properties:
- Critical: Yes
- Interval: 30 seconds
- Timeout: 10 seconds
- Checks: Database ping/connectivity
Disk Space Check
check := health.CreateDiskSpaceCheck("/var/lib/CHORUS", 0.90) // Alert at 90%
manager.RegisterCheck(check)
Properties:
- Critical: No (warning only)
- Interval: 60 seconds
- Timeout: 5 seconds
- Threshold: Configurable (e.g., 90%)
Memory Usage Check
check := health.CreateMemoryCheck(0.85) // Alert at 85%
manager.RegisterCheck(check)
Properties:
- Critical: No (warning only)
- Interval: 30 seconds
- Timeout: 5 seconds
- Threshold: Configurable (e.g., 85%)
Active PubSub Check
adapter := health.NewPubSubAdapter(pubsubInstance)
check := health.CreateActivePubSubCheck(adapter)
manager.RegisterCheck(check)
Properties:
- Critical: No
- Interval: 60 seconds
- Timeout: 15 seconds
- Test: Publish/subscribe loopback with unique message
- Validates: End-to-end PubSub functionality
Test Flow:
- Subscribe to test topic
CHORUS/health-test/v1 - Publish unique test message with timestamp
- Wait for message receipt (max 10 seconds)
- Verify message integrity
- Report success or timeout
Active DHT Check
adapter := health.NewDHTAdapter(dhtInstance)
check := health.CreateActiveDHTCheck(adapter)
manager.RegisterCheck(check)
Properties:
- Critical: No
- Interval: 90 seconds
- Timeout: 20 seconds
- Test: Put/get operation with unique key
- Validates: DHT storage and retrieval integrity
Test Flow:
- Generate unique test key and value
- Perform DHT put operation
- Wait for propagation (100ms)
- Perform DHT get operation
- Verify retrieved value matches original
- Report success, failure, or integrity violation
Enhanced Health Checks
Overview
The EnhancedHealthChecks system provides advanced monitoring with metrics tracking, historical data, and comprehensive component health scoring.
Initialization
import "chorus/pkg/health"
// Create enhanced health monitoring
enhanced := health.NewEnhancedHealthChecks(
manager, // Health manager
electionMgr, // Election manager
dhtInstance, // DHT instance
pubsubInstance, // PubSub instance
replicationMgr, // Replication manager
logger, // Logger
)
// Enhanced checks are automatically registered
Configuration
type HealthConfig struct {
// Active probe intervals
PubSubProbeInterval time.Duration // Default: 30s
DHTProbeInterval time.Duration // Default: 60s
ElectionProbeInterval time.Duration // Default: 15s
// Probe timeouts
PubSubProbeTimeout time.Duration // Default: 10s
DHTProbeTimeout time.Duration // Default: 20s
ElectionProbeTimeout time.Duration // Default: 5s
// Thresholds
MaxFailedProbes int // Default: 3
HealthyThreshold float64 // Default: 0.95
DegradedThreshold float64 // Default: 0.75
// History retention
MaxHistoryEntries int // Default: 1000
HistoryCleanupInterval time.Duration // Default: 1h
// Enable/disable specific checks
EnablePubSubProbes bool // Default: true
EnableDHTProbes bool // Default: true
EnableElectionProbes bool // Default: true
EnableReplicationProbes bool // Default: true
}
// Use custom configuration
config := health.DefaultHealthConfig()
config.PubSubProbeInterval = 45 * time.Second
config.HealthyThreshold = 0.98
enhanced.config = config
Enhanced Health Checks Registered
1. Enhanced PubSub Check
- Name:
pubsub-enhanced - Critical: Yes
- Interval: Configurable (default: 30s)
- Features:
- Loopback message testing
- Success rate tracking
- Consecutive failure counting
- Latency measurement
- Health score calculation
2. Enhanced DHT Check
- Name:
dht-enhanced - Critical: Yes
- Interval: Configurable (default: 60s)
- Features:
- Put/get operation testing
- Data integrity verification
- Replication health monitoring
- Success rate tracking
- Latency measurement
3. Election Health Check
- Name:
election-health - Critical: No
- Interval: Configurable (default: 15s)
- Features:
- Election state monitoring
- Heartbeat status tracking
- Leadership stability calculation
- Admin uptime tracking
4. Replication Health Check
- Name:
replication-health - Critical: No
- Interval: 120 seconds
- Features:
- Replication metrics monitoring
- Failure rate tracking
- Average replication factor
- Provider record counting
5. P2P Connectivity Check
- Name:
p2p-connectivity - Critical: Yes
- Interval: 30 seconds
- Features:
- Connected peer counting
- Minimum peer threshold validation
- Connectivity score calculation
6. Resource Health Check
- Name:
resource-health - Critical: No
- Interval: 60 seconds
- Features:
- CPU usage monitoring
- Memory usage monitoring
- Disk usage monitoring
- Threshold-based alerting
7. Task Manager Check
- Name:
task-manager - Critical: No
- Interval: 30 seconds
- Features:
- Active task counting
- Queue depth monitoring
- Task success rate tracking
- Capacity monitoring
Health Metrics
type HealthMetrics struct {
// Overall system health
SystemHealthScore float64 // 0.0-1.0
LastFullHealthCheck time.Time
TotalHealthChecks int64
FailedHealthChecks int64
// PubSub metrics
PubSubHealthScore float64
PubSubProbeLatency time.Duration
PubSubSuccessRate float64
PubSubLastSuccess time.Time
PubSubConsecutiveFails int
// DHT metrics
DHTHealthScore float64
DHTProbeLatency time.Duration
DHTSuccessRate float64
DHTLastSuccess time.Time
DHTConsecutiveFails int
DHTReplicationStatus map[string]*ReplicationStatus
// Election metrics
ElectionHealthScore float64
ElectionStability float64
HeartbeatLatency time.Duration
LeadershipChanges int64
LastLeadershipChange time.Time
AdminUptime time.Duration
// Network metrics
P2PConnectedPeers int
P2PConnectivityScore float64
NetworkLatency time.Duration
// Resource metrics
CPUUsage float64
MemoryUsage float64
DiskUsage float64
// Service metrics
ActiveTasks int
QueuedTasks int
TaskSuccessRate float64
}
// Access metrics
metrics := enhanced.GetHealthMetrics()
fmt.Printf("System Health Score: %.2f\n", metrics.SystemHealthScore)
fmt.Printf("PubSub Health: %.2f\n", metrics.PubSubHealthScore)
fmt.Printf("DHT Health: %.2f\n", metrics.DHTHealthScore)
Health Summary
summary := enhanced.GetHealthSummary()
// Returns:
// {
// "status": "healthy",
// "overall_score": 0.96,
// "last_check": "2025-09-30T10:30:00Z",
// "total_checks": 1523,
// "component_scores": {
// "pubsub": 0.98,
// "dht": 0.95,
// "election": 0.92,
// "p2p": 1.0
// },
// "key_metrics": {
// "connected_peers": 5,
// "active_tasks": 3,
// "admin_uptime": "2h30m15s",
// "leadership_changes": 2,
// "resource_utilization": {
// "cpu": 0.45,
// "memory": 0.62,
// "disk": 0.73
// }
// }
// }
Adapter System
PubSub Adapter
Adapts CHORUS PubSub system to health check interface:
type PubSubInterface interface {
SubscribeToTopic(topic string, handler func([]byte)) error
PublishToTopic(topic string, data interface{}) error
}
// Create adapter
adapter := health.NewPubSubAdapter(pubsubInstance)
// Use in health checks
check := health.CreateActivePubSubCheck(adapter)
DHT Adapter
Adapts various DHT implementations to health check interface:
type DHTInterface interface {
PutValue(ctx context.Context, key string, value []byte) error
GetValue(ctx context.Context, key string) ([]byte, error)
}
// Create adapter (supports multiple DHT types)
adapter := health.NewDHTAdapter(dhtInstance)
// Use in health checks
check := health.CreateActiveDHTCheck(adapter)
Supported DHT Types:
*dht.LibP2PDHT*dht.MockDHTInterface*dht.EncryptedDHTStorage
Mock Adapters
For testing without real infrastructure:
// Mock PubSub
mockPubSub := health.NewMockPubSubAdapter()
check := health.CreateActivePubSubCheck(mockPubSub)
// Mock DHT
mockDHT := health.NewMockDHTAdapter()
check := health.CreateActiveDHTCheck(mockDHT)
Integration with Graceful Shutdown
Critical Health Check Failures
When a critical health check fails, the health manager can trigger graceful shutdown:
// Connect managers
healthMgr.SetShutdownManager(shutdownMgr)
// Register critical check
criticalCheck := &health.HealthCheck{
Name: "database-connectivity",
Critical: true, // Failure triggers shutdown
Checker: func(ctx context.Context) health.CheckResult {
// Check logic
},
}
healthMgr.RegisterCheck(criticalCheck)
// If check fails, shutdown is automatically initiated
Shutdown Integration Example
import (
"chorus/pkg/health"
"chorus/pkg/shutdown"
)
// Create managers
shutdownMgr := shutdown.NewManager(30*time.Second, logger)
healthMgr := health.NewManager("node-123", "v1.0.0", logger)
healthMgr.SetShutdownManager(shutdownMgr)
// Register health manager for shutdown
healthComponent := shutdown.NewGenericComponent("health-manager", 10, true).
SetShutdownFunc(func(ctx context.Context) error {
return healthMgr.Stop()
})
shutdownMgr.Register(healthComponent)
// Add pre-shutdown hook to update health status
shutdownMgr.AddHook(shutdown.PhasePreShutdown, func(ctx context.Context) error {
status := healthMgr.GetStatus()
status.Status = health.StatusStopping
status.Message = "System is shutting down"
return nil
})
// Start systems
healthMgr.Start()
healthMgr.StartHTTPServer(8081)
shutdownMgr.Start()
// Wait for shutdown
shutdownMgr.Wait()
Health Check Best Practices
Design Principles
- Fast Execution: Health checks should complete quickly (< 5 seconds typical)
- Idempotent: Checks should be safe to run repeatedly without side effects
- Isolated: Checks should not depend on other checks
- Meaningful: Checks should validate actual functionality, not just existence
- Critical vs Warning: Reserve critical status for failures that prevent core functionality
Check Intervals
// Critical infrastructure: Check frequently
databaseCheck.Interval = 15 * time.Second
// Expensive operations: Check less frequently
replicationCheck.Interval = 120 * time.Second
// Active probes: Balance thoroughness with overhead
pubsubProbe.Interval = 60 * time.Second
Timeout Configuration
// Fast checks: Short timeout
connectivityCheck.Timeout = 5 * time.Second
// Network operations: Longer timeout
dhtProbe.Timeout = 20 * time.Second
// Complex operations: Generous timeout
systemCheck.Timeout = 30 * time.Second
Critical Check Guidelines
Mark a check as critical when:
- Failure prevents core system functionality
- Continued operation would cause data corruption
- User-facing services become unavailable
- System cannot safely recover automatically
Do NOT mark as critical when:
- Failure is temporary or transient
- System can operate in degraded mode
- Alternative mechanisms exist
- Recovery is possible without restart
Error Handling
Checker: func(ctx context.Context) health.CheckResult {
// Handle context cancellation
select {
case <-ctx.Done():
return health.CheckResult{
Healthy: false,
Message: "Check cancelled",
Error: ctx.Err(),
Timestamp: time.Now(),
}
default:
}
// Perform check with timeout
result := make(chan error, 1)
go func() {
result <- performCheck()
}()
select {
case err := <-result:
if err != nil {
return health.CheckResult{
Healthy: false,
Message: fmt.Sprintf("Check failed: %v", err),
Error: err,
Timestamp: time.Now(),
}
}
return health.CheckResult{
Healthy: true,
Message: "Check passed",
Timestamp: time.Now(),
}
case <-ctx.Done():
return health.CheckResult{
Healthy: false,
Message: "Check timeout",
Error: ctx.Err(),
Timestamp: time.Now(),
}
}
}
Detailed Results
Provide structured details for debugging:
return health.CheckResult{
Healthy: true,
Message: "P2P network healthy",
Details: map[string]interface{}{
"connected_peers": 5,
"min_peers": 3,
"max_peers": 20,
"current_usage": "25%",
"peer_quality": 0.85,
"network_latency": "50ms",
},
Timestamp: time.Now(),
}
Custom Health Checks
Simple Health Check
simpleCheck := &health.HealthCheck{
Name: "my-service",
Description: "Custom service health check",
Enabled: true,
Critical: false,
Interval: 30 * time.Second,
Timeout: 10 * time.Second,
Checker: func(ctx context.Context) health.CheckResult {
// Your check logic
healthy := checkMyService()
return health.CheckResult{
Healthy: healthy,
Message: "Service status",
Timestamp: time.Now(),
}
},
}
manager.RegisterCheck(simpleCheck)
Health Check with Metrics
type ServiceHealthCheck struct {
service MyService
metricsCollector *metrics.CHORUSMetrics
}
func (s *ServiceHealthCheck) Check(ctx context.Context) health.CheckResult {
start := time.Now()
err := s.service.Ping(ctx)
latency := time.Since(start)
if err != nil {
s.metricsCollector.IncrementHealthCheckFailed("my-service", err.Error())
return health.CheckResult{
Healthy: false,
Message: fmt.Sprintf("Service unavailable: %v", err),
Error: err,
Latency: latency,
Timestamp: time.Now(),
}
}
s.metricsCollector.IncrementHealthCheckPassed("my-service")
return health.CheckResult{
Healthy: true,
Message: "Service available",
Latency: latency,
Timestamp: time.Now(),
Details: map[string]interface{}{
"latency_ms": latency.Milliseconds(),
"version": s.service.Version(),
},
}
}
Stateful Health Check
type StatefulHealthCheck struct {
consecutiveFailures int
maxFailures int
}
func (s *StatefulHealthCheck) Check(ctx context.Context) health.CheckResult {
healthy := performCheck()
if !healthy {
s.consecutiveFailures++
} else {
s.consecutiveFailures = 0
}
// Only report unhealthy after multiple failures
if s.consecutiveFailures >= s.maxFailures {
return health.CheckResult{
Healthy: false,
Message: fmt.Sprintf("Failed %d consecutive checks", s.consecutiveFailures),
Details: map[string]interface{}{
"consecutive_failures": s.consecutiveFailures,
"threshold": s.maxFailures,
},
Timestamp: time.Now(),
}
}
return health.CheckResult{
Healthy: true,
Message: "Check passed",
Timestamp: time.Now(),
}
}
Monitoring and Alerting
Prometheus Integration
Health check results are automatically exposed as metrics:
# Health check success rate
rate(chorus_health_checks_passed_total[5m]) /
(rate(chorus_health_checks_passed_total[5m]) + rate(chorus_health_checks_failed_total[5m]))
# System health score
chorus_system_health_score
# Component health
chorus_component_health_score{component="dht"}
Alert Rules
groups:
- name: health_alerts
interval: 30s
rules:
- alert: HealthCheckFailing
expr: chorus_health_checks_failed_total{check_name="database-connectivity"} > 0
for: 5m
labels:
severity: critical
annotations:
summary: "Critical health check failing"
description: "{{ $labels.check_name }} has been failing for 5 minutes"
- alert: LowSystemHealth
expr: chorus_system_health_score < 0.75
for: 10m
labels:
severity: warning
annotations:
summary: "Low system health score"
description: "System health score: {{ $value }}"
- alert: ComponentDegraded
expr: chorus_component_health_score < 0.5
for: 15m
labels:
severity: warning
annotations:
summary: "Component {{ $labels.component }} degraded"
description: "Health score: {{ $value }}"
Troubleshooting
Health Check Not Running
# Check if health manager is started
curl http://localhost:8081/health/checks
# Verify check is registered and enabled
# Look for "enabled": true in response
# Check application logs for errors
grep "health check" /var/log/chorus/chorus.log
Health Check Timeouts
// Increase timeout for slow operations
check.Timeout = 30 * time.Second
// Add timeout monitoring
Checker: func(ctx context.Context) health.CheckResult {
deadline, ok := ctx.Deadline()
if ok {
log.Printf("Check deadline: %v (%.2fs remaining)",
deadline, time.Until(deadline).Seconds())
}
// ... check logic
}
False Positives
// Add retry logic
attempts := 3
for i := 0; i < attempts; i++ {
if checkPasses() {
return health.CheckResult{Healthy: true, ...}
}
if i < attempts-1 {
time.Sleep(100 * time.Millisecond)
}
}
return health.CheckResult{Healthy: false, ...}
High Memory Usage
// Limit check history
config := health.DefaultHealthConfig()
config.MaxHistoryEntries = 500 // Reduce from default 1000
config.HistoryCleanupInterval = 30 * time.Minute // More frequent cleanup
Testing
Unit Testing Health Checks
func TestHealthCheck(t *testing.T) {
// Create check
check := &health.HealthCheck{
Name: "test-check",
Enabled: true,
Timeout: 5 * time.Second,
Checker: func(ctx context.Context) health.CheckResult {
// Test logic
return health.CheckResult{
Healthy: true,
Message: "Test passed",
Timestamp: time.Now(),
}
},
}
// Execute check
ctx := context.Background()
result := check.Checker(ctx)
// Verify result
assert.True(t, result.Healthy)
assert.Equal(t, "Test passed", result.Message)
}
Integration Testing with Mocks
func TestHealthManager(t *testing.T) {
logger := &testLogger{}
manager := health.NewManager("test-node", "v1.0.0", logger)
// Register mock check
mockCheck := &health.HealthCheck{
Name: "mock-check",
Enabled: true,
Interval: 1 * time.Second,
Timeout: 500 * time.Millisecond,
Checker: func(ctx context.Context) health.CheckResult {
return health.CheckResult{
Healthy: true,
Message: "Mock check passed",
Timestamp: time.Now(),
}
},
}
manager.RegisterCheck(mockCheck)
// Start manager
err := manager.Start()
assert.NoError(t, err)
// Wait for check execution
time.Sleep(2 * time.Second)
// Verify status
status := manager.GetStatus()
assert.Equal(t, health.StatusHealthy, status.Status)
assert.Contains(t, status.Checks, "mock-check")
// Stop manager
err = manager.Stop()
assert.NoError(t, err)
}