9.3 KiB
BACKBEAT Pulse Service Implementation
Overview
This is the complete implementation of the BACKBEAT pulse service based on the architectural requirements for CHORUS 2.0.0. The service provides foundational timing coordination for the distributed ecosystem with production-grade leader election, hybrid logical clocks, and comprehensive observability.
Architecture
The implementation consists of several key components:
Core Components
-
Leader Election System (
internal/backbeat/leader.go)- Implements BACKBEAT-REQ-001 using HashiCorp Raft consensus
- Pluggable strategy with automatic failover
- Single BeatFrame publisher per cluster guarantee
-
Hybrid Logical Clock (
internal/backbeat/hlc.go)- Provides ordering guarantees for distributed events
- Supports reconciliation after network partitions
- Format:
unix_ms_hex:logical_counter_hex:node_id_suffix
-
BeatFrame Generator (
cmd/pulse/main.go)- Implements BACKBEAT-REQ-002 (INT-A BeatFrame emission)
- Publishes structured beat events to NATS
- Includes HLC, beat_index, downbeat, phase, deadline_at, tempo_bpm
-
Degradation Manager (
internal/backbeat/degradation.go)- Implements BACKBEAT-REQ-003 (local tempo derivation)
- Manages partition tolerance with drift monitoring
- BACKBEAT-PER-003 compliance (≤1% drift over 1 hour)
-
Admin API Server (
internal/backbeat/admin.go)- HTTP endpoints for operational control
- Tempo management with BACKBEAT-REQ-004 validation
- Health checks, drift monitoring, leader status
-
Metrics & Observability (
internal/backbeat/metrics.go)- Prometheus metrics for all performance requirements
- Comprehensive monitoring of timing accuracy
- Performance requirement tracking
Requirements Implementation
BACKBEAT-REQ-001: Pulse Leader
✅ Implemented: Leader election using Raft consensus algorithm
- Single leader publishes BeatFrames per cluster
- Automatic failover with consistent leadership
- Pluggable strategy (currently Raft, extensible)
BACKBEAT-REQ-002: BeatFrame Emit
✅ Implemented: INT-A compliant BeatFrame publishing
{
"type": "backbeat.beatframe.v1",
"cluster_id": "string",
"beat_index": 0,
"downbeat": false,
"phase": "plan",
"hlc": "7ffd:0001:abcd",
"deadline_at": "2025-09-04T12:00:00Z",
"tempo_bpm": 120,
"window_id": "deterministic_sha256_hash"
}
BACKBEAT-REQ-003: Degrade Local
✅ Implemented: Partition tolerance with local tempo derivation
- Followers maintain local timing when leader is lost
- HLC-based reconciliation when leader returns
- Drift monitoring and alerting
BACKBEAT-REQ-004: Tempo Change Rules
✅ Implemented: Downbeat-gated tempo changes with delta limits
- Changes only applied on next downbeat
- ≤±10% delta validation
- Admin API with validation and scheduling
BACKBEAT-REQ-005: Window ID
✅ Implemented: Deterministic window ID generation
window_id = hex(sha256(cluster_id + ":" + downbeat_beat_index))[0:32]
Performance Requirements
BACKBEAT-PER-001: End-to-End Delivery
✅ Target: p95 ≤ 100ms at 2Hz
- Comprehensive latency monitoring
- NATS optimization for low latency
- Metrics:
backbeat_beat_delivery_latency_seconds
BACKBEAT-PER-002: Pulse Jitter
✅ Target: p95 ≤ 20ms
- High-resolution timing measurement
- Jitter calculation and monitoring
- Metrics:
backbeat_pulse_jitter_seconds
BACKBEAT-PER-003: Timer Drift
✅ Target: ≤1% over 1 hour without leader
- Continuous drift monitoring
- Degradation mode with local derivation
- Automatic alerting on threshold violations
- Metrics:
backbeat_timer_drift_ratio
API Endpoints
Admin API (Port 8080)
GET /tempo
Returns current and pending tempo information:
{
"current_bpm": 120,
"pending_bpm": 120,
"can_change": true,
"next_change": "2025-09-04T12:00:00Z",
"reason": ""
}
POST /tempo
Changes tempo with validation:
{
"tempo_bpm": 130,
"justification": "workload increase"
}
GET /drift
Returns drift monitoring information:
{
"timer_drift_percent": 0.5,
"hlc_drift_seconds": 1.2,
"last_sync_time": "2025-09-04T11:59:00Z",
"degradation_mode": false,
"within_limits": true
}
GET /leader
Returns leadership information:
{
"node_id": "pulse-abc123",
"is_leader": true,
"leader": "127.0.0.1:9000",
"cluster_size": 2,
"stats": { ... }
}
Health & Monitoring
GET /health- Overall service healthGET /ready- Kubernetes readiness probeGET /live- Kubernetes liveness probeGET /metrics- Prometheus metrics endpoint
Deployment
Development (Single Node)
make build
make dev
Cluster Development
make cluster
# Starts leader on :8080, follower on :8081
Production (Docker Compose)
docker-compose up -d
This starts:
- NATS message broker
- 2-node BACKBEAT pulse cluster
- Prometheus metrics collection
- Grafana dashboards
- Health monitoring
Production (Docker Swarm)
docker stack deploy -c docker-compose.swarm.yml backbeat
Configuration
Command Line Options
-cluster string Cluster identifier (default "chorus-aus-01")
-node-id string Node identifier (auto-generated if empty)
-bpm int Initial tempo in BPM (default 12)
-bar int Beats per bar (default 8)
-phases string Comma-separated phase names (default "plan,work,review")
-min-bpm int Minimum allowed BPM (default 4)
-max-bpm int Maximum allowed BPM (default 24)
-nats string NATS server URL (default "nats://localhost:4222")
-admin-port int Admin API port (default 8080)
-raft-bind string Raft bind address (default "127.0.0.1:0")
-bootstrap bool Bootstrap new cluster (default false)
-peers string Comma-separated Raft peer addresses
-data-dir string Data directory (auto-generated if empty)
Environment Variables
BACKBEAT_LOG_LEVEL- Log level (debug, info, warn, error)BACKBEAT_DATA_DIR- Data directory overrideBACKBEAT_CLUSTER_ID- Cluster ID override
Monitoring
Key Metrics
backbeat_beat_publish_duration_seconds- Beat publishing latencybackbeat_pulse_jitter_seconds- Timing jitter (BACKBEAT-PER-002)backbeat_timer_drift_ratio- Timer drift percentage (BACKBEAT-PER-003)backbeat_is_leader- Leadership statusbackbeat_beats_total- Total beats publishedbackbeat_tempo_change_errors_total- Failed tempo changes
Alerts
Configure alerts for:
- Pulse jitter p95 > 20ms
- Timer drift > 1%
- Leadership changes
- Degradation mode active > 5 minutes
- NATS connection losses
Testing
API Testing
make test-all
Tests all admin endpoints with sample requests.
Load Testing
# Monitor metrics during load
watch curl -s http://localhost:8080/metrics | grep backbeat_pulse_jitter
Chaos Engineering
- Network partitions between nodes
- NATS broker restart
- Leader node termination
- Clock drift simulation
Integration
NATS Subjects
backbeat.{cluster}.beat- BeatFrame publicationsbackbeat.{cluster}.control- Legacy control messages (backward compatibility)
Service Discovery
- Raft handles internal cluster membership
- External services discover via NATS subjects
- Health checks via HTTP endpoints
Security
Network Security
- Raft traffic encrypted in production
- Admin API should be behind authentication proxy
- NATS authentication recommended
Data Security
- No sensitive data in BeatFrames
- Raft logs contain only operational state
- Metrics don't expose sensitive information
Performance Tuning
NATS Configuration
max_payload: 1MB
max_connections: 10000
jetstream: enabled
Raft Configuration
HeartbeatTimeout: 1s
ElectionTimeout: 1s
CommitTimeout: 500ms
Go Runtime
GOGC=100
GOMAXPROCS=auto
Troubleshooting
Common Issues
-
Leadership flapping
- Check network connectivity between nodes
- Verify Raft bind addresses are reachable
- Monitor
backbeat_leadership_changes_total
-
High jitter
- Check system load and CPU scheduling
- Verify Go GC tuning
- Monitor
backbeat_pulse_jitter_seconds
-
Drift violations
- Check NTP synchronization
- Monitor degradation mode duration
- Verify
backbeat_timer_drift_ratio
Debug Commands
# Check leader status
curl http://localhost:8080/leader | jq
# Check drift status
curl http://localhost:8080/drift | jq
# View Raft logs
docker logs backbeat_pulse-leader_1
# Monitor real-time metrics
curl http://localhost:8080/metrics | grep backbeat_
Future Enhancements
- COOEE Transport Integration - Replace NATS with COOEE for enhanced delivery
- Multi-Region Support - Cross-datacenter synchronization
- Dynamic Phase Configuration - Runtime phase definition updates
- Backup/Restore - Raft state backup and recovery
- WebSocket API - Real-time admin interface
Compliance
This implementation fully satisfies:
- ✅ BACKBEAT-REQ-001 through BACKBEAT-REQ-005
- ✅ BACKBEAT-PER-001 through BACKBEAT-PER-003
- ✅ INT-A BeatFrame specification
- ✅ Production deployment requirements
- ✅ Observability and monitoring requirements
The service is ready for production deployment in the CHORUS 2.0.0 ecosystem.