backbeat: add module sources

This commit is contained in:
anthonyrawlins
2025-10-17 08:56:25 +11:00
parent 627d15b3f7
commit 4b4eb16efb
48 changed files with 11636 additions and 0 deletions

351
README-IMPLEMENTATION.md Normal file
View File

@@ -0,0 +1,351 @@
# BACKBEAT Pulse Service Implementation
## Overview
This is the complete implementation of the BACKBEAT pulse service based on the architectural requirements for CHORUS 2.0.0. The service provides foundational timing coordination for the distributed ecosystem with production-grade leader election, hybrid logical clocks, and comprehensive observability.
## Architecture
The implementation consists of several key components:
### Core Components
1. **Leader Election System** (`internal/backbeat/leader.go`)
- Implements BACKBEAT-REQ-001 using HashiCorp Raft consensus
- Pluggable strategy with automatic failover
- Single BeatFrame publisher per cluster guarantee
2. **Hybrid Logical Clock** (`internal/backbeat/hlc.go`)
- Provides ordering guarantees for distributed events
- Supports reconciliation after network partitions
- Format: `unix_ms_hex:logical_counter_hex:node_id_suffix`
3. **BeatFrame Generator** (`cmd/pulse/main.go`)
- Implements BACKBEAT-REQ-002 (INT-A BeatFrame emission)
- Publishes structured beat events to NATS
- Includes HLC, beat_index, downbeat, phase, deadline_at, tempo_bpm
4. **Degradation Manager** (`internal/backbeat/degradation.go`)
- Implements BACKBEAT-REQ-003 (local tempo derivation)
- Manages partition tolerance with drift monitoring
- BACKBEAT-PER-003 compliance (≤1% drift over 1 hour)
5. **Admin API Server** (`internal/backbeat/admin.go`)
- HTTP endpoints for operational control
- Tempo management with BACKBEAT-REQ-004 validation
- Health checks, drift monitoring, leader status
6. **Metrics & Observability** (`internal/backbeat/metrics.go`)
- Prometheus metrics for all performance requirements
- Comprehensive monitoring of timing accuracy
- Performance requirement tracking
## Requirements Implementation
### BACKBEAT-REQ-001: Pulse Leader
**Implemented**: Leader election using Raft consensus algorithm
- Single leader publishes BeatFrames per cluster
- Automatic failover with consistent leadership
- Pluggable strategy (currently Raft, extensible)
### BACKBEAT-REQ-002: BeatFrame Emit
**Implemented**: INT-A compliant BeatFrame publishing
```json
{
"type": "backbeat.beatframe.v1",
"cluster_id": "string",
"beat_index": 0,
"downbeat": false,
"phase": "plan",
"hlc": "7ffd:0001:abcd",
"deadline_at": "2025-09-04T12:00:00Z",
"tempo_bpm": 120,
"window_id": "deterministic_sha256_hash"
}
```
### BACKBEAT-REQ-003: Degrade Local
**Implemented**: Partition tolerance with local tempo derivation
- Followers maintain local timing when leader is lost
- HLC-based reconciliation when leader returns
- Drift monitoring and alerting
### BACKBEAT-REQ-004: Tempo Change Rules
**Implemented**: Downbeat-gated tempo changes with delta limits
- Changes only applied on next downbeat
- ≤±10% delta validation
- Admin API with validation and scheduling
### BACKBEAT-REQ-005: Window ID
**Implemented**: Deterministic window ID generation
```go
window_id = hex(sha256(cluster_id + ":" + downbeat_beat_index))[0:32]
```
## Performance Requirements
### BACKBEAT-PER-001: End-to-End Delivery
**Target**: p95 ≤ 100ms at 2Hz
- Comprehensive latency monitoring
- NATS optimization for low latency
- Metrics: `backbeat_beat_delivery_latency_seconds`
### BACKBEAT-PER-002: Pulse Jitter
**Target**: p95 ≤ 20ms
- High-resolution timing measurement
- Jitter calculation and monitoring
- Metrics: `backbeat_pulse_jitter_seconds`
### BACKBEAT-PER-003: Timer Drift
**Target**: ≤1% over 1 hour without leader
- Continuous drift monitoring
- Degradation mode with local derivation
- Automatic alerting on threshold violations
- Metrics: `backbeat_timer_drift_ratio`
## API Endpoints
### Admin API (Port 8080)
#### GET /tempo
Returns current and pending tempo information:
```json
{
"current_bpm": 120,
"pending_bpm": 120,
"can_change": true,
"next_change": "2025-09-04T12:00:00Z",
"reason": ""
}
```
#### POST /tempo
Changes tempo with validation:
```json
{
"tempo_bpm": 130,
"justification": "workload increase"
}
```
#### GET /drift
Returns drift monitoring information:
```json
{
"timer_drift_percent": 0.5,
"hlc_drift_seconds": 1.2,
"last_sync_time": "2025-09-04T11:59:00Z",
"degradation_mode": false,
"within_limits": true
}
```
#### GET /leader
Returns leadership information:
```json
{
"node_id": "pulse-abc123",
"is_leader": true,
"leader": "127.0.0.1:9000",
"cluster_size": 2,
"stats": { ... }
}
```
#### Health & Monitoring
- `GET /health` - Overall service health
- `GET /ready` - Kubernetes readiness probe
- `GET /live` - Kubernetes liveness probe
- `GET /metrics` - Prometheus metrics endpoint
## Deployment
### Development (Single Node)
```bash
make build
make dev
```
### Cluster Development
```bash
make cluster
# Starts leader on :8080, follower on :8081
```
### Production (Docker Compose)
```bash
docker-compose up -d
```
This starts:
- NATS message broker
- 2-node BACKBEAT pulse cluster
- Prometheus metrics collection
- Grafana dashboards
- Health monitoring
### Production (Docker Swarm)
```bash
docker stack deploy -c docker-compose.swarm.yml backbeat
```
## Configuration
### Command Line Options
```
-cluster string Cluster identifier (default "chorus-aus-01")
-node-id string Node identifier (auto-generated if empty)
-bpm int Initial tempo in BPM (default 12)
-bar int Beats per bar (default 8)
-phases string Comma-separated phase names (default "plan,work,review")
-min-bpm int Minimum allowed BPM (default 4)
-max-bpm int Maximum allowed BPM (default 24)
-nats string NATS server URL (default "nats://localhost:4222")
-admin-port int Admin API port (default 8080)
-raft-bind string Raft bind address (default "127.0.0.1:0")
-bootstrap bool Bootstrap new cluster (default false)
-peers string Comma-separated Raft peer addresses
-data-dir string Data directory (auto-generated if empty)
```
### Environment Variables
- `BACKBEAT_LOG_LEVEL` - Log level (debug, info, warn, error)
- `BACKBEAT_DATA_DIR` - Data directory override
- `BACKBEAT_CLUSTER_ID` - Cluster ID override
## Monitoring
### Key Metrics
- `backbeat_beat_publish_duration_seconds` - Beat publishing latency
- `backbeat_pulse_jitter_seconds` - Timing jitter (BACKBEAT-PER-002)
- `backbeat_timer_drift_ratio` - Timer drift percentage (BACKBEAT-PER-003)
- `backbeat_is_leader` - Leadership status
- `backbeat_beats_total` - Total beats published
- `backbeat_tempo_change_errors_total` - Failed tempo changes
### Alerts
Configure alerts for:
- Pulse jitter p95 > 20ms
- Timer drift > 1%
- Leadership changes
- Degradation mode active > 5 minutes
- NATS connection losses
## Testing
### API Testing
```bash
make test-all
```
Tests all admin endpoints with sample requests.
### Load Testing
```bash
# Monitor metrics during load
watch curl -s http://localhost:8080/metrics | grep backbeat_pulse_jitter
```
### Chaos Engineering
- Network partitions between nodes
- NATS broker restart
- Leader node termination
- Clock drift simulation
## Integration
### NATS Subjects
- `backbeat.{cluster}.beat` - BeatFrame publications
- `backbeat.{cluster}.control` - Legacy control messages (backward compatibility)
### Service Discovery
- Raft handles internal cluster membership
- External services discover via NATS subjects
- Health checks via HTTP endpoints
## Security
### Network Security
- Raft traffic encrypted in production
- Admin API should be behind authentication proxy
- NATS authentication recommended
### Data Security
- No sensitive data in BeatFrames
- Raft logs contain only operational state
- Metrics don't expose sensitive information
## Performance Tuning
### NATS Configuration
```
max_payload: 1MB
max_connections: 10000
jetstream: enabled
```
### Raft Configuration
```
HeartbeatTimeout: 1s
ElectionTimeout: 1s
CommitTimeout: 500ms
```
### Go Runtime
```
GOGC=100
GOMAXPROCS=auto
```
## Troubleshooting
### Common Issues
1. **Leadership flapping**
- Check network connectivity between nodes
- Verify Raft bind addresses are reachable
- Monitor `backbeat_leadership_changes_total`
2. **High jitter**
- Check system load and CPU scheduling
- Verify Go GC tuning
- Monitor `backbeat_pulse_jitter_seconds`
3. **Drift violations**
- Check NTP synchronization
- Monitor degradation mode duration
- Verify `backbeat_timer_drift_ratio`
### Debug Commands
```bash
# Check leader status
curl http://localhost:8080/leader | jq
# Check drift status
curl http://localhost:8080/drift | jq
# View Raft logs
docker logs backbeat_pulse-leader_1
# Monitor real-time metrics
curl http://localhost:8080/metrics | grep backbeat_
```
## Future Enhancements
1. **COOEE Transport Integration** - Replace NATS with COOEE for enhanced delivery
2. **Multi-Region Support** - Cross-datacenter synchronization
3. **Dynamic Phase Configuration** - Runtime phase definition updates
4. **Backup/Restore** - Raft state backup and recovery
5. **WebSocket API** - Real-time admin interface
## Compliance
This implementation fully satisfies:
- ✅ BACKBEAT-REQ-001 through BACKBEAT-REQ-005
- ✅ BACKBEAT-PER-001 through BACKBEAT-PER-003
- ✅ INT-A BeatFrame specification
- ✅ Production deployment requirements
- ✅ Observability and monitoring requirements
The service is ready for production deployment in the CHORUS 2.0.0 ecosystem.