Files
BACKBEAT/README.md
2025-10-17 08:56:25 +11:00

326 lines
8.3 KiB
Markdown

# BACKBEAT Prototype
A production-grade distributed task orchestration system with time-synchronized beat generation and agent status aggregation.
## Overview
BACKBEAT implements a novel approach to distributed system coordination using musical concepts:
- **Pulse Service**: Leader-elected nodes generate synchronized "beats" as timing references
- **Reverb Service**: Aggregates agent status claims and produces summary reports per "window"
- **Agent Simulation**: Simulates distributed agents reporting task status
## Module Availability
BACKBEAT is published as a Go module. Consumers can pin the current release directly:
```bash
go get github.com/chorus-services/backbeat@v0.1.0
```
After downloading, the SDK helpers are available via `github.com/chorus-services/backbeat/pkg/sdk`.
## Architecture
```
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ Pulse │────▶│ NATS │◀────│ Reverb │
│ (Leader) │ │ Broker │ │ (Aggregator)│
└─────────────┘ └─────────────┘ └─────────────┘
┌─────────────┐
│ Agents │
│ (Simulated) │
└─────────────┘
```
### Key Components
1. **Pulse Service** (`cmd/pulse/`)
- Raft-based leader election
- Hybrid Logical Clock (HLC) synchronization
- Tempo control with ±10% change limits
- Beat frame generation at configurable BPM
- Degradation mode for fault tolerance
2. **Reverb Service** (`cmd/reverb/`)
- StatusClaim ingestion and validation
- Window-based aggregation
- BarReport generation with KPIs
- Performance monitoring and SLO tracking
- Admin API for operational visibility
3. **Agent Simulator** (`cmd/agent-sim/`)
- Multi-agent simulation
- Realistic task state transitions
- Configurable reporting rates
- Load testing capabilities
## Requirements Implementation
The system implements the following requirements:
### Core Requirements
- **BACKBEAT-REQ-020**: StatusClaim ingestion and window grouping
- **BACKBEAT-REQ-021**: BarReport emission at downbeats with KPIs
- **BACKBEAT-REQ-022**: DHT persistence placeholder (future implementation)
### Performance Requirements
- **BACKBEAT-PER-001**: End-to-end delivery p95 ≤ 100ms at 2Hz
- **BACKBEAT-PER-002**: Reverb rollup ≤ 1 beat after downbeat
- **BACKBEAT-PER-003**: SDK timer drift ≤ 1% over 1 hour
### Observability Requirements
- **BACKBEAT-OBS-002**: Comprehensive reverb metrics
- Prometheus metrics export
- Structured logging with zerolog
- Health and readiness endpoints
## Quick Start
### Development Environment
1. **Start the complete stack:**
```bash
make run-dev
```
2. **Monitor the services:**
- Pulse Node 1: http://localhost:8080
- Pulse Node 2: http://localhost:8081
- Reverb Service: http://localhost:8082
- Prometheus: http://localhost:9090
- Grafana: http://localhost:3000 (admin/admin)
3. **View logs:**
```bash
make logs
```
4. **Check service status:**
```bash
make status
```
### Manual Build
```bash
# Build all services
make build
# Run individual services
./bin/pulse -cluster=test-cluster -nats=nats://localhost:4222
./bin/reverb -cluster=test-cluster -nats=nats://localhost:4222
./bin/agent-sim -cluster=test-cluster -nats=nats://localhost:4222
```
## Interface Specifications
### INT-A: BeatFrame (Pulse → All)
```json
{
"type": "backbeat.beatframe.v1",
"cluster_id": "chorus-production",
"beat_index": 1234,
"downbeat": true,
"phase": "execution",
"hlc": "7ffd:0001:beef",
"deadline_at": "2024-01-15T10:30:00Z",
"tempo_bpm": 120,
"window_id": "7e9b0e6c4c9a4e59b7f2d9a3c1b2e4d5"
}
```
### INT-B: StatusClaim (Agents → Reverb)
```json
{
"type": "backbeat.statusclaim.v1",
"agent_id": "agent:xyz",
"task_id": "task:123",
"beat_index": 1234,
"state": "executing",
"beats_left": 3,
"progress": 0.5,
"notes": "fetching inputs",
"hlc": "7ffd:0001:beef"
}
```
### INT-C: BarReport (Reverb → Consumers)
```json
{
"type": "backbeat.barreport.v1",
"window_id": "7e9b0e6c4c9a4e59b7f2d9a3c1b2e4d5",
"from_beat": 240,
"to_beat": 359,
"agents_reporting": 978,
"on_time_reviews": 842,
"help_promises_fulfilled": 91,
"secret_rotations_ok": true,
"tempo_drift_ms": 7,
"issues": []
}
```
## API Endpoints
### Pulse Service
- `GET /health` - Health check
- `GET /ready` - Readiness check
- `GET /metrics` - Prometheus metrics
- `POST /api/v1/tempo` - Change tempo
- `GET /api/v1/status` - Service status
### Reverb Service
- `GET /health` - Health check
- `GET /ready` - Readiness check
- `GET /metrics` - Prometheus metrics
- `GET /api/v1/windows` - List active windows
- `GET /api/v1/windows/{id}` - Get window details
- `GET /api/v1/status` - Service status
## Configuration
### Environment Variables
- `BACKBEAT_ENV` - Environment (development/production)
- `NATS_URL` - NATS server URL
- `LOG_LEVEL` - Logging level (debug/info/warn/error)
### Command Line Flags
#### Pulse Service
- `-cluster` - Cluster identifier
- `-node` - Node identifier
- `-admin-port` - HTTP admin port
- `-raft-bind` - Raft cluster bind address
- `-data-dir` - Data directory
- `-nats` - NATS server URL
#### Reverb Service
- `-cluster` - Cluster identifier
- `-node` - Node identifier
- `-nats` - NATS server URL
- `-bar-length` - Bar length in beats
- `-log-level` - Log level
## Monitoring
### Key Metrics
**Pulse Service:**
- `backbeat_beats_total` - Total beats published
- `backbeat_pulse_jitter_seconds` - Beat timing jitter
- `backbeat_is_leader` - Leadership status
- `backbeat_current_tempo_bpm` - Current tempo
**Reverb Service:**
- `backbeat_reverb_agents_reporting` - Agents in current window
- `backbeat_reverb_on_time_reviews` - On-time task completions
- `backbeat_reverb_windows_completed_total` - Total windows processed
- `backbeat_reverb_window_processing_seconds` - Window processing time
### Performance SLOs
The system tracks compliance with performance requirements:
- Beat delivery latency p95 ≤ 100ms
- Pulse jitter p95 ≤ 20ms
- Reverb processing ≤ 1 beat duration
- Timer drift ≤ 1% over 1 hour
## Development
### Build Requirements
- Go 1.22+
- Docker & Docker Compose
- Make
### Development Workflow
```bash
# Format, vet, test, and build
make dev
# Run full CI pipeline
make ci
# Build for production
make production
```
### Testing
```bash
# Run tests
make test
# Run with race detection
go test -race ./...
# Run specific test suites
go test ./internal/backbeat -v
```
## Production Deployment
### Docker Images
The multi-stage Dockerfile produces separate images for each service:
- `backbeat-pulse:v1.0.0` - Pulse service
- `backbeat-reverb:v1.0.0` - Reverb service
- `backbeat-agent-sim:v1.0.0` - Agent simulator
### Kubernetes Deployment
```bash
# Build and push images
make docker-push VERSION=v1.0.0
# Deploy to Kubernetes (example)
kubectl apply -f k8s/
```
### Docker Swarm Deployment
```bash
# Build images
make docker
# Deploy stack
docker stack deploy -c docker-compose.swarm.yml backbeat
```
## Troubleshooting
### Common Issues
1. **NATS Connection Failed**
- Verify NATS server is running
- Check network connectivity
- Verify NATS URL configuration
2. **Leader Election Issues**
- Check Raft logs for cluster formation
- Verify peer connectivity on Raft ports
- Ensure persistent storage is available
3. **Missing StatusClaims**
- Verify agents are publishing to correct NATS subjects
- Check StatusClaim validation errors in reverb logs
- Monitor `backbeat_reverb_claims_processed_total` metric
### Log Analysis
```bash
# Follow reverb service logs
docker-compose logs -f reverb
# Search for specific window processing
docker-compose logs reverb | grep "window_id=abc123"
# Monitor performance metrics
curl http://localhost:8082/metrics | grep backbeat_reverb
```
## License
This is prototype software for the CHORUS platform. See licensing documentation for details.
## Support
For issues and questions, please refer to the CHORUS platform documentation or contact the development team.