326 lines
8.3 KiB
Markdown
326 lines
8.3 KiB
Markdown
# BACKBEAT Prototype
|
|
|
|
A production-grade distributed task orchestration system with time-synchronized beat generation and agent status aggregation.
|
|
|
|
## Overview
|
|
|
|
BACKBEAT implements a novel approach to distributed system coordination using musical concepts:
|
|
|
|
- **Pulse Service**: Leader-elected nodes generate synchronized "beats" as timing references
|
|
- **Reverb Service**: Aggregates agent status claims and produces summary reports per "window"
|
|
- **Agent Simulation**: Simulates distributed agents reporting task status
|
|
|
|
## Module Availability
|
|
|
|
BACKBEAT is published as a Go module. Consumers can pin the current release directly:
|
|
|
|
```bash
|
|
go get github.com/chorus-services/backbeat@v0.1.0
|
|
```
|
|
|
|
After downloading, the SDK helpers are available via `github.com/chorus-services/backbeat/pkg/sdk`.
|
|
|
|
## Architecture
|
|
|
|
```
|
|
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
|
|
│ Pulse │────▶│ NATS │◀────│ Reverb │
|
|
│ (Leader) │ │ Broker │ │ (Aggregator)│
|
|
└─────────────┘ └─────────────┘ └─────────────┘
|
|
│
|
|
▼
|
|
┌─────────────┐
|
|
│ Agents │
|
|
│ (Simulated) │
|
|
└─────────────┘
|
|
```
|
|
|
|
### Key Components
|
|
|
|
1. **Pulse Service** (`cmd/pulse/`)
|
|
- Raft-based leader election
|
|
- Hybrid Logical Clock (HLC) synchronization
|
|
- Tempo control with ±10% change limits
|
|
- Beat frame generation at configurable BPM
|
|
- Degradation mode for fault tolerance
|
|
|
|
2. **Reverb Service** (`cmd/reverb/`)
|
|
- StatusClaim ingestion and validation
|
|
- Window-based aggregation
|
|
- BarReport generation with KPIs
|
|
- Performance monitoring and SLO tracking
|
|
- Admin API for operational visibility
|
|
|
|
3. **Agent Simulator** (`cmd/agent-sim/`)
|
|
- Multi-agent simulation
|
|
- Realistic task state transitions
|
|
- Configurable reporting rates
|
|
- Load testing capabilities
|
|
|
|
## Requirements Implementation
|
|
|
|
The system implements the following requirements:
|
|
|
|
### Core Requirements
|
|
- **BACKBEAT-REQ-020**: StatusClaim ingestion and window grouping
|
|
- **BACKBEAT-REQ-021**: BarReport emission at downbeats with KPIs
|
|
- **BACKBEAT-REQ-022**: DHT persistence placeholder (future implementation)
|
|
|
|
### Performance Requirements
|
|
- **BACKBEAT-PER-001**: End-to-end delivery p95 ≤ 100ms at 2Hz
|
|
- **BACKBEAT-PER-002**: Reverb rollup ≤ 1 beat after downbeat
|
|
- **BACKBEAT-PER-003**: SDK timer drift ≤ 1% over 1 hour
|
|
|
|
### Observability Requirements
|
|
- **BACKBEAT-OBS-002**: Comprehensive reverb metrics
|
|
- Prometheus metrics export
|
|
- Structured logging with zerolog
|
|
- Health and readiness endpoints
|
|
|
|
## Quick Start
|
|
|
|
### Development Environment
|
|
|
|
1. **Start the complete stack:**
|
|
```bash
|
|
make run-dev
|
|
```
|
|
|
|
2. **Monitor the services:**
|
|
- Pulse Node 1: http://localhost:8080
|
|
- Pulse Node 2: http://localhost:8081
|
|
- Reverb Service: http://localhost:8082
|
|
- Prometheus: http://localhost:9090
|
|
- Grafana: http://localhost:3000 (admin/admin)
|
|
|
|
3. **View logs:**
|
|
```bash
|
|
make logs
|
|
```
|
|
|
|
4. **Check service status:**
|
|
```bash
|
|
make status
|
|
```
|
|
|
|
### Manual Build
|
|
|
|
```bash
|
|
# Build all services
|
|
make build
|
|
|
|
# Run individual services
|
|
./bin/pulse -cluster=test-cluster -nats=nats://localhost:4222
|
|
./bin/reverb -cluster=test-cluster -nats=nats://localhost:4222
|
|
./bin/agent-sim -cluster=test-cluster -nats=nats://localhost:4222
|
|
```
|
|
|
|
## Interface Specifications
|
|
|
|
### INT-A: BeatFrame (Pulse → All)
|
|
```json
|
|
{
|
|
"type": "backbeat.beatframe.v1",
|
|
"cluster_id": "chorus-production",
|
|
"beat_index": 1234,
|
|
"downbeat": true,
|
|
"phase": "execution",
|
|
"hlc": "7ffd:0001:beef",
|
|
"deadline_at": "2024-01-15T10:30:00Z",
|
|
"tempo_bpm": 120,
|
|
"window_id": "7e9b0e6c4c9a4e59b7f2d9a3c1b2e4d5"
|
|
}
|
|
```
|
|
|
|
### INT-B: StatusClaim (Agents → Reverb)
|
|
```json
|
|
{
|
|
"type": "backbeat.statusclaim.v1",
|
|
"agent_id": "agent:xyz",
|
|
"task_id": "task:123",
|
|
"beat_index": 1234,
|
|
"state": "executing",
|
|
"beats_left": 3,
|
|
"progress": 0.5,
|
|
"notes": "fetching inputs",
|
|
"hlc": "7ffd:0001:beef"
|
|
}
|
|
```
|
|
|
|
### INT-C: BarReport (Reverb → Consumers)
|
|
```json
|
|
{
|
|
"type": "backbeat.barreport.v1",
|
|
"window_id": "7e9b0e6c4c9a4e59b7f2d9a3c1b2e4d5",
|
|
"from_beat": 240,
|
|
"to_beat": 359,
|
|
"agents_reporting": 978,
|
|
"on_time_reviews": 842,
|
|
"help_promises_fulfilled": 91,
|
|
"secret_rotations_ok": true,
|
|
"tempo_drift_ms": 7,
|
|
"issues": []
|
|
}
|
|
```
|
|
|
|
## API Endpoints
|
|
|
|
### Pulse Service
|
|
- `GET /health` - Health check
|
|
- `GET /ready` - Readiness check
|
|
- `GET /metrics` - Prometheus metrics
|
|
- `POST /api/v1/tempo` - Change tempo
|
|
- `GET /api/v1/status` - Service status
|
|
|
|
### Reverb Service
|
|
- `GET /health` - Health check
|
|
- `GET /ready` - Readiness check
|
|
- `GET /metrics` - Prometheus metrics
|
|
- `GET /api/v1/windows` - List active windows
|
|
- `GET /api/v1/windows/{id}` - Get window details
|
|
- `GET /api/v1/status` - Service status
|
|
|
|
## Configuration
|
|
|
|
### Environment Variables
|
|
- `BACKBEAT_ENV` - Environment (development/production)
|
|
- `NATS_URL` - NATS server URL
|
|
- `LOG_LEVEL` - Logging level (debug/info/warn/error)
|
|
|
|
### Command Line Flags
|
|
|
|
#### Pulse Service
|
|
- `-cluster` - Cluster identifier
|
|
- `-node` - Node identifier
|
|
- `-admin-port` - HTTP admin port
|
|
- `-raft-bind` - Raft cluster bind address
|
|
- `-data-dir` - Data directory
|
|
- `-nats` - NATS server URL
|
|
|
|
#### Reverb Service
|
|
- `-cluster` - Cluster identifier
|
|
- `-node` - Node identifier
|
|
- `-nats` - NATS server URL
|
|
- `-bar-length` - Bar length in beats
|
|
- `-log-level` - Log level
|
|
|
|
## Monitoring
|
|
|
|
### Key Metrics
|
|
|
|
**Pulse Service:**
|
|
- `backbeat_beats_total` - Total beats published
|
|
- `backbeat_pulse_jitter_seconds` - Beat timing jitter
|
|
- `backbeat_is_leader` - Leadership status
|
|
- `backbeat_current_tempo_bpm` - Current tempo
|
|
|
|
**Reverb Service:**
|
|
- `backbeat_reverb_agents_reporting` - Agents in current window
|
|
- `backbeat_reverb_on_time_reviews` - On-time task completions
|
|
- `backbeat_reverb_windows_completed_total` - Total windows processed
|
|
- `backbeat_reverb_window_processing_seconds` - Window processing time
|
|
|
|
### Performance SLOs
|
|
|
|
The system tracks compliance with performance requirements:
|
|
- Beat delivery latency p95 ≤ 100ms
|
|
- Pulse jitter p95 ≤ 20ms
|
|
- Reverb processing ≤ 1 beat duration
|
|
- Timer drift ≤ 1% over 1 hour
|
|
|
|
## Development
|
|
|
|
### Build Requirements
|
|
- Go 1.22+
|
|
- Docker & Docker Compose
|
|
- Make
|
|
|
|
### Development Workflow
|
|
```bash
|
|
# Format, vet, test, and build
|
|
make dev
|
|
|
|
# Run full CI pipeline
|
|
make ci
|
|
|
|
# Build for production
|
|
make production
|
|
```
|
|
|
|
### Testing
|
|
```bash
|
|
# Run tests
|
|
make test
|
|
|
|
# Run with race detection
|
|
go test -race ./...
|
|
|
|
# Run specific test suites
|
|
go test ./internal/backbeat -v
|
|
```
|
|
|
|
## Production Deployment
|
|
|
|
### Docker Images
|
|
The multi-stage Dockerfile produces separate images for each service:
|
|
- `backbeat-pulse:v1.0.0` - Pulse service
|
|
- `backbeat-reverb:v1.0.0` - Reverb service
|
|
- `backbeat-agent-sim:v1.0.0` - Agent simulator
|
|
|
|
### Kubernetes Deployment
|
|
```bash
|
|
# Build and push images
|
|
make docker-push VERSION=v1.0.0
|
|
|
|
# Deploy to Kubernetes (example)
|
|
kubectl apply -f k8s/
|
|
```
|
|
|
|
### Docker Swarm Deployment
|
|
```bash
|
|
# Build images
|
|
make docker
|
|
|
|
# Deploy stack
|
|
docker stack deploy -c docker-compose.swarm.yml backbeat
|
|
```
|
|
|
|
## Troubleshooting
|
|
|
|
### Common Issues
|
|
|
|
1. **NATS Connection Failed**
|
|
- Verify NATS server is running
|
|
- Check network connectivity
|
|
- Verify NATS URL configuration
|
|
|
|
2. **Leader Election Issues**
|
|
- Check Raft logs for cluster formation
|
|
- Verify peer connectivity on Raft ports
|
|
- Ensure persistent storage is available
|
|
|
|
3. **Missing StatusClaims**
|
|
- Verify agents are publishing to correct NATS subjects
|
|
- Check StatusClaim validation errors in reverb logs
|
|
- Monitor `backbeat_reverb_claims_processed_total` metric
|
|
|
|
### Log Analysis
|
|
```bash
|
|
# Follow reverb service logs
|
|
docker-compose logs -f reverb
|
|
|
|
# Search for specific window processing
|
|
docker-compose logs reverb | grep "window_id=abc123"
|
|
|
|
# Monitor performance metrics
|
|
curl http://localhost:8082/metrics | grep backbeat_reverb
|
|
```
|
|
|
|
## License
|
|
|
|
This is prototype software for the CHORUS platform. See licensing documentation for details.
|
|
|
|
## Support
|
|
|
|
For issues and questions, please refer to the CHORUS platform documentation or contact the development team.
|