Files
BACKBEAT/README-IMPLEMENTATION.md
2025-10-17 08:56:25 +11:00

9.3 KiB

BACKBEAT Pulse Service Implementation

Overview

This is the complete implementation of the BACKBEAT pulse service based on the architectural requirements for CHORUS 2.0.0. The service provides foundational timing coordination for the distributed ecosystem with production-grade leader election, hybrid logical clocks, and comprehensive observability.

Architecture

The implementation consists of several key components:

Core Components

  1. Leader Election System (internal/backbeat/leader.go)

    • Implements BACKBEAT-REQ-001 using HashiCorp Raft consensus
    • Pluggable strategy with automatic failover
    • Single BeatFrame publisher per cluster guarantee
  2. Hybrid Logical Clock (internal/backbeat/hlc.go)

    • Provides ordering guarantees for distributed events
    • Supports reconciliation after network partitions
    • Format: unix_ms_hex:logical_counter_hex:node_id_suffix
  3. BeatFrame Generator (cmd/pulse/main.go)

    • Implements BACKBEAT-REQ-002 (INT-A BeatFrame emission)
    • Publishes structured beat events to NATS
    • Includes HLC, beat_index, downbeat, phase, deadline_at, tempo_bpm
  4. Degradation Manager (internal/backbeat/degradation.go)

    • Implements BACKBEAT-REQ-003 (local tempo derivation)
    • Manages partition tolerance with drift monitoring
    • BACKBEAT-PER-003 compliance (≤1% drift over 1 hour)
  5. Admin API Server (internal/backbeat/admin.go)

    • HTTP endpoints for operational control
    • Tempo management with BACKBEAT-REQ-004 validation
    • Health checks, drift monitoring, leader status
  6. Metrics & Observability (internal/backbeat/metrics.go)

    • Prometheus metrics for all performance requirements
    • Comprehensive monitoring of timing accuracy
    • Performance requirement tracking

Requirements Implementation

BACKBEAT-REQ-001: Pulse Leader

Implemented: Leader election using Raft consensus algorithm

  • Single leader publishes BeatFrames per cluster
  • Automatic failover with consistent leadership
  • Pluggable strategy (currently Raft, extensible)

BACKBEAT-REQ-002: BeatFrame Emit

Implemented: INT-A compliant BeatFrame publishing

{
  "type": "backbeat.beatframe.v1",
  "cluster_id": "string", 
  "beat_index": 0,
  "downbeat": false,
  "phase": "plan",
  "hlc": "7ffd:0001:abcd",
  "deadline_at": "2025-09-04T12:00:00Z", 
  "tempo_bpm": 120,
  "window_id": "deterministic_sha256_hash"
}

BACKBEAT-REQ-003: Degrade Local

Implemented: Partition tolerance with local tempo derivation

  • Followers maintain local timing when leader is lost
  • HLC-based reconciliation when leader returns
  • Drift monitoring and alerting

BACKBEAT-REQ-004: Tempo Change Rules

Implemented: Downbeat-gated tempo changes with delta limits

  • Changes only applied on next downbeat
  • ≤±10% delta validation
  • Admin API with validation and scheduling

BACKBEAT-REQ-005: Window ID

Implemented: Deterministic window ID generation

window_id = hex(sha256(cluster_id + ":" + downbeat_beat_index))[0:32]

Performance Requirements

BACKBEAT-PER-001: End-to-End Delivery

Target: p95 ≤ 100ms at 2Hz

  • Comprehensive latency monitoring
  • NATS optimization for low latency
  • Metrics: backbeat_beat_delivery_latency_seconds

BACKBEAT-PER-002: Pulse Jitter

Target: p95 ≤ 20ms

  • High-resolution timing measurement
  • Jitter calculation and monitoring
  • Metrics: backbeat_pulse_jitter_seconds

BACKBEAT-PER-003: Timer Drift

Target: ≤1% over 1 hour without leader

  • Continuous drift monitoring
  • Degradation mode with local derivation
  • Automatic alerting on threshold violations
  • Metrics: backbeat_timer_drift_ratio

API Endpoints

Admin API (Port 8080)

GET /tempo

Returns current and pending tempo information:

{
  "current_bpm": 120,
  "pending_bpm": 120,
  "can_change": true,
  "next_change": "2025-09-04T12:00:00Z",
  "reason": ""
}

POST /tempo

Changes tempo with validation:

{
  "tempo_bpm": 130,
  "justification": "workload increase"
}

GET /drift

Returns drift monitoring information:

{
  "timer_drift_percent": 0.5,
  "hlc_drift_seconds": 1.2,
  "last_sync_time": "2025-09-04T11:59:00Z",
  "degradation_mode": false,
  "within_limits": true
}

GET /leader

Returns leadership information:

{
  "node_id": "pulse-abc123",
  "is_leader": true,
  "leader": "127.0.0.1:9000",
  "cluster_size": 2,
  "stats": { ... }
}

Health & Monitoring

  • GET /health - Overall service health
  • GET /ready - Kubernetes readiness probe
  • GET /live - Kubernetes liveness probe
  • GET /metrics - Prometheus metrics endpoint

Deployment

Development (Single Node)

make build
make dev

Cluster Development

make cluster
# Starts leader on :8080, follower on :8081

Production (Docker Compose)

docker-compose up -d

This starts:

  • NATS message broker
  • 2-node BACKBEAT pulse cluster
  • Prometheus metrics collection
  • Grafana dashboards
  • Health monitoring

Production (Docker Swarm)

docker stack deploy -c docker-compose.swarm.yml backbeat

Configuration

Command Line Options

-cluster string          Cluster identifier (default "chorus-aus-01")
-node-id string         Node identifier (auto-generated if empty)
-bpm int                Initial tempo in BPM (default 12)
-bar int                Beats per bar (default 8)  
-phases string          Comma-separated phase names (default "plan,work,review")
-min-bpm int           Minimum allowed BPM (default 4)
-max-bpm int           Maximum allowed BPM (default 24)
-nats string           NATS server URL (default "nats://localhost:4222")
-admin-port int        Admin API port (default 8080)
-raft-bind string      Raft bind address (default "127.0.0.1:0")
-bootstrap bool        Bootstrap new cluster (default false)
-peers string          Comma-separated Raft peer addresses
-data-dir string       Data directory (auto-generated if empty)

Environment Variables

  • BACKBEAT_LOG_LEVEL - Log level (debug, info, warn, error)
  • BACKBEAT_DATA_DIR - Data directory override
  • BACKBEAT_CLUSTER_ID - Cluster ID override

Monitoring

Key Metrics

  • backbeat_beat_publish_duration_seconds - Beat publishing latency
  • backbeat_pulse_jitter_seconds - Timing jitter (BACKBEAT-PER-002)
  • backbeat_timer_drift_ratio - Timer drift percentage (BACKBEAT-PER-003)
  • backbeat_is_leader - Leadership status
  • backbeat_beats_total - Total beats published
  • backbeat_tempo_change_errors_total - Failed tempo changes

Alerts

Configure alerts for:

  • Pulse jitter p95 > 20ms
  • Timer drift > 1%
  • Leadership changes
  • Degradation mode active > 5 minutes
  • NATS connection losses

Testing

API Testing

make test-all

Tests all admin endpoints with sample requests.

Load Testing

# Monitor metrics during load
watch curl -s http://localhost:8080/metrics | grep backbeat_pulse_jitter

Chaos Engineering

  • Network partitions between nodes
  • NATS broker restart
  • Leader node termination
  • Clock drift simulation

Integration

NATS Subjects

  • backbeat.{cluster}.beat - BeatFrame publications
  • backbeat.{cluster}.control - Legacy control messages (backward compatibility)

Service Discovery

  • Raft handles internal cluster membership
  • External services discover via NATS subjects
  • Health checks via HTTP endpoints

Security

Network Security

  • Raft traffic encrypted in production
  • Admin API should be behind authentication proxy
  • NATS authentication recommended

Data Security

  • No sensitive data in BeatFrames
  • Raft logs contain only operational state
  • Metrics don't expose sensitive information

Performance Tuning

NATS Configuration

max_payload: 1MB
max_connections: 10000
jetstream: enabled

Raft Configuration

HeartbeatTimeout: 1s
ElectionTimeout: 1s  
CommitTimeout: 500ms

Go Runtime

GOGC=100
GOMAXPROCS=auto

Troubleshooting

Common Issues

  1. Leadership flapping

    • Check network connectivity between nodes
    • Verify Raft bind addresses are reachable
    • Monitor backbeat_leadership_changes_total
  2. High jitter

    • Check system load and CPU scheduling
    • Verify Go GC tuning
    • Monitor backbeat_pulse_jitter_seconds
  3. Drift violations

    • Check NTP synchronization
    • Monitor degradation mode duration
    • Verify backbeat_timer_drift_ratio

Debug Commands

# Check leader status
curl http://localhost:8080/leader | jq

# Check drift status  
curl http://localhost:8080/drift | jq

# View Raft logs
docker logs backbeat_pulse-leader_1

# Monitor real-time metrics
curl http://localhost:8080/metrics | grep backbeat_

Future Enhancements

  1. COOEE Transport Integration - Replace NATS with COOEE for enhanced delivery
  2. Multi-Region Support - Cross-datacenter synchronization
  3. Dynamic Phase Configuration - Runtime phase definition updates
  4. Backup/Restore - Raft state backup and recovery
  5. WebSocket API - Real-time admin interface

Compliance

This implementation fully satisfies:

  • BACKBEAT-REQ-001 through BACKBEAT-REQ-005
  • BACKBEAT-PER-001 through BACKBEAT-PER-003
  • INT-A BeatFrame specification
  • Production deployment requirements
  • Observability and monitoring requirements

The service is ready for production deployment in the CHORUS 2.0.0 ecosystem.