Files
DistOS/BACKBEAT.md
2026-02-26 22:16:41 +00:00

455 lines
14 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Backbeat Protocol — Pulse/Reverb (v0.1)
> **Purpose:** Give CHORUS a shared, lightweight rhythm so multiagent, p2p work doesnt deadlock or drift. Standardise expectations (plan/work/review), exit conditions, promises, and timebounded collaboration across CHORUS, HMMM, SLURP, SHHH, UCXL, WHOOSH, and COOEE.
---
## 1) Rationale
- **Problem:** In pub/sub meshes, agents can wait indefinitely for help/context; theres no universal cadence for planning, execution, or reevaluation.
- **Principle:** Use **coarse, explicit tempo** (beats/bars) for policy alignment; not for hard realtime sync. Must be **partitiontolerant**, **observable**, and **cheap**.
- **Design:** Humanreadable **beats/bars/phrases** for policy, **Hybrid Logical Clocks (HLC)** for mergeable ordering.
---
## 2) Core Concepts
- **Tempo (BPM):** Beats per minute (e.g., 630 BPM). Clusterlevel default; task classes may suggest hints.
- **Beat:** Base epoch (e.g., 4 s @ 15 BPM).
- **Bar:** Group of beats (e.g., 8). **Downbeat** (beat 1) is a soft barrier (checkpoints, secret rotation).
- **Phrase:** A sequence of bars that maps to a work cycle: **plan → work → review**.
- **Score (per task):** Declarative allocation of beats across phases + wait budgets + retries.
---
## 3) Roles & Components
- **Pulse:** Cluster tempo broadcaster. Publishes `BeatFrame` each beat; single elected leader (Raft/etcd), followers can degrade to local.
- **Reverb:** Aggregator/rollup. Ingests `StatusClaim`s and emits perbar `BarReport`s, plus hints for adaptive tempo.
- **Agents (CHORUS workers, HMMM collaborators, SLURP, etc.):** Consume beats, enforce **Score**, publish `StatusClaim`s.
- **SHHH:** Rotates shortlived secrets **on downbeats** (perbar keys).
- **COOEE/DHT:** Transport for topics `backbeat://cluster/{id}` and perproject status lanes.
### Implementation Snapshot (2025-10)
- **Pulse service (`cmd/pulse`)** Encapsulates Raft leader election (`internal/backbeat/leader.go`), Hybrid Logical Clock maintenance (`internal/backbeat/hlc.go`), degradation control (`internal/backbeat/degradation.go`), and beat publishing over NATS. It also exposes an admin HTTP surface and collects tempo/drift metrics via `internal/backbeat/metrics.go`.
- **Reverb service (`cmd/reverb`)** Subscribes to pulse beats and agent status subjects, aggregates `StatusClaim`s into rolling windows, and emits `BarReport`s on downbeats. Readiness, health, and Prometheus endpoints report claim throughput, aggregation latency, and NATs failures.
- **Go SDK (`pkg/sdk`)** Provides clients for beat callbacks, status emission, and health reporting with retry/circuit breaker hooks. CHORUS (`project-queues/active/CHORUS/internal/backbeat/integration.go`) and WHOOSH (`project-queues/active/WHOOSH/internal/backbeat/integration.go`) embed the SDK to align runtime operations with cluster tempo.
- **Inter-module telemetry** CHORUS maps P2P lifecycle operations (elections, DHT bootstrap, council delivery) into BACKBEAT status claims, while WHOOSH emits search/composer activity. This keeps Reverb windows authoritative for council health and informs SLURP/BUBBLE provenance.
- **Observability bundle** Monitoring assets (`monitoring/`, `prometheus.yml`) plus service metrics export drift, tempo adjustments, Raft state, and window KPIs, meeting BACKBEAT-PER-001/002/003 targets and enabling WHOOSH scaling gates to react to rhythm degradation.
---
## 4) Wire Model
### 4.1 BeatFrame (Pulse → all)
```json
{
"cluster_id": "chorus-aus-01",
"tempo_bpm": 15,
"beat_ms": 4000,
"bar_len_beats": 8,
"bar": 1287,
"beat": 3,
"phase": "work",
"hlc": "2025-09-03T02:12:27.183Z+1287:3+17",
"policy_hash": "sha256:...",
"deadline_at": "2025-09-03T02:12:31.183Z"
}
```
### 4.2 StatusClaim (agents → Reverb)
```json
{
"agent_id": "chorus-192-168-1-27",
"task_id": "ucxl://...",
"bar": 1287,
"beat": 3,
"state": "planning|executing|waiting|review|done|failed",
"wait_for": ["hmmm://thread/abc"],
"beats_left": 2,
"progress": 0.42,
"notes": "awaiting summarised artifact from peer",
"hlc": "..."
}
```
### 4.3 HelpPromise (HMMM → requester)
```json
{
"thread_id": "hmmm://thread/abc",
"promise_beats": 2,
"confidence": 0.7,
"fail_after_beats": 3,
"on_fail": "fallback-plan-A"
}
```
### 4.4 BarReport (Reverb → observability)
- Perbar rollup: task counts by state, overruns, broken promises, queue depth, utilisation hints, suggested tempo/phase tweak.
---
## 5) Score Spec (YAML)
```yaml
score:
tempo: 15 # bpm hint; cluster policy can override
bar_len: 8 # beats per bar
phases:
plan: 2 # beats
work: 4
review: 2
wait_budget:
help: 2 # max beats to wait for HMMM replies across the phrase
io: 1 # max beats to wait for I/O
retry:
max_phrases: 2
backoff: geometric # plan/work/review shrink each retry
escalation:
on_wait_exhausted: ["emit:needs-attention", "fallback:coarse-answer"]
on_overrun: ["checkpoint", "defer:next-phrase"]
```
> **Rule:** Agents must not exceed phase beat allocations. If `help` budget is exhausted, **exit cleanly** with degraded but auditable output.
---
## 6) Agent Loop (sketch)
```text
on BeatFrame(bf):
if new bar and beat==1: rotate_ephemeral_keys(); checkpoint();
phase = score.phase_for(bf.beat)
switch phase:
PLAN:
if not planned: do_planning_until(phase_end)
WORK:
if need_help and !help_promised: request_help_with_promise()
if waiting_for_help:
if wait_beats > score.wait_budget.help: exit_with_fallback()
else continue_work_on_alternative_path()
else do_work_step()
REVIEW:
run_tests_and_summarise(); publish StatusClaim(state=done|failed)
enforce_cutoffs_at_phase_boundaries()
```
---
## 7) Adaptive Tempo Controller (ATC)
- **Inputs:** Queue depth per role, GPU/CPU util (WHOOSH), overrun frequency, broken promises.
- **Policy:** Adjust `tempo_bpm` and/or redistribute phase beats **between bars only** (PIstyle control, hysteresis ±10%).
- **Guardrails:** ≤1 beat change per minute; freeze during incidents.
---
## 8) Exit Conditions & Deadlock Prevention
- **Wait budgets** are hard ceilings. Missing `HelpPromise` by endofbar triggers `on_wait_exhausted`.
- **Locks & leases** expire at bar boundaries unless renewed with `beats_left`.
- **Promises** include `promise_beats` and `fail_after_beats` so callers can plan.
- **Idempotent checkpoints** at downbeats enable safe retries/resumptions.
---
## 9) Integration Points
- **CHORUS (workers):** Consume `BeatFrame`; enforce `Score`; publish `StatusClaim` each beat/change.
- **HMMM (collab):** Replies carry `HelpPromise`; threads autoclose if `fail_after_beats` elapses.
- **SLURP (curation):** Batch ingest windows tied to review beats; produce barstamped artefacts.
- **SHHH (secrets):** Rotate per bar; credentials scoped to `<cluster,bar>`.
- **UCXL:** Attach tempo metadata to deliverables: `{bar, beat, hlc}`; optional address suffix `;bar=1287#beat=8`.
- **WHOOSH:** Expose utilisation to ATC; enforce resource leases in beat units.
- **COOEE/DHT:** Topics: `backbeat://cluster/{id}`, `status://{project}`, `promise://hmmm`.
---
## 10) Failure Modes & Degraded Operation
- **No Pulse leader:** Agents derive a **medianofpulses** from available Purses; if none, use local monotonic clock (jitter ok) and **freeze tempo changes**.
- **Partitions:** Keep counting beats locally (HLC ensures mergeable order). Reverb reconciles by HLC and bar on heal.
- **Drift:** Tempo changes only on downbeats; publish `policy_hash` so agents detect misconfig.
---
## 11) Config Examples
### 11.1 Cluster Tempo Policy
```yaml
cluster_id: chorus-aus-01
initial_bpm: 12
bar_len_beats: 8
phases: [plan, work, review]
limits:
max_bpm: 24
min_bpm: 6
adaptation:
enable: true
hysteresis_pct: 10
change_per_minute: 1_beat
observability:
emit_bar_reports: true
```
### 11.2 Task Score (attached to UCXL deliverable)
```yaml
ucxl: ucxl://proj:any/*/task/graph_ingest
score:
tempo: 15
bar_len: 8
phases: {plan: 2, work: 4, review: 2}
wait_budget: {help: 2, io: 1}
retry: {max_phrases: 2, backoff: geometric}
escalation:
on_wait_exhausted: ["emit:needs-attention", "fallback:coarse-answer"]
```
---
## 12) Observability
- **Perbar dashboards:** state counts, overruns, broken promises, tempo changes, queue depth, utilisation.
- **Trace stamps:** Every artifact/event carries `{bar, beat, hlc}` for forensic replay.
- **Alarms:** `promise_miss_rate`, `overrun_rate`, `no_status_claims`.
---
## 13) Security
- Rotate **ephemeral keys on downbeats**; scope to project/role when possible.
- Barstamped tokens reduce blast radius; revoke at bar+N.
---
## 14) Economics & Budgeting — Beats as Unit of Cost
### 14.1 Beat Unit (BU)
- **Definition:** 1 BU = one cluster beat interval (`beat_ms`). Its the atomic scheduling & accounting quantum.
### 14.2 Resource Primitives (WHOOSHmeasured)
- `cpu_sec`, `gpu_sec[class]`, `accel_sec[class]`, `mem_gbs` (GB·s), `disk_io_mb`, `net_egress_mb`, `storage_gbh`.
### 14.3 Budget & Costing
```yaml
budget:
max_bu: N_total
phase_caps: { plan: Np, work: Nw, review: Nr }
wait_caps: { help: Nh, io: Ni }
hard_end: bar+K
charge_to: ucxl://acct/...
```
Cost per phrase:
```
Total = Σ(beats_used * role_rate_bu)
+ Σ_class(gpu_sec[class] * rate_gpu_sec[class])
+ cpu_sec*rate_cpu_sec + mem_gbs*rate_mem_gbs
+ disk_io_mb*rate_io_mb + net_egress_mb*rate_egress_mb
+ storage_gbh*rate_storage_gbh
```
### 14.4 KPIs
- **TNT** (temponormalised throughput), **BPD** (beats per deliverable), **WR** (wait ratio), **η** (efficiency), **PMR** (promise miss rate), **CPD** (cost per deliverable), **TTFU** (time to first useful).
---
## 15) Tokenless Accounting (Hybrid CPU/GPU, Onprem + Cloud)
- **No tokens.** Price **beats + measured resources**; ignore modeltoken counts.
- **Device classes:** price per GPU/accelerator class (A100, 4090, MI300X, TPU…).
- **Rates:** onprem from TCO / dutycycle seconds; cloud from persecond list prices. Bind via config.
- **Beatscoped caps:** perBU ceilings on resource primitives to contain spend regardless of hardware skew.
- **Calibration (planningonly):** perfamily normalisers if you want **Effective Compute Units** for planning; **billing remains raw seconds**.
---
## 16) MVP Bringup Plan
1. **Pulse**: static BPM, broadcast `BeatFrame` over COOEE.
2. **Agents**: publish `StatusClaim`; enforce `wait_budget` & `HelpPromise`.
3. **Reverb**: roll up to `BarReport`; surface early KPIs.
4. **SHHH**: rotate credentials on downbeats.
5. **ATC**: enable adaptation after telemetry.
---
## 17) Open Questions
- Perrole tempi vs one cluster tempo?
- Fixed `bar_len` vs dynamic redistribution of phase beats?
- Score UI: YAML + visual “score sheet” editor?
---
### Naming (on brand)
- **Backbeat Protocol** — **Pulse** (broadcaster) + **Reverb** (rollup & reports). Musical, expressive; conveys ripples from each downbeat.
# Backbeat — Relative Beats Addendum (UCXL ^^/~~)
**Why this addendum?** Were removing dependence on everincreasing `bar`/`beat` counters. All coordination is expressed **relative to NOW** in **beats**, aligned with UCXL temporal markers `^^` (future) and `~~` (past).
## A) Wire Model Adjustments
### BeatFrame (Pulse → all)
**Replace** prior fields `{bar, beat}` with:
```json
{
"cluster_id": "...",
"tempo_bpm": 15,
"beat_ms": 4000,
"bar_len_beats": 8,
"beat_index": 3, // 1..bar_len_beats (cyclic within bar)
"beat_epoch": "2025-09-03T02:12:27.000Z", // start time of this beat
"downbeat": false, // true when beat_index==1
"phase": "work",
"hlc": "2025-09-03T02:12:27.183Z+17",
"policy_hash": "sha256:...",
"deadline_at": "2025-09-03T02:12:31.183Z"
}
```
### StatusClaim (agents → Reverb)
**Replace** prior fields `{bar, beat}` with:
```json
{
"agent_id": "...",
"task_id": "...",
"beat_index": 3,
"state": "planning|executing|waiting|review|done|failed",
"beats_left": 2,
"progress": 0.42,
"notes": "...",
"hlc": "..."
}
```
### Bar/Window Aggregation
- Reverb aggregates per **window** bounded by `downbeat=true` frames.
- **No global bar counters** are transmitted. Observability UIs may keep a local `window_id` for navigation.
## B) UCXL Temporal Suffix - (Requires RFC-UCXL 1.1)
Attach **relative beat** navigation to any UCXL address:
- `;beats=^^N` → target **N beats in the future** from now
- `;beats=~~N` → target **N beats in the past** from now
- Optional: `;phase=plan|work|review`
**Example:**
```
ucxl://proj:any/*/task/ingest;beats=^^4;phase=work
```
## C) Policy & Promises
- All time budgets are **Δbeats**: `wait_budget.help`, `retry.max_phrases`, `promise_beats`, `fail_after_beats`.
- **Leases/locks** renew per beat and expire on phase change unless renewed.
## D) Derivations
- `beat_index = 1 + floor( (unix_ms / beat_ms) mod bar_len_beats )` (derived locally).
- `beat_epoch = floor_to_multiple(now, beat_ms)`.
- `Δbeats(target_time) = round( (target_time - now) / beat_ms )`.
## E) Compatibility Notes
- Old fields `{bar, beat}` are **deprecated**; if received, they can be ignored or mapped to local windows.
- HLC remains the canonical merge key for causality.
## F) Action Items
1. Update the **spec wire model** sections accordingly.
2. Regenerate the **Go prototype** using `BeatIndex/BeatEpoch/Downbeat` instead of `Bar/Beat` counters.
3. Add UCXL parsing for `;beats=^^/~~` in RUSTLE.
- [ ] TODO: RUSTLE update for BACKBEAT compatibility