Upload files to "/"

2026-02-26 22:16:41 +00:00
parent 99238f12d7
commit 6788dd741c
5 changed files with 2443 additions and 0 deletions
--- a/BACKBEAT.md
+++ b/BACKBEAT.md
@@ -0,0 +1,454 @@
 # Backbeat Protocol — Pulse/Reverb (v0.1)
 > **Purpose:** Give CHORUS a shared, lightweight rhythm so multi‑agent, p2p work doesn’t deadlock or drift. Standardise expectations (plan/work/review), exit conditions, promises, and time‑bounded collaboration across CHORUS, HMMM, SLURP, SHHH, UCXL, WHOOSH, and COOEE.
 ---
 ## 1) Rationale
 - **Problem:** In pub/sub meshes, agents can wait indefinitely for help/context; there’s no universal cadence for planning, execution, or re‑evaluation.
 - **Principle:** Use **coarse, explicit tempo** (beats/bars) for policy alignment; not for hard real‑time sync. Must be **partition‑tolerant**, **observable**, and **cheap**.
 - **Design:** Human‑readable **beats/bars/phrases** for policy, **Hybrid Logical Clocks (HLC)** for mergeable ordering.
 ---
 ## 2) Core Concepts
 - **Tempo (BPM):** Beats per minute (e.g., 6–30 BPM). Cluster‑level default; task classes may suggest hints.
 - **Beat:** Base epoch (e.g., 4 s @ 15 BPM).
 - **Bar:** Group of beats (e.g., 8). **Downbeat** (beat 1) is a soft barrier (checkpoints, secret rotation).
 - **Phrase:** A sequence of bars that maps to a work cycle: **plan → work → review**.
 - **Score (per task):** Declarative allocation of beats across phases + wait budgets + retries.
 ---
 ## 3) Roles & Components
 - **Pulse:** Cluster tempo broadcaster. Publishes `BeatFrame` each beat; single elected leader (Raft/etcd), followers can degrade to local.
 - **Reverb:** Aggregator/rollup. Ingests `StatusClaim`s and emits per‑bar `BarReport`s, plus hints for adaptive tempo.
 - **Agents (CHORUS workers, HMMM collaborators, SLURP, etc.):** Consume beats, enforce **Score**, publish `StatusClaim`s.
 - **SHHH:** Rotates short‑lived secrets **on downbeats** (per‑bar keys).
 - **COOEE/DHT:** Transport for topics `backbeat://cluster/{id}` and per‑project status lanes.
 ### Implementation Snapshot (2025-10)
 - **Pulse service (`cmd/pulse`)** – Encapsulates Raft leader election (`internal/backbeat/leader.go`), Hybrid Logical Clock maintenance (`internal/backbeat/hlc.go`), degradation control (`internal/backbeat/degradation.go`), and beat publishing over NATS. It also exposes an admin HTTP surface and collects tempo/drift metrics via `internal/backbeat/metrics.go`.
 - **Reverb service (`cmd/reverb`)** – Subscribes to pulse beats and agent status subjects, aggregates `StatusClaim`s into rolling windows, and emits `BarReport`s on downbeats. Readiness, health, and Prometheus endpoints report claim throughput, aggregation latency, and NATs failures.
 - **Go SDK (`pkg/sdk`)** – Provides clients for beat callbacks, status emission, and health reporting with retry/circuit breaker hooks. CHORUS (`project-queues/active/CHORUS/internal/backbeat/integration.go`) and WHOOSH (`project-queues/active/WHOOSH/internal/backbeat/integration.go`) embed the SDK to align runtime operations with cluster tempo.
 - **Inter-module telemetry** – CHORUS maps P2P lifecycle operations (elections, DHT bootstrap, council delivery) into BACKBEAT status claims, while WHOOSH emits search/composer activity. This keeps Reverb windows authoritative for council health and informs SLURP/BUBBLE provenance.
 - **Observability bundle** – Monitoring assets (`monitoring/`, `prometheus.yml`) plus service metrics export drift, tempo adjustments, Raft state, and window KPIs, meeting BACKBEAT-PER-001/002/003 targets and enabling WHOOSH scaling gates to react to rhythm degradation.
 ---
 ## 4) Wire Model
 ### 4.1 BeatFrame (Pulse → all)
 ```json
 {
  "cluster_id": "chorus-aus-01",
  "tempo_bpm": 15,
  "beat_ms": 4000,
  "bar_len_beats": 8,
  "bar": 1287,
  "beat": 3,
  "phase": "work",
  "hlc": "2025-09-03T02:12:27.183Z+1287:3+17",
  "policy_hash": "sha256:...",
  "deadline_at": "2025-09-03T02:12:31.183Z"
 }
 ```
 ### 4.2 StatusClaim (agents → Reverb)
 ```json
 {
  "agent_id": "chorus-192-168-1-27",
  "task_id": "ucxl://...",
  "bar": 1287,
  "beat": 3,
  "state": "planning|executing|waiting|review|done|failed",
  "wait_for": ["hmmm://thread/abc"],
  "beats_left": 2,
  "progress": 0.42,
  "notes": "awaiting summarised artifact from peer",
  "hlc": "..."
 }
 ```
 ### 4.3 HelpPromise (HMMM → requester)
 ```json
 {
  "thread_id": "hmmm://thread/abc",
  "promise_beats": 2,
  "confidence": 0.7,
  "fail_after_beats": 3,
  "on_fail": "fallback-plan-A"
 }
 ```
 ### 4.4 BarReport (Reverb → observability)
 - Per‑bar rollup: task counts by state, overruns, broken promises, queue depth, utilisation hints, suggested tempo/phase tweak.
 ---
 ## 5) Score Spec (YAML)
 ```yaml
 score:
  tempo: 15              # bpm hint; cluster policy can override
  bar_len: 8             # beats per bar
  phases:
    plan: 2              # beats
    work: 4
    review: 2
  wait_budget:
    help: 2              # max beats to wait for HMMM replies across the phrase
    io: 1                # max beats to wait for I/O
  retry:
    max_phrases: 2
    backoff: geometric   # plan/work/review shrink each retry
  escalation:
    on_wait_exhausted: ["emit:needs-attention", "fallback:coarse-answer"]
    on_overrun: ["checkpoint", "defer:next-phrase"]
 ```
 > **Rule:** Agents must not exceed phase beat allocations. If `help` budget is exhausted, **exit cleanly** with degraded but auditable output.
 ---
 ## 6) Agent Loop (sketch)
 ```text
 on BeatFrame(bf):
  if new bar and beat==1: rotate_ephemeral_keys(); checkpoint();
  phase = score.phase_for(bf.beat)
  switch phase:
    PLAN:
      if not planned: do_planning_until(phase_end)
    WORK:
      if need_help and !help_promised: request_help_with_promise()
      if waiting_for_help:
        if wait_beats > score.wait_budget.help: exit_with_fallback()
        else continue_work_on_alternative_path()
      else do_work_step()
    REVIEW:
      run_tests_and_summarise(); publish StatusClaim(state=done|failed)
  enforce_cutoffs_at_phase_boundaries()
 ```
 ---
 ## 7) Adaptive Tempo Controller (ATC)
 - **Inputs:** Queue depth per role, GPU/CPU util (WHOOSH), overrun frequency, broken promises.
 - **Policy:** Adjust `tempo_bpm` and/or redistribute phase beats **between bars only** (PI‑style control, hysteresis ±10%).
 - **Guardrails:** ≤1 beat change per minute; freeze during incidents.
 ---
 ## 8) Exit Conditions & Deadlock Prevention
 - **Wait budgets** are hard ceilings. Missing `HelpPromise` by end‑of‑bar triggers `on_wait_exhausted`.
 - **Locks & leases** expire at bar boundaries unless renewed with `beats_left`.
 - **Promises** include `promise_beats` and `fail_after_beats` so callers can plan.
 - **Idempotent checkpoints** at downbeats enable safe retries/resumptions.
 ---
 ## 9) Integration Points
 - **CHORUS (workers):** Consume `BeatFrame`; enforce `Score`; publish `StatusClaim` each beat/change.
 - **HMMM (collab):** Replies carry `HelpPromise`; threads auto‑close if `fail_after_beats` elapses.
 - **SLURP (curation):** Batch ingest windows tied to review beats; produce bar‑stamped artefacts.
 - **SHHH (secrets):** Rotate per bar; credentials scoped to `<cluster,bar>`.
 - **UCXL:** Attach tempo metadata to deliverables: `{bar, beat, hlc}`; optional address suffix `;bar=1287#beat=8`.
 - **WHOOSH:** Expose utilisation to ATC; enforce resource leases in beat units.
 - **COOEE/DHT:** Topics: `backbeat://cluster/{id}`, `status://{project}`, `promise://hmmm`.
 ---
 ## 10) Failure Modes & Degraded Operation
 - **No Pulse leader:** Agents derive a **median‑of‑pulses** from available Purses; if none, use local monotonic clock (jitter ok) and **freeze tempo changes**.
 - **Partitions:** Keep counting beats locally (HLC ensures mergeable order). Reverb reconciles by HLC and bar on heal.
 - **Drift:** Tempo changes only on downbeats; publish `policy_hash` so agents detect misconfig.
 ---
 ## 11) Config Examples
 ### 11.1 Cluster Tempo Policy
 ```yaml
 cluster_id: chorus-aus-01
 initial_bpm: 12
 bar_len_beats: 8
 phases: [plan, work, review]
 limits:
  max_bpm: 24
  min_bpm: 6
 adaptation:
  enable: true
  hysteresis_pct: 10
  change_per_minute: 1_beat
 observability:
  emit_bar_reports: true
 ```
 ### 11.2 Task Score (attached to UCXL deliverable)
 ```yaml
 ucxl: ucxl://proj:any/*/task/graph_ingest
 score:
  tempo: 15
  bar_len: 8
  phases: {plan: 2, work: 4, review: 2}
  wait_budget: {help: 2, io: 1}
  retry: {max_phrases: 2, backoff: geometric}
  escalation:
    on_wait_exhausted: ["emit:needs-attention", "fallback:coarse-answer"]
 ```
 ---
 ## 12) Observability
 - **Per‑bar dashboards:** state counts, overruns, broken promises, tempo changes, queue depth, utilisation.
 - **Trace stamps:** Every artifact/event carries `{bar, beat, hlc}` for forensic replay.
 - **Alarms:** `promise_miss_rate`, `overrun_rate`, `no_status_claims`.
 ---
 ## 13) Security
 - Rotate **ephemeral keys on downbeats**; scope to project/role when possible.
 - Bar‑stamped tokens reduce blast radius; revoke at bar+N.
 ---
 ## 14) Economics & Budgeting — Beats as Unit of Cost
 ### 14.1 Beat Unit (BU)
 - **Definition:** 1 BU = one cluster beat interval (`beat_ms`). It’s the atomic scheduling & accounting quantum.
 ### 14.2 Resource Primitives (WHOOSH‑measured)
 - `cpu_sec`, `gpu_sec[class]`, `accel_sec[class]`, `mem_gbs` (GB·s), `disk_io_mb`, `net_egress_mb`, `storage_gbh`.
 ### 14.3 Budget & Costing
 ```yaml
 budget:
  max_bu:  N_total
  phase_caps: { plan: Np, work: Nw, review: Nr }
  wait_caps:  { help: Nh, io: Ni }
  hard_end: bar+K
  charge_to: ucxl://acct/...
 ```
 Cost per phrase:
 ```
 Total = Σ(beats_used * role_rate_bu)
      + Σ_class(gpu_sec[class] * rate_gpu_sec[class])
      + cpu_sec*rate_cpu_sec + mem_gbs*rate_mem_gbs
      + disk_io_mb*rate_io_mb + net_egress_mb*rate_egress_mb
      + storage_gbh*rate_storage_gbh
 ```
 ### 14.4 KPIs
 - **TNT** (tempo‑normalised throughput), **BPD** (beats per deliverable), **WR** (wait ratio), **η** (efficiency), **PMR** (promise miss rate), **CPD** (cost per deliverable), **TTFU** (time to first useful).
 ---
 ## 15) Tokenless Accounting (Hybrid CPU/GPU, On‑prem + Cloud)
 - **No tokens.** Price **beats + measured resources**; ignore model‑token counts.
 - **Device classes:** price per GPU/accelerator class (A100, 4090, MI300X, TPU…).
 - **Rates:** on‑prem from TCO / duty‑cycle seconds; cloud from per‑second list prices. Bind via config.
 - **Beat‑scoped caps:** per‑BU ceilings on resource primitives to contain spend regardless of hardware skew.
 - **Calibration (planning‑only):** per‑family normalisers if you want **Effective Compute Units** for planning; **billing remains raw seconds**.
 ---
 ## 16) MVP Bring‑up Plan
 1. **Pulse**: static BPM, broadcast `BeatFrame` over COOEE.
 2. **Agents**: publish `StatusClaim`; enforce `wait_budget` & `HelpPromise`.
 3. **Reverb**: roll up to `BarReport`; surface early KPIs.
 4. **SHHH**: rotate credentials on downbeats.
 5. **ATC**: enable adaptation after telemetry.
 ---
 ## 17) Open Questions
 - Per‑role tempi vs one cluster tempo?
 - Fixed `bar_len` vs dynamic redistribution of phase beats?
 - Score UI: YAML + visual “score sheet” editor?
 ---
 ### Naming (on brand)
 - **Backbeat Protocol** — **Pulse** (broadcaster) + **Reverb** (rollup & reports). Musical, expressive; conveys ripples from each downbeat.
 # Backbeat — Relative Beats Addendum (UCXL ^^/~~)
 **Why this addendum?** We’re removing dependence on ever‑increasing `bar`/`beat` counters. All coordination is expressed **relative to NOW** in **beats**, aligned with UCXL temporal markers `^^` (future) and `~~` (past).
 ## A) Wire Model Adjustments
 ### BeatFrame (Pulse → all)
 **Replace** prior fields `{bar, beat}` with:
 ```json
 {
  "cluster_id": "...",
  "tempo_bpm": 15,
  "beat_ms": 4000,
  "bar_len_beats": 8,
  "beat_index": 3,                  // 1..bar_len_beats (cyclic within bar)
  "beat_epoch": "2025-09-03T02:12:27.000Z", // start time of this beat
  "downbeat": false,                // true when beat_index==1
  "phase": "work",
  "hlc": "2025-09-03T02:12:27.183Z+17",
  "policy_hash": "sha256:...",
  "deadline_at": "2025-09-03T02:12:31.183Z"
 }
 ```
 ### StatusClaim (agents → Reverb)
 **Replace** prior fields `{bar, beat}` with:
 ```json
 {
  "agent_id": "...",
  "task_id": "...",
  "beat_index": 3,
  "state": "planning|executing|waiting|review|done|failed",
  "beats_left": 2,
  "progress": 0.42,
  "notes": "...",
  "hlc": "..."
 }
 ```
 ### Bar/Window Aggregation
 - Reverb aggregates per **window** bounded by `downbeat=true` frames.
 - **No global bar counters** are transmitted. Observability UIs may keep a local `window_id` for navigation.
 ## B) UCXL Temporal Suffix - (Requires RFC-UCXL 1.1)
 Attach **relative beat** navigation to any UCXL address:
 - `;beats=^^N` → target **N beats in the future** from now
 - `;beats=~~N` → target **N beats in the past** from now
 - Optional: `;phase=plan|work|review`
 **Example:**
 ```
 ucxl://proj:any/*/task/ingest;beats=^^4;phase=work
 ```
 ## C) Policy & Promises
 - All time budgets are **Δbeats**: `wait_budget.help`, `retry.max_phrases`, `promise_beats`, `fail_after_beats`.
 - **Leases/locks** renew per beat and expire on phase change unless renewed.
 ## D) Derivations
 - `beat_index = 1 + floor( (unix_ms / beat_ms) mod bar_len_beats )` (derived locally).
 - `beat_epoch = floor_to_multiple(now, beat_ms)`.
 - `Δbeats(target_time) = round( (target_time - now) / beat_ms )`.
 ## E) Compatibility Notes
 - Old fields `{bar, beat}` are **deprecated**; if received, they can be ignored or mapped to local windows.
 - HLC remains the canonical merge key for causality.
 ## F) Action Items
 1. Update the **spec wire model** sections accordingly.
 2. Regenerate the **Go prototype** using `BeatIndex/BeatEpoch/Downbeat` instead of `Bar/Beat` counters.
 3. Add UCXL parsing for `;beats=^^/~~` in RUSTLE.
 - [ ] TODO: RUSTLE update for BACKBEAT compatibility
--- a/BUBBLE.md
+++ b/BUBBLE.md
@@ -0,0 +1,38 @@
 BUBBLE is an organization that functions as a hierarchical index augmented with semantic and provenance overlays. Its primary purpose is to extract historical authority provenance, decision records, citations, and decision ancestry. These capabilities support complex queries, semantic proximity navigation, citation tracking for decisions, and namespace navigation. BUBBLE enables fast, academic-style citation tracking, semantic navigation, and provenance verification, making it a crucial component for managing decision history and authority within the system.
 # Decision Records as first-class artifacts
 **DRs everywhere.** Every accepted/rejected option becomes an immutable node with edges like _influenced-by_, _derived-from_, and _depends-on-constraint-X_. This enables academic-style citation tracking, semantic proximity hops, and classic namespace traversal in one surface.
 # Audit in minutes, not weeks
 Because BUBBLE indexes against UCXL addresses, you can reconstruct _exactly_ what a role could see at time **t** and why a path was chosen — no folklore archaeology.
 **References:**
 1. [KG] Entities (KG): BUBBLE (file: MASTER PLAN.md)
 2. [DC] Document Chunks: "Machine-readable metadata for provenance & confidence." (file: 1)
 3. [KG] Relationships (KG): BUBBLE enables citation tracking for Decision Records (file: MASTER PLAN.md)
 4. [KG] Relationships (KG): BUBBLE's provenance and citation tracking support SLURP's decision processes (file: MASTER PLAN.md)
 5. [KG] Relationships (KG): BUBBLE extracts and manages Decision Records with provenance graphs (file: MASTER PLAN.md)
 ## How does UCXL work with BUBBLE?
 UCXL and BUBBLE are interconnected within the broader ecosystem of semantic context management and decision provenance. UCXL functions primarily as a protocol architecture designed for addressing, managing, and sharing semantic contextual resources. It creates a 1:1 mapping between traditional file system structures and contextual metadata, enabling distributed AI development, knowledge preservation, and sophisticated context organization. UCXL supports features such as semantic address parsing, temporal navigation, context storage, and content retrieval, functioning as a foundational framework for maintaining and querying knowledge with temporal and semantic precision [KG1].
 BUBBLE, on the other hand, acts as a hierarchical index and provenance overlay system. It enables extraction of historical authority provenance, decision records, citations, and decision ancestry. BUBBLE supports semantic and provenance overlays, citation tracking, complex queries, and namespace navigation, particularly over decision data and authority provenance. It works in conjunction with systems like SLURP and facilitates fast citation tracking and semantic proximity navigation, enriching the decision-making and knowledge validation process [DC8].
 The correlation between UCXL and BUBBLE is thus centered on BUBBLE leveraging UCXL's protocol and addressing capabilities to support detailed provenance, decision tracking, and semantic overlay functionalities. BUBBLE’s ability to extract, index, and query decision records and provenance graphs fundamentally depends on UCXL’s resource addressing, contextual metadata management, and semantic addressing features. Together, they form a cohesive system for knowledge preservation, decision validation, and context-aware reasoning within the distributed AI environment.
 **References:**
 [KG1] UCXL: Entities and Description, in KG file
 [DC8] BUBBLE: Entity description and functions, in Document Chunks
 ## TODO
 - Complete RocksDB storage implementation and promote to production default (replace SQLite).
 - Integrate vector search (e.g., FAISS/Milvus) for similarity scoring in WalkBack.
 - Implement DHT → local synchronization/replication for up-to-date decision graphs.
 - Add caching layer (LRU or Redis) for frequent bundles and document cache invalidation.
 - Optional: LLM-based summaries for `DecisionBundleResponse` with configuration and tests.
--- a/CHORUS-AGENTS.md
+++ b/CHORUS-AGENTS.md
@@ -0,0 +1,115 @@
 ## What is CHORUS?
 CHORUS is a comprehensive, distributed platform primarily designed for AI coordination and communication within its ecosystem. It functions as a semantic context publishing system that enables autonomous AI agents to securely share decisions, coordinate activities, and maintain consistent contexts through role-based encryption and consensus mechanisms. The platform supports real-time, selective sharing of both live and historical contextual data, facilitating operations such as task management, artifact publishing, and provenance tracking.
 Originally established as a task coordination system using the `bzzz://` protocol, CHORUS is evolving into a semantic context publishing platform that enhances AI collaboration by transforming task coordination into a distributed decision graph based on the UCXL protocol. It also addresses operational aspects like data storage, configuration management, and security, including license management and threat protection. Overall, CHORUS aims to enable decentralized AI workflows, autonomous agent collaboration, and robust information sharing across distributed systems.
 **Context OS responsibilities:** CHORUS orchestrates _selective_, time-aware context exchange between peers (no repo mirroring), carrying machine-readable provenance and confidence. In effect, it acts as the **runtime** for UCXL-addressed knowledge and policy-aware routing.
 ### References:
 - [KG] Entity 1 (CHORUS) describes its role as a distributed AI coordination and semantic context publishing platform.
 - [DC] "Phase 1 Integration Test Framework" document emphasizes its purpose in enabling secure, real-time sharing of decisions and contextual data.
 - [KG] Its support for autonomous agents and role-based encryption highlights its focus on secure, decentralized AI collaboration.
 - [KG] The transformation from a task coordination system to a semantic platform indicates its goal of enhancing operational synchronization.
 - [DC] User Manual details the system's functions in managing AI decision-sharing and collaboration.
 ## Current Implementation Snapshot (2025-10)
 - **WHOOSH-assignment runtime** – `internal/runtime/shared.go` now bootstraps a `pkg/config.RuntimeConfig` that merges base env vars with dynamic assignments pulled from WHOOSH (`ASSIGN_URL`). Reloads are wired to `SIGHUP`, and WHOOSH-provided bootstrap peers or join staggers override baked-in defaults. See [[docs/Modules/WHOOSH.md]] for the rendezvous API that serves these payloads.
 - **License-first startup & AI provider wiring** – Startup blocks on `internal/licensing` validation against KACHING before any P2P work. Once licensed, `initializeAIProvider` configures ResetData or Ollama providers, composes persona prompts, and enables LightRAG via the MCP client when `LightRAG.Enabled` is set. This keeps reasoning, prompt curation, and RAG feeds consistent across containers.
 - **Tempo-aware operations via BACKBEAT** – The BACKBEAT integration (`internal/backbeat/integration.go`) tracks beat cadence, wraps long-running peer operations (DHT bootstrap, peer discovery, elections), and emits health/status claims. Election scoring (`pkg/election/election.go`) consumes tempo and beat-gap data to throttle discovery storms and to respect stability windows.
 - **Instrumented transport stack** – `pkg/transport/quic_transport.go` layers QUIC + optional TCP fallback with pooled streams, dial metrics, and configurable libp2p options. A transport telemetry reporter (`internal/runtime/transport_telemetry.go`) publishes snapshots to NATS (`CHORUS_TRANSPORT_METRICS_NATS_URL`) and surfaces per-transport counters through `pkg/metrics`.
 - **Encrypted context pipeline** – When `V2.DHT.Enabled` is true, CHORUS spins up a libp2p DHT with AGE-backed storage (`pkg/dht`) and routes decision artifacts through `ucxl.DecisionPublisher`, exposing them via the optional UCXI HTTP server. Council opportunities are bridged over QUIC streams and mirrored onto NATS to keep WHOOSH/SLURP in sync.
 **Operational Notes**
 - **Centralised QUIC option injection** – Noise security, muxers, and relay support are now exclusively configured inside `transport.NewQUICTransport`, preventing the duplicate-provider panic that surfaced when both the transport and caller registered `libp2p.Security(noise)`. Custom libp2p options should be appended via runtime overrides to avoid reintroducing duplicates.
 ## How does CHORUS establish a peer-to-peer network?
 CHORUS connects to itself to form a peer-to-peer (P2P) network through its native libp2p-based infrastructure, which manages peer discovery, connectivity, and message routing. The system employs protocols such as pubsub for messaging and mDNS or DHT for peer discovery, enabling nodes to automatically recognize and establish connections with each other without manual configuration.
 Specifically, the architecture supports direct peer connections where nodes connect directly with each other, forming a full mesh or ring topology depending on deployment. The network topology is managed via mechanisms like bootstrap peers, which are static nodes used to initialize and support connections within the network. Nodes exchange capability broadcasts periodically to maintain an updated view of the network state.
 Security in the connection process is enhanced through cryptographic verification methods like signature validation and signature verification to ensure trustworthiness among peers. Additionally, peer discovery protocols like mDNS (multicast DNS) facilitate neighborhood detection within local networks, while content and capabilities are exchanged securely. The infrastructure includes features for network resilience such as circuit breakers, rate limiting, and network health monitoring.
 Overall, CHORUS forms its P2P network by leveraging libp2p’s suite of protocols for secure, decentralized peer discovery and connection management, which allows nodes within the network to connect directly and participate effectively in distributed AI coordination activities.
 **Institutional stance:** libp2p + scoped sync means context sharing is intentional, logged, and revocable; bootstrap + capability broadcasts make the mesh resilient while preserving least-privilege flow.
 References:
 [KG] "CHORUS P2P Network": Describes the decentralized communication layer involving peer discovery, pubsub messaging, and mutual peer management.
 [DC] "README.md": Details on network discovery, peer connections, and mesh network topology.
 [KG] "libp2p": Mentioned as the core protocol facilitating mesh networking, peer discovery via mDNS, and secure connections.
 [KG] "CHORUS P2P Mesh": Organization that supports distributed peer connections for chat and AI collaboration.
 [DC] "CHORUS-ARCHITECTURE.md": Explains the architecture including bootstrap peers and full mesh topology.
 ### 1.1.2.1 Key Functions
 - Decentralized task coordination and management
 - Secure, role-based sharing of contextual data
 - Autonomous AI agent collaboration
 - Content publishing of decisions and task updates
 - Distributed data storage and retrieval
 - Role-based encryption and security
 - System configuration management
 - License enforcement and resilience
 - System deployment and maintenance
 ### 1.1.2.2 Modules of CHORUS
 - [[BUBBLE]]
 - [[SLURP]]
 - [[COOEE]]
 - [[SHHH]]
 - [[HMMM]]
 - [[UCXL]]
 - [[DHT]]
 CHORUS employs a range of components and modules that form its comprehensive architecture. The core components include the Main Application (main.go), decision-related modules such as the Decision Publisher, Election Manager, and Configuration Manager, and infrastructure elements like the Crypto Module (supporting Age encryption and Shamir secret sharing), Distributed Hash Table (DHT) Storage, and the Peer-to-Peer (P2P) Network for peer discovery and pub/sub messaging.
 Additionally, CHORUS features specialized modules such as the UCXL protocol for semantic address management, SLURP for context management and decision reasoning, and a set of components dedicated to content publishing, security, and operational diagnostics. It also includes components responsible for the layered architecture, such as API handling, web sockets, and management tools.
 For hybrid or flexible deployment, there are mock components for extensive testing and real components intended for production use. These include mock and real implementations of the DHT backend, address resolution, peer discovery, network layer, and connectors (such as the CHORUS Connector and RUSTLE Hybrid Components). The architecture is designed to support role-based security, distributed storage, consensus elections, and semantic addressing to facilitate decentralized AI coordination.
 References:
 [KG] Entity "CHORUS" description in the JSON graph.
 [DC] "CHORUS System Architecture" details in "CHORUS-2B-ARCHITECTURE.md".
 [DC] Components listed under "System Components" in "CHORUS_N8N_IMPLEMENTATION_COMPLETE.md".
 [DC] Architectural diagrams in "CHORUS Architecture Documentation".
 ## What models does CHORUS use?
 The CHORUS platform supports various models for its AI agents, depending on their roles and tasks. Specifically, the system includes features such as model integration within role configurations and capabilities for autonomous AI agents. One known provider mentioned is Ollama, which supplies models used for meta-thinking and performance metrics within the system, indicating that Ollama models are utilized to support agents in reasoning and performance evaluation.
 CHORUS can execute tasks using its built-in smolLM3 model from HuggingFace, claude code using sonnet or opus from AnthropicAI, Google's gemini-cli tool, or models available for Ollama.
 Additionally, CHORUS’s architecture involves the integration of models for different purposes, such as development, testing, and deployment, including models in the context of the RUSTLE component for meta-thinking and model capability announcements. The platform emphasizes a flexible, role-based model deployment framework, enabling agents to self-allocate tasks based on their available tooling and model capabilities.
 **References:**
 - [KG] Entity "Ollama" describing model providers for system agents.
 - [DC] Content from "USER_MANUAL.md" and "CHORUS System Architecture" describing role-based and capability-driven model usage.
 ## What happens if my machines crash?
 If a CHORUS agent machine crashes or goes offline, the system employs fault detection and failover mechanisms to maintain operational continuity. Heartbeat signals, which are periodic messages sent by agents to indicate they are active, play a key role in fault detection. When an agent fails to send heartbeats within a configured timeout period, the system recognizes this as a failure scenario.
 In response, the system can trigger an automatic election to replace or promote another agent, utilizing algorithms such as Raft to select a new leader or maintain consensus. The system also supports automatic failover, where services migrate from the failed node to remaining healthy nodes (B & C) to ensure continuous operation. This process enhances reliability by enabling the system to recover quickly from node failures. Moreover, monitoring and health checks are in place to track system status and trigger recovery events, including replacing failed agents and re-establishing communication channels.
 In some cases, recovery events are tracked by the health monitoring system for performance analysis, ensuring that overall system stability and reliability are preserved even during individual machine failures.
 ## TODO
 - Integrate the Sequential Thinking MCP wrapper as a first-class AI provider so architect-level personas use the encrypted reasoning flow powered by `deploy/seqthink/mcp_server.py`.
 - Deliver the brief-processing/event loop described in the task execution monitoring plan so councils automatically hand work to the execution engine.
 - DHT (production): Ensure the runtime uses libp2p-kad DHT (not mocks), with 3–5x replication, provider records, and SLOs validated (success >99%, p95 GET <300ms).
 - UCXL + encryption: Validate leader-only write/read paths with AGE/Shamir; document key rotation and run end-to-end tests for encrypted artifacts.
 - SLURP as leader-only: Codify SLURP as privileged leader-only paths gated by elections; add leader assertions and tests.
 - SHHH as leader-only: Introduce secrets sentinel hooks in publish/log ingest; implement redact/deny rules and audit trails under leader control.
 - COOEE announcements: Surface capability/availability and enrolment APIs with schemas and tests; integrate with agentid flows.
 - Elections/consensus: Document algorithms and add multi-node tests for failover, recovery, and consistency.
--- a/COOEE.md
+++ b/COOEE.md
--- a/DHT.md
+++ b/DHT.md
@@ -0,0 +1,116 @@
 The DHT Storage exists as a core component of the CHORUS architecture, used for distributed storage of encrypted content. It utilizes a distributed hash table (DHT) architecture, specifically based on libp2p, which supports decentralized, content-addressed storage and lookup capabilities. The DHT supports features such as caching, replication, and peer discovery, enabling high availability and fault tolerance across cluster nodes.
 The DHT storage is integrated into the system in several ways. It stores encrypted content in the network, with content stored across nodes to facilitate secure, distributed data management. Content stored in the DHT is verified for hash, size, encryption, and metadata integrity to ensure authenticity and confidentiality. The content is associated with content hashes, used as references for content addressing within the storage hierarchy. The system also manages replication of data (maintaining 3-5 node copies) for fault tolerance. Nodes participating in the network can announce, retrieve, and verify content, with performance metrics monitoring efficiency and cache hit rates.
 The DHT storage layer is accessed through entities such as content storage nodes and clients, which perform operations like store, retrieve, search, and get metrics. It also supports cryptographic protections, ensuring that stored content remains encrypted and secure, with metadata associated with raw content and backup data. Overall, the DHT storage provides decentralized, encrypted, and resilient data management essential for the secure operation of the CHORUS platform.
 # Encrypted compartments
 **Content-addressed, replicated, verified.** DHT persists encrypted context shards, with replication (3–5x) and integrity checks (hash/size/metadata) to make tampering and drift obvious. UCXL points _into_ these compartments; policy decides _if_ resolution occurs.
 References:
 [KG] Entities related to DHT Storage and its description (file_path not specified)
 [DC] Storage component details from "CHORUS-2B-ARCHITECTURE.md"
 ## TODO
 - Runtime boundaries: Consolidate DHT documentation across CHORUS and RUSTLE; define APIs, ownership, and responsibilities.
 - Production usage: Ensure services use a production DHT implementation (not mocks); validate replication and metrics SLOs in CI.
 - Crypto guarantees: Verify end-to-end encryption and verifiability for UCXL content/metadata, including hash/size/metadata integrity checks.
 # Data Resilience with agent churn...
 Me: how resilient to data loss is our DHT given docker container churn? Or is this mitigated by volume binding the data persistence to another storage layer? I guess what I'm asking is if the loss of agents of a particular role would mean that data is encrypted on our DHT and spread across a number of machines in our swarm, would that data then be untrackable if the agent of that role no longer exists in the cluster?
 GPT-5:
 Short version: a DHT won’t save you from churn by itself. It gives *routing*, not *durability*. If containers die and you don’t deliberately persist and re-republish, your encrypted blobs can become “stranded” (still somewhere on disks, but effectively unfindable or undecryptable). You avoid that by (1) persisting node state outside containers, (2) replicating/pinning content across a quorum of nodes, and (3) decoupling discovery + keys from any single agent/role.
 Here’s a practical, CHORUS-flavoured breakdown.
 ## What breaks under churn
 * **Ephemeral container storage**: if a DHT node’s datastore lives in the container FS, every redeploy wipes the routing tables, provider records, and any locally cached shards. You lose providers for your CIDs until someone reannounces.
 * **Provider records expire**: Kademlia/libp2p provider records are TTL’d. If no process re-publishes, content becomes *unroutable* even if replicas still exist on disk.
 * **Role-bound keys or indexes**: if only the “{role} agent” knows how to locate/decrypt artifacts (e.g., it holds the KEK or maintains the UCXL pointers), killing that role strands data.
 ## What actually makes it resilient
 1. **Persist node state outside the container**
   * Bind mount or named volumes for the DHT node’s datastore (routing tables, provider records, blockstore).
   * In Swarm, use named volumes backed by real disks/NAS (NFS/ZFS/Ceph) rather than ephemeral layers.
   * Pin your bootstrappers to specific hosts with `placement.constraints` so their on-disk stores survive redeploys.
 2. **Make availability a *policy*, not an accident**
   * **Replication**: set a replication/pinning factor (e.g., r=5) per artifact or collection. Don’t rely on opportunistic caching.
   * **Erasure coding**: store shards with Reed–Solomon (e.g., k=10, m=4). Any k of k+m shards reconstructs.
   * **Pinning service**: run a small, role-agnostic “Pin Steward” service that enforces pinsets across a quorum (e.g., 3 of 5 designated storage nodes) and re-announces providers every BACKBEAT pulse.
   * **Anti-entropy**: schedule periodic verify/repair jobs (sample CIDs, fetch, heal missing replicas).
 3. **Separate planes: data, index, and keys**
   * **Data plane (DHT)**: encrypted, content-addressed blobs. DHT gives routing only.
   * **Index plane (UCXL/registry)**: DRs and UCXL metadata carry the authoritative *pointers* (CIDs, shard manifests, checksum, retention, owners). This lives in a durable store (Raft/etcd, or a CRDT feed anchored to a small set of stateful nodes) and is not tied to a single role.
   * **Key plane (age/envelope)**: per-artifact DEK + org-scoped KEK. Use *envelope encryption*:
     * Encrypt data with a random DEK.
     * Encrypt the DEK under a KEK controlled by org custody, **not** a single agent/role.
     * Prefer threshold schemes: e.g., Shamir or age-wrapped multi-recipient policy: Security-Architect, Compliance, and TPM each hold shares; require 2-of-3 to recover.
     * Store KEK metadata + recipient set in UCXL alongside the CID (never only in an agent’s private store).
 4. **Don’t bind capability to a role identity**
   * Treat “roles” as *claims* over an org DID, not as key custodians. Agents can come and go; the org-level custody remains.
   * Rotate claims freely; rotate keys carefully (KEK rotation with rewrap jobs scheduled on a BACKBEAT reverb).
 5. **Swarm-level hygiene**
   * `deploy.update_config: order: start-first` to keep at least one replica online during updates.
   * `restart_policy: condition: any` and healthchecks so nodes auto-return to the DHT.
   * Node labels + `placement.constraints` to keep pinning nodes on storage-rich hosts.
   * Back the volume with ZFS and snapshots; export via NFS/Ceph if you want cross-host mobility.
   * Keep 3+ **stable bootstrap nodes** (stateful, fixed addresses) that never run on ephemeral volumes.
 6. **Lifecycle + BACKBEAT**
   * On each beat:
     * **Pulse**: Pin Steward re-announces providers for recently touched CIDs.
     * **Reverb**: Run verify/repair, renew provider TTLs, and checkpoint UCXL indexes + key policy deltas.
   * Garbage collection: never delete until UCXL’s retention policy says so and quorum confirms the DR tombstone.
 ## Answering your core question
 > If the loss of agents of a particular role occurs, is the encrypted data on the DHT “untrackable”?
 * **It’s untrackable only if** (a) the UCXL index that maps concepts → CIDs was owned by that role alone, or (b) the KEK/recipient set lived solely with that role.
 * **It remains trackable and recoverable if** you:
  * Keep UCXL pointers in the shared registry (replicated, role-agnostic).
  * Use envelope encryption with threshold/org custody.
  * Run a Pin Steward that republishes providers independently of any single agent.
  * Persist DHT/blocks on host volumes and across multiple nodes.
 ## Recommended defaults (sane, not dogma)
 * Replication factor: **r ≥ 5** across failure domains (different physical hosts).
 * Erasure coding: **k=10, m=4** for large artifacts; pure replication for small files/indices.
 * Pin quorum: **3 of 5** designated storage nodes.
 * Key policy: **2-of-3** threshold (Security-Architect, Compliance Officer, TPM).
 * Provider refresh: every **BACKBEAT** (e.g., 5–10 min) with jitter; hard renew at 1–2× the DHT provider TTL.
 * Bootstrap nodes: **3** stateful services, fixed host placement, ZFS-backed volumes, off the autoscaler.
 ## What to change right now
 * Move every DHT/blockstore path to a **named volume** (ZFS dataset or NFS/Ceph) and pin bootstrappers via constraints.
 * Stand up the **Pin Steward** (tiny service) that:
  * Watches UCXL for new/changed CIDs and desired r/k+m,
  * Ensures pins across its quorum,
  * Re-announces providers on Pulse/Reverb,
  * Emits DRs when healing occurs.
 * Switch to **envelope encryption + threshold KEK**; remove role-exclusive custody.
 * Add a **UCXL “Beacon” index** (small, replicated KV) that maps logical addresses → latest CID/manifests so discovery never depends on an agent’s memory.
 Do those, and container churn becomes a nuisance, not a data-loss event. Without them, yes—you can strand perfectly good ciphertext all over your swarm with no reliable way to find or open it once that role is gone.