The DHT Storage exists as a core component of the CHORUS architecture, used for distributed storage of encrypted content. It utilizes a distributed hash table (DHT) architecture, specifically based on libp2p, which supports decentralized, content-addressed storage and lookup capabilities. The DHT supports features such as caching, replication, and peer discovery, enabling high availability and fault tolerance across cluster nodes.

The DHT storage is integrated into the system in several ways. It stores encrypted content in the network, with content stored across nodes to facilitate secure, distributed data management. Content stored in the DHT is verified for hash, size, encryption, and metadata integrity to ensure authenticity and confidentiality. The content is associated with content hashes, used as references for content addressing within the storage hierarchy. The system also manages replication of data (maintaining 3-5 node copies) for fault tolerance. Nodes participating in the network can announce, retrieve, and verify content, with performance metrics monitoring efficiency and cache hit rates.

The DHT storage layer is accessed through entities such as content storage nodes and clients, which perform operations like store, retrieve, search, and get metrics. It also supports cryptographic protections, ensuring that stored content remains encrypted and secure, with metadata associated with raw content and backup data. Overall, the DHT storage provides decentralized, encrypted, and resilient data management essential for the secure operation of the CHORUS platform.

# Encrypted compartments

**Content-addressed, replicated, verified.** DHT persists encrypted context shards, with replication (3–5x) and integrity checks (hash/size/metadata) to make tampering and drift obvious. UCXL points _into_ these compartments; policy decides _if_ resolution occurs.

References:
[KG] Entities related to DHT Storage and its description (file_path not specified)
[DC] Storage component details from "CHORUS-2B-ARCHITECTURE.md"

## TODO

- Runtime boundaries: Consolidate DHT documentation across CHORUS and RUSTLE; define APIs, ownership, and responsibilities.
- Production usage: Ensure services use a production DHT implementation (not mocks); validate replication and metrics SLOs in CI.
- Crypto guarantees: Verify end-to-end encryption and verifiability for UCXL content/metadata, including hash/size/metadata integrity checks.


# Data Resilience with agent churn...

Me: how resilient to data loss is our DHT given docker container churn? Or is this mitigated by volume binding the data persistence to another storage layer? I guess what I'm asking is if the loss of agents of a particular role would mean that data is encrypted on our DHT and spread across a number of machines in our swarm, would that data then be untrackable if the agent of that role no longer exists in the cluster?
GPT-5:
Short version: a DHT won’t save you from churn by itself. It gives *routing*, not *durability*. If containers die and you don’t deliberately persist and re-republish, your encrypted blobs can become “stranded” (still somewhere on disks, but effectively unfindable or undecryptable). You avoid that by (1) persisting node state outside containers, (2) replicating/pinning content across a quorum of nodes, and (3) decoupling discovery + keys from any single agent/role.

Here’s a practical, CHORUS-flavoured breakdown.

## What breaks under churn

* **Ephemeral container storage**: if a DHT node’s datastore lives in the container FS, every redeploy wipes the routing tables, provider records, and any locally cached shards. You lose providers for your CIDs until someone reannounces.
* **Provider records expire**: Kademlia/libp2p provider records are TTL’d. If no process re-publishes, content becomes *unroutable* even if replicas still exist on disk.
* **Role-bound keys or indexes**: if only the “{role} agent” knows how to locate/decrypt artifacts (e.g., it holds the KEK or maintains the UCXL pointers), killing that role strands data.

## What actually makes it resilient

1. **Persist node state outside the container**

   * Bind mount or named volumes for the DHT node’s datastore (routing tables, provider records, blockstore).
   * In Swarm, use named volumes backed by real disks/NAS (NFS/ZFS/Ceph) rather than ephemeral layers.
   * Pin your bootstrappers to specific hosts with `placement.constraints` so their on-disk stores survive redeploys.

2. **Make availability a *policy*, not an accident**

   * **Replication**: set a replication/pinning factor (e.g., r=5) per artifact or collection. Don’t rely on opportunistic caching.
   * **Erasure coding**: store shards with Reed–Solomon (e.g., k=10, m=4). Any k of k+m shards reconstructs.
   * **Pinning service**: run a small, role-agnostic “Pin Steward” service that enforces pinsets across a quorum (e.g., 3 of 5 designated storage nodes) and re-announces providers every BACKBEAT pulse.
   * **Anti-entropy**: schedule periodic verify/repair jobs (sample CIDs, fetch, heal missing replicas).

3. **Separate planes: data, index, and keys**

   * **Data plane (DHT)**: encrypted, content-addressed blobs. DHT gives routing only.
   * **Index plane (UCXL/registry)**: DRs and UCXL metadata carry the authoritative *pointers* (CIDs, shard manifests, checksum, retention, owners). This lives in a durable store (Raft/etcd, or a CRDT feed anchored to a small set of stateful nodes) and is not tied to a single role.
   * **Key plane (age/envelope)**: per-artifact DEK + org-scoped KEK. Use *envelope encryption*:

     * Encrypt data with a random DEK.
     * Encrypt the DEK under a KEK controlled by org custody, **not** a single agent/role.
     * Prefer threshold schemes: e.g., Shamir or age-wrapped multi-recipient policy: Security-Architect, Compliance, and TPM each hold shares; require 2-of-3 to recover.
     * Store KEK metadata + recipient set in UCXL alongside the CID (never only in an agent’s private store).

4. **Don’t bind capability to a role identity**

   * Treat “roles” as *claims* over an org DID, not as key custodians. Agents can come and go; the org-level custody remains.
   * Rotate claims freely; rotate keys carefully (KEK rotation with rewrap jobs scheduled on a BACKBEAT reverb).

5. **Swarm-level hygiene**

   * `deploy.update_config: order: start-first` to keep at least one replica online during updates.
   * `restart_policy: condition: any` and healthchecks so nodes auto-return to the DHT.
   * Node labels + `placement.constraints` to keep pinning nodes on storage-rich hosts.
   * Back the volume with ZFS and snapshots; export via NFS/Ceph if you want cross-host mobility.
   * Keep 3+ **stable bootstrap nodes** (stateful, fixed addresses) that never run on ephemeral volumes.

6. **Lifecycle + BACKBEAT**

   * On each beat:

     * **Pulse**: Pin Steward re-announces providers for recently touched CIDs.
     * **Reverb**: Run verify/repair, renew provider TTLs, and checkpoint UCXL indexes + key policy deltas.
   * Garbage collection: never delete until UCXL’s retention policy says so and quorum confirms the DR tombstone.

## Answering your core question

> If the loss of agents of a particular role occurs, is the encrypted data on the DHT “untrackable”?

* **It’s untrackable only if** (a) the UCXL index that maps concepts → CIDs was owned by that role alone, or (b) the KEK/recipient set lived solely with that role.
* **It remains trackable and recoverable if** you:

  * Keep UCXL pointers in the shared registry (replicated, role-agnostic).
  * Use envelope encryption with threshold/org custody.
  * Run a Pin Steward that republishes providers independently of any single agent.
  * Persist DHT/blocks on host volumes and across multiple nodes.

## Recommended defaults (sane, not dogma)

* Replication factor: **r ≥ 5** across failure domains (different physical hosts).
* Erasure coding: **k=10, m=4** for large artifacts; pure replication for small files/indices.
* Pin quorum: **3 of 5** designated storage nodes.
* Key policy: **2-of-3** threshold (Security-Architect, Compliance Officer, TPM).
* Provider refresh: every **BACKBEAT** (e.g., 5–10 min) with jitter; hard renew at 1–2× the DHT provider TTL.
* Bootstrap nodes: **3** stateful services, fixed host placement, ZFS-backed volumes, off the autoscaler.

## What to change right now

* Move every DHT/blockstore path to a **named volume** (ZFS dataset or NFS/Ceph) and pin bootstrappers via constraints.
* Stand up the **Pin Steward** (tiny service) that:

  * Watches UCXL for new/changed CIDs and desired r/k+m,
  * Ensures pins across its quorum,
  * Re-announces providers on Pulse/Reverb,
  * Emits DRs when healing occurs.
* Switch to **envelope encryption + threshold KEK**; remove role-exclusive custody.
* Add a **UCXL “Beacon” index** (small, replicated KV) that maps logical addresses → latest CID/manifests so discovery never depends on an agent’s memory.

Do those, and container churn becomes a nuisance, not a data-loss event. Without them, yes—you can strand perfectly good ciphertext all over your swarm with no reliable way to find or open it once that role is gone.