Files
CHORUS/docs/distos/councils/03-network-stack.md

627 lines
40 KiB
Markdown

# Council Design Brief: Network Stack
**Council ID:** `council-net`
**Mission:** Design the complete network stack for DistOS, encompassing RDMA transport (InfiniBand/RoCE) for GPU-to-GPU communication, the overlay network and control plane (libp2p), transport protocol selection (QUIC/TCP/UCX), network topology discovery, adaptive routing, congestion control at 1024-node scale, NVLink domain bridging, multi-rail networking, service mesh for agent communication, and multicast/broadcast for council pub-sub channels.
**UCXL Base Address:** `ucxl://council-net:*@DistOS:networking/*`
**Agent Count:** 60
**Status:** Constitution Phase — awaiting WHOOSH formation trigger
**Created:** 2026-02-24
---
## 1. Scope and Responsibilities
`council-net` owns the complete specification of the DistOS network subsystem. Scope boundaries are defined as follows.
**In scope:**
- RDMA transport layer: InfiniBand verbs and RoCE v2 for GPU-to-GPU data plane; Queue Pair (QP) lifecycle, completion queue management, and memory region registration interface
- Overlay network design: logical topology over the physical InfiniBand/Ethernet fabric; addressing, routing, and peer discovery for the control plane
- libp2p integration for the DistOS control plane: peer discovery (mDNS, DHT-based), multiplexed streams, NAT traversal, and protocol negotiation
- Transport protocol selection: QUIC for agent control plane communication, UCX (Unified Communication X) for GPU collective and point-to-point data plane, TCP/IP for compatibility and management traffic
- Network topology discovery: automatic cluster topology mapping (fat-tree, dragonfly, or NVLink Switch topology), NVLink domain membership, InfiniBand subnet manager interface
- Adaptive routing: traffic-aware routing, ECMP (Equal-Cost Multi-Path), InfiniBand OpenSM AR (Adaptive Routing), and Dragonfly-specific routing algorithms
- Congestion control at 1024-node scale: ECN (Explicit Congestion Notification), PFC (Priority Flow Control), and DCQCN for RoCE; InfiniBand credit-based flow control; interaction with NCCL collectives
- NVLink domain bridging: translating between NVLink-domain (intra-node or NVSwitch-connected GPUs) and InfiniBand-domain (inter-node) communication
- Multi-rail networking: multiple IB HCAs per node, rail selection policy, failover across rails
- Service mesh for agent (CHORUS/WHOOSH) communication: mTLS, sidecar-less or sidecar-based, service discovery, load balancing, and circuit-breaking
- Multicast and broadcast for council pub-sub: efficient dissemination of research summaries, decision records, and vote notifications to council members
- Formal specification of the DistOS network interface — the API surface exposed to the memory subsystem (RDMA registration), the scheduler (network-aware placement), and user workloads (NCCL-compatible collective API)
**Out of scope (delegated):**
- Physical layer hardware (HCA firmware, cable selection, switch configuration) — these are hardware dependencies, not OS design decisions
- GPU memory allocation and RDMA buffer management (delegated to `council-mem`; RDMA registration interface consumed)
- Workload scheduling and GPU assignment (delegated to `council-sched`; topology hints provided as outputs)
- Encrypted transport key management and certificate lifecycle (delegated to `council-sec`; TLS/mTLS integration points defined)
- Resource metering for network bandwidth usage (delegated to `council-telemetry`; metering event API published)
---
## 2. Research Domains
### 2.1 RDMA: InfiniBand Verbs and RoCE
InfiniBand provides the primary high-bandwidth, low-latency transport for GPU-to-GPU data movement on this cluster. Understand the verbs API (`ibverbs`), Queue Pair state machine (RESET → INIT → RTR → RTS → ERROR), completion queues, and memory region registration. Survey RoCE v2 as the Ethernet-encapsulated RDMA alternative and understand its congestion control requirements.
Key materials:
- Mellanox/NVIDIA InfiniBand Architecture Specification — QP model, RDMA read/write/atomic operations, immediate data
- Kalia et al., "Using RDMA Efficiently for Key-Value Services" (SIGCOMM 2014) — RDMA design patterns, one-sided vs. two-sided operations
- Kalia et al., "Design Guidelines for High Performance RDMA Systems" (USENIX ATC 2016) — QP scaling, SEND vs. RDMA READ trade-offs, doorbell batching
- Mitchell et al., "Using One-Sided RDMA Reads to Build a Fast, CPU-Efficient Key-Value Store" (USENIX ATC 2013)
- Rosenblum et al., "The Design and Implementation of a Log-Structured File System" — not RDMA-specific but directly relevant to RDMA-backed storage patterns
- Pfefferle et al., "A Hybrid I/O Virtualization Framework for RDMA-capable Network Interfaces" (VEE 2015)
- RoCE v2 specification (InfiniBand Trade Association) — UDP encapsulation, RoCEv2 header, ECN marking
### 2.2 UCX: Unified Communication X
UCX is the primary communication framework for DistOS data plane operations. It provides a hardware-agnostic API over InfiniBand, RoCE, CUDA IPC, CMA (Cross-Memory Attach), and TCP. NCCL uses UCX as an optional backend.
Key materials:
- Shamis et al., "UCX: An Open Source Framework for HPC Network APIs and Beyond" (HOTI 2015) — UCX design and API overview
- UCX documentation (openucx.org) — UCP (User Communication Protocol) context, endpoints, tagged messages, Active Messages, RDMA operations
- UCX GPU-Direct RDMA integration guide — `ucp_mem_map` with `UCS_MEMORY_TYPE_CUDA`, `ucx_perftest` GPU benchmarks
- Venkata et al., "OSU Micro Benchmarks" (OMB) — latency/bandwidth benchmarks for UCX transport selection validation
- NVIDIA NCCL-UCX plugin documentation — `NCCL_UCX_*` environment variables, QP provisioning, registration cache
### 2.3 NCCL: NVIDIA Collective Communication Library
NCCL implements the collective communication patterns used by distributed training (allreduce, broadcast, reduce-scatter, all-gather). On Hopper clusters, NCCL uses NVLink for intra-node collectives and InfiniBand for inter-node. Understanding NCCL topology files and ring/tree algorithm selection is essential for network-aware scheduling.
Key materials:
- NCCL documentation and source (github.com/NVIDIA/nccl) — `ncclCommInitRank`, topology detection, algorithm selection (ring, tree, collnet)
- Patarasuk and Yuan, "Bandwidth Optimal All-reduce Algorithms for Clusters of Workstations" (J. Parallel Distrib. Comput. 2009) — ring-allreduce bandwidth analysis
- Ying et al., "Switch Transformer" appendix on communication cost — practical allreduce scaling analysis
- NCCL topology XML format — specifying NVLink domains, IB rail affinity, and switch hierarchy for NCCL algorithm tuning
- NVIDIA Collective Communication Library Performance Notes — algorithm selection heuristics, tree vs. ring cross-over point
### 2.4 libp2p for the Control Plane
The DistOS control plane (agent coordination, WHOOSH council formation, SLURP DHT, UCXL resolution) runs over libp2p. Survey the libp2p protocol suite: peer identity (ed25519/secp256k1 keypairs), peer discovery (mDNS, Kademlia DHT), stream multiplexing (yamux, mplex), transport (QUIC, TCP), and NAT traversal.
Key materials:
- libp2p specification (github.com/libp2p/specs) — multiaddress format, transport protocols, peer routing
- Maymounkov and Mazières, "Kademlia: A Peer-to-Peer Information System Based on the XOR Metric" (IPTPS 2002) — foundational DHT for libp2p routing
- libp2p QUIC transport specification — 0-RTT handshake, connection migration, multiplexed streams without HoL blocking
- IPFS documentation on libp2p — practical deployment patterns, bootstrap peer configuration, ambient peer discovery
- Baumgart and Meis, "S/Kademlia: A Practicable Approach Towards Secure Key-Based Routing" (P2P 2007) — security hardening for Kademlia relevant to the CHORUS mesh
### 2.5 QUIC Protocol
QUIC provides the transport for DistOS agent control-plane communication (WHOOSH formation, SLURP DHT queries, UCXL resolution, BUBBLE decision records). QUIC's multiplexed streams, 0-RTT connection establishment, and connection migration over multiple network paths make it well-suited to the heterogeneous cluster environment.
Key materials:
- Langley et al., "The QUIC Transport Protocol: Design and Internet-Scale Deployment" (SIGCOMM 2017) — QUIC design rationale and performance analysis
- RFC 9000 — QUIC: A UDP-Based Multiplexed and Secure Transport
- RFC 9001 — Using TLS to Secure QUIC
- Marx et al., "QUIC is not Quick Enough over Fast Internet Connections" (PAM 2020) — QUIC performance limitations at high bandwidth relevant to cluster use
- Cui et al., "QUIC is not Enough: Towards Wireless QUIC" — multipath extensions relevant to multi-rail networking
### 2.6 Network Topology Discovery
At 1024 nodes, manual topology configuration is not feasible. The network stack must automatically discover the cluster topology: fat-tree vs. dragonfly vs. NVLink Switch topology, rail assignments per node, and NVSwitch domain membership per GPU.
Key materials:
- Al-Fares et al., "A Scalable, Commodity Data Center Network Architecture" (SIGCOMM 2008) — fat-tree topology analysis
- Abts et al., "Energy Proportional Datacenter Networks" — dragonfly topology and routing
- InfiniBand OpenSM subnet manager documentation — `smpquery`, `ibnetdiscover`, topology file format, AR (Adaptive Routing) configuration
- NVIDIA NVTOPO documentation — GPU topology detection, `nvidia-smi topo -m` output format
- `ibstat`, `ibstatus`, `perfquery` — InfiniBand diagnostic tools relevant to topology verification
### 2.7 Adaptive Routing and Congestion Control
At 1024-node scale, static routing leads to hot-spots. Adaptive routing dynamically distributes traffic across equal-cost paths. Congestion control prevents PFC pause storms that can cascade to TCP-Incast-style deadlocks.
Key materials:
- Valadarsky et al., "Xpander: Towards Optimal-Performance Datacenters" (CoNEXT 2016) — adaptive topology design
- Zhu et al., "Congestion Control for Large-Scale RDMA Deployments" (SIGCOMM 2015) — DCQCN algorithm for RoCE congestion control
- NVIDIA Spectrum InfiniBand switch AR documentation — per-packet vs. per-flow adaptive routing
- Pfaff et al., "The Design and Implementation of Open vSwitch" (NSDI 2015) — SDN-based adaptive routing reference
- Mittal et al., "TIMELY: RTT-based Congestion Control for the Datacenter" (SIGCOMM 2015) — RTT-based CC for RDMA
- Google Jupiter and Andromeda: Firestone et al., "Azure Accelerated Networking" (NSDI 2018) — hyperscale network stack design patterns; Vahdat et al., "Jupiter Rising" (SIGCOMM 2015) — Google datacenter network architecture
### 2.8 Multi-Rail Networking
Each node in the cluster has multiple InfiniBand HCAs for aggregate bandwidth. The network stack must implement rail selection policy (hash-based, round-robin, least-loaded), per-flow rail affinity for ordered delivery, and HCA failover.
Key materials:
- NVIDIA Multi-Rail documentation — `NCCL_IB_HCA` environment variable, rail selection in NCCL
- Kalia et al., "FaRM: Fast Remote Memory" (NSDI 2014) — multi-rail RDMA design patterns
- Bai et al., "PIAS: Practical Information-Agnostic Flow Scheduling for Commodity Data Centers" (NSDI 2015) — flow scheduling relevant to multi-rail policy
- InfiniBand bonding and link aggregation — `ipoib` bonding, `rdma cm` multi-path
### 2.9 Service Mesh for Agent Communication
The CHORUS/WHOOSH agent mesh requires a lightweight service mesh for mTLS, service discovery, load balancing, and circuit-breaking. The service mesh must function without a centralised control plane to avoid a single point of failure.
Key materials:
- Burns et al., "Borg, Omega, and Kubernetes" (ACM Queue 2016) — service mesh design considerations in large-scale systems
- Istio architecture documentation — sidecar proxy (Envoy), control plane (Istiod), mTLS design
- Linkerd2 documentation — Rust-based ultra-lightweight proxy, no-sidecar mode (linkerd-proxy as eBPF-based)
- Envoy proxy documentation — xDS API, circuit-breaking, outlier detection
- Cilium documentation — eBPF-based service mesh without sidecar proxies; relevant to kernel-bypass agent communication
### 2.10 Multicast and Pub-Sub for Council Communication
Council pub-sub (HMMM protocol message distribution, vote broadcasts, research summary publication) requires an efficient multicast or pub-sub mechanism. At 80 agents per council, naive unicast produces O(n) messages per broadcast. Survey IP multicast, InfiniBand multicast, and application-layer pub-sub.
Key materials:
- Deering, "Multicast Routing in Datagram Internetworks and Extended LANs" (1991) — foundational IP multicast design
- InfiniBand multicast documentation — `ibmcast`, unreliable datagram multicast groups, LID-based group addressing
- ZeroMQ documentation — PUB/SUB pattern, EPGM (Encapsulated PGM multicast), NORM protocol
- NATS documentation — subject-based pub-sub, JetStream persistence, clustering — directly relevant as BACKBEAT uses NATS JetStream
- Eugster et al., "The Many Faces of Publish/Subscribe" (ACM Computing Surveys 2003) — pub-sub system classification
---
## 3. Agent Roles
Total agents: **60**
| Role | Count | Responsibilities |
|------|-------|-----------------|
| Lead Architect | 2 | Network stack architecture decisions, cross-subsystem network interface ownership, topology model ownership |
| RDMA/InfiniBand Researchers | 7 | InfiniBand verbs, RoCE v2, QP model, memory region registration survey |
| UCX/NCCL Specialists | 6 | UCX API, NCCL topology and algorithm selection, GPU collective communication integration |
| libp2p/Control Plane Researchers | 5 | libp2p protocol suite, DHT-based peer discovery, stream multiplexing |
| QUIC Transport Specialists | 4 | QUIC protocol design, 0-RTT semantics, multipath/multi-rail QUIC |
| Topology Discovery Researchers | 4 | Automated cluster topology mapping, NVLink domain detection, IB subnet manager integration |
| Adaptive Routing and CC Specialists | 4 | DCQCN, ECMP, OpenSM AR, congestion control at scale |
| Multi-Rail Specialists | 3 | HCA failover, rail selection policy, per-flow affinity |
| Service Mesh Researchers | 4 | mTLS, eBPF-based service mesh, sidecar-less design, circuit-breaking |
| Pub-Sub / Multicast Researchers | 3 | IB multicast, NATS JetStream, council broadcast design |
| Formal Specification Authors | 6 | TLA+ specification of network protocol state machines, RDMA safety, routing convergence |
| Architects (sub-component) | 5 | Concrete architecture proposals for each network subsystem component |
| Internal Reviewers | 5 | Review proposals; green/yellow/red votes with rationale |
| Integration Liaisons | 4 | Interface with `council-mem`, `council-sched`, `council-sec`, `council-telemetry` |
| Decision Record Authors | 3 | Author DRs for all decision points; maintain UCXL provenance chain |
| Adversarial Critics | 3 | Surface congestion collapse scenarios, RDMA registration exhaustion, DHT partition risks |
**Role distribution rationale:** RDMA/InfiniBand and UCX/NCCL together receive 13 researchers because the data plane is the most technically complex and least tractable domain in this council. The formal specification group (6 agents) is proportionally smaller than `council-sched` and `council-mem` reflecting that network protocol safety properties are narrower in scope (liveness and congestion-freedom rather than memory safety).
---
## 4. Key Deliverables
### 4.1 Research Summaries
```
ucxl://council-net:researcher@DistOS:networking/*^/research/rdma-infiniband-roce.md
ucxl://council-net:researcher@DistOS:networking/*^/research/ucx-nccl-data-plane.md
ucxl://council-net:researcher@DistOS:networking/*^/research/libp2p-control-plane.md
ucxl://council-net:researcher@DistOS:networking/*^/research/quic-transport.md
ucxl://council-net:researcher@DistOS:networking/*^/research/topology-discovery.md
ucxl://council-net:researcher@DistOS:networking/*^/research/adaptive-routing-congestion.md
ucxl://council-net:researcher@DistOS:networking/*^/research/multi-rail-networking.md
ucxl://council-net:researcher@DistOS:networking/*^/research/service-mesh-agent-comms.md
ucxl://council-net:researcher@DistOS:networking/*^/research/multicast-pubsub-council.md
```
### 4.2 Architecture Proposals
```
ucxl://council-net:architect@DistOS:networking/*^/architecture/network-stack-overview.md
ucxl://council-net:architect@DistOS:networking/*^/architecture/rdma-transport-design.md
ucxl://council-net:architect@DistOS:networking/*^/architecture/ucx-data-plane-design.md
ucxl://council-net:architect@DistOS:networking/*^/architecture/control-plane-libp2p.md
ucxl://council-net:architect@DistOS:networking/*^/architecture/topology-model.md
ucxl://council-net:architect@DistOS:networking/*^/architecture/adaptive-routing-design.md
ucxl://council-net:architect@DistOS:networking/*^/architecture/multi-rail-policy.md
ucxl://council-net:architect@DistOS:networking/*^/architecture/service-mesh-design.md
ucxl://council-net:architect@DistOS:networking/*^/architecture/council-pubsub-design.md
ucxl://council-net:architect@DistOS:networking/*^/architecture/bandwidth-reservation.md
```
### 4.3 Decision Records
```
ucxl://council-net:architect@DistOS:networking/*^/decisions/DR-NET-001-transport-selection.md
ucxl://council-net:architect@DistOS:networking/*^/decisions/DR-NET-002-rdma-qp-model.md
ucxl://council-net:architect@DistOS:networking/*^/decisions/DR-NET-003-control-plane-protocol.md
ucxl://council-net:architect@DistOS:networking/*^/decisions/DR-NET-004-topology-discovery-mechanism.md
ucxl://council-net:architect@DistOS:networking/*^/decisions/DR-NET-005-congestion-control-algorithm.md
ucxl://council-net:architect@DistOS:networking/*^/decisions/DR-NET-006-nvlink-ib-domain-bridging.md
ucxl://council-net:architect@DistOS:networking/*^/decisions/DR-NET-007-service-mesh-architecture.md
ucxl://council-net:architect@DistOS:networking/*^/decisions/DR-NET-008-pubsub-mechanism.md
```
### 4.4 Formal Specifications
```
ucxl://council-net:verifier@DistOS:networking/*^/specs/RDMATransportProtocol.tla
ucxl://council-net:verifier@DistOS:networking/*^/specs/RoutingConvergence.tla
ucxl://council-net:verifier@DistOS:networking/*^/specs/CongestionControlSafety.tla
```
### 4.5 Interface Contracts
```
ucxl://council-net:architect@DistOS:networking/*^/interfaces/net-to-mem-contract.md
ucxl://council-net:architect@DistOS:networking/*^/interfaces/net-to-sched-contract.md
ucxl://council-net:architect@DistOS:networking/*^/interfaces/net-to-sec-contract.md
ucxl://council-net:architect@DistOS:networking/*^/interfaces/net-to-telemetry-contract.md
```
### 4.6 Topology Model (shared asset)
```
ucxl://council-net:architect@DistOS:networking/*^/topology/cluster-topology-model.json
ucxl://council-net:architect@DistOS:networking/*^/topology/nvlink-domain-map.json
ucxl://council-net:architect@DistOS:networking/*^/topology/ib-rail-assignments.json
```
These three artifacts are the primary outputs consumed by `council-sched` for topology-aware placement. They must be published before the end of Day 3.
---
## 5. Decision Points
### DP-NET-001: Primary Transport Protocol Selection
**Question:** Should the DistOS network stack use a single unified transport (UCX, which provides a hardware-agnostic API over IB, RoCE, CUDA IPC, and TCP) or a layered model where different transports serve different roles (RDMA verbs for data plane, QUIC for control plane, TCP/IP for management)?
**Factors:** UCX simplicity vs. flexibility of purpose-specific transports, operational complexity (one configuration surface vs. three), UCX overhead at small message sizes relevant to control plane traffic, QUIC advantages (stream multiplexing, 0-RTT, connection migration) for agent communication that UCX does not provide.
**Recommendation bias from research:** A layered model is likely: UCX/RDMA for data plane (GPU collective, RDMA reads/writes), QUIC/libp2p for control plane (agent mesh, WHOOSH, SLURP), TCP/IP for management and compatibility. This requires explicit definition of which traffic classes use which transport.
### DP-NET-002: RDMA Queue Pair Model
**Question:** Should DistOS use Reliable Connected (RC) QPs (one QP per peer, guaranteed delivery, ordered), Unreliable Datagram (UD) QPs (one QP to all peers, scalable, no ordering), or Dynamically Connected Transport (DCT/DC) QPs (NVIDIA RDMA Connect, scales QP count logarithmically)?
**Factors:** RC QP count scales as O(n^2) for all-to-all communication at 1024 nodes (1M QPs is infeasible); UD requires application-level reliability; DC (NVIDIA-proprietary) scales to large clusters but limits portability.
**Dependency:** RDMA registration model (DP-MEM-008 in `council-mem`) must align with the QP model selected here.
### DP-NET-003: Control Plane Protocol
**Question:** Should the DistOS control plane use libp2p (QUIC + Kademlia DHT + mDNS) as used by CHORUS/WHOOSH, a purpose-built cluster control plane (similar to etcd/Raft), or a hybrid where cluster-local discovery uses mDNS and global routing uses Kademlia DHT?
**Factors:** libp2p operational familiarity (CHORUS already uses it), DHT scalability to 1024 nodes, etcd reliability guarantees but centralisation risk, QUIC connection establishment overhead per agent peer, interaction with BACKBEAT (NATS JetStream) for pub-sub.
**Dependency:** This decision affects `council-synth` and `council-docs` — the control plane protocol is a foundational assumption for all inter-council communication.
### DP-NET-004: Topology Discovery Mechanism
**Question:** Should DistOS implement automatic topology discovery via (a) querying the InfiniBand subnet manager (OpenSM) for fabric topology, (b) active probing with `ibnetdiscover`-style sweeps, (c) passive LLDP/CDP-based discovery, or (d) a hybrid agent-reported topology where each node self-reports its NVLink domain membership, HCA rail assignments, and switch connectivity?
**Factors:** Convergence time (how long before a newly joined node's topology is reflected in placement decisions), accuracy of IB SM queries vs. active probing, interaction with WHOOSH agent discovery (agents could report their own topology as part of CHORUS join handshake).
**Output dependency:** The topology model artifacts (cluster-topology-model.json, nvlink-domain-map.json, ib-rail-assignments.json) must be published by end of Day 3 for `council-sched` to begin placement algorithm design.
### DP-NET-005: Congestion Control Algorithm
**Question:** For the RDMA data plane over RoCE, which congestion control algorithm should DistOS mandate? Options: (a) DCQCN (DCTCP+QCN hybrid, used by Microsoft/Mellanox, standardised in RoCE v2), (b) TIMELY (RTT-based, by Google), (c) HPCC (High Precision Congestion Control, by Alibaba, requires INT telemetry), (d) Swift (Google, delay-based with ECN fallback).
**Factors:** ECN hardware support requirements, telemetry infrastructure dependencies (HPCC requires INT which requires `council-telemetry` infrastructure), performance at 1024-node allreduce traffic patterns, PFC interaction (PFC-free designs preferred to avoid head-of-line blocking cascades).
### DP-NET-006: NVLink-to-InfiniBand Domain Bridging
**Question:** GPU-to-GPU communication within a NVLink domain uses NVLink (900 GB/s bidirectional). Communication crossing a NVLink domain boundary must use InfiniBand (200 Gbps HDR per port). How should DistOS present this heterogeneous fabric to workloads? Options: (a) expose domain boundaries explicitly in the topology API; workloads are topology-aware, (b) provide a uniform address space with the OS transparently routing NVLink vs. IB, (c) NCCL-style: the collective library handles topology, the OS provides a topology file.
**Factors:** Programmer complexity, NCCL compatibility (c is the status quo for GPU training), overhead of transparent routing (b), latency impact of crossing domain boundaries.
### DP-NET-007: Service Mesh Architecture
**Question:** For CHORUS/WHOOSH agent-to-agent communication (the service mesh layer), should DistOS use (a) a sidecar proxy model (Envoy/Linkerd), (b) an eBPF-based kernel-bypass service mesh (Cilium), or (c) the existing CHORUS P2P mesh (libp2p with HMMM protocol channels) without an additional service mesh layer?
**Factors:** Latency overhead of sidecar proxies vs. eBPF bypass, mTLS complexity, operational tooling maturity, overlap with existing CHORUS mesh implementation, interaction with `council-sec`'s mTLS certificate model.
### DP-NET-008: Council Pub-Sub Mechanism
**Question:** How should council-wide broadcasts (research summary publications, vote notifications, DR announcements) be delivered to all ~80 agents in a council? Options: (a) NATS JetStream subjects with wildcard subscriptions (already used by BACKBEAT), (b) InfiniBand multicast groups (IB UD multicast, very low latency but complex group management), (c) libp2p GossipSub (probabilistic epidemic broadcast), (d) application-layer unicast fan-out from a designated broadcaster.
**Factors:** Delivery guarantee (at-least-once vs. at-most-once), ordering requirements (vote notifications must be totally ordered), latency, operational complexity, interaction with BACKBEAT NATS JetStream infrastructure.
**Recommendation bias from research:** NATS JetStream (option a) is the natural choice given BACKBEAT already provides it. The question is whether IB multicast is justified for latency-critical broadcasts.
---
## 6. Dependencies
### 6.1 What council-net Needs from Other Councils
| Dependency | Source Council | Artifact | Purpose |
|------------|---------------|---------|---------|
| RDMA memory registration interface | `council-mem` | `ucxl://council-mem:architect@DistOS:memory/*^/interfaces/mem-to-net-contract.md` | RDMA data plane requires agreed memory registration semantics (DP-MEM-008); QP design depends on this |
| Memory buffer alignment and pinning requirements | `council-mem` | `ucxl://council-mem:architect@DistOS:memory/*^/architecture/weka-gds-integration-design.md` | GPUDirect Storage requires specific buffer alignment and Weka GDS integration constraints |
| Placement decisions (GPU assignment map) | `council-sched` | `ucxl://council-sched:architect@DistOS:scheduling/*^/interfaces/sched-to-net-contract.md` | Network-aware placement requires that `council-sched` feed placement events to the network stack for QP provisioning |
| Network-aware scheduling requirements | `council-sched` | `ucxl://council-sched:architect@DistOS:scheduling/*^/architecture/placement-scoring-algorithm.md` | Topology model format must be compatible with placement scoring algorithm input requirements |
| mTLS certificate model and PKI design | `council-sec` | `ucxl://council-sec:architect@DistOS:security/*^/interfaces/sec-to-net-contract.md` | Service mesh mTLS and encrypted QUIC transport require PKI integration |
| Network bandwidth metering contract | `council-telemetry` | `ucxl://council-telemetry:architect@DistOS:telemetry/*^/interfaces/telemetry-to-net-contract.md` | Congestion control algorithm selection (DP-NET-005) may depend on INT telemetry infrastructure availability |
### 6.2 What Other Councils Need from council-net
| Consumer Council | Artifact Required | Purpose |
|-----------------|------------------|---------|
| `council-sched` | Topology model (cluster-topology, NVLink domain map, IB rail assignments) | Topology-aware placement scoring; must be published by end of Day 3 |
| `council-sched` | Network bandwidth reservation API | Gang scheduling for allreduce jobs requires co-reserving network bandwidth |
| `council-mem` | RDMA registration model decision (DP-NET-002 outcome) | Memory subsystem designs RDMA buffer registration around the QP model |
| `council-mem` | Inter-node bandwidth model and congestion control parameters | Page migration bandwidth budgeting must account for IB congestion control behaviour |
| `council-sec` | Transport security integration points | Security subsystem specifies mTLS enforcement points on the service mesh and QUIC transport |
| `council-telemetry` | Network event stream API | Metering of IB bandwidth usage, QP errors, and congestion events |
| `council-verify` | TLA+ RDMA transport protocol and routing convergence specs | Formal verification of RDMA safety (no message corruption) and routing liveness (convergence after topology change) |
| `council-synth` | Control plane protocol decision (DP-NET-003 outcome) | Inter-council synthesis depends on the control plane protocol as the communication substrate |
| All councils | Council pub-sub mechanism (DP-NET-008 outcome and configuration) | All councils use the pub-sub mechanism for broadcast communication within their member agents |
---
## 7. WHOOSH Configuration
```yaml
# WHOOSH council formation configuration for council-net
council_id: council-net
project: DistOS
subsystem: networking
gitea_label: chorus-entrypoint
gitea_repo: distos/networking
formation:
target_agents: 60
min_agents: 45
wave:
max_per_wave: 10
min_per_wave: 5
period_sec: 30
placement:
max_replicas_per_node: 2
join_stagger_ms: 2000
bootstrap_peers_min: 4
roles:
- role: lead-architect
count: 2
model: claude-opus-4-6
priority: high
- role: researcher
count: 32
model: qwen2.5-coder:32b
priority: normal
subgroups:
- tag: rdma-infiniband
count: 7
- tag: ucx-nccl
count: 6
- tag: libp2p-control-plane
count: 5
- tag: quic-transport
count: 4
- tag: topology-discovery
count: 4
- tag: routing-congestion
count: 4
- tag: multi-rail
count: 3
- tag: service-mesh
count: 4
- tag: pubsub-multicast
count: 3
# Note: researcher subgroup counts sum to 40 intentionally to allow
# some agents to cover two adjacent domains (e.g., RDMA + UCX agents
# may overlap). WHOOSH assigns 32 researchers but permits dual-tagging.
- role: architect
count: 5
model: claude-opus-4-6
priority: normal
- role: verifier
count: 6
model: deepseek-coder-v2
priority: normal
- role: reviewer
count: 5
model: claude-opus-4-6
priority: normal
- role: integration-liaison
count: 4
model: qwen2.5-coder:32b
priority: normal
- role: decision-record-author
count: 3
model: claude-opus-4-6
priority: normal
- role: adversarial-critic
count: 3
model: claude-opus-4-6
priority: normal
subchannels:
- name: net-research
description: "Network stack research and literature synthesis"
participants: [researcher, lead-architect]
pubsub: true
- name: net-data-plane
description: "RDMA, UCX, NCCL data plane design — high priority"
participants: [researcher-rdma-infiniband, researcher-ucx-nccl, architect, lead-architect]
pubsub: false
- name: net-control-plane
description: "libp2p, QUIC, DHT control plane design"
participants: [researcher-libp2p-control-plane, researcher-quic-transport, architect, lead-architect]
pubsub: false
- name: net-topology
description: "Topology discovery and model publication — time-critical (Day 3 deadline)"
participants: [researcher-topology-discovery, architect, lead-architect, integration-liaison]
pubsub: false
- name: net-architecture
description: "Architecture proposal discussion and voting"
participants: [architect, lead-architect, reviewer, adversarial-critic]
pubsub: true
- name: net-formal-spec
description: "TLA+ specification authoring and review"
participants: [verifier, lead-architect, reviewer]
pubsub: false
- name: net-integration
description: "Cross-council interface negotiation"
participants: [integration-liaison, lead-architect]
pubsub: false
- name: net-decisions
description: "Decision record authoring and consensus"
participants: [decision-record-author, lead-architect, reviewer]
pubsub: true
quorum:
architecture_changes:
policy: supermajority
threshold: 0.667
require_domain_role: true
require_quality_role: true
beat_minutes: 20
timeout_beats: 6
# Topology model publication uses expedited quorum (Day 3 deadline)
topology_model_publication:
policy: simple_majority
threshold: 0.5
require_domain_role: true
require_quality_role: false
beat_minutes: 15
timeout_beats: 3
expedited: true
deadline_day: 3
research_summaries:
policy: simple_majority
threshold: 0.5
require_domain_role: true
require_quality_role: false
beat_minutes: 15
timeout_beats: 4
formal_specs:
policy: supermajority
threshold: 0.667
require_domain_role: true
require_quality_role: true
require_verifier: true
beat_minutes: 25
timeout_beats: 8
interface_contracts:
policy: unanimous
roles: [lead-architect, integration-liaison]
beat_minutes: 30
timeout_beats: 4
gates:
kaching:
p95_latency_ms: 250
max_error_rate: 0.01
backbeat:
max_stream_lag: 200
bootstrap:
min_healthy_peers: 4
join:
min_success_rate: 0.80
review:
beat_minutes: 20
quorum:
total_min: 3
require_domain_role: true
require_quality_role: true
timeout_beats: 6
no_self_approval: true
```
---
## 8. Success Criteria
1. **Research completeness:** All 9 research domain summaries published to DHT with at least 5 primary references each, approved by council simple majority.
2. **Topology model published by Day 3:** The three topology model artifacts (cluster-topology-model.json, nvlink-domain-map.json, ib-rail-assignments.json) are published to DHT and their UCXL addresses communicated to `council-sched` by the end of Day 3. This is a hard dependency for scheduling placement design.
3. **Architecture coverage:** Architectural proposals exist for all 10 network stack components. The data plane (RDMA + UCX + NCCL) and control plane (libp2p + QUIC) proposals are the most critical and must be approved before Phase 3 begins.
4. **Decision records resolved:** All 8 decision points (DP-NET-001 through DP-NET-008) have corresponding Decision Records with at least 3 alternatives considered, evidence citations, and a chosen option ratified by council supermajority.
5. **Formal specifications:** TLA+ specifications for the RDMA transport protocol (QP safety: no message duplication or loss under reliable connected mode), routing convergence (finite time to stable routing table after topology change), and congestion control safety (freedom from PFC pause storm deadlock). At least 2 of these must have model-checked invariants verified by `council-verify`.
6. **Interface contracts ratified:** All 4 interface contracts (to `council-mem`, `council-sched`, `council-sec`, `council-telemetry`) are co-signed by integration liaisons from both councils. The RDMA interface contract with `council-mem` must be co-signed before the end of Phase 2 (Day 6).
7. **Control plane protocol decision communicated:** DP-NET-003 outcome is communicated to `council-synth` and all other councils before Day 5, as the control plane protocol is a foundational assumption for inter-council communication.
8. **UCXL navigability:** Any Decision Record can be traced to the research summary motivating it within 5 UCXL hops.
9. **Adversarial review pass:** Each major architecture proposal has a documented adversarial critique and resolution. The congestion control design must specifically address the PFC pause storm scenario at 1024-node allreduce scale, and the QP model must address the O(n^2) QP count scalability problem with a documented mitigation (DC transport, UD with app-level reliability, or QP sharing).
---
## 9. Timeline
### Phase 1: Research and Survey (Days 1-3)
**Day 1:**
- WHOOSH forms council; 60 agents join via wave deployment
- Researchers self-assign to domain subgroups via `net-research` subchannel
- RDMA/IB, UCX/NCCL, and topology discovery research begin immediately — these are on the critical path for the Day 3 topology model publication deadline
- libp2p and QUIC control plane research begins in parallel
- Integration liaisons contact `council-sched` to understand topology model format requirements and `council-mem` to begin RDMA registration interface discussion
**Day 2:**
- Remaining domains surveyed: adaptive routing/congestion control, multi-rail, service mesh, pub-sub/multicast
- Research summaries drafted across all subgroups
- Topology discovery researchers draft preliminary topology model in parallel with research (early draft, not yet approved)
- Internal review cycle begins; green/yellow/red votes cast
**Day 3 (hard deadline: topology model artifacts published):**
- Research summaries revised; all 9 published to DHT with simple majority approval
- Topology model artifacts (cluster-topology-model.json, nvlink-domain-map.json, ib-rail-assignments.json) approved via expedited quorum (beat_minutes: 15, timeout_beats: 3) and UCXL addresses communicated to `council-sched`
- Adversarial critics challenge topology discovery completeness and RDMA scalability assumptions
- Research phase gate: all summaries approved; topology model published
### Phase 2: Architecture and Trade-offs (Days 3-6)
**Day 3-4:**
- DP-NET-001 (transport selection) proposed and voted — this is the highest-impact architectural decision
- DP-NET-002 (RDMA QP model) — joint session with `council-mem` RDMA liaison; DP-MEM-008 alignment confirmed
- DP-NET-003 (control plane protocol) proposed — result communicated to all councils as soon as resolved (target: Day 4)
**Day 4-5:**
- DP-NET-004 (topology discovery mechanism) resolved
- DP-NET-005 (congestion control) resolved — `council-telemetry` consulted on INT telemetry infrastructure availability
- Adaptive routing and multi-rail architecture proposals authored
**Day 5-6:**
- DP-NET-006 (NVLink/IB domain bridging) resolved — `council-sched` consulted
- DP-NET-007 (service mesh architecture) resolved — `council-sec` consulted on mTLS model
- DP-NET-008 (pub-sub mechanism) resolved — BACKBEAT team consulted on NATS JetStream configuration
- All 8 Decision Records drafted and voted by council supermajority
- Architecture overview assembled; architecture phase gate passed
### Phase 3: Formal Specification (Days 6-10)
**Day 6-7:**
- TLA+ specification of RDMA transport protocol begun (QP state machine, message delivery safety)
- Routing convergence spec begun (Kademlia or IB SM topology distribution, convergence after link failure)
- `council-verify` given early access to spec drafts
**Day 7-8:**
- Congestion control safety spec authored (PFC storm freedom, DCQCN convergence)
- RDMA transport safety invariants: no duplicate delivery in RC mode, guaranteed delivery or error reporting
**Day 8-10:**
- Model checking submitted to `council-verify`
- Any counterexamples trigger architecture revision with updated DRs
- Formal spec versions pinned in DHT; UCXL addresses published
### Phase 4: Integration and Review (Days 10-12)
**Day 10-11:**
- Interface contracts with `council-mem`, `council-sched`, `council-sec`, `council-telemetry` finalised and co-signed
- Cross-council integration session: QP model validated with `council-mem` RDMA registration design
- Cross-council integration session: topology model consumed by `council-sched` placement scoring — confirm format compatibility
- `council-synth` engaged for control plane protocol conflicts with other councils
**Day 11-12:**
- Final council review of complete network specification
- Adversarial critics run PFC storm scenario, DHT partition scenario, and O(n^2) QP scale scenario
- All yellow votes addressed with documented mitigations
- Integration review gate: all interface contracts co-signed
### Phase 5: Documentation and Narrative (Days 12-14)
**Day 12-13:**
- Decision record authors produce narrative summaries of network design evolution
- `council-docs` receives complete network specification for standardised formatting
- UCXL navigability audit: spot-check 10 random decision paths
**Day 14:**
- Final specification published
- `council-arch` generates human-readable narrative of network stack design evolution
- Council dissolved; agents released back to WHOOSH pool