Files
CHORUS/docs/distos/councils/03-network-stack.md

40 KiB

Council Design Brief: Network Stack

Council ID: council-net Mission: Design the complete network stack for DistOS, encompassing RDMA transport (InfiniBand/RoCE) for GPU-to-GPU communication, the overlay network and control plane (libp2p), transport protocol selection (QUIC/TCP/UCX), network topology discovery, adaptive routing, congestion control at 1024-node scale, NVLink domain bridging, multi-rail networking, service mesh for agent communication, and multicast/broadcast for council pub-sub channels. UCXL Base Address: ucxl://council-net:*@DistOS:networking/* Agent Count: 60 Status: Constitution Phase — awaiting WHOOSH formation trigger Created: 2026-02-24


1. Scope and Responsibilities

council-net owns the complete specification of the DistOS network subsystem. Scope boundaries are defined as follows.

In scope:

  • RDMA transport layer: InfiniBand verbs and RoCE v2 for GPU-to-GPU data plane; Queue Pair (QP) lifecycle, completion queue management, and memory region registration interface
  • Overlay network design: logical topology over the physical InfiniBand/Ethernet fabric; addressing, routing, and peer discovery for the control plane
  • libp2p integration for the DistOS control plane: peer discovery (mDNS, DHT-based), multiplexed streams, NAT traversal, and protocol negotiation
  • Transport protocol selection: QUIC for agent control plane communication, UCX (Unified Communication X) for GPU collective and point-to-point data plane, TCP/IP for compatibility and management traffic
  • Network topology discovery: automatic cluster topology mapping (fat-tree, dragonfly, or NVLink Switch topology), NVLink domain membership, InfiniBand subnet manager interface
  • Adaptive routing: traffic-aware routing, ECMP (Equal-Cost Multi-Path), InfiniBand OpenSM AR (Adaptive Routing), and Dragonfly-specific routing algorithms
  • Congestion control at 1024-node scale: ECN (Explicit Congestion Notification), PFC (Priority Flow Control), and DCQCN for RoCE; InfiniBand credit-based flow control; interaction with NCCL collectives
  • NVLink domain bridging: translating between NVLink-domain (intra-node or NVSwitch-connected GPUs) and InfiniBand-domain (inter-node) communication
  • Multi-rail networking: multiple IB HCAs per node, rail selection policy, failover across rails
  • Service mesh for agent (CHORUS/WHOOSH) communication: mTLS, sidecar-less or sidecar-based, service discovery, load balancing, and circuit-breaking
  • Multicast and broadcast for council pub-sub: efficient dissemination of research summaries, decision records, and vote notifications to council members
  • Formal specification of the DistOS network interface — the API surface exposed to the memory subsystem (RDMA registration), the scheduler (network-aware placement), and user workloads (NCCL-compatible collective API)

Out of scope (delegated):

  • Physical layer hardware (HCA firmware, cable selection, switch configuration) — these are hardware dependencies, not OS design decisions
  • GPU memory allocation and RDMA buffer management (delegated to council-mem; RDMA registration interface consumed)
  • Workload scheduling and GPU assignment (delegated to council-sched; topology hints provided as outputs)
  • Encrypted transport key management and certificate lifecycle (delegated to council-sec; TLS/mTLS integration points defined)
  • Resource metering for network bandwidth usage (delegated to council-telemetry; metering event API published)

2. Research Domains

2.1 RDMA: InfiniBand Verbs and RoCE

InfiniBand provides the primary high-bandwidth, low-latency transport for GPU-to-GPU data movement on this cluster. Understand the verbs API (ibverbs), Queue Pair state machine (RESET → INIT → RTR → RTS → ERROR), completion queues, and memory region registration. Survey RoCE v2 as the Ethernet-encapsulated RDMA alternative and understand its congestion control requirements.

Key materials:

  • Mellanox/NVIDIA InfiniBand Architecture Specification — QP model, RDMA read/write/atomic operations, immediate data
  • Kalia et al., "Using RDMA Efficiently for Key-Value Services" (SIGCOMM 2014) — RDMA design patterns, one-sided vs. two-sided operations
  • Kalia et al., "Design Guidelines for High Performance RDMA Systems" (USENIX ATC 2016) — QP scaling, SEND vs. RDMA READ trade-offs, doorbell batching
  • Mitchell et al., "Using One-Sided RDMA Reads to Build a Fast, CPU-Efficient Key-Value Store" (USENIX ATC 2013)
  • Rosenblum et al., "The Design and Implementation of a Log-Structured File System" — not RDMA-specific but directly relevant to RDMA-backed storage patterns
  • Pfefferle et al., "A Hybrid I/O Virtualization Framework for RDMA-capable Network Interfaces" (VEE 2015)
  • RoCE v2 specification (InfiniBand Trade Association) — UDP encapsulation, RoCEv2 header, ECN marking

2.2 UCX: Unified Communication X

UCX is the primary communication framework for DistOS data plane operations. It provides a hardware-agnostic API over InfiniBand, RoCE, CUDA IPC, CMA (Cross-Memory Attach), and TCP. NCCL uses UCX as an optional backend.

Key materials:

  • Shamis et al., "UCX: An Open Source Framework for HPC Network APIs and Beyond" (HOTI 2015) — UCX design and API overview
  • UCX documentation (openucx.org) — UCP (User Communication Protocol) context, endpoints, tagged messages, Active Messages, RDMA operations
  • UCX GPU-Direct RDMA integration guide — ucp_mem_map with UCS_MEMORY_TYPE_CUDA, ucx_perftest GPU benchmarks
  • Venkata et al., "OSU Micro Benchmarks" (OMB) — latency/bandwidth benchmarks for UCX transport selection validation
  • NVIDIA NCCL-UCX plugin documentation — NCCL_UCX_* environment variables, QP provisioning, registration cache

2.3 NCCL: NVIDIA Collective Communication Library

NCCL implements the collective communication patterns used by distributed training (allreduce, broadcast, reduce-scatter, all-gather). On Hopper clusters, NCCL uses NVLink for intra-node collectives and InfiniBand for inter-node. Understanding NCCL topology files and ring/tree algorithm selection is essential for network-aware scheduling.

Key materials:

  • NCCL documentation and source (github.com/NVIDIA/nccl) — ncclCommInitRank, topology detection, algorithm selection (ring, tree, collnet)
  • Patarasuk and Yuan, "Bandwidth Optimal All-reduce Algorithms for Clusters of Workstations" (J. Parallel Distrib. Comput. 2009) — ring-allreduce bandwidth analysis
  • Ying et al., "Switch Transformer" appendix on communication cost — practical allreduce scaling analysis
  • NCCL topology XML format — specifying NVLink domains, IB rail affinity, and switch hierarchy for NCCL algorithm tuning
  • NVIDIA Collective Communication Library Performance Notes — algorithm selection heuristics, tree vs. ring cross-over point

2.4 libp2p for the Control Plane

The DistOS control plane (agent coordination, WHOOSH council formation, SLURP DHT, UCXL resolution) runs over libp2p. Survey the libp2p protocol suite: peer identity (ed25519/secp256k1 keypairs), peer discovery (mDNS, Kademlia DHT), stream multiplexing (yamux, mplex), transport (QUIC, TCP), and NAT traversal.

Key materials:

  • libp2p specification (github.com/libp2p/specs) — multiaddress format, transport protocols, peer routing
  • Maymounkov and Mazières, "Kademlia: A Peer-to-Peer Information System Based on the XOR Metric" (IPTPS 2002) — foundational DHT for libp2p routing
  • libp2p QUIC transport specification — 0-RTT handshake, connection migration, multiplexed streams without HoL blocking
  • IPFS documentation on libp2p — practical deployment patterns, bootstrap peer configuration, ambient peer discovery
  • Baumgart and Meis, "S/Kademlia: A Practicable Approach Towards Secure Key-Based Routing" (P2P 2007) — security hardening for Kademlia relevant to the CHORUS mesh

2.5 QUIC Protocol

QUIC provides the transport for DistOS agent control-plane communication (WHOOSH formation, SLURP DHT queries, UCXL resolution, BUBBLE decision records). QUIC's multiplexed streams, 0-RTT connection establishment, and connection migration over multiple network paths make it well-suited to the heterogeneous cluster environment.

Key materials:

  • Langley et al., "The QUIC Transport Protocol: Design and Internet-Scale Deployment" (SIGCOMM 2017) — QUIC design rationale and performance analysis
  • RFC 9000 — QUIC: A UDP-Based Multiplexed and Secure Transport
  • RFC 9001 — Using TLS to Secure QUIC
  • Marx et al., "QUIC is not Quick Enough over Fast Internet Connections" (PAM 2020) — QUIC performance limitations at high bandwidth relevant to cluster use
  • Cui et al., "QUIC is not Enough: Towards Wireless QUIC" — multipath extensions relevant to multi-rail networking

2.6 Network Topology Discovery

At 1024 nodes, manual topology configuration is not feasible. The network stack must automatically discover the cluster topology: fat-tree vs. dragonfly vs. NVLink Switch topology, rail assignments per node, and NVSwitch domain membership per GPU.

Key materials:

  • Al-Fares et al., "A Scalable, Commodity Data Center Network Architecture" (SIGCOMM 2008) — fat-tree topology analysis
  • Abts et al., "Energy Proportional Datacenter Networks" — dragonfly topology and routing
  • InfiniBand OpenSM subnet manager documentation — smpquery, ibnetdiscover, topology file format, AR (Adaptive Routing) configuration
  • NVIDIA NVTOPO documentation — GPU topology detection, nvidia-smi topo -m output format
  • ibstat, ibstatus, perfquery — InfiniBand diagnostic tools relevant to topology verification

2.7 Adaptive Routing and Congestion Control

At 1024-node scale, static routing leads to hot-spots. Adaptive routing dynamically distributes traffic across equal-cost paths. Congestion control prevents PFC pause storms that can cascade to TCP-Incast-style deadlocks.

Key materials:

  • Valadarsky et al., "Xpander: Towards Optimal-Performance Datacenters" (CoNEXT 2016) — adaptive topology design
  • Zhu et al., "Congestion Control for Large-Scale RDMA Deployments" (SIGCOMM 2015) — DCQCN algorithm for RoCE congestion control
  • NVIDIA Spectrum InfiniBand switch AR documentation — per-packet vs. per-flow adaptive routing
  • Pfaff et al., "The Design and Implementation of Open vSwitch" (NSDI 2015) — SDN-based adaptive routing reference
  • Mittal et al., "TIMELY: RTT-based Congestion Control for the Datacenter" (SIGCOMM 2015) — RTT-based CC for RDMA
  • Google Jupiter and Andromeda: Firestone et al., "Azure Accelerated Networking" (NSDI 2018) — hyperscale network stack design patterns; Vahdat et al., "Jupiter Rising" (SIGCOMM 2015) — Google datacenter network architecture

2.8 Multi-Rail Networking

Each node in the cluster has multiple InfiniBand HCAs for aggregate bandwidth. The network stack must implement rail selection policy (hash-based, round-robin, least-loaded), per-flow rail affinity for ordered delivery, and HCA failover.

Key materials:

  • NVIDIA Multi-Rail documentation — NCCL_IB_HCA environment variable, rail selection in NCCL
  • Kalia et al., "FaRM: Fast Remote Memory" (NSDI 2014) — multi-rail RDMA design patterns
  • Bai et al., "PIAS: Practical Information-Agnostic Flow Scheduling for Commodity Data Centers" (NSDI 2015) — flow scheduling relevant to multi-rail policy
  • InfiniBand bonding and link aggregation — ipoib bonding, rdma cm multi-path

2.9 Service Mesh for Agent Communication

The CHORUS/WHOOSH agent mesh requires a lightweight service mesh for mTLS, service discovery, load balancing, and circuit-breaking. The service mesh must function without a centralised control plane to avoid a single point of failure.

Key materials:

  • Burns et al., "Borg, Omega, and Kubernetes" (ACM Queue 2016) — service mesh design considerations in large-scale systems
  • Istio architecture documentation — sidecar proxy (Envoy), control plane (Istiod), mTLS design
  • Linkerd2 documentation — Rust-based ultra-lightweight proxy, no-sidecar mode (linkerd-proxy as eBPF-based)
  • Envoy proxy documentation — xDS API, circuit-breaking, outlier detection
  • Cilium documentation — eBPF-based service mesh without sidecar proxies; relevant to kernel-bypass agent communication

2.10 Multicast and Pub-Sub for Council Communication

Council pub-sub (HMMM protocol message distribution, vote broadcasts, research summary publication) requires an efficient multicast or pub-sub mechanism. At 80 agents per council, naive unicast produces O(n) messages per broadcast. Survey IP multicast, InfiniBand multicast, and application-layer pub-sub.

Key materials:

  • Deering, "Multicast Routing in Datagram Internetworks and Extended LANs" (1991) — foundational IP multicast design
  • InfiniBand multicast documentation — ibmcast, unreliable datagram multicast groups, LID-based group addressing
  • ZeroMQ documentation — PUB/SUB pattern, EPGM (Encapsulated PGM multicast), NORM protocol
  • NATS documentation — subject-based pub-sub, JetStream persistence, clustering — directly relevant as BACKBEAT uses NATS JetStream
  • Eugster et al., "The Many Faces of Publish/Subscribe" (ACM Computing Surveys 2003) — pub-sub system classification

3. Agent Roles

Total agents: 60

Role Count Responsibilities
Lead Architect 2 Network stack architecture decisions, cross-subsystem network interface ownership, topology model ownership
RDMA/InfiniBand Researchers 7 InfiniBand verbs, RoCE v2, QP model, memory region registration survey
UCX/NCCL Specialists 6 UCX API, NCCL topology and algorithm selection, GPU collective communication integration
libp2p/Control Plane Researchers 5 libp2p protocol suite, DHT-based peer discovery, stream multiplexing
QUIC Transport Specialists 4 QUIC protocol design, 0-RTT semantics, multipath/multi-rail QUIC
Topology Discovery Researchers 4 Automated cluster topology mapping, NVLink domain detection, IB subnet manager integration
Adaptive Routing and CC Specialists 4 DCQCN, ECMP, OpenSM AR, congestion control at scale
Multi-Rail Specialists 3 HCA failover, rail selection policy, per-flow affinity
Service Mesh Researchers 4 mTLS, eBPF-based service mesh, sidecar-less design, circuit-breaking
Pub-Sub / Multicast Researchers 3 IB multicast, NATS JetStream, council broadcast design
Formal Specification Authors 6 TLA+ specification of network protocol state machines, RDMA safety, routing convergence
Architects (sub-component) 5 Concrete architecture proposals for each network subsystem component
Internal Reviewers 5 Review proposals; green/yellow/red votes with rationale
Integration Liaisons 4 Interface with council-mem, council-sched, council-sec, council-telemetry
Decision Record Authors 3 Author DRs for all decision points; maintain UCXL provenance chain
Adversarial Critics 3 Surface congestion collapse scenarios, RDMA registration exhaustion, DHT partition risks

Role distribution rationale: RDMA/InfiniBand and UCX/NCCL together receive 13 researchers because the data plane is the most technically complex and least tractable domain in this council. The formal specification group (6 agents) is proportionally smaller than council-sched and council-mem reflecting that network protocol safety properties are narrower in scope (liveness and congestion-freedom rather than memory safety).


4. Key Deliverables

4.1 Research Summaries

ucxl://council-net:researcher@DistOS:networking/*^/research/rdma-infiniband-roce.md
ucxl://council-net:researcher@DistOS:networking/*^/research/ucx-nccl-data-plane.md
ucxl://council-net:researcher@DistOS:networking/*^/research/libp2p-control-plane.md
ucxl://council-net:researcher@DistOS:networking/*^/research/quic-transport.md
ucxl://council-net:researcher@DistOS:networking/*^/research/topology-discovery.md
ucxl://council-net:researcher@DistOS:networking/*^/research/adaptive-routing-congestion.md
ucxl://council-net:researcher@DistOS:networking/*^/research/multi-rail-networking.md
ucxl://council-net:researcher@DistOS:networking/*^/research/service-mesh-agent-comms.md
ucxl://council-net:researcher@DistOS:networking/*^/research/multicast-pubsub-council.md

4.2 Architecture Proposals

ucxl://council-net:architect@DistOS:networking/*^/architecture/network-stack-overview.md
ucxl://council-net:architect@DistOS:networking/*^/architecture/rdma-transport-design.md
ucxl://council-net:architect@DistOS:networking/*^/architecture/ucx-data-plane-design.md
ucxl://council-net:architect@DistOS:networking/*^/architecture/control-plane-libp2p.md
ucxl://council-net:architect@DistOS:networking/*^/architecture/topology-model.md
ucxl://council-net:architect@DistOS:networking/*^/architecture/adaptive-routing-design.md
ucxl://council-net:architect@DistOS:networking/*^/architecture/multi-rail-policy.md
ucxl://council-net:architect@DistOS:networking/*^/architecture/service-mesh-design.md
ucxl://council-net:architect@DistOS:networking/*^/architecture/council-pubsub-design.md
ucxl://council-net:architect@DistOS:networking/*^/architecture/bandwidth-reservation.md

4.3 Decision Records

ucxl://council-net:architect@DistOS:networking/*^/decisions/DR-NET-001-transport-selection.md
ucxl://council-net:architect@DistOS:networking/*^/decisions/DR-NET-002-rdma-qp-model.md
ucxl://council-net:architect@DistOS:networking/*^/decisions/DR-NET-003-control-plane-protocol.md
ucxl://council-net:architect@DistOS:networking/*^/decisions/DR-NET-004-topology-discovery-mechanism.md
ucxl://council-net:architect@DistOS:networking/*^/decisions/DR-NET-005-congestion-control-algorithm.md
ucxl://council-net:architect@DistOS:networking/*^/decisions/DR-NET-006-nvlink-ib-domain-bridging.md
ucxl://council-net:architect@DistOS:networking/*^/decisions/DR-NET-007-service-mesh-architecture.md
ucxl://council-net:architect@DistOS:networking/*^/decisions/DR-NET-008-pubsub-mechanism.md

4.4 Formal Specifications

ucxl://council-net:verifier@DistOS:networking/*^/specs/RDMATransportProtocol.tla
ucxl://council-net:verifier@DistOS:networking/*^/specs/RoutingConvergence.tla
ucxl://council-net:verifier@DistOS:networking/*^/specs/CongestionControlSafety.tla

4.5 Interface Contracts

ucxl://council-net:architect@DistOS:networking/*^/interfaces/net-to-mem-contract.md
ucxl://council-net:architect@DistOS:networking/*^/interfaces/net-to-sched-contract.md
ucxl://council-net:architect@DistOS:networking/*^/interfaces/net-to-sec-contract.md
ucxl://council-net:architect@DistOS:networking/*^/interfaces/net-to-telemetry-contract.md

4.6 Topology Model (shared asset)

ucxl://council-net:architect@DistOS:networking/*^/topology/cluster-topology-model.json
ucxl://council-net:architect@DistOS:networking/*^/topology/nvlink-domain-map.json
ucxl://council-net:architect@DistOS:networking/*^/topology/ib-rail-assignments.json

These three artifacts are the primary outputs consumed by council-sched for topology-aware placement. They must be published before the end of Day 3.


5. Decision Points

DP-NET-001: Primary Transport Protocol Selection

Question: Should the DistOS network stack use a single unified transport (UCX, which provides a hardware-agnostic API over IB, RoCE, CUDA IPC, and TCP) or a layered model where different transports serve different roles (RDMA verbs for data plane, QUIC for control plane, TCP/IP for management)?

Factors: UCX simplicity vs. flexibility of purpose-specific transports, operational complexity (one configuration surface vs. three), UCX overhead at small message sizes relevant to control plane traffic, QUIC advantages (stream multiplexing, 0-RTT, connection migration) for agent communication that UCX does not provide.

Recommendation bias from research: A layered model is likely: UCX/RDMA for data plane (GPU collective, RDMA reads/writes), QUIC/libp2p for control plane (agent mesh, WHOOSH, SLURP), TCP/IP for management and compatibility. This requires explicit definition of which traffic classes use which transport.

DP-NET-002: RDMA Queue Pair Model

Question: Should DistOS use Reliable Connected (RC) QPs (one QP per peer, guaranteed delivery, ordered), Unreliable Datagram (UD) QPs (one QP to all peers, scalable, no ordering), or Dynamically Connected Transport (DCT/DC) QPs (NVIDIA RDMA Connect, scales QP count logarithmically)?

Factors: RC QP count scales as O(n^2) for all-to-all communication at 1024 nodes (1M QPs is infeasible); UD requires application-level reliability; DC (NVIDIA-proprietary) scales to large clusters but limits portability.

Dependency: RDMA registration model (DP-MEM-008 in council-mem) must align with the QP model selected here.

DP-NET-003: Control Plane Protocol

Question: Should the DistOS control plane use libp2p (QUIC + Kademlia DHT + mDNS) as used by CHORUS/WHOOSH, a purpose-built cluster control plane (similar to etcd/Raft), or a hybrid where cluster-local discovery uses mDNS and global routing uses Kademlia DHT?

Factors: libp2p operational familiarity (CHORUS already uses it), DHT scalability to 1024 nodes, etcd reliability guarantees but centralisation risk, QUIC connection establishment overhead per agent peer, interaction with BACKBEAT (NATS JetStream) for pub-sub.

Dependency: This decision affects council-synth and council-docs — the control plane protocol is a foundational assumption for all inter-council communication.

DP-NET-004: Topology Discovery Mechanism

Question: Should DistOS implement automatic topology discovery via (a) querying the InfiniBand subnet manager (OpenSM) for fabric topology, (b) active probing with ibnetdiscover-style sweeps, (c) passive LLDP/CDP-based discovery, or (d) a hybrid agent-reported topology where each node self-reports its NVLink domain membership, HCA rail assignments, and switch connectivity?

Factors: Convergence time (how long before a newly joined node's topology is reflected in placement decisions), accuracy of IB SM queries vs. active probing, interaction with WHOOSH agent discovery (agents could report their own topology as part of CHORUS join handshake).

Output dependency: The topology model artifacts (cluster-topology-model.json, nvlink-domain-map.json, ib-rail-assignments.json) must be published by end of Day 3 for council-sched to begin placement algorithm design.

DP-NET-005: Congestion Control Algorithm

Question: For the RDMA data plane over RoCE, which congestion control algorithm should DistOS mandate? Options: (a) DCQCN (DCTCP+QCN hybrid, used by Microsoft/Mellanox, standardised in RoCE v2), (b) TIMELY (RTT-based, by Google), (c) HPCC (High Precision Congestion Control, by Alibaba, requires INT telemetry), (d) Swift (Google, delay-based with ECN fallback).

Factors: ECN hardware support requirements, telemetry infrastructure dependencies (HPCC requires INT which requires council-telemetry infrastructure), performance at 1024-node allreduce traffic patterns, PFC interaction (PFC-free designs preferred to avoid head-of-line blocking cascades).

Question: GPU-to-GPU communication within a NVLink domain uses NVLink (900 GB/s bidirectional). Communication crossing a NVLink domain boundary must use InfiniBand (200 Gbps HDR per port). How should DistOS present this heterogeneous fabric to workloads? Options: (a) expose domain boundaries explicitly in the topology API; workloads are topology-aware, (b) provide a uniform address space with the OS transparently routing NVLink vs. IB, (c) NCCL-style: the collective library handles topology, the OS provides a topology file.

Factors: Programmer complexity, NCCL compatibility (c is the status quo for GPU training), overhead of transparent routing (b), latency impact of crossing domain boundaries.

DP-NET-007: Service Mesh Architecture

Question: For CHORUS/WHOOSH agent-to-agent communication (the service mesh layer), should DistOS use (a) a sidecar proxy model (Envoy/Linkerd), (b) an eBPF-based kernel-bypass service mesh (Cilium), or (c) the existing CHORUS P2P mesh (libp2p with HMMM protocol channels) without an additional service mesh layer?

Factors: Latency overhead of sidecar proxies vs. eBPF bypass, mTLS complexity, operational tooling maturity, overlap with existing CHORUS mesh implementation, interaction with council-sec's mTLS certificate model.

DP-NET-008: Council Pub-Sub Mechanism

Question: How should council-wide broadcasts (research summary publications, vote notifications, DR announcements) be delivered to all ~80 agents in a council? Options: (a) NATS JetStream subjects with wildcard subscriptions (already used by BACKBEAT), (b) InfiniBand multicast groups (IB UD multicast, very low latency but complex group management), (c) libp2p GossipSub (probabilistic epidemic broadcast), (d) application-layer unicast fan-out from a designated broadcaster.

Factors: Delivery guarantee (at-least-once vs. at-most-once), ordering requirements (vote notifications must be totally ordered), latency, operational complexity, interaction with BACKBEAT NATS JetStream infrastructure.

Recommendation bias from research: NATS JetStream (option a) is the natural choice given BACKBEAT already provides it. The question is whether IB multicast is justified for latency-critical broadcasts.


6. Dependencies

6.1 What council-net Needs from Other Councils

Dependency Source Council Artifact Purpose
RDMA memory registration interface council-mem ucxl://council-mem:architect@DistOS:memory/*^/interfaces/mem-to-net-contract.md RDMA data plane requires agreed memory registration semantics (DP-MEM-008); QP design depends on this
Memory buffer alignment and pinning requirements council-mem ucxl://council-mem:architect@DistOS:memory/*^/architecture/weka-gds-integration-design.md GPUDirect Storage requires specific buffer alignment and Weka GDS integration constraints
Placement decisions (GPU assignment map) council-sched ucxl://council-sched:architect@DistOS:scheduling/*^/interfaces/sched-to-net-contract.md Network-aware placement requires that council-sched feed placement events to the network stack for QP provisioning
Network-aware scheduling requirements council-sched ucxl://council-sched:architect@DistOS:scheduling/*^/architecture/placement-scoring-algorithm.md Topology model format must be compatible with placement scoring algorithm input requirements
mTLS certificate model and PKI design council-sec ucxl://council-sec:architect@DistOS:security/*^/interfaces/sec-to-net-contract.md Service mesh mTLS and encrypted QUIC transport require PKI integration
Network bandwidth metering contract council-telemetry ucxl://council-telemetry:architect@DistOS:telemetry/*^/interfaces/telemetry-to-net-contract.md Congestion control algorithm selection (DP-NET-005) may depend on INT telemetry infrastructure availability

6.2 What Other Councils Need from council-net

Consumer Council Artifact Required Purpose
council-sched Topology model (cluster-topology, NVLink domain map, IB rail assignments) Topology-aware placement scoring; must be published by end of Day 3
council-sched Network bandwidth reservation API Gang scheduling for allreduce jobs requires co-reserving network bandwidth
council-mem RDMA registration model decision (DP-NET-002 outcome) Memory subsystem designs RDMA buffer registration around the QP model
council-mem Inter-node bandwidth model and congestion control parameters Page migration bandwidth budgeting must account for IB congestion control behaviour
council-sec Transport security integration points Security subsystem specifies mTLS enforcement points on the service mesh and QUIC transport
council-telemetry Network event stream API Metering of IB bandwidth usage, QP errors, and congestion events
council-verify TLA+ RDMA transport protocol and routing convergence specs Formal verification of RDMA safety (no message corruption) and routing liveness (convergence after topology change)
council-synth Control plane protocol decision (DP-NET-003 outcome) Inter-council synthesis depends on the control plane protocol as the communication substrate
All councils Council pub-sub mechanism (DP-NET-008 outcome and configuration) All councils use the pub-sub mechanism for broadcast communication within their member agents

7. WHOOSH Configuration

# WHOOSH council formation configuration for council-net
council_id: council-net
project: DistOS
subsystem: networking
gitea_label: chorus-entrypoint
gitea_repo: distos/networking

formation:
  target_agents: 60
  min_agents: 45
  wave:
    max_per_wave: 10
    min_per_wave: 5
    period_sec: 30
  placement:
    max_replicas_per_node: 2
  join_stagger_ms: 2000
  bootstrap_peers_min: 4

roles:
  - role: lead-architect
    count: 2
    model: claude-opus-4-6
    priority: high
  - role: researcher
    count: 32
    model: qwen2.5-coder:32b
    priority: normal
    subgroups:
      - tag: rdma-infiniband
        count: 7
      - tag: ucx-nccl
        count: 6
      - tag: libp2p-control-plane
        count: 5
      - tag: quic-transport
        count: 4
      - tag: topology-discovery
        count: 4
      - tag: routing-congestion
        count: 4
      - tag: multi-rail
        count: 3
      - tag: service-mesh
        count: 4
      - tag: pubsub-multicast
        count: 3
  # Note: researcher subgroup counts sum to 40 intentionally to allow
  # some agents to cover two adjacent domains (e.g., RDMA + UCX agents
  # may overlap). WHOOSH assigns 32 researchers but permits dual-tagging.
  - role: architect
    count: 5
    model: claude-opus-4-6
    priority: normal
  - role: verifier
    count: 6
    model: deepseek-coder-v2
    priority: normal
  - role: reviewer
    count: 5
    model: claude-opus-4-6
    priority: normal
  - role: integration-liaison
    count: 4
    model: qwen2.5-coder:32b
    priority: normal
  - role: decision-record-author
    count: 3
    model: claude-opus-4-6
    priority: normal
  - role: adversarial-critic
    count: 3
    model: claude-opus-4-6
    priority: normal

subchannels:
  - name: net-research
    description: "Network stack research and literature synthesis"
    participants: [researcher, lead-architect]
    pubsub: true
  - name: net-data-plane
    description: "RDMA, UCX, NCCL data plane design — high priority"
    participants: [researcher-rdma-infiniband, researcher-ucx-nccl, architect, lead-architect]
    pubsub: false
  - name: net-control-plane
    description: "libp2p, QUIC, DHT control plane design"
    participants: [researcher-libp2p-control-plane, researcher-quic-transport, architect, lead-architect]
    pubsub: false
  - name: net-topology
    description: "Topology discovery and model publication — time-critical (Day 3 deadline)"
    participants: [researcher-topology-discovery, architect, lead-architect, integration-liaison]
    pubsub: false
  - name: net-architecture
    description: "Architecture proposal discussion and voting"
    participants: [architect, lead-architect, reviewer, adversarial-critic]
    pubsub: true
  - name: net-formal-spec
    description: "TLA+ specification authoring and review"
    participants: [verifier, lead-architect, reviewer]
    pubsub: false
  - name: net-integration
    description: "Cross-council interface negotiation"
    participants: [integration-liaison, lead-architect]
    pubsub: false
  - name: net-decisions
    description: "Decision record authoring and consensus"
    participants: [decision-record-author, lead-architect, reviewer]
    pubsub: true

quorum:
  architecture_changes:
    policy: supermajority
    threshold: 0.667
    require_domain_role: true
    require_quality_role: true
    beat_minutes: 20
    timeout_beats: 6
  # Topology model publication uses expedited quorum (Day 3 deadline)
  topology_model_publication:
    policy: simple_majority
    threshold: 0.5
    require_domain_role: true
    require_quality_role: false
    beat_minutes: 15
    timeout_beats: 3
    expedited: true
    deadline_day: 3
  research_summaries:
    policy: simple_majority
    threshold: 0.5
    require_domain_role: true
    require_quality_role: false
    beat_minutes: 15
    timeout_beats: 4
  formal_specs:
    policy: supermajority
    threshold: 0.667
    require_domain_role: true
    require_quality_role: true
    require_verifier: true
    beat_minutes: 25
    timeout_beats: 8
  interface_contracts:
    policy: unanimous
    roles: [lead-architect, integration-liaison]
    beat_minutes: 30
    timeout_beats: 4

gates:
  kaching:
    p95_latency_ms: 250
    max_error_rate: 0.01
  backbeat:
    max_stream_lag: 200
  bootstrap:
    min_healthy_peers: 4
  join:
    min_success_rate: 0.80

review:
  beat_minutes: 20
  quorum:
    total_min: 3
    require_domain_role: true
    require_quality_role: true
  timeout_beats: 6
  no_self_approval: true

8. Success Criteria

  1. Research completeness: All 9 research domain summaries published to DHT with at least 5 primary references each, approved by council simple majority.

  2. Topology model published by Day 3: The three topology model artifacts (cluster-topology-model.json, nvlink-domain-map.json, ib-rail-assignments.json) are published to DHT and their UCXL addresses communicated to council-sched by the end of Day 3. This is a hard dependency for scheduling placement design.

  3. Architecture coverage: Architectural proposals exist for all 10 network stack components. The data plane (RDMA + UCX + NCCL) and control plane (libp2p + QUIC) proposals are the most critical and must be approved before Phase 3 begins.

  4. Decision records resolved: All 8 decision points (DP-NET-001 through DP-NET-008) have corresponding Decision Records with at least 3 alternatives considered, evidence citations, and a chosen option ratified by council supermajority.

  5. Formal specifications: TLA+ specifications for the RDMA transport protocol (QP safety: no message duplication or loss under reliable connected mode), routing convergence (finite time to stable routing table after topology change), and congestion control safety (freedom from PFC pause storm deadlock). At least 2 of these must have model-checked invariants verified by council-verify.

  6. Interface contracts ratified: All 4 interface contracts (to council-mem, council-sched, council-sec, council-telemetry) are co-signed by integration liaisons from both councils. The RDMA interface contract with council-mem must be co-signed before the end of Phase 2 (Day 6).

  7. Control plane protocol decision communicated: DP-NET-003 outcome is communicated to council-synth and all other councils before Day 5, as the control plane protocol is a foundational assumption for inter-council communication.

  8. UCXL navigability: Any Decision Record can be traced to the research summary motivating it within 5 UCXL hops.

  9. Adversarial review pass: Each major architecture proposal has a documented adversarial critique and resolution. The congestion control design must specifically address the PFC pause storm scenario at 1024-node allreduce scale, and the QP model must address the O(n^2) QP count scalability problem with a documented mitigation (DC transport, UD with app-level reliability, or QP sharing).


9. Timeline

Phase 1: Research and Survey (Days 1-3)

Day 1:

  • WHOOSH forms council; 60 agents join via wave deployment
  • Researchers self-assign to domain subgroups via net-research subchannel
  • RDMA/IB, UCX/NCCL, and topology discovery research begin immediately — these are on the critical path for the Day 3 topology model publication deadline
  • libp2p and QUIC control plane research begins in parallel
  • Integration liaisons contact council-sched to understand topology model format requirements and council-mem to begin RDMA registration interface discussion

Day 2:

  • Remaining domains surveyed: adaptive routing/congestion control, multi-rail, service mesh, pub-sub/multicast
  • Research summaries drafted across all subgroups
  • Topology discovery researchers draft preliminary topology model in parallel with research (early draft, not yet approved)
  • Internal review cycle begins; green/yellow/red votes cast

Day 3 (hard deadline: topology model artifacts published):

  • Research summaries revised; all 9 published to DHT with simple majority approval
  • Topology model artifacts (cluster-topology-model.json, nvlink-domain-map.json, ib-rail-assignments.json) approved via expedited quorum (beat_minutes: 15, timeout_beats: 3) and UCXL addresses communicated to council-sched
  • Adversarial critics challenge topology discovery completeness and RDMA scalability assumptions
  • Research phase gate: all summaries approved; topology model published

Phase 2: Architecture and Trade-offs (Days 3-6)

Day 3-4:

  • DP-NET-001 (transport selection) proposed and voted — this is the highest-impact architectural decision
  • DP-NET-002 (RDMA QP model) — joint session with council-mem RDMA liaison; DP-MEM-008 alignment confirmed
  • DP-NET-003 (control plane protocol) proposed — result communicated to all councils as soon as resolved (target: Day 4)

Day 4-5:

  • DP-NET-004 (topology discovery mechanism) resolved
  • DP-NET-005 (congestion control) resolved — council-telemetry consulted on INT telemetry infrastructure availability
  • Adaptive routing and multi-rail architecture proposals authored

Day 5-6:

  • DP-NET-006 (NVLink/IB domain bridging) resolved — council-sched consulted
  • DP-NET-007 (service mesh architecture) resolved — council-sec consulted on mTLS model
  • DP-NET-008 (pub-sub mechanism) resolved — BACKBEAT team consulted on NATS JetStream configuration
  • All 8 Decision Records drafted and voted by council supermajority
  • Architecture overview assembled; architecture phase gate passed

Phase 3: Formal Specification (Days 6-10)

Day 6-7:

  • TLA+ specification of RDMA transport protocol begun (QP state machine, message delivery safety)
  • Routing convergence spec begun (Kademlia or IB SM topology distribution, convergence after link failure)
  • council-verify given early access to spec drafts

Day 7-8:

  • Congestion control safety spec authored (PFC storm freedom, DCQCN convergence)
  • RDMA transport safety invariants: no duplicate delivery in RC mode, guaranteed delivery or error reporting

Day 8-10:

  • Model checking submitted to council-verify
  • Any counterexamples trigger architecture revision with updated DRs
  • Formal spec versions pinned in DHT; UCXL addresses published

Phase 4: Integration and Review (Days 10-12)

Day 10-11:

  • Interface contracts with council-mem, council-sched, council-sec, council-telemetry finalised and co-signed
  • Cross-council integration session: QP model validated with council-mem RDMA registration design
  • Cross-council integration session: topology model consumed by council-sched placement scoring — confirm format compatibility
  • council-synth engaged for control plane protocol conflicts with other councils

Day 11-12:

  • Final council review of complete network specification
  • Adversarial critics run PFC storm scenario, DHT partition scenario, and O(n^2) QP scale scenario
  • All yellow votes addressed with documented mitigations
  • Integration review gate: all interface contracts co-signed

Phase 5: Documentation and Narrative (Days 12-14)

Day 12-13:

  • Decision record authors produce narrative summaries of network design evolution
  • council-docs receives complete network specification for standardised formatting
  • UCXL navigability audit: spot-check 10 random decision paths

Day 14:

  • Final specification published
  • council-arch generates human-readable narrative of network stack design evolution
  • Council dissolved; agents released back to WHOOSH pool