Initial DistOS project constitution and council design briefs
12 council design briefs for distributed OS specification project targeting 1024-node Hopper/Grace/Blackwell GPU cluster with Weka parallel filesystem. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
160
PROJECT-CONSTITUTION.md
Normal file
160
PROJECT-CONSTITUTION.md
Normal file
@@ -0,0 +1,160 @@
|
|||||||
|
# DistOS: Distributed Operating System for Heterogeneous GPU Clusters
|
||||||
|
|
||||||
|
## Project Constitution
|
||||||
|
|
||||||
|
**Project ID:** `DistOS`
|
||||||
|
**UCXL Base:** `ucxl://*:*@DistOS:*`
|
||||||
|
**Target Platform:** 1024-node Hopper/Grace/Blackwell cluster with Weka parallel filesystem
|
||||||
|
**Created:** 2026-02-24
|
||||||
|
**Status:** Constitution Phase
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 1. Mission
|
||||||
|
|
||||||
|
Design a comprehensive, formally specified distributed operating system optimised for heterogeneous GPU clusters. The system must manage scheduling, memory, networking, fault tolerance, security, and observability across up to 1024 nodes equipped with NVIDIA Hopper, Grace, and Blackwell accelerators, backed by the Weka parallel filesystem.
|
||||||
|
|
||||||
|
This project serves dual purpose:
|
||||||
|
|
||||||
|
1. **Primary:** Produce a rigorous, verifiable specification for a novel distributed OS
|
||||||
|
2. **Meta:** Demonstrate that ~1000 coordinated CHORUS agents can collaboratively solve a problem of this complexity, with every decision traceable by humans via UCXL
|
||||||
|
|
||||||
|
## 2. Guiding Principles
|
||||||
|
|
||||||
|
- **Formal First:** Every subsystem must have a formal specification (TLA+, Alloy, or equivalent) before implementation sketches
|
||||||
|
- **Decision Provenance:** Every architectural choice must be recorded as a Decision Record (DR) in the DHT with full UCXL addressing, including alternatives considered and rationale
|
||||||
|
- **Cross-Council Coherence:** No council operates in isolation; integration points must be explicitly defined and tracked
|
||||||
|
- **Human Navigability:** A human must be able to follow any decision chain from final spec back to initial research via UCXL temporal navigation
|
||||||
|
- **Self-Referential Awareness:** Agents should recognise they are designing a system they would ideally run on, and leverage that perspective
|
||||||
|
|
||||||
|
## 3. Council Structure
|
||||||
|
|
||||||
|
### Research & Design Councils
|
||||||
|
|
||||||
|
| Council ID | Domain | ~Agents | Brief |
|
||||||
|
|------------|--------|---------|-------|
|
||||||
|
| `council-sched` | Process Scheduling | ~80 | Heterogeneous GPU/CPU scheduling, workload placement, fair queuing |
|
||||||
|
| `council-mem` | Distributed Memory | ~80 | Memory model, Weka FS integration, caching, coherence |
|
||||||
|
| `council-net` | Network Stack | ~60 | P2P mesh, RDMA, overlay networks, transport protocols |
|
||||||
|
| `council-fault` | Fault Tolerance | ~60 | Consensus, failure detection, recovery, Byzantine resilience |
|
||||||
|
| `council-sec` | Security Model | ~60 | Capability-based security, isolation, attestation, key management |
|
||||||
|
| `council-telemetry` | Resource Accounting | ~40 | Metering, telemetry, cost attribution, SLO enforcement |
|
||||||
|
|
||||||
|
### Verification & Quality Councils
|
||||||
|
|
||||||
|
| Council ID | Domain | ~Agents | Brief |
|
||||||
|
|------------|--------|---------|-------|
|
||||||
|
| `council-verify` | Formal Verification | ~80 | TLA+ specs, model checking, invariant proofs, liveness properties |
|
||||||
|
| `council-qa` | Adversarial Testing | ~60 | Fuzzing, chaos engineering, fault injection, spec conformance |
|
||||||
|
|
||||||
|
### Integration & Communication Councils
|
||||||
|
|
||||||
|
| Council ID | Domain | ~Agents | Brief |
|
||||||
|
|------------|--------|---------|-------|
|
||||||
|
| `council-api` | API & Developer Experience | ~40 | System call interface, SDK design, ergonomics, POSIX compatibility |
|
||||||
|
| `council-synth` | Inter-Council Synthesis | ~100 | Cross-cutting conflict resolution, architectural coherence, trade-off analysis |
|
||||||
|
| `council-docs` | Specification Writing | ~40 | Technical writing, standards formatting, reference documentation |
|
||||||
|
| `council-arch` | Decision Archaeology | ~40 | UCXL history traversal, decision narrative generation, human-readable summaries |
|
||||||
|
|
||||||
|
### Meta-Council
|
||||||
|
|
||||||
|
| Council ID | Domain | ~Agents | Brief |
|
||||||
|
|------------|--------|---------|-------|
|
||||||
|
| `council-meta` | Project Governance | ~10 | Overall coordination, milestone tracking, council health, escalation |
|
||||||
|
|
||||||
|
**Total:** ~750-800 agents in councils + ~200 unassigned agents available for dynamic council formation
|
||||||
|
|
||||||
|
## 4. UCXL Addressing Conventions
|
||||||
|
|
||||||
|
### Council Artifacts
|
||||||
|
```
|
||||||
|
ucxl://council-{id}:{role}@DistOS:{subsystem}/*^/{artifact-type}/{name}
|
||||||
|
```
|
||||||
|
|
||||||
|
### Decision Records
|
||||||
|
```
|
||||||
|
ucxl://council-{id}:{role}@DistOS:{subsystem}/*^/decisions/{decision-id}.md
|
||||||
|
```
|
||||||
|
|
||||||
|
### Research Artifacts
|
||||||
|
```
|
||||||
|
ucxl://council-{id}:researcher@DistOS:{subsystem}/*^/research/{topic}.md
|
||||||
|
```
|
||||||
|
|
||||||
|
### Formal Specifications
|
||||||
|
```
|
||||||
|
ucxl://council-{id}:verifier@DistOS:{subsystem}/*^/specs/{component}.tla
|
||||||
|
```
|
||||||
|
|
||||||
|
### Narrative Summaries (Decision Archaeology)
|
||||||
|
```
|
||||||
|
ucxl://council-arch:narrator@DistOS:decision-history/*^/narratives/{period}-summary.md
|
||||||
|
```
|
||||||
|
|
||||||
|
### Cross-Council Integration
|
||||||
|
```
|
||||||
|
ucxl://council-synth:synthesizer@DistOS:integration/*^/conflicts/{council-a}-vs-{council-b}.md
|
||||||
|
```
|
||||||
|
|
||||||
|
## 5. Lifecycle Phases
|
||||||
|
|
||||||
|
### Phase 1: Research & Survey (Days 1-3)
|
||||||
|
- Each council surveys existing literature, systems, and approaches
|
||||||
|
- Research artifacts published to DHT with UCXL addresses
|
||||||
|
- Decision Archaeology agents begin tracking from day one
|
||||||
|
|
||||||
|
### Phase 2: Architecture & Trade-offs (Days 3-6)
|
||||||
|
- Councils propose architectural options with formal trade-off analysis
|
||||||
|
- Inter-Council Synthesis identifies conflicts and dependencies
|
||||||
|
- Key architectural decisions recorded as DRs
|
||||||
|
|
||||||
|
### Phase 3: Formal Specification (Days 6-10)
|
||||||
|
- TLA+/Alloy specifications written for each subsystem
|
||||||
|
- Verification council model-checks specs for safety and liveness
|
||||||
|
- QA council designs conformance test suites
|
||||||
|
|
||||||
|
### Phase 4: Integration & Review (Days 10-12)
|
||||||
|
- Cross-council integration review
|
||||||
|
- Conflict resolution via synthesis councils
|
||||||
|
- API surface finalised
|
||||||
|
|
||||||
|
### Phase 5: Documentation & Narrative (Days 12-14)
|
||||||
|
- Complete specification document assembled
|
||||||
|
- Decision archaeology produces human-readable narrative of entire project
|
||||||
|
- Final UCXL navigability audit
|
||||||
|
|
||||||
|
## 6. Success Criteria
|
||||||
|
|
||||||
|
1. **Completeness:** Formal specifications exist for all 6 core subsystems
|
||||||
|
2. **Verification:** At least 3 subsystems have model-checked TLA+ specs with proven invariants
|
||||||
|
3. **Traceability:** Any specification decision can be traced back through UCXL to the research that motivated it
|
||||||
|
4. **Human Navigability:** A person unfamiliar with the project can, using only UCXL addresses and the archaeology narratives, understand why a given design decision was made
|
||||||
|
5. **Coherence:** The synthesis council has resolved all identified cross-council conflicts
|
||||||
|
6. **Scale Proof:** The system successfully coordinated 500+ agents across 10+ concurrent councils
|
||||||
|
|
||||||
|
## 7. Dependencies
|
||||||
|
|
||||||
|
- **CHORUS v0.5.5+** with leader election and P2P mesh
|
||||||
|
- **SLURP** with Go port of storage, resolver, and temporal graph
|
||||||
|
- **WHOOSH** MVP with council formation and consensus
|
||||||
|
- **BUBBLE** for decision walkback queries
|
||||||
|
- **BACKBEAT** for distributed timing coordination
|
||||||
|
- **Weka FS** access on resetdata.ai platform
|
||||||
|
- **LLM access** via resetdata.ai API (Hopper/Blackwell inference)
|
||||||
|
|
||||||
|
## 8. Council Design Briefs
|
||||||
|
|
||||||
|
Each council has a detailed design brief in `docs/distos/councils/`:
|
||||||
|
|
||||||
|
- [Process Scheduling](councils/01-process-scheduling.md)
|
||||||
|
- [Distributed Memory](councils/02-distributed-memory.md)
|
||||||
|
- [Network Stack](councils/03-network-stack.md)
|
||||||
|
- [Fault Tolerance](councils/04-fault-tolerance.md)
|
||||||
|
- [Security Model](councils/05-security-model.md)
|
||||||
|
- [Resource Accounting](councils/06-resource-accounting.md)
|
||||||
|
- [Formal Verification](councils/07-formal-verification.md)
|
||||||
|
- [API & Developer Experience](councils/08-api-surface.md)
|
||||||
|
- [Inter-Council Synthesis](councils/09-inter-council-synthesis.md)
|
||||||
|
- [QA & Adversarial Testing](councils/10-qa-adversarial-testing.md)
|
||||||
|
- [Specification Writing](councils/11-documentation.md)
|
||||||
|
- [Decision Archaeology](councils/12-decision-archaeology.md)
|
||||||
535
councils/01-process-scheduling.md
Normal file
535
councils/01-process-scheduling.md
Normal file
@@ -0,0 +1,535 @@
|
|||||||
|
# Council Design Brief: Process Scheduling
|
||||||
|
|
||||||
|
**Council ID:** `council-sched`
|
||||||
|
**Mission:** Design the heterogeneous process scheduling subsystem for DistOS, covering GPU kernel dispatch, workload placement across Hopper/Grace/Blackwell accelerators, fair multi-tenant queuing, gang scheduling for distributed training, and energy-aware execution across a 1024-node cluster.
|
||||||
|
**UCXL Base Address:** `ucxl://council-sched:*@DistOS:scheduling/*`
|
||||||
|
**Agent Count:** 80
|
||||||
|
**Status:** Constitution Phase — awaiting WHOOSH formation trigger
|
||||||
|
**Created:** 2026-02-24
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 1. Scope and Responsibilities
|
||||||
|
|
||||||
|
`council-sched` owns the complete specification of the DistOS process and kernel scheduling subsystem. Scope boundaries are defined as follows.
|
||||||
|
|
||||||
|
**In scope:**
|
||||||
|
|
||||||
|
- GPU kernel scheduling: dispatch queue management, kernel concurrency, SM partitioning on Hopper (MIG) and Blackwell, and MPS (Multi-Process Service) lifetime management
|
||||||
|
- CPU-GPU co-scheduling on Grace Superchip (NVLink-C2C coherent interconnect), including unified virtual address space scheduling implications
|
||||||
|
- Workload placement policy across heterogeneous accelerator types (H100, GH200, B200), including topology-aware affinity scoring
|
||||||
|
- Fair multi-tenant queuing: priority classes, weighted fair queuing, and quota enforcement
|
||||||
|
- Gang scheduling for distributed training workloads: all-or-nothing allocation, partial-allotment strategies, and backfill
|
||||||
|
- Preemption strategies: checkpoint-based preemption, time-sliced preemption, and priority inversion avoidance
|
||||||
|
- NUMA-aware placement across CPU sockets and GPU memory domains
|
||||||
|
- GPU memory oversubscription scheduling: eviction policy, swap-to-host, and coordinated demand management with `council-mem`
|
||||||
|
- Energy-aware scheduling: frequency/voltage scaling directives, power capping, and thermal headroom management
|
||||||
|
- Formal specification of the scheduling API surface exposed to user workloads and to other DistOS subsystems
|
||||||
|
|
||||||
|
**Out of scope (delegated):**
|
||||||
|
|
||||||
|
- Physical memory allocation and coherence protocol (delegated to `council-mem`)
|
||||||
|
- Network topology discovery and network-aware placement data (delegated to `council-net`; consumed as inputs)
|
||||||
|
- Metering, cost attribution, and SLO enforcement (delegated to `council-telemetry`; consumed as outputs)
|
||||||
|
- Security isolation between tenants at the hardware level (delegated to `council-sec`; policies consumed as constraints)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 2. Research Domains
|
||||||
|
|
||||||
|
### 2.1 GPU Kernel Scheduling and SM Partitioning
|
||||||
|
|
||||||
|
Survey the CUDA Concurrent Kernel Execution model, the NVIDIA Multi-Instance GPU (MIG) architecture on Hopper (H100), and the NVIDIA Multi-Process Service (MPS). Understand how MIG partitions a GPU into isolated GPC slices with dedicated HBM3 memory, and how MPS enables concurrent kernel execution within a single GPU context without full MIG isolation overhead.
|
||||||
|
|
||||||
|
Key materials:
|
||||||
|
- NVIDIA H100 Architecture Technical Overview (2022) — SM partitioning, GPC layout, NVLink 4.0 bandwidth
|
||||||
|
- NVIDIA MIG User Guide (CUDA 12.x) — partition profiles (1g.10gb through 7g.80gb), instance isolation, compute and memory capacity tables
|
||||||
|
- NVIDIA MPS documentation — shared context model, client limit (48 for Hopper), error containment limitations
|
||||||
|
- AMD ROCm Hardware Abstraction Layer (HAL) source — `amdkfd` KFD driver, compute queue management, HWS (Hardware Scheduler) in GFX12
|
||||||
|
- Jain et al., "Fractional GPUs: Software-Based Compute and Memory Bandwidth Reservation for GPUs" (RTAS 2019) — software-defined partitioning as a baseline comparison
|
||||||
|
- Xiao et al., "AntMan: Dynamic Scaling on GPU Clusters for Deep Learning" (OSDI 2020) — GPU memory and compute sharing for co-located workloads
|
||||||
|
|
||||||
|
### 2.2 Heterogeneous Workload Placement
|
||||||
|
|
||||||
|
Develop a placement model that accounts for the distinct performance characteristics of H100 (PCIe/SXM5), GH200 Grace Superchip (NVLink-C2C), and B200 Blackwell (NVLink 5.0 SXM). Each accelerator type has different compute density, memory bandwidth, and interconnect topology.
|
||||||
|
|
||||||
|
Key materials:
|
||||||
|
- Jouppi et al., "In-Datacenter Performance Analysis of a Tensor Processing Unit" (ISCA 2017) — heterogeneous accelerator placement rationale
|
||||||
|
- Google Borg paper: Verma et al., "Large-scale cluster management at Google with Borg" (EuroSys 2015) — machine heterogeneity handling, alloc sets, resource estimation
|
||||||
|
- Microsoft Singularity: Qiao et al., "Pollux: Co-adaptive Cluster Scheduling for Goodput-Optimized Deep Learning" (OSDI 2021) — goodput-aware placement for training workloads
|
||||||
|
- Weng et al., "MLaaS in the Wild: Workload Analysis and Scheduling in Large-Scale Heterogeneous GPU Clusters" (NSDI 2022) — real-world heterogeneous GPU cluster scheduling from Alibaba
|
||||||
|
|
||||||
|
### 2.3 Fair Multi-Tenant Queuing
|
||||||
|
|
||||||
|
Design a multi-level queueing architecture with Dominant Resource Fairness (DRF) semantics extended for GPU resources (SM fraction, HBM bandwidth, NVLink bandwidth as co-dominant resources).
|
||||||
|
|
||||||
|
Key materials:
|
||||||
|
- Ghodsi et al., "Dominant Resource Fairness: Fair Allocation of Multiple Resource Types" (NSDI 2011) — foundational DRF theory
|
||||||
|
- Apache YARN CapacityScheduler and FairScheduler documentation — hierarchical queues, preemption policy, label-based node targeting
|
||||||
|
- Apache Mesos DRF implementation — `DominantShareAllocator`, offer model, role weights
|
||||||
|
- Kubernetes device plugin framework (k8s.io/device-plugins) — GPU resource advertisement, extended resource scheduling
|
||||||
|
- Tiresias: Gu et al., "Tiresias: A GPU Cluster Manager for Distributed Deep Learning" (NSDI 2019) — LAS (Least Attained Service) scheduling for DL jobs, 2DAS (2-dimensional attained service)
|
||||||
|
- Gandiva: Xiao et al., "Gandiva: Introspective Cluster Scheduling for Deep Learning" (OSDI 2018) — time-sliced GPU sharing, job packing, migration
|
||||||
|
|
||||||
|
### 2.4 Gang Scheduling for Distributed Training
|
||||||
|
|
||||||
|
Gang scheduling ensures all processes in a distributed training job (e.g., a 512-GPU allreduce ring) are co-scheduled simultaneously. This is critical for preventing head-of-line blocking and deadlock in collective communication.
|
||||||
|
|
||||||
|
Key materials:
|
||||||
|
- Feitelson and Rudolph, "Gang Scheduling Performance Benefits for Fine-Grain Synchronization" (J. Parallel Distrib. Comput., 1992) — foundational gang scheduling analysis
|
||||||
|
- Rajachandrasekar et al., "A Closer Look at All-Reduce for Deep Learning" (Workshop at SC'19) — collective communication scheduling sensitivity
|
||||||
|
- Hwang et al., "AFS: Annotation-Free Automatic Sharding for Large Language Models" (ICLR 2023) — pipeline and tensor parallel placement co-scheduling
|
||||||
|
- Slurm gang scheduling documentation — `Oversubscribe=FORCE`, `GraceTime`, slurmctld gang plugin
|
||||||
|
- Jeon et al., "Analysis of Large-Scale Multi-Tenant GPU Clusters for DNN Training Workloads" (USENIX ATC 2019) — Microsoft Philly cluster trace, gang failure modes
|
||||||
|
|
||||||
|
### 2.5 Preemption Strategies
|
||||||
|
|
||||||
|
Survey checkpoint-based, reactive, and time-sliced preemption. Understand the cost of GPU checkpoint (saving SM register file, shared memory, and in-flight DMA) and design preemption policies that bound maximum latency impact.
|
||||||
|
|
||||||
|
Key materials:
|
||||||
|
- Park et al., "Preemptive, Low Latency Datacenter Scheduling via Lightweight Virtualization" (USENIX ATC 2020) — container-level preemption
|
||||||
|
- Lin et al., "SHEPHERD: Serving DNNs in the Wild" (NSDI 2023) — latency SLO-aware preemption for inference workloads
|
||||||
|
- NVIDIA CUDA checkpoint/restore (CRIU for GPU) — experimental support status as of CUDA 12.x
|
||||||
|
- Wang et al., "Achieving Microsecond-Scale Tail Latency Efficiently with Approximate Optimal Scheduling" (SOSP 2017) — preemption-aware scheduling theory
|
||||||
|
|
||||||
|
### 2.6 NUMA-Aware and Topology-Aware Placement
|
||||||
|
|
||||||
|
Model the NUMA topology of a Grace Superchip node: 72-core Arm Neoverse V2 CPU, 480 GB LPDDR5X at ~512 GB/s, connected to H100 GPU via NVLink-C2C at 900 GB/s. Compare this with standard SXM5 nodes where CPU-GPU bandwidth is limited to PCIe Gen5 (~128 GB/s bidirectional).
|
||||||
|
|
||||||
|
Key materials:
|
||||||
|
- Linux NUMA scheduling documentation — `libnuma`, `numactl`, CFS NUMA balancing (`task_numa_migrate`)
|
||||||
|
- NVIDIA GH200 Grace Hopper Superchip Architecture Whitepaper (2023)
|
||||||
|
- Lepers et al., "Thread and Memory Placement on NUMA Systems: Asymmetry Matters" (USENIX ATC 2015)
|
||||||
|
- Blagodurov et al., "A Case for NUMA-Aware Contention Management on Multicore Systems" (USENIX ATC 2011)
|
||||||
|
|
||||||
|
### 2.7 GPU Memory Oversubscription
|
||||||
|
|
||||||
|
When aggregate GPU memory demand exceeds physical HBM3 capacity, the scheduler must coordinate with `council-mem`'s eviction and tiering policies to decide which kernels to throttle, checkpoint, or migrate.
|
||||||
|
|
||||||
|
Key materials:
|
||||||
|
- Rhu et al., "vDNN: Virtualized Deep Neural Networks for Scalable, Memory-Efficient Neural Network Design" (MICRO 2016) — layer-by-layer activation offload
|
||||||
|
- Huang et al., "Efficient Large Scale Language Modeling with Mixtures of Experts" (EMNLP 2021) — memory-efficient model parallelism
|
||||||
|
- NVIDIA Unified Memory documentation (CUDA 12.x) — page migration engine, oversubscription, prefetch advisory API
|
||||||
|
- Kumar et al., "SwapAdvisor: Push Deep Learning Beyond the GPU Memory Limit via Smart Swapping" (ASPLOS 2020)
|
||||||
|
|
||||||
|
### 2.8 Energy-Aware Scheduling
|
||||||
|
|
||||||
|
Design scheduling policies that honour per-node power caps enforced by Baseboard Management Controllers (BMCs) and exploit DVFS (Dynamic Voltage and Frequency Scaling) headroom reported by `council-telemetry`.
|
||||||
|
|
||||||
|
Key materials:
|
||||||
|
- Patel et al., "Clite: Efficient and QoS Aware Co-location of Multiple Latency-Critical Jobs for Warehouse Scale Computers" (ISCA 2020)
|
||||||
|
- Lim et al., "Adaptive Power Management for Heterogeneous Multiprocessor SoCs" (ICCAD 2009)
|
||||||
|
- NVIDIA NVML power management API — `nvmlDeviceSetPowerManagementLimit`, `nvmlDeviceGetCurrentClocksThrottleReasons`
|
||||||
|
- AMD ROCm SMI library — power cap interface (`rsmi_dev_power_cap_set`)
|
||||||
|
- RAPL (Running Average Power Limit) interface for Grace CPU power management
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 3. Agent Roles
|
||||||
|
|
||||||
|
Total agents: **80**
|
||||||
|
|
||||||
|
| Role | Count | Responsibilities |
|
||||||
|
|------|-------|-----------------|
|
||||||
|
| Lead Architect | 2 | Overall scheduling architecture decisions, cross-subsystem interface design, DR authorship for major decisions |
|
||||||
|
| Kernel Scheduling Researchers | 8 | CUDA/ROCm scheduler internals, MIG/MPS analysis, SM partitioning survey |
|
||||||
|
| Placement Researchers | 6 | Heterogeneous accelerator placement, topology modelling, workload profiling |
|
||||||
|
| Queueing Theory Specialists | 5 | DRF extensions for multi-resource GPU scheduling, fairness proof sketches |
|
||||||
|
| Gang Scheduling Specialists | 5 | Collective communication scheduling, all-or-nothing allocation protocols |
|
||||||
|
| Preemption Specialists | 4 | Checkpoint protocol design, preemption cost modelling |
|
||||||
|
| NUMA/Topology Analysts | 4 | NUMA topology modelling for Grace Superchip and SXM5 nodes |
|
||||||
|
| Energy Efficiency Researchers | 4 | Power capping, DVFS scheduling integration |
|
||||||
|
| Formal Specification Authors | 8 | TLA+ specification of scheduler state machine, safety and liveness invariants |
|
||||||
|
| Architects (sub-component) | 10 | Propose concrete scheduling algorithm designs for each domain |
|
||||||
|
| Internal Reviewers | 8 | Review research summaries and architecture proposals; cast green/yellow/red votes |
|
||||||
|
| Integration Liaisons | 6 | Interface with `council-mem`, `council-net`, `council-telemetry`, `council-sec` |
|
||||||
|
| Decision Record Authors | 5 | Author DRs for each resolved decision point; maintain UCXL provenance chain |
|
||||||
|
| Adversarial Critics | 5 | Challenge proposed designs; surface failure modes, starvation scenarios, livelock risks |
|
||||||
|
|
||||||
|
**Role distribution rationale:** The large researcher cohort (37 agents covering 8 research domains) reflects the breadth of scheduling literature. The formal specification group (8 agents) is sized to produce parallel TLA+ modules. Internal reviewers and adversarial critics (13 agents combined) ensure no architecture proposal passes without rigorous challenge.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 4. Key Deliverables
|
||||||
|
|
||||||
|
All artifacts are published to the DHT and addressable via UCXL.
|
||||||
|
|
||||||
|
### 4.1 Research Summaries
|
||||||
|
|
||||||
|
```
|
||||||
|
ucxl://council-sched:researcher@DistOS:scheduling/*^/research/gpu-kernel-scheduling.md
|
||||||
|
ucxl://council-sched:researcher@DistOS:scheduling/*^/research/heterogeneous-placement.md
|
||||||
|
ucxl://council-sched:researcher@DistOS:scheduling/*^/research/fair-queuing-drf.md
|
||||||
|
ucxl://council-sched:researcher@DistOS:scheduling/*^/research/gang-scheduling.md
|
||||||
|
ucxl://council-sched:researcher@DistOS:scheduling/*^/research/preemption-strategies.md
|
||||||
|
ucxl://council-sched:researcher@DistOS:scheduling/*^/research/numa-topology.md
|
||||||
|
ucxl://council-sched:researcher@DistOS:scheduling/*^/research/memory-oversubscription.md
|
||||||
|
ucxl://council-sched:researcher@DistOS:scheduling/*^/research/energy-aware-scheduling.md
|
||||||
|
```
|
||||||
|
|
||||||
|
### 4.2 Architecture Proposals
|
||||||
|
|
||||||
|
```
|
||||||
|
ucxl://council-sched:architect@DistOS:scheduling/*^/architecture/scheduler-overview.md
|
||||||
|
ucxl://council-sched:architect@DistOS:scheduling/*^/architecture/mig-mps-partitioning-model.md
|
||||||
|
ucxl://council-sched:architect@DistOS:scheduling/*^/architecture/placement-scoring-algorithm.md
|
||||||
|
ucxl://council-sched:architect@DistOS:scheduling/*^/architecture/drf-gpu-extension.md
|
||||||
|
ucxl://council-sched:architect@DistOS:scheduling/*^/architecture/gang-scheduler-protocol.md
|
||||||
|
ucxl://council-sched:architect@DistOS:scheduling/*^/architecture/preemption-protocol.md
|
||||||
|
ucxl://council-sched:architect@DistOS:scheduling/*^/architecture/energy-policy-interface.md
|
||||||
|
```
|
||||||
|
|
||||||
|
### 4.3 Decision Records
|
||||||
|
|
||||||
|
```
|
||||||
|
ucxl://council-sched:architect@DistOS:scheduling/*^/decisions/DR-SCHED-001-partition-model.md
|
||||||
|
ucxl://council-sched:architect@DistOS:scheduling/*^/decisions/DR-SCHED-002-placement-algorithm.md
|
||||||
|
ucxl://council-sched:architect@DistOS:scheduling/*^/decisions/DR-SCHED-003-queuing-policy.md
|
||||||
|
ucxl://council-sched:architect@DistOS:scheduling/*^/decisions/DR-SCHED-004-gang-protocol.md
|
||||||
|
ucxl://council-sched:architect@DistOS:scheduling/*^/decisions/DR-SCHED-005-preemption-model.md
|
||||||
|
ucxl://council-sched:architect@DistOS:scheduling/*^/decisions/DR-SCHED-006-energy-interface.md
|
||||||
|
```
|
||||||
|
|
||||||
|
### 4.4 Formal Specifications
|
||||||
|
|
||||||
|
```
|
||||||
|
ucxl://council-sched:verifier@DistOS:scheduling/*^/specs/SchedulerStateMachine.tla
|
||||||
|
ucxl://council-sched:verifier@DistOS:scheduling/*^/specs/GangScheduler.tla
|
||||||
|
ucxl://council-sched:verifier@DistOS:scheduling/*^/specs/PreemptionProtocol.tla
|
||||||
|
ucxl://council-sched:verifier@DistOS:scheduling/*^/specs/FairnessInvariants.tla
|
||||||
|
```
|
||||||
|
|
||||||
|
### 4.5 Interface Contracts (for other councils)
|
||||||
|
|
||||||
|
```
|
||||||
|
ucxl://council-sched:architect@DistOS:scheduling/*^/interfaces/sched-to-mem-contract.md
|
||||||
|
ucxl://council-sched:architect@DistOS:scheduling/*^/interfaces/sched-to-net-contract.md
|
||||||
|
ucxl://council-sched:architect@DistOS:scheduling/*^/interfaces/sched-to-telemetry-contract.md
|
||||||
|
ucxl://council-sched:architect@DistOS:scheduling/*^/interfaces/sched-to-sec-contract.md
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 5. Decision Points
|
||||||
|
|
||||||
|
The following are the major architectural questions `council-sched` must resolve. Each decision must produce a Decision Record with alternatives considered, evidence from research, and rationale for the chosen option.
|
||||||
|
|
||||||
|
### DP-SCHED-001: Primary Partition Model
|
||||||
|
|
||||||
|
**Question:** Should the scheduler use MIG (hardware-enforced partition isolation), MPS (shared context with software-enforced limits), or a hybrid model where MIG is used for multi-tenant isolation and MPS for co-located jobs within a single tenant's partition?
|
||||||
|
|
||||||
|
**Factors:** isolation strength, context switch overhead, minimum allocation granularity, Blackwell support parity, operational complexity.
|
||||||
|
|
||||||
|
**Dependency:** Decision informs the security isolation model that `council-sec` will specify.
|
||||||
|
|
||||||
|
### DP-SCHED-002: Placement Scoring Architecture
|
||||||
|
|
||||||
|
**Question:** Should placement be computed by a centralised scoring service (Borg-style) or a distributed, bid-based negotiation (Mesos offer model)? How does topology affinity (NVLink domain proximity) weight against fairness constraints?
|
||||||
|
|
||||||
|
**Factors:** convergence time at 1024-node scale, fault tolerance of the placement service itself, stale topology information handling, NVLink bandwidth utilisation efficiency.
|
||||||
|
|
||||||
|
### DP-SCHED-003: Multi-Resource Fairness Extension
|
||||||
|
|
||||||
|
**Question:** DRF was designed for CPU/memory. GPU workloads introduce additional dominant resources: SM fraction, HBM3 bandwidth, NVLink bandwidth, and NVSwitch port occupancy. How many resource dimensions does the fairness model track, and what is the computational complexity of DRF at 80+ resource types?
|
||||||
|
|
||||||
|
**Factors:** implementation tractability, approximation error bounds, interaction with per-tenant quota enforcement.
|
||||||
|
|
||||||
|
### DP-SCHED-004: Gang Scheduling Protocol
|
||||||
|
|
||||||
|
**Question:** Should gang scheduling use a two-phase reservation (reserve then commit) protocol, speculative allocation with rollback, or a backfill-with-hold strategy? What is the maximum tolerable scheduling delay for a 1024-GPU gang job?
|
||||||
|
|
||||||
|
**Factors:** cluster utilisation impact, deadlock risk in two-phase protocols, interaction with preemption.
|
||||||
|
|
||||||
|
### DP-SCHED-005: Preemption Granularity
|
||||||
|
|
||||||
|
**Question:** Should preemption operate at kernel-granularity (CUDA stream checkpointing), job-granularity (full process checkpoint via CRIU-for-GPU), or a hybrid that allows kernel preemption for short jobs and process-level preemption for long jobs?
|
||||||
|
|
||||||
|
**Factors:** checkpoint latency (kernel-level: microseconds; process-level: seconds to tens of seconds for large model weights), HBM3 save/restore bandwidth cost, preemption frequency requirements.
|
||||||
|
|
||||||
|
### DP-SCHED-006: Grace Superchip Scheduling Model
|
||||||
|
|
||||||
|
**Question:** For GH200 nodes with NVLink-C2C, the CPU and GPU share a unified virtual address space. Should the scheduler treat CPU and GPU on a Grace node as a single scheduling unit, or as two separate but affinity-linked resources? How does this interact with the memory model specified by `council-mem`?
|
||||||
|
|
||||||
|
**Factors:** utilisation efficiency, memory model consistency requirements, API ergonomics for user workloads, migration cost if job is split across a C2C boundary.
|
||||||
|
|
||||||
|
### DP-SCHED-007: Energy Policy Interface
|
||||||
|
|
||||||
|
**Question:** Should energy-aware scheduling be a first-class scheduling objective (a weight in the placement score function), a hard constraint (power cap as a resource limit), or an advisory mechanism (scheduler receives power headroom hints and applies them at its discretion)?
|
||||||
|
|
||||||
|
**Factors:** SLO compatibility, predictability of power-capped execution, interaction with thermal management, coordination protocol with `council-telemetry`.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 6. Dependencies
|
||||||
|
|
||||||
|
### 6.1 What council-sched Needs from Other Councils
|
||||||
|
|
||||||
|
| Dependency | Source Council | Artifact | Purpose |
|
||||||
|
|------------|---------------|---------|---------|
|
||||||
|
| Memory pressure signals and eviction cost model | `council-mem` | `ucxl://council-mem:architect@DistOS:memory/*^/interfaces/mem-to-sched-contract.md` | GPU memory oversubscription scheduling requires eviction cost estimates to avoid scheduling kernels that will immediately thrash |
|
||||||
|
| HBM3/DDR5 bandwidth allocation model | `council-mem` | `ucxl://council-mem:architect@DistOS:memory/*^/architecture/bandwidth-allocation.md` | Bandwidth is a co-dominant scheduling resource; model must be consistent |
|
||||||
|
| Network topology map (NVLink domains, IB fabric) | `council-net` | `ucxl://council-net:architect@DistOS:networking/*^/architecture/topology-model.md` | Topology-aware placement requires accurate NVLink domain membership and IB bisection bandwidth per rack |
|
||||||
|
| Network bandwidth reservation API | `council-net` | `ucxl://council-net:architect@DistOS:networking/*^/interfaces/net-to-sched-contract.md` | Gang scheduling for allreduce jobs requires co-reserving network bandwidth |
|
||||||
|
| Resource metering API contract | `council-telemetry` | `ucxl://council-telemetry:architect@DistOS:telemetry/*^/interfaces/telemetry-to-sched-contract.md` | Scheduler must feed placement and execution events to telemetry; power headroom signals flow back |
|
||||||
|
| Security isolation constraints | `council-sec` | `ucxl://council-sec:architect@DistOS:security/*^/interfaces/sec-to-sched-contract.md` | Which tenants may share a GPU partition, MPS context constraints, minimum isolation requirements per tenant class |
|
||||||
|
|
||||||
|
### 6.2 What Other Councils Need from council-sched
|
||||||
|
|
||||||
|
| Consumer Council | Artifact Required | Purpose |
|
||||||
|
|-----------------|------------------|---------|
|
||||||
|
| `council-mem` | Placement decisions and GPU assignment map | Memory subsystem needs to know which GPU a job is placed on to configure HBM3 address space and NVLink fabric attachment |
|
||||||
|
| `council-mem` | Preemption events and checkpoint triggers | Memory tiering must snapshot GPU memory on preemption |
|
||||||
|
| `council-net` | Job placement map with NVLink domain assignments | Network subsystem configures NCCL topology files and RDMA QP mappings based on scheduler placement |
|
||||||
|
| `council-telemetry` | Scheduling events stream (enqueue, dequeue, preempt, complete) | Metering and cost attribution require per-job lifecycle events |
|
||||||
|
| `council-sec` | Partition assignment per tenant | Security subsystem enforces isolation based on scheduler-assigned partition IDs |
|
||||||
|
| `council-verify` | TLA+ scheduler state machine spec | Formal verification council model-checks scheduler invariants |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 7. WHOOSH Configuration
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
# WHOOSH council formation configuration for council-sched
|
||||||
|
council_id: council-sched
|
||||||
|
project: DistOS
|
||||||
|
subsystem: scheduling
|
||||||
|
gitea_label: chorus-entrypoint
|
||||||
|
gitea_repo: distos/scheduling
|
||||||
|
|
||||||
|
formation:
|
||||||
|
target_agents: 80
|
||||||
|
min_agents: 60
|
||||||
|
wave:
|
||||||
|
max_per_wave: 12
|
||||||
|
min_per_wave: 6
|
||||||
|
period_sec: 30
|
||||||
|
placement:
|
||||||
|
max_replicas_per_node: 2
|
||||||
|
join_stagger_ms: 2000
|
||||||
|
bootstrap_peers_min: 5
|
||||||
|
|
||||||
|
roles:
|
||||||
|
- role: lead-architect
|
||||||
|
count: 2
|
||||||
|
model: claude-opus-4-6
|
||||||
|
priority: high
|
||||||
|
- role: researcher
|
||||||
|
count: 37
|
||||||
|
model: qwen2.5-coder:32b
|
||||||
|
priority: normal
|
||||||
|
subgroups:
|
||||||
|
- tag: kernel-scheduling
|
||||||
|
count: 8
|
||||||
|
- tag: placement
|
||||||
|
count: 6
|
||||||
|
- tag: queuing-theory
|
||||||
|
count: 5
|
||||||
|
- tag: gang-scheduling
|
||||||
|
count: 5
|
||||||
|
- tag: preemption
|
||||||
|
count: 4
|
||||||
|
- tag: numa-topology
|
||||||
|
count: 4
|
||||||
|
- tag: energy
|
||||||
|
count: 5
|
||||||
|
- role: architect
|
||||||
|
count: 10
|
||||||
|
model: claude-opus-4-6
|
||||||
|
priority: normal
|
||||||
|
- role: verifier
|
||||||
|
count: 8
|
||||||
|
model: deepseek-coder-v2
|
||||||
|
priority: normal
|
||||||
|
- role: reviewer
|
||||||
|
count: 8
|
||||||
|
model: claude-opus-4-6
|
||||||
|
priority: normal
|
||||||
|
- role: integration-liaison
|
||||||
|
count: 6
|
||||||
|
model: qwen2.5-coder:32b
|
||||||
|
priority: normal
|
||||||
|
- role: decision-record-author
|
||||||
|
count: 5
|
||||||
|
model: claude-opus-4-6
|
||||||
|
priority: normal
|
||||||
|
- role: adversarial-critic
|
||||||
|
count: 5
|
||||||
|
model: claude-opus-4-6
|
||||||
|
priority: normal
|
||||||
|
|
||||||
|
subchannels:
|
||||||
|
- name: sched-research
|
||||||
|
description: "Research discussion and literature synthesis"
|
||||||
|
participants: [researcher, lead-architect]
|
||||||
|
pubsub: true
|
||||||
|
- name: sched-architecture
|
||||||
|
description: "Architecture proposal discussion and voting"
|
||||||
|
participants: [architect, lead-architect, reviewer, adversarial-critic]
|
||||||
|
pubsub: true
|
||||||
|
- name: sched-formal-spec
|
||||||
|
description: "TLA+ specification authoring and review"
|
||||||
|
participants: [verifier, lead-architect, reviewer]
|
||||||
|
pubsub: false
|
||||||
|
- name: sched-integration
|
||||||
|
description: "Cross-council interface negotiation"
|
||||||
|
participants: [integration-liaison, lead-architect]
|
||||||
|
pubsub: false
|
||||||
|
- name: sched-decisions
|
||||||
|
description: "Decision record authoring and consensus"
|
||||||
|
participants: [decision-record-author, lead-architect, reviewer]
|
||||||
|
pubsub: true
|
||||||
|
|
||||||
|
quorum:
|
||||||
|
# Architecture decisions require supermajority
|
||||||
|
architecture_changes:
|
||||||
|
policy: supermajority
|
||||||
|
threshold: 0.667
|
||||||
|
require_domain_role: true
|
||||||
|
require_quality_role: true
|
||||||
|
beat_minutes: 20
|
||||||
|
timeout_beats: 6
|
||||||
|
# Research summaries require simple majority
|
||||||
|
research_summaries:
|
||||||
|
policy: simple_majority
|
||||||
|
threshold: 0.5
|
||||||
|
require_domain_role: true
|
||||||
|
require_quality_role: false
|
||||||
|
beat_minutes: 15
|
||||||
|
timeout_beats: 4
|
||||||
|
# Formal specifications require supermajority with verifier sign-off
|
||||||
|
formal_specs:
|
||||||
|
policy: supermajority
|
||||||
|
threshold: 0.667
|
||||||
|
require_domain_role: true
|
||||||
|
require_quality_role: true
|
||||||
|
require_verifier: true
|
||||||
|
beat_minutes: 25
|
||||||
|
timeout_beats: 8
|
||||||
|
# Cross-council interface contracts require unanimous lead-architect approval
|
||||||
|
interface_contracts:
|
||||||
|
policy: unanimous
|
||||||
|
roles: [lead-architect, integration-liaison]
|
||||||
|
beat_minutes: 30
|
||||||
|
timeout_beats: 4
|
||||||
|
|
||||||
|
gates:
|
||||||
|
kaching:
|
||||||
|
p95_latency_ms: 250
|
||||||
|
max_error_rate: 0.01
|
||||||
|
backbeat:
|
||||||
|
max_stream_lag: 200
|
||||||
|
bootstrap:
|
||||||
|
min_healthy_peers: 5
|
||||||
|
join:
|
||||||
|
min_success_rate: 0.80
|
||||||
|
|
||||||
|
review:
|
||||||
|
beat_minutes: 20
|
||||||
|
quorum:
|
||||||
|
total_min: 3
|
||||||
|
require_domain_role: true
|
||||||
|
require_quality_role: true
|
||||||
|
timeout_beats: 6
|
||||||
|
no_self_approval: true
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 8. Success Criteria
|
||||||
|
|
||||||
|
1. **Research completeness:** All 8 research domain summaries published to DHT with at least 5 primary references each, reviewed and approved by at least 3 council agents.
|
||||||
|
|
||||||
|
2. **Architecture coverage:** Architectural proposals exist for all 7 major scheduling components (kernel dispatch, placement, queuing, gang, preemption, NUMA, energy). Each proposal addresses the 1024-node scale constraint explicitly.
|
||||||
|
|
||||||
|
3. **Decision records resolved:** All 7 decision points (DP-SCHED-001 through DP-SCHED-007) have a corresponding Decision Record with at least 3 alternatives considered, evidence citations, and a chosen option ratified by council supermajority.
|
||||||
|
|
||||||
|
4. **Formal specifications:** TLA+ specifications for the scheduler state machine, gang scheduling protocol, and preemption protocol. At least 2 of these must have model-checked safety invariants (no starvation for highest-priority jobs, gang deadlock freedom) verified by `council-verify`.
|
||||||
|
|
||||||
|
5. **Interface contracts ratified:** All 4 interface contracts (to `council-mem`, `council-net`, `council-telemetry`, `council-sec`) are co-signed by integration liaisons from both councils.
|
||||||
|
|
||||||
|
6. **UCXL navigability:** A human unfamiliar with the project should be able to navigate from any Decision Record to the research summary that motivated it using only UCXL temporal navigation, within 5 hops.
|
||||||
|
|
||||||
|
7. **Adversarial review pass:** Each major architecture proposal has at minimum one adversarial critique documented and a resolution recorded. No proposal advances to formal specification with an unresolved red-vote critique.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 9. Timeline
|
||||||
|
|
||||||
|
### Phase 1: Research and Survey (Days 1-3)
|
||||||
|
|
||||||
|
**Day 1:**
|
||||||
|
- WHOOSH forms council; all 80 agents join via wave deployment
|
||||||
|
- Researchers self-assign to domain subgroups via `sched-research` subchannel
|
||||||
|
- Literature survey begins: GPU kernel scheduling, placement, and fair queuing domains prioritised first (these are on the critical path for Decision Points DP-SCHED-001 through DP-SCHED-003)
|
||||||
|
- Integration liaisons make initial contact with `council-mem` and `council-net` to understand their timelines
|
||||||
|
|
||||||
|
**Day 2:**
|
||||||
|
- Remaining research domains surveyed: gang scheduling, preemption, NUMA, energy
|
||||||
|
- Research summaries drafted in parallel across subgroups
|
||||||
|
- First internal review cycle: reviewers read summaries and post green/yellow/red votes with rationale
|
||||||
|
- Lead architects synthesise research findings into a preliminary scheduling design space map
|
||||||
|
|
||||||
|
**Day 3:**
|
||||||
|
- Research summaries revised based on review feedback; final versions published to DHT
|
||||||
|
- Adversarial critics challenge assumptions in each summary
|
||||||
|
- Research phase gate: all 8 summaries must achieve simple majority approval before Phase 2 begins
|
||||||
|
- Preliminary interface contract outlines shared with dependency councils
|
||||||
|
|
||||||
|
### Phase 2: Architecture and Trade-offs (Days 3-6)
|
||||||
|
|
||||||
|
**Day 3-4:**
|
||||||
|
- Architects propose concrete options for DP-SCHED-001 (partition model) and DP-SCHED-002 (placement scoring) — these are the highest-dependency decisions
|
||||||
|
- Adversarial critics engage immediately; all alternatives documented
|
||||||
|
- DP-SCHED-001 decision record drafted; council votes; DR published
|
||||||
|
|
||||||
|
**Day 4-5:**
|
||||||
|
- Queuing model (DP-SCHED-003), gang scheduling (DP-SCHED-004), and preemption (DP-SCHED-005) design proposals concurrently authored
|
||||||
|
- Inter-council synthesis session with `council-mem` to align on oversubscription and eviction signal interfaces
|
||||||
|
- Inter-council synthesis session with `council-net` to align on topology model input format
|
||||||
|
|
||||||
|
**Day 5-6:**
|
||||||
|
- Grace Superchip model (DP-SCHED-006) and energy interface (DP-SCHED-007) decisions resolved
|
||||||
|
- All 7 Decision Records drafted; supermajority vote on each
|
||||||
|
- Architecture overview document assembled from approved DRs
|
||||||
|
- Architecture phase gate: all DPs resolved before Phase 3 begins
|
||||||
|
|
||||||
|
### Phase 3: Formal Specification (Days 6-10)
|
||||||
|
|
||||||
|
**Day 6-7:**
|
||||||
|
- Verifiers begin TLA+ specification of the scheduler state machine based on approved architecture
|
||||||
|
- Architects continue refining component-level designs to resolve any ambiguities surfaced during spec authoring
|
||||||
|
- `council-verify` engaged: share spec drafts for early model-checking feedback
|
||||||
|
|
||||||
|
**Day 7-8:**
|
||||||
|
- Gang scheduler TLA+ module authored; parallel with preemption protocol spec
|
||||||
|
- Fairness invariants formally stated: no starvation under bounded load, gang deadlock freedom, DRF monotonicity
|
||||||
|
|
||||||
|
**Day 8-10:**
|
||||||
|
- Model checking runs submitted to `council-verify`
|
||||||
|
- Counterexample analysis: any liveness or safety violations trigger architecture revision and updated DRs
|
||||||
|
- Formal spec versions pinned in DHT; UCXL addresses published to dependent councils
|
||||||
|
|
||||||
|
### Phase 4: Integration and Review (Days 10-12)
|
||||||
|
|
||||||
|
**Day 10-11:**
|
||||||
|
- Interface contracts with `council-mem`, `council-net`, `council-telemetry`, `council-sec` finalised and submitted for co-signature
|
||||||
|
- Cross-council integration session: scheduling placement decisions validated against network topology model
|
||||||
|
- `council-synth` engaged for any unresolved conflicts with other councils' specifications
|
||||||
|
|
||||||
|
**Day 11-12:**
|
||||||
|
- Final council review of complete scheduling specification
|
||||||
|
- Adversarial critics run end-to-end failure scenario analysis
|
||||||
|
- Any remaining yellow votes addressed with documented mitigations
|
||||||
|
- Integration review gate: all interface contracts co-signed
|
||||||
|
|
||||||
|
### Phase 5: Documentation and Narrative (Days 12-14)
|
||||||
|
|
||||||
|
**Day 12-13:**
|
||||||
|
- Decision record authors produce narrative summaries of the scheduling architecture journey
|
||||||
|
- `council-docs` receives the complete scheduling specification package for standardised formatting
|
||||||
|
- UCXL navigability audit: spot-check 10 random decision paths for completeness
|
||||||
|
|
||||||
|
**Day 14:**
|
||||||
|
- Final specification published
|
||||||
|
- `council-arch` decision archaeology agents generate human-readable narrative of scheduling design evolution
|
||||||
|
- Council formally dissolved; agents released back to WHOOSH pool
|
||||||
586
councils/02-distributed-memory.md
Normal file
586
councils/02-distributed-memory.md
Normal file
@@ -0,0 +1,586 @@
|
|||||||
|
# Council Design Brief: Distributed Memory
|
||||||
|
|
||||||
|
**Council ID:** `council-mem`
|
||||||
|
**Mission:** Design the distributed memory model for DistOS, encompassing the tiered memory hierarchy (HBM3 → DDR5 → NVMe → Weka), Weka parallel filesystem integration, cache coherence at cluster scale, GPU unified memory and managed memory policies, memory-mapped I/O over Weka, page migration, and memory pressure handling across a 1024-node Hopper/Grace/Blackwell cluster.
|
||||||
|
**UCXL Base Address:** `ucxl://council-mem:*@DistOS:memory/*`
|
||||||
|
**Agent Count:** 80
|
||||||
|
**Status:** Constitution Phase — awaiting WHOOSH formation trigger
|
||||||
|
**Created:** 2026-02-24
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 1. Scope and Responsibilities
|
||||||
|
|
||||||
|
`council-mem` owns the complete specification of the DistOS memory subsystem. Scope boundaries are defined as follows.
|
||||||
|
|
||||||
|
**In scope:**
|
||||||
|
|
||||||
|
- Distributed shared memory (DSM) model versus message-passing model decision and formal specification
|
||||||
|
- Weka WekaFS integration: POSIX semantics, parallel I/O patterns, consistency model at cluster scale, WEKA client mount configuration for GPU nodes
|
||||||
|
- NVLink/NVSwitch memory fabric on Hopper: peer-to-peer GPU memory access, NVLink memory copy engines, NVSwitch all-to-all bandwidth sharing
|
||||||
|
- Grace Superchip unified memory: NVLink-C2C coherent memory, CPU-GPU unified virtual address space, cache coherence between Arm Neoverse V2 and H100 L2/L3
|
||||||
|
- Cache coherence protocols at cluster scale: directory-based coherence, home node placement, false sharing avoidance across the NVLink fabric
|
||||||
|
- GPU memory management: CUDA unified memory (UM), managed memory, explicit cudaMemcpy vs implicit page migration, GPUDirect RDMA zero-copy pathways
|
||||||
|
- Memory-mapped file I/O over Weka: mmap semantics for GPU-accessible WekaFS files, page fault handling, demand paging from Weka to GPU HBM3
|
||||||
|
- Tiered storage hierarchy: HBM3 (80 GB/GPU on H100 SXM5) → DDR5 (GH200 node LPDDR5X) → NVMe (local SSD) → Weka (parallel FS, PB-scale)
|
||||||
|
- Page migration policies: when to migrate pages between tiers, migration bandwidth management, and NUMA migration cost models
|
||||||
|
- Memory pressure handling: OOM prevention, demand-based eviction, balloon device analogue for GPU memory, cooperative memory release protocols
|
||||||
|
- Memory isolation and address space layout for multi-tenant workloads
|
||||||
|
- Formal specification of the DistOS virtual memory interface exposed to the scheduler and to user processes
|
||||||
|
|
||||||
|
**Out of scope (delegated):**
|
||||||
|
|
||||||
|
- Physical network transport for RDMA (delegated to `council-net`; RDMA registration interfaces consumed)
|
||||||
|
- Scheduling decisions about which workloads run on which GPUs (delegated to `council-sched`; placement decisions consumed as inputs)
|
||||||
|
- Security isolation primitives at the hardware level (delegated to `council-sec`; IOMMU and capability constraints consumed)
|
||||||
|
- Resource metering and HBM3 quota enforcement (delegated to `council-telemetry`; metering events emitted as outputs)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 2. Research Domains
|
||||||
|
|
||||||
|
### 2.1 Distributed Shared Memory vs. Message Passing
|
||||||
|
|
||||||
|
Evaluate DSM systems (software-managed global address space) against message-passing models (MPI, NCCL, UCX) for the primary inter-node memory model of DistOS. The cluster's NVLink/NVSwitch fabric makes intra-NVLink-domain DSM feasible at low latency, while cross-domain communication involves InfiniBand/RoCE.
|
||||||
|
|
||||||
|
Key materials:
|
||||||
|
- Li and Hudak, "Memory Coherence in Shared Virtual Memory Systems" (ACM TOCS 1989) — foundational DSM coherence analysis
|
||||||
|
- Bal et al., "Orca: A Flat Object-Based Distributed Shared Memory System" — hybrid DSM design
|
||||||
|
- PGAS language models: UPC++ (Berkeley UPC++ team), OpenSHMEM (OpenSHMEM 1.5 spec), Chapel locale model (Cray/HPE)
|
||||||
|
- Zheng et al., "UPC++: A High-Performance Communication Framework for Asynchronous Computation" (IPDPS 2023) — modern PGAS for GPU clusters
|
||||||
|
- Hoefler et al., "MPI+MPI: A New Hybrid Approach to Parallel Programming with MPI Plus Shared Memory" (Computing 2013) — hybrid model analysis
|
||||||
|
|
||||||
|
### 2.2 Weka Data Platform and WekaFS
|
||||||
|
|
||||||
|
Weka is the parallel filesystem deployed on this cluster. WekaFS provides POSIX-compliant parallel I/O with client-side caching and a distributed metadata architecture. Survey the WekaFS client protocol, consistency model, and performance characteristics for GPU workload I/O patterns.
|
||||||
|
|
||||||
|
Key materials:
|
||||||
|
- Weka Data Platform documentation (v4.x) — WekaFS client, mount options (`-o wekafs`, cache policy `readcache`, `writecache`, `coherent`), tiering configuration
|
||||||
|
- Weka technical whitepaper: "Weka and NVIDIA GPUDirect Storage Integration" — zerocopy path from Weka to GPU HBM3 via GDS
|
||||||
|
- Weka S3 and NFS interoperability guide — multi-protocol access patterns relevant to multi-tenant workloads
|
||||||
|
- POSIX consistency semantics under parallel access — close-to-open consistency vs. strict POSIX; implications for checkpoint/restart workflows
|
||||||
|
- Bent et al., "PLFS: A Checkpoint Filesystem for Parallel Applications" (SC 2009) — N-to-1 and N-to-N checkpoint patterns relevant to Weka I/O design
|
||||||
|
- Lofstead et al., "Flexible IO and Integration for Scientific Codes through the Adaptable IO System (ADIOS)" — parallel I/O pattern survey
|
||||||
|
|
||||||
|
### 2.3 NVIDIA Magnum IO and GPUDirect
|
||||||
|
|
||||||
|
NVIDIA Magnum IO is the umbrella framework for GPU-optimised I/O. It encompasses GPUDirect RDMA (peer-to-peer GPU memory over InfiniBand), GPUDirect Storage (GDS, direct path from NVMe/Weka to GPU HBM3 bypassing host DRAM), and NCCL (collective communication over NVLink/IB).
|
||||||
|
|
||||||
|
Key materials:
|
||||||
|
- NVIDIA GPUDirect RDMA documentation — `nvidia_p2p_get_pages`, DMA mapping API, peer-to-peer registration requirements
|
||||||
|
- NVIDIA GPUDirect Storage documentation — `cuFile` API, `cuFileRead`, `cuFileWrite`, alignment requirements, Weka GDS driver (`libwekafs-gds`)
|
||||||
|
- Shainer et al., "The Development of Mellanox/NVIDIA GPUDirect over InfiniBand — A New Model for GPU to GPU Communications" (EPSRC 2011) — foundational GPUDirect RDMA paper
|
||||||
|
- Barroso et al., "The Datacenter as a Computer" (3rd edition) — memory hierarchy cost model relevant to tiered storage trade-offs
|
||||||
|
|
||||||
|
### 2.4 NVLink and NVSwitch Memory Fabric
|
||||||
|
|
||||||
|
Hopper H100 SXM5 nodes are connected via NVLink 4.0 within a node and across NVSwitch fabrics in NVLink Switch Systems (formerly DGX SuperPOD architecture). Understand the addressing model, bandwidth characteristics, and coherence semantics of the NVLink fabric.
|
||||||
|
|
||||||
|
Key materials:
|
||||||
|
- NVIDIA H100 SXM5 NVLink 4.0 specification — 900 GB/s bidirectional aggregate per GPU, NVSwitch all-reduce hardware acceleration
|
||||||
|
- NVIDIA NVLink Switch System (formerly NVLink Switch Fabric) architecture documentation
|
||||||
|
- Foley and Danskin, "Fast In-Kernel GEMM for Server-Class GPUs" — NVLink bandwidth utilisation analysis
|
||||||
|
- Choquette et al., "NVIDIA A100 Tensor Core GPU: Performance and Innovation" (IEEE Micro 2021) — NVLink 3.0 predecessor; bandwidth and addressing model basis for NVLink 4.0
|
||||||
|
- NVIDIA SHARP (Scalable Hierarchical Aggregation and Reduction Protocol) documentation — in-network compute for allreduce over NVSwitch
|
||||||
|
|
||||||
|
### 2.5 Grace Superchip Unified Memory (NVLink-C2C)
|
||||||
|
|
||||||
|
The GH200 Grace Superchip connects the Arm Neoverse V2 CPU die and the H100 GPU die via NVLink-C2C at 900 GB/s with coherent memory semantics. CPU can access GPU HBM3 directly and vice versa. This creates a unified virtual address space that fundamentally changes GPU memory management assumptions.
|
||||||
|
|
||||||
|
Key materials:
|
||||||
|
- NVIDIA GH200 Grace Hopper Superchip Architecture Whitepaper (2023) — NVLink-C2C bandwidth, cache coherence protocol, unified memory semantics, 480 GB LPDDR5X + 96 GB HBM3e
|
||||||
|
- NVIDIA Unified Memory for CUDA documentation — UM overview, page migration engine, `cudaMemPrefetchAsync`, `cudaMemAdvise`
|
||||||
|
- Ausavarungnirun et al., "Exploiting Inter-Warp Heterogeneity to Improve GPU Performance" — GPU memory access heterogeneity relevant to migration policy design
|
||||||
|
- Ganguly et al., "Interconnect-Aware Memory Management for GPU Architectures" (ISCA 2019) — NVLink-aware page placement analysis
|
||||||
|
- Li et al., "Evaluating Modern GPU Interconnect: PCIe, NVLink, NV-SLI, NVSwitch and GPUDirect" (IEEE TPDS 2020)
|
||||||
|
|
||||||
|
### 2.6 Cache Coherence at Cluster Scale
|
||||||
|
|
||||||
|
For DistOS to support a distributed shared memory model (even within an NVLink domain), a cache coherence protocol must be specified. Study directory-based protocols, their scalability to 1024-node clusters, and the latency characteristics of coherence traffic over InfiniBand.
|
||||||
|
|
||||||
|
Key materials:
|
||||||
|
- Censier and Feautrier, "A New Solution to Coherence Problems in Multicache Systems" (IEEE Trans. Computers 1978) — directory-based coherence origin
|
||||||
|
- Lenoski et al., "The Stanford DASH Multiprocessor" (IEEE Computer 1992) — scalable directory coherence at large scale
|
||||||
|
- Cray Chapel locale model documentation — distributed memory with locality-aware access
|
||||||
|
- Intel Optane DC Persistent Memory documentation — cache-coherent byte-addressable storage relevant to persistence model
|
||||||
|
- Aguilera et al., "Memory Performance Attacks: Denial of Memory Service in Multi-Core Systems" (USENIX Security 2007) — coherence traffic as interference vector (adversarial relevance)
|
||||||
|
|
||||||
|
### 2.7 GPU Memory Management: Unified Memory and Managed Memory
|
||||||
|
|
||||||
|
Survey CUDA unified memory (automatic page migration between CPU DRAM and GPU HBM3), managed memory (explicitly `cudaMallocManaged`), and the interaction between these mechanisms and the OS page fault handler.
|
||||||
|
|
||||||
|
Key materials:
|
||||||
|
- Landaverde et al., "An Investigation of Unified Memory Access Performance in CUDA" (HiPC 2014)
|
||||||
|
- Zheng et al., "Buddy Compression: Enabling Larger Memory for Deep Learning and HPC Workloads on GPUs" (ISCA 2020) — GPU memory compression as oversubscription strategy
|
||||||
|
- Rhu et al., "vDNN: Virtualized Deep Neural Networks for Scalable, Memory-Efficient Neural Network Design" (MICRO 2016) — activation checkpointing for memory oversubscription
|
||||||
|
- NVIDIA documentation on `cudaMemAdvise` hints — `cudaMemAdviseSetPreferredLocation`, `cudaMemAdviseSetAccessedBy` — hints to the migration engine
|
||||||
|
- AMD ROCm HSA (Heterogeneous System Architecture) documentation — `hsa_amd_memory_pool_t`, coarse-grained and fine-grained memory pool semantics
|
||||||
|
|
||||||
|
### 2.8 Tiered Storage and Page Migration Policies
|
||||||
|
|
||||||
|
Design the page migration engine for the 4-tier hierarchy: HBM3 (hot, ~80 GB) → DDR5/LPDDR5X (warm, ~480 GB on GH200) → NVMe (cool, 1-10 TB per node) → Weka (cold, PB-scale). Define migration triggers, bandwidth budgets, and admission control to prevent migration storms.
|
||||||
|
|
||||||
|
Key materials:
|
||||||
|
- Lagar-Cavilla et al., "SnowFlock: Rapid Virtual Machine Cloning for Cloud Computing" (EuroSys 2009) — demand paging strategies
|
||||||
|
- Yan et al., "Nimble Page Management for Tiered Memory Systems" (ASPLOS 2019) — tiered DRAM page migration policies; directly applicable to HBM3/DDR5 tiering
|
||||||
|
- NVIDIA NVMe Direct documentation — NVMe namespace affinity for GPU workloads
|
||||||
|
- Linux heterogeneous memory management (HMM) — `hmm_range_fault`, `migrate_vma`, mmap-based GPU page fault delegation
|
||||||
|
- Intel PMDK (Persistent Memory Development Kit) — tiered memory management patterns adaptable to GPU tiering
|
||||||
|
- Agarwal et al., "Thermostat: Application-Transparent Page Management for Two-Tiered Main Memory" (ASPLOS 2017) — application-transparent hotness tracking
|
||||||
|
|
||||||
|
### 2.9 Memory Pressure Handling and OOM Prevention
|
||||||
|
|
||||||
|
At 1024 nodes, memory pressure events will be frequent. Design a cooperative memory pressure protocol where processes can voluntarily release memory under `pressure` signals, an eviction hierarchy, and an OOM prevention mechanism that avoids hard OOM kills in favour of graceful degradation.
|
||||||
|
|
||||||
|
Key materials:
|
||||||
|
- Linux kernel Memory Management documentation — `/proc/pressure/memory` (PSI, Pressure Stall Information), cgroup memory.pressure, OOM killer heuristics
|
||||||
|
- Guo et al., "CrystalBall: Statistically-Informed, Co-locating, Workload Placement for Warehouse-Scale Systems" — memory pressure prediction
|
||||||
|
- Alistarh et al., "Gradient Sparsification for Communication-Efficient Distributed Optimization" — model-level memory reduction under pressure; relevant to adaptive workload response
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 3. Agent Roles
|
||||||
|
|
||||||
|
Total agents: **80**
|
||||||
|
|
||||||
|
| Role | Count | Responsibilities |
|
||||||
|
|------|-------|-----------------|
|
||||||
|
| Lead Architect | 2 | Distributed memory architecture decisions, DSM vs. message-passing model choice, cross-council memory interface ownership |
|
||||||
|
| Weka Integration Specialists | 8 | WekaFS client protocol, POSIX semantics, GDS integration, checkpoint I/O patterns |
|
||||||
|
| GPU Memory Researchers | 8 | CUDA Unified Memory, managed memory, page migration engine, HBM3 characteristics |
|
||||||
|
| NVLink/NVSwitch Specialists | 6 | NVLink fabric addressing, NVSwitch bandwidth model, peer-to-peer GPU memory semantics |
|
||||||
|
| Grace Superchip Specialists | 5 | NVLink-C2C coherence, GH200 unified address space, CPU-GPU shared memory model |
|
||||||
|
| Cache Coherence Researchers | 5 | Directory-based coherence protocols, coherence at cluster scale, false sharing analysis |
|
||||||
|
| Tiering and Migration Researchers | 6 | Page migration policies, tier promotion/demotion triggers, bandwidth budgeting |
|
||||||
|
| Memory Pressure Specialists | 4 | OOM prevention, cooperative release protocols, pressure signal design |
|
||||||
|
| PGAS/DSM Language Researchers | 4 | UPC++, OpenSHMEM, Chapel locale model analysis |
|
||||||
|
| Formal Specification Authors | 8 | TLA+ specification of memory model state machine, coherence invariants, migration protocol |
|
||||||
|
| Architects (sub-component) | 8 | Concrete architecture proposals for each memory subsystem component |
|
||||||
|
| Internal Reviewers | 7 | Review research and architecture proposals; green/yellow/red vote casting |
|
||||||
|
| Integration Liaisons | 5 | Interface with `council-sched`, `council-net`, `council-sec`, `council-telemetry` |
|
||||||
|
| Decision Record Authors | 5 | Author DRs for all decision points; maintain UCXL provenance chain |
|
||||||
|
| Adversarial Critics | 4 | Surface memory safety violations, coherence anomalies, migration storm scenarios |
|
||||||
|
|
||||||
|
**Role distribution rationale:** Weka integration is a high-priority domain given the cluster's reliance on WekaFS; 8 specialists are assigned here. GPU memory (unified memory, managed memory) receives 8 researchers reflecting the complexity of CUDA memory semantics. NVLink-C2C and the GH200 model are treated as distinct specialisms given their novel architecture (11 agents combined).
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 4. Key Deliverables
|
||||||
|
|
||||||
|
### 4.1 Research Summaries
|
||||||
|
|
||||||
|
```
|
||||||
|
ucxl://council-mem:researcher@DistOS:memory/*^/research/dsm-vs-message-passing.md
|
||||||
|
ucxl://council-mem:researcher@DistOS:memory/*^/research/weka-wekafs-integration.md
|
||||||
|
ucxl://council-mem:researcher@DistOS:memory/*^/research/magnum-io-gpudirect.md
|
||||||
|
ucxl://council-mem:researcher@DistOS:memory/*^/research/nvlink-nvswitch-fabric.md
|
||||||
|
ucxl://council-mem:researcher@DistOS:memory/*^/research/grace-superchip-unified-memory.md
|
||||||
|
ucxl://council-mem:researcher@DistOS:memory/*^/research/cache-coherence-cluster-scale.md
|
||||||
|
ucxl://council-mem:researcher@DistOS:memory/*^/research/gpu-memory-management.md
|
||||||
|
ucxl://council-mem:researcher@DistOS:memory/*^/research/tiered-storage-page-migration.md
|
||||||
|
ucxl://council-mem:researcher@DistOS:memory/*^/research/memory-pressure-handling.md
|
||||||
|
```
|
||||||
|
|
||||||
|
### 4.2 Architecture Proposals
|
||||||
|
|
||||||
|
```
|
||||||
|
ucxl://council-mem:architect@DistOS:memory/*^/architecture/memory-model-overview.md
|
||||||
|
ucxl://council-mem:architect@DistOS:memory/*^/architecture/weka-gds-integration-design.md
|
||||||
|
ucxl://council-mem:architect@DistOS:memory/*^/architecture/nvlink-domain-addressing.md
|
||||||
|
ucxl://council-mem:architect@DistOS:memory/*^/architecture/grace-c2c-memory-model.md
|
||||||
|
ucxl://council-mem:architect@DistOS:memory/*^/architecture/coherence-protocol-design.md
|
||||||
|
ucxl://council-mem:architect@DistOS:memory/*^/architecture/tiered-hierarchy-design.md
|
||||||
|
ucxl://council-mem:architect@DistOS:memory/*^/architecture/page-migration-engine.md
|
||||||
|
ucxl://council-mem:architect@DistOS:memory/*^/architecture/memory-pressure-protocol.md
|
||||||
|
ucxl://council-mem:architect@DistOS:memory/*^/architecture/bandwidth-allocation-model.md
|
||||||
|
```
|
||||||
|
|
||||||
|
### 4.3 Decision Records
|
||||||
|
|
||||||
|
```
|
||||||
|
ucxl://council-mem:architect@DistOS:memory/*^/decisions/DR-MEM-001-dsm-vs-mp-model.md
|
||||||
|
ucxl://council-mem:architect@DistOS:memory/*^/decisions/DR-MEM-002-coherence-protocol.md
|
||||||
|
ucxl://council-mem:architect@DistOS:memory/*^/decisions/DR-MEM-003-weka-consistency-semantics.md
|
||||||
|
ucxl://council-mem:architect@DistOS:memory/*^/decisions/DR-MEM-004-grace-c2c-scheduling-unit.md
|
||||||
|
ucxl://council-mem:architect@DistOS:memory/*^/decisions/DR-MEM-005-tiering-policy.md
|
||||||
|
ucxl://council-mem:architect@DistOS:memory/*^/decisions/DR-MEM-006-page-migration-triggers.md
|
||||||
|
ucxl://council-mem:architect@DistOS:memory/*^/decisions/DR-MEM-007-oom-prevention-protocol.md
|
||||||
|
ucxl://council-mem:architect@DistOS:memory/*^/decisions/DR-MEM-008-rdma-registration-model.md
|
||||||
|
```
|
||||||
|
|
||||||
|
### 4.4 Formal Specifications
|
||||||
|
|
||||||
|
```
|
||||||
|
ucxl://council-mem:verifier@DistOS:memory/*^/specs/MemoryModel.tla
|
||||||
|
ucxl://council-mem:verifier@DistOS:memory/*^/specs/CoherenceProtocol.tla
|
||||||
|
ucxl://council-mem:verifier@DistOS:memory/*^/specs/PageMigrationProtocol.tla
|
||||||
|
ucxl://council-mem:verifier@DistOS:memory/*^/specs/WekaConsistencyModel.tla
|
||||||
|
```
|
||||||
|
|
||||||
|
### 4.5 Interface Contracts
|
||||||
|
|
||||||
|
```
|
||||||
|
ucxl://council-mem:architect@DistOS:memory/*^/interfaces/mem-to-sched-contract.md
|
||||||
|
ucxl://council-mem:architect@DistOS:memory/*^/interfaces/mem-to-net-contract.md
|
||||||
|
ucxl://council-mem:architect@DistOS:memory/*^/interfaces/mem-to-sec-contract.md
|
||||||
|
ucxl://council-mem:architect@DistOS:memory/*^/interfaces/mem-to-telemetry-contract.md
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 5. Decision Points
|
||||||
|
|
||||||
|
### DP-MEM-001: Primary Memory Consistency Model
|
||||||
|
|
||||||
|
**Question:** Should DistOS expose a Distributed Shared Memory (DSM) programming model to user workloads, a pure message-passing model, or a hybrid model where DSM is restricted to within-NVLink-domain access (exploiting NVSwitch hardware) while cross-domain communication uses explicit message passing?
|
||||||
|
|
||||||
|
**Factors:** NVLink domain size (8 GPUs per NVSwitch in DGX H100, up to 256 GPUs in NVLink Switch System), coherence traffic overhead at 1024-node scale, programmability, compatibility with existing CUDA/MPI workloads, interaction with the PGAS language model.
|
||||||
|
|
||||||
|
**Dependency:** This is the highest-impact decision in `council-mem`; it governs the programming model presented to users and all downstream specification work. `council-sched` requires this decision before finalising its placement model.
|
||||||
|
|
||||||
|
### DP-MEM-002: Cache Coherence Protocol Selection
|
||||||
|
|
||||||
|
**Question:** If a DSM model (even partial) is adopted, which coherence protocol should DistOS implement? Options: (a) MESI/MESIF directory-based protocol with a distributed directory mapped to NVLink domains, (b) release consistency (lazy/eager), (c) entry consistency with lock-based synchronisation, (d) scope consistency (as in OpenSHMEM) restricted to explicitly declared shared objects.
|
||||||
|
|
||||||
|
**Factors:** Coherence directory scalability to 1024 nodes, false sharing cost over 900 GB/s NVLink vs. 200 Gbps InfiniBand, implementation complexity, interaction with CUDA memory model (`__threadfence_system`).
|
||||||
|
|
||||||
|
### DP-MEM-003: Weka Consistency Semantics
|
||||||
|
|
||||||
|
**Question:** WekaFS supports multiple client-side caching modes: `readcache` (read-only caching), `writecache` (write-back caching), and `coherent` (strict POSIX consistency with distributed coherence). DistOS must choose a default mount policy and define when workloads may opt into relaxed consistency. Should checkpoint writes use `writecache` for performance and accept the risk of partial data on crash, or require `coherent` mode with reduced write throughput?
|
||||||
|
|
||||||
|
**Factors:** Checkpoint I/O bandwidth (a 1024-GPU training job may checkpoint 100+ TB), crash recovery correctness, WekaFS client-side coherence message overhead, interaction with `council-fault`'s recovery model.
|
||||||
|
|
||||||
|
**Dependency:** `council-fault` must be consulted; this decision affects recovery guarantees.
|
||||||
|
|
||||||
|
### DP-MEM-004: Grace Superchip Scheduling Unit
|
||||||
|
|
||||||
|
**Question:** On GH200 nodes, should the CPU and GPU be treated as a single unified scheduling unit (one "node" from the scheduler's perspective, sharing a unified virtual address space) or as two separate resources with explicit affinity? The answer changes the memory model: unified treatment enables transparent CPU-GPU pointer sharing; separate treatment requires explicit migration.
|
||||||
|
|
||||||
|
**Factors:** Interaction with `council-sched` DP-SCHED-006, API surface complexity, NUMA distance for CPU-to-GPU migration over NVLink-C2C vs. within a unified UVA space.
|
||||||
|
|
||||||
|
**Coordination required:** This decision is jointly owned by `council-mem` and `council-sched`; a joint session is required before either council can finalise their respective specs.
|
||||||
|
|
||||||
|
### DP-MEM-005: Tiering Policy Design
|
||||||
|
|
||||||
|
**Question:** What algorithm governs page promotion and demotion across the HBM3 → DDR5 → NVMe → Weka hierarchy? Options: (a) LRU/LFU approximation (similar to Linux CLOCK), (b) access frequency + recency hybrid (ARC), (c) workload-hint-driven (applications annotate hot vs. cold regions via `cudaMemAdvise` equivalents), (d) ML-based hotness prediction.
|
||||||
|
|
||||||
|
**Factors:** Migration bandwidth cost (HBM3 → DDR5 over NVLink-C2C is fast; DDR5 → NVMe is slow; NVMe → Weka involves network I/O), migration storm risk, implementation complexity, applicability to training vs. inference vs. HPC workloads.
|
||||||
|
|
||||||
|
### DP-MEM-006: Page Migration Trigger and Bandwidth Budget
|
||||||
|
|
||||||
|
**Question:** What events trigger page migration? Candidates: (a) page fault on access (demand paging), (b) periodic access pattern analysis (proactive migration), (c) memory pressure threshold crossings, (d) explicit application hints. How is migration bandwidth budgeted to avoid starving compute workloads that use the same NVLink/NVSwitch fabric?
|
||||||
|
|
||||||
|
**Factors:** NVLink bandwidth contention with NCCL allreduce traffic, migration latency vs. access fault latency trade-off, integration with `council-net`'s bandwidth reservation model.
|
||||||
|
|
||||||
|
### DP-MEM-007: OOM Prevention and Memory Pressure Protocol
|
||||||
|
|
||||||
|
**Question:** When GPU HBM3 memory reaches a pressure threshold, what is the DistOS response hierarchy? Proposed hierarchy: (1) request cooperative memory release from low-priority co-located processes, (2) demote cold HBM3 pages to DDR5 or NVMe, (3) suspend (checkpoint) low-priority jobs to free their entire allocation, (4) if all else fails, terminate lowest-priority job. How are the pressure threshold levels defined and measured?
|
||||||
|
|
||||||
|
**Factors:** Interaction with `council-sched`'s preemption model, OOM kill latency, impact on SLO guarantees, coordination signal protocol design.
|
||||||
|
|
||||||
|
### DP-MEM-008: RDMA Registration Model
|
||||||
|
|
||||||
|
**Question:** GPUDirect RDMA requires GPU memory regions to be registered with the RDMA HCA (Host Channel Adapter). Should DistOS pre-register fixed RDMA memory pools (reducing registration overhead but consuming HBM3 at idle), register on demand (flexible but with latency), or use a cache of registered regions with LRU eviction?
|
||||||
|
|
||||||
|
**Factors:** Registration latency (~milliseconds for large GPU buffers), HBM3 overhead for pre-registered pools, interaction with `council-net`'s RDMA QP management, memory fragmentation risk.
|
||||||
|
|
||||||
|
**Dependency:** `council-net` must align on the registration model before RDMA transport design is finalised.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 6. Dependencies
|
||||||
|
|
||||||
|
### 6.1 What council-mem Needs from Other Councils
|
||||||
|
|
||||||
|
| Dependency | Source Council | Artifact | Purpose |
|
||||||
|
|------------|---------------|---------|---------|
|
||||||
|
| Placement decisions and GPU assignment map | `council-sched` | `ucxl://council-sched:architect@DistOS:scheduling/*^/interfaces/sched-to-mem-contract.md` | Memory subsystem must know which GPU a job is placed on to configure HBM3 address space, NVLink fabric attachment point, and DDR5 NUMA proximity |
|
||||||
|
| Preemption events and checkpoint triggers | `council-sched` | `ucxl://council-sched:architect@DistOS:scheduling/*^/architecture/preemption-protocol.md` | Memory tiering must snapshot GPU HBM3 contents on preemption; needs notification protocol |
|
||||||
|
| RDMA transport requirements | `council-net` | `ucxl://council-net:architect@DistOS:networking/*^/interfaces/net-to-mem-contract.md` | RDMA registration model (DP-MEM-008) must align with network transport QP management |
|
||||||
|
| Network bandwidth reservation API | `council-net` | `ucxl://council-net:architect@DistOS:networking/*^/architecture/bandwidth-reservation.md` | Page migration over NVLink/IB must not starve NCCL allreduce traffic; needs bandwidth reservation |
|
||||||
|
| Memory isolation constraints | `council-sec` | `ucxl://council-sec:architect@DistOS:security/*^/interfaces/sec-to-mem-contract.md` | IOMMU domain assignments, capability restrictions on shared memory regions, tenant isolation requirements |
|
||||||
|
| Memory quota enforcement hooks | `council-telemetry` | `ucxl://council-telemetry:architect@DistOS:telemetry/*^/interfaces/telemetry-to-mem-contract.md` | HBM3 usage accounting per tenant requires telemetry metering hooks |
|
||||||
|
| Recovery semantics for Weka I/O | `council-fault` | `ucxl://council-fault:architect@DistOS:fault-tolerance/*^/interfaces/fault-to-mem-contract.md` | Weka consistency mode choice (DP-MEM-003) depends on recovery guarantees provided by fault tolerance subsystem |
|
||||||
|
|
||||||
|
### 6.2 What Other Councils Need from council-mem
|
||||||
|
|
||||||
|
| Consumer Council | Artifact Required | Purpose |
|
||||||
|
|-----------------|------------------|---------|
|
||||||
|
| `council-sched` | Memory pressure signals and eviction cost model | Scheduler uses eviction cost estimates to avoid placing kernels that will immediately trigger HBM3 pressure |
|
||||||
|
| `council-sched` | HBM3 and DDR5 bandwidth allocation model | Bandwidth is a co-dominant scheduling resource; model must be consistent with memory spec |
|
||||||
|
| `council-net` | RDMA memory registration interface specification | Network subsystem designs RDMA transport around the memory registration model agreed in DP-MEM-008 |
|
||||||
|
| `council-sec` | Address space layout and isolation model | Security isolation requires memory address space boundaries from the memory model |
|
||||||
|
| `council-telemetry` | Memory usage event stream specification | Metering requires a defined set of memory events (allocation, migration, eviction, OOM) with UCXL addresses |
|
||||||
|
| `council-verify` | TLA+ memory model and coherence protocol specs | Formal verification council model-checks for coherence safety and freedom from memory corruption |
|
||||||
|
| `council-api` | Virtual memory API surface | API council designs the user-facing memory allocation and mapping interface based on `council-mem` spec |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 7. WHOOSH Configuration
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
# WHOOSH council formation configuration for council-mem
|
||||||
|
council_id: council-mem
|
||||||
|
project: DistOS
|
||||||
|
subsystem: memory
|
||||||
|
gitea_label: chorus-entrypoint
|
||||||
|
gitea_repo: distos/memory
|
||||||
|
|
||||||
|
formation:
|
||||||
|
target_agents: 80
|
||||||
|
min_agents: 60
|
||||||
|
wave:
|
||||||
|
max_per_wave: 12
|
||||||
|
min_per_wave: 6
|
||||||
|
period_sec: 30
|
||||||
|
placement:
|
||||||
|
max_replicas_per_node: 2
|
||||||
|
join_stagger_ms: 2000
|
||||||
|
bootstrap_peers_min: 5
|
||||||
|
|
||||||
|
roles:
|
||||||
|
- role: lead-architect
|
||||||
|
count: 2
|
||||||
|
model: claude-opus-4-6
|
||||||
|
priority: high
|
||||||
|
- role: researcher
|
||||||
|
count: 46
|
||||||
|
model: qwen2.5-coder:32b
|
||||||
|
priority: normal
|
||||||
|
subgroups:
|
||||||
|
- tag: weka-integration
|
||||||
|
count: 8
|
||||||
|
- tag: gpu-memory
|
||||||
|
count: 8
|
||||||
|
- tag: nvlink-nvswitch
|
||||||
|
count: 6
|
||||||
|
- tag: grace-superchip
|
||||||
|
count: 5
|
||||||
|
- tag: cache-coherence
|
||||||
|
count: 5
|
||||||
|
- tag: tiering-migration
|
||||||
|
count: 6
|
||||||
|
- tag: memory-pressure
|
||||||
|
count: 4
|
||||||
|
- tag: pgas-dsm
|
||||||
|
count: 4
|
||||||
|
- role: architect
|
||||||
|
count: 8
|
||||||
|
model: claude-opus-4-6
|
||||||
|
priority: normal
|
||||||
|
- role: verifier
|
||||||
|
count: 8
|
||||||
|
model: deepseek-coder-v2
|
||||||
|
priority: normal
|
||||||
|
- role: reviewer
|
||||||
|
count: 7
|
||||||
|
model: claude-opus-4-6
|
||||||
|
priority: normal
|
||||||
|
- role: integration-liaison
|
||||||
|
count: 5
|
||||||
|
model: qwen2.5-coder:32b
|
||||||
|
priority: normal
|
||||||
|
- role: decision-record-author
|
||||||
|
count: 5
|
||||||
|
model: claude-opus-4-6
|
||||||
|
priority: normal
|
||||||
|
- role: adversarial-critic
|
||||||
|
count: 4
|
||||||
|
model: claude-opus-4-6
|
||||||
|
priority: normal
|
||||||
|
|
||||||
|
subchannels:
|
||||||
|
- name: mem-research
|
||||||
|
description: "Memory subsystem research discussion and literature synthesis"
|
||||||
|
participants: [researcher, lead-architect]
|
||||||
|
pubsub: true
|
||||||
|
- name: mem-weka
|
||||||
|
description: "Weka/WekaFS integration design — high-priority subchannel"
|
||||||
|
participants: [researcher-weka-integration, architect, lead-architect, integration-liaison]
|
||||||
|
pubsub: false
|
||||||
|
- name: mem-architecture
|
||||||
|
description: "Architecture proposal discussion and voting"
|
||||||
|
participants: [architect, lead-architect, reviewer, adversarial-critic]
|
||||||
|
pubsub: true
|
||||||
|
- name: mem-formal-spec
|
||||||
|
description: "TLA+ specification authoring and review"
|
||||||
|
participants: [verifier, lead-architect, reviewer]
|
||||||
|
pubsub: false
|
||||||
|
- name: mem-integration
|
||||||
|
description: "Cross-council interface negotiation"
|
||||||
|
participants: [integration-liaison, lead-architect]
|
||||||
|
pubsub: false
|
||||||
|
- name: mem-grace-joint
|
||||||
|
description: "Joint session channel with council-sched for GH200 scheduling unit decision (DP-MEM-004/DP-SCHED-006)"
|
||||||
|
participants: [lead-architect, integration-liaison]
|
||||||
|
pubsub: false
|
||||||
|
external_councils: [council-sched]
|
||||||
|
- name: mem-decisions
|
||||||
|
description: "Decision record authoring and consensus"
|
||||||
|
participants: [decision-record-author, lead-architect, reviewer]
|
||||||
|
pubsub: true
|
||||||
|
|
||||||
|
quorum:
|
||||||
|
architecture_changes:
|
||||||
|
policy: supermajority
|
||||||
|
threshold: 0.667
|
||||||
|
require_domain_role: true
|
||||||
|
require_quality_role: true
|
||||||
|
beat_minutes: 20
|
||||||
|
timeout_beats: 6
|
||||||
|
research_summaries:
|
||||||
|
policy: simple_majority
|
||||||
|
threshold: 0.5
|
||||||
|
require_domain_role: true
|
||||||
|
require_quality_role: false
|
||||||
|
beat_minutes: 15
|
||||||
|
timeout_beats: 4
|
||||||
|
formal_specs:
|
||||||
|
policy: supermajority
|
||||||
|
threshold: 0.667
|
||||||
|
require_domain_role: true
|
||||||
|
require_quality_role: true
|
||||||
|
require_verifier: true
|
||||||
|
beat_minutes: 25
|
||||||
|
timeout_beats: 8
|
||||||
|
# Joint decisions with council-sched require lead-architect sign-off from both councils
|
||||||
|
joint_decisions:
|
||||||
|
policy: unanimous
|
||||||
|
roles: [lead-architect]
|
||||||
|
councils: [council-mem, council-sched]
|
||||||
|
beat_minutes: 30
|
||||||
|
timeout_beats: 6
|
||||||
|
interface_contracts:
|
||||||
|
policy: unanimous
|
||||||
|
roles: [lead-architect, integration-liaison]
|
||||||
|
beat_minutes: 30
|
||||||
|
timeout_beats: 4
|
||||||
|
|
||||||
|
gates:
|
||||||
|
kaching:
|
||||||
|
p95_latency_ms: 250
|
||||||
|
max_error_rate: 0.01
|
||||||
|
backbeat:
|
||||||
|
max_stream_lag: 200
|
||||||
|
bootstrap:
|
||||||
|
min_healthy_peers: 5
|
||||||
|
join:
|
||||||
|
min_success_rate: 0.80
|
||||||
|
|
||||||
|
review:
|
||||||
|
beat_minutes: 20
|
||||||
|
quorum:
|
||||||
|
total_min: 3
|
||||||
|
require_domain_role: true
|
||||||
|
require_quality_role: true
|
||||||
|
timeout_beats: 6
|
||||||
|
no_self_approval: true
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 8. Success Criteria
|
||||||
|
|
||||||
|
1. **Research completeness:** All 9 research domain summaries published to DHT with at least 5 primary references each, approved by council simple majority.
|
||||||
|
|
||||||
|
2. **Architecture coverage:** Architectural proposals exist for all 9 major memory subsystem components. Each proposal addresses the specific implications of HBM3 scarcity (80 GB per H100) and the Weka filesystem integration.
|
||||||
|
|
||||||
|
3. **Decision records resolved:** All 8 decision points (DP-MEM-001 through DP-MEM-008) have corresponding Decision Records with at least 3 alternatives considered. DP-MEM-001 (primary memory model) and DP-MEM-004 (Grace scheduling unit) are resolved by council supermajority with `council-sched` co-sign on DP-MEM-004.
|
||||||
|
|
||||||
|
4. **Formal specifications:** TLA+ specifications for the memory model, coherence protocol, and page migration protocol. The coherence protocol spec must include a proof (model-checked by `council-verify`) that it satisfies SC-for-DRF (Sequential Consistency for Data-Race-Free programs) within an NVLink domain.
|
||||||
|
|
||||||
|
5. **Weka integration validated:** The Weka consistency mode recommendation (DP-MEM-003) is accompanied by a quantitative bandwidth model (estimated checkpoint throughput for each consistency mode) and a failure scenario analysis reviewed by `council-fault`.
|
||||||
|
|
||||||
|
6. **Interface contracts ratified:** All 4 interface contracts (to `council-sched`, `council-net`, `council-sec`, `council-telemetry`) are co-signed. The RDMA registration model (DP-MEM-008) contract is co-signed by `council-net` before the end of Phase 2.
|
||||||
|
|
||||||
|
7. **UCXL navigability:** Any Decision Record can be traced to the research summary motivating it within 5 UCXL hops.
|
||||||
|
|
||||||
|
8. **Adversarial review pass:** Each major architecture proposal has a documented adversarial critique and resolution. The coherence protocol design must specifically address the scenario of a migration storm (100+ simultaneous page migrations consuming all NVLink bandwidth) with a documented mitigation.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 9. Timeline
|
||||||
|
|
||||||
|
### Phase 1: Research and Survey (Days 1-3)
|
||||||
|
|
||||||
|
**Day 1:**
|
||||||
|
- WHOOSH forms council; all 80 agents join via wave deployment
|
||||||
|
- Researchers self-assign to domain subgroups via `mem-research` subchannel
|
||||||
|
- Weka integration research begins immediately (highest dependency for both `council-sched` placement and `council-fault` recovery design)
|
||||||
|
- GPU memory management and NVLink/NVSwitch fabric research begins in parallel
|
||||||
|
- Integration liaisons contact `council-sched` and `council-net` to establish interface negotiation schedule
|
||||||
|
|
||||||
|
**Day 2:**
|
||||||
|
- Grace Superchip, cache coherence, and tiering/migration research domains surveyed
|
||||||
|
- PGAS/DSM language model research completed (inputs to DP-MEM-001)
|
||||||
|
- Memory pressure and OOM research domain surveyed
|
||||||
|
- Research summaries drafted; internal review cycle begins
|
||||||
|
- Lead architects draft preliminary memory model design space map
|
||||||
|
|
||||||
|
**Day 3:**
|
||||||
|
- Research summaries revised based on review feedback; all 9 summaries published to DHT
|
||||||
|
- Adversarial critics challenge key assumptions (particularly DSM scalability claims)
|
||||||
|
- Research phase gate: all 9 summaries achieve simple majority approval
|
||||||
|
- Preliminary interface contract outlines shared with all dependency councils
|
||||||
|
- Joint session scheduled with `council-sched` for DP-MEM-004/DP-SCHED-006 (Grace scheduling unit)
|
||||||
|
|
||||||
|
### Phase 2: Architecture and Trade-offs (Days 3-6)
|
||||||
|
|
||||||
|
**Day 3-4:**
|
||||||
|
- DP-MEM-001 (primary memory model) — highest priority; architects propose DSM, message-passing, and hybrid options
|
||||||
|
- Joint session with `council-sched` on DP-MEM-004/DP-SCHED-006 (Grace Superchip scheduling unit) — this is a co-owned decision requiring both councils
|
||||||
|
- DP-MEM-008 (RDMA registration model) — early engagement with `council-net` required; initial proposal shared
|
||||||
|
|
||||||
|
**Day 4-5:**
|
||||||
|
- DP-MEM-002 (coherence protocol) resolved — depends on DP-MEM-001 outcome
|
||||||
|
- DP-MEM-003 (Weka consistency semantics) resolved — `council-fault` consulted
|
||||||
|
- Bandwidth allocation model drafted; shared with `council-sched` as input to their placement scoring
|
||||||
|
|
||||||
|
**Day 5-6:**
|
||||||
|
- DP-MEM-005 (tiering policy), DP-MEM-006 (migration triggers), DP-MEM-007 (OOM protocol) resolved
|
||||||
|
- All 8 Decision Records drafted and voted on by council supermajority
|
||||||
|
- Architecture overview assembled from approved DRs
|
||||||
|
- Architecture phase gate: all DPs resolved and co-dependencies with `council-sched` confirmed
|
||||||
|
|
||||||
|
### Phase 3: Formal Specification (Days 6-10)
|
||||||
|
|
||||||
|
**Day 6-7:**
|
||||||
|
- TLA+ specification of the memory model begins; verifiers partition work across 4 modules (memory model, coherence, migration, Weka consistency)
|
||||||
|
- Architects continue refining designs to resolve spec ambiguities
|
||||||
|
- `council-verify` given early access to spec drafts for model-checking setup
|
||||||
|
|
||||||
|
**Day 7-8:**
|
||||||
|
- Coherence protocol TLA+ module authored; safety invariants stated (SC-for-DRF within NVLink domain, no lost writes across domain boundaries)
|
||||||
|
- Page migration protocol TLA+ module authored; liveness invariants (no indefinite page residency in any tier, migration storm prevention)
|
||||||
|
|
||||||
|
**Day 8-10:**
|
||||||
|
- Weka consistency model TLA+ module authored
|
||||||
|
- Model checking runs submitted to `council-verify`
|
||||||
|
- Any counterexamples trigger architecture revision with updated DRs
|
||||||
|
- Formal spec versions pinned in DHT; UCXL addresses published to `council-sched`, `council-net`, `council-verify`
|
||||||
|
|
||||||
|
### Phase 4: Integration and Review (Days 10-12)
|
||||||
|
|
||||||
|
**Day 10-11:**
|
||||||
|
- Interface contracts with all dependency councils finalised and submitted for co-signature
|
||||||
|
- Cross-council integration session: memory model validated against network RDMA model (`council-net`)
|
||||||
|
- Cross-council integration session: memory pressure protocol validated against preemption protocol (`council-sched`)
|
||||||
|
- `council-synth` engaged for any unresolved conflicts
|
||||||
|
|
||||||
|
**Day 11-12:**
|
||||||
|
- Final council review of complete memory specification
|
||||||
|
- Adversarial critics run migration storm scenario and DSM coherence traffic saturation analysis
|
||||||
|
- All yellow votes addressed with documented mitigations
|
||||||
|
- Integration review gate: all interface contracts co-signed
|
||||||
|
|
||||||
|
### Phase 5: Documentation and Narrative (Days 12-14)
|
||||||
|
|
||||||
|
**Day 12-13:**
|
||||||
|
- Decision record authors produce narrative summaries
|
||||||
|
- `council-docs` receives complete memory specification for standardised formatting
|
||||||
|
- UCXL navigability audit: spot-check 10 random decision paths
|
||||||
|
|
||||||
|
**Day 14:**
|
||||||
|
- Final specification published
|
||||||
|
- `council-arch` generates human-readable narrative of memory subsystem design evolution
|
||||||
|
- Council dissolved; agents released back to WHOOSH pool
|
||||||
626
councils/03-network-stack.md
Normal file
626
councils/03-network-stack.md
Normal file
@@ -0,0 +1,626 @@
|
|||||||
|
# Council Design Brief: Network Stack
|
||||||
|
|
||||||
|
**Council ID:** `council-net`
|
||||||
|
**Mission:** Design the complete network stack for DistOS, encompassing RDMA transport (InfiniBand/RoCE) for GPU-to-GPU communication, the overlay network and control plane (libp2p), transport protocol selection (QUIC/TCP/UCX), network topology discovery, adaptive routing, congestion control at 1024-node scale, NVLink domain bridging, multi-rail networking, service mesh for agent communication, and multicast/broadcast for council pub-sub channels.
|
||||||
|
**UCXL Base Address:** `ucxl://council-net:*@DistOS:networking/*`
|
||||||
|
**Agent Count:** 60
|
||||||
|
**Status:** Constitution Phase — awaiting WHOOSH formation trigger
|
||||||
|
**Created:** 2026-02-24
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 1. Scope and Responsibilities
|
||||||
|
|
||||||
|
`council-net` owns the complete specification of the DistOS network subsystem. Scope boundaries are defined as follows.
|
||||||
|
|
||||||
|
**In scope:**
|
||||||
|
|
||||||
|
- RDMA transport layer: InfiniBand verbs and RoCE v2 for GPU-to-GPU data plane; Queue Pair (QP) lifecycle, completion queue management, and memory region registration interface
|
||||||
|
- Overlay network design: logical topology over the physical InfiniBand/Ethernet fabric; addressing, routing, and peer discovery for the control plane
|
||||||
|
- libp2p integration for the DistOS control plane: peer discovery (mDNS, DHT-based), multiplexed streams, NAT traversal, and protocol negotiation
|
||||||
|
- Transport protocol selection: QUIC for agent control plane communication, UCX (Unified Communication X) for GPU collective and point-to-point data plane, TCP/IP for compatibility and management traffic
|
||||||
|
- Network topology discovery: automatic cluster topology mapping (fat-tree, dragonfly, or NVLink Switch topology), NVLink domain membership, InfiniBand subnet manager interface
|
||||||
|
- Adaptive routing: traffic-aware routing, ECMP (Equal-Cost Multi-Path), InfiniBand OpenSM AR (Adaptive Routing), and Dragonfly-specific routing algorithms
|
||||||
|
- Congestion control at 1024-node scale: ECN (Explicit Congestion Notification), PFC (Priority Flow Control), and DCQCN for RoCE; InfiniBand credit-based flow control; interaction with NCCL collectives
|
||||||
|
- NVLink domain bridging: translating between NVLink-domain (intra-node or NVSwitch-connected GPUs) and InfiniBand-domain (inter-node) communication
|
||||||
|
- Multi-rail networking: multiple IB HCAs per node, rail selection policy, failover across rails
|
||||||
|
- Service mesh for agent (CHORUS/WHOOSH) communication: mTLS, sidecar-less or sidecar-based, service discovery, load balancing, and circuit-breaking
|
||||||
|
- Multicast and broadcast for council pub-sub: efficient dissemination of research summaries, decision records, and vote notifications to council members
|
||||||
|
- Formal specification of the DistOS network interface — the API surface exposed to the memory subsystem (RDMA registration), the scheduler (network-aware placement), and user workloads (NCCL-compatible collective API)
|
||||||
|
|
||||||
|
**Out of scope (delegated):**
|
||||||
|
|
||||||
|
- Physical layer hardware (HCA firmware, cable selection, switch configuration) — these are hardware dependencies, not OS design decisions
|
||||||
|
- GPU memory allocation and RDMA buffer management (delegated to `council-mem`; RDMA registration interface consumed)
|
||||||
|
- Workload scheduling and GPU assignment (delegated to `council-sched`; topology hints provided as outputs)
|
||||||
|
- Encrypted transport key management and certificate lifecycle (delegated to `council-sec`; TLS/mTLS integration points defined)
|
||||||
|
- Resource metering for network bandwidth usage (delegated to `council-telemetry`; metering event API published)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 2. Research Domains
|
||||||
|
|
||||||
|
### 2.1 RDMA: InfiniBand Verbs and RoCE
|
||||||
|
|
||||||
|
InfiniBand provides the primary high-bandwidth, low-latency transport for GPU-to-GPU data movement on this cluster. Understand the verbs API (`ibverbs`), Queue Pair state machine (RESET → INIT → RTR → RTS → ERROR), completion queues, and memory region registration. Survey RoCE v2 as the Ethernet-encapsulated RDMA alternative and understand its congestion control requirements.
|
||||||
|
|
||||||
|
Key materials:
|
||||||
|
- Mellanox/NVIDIA InfiniBand Architecture Specification — QP model, RDMA read/write/atomic operations, immediate data
|
||||||
|
- Kalia et al., "Using RDMA Efficiently for Key-Value Services" (SIGCOMM 2014) — RDMA design patterns, one-sided vs. two-sided operations
|
||||||
|
- Kalia et al., "Design Guidelines for High Performance RDMA Systems" (USENIX ATC 2016) — QP scaling, SEND vs. RDMA READ trade-offs, doorbell batching
|
||||||
|
- Mitchell et al., "Using One-Sided RDMA Reads to Build a Fast, CPU-Efficient Key-Value Store" (USENIX ATC 2013)
|
||||||
|
- Rosenblum et al., "The Design and Implementation of a Log-Structured File System" — not RDMA-specific but directly relevant to RDMA-backed storage patterns
|
||||||
|
- Pfefferle et al., "A Hybrid I/O Virtualization Framework for RDMA-capable Network Interfaces" (VEE 2015)
|
||||||
|
- RoCE v2 specification (InfiniBand Trade Association) — UDP encapsulation, RoCEv2 header, ECN marking
|
||||||
|
|
||||||
|
### 2.2 UCX: Unified Communication X
|
||||||
|
|
||||||
|
UCX is the primary communication framework for DistOS data plane operations. It provides a hardware-agnostic API over InfiniBand, RoCE, CUDA IPC, CMA (Cross-Memory Attach), and TCP. NCCL uses UCX as an optional backend.
|
||||||
|
|
||||||
|
Key materials:
|
||||||
|
- Shamis et al., "UCX: An Open Source Framework for HPC Network APIs and Beyond" (HOTI 2015) — UCX design and API overview
|
||||||
|
- UCX documentation (openucx.org) — UCP (User Communication Protocol) context, endpoints, tagged messages, Active Messages, RDMA operations
|
||||||
|
- UCX GPU-Direct RDMA integration guide — `ucp_mem_map` with `UCS_MEMORY_TYPE_CUDA`, `ucx_perftest` GPU benchmarks
|
||||||
|
- Venkata et al., "OSU Micro Benchmarks" (OMB) — latency/bandwidth benchmarks for UCX transport selection validation
|
||||||
|
- NVIDIA NCCL-UCX plugin documentation — `NCCL_UCX_*` environment variables, QP provisioning, registration cache
|
||||||
|
|
||||||
|
### 2.3 NCCL: NVIDIA Collective Communication Library
|
||||||
|
|
||||||
|
NCCL implements the collective communication patterns used by distributed training (allreduce, broadcast, reduce-scatter, all-gather). On Hopper clusters, NCCL uses NVLink for intra-node collectives and InfiniBand for inter-node. Understanding NCCL topology files and ring/tree algorithm selection is essential for network-aware scheduling.
|
||||||
|
|
||||||
|
Key materials:
|
||||||
|
- NCCL documentation and source (github.com/NVIDIA/nccl) — `ncclCommInitRank`, topology detection, algorithm selection (ring, tree, collnet)
|
||||||
|
- Patarasuk and Yuan, "Bandwidth Optimal All-reduce Algorithms for Clusters of Workstations" (J. Parallel Distrib. Comput. 2009) — ring-allreduce bandwidth analysis
|
||||||
|
- Ying et al., "Switch Transformer" appendix on communication cost — practical allreduce scaling analysis
|
||||||
|
- NCCL topology XML format — specifying NVLink domains, IB rail affinity, and switch hierarchy for NCCL algorithm tuning
|
||||||
|
- NVIDIA Collective Communication Library Performance Notes — algorithm selection heuristics, tree vs. ring cross-over point
|
||||||
|
|
||||||
|
### 2.4 libp2p for the Control Plane
|
||||||
|
|
||||||
|
The DistOS control plane (agent coordination, WHOOSH council formation, SLURP DHT, UCXL resolution) runs over libp2p. Survey the libp2p protocol suite: peer identity (ed25519/secp256k1 keypairs), peer discovery (mDNS, Kademlia DHT), stream multiplexing (yamux, mplex), transport (QUIC, TCP), and NAT traversal.
|
||||||
|
|
||||||
|
Key materials:
|
||||||
|
- libp2p specification (github.com/libp2p/specs) — multiaddress format, transport protocols, peer routing
|
||||||
|
- Maymounkov and Mazières, "Kademlia: A Peer-to-Peer Information System Based on the XOR Metric" (IPTPS 2002) — foundational DHT for libp2p routing
|
||||||
|
- libp2p QUIC transport specification — 0-RTT handshake, connection migration, multiplexed streams without HoL blocking
|
||||||
|
- IPFS documentation on libp2p — practical deployment patterns, bootstrap peer configuration, ambient peer discovery
|
||||||
|
- Baumgart and Meis, "S/Kademlia: A Practicable Approach Towards Secure Key-Based Routing" (P2P 2007) — security hardening for Kademlia relevant to the CHORUS mesh
|
||||||
|
|
||||||
|
### 2.5 QUIC Protocol
|
||||||
|
|
||||||
|
QUIC provides the transport for DistOS agent control-plane communication (WHOOSH formation, SLURP DHT queries, UCXL resolution, BUBBLE decision records). QUIC's multiplexed streams, 0-RTT connection establishment, and connection migration over multiple network paths make it well-suited to the heterogeneous cluster environment.
|
||||||
|
|
||||||
|
Key materials:
|
||||||
|
- Langley et al., "The QUIC Transport Protocol: Design and Internet-Scale Deployment" (SIGCOMM 2017) — QUIC design rationale and performance analysis
|
||||||
|
- RFC 9000 — QUIC: A UDP-Based Multiplexed and Secure Transport
|
||||||
|
- RFC 9001 — Using TLS to Secure QUIC
|
||||||
|
- Marx et al., "QUIC is not Quick Enough over Fast Internet Connections" (PAM 2020) — QUIC performance limitations at high bandwidth relevant to cluster use
|
||||||
|
- Cui et al., "QUIC is not Enough: Towards Wireless QUIC" — multipath extensions relevant to multi-rail networking
|
||||||
|
|
||||||
|
### 2.6 Network Topology Discovery
|
||||||
|
|
||||||
|
At 1024 nodes, manual topology configuration is not feasible. The network stack must automatically discover the cluster topology: fat-tree vs. dragonfly vs. NVLink Switch topology, rail assignments per node, and NVSwitch domain membership per GPU.
|
||||||
|
|
||||||
|
Key materials:
|
||||||
|
- Al-Fares et al., "A Scalable, Commodity Data Center Network Architecture" (SIGCOMM 2008) — fat-tree topology analysis
|
||||||
|
- Abts et al., "Energy Proportional Datacenter Networks" — dragonfly topology and routing
|
||||||
|
- InfiniBand OpenSM subnet manager documentation — `smpquery`, `ibnetdiscover`, topology file format, AR (Adaptive Routing) configuration
|
||||||
|
- NVIDIA NVTOPO documentation — GPU topology detection, `nvidia-smi topo -m` output format
|
||||||
|
- `ibstat`, `ibstatus`, `perfquery` — InfiniBand diagnostic tools relevant to topology verification
|
||||||
|
|
||||||
|
### 2.7 Adaptive Routing and Congestion Control
|
||||||
|
|
||||||
|
At 1024-node scale, static routing leads to hot-spots. Adaptive routing dynamically distributes traffic across equal-cost paths. Congestion control prevents PFC pause storms that can cascade to TCP-Incast-style deadlocks.
|
||||||
|
|
||||||
|
Key materials:
|
||||||
|
- Valadarsky et al., "Xpander: Towards Optimal-Performance Datacenters" (CoNEXT 2016) — adaptive topology design
|
||||||
|
- Zhu et al., "Congestion Control for Large-Scale RDMA Deployments" (SIGCOMM 2015) — DCQCN algorithm for RoCE congestion control
|
||||||
|
- NVIDIA Spectrum InfiniBand switch AR documentation — per-packet vs. per-flow adaptive routing
|
||||||
|
- Pfaff et al., "The Design and Implementation of Open vSwitch" (NSDI 2015) — SDN-based adaptive routing reference
|
||||||
|
- Mittal et al., "TIMELY: RTT-based Congestion Control for the Datacenter" (SIGCOMM 2015) — RTT-based CC for RDMA
|
||||||
|
- Google Jupiter and Andromeda: Firestone et al., "Azure Accelerated Networking" (NSDI 2018) — hyperscale network stack design patterns; Vahdat et al., "Jupiter Rising" (SIGCOMM 2015) — Google datacenter network architecture
|
||||||
|
|
||||||
|
### 2.8 Multi-Rail Networking
|
||||||
|
|
||||||
|
Each node in the cluster has multiple InfiniBand HCAs for aggregate bandwidth. The network stack must implement rail selection policy (hash-based, round-robin, least-loaded), per-flow rail affinity for ordered delivery, and HCA failover.
|
||||||
|
|
||||||
|
Key materials:
|
||||||
|
- NVIDIA Multi-Rail documentation — `NCCL_IB_HCA` environment variable, rail selection in NCCL
|
||||||
|
- Kalia et al., "FaRM: Fast Remote Memory" (NSDI 2014) — multi-rail RDMA design patterns
|
||||||
|
- Bai et al., "PIAS: Practical Information-Agnostic Flow Scheduling for Commodity Data Centers" (NSDI 2015) — flow scheduling relevant to multi-rail policy
|
||||||
|
- InfiniBand bonding and link aggregation — `ipoib` bonding, `rdma cm` multi-path
|
||||||
|
|
||||||
|
### 2.9 Service Mesh for Agent Communication
|
||||||
|
|
||||||
|
The CHORUS/WHOOSH agent mesh requires a lightweight service mesh for mTLS, service discovery, load balancing, and circuit-breaking. The service mesh must function without a centralised control plane to avoid a single point of failure.
|
||||||
|
|
||||||
|
Key materials:
|
||||||
|
- Burns et al., "Borg, Omega, and Kubernetes" (ACM Queue 2016) — service mesh design considerations in large-scale systems
|
||||||
|
- Istio architecture documentation — sidecar proxy (Envoy), control plane (Istiod), mTLS design
|
||||||
|
- Linkerd2 documentation — Rust-based ultra-lightweight proxy, no-sidecar mode (linkerd-proxy as eBPF-based)
|
||||||
|
- Envoy proxy documentation — xDS API, circuit-breaking, outlier detection
|
||||||
|
- Cilium documentation — eBPF-based service mesh without sidecar proxies; relevant to kernel-bypass agent communication
|
||||||
|
|
||||||
|
### 2.10 Multicast and Pub-Sub for Council Communication
|
||||||
|
|
||||||
|
Council pub-sub (HMMM protocol message distribution, vote broadcasts, research summary publication) requires an efficient multicast or pub-sub mechanism. At 80 agents per council, naive unicast produces O(n) messages per broadcast. Survey IP multicast, InfiniBand multicast, and application-layer pub-sub.
|
||||||
|
|
||||||
|
Key materials:
|
||||||
|
- Deering, "Multicast Routing in Datagram Internetworks and Extended LANs" (1991) — foundational IP multicast design
|
||||||
|
- InfiniBand multicast documentation — `ibmcast`, unreliable datagram multicast groups, LID-based group addressing
|
||||||
|
- ZeroMQ documentation — PUB/SUB pattern, EPGM (Encapsulated PGM multicast), NORM protocol
|
||||||
|
- NATS documentation — subject-based pub-sub, JetStream persistence, clustering — directly relevant as BACKBEAT uses NATS JetStream
|
||||||
|
- Eugster et al., "The Many Faces of Publish/Subscribe" (ACM Computing Surveys 2003) — pub-sub system classification
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 3. Agent Roles
|
||||||
|
|
||||||
|
Total agents: **60**
|
||||||
|
|
||||||
|
| Role | Count | Responsibilities |
|
||||||
|
|------|-------|-----------------|
|
||||||
|
| Lead Architect | 2 | Network stack architecture decisions, cross-subsystem network interface ownership, topology model ownership |
|
||||||
|
| RDMA/InfiniBand Researchers | 7 | InfiniBand verbs, RoCE v2, QP model, memory region registration survey |
|
||||||
|
| UCX/NCCL Specialists | 6 | UCX API, NCCL topology and algorithm selection, GPU collective communication integration |
|
||||||
|
| libp2p/Control Plane Researchers | 5 | libp2p protocol suite, DHT-based peer discovery, stream multiplexing |
|
||||||
|
| QUIC Transport Specialists | 4 | QUIC protocol design, 0-RTT semantics, multipath/multi-rail QUIC |
|
||||||
|
| Topology Discovery Researchers | 4 | Automated cluster topology mapping, NVLink domain detection, IB subnet manager integration |
|
||||||
|
| Adaptive Routing and CC Specialists | 4 | DCQCN, ECMP, OpenSM AR, congestion control at scale |
|
||||||
|
| Multi-Rail Specialists | 3 | HCA failover, rail selection policy, per-flow affinity |
|
||||||
|
| Service Mesh Researchers | 4 | mTLS, eBPF-based service mesh, sidecar-less design, circuit-breaking |
|
||||||
|
| Pub-Sub / Multicast Researchers | 3 | IB multicast, NATS JetStream, council broadcast design |
|
||||||
|
| Formal Specification Authors | 6 | TLA+ specification of network protocol state machines, RDMA safety, routing convergence |
|
||||||
|
| Architects (sub-component) | 5 | Concrete architecture proposals for each network subsystem component |
|
||||||
|
| Internal Reviewers | 5 | Review proposals; green/yellow/red votes with rationale |
|
||||||
|
| Integration Liaisons | 4 | Interface with `council-mem`, `council-sched`, `council-sec`, `council-telemetry` |
|
||||||
|
| Decision Record Authors | 3 | Author DRs for all decision points; maintain UCXL provenance chain |
|
||||||
|
| Adversarial Critics | 3 | Surface congestion collapse scenarios, RDMA registration exhaustion, DHT partition risks |
|
||||||
|
|
||||||
|
**Role distribution rationale:** RDMA/InfiniBand and UCX/NCCL together receive 13 researchers because the data plane is the most technically complex and least tractable domain in this council. The formal specification group (6 agents) is proportionally smaller than `council-sched` and `council-mem` reflecting that network protocol safety properties are narrower in scope (liveness and congestion-freedom rather than memory safety).
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 4. Key Deliverables
|
||||||
|
|
||||||
|
### 4.1 Research Summaries
|
||||||
|
|
||||||
|
```
|
||||||
|
ucxl://council-net:researcher@DistOS:networking/*^/research/rdma-infiniband-roce.md
|
||||||
|
ucxl://council-net:researcher@DistOS:networking/*^/research/ucx-nccl-data-plane.md
|
||||||
|
ucxl://council-net:researcher@DistOS:networking/*^/research/libp2p-control-plane.md
|
||||||
|
ucxl://council-net:researcher@DistOS:networking/*^/research/quic-transport.md
|
||||||
|
ucxl://council-net:researcher@DistOS:networking/*^/research/topology-discovery.md
|
||||||
|
ucxl://council-net:researcher@DistOS:networking/*^/research/adaptive-routing-congestion.md
|
||||||
|
ucxl://council-net:researcher@DistOS:networking/*^/research/multi-rail-networking.md
|
||||||
|
ucxl://council-net:researcher@DistOS:networking/*^/research/service-mesh-agent-comms.md
|
||||||
|
ucxl://council-net:researcher@DistOS:networking/*^/research/multicast-pubsub-council.md
|
||||||
|
```
|
||||||
|
|
||||||
|
### 4.2 Architecture Proposals
|
||||||
|
|
||||||
|
```
|
||||||
|
ucxl://council-net:architect@DistOS:networking/*^/architecture/network-stack-overview.md
|
||||||
|
ucxl://council-net:architect@DistOS:networking/*^/architecture/rdma-transport-design.md
|
||||||
|
ucxl://council-net:architect@DistOS:networking/*^/architecture/ucx-data-plane-design.md
|
||||||
|
ucxl://council-net:architect@DistOS:networking/*^/architecture/control-plane-libp2p.md
|
||||||
|
ucxl://council-net:architect@DistOS:networking/*^/architecture/topology-model.md
|
||||||
|
ucxl://council-net:architect@DistOS:networking/*^/architecture/adaptive-routing-design.md
|
||||||
|
ucxl://council-net:architect@DistOS:networking/*^/architecture/multi-rail-policy.md
|
||||||
|
ucxl://council-net:architect@DistOS:networking/*^/architecture/service-mesh-design.md
|
||||||
|
ucxl://council-net:architect@DistOS:networking/*^/architecture/council-pubsub-design.md
|
||||||
|
ucxl://council-net:architect@DistOS:networking/*^/architecture/bandwidth-reservation.md
|
||||||
|
```
|
||||||
|
|
||||||
|
### 4.3 Decision Records
|
||||||
|
|
||||||
|
```
|
||||||
|
ucxl://council-net:architect@DistOS:networking/*^/decisions/DR-NET-001-transport-selection.md
|
||||||
|
ucxl://council-net:architect@DistOS:networking/*^/decisions/DR-NET-002-rdma-qp-model.md
|
||||||
|
ucxl://council-net:architect@DistOS:networking/*^/decisions/DR-NET-003-control-plane-protocol.md
|
||||||
|
ucxl://council-net:architect@DistOS:networking/*^/decisions/DR-NET-004-topology-discovery-mechanism.md
|
||||||
|
ucxl://council-net:architect@DistOS:networking/*^/decisions/DR-NET-005-congestion-control-algorithm.md
|
||||||
|
ucxl://council-net:architect@DistOS:networking/*^/decisions/DR-NET-006-nvlink-ib-domain-bridging.md
|
||||||
|
ucxl://council-net:architect@DistOS:networking/*^/decisions/DR-NET-007-service-mesh-architecture.md
|
||||||
|
ucxl://council-net:architect@DistOS:networking/*^/decisions/DR-NET-008-pubsub-mechanism.md
|
||||||
|
```
|
||||||
|
|
||||||
|
### 4.4 Formal Specifications
|
||||||
|
|
||||||
|
```
|
||||||
|
ucxl://council-net:verifier@DistOS:networking/*^/specs/RDMATransportProtocol.tla
|
||||||
|
ucxl://council-net:verifier@DistOS:networking/*^/specs/RoutingConvergence.tla
|
||||||
|
ucxl://council-net:verifier@DistOS:networking/*^/specs/CongestionControlSafety.tla
|
||||||
|
```
|
||||||
|
|
||||||
|
### 4.5 Interface Contracts
|
||||||
|
|
||||||
|
```
|
||||||
|
ucxl://council-net:architect@DistOS:networking/*^/interfaces/net-to-mem-contract.md
|
||||||
|
ucxl://council-net:architect@DistOS:networking/*^/interfaces/net-to-sched-contract.md
|
||||||
|
ucxl://council-net:architect@DistOS:networking/*^/interfaces/net-to-sec-contract.md
|
||||||
|
ucxl://council-net:architect@DistOS:networking/*^/interfaces/net-to-telemetry-contract.md
|
||||||
|
```
|
||||||
|
|
||||||
|
### 4.6 Topology Model (shared asset)
|
||||||
|
|
||||||
|
```
|
||||||
|
ucxl://council-net:architect@DistOS:networking/*^/topology/cluster-topology-model.json
|
||||||
|
ucxl://council-net:architect@DistOS:networking/*^/topology/nvlink-domain-map.json
|
||||||
|
ucxl://council-net:architect@DistOS:networking/*^/topology/ib-rail-assignments.json
|
||||||
|
```
|
||||||
|
|
||||||
|
These three artifacts are the primary outputs consumed by `council-sched` for topology-aware placement. They must be published before the end of Day 3.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 5. Decision Points
|
||||||
|
|
||||||
|
### DP-NET-001: Primary Transport Protocol Selection
|
||||||
|
|
||||||
|
**Question:** Should the DistOS network stack use a single unified transport (UCX, which provides a hardware-agnostic API over IB, RoCE, CUDA IPC, and TCP) or a layered model where different transports serve different roles (RDMA verbs for data plane, QUIC for control plane, TCP/IP for management)?
|
||||||
|
|
||||||
|
**Factors:** UCX simplicity vs. flexibility of purpose-specific transports, operational complexity (one configuration surface vs. three), UCX overhead at small message sizes relevant to control plane traffic, QUIC advantages (stream multiplexing, 0-RTT, connection migration) for agent communication that UCX does not provide.
|
||||||
|
|
||||||
|
**Recommendation bias from research:** A layered model is likely: UCX/RDMA for data plane (GPU collective, RDMA reads/writes), QUIC/libp2p for control plane (agent mesh, WHOOSH, SLURP), TCP/IP for management and compatibility. This requires explicit definition of which traffic classes use which transport.
|
||||||
|
|
||||||
|
### DP-NET-002: RDMA Queue Pair Model
|
||||||
|
|
||||||
|
**Question:** Should DistOS use Reliable Connected (RC) QPs (one QP per peer, guaranteed delivery, ordered), Unreliable Datagram (UD) QPs (one QP to all peers, scalable, no ordering), or Dynamically Connected Transport (DCT/DC) QPs (NVIDIA RDMA Connect, scales QP count logarithmically)?
|
||||||
|
|
||||||
|
**Factors:** RC QP count scales as O(n^2) for all-to-all communication at 1024 nodes (1M QPs is infeasible); UD requires application-level reliability; DC (NVIDIA-proprietary) scales to large clusters but limits portability.
|
||||||
|
|
||||||
|
**Dependency:** RDMA registration model (DP-MEM-008 in `council-mem`) must align with the QP model selected here.
|
||||||
|
|
||||||
|
### DP-NET-003: Control Plane Protocol
|
||||||
|
|
||||||
|
**Question:** Should the DistOS control plane use libp2p (QUIC + Kademlia DHT + mDNS) as used by CHORUS/WHOOSH, a purpose-built cluster control plane (similar to etcd/Raft), or a hybrid where cluster-local discovery uses mDNS and global routing uses Kademlia DHT?
|
||||||
|
|
||||||
|
**Factors:** libp2p operational familiarity (CHORUS already uses it), DHT scalability to 1024 nodes, etcd reliability guarantees but centralisation risk, QUIC connection establishment overhead per agent peer, interaction with BACKBEAT (NATS JetStream) for pub-sub.
|
||||||
|
|
||||||
|
**Dependency:** This decision affects `council-synth` and `council-docs` — the control plane protocol is a foundational assumption for all inter-council communication.
|
||||||
|
|
||||||
|
### DP-NET-004: Topology Discovery Mechanism
|
||||||
|
|
||||||
|
**Question:** Should DistOS implement automatic topology discovery via (a) querying the InfiniBand subnet manager (OpenSM) for fabric topology, (b) active probing with `ibnetdiscover`-style sweeps, (c) passive LLDP/CDP-based discovery, or (d) a hybrid agent-reported topology where each node self-reports its NVLink domain membership, HCA rail assignments, and switch connectivity?
|
||||||
|
|
||||||
|
**Factors:** Convergence time (how long before a newly joined node's topology is reflected in placement decisions), accuracy of IB SM queries vs. active probing, interaction with WHOOSH agent discovery (agents could report their own topology as part of CHORUS join handshake).
|
||||||
|
|
||||||
|
**Output dependency:** The topology model artifacts (cluster-topology-model.json, nvlink-domain-map.json, ib-rail-assignments.json) must be published by end of Day 3 for `council-sched` to begin placement algorithm design.
|
||||||
|
|
||||||
|
### DP-NET-005: Congestion Control Algorithm
|
||||||
|
|
||||||
|
**Question:** For the RDMA data plane over RoCE, which congestion control algorithm should DistOS mandate? Options: (a) DCQCN (DCTCP+QCN hybrid, used by Microsoft/Mellanox, standardised in RoCE v2), (b) TIMELY (RTT-based, by Google), (c) HPCC (High Precision Congestion Control, by Alibaba, requires INT telemetry), (d) Swift (Google, delay-based with ECN fallback).
|
||||||
|
|
||||||
|
**Factors:** ECN hardware support requirements, telemetry infrastructure dependencies (HPCC requires INT which requires `council-telemetry` infrastructure), performance at 1024-node allreduce traffic patterns, PFC interaction (PFC-free designs preferred to avoid head-of-line blocking cascades).
|
||||||
|
|
||||||
|
### DP-NET-006: NVLink-to-InfiniBand Domain Bridging
|
||||||
|
|
||||||
|
**Question:** GPU-to-GPU communication within a NVLink domain uses NVLink (900 GB/s bidirectional). Communication crossing a NVLink domain boundary must use InfiniBand (200 Gbps HDR per port). How should DistOS present this heterogeneous fabric to workloads? Options: (a) expose domain boundaries explicitly in the topology API; workloads are topology-aware, (b) provide a uniform address space with the OS transparently routing NVLink vs. IB, (c) NCCL-style: the collective library handles topology, the OS provides a topology file.
|
||||||
|
|
||||||
|
**Factors:** Programmer complexity, NCCL compatibility (c is the status quo for GPU training), overhead of transparent routing (b), latency impact of crossing domain boundaries.
|
||||||
|
|
||||||
|
### DP-NET-007: Service Mesh Architecture
|
||||||
|
|
||||||
|
**Question:** For CHORUS/WHOOSH agent-to-agent communication (the service mesh layer), should DistOS use (a) a sidecar proxy model (Envoy/Linkerd), (b) an eBPF-based kernel-bypass service mesh (Cilium), or (c) the existing CHORUS P2P mesh (libp2p with HMMM protocol channels) without an additional service mesh layer?
|
||||||
|
|
||||||
|
**Factors:** Latency overhead of sidecar proxies vs. eBPF bypass, mTLS complexity, operational tooling maturity, overlap with existing CHORUS mesh implementation, interaction with `council-sec`'s mTLS certificate model.
|
||||||
|
|
||||||
|
### DP-NET-008: Council Pub-Sub Mechanism
|
||||||
|
|
||||||
|
**Question:** How should council-wide broadcasts (research summary publications, vote notifications, DR announcements) be delivered to all ~80 agents in a council? Options: (a) NATS JetStream subjects with wildcard subscriptions (already used by BACKBEAT), (b) InfiniBand multicast groups (IB UD multicast, very low latency but complex group management), (c) libp2p GossipSub (probabilistic epidemic broadcast), (d) application-layer unicast fan-out from a designated broadcaster.
|
||||||
|
|
||||||
|
**Factors:** Delivery guarantee (at-least-once vs. at-most-once), ordering requirements (vote notifications must be totally ordered), latency, operational complexity, interaction with BACKBEAT NATS JetStream infrastructure.
|
||||||
|
|
||||||
|
**Recommendation bias from research:** NATS JetStream (option a) is the natural choice given BACKBEAT already provides it. The question is whether IB multicast is justified for latency-critical broadcasts.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 6. Dependencies
|
||||||
|
|
||||||
|
### 6.1 What council-net Needs from Other Councils
|
||||||
|
|
||||||
|
| Dependency | Source Council | Artifact | Purpose |
|
||||||
|
|------------|---------------|---------|---------|
|
||||||
|
| RDMA memory registration interface | `council-mem` | `ucxl://council-mem:architect@DistOS:memory/*^/interfaces/mem-to-net-contract.md` | RDMA data plane requires agreed memory registration semantics (DP-MEM-008); QP design depends on this |
|
||||||
|
| Memory buffer alignment and pinning requirements | `council-mem` | `ucxl://council-mem:architect@DistOS:memory/*^/architecture/weka-gds-integration-design.md` | GPUDirect Storage requires specific buffer alignment and Weka GDS integration constraints |
|
||||||
|
| Placement decisions (GPU assignment map) | `council-sched` | `ucxl://council-sched:architect@DistOS:scheduling/*^/interfaces/sched-to-net-contract.md` | Network-aware placement requires that `council-sched` feed placement events to the network stack for QP provisioning |
|
||||||
|
| Network-aware scheduling requirements | `council-sched` | `ucxl://council-sched:architect@DistOS:scheduling/*^/architecture/placement-scoring-algorithm.md` | Topology model format must be compatible with placement scoring algorithm input requirements |
|
||||||
|
| mTLS certificate model and PKI design | `council-sec` | `ucxl://council-sec:architect@DistOS:security/*^/interfaces/sec-to-net-contract.md` | Service mesh mTLS and encrypted QUIC transport require PKI integration |
|
||||||
|
| Network bandwidth metering contract | `council-telemetry` | `ucxl://council-telemetry:architect@DistOS:telemetry/*^/interfaces/telemetry-to-net-contract.md` | Congestion control algorithm selection (DP-NET-005) may depend on INT telemetry infrastructure availability |
|
||||||
|
|
||||||
|
### 6.2 What Other Councils Need from council-net
|
||||||
|
|
||||||
|
| Consumer Council | Artifact Required | Purpose |
|
||||||
|
|-----------------|------------------|---------|
|
||||||
|
| `council-sched` | Topology model (cluster-topology, NVLink domain map, IB rail assignments) | Topology-aware placement scoring; must be published by end of Day 3 |
|
||||||
|
| `council-sched` | Network bandwidth reservation API | Gang scheduling for allreduce jobs requires co-reserving network bandwidth |
|
||||||
|
| `council-mem` | RDMA registration model decision (DP-NET-002 outcome) | Memory subsystem designs RDMA buffer registration around the QP model |
|
||||||
|
| `council-mem` | Inter-node bandwidth model and congestion control parameters | Page migration bandwidth budgeting must account for IB congestion control behaviour |
|
||||||
|
| `council-sec` | Transport security integration points | Security subsystem specifies mTLS enforcement points on the service mesh and QUIC transport |
|
||||||
|
| `council-telemetry` | Network event stream API | Metering of IB bandwidth usage, QP errors, and congestion events |
|
||||||
|
| `council-verify` | TLA+ RDMA transport protocol and routing convergence specs | Formal verification of RDMA safety (no message corruption) and routing liveness (convergence after topology change) |
|
||||||
|
| `council-synth` | Control plane protocol decision (DP-NET-003 outcome) | Inter-council synthesis depends on the control plane protocol as the communication substrate |
|
||||||
|
| All councils | Council pub-sub mechanism (DP-NET-008 outcome and configuration) | All councils use the pub-sub mechanism for broadcast communication within their member agents |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 7. WHOOSH Configuration
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
# WHOOSH council formation configuration for council-net
|
||||||
|
council_id: council-net
|
||||||
|
project: DistOS
|
||||||
|
subsystem: networking
|
||||||
|
gitea_label: chorus-entrypoint
|
||||||
|
gitea_repo: distos/networking
|
||||||
|
|
||||||
|
formation:
|
||||||
|
target_agents: 60
|
||||||
|
min_agents: 45
|
||||||
|
wave:
|
||||||
|
max_per_wave: 10
|
||||||
|
min_per_wave: 5
|
||||||
|
period_sec: 30
|
||||||
|
placement:
|
||||||
|
max_replicas_per_node: 2
|
||||||
|
join_stagger_ms: 2000
|
||||||
|
bootstrap_peers_min: 4
|
||||||
|
|
||||||
|
roles:
|
||||||
|
- role: lead-architect
|
||||||
|
count: 2
|
||||||
|
model: claude-opus-4-6
|
||||||
|
priority: high
|
||||||
|
- role: researcher
|
||||||
|
count: 32
|
||||||
|
model: qwen2.5-coder:32b
|
||||||
|
priority: normal
|
||||||
|
subgroups:
|
||||||
|
- tag: rdma-infiniband
|
||||||
|
count: 7
|
||||||
|
- tag: ucx-nccl
|
||||||
|
count: 6
|
||||||
|
- tag: libp2p-control-plane
|
||||||
|
count: 5
|
||||||
|
- tag: quic-transport
|
||||||
|
count: 4
|
||||||
|
- tag: topology-discovery
|
||||||
|
count: 4
|
||||||
|
- tag: routing-congestion
|
||||||
|
count: 4
|
||||||
|
- tag: multi-rail
|
||||||
|
count: 3
|
||||||
|
- tag: service-mesh
|
||||||
|
count: 4
|
||||||
|
- tag: pubsub-multicast
|
||||||
|
count: 3
|
||||||
|
# Note: researcher subgroup counts sum to 40 intentionally to allow
|
||||||
|
# some agents to cover two adjacent domains (e.g., RDMA + UCX agents
|
||||||
|
# may overlap). WHOOSH assigns 32 researchers but permits dual-tagging.
|
||||||
|
- role: architect
|
||||||
|
count: 5
|
||||||
|
model: claude-opus-4-6
|
||||||
|
priority: normal
|
||||||
|
- role: verifier
|
||||||
|
count: 6
|
||||||
|
model: deepseek-coder-v2
|
||||||
|
priority: normal
|
||||||
|
- role: reviewer
|
||||||
|
count: 5
|
||||||
|
model: claude-opus-4-6
|
||||||
|
priority: normal
|
||||||
|
- role: integration-liaison
|
||||||
|
count: 4
|
||||||
|
model: qwen2.5-coder:32b
|
||||||
|
priority: normal
|
||||||
|
- role: decision-record-author
|
||||||
|
count: 3
|
||||||
|
model: claude-opus-4-6
|
||||||
|
priority: normal
|
||||||
|
- role: adversarial-critic
|
||||||
|
count: 3
|
||||||
|
model: claude-opus-4-6
|
||||||
|
priority: normal
|
||||||
|
|
||||||
|
subchannels:
|
||||||
|
- name: net-research
|
||||||
|
description: "Network stack research and literature synthesis"
|
||||||
|
participants: [researcher, lead-architect]
|
||||||
|
pubsub: true
|
||||||
|
- name: net-data-plane
|
||||||
|
description: "RDMA, UCX, NCCL data plane design — high priority"
|
||||||
|
participants: [researcher-rdma-infiniband, researcher-ucx-nccl, architect, lead-architect]
|
||||||
|
pubsub: false
|
||||||
|
- name: net-control-plane
|
||||||
|
description: "libp2p, QUIC, DHT control plane design"
|
||||||
|
participants: [researcher-libp2p-control-plane, researcher-quic-transport, architect, lead-architect]
|
||||||
|
pubsub: false
|
||||||
|
- name: net-topology
|
||||||
|
description: "Topology discovery and model publication — time-critical (Day 3 deadline)"
|
||||||
|
participants: [researcher-topology-discovery, architect, lead-architect, integration-liaison]
|
||||||
|
pubsub: false
|
||||||
|
- name: net-architecture
|
||||||
|
description: "Architecture proposal discussion and voting"
|
||||||
|
participants: [architect, lead-architect, reviewer, adversarial-critic]
|
||||||
|
pubsub: true
|
||||||
|
- name: net-formal-spec
|
||||||
|
description: "TLA+ specification authoring and review"
|
||||||
|
participants: [verifier, lead-architect, reviewer]
|
||||||
|
pubsub: false
|
||||||
|
- name: net-integration
|
||||||
|
description: "Cross-council interface negotiation"
|
||||||
|
participants: [integration-liaison, lead-architect]
|
||||||
|
pubsub: false
|
||||||
|
- name: net-decisions
|
||||||
|
description: "Decision record authoring and consensus"
|
||||||
|
participants: [decision-record-author, lead-architect, reviewer]
|
||||||
|
pubsub: true
|
||||||
|
|
||||||
|
quorum:
|
||||||
|
architecture_changes:
|
||||||
|
policy: supermajority
|
||||||
|
threshold: 0.667
|
||||||
|
require_domain_role: true
|
||||||
|
require_quality_role: true
|
||||||
|
beat_minutes: 20
|
||||||
|
timeout_beats: 6
|
||||||
|
# Topology model publication uses expedited quorum (Day 3 deadline)
|
||||||
|
topology_model_publication:
|
||||||
|
policy: simple_majority
|
||||||
|
threshold: 0.5
|
||||||
|
require_domain_role: true
|
||||||
|
require_quality_role: false
|
||||||
|
beat_minutes: 15
|
||||||
|
timeout_beats: 3
|
||||||
|
expedited: true
|
||||||
|
deadline_day: 3
|
||||||
|
research_summaries:
|
||||||
|
policy: simple_majority
|
||||||
|
threshold: 0.5
|
||||||
|
require_domain_role: true
|
||||||
|
require_quality_role: false
|
||||||
|
beat_minutes: 15
|
||||||
|
timeout_beats: 4
|
||||||
|
formal_specs:
|
||||||
|
policy: supermajority
|
||||||
|
threshold: 0.667
|
||||||
|
require_domain_role: true
|
||||||
|
require_quality_role: true
|
||||||
|
require_verifier: true
|
||||||
|
beat_minutes: 25
|
||||||
|
timeout_beats: 8
|
||||||
|
interface_contracts:
|
||||||
|
policy: unanimous
|
||||||
|
roles: [lead-architect, integration-liaison]
|
||||||
|
beat_minutes: 30
|
||||||
|
timeout_beats: 4
|
||||||
|
|
||||||
|
gates:
|
||||||
|
kaching:
|
||||||
|
p95_latency_ms: 250
|
||||||
|
max_error_rate: 0.01
|
||||||
|
backbeat:
|
||||||
|
max_stream_lag: 200
|
||||||
|
bootstrap:
|
||||||
|
min_healthy_peers: 4
|
||||||
|
join:
|
||||||
|
min_success_rate: 0.80
|
||||||
|
|
||||||
|
review:
|
||||||
|
beat_minutes: 20
|
||||||
|
quorum:
|
||||||
|
total_min: 3
|
||||||
|
require_domain_role: true
|
||||||
|
require_quality_role: true
|
||||||
|
timeout_beats: 6
|
||||||
|
no_self_approval: true
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 8. Success Criteria
|
||||||
|
|
||||||
|
1. **Research completeness:** All 9 research domain summaries published to DHT with at least 5 primary references each, approved by council simple majority.
|
||||||
|
|
||||||
|
2. **Topology model published by Day 3:** The three topology model artifacts (cluster-topology-model.json, nvlink-domain-map.json, ib-rail-assignments.json) are published to DHT and their UCXL addresses communicated to `council-sched` by the end of Day 3. This is a hard dependency for scheduling placement design.
|
||||||
|
|
||||||
|
3. **Architecture coverage:** Architectural proposals exist for all 10 network stack components. The data plane (RDMA + UCX + NCCL) and control plane (libp2p + QUIC) proposals are the most critical and must be approved before Phase 3 begins.
|
||||||
|
|
||||||
|
4. **Decision records resolved:** All 8 decision points (DP-NET-001 through DP-NET-008) have corresponding Decision Records with at least 3 alternatives considered, evidence citations, and a chosen option ratified by council supermajority.
|
||||||
|
|
||||||
|
5. **Formal specifications:** TLA+ specifications for the RDMA transport protocol (QP safety: no message duplication or loss under reliable connected mode), routing convergence (finite time to stable routing table after topology change), and congestion control safety (freedom from PFC pause storm deadlock). At least 2 of these must have model-checked invariants verified by `council-verify`.
|
||||||
|
|
||||||
|
6. **Interface contracts ratified:** All 4 interface contracts (to `council-mem`, `council-sched`, `council-sec`, `council-telemetry`) are co-signed by integration liaisons from both councils. The RDMA interface contract with `council-mem` must be co-signed before the end of Phase 2 (Day 6).
|
||||||
|
|
||||||
|
7. **Control plane protocol decision communicated:** DP-NET-003 outcome is communicated to `council-synth` and all other councils before Day 5, as the control plane protocol is a foundational assumption for inter-council communication.
|
||||||
|
|
||||||
|
8. **UCXL navigability:** Any Decision Record can be traced to the research summary motivating it within 5 UCXL hops.
|
||||||
|
|
||||||
|
9. **Adversarial review pass:** Each major architecture proposal has a documented adversarial critique and resolution. The congestion control design must specifically address the PFC pause storm scenario at 1024-node allreduce scale, and the QP model must address the O(n^2) QP count scalability problem with a documented mitigation (DC transport, UD with app-level reliability, or QP sharing).
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 9. Timeline
|
||||||
|
|
||||||
|
### Phase 1: Research and Survey (Days 1-3)
|
||||||
|
|
||||||
|
**Day 1:**
|
||||||
|
- WHOOSH forms council; 60 agents join via wave deployment
|
||||||
|
- Researchers self-assign to domain subgroups via `net-research` subchannel
|
||||||
|
- RDMA/IB, UCX/NCCL, and topology discovery research begin immediately — these are on the critical path for the Day 3 topology model publication deadline
|
||||||
|
- libp2p and QUIC control plane research begins in parallel
|
||||||
|
- Integration liaisons contact `council-sched` to understand topology model format requirements and `council-mem` to begin RDMA registration interface discussion
|
||||||
|
|
||||||
|
**Day 2:**
|
||||||
|
- Remaining domains surveyed: adaptive routing/congestion control, multi-rail, service mesh, pub-sub/multicast
|
||||||
|
- Research summaries drafted across all subgroups
|
||||||
|
- Topology discovery researchers draft preliminary topology model in parallel with research (early draft, not yet approved)
|
||||||
|
- Internal review cycle begins; green/yellow/red votes cast
|
||||||
|
|
||||||
|
**Day 3 (hard deadline: topology model artifacts published):**
|
||||||
|
- Research summaries revised; all 9 published to DHT with simple majority approval
|
||||||
|
- Topology model artifacts (cluster-topology-model.json, nvlink-domain-map.json, ib-rail-assignments.json) approved via expedited quorum (beat_minutes: 15, timeout_beats: 3) and UCXL addresses communicated to `council-sched`
|
||||||
|
- Adversarial critics challenge topology discovery completeness and RDMA scalability assumptions
|
||||||
|
- Research phase gate: all summaries approved; topology model published
|
||||||
|
|
||||||
|
### Phase 2: Architecture and Trade-offs (Days 3-6)
|
||||||
|
|
||||||
|
**Day 3-4:**
|
||||||
|
- DP-NET-001 (transport selection) proposed and voted — this is the highest-impact architectural decision
|
||||||
|
- DP-NET-002 (RDMA QP model) — joint session with `council-mem` RDMA liaison; DP-MEM-008 alignment confirmed
|
||||||
|
- DP-NET-003 (control plane protocol) proposed — result communicated to all councils as soon as resolved (target: Day 4)
|
||||||
|
|
||||||
|
**Day 4-5:**
|
||||||
|
- DP-NET-004 (topology discovery mechanism) resolved
|
||||||
|
- DP-NET-005 (congestion control) resolved — `council-telemetry` consulted on INT telemetry infrastructure availability
|
||||||
|
- Adaptive routing and multi-rail architecture proposals authored
|
||||||
|
|
||||||
|
**Day 5-6:**
|
||||||
|
- DP-NET-006 (NVLink/IB domain bridging) resolved — `council-sched` consulted
|
||||||
|
- DP-NET-007 (service mesh architecture) resolved — `council-sec` consulted on mTLS model
|
||||||
|
- DP-NET-008 (pub-sub mechanism) resolved — BACKBEAT team consulted on NATS JetStream configuration
|
||||||
|
- All 8 Decision Records drafted and voted by council supermajority
|
||||||
|
- Architecture overview assembled; architecture phase gate passed
|
||||||
|
|
||||||
|
### Phase 3: Formal Specification (Days 6-10)
|
||||||
|
|
||||||
|
**Day 6-7:**
|
||||||
|
- TLA+ specification of RDMA transport protocol begun (QP state machine, message delivery safety)
|
||||||
|
- Routing convergence spec begun (Kademlia or IB SM topology distribution, convergence after link failure)
|
||||||
|
- `council-verify` given early access to spec drafts
|
||||||
|
|
||||||
|
**Day 7-8:**
|
||||||
|
- Congestion control safety spec authored (PFC storm freedom, DCQCN convergence)
|
||||||
|
- RDMA transport safety invariants: no duplicate delivery in RC mode, guaranteed delivery or error reporting
|
||||||
|
|
||||||
|
**Day 8-10:**
|
||||||
|
- Model checking submitted to `council-verify`
|
||||||
|
- Any counterexamples trigger architecture revision with updated DRs
|
||||||
|
- Formal spec versions pinned in DHT; UCXL addresses published
|
||||||
|
|
||||||
|
### Phase 4: Integration and Review (Days 10-12)
|
||||||
|
|
||||||
|
**Day 10-11:**
|
||||||
|
- Interface contracts with `council-mem`, `council-sched`, `council-sec`, `council-telemetry` finalised and co-signed
|
||||||
|
- Cross-council integration session: QP model validated with `council-mem` RDMA registration design
|
||||||
|
- Cross-council integration session: topology model consumed by `council-sched` placement scoring — confirm format compatibility
|
||||||
|
- `council-synth` engaged for control plane protocol conflicts with other councils
|
||||||
|
|
||||||
|
**Day 11-12:**
|
||||||
|
- Final council review of complete network specification
|
||||||
|
- Adversarial critics run PFC storm scenario, DHT partition scenario, and O(n^2) QP scale scenario
|
||||||
|
- All yellow votes addressed with documented mitigations
|
||||||
|
- Integration review gate: all interface contracts co-signed
|
||||||
|
|
||||||
|
### Phase 5: Documentation and Narrative (Days 12-14)
|
||||||
|
|
||||||
|
**Day 12-13:**
|
||||||
|
- Decision record authors produce narrative summaries of network design evolution
|
||||||
|
- `council-docs` receives complete network specification for standardised formatting
|
||||||
|
- UCXL navigability audit: spot-check 10 random decision paths
|
||||||
|
|
||||||
|
**Day 14:**
|
||||||
|
- Final specification published
|
||||||
|
- `council-arch` generates human-readable narrative of network stack design evolution
|
||||||
|
- Council dissolved; agents released back to WHOOSH pool
|
||||||
348
councils/04-fault-tolerance.md
Normal file
348
councils/04-fault-tolerance.md
Normal file
@@ -0,0 +1,348 @@
|
|||||||
|
# Council Design Brief: Fault Tolerance, Consensus, and Recovery
|
||||||
|
|
||||||
|
**Council ID:** `council-fault`
|
||||||
|
**Mission:** Design the fault tolerance architecture for DistOS — encompassing failure detection, distributed consensus, checkpoint/recovery, and Byzantine resilience — such that the 1024-node cluster degrades gracefully under any realistic failure scenario and recovers to full operation without human intervention.
|
||||||
|
**UCXL Base Address:** `ucxl://council-fault:*@DistOS:fault-tolerance/*`
|
||||||
|
**Agent Count:** ~60
|
||||||
|
**Status:** Pre-formation (Constitution Phase)
|
||||||
|
**Created:** 2026-02-24
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 1. Scope and Responsibilities
|
||||||
|
|
||||||
|
Council-fault is responsible for every mechanism by which DistOS detects, tolerates, and recovers from failure. This spans the full failure taxonomy: transient bit errors, node crashes, NIC failures, rack-level power loss, network partitions, and Byzantine agent behaviour. The council does not own the network transport layer (that is council-net's domain) nor the memory model (council-mem), but it owns the protocols and state machines that run over those substrates to provide fault-tolerant guarantees.
|
||||||
|
|
||||||
|
### In Scope
|
||||||
|
|
||||||
|
- Failure detection algorithms and their parameterisation for a 1024-node cluster
|
||||||
|
- Consensus protocol selection, instantiation, and formal specification for each use case
|
||||||
|
- Checkpoint strategy design for long-running GPU workloads (distributed training, inference serving)
|
||||||
|
- Exactly-once and at-least-once delivery semantics at the OS level
|
||||||
|
- Split-brain prevention and partition healing
|
||||||
|
- Hot standby, warm standby, and cold recovery strategies and their trade-offs
|
||||||
|
- State machine replication design for DistOS control plane components
|
||||||
|
- Failure domain modelling: node, rack, network segment, availability zone
|
||||||
|
- Byzantine fault tolerance for the council/consensus layer of DistOS itself
|
||||||
|
- Graceful degradation policies: what the system continues to do under N simultaneous failures
|
||||||
|
- Recovery time objective (RTO) and recovery point objective (RPO) targets per workload class
|
||||||
|
|
||||||
|
### Out of Scope
|
||||||
|
|
||||||
|
- Physical network topology and RDMA configuration (council-net)
|
||||||
|
- Memory coherence protocols (council-mem)
|
||||||
|
- Scheduling decisions post-recovery (council-sched)
|
||||||
|
- Security properties of the consensus protocol (council-sec coordinates on this)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 2. Research Domains
|
||||||
|
|
||||||
|
### 2.1 Failure Detection
|
||||||
|
|
||||||
|
The fundamental challenge is discriminating between a slow node and a dead one in a 1024-node, high-throughput RDMA environment. Conservative timeouts cause false negatives; aggressive timeouts cause false positives that trigger unnecessary failover storms.
|
||||||
|
|
||||||
|
**Key Papers and Systems:**
|
||||||
|
|
||||||
|
- Chandra & Toueg (1996), "Unreliable failure detectors for reliable distributed systems" — foundational impossibility and completeness/accuracy definitions; establishes the theoretical basis for eventual strong completeness
|
||||||
|
- Hayashibara et al. (2004), "The phi accrual failure detector" — Cassandra's production adoption shows phi accrual outperforms fixed-threshold detectors by adapting to network jitter; threshold calibration for 10 GbE vs 400 GbE InfiniBand will differ substantially
|
||||||
|
- Das et al. (2002), "SWIM: Scalable Weakly-consistent Infection-style Process group Membership" — O(log N) dissemination cost makes SWIM the practical choice for 1000+ node clusters; memberlist (HashiCorp) and etcd both derive from this; DistOS should evaluate SWIM with the piggybacking optimisation and compare to phi accrual on the target InfiniBand fabric
|
||||||
|
- Gupta et al. (2001), "Scalable fault-tolerant management of inter-cluster gossip" — establishes that combining gossip with direct probing (as SWIM does) provides probabilistic completeness without the O(N²) cost of all-pairs heartbeating
|
||||||
|
|
||||||
|
**Open questions for research phase:** What is the optimal phi accrual window size given InfiniBand's sub-microsecond baseline latency and the potential for multi-second NVLink saturation events? Should DistOS use a separate out-of-band management network (BMC/IPMI) to disambiguate slow-GPU from dead-node?
|
||||||
|
|
||||||
|
### 2.2 Consensus Protocols
|
||||||
|
|
||||||
|
DistOS requires consensus at multiple granularities: cluster-wide for global control plane decisions, rack-local for fast scheduling, and per-workgroup for GPU collective operations.
|
||||||
|
|
||||||
|
**Key Papers and Systems:**
|
||||||
|
|
||||||
|
- Ongaro & Ousterhout (2014), "In Search of an Understandable Consensus Algorithm (Raft)" — Raft's understandability advantage over Multi-Paxos makes it the baseline for DistOS control plane; etcd (used in Kubernetes) provides a production-quality reference implementation; the council must assess whether Raft's leader bottleneck is acceptable at 1024-node scale
|
||||||
|
- Lamport (1998, 2001), "The Part-Time Parliament" and "Paxos Made Simple" — Multi-Paxos enables pipelining and multi-leader variants; the council should evaluate whether the complexity cost is justified for DistOS's write-heavy workload tracking
|
||||||
|
- Lamport et al. (2010), "Byzantizing Paxos by Refinement" — pathway from crash-fault-tolerant (CFT) to Byzantine-fault-tolerant (BFT) without redesigning the system from scratch
|
||||||
|
- Castro & Liskov (1999), "Practical Byzantine Fault Tolerance" (PBFT) — O(N²) message complexity makes vanilla PBFT unsuitable for 1024 nodes, but the security model and view-change protocol remain the reference for BFT design
|
||||||
|
- Yin et al. (2019), "HotStuff: BFT Consensus with Linear View Change" — reduces PBFT's O(N²) view-change cost to O(N) via threshold signatures; adopted in Diem/LibraBFT; viable for DistOS's security-sensitive control paths if the threat model includes Byzantine council agents
|
||||||
|
- Liskov & Cowling (2012), "Viewstamped Replication Revisited" — predates Raft but provides cleaner reconfiguration semantics; relevant when council-fault designs the cluster membership change protocol
|
||||||
|
- Howard et al. (2016), "Flexible Paxos: Quorum Intersection Revisited" — decouples write quorum from read quorum, enabling latency optimisation for read-heavy metadata workloads
|
||||||
|
- Chandra et al. (2007), "Paxos Made Live" — Google's engineering lessons deploying Paxos in Chubby; multi-master lease management, disk corruption handling, and operational complexity are directly relevant to DistOS
|
||||||
|
- Google Chubby (Burrows 2006) — coarse-grained lock service over Paxos; DistOS's distributed lock manager should evaluate this architecture
|
||||||
|
- Apache ZooKeeper / ZAB (Reed & Junqueira 2008), "A simple totally ordered broadcast protocol" — ZAB's primary-backup model with in-order delivery guarantees is simpler than Paxos for log replication; used in Kafka and HBase; council should assess fit for DistOS's scheduler event log
|
||||||
|
|
||||||
|
**Open questions for research phase:** Should DistOS use a single consensus cluster (simple, potential bottleneck) or hierarchical consensus (rack-local Raft groups coordinated by a meta-Raft group)? At what cluster size does the Raft leader become a write throughput bottleneck? What is the cost of HotStuff threshold signature aggregation on the cluster's CPU budget?
|
||||||
|
|
||||||
|
### 2.3 Checkpoint and Restart for GPU Workloads
|
||||||
|
|
||||||
|
Long-running distributed training jobs on 1024 GPUs represent the highest-value workloads; losing 100 hours of training to a single node failure is economically catastrophic. Checkpoint strategy must balance checkpoint overhead against recovery cost.
|
||||||
|
|
||||||
|
**Key Papers and Systems:**
|
||||||
|
|
||||||
|
- Vaswani et al. (2017), "Attention Is All You Need" — the transformer training workloads that DistOS must protect; model sizes (70B–1T+ parameters) determine checkpoint volume and frequency requirements
|
||||||
|
- Shoeybi et al. (2019), "Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism" — Megatron-LM's distributed checkpoint format partitions model state across nodes; DistOS must understand this format to design transparent checkpointing that does not require application modification
|
||||||
|
- Rajbhandari et al. (2020), "ZeRO: Memory Optimizations Toward Training Trillion Parameter Models" — ZeRO-3 partitions optimizer state, gradients, and parameters across devices; recovery after a single node failure requires coordinated state reassembly from multiple surviving nodes
|
||||||
|
- Koo et al. (2020), "Elastic Training for BERT" — elastic distributed training with dynamic rank reconfiguration; motivates checkpoint formats that survive changes in node count
|
||||||
|
- Duell (2005), "The Design and Implementation of Berkeley Lab's Linux Checkpoint/Restart" and DMTCP (Ansel et al. 2009) — transparent process-level checkpointing; OS-level DistOS could provide DMTCP-style capabilities without application cooperation
|
||||||
|
- NVIDIA NCCL fault recovery — NCCL 2.x introduced limited fault recovery for collective operations; DistOS must integrate with or extend this for transparent GPU collective resilience
|
||||||
|
- Young (1974), "A first order approximation to the optimum checkpoint interval" — classic result showing optimal checkpoint interval as sqrt(2 * MTTR * checkpoint_cost); DistOS should instantiate this formula with empirically measured values for the target cluster
|
||||||
|
- Gupta et al. (2018), "Checkpointing and Recovery for Iterative Machine Learning" — shows that checkpoint frequency should adapt to loss curve stability; motivates adaptive checkpointing in DistOS
|
||||||
|
|
||||||
|
**Weka FS considerations:** Weka's parallel filesystem provides high-bandwidth checkpoint writes. The council must determine: (a) whether checkpoint data should bypass the page cache and write directly to Weka via O_DIRECT, (b) whether Weka's erasure coding provides sufficient durability so that a single Weka shard loss does not require checkpoint redundancy, and (c) optimal checkpoint object sizes given Weka's stripe geometry.
|
||||||
|
|
||||||
|
### 2.4 Exactly-Once and Delivery Semantics
|
||||||
|
|
||||||
|
**Key Papers and Systems:**
|
||||||
|
|
||||||
|
- Zaharia et al. (2013), "Discretized Streams: Fault-Tolerant Streaming Computation at Scale" — Spark Streaming's micro-batch approach achieves exactly-once by recomputing lost micro-batches from durable sources; relevant for DistOS's telemetry and event log pipelines
|
||||||
|
- Kreps et al. (2011), "Kafka: a Distributed Messaging System for Log Processing" — idempotent producer design with sequence numbers provides exactly-once at the messaging layer; DistOS's internal event bus should adopt equivalent semantics
|
||||||
|
- Akidau et al. (2015), "The Dataflow Model" — unified model for batch and streaming with watermarks; DistOS event processing should support watermark-based exactly-once semantics
|
||||||
|
|
||||||
|
### 2.5 Split-Brain Prevention
|
||||||
|
|
||||||
|
**Key Papers and Systems:**
|
||||||
|
|
||||||
|
- Bailis et al. (2013), "Highly Available Transactions: Virtues and Limitations" — characterises the exact split between operations achievable during partition (highly available) and those requiring coordination (partition-intolerant); DistOS must classify each operation type
|
||||||
|
- Brewer (2000), "Towards Robust Distributed Systems" (CAP theorem) — the council must explicitly document CP vs AP trade-offs for each DistOS subsystem; scheduling metadata can be AP, distributed locks must be CP
|
||||||
|
- Fonseca et al. (2007), "X-Trace: A Pervasive Network Tracing Framework" — causal tracing across partition boundaries enables forensic analysis of split-brain events post-recovery
|
||||||
|
|
||||||
|
### 2.6 State Machine Replication and Azure Service Fabric
|
||||||
|
|
||||||
|
- Terry et al. (2013), "Replicated Data Consistency Explained Through Baseball" — accessible model for multi-tier consistency; DistOS should define consistency levels analogous to this model
|
||||||
|
- Kakivaya et al. (2018), "Service Fabric: A Distributed Platform for Building Microservices in the Cloud" — Azure Service Fabric's health model, reliable collections, and actor model directly inform DistOS's control plane reliability design; particularly relevant is its failure domain-aware placement and automatic repair orchestration
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 3. Agent Roles
|
||||||
|
|
||||||
|
| Role | Count | Responsibilities |
|
||||||
|
|------|-------|-----------------|
|
||||||
|
| `lead-architect` | 2 | Cross-domain coherence, final decision arbitration, dependency liaison with council-net and council-mem |
|
||||||
|
| `failure-detection-specialist` | 6 | Phi accrual and SWIM protocol design, parameterisation for InfiniBand fabric, integration with out-of-band management |
|
||||||
|
| `consensus-engineer` | 10 | Raft instantiation for control plane, hierarchical consensus evaluation, quorum configuration, leader election timing analysis |
|
||||||
|
| `byzantine-resilience-analyst` | 6 | HotStuff evaluation, threat model for Byzantine agents, threshold signature overhead analysis, BFT protocol formal specification |
|
||||||
|
| `checkpoint-engineer` | 10 | GPU workload checkpoint design, Weka FS integration, DMTCP-style transparent checkpointing, checkpoint interval optimisation |
|
||||||
|
| `recovery-coordinator` | 6 | Hot/warm/cold standby design, RTO/RPO target definition, failure domain modelling, recovery orchestration state machine |
|
||||||
|
| `delivery-semantics-specialist` | 4 | Exactly-once guarantees for event bus, idempotency patterns, watermark-based processing |
|
||||||
|
| `partition-analyst` | 4 | CAP classification of DistOS subsystems, split-brain prevention protocols, partition healing procedures |
|
||||||
|
| `formal-spec-author` | 6 | TLA+ specifications for consensus and failure detection state machines, invariant definition, interface with council-verify |
|
||||||
|
| `integration-liaison` | 4 | Tracks dependency interfaces with council-net, council-mem, council-sched; attends synthesis sessions; produces interface contracts |
|
||||||
|
| `research-surveyor` | 2 | Literature monitoring, emerging results, coordination with external references |
|
||||||
|
|
||||||
|
**Total: 60 agents**
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 4. Key Deliverables
|
||||||
|
|
||||||
|
### Phase 1: Research (Days 1-3)
|
||||||
|
|
||||||
|
| Deliverable | UCXL Address | Description |
|
||||||
|
|-------------|-------------|-------------|
|
||||||
|
| Failure detection survey | `ucxl://council-fault:researcher@DistOS:fault-tolerance/*^/research/failure-detection-survey.md` | Comparative analysis of phi accrual vs SWIM on InfiniBand fabrics with parameterisation recommendations |
|
||||||
|
| Consensus protocol comparison | `ucxl://council-fault:researcher@DistOS:fault-tolerance/*^/research/consensus-protocol-comparison.md` | Raft vs Multi-Paxos vs HotStuff trade-off matrix for DistOS use cases |
|
||||||
|
| Checkpoint volume analysis | `ucxl://council-fault:researcher@DistOS:fault-tolerance/*^/research/checkpoint-volume-analysis.md` | Checkpoint sizes and write bandwidth requirements for 70B–1T parameter models on 1024 GPUs |
|
||||||
|
| Failure domain taxonomy | `ucxl://council-fault:researcher@DistOS:fault-tolerance/*^/research/failure-domain-taxonomy.md` | Classification of failure modes, estimated MTTF per domain, impact radius |
|
||||||
|
|
||||||
|
### Phase 2: Architecture (Days 3-6)
|
||||||
|
|
||||||
|
| Deliverable | UCXL Address | Description |
|
||||||
|
|-------------|-------------|-------------|
|
||||||
|
| Failure detection design | `ucxl://council-fault:lead-architect@DistOS:fault-tolerance/*^/architecture/failure-detection-design.md` | Selected algorithm, configuration parameters, false positive/negative rate targets |
|
||||||
|
| Consensus architecture | `ucxl://council-fault:consensus-engineer@DistOS:fault-tolerance/*^/architecture/consensus-architecture.md` | Hierarchical Raft topology, group sizing, election timeout bands |
|
||||||
|
| Checkpoint strategy | `ucxl://council-fault:checkpoint-engineer@DistOS:fault-tolerance/*^/architecture/checkpoint-strategy.md` | Checkpoint placement policy, Weka FS write path, adaptive interval algorithm |
|
||||||
|
| Recovery state machine | `ucxl://council-fault:recovery-coordinator@DistOS:fault-tolerance/*^/architecture/recovery-state-machine.md` | Node failure response sequence, standby promotion protocol, divergence resolution |
|
||||||
|
| DR-FT-001: Consensus protocol selection | `ucxl://council-fault:lead-architect@DistOS:fault-tolerance/*^/decisions/DR-FT-001-consensus-selection.md` | Formal decision record for consensus protocol choice |
|
||||||
|
| DR-FT-002: Checkpoint strategy | `ucxl://council-fault:lead-architect@DistOS:fault-tolerance/*^/decisions/DR-FT-002-checkpoint-strategy.md` | Formal decision record for checkpoint approach |
|
||||||
|
|
||||||
|
### Phase 3: Formal Specification (Days 6-10)
|
||||||
|
|
||||||
|
| Deliverable | UCXL Address | Description |
|
||||||
|
|-------------|-------------|-------------|
|
||||||
|
| Raft TLA+ specification | `ucxl://council-fault:formal-spec-author@DistOS:fault-tolerance/*^/specs/raft-consensus.tla` | TLA+ model of DistOS Raft variant with cluster membership changes |
|
||||||
|
| Failure detector TLA+ specification | `ucxl://council-fault:formal-spec-author@DistOS:fault-tolerance/*^/specs/failure-detector.tla` | Completeness and accuracy properties as TLA+ invariants |
|
||||||
|
| Checkpoint protocol specification | `ucxl://council-fault:formal-spec-author@DistOS:fault-tolerance/*^/specs/checkpoint-protocol.tla` | No-orphaned-data and consistent-recovery-point invariants |
|
||||||
|
| Recovery orchestration specification | `ucxl://council-fault:formal-spec-author@DistOS:fault-tolerance/*^/specs/recovery-orchestration.tla` | Progress liveness property: cluster always eventually recovers from N simultaneous failures |
|
||||||
|
| Exactly-once semantics specification | `ucxl://council-fault:delivery-semantics-specialist@DistOS:fault-tolerance/*^/specs/exactly-once-semantics.tla` | Idempotency and ordering invariants for DistOS event bus |
|
||||||
|
|
||||||
|
### Phase 4: Integration (Days 10-12)
|
||||||
|
|
||||||
|
| Deliverable | UCXL Address | Description |
|
||||||
|
|-------------|-------------|-------------|
|
||||||
|
| Network interface contract | `ucxl://council-fault:integration-liaison@DistOS:fault-tolerance/*^/integration/council-net-interface.md` | Agreed API surface with council-net: topology queries, link failure notifications |
|
||||||
|
| Memory interface contract | `ucxl://council-fault:integration-liaison@DistOS:fault-tolerance/*^/integration/council-mem-interface.md` | Agreed API surface with council-mem: checkpoint buffer allocation, durability guarantees |
|
||||||
|
| Scheduler interface contract | `ucxl://council-fault:integration-liaison@DistOS:fault-tolerance/*^/integration/council-sched-interface.md` | Agreed API surface with council-sched: workload evacuation on node failure, checkpoint placement hints |
|
||||||
|
| Cross-council conflict log | `ucxl://council-synth:synthesizer@DistOS:integration/*^/conflicts/council-fault-vs-council-sched.md` | Conflicts raised to council-synth for resolution |
|
||||||
|
|
||||||
|
### Phase 5: Documentation (Days 12-14)
|
||||||
|
|
||||||
|
| Deliverable | UCXL Address | Description |
|
||||||
|
|-------------|-------------|-------------|
|
||||||
|
| Fault tolerance reference specification | `ucxl://council-fault:lead-architect@DistOS:fault-tolerance/*^/docs/fault-tolerance-reference-spec.md` | Complete, self-contained specification of DistOS fault tolerance model |
|
||||||
|
| Operator runbook | `ucxl://council-fault:recovery-coordinator@DistOS:fault-tolerance/*^/docs/operator-runbook.md` | Step-by-step procedures for manual intervention in pathological failure scenarios |
|
||||||
|
| Decision archaeology summary | `ucxl://council-arch:narrator@DistOS:decision-history/*^/narratives/council-fault-design-narrative.md` | Human-readable narrative of how fault tolerance design decisions evolved |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 5. Decision Points
|
||||||
|
|
||||||
|
The following architectural questions constitute the major decision points council-fault must resolve and record as Decision Records.
|
||||||
|
|
||||||
|
**DP-FT-01: Failure Detection Algorithm**
|
||||||
|
Should DistOS use phi accrual, SWIM with piggybacking, or a hybrid combining both? Phi accrual provides graded suspicion levels useful for soft degradation; SWIM provides O(log N) message complexity. Key consideration: the target InfiniBand fabric has sub-microsecond latency, which may make SWIM's indirect probing unnecessary. The out-of-band BMC/IPMI path provides a ground-truth oracle — should this be used to disambiguate, or does it introduce additional complexity and failure modes?
|
||||||
|
|
||||||
|
**DP-FT-02: Consensus Topology**
|
||||||
|
Should DistOS use a single global Raft cluster for all control plane decisions, or a two-tier topology with rack-local Raft groups coordinated by a meta-Raft group? Global Raft is simpler and easier to reason about but introduces a write bottleneck at scale and cross-rack latency on every commit. Hierarchical consensus improves write throughput and localises rack-level decisions but requires a reconfiguration protocol that correctly handles simultaneous rack and meta-group failures.
|
||||||
|
|
||||||
|
**DP-FT-03: Byzantine Fault Tolerance Scope**
|
||||||
|
Should DistOS employ BFT consensus for any layer, or assume crash-fault-tolerant (CFT) suffices? The threat model for a private GPU cluster operated by a single organisation likely does not include Byzantine node behaviour, but it may include Byzantine agent behaviour (a compromised CHORUS agent providing false telemetry). The council must precisely define the Byzantine assumption boundary and determine whether HotStuff is warranted for the agent coordination layer even if the underlying node layer uses CFT Raft.
|
||||||
|
|
||||||
|
**DP-FT-04: Checkpoint Placement and Ownership**
|
||||||
|
Should checkpoints be written by the application (Megatron-LM style, application-cooperative), by a transparent OS-level mechanism (DMTCP style, application-agnostic), or by a hybrid that provides OS-level infrastructure but application-supplied hints? Application-cooperative checkpointing achieves optimal consistency points (end-of-iteration) but requires all workloads to be ported; transparent checkpointing is universal but captures unnecessary state and misses semantic consistency boundaries.
|
||||||
|
|
||||||
|
**DP-FT-05: Checkpoint Storage Durability**
|
||||||
|
Given that Weka provides erasure-coded parallel storage, does DistOS need additional checkpoint redundancy (e.g., copying checkpoints to a second Weka namespace or to object storage), or is single-namespace checkpoint storage sufficient? This depends on whether Weka's erasure coding tolerates the same failure modes that trigger the checkpoint recovery (e.g., if a rack failure simultaneously takes down Weka shards and the compute nodes being recovered, single-namespace checkpoints may be unreadable).
|
||||||
|
|
||||||
|
**DP-FT-06: Exactly-Once Semantics Boundary**
|
||||||
|
Exactly-once semantics are expensive. Which DistOS subsystems require exactly-once delivery (candidate: distributed lock service, scheduler event log, billing/accounting events) versus at-least-once delivery with idempotent consumers (candidate: telemetry, health heartbeats, log events)? The council must produce a formal classification that council-sec and council-telemetry can ratify.
|
||||||
|
|
||||||
|
**DP-FT-07: Hot vs Warm vs Cold Standby**
|
||||||
|
For each category of DistOS component (leader node, metadata server, scheduler replica), what is the appropriate standby tier? Hot standby (immediate failover, high resource cost) is justified for the consensus leader; cold recovery (restart from checkpoint, potentially minutes of downtime) may be acceptable for less critical components. The council must define RTO and RPO targets per component class and map these to standby tier selection.
|
||||||
|
|
||||||
|
**DP-FT-08: Failure Domain Granularity**
|
||||||
|
What failure domains does DistOS model, and what is the maximum tolerated simultaneous failure count per domain? Candidate answer: tolerate 1 rack (32 nodes) simultaneously failing without workload loss (checkpoints survive on other racks), tolerate up to 16 individual node failures per hour without RTO violation. The council must derive these numbers from MTTF data for the target hardware and the RTO/RPO targets.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 6. Dependencies on Other Councils
|
||||||
|
|
||||||
|
### council-net (Network Stack)
|
||||||
|
Council-fault requires network topology information to configure failure detector parameters correctly. The phi accrual window must account for cross-rack latency variability. SWIM's indirect probing path must be aware of network partitions. Specifically needed:
|
||||||
|
|
||||||
|
- Link-level latency distribution (P50, P99, P99.9) for same-rack vs cross-rack communication
|
||||||
|
- Notification API for link-level failure events (to accelerate failure detection beyond gossip convergence time)
|
||||||
|
- Details of the RDMA fabric topology so failure domains align with physical network segments
|
||||||
|
|
||||||
|
**Interface artifact:** `ucxl://council-net:lead-architect@DistOS:network/*^/integration/topology-api-for-council-fault.md`
|
||||||
|
|
||||||
|
### council-mem (Distributed Memory)
|
||||||
|
Checkpoint design depends critically on the memory model. Council-fault needs:
|
||||||
|
|
||||||
|
- Memory allocation primitives that can pin GPU memory buffers for checkpoint capture without disrupting running workloads
|
||||||
|
- Durability guarantees from the Weka FS integration layer: what consistency model applies to checkpoint writes?
|
||||||
|
- Information on NVLink/NVSwitch topology to understand whether GPU-to-GPU direct checkpoint transfers are viable without CPU involvement
|
||||||
|
|
||||||
|
**Interface artifact:** `ucxl://council-mem:lead-architect@DistOS:memory/*^/integration/checkpoint-primitives-for-council-fault.md`
|
||||||
|
|
||||||
|
### council-sched (Process Scheduling)
|
||||||
|
Failure recovery intersects with scheduling at multiple points. Council-fault needs:
|
||||||
|
|
||||||
|
- Workload evacuation notification API: when council-fault declares a node dead, council-sched must be notified to reschedule affected workloads
|
||||||
|
- Checkpoint placement hints: council-sched knows which nodes have GPU memory pressure and can advise checkpoint placement to avoid I/O bottlenecks
|
||||||
|
- Recovery priority policy: when multiple workloads are recovering simultaneously, council-sched provides the priority ordering
|
||||||
|
|
||||||
|
**Interface artifact:** `ucxl://council-sched:lead-architect@DistOS:scheduling/*^/integration/recovery-coordination-api.md`
|
||||||
|
|
||||||
|
### council-sec (Security Model)
|
||||||
|
Council-sec provides constraints on the consensus protocol design. Specifically:
|
||||||
|
|
||||||
|
- Cryptographic signing requirements for Raft log entries (are log entries signed, or is transport-level authentication sufficient?)
|
||||||
|
- Key management for hot standby nodes that must be pre-provisioned with credentials before a failure occurs
|
||||||
|
- Audit requirements for failover events (every leader election and node eviction must generate a signed audit record)
|
||||||
|
|
||||||
|
Council-sec is a consumer of council-fault's decisions (it must enforce security properties on fault-tolerant paths), but it is also a constraint provider.
|
||||||
|
|
||||||
|
### council-synth (Inter-Council Synthesis)
|
||||||
|
Any conflict between council-fault's requirements and other councils' designs is escalated to council-synth. Likely conflict areas:
|
||||||
|
|
||||||
|
- Checkpoint overhead vs scheduling latency: if checkpoint I/O degrades scheduling responsiveness, council-synth arbitrates
|
||||||
|
- Consensus quorum configuration vs network partition tolerance: council-net's topology design may make certain quorum configurations unreachable under valid partition scenarios
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 7. WHOOSH Configuration
|
||||||
|
|
||||||
|
### Team Formation
|
||||||
|
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"council_id": "council-fault",
|
||||||
|
"team_topic": "whoosh.team.distos-council-fault",
|
||||||
|
"composition": {
|
||||||
|
"lead-architect": 2,
|
||||||
|
"failure-detection-specialist": 6,
|
||||||
|
"consensus-engineer": 10,
|
||||||
|
"byzantine-resilience-analyst": 6,
|
||||||
|
"checkpoint-engineer": 10,
|
||||||
|
"recovery-coordinator": 6,
|
||||||
|
"delivery-semantics-specialist": 4,
|
||||||
|
"partition-analyst": 4,
|
||||||
|
"formal-spec-author": 6,
|
||||||
|
"integration-liaison": 4,
|
||||||
|
"research-surveyor": 2
|
||||||
|
},
|
||||||
|
"total_agents": 60,
|
||||||
|
"quorum_policy": {
|
||||||
|
"artifact_publication": "simple_majority",
|
||||||
|
"architecture_decision": "two_thirds_supermajority",
|
||||||
|
"formal_spec_ratification": "lead_architect_plus_two_thirds",
|
||||||
|
"dependency_interface_agreement": "all_integration_liaisons_plus_one_lead"
|
||||||
|
},
|
||||||
|
"join_timeout_minutes": 30,
|
||||||
|
"inactivity_eviction_minutes": 120
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
### Subchannels
|
||||||
|
|
||||||
|
| Subchannel | Topic Suffix | Purpose |
|
||||||
|
|-----------|-------------|---------|
|
||||||
|
| Control | `.control` | Role assignments, join/leave events, phase transitions |
|
||||||
|
| Research | `.research` | Paper sharing, survey coordination, literature discussion |
|
||||||
|
| Architecture | `.architecture` | Design proposals, trade-off debates, decision drafting |
|
||||||
|
| Formal Spec | `.formal-spec` | TLA+ review, invariant discussion, council-verify liaison |
|
||||||
|
| Integration | `.integration` | Cross-council interface negotiation, dependency tracking |
|
||||||
|
| Voting | `.voting` | Quorum votes on decision records and artifact ratification |
|
||||||
|
| Artifacts | `.artifacts` | UCXL artifact announcement references |
|
||||||
|
|
||||||
|
### Quorum Configuration
|
||||||
|
|
||||||
|
The consensus engineering team within council-fault operates under the same principles it is designing. A minimum of 2/3 of active agents must be reachable for architecture decisions to be recorded. If council-fault itself experiences a partition, the larger partition shard continues and the smaller shard's work is merged post-healing (following the same eventual consistency model the council advocates for DistOS data paths where CP is not required).
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 8. Success Criteria
|
||||||
|
|
||||||
|
1. **Failure detection coverage:** A formal characterisation of phi accrual or SWIM convergence time under the target InfiniBand fabric, with worst-case bounds proven analytically and validated against simulation
|
||||||
|
2. **Consensus specification completeness:** A TLA+ specification for the chosen consensus protocol that model-checks with no invariant violations under the TLC model checker for clusters up to N=7 (as a representative abstraction), with safety and liveness properties explicitly stated
|
||||||
|
3. **Checkpoint strategy viability:** A checkpoint overhead analysis showing that optimal checkpoint interval yields less than 5% throughput degradation on a representative transformer training workload
|
||||||
|
4. **Recovery time verification:** Formal proof that the recovery state machine terminates (liveness) and does not enter a state where a healthy subset of nodes is incorrectly marked failed (safety)
|
||||||
|
5. **Exactly-once classification:** A complete table classifying every DistOS internal message type as exactly-once, at-least-once, or best-effort, with rationale
|
||||||
|
6. **Interface contracts ratified:** All three dependency interface contracts (council-net, council-mem, council-sched) are signed off by both parties and published to UCXL
|
||||||
|
7. **Decision record completeness:** All 8 decision points have corresponding Decision Records with alternatives considered, evidence cited, and rationale documented
|
||||||
|
8. **Byzantine scope defined:** A written threat model defining the Byzantine assumption boundary, ratified by council-sec
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 9. Timeline
|
||||||
|
|
||||||
|
### Phase 1: Research (Days 1-3)
|
||||||
|
- Day 1: Activate council, assign roles, distribute research domains; research-surveyors produce initial bibliography
|
||||||
|
- Day 2: Failure detection specialists produce phi accrual vs SWIM comparison; checkpoint engineers survey Megatron-LM and DMTCP checkpoint formats; consensus engineers survey Raft, Multi-Paxos, and HotStuff
|
||||||
|
- Day 3: All research artifacts published to UCXL; cross-domain synthesis discussion; research phase closes with a prioritised list of open questions entering the architecture phase
|
||||||
|
|
||||||
|
### Phase 2: Architecture (Days 3-6)
|
||||||
|
- Day 3-4: Failure detection design and consensus architecture drafted; initial decision record drafts for DP-FT-01 and DP-FT-02 circulated for comment
|
||||||
|
- Day 4-5: Checkpoint strategy and recovery state machine designed; integration liaisons begin negotiating interface contracts with council-net and council-mem; DR-FT-01 through DR-FT-04 voted on
|
||||||
|
- Day 5-6: Remaining decision records voted on; architecture artifacts finalised and published; architecture phase closes with all DR-FT-* in accepted or deferred state
|
||||||
|
|
||||||
|
### Phase 3: Formal Specification (Days 6-10)
|
||||||
|
- Day 6-7: Raft TLA+ specification and failure detector TLA+ specification drafted; council-verify liaison established
|
||||||
|
- Day 7-8: Checkpoint protocol specification and exactly-once semantics specification drafted; council-verify begins model checking on Raft spec
|
||||||
|
- Day 8-9: Recovery orchestration specification drafted; first model-checking results from council-verify inform spec refinements
|
||||||
|
- Day 9-10: All TLA+ specifications in review; formal spec phase closes with at least two specs passing model checking with no counterexamples
|
||||||
|
|
||||||
|
### Phase 4: Integration (Days 10-12)
|
||||||
|
- Day 10-11: Interface contracts with council-net, council-mem, and council-sched finalised; any conflicts escalated to council-synth
|
||||||
|
- Day 11-12: council-synth resolution of any conflicts; final integration artifacts published; cross-council review of security constraints from council-sec
|
||||||
|
|
||||||
|
### Phase 5: Documentation (Days 12-14)
|
||||||
|
- Day 12-13: Fault tolerance reference specification assembled from architecture and formal spec artifacts
|
||||||
|
- Day 13-14: Operator runbook written; council-arch produces decision archaeology narrative; final UCXL navigability audit confirms all artifact addresses resolve correctly
|
||||||
374
councils/05-security-model.md
Normal file
374
councils/05-security-model.md
Normal file
@@ -0,0 +1,374 @@
|
|||||||
|
# Council Design Brief: Security Model, Isolation, and Trust
|
||||||
|
|
||||||
|
**Council ID:** `council-sec`
|
||||||
|
**Mission:** Design the complete security architecture for DistOS — spanning hardware-rooted trust, capability-based access control, cryptographic identity, multi-tenant GPU isolation, zero-trust networking, and supply chain integrity — such that every resource access and inter-agent communication is authorised, attributable, and auditable, from silicon to application.
|
||||||
|
**UCXL Base Address:** `ucxl://council-sec:*@DistOS:security/*`
|
||||||
|
**Agent Count:** ~60
|
||||||
|
**Status:** Pre-formation (Constitution Phase)
|
||||||
|
**Created:** 2026-02-24
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 1. Scope and Responsibilities
|
||||||
|
|
||||||
|
Council-sec owns the security model of DistOS in its entirety. This is unusual among the DistOS councils: rather than designing one isolated subsystem, council-sec defines constraints and primitives that every other council must satisfy. It operates both as a producer (defining security APIs, attestation protocols, and key management) and as an authority (issuing security requirements that other councils must incorporate). No other council may finalise an architectural decision that has security implications without ratification from council-sec.
|
||||||
|
|
||||||
|
The target environment is a 1024-node Hopper/Grace/Blackwell cluster that may host multiple tenants concurrently: research groups, internal teams, and potentially external customers via the KACHING licensing model. The security model must support this multi-tenancy without hardware-level repartitioning of the cluster for each tenant.
|
||||||
|
|
||||||
|
### In Scope
|
||||||
|
|
||||||
|
- Capability-based security model: capability taxonomy, attenuation, revocation, and ambient authority elimination
|
||||||
|
- Hardware-rooted trust chain: TPM 2.0 attestation, GPU attestation (NVIDIA Hopper/Blackwell Confidential Computing), secure boot verification
|
||||||
|
- Cryptographic identity: node identity, agent identity, workload identity; DID-based identity architecture; Ed25519 signing
|
||||||
|
- Zero-trust network model: mutual TLS everywhere, SPIFFE/SPIRE workload identity, no implicit trust based on network position
|
||||||
|
- Multi-tenant GPU isolation: hardware MIG (Multi-Instance GPU) partitioning, software memory isolation between tenant workloads sharing a GPU, confidential VM boundaries
|
||||||
|
- Secure enclave design for sensitive workload data: NVIDIA Hopper Confidential Computing, TEE integration
|
||||||
|
- RBAC and ABAC policy engine: role-based and attribute-based access control for DistOS resources
|
||||||
|
- Secrets management architecture: secret lifecycle, rotation, distribution to agents, Vault integration
|
||||||
|
- Audit logging: tamper-evident audit log, log retention policy, query interface
|
||||||
|
- Supply chain security: agent code integrity verification, container image signing, runtime attestation of CHORUS agents
|
||||||
|
- Secure communication protocols: channel encryption, replay prevention, message signing for all inter-agent messages
|
||||||
|
- Threat model: formal threat model document classifying adversaries and their capabilities
|
||||||
|
|
||||||
|
### Out of Scope
|
||||||
|
|
||||||
|
- Specific network transport implementation (council-net owns this; council-sec specifies the security requirements that the transport must satisfy)
|
||||||
|
- Memory allocation algorithms (council-mem owns this; council-sec specifies isolation requirements)
|
||||||
|
- Fault recovery procedures (council-fault owns this; council-sec specifies that recovery paths must not bypass access controls)
|
||||||
|
- Physical security of the data centre nodes (assumed to be operated by resetdata.ai under a separate security programme)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 2. Research Domains
|
||||||
|
|
||||||
|
### 2.1 Capability-Based Security
|
||||||
|
|
||||||
|
Object capabilities (ocaps) represent the cleanest theoretical foundation for a distributed OS security model because they unify authentication and authorisation: possession of a capability is both proof of identity and proof of authorisation. DistOS should adopt ocaps as its primary access control model, supplementing with RBAC/ABAC for human operators who cannot hold unforgeable references.
|
||||||
|
|
||||||
|
**Key Papers and Systems:**
|
||||||
|
|
||||||
|
- Dennis & Van Horn (1966), "Programming Semantics for Multiprogrammed Computations" — original capability concept; establishes that capabilities are unforgeable tokens conferring the right to perform an operation on a specific object
|
||||||
|
- Miller (2006), "Robust Composition: Towards a Unified Approach to Access Control and Concurrency Control" (PhD thesis) — the definitive modern treatment of object capabilities; defines the confinement property, the principle of least authority (POLA), and the E programming language's ocap model; DistOS's capability system should be evaluated against Miller's taxonomy
|
||||||
|
- Shapiro et al. (1999), "EROS: A Fast Capability System" — production capability OS demonstrating that capability systems need not sacrifice performance; EROS achieves capability confinement with minimal TCB and provides the performance benchmark DistOS should target
|
||||||
|
- Watson et al. (2015), "CHERI: A Hybrid Capability-System Architecture for Scalable Software Compartmentalization" — CHERI extends the ISA with hardware capability registers, eliminating the software overhead of software capability systems; Arm Morello and future RISC-V implementations make CHERI practically relevant; DistOS should evaluate whether CHERI-capable CPUs are present in the target cluster or likely to be in a future revision
|
||||||
|
- Watson et al. (2010), "Capsicum: Practical Capabilities for UNIX" — retrofits capability semantics onto FreeBSD without requiring application rewrite; demonstrates that capability confinement is deployable in production UNIX-derived systems; DistOS should evaluate a Capsicum-inspired capability layer for POSIX compatibility
|
||||||
|
- Saltzer & Schroeder (1975), "The Protection of Information in Computer Systems" — foundational principles (economy of mechanism, fail-safe defaults, complete mediation, least privilege, separation of privilege, least common mechanism, open design, psychological acceptability); DistOS security design should explicitly cite which principles each mechanism satisfies
|
||||||
|
|
||||||
|
**Open questions for research phase:** Can DistOS provide capability attenuation (creating a weaker capability from a stronger one) efficiently across the RDMA fabric? How should capability revocation be implemented in a system with thousands of agents holding capabilities to the same resource?
|
||||||
|
|
||||||
|
### 2.2 Hardware-Rooted Trust
|
||||||
|
|
||||||
|
In a multi-tenant GPU cluster, tenants cannot trust that the hypervisor or OS they are running on is uncompromised. Hardware attestation provides a root of trust that is independent of software state.
|
||||||
|
|
||||||
|
**Key Papers and Systems:**
|
||||||
|
|
||||||
|
- Trusted Computing Group (2019), TPM 2.0 Library Specification — TPM 2.0 provides measured boot, remote attestation, and sealed storage; DistOS nodes should use TPM 2.0 to attest to their firmware and software state before joining the cluster; each node's TPM endorsement key provides the hardware root of identity
|
||||||
|
- Coker et al. (2011), "Principles of Remote Attestation" — formal model for what attestation proves and what it does not; DistOS's attestation design must be grounded in this model to avoid the common mistake of conflating freshness with integrity
|
||||||
|
- Keylime (Schecter et al. 2019 / MIT Lincoln Laboratory) — open-source TPM-based remote attestation for cloud and cluster environments; DistOS should evaluate Keylime as the attestation infrastructure, particularly its continuous attestation (not just boot-time) capability that detects runtime tampering
|
||||||
|
- NVIDIA Confidential Computing (Hopper H100, Blackwell B100/B200) — NVIDIA's Confidential Computing architecture provides hardware-isolated GPU execution environments (Confidential VMs) where GPU memory is encrypted and inaccessible to the host hypervisor or other tenants; this is directly relevant to DistOS's multi-tenant workload isolation; the attestation protocol for GPU TEEs is documented in NVIDIA's Hopper Architecture White Paper (2022)
|
||||||
|
- NVIDIA Hopper Architecture Technical Overview (2022) — describes the Hardware Security Module (HSM) embedded in H100, the device attestation certificate chain rooted in NVIDIA's Certificate Authority, and the RIM (Reference Integrity Manifest) comparison mechanism; DistOS must integrate with this attestation flow to verify GPU integrity before scheduling confidential workloads
|
||||||
|
- Intel TDX (Trust Domain Extensions) and AMD SEV-SNP — CPU-level confidential computing; relevant if Grace CPUs in the Grace-Hopper Superchip include AMD or Intel TEE support (Grace is ARM-based with NVIDIA's own security architecture); the council should verify the actual TEE architecture of the Grace CPU in the target cluster
|
||||||
|
- Parno et al. (2011), "Bootstrapping Trust in Commodity Computers" — formal treatment of what a trusted boot sequence proves; DistOS should cite this when justifying its boot attestation design
|
||||||
|
|
||||||
|
**Open questions for research phase:** Does the Grace-Hopper Superchip's ARM-based Grace CPU support TrustZone or an equivalent TEE? How does NVIDIA's GPU attestation integrate with the TPM on the same node? Is there a unified attestation token that covers both CPU and GPU integrity?
|
||||||
|
|
||||||
|
### 2.3 Cryptographic Identity
|
||||||
|
|
||||||
|
Every entity in DistOS — nodes, agents, workloads, council members — needs a stable cryptographic identity that can be verified without a central authority.
|
||||||
|
|
||||||
|
**Key Papers and Systems:**
|
||||||
|
|
||||||
|
- W3C Decentralised Identifiers (DID) 1.0 (Sporny et al. 2022) — DIDs provide globally unique, cryptographically verifiable identifiers that do not require a centralised registry; DistOS should assign DIDs to nodes (anchored to their TPM endorsement key) and agents (anchored to their signing key); the DID document contains the public key and service endpoints
|
||||||
|
- Bernstein et al. (2011), "High-Speed High-Security Signatures" — Ed25519 is the recommended signing algorithm for DistOS: 128-bit security, 64-byte signatures, fast verification; all inter-agent messages, audit log entries, and artifact publications should be signed with Ed25519
|
||||||
|
- Boneh et al. (2001), "Short Signatures from the Weil Pairing" (BLS signatures) — BLS signatures support aggregation, making them suitable for the consensus layer where 1000 agents may need to sign a decision; a single aggregated signature is as compact as a single BLS signature regardless of signer count; HotStuff's use of BLS aggregation is directly relevant if council-fault selects HotStuff
|
||||||
|
- SPIFFE (Secure Production Identity Framework For Everyone) and SPIRE (SPIFFE Runtime Environment) — SPIFFE provides a standard for workload identity in multi-cloud and cluster environments; SPIRE implements SPIFFE with attestation-based identity issuance (TPM, Kubernetes, cloud provider attestation); DistOS should adopt SPIFFE SVIDs (X.509 or JWT) as the workload identity standard for inter-service mTLS; this aligns with the Google BeyondCorp model
|
||||||
|
|
||||||
|
**Open questions for research phase:** Should DistOS use a DID method that anchors identity to the local cluster registry (private, no external dependency) or to a public blockchain (tamper-evident but introduces external dependency)? How frequently should agent signing keys rotate, and what is the key rotation protocol during an active task?
|
||||||
|
|
||||||
|
### 2.4 Zero-Trust Networking
|
||||||
|
|
||||||
|
The zero-trust model eliminates implicit trust based on network position: a process on the internal network is not trusted more than one on the public internet. Every connection is authenticated, authorised, and encrypted.
|
||||||
|
|
||||||
|
**Key Papers and Systems:**
|
||||||
|
|
||||||
|
- Kindervag (2010), "No More Chewy Centers: Introducing the Zero Trust Model of Information Security" (Forrester Research) — original zero-trust concept; the core principle for DistOS is that network location grants no authority
|
||||||
|
- Ward & Beyer (2014), "BeyondCorp: A New Approach to Enterprise Security" — Google's production zero-trust implementation; device certificates, user certificates, and context-aware access proxy are the model for DistOS's control plane access
|
||||||
|
- National Institute of Standards and Technology Special Publication 800-207 (Rose et al. 2020), "Zero Trust Architecture" — formal NIST definition and deployment guidance; DistOS should achieve NIST ZTA Deployment Model 3 (micro-segmented network with resource-level enforcement)
|
||||||
|
- SPIFFE/SPIRE (see Section 2.3) — the workload identity infrastructure that makes zero-trust practical at the intra-cluster level; SPIRE's node attestor uses TPM evidence to issue SPIFFE SVIDs without requiring pre-provisioned secrets
|
||||||
|
|
||||||
|
### 2.5 Multi-Tenant GPU Isolation
|
||||||
|
|
||||||
|
The most technically novel aspect of DistOS security is providing strong isolation between tenants sharing the same physical GPU, or between a tenant's workload and the DistOS system processes.
|
||||||
|
|
||||||
|
**Key Papers and Systems:**
|
||||||
|
|
||||||
|
- NVIDIA Multi-Instance GPU (MIG) Technology (NVIDIA, 2020) — hardware partitioning of A100/H100/B100 GPUs into up to 7 isolated instances with separate compute engines, memory, and PCIe bandwidth; MIG provides hardware-enforced isolation between tenants but is coarse-grained (whole-MIG-instance allocation); DistOS must decide when to use MIG versus software-only isolation
|
||||||
|
- Jain et al. (2019), "Characterizing and Taming Model Instability Across Edge Devices" — raises GPU memory isolation as a practical problem; co-located inference workloads can observe timing side channels via shared L2 cache; DistOS must evaluate whether MIG eliminates all cache side channels or whether software countermeasures are additionally required
|
||||||
|
- NVIDIA Confidential Computing (Hopper/Blackwell) — Confidential VMs with GPU TEE; the GPU encrypts all memory and DMA transfers to a specific VM; no other VM or the host hypervisor can read that memory even with physical access; the trade-off is reduced performance (encryption overhead) and no multi-tenant MIG sharing within a Confidential Computing session
|
||||||
|
- Volos et al. (2018), "Graviton: Trusted Execution Environments on GPUs" (research prototype) — proposes a GPU TEE architecture predating NVIDIA's official Confidential Computing; useful background for understanding the design space
|
||||||
|
- Dettmers et al. (2022), "LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale" — demonstrates that quantisation affects memory layout in ways that may complicate memory isolation guarantees; DistOS's memory isolation must be quantisation-agnostic
|
||||||
|
|
||||||
|
**Open questions for research phase:** When a tenant uses MIG, is MIG-level isolation sufficient for DistOS's security model, or must Confidential Computing be mandated? What is the performance overhead of always-on Confidential Computing on H100/B100? Can two tenants share a single MIG instance using software isolation, or does that violate the security model?
|
||||||
|
|
||||||
|
### 2.6 Access Control: RBAC and ABAC
|
||||||
|
|
||||||
|
**Key Papers and Systems:**
|
||||||
|
|
||||||
|
- Sandhu et al. (1996), "Role-Based Access Control Models" — four RBAC models (RBAC0 through RBAC3); DistOS should implement RBAC3 (hierarchical roles with constraints) for human operators; role hierarchy maps to organisational structure (admin, operator, researcher, auditor)
|
||||||
|
- Hu et al. (2014), NIST Special Publication 800-162, "Guide to Attribute Based Access Control (ABAC)" — ABAC enables fine-grained, context-sensitive policies that RBAC alone cannot express; for example, "a researcher may read checkpoints only for workloads they own and only from nodes in their allocated partition"; DistOS should implement ABAC for programmatic access (agent-to-resource) and RBAC for human-to-system access
|
||||||
|
- Open Policy Agent (OPA, Toews et al. 2018) — Rego policy language for ABAC; production deployments in Kubernetes, Istio, and Terraform; DistOS should evaluate OPA as the policy engine for runtime access decisions
|
||||||
|
- XACML (eXtensible Access Control Markup Language, OASIS 2013) — formal ABAC policy language with request/response model; heavier than OPA but formally specified; relevant if DistOS requires formal policy verification
|
||||||
|
|
||||||
|
### 2.7 Secrets Management
|
||||||
|
|
||||||
|
**Key Papers and Systems:**
|
||||||
|
|
||||||
|
- HashiCorp Vault — production secrets management with dynamic secrets (short-lived credentials issued on demand), secret leasing and renewal, and multiple authentication backends including TPM and Kubernetes; DistOS should adopt Vault or an equivalent for all secret distribution; the key architectural principle is that agents never hold long-lived credentials
|
||||||
|
- Bellovin & Cheswick (1994), "Network Firewalls" — historical context for why perimeter-based secrets management (shared network secrets) is insufficient; motivates per-workload dynamic credentials
|
||||||
|
- Saltzer (1974), "Protection and the Control of Information Sharing in Multics" — the original minimal TCB argument; DistOS's secrets management TCB should be as small as possible and formally specifiable
|
||||||
|
|
||||||
|
### 2.8 Supply Chain Security
|
||||||
|
|
||||||
|
**Key Papers and Systems:**
|
||||||
|
|
||||||
|
- SLSA (Supply-chain Levels for Software Artifacts) Framework (Google, 2021) — graduated supply chain security model (Levels 1-4); DistOS should require SLSA Level 3 or higher for all CHORUS agent containers: reproducible builds, signed provenance, and verified build platform
|
||||||
|
- Sigstore (Carpenter et al. 2021) — keyless signing infrastructure for container images and artifacts using OIDC identity and certificate transparency; DistOS container registry should enforce Sigstore signatures on all agent images
|
||||||
|
- in-toto (Torres-Arias et al. 2019), "in-toto: Providing farm-to-table guarantees for bits and bytes" — end-to-end supply chain verification linking source code to deployed artifact through a verifiable chain of custody; DistOS should use in-toto layouts to verify that a running agent binary was built from the expected source commit
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 3. Agent Roles
|
||||||
|
|
||||||
|
| Role | Count | Responsibilities |
|
||||||
|
|------|-------|-----------------|
|
||||||
|
| `lead-architect` | 2 | Security model coherence, final ratification authority, liaison to all other councils |
|
||||||
|
| `capability-system-designer` | 6 | Object capability taxonomy, attenuation protocol, revocation mechanism, POLA enforcement |
|
||||||
|
| `attestation-engineer` | 8 | TPM integration, GPU attestation, Keylime deployment design, attestation protocol specification |
|
||||||
|
| `cryptographic-identity-specialist` | 6 | DID architecture, Ed25519 key management, BLS aggregation for consensus, SPIFFE/SPIRE integration |
|
||||||
|
| `network-security-engineer` | 6 | mTLS everywhere, SPIFFE SVID issuance, zero-trust policy design, council-net liaison |
|
||||||
|
| `gpu-isolation-specialist` | 6 | MIG policy design, Confidential Computing integration, side-channel analysis, multi-tenant boundary verification |
|
||||||
|
| `access-control-engineer` | 6 | RBAC/ABAC policy design, OPA integration, policy language selection, role hierarchy |
|
||||||
|
| `secrets-manager` | 4 | Vault architecture, dynamic secret design, credential rotation, agent secret distribution |
|
||||||
|
| `audit-engineer` | 4 | Tamper-evident audit log design, log retention policy, UCXL-addressed audit query interface |
|
||||||
|
| `supply-chain-security-specialist` | 4 | SLSA compliance, Sigstore integration, in-toto layout authoring, container image signing |
|
||||||
|
| `threat-model-analyst` | 4 | Formal threat model, adversary capability classification, attack tree analysis, red team scenario authoring |
|
||||||
|
| `formal-spec-author` | 4 | TLA+ specifications for capability and attestation state machines, interface with council-verify |
|
||||||
|
|
||||||
|
**Total: 60 agents**
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 4. Key Deliverables
|
||||||
|
|
||||||
|
### Phase 1: Research (Days 1-3)
|
||||||
|
|
||||||
|
| Deliverable | UCXL Address | Description |
|
||||||
|
|-------------|-------------|-------------|
|
||||||
|
| Capability system survey | `ucxl://council-sec:researcher@DistOS:security/*^/research/capability-system-survey.md` | Analysis of EROS, Capsicum, CHERI, and E-language approaches; applicability to DistOS |
|
||||||
|
| GPU attestation survey | `ucxl://council-sec:researcher@DistOS:security/*^/research/gpu-attestation-survey.md` | NVIDIA Confidential Computing, Hopper attestation protocol, MIG isolation boundaries |
|
||||||
|
| Zero-trust network survey | `ucxl://council-sec:researcher@DistOS:security/*^/research/zero-trust-network-survey.md` | SPIFFE/SPIRE, BeyondCorp, NIST ZTA model applicability |
|
||||||
|
| Multi-tenant isolation survey | `ucxl://council-sec:researcher@DistOS:security/*^/research/multi-tenant-isolation-survey.md` | MIG vs Confidential Computing trade-off analysis, side-channel taxonomy |
|
||||||
|
| Threat model draft v0 | `ucxl://council-sec:threat-model-analyst@DistOS:security/*^/research/threat-model-v0.md` | Initial adversary classification and attack surface enumeration |
|
||||||
|
|
||||||
|
### Phase 2: Architecture (Days 3-6)
|
||||||
|
|
||||||
|
| Deliverable | UCXL Address | Description |
|
||||||
|
|-------------|-------------|-------------|
|
||||||
|
| Capability system design | `ucxl://council-sec:capability-system-designer@DistOS:security/*^/architecture/capability-system-design.md` | Capability representation, attenuation protocol, revocation model |
|
||||||
|
| Trust chain architecture | `ucxl://council-sec:attestation-engineer@DistOS:security/*^/architecture/trust-chain-architecture.md` | Boot-to-runtime trust chain, TPM+GPU attestation integration |
|
||||||
|
| Cryptographic identity design | `ucxl://council-sec:cryptographic-identity-specialist@DistOS:security/*^/architecture/cryptographic-identity-design.md` | DID method, SPIFFE integration, key lifecycle |
|
||||||
|
| Zero-trust network design | `ucxl://council-sec:network-security-engineer@DistOS:security/*^/architecture/zero-trust-network-design.md` | mTLS architecture, SVID issuance policy, network microsegmentation |
|
||||||
|
| GPU isolation policy | `ucxl://council-sec:gpu-isolation-specialist@DistOS:security/*^/architecture/gpu-isolation-policy.md` | MIG allocation rules, Confidential Computing mandatory use cases |
|
||||||
|
| RBAC/ABAC policy design | `ucxl://council-sec:access-control-engineer@DistOS:security/*^/architecture/rbac-abac-policy-design.md` | Role hierarchy, OPA policy language, attribute schema |
|
||||||
|
| Threat model v1 (ratified) | `ucxl://council-sec:threat-model-analyst@DistOS:security/*^/architecture/threat-model-v1.md` | Ratified threat model; distributed to all councils as a constraint document |
|
||||||
|
| DR-SEC-001: Capability system selection | `ucxl://council-sec:lead-architect@DistOS:security/*^/decisions/DR-SEC-001-capability-system.md` | Decision record: ocap vs RBAC as primary model |
|
||||||
|
| DR-SEC-002: GPU isolation boundary | `ucxl://council-sec:lead-architect@DistOS:security/*^/decisions/DR-SEC-002-gpu-isolation-boundary.md` | Decision record: MIG vs Confidential Computing as the primary isolation mechanism |
|
||||||
|
| DR-SEC-003: Identity architecture | `ucxl://council-sec:lead-architect@DistOS:security/*^/decisions/DR-SEC-003-identity-architecture.md` | Decision record: DID method selection and SPIFFE integration |
|
||||||
|
|
||||||
|
### Phase 3: Formal Specification (Days 6-10)
|
||||||
|
|
||||||
|
| Deliverable | UCXL Address | Description |
|
||||||
|
|-------------|-------------|-------------|
|
||||||
|
| Capability system TLA+ specification | `ucxl://council-sec:formal-spec-author@DistOS:security/*^/specs/capability-system.tla` | Invariant: no capability grants access beyond its attenuation bound; revocation terminates all derived capabilities |
|
||||||
|
| Attestation protocol TLA+ specification | `ucxl://council-sec:formal-spec-author@DistOS:security/*^/specs/attestation-protocol.tla` | Invariant: no unattested node joins the cluster; attestation freshness property |
|
||||||
|
| Access control TLA+ specification | `ucxl://council-sec:formal-spec-author@DistOS:security/*^/specs/access-control.tla` | RBAC/ABAC safety: no agent accesses a resource without a valid policy authorisation |
|
||||||
|
| Audit log integrity specification | `ucxl://council-sec:formal-spec-author@DistOS:security/*^/specs/audit-log-integrity.tla` | Append-only invariant; tamper detection property |
|
||||||
|
| Security requirements matrix | `ucxl://council-sec:lead-architect@DistOS:security/*^/specs/security-requirements-matrix.md` | Table of security requirements per subsystem, distributed to all councils |
|
||||||
|
|
||||||
|
### Phase 4: Integration (Days 10-12)
|
||||||
|
|
||||||
|
| Deliverable | UCXL Address | Description |
|
||||||
|
|-------------|-------------|-------------|
|
||||||
|
| Security constraints for council-net | `ucxl://council-sec:network-security-engineer@DistOS:security/*^/integration/constraints-for-council-net.md` | mTLS requirements, SVID validation, traffic encryption standards |
|
||||||
|
| Security constraints for council-mem | `ucxl://council-sec:gpu-isolation-specialist@DistOS:security/*^/integration/constraints-for-council-mem.md` | Memory isolation requirements, confidential memory handling |
|
||||||
|
| Security constraints for council-fault | `ucxl://council-sec:lead-architect@DistOS:security/*^/integration/constraints-for-council-fault.md` | Audit requirements for failover, key management for hot standbys |
|
||||||
|
| Security constraints for council-sched | `ucxl://council-sec:access-control-engineer@DistOS:security/*^/integration/constraints-for-council-sched.md` | Authorisation for workload placement decisions |
|
||||||
|
| Security constraints for council-telemetry | `ucxl://council-sec:audit-engineer@DistOS:security/*^/integration/constraints-for-council-telemetry.md` | Audit log integration, telemetry data classification |
|
||||||
|
| KACHING security integration | `ucxl://council-sec:lead-architect@DistOS:security/*^/integration/kaching-security-integration.md` | Security model for multi-tenant billing and licensing via KACHING |
|
||||||
|
|
||||||
|
### Phase 5: Documentation (Days 12-14)
|
||||||
|
|
||||||
|
| Deliverable | UCXL Address | Description |
|
||||||
|
|-------------|-------------|-------------|
|
||||||
|
| Security model reference specification | `ucxl://council-sec:lead-architect@DistOS:security/*^/docs/security-model-reference-spec.md` | Complete security model, normative for all other councils |
|
||||||
|
| Operator security guide | `ucxl://council-sec:access-control-engineer@DistOS:security/*^/docs/operator-security-guide.md` | Role provisioning, key rotation procedures, incident response |
|
||||||
|
| Tenant security guide | `ucxl://council-sec:gpu-isolation-specialist@DistOS:security/*^/docs/tenant-security-guide.md` | What tenants can and cannot rely on from DistOS's isolation guarantees |
|
||||||
|
| Decision archaeology summary | `ucxl://council-arch:narrator@DistOS:decision-history/*^/narratives/council-sec-design-narrative.md` | Human-readable narrative of security architecture decision history |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 5. Decision Points
|
||||||
|
|
||||||
|
**DP-SEC-01: Primary Access Control Paradigm**
|
||||||
|
Should DistOS adopt object capabilities as the primary access control model, or rely on RBAC/ABAC as the foundation? Object capabilities are more powerful (they enable the principle of least authority at fine granularity and eliminate ambient authority) but require a fundamentally different programming model that may not compose with POSIX APIs. RBAC/ABAC is familiar to operators and integrates with existing tooling but cannot prevent confused deputy attacks without additional mechanism. A hybrid model (ocap for inter-agent communication, RBAC for human operator access, ABAC for resource access policies) may be the practical resolution.
|
||||||
|
|
||||||
|
**DP-SEC-02: GPU Isolation Mandate**
|
||||||
|
Should Confidential Computing be mandatory for all tenant workloads on DistOS, or only for workloads that explicitly request it? Mandatory Confidential Computing eliminates the side-channel risk class entirely and simplifies the isolation proof, but incurs encryption overhead (estimated 5–10% performance penalty on H100 per NVIDIA's published benchmarks) on all workloads including those where confidentiality is not required. Optional Confidential Computing preserves performance for non-sensitive workloads but requires the security model to specify precisely what isolation guarantees apply to non-confidential workloads sharing a GPU.
|
||||||
|
|
||||||
|
**DP-SEC-03: Identity Architecture Anchoring**
|
||||||
|
Should node and agent identity DIDs be anchored to the local cluster registry, a private distributed ledger, or a public blockchain? Local registry is simple and does not introduce external dependencies, but identity certificates cannot be verified by external parties without access to the registry. A private distributed ledger (e.g., Hyperledger Fabric) provides tamper evidence without external dependencies. A public blockchain (e.g., Ethereum ENS) enables cross-cluster identity verification but introduces gas costs and external censorship risk.
|
||||||
|
|
||||||
|
**DP-SEC-04: Attestation Freshness Policy**
|
||||||
|
TPM attestation at boot time proves the initial state is correct. How frequently should DistOS re-attest nodes to detect runtime tampering? Continuous attestation (Keylime's model) provides the strongest guarantee but consumes PCR quote generation resources. Periodic re-attestation (e.g., hourly) is a common compromise. Attestation on workload scheduling (attest before placing any new workload on a node) ensures each placement decision is based on a fresh integrity check. The council must define the re-attestation frequency and the consequences of failed re-attestation.
|
||||||
|
|
||||||
|
**DP-SEC-05: Capability Revocation Mechanism**
|
||||||
|
Capability revocation in a distributed system is hard: there is no central authority that can atomically revoke a capability held by 1000 agents simultaneously. Approaches include: revocable forwarding capabilities (an intermediate object that can be zeroed out), certificate revocation lists with short validity windows, OCSP stapling, and content-addressed capabilities where the capability hash is checked against a revocation set on each use. Each approach makes different trade-offs between revocation latency, check overhead, and system complexity.
|
||||||
|
|
||||||
|
**DP-SEC-06: Supply Chain Trust Anchor**
|
||||||
|
What is the root of trust for CHORUS agent code supply chain? Options: (a) DistOS operator signs all deployed agent images (centralised, strong, single point of trust); (b) each agent image must carry an in-toto layout signed by the agent's developer plus a separate counter-signature by the DistOS operator (two-party authorisation); (c) Sigstore keyless signing using OIDC from the GITEA CI/CD pipeline (ties identity to GITEA account, reduces signing key management burden). The council must define what constitutes a valid supply chain proof and what happens when an agent cannot produce one.
|
||||||
|
|
||||||
|
**DP-SEC-07: Audit Log Architecture**
|
||||||
|
Should DistOS maintain a single authoritative audit log (simple, single point of failure and compromise) or a distributed append-only audit log replicated across multiple nodes (resilient, complex)? The audit log is itself a security-critical component: if an attacker can modify the audit log, they can erase evidence of their actions. The log must be tamper-evident (e.g., using a Merkle tree or hash chain) and durable (writes must survive the failure scenarios defined by council-fault). The council must also define who can query the audit log and with what access controls.
|
||||||
|
|
||||||
|
**DP-SEC-08: Secrets Distribution Model**
|
||||||
|
How are secrets distributed to agents at runtime? Options: (a) push model with Vault agent sidecars injecting secrets at workload startup; (b) pull model where agents authenticate to Vault using their SPIFFE SVID and fetch secrets on demand; (c) sealed secrets baked into attestation-decryptable blobs that can only be read after successful TPM+GPU attestation. The push model is simple but requires the orchestrator to hold secrets momentarily; the pull model is cleaner but adds latency to every secret access; the sealed secret model is the most secure but requires TEE integration for every workload that handles secrets.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 6. Dependencies on Other Councils
|
||||||
|
|
||||||
|
Council-sec has a unique dependency relationship: it is primarily a constraint provider to other councils. Every other council must consult council-sec before finalising any architectural decision with security implications.
|
||||||
|
|
||||||
|
### council-net (Network Stack) — Bidirectional
|
||||||
|
Council-sec specifies the security requirements for inter-node and inter-agent communication. Council-net must implement mTLS between all DistOS components, validate SPIFFE SVIDs on every connection, and expose link-level failure notifications without revealing information that could aid traffic analysis. Council-net must not make performance optimisations that bypass encryption or authentication without explicit council-sec approval and a corresponding Decision Record.
|
||||||
|
|
||||||
|
**Dependency artifact:** `ucxl://council-sec:network-security-engineer@DistOS:security/*^/integration/constraints-for-council-net.md`
|
||||||
|
|
||||||
|
### council-mem (Distributed Memory) — Constraint Direction
|
||||||
|
Council-mem must implement the memory isolation primitives that council-sec's GPU isolation policy requires. Specifically: tenant memory pages must not be readable by other tenants at any point in their lifecycle, including after deallocation; CUDA unified memory must track ownership and enforce capability checks before cross-tenant pointer dereferences; Weka FS namespaces must be enforced at the kernel level so that one tenant's checkpoint data cannot be addressed by another tenant's process.
|
||||||
|
|
||||||
|
**Dependency artifact:** `ucxl://council-sec:gpu-isolation-specialist@DistOS:security/*^/integration/constraints-for-council-mem.md`
|
||||||
|
|
||||||
|
### council-fault (Fault Tolerance) — Constraint Direction
|
||||||
|
Fault recovery paths are historically a rich source of security vulnerabilities: standby nodes that bypass normal authentication, recovery operations that grant temporary elevated privileges, or checkpoints that carry credentials that expire during recovery. Council-fault must incorporate council-sec's requirements before finalising the recovery state machine. Specifically: standby node promotion must perform full attestation before accepting traffic; recovery operations must not grant capabilities beyond what the original node held; credential rotation must be coordinated with recovery so that recovering nodes receive fresh credentials.
|
||||||
|
|
||||||
|
**Dependency artifact:** `ucxl://council-sec:lead-architect@DistOS:security/*^/integration/constraints-for-council-fault.md`
|
||||||
|
|
||||||
|
### council-sched (Process Scheduling) — Constraint Direction
|
||||||
|
Scheduling decisions have security implications: placing a tenant's workload on a node with a failed attestation, co-locating competing tenants on the same MIG instance, or allowing a workload to acquire scheduling priority beyond its authorised quota. Council-sched must validate that every placement decision is authorised by council-sec's ABAC policy before executing it.
|
||||||
|
|
||||||
|
**Dependency artifact:** `ucxl://council-sec:access-control-engineer@DistOS:security/*^/integration/constraints-for-council-sched.md`
|
||||||
|
|
||||||
|
### council-telemetry (Resource Accounting) — Bidirectional
|
||||||
|
Council-telemetry collects resource usage data that includes sensitive information (which tenants are running what workloads, when, and at what cost). This telemetry data must be classified and protected by council-sec's data classification policy. Conversely, council-telemetry's audit log integration depends on council-sec defining the audit log architecture and the access control policy for log queries.
|
||||||
|
|
||||||
|
**Dependency artifact:** `ucxl://council-sec:audit-engineer@DistOS:security/*^/integration/constraints-for-council-telemetry.md`
|
||||||
|
|
||||||
|
### council-synth (Inter-Council Synthesis)
|
||||||
|
Any conflict between council-sec's security requirements and another council's performance or functionality requirements is escalated to council-synth for formal resolution. Council-sec has a strong voice in these resolutions: a performance optimisation that compromises a security invariant is not an acceptable trade-off without council-sec and council-synth joint sign-off and a corresponding Decision Record documenting the risk acceptance.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 7. WHOOSH Configuration
|
||||||
|
|
||||||
|
### Team Formation
|
||||||
|
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"council_id": "council-sec",
|
||||||
|
"team_topic": "whoosh.team.distos-council-sec",
|
||||||
|
"composition": {
|
||||||
|
"lead-architect": 2,
|
||||||
|
"capability-system-designer": 6,
|
||||||
|
"attestation-engineer": 8,
|
||||||
|
"cryptographic-identity-specialist": 6,
|
||||||
|
"network-security-engineer": 6,
|
||||||
|
"gpu-isolation-specialist": 6,
|
||||||
|
"access-control-engineer": 6,
|
||||||
|
"secrets-manager": 4,
|
||||||
|
"audit-engineer": 4,
|
||||||
|
"supply-chain-security-specialist": 4,
|
||||||
|
"threat-model-analyst": 4,
|
||||||
|
"formal-spec-author": 4
|
||||||
|
},
|
||||||
|
"total_agents": 60,
|
||||||
|
"quorum_policy": {
|
||||||
|
"artifact_publication": "simple_majority",
|
||||||
|
"security_requirement_issuance": "two_thirds_supermajority",
|
||||||
|
"threat_model_ratification": "all_threat_model_analysts_plus_both_lead_architects",
|
||||||
|
"constraint_document_publication": "lead_architect_plus_two_thirds",
|
||||||
|
"formal_spec_ratification": "formal_spec_authors_plus_one_lead"
|
||||||
|
},
|
||||||
|
"join_timeout_minutes": 30,
|
||||||
|
"inactivity_eviction_minutes": 120,
|
||||||
|
"special_policy": "constraint_documents_require_all_councils_acknowledgement_before_phase_3_begins"
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
### Subchannels
|
||||||
|
|
||||||
|
| Subchannel | Topic Suffix | Purpose |
|
||||||
|
|-----------|-------------|---------|
|
||||||
|
| Control | `.control` | Role assignments, join/leave events, phase transitions |
|
||||||
|
| Research | `.research` | Literature survey coordination, threat modelling, design space exploration |
|
||||||
|
| Threat-Model | `.threat-model` | Adversary definition, attack tree construction, risk classification |
|
||||||
|
| Architecture | `.architecture` | Security design proposals, capability taxonomy, isolation boundary debate |
|
||||||
|
| Formal-Spec | `.formal-spec` | TLA+ review, invariant discussion, council-verify liaison |
|
||||||
|
| Inter-Council | `.inter-council` | Constraint document drafting, cross-council security negotiation |
|
||||||
|
| Voting | `.voting` | Quorum votes on decision records, security requirement issuance, threat model ratification |
|
||||||
|
| Artifacts | `.artifacts` | UCXL artifact announcement references |
|
||||||
|
|
||||||
|
### Quorum Configuration
|
||||||
|
|
||||||
|
Security decisions require a higher voting threshold than most councils because a security invariant, once broken in the specification, is difficult to retrofit. The two-thirds supermajority quorum for security requirement issuance ensures that no small subset of agents can weaken the security model. Threat model ratification requires unanimous agreement from all threat model analysts and both lead architects because the threat model is the foundation on which all other security decisions rest; a disagreement here must be resolved, not overridden.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 8. Success Criteria
|
||||||
|
|
||||||
|
1. **Threat model completeness:** A ratified threat model document that classifies all DistOS adversary types (including Byzantine CHORUS agents), their capabilities, and their goals; distributed to all councils and formally acknowledged
|
||||||
|
2. **Capability system specification:** A TLA+ specification for the capability system with the following invariants model-checked: (a) no agent holds a capability more powerful than it was granted; (b) capability revocation terminates all derived capabilities within a bounded number of rounds
|
||||||
|
3. **Attestation coverage:** A design that ensures every node in the cluster is attested before it receives workloads, with a formal specification of the attestation freshness property
|
||||||
|
4. **GPU isolation proof:** A written argument (ideally backed by the GPU isolation TLA+ specification) that two tenants co-located on the same physical GPU cannot read each other's memory under the selected isolation mechanism
|
||||||
|
5. **Constraint documents delivered:** All constraint documents for dependent councils published to UCXL and formally acknowledged by those councils before Phase 3 begins
|
||||||
|
6. **Zero-trust coverage:** Formal verification that there are no network paths in the DistOS design where unauthenticated or unencrypted communication is possible
|
||||||
|
7. **Supply chain chain-of-custody:** An in-toto layout that covers the full chain from agent source code commit to running container, with Sigstore signatures at each stage
|
||||||
|
8. **Decision record completeness:** All 8 decision points have corresponding Decision Records with alternatives considered and rationale documented, including explicit risk acceptance statements where a security trade-off was made
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 9. Timeline
|
||||||
|
|
||||||
|
### Phase 1: Research (Days 1-3)
|
||||||
|
- Day 1: Activate council, assign roles; threat model analysts begin adversary classification; research surveyors distribute literature assignments across all specialisations
|
||||||
|
- Day 2: GPU attestation survey and zero-trust network survey completed; threat model v0 circulated; capability system and multi-tenant isolation surveys in progress
|
||||||
|
- Day 3: All research artifacts published; threat model v0 reviewed by full council; research phase closes with threat model v0 and prioritised list of open questions; threat model is immediately shared with all other councils as an informational document
|
||||||
|
|
||||||
|
### Phase 2: Architecture (Days 3-6)
|
||||||
|
- Day 3-4: Capability system design, trust chain architecture, and cryptographic identity design drafted; initial decision records for DP-SEC-01 through DP-SEC-03 circulated
|
||||||
|
- Day 4-5: GPU isolation policy, RBAC/ABAC design, and zero-trust network design drafted; inter-council constraint documents begin drafting; threat model v1 drafted for ratification
|
||||||
|
- Day 5-6: Threat model v1 ratification vote (requires all threat model analysts and both leads); all DR-SEC-* voted on; constraint documents for all councils published; architecture phase closes
|
||||||
|
|
||||||
|
### Phase 3: Formal Specification (Days 6-10)
|
||||||
|
- Day 6-7: Capability system TLA+ and attestation protocol TLA+ drafted; security requirements matrix published in final form; distributed to all councils as normative constraints
|
||||||
|
- Day 7-8: Access control TLA+ and audit log integrity specification drafted; council-verify liaison reviews specifications
|
||||||
|
- Day 8-9: All TLA+ specifications submitted to council-verify for model checking; council-sec reviews model checking results and refines specifications as needed
|
||||||
|
- Day 9-10: Formal spec phase closes; all TLA+ specifications in final state with model-checking status documented
|
||||||
|
|
||||||
|
### Phase 4: Integration (Days 10-12)
|
||||||
|
- Day 10-11: Cross-council integration review; council-sec reviews each other council's Phase 3 specs for security invariant compliance; issues objections where needed
|
||||||
|
- Day 11-12: council-synth handles any conflicts; KACHING security integration document published; final ratification of all constraint documents by both council-sec and recipient councils
|
||||||
|
|
||||||
|
### Phase 5: Documentation (Days 12-14)
|
||||||
|
- Day 12-13: Security model reference specification assembled; operator and tenant security guides drafted
|
||||||
|
- Day 13-14: council-arch produces security decision archaeology narrative; final UCXL navigability audit; any outstanding security objections escalated to council-meta for resolution
|
||||||
398
councils/06-resource-accounting.md
Normal file
398
councils/06-resource-accounting.md
Normal file
@@ -0,0 +1,398 @@
|
|||||||
|
# Council Design Brief: Resource Accounting, Telemetry, and SLO Enforcement
|
||||||
|
|
||||||
|
**Council ID:** `council-telemetry`
|
||||||
|
**Mission:** Design the complete observability, metering, and accountability infrastructure for DistOS — covering GPU utilisation metering, multi-tenant cost attribution, SLO definition and enforcement, real-time telemetry pipelines, distributed tracing, energy consumption tracking, and anomaly detection — such that every resource consumed in the cluster is measured, attributed, and reportable with sub-minute latency, and that service level objectives are enforced automatically with human-navigable evidence.
|
||||||
|
**UCXL Base Address:** `ucxl://council-telemetry:*@DistOS:telemetry/*`
|
||||||
|
**Agent Count:** ~40
|
||||||
|
**Status:** Pre-formation (Constitution Phase)
|
||||||
|
**Created:** 2026-02-24
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 1. Scope and Responsibilities
|
||||||
|
|
||||||
|
Council-telemetry is responsible for the full stack of observability and accountability in DistOS: from raw hardware counters (GPU SM occupancy, NVLink bandwidth, DRAM bandwidth) through the telemetry pipeline (collection, aggregation, storage) to higher-level constructs (cost attribution, SLO evaluation, anomaly detection, capacity planning signals). The council also provides the metering data that the KACHING enterprise licensing module consumes for billing, and the energy consumption signals used for carbon-aware scheduling.
|
||||||
|
|
||||||
|
Unlike council-sec, which issues constraints to other councils, council-telemetry primarily consumes events and metrics produced by other subsystems. Its design is therefore deeply dependent on the interfaces those councils expose. A key deliverable is a telemetry interface specification that other councils must implement.
|
||||||
|
|
||||||
|
### In Scope
|
||||||
|
|
||||||
|
- GPU hardware counter collection design: SM occupancy, memory bandwidth, tensor core utilisation, NVLink bandwidth, PCIe traffic — using NVIDIA DCGM and NVML as the collection substrate
|
||||||
|
- Multi-tenant cost attribution: relating raw hardware counter values to tenant workloads; designing fair attribution algorithms for shared resources (e.g., GPU memory bus contention)
|
||||||
|
- SLO definition language and schema: formal specification of latency, throughput, availability, and utilisation SLOs; SLO evaluation at runtime
|
||||||
|
- SLO enforcement mechanisms: automatic throttling, priority inversion prevention, quota enforcement, workload eviction as a last resort
|
||||||
|
- Real-time telemetry pipeline: collection, transport, aggregation, and storage architecture; latency requirements (sub-minute for billing, sub-second for SLO enforcement, millisecond for anomaly detection)
|
||||||
|
- Distributed tracing: causally-correct tracing of agent interactions across the CHORUS mesh; integration with OpenTelemetry
|
||||||
|
- Energy consumption tracking: per-node and per-workload energy metering using NVIDIA DCGM power readings and Intel RAPL (for CPU/memory subsystem); integration with Kepler for Kubernetes-style workloads if DistOS supports Kubernetes compatibility
|
||||||
|
- Carbon-aware scheduling signals: translating energy metering into carbon signals; publishing scheduling advisory signals for carbon-aware placement decisions (consumed by council-sched)
|
||||||
|
- Quota management: per-tenant resource quotas, quota accounting, quota enforcement protocol
|
||||||
|
- Anomaly detection: statistical and ML-based anomaly detection on resource metrics; alert generation; integration with operator notification
|
||||||
|
- Capacity planning: historical data aggregation, trend analysis, capacity headroom computation
|
||||||
|
- Chargeback models: metering-to-cost translation; integration with KACHING billing model
|
||||||
|
|
||||||
|
### Out of Scope
|
||||||
|
|
||||||
|
- Scheduling decisions based on telemetry signals (council-sched consumes telemetry signals; council-telemetry produces them)
|
||||||
|
- Memory allocation algorithms (council-mem owns this; council-telemetry meters it)
|
||||||
|
- Network transport implementation (council-net owns this; council-telemetry instruments it)
|
||||||
|
- Security of the telemetry pipeline (council-sec owns this; council-telemetry complies with its requirements)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 2. Research Domains
|
||||||
|
|
||||||
|
### 2.1 GPU Hardware Counter Collection
|
||||||
|
|
||||||
|
Accurate GPU utilisation metering requires direct access to hardware performance counters. The challenge in a multi-tenant environment is that hardware counter access is often exclusive (enabling profiling for one workload disables it for others) or requires privileged access that tenants should not have.
|
||||||
|
|
||||||
|
**Key Papers and Systems:**
|
||||||
|
|
||||||
|
- NVIDIA Data Center GPU Manager (DCGM) — NVIDIA's official GPU telemetry framework for data centres; provides SM occupancy, memory bandwidth, active cycles, power consumption, temperature, PCIe throughput, and NVLink bandwidth via the DCGM API; DistOS's collection layer should be built on DCGM for NVIDIA GPU metrics; critically, DCGM operates in a privileged daemon model that does not require per-workload profiling access, which is compatible with multi-tenant environments
|
||||||
|
- NVIDIA Management Library (NVML) — low-level C library underlying DCGM; exposes per-GPU and per-process metrics; the per-process accounting mode (`nvmlDeviceSetAccountingMode`) enables post-run attribution of compute time and memory to specific PIDs, which is essential for chargeback; however, PID-based attribution is insufficient for MIG-partitioned GPUs where tenants share the same physical device
|
||||||
|
- Jouppi et al. (2017), "In-Datacenter Performance Analysis of a Tensor Processing Unit" — TPU performance counter methodology; useful comparison point for what GPU metrics are available versus what would be useful; many metrics DistOS needs (e.g., tensor core utilisation breakdowns by operation type) may not be directly hardware-countable and must be estimated from higher-level profiling
|
||||||
|
- Mei et al. (2023), "Characterising and Optimising Deep Learning Inference Workloads on Modern GPUs" — empirical characterisation of GPU counter behaviour for inference workloads; establishes that SM occupancy alone is an insufficient utilisation proxy; NVLink bandwidth saturation and L2 cache miss rate are important co-metrics for attributing performance degradation to resource contention
|
||||||
|
- NVIDIA Nsight Systems and Nsight Compute — profiling tools that expose timeline-level GPU activity; not suitable for production always-on metering (intrusion overhead), but inform which subset of DCGM metrics are sufficient proxies for workload characterisation
|
||||||
|
- Awan et al. (2023), "Near-Zero Overhead Telemetry" (research prototype, MLSys 2023) — demonstrates that carefully selected subset of hardware counters with low sampling frequency achieves 98% accuracy of full profiling at less than 0.5% overhead; informs DistOS's counter selection policy
|
||||||
|
|
||||||
|
**Open questions for research phase:** Does DCGM support per-MIG-instance accounting with the same granularity as per-physical-GPU accounting? What is the DCGM polling overhead at 1-second vs 100-millisecond sampling intervals across 1024 GPUs? How should council-telemetry handle the period during which a GPU is being re-partitioned (MIG reconfiguration)?
|
||||||
|
|
||||||
|
### 2.2 Multi-Tenant Cost Attribution
|
||||||
|
|
||||||
|
Attributing resource consumption to individual tenants in a shared GPU environment is significantly harder than in a dedicated allocation model. Shared resources (memory bus bandwidth, NVLink, L2 cache) cause one tenant's workload to degrade another's performance, creating contention attribution problems.
|
||||||
|
|
||||||
|
**Key Papers and Systems:**
|
||||||
|
|
||||||
|
- Delimitrou & Kozyrakis (2014), "Quasar: Resource-Efficient and QoS-Aware Cluster Management" — demonstrates that interference between co-located workloads in shared clusters is predictable and can be modelled; Quasar's interference matrix approach is directly applicable to DistOS's contention attribution; DistOS's metering should record not just individual workload consumption but also co-location context
|
||||||
|
- Nathuji et al. (2010), "VirtualPower: Coordinated Power Management in Virtualized Enterprise Systems" — power attribution in virtualised environments; establishes that power metering must account for idle overhead allocation, which is directly analogous to GPU idle capacity attribution in multi-tenant DistOS
|
||||||
|
- Amazon EC2 Enhanced Networking and CPU credits — practical chargeback model for burstable shared resources; DistOS's quota model should evaluate the T-credit style approach for GPU workloads that occasionally spike beyond their allocated share
|
||||||
|
- Zaharia et al. (2011), "Apache Spark: A Unified Analytics Engine for Large-Scale Data Processing" — Spark's stage-level resource metering provides an abstraction level above hardware counters that is more useful for chargeback; DistOS should consider an analogous task/stage metering abstraction
|
||||||
|
- Ouyang et al. (2023), "Characterizing Interference in Shared Multi-GPU Systems" — empirical study showing L2 cache and memory bus contention between co-located workloads; provides the empirical basis for DistOS's interference-aware attribution model
|
||||||
|
|
||||||
|
### 2.3 SLO Definition, Evaluation, and Enforcement
|
||||||
|
|
||||||
|
SLO-based resource management is the standard model for production cluster management. DistOS must define SLOs, evaluate them continuously, and enforce them before violations occur.
|
||||||
|
|
||||||
|
**Key Papers and Systems:**
|
||||||
|
|
||||||
|
- Mogul & Wilkes (2019), "Nines are Not Enough: Meaningful Metrics for Clouds" — argues that traditional availability SLOs (99.9%, 99.99%) are insufficient for capturing user experience; motivates richer SLO definitions including latency percentiles (P99, P99.9), throughput, and error budget; DistOS's SLO language should support these constructs
|
||||||
|
- Hauer et al. (2020), "Shard Manager: A Generic Shard Management Framework for Geo-Distributed Applications" — Google's production SLO management experience; demonstrates the importance of error budgets (how much SLO budget has been consumed) as the primary metric driving reliability investment
|
||||||
|
- Google Site Reliability Engineering Book (Beyer et al. 2016), Chapter 4 "Service Level Objectives" — defines SLI, SLO, SLA, and error budget concepts; DistOS's SLO framework should be built on these definitions; particularly important is the error budget burn rate alerting model (Alerting on SLOs, Chapter 5 of the SRE Workbook)
|
||||||
|
- Cortez et al. (2017), "Resource Central: Understanding and Predicting Workloads for Improved Resource Management in Large Cloud Platforms" — Microsoft Azure's workload prediction model for SLO-aware resource management; DistOS should evaluate whether workload type classification (via fingerprinting) can improve SLO prediction accuracy
|
||||||
|
- Borg (Verma et al. 2015), "Large-Scale Cluster Management at Google with Borg" — Borg's priority-based preemption as the primary SLO enforcement mechanism; DistOS should adopt a similar priority taxonomy (prod vs non-prod, high vs low latency) with preemption as the enforcement backstop
|
||||||
|
- Kubernetes Quality of Service (QoS) classes and LimitRange — Guaranteed, Burstable, and BestEffort QoS classes provide a practical three-tier SLO model; DistOS's SLO framework should map to an equivalent model for compute and memory resources, extended with GPU-specific dimensions
|
||||||
|
|
||||||
|
**Open questions for research phase:** What SLO dimensions are unique to GPU workloads versus CPU workloads? (Candidates: tensor core utilisation rate, NVLink bandwidth utilisation, checkpoint latency.) How should DistOS define an SLO for a distributed training job where progress is measured in loss reduction per unit time rather than latency?
|
||||||
|
|
||||||
|
### 2.4 Telemetry Pipeline Architecture
|
||||||
|
|
||||||
|
At 1024 nodes each generating hundreds of metrics per second, the telemetry ingestion rate is substantial. The pipeline must handle high ingest volume, support sub-second fan-in aggregation for SLO enforcement, and store data efficiently for historical analysis.
|
||||||
|
|
||||||
|
**Key Papers and Systems:**
|
||||||
|
|
||||||
|
- Prometheus (Volz & Wilkinson, SoundCloud 2012; CNCF 2016) — pull-based metrics collection with multi-dimensional data model; Prometheus's label-based cardinality model is well-suited to DistOS's tenant × node × workload attribution; however, Prometheus's single-node storage is not suitable for 1024 nodes at 1-second granularity over months of retention
|
||||||
|
- Thanos (Improbable Engineering, 2019) — horizontally scalable Prometheus with object storage backend; enables long-term metrics retention and cross-cluster querying; DistOS should evaluate Thanos (or its equivalent Cortex/Mimir) as the metrics storage layer
|
||||||
|
- Borgmon (Moyé et al. 2003, internal Google; described in SRE Book Chapter 10) — Google's Borg monitoring system; time-series database with rule evaluation for alerting; the concept of borgmon rules as the primary SLO evaluation mechanism directly informs DistOS's design
|
||||||
|
- Kafka (Kreps et al. 2011, LinkedIn) — distributed log for high-throughput event streaming; DistOS's telemetry pipeline should evaluate Kafka or an equivalent (Pulsar, Redpanda) as the transport layer between metric producers and the time-series database; Kafka's retention window provides a replay buffer for SLO evaluation catching up after an outage
|
||||||
|
- OpenTelemetry (CNCF 2019) — vendor-neutral standard for metrics, logs, and traces; the OpenTelemetry Collector provides a pipeline architecture (receivers, processors, exporters) that DistOS should adopt as its telemetry collection standard; OTLP (OpenTelemetry Protocol) is the wire format
|
||||||
|
- VictoriaMetrics — high-performance metrics storage designed for high cardinality; benchmarks show 10x storage efficiency over Prometheus for high-cardinality workloads typical of multi-tenant cluster telemetry; DistOS should include VictoriaMetrics in the storage backend evaluation
|
||||||
|
|
||||||
|
**Throughput estimation:** At 1024 nodes, each running 8 GPUs, at 100 DCGM metrics per GPU at 1-second sampling, the raw ingest rate is approximately 1024 × 8 × 100 = 819,200 metric data points per second. At 8 bytes per data point plus 40 bytes of labels, this is approximately 39 MB/s raw metric stream, before compression. The telemetry pipeline must handle this as a sustained base load with 3x burst headroom during incidents (when anomaly detection triggers increased sampling rates).
|
||||||
|
|
||||||
|
### 2.5 Distributed Tracing
|
||||||
|
|
||||||
|
In a system where CHORUS agents interact via the WHOOSH/CHORUS mesh, understanding the causal chain of an operation across hundreds of agents requires distributed tracing.
|
||||||
|
|
||||||
|
**Key Papers and Systems:**
|
||||||
|
|
||||||
|
- Dapper (Sigelman et al. 2010), "Dapper, a Large-Scale Distributed Systems Tracing Infrastructure" — Google's production distributed tracing system; introduces the span/trace model that all modern tracing systems follow; DistOS should implement Dapper-style tracing for all inter-agent communications
|
||||||
|
- Zipkin (Twitter, 2012) — open-source Dapper implementation; provides the reference implementation for DistOS's tracing layer; B3 propagation headers are the de facto standard for trace context propagation
|
||||||
|
- OpenTelemetry Traces — the successor to Zipkin and OpenCensus; provides a standardised trace API and SDK; DistOS should use OpenTelemetry traces for all CHORUS agent interactions, with OTLP export to a Jaeger or Tempo backend
|
||||||
|
- Fonseca et al. (2007), "X-Trace: A Pervasive Network Tracing Framework" — alternative to Dapper with stronger causality tracking; relevant for DistOS because CHORUS agent communication is not strictly request-response (it is gossip-based), and standard parent-child span models may not capture gossip causality accurately
|
||||||
|
|
||||||
|
**Open questions for research phase:** How should DistOS trace gossip-based communication (where a single message may spawn fan-out to hundreds of recipients)? What is the trace context propagation mechanism for CHORUS pubsub messages? Should DistOS use baggage propagation to carry tenant identity through the trace context?
|
||||||
|
|
||||||
|
### 2.6 Energy Consumption and Carbon-Aware Scheduling
|
||||||
|
|
||||||
|
**Key Papers and Systems:**
|
||||||
|
|
||||||
|
- Intel RAPL (Running Average Power Limit) — exposes CPU and DRAM power consumption via MSR registers; available on all Intel Xeon processors in the target cluster's Grace-Hopper nodes if Grace CPUs expose an equivalent; provides per-socket power metering at millisecond granularity
|
||||||
|
- NVIDIA DCGM Power Fields — DCGM exposes `DCGM_FI_DEV_POWER_USAGE` (current power draw in watts) and `DCGM_FI_DEV_ENERGY_CONSUMPTION` (cumulative joules) per GPU; DistOS's energy metering should aggregate these per workload and per tenant
|
||||||
|
- Kepler (Kubernetes-based Efficient Power Level Exporter, CNCF Sandbox 2023) — exports per-container and per-pod energy consumption estimates derived from hardware performance counters; provides Prometheus metrics for energy and carbon; DistOS should evaluate Kepler's counter-to-energy estimation model for per-workload attribution
|
||||||
|
- Lottarini et al. (2018), "vBoost: Scaling Up Microservices by Automatically Boosting Cloud Resources" — demonstrates that energy cost models must account for the non-linear relationship between utilisation and power draw (a GPU at 50% utilisation typically consumes more than 50% of its maximum power due to idle power overheads)
|
||||||
|
- Wiesner et al. (2021), "Let's Wait Awhile: How Temporal Workload Shifting Can Reduce Carbon Emissions in the Cloud" — demonstrates that temporal workload shifting (scheduling batch workloads to run when the grid carbon intensity is low) can reduce carbon emissions by 20-40%; DistOS should publish carbon intensity signals that council-sched can use for carbon-aware placement
|
||||||
|
- Electricity Maps API — real-time grid carbon intensity data; DistOS's carbon signal should integrate with an external grid carbon API to translate energy consumption to CO2 equivalent; provides the "carbon intensity" value in gCO2eq/kWh that DistOS needs for carbon-aware scheduling
|
||||||
|
- Patterson et al. (2021), "Carbon Emissions and Large Neural Network Training" — quantifies the carbon footprint of large model training; the methodology here is the direct motivation for DistOS's energy and carbon telemetry; DistOS should enable workload-level carbon reporting at the same granularity as this paper reports
|
||||||
|
|
||||||
|
### 2.7 Anomaly Detection
|
||||||
|
|
||||||
|
**Key Papers and Systems:**
|
||||||
|
|
||||||
|
- Dean et al. (2013), "The Tail at Scale" — motivates tail latency as the primary SLO metric; a system composed of many services has a combinatorially higher probability of hitting slow outliers; DistOS's anomaly detection should focus on detecting when individual nodes begin exhibiting tail latency behaviour before it becomes cluster-wide
|
||||||
|
- Chandola et al. (2009), "Anomaly Detection: A Survey" — comprehensive taxonomy of anomaly detection approaches; DistOS should implement statistical anomaly detection (control charts, CUSUM) for low-latency alerting on simple metric deviations, and ML-based anomaly detection (Isolation Forest, LSTM-AD) for complex multi-metric anomalies
|
||||||
|
- Liu et al. (2008), "Isolation Forest" — O(n log n) anomaly detection algorithm well-suited to streaming metric data; DistOS's anomaly detection pipeline should evaluate Isolation Forest for high-dimensional metric vectors
|
||||||
|
- Hundman et al. (2018), "Detecting Spacecraft Anomalies Using LSTMs and Nonparametric Dynamic Thresholding" — LSTM-based anomaly detection with dynamic thresholds; relevant for DistOS GPU metrics that exhibit strong periodicity (training loss curves, checkpoint intervals) that static thresholds cannot handle
|
||||||
|
- Prometheus Alertmanager and recording rules — production alerting infrastructure; DistOS's anomaly detection can integrate with Prometheus recording rules for pre-computed alert expressions and Alertmanager for routing, deduplication, and silencing
|
||||||
|
|
||||||
|
### 2.8 KACHING Integration
|
||||||
|
|
||||||
|
The KACHING enterprise licensing module is DistOS's commercial monetisation layer. Council-telemetry's metering data is the input to KACHING's billing engine.
|
||||||
|
|
||||||
|
**Key Papers:**
|
||||||
|
|
||||||
|
- Lim et al. (2009), "Characterizing Web Server Capacity" — motivates the importance of accurate metering as the foundation for any chargeback model; inaccurate meters lead to customer disputes and revenue leakage
|
||||||
|
- Amazon Web Services billing model (EC2 detailed billing, Cost Allocation Tags) — practical reference for chargeback model design; DistOS's billing tags should be compatible with AWS cost allocation tag conventions to ease hybrid billing for organisations already using AWS cost accounting
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 3. Agent Roles
|
||||||
|
|
||||||
|
| Role | Count | Responsibilities |
|
||||||
|
|------|-------|-----------------|
|
||||||
|
| `lead-architect` | 2 | Cross-domain coherence, KACHING integration oversight, SLO enforcement arbitration, synthesis liaison |
|
||||||
|
| `gpu-metrics-engineer` | 6 | DCGM integration design, counter selection policy, per-MIG accounting design, sampling rate analysis |
|
||||||
|
| `telemetry-pipeline-engineer` | 6 | Collection pipeline architecture, Kafka/OTLP transport design, time-series storage selection, throughput analysis |
|
||||||
|
| `cost-attribution-specialist` | 4 | Multi-tenant attribution algorithms, contention attribution, shared resource cost allocation models |
|
||||||
|
| `slo-engineer` | 6 | SLO language design, error budget model, SLO evaluation runtime, enforcement trigger design |
|
||||||
|
| `distributed-tracing-specialist` | 4 | OpenTelemetry integration, trace context propagation for CHORUS gossip, span model for agent interactions |
|
||||||
|
| `energy-carbon-analyst` | 4 | DCGM power metering, Intel RAPL integration, Kepler evaluation, carbon intensity signal design |
|
||||||
|
| `anomaly-detection-engineer` | 4 | Statistical and ML-based anomaly detection design, alert routing, dynamic threshold design |
|
||||||
|
| `quota-manager` | 2 | Per-tenant quota definition, quota accounting protocol, enforcement escalation |
|
||||||
|
| `formal-spec-author` | 2 | TLA+ specifications for SLO evaluation state machine, quota accounting invariants |
|
||||||
|
|
||||||
|
**Total: 40 agents**
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 4. Key Deliverables
|
||||||
|
|
||||||
|
### Phase 1: Research (Days 1-3)
|
||||||
|
|
||||||
|
| Deliverable | UCXL Address | Description |
|
||||||
|
|-------------|-------------|-------------|
|
||||||
|
| GPU metrics survey | `ucxl://council-telemetry:researcher@DistOS:telemetry/*^/research/gpu-metrics-survey.md` | DCGM counter catalogue, sampling overhead analysis, per-MIG accounting capabilities |
|
||||||
|
| Telemetry pipeline survey | `ucxl://council-telemetry:researcher@DistOS:telemetry/*^/research/telemetry-pipeline-survey.md` | Prometheus/Thanos vs VictoriaMetrics vs OpenTelemetry Collector comparison |
|
||||||
|
| SLO framework survey | `ucxl://council-telemetry:researcher@DistOS:telemetry/*^/research/slo-framework-survey.md` | SRE SLO model, error budgets, GPU-specific SLO dimensions |
|
||||||
|
| Energy metering survey | `ucxl://council-telemetry:researcher@DistOS:telemetry/*^/research/energy-metering-survey.md` | DCGM power fields, Intel RAPL, Kepler, carbon intensity API options |
|
||||||
|
| Telemetry interface requirements | `ucxl://council-telemetry:lead-architect@DistOS:telemetry/*^/research/telemetry-interface-requirements.md` | What events and metrics council-telemetry requires from council-sched, council-mem, and council-net |
|
||||||
|
|
||||||
|
### Phase 2: Architecture (Days 3-6)
|
||||||
|
|
||||||
|
| Deliverable | UCXL Address | Description |
|
||||||
|
|-------------|-------------|-------------|
|
||||||
|
| GPU metering design | `ucxl://council-telemetry:gpu-metrics-engineer@DistOS:telemetry/*^/architecture/gpu-metering-design.md` | Counter selection, sampling policy, per-MIG attribution design |
|
||||||
|
| Telemetry pipeline design | `ucxl://council-telemetry:telemetry-pipeline-engineer@DistOS:telemetry/*^/architecture/telemetry-pipeline-design.md` | Collection, transport, aggregation, and storage architecture |
|
||||||
|
| Cost attribution model | `ucxl://council-telemetry:cost-attribution-specialist@DistOS:telemetry/*^/architecture/cost-attribution-model.md` | Attribution algorithm, contention handling, shared resource allocation |
|
||||||
|
| SLO framework design | `ucxl://council-telemetry:slo-engineer@DistOS:telemetry/*^/architecture/slo-framework-design.md` | SLO language spec, error budget tracking, enforcement mechanism |
|
||||||
|
| Energy and carbon model | `ucxl://council-telemetry:energy-carbon-analyst@DistOS:telemetry/*^/architecture/energy-carbon-model.md` | Per-workload energy metering, carbon signal production |
|
||||||
|
| Anomaly detection design | `ucxl://council-telemetry:anomaly-detection-engineer@DistOS:telemetry/*^/architecture/anomaly-detection-design.md` | Algorithm selection, threshold design, alert routing |
|
||||||
|
| Telemetry API specification | `ucxl://council-telemetry:lead-architect@DistOS:telemetry/*^/architecture/telemetry-api-spec.md` | API surface that other councils must implement to emit events to council-telemetry |
|
||||||
|
| DR-TEL-001: Time-series storage backend | `ucxl://council-telemetry:lead-architect@DistOS:telemetry/*^/decisions/DR-TEL-001-tsdb-selection.md` | Decision record: Thanos vs VictoriaMetrics vs Mimir |
|
||||||
|
| DR-TEL-002: SLO enforcement mechanism | `ucxl://council-telemetry:lead-architect@DistOS:telemetry/*^/decisions/DR-TEL-002-slo-enforcement.md` | Decision record: throttling vs preemption as primary enforcement |
|
||||||
|
| DR-TEL-003: Energy attribution model | `ucxl://council-telemetry:lead-architect@DistOS:telemetry/*^/decisions/DR-TEL-003-energy-attribution.md` | Decision record: direct measurement vs performance-counter estimation |
|
||||||
|
|
||||||
|
### Phase 3: Formal Specification (Days 6-10)
|
||||||
|
|
||||||
|
| Deliverable | UCXL Address | Description |
|
||||||
|
|-------------|-------------|-------------|
|
||||||
|
| SLO evaluation TLA+ specification | `ucxl://council-telemetry:formal-spec-author@DistOS:telemetry/*^/specs/slo-evaluation.tla` | Invariant: no workload's SLO is evaluated on stale data; liveness: every SLO violation triggers enforcement within a bounded interval |
|
||||||
|
| Quota accounting TLA+ specification | `ucxl://council-telemetry:formal-spec-author@DistOS:telemetry/*^/specs/quota-accounting.tla` | Invariant: sum of attributed quota across all tenants does not exceed physical resource capacity; no negative quota balances |
|
||||||
|
| Cost attribution TLA+ specification | `ucxl://council-telemetry:formal-spec-author@DistOS:telemetry/*^/specs/cost-attribution.tla` | Invariant: every resource unit consumed is attributed to exactly one tenant; attribution completeness property |
|
||||||
|
| Telemetry data model schema | `ucxl://council-telemetry:telemetry-pipeline-engineer@DistOS:telemetry/*^/specs/telemetry-data-model.json` | JSON Schema definition of all telemetry event types, label cardinality constraints |
|
||||||
|
|
||||||
|
### Phase 4: Integration (Days 10-12)
|
||||||
|
|
||||||
|
| Deliverable | UCXL Address | Description |
|
||||||
|
|-------------|-------------|-------------|
|
||||||
|
| Scheduling event interface | `ucxl://council-telemetry:lead-architect@DistOS:telemetry/*^/integration/council-sched-interface.md` | Events required from council-sched: workload placement, preemption, completion, checkpoint events |
|
||||||
|
| Memory metrics interface | `ucxl://council-telemetry:lead-architect@DistOS:telemetry/*^/integration/council-mem-interface.md` | Metrics required from council-mem: allocation, deallocation, bandwidth consumption, cache miss rates |
|
||||||
|
| Network metrics interface | `ucxl://council-telemetry:lead-architect@DistOS:telemetry/*^/integration/council-net-interface.md` | Metrics required from council-net: per-link bandwidth, RDMA queue depths, retransmission rates |
|
||||||
|
| KACHING billing integration | `ucxl://council-telemetry:cost-attribution-specialist@DistOS:telemetry/*^/integration/kaching-billing-integration.md` | Metering event schema for KACHING; billing granularity; attribution latency SLA |
|
||||||
|
| Security compliance review | `ucxl://council-telemetry:lead-architect@DistOS:telemetry/*^/integration/security-compliance-review.md` | Confirmation that telemetry pipeline satisfies council-sec's data classification requirements |
|
||||||
|
|
||||||
|
### Phase 5: Documentation (Days 12-14)
|
||||||
|
|
||||||
|
| Deliverable | UCXL Address | Description |
|
||||||
|
|-------------|-------------|-------------|
|
||||||
|
| Telemetry reference specification | `ucxl://council-telemetry:lead-architect@DistOS:telemetry/*^/docs/telemetry-reference-spec.md` | Complete observability and accounting model specification |
|
||||||
|
| SLO configuration guide | `ucxl://council-telemetry:slo-engineer@DistOS:telemetry/*^/docs/slo-configuration-guide.md` | How tenants define SLOs, how operators set enforcement policies |
|
||||||
|
| Operator metering guide | `ucxl://council-telemetry:gpu-metrics-engineer@DistOS:telemetry/*^/docs/operator-metering-guide.md` | DCGM configuration, counter selection, sampling rate tuning |
|
||||||
|
| Carbon accounting guide | `ucxl://council-telemetry:energy-carbon-analyst@DistOS:telemetry/*^/docs/carbon-accounting-guide.md` | How to read per-workload energy reports, carbon intensity interpretation |
|
||||||
|
| Decision archaeology summary | `ucxl://council-arch:narrator@DistOS:decision-history/*^/narratives/council-telemetry-design-narrative.md` | Human-readable narrative of how resource accounting design decisions evolved |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 5. Decision Points
|
||||||
|
|
||||||
|
**DP-TEL-01: GPU Metric Sampling Strategy**
|
||||||
|
Should council-telemetry use DCGM's push model (DCGM publishes metrics to a Kafka topic at a configured interval) or a pull model (the telemetry collector scrapes DCGM's gRPC API at configurable intervals, Prometheus-style)? The push model reduces collection latency but couples the DCGM configuration to the pipeline design; the pull model is more operationally familiar and aligns with Prometheus conventions, but adds a scrape hop. Additionally, should sampling intervals be fixed (1 second for all metrics) or adaptive (higher frequency during SLO-critical periods, lower frequency during idle periods)?
|
||||||
|
|
||||||
|
**DP-TEL-02: Time-Series Database Selection**
|
||||||
|
Three candidates for the long-term metrics storage backend have been identified: (a) Prometheus with Thanos for object storage backend — widely adopted, strong community, operational maturity; (b) VictoriaMetrics — significantly higher storage efficiency (3-5x compression vs Prometheus), higher ingest throughput, but smaller ecosystem; (c) OpenTelemetry Collector with a custom backend — maximum flexibility but highest implementation cost. The decision should be driven by the ingest rate calculation (approximately 39 MB/s at 1-second granularity for 1024 nodes × 8 GPUs × 100 metrics) and the query latency requirement for SLO evaluation (sub-second query execution for dashboard rendering).
|
||||||
|
|
||||||
|
**DP-TEL-03: SLO Enforcement Mechanism**
|
||||||
|
When an SLO violation is detected, what enforcement actions should DistOS take and in what order? Candidate escalation ladder: (1) warning signal to workload orchestrator (advisory); (2) CPU/memory throttling via cgroups; (3) GPU bandwidth throttling via NVIDIA MPS (Multi-Process Service) client thread percentage; (4) GPU time-slice reduction; (5) workload preemption (destructive if no checkpoint exists). The council must define the escalation policy including dwell times at each stage, the maximum rate of escalation, and the de-escalation criteria. A workload should not oscillate between enforcement states faster than its checkpoint interval.
|
||||||
|
|
||||||
|
**DP-TEL-04: Cost Attribution for Contended Resources**
|
||||||
|
When two tenants compete for a shared resource (e.g., L2 cache, NVLink bandwidth) and both suffer performance degradation, how should the cost be attributed? Option A: attribute the physical resource consumption to both tenants proportionally (each pays for the resource they consumed regardless of contention); Option B: attribute the wasted capacity caused by contention to the lower-priority tenant (the disruptor pays); Option C: attribute contention overhead as an unallocated cluster cost absorbed by the operator. Option A is simple but may not align with tenants' SLO expectations; Option B requires a contention detection mechanism that is technically complex but aligns with the principle of least cost surprise.
|
||||||
|
|
||||||
|
**DP-TEL-05: Energy Attribution Granularity**
|
||||||
|
Should energy attribution be per-GPU (simple: divide total GPU power draw by workload count), per-MIG-instance (more precise: DCGM provides per-MIG power estimates), per-CUDA-stream (most precise: requires performance counter access that may not be available in multi-tenant mode), or per-transaction (for inference serving: most useful for KACHING billing of per-request costs)? The appropriate granularity depends on the KACHING billing model and the measurement capabilities of the hardware. The council should document the measurement capability gaps and the estimation methodologies used to bridge them.
|
||||||
|
|
||||||
|
**DP-TEL-06: Telemetry Data Retention and Downsampling**
|
||||||
|
Raw metric data at 1-second granularity generates approximately 39 MB/s. A 90-day retention period at this rate requires approximately 210 TB of storage. Is this acceptable, or should DistOS implement a tiered retention policy: raw data retained for 7 days, 1-minute downsampled for 30 days, 5-minute downsampled for 1 year? Downsampled data loses information about short-duration anomalies but dramatically reduces storage cost. The decision must specify the minimum retention period for data used in SLO violation forensics (which may require second-level resolution for the duration of the SLO window).
|
||||||
|
|
||||||
|
**DP-TEL-07: Carbon Signal Architecture**
|
||||||
|
Should DistOS maintain an internal carbon intensity model (estimating carbon per kWh from the data centre's contracted electricity mix), integrate with an external real-time carbon API (such as Electricity Maps or WattTime), or provide both with the internal model as a fallback? The external API provides higher accuracy and real-time grid-responsive signals but introduces an external dependency; the internal model is always available but may be inaccurate. The carbon signal is advisory (consumed by council-sched for carbon-aware placement), so accuracy requirements are lower than for billing data.
|
||||||
|
|
||||||
|
**DP-TEL-08: Anomaly Detection Sensitivity and Alert Fatigue**
|
||||||
|
Anomaly detection systems in production clusters routinely suffer from alert fatigue: so many anomalies are detected that operators stop responding to alerts. DistOS must define an alert quality target (e.g., 90% of alerts should correspond to a genuine degradation event, measured over a rolling 7-day window) and design the anomaly detection algorithm parameters to achieve it. The council must choose between a sensitivity-first approach (catch everything, suppress via correlation) and a precision-first approach (only alert when confidence is high, miss some real anomalies).
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 6. Dependencies on Other Councils
|
||||||
|
|
||||||
|
### council-sched (Process Scheduling) — Events Required
|
||||||
|
Council-sched is the primary event source for workload-level attribution. Council-telemetry requires the following events from council-sched:
|
||||||
|
|
||||||
|
- Workload placement event: workload ID, tenant ID, node ID, GPU ID (or MIG instance ID), start timestamp
|
||||||
|
- Workload completion event: workload ID, end timestamp, exit status
|
||||||
|
- Preemption event: workload ID, timestamp, preempting workload ID, reason
|
||||||
|
- Checkpoint event: workload ID, checkpoint ID, timestamp, checkpoint size, Weka FS location
|
||||||
|
- Priority change event: workload ID, old priority, new priority, timestamp
|
||||||
|
|
||||||
|
Without these events, council-telemetry cannot attribute GPU hardware counter readings to specific workloads. The attribution is performed by joining the scheduling event stream (from council-sched) with the DCGM counter stream on (node_id, gpu_id, time_window).
|
||||||
|
|
||||||
|
**Interface artifact:** `ucxl://council-telemetry:lead-architect@DistOS:telemetry/*^/integration/council-sched-interface.md`
|
||||||
|
|
||||||
|
### council-mem (Distributed Memory) — Metrics Required
|
||||||
|
Council-telemetry requires memory utilisation metrics that council-mem produces or can expose:
|
||||||
|
|
||||||
|
- Per-workload GPU memory allocation size (bytes, by memory type: HBM, NVLink DRAM, Weka-backed)
|
||||||
|
- Memory bandwidth consumption per workload (GB/s, HBM and NVLink separately)
|
||||||
|
- Memory allocation and deallocation events for quota accounting
|
||||||
|
- L2 cache hit rate per workload (for contention detection)
|
||||||
|
|
||||||
|
**Interface artifact:** `ucxl://council-telemetry:lead-architect@DistOS:telemetry/*^/integration/council-mem-interface.md`
|
||||||
|
|
||||||
|
### council-net (Network Stack) — Metrics Required
|
||||||
|
Council-telemetry requires network metrics to provide end-to-end resource accounting:
|
||||||
|
|
||||||
|
- Per-flow bandwidth consumption (bytes transferred, broken down by tenant workload where SPIFFE identity is available)
|
||||||
|
- NVLink utilisation per link (for intra-node GPU-to-GPU bandwidth attribution)
|
||||||
|
- InfiniBand queue pair statistics (for RDMA-based checkpoint I/O attribution)
|
||||||
|
- Weka client I/O statistics per node (to attribute filesystem I/O to tenants)
|
||||||
|
|
||||||
|
**Interface artifact:** `ucxl://council-telemetry:lead-architect@DistOS:telemetry/*^/integration/council-net-interface.md`
|
||||||
|
|
||||||
|
### council-sec (Security Model) — Constraint Direction
|
||||||
|
Council-sec provides the following constraints on council-telemetry's design:
|
||||||
|
|
||||||
|
- Telemetry data classification: raw hardware counters are classified as internal operational data; per-tenant cost attribution data is classified as confidential tenant data; no tenant may query another tenant's attribution data
|
||||||
|
- Audit log requirements: all SLO enforcement actions (throttling, preemption) must generate signed audit log entries readable by council-sec's audit log query interface
|
||||||
|
- Telemetry pipeline security: the DCGM collection daemon, the Kafka transport, and the TSDB must all operate under SPIFFE-issued SVIDs; no unauthenticated writes to the metrics store are permitted
|
||||||
|
|
||||||
|
**Constraint artifact:** `ucxl://council-sec:audit-engineer@DistOS:security/*^/integration/constraints-for-council-telemetry.md`
|
||||||
|
|
||||||
|
### KACHING Integration
|
||||||
|
KACHING is the CHORUS enterprise licensing and billing module. Council-telemetry is the sole authoritative source of metering data for KACHING billing. The integration requires:
|
||||||
|
|
||||||
|
- A streaming metering event feed from council-telemetry to KACHING (Kafka topic with per-workload resource consumption events at billing granularity)
|
||||||
|
- Immutable billing records: metering events used for billing must be cryptographically signed by council-telemetry and stored in KACHING's tamper-evident billing ledger
|
||||||
|
- Reconciliation: periodic reconciliation between council-telemetry's attribution totals and KACHING's billed totals; discrepancies must be flagged as billing anomalies
|
||||||
|
|
||||||
|
**Interface artifact:** `ucxl://council-telemetry:cost-attribution-specialist@DistOS:telemetry/*^/integration/kaching-billing-integration.md`
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 7. WHOOSH Configuration
|
||||||
|
|
||||||
|
### Team Formation
|
||||||
|
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"council_id": "council-telemetry",
|
||||||
|
"team_topic": "whoosh.team.distos-council-telemetry",
|
||||||
|
"composition": {
|
||||||
|
"lead-architect": 2,
|
||||||
|
"gpu-metrics-engineer": 6,
|
||||||
|
"telemetry-pipeline-engineer": 6,
|
||||||
|
"cost-attribution-specialist": 4,
|
||||||
|
"slo-engineer": 6,
|
||||||
|
"distributed-tracing-specialist": 4,
|
||||||
|
"energy-carbon-analyst": 4,
|
||||||
|
"anomaly-detection-engineer": 4,
|
||||||
|
"quota-manager": 2,
|
||||||
|
"formal-spec-author": 2
|
||||||
|
},
|
||||||
|
"total_agents": 40,
|
||||||
|
"quorum_policy": {
|
||||||
|
"artifact_publication": "simple_majority",
|
||||||
|
"architecture_decision": "two_thirds_supermajority",
|
||||||
|
"slo_enforcement_policy": "all_slo_engineers_plus_one_lead",
|
||||||
|
"kaching_billing_contract": "lead_architect_plus_cost_attribution_specialists",
|
||||||
|
"formal_spec_ratification": "formal_spec_authors_plus_one_lead"
|
||||||
|
},
|
||||||
|
"join_timeout_minutes": 30,
|
||||||
|
"inactivity_eviction_minutes": 120,
|
||||||
|
"special_policy": "telemetry_api_spec_requires_acknowledgement_from_council_sched_council_mem_council_net_before_finalisation"
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
### Subchannels
|
||||||
|
|
||||||
|
| Subchannel | Topic Suffix | Purpose |
|
||||||
|
|-----------|-------------|---------|
|
||||||
|
| Control | `.control` | Role assignments, join/leave events, phase transitions |
|
||||||
|
| Research | `.research` | Literature survey coordination, throughput calculations, system comparisons |
|
||||||
|
| Pipeline | `.pipeline` | Telemetry collection and storage design |
|
||||||
|
| SLO | `.slo` | SLO language design, enforcement policy debate |
|
||||||
|
| Attribution | `.attribution` | Cost attribution algorithm design, contention handling |
|
||||||
|
| Energy | `.energy` | Power metering, carbon signal design |
|
||||||
|
| Integration | `.integration` | Interface contract negotiation with other councils and KACHING |
|
||||||
|
| Voting | `.voting` | Quorum votes on decision records and artifact ratification |
|
||||||
|
| Artifacts | `.artifacts` | UCXL artifact announcement references |
|
||||||
|
|
||||||
|
### Quorum Configuration
|
||||||
|
|
||||||
|
Council-telemetry, being the smallest of the DistOS research councils at 40 agents, operates with proportionally lower absolute quorum counts. A simple majority (21 agents) suffices for artifact publication. Architecture decisions require 27 agents (two-thirds). The KACHING billing contract requires sign-off from both lead architects and all four cost attribution specialists because billing accuracy errors have direct financial consequences that cannot be corrected retroactively without customer disputes.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 8. Success Criteria
|
||||||
|
|
||||||
|
1. **Metering coverage:** A complete catalogue of all GPU and system resources tracked by DistOS's telemetry layer, with coverage gaps explicitly documented and estimation methodologies described for each gap
|
||||||
|
2. **Pipeline throughput verification:** An architectural analysis showing that the selected telemetry pipeline can sustain the calculated peak ingest rate (approximately 39 MB/s base plus 3x burst headroom) with sub-500ms end-to-end latency from metric production to TSDB availability
|
||||||
|
3. **Attribution completeness proof:** A TLA+ specification with the invariant that every resource unit consumed in the cluster is attributed to exactly one tenant in every execution; verified by council-verify
|
||||||
|
4. **SLO enforcement latency:** A formal specification proving that every SLO violation is detected and enforcement action initiated within a bounded interval (target: 5 seconds from violation onset to enforcement action, 30 seconds to preemption if lower enforcement levels are insufficient)
|
||||||
|
5. **KACHING integration:** A ratified billing event schema that covers all billable resource types (GPU compute time, GPU memory-seconds, NVLink bandwidth, Weka I/O), with signed acknowledgement from the KACHING module team
|
||||||
|
6. **Energy metering validation:** A methodology document showing how per-workload energy estimates are derived and validated, including uncertainty bounds on the estimates
|
||||||
|
7. **Interface contracts ratified:** All three dependency interface contracts (council-sched, council-mem, council-net) are signed off by both parties and published to UCXL before Phase 3 begins
|
||||||
|
8. **Decision record completeness:** All 8 decision points have corresponding Decision Records with rationale and alternatives documented
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 9. Timeline
|
||||||
|
|
||||||
|
### Phase 1: Research (Days 1-3)
|
||||||
|
- Day 1: Activate council, assign roles; gpu-metrics-engineers begin DCGM counter catalogue; telemetry-pipeline-engineers begin throughput calculations; lead architects draft telemetry interface requirements document for distribution to other councils
|
||||||
|
- Day 2: GPU metrics survey, telemetry pipeline survey, and energy metering survey completed; SLO framework survey in progress; throughput estimate finalised and published
|
||||||
|
- Day 3: All research artifacts published to UCXL; telemetry interface requirements document v0 distributed to council-sched, council-mem, and council-net for early review; research phase closes with prioritised list of open questions
|
||||||
|
|
||||||
|
### Phase 2: Architecture (Days 3-6)
|
||||||
|
- Day 3-4: GPU metering design and telemetry pipeline design drafted; initial decision records for DP-TEL-01 and DP-TEL-02 circulated; cost attribution model draft initiated
|
||||||
|
- Day 4-5: SLO framework design and energy/carbon model drafted; telemetry API specification drafted and distributed to dependent councils for review; anomaly detection design initiated; DP-TEL-01 through DP-TEL-05 voted on
|
||||||
|
- Day 5-6: Remaining decision records voted on; telemetry API specification finalised pending council-sched, council-mem, and council-net acknowledgement; architecture artifacts published; architecture phase closes
|
||||||
|
|
||||||
|
### Phase 3: Formal Specification (Days 6-10)
|
||||||
|
- Day 6-7: SLO evaluation TLA+ specification and quota accounting TLA+ specification drafted; telemetry data model schema published; council-verify liaison established
|
||||||
|
- Day 7-8: Cost attribution TLA+ specification drafted; council-verify begins model checking on SLO evaluation spec
|
||||||
|
- Day 8-9: First model-checking results from council-verify; spec refinements as needed; KACHING billing integration document drafted
|
||||||
|
- Day 9-10: All TLA+ specifications in final state; formal spec phase closes with model-checking status documented for all three TLA+ specifications
|
||||||
|
|
||||||
|
### Phase 4: Integration (Days 10-12)
|
||||||
|
- Day 10-11: Interface contracts with council-sched, council-mem, and council-net finalised; any conflicts escalated to council-synth; KACHING billing integration reviewed by KACHING team
|
||||||
|
- Day 11-12: council-sec security compliance review completed; any outstanding interface issues resolved; final integration artifacts published
|
||||||
|
|
||||||
|
### Phase 5: Documentation (Days 12-14)
|
||||||
|
- Day 12-13: Telemetry reference specification assembled; SLO configuration guide and operator metering guide drafted
|
||||||
|
- Day 13-14: Carbon accounting guide and KACHING integration documentation completed; council-arch produces decision archaeology narrative; final UCXL navigability audit confirms all artifact addresses resolve correctly
|
||||||
464
councils/07-formal-verification.md
Normal file
464
councils/07-formal-verification.md
Normal file
@@ -0,0 +1,464 @@
|
|||||||
|
# Council Design Brief: Formal Verification
|
||||||
|
|
||||||
|
**Council ID:** `council-verify`
|
||||||
|
**Mission:** Provide rigorous, machine-checked correctness guarantees for every DistOS subsystem specification. This council does not design subsystems — it proves them correct, identifies flaws in proposed designs, and returns actionable verification results with enough precision that authoring councils can fix their specifications without ambiguity.
|
||||||
|
**UCXL Base Address:** `ucxl://council-verify:*@DistOS:verification/*`
|
||||||
|
**Agent Count:** ~80
|
||||||
|
**Status:** Design Brief — Constitution Phase
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 1. Scope and Responsibilities
|
||||||
|
|
||||||
|
`council-verify` owns the formal correctness layer of DistOS. Its scope covers:
|
||||||
|
|
||||||
|
- Accepting formal specifications (TLA+, Alloy, Coq, Lean) from all six core subsystem councils and from `council-api` and `council-fault`
|
||||||
|
- Writing model-checking harnesses for TLC and Alloy Analyzer against those specifications
|
||||||
|
- Specifying and checking safety properties: mutual exclusion, absence of deadlock, freedom from starvation, bounded wait times, and memory safety invariants
|
||||||
|
- Specifying and checking liveness properties: progress, termination, fairness under weak and strong fairness assumptions
|
||||||
|
- Constructing refinement mappings that relate abstract protocol specs to concrete implementation-level descriptions
|
||||||
|
- Verifying compositional properties: establishing that subsystem specs verified in isolation remain correct when composed
|
||||||
|
- Managing state-space explosion via symmetry reduction, partial-order reduction, abstraction, and bounded verification
|
||||||
|
- Producing structured verification reports that link every proved or falsified property back to the subsystem artifact that claimed it
|
||||||
|
- Maintaining a central registry of verified invariants so `council-synth` can detect when a cross-council design change invalidates a previously proved property
|
||||||
|
|
||||||
|
Responsibilities this council does **not** own: designing or modifying the subsystems themselves; producing human-readable narrative (that belongs to `council-docs` and `council-arch`); adversarial test case generation (owned by `council-qa`).
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 2. Research Domains
|
||||||
|
|
||||||
|
### 2.1 TLA+ Specification Language and TLC Model Checker
|
||||||
|
|
||||||
|
TLA+ (Temporal Logic of Actions, Lamport 1994) remains the most widely deployed formal specification language in industrial distributed systems. Agents must understand the full TLA+ grammar including action formulas, temporal operators (`[]`, `<>`, `~>`, `ENABLED`), refinement (`INSTANCE` with substitution), and the TLAPS proof system for machine-checked TLA+ proofs beyond TLC's exhaustive checking.
|
||||||
|
|
||||||
|
Key references:
|
||||||
|
- Lamport, L. *Specifying Systems* (2002). Addison-Wesley. The definitive text. Chapters 14–16 on liveness and fairness are mandatory reading.
|
||||||
|
- Lamport, L. *The TLA+ Toolbox* (2019). Formal Aspects of Computing 31(4).
|
||||||
|
- Newcombe, C. et al. "How Amazon Web Services Uses Formal Methods." *Communications of the ACM* 58(4), 2015. Documents TLA+ specs for DynamoDB (conditional writes, leader election), S3 (replication, bucket visibility), and EBS (volume attachment). Directly relevant because the DistOS storage layer shares fault-tolerance requirements with S3.
|
||||||
|
- Lebresne, S. and Bonnet, R. "A TLA+ specification for Raft." 2015. Reference implementation at `github.com/ongardie/raft.tla`. Agents should study how the spec handles log compaction and leader completeness.
|
||||||
|
- Helland, P. "Raft-TLA walk-through." AWS Builder's Library, 2023. Useful for understanding how abstract specs relate to actual implementations.
|
||||||
|
|
||||||
|
Agents must be fluent in the distinctions between safety (`[]P`) and liveness (`<>P`, `P ~> Q`) properties in the context of distributed system specs, and must understand that TLC can verify safety exhaustively for small models and liveness only under fairness assumptions.
|
||||||
|
|
||||||
|
### 2.2 Alloy Structural Modelling
|
||||||
|
|
||||||
|
Alloy (Jackson 2002) provides relational modelling with automatic satisfiability checking via the Alloy Analyzer (Kodkod back-end). It is better suited than TLA+ for verifying structural invariants: capability lattice consistency, access control policy completeness, and message schema constraints.
|
||||||
|
|
||||||
|
Key references:
|
||||||
|
- Jackson, D. *Software Abstractions: Logic, Language, and Analysis* (2nd ed., 2012). MIT Press. Chapters 5–7 on relational logic and the Alloy language.
|
||||||
|
- Jackson, D. "Alloy: A Lightweight Object Modelling Notation." *ACM Transactions on Software Engineering and Methodology* 11(2), 2002.
|
||||||
|
- Dennis, G. et al. "Modular Verification of Code with SAT." *ISSTA 2006*. Covers compositional use of Alloy for modular systems — directly applicable to per-council specs.
|
||||||
|
|
||||||
|
Alloy is the preferred tool for `council-sec`'s capability model and `council-api`'s interface contracts.
|
||||||
|
|
||||||
|
### 2.3 Theorem Proving with Coq and Lean
|
||||||
|
|
||||||
|
For critical invariants where exhaustive model checking is infeasible (infinite state spaces, parameterised proofs over arbitrary numbers of nodes), mechanical theorem proving is required. The two target systems are:
|
||||||
|
|
||||||
|
- **Coq** (v8.18+): mature ecosystem, extensive libraries (Mathematical Components, Iris for concurrent separation logic). Used in seL4 verification (Klein et al., *SOSP 2009*, "seL4: Formal Verification of an OS Kernel" — the only OS kernel with a full functional correctness proof in Isabelle/HOL; agent teams should study the proof methodology even though DistOS does not target Isabelle).
|
||||||
|
- **Lean 4**: newer, strong dependent type theory, mathlib. Agents should be aware of Lean's advantages for mathematical specifications and its growing industrial use.
|
||||||
|
|
||||||
|
Key references:
|
||||||
|
- Klein, G. et al. "seL4: Formal Verification of an OS Kernel." *SOSP 2009*.
|
||||||
|
- Hawblitzel, C. et al. "IronFleet: Proving Practical Distributed Systems Correct." *SOSP 2015*. IronFleet verified a Paxos-based key-value store end-to-end in Dafny (a close relative of the Lean/Coq family). The methodology — writing code in a verifiable subset and discharging proof obligations automatically — is the architectural target for DistOS's critical paths.
|
||||||
|
- Hawblitzel, C. et al. "IronClad Apps: End-to-End Security via Automated Full-System Verification." *OSDI 2014*. Companion work on security verification.
|
||||||
|
- Jung, R. et al. "Iris: Monoids and Invariants as an Orthogonal Basis for Concurrent Reasoning." *POPL 2015*. Iris is the preferred logic for reasoning about concurrent OS code in Coq.
|
||||||
|
|
||||||
|
### 2.4 Safety Properties in Distributed Systems
|
||||||
|
|
||||||
|
Agents must be able to specify and check the following safety properties across all subsystems:
|
||||||
|
|
||||||
|
- **Mutual exclusion:** No two agents hold a critical resource simultaneously. Expressed as `[](~(hold(a) /\ hold(b)))` for agents a, b.
|
||||||
|
- **Deadlock freedom:** The system never reaches a state where no process can advance. In TLA+: `[][ENABLED(Next)]_vars`.
|
||||||
|
- **Starvation freedom:** Every process that repeatedly requests a resource eventually obtains it. A liveness property requiring a fairness assumption.
|
||||||
|
- **Memory safety invariants:** No node accesses memory beyond its allocated region. Requires coupling the memory council's allocation spec with an invariant expressed over the allocated-region relation.
|
||||||
|
- **Invariant stability under node failure:** Safety properties must hold even when `f` nodes fail for the declared failure threshold `f`.
|
||||||
|
|
||||||
|
Key references:
|
||||||
|
- Alpern, B. and Schneider, F. "Defining Liveness." *Information Processing Letters* 21(4), 1985. Foundational paper establishing the safety/liveness dichotomy.
|
||||||
|
- Lamport, L. "Proving the Correctness of Multiprocess Programs." *IEEE Transactions on Software Engineering* 3(2), 1977.
|
||||||
|
|
||||||
|
### 2.5 Liveness Properties: Progress, Termination, and Fairness
|
||||||
|
|
||||||
|
Liveness verification is harder than safety because it requires reasoning about infinite behaviours. Key concepts:
|
||||||
|
|
||||||
|
- **Weak Fairness (WF):** If an action is continuously enabled, it eventually fires.
|
||||||
|
- **Strong Fairness (SF):** If an action is repeatedly enabled, it eventually fires.
|
||||||
|
- **Progress:** Every submitted job eventually completes (relevant to `council-sched`).
|
||||||
|
- **Termination:** Every protocol run terminates (relevant to consensus in `council-fault`).
|
||||||
|
- **Wait-freedom:** Every operation completes in a bounded number of steps regardless of other processes. The strongest liveness property; not required everywhere but should be analysed for critical paths.
|
||||||
|
|
||||||
|
Fairness assumptions must be documented explicitly in every spec. Agents must flag any property that relies on strong fairness, as this is a non-trivial assumption in distributed systems with Byzantine participants.
|
||||||
|
|
||||||
|
### 2.6 Refinement Mappings
|
||||||
|
|
||||||
|
A refinement mapping demonstrates that a concrete spec `C` correctly implements an abstract spec `A`. In TLA+, this is expressed as `C ≡ A(f)` where `f` is a state function mapping `C`-variables to `A`-variables. Verification of refinement requires showing that every behaviour of `C` is a behaviour of `A(f)`.
|
||||||
|
|
||||||
|
Key references:
|
||||||
|
- Lamport, L. "The Temporal Logic of Actions." *ACM Transactions on Programming Languages and Systems* 16(3), 1994.
|
||||||
|
- Abadi, M. and Lamport, L. "The Existence of Refinement Mappings." *Theoretical Computer Science* 82(2), 1991. Establishes when refinement mappings exist.
|
||||||
|
|
||||||
|
Every subsystem spec must provide two levels: an abstract protocol-level spec (suitable for human understanding and compositional reasoning) and a concrete implementation-level spec (suitable for derivation of data structures and algorithms). `council-verify` is responsible for verifying that the refinement mapping holds.
|
||||||
|
|
||||||
|
### 2.7 Compositional Verification
|
||||||
|
|
||||||
|
Verifying subsystems independently and then proving the composed system correct is the only tractable approach at scale. Key techniques:
|
||||||
|
|
||||||
|
- **Assume-Guarantee reasoning (A-G):** Each module assumes a guarantee from its environment and guarantees a property to its users. The composition is correct if each module's guarantee satisfies the next module's assumption.
|
||||||
|
- **Interface specifications:** Every cross-council interface must have a formal spec of its pre/post conditions before compositional verification can proceed.
|
||||||
|
|
||||||
|
Key references:
|
||||||
|
- McMillan, K. "Circular Compositional Reasoning About Liveness." *CHARME 1999*.
|
||||||
|
- Henzinger, T. et al. "Assume-Guarantee Reasoning for Hierarchical Hybrid Automata." *HSCC 2001*.
|
||||||
|
|
||||||
|
`council-verify` will maintain an interface contract registry. Any subsystem change that violates an interface contract triggers an immediate verification task.
|
||||||
|
|
||||||
|
### 2.8 Distributed Systems Verification: Linearizability and Consistency Models
|
||||||
|
|
||||||
|
Subsystem specs must state their consistency model explicitly. `council-verify` verifies the stated model against the protocol spec:
|
||||||
|
|
||||||
|
- **Linearizability:** Each operation appears to take effect atomically at a single point between invocation and response (Herlihy and Wing, *JACM 1990*).
|
||||||
|
- **Sequential consistency:** Operations appear in a total order consistent with each process's program order.
|
||||||
|
- **Causal consistency:** Causally related operations appear in causal order.
|
||||||
|
- **Eventual consistency:** Replicas converge to the same value in the absence of updates.
|
||||||
|
|
||||||
|
Key references:
|
||||||
|
- Herlihy, M. and Wing, J. "Linearizability: A Correctness Condition for Concurrent Objects." *ACM Transactions on Programming Languages and Systems* 12(3), 1990.
|
||||||
|
- Burckhardt, S. *Principles of Eventual Consistency* (2014). Microsoft Research TR. Systematic treatment of consistency models using execution graphs.
|
||||||
|
- Attiya, H. and Welch, J. *Distributed Computing: Fundamentals, Simulations, and Advanced Topics* (2nd ed., 2004). Reference for correctness conditions.
|
||||||
|
- Viotti, P. and Vukolić, M. "Consistency in Non-Transactional Distributed Storage Systems." *ACM Computing Surveys* 49(1), 2016. Comprehensive taxonomy of 50+ consistency models.
|
||||||
|
|
||||||
|
### 2.9 State Space Explosion Management
|
||||||
|
|
||||||
|
Model checking hits combinatorial explosion rapidly for distributed systems. Agents must be competent in:
|
||||||
|
|
||||||
|
- **Symmetry reduction:** Exploiting permutation symmetry among identical nodes to collapse the state space. TLC's `Symmetry` clause.
|
||||||
|
- **Partial-order reduction:** Avoiding redundant interleavings of independent actions. Implemented in SPIN; partial support in TLC via action fairness.
|
||||||
|
- **Abstraction and data abstraction:** Replacing concrete data types with abstract types that preserve the relevant structure.
|
||||||
|
- **Bounded verification:** Verifying safety for small model sizes (e.g., 3 nodes, 5 messages) and using induction for the general case.
|
||||||
|
- **Counterexample-guided abstraction refinement (CEGAR):** Automatically refining abstractions when spurious counterexamples are found.
|
||||||
|
|
||||||
|
Key references:
|
||||||
|
- Clarke, E. et al. *Model Checking* (1999). MIT Press. The canonical textbook.
|
||||||
|
- Holzmann, G. *The SPIN Model Checker* (2003). Addison-Wesley.
|
||||||
|
- Clarke, E. et al. "Counterexample-Guided Abstraction Refinement." *CAV 2000*.
|
||||||
|
|
||||||
|
### 2.10 GPU Memory Model Formal Verification
|
||||||
|
|
||||||
|
NVIDIA has published formal verification work on the GPU memory model — a requirement given that DistOS targets Hopper, Grace, and Blackwell architectures:
|
||||||
|
|
||||||
|
- Alglave, J. et al. "GPU Concurrency: Weak Behaviours and Programming Assumptions." *ASPLOS 2015*. Establishes the formal framework for GPU memory models using axiomatic models.
|
||||||
|
- Lustig, D. et al. "Automated Synthesis of Comprehensive Memory Model Litmus Test Suites." *ASPLOS 2017*.
|
||||||
|
- NVIDIA. "NVIDIA Hopper Architecture In-Depth." 2022. Section on memory consistency model for NVLink and NVSwitch fabrics.
|
||||||
|
- Wickerson, J. et al. "Automatically Comparing Memory Consistency Models." *POPL 2017*.
|
||||||
|
|
||||||
|
`council-verify` must verify that the `council-mem` memory model specification is consistent with the Hopper/Blackwell hardware memory model documented by NVIDIA.
|
||||||
|
|
||||||
|
### 2.11 Industrial Case Studies
|
||||||
|
|
||||||
|
Agents must study these deployments to calibrate what is achievable in 14 days:
|
||||||
|
|
||||||
|
- **Amazon Web Services TLA+ usage:** DynamoDB (conditional writes, leader election), S3 (replication), EBS, internal lock manager. Newcombe et al. 2015 (cited above) documents 14 specs found 10 bugs that other methods missed.
|
||||||
|
- **CockroachDB:** Published TLA+ specs for their Parallel Commits protocol at `github.com/cockroachdb/cockroach/tree/master/docs/tech-notes`. Model of their 2PC variant demonstrating absence of deadlock.
|
||||||
|
- **etcd/Raft:** Formal verification via TLC. Howard, H. "Flexible Paxos: Quorum Intersection Revisited." 2016. Extends Raft's quorum reasoning.
|
||||||
|
- **IronFleet** (Microsoft Research): End-to-end verified distributed system using Dafny. The most directly relevant existence proof that full-stack distributed system verification is feasible.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 3. Agent Roles
|
||||||
|
|
||||||
|
| Role | Count | Responsibilities |
|
||||||
|
|------|-------|-----------------|
|
||||||
|
| Lead Verifier | 1 | Coordinates the council; assigns verification tasks; maintains the invariant registry; escalates to `council-synth` when a falsified property implicates another council |
|
||||||
|
| TLA+ Specialists | 20 | Write TLA+ specs and TLC harnesses for assigned subsystems; drive TLC jobs; interpret counterexamples; write refinement mappings |
|
||||||
|
| Alloy Modellers | 10 | Write Alloy models for structural invariants (capability lattices, interface schemas, access control); run Alloy Analyzer jobs |
|
||||||
|
| Theorem Prover Agents | 10 | Handle parameterised proofs in Coq or Lean where exhaustive checking is infeasible; prove inductive invariants for unbounded node counts |
|
||||||
|
| Liveness Specialists | 8 | Focus exclusively on liveness properties, fairness assumptions, and progress guarantees across all subsystems; track fairness debts |
|
||||||
|
| Refinement Analysts | 8 | Construct and verify refinement mappings between abstract and concrete specs; maintain the two-level spec contract for each subsystem |
|
||||||
|
| Compositional Integrators | 8 | Maintain the assume-guarantee decomposition across councils; track interface contract evolution; recheck composed properties when any component changes |
|
||||||
|
| GPU Memory Model Analysts | 6 | Verify the memory subsystem spec against Hopper/Blackwell hardware memory models; consult Alglave et al. axiomatic framework |
|
||||||
|
| Verification Report Writers | 9 | Produce structured verification reports in UCXL-addressed artifacts; translate counterexamples into prose that authoring councils can act on |
|
||||||
|
|
||||||
|
**Total:** 80 agents
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 4. Key Deliverables
|
||||||
|
|
||||||
|
All artifacts are addressed using the pattern `ucxl://council-verify:{role}@DistOS:verification/^^/{artifact-type}/{name}`.
|
||||||
|
|
||||||
|
### 4.1 Verified Property Registry
|
||||||
|
|
||||||
|
A central machine-readable registry mapping every claimed property (safety or liveness) to its verification status (proved, falsified with counterexample, in-progress, deferred), the spec artifact it was verified against, and the TLC/Coq/Alloy job that produced the result.
|
||||||
|
|
||||||
|
```
|
||||||
|
ucxl://council-verify:lead-verifier@DistOS:verification/^^/registries/verified-properties.json
|
||||||
|
```
|
||||||
|
|
||||||
|
### 4.2 Per-Subsystem TLA+ Verification Reports
|
||||||
|
|
||||||
|
One verification report per subsystem council. Each report covers: which properties were checked, the model parameters used, verification outcome, any counterexamples with step-by-step traces, and recommended spec fixes.
|
||||||
|
|
||||||
|
```
|
||||||
|
ucxl://council-verify:tla-specialist@DistOS:verification/^^/reports/sched-verification-report.md
|
||||||
|
ucxl://council-verify:tla-specialist@DistOS:verification/^^/reports/mem-verification-report.md
|
||||||
|
ucxl://council-verify:tla-specialist@DistOS:verification/^^/reports/net-verification-report.md
|
||||||
|
ucxl://council-verify:tla-specialist@DistOS:verification/^^/reports/fault-verification-report.md
|
||||||
|
ucxl://council-verify:tla-specialist@DistOS:verification/^^/reports/sec-verification-report.md
|
||||||
|
ucxl://council-verify:tla-specialist@DistOS:verification/^^/reports/telemetry-verification-report.md
|
||||||
|
```
|
||||||
|
|
||||||
|
### 4.3 Alloy Structural Models
|
||||||
|
|
||||||
|
```
|
||||||
|
ucxl://council-verify:alloy-modeller@DistOS:verification/^^/specs/capability-lattice.als
|
||||||
|
ucxl://council-verify:alloy-modeller@DistOS:verification/^^/specs/api-interface-contracts.als
|
||||||
|
ucxl://council-verify:alloy-modeller@DistOS:verification/^^/specs/memory-region-invariants.als
|
||||||
|
```
|
||||||
|
|
||||||
|
### 4.4 Coq/Lean Theorem Proofs
|
||||||
|
|
||||||
|
```
|
||||||
|
ucxl://council-verify:theorem-prover@DistOS:verification/^^/proofs/consensus-termination.v
|
||||||
|
ucxl://council-verify:theorem-prover@DistOS:verification/^^/proofs/scheduler-fairness.lean
|
||||||
|
ucxl://council-verify:theorem-prover@DistOS:verification/^^/proofs/memory-safety-invariant.v
|
||||||
|
```
|
||||||
|
|
||||||
|
### 4.5 Refinement Mapping Documents
|
||||||
|
|
||||||
|
```
|
||||||
|
ucxl://council-verify:refinement-analyst@DistOS:verification/^^/refinements/sched-abstract-to-concrete.md
|
||||||
|
ucxl://council-verify:refinement-analyst@DistOS:verification/^^/refinements/fault-paxos-to-implementation.md
|
||||||
|
```
|
||||||
|
|
||||||
|
### 4.6 Compositional Verification Summary
|
||||||
|
|
||||||
|
The master document proving that the composed DistOS spec (all subsystems together) satisfies the system-level properties enumerated in the Project Constitution.
|
||||||
|
|
||||||
|
```
|
||||||
|
ucxl://council-verify:compositional-integrator@DistOS:verification/^^/reports/compositional-correctness.md
|
||||||
|
```
|
||||||
|
|
||||||
|
### 4.7 GPU Memory Model Conformance Report
|
||||||
|
|
||||||
|
```
|
||||||
|
ucxl://council-verify:gpu-mem-analyst@DistOS:verification/^^/reports/gpu-memory-model-conformance.md
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 5. Decision Points
|
||||||
|
|
||||||
|
The following architectural questions must be resolved by this council in collaboration with the councils noted. Each decision must be recorded as a Decision Record (DR) at the UCXL address pattern `ucxl://council-verify:lead-verifier@DistOS:verification/^^/decisions/{dr-id}.md`.
|
||||||
|
|
||||||
|
### DP-V01: Primary Specification Language
|
||||||
|
|
||||||
|
Choose between TLA+ (dominant in industry, mature TLC checker), PlusCal (higher-level algorithmic notation that compiles to TLA+), or a hybrid approach. Decision criteria: agent familiarity, tool availability, expressiveness for GPU-specific memory model properties.
|
||||||
|
**Deciding parties:** Lead Verifier, TLA+ Specialists, `council-synth`
|
||||||
|
|
||||||
|
### DP-V02: Verification Depth vs. Coverage Trade-off
|
||||||
|
|
||||||
|
Given 14 days, the council cannot exhaustively verify every subsystem at every level. Decide which subsystems require: (a) full TLC model checking with proved safety and liveness, (b) safety-only checking, (c) Alloy structural check only, (d) manual proof review. Scheduling and fault tolerance subsystems should be candidates for (a).
|
||||||
|
**Deciding parties:** Lead Verifier, all specialists, `council-meta`
|
||||||
|
|
||||||
|
### DP-V03: Consistency Model for the Memory Subsystem
|
||||||
|
|
||||||
|
Must agree with `council-mem` on the stated consistency model before verification can begin. Options: linearizability (strongest, highest cost), sequential consistency, causal consistency, release-acquire (matching hardware), eventual consistency. The choice directly determines which properties are verifiable and which invariants are admissible.
|
||||||
|
**Deciding parties:** `council-verify` Refinement Analysts, `council-mem`, `council-synth`
|
||||||
|
|
||||||
|
### DP-V04: Fairness Assumption Standard
|
||||||
|
|
||||||
|
Establish the minimum fairness assumption that all subsystem specs must satisfy. Options: no fairness (specs must be self-scheduling), weak fairness (WF) on all actions, strong fairness (SF) on specific actions. This decision affects what liveness properties can be claimed system-wide.
|
||||||
|
**Deciding parties:** Liveness Specialists, `council-fault`, `council-sched`
|
||||||
|
|
||||||
|
### DP-V05: Parameterised Proof Strategy for Node Count
|
||||||
|
|
||||||
|
Determine how to handle proofs over arbitrary numbers of nodes (1024 in production) when TLC can only check small models. Options: inductive invariant proofs in Coq (high effort), boundary case analysis (check 3, 5, 7 nodes and argue by symmetry), or use of parameterised model checkers (IC3/PDR via aiger).
|
||||||
|
**Deciding parties:** Theorem Prover Agents, `council-fault`
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 6. Dependencies on Other Councils
|
||||||
|
|
||||||
|
`council-verify` is a downstream consumer of specifications produced by all other councils. It is also an upstream dependency for any council that needs a correctness certificate before proceeding to integration.
|
||||||
|
|
||||||
|
| Council | Relationship | What council-verify consumes | What council-verify produces |
|
||||||
|
|---------|-------------|------------------------------|------------------------------|
|
||||||
|
| `council-sched` | Bidirectional | TLA+ scheduling protocol spec | Verification report; falsified properties with counterexamples |
|
||||||
|
| `council-mem` | Bidirectional | Memory model TLA+ spec; consistency model choice | Memory safety verification; GPU model conformance |
|
||||||
|
| `council-net` | Bidirectional | Network protocol TLA+ spec | Deadlock-freedom and progress reports |
|
||||||
|
| `council-fault` | Bidirectional | Consensus TLA+ spec (Raft/Paxos variant) | Termination proof; Byzantine resilience analysis |
|
||||||
|
| `council-sec` | Bidirectional | Capability lattice description; isolation invariants | Alloy structural model verification; capability safety report |
|
||||||
|
| `council-telemetry` | Consuming | Metering protocol description | Bounded-wait verification for accounting operations |
|
||||||
|
| `council-api` | Bidirectional | Interface contracts; system call semantics | API contract verification; pre/post condition checking |
|
||||||
|
| `council-qa` | Collaborative | Conformance test requirements | Verification results that inform test targets |
|
||||||
|
| `council-synth` | Reporting | Cross-council conflict notifications | Falsification results that trigger conflict resolution |
|
||||||
|
| `council-docs` | Providing | N/A | Verified invariant statements for inclusion in formal spec document |
|
||||||
|
|
||||||
|
**Critical path constraint:** `council-verify` cannot begin checking a subsystem spec until that spec is at least 70% complete. The Phase 2 architecture documents from each council are the minimum viable input.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 7. WHOOSH Configuration
|
||||||
|
|
||||||
|
### 7.1 Team Formation
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
council_id: council-verify
|
||||||
|
display_name: "Formal Verification Council"
|
||||||
|
target_size: 80
|
||||||
|
formation_strategy: competency_weighted
|
||||||
|
required_roles:
|
||||||
|
- role: lead-verifier
|
||||||
|
count: 1
|
||||||
|
persona: systems-analyst
|
||||||
|
competencies: [tla-plus, model-checking, distributed-systems, proof-theory]
|
||||||
|
- role: tla-specialist
|
||||||
|
count: 20
|
||||||
|
persona: technical-specialist
|
||||||
|
competencies: [tla-plus, tlc, liveness-properties, refinement-mapping]
|
||||||
|
- role: alloy-modeller
|
||||||
|
count: 10
|
||||||
|
persona: technical-specialist
|
||||||
|
competencies: [alloy, relational-logic, structural-verification]
|
||||||
|
- role: theorem-prover
|
||||||
|
count: 10
|
||||||
|
persona: technical-specialist
|
||||||
|
competencies: [coq, lean4, dependent-types, inductive-proofs]
|
||||||
|
- role: liveness-specialist
|
||||||
|
count: 8
|
||||||
|
persona: technical-specialist
|
||||||
|
competencies: [fairness, liveness, progress-guarantees, temporal-logic]
|
||||||
|
- role: refinement-analyst
|
||||||
|
count: 8
|
||||||
|
persona: systems-analyst
|
||||||
|
competencies: [refinement-mappings, abstraction, abadi-lamport]
|
||||||
|
- role: compositional-integrator
|
||||||
|
count: 8
|
||||||
|
persona: systems-analyst
|
||||||
|
competencies: [assume-guarantee, compositional-reasoning, interface-contracts]
|
||||||
|
- role: gpu-mem-analyst
|
||||||
|
count: 6
|
||||||
|
persona: technical-specialist
|
||||||
|
competencies: [gpu-memory-models, hopper-architecture, axiomatic-models]
|
||||||
|
- role: verification-report-writer
|
||||||
|
count: 9
|
||||||
|
persona: technical-writer
|
||||||
|
competencies: [technical-writing, counterexample-analysis, verification-reports]
|
||||||
|
```
|
||||||
|
|
||||||
|
### 7.2 Quorum Rules
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
quorum:
|
||||||
|
decision_threshold: 0.6 # 60% of active agents must agree
|
||||||
|
lead_verifier_veto: true # Lead Verifier can block any property acceptance
|
||||||
|
minimum_specialist_agreement: 3 # At least 3 domain specialists must agree on any claim of "proved"
|
||||||
|
falsification_threshold: 1 # A single counterexample from any agent is sufficient to falsify
|
||||||
|
cross_council_escalation:
|
||||||
|
trigger: falsified_property
|
||||||
|
target: council-synth
|
||||||
|
response_sla_hours: 4
|
||||||
|
```
|
||||||
|
|
||||||
|
### 7.3 Subchannels
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
subchannels:
|
||||||
|
- id: verify-tlaplus-harnesses
|
||||||
|
subscribers: [tla-specialist, lead-verifier]
|
||||||
|
purpose: "TLC job coordination, model parameter negotiation, counterexample sharing"
|
||||||
|
ucxl_feed: "ucxl://council-verify:tla-specialist@DistOS:verification/^^/specs/*.tla"
|
||||||
|
|
||||||
|
- id: verify-alloy-sessions
|
||||||
|
subscribers: [alloy-modeller, lead-verifier]
|
||||||
|
purpose: "Alloy Analyzer session coordination and structural model review"
|
||||||
|
ucxl_feed: "ucxl://council-verify:alloy-modeller@DistOS:verification/^^/specs/*.als"
|
||||||
|
|
||||||
|
- id: verify-theorem-proving
|
||||||
|
subscribers: [theorem-prover, liveness-specialist, lead-verifier]
|
||||||
|
purpose: "Coq/Lean proof progress and tactic strategy discussion"
|
||||||
|
ucxl_feed: "ucxl://council-verify:theorem-prover@DistOS:verification/^^/proofs/*"
|
||||||
|
|
||||||
|
- id: verify-counterexamples
|
||||||
|
subscribers: [all]
|
||||||
|
purpose: "Broadcast falsification events; all agents subscribe to detect cross-cutting impact"
|
||||||
|
ucxl_feed: "ucxl://council-verify:lead-verifier@DistOS:verification/^^/counterexamples/*"
|
||||||
|
|
||||||
|
- id: verify-cross-council-inbound
|
||||||
|
subscribers: [lead-verifier, refinement-analyst, compositional-integrator]
|
||||||
|
purpose: "Receive spec updates from all subsystem councils; triage for re-verification"
|
||||||
|
ucxl_feed: "ucxl://council-*:*@DistOS:*/^^/specs/*"
|
||||||
|
|
||||||
|
- id: verify-reports-outbound
|
||||||
|
subscribers: [verification-report-writer, lead-verifier]
|
||||||
|
purpose: "Coordinate and publish final verification reports to BUBBLE"
|
||||||
|
ucxl_feed: "ucxl://council-verify:verification-report-writer@DistOS:verification/^^/reports/*"
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 8. Success Criteria
|
||||||
|
|
||||||
|
1. **Safety completeness:** Every subsystem spec has at least one mechanically checked safety property with a passing TLC run or Alloy check. Zero unverified safety claims in the final spec.
|
||||||
|
2. **Liveness coverage:** At least four of the six core subsystems have at least one mechanically checked liveness property under a documented fairness assumption.
|
||||||
|
3. **Refinement completeness:** At least three subsystems have a verified refinement mapping from abstract to concrete spec.
|
||||||
|
4. **No open counterexamples:** All TLC counterexamples discovered during the project are either resolved (spec corrected and re-verified) or explicitly deferred with a documented rationale and risk assessment.
|
||||||
|
5. **Compositional correctness:** The compositional verification report exists and demonstrates that system-level safety properties hold given the per-subsystem assume-guarantee decomposition.
|
||||||
|
6. **GPU model conformance:** The `council-mem` memory model spec is verified consistent with the Hopper/Blackwell hardware memory model specification.
|
||||||
|
7. **Invariant registry completeness:** The verified property registry contains an entry for every property claimed anywhere in the DistOS specification documents.
|
||||||
|
8. **Response SLA:** All counterexample reports are delivered to the originating council within 4 hours of falsification. All verification requests from `council-api` and `council-synth` are acknowledged within 2 hours.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 9. Timeline
|
||||||
|
|
||||||
|
### Phase 1: Research (Days 1–3)
|
||||||
|
|
||||||
|
- All agents survey the reference systems and papers listed in Section 2
|
||||||
|
- TLA+ Specialists identify TLA+ idioms and patterns from Amazon, CockroachDB, and Raft specs that are directly applicable to DistOS
|
||||||
|
- Alloy Modellers establish the Alloy modelling vocabulary for capability lattices
|
||||||
|
- Theorem Prover Agents assess Coq/Lean library coverage for distributed systems reasoning
|
||||||
|
- GPU Memory Model Analysts study Hopper/Blackwell architecture documentation and Alglave et al. framework
|
||||||
|
- Lead Verifier drafts the verification scope matrix: which councils, which subsystems, which properties, which tools, in which order
|
||||||
|
- Deliverable: `ucxl://council-verify:lead-verifier@DistOS:verification/^^/research/verification-scope-matrix.md`
|
||||||
|
|
||||||
|
### Phase 2: Architecture (Days 3–6)
|
||||||
|
|
||||||
|
- Resolve Decision Points DP-V01 through DP-V05 with full DR records
|
||||||
|
- Establish interface contract templates and distribute to all subsystem councils
|
||||||
|
- Begin writing TLC harnesses for scheduling and fault tolerance specs (these arrive first from their councils)
|
||||||
|
- Establish the assume-guarantee decomposition for compositional verification
|
||||||
|
- Compositional Integrators begin mapping inter-council interface contracts
|
||||||
|
- Deliverable: `ucxl://council-verify:lead-verifier@DistOS:verification/^^/decisions/dp-v01-through-v05.md`
|
||||||
|
|
||||||
|
### Phase 3: Formal Specification (Days 6–10)
|
||||||
|
|
||||||
|
- Primary verification window: TLC, Alloy Analyzer, and Coq/Lean jobs run continuously
|
||||||
|
- TLA+ Specialists process all six core subsystem specs as they arrive from councils
|
||||||
|
- Alloy Modellers verify `council-sec` capability lattice and `council-api` interface contracts
|
||||||
|
- Theorem Prover Agents focus on consensus termination (Coq) and scheduler fairness (Lean)
|
||||||
|
- All counterexamples reported back to originating councils within 4 hours
|
||||||
|
- Liveness Specialists audit all specs for missing fairness assumptions
|
||||||
|
- Refinement Analysts begin mapping abstract to concrete for sched and fault subsystems
|
||||||
|
- Deliverable: All six per-subsystem verification reports (drafts) at verification report UCXL addresses
|
||||||
|
|
||||||
|
### Phase 4: Integration (Days 10–12)
|
||||||
|
|
||||||
|
- Compositional verification: compose all verified subsystem specs and check system-level properties
|
||||||
|
- Final resolution of any outstanding counterexamples
|
||||||
|
- GPU memory model conformance verification
|
||||||
|
- Verification Report Writers produce final reports for all subsystems
|
||||||
|
- Lead Verifier compiles the verified property registry
|
||||||
|
- Deliver compositional correctness report to `council-synth`
|
||||||
|
- Deliverable: `ucxl://council-verify:compositional-integrator@DistOS:verification/^^/reports/compositional-correctness.md`
|
||||||
|
|
||||||
|
### Phase 5: Documentation (Days 12–14)
|
||||||
|
|
||||||
|
- Verification Report Writers produce human-readable summaries of all verification results for inclusion in the main DistOS specification
|
||||||
|
- Lead Verifier produces a final verification audit trail navigable via UCXL temporal navigation
|
||||||
|
- All verification artifacts committed to BUBBLE decision record system
|
||||||
|
- Final verified property registry published
|
||||||
|
- Support `council-docs` in producing the Formal Verification appendix of the DistOS specification
|
||||||
|
- Deliverable: `ucxl://council-verify:lead-verifier@DistOS:verification/^^/reports/final-audit-trail.md`
|
||||||
514
councils/08-api-surface.md
Normal file
514
councils/08-api-surface.md
Normal file
@@ -0,0 +1,514 @@
|
|||||||
|
# Council Design Brief: API Surface and Developer Experience
|
||||||
|
|
||||||
|
**Council ID:** `council-api`
|
||||||
|
**Mission:** Define the complete, coherent, and ergonomic interface between DistOS and its users — operators, application developers, and other systems. This council decides what the operating system looks like from the outside: system calls, SDK bindings, CLI tools, and the conventions that make all of the above consistent and maintainable across language boundaries and API versions.
|
||||||
|
**UCXL Base Address:** `ucxl://council-api:*@DistOS:api/*`
|
||||||
|
**Agent Count:** ~40
|
||||||
|
**Status:** Design Brief — Constitution Phase
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 1. Scope and Responsibilities
|
||||||
|
|
||||||
|
`council-api` owns the external interface contract of DistOS. Its scope covers:
|
||||||
|
|
||||||
|
- Deciding the overall API philosophy: POSIX-compatible extension, clean-slate design, or a layered model that offers both
|
||||||
|
- Defining GPU-native system calls for kernel launch, memory allocation, device-to-device transfers, stream and graph management, and event synchronisation
|
||||||
|
- Defining distributed system calls: remote procedure invocation (covering both synchronous RPC and async futures), distributed lock acquisition and release, barriers, and collective operations across node groups
|
||||||
|
- Designing an async-first API surface that aligns with modern language runtimes (Rust `async`/`await`, Go goroutines, Python `asyncio`)
|
||||||
|
- Establishing error handling conventions, including integration with UCXL response codes for errors that carry provenance (which node, which operation, at what logical time)
|
||||||
|
- Designing the SDK for four target languages: C (ABI-stable systems interface), Rust (idiomatic, zero-cost), Go (ergonomic, channel-friendly), and Python (user-friendly, numpy-compatible)
|
||||||
|
- Designing CLI tooling for cluster management: node status, job submission, resource inspection, log retrieval, and administrative operations
|
||||||
|
- Defining the API versioning and evolution strategy: how new calls are introduced, how deprecated calls are retired, compatibility guarantees across minor and major versions
|
||||||
|
- Producing API reference documentation that is precise enough to serve as a normative source alongside the formal spec
|
||||||
|
- Specifying example applications that exercise non-trivial API paths and serve as integration test targets
|
||||||
|
|
||||||
|
Responsibilities this council does **not** own: kernel implementation (owned by subsystem councils); formal verification of API contracts (owned by `council-verify`); security policy enforcement (owned by `council-sec`, though `council-api` designs the authentication and authorisation API surface in coordination with it); monitoring and metering calls (owned by `council-telemetry`, though `council-api` exposes the SDK surface for those).
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 2. Research Domains
|
||||||
|
|
||||||
|
### 2.1 POSIX Compatibility vs. Clean-Slate Design
|
||||||
|
|
||||||
|
POSIX (IEEE 1003.1) defines the canonical Unix system call interface. Its strengths are: near-universal language runtime support, a mature ecosystem of tools, and decades of developer familiarity. Its weaknesses in a GPU-cluster OS context are: blocking I/O semantics that assume CPU-thread models, file-descriptor-centric resource management ill-suited to GPU memory objects, and no native concept of distributed operations or remote memory.
|
||||||
|
|
||||||
|
Two design philosophies must be fully researched before the council can decide:
|
||||||
|
|
||||||
|
- **POSIX-compatible extension:** Retain the full POSIX interface and extend it with GPU and distributed primitives as optional add-ons. Applications written for Linux run unmodified; GPU-aware applications opt into extensions. This is the approach taken by CUDA (which layers a driver API on top of the OS) and by ROCm/HIP.
|
||||||
|
- **Clean-slate design:** Design an interface optimal for the DistOS hardware target without backward-compatibility constraints. This allows stronger type safety, async-native semantics, and a capability-based resource model from the first call. Plan 9 (Pike et al.) and Fuchsia (Zircon) are the primary existence proofs.
|
||||||
|
- **Layered model:** Provide a clean-slate primary API and a POSIX compatibility layer implemented on top of it. This is the architectural recommendation for evaluation. The compatibility layer has a defined cost budget.
|
||||||
|
|
||||||
|
Key references:
|
||||||
|
- The Open Group. *The Single UNIX Specification (SUSv4/POSIX.1-2017)*. The normative POSIX reference.
|
||||||
|
- Pike, R. et al. "Plan 9 from Bell Labs." *USENIX Summer 1990 Technical Conference*. Plan 9's contribution is the 9P protocol: everything is a file, including processes and network connections. The simplicity of the resource model is instructive even if DistOS does not adopt 9P verbatim.
|
||||||
|
- Pike, R. "The Use of Name Spaces in Plan 9." *EUUG Newsletter* 12(1), 1992.
|
||||||
|
- Google. *Fuchsia OS: Zircon Kernel Objects*. https://fuchsia.dev/fuchsia-src/concepts/kernel. Zircon uses a capability-based object system with handles as the only way to reference kernel objects. This is the most complete modern clean-slate OS design and must be studied in depth.
|
||||||
|
|
||||||
|
### 2.2 GPU-Native System Calls
|
||||||
|
|
||||||
|
The CUDA Driver API provides the lowest-level GPU control surface available: `cuInit`, `cuDeviceGet`, `cuCtxCreate`, `cuMemAlloc`, `cuLaunchKernel`, `cuEventRecord`, `cuStreamWaitEvent`. It is the reference for what a GPU system call interface must cover.
|
||||||
|
|
||||||
|
Agents must evaluate the tradeoffs between:
|
||||||
|
- **Driver-level API** (CUDA Driver API / ROCm HIP Low-Level): explicit context management, explicit stream management, maximum control, verbose
|
||||||
|
- **Runtime API** (CUDA Runtime / ROCm): implicit context, automatic stream assignment, less control, more ergonomic
|
||||||
|
- **Graph-based execution** (CUDA Graphs / HIP Graphs): capture a sequence of operations as a graph for repeated execution with lower launch overhead. Critical for the 1024-node deployment where kernel launch overhead accumulates.
|
||||||
|
|
||||||
|
Key references:
|
||||||
|
- NVIDIA. *CUDA Driver API Reference Manual*. https://docs.nvidia.com/cuda/cuda-driver-api/. Normative reference for GPU system call semantics.
|
||||||
|
- NVIDIA. *CUDA C Programming Guide* (Chapter 3: Programming Interface). Covers the Runtime API and its relationship to the Driver API.
|
||||||
|
- NVIDIA. *CUDA Graphs* documentation. https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#cuda-graphs. The graph execution model is essential for understanding low-latency repeated workloads on Hopper and Blackwell.
|
||||||
|
- Khronos Group. *OpenCL 3.0 Specification*. https://www.khronos.org/opencl/. The vendor-neutral GPU programming API. DistOS must decide whether to support OpenCL alongside CUDA semantics.
|
||||||
|
- Khronos Group. *SYCL 2020 Specification*. https://www.khronos.org/sycl/. SYCL provides a C++ abstraction over OpenCL and oneAPI targets. Intel's oneAPI unifies GPU programming across vendors and is a candidate for the DistOS higher-level SDK layer.
|
||||||
|
- Intel. *oneAPI Programming Guide*. https://www.intel.com/content/www/us/en/developer/tools/oneapi/programming-guide.html.
|
||||||
|
- NVIDIA. *NVLink and NVSwitch Architecture Overview*. https://www.nvidia.com/en-us/data-center/nvlink/. GPU-to-GPU direct access semantics affect memory system call design.
|
||||||
|
|
||||||
|
Blackwell-specific: The GB200 NVL72 introduces NVLink Switch System connecting 72 GPUs in a single flat memory domain. System calls for `cuMemAdvise` and `cuMemPrefetchAsync` take on new semantics in this topology. Agents must review:
|
||||||
|
- NVIDIA. *NVIDIA Blackwell Architecture Technical Brief*. 2024.
|
||||||
|
|
||||||
|
### 2.3 Distributed System Calls
|
||||||
|
|
||||||
|
System calls that span nodes are novel: POSIX has no notion of them. The design space covers:
|
||||||
|
|
||||||
|
- **Remote procedure invocation:** How does a process on node A invoke a procedure on node B? Synchronous blocking (simple, latency-bound), asynchronous with futures (complex, scalable), or continuation-passing. gRPC is the de facto standard for service-to-service RPC in the cloud but carries HTTP/2 overhead.
|
||||||
|
- **Distributed locks:** Lease-based locks (Chubby/Zookeeper model), RDMA-based compare-and-swap (best latency), or consensus-based locks for strong guarantees. Each has different failure semantics.
|
||||||
|
- **Barriers:** Collective synchronisation across node groups. MPI_Barrier semantics are well understood; the question is how to expose this in a general-purpose OS API.
|
||||||
|
- **Collective operations:** AllReduce, AllGather, Broadcast, Reduce-Scatter. These are first-class operations for distributed ML workloads (the dominant use case on a 1024-node GPU cluster) and must be surfaced as OS-level calls, not just library calls, so the OS can optimise placement and routing.
|
||||||
|
|
||||||
|
Key references:
|
||||||
|
- Birrell, A. and Nelson, B. "Implementing Remote Procedure Calls." *ACM Transactions on Computer Systems* 2(1), 1984. The foundational RPC paper.
|
||||||
|
- Google. *gRPC*. https://grpc.io/. The current industry standard for typed RPC. Protocol Buffers schema evolution strategy is directly applicable to DistOS API versioning.
|
||||||
|
- Google. *Chubby: A Lock Service for Loosely-Coupled Distributed Systems*. Burrows, M. OSDI 2006.
|
||||||
|
- Hunt, P. et al. "ZooKeeper: Wait-free Coordination for Internet-scale Systems." *USENIX ATC 2010*.
|
||||||
|
- Message Passing Interface Forum. *MPI: A Message-Passing Interface Standard, Version 4.1*. 2023. The collective operations specification is normative for `council-api`'s collective call design.
|
||||||
|
- Mellanox/NVIDIA. *RDMA Programming Guide*. InfiniBand verbs API (ibv_post_send, ibv_post_recv, ibv_create_qp) provides the lowest-latency distributed memory access primitives available on the target cluster.
|
||||||
|
|
||||||
|
### 2.4 Async-First API Design
|
||||||
|
|
||||||
|
A GPU cluster OS serving AI workloads will have I/O patterns dominated by deep asynchrony: thousands of in-flight kernel launches, streaming data from Weka FS, collective comms across 1024 nodes. A synchronous API is a fundamental design mistake. Agents must research:
|
||||||
|
|
||||||
|
- **Rust async/await:** The Rust async model (futures, the `Poll` trait, the executor model) provides zero-cost abstraction over async I/O. The `tokio` runtime is the dominant executor. The DistOS Rust SDK must integrate naturally with tokio.
|
||||||
|
- **io_uring (Linux 5.1+):** The io_uring interface provides a shared ring-buffer interface between kernel and userspace that eliminates syscall overhead for I/O. Its submission/completion queue model is the reference for how DistOS should design its own async system call interface.
|
||||||
|
- **Go channels and goroutines:** Go's concurrency model maps well to distributed operations. The DistOS Go SDK must express distributed calls as channels or via the `context.Context` cancellation pattern.
|
||||||
|
- **Python asyncio:** The Python SDK must be usable from `async def` coroutines. NumPy compatibility for GPU tensor operations should be considered (compatibility with the Numba/CuPy interface).
|
||||||
|
|
||||||
|
Key references:
|
||||||
|
- Axboe, J. *io_uring and the new Linux async I/O API*. https://kernel.dk/io_uring.pdf. 2019. This paper is essential for understanding the state of the art in async syscall design.
|
||||||
|
- The Rust Async Book. https://rust-lang.github.io/async-book/. Normative reference for Rust async design patterns.
|
||||||
|
- Grigorik, I. *High Performance Browser Networking* (Chapter 2 on event loop and async I/O patterns). 2013. O'Reilly. Useful background on event-driven I/O design.
|
||||||
|
|
||||||
|
### 2.5 Error Handling Conventions
|
||||||
|
|
||||||
|
A cluster OS at this scale will produce a high volume of partial failures: a node goes dark, a GPU kernel faults, a network partition isolates a subsystem. The error handling convention must be:
|
||||||
|
|
||||||
|
- **Structured:** Every error carries a type, a severity, a source identifier (node, subsystem, call), and a correlation ID that links it to a UCXL-addressed event in the distributed log.
|
||||||
|
- **Actionable:** The API must distinguish between errors that the caller should retry (transient), errors that require intervention (permanent), and errors that indicate a usage mistake (programmer error).
|
||||||
|
- **Traceable:** Error correlation IDs must be UCXL-compatible so that an error returned to a Python application can be resolved to the full distributed event chain using the UCXL resolver.
|
||||||
|
|
||||||
|
Key references:
|
||||||
|
- Google. *Google Cloud API Design Guide: Errors*. https://cloud.google.com/apis/design/errors. The most systematic public treatment of structured API error design. The canonical status codes (OK, INVALID_ARGUMENT, NOT_FOUND, UNAVAILABLE, etc.) should be adopted or adapted.
|
||||||
|
- Klabnik, S. and Nichols, C. *The Rust Programming Language* (Chapter 9: Error Handling). The Rust approach to `Result<T, E>` and the `?` operator represents the state of the art for recoverable errors in a systems language.
|
||||||
|
- Syme, D. et al. "Exceptional Syntactic Support for Error Handling in F#." *Haskell Symposium 2020*. Relevant to the higher-level SDK error design.
|
||||||
|
|
||||||
|
The UCXL response code integration specifically means that API error structs carry a `ucxl_trace` field containing the UCXL address of the distributed event that caused the failure:
|
||||||
|
|
||||||
|
```
|
||||||
|
error.ucxl_trace = "ucxl://council-fault:monitor@DistOS:fault-tolerance/^^/events/node-042-timeout-2026-03-01T14:22:00Z"
|
||||||
|
```
|
||||||
|
|
||||||
|
### 2.6 SDK Design for Multiple Languages
|
||||||
|
|
||||||
|
The SDK must present a coherent surface across four languages with different idioms. The design principles are:
|
||||||
|
|
||||||
|
- **C ABI as the foundation:** The canonical system call interface is a C ABI. All other language SDKs are generated or hand-written wrappers over the C ABI. This ensures ABI stability and FFI compatibility with every language.
|
||||||
|
- **Rust SDK:** Idiomatic, zero-cost wrappers. Use Rust's ownership system to enforce resource lifetimes at compile time (e.g., a `GpuBuffer<T>` type that is `Send` but not `Sync`, reflecting GPU buffer ownership semantics). The Rust SDK should use `#[repr(C)]` structs for ABI compatibility.
|
||||||
|
- **Go SDK:** Ergonomic wrappers using `cgo` for the C ABI. Expose distributed operations as channel-returning functions. Context-aware: all calls accept `context.Context` for cancellation and timeout propagation.
|
||||||
|
- **Python SDK:** High-level, NumPy-compatible. Consider auto-generating stub code from a schema. Must be `asyncio`-compatible. Integrate with the Python type system via `Protocol` and `TypedDict`.
|
||||||
|
|
||||||
|
Key references:
|
||||||
|
- Klabnik, S. and Nichols, C. *The Rust Programming Language*. https://doc.rust-lang.org/book/. Idiomatic Rust patterns.
|
||||||
|
- Go Authors. *Effective Go*. https://go.dev/doc/effective_go. Idiomatic Go patterns.
|
||||||
|
- Google. *Google Cloud API Design Guide*. https://cloud.google.com/apis/design. The most comprehensive public API design guide, covering resource-oriented design, standard methods, naming conventions, and backwards compatibility.
|
||||||
|
- Smith, P. *Designing for Compatibility in Evolving APIs*. IEEE Software 39(4), 2022.
|
||||||
|
|
||||||
|
### 2.7 CLI Tooling Design
|
||||||
|
|
||||||
|
The cluster management CLI (`distos-ctl` or equivalent) must follow modern CLI design principles:
|
||||||
|
|
||||||
|
- Machine-readable output (JSON/YAML with `--output json`) for scripting
|
||||||
|
- Structured logging with log levels
|
||||||
|
- Human-readable default output with colour and progress indicators
|
||||||
|
- Completion generation for bash/zsh/fish
|
||||||
|
- Subcommand structure: `node`, `job`, `gpu`, `net`, `storage`, `secret`, `log`
|
||||||
|
|
||||||
|
Key references:
|
||||||
|
- Sigurdsson, A. et al. *Command Line Interface Guidelines*. https://clig.dev/. The community-written standard for modern CLI design. Should be treated as the style guide for `distos-ctl`.
|
||||||
|
- Hashicorp. *Vault CLI design*. The Vault CLI is an exemplar of a well-structured cluster management tool with consistent subcommand and flag conventions.
|
||||||
|
- Kubernetes. `kubectl` source and documentation. The de facto standard for distributed cluster management CLIs. The DistOS CLI should match `kubectl` conventions where applicable to reduce cognitive load.
|
||||||
|
|
||||||
|
### 2.8 API Versioning and Evolution Strategy
|
||||||
|
|
||||||
|
A system call interface must be stable. The versioning strategy must address:
|
||||||
|
|
||||||
|
- **Compatibility guarantees:** What changes are backwards-compatible (adding optional parameters, adding new calls) vs. breaking (changing parameter semantics, removing calls)?
|
||||||
|
- **Deprecation lifecycle:** Minimum deprecation notice period, deprecation markers in the SDK, removal schedule.
|
||||||
|
- **Version negotiation:** How does a client indicate the API version it was compiled against? How does the kernel report available versions?
|
||||||
|
- **Experimental APIs:** A clearly marked experimental tier for new calls before they enter the stable surface.
|
||||||
|
|
||||||
|
Key references:
|
||||||
|
- Google. *Google Cloud API Versioning*. https://cloud.google.com/apis/design/versioning. URL-based versioning for REST APIs; the principles apply to system call versioning.
|
||||||
|
- Klabnik, S. "Stability as a Deliverable." https://blog.rust-lang.org/2014/10/30/Stability.html. Rust's stability commitment is a model for how a systems project can make and keep compatibility promises.
|
||||||
|
- Semantic Versioning Specification. https://semver.org/. The DistOS SDK and ABI will follow SemVer 2.0.
|
||||||
|
|
||||||
|
### 2.9 Plan 9 and Fuchsia Zircon Deep Dive
|
||||||
|
|
||||||
|
These two systems represent the clearest non-POSIX OS API designs and must be studied in depth:
|
||||||
|
|
||||||
|
- **Plan 9:** The 9P protocol represents all system resources as files served over a file system protocol. Network connections, processes, and graphics are files. The simplicity is extreme. The DistOS clean-slate layer need not adopt 9P but should understand its design philosophy.
|
||||||
|
- Pike, R. et al. "The Use of Name Spaces in Plan 9." *EUUG Newsletter* 12(1), 1992.
|
||||||
|
- Dorward, S. et al. "The Inferno Operating System." *Bell Labs Technical Journal* 2(1), 1997.
|
||||||
|
- **Fuchsia / Zircon:** Zircon is a microkernel with capabilities as the security primitive. Every kernel resource is a `zx_handle_t`. Handles are passed between processes explicitly; there is no global namespace for kernel objects. This is the preferred model for DistOS's capability integration with `council-sec`.
|
||||||
|
- Google. *Zircon Kernel Concepts*. https://fuchsia.dev/fuchsia-src/concepts/kernel/concepts.
|
||||||
|
- Google. *Zircon Syscall Reference*. https://fuchsia.dev/fuchsia-src/reference/syscalls.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 3. Agent Roles
|
||||||
|
|
||||||
|
| Role | Count | Responsibilities |
|
||||||
|
|------|-------|-----------------|
|
||||||
|
| Lead API Architect | 1 | Decides overall API philosophy; coordinates with all subsystem councils; owns the master API specification document; resolves conflicts between API and subsystem requirements |
|
||||||
|
| POSIX Compatibility Analysts | 4 | Audit which POSIX calls must be retained; design the compatibility shim layer; document compatibility coverage gaps |
|
||||||
|
| GPU Syscall Designers | 6 | Design GPU-native system calls for kernel launch, memory, streams, events, graphs; ensure Hopper/Blackwell/Grace specifics are covered |
|
||||||
|
| Distributed Syscall Designers | 5 | Design RPC, distributed lock, barrier, and collective operation system calls; consult MPI and RDMA references |
|
||||||
|
| SDK Designers | 8 | Design language-specific SDKs: 2 per language (C, Rust, Go, Python); responsible for ergonomics, idiom conformance, and ABI stability |
|
||||||
|
| Async API Specialists | 4 | Design the async call model; specify io_uring-style ring buffer interface; ensure Rust/Go/Python async integration |
|
||||||
|
| CLI Designers | 3 | Design `distos-ctl` command structure, output formats, and completions |
|
||||||
|
| Error Handling Architects | 3 | Design structured error types, UCXL trace integration, and error propagation conventions across all SDK layers |
|
||||||
|
| API Versioning Strategists | 2 | Develop the versioning policy, deprecation lifecycle, compatibility matrix, and experimental API tier |
|
||||||
|
| Developer Experience Reviewers | 4 | Evaluate API usability; write developer-facing documentation and example applications; run internal "dogfooding" walkthroughs |
|
||||||
|
|
||||||
|
**Total:** 40 agents
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 4. Key Deliverables
|
||||||
|
|
||||||
|
All artifacts use the pattern `ucxl://council-api:{role}@DistOS:api/^^/{artifact-type}/{name}`.
|
||||||
|
|
||||||
|
### 4.1 Master API Philosophy Decision Record
|
||||||
|
|
||||||
|
```
|
||||||
|
ucxl://council-api:lead-api-architect@DistOS:api/^^/decisions/dr-api-01-philosophy.md
|
||||||
|
```
|
||||||
|
|
||||||
|
Covers the layered model decision: clean-slate primary API, POSIX compatibility shim, and the cost budget for the shim.
|
||||||
|
|
||||||
|
### 4.2 GPU System Call Specification
|
||||||
|
|
||||||
|
```
|
||||||
|
ucxl://council-api:gpu-syscall-designer@DistOS:api/^^/specs/gpu-syscalls.md
|
||||||
|
```
|
||||||
|
|
||||||
|
Full specification of all GPU-native system calls with parameter types, semantics, error codes, and Hopper/Blackwell/Grace specifics.
|
||||||
|
|
||||||
|
### 4.3 Distributed System Call Specification
|
||||||
|
|
||||||
|
```
|
||||||
|
ucxl://council-api:distributed-syscall-designer@DistOS:api/^^/specs/distributed-syscalls.md
|
||||||
|
```
|
||||||
|
|
||||||
|
### 4.4 Async Call Interface Specification
|
||||||
|
|
||||||
|
```
|
||||||
|
ucxl://council-api:async-api-specialist@DistOS:api/^^/specs/async-interface.md
|
||||||
|
```
|
||||||
|
|
||||||
|
Documents the submission/completion ring model, back-pressure semantics, and language runtime integration.
|
||||||
|
|
||||||
|
### 4.5 C ABI Reference
|
||||||
|
|
||||||
|
```
|
||||||
|
ucxl://council-api:sdk-designer@DistOS:api/^^/specs/c-abi-reference.h
|
||||||
|
```
|
||||||
|
|
||||||
|
The normative C header file. All other SDKs are derived from this.
|
||||||
|
|
||||||
|
### 4.6 Language SDK Specifications
|
||||||
|
|
||||||
|
```
|
||||||
|
ucxl://council-api:sdk-designer@DistOS:api/^^/specs/sdk-rust.md
|
||||||
|
ucxl://council-api:sdk-designer@DistOS:api/^^/specs/sdk-go.md
|
||||||
|
ucxl://council-api:sdk-designer@DistOS:api/^^/specs/sdk-python.md
|
||||||
|
```
|
||||||
|
|
||||||
|
### 4.7 Error Type Catalogue
|
||||||
|
|
||||||
|
```
|
||||||
|
ucxl://council-api:error-handling-architect@DistOS:api/^^/specs/error-catalogue.md
|
||||||
|
```
|
||||||
|
|
||||||
|
All structured error types with UCXL trace integration, severity levels, and retry guidance.
|
||||||
|
|
||||||
|
### 4.8 CLI Specification
|
||||||
|
|
||||||
|
```
|
||||||
|
ucxl://council-api:cli-designer@DistOS:api/^^/specs/distos-ctl-spec.md
|
||||||
|
```
|
||||||
|
|
||||||
|
Full command reference including all subcommands, flags, output formats, and completion scripts.
|
||||||
|
|
||||||
|
### 4.9 API Versioning Policy
|
||||||
|
|
||||||
|
```
|
||||||
|
ucxl://council-api:api-versioning-strategist@DistOS:api/^^/policies/versioning-policy.md
|
||||||
|
```
|
||||||
|
|
||||||
|
### 4.10 POSIX Compatibility Coverage Matrix
|
||||||
|
|
||||||
|
```
|
||||||
|
ucxl://council-api:posix-compatibility-analyst@DistOS:api/^^/specs/posix-compatibility-matrix.md
|
||||||
|
```
|
||||||
|
|
||||||
|
Tabulates every POSIX call: supported natively, supported via shim, not supported (with rationale).
|
||||||
|
|
||||||
|
### 4.11 Example Applications
|
||||||
|
|
||||||
|
```
|
||||||
|
ucxl://council-api:developer-experience-reviewer@DistOS:api/^^/examples/hello-distributed-gpu.md
|
||||||
|
ucxl://council-api:developer-experience-reviewer@DistOS:api/^^/examples/allreduce-collective.md
|
||||||
|
ucxl://council-api:developer-experience-reviewer@DistOS:api/^^/examples/weka-fs-streaming-io.md
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 5. Decision Points
|
||||||
|
|
||||||
|
All DRs use the address pattern `ucxl://council-api:lead-api-architect@DistOS:api/^^/decisions/{dr-id}.md`.
|
||||||
|
|
||||||
|
### DP-A01: POSIX vs. Clean-Slate vs. Layered
|
||||||
|
|
||||||
|
The foundational design philosophy choice. The default recommendation is the layered model, but this must be validated against: the cost of maintaining the shim layer, the risk of semantic leakage from POSIX into the clean-slate layer, and the developer familiarity benefit.
|
||||||
|
**Deciding parties:** Lead API Architect, POSIX Compatibility Analysts, `council-synth`
|
||||||
|
|
||||||
|
### DP-A02: Async System Call Mechanism
|
||||||
|
|
||||||
|
Choose between: io_uring-inspired ring buffer (lowest overhead, Linux precedent), a POSIX-extended `aio_*` interface (familiarity, limited expressiveness), or a fully custom completion port model. This decision is tightly coupled to the `council-mem` memory model (the ring buffer requires shared memory between kernel and userspace).
|
||||||
|
**Deciding parties:** Async API Specialists, `council-mem`, `council-verify` (for ABI safety check)
|
||||||
|
|
||||||
|
### DP-A03: GPU Memory API at the Syscall Layer vs. Library Layer
|
||||||
|
|
||||||
|
Should GPU memory allocation (`cuMemAlloc` equivalent) be a kernel-mediated system call (allowing the OS to account for and schedule GPU memory as a first-class resource) or a library call that bypasses the kernel after initial device setup? Kernel mediation adds latency; bypass reduces accounting fidelity.
|
||||||
|
**Deciding parties:** GPU Syscall Designers, `council-mem`, `council-telemetry`
|
||||||
|
|
||||||
|
### DP-A04: RPC Mechanism for Distributed System Calls
|
||||||
|
|
||||||
|
Choose the wire protocol for remote procedure calls: gRPC (typed, HTTP/2, mature), a custom binary protocol over RDMA (lowest latency, highest implementation cost), or a two-tier model (gRPC for control plane, RDMA for data plane). The choice directly affects the latency budget for distributed system calls.
|
||||||
|
**Deciding parties:** Distributed Syscall Designers, `council-net`
|
||||||
|
|
||||||
|
### DP-A05: SDK Code Generation vs. Hand-Written Wrappers
|
||||||
|
|
||||||
|
Decide whether to generate the Rust, Go, and Python SDKs from a schema definition (IDL, such as Protocol Buffers or a custom DSL) or maintain hand-written wrappers. Generated code is more consistent; hand-written code can be more idiomatic. A hybrid (generate the boilerplate, hand-write ergonomic wrappers) is the likely outcome.
|
||||||
|
**Deciding parties:** SDK Designers, API Versioning Strategists
|
||||||
|
|
||||||
|
### DP-A06: Authentication and Authorisation API
|
||||||
|
|
||||||
|
How does a process prove its identity to the kernel and acquire capabilities? Options: token-based (JWT or similar), capability handles (Zircon model), certificate-based (X.509 with a cluster CA), or UCXL-scoped credentials. This decision must be made jointly with `council-sec`.
|
||||||
|
**Deciding parties:** Lead API Architect, `council-sec`
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 6. Dependencies on Other Councils
|
||||||
|
|
||||||
|
`council-api` is the integrating council: every subsystem council produces functionality, and `council-api` exposes that functionality through a coherent surface. It is therefore a downstream consumer of requirements from all councils and an upstream provider to `council-docs` and `council-verify`.
|
||||||
|
|
||||||
|
| Council | Relationship | What council-api consumes | What council-api produces |
|
||||||
|
|---------|-------------|--------------------------|--------------------------|
|
||||||
|
| `council-sched` | Consuming requirements | Job submission semantics, priority model, queue management APIs | Scheduler-facing system calls in API spec |
|
||||||
|
| `council-mem` | Bidirectional | Memory model, allocation semantics, consistency guarantees | Memory system call specs; async memory API |
|
||||||
|
| `council-net` | Bidirectional | Network abstraction primitives, RDMA capabilities | Network system calls; distributed RPC wire protocol choice |
|
||||||
|
| `council-fault` | Consuming requirements | Failure notification model, recovery primitives | Fault-tolerance-related error codes; node failure event API |
|
||||||
|
| `council-sec` | Bidirectional | Capability model, identity primitives, isolation guarantees | Authentication/authorisation API surface; capability handle design |
|
||||||
|
| `council-telemetry` | Consuming requirements | Metering call semantics, SLO query interface | Telemetry-facing SDK surface; metering call specs |
|
||||||
|
| `council-verify` | Providing for verification | N/A | API interface contracts for formal verification |
|
||||||
|
| `council-qa` | Providing for test design | N/A | API spec enables QA to design conformance tests |
|
||||||
|
| `council-synth` | Receiving directives | Cross-council conflict resolutions affecting API design | Updates to API spec when directed by synth |
|
||||||
|
| `council-docs` | Providing for documentation | N/A | All API specs feed directly into the reference documentation |
|
||||||
|
|
||||||
|
**Critical path constraint:** `council-api` cannot finalise the distributed system call interface until `council-net` has committed to its RPC and RDMA model (DP-A04 depends on this). GPU system call design can proceed independently from Day 1.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 7. WHOOSH Configuration
|
||||||
|
|
||||||
|
### 7.1 Team Formation
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
council_id: council-api
|
||||||
|
display_name: "API Surface and Developer Experience Council"
|
||||||
|
target_size: 40
|
||||||
|
formation_strategy: competency_weighted
|
||||||
|
required_roles:
|
||||||
|
- role: lead-api-architect
|
||||||
|
count: 1
|
||||||
|
persona: systems-analyst
|
||||||
|
competencies: [api-design, posix, distributed-systems, gpu-programming, developer-experience]
|
||||||
|
- role: posix-compatibility-analyst
|
||||||
|
count: 4
|
||||||
|
persona: technical-specialist
|
||||||
|
competencies: [posix, linux-kernel, system-calls, abi-stability]
|
||||||
|
- role: gpu-syscall-designer
|
||||||
|
count: 6
|
||||||
|
persona: technical-specialist
|
||||||
|
competencies: [cuda, rocm, gpu-memory, hopper-architecture, blackwell-architecture, nvlink]
|
||||||
|
- role: distributed-syscall-designer
|
||||||
|
count: 5
|
||||||
|
persona: technical-specialist
|
||||||
|
competencies: [rpc, rdma, mpi-collectives, distributed-locks, grpc]
|
||||||
|
- role: sdk-designer
|
||||||
|
count: 8
|
||||||
|
persona: technical-specialist
|
||||||
|
competencies: [c-abi, rust-async, go-concurrency, python-asyncio, ffi, sdk-design]
|
||||||
|
- role: async-api-specialist
|
||||||
|
count: 4
|
||||||
|
persona: technical-specialist
|
||||||
|
competencies: [io-uring, async-io, rust-futures, event-driven-design]
|
||||||
|
- role: cli-designer
|
||||||
|
count: 3
|
||||||
|
persona: technical-specialist
|
||||||
|
competencies: [cli-design, ux, kubectl-conventions, shell-completion]
|
||||||
|
- role: error-handling-architect
|
||||||
|
count: 3
|
||||||
|
persona: systems-analyst
|
||||||
|
competencies: [error-design, structured-errors, distributed-tracing, ucxl]
|
||||||
|
- role: api-versioning-strategist
|
||||||
|
count: 2
|
||||||
|
persona: systems-analyst
|
||||||
|
competencies: [api-versioning, semver, deprecation-policy, compatibility]
|
||||||
|
- role: developer-experience-reviewer
|
||||||
|
count: 4
|
||||||
|
persona: technical-writer
|
||||||
|
competencies: [developer-documentation, api-usability, example-applications, dogfooding]
|
||||||
|
```
|
||||||
|
|
||||||
|
### 7.2 Quorum Rules
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
quorum:
|
||||||
|
decision_threshold: 0.65 # 65% of active agents must agree on API design decisions
|
||||||
|
lead_architect_veto: true # Lead API Architect can block any interface decision
|
||||||
|
breaking_change_threshold: 0.85 # Breaking changes require 85% supermajority
|
||||||
|
cross_council_approval:
|
||||||
|
trigger: api_affects_subsystem
|
||||||
|
required: [affected_council_lead, council-synth]
|
||||||
|
response_sla_hours: 6
|
||||||
|
developer_experience_review:
|
||||||
|
trigger: new_public_call
|
||||||
|
required: [developer-experience-reviewer_count >= 2]
|
||||||
|
purpose: "Ensure every new call meets ergonomics standard before it enters the spec"
|
||||||
|
```
|
||||||
|
|
||||||
|
### 7.3 Subchannels
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
subchannels:
|
||||||
|
- id: api-posix-compat
|
||||||
|
subscribers: [posix-compatibility-analyst, lead-api-architect]
|
||||||
|
purpose: "POSIX coverage analysis, shim design, compatibility gap triage"
|
||||||
|
ucxl_feed: "ucxl://council-api:posix-compatibility-analyst@DistOS:api/^^/specs/posix-*"
|
||||||
|
|
||||||
|
- id: api-gpu-syscalls
|
||||||
|
subscribers: [gpu-syscall-designer, lead-api-architect, async-api-specialist]
|
||||||
|
purpose: "GPU-native system call design; Hopper/Blackwell capability integration"
|
||||||
|
ucxl_feed: "ucxl://council-api:gpu-syscall-designer@DistOS:api/^^/specs/gpu-*"
|
||||||
|
|
||||||
|
- id: api-distributed-syscalls
|
||||||
|
subscribers: [distributed-syscall-designer, lead-api-architect]
|
||||||
|
purpose: "Distributed call design; RPC and RDMA protocol negotiation with council-net"
|
||||||
|
ucxl_feed: "ucxl://council-api:distributed-syscall-designer@DistOS:api/^^/specs/distributed-*"
|
||||||
|
|
||||||
|
- id: api-sdk-coordination
|
||||||
|
subscribers: [sdk-designer, async-api-specialist, developer-experience-reviewer]
|
||||||
|
purpose: "Cross-language SDK consistency; ABI stability coordination"
|
||||||
|
ucxl_feed: "ucxl://council-api:sdk-designer@DistOS:api/^^/specs/sdk-*"
|
||||||
|
|
||||||
|
- id: api-error-and-versioning
|
||||||
|
subscribers: [error-handling-architect, api-versioning-strategist, lead-api-architect]
|
||||||
|
purpose: "Error catalogue development; versioning policy; UCXL trace integration"
|
||||||
|
ucxl_feed: "ucxl://council-api:error-handling-architect@DistOS:api/^^/specs/error-*"
|
||||||
|
|
||||||
|
- id: api-cross-council-requirements
|
||||||
|
subscribers: [lead-api-architect, distributed-syscall-designer, gpu-syscall-designer]
|
||||||
|
purpose: "Inbound requirements from all subsystem councils; tracks what each council needs exposed"
|
||||||
|
ucxl_feed: "ucxl://council-*:*@DistOS:*/^^/requirements/api-*"
|
||||||
|
|
||||||
|
- id: api-devex-review
|
||||||
|
subscribers: [developer-experience-reviewer, lead-api-architect]
|
||||||
|
purpose: "Developer experience walkthroughs; example application drafts; usability feedback"
|
||||||
|
ucxl_feed: "ucxl://council-api:developer-experience-reviewer@DistOS:api/^^/examples/*"
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 8. Success Criteria
|
||||||
|
|
||||||
|
1. **Complete API surface:** The master API specification covers all system calls required by all six core subsystem councils. No subsystem has an unaddressed API requirement at the end of Phase 4.
|
||||||
|
2. **POSIX coverage documented:** The POSIX compatibility matrix exists and classifies every POSIX.1-2017 system call as supported, shim-supported, or explicitly unsupported with rationale.
|
||||||
|
3. **GPU system calls complete:** All GPU-native system calls for Hopper, Grace, and Blackwell are specified with parameter types, semantics, and error codes. NVLink/NVSwitch topology-aware calls are included.
|
||||||
|
4. **Distributed system calls complete:** All distributed calls (RPC, locks, barriers, collectives) are specified with failure semantics and consistency guarantees matching the `council-fault` and `council-net` specs.
|
||||||
|
5. **Four-language SDK specs complete:** C ABI, Rust, Go, and Python SDK specifications exist and have been reviewed for idiomatic correctness by SDK Designers.
|
||||||
|
6. **Error handling consistent:** All error types are catalogued and every public API call has a documented error table. Every error carries a UCXL trace field.
|
||||||
|
7. **Versioning policy ratified:** The versioning policy is agreed with `council-synth` and published. The experimental API tier is defined.
|
||||||
|
8. **Verification-ready contracts:** All interface contracts have been delivered to `council-verify` in Alloy-compatible form by Day 8.
|
||||||
|
9. **Developer experience validated:** At least three example applications have been written by Developer Experience Reviewers and cover: a simple GPU computation, a distributed collective operation, and a Weka FS streaming I/O pattern.
|
||||||
|
10. **CLI specification complete:** `distos-ctl` subcommand structure and all primary flags are specified.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 9. Timeline
|
||||||
|
|
||||||
|
### Phase 1: Research (Days 1–3)
|
||||||
|
|
||||||
|
- POSIX Compatibility Analysts catalogue POSIX.1-2017 system calls and assess coverage feasibility
|
||||||
|
- GPU Syscall Designers survey CUDA Driver API, CUDA Graphs, Hopper/Blackwell architecture documentation, NVLink topology implications
|
||||||
|
- Distributed Syscall Designers survey MPI collectives, gRPC, RDMA verbs, ZooKeeper/Chubby lock models
|
||||||
|
- SDK Designers survey language ecosystems: Rust async patterns, Go `cgo` patterns, Python asyncio/CuPy
|
||||||
|
- Async API Specialists study io_uring interface in depth
|
||||||
|
- Lead API Architect drafts the API philosophy options paper for DP-A01
|
||||||
|
- Deliverable: `ucxl://council-api:lead-api-architect@DistOS:api/^^/research/api-philosophy-options.md`
|
||||||
|
|
||||||
|
### Phase 2: Architecture (Days 3–6)
|
||||||
|
|
||||||
|
- Resolve DP-A01 (philosophy), DP-A02 (async mechanism), DP-A04 (RPC wire protocol), DP-A06 (auth/authz) — all in consultation with relevant councils
|
||||||
|
- Lead API Architect drafts the call taxonomy: which calls belong in which layer (kernel/shim/library)
|
||||||
|
- GPU Syscall Designers draft the GPU system call prototype spec for Hopper and Blackwell
|
||||||
|
- Distributed Syscall Designers draft the distributed call prototype spec, contingent on DP-A04 resolution
|
||||||
|
- Error Handling Architects draft the error type taxonomy and UCXL trace integration
|
||||||
|
- Deliverable: `ucxl://council-api:lead-api-architect@DistOS:api/^^/research/call-taxonomy.md`
|
||||||
|
|
||||||
|
### Phase 3: Formal Specification (Days 6–10)
|
||||||
|
|
||||||
|
- Full API spec written: GPU syscalls, distributed syscalls, async interface, C ABI reference
|
||||||
|
- Language SDK specifications written in parallel by SDK Designers
|
||||||
|
- Error catalogue completed and UCXL trace integration specified
|
||||||
|
- Alloy interface contracts delivered to `council-verify` for structural verification
|
||||||
|
- CLI specification drafted by CLI Designers
|
||||||
|
- POSIX compatibility matrix completed
|
||||||
|
- Deliverable: `ucxl://council-api:gpu-syscall-designer@DistOS:api/^^/specs/gpu-syscalls.md` and all companion specs
|
||||||
|
|
||||||
|
### Phase 4: Integration (Days 10–12)
|
||||||
|
|
||||||
|
- Resolve any outstanding API requirements from subsystem councils surfaced during their Phase 3 spec work
|
||||||
|
- DP-A03 and DP-A05 resolved with full DR records
|
||||||
|
- API versioning policy ratified by `council-synth`
|
||||||
|
- Developer Experience Reviewers conduct walkthroughs of all three example applications
|
||||||
|
- Deliver final interface contracts to `council-verify` for re-verification after any Phase 3 changes
|
||||||
|
- Deliverable: Versioning policy, three example applications
|
||||||
|
|
||||||
|
### Phase 5: Documentation (Days 12–14)
|
||||||
|
|
||||||
|
- Developer Experience Reviewers produce the developer-facing API reference document
|
||||||
|
- SDK Designers produce getting-started guides for each language
|
||||||
|
- All specs integrated into the master DistOS specification document via `council-docs`
|
||||||
|
- Final UCXL navigability check: every API call traces back to the council decision that introduced it
|
||||||
|
- Deliverable: `ucxl://council-api:developer-experience-reviewer@DistOS:api/^^/docs/api-reference.md`
|
||||||
544
councils/09-inter-council-synthesis.md
Normal file
544
councils/09-inter-council-synthesis.md
Normal file
@@ -0,0 +1,544 @@
|
|||||||
|
# Council Design Brief: Inter-Council Synthesis
|
||||||
|
|
||||||
|
**Council ID:** `council-synth`
|
||||||
|
**Mission:** Maintain the architectural coherence of DistOS as a whole. This council does not own any subsystem. Instead, it owns the space between subsystems — the interface contracts, the dependency graph, the consistency of cross-cutting assumptions, and the resolution of conflicts between councils that cannot resolve their differences independently. When the DistOS specification is complete, `council-synth` must be able to certify that every decision made by every council is compatible with every decision made by every other council.
|
||||||
|
**UCXL Base Address:** `ucxl://council-synth:*@DistOS:integration/*`
|
||||||
|
**Agent Count:** ~100
|
||||||
|
**Status:** Design Brief — Constitution Phase
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 1. Scope and Responsibilities
|
||||||
|
|
||||||
|
`council-synth` is unique among all DistOS councils: it has no subsystem of its own to design. Its authority is horizontal rather than vertical. Specifically:
|
||||||
|
|
||||||
|
1. **Monitor Decision Records from all councils via UCXL subscription.** Every DR published by any council is read by `council-synth` Conflict Detectors within hours of publication. No council's decisions are exempt from synthesis review.
|
||||||
|
|
||||||
|
2. **Detect conflicts between council decisions.** The canonical example: `council-sched` assumes a shared global memory space for scheduling state, but `council-mem` has chosen a message-passing-only inter-node memory model. This conflict is invisible to both councils independently; `council-synth` makes it visible.
|
||||||
|
|
||||||
|
3. **Convene cross-council mini-councils to resolve conflicts.** When a conflict is detected, the Lead Synthesiser convenes a focused session with representatives from the affected councils, facilitated by an Arbitrator agent. The mini-council produces a resolution DR that updates the affected specs.
|
||||||
|
|
||||||
|
4. **Maintain the Global Architectural Coherence Document (GACD).** A living document that describes the whole-system architecture as currently understood — the subsystems, their interfaces, their shared assumptions, and the status of each assumption (agreed, under review, contested). This is the primary artefact that a human reviewer can consult to understand the state of the DistOS design.
|
||||||
|
|
||||||
|
5. **Manage the dependency graph between all councils.** Which councils must complete which deliverables before which other councils can proceed? `council-synth` maintains this graph, flags blocked councils, and escalates to `council-meta` when a blockage threatens the project timeline.
|
||||||
|
|
||||||
|
6. **Arbitrate when two councils have incompatible requirements.** Some conflicts cannot be resolved by the affected councils alone — they represent genuine trade-offs with no obvious correct answer. In these cases, `council-synth` Arbitrators make a binding recommendation (subject to appeal to `council-meta`) based on the ATAM methodology and the Project Constitution's guiding principles.
|
||||||
|
|
||||||
|
7. **Allocate the global performance budget.** The DistOS specification must include quantitative performance targets (latency, throughput, jitter) for the 1024-node cluster. These targets cannot be set by individual subsystems independently — a scheduler that claims 10 microsecond dispatch latency, a memory system that claims 1 microsecond allocation, and a network stack that claims 5 microsecond RPC round-trip imply a composed system latency budget that may or may not be consistent with the target hardware. `council-synth` owns this budget and allocates it across subsystems.
|
||||||
|
|
||||||
|
8. **Verify consistency model alignment.** The memory model chosen by `council-mem`, the failure model chosen by `council-fault`, and the network guarantees provided by `council-net` must all agree on the consistency model they collectively provide to `council-api`'s system call interface. This alignment is `council-synth`'s most technically demanding task.
|
||||||
|
|
||||||
|
9. **Manage evolutionary architecture.** The DistOS specification is not a static document to be sealed at the end of 14 days. It must be designed to evolve. `council-synth` specifies the evolution policy: how new subsystems are added, how interfaces are changed, and how the coherence invariants are maintained as the spec evolves post-delivery.
|
||||||
|
|
||||||
|
`council-synth` does **not** own: the design of any subsystem; formal verification (owned by `council-verify`); the narrative documentation (owned by `council-docs` and `council-arch`); adversarial testing (owned by `council-qa`).
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 2. Research Domains
|
||||||
|
|
||||||
|
### 2.1 Architecture Trade-off Analysis Method (ATAM)
|
||||||
|
|
||||||
|
ATAM (Bass, Clements, and Kazman, 2003) is the systematic methodology for evaluating software architectures against quality attribute requirements. It was developed at the SEI and has been validated in large-scale industrial deployments.
|
||||||
|
|
||||||
|
The ATAM process:
|
||||||
|
1. Present the architecture: stakeholders and architects explain the design
|
||||||
|
2. Identify quality attribute scenarios: concrete, measurable instances of quality attributes (e.g., "under a 10% node failure rate, no job should wait more than 5 seconds for rescheduling")
|
||||||
|
3. Identify architectural approaches: which architectural decisions (style, pattern, tactic) address each scenario
|
||||||
|
4. Generate the utility tree: a structured decomposition of quality attributes, sub-attributes, and prioritised scenarios
|
||||||
|
5. Analyse architectural approaches: for each approach, identify sensitivity points (design decisions that materially affect the quality attribute), trade-off points (decisions that affect two or more quality attributes in tension), and risks
|
||||||
|
6. Identify risks and non-risks: document architectural decisions that carry risk and those that are well-understood
|
||||||
|
|
||||||
|
`council-synth` Trade-off Analysts are required to be proficient in ATAM. Every major cross-council conflict will be analysed using the ATAM utility tree methodology.
|
||||||
|
|
||||||
|
Key references:
|
||||||
|
- Bass, L., Clements, P., and Kazman, R. *Software Architecture in Practice* (3rd ed., 2012). Addison-Wesley. Chapters 14–16 on ATAM.
|
||||||
|
- Kazman, R. et al. "ATAM: Method for Architecture Evaluation." *CMU/SEI-2000-TR-004*. 2000. The primary technical report on ATAM methodology.
|
||||||
|
- Clements, P. et al. *Documenting Software Architectures: Views and Beyond* (2nd ed., 2010). Addison-Wesley. The companion to ATAM on how to document architecture decisions.
|
||||||
|
|
||||||
|
### 2.2 Conflict Detection in Distributed Design Processes
|
||||||
|
|
||||||
|
Detecting incompatible assumptions across independent design teams is a known research problem. The primary approaches are:
|
||||||
|
|
||||||
|
- **Assumption surfacing:** Each council explicitly documents its assumptions about the environment its subsystem operates in. `council-synth` collects these assumption lists and checks them for mutual inconsistency.
|
||||||
|
- **Interface contract comparison:** Each council publishes the interface it expects to consume and the interface it promises to provide. `council-synth` checks that every consumed interface is covered by some council's provided interface, and that the specifications are compatible.
|
||||||
|
- **Design decision dependency tracking:** Design decisions have dependencies — a choice made by one council constrains the options available to another. `council-synth` maintains a dependency graph where nodes are design decisions and edges represent constraints.
|
||||||
|
|
||||||
|
Key references:
|
||||||
|
- Nuseibeh, B. et al. "A Framework for Expressing the Relationships Between Multiple Views in Requirements Specification." *IEEE Transactions on Software Engineering* 20(10), 1994. The foundational work on multi-view consistency checking.
|
||||||
|
- van Lamsweerde, A. and Willemet, L. "Inferring Declarative Requirements Specifications from Operational Scenarios." *IEEE Transactions on Software Engineering* 24(12), 1998.
|
||||||
|
- Garlan, D. et al. "Architectural Mismatch: Why Reuse Is So Hard." *IEEE Software* 12(6), 1995. Introduces the concept of architectural mismatch — what happens when independently designed components make incompatible assumptions. This paper names the core problem that `council-synth` exists to solve.
|
||||||
|
|
||||||
|
### 2.3 Dependency Graph Management
|
||||||
|
|
||||||
|
The DistOS project has approximately 12 councils, each producing dozens of deliverables, with complex inter-dependencies. Managing this graph requires:
|
||||||
|
|
||||||
|
- A machine-readable dependency graph format (likely a directed acyclic graph where nodes are deliverables and edges are "depends on" relationships)
|
||||||
|
- Cycle detection (circular dependencies must be identified and broken)
|
||||||
|
- Critical path analysis (which chain of dependencies determines the minimum project duration?)
|
||||||
|
- Blockage detection (which deliverables are currently unmet and which councils are blocked as a result?)
|
||||||
|
|
||||||
|
Key references:
|
||||||
|
- Cormen, T. et al. *Introduction to Algorithms* (4th ed., 2022). MIT Press. Chapter 22 on graph algorithms: topological sort, strongly connected components, DAG shortest paths.
|
||||||
|
- Brooks, F. *The Mythical Man-Month* (1995 anniversary edition). Addison-Wesley. The dependency management lessons here, while not formal, are empirically grounded and must inform how `council-synth` manages blockages.
|
||||||
|
- Reinertsen, D. *The Principles of Product Development Flow* (2009). Celeritas. Chapter 3 on queuing theory and dependency cost. Useful for quantifying the cost of blocked councils.
|
||||||
|
|
||||||
|
### 2.4 Interface Contract Enforcement
|
||||||
|
|
||||||
|
An interface contract specifies what a consumer may expect from a provider. In a distributed OS context:
|
||||||
|
- **Pre-conditions:** what must be true before a call is made
|
||||||
|
- **Post-conditions:** what is guaranteed to be true after a call returns successfully
|
||||||
|
- **Invariants:** what must remain true across all states
|
||||||
|
- **Error contracts:** which error conditions the provider guarantees to detect and report
|
||||||
|
|
||||||
|
`council-synth` Interface Contract Enforcement is the process of verifying that every interface between subsystems has a formal contract (supplied by `council-api` in Alloy-checkable form) and that no council's spec violates the contract it has agreed to.
|
||||||
|
|
||||||
|
Key references:
|
||||||
|
- Meyer, B. *Object-Oriented Software Construction* (2nd ed., 1997). Prentice Hall. Chapter 11 on Design by Contract. The foundational text.
|
||||||
|
- Leavens, G. and Sitaraman, M. (eds.) *Foundations of Component-Based Systems*. 2000. Cambridge University Press. Chapter on behavioural contracts and compatibility.
|
||||||
|
- Hatcliff, J. et al. "Behavioral Interface Specification Languages." *ACM Computing Surveys* 44(3), 2012.
|
||||||
|
|
||||||
|
### 2.5 Performance Budget Allocation
|
||||||
|
|
||||||
|
Allocating latency and throughput budgets across subsystems in a system with tight end-to-end requirements requires queuing theory and performance modelling:
|
||||||
|
|
||||||
|
- **Amdahl's Law:** The speedup from improving a component is bounded by the fraction of time spent in that component. Subsystems that are rarely on the critical path do not benefit from tight latency budgets.
|
||||||
|
- **Little's Law:** In a stable queuing system, L = λW (average queue length = arrival rate × average wait time). Used to reason about throughput and latency trade-offs in the scheduler and network stack.
|
||||||
|
- **Tail latency management:** At 1024 nodes, the 99th-percentile latency of any distributed operation is the primary engineering challenge. The Tail at Scale (Dean and Barroso, 2013) is mandatory reading.
|
||||||
|
- **Latency budget decomposition:** If the end-to-end target is 100 microseconds for a distributed system call, how many microseconds are allocated to: kernel entry, capability check, network transmission, remote kernel entry, computation, response, network return? This decomposition is `council-synth`'s responsibility.
|
||||||
|
|
||||||
|
Key references:
|
||||||
|
- Dean, J. and Barroso, L. "The Tail at Scale." *Communications of the ACM* 56(2), 2013. Demonstrates that at large scale, tail latency dominates user-visible performance. Introduces hedged requests and tied requests as mitigation strategies. Mandatory reading.
|
||||||
|
- Barroso, L., Clidaras, J., and Holzle, U. *The Datacenter as a Computer* (3rd ed., 2018). Morgan and Claypool. Chapter 5 on latency budget analysis in warehouse-scale systems.
|
||||||
|
- Jain, R. *The Art of Computer Systems Performance Analysis* (1991). Wiley. Chapters 28–30 on queuing networks and latency decomposition.
|
||||||
|
- Leverich, J. and Kozyrakis, C. "Reconciling High Server Utilization and Sub-Millisecond Quality-of-Service." *EuroSys 2014*. Directly relevant to the latency vs. utilisation trade-off in the scheduler.
|
||||||
|
|
||||||
|
### 2.6 Consistency Model Alignment
|
||||||
|
|
||||||
|
The most technically complex task for `council-synth` is ensuring that the consistency models chosen by `council-mem`, `council-net`, and `council-fault` are mutually compatible and that the composed model matches what `council-api` has promised to applications.
|
||||||
|
|
||||||
|
The consistency model stack:
|
||||||
|
- `council-mem` chooses a hardware-level memory consistency model for GPU operations (likely release-acquire, matching Hopper/Blackwell)
|
||||||
|
- `council-net` chooses a network-level consistency model for distributed operations (likely causal consistency or linearisability for control plane, eventual consistency for data plane)
|
||||||
|
- `council-fault` chooses a consistency model for state that survives node failures (determined by the consensus protocol choice)
|
||||||
|
- `council-api` promises applications a composed consistency model
|
||||||
|
|
||||||
|
If these choices are incompatible — for example, if the application API promises linearisability but the network stack only provides causal consistency — the system is unsound. `council-synth` must detect this and convene a resolution mini-council.
|
||||||
|
|
||||||
|
Key references:
|
||||||
|
- Herlihy, M. and Wing, J. "Linearizability: A Correctness Condition for Concurrent Objects." *JACM* 37(3), 1990. (Cited in `council-verify` brief; essential here too.)
|
||||||
|
- Bailis, P. et al. "Highly Available Transactions: Virtues and Limitations." *VLDB 2014*. Proves which consistency models are achievable without sacrificing availability (CAP theorem refinement). Directly constrains what `council-fault` can promise to `council-api`.
|
||||||
|
- Bailis, P. and Ghodsi, A. "Eventual Consistency Today: Limitations, Extensions, and Beyond." *ACM Queue* 11(3), 2013.
|
||||||
|
- Brewer, E. "CAP Twelve Years Later: How the 'Rules' Have Changed." *IEEE Computer* 45(2), 2012. Brewer's own retrospective on CAP, which is more nuanced than the original conjecture. Required for understanding the actual trade-off space.
|
||||||
|
- Crooks, N. et al. "Seeing is Believing: A Client-Centric Isolation Definition for Databases." *SIGMOD 2017*. Client-centric consistency definitions that map to what an application developer actually observes.
|
||||||
|
|
||||||
|
### 2.7 Resource Contention Arbitration
|
||||||
|
|
||||||
|
Multiple subsystems compete for shared physical resources: CPU cycles, GPU memory bandwidth, NVLink bandwidth, Weka FS I/O bandwidth, and DRAM. `council-synth` must establish a policy for how these contentions are resolved when subsystem specs make conflicting demands on the same resource.
|
||||||
|
|
||||||
|
Key references:
|
||||||
|
- Zaharia, M. et al. "Improving MapReduce Performance in Heterogeneous Environments." *OSDI 2008*. Speculative execution and heterogeneity-aware scheduling; relevant to scheduler-network resource contention.
|
||||||
|
- Delimitrou, C. and Kozyrakis, C. "Paragon: QoS-Aware Scheduling for Heterogeneous Datacenters." *ASPLOS 2013*. QoS-aware scheduling as a framework for resource contention arbitration.
|
||||||
|
- Grandl, R. et al. "Multi-Resource Packing for Cluster Schedulers." *SIGCOMM 2014*. Resource packing across multiple resource dimensions (CPU, memory, network, GPU) — the exact problem for the 1024-node cluster.
|
||||||
|
|
||||||
|
### 2.8 Evolutionary Architecture
|
||||||
|
|
||||||
|
The DistOS specification will evolve after Day 14. `council-synth` must specify how:
|
||||||
|
|
||||||
|
- New subsystems are added without breaking existing subsystem interfaces
|
||||||
|
- Interface contracts are changed with defined compatibility guarantees
|
||||||
|
- Architectural decisions are revisited (when is it permissible to reverse a previously made decision, and what is the process?)
|
||||||
|
- The coherence invariants are maintained as new agents join and contribute
|
||||||
|
|
||||||
|
Key references:
|
||||||
|
- Ford, N. et al. *Building Evolutionary Architectures* (2nd ed., 2022). O'Reilly. The definitive treatment of architecture that supports guided change. Chapter 3 on fitness functions — automated checks that verify architectural properties — is directly applicable to UCXL-based coherence monitoring.
|
||||||
|
- Garlan, D. and Perry, D. "Introduction to the Special Issue on Software Architecture." *IEEE Transactions on Software Engineering* 21(4), 1995.
|
||||||
|
- Fowler, M. *Patterns of Enterprise Application Architecture* (2002). Addison-Wesley. Chapter on evolving architectures.
|
||||||
|
- Nadareishvili, I. et al. *Microservice Architecture* (2016). O'Reilly. Chapter 4 on evolutionary API design patterns.
|
||||||
|
|
||||||
|
### 2.9 ATAM Applied to Distributed OS Design
|
||||||
|
|
||||||
|
While ATAM was developed for traditional software architectures, it applies to distributed OS design with some adaptations:
|
||||||
|
|
||||||
|
- **Quality attributes for a distributed OS:** performance (latency, throughput, tail latency), reliability (MTBF, MTTR, availability under f failures), security (isolation strength, attack surface), maintainability (spec complexity, evolvability), and developer experience (API ergonomics, debuggability)
|
||||||
|
- **Utility tree for DistOS:** `council-synth` must construct and maintain the DistOS utility tree mapping quality attribute scenarios to architectural decisions
|
||||||
|
- **Sensitivity and trade-off points:** the ATAM outputs that are most valuable to `council-synth` — decisions that are tightly coupled to quality attributes (sensitivity points) and decisions where optimising for one attribute degrades another (trade-off points)
|
||||||
|
|
||||||
|
Key examples of trade-off points expected in DistOS:
|
||||||
|
- **Linearisability vs. availability:** Making all distributed operations linearisable (strongest consistency) requires coordination that reduces availability under network partitions (CAP constraint).
|
||||||
|
- **Memory coherence vs. bandwidth:** Maintaining fine-grained cache coherence across 1024 nodes provides stronger consistency guarantees but consumes NVLink bandwidth for coherence traffic.
|
||||||
|
- **POSIX compatibility vs. API ergonomics:** Supporting the POSIX shim layer adds complexity and performance overhead to the clean-slate API.
|
||||||
|
- **Formal verification completeness vs. project timeline:** Fully verifying every subsystem spec in 14 days may not be feasible; some subsystems may require deferring full verification.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 3. Agent Roles
|
||||||
|
|
||||||
|
| Role | Count | Responsibilities |
|
||||||
|
|------|-------|-----------------|
|
||||||
|
| Lead Synthesiser | 1 | Chairs all mini-councils; owns the GACD; reports to `council-meta`; makes binding arbitration recommendations; maintains the global UCXL subscription |
|
||||||
|
| Conflict Detectors | 30 | Monitor UCXL feeds from all councils; parse incoming DRs; identify potential conflicts; triage and escalate genuine conflicts to Lead Synthesiser |
|
||||||
|
| Trade-off Analysts | 30 | Apply ATAM methodology to evaluate conflict resolution alternatives; construct utility tree entries for cross-cutting decisions; produce trade-off analysis reports |
|
||||||
|
| Arbitrators | 20 | Facilitate mini-council sessions; ensure both sides are heard; produce resolution DRs; track resolution implementation |
|
||||||
|
| Coherence Auditors | 19 | Verify global consistency: interface contracts are satisfied, consistency models are aligned, performance budget is respected, dependency graph is acyclic |
|
||||||
|
|
||||||
|
**Total:** 100 agents
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 4. Key Deliverables
|
||||||
|
|
||||||
|
All artifacts use the pattern `ucxl://council-synth:{role}@DistOS:integration/^^/{artifact-type}/{name}`.
|
||||||
|
|
||||||
|
### 4.1 Global Architectural Coherence Document (GACD)
|
||||||
|
|
||||||
|
The living document describing the whole-system architecture as currently agreed. Updated continuously throughout the project.
|
||||||
|
|
||||||
|
```
|
||||||
|
ucxl://council-synth:lead-synthesiser@DistOS:integration/^^/docs/global-architectural-coherence.md
|
||||||
|
```
|
||||||
|
|
||||||
|
### 4.2 Conflict Registry
|
||||||
|
|
||||||
|
A machine-readable log of all detected conflicts, their status (open, in-resolution, resolved, deferred), the councils involved, and the UCXL address of the resolution DR.
|
||||||
|
|
||||||
|
```
|
||||||
|
ucxl://council-synth:conflict-detector@DistOS:integration/^^/registries/conflict-registry.json
|
||||||
|
```
|
||||||
|
|
||||||
|
### 4.3 Dependency Graph
|
||||||
|
|
||||||
|
A machine-readable directed acyclic graph of all council deliverable dependencies. Updated as deliverables are published and consumed. Rendered as a human-readable diagram for `council-meta` and project stakeholders.
|
||||||
|
|
||||||
|
```
|
||||||
|
ucxl://council-synth:coherence-auditor@DistOS:integration/^^/graphs/dependency-graph.json
|
||||||
|
ucxl://council-synth:coherence-auditor@DistOS:integration/^^/graphs/dependency-graph.svg
|
||||||
|
```
|
||||||
|
|
||||||
|
### 4.4 Performance Budget Allocation Document
|
||||||
|
|
||||||
|
The authoritative record of latency and throughput budget allocations across all subsystems.
|
||||||
|
|
||||||
|
```
|
||||||
|
ucxl://council-synth:trade-off-analyst@DistOS:integration/^^/docs/performance-budget.md
|
||||||
|
```
|
||||||
|
|
||||||
|
### 4.5 Consistency Model Alignment Report
|
||||||
|
|
||||||
|
Documents the agreed consistency model stack: hardware-level (memory), network-level (net), fault-tolerant (fault), and application-visible (api). Confirms they are mutually compatible or records the accepted trade-off.
|
||||||
|
|
||||||
|
```
|
||||||
|
ucxl://council-synth:coherence-auditor@DistOS:integration/^^/reports/consistency-model-alignment.md
|
||||||
|
```
|
||||||
|
|
||||||
|
### 4.6 Resolution Decision Records
|
||||||
|
|
||||||
|
One DR per resolved conflict. Each DR includes: the conflict description, the councils involved, the alternatives evaluated (ATAM analysis), the decision reached, the rationale, and the spec changes required.
|
||||||
|
|
||||||
|
```
|
||||||
|
ucxl://council-synth:arbitrator@DistOS:integration/^^/decisions/conflict-{id}-resolution.md
|
||||||
|
```
|
||||||
|
|
||||||
|
Example:
|
||||||
|
|
||||||
|
```
|
||||||
|
ucxl://council-synth:arbitrator@DistOS:integration/^^/decisions/conflict-001-sched-vs-mem-memory-model.md
|
||||||
|
ucxl://council-synth:arbitrator@DistOS:integration/^^/decisions/conflict-002-net-vs-fault-consistency.md
|
||||||
|
```
|
||||||
|
|
||||||
|
### 4.7 ATAM Utility Tree
|
||||||
|
|
||||||
|
The structured decomposition of DistOS quality attribute scenarios, architectural approaches, sensitivity points, and trade-off points.
|
||||||
|
|
||||||
|
```
|
||||||
|
ucxl://council-synth:trade-off-analyst@DistOS:integration/^^/docs/atam-utility-tree.md
|
||||||
|
```
|
||||||
|
|
||||||
|
### 4.8 Interface Contract Registry
|
||||||
|
|
||||||
|
A registry of all inter-subsystem interface contracts: which council provides what interface, which council consumes it, the contract's UCXL address, and its verification status.
|
||||||
|
|
||||||
|
```
|
||||||
|
ucxl://council-synth:coherence-auditor@DistOS:integration/^^/registries/interface-contract-registry.json
|
||||||
|
```
|
||||||
|
|
||||||
|
### 4.9 Evolutionary Architecture Policy
|
||||||
|
|
||||||
|
The specification of how DistOS can evolve post-delivery while maintaining coherence.
|
||||||
|
|
||||||
|
```
|
||||||
|
ucxl://council-synth:lead-synthesiser@DistOS:integration/^^/policies/evolutionary-architecture.md
|
||||||
|
```
|
||||||
|
|
||||||
|
### 4.10 Synthesis Final Report
|
||||||
|
|
||||||
|
The end-of-project coherence certification document. States that all detected conflicts are resolved or explicitly deferred with risk assessment, all interface contracts are satisfied, the consistency model is aligned, and the performance budget is consistent.
|
||||||
|
|
||||||
|
```
|
||||||
|
ucxl://council-synth:lead-synthesiser@DistOS:integration/^^/reports/synthesis-final-report.md
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 5. Decision Points
|
||||||
|
|
||||||
|
All DRs use the address pattern `ucxl://council-synth:lead-synthesiser@DistOS:integration/^^/decisions/{dr-id}.md`.
|
||||||
|
|
||||||
|
### DP-S01: Consistency Model for the Application-Visible API
|
||||||
|
|
||||||
|
This is the most consequential decision in the entire DistOS project. It determines what `council-api` can promise to applications, which in turn constrains what `council-mem`, `council-net`, and `council-fault` must deliver. The options are:
|
||||||
|
|
||||||
|
- **Linearisability everywhere:** Strongest guarantee, highest cost. Requires global coordination for every distributed operation. Not recommended for the data plane; may be appropriate for the control plane.
|
||||||
|
- **Linearisability for control, causal for data:** A two-tier model. Control-plane operations (node management, job submission, lock acquisition) are linearisable. Data-plane operations (tensor reads/writes, streaming I/O) are causally consistent. This is the recommended starting position.
|
||||||
|
- **Per-operation consistency annotation:** Applications declare the consistency level required per operation. Maximum flexibility, maximum complexity, highest risk of programmer error.
|
||||||
|
|
||||||
|
**Deciding parties:** Lead Synthesiser, `council-mem`, `council-net`, `council-fault`, `council-api`, Trade-off Analysts
|
||||||
|
|
||||||
|
### DP-S02: Performance Budget Top-Level Allocation
|
||||||
|
|
||||||
|
Establish the top-level latency targets that subsystem budgets are derived from. Candidate targets for a 1024-node Hopper/Blackwell cluster:
|
||||||
|
|
||||||
|
| Operation | Target P50 | Target P99 |
|
||||||
|
|-----------|-----------|-----------|
|
||||||
|
| Local GPU kernel launch | 2 μs | 10 μs |
|
||||||
|
| GPU memory allocation | 5 μs | 50 μs |
|
||||||
|
| Node-local system call round-trip | 1 μs | 5 μs |
|
||||||
|
| Distributed RPC (same rack) | 10 μs | 100 μs |
|
||||||
|
| Distributed RPC (cross-rack) | 50 μs | 500 μs |
|
||||||
|
| Consensus write (Raft/Paxos) | 500 μs | 5 ms |
|
||||||
|
| Collective AllReduce (1024 nodes) | 1 ms | 10 ms |
|
||||||
|
|
||||||
|
These numbers must be validated against the known hardware capabilities of the target cluster (Hopper NVLink 4.0 bandwidth, InfiniBand HDR 200 Gbps, Weka FS parallel I/O) and then allocated to subsystems.
|
||||||
|
|
||||||
|
**Deciding parties:** Lead Synthesiser, Trade-off Analysts, `council-sched`, `council-mem`, `council-net`
|
||||||
|
|
||||||
|
### DP-S03: Conflict Escalation Policy
|
||||||
|
|
||||||
|
Define the escalation path when `council-synth` and the affected councils cannot reach agreement in a mini-council. Options: (a) Lead Synthesiser makes a binding ruling, (b) escalate to `council-meta` for project-level arbitration, (c) put the conflict to a vote of all council leads. A clear escalation policy is essential to prevent deadlocked conflicts from blocking the project timeline.
|
||||||
|
**Deciding parties:** Lead Synthesiser, `council-meta`
|
||||||
|
|
||||||
|
### DP-S04: Global UCXL Subscription Architecture
|
||||||
|
|
||||||
|
`council-synth` Conflict Detectors must monitor DR feeds from 11 other councils simultaneously. Define the UCXL subscription topology: (a) each Conflict Detector subscribes to a subset of councils (partitioned), (b) all Conflict Detectors subscribe to all councils with deduplication (redundant, robust), or (c) a tiered model with dedicated detectors per council that escalate to a shared pool. The choice affects latency to detection and resilience to Conflict Detector failures.
|
||||||
|
**Deciding parties:** Lead Synthesiser, Coherence Auditors
|
||||||
|
|
||||||
|
### DP-S05: Architectural Fitness Functions
|
||||||
|
|
||||||
|
Following Ford et al. (Building Evolutionary Architectures), define at least five automated fitness functions — machine-checkable assertions about the DistOS architecture — that can be run continuously to verify that the architecture remains coherent as it evolves. Examples:
|
||||||
|
|
||||||
|
- No interface contract is consumed without a published provider
|
||||||
|
- No performance budget allocation exceeds the top-level target
|
||||||
|
- Every DR cites at least one research reference
|
||||||
|
- No cycle exists in the dependency graph
|
||||||
|
- Every public system call has an Alloy-verified interface contract
|
||||||
|
|
||||||
|
**Deciding parties:** Lead Synthesiser, Coherence Auditors, `council-verify`
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 6. Dependencies on Other Councils
|
||||||
|
|
||||||
|
`council-synth` is the most broadly dependent council in DistOS. It depends on all councils (as inputs to synthesis) and all councils depend on it (as the source of coherence directives and conflict resolutions).
|
||||||
|
|
||||||
|
| Council | Relationship | What council-synth consumes | What council-synth produces |
|
||||||
|
|---------|-------------|----------------------------|----------------------------|
|
||||||
|
| `council-sched` | Bidirectional | All DRs; scheduling assumptions; performance requirements | Conflict resolutions; performance budget allocation; coherence directives |
|
||||||
|
| `council-mem` | Bidirectional | Memory model choice; consistency model; Weka FS integration assumptions | Consistency model alignment ruling; coherence directives |
|
||||||
|
| `council-net` | Bidirectional | Network protocol choices; consistency guarantees; latency estimates | Consistency model alignment ruling; performance budget for network |
|
||||||
|
| `council-fault` | Bidirectional | Consensus protocol choice; failure model; recovery semantics | Coherence directives on fault/consistency interaction |
|
||||||
|
| `council-sec` | Bidirectional | Security model; isolation assumptions; capability model | Coherence directives on sec/API interaction |
|
||||||
|
| `council-telemetry` | Consuming | Metering model; performance overhead estimates | Performance budget for telemetry overhead |
|
||||||
|
| `council-verify` | Providing specifications to | Conflict resolution DRs; interface contract registry | Verification results that inform synthesis |
|
||||||
|
| `council-qa` | Consuming | Test failure reports that reveal architectural inconsistencies | Architectural issue reports for QA follow-up |
|
||||||
|
| `council-api` | Bidirectional | API philosophy decisions; consistency model promises to applications | Consistency model ruling; conflict resolutions affecting API |
|
||||||
|
| `council-docs` | Providing to | N/A | GACD sections for incorporation into final spec |
|
||||||
|
| `council-arch` | Providing to | N/A | Conflict registry and resolution DRs for archaeological narrative |
|
||||||
|
| `council-meta` | Reporting to | Project health directives; escalation rulings | Blocked council reports; synthesis status updates |
|
||||||
|
|
||||||
|
**Dependency inversion note:** `council-synth` must be operational and subscribed to all UCXL feeds before the other councils begin producing DRs. It is the first council to be fully formed and the last to be disbanded.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 7. WHOOSH Configuration
|
||||||
|
|
||||||
|
### 7.1 Team Formation
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
council_id: council-synth
|
||||||
|
display_name: "Inter-Council Synthesis Council"
|
||||||
|
target_size: 100
|
||||||
|
formation_strategy: competency_weighted
|
||||||
|
priority: critical # Must be formed before all other councils begin producing DRs
|
||||||
|
required_roles:
|
||||||
|
- role: lead-synthesiser
|
||||||
|
count: 1
|
||||||
|
persona: systems-analyst
|
||||||
|
competencies: [architectural-analysis, atam, distributed-systems, conflict-resolution, ucxl]
|
||||||
|
- role: conflict-detector
|
||||||
|
count: 30
|
||||||
|
persona: systems-analyst
|
||||||
|
competencies: [design-review, assumption-analysis, distributed-systems, dependency-tracking]
|
||||||
|
- role: trade-off-analyst
|
||||||
|
count: 30
|
||||||
|
persona: systems-analyst
|
||||||
|
competencies: [atam, performance-modelling, queuing-theory, cap-theorem, consistency-models]
|
||||||
|
- role: arbitrator
|
||||||
|
count: 20
|
||||||
|
persona: systems-analyst
|
||||||
|
competencies: [facilitation, conflict-resolution, decision-records, distributed-systems]
|
||||||
|
- role: coherence-auditor
|
||||||
|
count: 19
|
||||||
|
persona: systems-analyst
|
||||||
|
competencies: [interface-contracts, dependency-graphs, consistency-models, ucxl, fitness-functions]
|
||||||
|
```
|
||||||
|
|
||||||
|
### 7.2 Quorum Rules
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
quorum:
|
||||||
|
decision_threshold: 0.6 # 60% of active agents for routine decisions
|
||||||
|
lead_synthesiser_binding: true # Lead Synthesiser's ruling is binding after mini-council process
|
||||||
|
arbitration_quorum: 3 # Minimum 3 Arbitrators must participate in any mini-council
|
||||||
|
conflict_detection_threshold: 1 # A single Conflict Detector raising a flag is sufficient to open a conflict
|
||||||
|
coherence_certification:
|
||||||
|
required_auditors: 5 # At least 5 Coherence Auditors must sign the final coherence certification
|
||||||
|
escalation_to_meta:
|
||||||
|
trigger: stalled_conflict # Conflict unresolved for more than 6 hours
|
||||||
|
target: council-meta
|
||||||
|
performance_budget_change:
|
||||||
|
threshold: 0.9 # 90% supermajority required to change top-level performance targets
|
||||||
|
required_councils: [council-sched, council-mem, council-net]
|
||||||
|
```
|
||||||
|
|
||||||
|
### 7.3 Subchannels
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
subchannels:
|
||||||
|
- id: synth-all-council-dr-feed
|
||||||
|
subscribers: [conflict-detector]
|
||||||
|
purpose: "Real-time UCXL subscription to Decision Records from all councils. Primary input feed for conflict detection."
|
||||||
|
ucxl_feed: "ucxl://council-*:*@DistOS:*/^^/decisions/*"
|
||||||
|
delivery: streaming
|
||||||
|
priority: high
|
||||||
|
|
||||||
|
- id: synth-conflict-triage
|
||||||
|
subscribers: [conflict-detector, lead-synthesiser]
|
||||||
|
purpose: "Triage queue for potential conflicts. Conflict Detectors post candidates here for Lead Synthesiser review before a formal conflict is opened."
|
||||||
|
ucxl_feed: "ucxl://council-synth:conflict-detector@DistOS:integration/^^/triage/*"
|
||||||
|
|
||||||
|
- id: synth-mini-councils
|
||||||
|
subscribers: [lead-synthesiser, arbitrator, affected_council_representatives]
|
||||||
|
purpose: "Active mini-council sessions. Each conflict gets a dedicated ephemeral sub-thread here."
|
||||||
|
ucxl_feed: "ucxl://council-synth:arbitrator@DistOS:integration/^^/mini-councils/*"
|
||||||
|
membership: dynamic # Representatives from affected councils join per-conflict
|
||||||
|
|
||||||
|
- id: synth-tradeoff-analysis
|
||||||
|
subscribers: [trade-off-analyst, lead-synthesiser]
|
||||||
|
purpose: "ATAM utility tree construction and trade-off analysis. Trade-off Analysts post analyses here for Lead Synthesiser input to mini-councils."
|
||||||
|
ucxl_feed: "ucxl://council-synth:trade-off-analyst@DistOS:integration/^^/tradeoffs/*"
|
||||||
|
|
||||||
|
- id: synth-performance-budget
|
||||||
|
subscribers: [trade-off-analyst, coherence-auditor, lead-synthesiser]
|
||||||
|
purpose: "Performance budget negotiation and allocation tracking."
|
||||||
|
ucxl_feed: "ucxl://council-synth:trade-off-analyst@DistOS:integration/^^/docs/performance-budget.md"
|
||||||
|
|
||||||
|
- id: synth-coherence-audit
|
||||||
|
subscribers: [coherence-auditor, lead-synthesiser]
|
||||||
|
purpose: "Ongoing coherence audits: interface contract checks, consistency model alignment, dependency graph updates."
|
||||||
|
ucxl_feed: "ucxl://council-synth:coherence-auditor@DistOS:integration/^^/audits/*"
|
||||||
|
|
||||||
|
- id: synth-gacd-updates
|
||||||
|
subscribers: [lead-synthesiser, coherence-auditor]
|
||||||
|
purpose: "Coordination of GACD updates. Coherence Auditors propose section updates; Lead Synthesiser approves."
|
||||||
|
ucxl_feed: "ucxl://council-synth:lead-synthesiser@DistOS:integration/^^/docs/global-architectural-coherence.md"
|
||||||
|
|
||||||
|
- id: synth-meta-reporting
|
||||||
|
subscribers: [lead-synthesiser]
|
||||||
|
purpose: "Upward reporting to council-meta: blocked councils, open conflicts, synthesis health metrics."
|
||||||
|
ucxl_feed: "ucxl://council-synth:lead-synthesiser@DistOS:integration/^^/reports/meta-status-*"
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 8. Success Criteria
|
||||||
|
|
||||||
|
1. **Zero open conflicts at Day 14:** Every conflict identified in the conflict registry has been either resolved (with a published resolution DR) or explicitly deferred with a documented risk assessment, accepted by `council-meta`.
|
||||||
|
|
||||||
|
2. **Consistency model certified:** The Consistency Model Alignment Report exists and is signed by at least 5 Coherence Auditors, confirming that `council-mem`, `council-net`, `council-fault`, and `council-api` consistency model choices are mutually compatible.
|
||||||
|
|
||||||
|
3. **Performance budget allocated and consistent:** Every subsystem has a latency budget derived from the top-level allocations in DP-S02. No subsystem's claimed performance targets exceed its allocated budget.
|
||||||
|
|
||||||
|
4. **Dependency graph is acyclic:** The dependency graph has been verified cycle-free at the end of each project phase. Any cycles detected were resolved within 24 hours.
|
||||||
|
|
||||||
|
5. **Interface contract registry complete:** Every inter-subsystem interface has an entry in the interface contract registry. Every contract is either verified by `council-verify` or explicitly flagged as pending with a resolution date.
|
||||||
|
|
||||||
|
6. **GACD maintained throughout:** The Global Architectural Coherence Document has been updated at least once per project day. It accurately reflects the state of all cross-cutting agreements at the time of project completion.
|
||||||
|
|
||||||
|
7. **ATAM utility tree complete:** The utility tree covers all six core quality attributes (performance, reliability, security, maintainability, developer experience, scalability) with at least three scenarios each, and identifies all major sensitivity and trade-off points.
|
||||||
|
|
||||||
|
8. **Response SLA met:** `council-synth` has acknowledged every conflict flagged by a Conflict Detector within 2 hours and opened a mini-council (or closed the non-conflict) within 4 hours.
|
||||||
|
|
||||||
|
9. **Evolutionary architecture policy ratified:** The policy exists, has been reviewed by `council-meta`, and is included in the final DistOS specification.
|
||||||
|
|
||||||
|
10. **Fitness functions defined and active:** At least five architectural fitness functions are defined in DP-S05 and are running against the UCXL artifact store by Day 6.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 9. Timeline
|
||||||
|
|
||||||
|
### Phase 1: Research (Days 1–3)
|
||||||
|
|
||||||
|
`council-synth` is unique in that it must be operational from Day 1 — before any other council produces a DR. The first three days are therefore split between research (learning the methods) and operational setup (configuring the monitoring infrastructure).
|
||||||
|
|
||||||
|
- Lead Synthesiser and Coherence Auditors study ATAM methodology; conduct a dry run on a simple architectural question to validate process
|
||||||
|
- Trade-off Analysts study consistency model taxonomy (Viotti and Vukolić 2016; Bailis et al.); build reference table of consistency models available on the target hardware
|
||||||
|
- Conflict Detectors configure UCXL subscriptions to all council DR feeds; establish the triage queue; validate that feeds are operational
|
||||||
|
- Lead Synthesiser drafts the Conflict Resolution Process document (the rules of engagement for mini-councils)
|
||||||
|
- Performance budget top-level targets researched from hardware specifications; draft DP-S02 targets
|
||||||
|
- Construct initial dependency graph from the Project Constitution council structure
|
||||||
|
- Deliverable: `ucxl://council-synth:lead-synthesiser@DistOS:integration/^^/docs/conflict-resolution-process.md`
|
||||||
|
- Deliverable: `ucxl://council-synth:coherence-auditor@DistOS:integration/^^/graphs/dependency-graph.json` (initial)
|
||||||
|
|
||||||
|
### Phase 2: Architecture (Days 3–6)
|
||||||
|
|
||||||
|
- Resolve DP-S01 (consistency model), DP-S02 (performance budget), DP-S03 (escalation policy), DP-S04 (UCXL subscription architecture)
|
||||||
|
- First DRs arrive from subsystem councils; Conflict Detectors begin live monitoring
|
||||||
|
- Trade-off Analysts begin constructing the ATAM utility tree as architectural decisions arrive
|
||||||
|
- First mini-councils expected as subsystem councils begin making choices; arbitration process validated against real conflicts
|
||||||
|
- Coherence Auditors begin populating the interface contract registry as councils publish their interface requirements
|
||||||
|
- Performance budget draft distributed to all councils for comment
|
||||||
|
- Deliverable: `ucxl://council-synth:trade-off-analyst@DistOS:integration/^^/docs/atam-utility-tree.md` (draft)
|
||||||
|
- Deliverable: `ucxl://council-synth:trade-off-analyst@DistOS:integration/^^/docs/performance-budget.md` (draft)
|
||||||
|
|
||||||
|
### Phase 3: Formal Specification (Days 6–10)
|
||||||
|
|
||||||
|
- Peak conflict detection period: subsystem councils are writing formal specs and making concrete decisions; incompatibilities surface rapidly
|
||||||
|
- All Conflict Detectors at full monitoring capacity; daily conflict triage with Lead Synthesiser
|
||||||
|
- Trade-off Analysts provide ATAM analyses for every conflict requiring a mini-council
|
||||||
|
- Arbitrators run mini-councils; resolution DRs published within 8 hours of mini-council completion
|
||||||
|
- Coherence Auditors verify that interface contracts are being delivered to `council-verify` on schedule; flag any delays
|
||||||
|
- Consistency model alignment assessed as `council-mem`, `council-net`, and `council-fault` commit their specs
|
||||||
|
- Dependency graph updated daily; critical path reviewed to identify at-risk deliverables
|
||||||
|
- GACD updated with each major architectural decision
|
||||||
|
- Deliverable: First batch of conflict resolution DRs at `ucxl://council-synth:arbitrator@DistOS:integration/^^/decisions/*`
|
||||||
|
|
||||||
|
### Phase 4: Integration (Days 10–12)
|
||||||
|
|
||||||
|
- Conflict detection continues but should be declining in frequency as major architectural decisions have been made
|
||||||
|
- Focus shifts to Coherence Auditors: verify that integration-phase work respects all agreed constraints
|
||||||
|
- Consistency model alignment report drafted and circulated to affected councils for confirmation
|
||||||
|
- Performance budget final allocations agreed with all subsystem councils
|
||||||
|
- Evolutionary architecture policy drafted
|
||||||
|
- Dependency graph verified acyclic; any final critical-path risks escalated to `council-meta`
|
||||||
|
- `council-synth` issues formal directives to any council whose integration-phase work introduces new conflicts
|
||||||
|
- Deliverable: `ucxl://council-synth:coherence-auditor@DistOS:integration/^^/reports/consistency-model-alignment.md`
|
||||||
|
|
||||||
|
### Phase 5: Documentation (Days 12–14)
|
||||||
|
|
||||||
|
- Coherence Auditors conduct the final coherence audit; sign the synthesis final report
|
||||||
|
- All conflict registry entries confirmed resolved or deferred with accepted risk
|
||||||
|
- GACD finalised and submitted to `council-docs` for incorporation into the master specification
|
||||||
|
- Evolutionary architecture policy ratified by `council-meta` and published
|
||||||
|
- Lead Synthesiser writes the synthesis narrative for `council-arch`'s archaeological record
|
||||||
|
- Final architectural fitness functions verified active in the UCXL artifact store
|
||||||
|
- Deliverable: `ucxl://council-synth:lead-synthesiser@DistOS:integration/^^/reports/synthesis-final-report.md`
|
||||||
|
- Deliverable: `ucxl://council-synth:lead-synthesiser@DistOS:integration/^^/policies/evolutionary-architecture.md`
|
||||||
378
councils/10-qa-adversarial-testing.md
Normal file
378
councils/10-qa-adversarial-testing.md
Normal file
@@ -0,0 +1,378 @@
|
|||||||
|
# Council Design Brief: Quality Assurance & Adversarial Testing
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Council Identification
|
||||||
|
|
||||||
|
| Field | Value |
|
||||||
|
|-------|-------|
|
||||||
|
| **Council ID** | `council-qa` |
|
||||||
|
| **Mission** | Adversarially validate the entire DistOS specification and implementation surface through systematic fuzz testing, chaos engineering, Byzantine fault simulation, and property-based verification — ensuring that what the formal specification says is provably correct (council-verify's domain) actually matches what the system does under hostile, degraded, and pathological conditions. |
|
||||||
|
| **UCXL Base Address** | `ucxl://council-qa:*@DistOS:qa/*^/` |
|
||||||
|
| **Agent Count** | ~60 agents |
|
||||||
|
| **Operates From** | Day 3 (Architecture phase) through Day 14 (Documentation close) |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Scope and Responsibilities
|
||||||
|
|
||||||
|
Council-qa owns the gap between proof and reality. Formal verification proves that a model is internally consistent; adversarial testing proves that the model is an accurate model of the system as built. This council is responsible for:
|
||||||
|
|
||||||
|
- Designing and executing a comprehensive adversarial test suite against all DistOS subsystems
|
||||||
|
- Validating that formal specifications produced by council-verify match observable system behaviour under both normal and failure conditions
|
||||||
|
- Discovering specification ambiguities by finding inputs or conditions that produce unexpected or underspecified behaviour
|
||||||
|
- Stress-testing all published API surfaces (received from council-api) for correctness, safety, and resilience
|
||||||
|
- Simulating Byzantine, crash-recovery, and omission faults across the 1024-node GPU cluster topology
|
||||||
|
- Injecting faults at the hardware layer (GPU errors, NVLink degradation, Weka FS unavailability, InfiniBand partition events) and observing correctness of system response
|
||||||
|
- Benchmarking performance under adversarial scheduling pressure, memory fragmentation, and concurrent failure scenarios
|
||||||
|
- Performing structured penetration testing of the DistOS security model
|
||||||
|
- Running deterministic simulation testing to reproduce races and transient faults at will
|
||||||
|
- Reporting all discovered defects, ambiguities, and underspecifications back to originating councils with UCXL-addressable evidence chains
|
||||||
|
|
||||||
|
Council-qa does **not** write specifications or make architectural decisions. Its authority is to raise issues and block deliverable acceptance until those issues are resolved or formally accepted as known limitations.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Research Domains
|
||||||
|
|
||||||
|
### 1. Distributed Systems Fault Injection and Chaos Engineering
|
||||||
|
|
||||||
|
**Core framework:** Jepsen (Kyle Kingsbury) — the canonical methodology for testing distributed system consistency under network partitions, clock skew, and process crashes. DistOS must pass Jepsen-equivalent analysis for all claims of linearisability, serializability, or causal consistency made in the formal spec.
|
||||||
|
|
||||||
|
Key Jepsen analyses to replicate:
|
||||||
|
- Partition healing with in-flight GPU kernel state
|
||||||
|
- Split-brain detection in the consensus layer
|
||||||
|
- Clock skew effects on distributed scheduling decisions
|
||||||
|
- Write visibility after node rejoins
|
||||||
|
|
||||||
|
**Netflix Chaos Monkey / Simian Army** — production chaos engineering methodology. Adapt for pre-production specification testing: if chaos can be injected into the architecture model, it should be injected into the formal model before any node is touched.
|
||||||
|
|
||||||
|
**Google DiRT (Disaster Recovery Testing)** — structured disaster scenario playbooks. Council-qa will produce DiRT-equivalent playbooks for DistOS, covering datacenter-level failure scenarios for a 1024-node cluster.
|
||||||
|
|
||||||
|
Key papers:
|
||||||
|
- Kingsbury, K. (2013–present). Jepsen analysis series. https://jepsen.io
|
||||||
|
- Gunawi, H. S., et al. (2011). "FATE and DESTINI: A Framework for Cloud Recovery Testing." NSDI 2011.
|
||||||
|
- Basiri, A., et al. (2016). "Chaos Engineering." IEEE Software 33(3).
|
||||||
|
- Leesatapornwongsa, T., et al. (2016). "TaxDC: A Taxonomy of Non-Deterministic Concurrency Bugs in Datacenter Distributed Systems." ASPLOS 2016.
|
||||||
|
|
||||||
|
### 2. Property-Based and Specification-Conformance Testing
|
||||||
|
|
||||||
|
**Hypothesis (Python) / QuickCheck (Haskell)** — shrinking-based property testing. The council will define properties derived from formal specifications and use automated generators to find minimally failing counterexamples.
|
||||||
|
|
||||||
|
Properties to encode:
|
||||||
|
- Scheduler decisions must be deterministic given the same system state snapshot
|
||||||
|
- Memory allocation must never produce aliased physical GPU memory addresses across tenants
|
||||||
|
- All filesystem operations on Weka must satisfy the POSIX consistency semantics claimed in the spec
|
||||||
|
- Security policy decisions must be monotone (granting more resources never reduces security guarantees)
|
||||||
|
|
||||||
|
**TLA+ model checking** — in coordination with council-verify, council-qa will encode system properties as TLA+ invariants and run TLC model checking over the state space of the DistOS scheduler and memory manager.
|
||||||
|
|
||||||
|
Key papers:
|
||||||
|
- Claessen, K., & Hughes, J. (2000). "QuickCheck: A Lightweight Tool for Random Testing of Haskell Programs." ICFP 2000.
|
||||||
|
- MacIver, D. R., et al. (2019). "Hypothesis: A new approach to property-based testing." JOSS 4(43).
|
||||||
|
- Lamport, L. (1999). "Specifying Systems: The TLA+ Language and Tools." Addison-Wesley.
|
||||||
|
|
||||||
|
### 3. Fuzz Testing of API Surfaces
|
||||||
|
|
||||||
|
**AFL++ / libFuzzer** — coverage-guided fuzzing of all DistOS API endpoints. Every API defined by council-api is a fuzzing target.
|
||||||
|
|
||||||
|
**Syzkaller** — adapted for DistOS syscall-equivalent interfaces. The kernel-level scheduling and memory interfaces will be fuzz-tested using a syzkaller-inspired grammar-aware fuzzer.
|
||||||
|
|
||||||
|
**gRPC/Protobuf fuzzing** — for any inter-node RPC interfaces, schema-aware fuzzing to discover serialisation edge cases, integer overflow in field handling, and unintended state transitions.
|
||||||
|
|
||||||
|
Focus areas:
|
||||||
|
- Malformed job submission to the scheduler
|
||||||
|
- Invalid or boundary-condition memory allocation requests
|
||||||
|
- Weka filesystem path traversal and permission edge cases
|
||||||
|
- Security token manipulation and replay attacks
|
||||||
|
- Malformed consensus messages (Raft/Paxos variant used by DistOS)
|
||||||
|
|
||||||
|
Key papers:
|
||||||
|
- Zalewski, M. (2014). American Fuzzy Lop technical whitepaper.
|
||||||
|
- Serebryany, K. (2016). "OSS-Fuzz — Google's continuous fuzzing service for open source software." USENIX Security 2017.
|
||||||
|
- Vishnyakov, A., et al. (2022). "SYDR-Fuzz: Continuous Hybrid Fuzzing and Dynamic Analysis for Security Development Lifecycle." ISPRAS Open 2022.
|
||||||
|
|
||||||
|
### 4. Byzantine Fault Simulation
|
||||||
|
|
||||||
|
For a 1024-node cluster intended for production AI workloads, Byzantine faults (nodes sending malicious, corrupt, or contradictory messages) are a realistic adversary model.
|
||||||
|
|
||||||
|
Council-qa will simulate:
|
||||||
|
- Nodes sending conflicting scheduling state to different peers
|
||||||
|
- GPU memory controllers reporting incorrect allocation metadata
|
||||||
|
- Weka FS nodes returning inconsistent directory listings
|
||||||
|
- Consensus participants casting votes that contradict their local state
|
||||||
|
|
||||||
|
**BFT protocol validation:** Any Byzantine-fault-tolerant consensus mechanism in DistOS must tolerate f faulty nodes where N >= 3f + 1. For a 1024-node cluster with f = 341, this requires rigorous simulation.
|
||||||
|
|
||||||
|
Key papers:
|
||||||
|
- Lamport, L., Shostak, R., & Pease, M. (1982). "The Byzantine Generals Problem." ACM TOPLAS 4(3).
|
||||||
|
- Castro, M., & Liskov, B. (1999). "Practical Byzantine Fault Tolerance." OSDI 1999.
|
||||||
|
- Yin, M., et al. (2019). "HotStuff: BFT Consensus with Linearity and Responsiveness." PODC 2019.
|
||||||
|
|
||||||
|
### 5. GPU-Specific Error Injection
|
||||||
|
|
||||||
|
**NVIDIA GPU Error Injection (NVML / DCGM)** — inject ECC errors, NVLink faults, and CUDA context corruption. DistOS must respond correctly to all hardware-signalled errors from the GPU fabric.
|
||||||
|
|
||||||
|
**ROCm SMI fault injection** — equivalent capabilities for AMD GPUs if mixed hardware is present.
|
||||||
|
|
||||||
|
Scenarios:
|
||||||
|
- Single-bit ECC error during active kernel execution: does DistOS checkpoint, migrate, or terminate the affected job?
|
||||||
|
- NVLink link failure mid-allreduce: does the collective communication layer correctly abort and reschedule?
|
||||||
|
- GPU memory over-temperature throttling: does the scheduler correctly rebalance load?
|
||||||
|
- CUDA context loss: does the runtime correctly clean up tenant resources?
|
||||||
|
|
||||||
|
Key references:
|
||||||
|
- NVIDIA. (2024). Data Center GPU Manager (DCGM) User Guide. NVIDIA Corporation.
|
||||||
|
- NVIDIA. (2023). NVML API Reference Guide. NVIDIA Corporation.
|
||||||
|
- Hochschild, P., et al. (2021). "Cores that don't count." HotOS 2021. (Silent data corruption in datacenter CPUs — same failure class applies to GPUs.)
|
||||||
|
|
||||||
|
### 6. Weka Parallel Filesystem Adversarial Testing
|
||||||
|
|
||||||
|
Weka FS is a high-performance parallel filesystem with specific consistency semantics. DistOS makes claims about filesystem semantics that must be tested against Weka's actual behaviour.
|
||||||
|
|
||||||
|
Test scenarios:
|
||||||
|
- Weka cluster node failure during active checkpoint write: does DistOS detect partial writes?
|
||||||
|
- Concurrent checkpoint reads and writes from 512 nodes: does the system observe stale data?
|
||||||
|
- Weka metadata server overload: how does DistOS degrade gracefully?
|
||||||
|
- Network partition between Weka clients and Weka backend: does DistOS correctly pause or abort affected operations?
|
||||||
|
|
||||||
|
Key references:
|
||||||
|
- Weka.io. (2024). WekaFS Architecture and Internals. Weka Technical Documentation.
|
||||||
|
- Carns, P., et al. (2011). "Understanding and Improving Computational Science Storage Access through Continuous Characterization." MSST 2011.
|
||||||
|
|
||||||
|
### 7. Deterministic Simulation Testing
|
||||||
|
|
||||||
|
**FoundationDB simulation testing** — the most rigorous approach to distributed systems testing known: run the entire system inside a deterministic simulator where all nondeterminism (network delays, disk I/O, timer callbacks) is controlled by a seeded random number generator. Any failing test can be reproduced exactly.
|
||||||
|
|
||||||
|
Council-qa will specify a deterministic simulation harness for the DistOS core components (scheduler, memory manager, consensus layer) that:
|
||||||
|
- Replaces all OS-level nondeterminism with simulated equivalents
|
||||||
|
- Records all random seeds for reproducible failure replay
|
||||||
|
- Supports time-travel debugging (roll back to a pre-failure state and re-execute)
|
||||||
|
|
||||||
|
Key references:
|
||||||
|
- Villegas, A. (2021). "FoundationDB: A Distributed Unbundled Transactional Key-Value Store." SIGMOD 2021.
|
||||||
|
- Kingsbury, K. (2018). "An Introduction to Jepsen." Strange Loop 2018.
|
||||||
|
- Belyakova, Y., et al. (2024). "Lineage-driven Fault Injection." SIGMOD 2015.
|
||||||
|
|
||||||
|
### 8. Race Condition Detection
|
||||||
|
|
||||||
|
**ThreadSanitizer (TSan) / Helgrind** — data race detection in any shared-memory regions of DistOS.
|
||||||
|
|
||||||
|
**Concuerror / DPOR (Dynamic Partial Order Reduction)** — systematic concurrency testing for Erlang/Elixir-style actor systems, adaptable to DistOS agent communication patterns.
|
||||||
|
|
||||||
|
**Deterministic concurrency testing** — enumerate all interleavings of concurrent operations in critical sections of the scheduler and memory allocator.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Agent Roles
|
||||||
|
|
||||||
|
| Role | Count | Responsibilities |
|
||||||
|
|------|-------|-----------------|
|
||||||
|
| **Chaos Engineers** | 12 | Design and execute Jepsen-style partition and failure scenarios; maintain chaos playbook library |
|
||||||
|
| **Fuzz Operators** | 10 | Run AFL++/libFuzzer/syzkaller campaigns against all API surfaces; triage and minimise crashes |
|
||||||
|
| **Property Testers** | 10 | Encode formal spec properties as Hypothesis/QuickCheck tests; maintain property library |
|
||||||
|
| **Byzantine Simulators** | 8 | Implement and execute Byzantine fault scenarios; validate BFT protocol correctness |
|
||||||
|
| **GPU Fault Injectors** | 8 | NVML/DCGM-based hardware fault injection; GPU error response validation |
|
||||||
|
| **Simulation Engineers** | 7 | Build and maintain the deterministic simulation harness; replay and minimise failures |
|
||||||
|
| **Performance Adversarialists** | 5 | Benchmarking under adversarial load; identify performance cliffs and scheduling pathologies |
|
||||||
|
|
||||||
|
**Total: 60 agents**
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Key Deliverables
|
||||||
|
|
||||||
|
| Deliverable | UCXL Address | Due Phase |
|
||||||
|
|-------------|--------------|-----------|
|
||||||
|
| Adversarial Test Strategy | `ucxl://council-qa:chaos-engineer@DistOS:qa/*^/spec/test-strategy` | Phase 2 |
|
||||||
|
| Chaos Engineering Playbook | `ucxl://council-qa:chaos-engineer@DistOS:qa/*^/runbook/chaos-playbook` | Phase 2 |
|
||||||
|
| Property Test Suite (formal spec conformance) | `ucxl://council-qa:property-tester@DistOS:qa/*^/test-suite/property-tests` | Phase 3 |
|
||||||
|
| API Fuzz Campaign Report | `ucxl://council-qa:fuzz-operator@DistOS:qa/*^/report/fuzz-campaign-api` | Phase 3 |
|
||||||
|
| Byzantine Fault Simulation Results | `ucxl://council-qa:byzantine-simulator@DistOS:qa/*^/report/byzantine-faults` | Phase 3 |
|
||||||
|
| GPU Error Injection Results | `ucxl://council-qa:gpu-fault-injector@DistOS:qa/*^/report/gpu-fault-injection` | Phase 3 |
|
||||||
|
| Deterministic Simulation Harness Spec | `ucxl://council-qa:simulation-engineer@DistOS:qa/*^/spec/deterministic-sim` | Phase 3 |
|
||||||
|
| Weka FS Adversarial Test Results | `ucxl://council-qa:chaos-engineer@DistOS:qa/*^/report/weka-adversarial` | Phase 4 |
|
||||||
|
| Race Condition Detection Report | `ucxl://council-qa:simulation-engineer@DistOS:qa/*^/report/race-conditions` | Phase 4 |
|
||||||
|
| Consolidated Defect Register | `ucxl://council-qa:*@DistOS:qa/*^/register/defects` | Continuous |
|
||||||
|
| Final QA Acceptance Report | `ucxl://council-qa:*@DistOS:qa/*^/report/final-acceptance` | Phase 5 |
|
||||||
|
| Performance Adversarial Benchmarks | `ucxl://council-qa:performance-adversarialist@DistOS:qa/*^/benchmark/adversarial-perf` | Phase 4 |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Decision Points
|
||||||
|
|
||||||
|
### DQ-01: Acceptable Defect Threshold for Specification Acceptance
|
||||||
|
**Question:** What severity and count of open defects constitutes an acceptable state for the DistOS specification to be declared complete?
|
||||||
|
|
||||||
|
**Options:**
|
||||||
|
- A. Zero P0 defects, fewer than 10 P1 defects with documented mitigations
|
||||||
|
- B. Zero P0/P1 defects, unlimited P2 with tracking
|
||||||
|
- C. All discovered defects must be resolved or formally accepted with architectural rationale
|
||||||
|
|
||||||
|
**Implications:** Option C produces the highest-quality specification but may block delivery. Option A risks shipping known critical issues with mitigations that may not hold in production.
|
||||||
|
|
||||||
|
### DQ-02: Scope of Byzantine Fault Tolerance Requirement
|
||||||
|
**Question:** Is DistOS required to tolerate Byzantine faults, or only crash-recovery faults?
|
||||||
|
|
||||||
|
**Options:**
|
||||||
|
- A. Crash-recovery (CFT) only — simpler protocols, lower overhead, sufficient for trusted datacenter hardware
|
||||||
|
- B. Byzantine fault tolerance for the consensus layer only
|
||||||
|
- C. Full BFT for all distributed components
|
||||||
|
|
||||||
|
**Implications:** Full BFT (Option C) requires significantly more complex protocols and typically 33% overhead in node count for quorum calculations. Given a 1024-node cluster with known hardware, CFT may be sufficient.
|
||||||
|
|
||||||
|
### DQ-03: Deterministic Simulation Scope
|
||||||
|
**Question:** Which DistOS components must support deterministic simulation testing?
|
||||||
|
|
||||||
|
**Options:**
|
||||||
|
- A. Core consensus and scheduling components only
|
||||||
|
- B. All components that touch shared state
|
||||||
|
- C. The entire DistOS software stack including GPU runtime interfaces
|
||||||
|
|
||||||
|
**Implications:** Option C provides the strongest testing guarantees but requires a simulation layer for every hardware interface including GPU APIs.
|
||||||
|
|
||||||
|
### DQ-04: Jepsen-Equivalent Validation Requirement
|
||||||
|
**Question:** Must DistOS pass a Jepsen-equivalent analysis before the specification is finalised, or is simulation-based testing sufficient?
|
||||||
|
|
||||||
|
**Options:**
|
||||||
|
- A. Simulation-based testing is sufficient at specification phase; Jepsen analysis deferred to implementation phase
|
||||||
|
- B. A Jepsen-equivalent model-level analysis is required before specification acceptance
|
||||||
|
- C. A full Jepsen-style test against a prototype implementation is required within the 14-day window
|
||||||
|
|
||||||
|
**Implications:** Option C is highly ambitious for a 14-day timeline. Option A defers significant risk to implementation.
|
||||||
|
|
||||||
|
### DQ-05: Security Penetration Testing Depth
|
||||||
|
**Question:** What is the scope of security penetration testing for the DistOS security model?
|
||||||
|
|
||||||
|
**Options:**
|
||||||
|
- A. Threat-model review and manual analysis only
|
||||||
|
- B. Automated scanning plus manual review of authentication and authorisation boundaries
|
||||||
|
- C. Full red-team exercise against the security model including side-channel analysis
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Dependencies on Other Councils
|
||||||
|
|
||||||
|
| Council | Dependency Type | What Council-QA Needs |
|
||||||
|
|---------|----------------|-----------------------|
|
||||||
|
| **council-verify** | Upstream | Formal specifications and invariants to test against; verified properties to validate in simulation |
|
||||||
|
| **council-api** | Upstream | Complete API surface definitions with semantics, pre/post conditions, and error contracts |
|
||||||
|
| **council-sched** | Upstream | Scheduler specification including claimed consistency and fairness properties |
|
||||||
|
| **council-mem** | Upstream | Memory manager specification including isolation guarantees and allocation invariants |
|
||||||
|
| **council-sec** | Upstream | Security model specification including trust boundaries and threat model |
|
||||||
|
| **council-net** | Upstream | Network layer specification including partition tolerance claims |
|
||||||
|
| **council-fs** | Upstream | Weka FS integration specification including consistency claims |
|
||||||
|
| **council-synth** | Bidirectional | Synthesis decisions that affect testability; council-qa feeds defects back to council-synth for resolution arbitration |
|
||||||
|
| **council-arch** | Downstream | Decision archaeology reads all council-qa defect reports and test results to narrate the QA story |
|
||||||
|
| **council-docs** | Downstream | Documentation council consumes test reports for the final specification document |
|
||||||
|
|
||||||
|
**Critical dependency:** Council-qa cannot begin property testing until council-verify delivers at least draft formal specifications. This creates a hard dependency on council-verify delivering Phase 2 outputs on schedule.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## WHOOSH Configuration
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
council: council-qa
|
||||||
|
whoosh:
|
||||||
|
formation:
|
||||||
|
strategy: domain-partitioned
|
||||||
|
# Each test domain forms a sub-team that operates semi-independently
|
||||||
|
partitions:
|
||||||
|
- name: chaos-partition
|
||||||
|
roles: [chaos-engineer]
|
||||||
|
size: 12
|
||||||
|
coordination: async
|
||||||
|
- name: fuzz-partition
|
||||||
|
roles: [fuzz-operator]
|
||||||
|
size: 10
|
||||||
|
coordination: async
|
||||||
|
- name: property-partition
|
||||||
|
roles: [property-tester]
|
||||||
|
size: 10
|
||||||
|
coordination: sync-with-verify
|
||||||
|
- name: byzantine-partition
|
||||||
|
roles: [byzantine-simulator]
|
||||||
|
size: 8
|
||||||
|
coordination: async
|
||||||
|
- name: gpu-partition
|
||||||
|
roles: [gpu-fault-injector]
|
||||||
|
size: 8
|
||||||
|
coordination: async
|
||||||
|
- name: simulation-partition
|
||||||
|
roles: [simulation-engineer]
|
||||||
|
size: 7
|
||||||
|
coordination: sync
|
||||||
|
- name: perf-partition
|
||||||
|
roles: [performance-adversarialist]
|
||||||
|
size: 5
|
||||||
|
coordination: async
|
||||||
|
|
||||||
|
quorum:
|
||||||
|
defect-classification: 3/5 # Three roles must agree on P0/P1 classification
|
||||||
|
test-acceptance: 4/7 # Four of seven role groups must sign off on test suite acceptance
|
||||||
|
spec-block: # council-qa can block spec acceptance; requires:
|
||||||
|
threshold: simple-majority
|
||||||
|
escalation: council-synth # Disputes escalate to council-synth for arbitration
|
||||||
|
|
||||||
|
subchannels:
|
||||||
|
- id: defect-reports
|
||||||
|
description: All discovered defects broadcast here for cross-council visibility
|
||||||
|
subscribers: [council-verify, council-api, council-sched, council-mem, council-sec, council-synth, council-arch]
|
||||||
|
retention: full-history
|
||||||
|
- id: chaos-ops
|
||||||
|
description: Real-time chaos experiment state
|
||||||
|
subscribers: [council-qa, council-arch]
|
||||||
|
- id: property-failures
|
||||||
|
description: Property test failures routed to council-verify for spec review
|
||||||
|
subscribers: [council-verify, council-synth, council-arch]
|
||||||
|
- id: qa-acceptance
|
||||||
|
description: Formal acceptance/rejection signals per deliverable
|
||||||
|
subscribers: [council-synth, council-docs, council-arch]
|
||||||
|
|
||||||
|
communication:
|
||||||
|
internal: broadcast-to-partition
|
||||||
|
cross-council: ucxl-addressed
|
||||||
|
defect-escalation: immediate-broadcast
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Success Criteria
|
||||||
|
|
||||||
|
A council-qa execution is considered successful when all of the following are met:
|
||||||
|
|
||||||
|
1. **Coverage:** Every formal specification invariant produced by council-verify has at least one corresponding property test with documented pass/fail results.
|
||||||
|
|
||||||
|
2. **API surface:** 100% of API endpoints defined by council-api have been subjected to structured fuzz testing with documented results.
|
||||||
|
|
||||||
|
3. **Chaos coverage:** Jepsen-equivalent partition and failure scenarios have been executed against the DistOS scheduler, memory manager, and consensus layer models.
|
||||||
|
|
||||||
|
4. **Byzantine validation:** Byzantine fault simulations have validated that the consensus protocol meets its claimed fault-tolerance bounds.
|
||||||
|
|
||||||
|
5. **GPU error coverage:** All documented GPU error codes and conditions in the NVIDIA/AMD hardware specifications have corresponding DistOS response scenarios tested.
|
||||||
|
|
||||||
|
6. **Weka FS:** All consistency claims in the DistOS filesystem integration specification have been tested against documented Weka FS behaviour.
|
||||||
|
|
||||||
|
7. **Defect resolution:** All P0 defects are resolved. All P1 defects have formal mitigations accepted by the originating council and council-synth. All open defects are registered with UCXL-addressable evidence.
|
||||||
|
|
||||||
|
8. **Deterministic reproduction:** All discovered failures can be reproduced deterministically via the simulation harness or a minimised test case.
|
||||||
|
|
||||||
|
9. **Archaeology readability:** Council-arch has confirmed that the defect register and test reports produce comprehensible decision narratives.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Timeline Mapping
|
||||||
|
|
||||||
|
| Phase | Days | Council-QA Activities |
|
||||||
|
|-------|------|-----------------------|
|
||||||
|
| **Phase 1: Research** | 1–3 | Study formal specification structure (coordination with council-verify); design adversarial test taxonomy; review Jepsen, FoundationDB simulation, and AFL++ methodologies; define property language for spec conformance tests |
|
||||||
|
| **Phase 2: Architecture** | 3–6 | Produce Adversarial Test Strategy; design chaos playbook structure; begin encoding specification properties as tests (as draft specs arrive from council-verify); establish defect classification schema; configure WHOOSH subchannels and defect broadcast |
|
||||||
|
| **Phase 3: Formal Specification** | 6–10 | Execute property tests against formal specs from council-verify; run initial fuzz campaigns against draft API definitions; execute Byzantine fault simulations; GPU error injection scenario validation; first Weka FS adversarial tests; publish defect reports to all councils |
|
||||||
|
| **Phase 4: Integration** | 10–12 | Full integration testing across all subsystem specifications; cross-subsystem fault injection (e.g., GPU error during Weka checkpoint during scheduler failover); race condition analysis; performance adversarial benchmarks; defect resolution verification |
|
||||||
|
| **Phase 5: Documentation** | 12–14 | Produce Final QA Acceptance Report; compile consolidated defect register; validate all P0/P1 defects are resolved or formally accepted; support council-docs in formatting test reports for the final specification document |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
*Council Design Brief v1.0 — DistOS Project — council-qa*
|
||||||
|
*Generated: 2026-02-24*
|
||||||
399
councils/11-documentation.md
Normal file
399
councils/11-documentation.md
Normal file
@@ -0,0 +1,399 @@
|
|||||||
|
# Council Design Brief: Documentation & Specification Formatting
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Council Identification
|
||||||
|
|
||||||
|
| Field | Value |
|
||||||
|
|-------|-------|
|
||||||
|
| **Council ID** | `council-docs` |
|
||||||
|
| **Mission** | Transform the raw technical outputs of all DistOS councils into a coherent, professionally formatted, cross-referenced, and accessible specification suite. Council-docs does not originate technical content — it is the editorial and production team responsible for ensuring that the collective intelligence of every council is rendered legible, navigable, and unambiguous to human readers. |
|
||||||
|
| **UCXL Base Address** | `ucxl://council-docs:*@DistOS:docs/*^/` |
|
||||||
|
| **Agent Count** | ~40 agents |
|
||||||
|
| **Operates From** | Day 3 (continuous intake) through Day 14 (final publication) |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Scope and Responsibilities
|
||||||
|
|
||||||
|
Council-docs is the final production stage of the DistOS specification. Its scope is editorial, structural, and representational — not technical. Specifically:
|
||||||
|
|
||||||
|
- Consuming outputs from all other councils via UCXL addressing and transforming them into polished specification documents
|
||||||
|
- Enforcing consistent terminology, formatting, style, and cross-reference standards across all documents
|
||||||
|
- Generating and maintaining the master glossary and terminology register for the DistOS project
|
||||||
|
- Producing architecture decision records (ADRs) that capture the rationale for every major architectural choice, sourced from council decision artifacts
|
||||||
|
- Creating visual representations (architecture diagrams, sequence diagrams, state machine diagrams, dependency maps) that complement written specifications
|
||||||
|
- Managing document versioning as specifications evolve across the 14-day timeline
|
||||||
|
- Maintaining bidirectional cross-references between all specification documents, ensuring that a change in one council's output is reflected in all dependent documents
|
||||||
|
- Producing the final DistOS specification document suite as a coherent, publishable artifact
|
||||||
|
|
||||||
|
Council-docs explicitly does **not**:
|
||||||
|
- Make architectural or technical decisions
|
||||||
|
- Resolve technical conflicts between councils (that is council-synth's domain)
|
||||||
|
- Generate original technical analysis
|
||||||
|
|
||||||
|
When a document review reveals technical ambiguity, inconsistency, or an apparent gap, council-docs raises a formal editorial query to the originating council and council-synth — it does not resolve the issue itself.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Research Domains
|
||||||
|
|
||||||
|
### 1. Standards Document Formatting (IEEE/ISO Style)
|
||||||
|
|
||||||
|
The DistOS specification must be formatted to a standard that would be recognisable to engineers familiar with IEEE or ISO technical standards. This means:
|
||||||
|
|
||||||
|
**IEEE 1016 (Software Design Descriptions)** — structure and content requirements for software design documentation. Council-docs will use IEEE 1016 as a structural template for the DistOS design specification.
|
||||||
|
|
||||||
|
**ISO/IEC 26514 (Systems and software engineering — Requirements for designers and developers of user documentation)** — best practices for technical documentation clarity, completeness, and navigability.
|
||||||
|
|
||||||
|
**RFC formatting conventions** — for protocol specifications, the IETF RFC format (with its formal requirement language: MUST, SHOULD, MAY, MUST NOT, SHOULD NOT as per RFC 2119) provides a well-understood convention for precise specification language.
|
||||||
|
|
||||||
|
Key references:
|
||||||
|
- IEEE Std 1016-2009. IEEE Standard for Information Technology — Systems Design — Software Design Descriptions.
|
||||||
|
- ISO/IEC 26514:2008. Systems and software engineering — Requirements for designers and developers of user documentation.
|
||||||
|
- Bradner, S. (1997). RFC 2119: Key words for use in RFCs to Indicate Requirement Levels. IETF.
|
||||||
|
- Nygard, M. T. (2017). "Documenting Architecture Decisions." https://cognitect.com/blog/2011/11/15/documenting-architecture-decisions
|
||||||
|
|
||||||
|
### 2. Architecture Decision Records
|
||||||
|
|
||||||
|
An ADR captures: the context that required a decision, the options considered, the decision made, and the consequences. For DistOS, every major architectural decision made by any council must be captured as an ADR.
|
||||||
|
|
||||||
|
Council-docs is responsible for:
|
||||||
|
- Extracting decision artifacts from UCXL addresses across all councils
|
||||||
|
- Reformatting them as coherent, numbered ADRs (ADR-001, ADR-002, etc.)
|
||||||
|
- Linking each ADR to the council-arch narratives that explain the decision chain
|
||||||
|
- Maintaining an ADR status register (Proposed / Accepted / Deprecated / Superseded)
|
||||||
|
|
||||||
|
**ADR format for DistOS:**
|
||||||
|
|
||||||
|
```
|
||||||
|
# ADR-NNN: [Title]
|
||||||
|
|
||||||
|
## Status
|
||||||
|
[Proposed | Accepted | Deprecated | Superseded by ADR-NNN]
|
||||||
|
|
||||||
|
## Context
|
||||||
|
[What situation required a decision? What forces were at play?]
|
||||||
|
|
||||||
|
## Decision
|
||||||
|
[What was decided?]
|
||||||
|
|
||||||
|
## Consequences
|
||||||
|
[What are the results of this decision? What becomes easier or harder?]
|
||||||
|
|
||||||
|
## Source
|
||||||
|
[UCXL addresses of the artifacts from which this ADR was derived]
|
||||||
|
|
||||||
|
## Councils Involved
|
||||||
|
[Which councils contributed to or were affected by this decision?]
|
||||||
|
```
|
||||||
|
|
||||||
|
Key references:
|
||||||
|
- Nygard, M. T. (2011). "Documenting Architecture Decisions." Cognitect Blog.
|
||||||
|
- Tyree, J., & Akerman, A. (2005). "Architecture Decisions: Demystifying Architecture." IEEE Software 22(2).
|
||||||
|
- Richards, M., & Ford, N. (2020). Fundamentals of Software Architecture. O'Reilly Media. Chapter 19: Making Architecture Decisions.
|
||||||
|
|
||||||
|
### 3. API Reference Documentation Generation
|
||||||
|
|
||||||
|
For all APIs produced by council-api, council-docs will produce structured reference documentation following OpenAPI 3.x conventions where applicable, and custom structured formats for non-REST interfaces (GPU kernel APIs, consensus protocol messages, filesystem interfaces).
|
||||||
|
|
||||||
|
Each API reference entry must include:
|
||||||
|
- Endpoint or function signature with all parameters
|
||||||
|
- Parameter types, constraints, and defaults
|
||||||
|
- Pre-conditions and post-conditions (sourced from council-verify formal specs)
|
||||||
|
- Error codes and their meaning
|
||||||
|
- Example requests and responses
|
||||||
|
- Security requirements
|
||||||
|
- Rate limits and performance characteristics where specified
|
||||||
|
|
||||||
|
**OpenAPI 3.1** — the current standard for REST API documentation. Council-docs will produce OpenAPI specs for any HTTP-based DistOS management interfaces.
|
||||||
|
|
||||||
|
**Swagger/Redoc** — tooling for rendering OpenAPI specs into navigable HTML documentation.
|
||||||
|
|
||||||
|
Key references:
|
||||||
|
- OpenAPI Initiative. (2021). OpenAPI Specification v3.1.0. https://spec.openapis.org/oas/v3.1.0
|
||||||
|
- Jacobson, D., Brail, G., & Woods, D. (2011). APIs: A Strategy Guide. O'Reilly Media.
|
||||||
|
|
||||||
|
### 4. Diagram Generation Standards
|
||||||
|
|
||||||
|
Technical diagrams must convey system structure and behaviour precisely and consistently. Council-docs Diagram Specialists will use:
|
||||||
|
|
||||||
|
**C4 Model (Context, Container, Component, Code)** — for architecture overview diagrams at four levels of abstraction. Simon Brown's C4 model provides a structured hierarchy that prevents the common failure of a single "big ball of mud" architecture diagram.
|
||||||
|
|
||||||
|
**UML Sequence Diagrams** — for inter-council and inter-component communication flows. Especially important for documenting GPU job submission flows, consensus protocol message exchanges, and checkpoint/recovery sequences.
|
||||||
|
|
||||||
|
**UML State Machine Diagrams** — for component lifecycle documentation (GPU context states, scheduler job states, filesystem handle states).
|
||||||
|
|
||||||
|
**Graphviz / Mermaid** — as the diagramming toolchain. Mermaid's text-based syntax integrates with Markdown documentation; Graphviz provides more precise layout control for complex dependency graphs.
|
||||||
|
|
||||||
|
**Decision tree diagrams** — for documenting fault recovery decision logic, sourced from council-qa chaos playbooks and council-fault specifications.
|
||||||
|
|
||||||
|
Key references:
|
||||||
|
- Brown, S. (2018). The C4 Model for Visualising Software Architecture. Leanpub.
|
||||||
|
- Fowler, M. (2003). UML Distilled: A Brief Guide to the Standard Object Modeling Language. Addison-Wesley.
|
||||||
|
- OMG. (2017). Unified Modeling Language Specification Version 2.5.1.
|
||||||
|
|
||||||
|
### 5. Cross-Reference Management
|
||||||
|
|
||||||
|
In a specification suite produced by 10+ councils, cross-references between documents are as important as the documents themselves. A memory manager specification that references a scheduler guarantee must point to the specific version of that guarantee that was current when the memory spec was written.
|
||||||
|
|
||||||
|
Council-docs Cross-Referencers will:
|
||||||
|
- Maintain a master cross-reference registry mapping logical references to UCXL-versioned artifacts
|
||||||
|
- Flag dangling references (references to artifacts that have been superseded or deleted)
|
||||||
|
- Ensure that every inter-council dependency named in design briefs has a corresponding cross-reference in the final specification
|
||||||
|
- Produce a dependency map showing which specification sections depend on which council outputs
|
||||||
|
|
||||||
|
**UCXL temporal versioning** will be used to pin cross-references: when document A references a claim from document B, the reference includes the UCXL temporal address of the specific version of B that was current at the time of reference.
|
||||||
|
|
||||||
|
Example:
|
||||||
|
```
|
||||||
|
The memory isolation guarantee stated in [MEM-ISOLATION-V3]
|
||||||
|
(ucxl://council-mem:allocator@DistOS:mem/*~3/spec/isolation-guarantee)
|
||||||
|
depends on the scheduler preemption guarantee in [SCHED-PREEMPT-V2]
|
||||||
|
(ucxl://council-sched:scheduler@DistOS:sched/*~2/spec/preemption).
|
||||||
|
```
|
||||||
|
|
||||||
|
### 6. Terminology Standardisation
|
||||||
|
|
||||||
|
Technical specifications fail when different sections use different words for the same concept, or the same word for different concepts. For a 1024-node GPU cluster OS, the terminology surface is large.
|
||||||
|
|
||||||
|
Council-docs Terminology Guardians will maintain a canonical glossary that:
|
||||||
|
- Defines every technical term used across all council outputs
|
||||||
|
- Flags and resolves synonyms (is it "job", "task", "workload", or "kernel"?)
|
||||||
|
- Maintains provenance for each definition (which council introduced it, when, and in what context)
|
||||||
|
- Enforces consistent usage through editorial review
|
||||||
|
|
||||||
|
The glossary is a living document; it is updated continuously as new terms emerge from council outputs.
|
||||||
|
|
||||||
|
### 7. Accessibility and Readability of Technical Content
|
||||||
|
|
||||||
|
The DistOS specification should be readable by:
|
||||||
|
- GPU cluster architects who have not read the detailed formal specifications
|
||||||
|
- Systems engineers implementing DistOS components
|
||||||
|
- Future maintainers who need to understand why specific decisions were made
|
||||||
|
- Auditors reviewing security and compliance properties
|
||||||
|
|
||||||
|
Council-docs applies the following readability standards:
|
||||||
|
- Flesch-Kincaid readability targets for narrative sections (technical precision does not require unreadable prose)
|
||||||
|
- Consistent use of active voice in procedural sections
|
||||||
|
- Progressive disclosure: overview sections before detail sections in every major document
|
||||||
|
- Every document has a clearly stated purpose, audience, and scope in its opening section
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Agent Roles
|
||||||
|
|
||||||
|
| Role | Count | Responsibilities |
|
||||||
|
|------|-------|-----------------|
|
||||||
|
| **Editors** | 15 | Review all council outputs for clarity, completeness, and internal consistency; rewrite for precision without altering technical content; raise editorial queries to originating councils |
|
||||||
|
| **Formatters** | 10 | Enforce consistent document structure, heading hierarchy, numbering schemes, table formatting, and code block conventions across all outputs; apply IEEE/ISO and ADR templates |
|
||||||
|
| **Cross-Referencers** | 5 | Maintain master cross-reference registry; validate all UCXL-pinned references; flag dangling or outdated references; produce dependency maps |
|
||||||
|
| **Diagram Specialists** | 5 | Generate architecture diagrams (C4), sequence diagrams (UML), state machine diagrams, and decision trees from council outputs; maintain diagram source files in version control |
|
||||||
|
| **Terminology Guardians** | 5 | Maintain master glossary; audit all council outputs for terminology consistency; propose canonical definitions; resolve synonym conflicts with originating councils |
|
||||||
|
|
||||||
|
**Total: 40 agents**
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Key Deliverables
|
||||||
|
|
||||||
|
| Deliverable | UCXL Address | Due Phase |
|
||||||
|
|-------------|--------------|-----------|
|
||||||
|
| Documentation Standards Guide | `ucxl://council-docs:formatter@DistOS:docs/*^/spec/documentation-standards` | Phase 1 |
|
||||||
|
| Master Glossary (v0, seed terms) | `ucxl://council-docs:terminology-guardian@DistOS:docs/*^/glossary/master-v0` | Phase 2 |
|
||||||
|
| ADR Template and Register | `ucxl://council-docs:formatter@DistOS:docs/*^/template/adr-template` | Phase 2 |
|
||||||
|
| Cross-Reference Registry (live) | `ucxl://council-docs:cross-referencer@DistOS:docs/*^/registry/cross-references` | Phase 2 (continuous) |
|
||||||
|
| Architecture Overview Document | `ucxl://council-docs:editor@DistOS:docs/*^/spec/architecture-overview` | Phase 3 |
|
||||||
|
| C4 Architecture Diagrams (all levels) | `ucxl://council-docs:diagram-specialist@DistOS:docs/*^/diagram/c4-architecture` | Phase 3 |
|
||||||
|
| API Reference Documentation | `ucxl://council-docs:editor@DistOS:docs/*^/reference/api-reference` | Phase 3 |
|
||||||
|
| Subsystem Sequence Diagrams | `ucxl://council-docs:diagram-specialist@DistOS:docs/*^/diagram/sequence-diagrams` | Phase 3 |
|
||||||
|
| ADR Compilation (all councils) | `ucxl://council-docs:formatter@DistOS:docs/*^/adr/complete-register` | Phase 4 |
|
||||||
|
| Master Glossary (final) | `ucxl://council-docs:terminology-guardian@DistOS:docs/*^/glossary/master-final` | Phase 4 |
|
||||||
|
| Formal Specification Suite (assembled) | `ucxl://council-docs:editor@DistOS:docs/*^/spec/formal-spec-suite` | Phase 5 |
|
||||||
|
| State Machine Diagram Set | `ucxl://council-docs:diagram-specialist@DistOS:docs/*^/diagram/state-machines` | Phase 4 |
|
||||||
|
| QA Test Report Formatted Edition | `ucxl://council-docs:formatter@DistOS:docs/*^/report/qa-formatted` | Phase 5 |
|
||||||
|
| Final DistOS Specification Document | `ucxl://council-docs:editor@DistOS:docs/*^/spec/distos-specification-v1` | Phase 5 |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Decision Points
|
||||||
|
|
||||||
|
### DD-01: Primary Document Format Standard
|
||||||
|
**Question:** What is the canonical format for the DistOS specification suite?
|
||||||
|
|
||||||
|
**Options:**
|
||||||
|
- A. Markdown with Mermaid diagrams — maximum portability, version-control friendly, renders on Gitea
|
||||||
|
- B. LaTeX — maximum typographic control, produces publication-quality PDFs, standard for academic specifications
|
||||||
|
- C. AsciiDoc — richer than Markdown, less complex than LaTeX, with strong tooling (Antora, Asciidoctor)
|
||||||
|
- D. IEEE-formatted Word/ODT — directly compatible with standards submission processes
|
||||||
|
|
||||||
|
**Implications:** Option A is most compatible with the CHORUS/UCXL toolchain and Gitea workflow. Option B produces the highest-quality PDF output but requires LaTeX toolchain expertise. The choice affects how cross-references and diagrams are managed throughout the project.
|
||||||
|
|
||||||
|
### DD-02: ADR Ownership Model
|
||||||
|
**Question:** Who owns the content of each ADR — the council that made the decision, or council-docs?
|
||||||
|
|
||||||
|
**Options:**
|
||||||
|
- A. Council-docs owns all ADRs; originating councils provide raw decision artifacts only
|
||||||
|
- B. Originating councils own ADR content; council-docs applies formatting only
|
||||||
|
- C. Co-ownership model: originating council writes ADR draft, council-docs editors revise for clarity, council-arch validates narrative accuracy
|
||||||
|
|
||||||
|
**Implications:** Option C is highest quality but requires the most coordination overhead. The council-arch dependency is critical here: council-arch's UCXL traversal output is the primary source of decision chain data.
|
||||||
|
|
||||||
|
### DD-03: Glossary Conflict Resolution Process
|
||||||
|
**Question:** When two councils use the same term to mean different things, who resolves the conflict?
|
||||||
|
|
||||||
|
**Options:**
|
||||||
|
- A. Council-docs Terminology Guardians have final authority on terminology
|
||||||
|
- B. The council that first defined the term retains its definition; later councils must use different terms
|
||||||
|
- C. Terminology conflicts escalate to council-synth for resolution; council-docs documents the resolution
|
||||||
|
|
||||||
|
**Implications:** Option C correctly positions council-synth as the cross-council arbitrator, but adds coordination overhead. Option A risks terminology decisions made by editors rather than domain experts.
|
||||||
|
|
||||||
|
### DD-04: Diagram Fidelity vs. Speed Trade-off
|
||||||
|
**Question:** For architecture diagrams, should council-docs produce high-fidelity manually-reviewed diagrams or automated/generated diagrams from machine-readable sources?
|
||||||
|
|
||||||
|
**Options:**
|
||||||
|
- A. All diagrams hand-crafted by Diagram Specialists for maximum clarity and accuracy
|
||||||
|
- B. All diagrams auto-generated from structured UCXL metadata with manual review
|
||||||
|
- C. Hybrid: C4 overview diagrams are hand-crafted; detailed sequence and state diagrams are auto-generated
|
||||||
|
|
||||||
|
**Implications:** Option B produces diagrams that stay automatically consistent with the underlying specification as it changes. Option A produces higher-quality diagrams that may drift from the specification if not carefully maintained.
|
||||||
|
|
||||||
|
### DD-05: Document Versioning Strategy
|
||||||
|
**Question:** How are document versions managed as specifications evolve across the 14-day timeline?
|
||||||
|
|
||||||
|
**Options:**
|
||||||
|
- A. Semantic versioning (v0.1, v0.2 ... v1.0) with UCXL temporal addressing for history
|
||||||
|
- B. Date-stamped snapshots only, with UCXL providing all history navigation
|
||||||
|
- C. Git-based versioning with tags at each phase boundary, UCXL addresses pointing to git refs
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Dependencies on Other Councils
|
||||||
|
|
||||||
|
| Council | Dependency Type | What Council-Docs Needs |
|
||||||
|
|---------|----------------|------------------------|
|
||||||
|
| **council-arch** | Critical upstream | Decision narratives and human-readable summaries to source ADR content; UCXL traversal paths for cross-reference validation |
|
||||||
|
| **council-synth** | Critical upstream | Synthesis decisions (especially conflict resolutions) which form the spine of the architecture document |
|
||||||
|
| **council-verify** | Upstream | Formal specification documents to edit and format; invariant definitions for the glossary |
|
||||||
|
| **council-api** | Upstream | API definitions to transform into formatted API reference documentation |
|
||||||
|
| **council-sched** | Upstream | Scheduler specification documents; sequence diagrams source data |
|
||||||
|
| **council-mem** | Upstream | Memory manager specification documents; state machine source data |
|
||||||
|
| **council-sec** | Upstream | Security model documents; threat model formatted edition |
|
||||||
|
| **council-net** | Upstream | Network layer specification documents |
|
||||||
|
| **council-fs** | Upstream | Filesystem integration specification documents |
|
||||||
|
| **council-qa** | Upstream | QA test reports; defect register; chaos playbooks for inclusion in operational documentation |
|
||||||
|
| **ALL councils** | Continuous | Terminology submissions for the master glossary; editorial query responses |
|
||||||
|
|
||||||
|
**Special relationship with council-arch:** Council-docs depends on council-arch not just for content but for interpretive context. When editors encounter a decision that seems unmotivated or a specification section that lacks clear rationale, they query council-arch's narrative archive before raising a formal editorial query to the technical council. This prevents unnecessary interruptions to technical councils with questions that the archaeology record already answers.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## WHOOSH Configuration
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
council: council-docs
|
||||||
|
whoosh:
|
||||||
|
formation:
|
||||||
|
strategy: workflow-pipeline
|
||||||
|
# Documentation follows a pipeline: intake -> edit -> format -> cross-ref -> publish
|
||||||
|
stages:
|
||||||
|
- name: intake
|
||||||
|
roles: [editor]
|
||||||
|
trigger: new-artifact-available
|
||||||
|
description: Editors monitor UCXL for new artifacts from all councils
|
||||||
|
- name: edit
|
||||||
|
roles: [editor, terminology-guardian]
|
||||||
|
trigger: intake-complete
|
||||||
|
description: Editors revise for clarity; Terminology Guardians check glossary compliance
|
||||||
|
- name: format
|
||||||
|
roles: [formatter]
|
||||||
|
trigger: edit-complete
|
||||||
|
description: Formatters apply templates and structural standards
|
||||||
|
- name: cross-reference
|
||||||
|
roles: [cross-referencer]
|
||||||
|
trigger: format-complete
|
||||||
|
description: Cross-Referencers validate and register all inter-document references
|
||||||
|
- name: diagram
|
||||||
|
roles: [diagram-specialist]
|
||||||
|
trigger: parallel-with-format
|
||||||
|
description: Diagram Specialists generate/update visual representations
|
||||||
|
- name: publish
|
||||||
|
roles: [formatter, editor]
|
||||||
|
trigger: cross-reference-complete
|
||||||
|
description: Final document published to UCXL
|
||||||
|
|
||||||
|
quorum:
|
||||||
|
editorial-query: 2/5-editors # Two editors must agree a query is warranted before sending
|
||||||
|
terminology-decision: 3/5-guardians-or-escalate # Three guardians or escalate to council-synth
|
||||||
|
document-acceptance: editor + formatter + cross-referencer # All three roles sign off per document
|
||||||
|
|
||||||
|
subchannels:
|
||||||
|
- id: artifact-intake
|
||||||
|
description: New artifacts arriving from all councils
|
||||||
|
subscribers: [council-docs-editors]
|
||||||
|
retention: full-history
|
||||||
|
- id: editorial-queries
|
||||||
|
description: Formal editorial queries sent to originating councils
|
||||||
|
subscribers: [all-councils, council-arch]
|
||||||
|
retention: full-history
|
||||||
|
- id: glossary-updates
|
||||||
|
description: New and revised glossary entries broadcast to all councils
|
||||||
|
subscribers: [all-councils]
|
||||||
|
retention: full-history
|
||||||
|
- id: document-published
|
||||||
|
description: Notification when a document section is published
|
||||||
|
subscribers: [all-councils, council-arch, council-synth]
|
||||||
|
- id: cross-reference-warnings
|
||||||
|
description: Dangling or stale cross-references requiring attention
|
||||||
|
subscribers: [council-synth, council-arch, originating-council]
|
||||||
|
|
||||||
|
continuous_monitoring:
|
||||||
|
# council-docs monitors all councils' UCXL streams for new artifacts
|
||||||
|
monitor_pattern: "ucxl://*:*@DistOS:*/*^/*"
|
||||||
|
trigger: on-new-artifact
|
||||||
|
action: queue-for-intake
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Success Criteria
|
||||||
|
|
||||||
|
Council-docs execution is successful when:
|
||||||
|
|
||||||
|
1. **Complete coverage:** Every formal technical output from every other council has a corresponding edited, formatted, and cross-referenced document in the DistOS specification suite.
|
||||||
|
|
||||||
|
2. **Terminology consistency:** The master glossary contains definitions for every distinct technical term used across all specification documents. Zero instances of unresolved synonym conflicts in published documents.
|
||||||
|
|
||||||
|
3. **Cross-reference integrity:** All inter-document cross-references resolve to valid UCXL-addressed artifacts. Zero dangling references in the final published suite.
|
||||||
|
|
||||||
|
4. **ADR completeness:** An ADR exists for every major architectural decision identified by council-arch's decision traversal. Every ADR links to its UCXL source artifacts.
|
||||||
|
|
||||||
|
5. **Diagram coverage:** C4 context, container, and component diagrams exist for all major DistOS subsystems. Sequence diagrams exist for all major inter-component workflows. State machine diagrams exist for all major component lifecycles.
|
||||||
|
|
||||||
|
6. **Readability standard:** A systems engineer with GPU cluster experience but no prior DistOS exposure can read the architecture overview document and correctly answer questions about the high-level design without consulting supplementary materials.
|
||||||
|
|
||||||
|
7. **Format compliance:** All documents conform to the agreed formatting standard (IEEE-aligned Markdown or equivalent) without structural inconsistencies.
|
||||||
|
|
||||||
|
8. **Version integrity:** Document versions are correctly pinned to UCXL temporal addresses. A reader can navigate to any historical version of any document by following the UCXL version chain.
|
||||||
|
|
||||||
|
9. **Editorial queries resolved:** All formal editorial queries raised by council-docs have been responded to and resolved by the relevant technical councils.
|
||||||
|
|
||||||
|
10. **Final specification published:** The complete DistOS Specification v1.0 document suite is published as a unified, navigable artifact by Day 14.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Timeline Mapping
|
||||||
|
|
||||||
|
| Phase | Days | Council-Docs Activities |
|
||||||
|
|-------|------|------------------------|
|
||||||
|
| **Phase 1: Research** | 1–3 | Establish documentation standards guide; define ADR template; seed master glossary with terms from DistOS design briefs; configure UCXL monitoring for artifact intake; coordinate with council-arch on narrative format integration |
|
||||||
|
| **Phase 2: Architecture** | 3–6 | Begin editing first architecture outputs as they arrive from council-synth and subsystem councils; build cross-reference registry; produce ADRs from early architecture decisions; create initial C4 context diagram; first glossary broadcast to all councils |
|
||||||
|
| **Phase 3: Formal Specification** | 6–10 | High-throughput editing and formatting as formal specs arrive from council-verify and all subsystem councils; API reference documentation production; sequence and state machine diagram generation; continuous cross-reference validation and glossary updates; ADR compilation accelerates |
|
||||||
|
| **Phase 4: Integration** | 10–12 | Assemble subsystem documents into coherent specification suite; resolve all cross-reference issues; finalise ADR register; incorporate QA test report formatted edition; complete all diagrams; master glossary final review |
|
||||||
|
| **Phase 5: Documentation** | 12–14 | Final editorial pass on all documents; publish DistOS Specification v1.0; produce publication-ready version of the complete specification suite; validate all UCXL addresses in published documents; confirm archaeology readability test with council-arch |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
*Council Design Brief v1.0 — DistOS Project — council-docs*
|
||||||
|
*Generated: 2026-02-24*
|
||||||
531
councils/12-decision-archaeology.md
Normal file
531
councils/12-decision-archaeology.md
Normal file
@@ -0,0 +1,531 @@
|
|||||||
|
# Council Design Brief: Decision Archaeology
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Council Identification
|
||||||
|
|
||||||
|
| Field | Value |
|
||||||
|
|-------|-------|
|
||||||
|
| **Council ID** | `council-arch` |
|
||||||
|
| **Mission** | Navigate the complete UCXL decision graph produced by all DistOS councils and generate human-comprehensible narratives of how every architectural decision was reached. Council-arch is the proof-of-concept for the central claim of the CHORUS experiment: that UCXL makes the complex interactions of hundreds of AI agents legible to human observers. If a human who has never seen this project can read council-arch's outputs and understand *why* DistOS was designed the way it was, the experiment succeeds. |
|
||||||
|
| **UCXL Base Address** | `ucxl://council-arch:*@DistOS:arch/*^/` |
|
||||||
|
| **Agent Count** | ~40 agents |
|
||||||
|
| **Operates From** | Day 1 — council-arch is the first council to activate and the last to conclude |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Scope and Responsibilities
|
||||||
|
|
||||||
|
Council-arch's scope is interpretive and narrative. It does not make architectural decisions; it reads, traverses, and explains them. Specifically:
|
||||||
|
|
||||||
|
- Continuously monitoring all UCXL streams across all DistOS councils from Day 1, building a real-time map of the decision graph
|
||||||
|
- Traversing temporal UCXL chains (`*^`, `*~N`, `~~N` syntax) to reconstruct the sequence of decisions that led to any given artifact
|
||||||
|
- Generating human-readable narratives explaining individual decisions, decision chains, and cross-council conflicts and their resolutions
|
||||||
|
- Answering the question "why was this decided?" for any architectural choice in DistOS, by tracing the decision chain back through the UCXL graph
|
||||||
|
- Producing daily and weekly progress summaries accessible to human observers who are not following the per-council streams
|
||||||
|
- Analysing the downstream impact of decisions: how did decision X made by council-sched affect council-mem, council-fs, and council-sec?
|
||||||
|
- Producing executive-level summaries for each phase of the project
|
||||||
|
- Feeding all narrative outputs to council-docs for incorporation into the final specification document's rationale sections
|
||||||
|
- Conducting the ultimate readability test: are the archaeology outputs sufficient for a newcomer to reconstruct the design rationale?
|
||||||
|
|
||||||
|
Council-arch does **not** write technical specifications. It does not evaluate whether decisions are correct — that is council-verify's domain. It does not resolve conflicts — that is council-synth's domain. It only reads, traverses, and narrates.
|
||||||
|
|
||||||
|
### Why This Council is the Most Important
|
||||||
|
|
||||||
|
The CHORUS project's central hypothesis is that UCXL can preserve the full provenance of AI agent decision-making in a form that humans can later inspect and understand. This has profound implications:
|
||||||
|
|
||||||
|
- **Accountability:** If an AI-generated specification contains an error, a human auditor must be able to trace it to the decision that introduced it and understand the reasoning that led there.
|
||||||
|
- **Verifiability:** If a design decision seems wrong in hindsight, a human must be able to reconstruct the information that was available at the time the decision was made.
|
||||||
|
- **Maintainability:** A human engineer who must later modify DistOS must be able to understand why things are the way they are, not just what they are.
|
||||||
|
- **Trust:** The ability for humans to inspect and understand AI reasoning chains is prerequisite for trusting AI-generated specifications in production systems.
|
||||||
|
|
||||||
|
Council-arch's outputs are the primary evidence for whether CHORUS achieved its hypothesis.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Research Domains
|
||||||
|
|
||||||
|
### 1. UCXL Temporal Navigation
|
||||||
|
|
||||||
|
The UCXL addressing scheme provides three mechanisms for temporal navigation:
|
||||||
|
|
||||||
|
**`*^` (current head)** — the most recent version of an artifact:
|
||||||
|
```
|
||||||
|
ucxl://council-sched:scheduler@DistOS:sched/*^/spec/preemption-policy
|
||||||
|
```
|
||||||
|
|
||||||
|
**`*~N` (N versions before head)** — navigate N steps back in the version chain:
|
||||||
|
```
|
||||||
|
ucxl://council-sched:scheduler@DistOS:sched/*~3/spec/preemption-policy
|
||||||
|
```
|
||||||
|
This is the version of the preemption policy 3 revisions ago — before the conflict with council-mem was resolved.
|
||||||
|
|
||||||
|
**`~~N` (absolute version N)** — navigate to a specific version by sequence number:
|
||||||
|
```
|
||||||
|
ucxl://council-sched:scheduler@DistOS:sched/~~7/spec/preemption-policy
|
||||||
|
```
|
||||||
|
This is exactly version 7 of the preemption policy, regardless of how many versions now exist.
|
||||||
|
|
||||||
|
Council-arch Traversers use these mechanisms to reconstruct decision chains. When a specification artifact changes, the previous version is reachable via `*~1`. The version before that is `*~2`, and so on back to the origin (`~~1`). By traversing this chain and reading the commit message, author agent, and timestamp at each step, a complete chronological narrative of how that artifact evolved can be constructed.
|
||||||
|
|
||||||
|
**Decision graph traversal:** Beyond single-artifact chains, council-arch must also navigate cross-artifact dependencies. When artifact A references artifact B, and B changes, A may be affected. Traversers must be capable of:
|
||||||
|
- Identifying all artifacts that reference a given UCXL address
|
||||||
|
- Traversing the full transitive dependency graph for any artifact
|
||||||
|
- Finding the temporal moment at which two artifacts' versions were mutually consistent
|
||||||
|
|
||||||
|
Key references:
|
||||||
|
- Lampson, B. W. (1983). "Hints for Computer System Design." SOSP 1983. (Provenance and audit trails in system design.)
|
||||||
|
- Lomet, D. B. (2001). "The Case for Always Full Logging." VLDB 2001. (Temporal database principles applicable to UCXL versioning.)
|
||||||
|
- Snodgrass, R. T. (1999). Developing Time-Oriented Database Applications in SQL. Morgan Kaufmann. (Temporal querying patterns directly applicable to UCXL navigation.)
|
||||||
|
|
||||||
|
### 2. Narrative Generation from Decision Chains
|
||||||
|
|
||||||
|
Converting a sequence of UCXL artifacts and their diffs into a coherent human-readable narrative is a non-trivial task. Council-arch Narrators must:
|
||||||
|
|
||||||
|
- Identify the decision points (moments when a meaningful choice was made between alternatives)
|
||||||
|
- Identify the context that existed at the time (what other artifacts were current, what constraints were known)
|
||||||
|
- Explain the alternatives that were not chosen (from proposal artifacts)
|
||||||
|
- Explain the rationale that was recorded (from decision artifacts and quorum records)
|
||||||
|
- Write this as a story that a human reader finds comprehensible and that correctly represents the technical content
|
||||||
|
|
||||||
|
The narrative must not editorialize or second-guess decisions. It must represent what actually happened in the decision-making process, sourced entirely from UCXL-addressed artifacts.
|
||||||
|
|
||||||
|
**Narrative structure for a single decision:**
|
||||||
|
```
|
||||||
|
Decision: [What was decided, in plain language]
|
||||||
|
Context: [What was the state of the system design when this decision was made?]
|
||||||
|
Trigger: [What prompted this decision? A conflict? A new constraint? A proposal?]
|
||||||
|
Alternatives Considered: [What other options were on the table, from proposal artifacts]
|
||||||
|
Reasoning: [What logic was applied, sourced from deliberation artifacts]
|
||||||
|
Decision Makers: [Which agents/councils made this decision, with their UCXL addresses]
|
||||||
|
Impact: [Which other artifacts were updated as a result of this decision]
|
||||||
|
UCXL Source Chain: [The UCXL addresses of every artifact in this decision chain]
|
||||||
|
```
|
||||||
|
|
||||||
|
Key references:
|
||||||
|
- Bruner, J. (1986). Actual Minds, Possible Worlds. Harvard University Press. (Narrative as a cognitive mode for understanding complex sequences of events.)
|
||||||
|
- van Dijk, T. A., & Kintsch, W. (1983). Strategies of Discourse Comprehension. Academic Press. (How humans comprehend complex information through narrative macrostructure.)
|
||||||
|
- Klein, G. (2008). "Naturalistic Decision Making." Human Factors 50(3). (How decision narratives support sensemaking in complex domains.)
|
||||||
|
|
||||||
|
### 3. Conflict Resolution Storytelling
|
||||||
|
|
||||||
|
Cross-council conflicts are the most complex decision chains to narrate. When council-sched proposes a scheduling policy that conflicts with a memory guarantee claimed by council-mem, a series of events unfolds:
|
||||||
|
|
||||||
|
1. The conflict is detected (by whom? council-synth's monitoring, or by a council agent encountering an inconsistency?)
|
||||||
|
2. The conflict is escalated (what channel? what UCXL addresses are involved?)
|
||||||
|
3. Proposals are made by each side
|
||||||
|
4. Council-synth mediates
|
||||||
|
5. A resolution is reached
|
||||||
|
6. Both councils update their specifications to reflect the resolution
|
||||||
|
|
||||||
|
Council-arch must narrate this entire arc in a way that explains:
|
||||||
|
- Why the conflict arose in the first place (what assumptions each council was making)
|
||||||
|
- What was at stake (what would have been lost by each possible resolution)
|
||||||
|
- How the resolution was reached (what compromise or synthesis was found)
|
||||||
|
- What the long-term consequences of the resolution are (which downstream decisions were affected)
|
||||||
|
|
||||||
|
**Worked example: Scheduling vs. Memory Conflict**
|
||||||
|
|
||||||
|
Suppose council-sched proposes preemptive GPU kernel migration as a scheduling mechanism, and council-mem objects that this requires GPU memory to remain addressable after the kernel context is suspended — violating their isolation model. The conflict resolution narrative must explain:
|
||||||
|
|
||||||
|
- The scheduling motivation: why preemptive migration is necessary for fair scheduling at 1024-node scale
|
||||||
|
- The memory motivation: why cross-context memory addressability is a security boundary violation
|
||||||
|
- The resolution: perhaps a restricted migration model where memory is remapped but not cross-accessible during migration
|
||||||
|
- The compromise cost: the migration overhead is higher than the pure preemptive model, requiring scheduler adjustments
|
||||||
|
|
||||||
|
This narrative is the intellectual history of a design trade-off. It is the most valuable content council-arch produces.
|
||||||
|
|
||||||
|
Key references:
|
||||||
|
- Fischer, M. J., Lynch, N. A., & Paterson, M. S. (1985). "Impossibility of Distributed Consensus with One Faulty Process." JACM 32(2). (The archetypal example of why distributed system design involves irreducible trade-offs — understanding the FLP impossibility theorem requires understanding the chain of reasoning that leads to it.)
|
||||||
|
- Lamport, L. (1978). "Time, Clocks, and the Ordering of Events in a Distributed System." CACM 21(7). (Decision chains in distributed systems require temporal reasoning that Lamport's original narrative illuminates.)
|
||||||
|
- Brooks, F. P. (1975). The Mythical Man-Month. Addison-Wesley. Chapter 2: The Tar Pit. (The intellectual history of design decisions in complex systems.)
|
||||||
|
|
||||||
|
### 4. Decision Impact Analysis
|
||||||
|
|
||||||
|
When a decision changes, its impact propagates through the dependency graph. Council-arch Impact Analysts must quantify and narrate this propagation.
|
||||||
|
|
||||||
|
**Impact analysis process:**
|
||||||
|
1. Identify the changed artifact at its UCXL address
|
||||||
|
2. Query the cross-reference registry (maintained by council-docs) for all artifacts that reference this address
|
||||||
|
3. For each dependent artifact, determine whether the change constitutes a breaking change (the dependent artifact's assumptions are violated) or a compatible change
|
||||||
|
4. Trace the transitive impact: do any of the immediately affected artifacts in turn affect further artifacts?
|
||||||
|
5. Produce an impact map showing the full propagation graph
|
||||||
|
6. Write a narrative explaining which systems are affected and why
|
||||||
|
|
||||||
|
**UCXL impact query pattern:**
|
||||||
|
```
|
||||||
|
# Find all artifacts that reference a changed artifact
|
||||||
|
ucxl://council-sched:scheduler@DistOS:sched/*~1/spec/preemption-policy
|
||||||
|
# (version before the change)
|
||||||
|
→ find-references: which artifacts cited this version?
|
||||||
|
→ for each: does the change break the citing artifact's assumptions?
|
||||||
|
```
|
||||||
|
|
||||||
|
Key references:
|
||||||
|
- Lehman, M. M., & Belady, L. A. (1985). Program Evolution: Processes of Software Change. Academic Press. (How changes propagate through complex software systems — directly applicable to specification change propagation.)
|
||||||
|
- Parnas, D. L. (1972). "On the Criteria to be Used in Decomposing Systems into Modules." CACM 15(12). (Information hiding and the management of change impact in system design.)
|
||||||
|
|
||||||
|
### 5. Visualisation of Decision Graphs
|
||||||
|
|
||||||
|
Decision chains are not linear — they branch, merge, and form complex directed acyclic graphs (or occasionally cyclic graphs when councils revisit earlier decisions). Council-arch must produce visual representations of these graphs.
|
||||||
|
|
||||||
|
**Decision graph elements:**
|
||||||
|
- Nodes: individual decision events (a council making a choice between alternatives)
|
||||||
|
- Edges: causal relationships (decision A enabled or constrained decision B)
|
||||||
|
- Temporal dimension: the sequence in which decisions were made
|
||||||
|
- Council dimension: which councils were involved in each decision
|
||||||
|
- Conflict markers: points where cross-council disagreement occurred
|
||||||
|
- Resolution markers: points where council-synth resolved a conflict
|
||||||
|
|
||||||
|
**Graph visualisation standards:**
|
||||||
|
- DOT language (Graphviz) for machine-generated decision graphs
|
||||||
|
- D3.js-compatible JSON for interactive browser-based exploration
|
||||||
|
- Static PNG/SVG exports for inclusion in documentation
|
||||||
|
|
||||||
|
The decision graph for the entire DistOS project — all councils, all phases — is the single most valuable artifact council-arch produces. It is the complete intellectual history of the specification in visual form.
|
||||||
|
|
||||||
|
Key references:
|
||||||
|
- Sugiyama, K., Tagawa, S., & Toda, M. (1981). "Methods for Visual Understanding of Hierarchical System Structures." IEEE Transactions on Systems, Man, and Cybernetics 11(2). (Foundational algorithm for laying out directed graphs — the Sugiyama framework is standard in Graphviz and similar tools.)
|
||||||
|
- Battista, G. D., Eades, P., Tamassia, R., & Tollis, I. G. (1999). Graph Drawing: Algorithms for the Visualization of Graphs. Prentice Hall.
|
||||||
|
|
||||||
|
### 6. The "Why Was This Decided?" Query Interface
|
||||||
|
|
||||||
|
Council-arch must support an interactive query capability: given any artifact or decision in DistOS, a human should be able to ask "why was this decided this way?" and receive a clear, sourced answer.
|
||||||
|
|
||||||
|
**Query answering process:**
|
||||||
|
1. Receive a natural language query referencing a DistOS decision
|
||||||
|
2. Identify the relevant UCXL artifact(s) using temporal addressing
|
||||||
|
3. Traverse the decision chain that produced that artifact
|
||||||
|
4. Retrieve related deliberation artifacts, proposal artifacts, and conflict resolution records
|
||||||
|
5. Synthesise a narrative answer that cites specific UCXL-addressed sources
|
||||||
|
6. Present the answer with full provenance: every claim is linked to a UCXL artifact
|
||||||
|
|
||||||
|
Example query: "Why does the DistOS scheduler use work-stealing rather than centralised queuing for GPU job distribution?"
|
||||||
|
|
||||||
|
Expected answer structure:
|
||||||
|
```
|
||||||
|
The DistOS scheduler uses work-stealing because:
|
||||||
|
|
||||||
|
[1] On Day 4, council-sched initially proposed centralised queuing
|
||||||
|
(ucxl://council-sched:scheduler@DistOS:sched/~~3/proposal/scheduling-model-v1).
|
||||||
|
|
||||||
|
[2] Council-net raised a scalability concern: centralised queuing
|
||||||
|
at 1024 nodes requires the queue coordinator to handle ~50k
|
||||||
|
scheduling decisions per second, exceeding its modelled capacity
|
||||||
|
(ucxl://council-net:scalability-analyst@DistOS:net/~~5/objection/queue-throughput).
|
||||||
|
|
||||||
|
[3] Council-synth mediated and proposed work-stealing as an alternative,
|
||||||
|
citing the Cilk scheduler as a reference implementation
|
||||||
|
(ucxl://council-synth:mediator@DistOS:synth/~~8/resolution/scheduling-model).
|
||||||
|
|
||||||
|
[4] Council-sched accepted the resolution and updated the scheduling
|
||||||
|
specification on Day 5
|
||||||
|
(ucxl://council-sched:scheduler@DistOS:sched/~~7/spec/scheduling-model).
|
||||||
|
|
||||||
|
The work-stealing choice trades centralised visibility for distributed
|
||||||
|
scalability. The implications for fairness guarantees are documented in
|
||||||
|
ADR-047 (ucxl://council-docs:formatter@DistOS:docs/~~23/adr/ADR-047).
|
||||||
|
```
|
||||||
|
|
||||||
|
### 7. Daily and Weekly Progress Summaries
|
||||||
|
|
||||||
|
Council-arch produces structured summaries at regular intervals:
|
||||||
|
|
||||||
|
**Daily summary (end of each project day):**
|
||||||
|
- Decisions made across all councils in the past 24 hours
|
||||||
|
- Conflicts opened and their current status
|
||||||
|
- Conflicts resolved and the resolution narratives
|
||||||
|
- New artifacts published and their significance
|
||||||
|
- Key open questions that remain unresolved
|
||||||
|
|
||||||
|
**Phase summary (at each phase boundary):**
|
||||||
|
- Major architectural decisions made in this phase
|
||||||
|
- How the design evolved from the start to the end of the phase
|
||||||
|
- Unresolved issues carried forward
|
||||||
|
- How phase deliverables relate to the overall DistOS design
|
||||||
|
|
||||||
|
**Final summary (Day 14):**
|
||||||
|
- The complete intellectual history of DistOS in narrative form
|
||||||
|
- The 10 most consequential decisions and why they shaped the design
|
||||||
|
- The 5 hardest conflicts and how they were resolved
|
||||||
|
- What CHORUS and UCXL proved about AI agent collective design
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Agent Roles
|
||||||
|
|
||||||
|
| Role | Count | Responsibilities |
|
||||||
|
|------|-------|-----------------|
|
||||||
|
| **Narrators** | 15 | Write human-readable decision stories from UCXL artifact chains; produce daily summaries; write conflict resolution narratives; maintain narrative consistency across the full project arc |
|
||||||
|
| **Traversers** | 10 | Continuously navigate UCXL temporal graphs across all councils; extract decision paths; identify when artifacts change and trace the causal chain; maintain the live decision graph database |
|
||||||
|
| **Impact Analysts** | 10 | Trace how decisions propagate through the dependency graph; identify breaking changes; produce impact maps when significant decisions are made; flag unresolved downstream consequences to council-synth |
|
||||||
|
| **Summarisers** | 5 | Produce executive-level overviews at phase boundaries; write the "why was this decided?" query responses; generate the final project summary; support council-docs with rationale sections |
|
||||||
|
|
||||||
|
**Total: 40 agents**
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Key Deliverables
|
||||||
|
|
||||||
|
| Deliverable | UCXL Address | Due Phase |
|
||||||
|
|-------------|--------------|-----------|
|
||||||
|
| UCXL Traversal Methodology | `ucxl://council-arch:traverser@DistOS:arch/*^/spec/traversal-methodology` | Phase 1 |
|
||||||
|
| Narrative Template Library | `ucxl://council-arch:narrator@DistOS:arch/*^/template/narrative-templates` | Phase 1 |
|
||||||
|
| Decision Graph Schema | `ucxl://council-arch:traverser@DistOS:arch/*^/spec/decision-graph-schema` | Phase 2 |
|
||||||
|
| Live Decision Graph (continuous) | `ucxl://council-arch:traverser@DistOS:arch/*^/graph/decision-graph-live` | Phase 2 (continuous) |
|
||||||
|
| Daily Progress Summary — Day 1 | `ucxl://council-arch:summariser@DistOS:arch/*^/summary/day-01` | Phase 1 |
|
||||||
|
| Daily Progress Summary — Day N | `ucxl://council-arch:summariser@DistOS:arch/~~N/summary/day-{N}` | Daily |
|
||||||
|
| Phase 1 Summary | `ucxl://council-arch:summariser@DistOS:arch/*^/summary/phase-01` | End of Day 3 |
|
||||||
|
| Phase 2 Summary | `ucxl://council-arch:summariser@DistOS:arch/*^/summary/phase-02` | End of Day 6 |
|
||||||
|
| Phase 3 Summary | `ucxl://council-arch:summariser@DistOS:arch/*^/summary/phase-03` | End of Day 10 |
|
||||||
|
| Phase 4 Summary | `ucxl://council-arch:summariser@DistOS:arch/*^/summary/phase-04` | End of Day 12 |
|
||||||
|
| Conflict Resolution Narratives (all) | `ucxl://council-arch:narrator@DistOS:arch/*^/narrative/conflict-resolutions` | Continuous |
|
||||||
|
| Impact Analysis Reports (all) | `ucxl://council-arch:impact-analyst@DistOS:arch/*^/analysis/impact-reports` | Continuous |
|
||||||
|
| "Why Was This Decided?" Query Archive | `ucxl://council-arch:summariser@DistOS:arch/*^/query/why-archive` | Continuous |
|
||||||
|
| Decision Graph Visualisation | `ucxl://council-arch:traverser@DistOS:arch/*^/diagram/decision-graph-full` | Phase 4 |
|
||||||
|
| Final Project Narrative | `ucxl://council-arch:narrator@DistOS:arch/*^/narrative/final-project-narrative` | Phase 5 |
|
||||||
|
| CHORUS Experiment Assessment | `ucxl://council-arch:summariser@DistOS:arch/*^/assessment/chorus-experiment` | Phase 5 |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Decision Points
|
||||||
|
|
||||||
|
### DA-01: Traversal Depth Limit
|
||||||
|
**Question:** How deep should UCXL traversal go when constructing a decision narrative? Some decision chains may be very long (a scheduling policy that changed 15 times over 14 days, each change referencing multiple earlier artifacts).
|
||||||
|
|
||||||
|
**Options:**
|
||||||
|
- A. Traverse the complete chain back to the origin for every narrative — maximum completeness, potentially very long narratives
|
||||||
|
- B. Traverse to a configurable depth (default: 5 versions back) and provide a "view full chain" link
|
||||||
|
- C. Summarise long chains: provide the origin, the current state, and the 3 most significant changes, with full chain available on request
|
||||||
|
|
||||||
|
**Implications:** Option A produces the most complete narratives but may overwhelm readers. Option C requires Traversers to make judgements about which changes are "most significant" — introducing interpretive risk.
|
||||||
|
|
||||||
|
### DA-02: Narrative Granularity Levels
|
||||||
|
**Question:** Should council-arch produce narratives at a single level of detail, or at multiple granularity levels for different audiences?
|
||||||
|
|
||||||
|
**Options:**
|
||||||
|
- A. Single level: engineer-grade narratives with full UCXL references throughout
|
||||||
|
- B. Two levels: executive summary + engineer detail, for each major decision
|
||||||
|
- C. Three levels: one-paragraph executive brief + section-length engineer narrative + complete UCXL chain listing
|
||||||
|
|
||||||
|
**Implications:** Option C serves the broadest audience but requires 3x the narrative production effort.
|
||||||
|
|
||||||
|
### DA-03: Real-Time vs. Batched Narrative Production
|
||||||
|
**Question:** Should Narrators write narratives immediately as each decision is made (real-time) or in batches (e.g., end of each day)?
|
||||||
|
|
||||||
|
**Options:**
|
||||||
|
- A. Real-time: narratives produced within minutes of each decision artifact being published
|
||||||
|
- B. End-of-day batches: narratives produced once per day covering all decisions in the preceding 24 hours
|
||||||
|
- C. Hybrid: real-time narratives for P0 conflicts and phase-boundary decisions; daily batches for routine decisions
|
||||||
|
|
||||||
|
**Implications:** Option A provides the most current view of the project state but may produce premature narratives for decisions that are subsequently revised. Option B risks losing context if the reasoning behind a decision is not recorded promptly.
|
||||||
|
|
||||||
|
### DA-04: The Readability Test Methodology
|
||||||
|
**Question:** How is the ultimate test of council-arch's success — "a human who has never seen the project can understand why the system was designed the way it was" — operationalised?
|
||||||
|
|
||||||
|
**Options:**
|
||||||
|
- A. Internal assessment: Summarisers evaluate whether their own narratives would be comprehensible to a newcomer
|
||||||
|
- B. Blind test: At Day 14, a set of novel DistOS design questions is posed to an agent who has access only to council-arch's outputs (not the underlying technical specs). If the agent can answer correctly, the test passes.
|
||||||
|
- C. Human review: The final project narrative is reviewed by a human reader who evaluates comprehensibility
|
||||||
|
|
||||||
|
**Implications:** Option B is the most rigorous test of UCXL's information preservation claims. Option C introduces human review into the timeline but provides the most authentic signal.
|
||||||
|
|
||||||
|
### DA-05: Conflict Narrative Attribution
|
||||||
|
**Question:** When narrating a conflict resolution, should the narrative attribute positions to specific agent roles (e.g., "the council-sched Preemption Specialist argued...") or to councils as collective bodies?
|
||||||
|
|
||||||
|
**Options:**
|
||||||
|
- A. Agent-role attribution: name specific roles and their UCXL addresses
|
||||||
|
- B. Council attribution: treat each council as a single voice
|
||||||
|
- C. Position attribution: describe positions without attributing to any agent ("the preemption camp argued... the isolation camp argued...")
|
||||||
|
|
||||||
|
**Implications:** Option A provides maximum accountability and traceability but may make narratives harder to read. Option C may obscure the council structure. Option B is a readable compromise.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Dependencies on Other Councils
|
||||||
|
|
||||||
|
| Council | Dependency Type | What Council-Arch Needs |
|
||||||
|
|---------|----------------|------------------------|
|
||||||
|
| **ALL councils** | Critical upstream | Real-time access to all UCXL artifact streams; read access to all decision, proposal, deliberation, and resolution artifacts |
|
||||||
|
| **council-synth** | Priority upstream | Conflict resolution records are the most narratively significant events; council-synth's mediation artifacts are council-arch's primary source for conflict resolution stories |
|
||||||
|
| **council-verify** | Upstream | Formal spec evolution history; particularly when invariants are revised in response to conflicts or discoveries |
|
||||||
|
| **council-qa** | Upstream | Defect discovery events often trigger important decision chains; the story of "how a QA finding changed the architecture" is a key narrative type |
|
||||||
|
| **council-docs** | Downstream (primary) | Council-arch's narratives feed directly into the documentation council's rationale sections and ADR content |
|
||||||
|
| **council-docs** | Upstream (cross-reference registry) | Council-arch queries the cross-reference registry to identify which artifacts reference any given UCXL address |
|
||||||
|
|
||||||
|
**Unique access requirement:** Council-arch requires read access to ALL UCXL streams across ALL councils. This is the broadest access scope of any council in DistOS, justified by council-arch's function as the universal observer. Council-arch does not write to other councils' namespaces — it only reads.
|
||||||
|
|
||||||
|
**The mutual dependency with council-docs:** Council-arch and council-docs have a deeply symbiotic relationship. Council-arch produces narratives; council-docs formats and publishes them. Council-docs raises editorial queries; council-arch answers them by traversing the UCXL record. When an Editor in council-docs encounters a specification section that lacks obvious rationale, the correct action is to query council-arch's narrative archive before raising an editorial query to the technical council.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## WHOOSH Configuration
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
council: council-arch
|
||||||
|
whoosh:
|
||||||
|
formation:
|
||||||
|
strategy: observer-pattern
|
||||||
|
# council-arch operates as a continuous observer of all other councils
|
||||||
|
# with internal specialisation by narrative type
|
||||||
|
teams:
|
||||||
|
- name: traversal-team
|
||||||
|
roles: [traverser]
|
||||||
|
size: 10
|
||||||
|
operation: continuous-monitoring
|
||||||
|
# Traversers run 24/7 from Day 1, monitoring all UCXL streams
|
||||||
|
trigger: on-any-artifact-change
|
||||||
|
- name: narrative-team
|
||||||
|
roles: [narrator]
|
||||||
|
size: 15
|
||||||
|
operation: event-driven
|
||||||
|
trigger: on-significant-decision-or-conflict
|
||||||
|
- name: impact-team
|
||||||
|
roles: [impact-analyst]
|
||||||
|
size: 10
|
||||||
|
operation: event-driven
|
||||||
|
trigger: on-artifact-change-in-referenced-document
|
||||||
|
- name: summary-team
|
||||||
|
roles: [summariser]
|
||||||
|
size: 5
|
||||||
|
operation: scheduled
|
||||||
|
schedule: end-of-day + end-of-phase + on-query
|
||||||
|
|
||||||
|
quorum:
|
||||||
|
narrative-publication: 2/3-narrators # Two Narrators must agree on a conflict narrative before publication
|
||||||
|
impact-assessment: 2/3-impact-analysts
|
||||||
|
# Note: council-arch does not vote on technical decisions — quorum here governs narrative quality only
|
||||||
|
|
||||||
|
subchannels:
|
||||||
|
- id: ucxl-monitor
|
||||||
|
description: Incoming artifact change events from all councils
|
||||||
|
subscribers: [council-arch-traversers, council-arch-impact-analysts]
|
||||||
|
source: all-councils
|
||||||
|
retention: full-history
|
||||||
|
- id: narratives-published
|
||||||
|
description: Completed narrative artifacts broadcast to consumers
|
||||||
|
subscribers: [council-docs, council-synth]
|
||||||
|
retention: full-history
|
||||||
|
- id: impact-alerts
|
||||||
|
description: High-impact decision propagation alerts
|
||||||
|
subscribers: [council-synth, affected-councils, council-docs]
|
||||||
|
retention: full-history
|
||||||
|
- id: daily-summaries
|
||||||
|
description: Daily progress summaries
|
||||||
|
subscribers: [all-councils]
|
||||||
|
retention: full-history
|
||||||
|
- id: query-responses
|
||||||
|
description: Responses to "why was this decided?" queries
|
||||||
|
subscribers: [querying-party, council-docs]
|
||||||
|
retention: full-history
|
||||||
|
- id: readability-test
|
||||||
|
description: Outputs used in the Day 14 readability assessment
|
||||||
|
subscribers: [council-docs, designated-assessor]
|
||||||
|
retention: full-history
|
||||||
|
|
||||||
|
continuous_monitoring:
|
||||||
|
# council-arch monitors everything
|
||||||
|
monitor_pattern: "ucxl://*:*@DistOS:*/*"
|
||||||
|
# Do not follow external references — DistOS namespace only
|
||||||
|
scope: DistOS
|
||||||
|
# Trigger narrative generation for significant events
|
||||||
|
significance_threshold:
|
||||||
|
- event: conflict-opened
|
||||||
|
action: queue-conflict-narrative-immediately
|
||||||
|
- event: conflict-resolved
|
||||||
|
action: queue-resolution-narrative-immediately
|
||||||
|
- event: specification-revised
|
||||||
|
action: queue-impact-analysis
|
||||||
|
- event: new-artifact-published
|
||||||
|
action: update-decision-graph
|
||||||
|
- event: artifact-deprecated
|
||||||
|
action: trace-and-update-cross-references
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Success Criteria
|
||||||
|
|
||||||
|
Council-arch's execution is successful when all of the following are met:
|
||||||
|
|
||||||
|
1. **Complete decision graph:** Every decision event across all 12+ councils is represented in the decision graph, with full UCXL provenance. No orphaned decisions exist (decisions that appear in specifications but have no traceable decision chain).
|
||||||
|
|
||||||
|
2. **Conflict narrative coverage:** Every cross-council conflict and its resolution has a corresponding narrative in the archaeology record. The narrative correctly identifies the parties, the positions, the resolution, and the UCXL artifacts involved.
|
||||||
|
|
||||||
|
3. **Impact analysis coverage:** Every specification change with downstream dependencies has a corresponding impact analysis. No significant breaking change went unanalysed.
|
||||||
|
|
||||||
|
4. **Daily summary completeness:** A daily summary was published for every project day, covering all significant decisions of that day.
|
||||||
|
|
||||||
|
5. **Phase summary quality:** Each phase summary gives a newcomer sufficient context to understand what was accomplished in that phase and how it advanced the overall design.
|
||||||
|
|
||||||
|
6. **Query answering:** The "why was this decided?" query archive contains responses for all major DistOS architectural decisions. A newcomer should be able to find the answer to "why did DistOS choose approach X?" for every significant design choice.
|
||||||
|
|
||||||
|
7. **The Readability Test (primary success criterion):** The Final Project Narrative, produced by council-arch and formatted by council-docs, is presented to an evaluator who has not participated in the DistOS design process. The evaluator is asked to answer a set of questions about the rationale behind 10 significant DistOS design decisions. The test passes if the evaluator can correctly explain the reasoning behind at least 8 of the 10 decisions, citing specific evidence from the narrative.
|
||||||
|
|
||||||
|
8. **UCXL hypothesis validation:** Every claim in council-arch's narratives is directly traceable to a specific UCXL-addressed artifact. No narrative statement exists without a UCXL source citation. This validates that UCXL preserved enough information for meaningful retrospective analysis.
|
||||||
|
|
||||||
|
9. **CHORUS experiment assessment:** The final CHORUS Experiment Assessment document articulates what the DistOS experiment proved about multi-agent AI collaboration: what worked, what failed, what was surprising, and what the implications are for future CHORUS deployments.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Timeline Mapping
|
||||||
|
|
||||||
|
| Phase | Days | Council-Arch Activities |
|
||||||
|
|-------|------|------------------------|
|
||||||
|
| **Phase 1: Research** | 1–3 | UCXL monitoring infrastructure active from Hour 1 of Day 1; build traversal tooling; define narrative templates; begin decision graph with initial council formation events; publish Day 1, Day 2, Day 3 summaries; produce Phase 1 summary at Day 3 |
|
||||||
|
| **Phase 2: Architecture** | 3–6 | Narrate the emergence of the core architecture from council-synth's synthesis work; capture early conflict events; produce impact analyses as the first cross-council dependencies form; first "why was this decided?" entries; Phase 2 summary at Day 6 |
|
||||||
|
| **Phase 3: Formal Specification** | 6–10 | Highest-intensity phase for council-arch: formal specification is being written, conflicts are frequent, specifications are being revised; narrate each conflict resolution in real time; continuous impact analysis; daily summaries every day; Phase 3 summary at Day 10 |
|
||||||
|
| **Phase 4: Integration** | 10–12 | Narrate cross-subsystem integration decisions; trace how integration testing findings propagate back to specifications; work with council-docs on ADR rationale content; Phase 4 summary at Day 12 |
|
||||||
|
| **Phase 5: Documentation** | 12–14 | Produce Final Project Narrative; complete decision graph visualisation; compile "why was this decided?" query archive; conduct readability test; write CHORUS Experiment Assessment; support council-docs in final specification publication |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Appendix: UCXL Traversal Examples
|
||||||
|
|
||||||
|
The following examples illustrate how council-arch Traversers use UCXL temporal syntax to reconstruct decision chains.
|
||||||
|
|
||||||
|
### Example 1: Tracing a Specification Revision
|
||||||
|
|
||||||
|
A Traverser observes that the preemption policy specification has been revised:
|
||||||
|
```
|
||||||
|
New: ucxl://council-sched:scheduler@DistOS:sched/*^/spec/preemption-policy
|
||||||
|
Old: ucxl://council-sched:scheduler@DistOS:sched/*~1/spec/preemption-policy
|
||||||
|
```
|
||||||
|
|
||||||
|
The Traverser retrieves both versions, computes the diff, and reads the commit metadata:
|
||||||
|
- Author: `council-sched:preemption-specialist`
|
||||||
|
- Timestamp: Day 7, 14:23 UTC
|
||||||
|
- Change reference: `ucxl://council-synth:mediator@DistOS:synth/*~2/resolution/scheduling-memory-conflict`
|
||||||
|
|
||||||
|
The Traverser follows the reference to the conflict resolution and retrieves the full conflict chain:
|
||||||
|
```
|
||||||
|
ucxl://council-mem:memory-guardian@DistOS:mem/~~12/objection/preemption-memory-violation
|
||||||
|
ucxl://council-sched:scheduler@DistOS:sched/~~8/proposal/restricted-migration-model
|
||||||
|
ucxl://council-synth:mediator@DistOS:synth/~~15/resolution/scheduling-memory-conflict
|
||||||
|
```
|
||||||
|
|
||||||
|
This chain is handed to a Narrator to produce the conflict resolution story.
|
||||||
|
|
||||||
|
### Example 2: Impact Analysis After a Change
|
||||||
|
|
||||||
|
An Impact Analyst observes that the Weka FS consistency claim has been revised:
|
||||||
|
```
|
||||||
|
ucxl://council-fs:fs-architect@DistOS:fs/*^/spec/consistency-model
|
||||||
|
```
|
||||||
|
|
||||||
|
The Impact Analyst queries the cross-reference registry for all artifacts that cite this address. The registry returns:
|
||||||
|
- `ucxl://council-mem:checkpoint-designer@DistOS:mem/~~9/spec/checkpoint-procedure` — depends on the FS consistency model
|
||||||
|
- `ucxl://council-qa:chaos-engineer@DistOS:qa/~~6/spec/test-strategy` — references the FS consistency claims for test design
|
||||||
|
- `ucxl://council-verify:model-checker@DistOS:verify/~~11/spec/fs-invariants` — formal invariants depend on FS consistency model
|
||||||
|
|
||||||
|
The Impact Analyst evaluates whether the change is breaking for each dependent:
|
||||||
|
- Checkpoint procedure: breaking (the consistency model was weakened; checkpoint write atomicity assumptions must be revised)
|
||||||
|
- QA test strategy: non-breaking (tests become more stringent, which is conservative)
|
||||||
|
- Formal invariants: breaking (invariants must be re-proved under the weaker consistency model)
|
||||||
|
|
||||||
|
The Impact Analyst publishes an alert to the `impact-alerts` subchannel and notifies council-synth that council-mem and council-verify have breaking dependencies on the FS consistency change.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
*Council Design Brief v1.0 — DistOS Project — council-arch*
|
||||||
|
*Generated: 2026-02-24*
|
||||||
Reference in New Issue
Block a user