Initial DistOS project constitution and council design briefs

12 council design briefs for distributed OS specification project targeting 1024-node Hopper/Grace/Blackwell GPU cluster with Weka parallel filesystem. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-26 14:15:39 +11:00
commit 7f56ca4d46
13 changed files with 5857 additions and 0 deletions
--- a/councils/01-process-scheduling.md
+++ b/councils/01-process-scheduling.md
@@ -0,0 +1,535 @@
+# Council Design Brief: Process Scheduling
+
+**Council ID:** `council-sched`
+**Mission:** Design the heterogeneous process scheduling subsystem for DistOS, covering GPU kernel dispatch, workload placement across Hopper/Grace/Blackwell accelerators, fair multi-tenant queuing, gang scheduling for distributed training, and energy-aware execution across a 1024-node cluster.
+**UCXL Base Address:** `ucxl://council-sched:*@DistOS:scheduling/*`
+**Agent Count:** 80
+**Status:** Constitution Phase — awaiting WHOOSH formation trigger
+**Created:** 2026-02-24
+
+---
+
+## 1. Scope and Responsibilities
+
+`council-sched` owns the complete specification of the DistOS process and kernel scheduling subsystem. Scope boundaries are defined as follows.
+
+**In scope:**
+
+- GPU kernel scheduling: dispatch queue management, kernel concurrency, SM partitioning on Hopper (MIG) and Blackwell, and MPS (Multi-Process Service) lifetime management
+- CPU-GPU co-scheduling on Grace Superchip (NVLink-C2C coherent interconnect), including unified virtual address space scheduling implications
+- Workload placement policy across heterogeneous accelerator types (H100, GH200, B200), including topology-aware affinity scoring
+- Fair multi-tenant queuing: priority classes, weighted fair queuing, and quota enforcement
+- Gang scheduling for distributed training workloads: all-or-nothing allocation, partial-allotment strategies, and backfill
+- Preemption strategies: checkpoint-based preemption, time-sliced preemption, and priority inversion avoidance
+- NUMA-aware placement across CPU sockets and GPU memory domains
+- GPU memory oversubscription scheduling: eviction policy, swap-to-host, and coordinated demand management with `council-mem`
+- Energy-aware scheduling: frequency/voltage scaling directives, power capping, and thermal headroom management
+- Formal specification of the scheduling API surface exposed to user workloads and to other DistOS subsystems
+
+**Out of scope (delegated):**
+
+- Physical memory allocation and coherence protocol (delegated to `council-mem`)
+- Network topology discovery and network-aware placement data (delegated to `council-net`; consumed as inputs)
+- Metering, cost attribution, and SLO enforcement (delegated to `council-telemetry`; consumed as outputs)
+- Security isolation between tenants at the hardware level (delegated to `council-sec`; policies consumed as constraints)
+
+---
+
+## 2. Research Domains
+
+### 2.1 GPU Kernel Scheduling and SM Partitioning
+
+Survey the CUDA Concurrent Kernel Execution model, the NVIDIA Multi-Instance GPU (MIG) architecture on Hopper (H100), and the NVIDIA Multi-Process Service (MPS). Understand how MIG partitions a GPU into isolated GPC slices with dedicated HBM3 memory, and how MPS enables concurrent kernel execution within a single GPU context without full MIG isolation overhead.
+
+Key materials:
+- NVIDIA H100 Architecture Technical Overview (2022) — SM partitioning, GPC layout, NVLink 4.0 bandwidth
+- NVIDIA MIG User Guide (CUDA 12.x) — partition profiles (1g.10gb through 7g.80gb), instance isolation, compute and memory capacity tables
+- NVIDIA MPS documentation — shared context model, client limit (48 for Hopper), error containment limitations
+- AMD ROCm Hardware Abstraction Layer (HAL) source — `amdkfd` KFD driver, compute queue management, HWS (Hardware Scheduler) in GFX12
+- Jain et al., "Fractional GPUs: Software-Based Compute and Memory Bandwidth Reservation for GPUs" (RTAS 2019) — software-defined partitioning as a baseline comparison
+- Xiao et al., "AntMan: Dynamic Scaling on GPU Clusters for Deep Learning" (OSDI 2020) — GPU memory and compute sharing for co-located workloads
+
+### 2.2 Heterogeneous Workload Placement
+
+Develop a placement model that accounts for the distinct performance characteristics of H100 (PCIe/SXM5), GH200 Grace Superchip (NVLink-C2C), and B200 Blackwell (NVLink 5.0 SXM). Each accelerator type has different compute density, memory bandwidth, and interconnect topology.
+
+Key materials:
+- Jouppi et al., "In-Datacenter Performance Analysis of a Tensor Processing Unit" (ISCA 2017) — heterogeneous accelerator placement rationale
+- Google Borg paper: Verma et al., "Large-scale cluster management at Google with Borg" (EuroSys 2015) — machine heterogeneity handling, alloc sets, resource estimation
+- Microsoft Singularity: Qiao et al., "Pollux: Co-adaptive Cluster Scheduling for Goodput-Optimized Deep Learning" (OSDI 2021) — goodput-aware placement for training workloads
+- Weng et al., "MLaaS in the Wild: Workload Analysis and Scheduling in Large-Scale Heterogeneous GPU Clusters" (NSDI 2022) — real-world heterogeneous GPU cluster scheduling from Alibaba
+
+### 2.3 Fair Multi-Tenant Queuing
+
+Design a multi-level queueing architecture with Dominant Resource Fairness (DRF) semantics extended for GPU resources (SM fraction, HBM bandwidth, NVLink bandwidth as co-dominant resources).
+
+Key materials:
+- Ghodsi et al., "Dominant Resource Fairness: Fair Allocation of Multiple Resource Types" (NSDI 2011) — foundational DRF theory
+- Apache YARN CapacityScheduler and FairScheduler documentation — hierarchical queues, preemption policy, label-based node targeting
+- Apache Mesos DRF implementation — `DominantShareAllocator`, offer model, role weights
+- Kubernetes device plugin framework (k8s.io/device-plugins) — GPU resource advertisement, extended resource scheduling
+- Tiresias: Gu et al., "Tiresias: A GPU Cluster Manager for Distributed Deep Learning" (NSDI 2019) — LAS (Least Attained Service) scheduling for DL jobs, 2DAS (2-dimensional attained service)
+- Gandiva: Xiao et al., "Gandiva: Introspective Cluster Scheduling for Deep Learning" (OSDI 2018) — time-sliced GPU sharing, job packing, migration
+
+### 2.4 Gang Scheduling for Distributed Training
+
+Gang scheduling ensures all processes in a distributed training job (e.g., a 512-GPU allreduce ring) are co-scheduled simultaneously. This is critical for preventing head-of-line blocking and deadlock in collective communication.
+
+Key materials:
+- Feitelson and Rudolph, "Gang Scheduling Performance Benefits for Fine-Grain Synchronization" (J. Parallel Distrib. Comput., 1992) — foundational gang scheduling analysis
+- Rajachandrasekar et al., "A Closer Look at All-Reduce for Deep Learning" (Workshop at SC'19) — collective communication scheduling sensitivity
+- Hwang et al., "AFS: Annotation-Free Automatic Sharding for Large Language Models" (ICLR 2023) — pipeline and tensor parallel placement co-scheduling
+- Slurm gang scheduling documentation — `Oversubscribe=FORCE`, `GraceTime`, slurmctld gang plugin
+- Jeon et al., "Analysis of Large-Scale Multi-Tenant GPU Clusters for DNN Training Workloads" (USENIX ATC 2019) — Microsoft Philly cluster trace, gang failure modes
+
+### 2.5 Preemption Strategies
+
+Survey checkpoint-based, reactive, and time-sliced preemption. Understand the cost of GPU checkpoint (saving SM register file, shared memory, and in-flight DMA) and design preemption policies that bound maximum latency impact.
+
+Key materials:
+- Park et al., "Preemptive, Low Latency Datacenter Scheduling via Lightweight Virtualization" (USENIX ATC 2020) — container-level preemption
+- Lin et al., "SHEPHERD: Serving DNNs in the Wild" (NSDI 2023) — latency SLO-aware preemption for inference workloads
+- NVIDIA CUDA checkpoint/restore (CRIU for GPU) — experimental support status as of CUDA 12.x
+- Wang et al., "Achieving Microsecond-Scale Tail Latency Efficiently with Approximate Optimal Scheduling" (SOSP 2017) — preemption-aware scheduling theory
+
+### 2.6 NUMA-Aware and Topology-Aware Placement
+
+Model the NUMA topology of a Grace Superchip node: 72-core Arm Neoverse V2 CPU, 480 GB LPDDR5X at ~512 GB/s, connected to H100 GPU via NVLink-C2C at 900 GB/s. Compare this with standard SXM5 nodes where CPU-GPU bandwidth is limited to PCIe Gen5 (~128 GB/s bidirectional).
+
+Key materials:
+- Linux NUMA scheduling documentation — `libnuma`, `numactl`, CFS NUMA balancing (`task_numa_migrate`)
+- NVIDIA GH200 Grace Hopper Superchip Architecture Whitepaper (2023)
+- Lepers et al., "Thread and Memory Placement on NUMA Systems: Asymmetry Matters" (USENIX ATC 2015)
+- Blagodurov et al., "A Case for NUMA-Aware Contention Management on Multicore Systems" (USENIX ATC 2011)
+
+### 2.7 GPU Memory Oversubscription
+
+When aggregate GPU memory demand exceeds physical HBM3 capacity, the scheduler must coordinate with `council-mem`'s eviction and tiering policies to decide which kernels to throttle, checkpoint, or migrate.
+
+Key materials:
+- Rhu et al., "vDNN: Virtualized Deep Neural Networks for Scalable, Memory-Efficient Neural Network Design" (MICRO 2016) — layer-by-layer activation offload
+- Huang et al., "Efficient Large Scale Language Modeling with Mixtures of Experts" (EMNLP 2021) — memory-efficient model parallelism
+- NVIDIA Unified Memory documentation (CUDA 12.x) — page migration engine, oversubscription, prefetch advisory API
+- Kumar et al., "SwapAdvisor: Push Deep Learning Beyond the GPU Memory Limit via Smart Swapping" (ASPLOS 2020)
+
+### 2.8 Energy-Aware Scheduling
+
+Design scheduling policies that honour per-node power caps enforced by Baseboard Management Controllers (BMCs) and exploit DVFS (Dynamic Voltage and Frequency Scaling) headroom reported by `council-telemetry`.
+
+Key materials:
+- Patel et al., "Clite: Efficient and QoS Aware Co-location of Multiple Latency-Critical Jobs for Warehouse Scale Computers" (ISCA 2020)
+- Lim et al., "Adaptive Power Management for Heterogeneous Multiprocessor SoCs" (ICCAD 2009)
+- NVIDIA NVML power management API — `nvmlDeviceSetPowerManagementLimit`, `nvmlDeviceGetCurrentClocksThrottleReasons`
+- AMD ROCm SMI library — power cap interface (`rsmi_dev_power_cap_set`)
+- RAPL (Running Average Power Limit) interface for Grace CPU power management
+
+---
+
+## 3. Agent Roles
+
+Total agents: **80**
+
+| Role | Count | Responsibilities |
+|------|-------|-----------------|
+| Lead Architect | 2 | Overall scheduling architecture decisions, cross-subsystem interface design, DR authorship for major decisions |
+| Kernel Scheduling Researchers | 8 | CUDA/ROCm scheduler internals, MIG/MPS analysis, SM partitioning survey |
+| Placement Researchers | 6 | Heterogeneous accelerator placement, topology modelling, workload profiling |
+| Queueing Theory Specialists | 5 | DRF extensions for multi-resource GPU scheduling, fairness proof sketches |
+| Gang Scheduling Specialists | 5 | Collective communication scheduling, all-or-nothing allocation protocols |
+| Preemption Specialists | 4 | Checkpoint protocol design, preemption cost modelling |
+| NUMA/Topology Analysts | 4 | NUMA topology modelling for Grace Superchip and SXM5 nodes |
+| Energy Efficiency Researchers | 4 | Power capping, DVFS scheduling integration |
+| Formal Specification Authors | 8 | TLA+ specification of scheduler state machine, safety and liveness invariants |
+| Architects (sub-component) | 10 | Propose concrete scheduling algorithm designs for each domain |
+| Internal Reviewers | 8 | Review research summaries and architecture proposals; cast green/yellow/red votes |
+| Integration Liaisons | 6 | Interface with `council-mem`, `council-net`, `council-telemetry`, `council-sec` |
+| Decision Record Authors | 5 | Author DRs for each resolved decision point; maintain UCXL provenance chain |
+| Adversarial Critics | 5 | Challenge proposed designs; surface failure modes, starvation scenarios, livelock risks |
+
+**Role distribution rationale:** The large researcher cohort (37 agents covering 8 research domains) reflects the breadth of scheduling literature. The formal specification group (8 agents) is sized to produce parallel TLA+ modules. Internal reviewers and adversarial critics (13 agents combined) ensure no architecture proposal passes without rigorous challenge.
+
+---
+
+## 4. Key Deliverables
+
+All artifacts are published to the DHT and addressable via UCXL.
+
+### 4.1 Research Summaries
+
+```
+ucxl://council-sched:researcher@DistOS:scheduling/*^/research/gpu-kernel-scheduling.md
+ucxl://council-sched:researcher@DistOS:scheduling/*^/research/heterogeneous-placement.md
+ucxl://council-sched:researcher@DistOS:scheduling/*^/research/fair-queuing-drf.md
+ucxl://council-sched:researcher@DistOS:scheduling/*^/research/gang-scheduling.md
+ucxl://council-sched:researcher@DistOS:scheduling/*^/research/preemption-strategies.md
+ucxl://council-sched:researcher@DistOS:scheduling/*^/research/numa-topology.md
+ucxl://council-sched:researcher@DistOS:scheduling/*^/research/memory-oversubscription.md
+ucxl://council-sched:researcher@DistOS:scheduling/*^/research/energy-aware-scheduling.md
+```
+
+### 4.2 Architecture Proposals
+
+```
+ucxl://council-sched:architect@DistOS:scheduling/*^/architecture/scheduler-overview.md
+ucxl://council-sched:architect@DistOS:scheduling/*^/architecture/mig-mps-partitioning-model.md
+ucxl://council-sched:architect@DistOS:scheduling/*^/architecture/placement-scoring-algorithm.md
+ucxl://council-sched:architect@DistOS:scheduling/*^/architecture/drf-gpu-extension.md
+ucxl://council-sched:architect@DistOS:scheduling/*^/architecture/gang-scheduler-protocol.md
+ucxl://council-sched:architect@DistOS:scheduling/*^/architecture/preemption-protocol.md
+ucxl://council-sched:architect@DistOS:scheduling/*^/architecture/energy-policy-interface.md
+```
+
+### 4.3 Decision Records
+
+```
+ucxl://council-sched:architect@DistOS:scheduling/*^/decisions/DR-SCHED-001-partition-model.md
+ucxl://council-sched:architect@DistOS:scheduling/*^/decisions/DR-SCHED-002-placement-algorithm.md
+ucxl://council-sched:architect@DistOS:scheduling/*^/decisions/DR-SCHED-003-queuing-policy.md
+ucxl://council-sched:architect@DistOS:scheduling/*^/decisions/DR-SCHED-004-gang-protocol.md
+ucxl://council-sched:architect@DistOS:scheduling/*^/decisions/DR-SCHED-005-preemption-model.md
+ucxl://council-sched:architect@DistOS:scheduling/*^/decisions/DR-SCHED-006-energy-interface.md
+```
+
+### 4.4 Formal Specifications
+
+```
+ucxl://council-sched:verifier@DistOS:scheduling/*^/specs/SchedulerStateMachine.tla
+ucxl://council-sched:verifier@DistOS:scheduling/*^/specs/GangScheduler.tla
+ucxl://council-sched:verifier@DistOS:scheduling/*^/specs/PreemptionProtocol.tla
+ucxl://council-sched:verifier@DistOS:scheduling/*^/specs/FairnessInvariants.tla
+```
+
+### 4.5 Interface Contracts (for other councils)
+
+```
+ucxl://council-sched:architect@DistOS:scheduling/*^/interfaces/sched-to-mem-contract.md
+ucxl://council-sched:architect@DistOS:scheduling/*^/interfaces/sched-to-net-contract.md
+ucxl://council-sched:architect@DistOS:scheduling/*^/interfaces/sched-to-telemetry-contract.md
+ucxl://council-sched:architect@DistOS:scheduling/*^/interfaces/sched-to-sec-contract.md
+```
+
+---
+
+## 5. Decision Points
+
+The following are the major architectural questions `council-sched` must resolve. Each decision must produce a Decision Record with alternatives considered, evidence from research, and rationale for the chosen option.
+
+### DP-SCHED-001: Primary Partition Model
+
+**Question:** Should the scheduler use MIG (hardware-enforced partition isolation), MPS (shared context with software-enforced limits), or a hybrid model where MIG is used for multi-tenant isolation and MPS for co-located jobs within a single tenant's partition?
+
+**Factors:** isolation strength, context switch overhead, minimum allocation granularity, Blackwell support parity, operational complexity.
+
+**Dependency:** Decision informs the security isolation model that `council-sec` will specify.
+
+### DP-SCHED-002: Placement Scoring Architecture
+
+**Question:** Should placement be computed by a centralised scoring service (Borg-style) or a distributed, bid-based negotiation (Mesos offer model)? How does topology affinity (NVLink domain proximity) weight against fairness constraints?
+
+**Factors:** convergence time at 1024-node scale, fault tolerance of the placement service itself, stale topology information handling, NVLink bandwidth utilisation efficiency.
+
+### DP-SCHED-003: Multi-Resource Fairness Extension
+
+**Question:** DRF was designed for CPU/memory. GPU workloads introduce additional dominant resources: SM fraction, HBM3 bandwidth, NVLink bandwidth, and NVSwitch port occupancy. How many resource dimensions does the fairness model track, and what is the computational complexity of DRF at 80+ resource types?
+
+**Factors:** implementation tractability, approximation error bounds, interaction with per-tenant quota enforcement.
+
+### DP-SCHED-004: Gang Scheduling Protocol
+
+**Question:** Should gang scheduling use a two-phase reservation (reserve then commit) protocol, speculative allocation with rollback, or a backfill-with-hold strategy? What is the maximum tolerable scheduling delay for a 1024-GPU gang job?
+
+**Factors:** cluster utilisation impact, deadlock risk in two-phase protocols, interaction with preemption.
+
+### DP-SCHED-005: Preemption Granularity
+
+**Question:** Should preemption operate at kernel-granularity (CUDA stream checkpointing), job-granularity (full process checkpoint via CRIU-for-GPU), or a hybrid that allows kernel preemption for short jobs and process-level preemption for long jobs?
+
+**Factors:** checkpoint latency (kernel-level: microseconds; process-level: seconds to tens of seconds for large model weights), HBM3 save/restore bandwidth cost, preemption frequency requirements.
+
+### DP-SCHED-006: Grace Superchip Scheduling Model
+
+**Question:** For GH200 nodes with NVLink-C2C, the CPU and GPU share a unified virtual address space. Should the scheduler treat CPU and GPU on a Grace node as a single scheduling unit, or as two separate but affinity-linked resources? How does this interact with the memory model specified by `council-mem`?
+
+**Factors:** utilisation efficiency, memory model consistency requirements, API ergonomics for user workloads, migration cost if job is split across a C2C boundary.
+
+### DP-SCHED-007: Energy Policy Interface
+
+**Question:** Should energy-aware scheduling be a first-class scheduling objective (a weight in the placement score function), a hard constraint (power cap as a resource limit), or an advisory mechanism (scheduler receives power headroom hints and applies them at its discretion)?
+
+**Factors:** SLO compatibility, predictability of power-capped execution, interaction with thermal management, coordination protocol with `council-telemetry`.
+
+---
+
+## 6. Dependencies
+
+### 6.1 What council-sched Needs from Other Councils
+
+| Dependency | Source Council | Artifact | Purpose |
+|------------|---------------|---------|---------|
+| Memory pressure signals and eviction cost model | `council-mem` | `ucxl://council-mem:architect@DistOS:memory/*^/interfaces/mem-to-sched-contract.md` | GPU memory oversubscription scheduling requires eviction cost estimates to avoid scheduling kernels that will immediately thrash |
+| HBM3/DDR5 bandwidth allocation model | `council-mem` | `ucxl://council-mem:architect@DistOS:memory/*^/architecture/bandwidth-allocation.md` | Bandwidth is a co-dominant scheduling resource; model must be consistent |
+| Network topology map (NVLink domains, IB fabric) | `council-net` | `ucxl://council-net:architect@DistOS:networking/*^/architecture/topology-model.md` | Topology-aware placement requires accurate NVLink domain membership and IB bisection bandwidth per rack |
+| Network bandwidth reservation API | `council-net` | `ucxl://council-net:architect@DistOS:networking/*^/interfaces/net-to-sched-contract.md` | Gang scheduling for allreduce jobs requires co-reserving network bandwidth |
+| Resource metering API contract | `council-telemetry` | `ucxl://council-telemetry:architect@DistOS:telemetry/*^/interfaces/telemetry-to-sched-contract.md` | Scheduler must feed placement and execution events to telemetry; power headroom signals flow back |
+| Security isolation constraints | `council-sec` | `ucxl://council-sec:architect@DistOS:security/*^/interfaces/sec-to-sched-contract.md` | Which tenants may share a GPU partition, MPS context constraints, minimum isolation requirements per tenant class |
+
+### 6.2 What Other Councils Need from council-sched
+
+| Consumer Council | Artifact Required | Purpose |
+|-----------------|------------------|---------|
+| `council-mem` | Placement decisions and GPU assignment map | Memory subsystem needs to know which GPU a job is placed on to configure HBM3 address space and NVLink fabric attachment |
+| `council-mem` | Preemption events and checkpoint triggers | Memory tiering must snapshot GPU memory on preemption |
+| `council-net` | Job placement map with NVLink domain assignments | Network subsystem configures NCCL topology files and RDMA QP mappings based on scheduler placement |
+| `council-telemetry` | Scheduling events stream (enqueue, dequeue, preempt, complete) | Metering and cost attribution require per-job lifecycle events |
+| `council-sec` | Partition assignment per tenant | Security subsystem enforces isolation based on scheduler-assigned partition IDs |
+| `council-verify` | TLA+ scheduler state machine spec | Formal verification council model-checks scheduler invariants |
+
+---
+
+## 7. WHOOSH Configuration
+
+```yaml
+# WHOOSH council formation configuration for council-sched
+council_id: council-sched
+project: DistOS
+subsystem: scheduling
+gitea_label: chorus-entrypoint
+gitea_repo: distos/scheduling
+
+formation:
+  target_agents: 80
+  min_agents: 60
+  wave:
+    max_per_wave: 12
+    min_per_wave: 6
+    period_sec: 30
+  placement:
+    max_replicas_per_node: 2
+  join_stagger_ms: 2000
+  bootstrap_peers_min: 5
+
+roles:
+  - role: lead-architect
+    count: 2
+    model: claude-opus-4-6
+    priority: high
+  - role: researcher
+    count: 37
+    model: qwen2.5-coder:32b
+    priority: normal
+    subgroups:
+      - tag: kernel-scheduling
+        count: 8
+      - tag: placement
+        count: 6
+      - tag: queuing-theory
+        count: 5
+      - tag: gang-scheduling
+        count: 5
+      - tag: preemption
+        count: 4
+      - tag: numa-topology
+        count: 4
+      - tag: energy
+        count: 5
+  - role: architect
+    count: 10
+    model: claude-opus-4-6
+    priority: normal
+  - role: verifier
+    count: 8
+    model: deepseek-coder-v2
+    priority: normal
+  - role: reviewer
+    count: 8
+    model: claude-opus-4-6
+    priority: normal
+  - role: integration-liaison
+    count: 6
+    model: qwen2.5-coder:32b
+    priority: normal
+  - role: decision-record-author
+    count: 5
+    model: claude-opus-4-6
+    priority: normal
+  - role: adversarial-critic
+    count: 5
+    model: claude-opus-4-6
+    priority: normal
+
+subchannels:
+  - name: sched-research
+    description: "Research discussion and literature synthesis"
+    participants: [researcher, lead-architect]
+    pubsub: true
+  - name: sched-architecture
+    description: "Architecture proposal discussion and voting"
+    participants: [architect, lead-architect, reviewer, adversarial-critic]
+    pubsub: true
+  - name: sched-formal-spec
+    description: "TLA+ specification authoring and review"
+    participants: [verifier, lead-architect, reviewer]
+    pubsub: false
+  - name: sched-integration
+    description: "Cross-council interface negotiation"
+    participants: [integration-liaison, lead-architect]
+    pubsub: false
+  - name: sched-decisions
+    description: "Decision record authoring and consensus"
+    participants: [decision-record-author, lead-architect, reviewer]
+    pubsub: true
+
+quorum:
+  # Architecture decisions require supermajority
+  architecture_changes:
+    policy: supermajority
+    threshold: 0.667
+    require_domain_role: true
+    require_quality_role: true
+    beat_minutes: 20
+    timeout_beats: 6
+  # Research summaries require simple majority
+  research_summaries:
+    policy: simple_majority
+    threshold: 0.5
+    require_domain_role: true
+    require_quality_role: false
+    beat_minutes: 15
+    timeout_beats: 4
+  # Formal specifications require supermajority with verifier sign-off
+  formal_specs:
+    policy: supermajority
+    threshold: 0.667
+    require_domain_role: true
+    require_quality_role: true
+    require_verifier: true
+    beat_minutes: 25
+    timeout_beats: 8
+  # Cross-council interface contracts require unanimous lead-architect approval
+  interface_contracts:
+    policy: unanimous
+    roles: [lead-architect, integration-liaison]
+    beat_minutes: 30
+    timeout_beats: 4
+
+gates:
+  kaching:
+    p95_latency_ms: 250
+    max_error_rate: 0.01
+  backbeat:
+    max_stream_lag: 200
+  bootstrap:
+    min_healthy_peers: 5
+  join:
+    min_success_rate: 0.80
+
+review:
+  beat_minutes: 20
+  quorum:
+    total_min: 3
+    require_domain_role: true
+    require_quality_role: true
+  timeout_beats: 6
+  no_self_approval: true
+```
+
+---
+
+## 8. Success Criteria
+
+1. **Research completeness:** All 8 research domain summaries published to DHT with at least 5 primary references each, reviewed and approved by at least 3 council agents.
+
+2. **Architecture coverage:** Architectural proposals exist for all 7 major scheduling components (kernel dispatch, placement, queuing, gang, preemption, NUMA, energy). Each proposal addresses the 1024-node scale constraint explicitly.
+
+3. **Decision records resolved:** All 7 decision points (DP-SCHED-001 through DP-SCHED-007) have a corresponding Decision Record with at least 3 alternatives considered, evidence citations, and a chosen option ratified by council supermajority.
+
+4. **Formal specifications:** TLA+ specifications for the scheduler state machine, gang scheduling protocol, and preemption protocol. At least 2 of these must have model-checked safety invariants (no starvation for highest-priority jobs, gang deadlock freedom) verified by `council-verify`.
+
+5. **Interface contracts ratified:** All 4 interface contracts (to `council-mem`, `council-net`, `council-telemetry`, `council-sec`) are co-signed by integration liaisons from both councils.
+
+6. **UCXL navigability:** A human unfamiliar with the project should be able to navigate from any Decision Record to the research summary that motivated it using only UCXL temporal navigation, within 5 hops.
+
+7. **Adversarial review pass:** Each major architecture proposal has at minimum one adversarial critique documented and a resolution recorded. No proposal advances to formal specification with an unresolved red-vote critique.
+
+---
+
+## 9. Timeline
+
+### Phase 1: Research and Survey (Days 1-3)
+
+**Day 1:**
+- WHOOSH forms council; all 80 agents join via wave deployment
+- Researchers self-assign to domain subgroups via `sched-research` subchannel
+- Literature survey begins: GPU kernel scheduling, placement, and fair queuing domains prioritised first (these are on the critical path for Decision Points DP-SCHED-001 through DP-SCHED-003)
+- Integration liaisons make initial contact with `council-mem` and `council-net` to understand their timelines
+
+**Day 2:**
+- Remaining research domains surveyed: gang scheduling, preemption, NUMA, energy
+- Research summaries drafted in parallel across subgroups
+- First internal review cycle: reviewers read summaries and post green/yellow/red votes with rationale
+- Lead architects synthesise research findings into a preliminary scheduling design space map
+
+**Day 3:**
+- Research summaries revised based on review feedback; final versions published to DHT
+- Adversarial critics challenge assumptions in each summary
+- Research phase gate: all 8 summaries must achieve simple majority approval before Phase 2 begins
+- Preliminary interface contract outlines shared with dependency councils
+
+### Phase 2: Architecture and Trade-offs (Days 3-6)
+
+**Day 3-4:**
+- Architects propose concrete options for DP-SCHED-001 (partition model) and DP-SCHED-002 (placement scoring) — these are the highest-dependency decisions
+- Adversarial critics engage immediately; all alternatives documented
+- DP-SCHED-001 decision record drafted; council votes; DR published
+
+**Day 4-5:**
+- Queuing model (DP-SCHED-003), gang scheduling (DP-SCHED-004), and preemption (DP-SCHED-005) design proposals concurrently authored
+- Inter-council synthesis session with `council-mem` to align on oversubscription and eviction signal interfaces
+- Inter-council synthesis session with `council-net` to align on topology model input format
+
+**Day 5-6:**
+- Grace Superchip model (DP-SCHED-006) and energy interface (DP-SCHED-007) decisions resolved
+- All 7 Decision Records drafted; supermajority vote on each
+- Architecture overview document assembled from approved DRs
+- Architecture phase gate: all DPs resolved before Phase 3 begins
+
+### Phase 3: Formal Specification (Days 6-10)
+
+**Day 6-7:**
+- Verifiers begin TLA+ specification of the scheduler state machine based on approved architecture
+- Architects continue refining component-level designs to resolve any ambiguities surfaced during spec authoring
+- `council-verify` engaged: share spec drafts for early model-checking feedback
+
+**Day 7-8:**
+- Gang scheduler TLA+ module authored; parallel with preemption protocol spec
+- Fairness invariants formally stated: no starvation under bounded load, gang deadlock freedom, DRF monotonicity
+
+**Day 8-10:**
+- Model checking runs submitted to `council-verify`
+- Counterexample analysis: any liveness or safety violations trigger architecture revision and updated DRs
+- Formal spec versions pinned in DHT; UCXL addresses published to dependent councils
+
+### Phase 4: Integration and Review (Days 10-12)
+
+**Day 10-11:**
+- Interface contracts with `council-mem`, `council-net`, `council-telemetry`, `council-sec` finalised and submitted for co-signature
+- Cross-council integration session: scheduling placement decisions validated against network topology model
+- `council-synth` engaged for any unresolved conflicts with other councils' specifications
+
+**Day 11-12:**
+- Final council review of complete scheduling specification
+- Adversarial critics run end-to-end failure scenario analysis
+- Any remaining yellow votes addressed with documented mitigations
+- Integration review gate: all interface contracts co-signed
+
+### Phase 5: Documentation and Narrative (Days 12-14)
+
+**Day 12-13:**
+- Decision record authors produce narrative summaries of the scheduling architecture journey
+- `council-docs` receives the complete scheduling specification package for standardised formatting
+- UCXL navigability audit: spot-check 10 random decision paths for completeness
+
+**Day 14:**
+- Final specification published
+- `council-arch` decision archaeology agents generate human-readable narrative of scheduling design evolution
+- Council formally dissolved; agents released back to WHOOSH pool