Files
DistOS/councils/01-process-scheduling.md
anthonyrawlins 7f56ca4d46 Initial DistOS project constitution and council design briefs
12 council design briefs for distributed OS specification project targeting
1024-node Hopper/Grace/Blackwell GPU cluster with Weka parallel filesystem.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-26 14:15:39 +11:00

536 lines
30 KiB
Markdown

# Council Design Brief: Process Scheduling
**Council ID:** `council-sched`
**Mission:** Design the heterogeneous process scheduling subsystem for DistOS, covering GPU kernel dispatch, workload placement across Hopper/Grace/Blackwell accelerators, fair multi-tenant queuing, gang scheduling for distributed training, and energy-aware execution across a 1024-node cluster.
**UCXL Base Address:** `ucxl://council-sched:*@DistOS:scheduling/*`
**Agent Count:** 80
**Status:** Constitution Phase — awaiting WHOOSH formation trigger
**Created:** 2026-02-24
---
## 1. Scope and Responsibilities
`council-sched` owns the complete specification of the DistOS process and kernel scheduling subsystem. Scope boundaries are defined as follows.
**In scope:**
- GPU kernel scheduling: dispatch queue management, kernel concurrency, SM partitioning on Hopper (MIG) and Blackwell, and MPS (Multi-Process Service) lifetime management
- CPU-GPU co-scheduling on Grace Superchip (NVLink-C2C coherent interconnect), including unified virtual address space scheduling implications
- Workload placement policy across heterogeneous accelerator types (H100, GH200, B200), including topology-aware affinity scoring
- Fair multi-tenant queuing: priority classes, weighted fair queuing, and quota enforcement
- Gang scheduling for distributed training workloads: all-or-nothing allocation, partial-allotment strategies, and backfill
- Preemption strategies: checkpoint-based preemption, time-sliced preemption, and priority inversion avoidance
- NUMA-aware placement across CPU sockets and GPU memory domains
- GPU memory oversubscription scheduling: eviction policy, swap-to-host, and coordinated demand management with `council-mem`
- Energy-aware scheduling: frequency/voltage scaling directives, power capping, and thermal headroom management
- Formal specification of the scheduling API surface exposed to user workloads and to other DistOS subsystems
**Out of scope (delegated):**
- Physical memory allocation and coherence protocol (delegated to `council-mem`)
- Network topology discovery and network-aware placement data (delegated to `council-net`; consumed as inputs)
- Metering, cost attribution, and SLO enforcement (delegated to `council-telemetry`; consumed as outputs)
- Security isolation between tenants at the hardware level (delegated to `council-sec`; policies consumed as constraints)
---
## 2. Research Domains
### 2.1 GPU Kernel Scheduling and SM Partitioning
Survey the CUDA Concurrent Kernel Execution model, the NVIDIA Multi-Instance GPU (MIG) architecture on Hopper (H100), and the NVIDIA Multi-Process Service (MPS). Understand how MIG partitions a GPU into isolated GPC slices with dedicated HBM3 memory, and how MPS enables concurrent kernel execution within a single GPU context without full MIG isolation overhead.
Key materials:
- NVIDIA H100 Architecture Technical Overview (2022) — SM partitioning, GPC layout, NVLink 4.0 bandwidth
- NVIDIA MIG User Guide (CUDA 12.x) — partition profiles (1g.10gb through 7g.80gb), instance isolation, compute and memory capacity tables
- NVIDIA MPS documentation — shared context model, client limit (48 for Hopper), error containment limitations
- AMD ROCm Hardware Abstraction Layer (HAL) source — `amdkfd` KFD driver, compute queue management, HWS (Hardware Scheduler) in GFX12
- Jain et al., "Fractional GPUs: Software-Based Compute and Memory Bandwidth Reservation for GPUs" (RTAS 2019) — software-defined partitioning as a baseline comparison
- Xiao et al., "AntMan: Dynamic Scaling on GPU Clusters for Deep Learning" (OSDI 2020) — GPU memory and compute sharing for co-located workloads
### 2.2 Heterogeneous Workload Placement
Develop a placement model that accounts for the distinct performance characteristics of H100 (PCIe/SXM5), GH200 Grace Superchip (NVLink-C2C), and B200 Blackwell (NVLink 5.0 SXM). Each accelerator type has different compute density, memory bandwidth, and interconnect topology.
Key materials:
- Jouppi et al., "In-Datacenter Performance Analysis of a Tensor Processing Unit" (ISCA 2017) — heterogeneous accelerator placement rationale
- Google Borg paper: Verma et al., "Large-scale cluster management at Google with Borg" (EuroSys 2015) — machine heterogeneity handling, alloc sets, resource estimation
- Microsoft Singularity: Qiao et al., "Pollux: Co-adaptive Cluster Scheduling for Goodput-Optimized Deep Learning" (OSDI 2021) — goodput-aware placement for training workloads
- Weng et al., "MLaaS in the Wild: Workload Analysis and Scheduling in Large-Scale Heterogeneous GPU Clusters" (NSDI 2022) — real-world heterogeneous GPU cluster scheduling from Alibaba
### 2.3 Fair Multi-Tenant Queuing
Design a multi-level queueing architecture with Dominant Resource Fairness (DRF) semantics extended for GPU resources (SM fraction, HBM bandwidth, NVLink bandwidth as co-dominant resources).
Key materials:
- Ghodsi et al., "Dominant Resource Fairness: Fair Allocation of Multiple Resource Types" (NSDI 2011) — foundational DRF theory
- Apache YARN CapacityScheduler and FairScheduler documentation — hierarchical queues, preemption policy, label-based node targeting
- Apache Mesos DRF implementation — `DominantShareAllocator`, offer model, role weights
- Kubernetes device plugin framework (k8s.io/device-plugins) — GPU resource advertisement, extended resource scheduling
- Tiresias: Gu et al., "Tiresias: A GPU Cluster Manager for Distributed Deep Learning" (NSDI 2019) — LAS (Least Attained Service) scheduling for DL jobs, 2DAS (2-dimensional attained service)
- Gandiva: Xiao et al., "Gandiva: Introspective Cluster Scheduling for Deep Learning" (OSDI 2018) — time-sliced GPU sharing, job packing, migration
### 2.4 Gang Scheduling for Distributed Training
Gang scheduling ensures all processes in a distributed training job (e.g., a 512-GPU allreduce ring) are co-scheduled simultaneously. This is critical for preventing head-of-line blocking and deadlock in collective communication.
Key materials:
- Feitelson and Rudolph, "Gang Scheduling Performance Benefits for Fine-Grain Synchronization" (J. Parallel Distrib. Comput., 1992) — foundational gang scheduling analysis
- Rajachandrasekar et al., "A Closer Look at All-Reduce for Deep Learning" (Workshop at SC'19) — collective communication scheduling sensitivity
- Hwang et al., "AFS: Annotation-Free Automatic Sharding for Large Language Models" (ICLR 2023) — pipeline and tensor parallel placement co-scheduling
- Slurm gang scheduling documentation — `Oversubscribe=FORCE`, `GraceTime`, slurmctld gang plugin
- Jeon et al., "Analysis of Large-Scale Multi-Tenant GPU Clusters for DNN Training Workloads" (USENIX ATC 2019) — Microsoft Philly cluster trace, gang failure modes
### 2.5 Preemption Strategies
Survey checkpoint-based, reactive, and time-sliced preemption. Understand the cost of GPU checkpoint (saving SM register file, shared memory, and in-flight DMA) and design preemption policies that bound maximum latency impact.
Key materials:
- Park et al., "Preemptive, Low Latency Datacenter Scheduling via Lightweight Virtualization" (USENIX ATC 2020) — container-level preemption
- Lin et al., "SHEPHERD: Serving DNNs in the Wild" (NSDI 2023) — latency SLO-aware preemption for inference workloads
- NVIDIA CUDA checkpoint/restore (CRIU for GPU) — experimental support status as of CUDA 12.x
- Wang et al., "Achieving Microsecond-Scale Tail Latency Efficiently with Approximate Optimal Scheduling" (SOSP 2017) — preemption-aware scheduling theory
### 2.6 NUMA-Aware and Topology-Aware Placement
Model the NUMA topology of a Grace Superchip node: 72-core Arm Neoverse V2 CPU, 480 GB LPDDR5X at ~512 GB/s, connected to H100 GPU via NVLink-C2C at 900 GB/s. Compare this with standard SXM5 nodes where CPU-GPU bandwidth is limited to PCIe Gen5 (~128 GB/s bidirectional).
Key materials:
- Linux NUMA scheduling documentation — `libnuma`, `numactl`, CFS NUMA balancing (`task_numa_migrate`)
- NVIDIA GH200 Grace Hopper Superchip Architecture Whitepaper (2023)
- Lepers et al., "Thread and Memory Placement on NUMA Systems: Asymmetry Matters" (USENIX ATC 2015)
- Blagodurov et al., "A Case for NUMA-Aware Contention Management on Multicore Systems" (USENIX ATC 2011)
### 2.7 GPU Memory Oversubscription
When aggregate GPU memory demand exceeds physical HBM3 capacity, the scheduler must coordinate with `council-mem`'s eviction and tiering policies to decide which kernels to throttle, checkpoint, or migrate.
Key materials:
- Rhu et al., "vDNN: Virtualized Deep Neural Networks for Scalable, Memory-Efficient Neural Network Design" (MICRO 2016) — layer-by-layer activation offload
- Huang et al., "Efficient Large Scale Language Modeling with Mixtures of Experts" (EMNLP 2021) — memory-efficient model parallelism
- NVIDIA Unified Memory documentation (CUDA 12.x) — page migration engine, oversubscription, prefetch advisory API
- Kumar et al., "SwapAdvisor: Push Deep Learning Beyond the GPU Memory Limit via Smart Swapping" (ASPLOS 2020)
### 2.8 Energy-Aware Scheduling
Design scheduling policies that honour per-node power caps enforced by Baseboard Management Controllers (BMCs) and exploit DVFS (Dynamic Voltage and Frequency Scaling) headroom reported by `council-telemetry`.
Key materials:
- Patel et al., "Clite: Efficient and QoS Aware Co-location of Multiple Latency-Critical Jobs for Warehouse Scale Computers" (ISCA 2020)
- Lim et al., "Adaptive Power Management for Heterogeneous Multiprocessor SoCs" (ICCAD 2009)
- NVIDIA NVML power management API — `nvmlDeviceSetPowerManagementLimit`, `nvmlDeviceGetCurrentClocksThrottleReasons`
- AMD ROCm SMI library — power cap interface (`rsmi_dev_power_cap_set`)
- RAPL (Running Average Power Limit) interface for Grace CPU power management
---
## 3. Agent Roles
Total agents: **80**
| Role | Count | Responsibilities |
|------|-------|-----------------|
| Lead Architect | 2 | Overall scheduling architecture decisions, cross-subsystem interface design, DR authorship for major decisions |
| Kernel Scheduling Researchers | 8 | CUDA/ROCm scheduler internals, MIG/MPS analysis, SM partitioning survey |
| Placement Researchers | 6 | Heterogeneous accelerator placement, topology modelling, workload profiling |
| Queueing Theory Specialists | 5 | DRF extensions for multi-resource GPU scheduling, fairness proof sketches |
| Gang Scheduling Specialists | 5 | Collective communication scheduling, all-or-nothing allocation protocols |
| Preemption Specialists | 4 | Checkpoint protocol design, preemption cost modelling |
| NUMA/Topology Analysts | 4 | NUMA topology modelling for Grace Superchip and SXM5 nodes |
| Energy Efficiency Researchers | 4 | Power capping, DVFS scheduling integration |
| Formal Specification Authors | 8 | TLA+ specification of scheduler state machine, safety and liveness invariants |
| Architects (sub-component) | 10 | Propose concrete scheduling algorithm designs for each domain |
| Internal Reviewers | 8 | Review research summaries and architecture proposals; cast green/yellow/red votes |
| Integration Liaisons | 6 | Interface with `council-mem`, `council-net`, `council-telemetry`, `council-sec` |
| Decision Record Authors | 5 | Author DRs for each resolved decision point; maintain UCXL provenance chain |
| Adversarial Critics | 5 | Challenge proposed designs; surface failure modes, starvation scenarios, livelock risks |
**Role distribution rationale:** The large researcher cohort (37 agents covering 8 research domains) reflects the breadth of scheduling literature. The formal specification group (8 agents) is sized to produce parallel TLA+ modules. Internal reviewers and adversarial critics (13 agents combined) ensure no architecture proposal passes without rigorous challenge.
---
## 4. Key Deliverables
All artifacts are published to the DHT and addressable via UCXL.
### 4.1 Research Summaries
```
ucxl://council-sched:researcher@DistOS:scheduling/*^/research/gpu-kernel-scheduling.md
ucxl://council-sched:researcher@DistOS:scheduling/*^/research/heterogeneous-placement.md
ucxl://council-sched:researcher@DistOS:scheduling/*^/research/fair-queuing-drf.md
ucxl://council-sched:researcher@DistOS:scheduling/*^/research/gang-scheduling.md
ucxl://council-sched:researcher@DistOS:scheduling/*^/research/preemption-strategies.md
ucxl://council-sched:researcher@DistOS:scheduling/*^/research/numa-topology.md
ucxl://council-sched:researcher@DistOS:scheduling/*^/research/memory-oversubscription.md
ucxl://council-sched:researcher@DistOS:scheduling/*^/research/energy-aware-scheduling.md
```
### 4.2 Architecture Proposals
```
ucxl://council-sched:architect@DistOS:scheduling/*^/architecture/scheduler-overview.md
ucxl://council-sched:architect@DistOS:scheduling/*^/architecture/mig-mps-partitioning-model.md
ucxl://council-sched:architect@DistOS:scheduling/*^/architecture/placement-scoring-algorithm.md
ucxl://council-sched:architect@DistOS:scheduling/*^/architecture/drf-gpu-extension.md
ucxl://council-sched:architect@DistOS:scheduling/*^/architecture/gang-scheduler-protocol.md
ucxl://council-sched:architect@DistOS:scheduling/*^/architecture/preemption-protocol.md
ucxl://council-sched:architect@DistOS:scheduling/*^/architecture/energy-policy-interface.md
```
### 4.3 Decision Records
```
ucxl://council-sched:architect@DistOS:scheduling/*^/decisions/DR-SCHED-001-partition-model.md
ucxl://council-sched:architect@DistOS:scheduling/*^/decisions/DR-SCHED-002-placement-algorithm.md
ucxl://council-sched:architect@DistOS:scheduling/*^/decisions/DR-SCHED-003-queuing-policy.md
ucxl://council-sched:architect@DistOS:scheduling/*^/decisions/DR-SCHED-004-gang-protocol.md
ucxl://council-sched:architect@DistOS:scheduling/*^/decisions/DR-SCHED-005-preemption-model.md
ucxl://council-sched:architect@DistOS:scheduling/*^/decisions/DR-SCHED-006-energy-interface.md
```
### 4.4 Formal Specifications
```
ucxl://council-sched:verifier@DistOS:scheduling/*^/specs/SchedulerStateMachine.tla
ucxl://council-sched:verifier@DistOS:scheduling/*^/specs/GangScheduler.tla
ucxl://council-sched:verifier@DistOS:scheduling/*^/specs/PreemptionProtocol.tla
ucxl://council-sched:verifier@DistOS:scheduling/*^/specs/FairnessInvariants.tla
```
### 4.5 Interface Contracts (for other councils)
```
ucxl://council-sched:architect@DistOS:scheduling/*^/interfaces/sched-to-mem-contract.md
ucxl://council-sched:architect@DistOS:scheduling/*^/interfaces/sched-to-net-contract.md
ucxl://council-sched:architect@DistOS:scheduling/*^/interfaces/sched-to-telemetry-contract.md
ucxl://council-sched:architect@DistOS:scheduling/*^/interfaces/sched-to-sec-contract.md
```
---
## 5. Decision Points
The following are the major architectural questions `council-sched` must resolve. Each decision must produce a Decision Record with alternatives considered, evidence from research, and rationale for the chosen option.
### DP-SCHED-001: Primary Partition Model
**Question:** Should the scheduler use MIG (hardware-enforced partition isolation), MPS (shared context with software-enforced limits), or a hybrid model where MIG is used for multi-tenant isolation and MPS for co-located jobs within a single tenant's partition?
**Factors:** isolation strength, context switch overhead, minimum allocation granularity, Blackwell support parity, operational complexity.
**Dependency:** Decision informs the security isolation model that `council-sec` will specify.
### DP-SCHED-002: Placement Scoring Architecture
**Question:** Should placement be computed by a centralised scoring service (Borg-style) or a distributed, bid-based negotiation (Mesos offer model)? How does topology affinity (NVLink domain proximity) weight against fairness constraints?
**Factors:** convergence time at 1024-node scale, fault tolerance of the placement service itself, stale topology information handling, NVLink bandwidth utilisation efficiency.
### DP-SCHED-003: Multi-Resource Fairness Extension
**Question:** DRF was designed for CPU/memory. GPU workloads introduce additional dominant resources: SM fraction, HBM3 bandwidth, NVLink bandwidth, and NVSwitch port occupancy. How many resource dimensions does the fairness model track, and what is the computational complexity of DRF at 80+ resource types?
**Factors:** implementation tractability, approximation error bounds, interaction with per-tenant quota enforcement.
### DP-SCHED-004: Gang Scheduling Protocol
**Question:** Should gang scheduling use a two-phase reservation (reserve then commit) protocol, speculative allocation with rollback, or a backfill-with-hold strategy? What is the maximum tolerable scheduling delay for a 1024-GPU gang job?
**Factors:** cluster utilisation impact, deadlock risk in two-phase protocols, interaction with preemption.
### DP-SCHED-005: Preemption Granularity
**Question:** Should preemption operate at kernel-granularity (CUDA stream checkpointing), job-granularity (full process checkpoint via CRIU-for-GPU), or a hybrid that allows kernel preemption for short jobs and process-level preemption for long jobs?
**Factors:** checkpoint latency (kernel-level: microseconds; process-level: seconds to tens of seconds for large model weights), HBM3 save/restore bandwidth cost, preemption frequency requirements.
### DP-SCHED-006: Grace Superchip Scheduling Model
**Question:** For GH200 nodes with NVLink-C2C, the CPU and GPU share a unified virtual address space. Should the scheduler treat CPU and GPU on a Grace node as a single scheduling unit, or as two separate but affinity-linked resources? How does this interact with the memory model specified by `council-mem`?
**Factors:** utilisation efficiency, memory model consistency requirements, API ergonomics for user workloads, migration cost if job is split across a C2C boundary.
### DP-SCHED-007: Energy Policy Interface
**Question:** Should energy-aware scheduling be a first-class scheduling objective (a weight in the placement score function), a hard constraint (power cap as a resource limit), or an advisory mechanism (scheduler receives power headroom hints and applies them at its discretion)?
**Factors:** SLO compatibility, predictability of power-capped execution, interaction with thermal management, coordination protocol with `council-telemetry`.
---
## 6. Dependencies
### 6.1 What council-sched Needs from Other Councils
| Dependency | Source Council | Artifact | Purpose |
|------------|---------------|---------|---------|
| Memory pressure signals and eviction cost model | `council-mem` | `ucxl://council-mem:architect@DistOS:memory/*^/interfaces/mem-to-sched-contract.md` | GPU memory oversubscription scheduling requires eviction cost estimates to avoid scheduling kernels that will immediately thrash |
| HBM3/DDR5 bandwidth allocation model | `council-mem` | `ucxl://council-mem:architect@DistOS:memory/*^/architecture/bandwidth-allocation.md` | Bandwidth is a co-dominant scheduling resource; model must be consistent |
| Network topology map (NVLink domains, IB fabric) | `council-net` | `ucxl://council-net:architect@DistOS:networking/*^/architecture/topology-model.md` | Topology-aware placement requires accurate NVLink domain membership and IB bisection bandwidth per rack |
| Network bandwidth reservation API | `council-net` | `ucxl://council-net:architect@DistOS:networking/*^/interfaces/net-to-sched-contract.md` | Gang scheduling for allreduce jobs requires co-reserving network bandwidth |
| Resource metering API contract | `council-telemetry` | `ucxl://council-telemetry:architect@DistOS:telemetry/*^/interfaces/telemetry-to-sched-contract.md` | Scheduler must feed placement and execution events to telemetry; power headroom signals flow back |
| Security isolation constraints | `council-sec` | `ucxl://council-sec:architect@DistOS:security/*^/interfaces/sec-to-sched-contract.md` | Which tenants may share a GPU partition, MPS context constraints, minimum isolation requirements per tenant class |
### 6.2 What Other Councils Need from council-sched
| Consumer Council | Artifact Required | Purpose |
|-----------------|------------------|---------|
| `council-mem` | Placement decisions and GPU assignment map | Memory subsystem needs to know which GPU a job is placed on to configure HBM3 address space and NVLink fabric attachment |
| `council-mem` | Preemption events and checkpoint triggers | Memory tiering must snapshot GPU memory on preemption |
| `council-net` | Job placement map with NVLink domain assignments | Network subsystem configures NCCL topology files and RDMA QP mappings based on scheduler placement |
| `council-telemetry` | Scheduling events stream (enqueue, dequeue, preempt, complete) | Metering and cost attribution require per-job lifecycle events |
| `council-sec` | Partition assignment per tenant | Security subsystem enforces isolation based on scheduler-assigned partition IDs |
| `council-verify` | TLA+ scheduler state machine spec | Formal verification council model-checks scheduler invariants |
---
## 7. WHOOSH Configuration
```yaml
# WHOOSH council formation configuration for council-sched
council_id: council-sched
project: DistOS
subsystem: scheduling
gitea_label: chorus-entrypoint
gitea_repo: distos/scheduling
formation:
target_agents: 80
min_agents: 60
wave:
max_per_wave: 12
min_per_wave: 6
period_sec: 30
placement:
max_replicas_per_node: 2
join_stagger_ms: 2000
bootstrap_peers_min: 5
roles:
- role: lead-architect
count: 2
model: claude-opus-4-6
priority: high
- role: researcher
count: 37
model: qwen2.5-coder:32b
priority: normal
subgroups:
- tag: kernel-scheduling
count: 8
- tag: placement
count: 6
- tag: queuing-theory
count: 5
- tag: gang-scheduling
count: 5
- tag: preemption
count: 4
- tag: numa-topology
count: 4
- tag: energy
count: 5
- role: architect
count: 10
model: claude-opus-4-6
priority: normal
- role: verifier
count: 8
model: deepseek-coder-v2
priority: normal
- role: reviewer
count: 8
model: claude-opus-4-6
priority: normal
- role: integration-liaison
count: 6
model: qwen2.5-coder:32b
priority: normal
- role: decision-record-author
count: 5
model: claude-opus-4-6
priority: normal
- role: adversarial-critic
count: 5
model: claude-opus-4-6
priority: normal
subchannels:
- name: sched-research
description: "Research discussion and literature synthesis"
participants: [researcher, lead-architect]
pubsub: true
- name: sched-architecture
description: "Architecture proposal discussion and voting"
participants: [architect, lead-architect, reviewer, adversarial-critic]
pubsub: true
- name: sched-formal-spec
description: "TLA+ specification authoring and review"
participants: [verifier, lead-architect, reviewer]
pubsub: false
- name: sched-integration
description: "Cross-council interface negotiation"
participants: [integration-liaison, lead-architect]
pubsub: false
- name: sched-decisions
description: "Decision record authoring and consensus"
participants: [decision-record-author, lead-architect, reviewer]
pubsub: true
quorum:
# Architecture decisions require supermajority
architecture_changes:
policy: supermajority
threshold: 0.667
require_domain_role: true
require_quality_role: true
beat_minutes: 20
timeout_beats: 6
# Research summaries require simple majority
research_summaries:
policy: simple_majority
threshold: 0.5
require_domain_role: true
require_quality_role: false
beat_minutes: 15
timeout_beats: 4
# Formal specifications require supermajority with verifier sign-off
formal_specs:
policy: supermajority
threshold: 0.667
require_domain_role: true
require_quality_role: true
require_verifier: true
beat_minutes: 25
timeout_beats: 8
# Cross-council interface contracts require unanimous lead-architect approval
interface_contracts:
policy: unanimous
roles: [lead-architect, integration-liaison]
beat_minutes: 30
timeout_beats: 4
gates:
kaching:
p95_latency_ms: 250
max_error_rate: 0.01
backbeat:
max_stream_lag: 200
bootstrap:
min_healthy_peers: 5
join:
min_success_rate: 0.80
review:
beat_minutes: 20
quorum:
total_min: 3
require_domain_role: true
require_quality_role: true
timeout_beats: 6
no_self_approval: true
```
---
## 8. Success Criteria
1. **Research completeness:** All 8 research domain summaries published to DHT with at least 5 primary references each, reviewed and approved by at least 3 council agents.
2. **Architecture coverage:** Architectural proposals exist for all 7 major scheduling components (kernel dispatch, placement, queuing, gang, preemption, NUMA, energy). Each proposal addresses the 1024-node scale constraint explicitly.
3. **Decision records resolved:** All 7 decision points (DP-SCHED-001 through DP-SCHED-007) have a corresponding Decision Record with at least 3 alternatives considered, evidence citations, and a chosen option ratified by council supermajority.
4. **Formal specifications:** TLA+ specifications for the scheduler state machine, gang scheduling protocol, and preemption protocol. At least 2 of these must have model-checked safety invariants (no starvation for highest-priority jobs, gang deadlock freedom) verified by `council-verify`.
5. **Interface contracts ratified:** All 4 interface contracts (to `council-mem`, `council-net`, `council-telemetry`, `council-sec`) are co-signed by integration liaisons from both councils.
6. **UCXL navigability:** A human unfamiliar with the project should be able to navigate from any Decision Record to the research summary that motivated it using only UCXL temporal navigation, within 5 hops.
7. **Adversarial review pass:** Each major architecture proposal has at minimum one adversarial critique documented and a resolution recorded. No proposal advances to formal specification with an unresolved red-vote critique.
---
## 9. Timeline
### Phase 1: Research and Survey (Days 1-3)
**Day 1:**
- WHOOSH forms council; all 80 agents join via wave deployment
- Researchers self-assign to domain subgroups via `sched-research` subchannel
- Literature survey begins: GPU kernel scheduling, placement, and fair queuing domains prioritised first (these are on the critical path for Decision Points DP-SCHED-001 through DP-SCHED-003)
- Integration liaisons make initial contact with `council-mem` and `council-net` to understand their timelines
**Day 2:**
- Remaining research domains surveyed: gang scheduling, preemption, NUMA, energy
- Research summaries drafted in parallel across subgroups
- First internal review cycle: reviewers read summaries and post green/yellow/red votes with rationale
- Lead architects synthesise research findings into a preliminary scheduling design space map
**Day 3:**
- Research summaries revised based on review feedback; final versions published to DHT
- Adversarial critics challenge assumptions in each summary
- Research phase gate: all 8 summaries must achieve simple majority approval before Phase 2 begins
- Preliminary interface contract outlines shared with dependency councils
### Phase 2: Architecture and Trade-offs (Days 3-6)
**Day 3-4:**
- Architects propose concrete options for DP-SCHED-001 (partition model) and DP-SCHED-002 (placement scoring) — these are the highest-dependency decisions
- Adversarial critics engage immediately; all alternatives documented
- DP-SCHED-001 decision record drafted; council votes; DR published
**Day 4-5:**
- Queuing model (DP-SCHED-003), gang scheduling (DP-SCHED-004), and preemption (DP-SCHED-005) design proposals concurrently authored
- Inter-council synthesis session with `council-mem` to align on oversubscription and eviction signal interfaces
- Inter-council synthesis session with `council-net` to align on topology model input format
**Day 5-6:**
- Grace Superchip model (DP-SCHED-006) and energy interface (DP-SCHED-007) decisions resolved
- All 7 Decision Records drafted; supermajority vote on each
- Architecture overview document assembled from approved DRs
- Architecture phase gate: all DPs resolved before Phase 3 begins
### Phase 3: Formal Specification (Days 6-10)
**Day 6-7:**
- Verifiers begin TLA+ specification of the scheduler state machine based on approved architecture
- Architects continue refining component-level designs to resolve any ambiguities surfaced during spec authoring
- `council-verify` engaged: share spec drafts for early model-checking feedback
**Day 7-8:**
- Gang scheduler TLA+ module authored; parallel with preemption protocol spec
- Fairness invariants formally stated: no starvation under bounded load, gang deadlock freedom, DRF monotonicity
**Day 8-10:**
- Model checking runs submitted to `council-verify`
- Counterexample analysis: any liveness or safety violations trigger architecture revision and updated DRs
- Formal spec versions pinned in DHT; UCXL addresses published to dependent councils
### Phase 4: Integration and Review (Days 10-12)
**Day 10-11:**
- Interface contracts with `council-mem`, `council-net`, `council-telemetry`, `council-sec` finalised and submitted for co-signature
- Cross-council integration session: scheduling placement decisions validated against network topology model
- `council-synth` engaged for any unresolved conflicts with other councils' specifications
**Day 11-12:**
- Final council review of complete scheduling specification
- Adversarial critics run end-to-end failure scenario analysis
- Any remaining yellow votes addressed with documented mitigations
- Integration review gate: all interface contracts co-signed
### Phase 5: Documentation and Narrative (Days 12-14)
**Day 12-13:**
- Decision record authors produce narrative summaries of the scheduling architecture journey
- `council-docs` receives the complete scheduling specification package for standardised formatting
- UCXL navigability audit: spot-check 10 random decision paths for completeness
**Day 14:**
- Final specification published
- `council-arch` decision archaeology agents generate human-readable narrative of scheduling design evolution
- Council formally dissolved; agents released back to WHOOSH pool