# Council Design Brief: Process Scheduling **Council ID:** `council-sched` **Mission:** Design the heterogeneous process scheduling subsystem for DistOS, covering GPU kernel dispatch, workload placement across Hopper/Grace/Blackwell accelerators, fair multi-tenant queuing, gang scheduling for distributed training, and energy-aware execution across a 1024-node cluster. **UCXL Base Address:** `ucxl://council-sched:*@DistOS:scheduling/*` **Agent Count:** 80 **Status:** Constitution Phase — awaiting WHOOSH formation trigger **Created:** 2026-02-24 --- ## 1. Scope and Responsibilities `council-sched` owns the complete specification of the DistOS process and kernel scheduling subsystem. Scope boundaries are defined as follows. **In scope:** - GPU kernel scheduling: dispatch queue management, kernel concurrency, SM partitioning on Hopper (MIG) and Blackwell, and MPS (Multi-Process Service) lifetime management - CPU-GPU co-scheduling on Grace Superchip (NVLink-C2C coherent interconnect), including unified virtual address space scheduling implications - Workload placement policy across heterogeneous accelerator types (H100, GH200, B200), including topology-aware affinity scoring - Fair multi-tenant queuing: priority classes, weighted fair queuing, and quota enforcement - Gang scheduling for distributed training workloads: all-or-nothing allocation, partial-allotment strategies, and backfill - Preemption strategies: checkpoint-based preemption, time-sliced preemption, and priority inversion avoidance - NUMA-aware placement across CPU sockets and GPU memory domains - GPU memory oversubscription scheduling: eviction policy, swap-to-host, and coordinated demand management with `council-mem` - Energy-aware scheduling: frequency/voltage scaling directives, power capping, and thermal headroom management - Formal specification of the scheduling API surface exposed to user workloads and to other DistOS subsystems **Out of scope (delegated):** - Physical memory allocation and coherence protocol (delegated to `council-mem`) - Network topology discovery and network-aware placement data (delegated to `council-net`; consumed as inputs) - Metering, cost attribution, and SLO enforcement (delegated to `council-telemetry`; consumed as outputs) - Security isolation between tenants at the hardware level (delegated to `council-sec`; policies consumed as constraints) --- ## 2. Research Domains ### 2.1 GPU Kernel Scheduling and SM Partitioning Survey the CUDA Concurrent Kernel Execution model, the NVIDIA Multi-Instance GPU (MIG) architecture on Hopper (H100), and the NVIDIA Multi-Process Service (MPS). Understand how MIG partitions a GPU into isolated GPC slices with dedicated HBM3 memory, and how MPS enables concurrent kernel execution within a single GPU context without full MIG isolation overhead. Key materials: - NVIDIA H100 Architecture Technical Overview (2022) — SM partitioning, GPC layout, NVLink 4.0 bandwidth - NVIDIA MIG User Guide (CUDA 12.x) — partition profiles (1g.10gb through 7g.80gb), instance isolation, compute and memory capacity tables - NVIDIA MPS documentation — shared context model, client limit (48 for Hopper), error containment limitations - AMD ROCm Hardware Abstraction Layer (HAL) source — `amdkfd` KFD driver, compute queue management, HWS (Hardware Scheduler) in GFX12 - Jain et al., "Fractional GPUs: Software-Based Compute and Memory Bandwidth Reservation for GPUs" (RTAS 2019) — software-defined partitioning as a baseline comparison - Xiao et al., "AntMan: Dynamic Scaling on GPU Clusters for Deep Learning" (OSDI 2020) — GPU memory and compute sharing for co-located workloads ### 2.2 Heterogeneous Workload Placement Develop a placement model that accounts for the distinct performance characteristics of H100 (PCIe/SXM5), GH200 Grace Superchip (NVLink-C2C), and B200 Blackwell (NVLink 5.0 SXM). Each accelerator type has different compute density, memory bandwidth, and interconnect topology. Key materials: - Jouppi et al., "In-Datacenter Performance Analysis of a Tensor Processing Unit" (ISCA 2017) — heterogeneous accelerator placement rationale - Google Borg paper: Verma et al., "Large-scale cluster management at Google with Borg" (EuroSys 2015) — machine heterogeneity handling, alloc sets, resource estimation - Microsoft Singularity: Qiao et al., "Pollux: Co-adaptive Cluster Scheduling for Goodput-Optimized Deep Learning" (OSDI 2021) — goodput-aware placement for training workloads - Weng et al., "MLaaS in the Wild: Workload Analysis and Scheduling in Large-Scale Heterogeneous GPU Clusters" (NSDI 2022) — real-world heterogeneous GPU cluster scheduling from Alibaba ### 2.3 Fair Multi-Tenant Queuing Design a multi-level queueing architecture with Dominant Resource Fairness (DRF) semantics extended for GPU resources (SM fraction, HBM bandwidth, NVLink bandwidth as co-dominant resources). Key materials: - Ghodsi et al., "Dominant Resource Fairness: Fair Allocation of Multiple Resource Types" (NSDI 2011) — foundational DRF theory - Apache YARN CapacityScheduler and FairScheduler documentation — hierarchical queues, preemption policy, label-based node targeting - Apache Mesos DRF implementation — `DominantShareAllocator`, offer model, role weights - Kubernetes device plugin framework (k8s.io/device-plugins) — GPU resource advertisement, extended resource scheduling - Tiresias: Gu et al., "Tiresias: A GPU Cluster Manager for Distributed Deep Learning" (NSDI 2019) — LAS (Least Attained Service) scheduling for DL jobs, 2DAS (2-dimensional attained service) - Gandiva: Xiao et al., "Gandiva: Introspective Cluster Scheduling for Deep Learning" (OSDI 2018) — time-sliced GPU sharing, job packing, migration ### 2.4 Gang Scheduling for Distributed Training Gang scheduling ensures all processes in a distributed training job (e.g., a 512-GPU allreduce ring) are co-scheduled simultaneously. This is critical for preventing head-of-line blocking and deadlock in collective communication. Key materials: - Feitelson and Rudolph, "Gang Scheduling Performance Benefits for Fine-Grain Synchronization" (J. Parallel Distrib. Comput., 1992) — foundational gang scheduling analysis - Rajachandrasekar et al., "A Closer Look at All-Reduce for Deep Learning" (Workshop at SC'19) — collective communication scheduling sensitivity - Hwang et al., "AFS: Annotation-Free Automatic Sharding for Large Language Models" (ICLR 2023) — pipeline and tensor parallel placement co-scheduling - Slurm gang scheduling documentation — `Oversubscribe=FORCE`, `GraceTime`, slurmctld gang plugin - Jeon et al., "Analysis of Large-Scale Multi-Tenant GPU Clusters for DNN Training Workloads" (USENIX ATC 2019) — Microsoft Philly cluster trace, gang failure modes ### 2.5 Preemption Strategies Survey checkpoint-based, reactive, and time-sliced preemption. Understand the cost of GPU checkpoint (saving SM register file, shared memory, and in-flight DMA) and design preemption policies that bound maximum latency impact. Key materials: - Park et al., "Preemptive, Low Latency Datacenter Scheduling via Lightweight Virtualization" (USENIX ATC 2020) — container-level preemption - Lin et al., "SHEPHERD: Serving DNNs in the Wild" (NSDI 2023) — latency SLO-aware preemption for inference workloads - NVIDIA CUDA checkpoint/restore (CRIU for GPU) — experimental support status as of CUDA 12.x - Wang et al., "Achieving Microsecond-Scale Tail Latency Efficiently with Approximate Optimal Scheduling" (SOSP 2017) — preemption-aware scheduling theory ### 2.6 NUMA-Aware and Topology-Aware Placement Model the NUMA topology of a Grace Superchip node: 72-core Arm Neoverse V2 CPU, 480 GB LPDDR5X at ~512 GB/s, connected to H100 GPU via NVLink-C2C at 900 GB/s. Compare this with standard SXM5 nodes where CPU-GPU bandwidth is limited to PCIe Gen5 (~128 GB/s bidirectional). Key materials: - Linux NUMA scheduling documentation — `libnuma`, `numactl`, CFS NUMA balancing (`task_numa_migrate`) - NVIDIA GH200 Grace Hopper Superchip Architecture Whitepaper (2023) - Lepers et al., "Thread and Memory Placement on NUMA Systems: Asymmetry Matters" (USENIX ATC 2015) - Blagodurov et al., "A Case for NUMA-Aware Contention Management on Multicore Systems" (USENIX ATC 2011) ### 2.7 GPU Memory Oversubscription When aggregate GPU memory demand exceeds physical HBM3 capacity, the scheduler must coordinate with `council-mem`'s eviction and tiering policies to decide which kernels to throttle, checkpoint, or migrate. Key materials: - Rhu et al., "vDNN: Virtualized Deep Neural Networks for Scalable, Memory-Efficient Neural Network Design" (MICRO 2016) — layer-by-layer activation offload - Huang et al., "Efficient Large Scale Language Modeling with Mixtures of Experts" (EMNLP 2021) — memory-efficient model parallelism - NVIDIA Unified Memory documentation (CUDA 12.x) — page migration engine, oversubscription, prefetch advisory API - Kumar et al., "SwapAdvisor: Push Deep Learning Beyond the GPU Memory Limit via Smart Swapping" (ASPLOS 2020) ### 2.8 Energy-Aware Scheduling Design scheduling policies that honour per-node power caps enforced by Baseboard Management Controllers (BMCs) and exploit DVFS (Dynamic Voltage and Frequency Scaling) headroom reported by `council-telemetry`. Key materials: - Patel et al., "Clite: Efficient and QoS Aware Co-location of Multiple Latency-Critical Jobs for Warehouse Scale Computers" (ISCA 2020) - Lim et al., "Adaptive Power Management for Heterogeneous Multiprocessor SoCs" (ICCAD 2009) - NVIDIA NVML power management API — `nvmlDeviceSetPowerManagementLimit`, `nvmlDeviceGetCurrentClocksThrottleReasons` - AMD ROCm SMI library — power cap interface (`rsmi_dev_power_cap_set`) - RAPL (Running Average Power Limit) interface for Grace CPU power management --- ## 3. Agent Roles Total agents: **80** | Role | Count | Responsibilities | |------|-------|-----------------| | Lead Architect | 2 | Overall scheduling architecture decisions, cross-subsystem interface design, DR authorship for major decisions | | Kernel Scheduling Researchers | 8 | CUDA/ROCm scheduler internals, MIG/MPS analysis, SM partitioning survey | | Placement Researchers | 6 | Heterogeneous accelerator placement, topology modelling, workload profiling | | Queueing Theory Specialists | 5 | DRF extensions for multi-resource GPU scheduling, fairness proof sketches | | Gang Scheduling Specialists | 5 | Collective communication scheduling, all-or-nothing allocation protocols | | Preemption Specialists | 4 | Checkpoint protocol design, preemption cost modelling | | NUMA/Topology Analysts | 4 | NUMA topology modelling for Grace Superchip and SXM5 nodes | | Energy Efficiency Researchers | 4 | Power capping, DVFS scheduling integration | | Formal Specification Authors | 8 | TLA+ specification of scheduler state machine, safety and liveness invariants | | Architects (sub-component) | 10 | Propose concrete scheduling algorithm designs for each domain | | Internal Reviewers | 8 | Review research summaries and architecture proposals; cast green/yellow/red votes | | Integration Liaisons | 6 | Interface with `council-mem`, `council-net`, `council-telemetry`, `council-sec` | | Decision Record Authors | 5 | Author DRs for each resolved decision point; maintain UCXL provenance chain | | Adversarial Critics | 5 | Challenge proposed designs; surface failure modes, starvation scenarios, livelock risks | **Role distribution rationale:** The large researcher cohort (37 agents covering 8 research domains) reflects the breadth of scheduling literature. The formal specification group (8 agents) is sized to produce parallel TLA+ modules. Internal reviewers and adversarial critics (13 agents combined) ensure no architecture proposal passes without rigorous challenge. --- ## 4. Key Deliverables All artifacts are published to the DHT and addressable via UCXL. ### 4.1 Research Summaries ``` ucxl://council-sched:researcher@DistOS:scheduling/*^/research/gpu-kernel-scheduling.md ucxl://council-sched:researcher@DistOS:scheduling/*^/research/heterogeneous-placement.md ucxl://council-sched:researcher@DistOS:scheduling/*^/research/fair-queuing-drf.md ucxl://council-sched:researcher@DistOS:scheduling/*^/research/gang-scheduling.md ucxl://council-sched:researcher@DistOS:scheduling/*^/research/preemption-strategies.md ucxl://council-sched:researcher@DistOS:scheduling/*^/research/numa-topology.md ucxl://council-sched:researcher@DistOS:scheduling/*^/research/memory-oversubscription.md ucxl://council-sched:researcher@DistOS:scheduling/*^/research/energy-aware-scheduling.md ``` ### 4.2 Architecture Proposals ``` ucxl://council-sched:architect@DistOS:scheduling/*^/architecture/scheduler-overview.md ucxl://council-sched:architect@DistOS:scheduling/*^/architecture/mig-mps-partitioning-model.md ucxl://council-sched:architect@DistOS:scheduling/*^/architecture/placement-scoring-algorithm.md ucxl://council-sched:architect@DistOS:scheduling/*^/architecture/drf-gpu-extension.md ucxl://council-sched:architect@DistOS:scheduling/*^/architecture/gang-scheduler-protocol.md ucxl://council-sched:architect@DistOS:scheduling/*^/architecture/preemption-protocol.md ucxl://council-sched:architect@DistOS:scheduling/*^/architecture/energy-policy-interface.md ``` ### 4.3 Decision Records ``` ucxl://council-sched:architect@DistOS:scheduling/*^/decisions/DR-SCHED-001-partition-model.md ucxl://council-sched:architect@DistOS:scheduling/*^/decisions/DR-SCHED-002-placement-algorithm.md ucxl://council-sched:architect@DistOS:scheduling/*^/decisions/DR-SCHED-003-queuing-policy.md ucxl://council-sched:architect@DistOS:scheduling/*^/decisions/DR-SCHED-004-gang-protocol.md ucxl://council-sched:architect@DistOS:scheduling/*^/decisions/DR-SCHED-005-preemption-model.md ucxl://council-sched:architect@DistOS:scheduling/*^/decisions/DR-SCHED-006-energy-interface.md ``` ### 4.4 Formal Specifications ``` ucxl://council-sched:verifier@DistOS:scheduling/*^/specs/SchedulerStateMachine.tla ucxl://council-sched:verifier@DistOS:scheduling/*^/specs/GangScheduler.tla ucxl://council-sched:verifier@DistOS:scheduling/*^/specs/PreemptionProtocol.tla ucxl://council-sched:verifier@DistOS:scheduling/*^/specs/FairnessInvariants.tla ``` ### 4.5 Interface Contracts (for other councils) ``` ucxl://council-sched:architect@DistOS:scheduling/*^/interfaces/sched-to-mem-contract.md ucxl://council-sched:architect@DistOS:scheduling/*^/interfaces/sched-to-net-contract.md ucxl://council-sched:architect@DistOS:scheduling/*^/interfaces/sched-to-telemetry-contract.md ucxl://council-sched:architect@DistOS:scheduling/*^/interfaces/sched-to-sec-contract.md ``` --- ## 5. Decision Points The following are the major architectural questions `council-sched` must resolve. Each decision must produce a Decision Record with alternatives considered, evidence from research, and rationale for the chosen option. ### DP-SCHED-001: Primary Partition Model **Question:** Should the scheduler use MIG (hardware-enforced partition isolation), MPS (shared context with software-enforced limits), or a hybrid model where MIG is used for multi-tenant isolation and MPS for co-located jobs within a single tenant's partition? **Factors:** isolation strength, context switch overhead, minimum allocation granularity, Blackwell support parity, operational complexity. **Dependency:** Decision informs the security isolation model that `council-sec` will specify. ### DP-SCHED-002: Placement Scoring Architecture **Question:** Should placement be computed by a centralised scoring service (Borg-style) or a distributed, bid-based negotiation (Mesos offer model)? How does topology affinity (NVLink domain proximity) weight against fairness constraints? **Factors:** convergence time at 1024-node scale, fault tolerance of the placement service itself, stale topology information handling, NVLink bandwidth utilisation efficiency. ### DP-SCHED-003: Multi-Resource Fairness Extension **Question:** DRF was designed for CPU/memory. GPU workloads introduce additional dominant resources: SM fraction, HBM3 bandwidth, NVLink bandwidth, and NVSwitch port occupancy. How many resource dimensions does the fairness model track, and what is the computational complexity of DRF at 80+ resource types? **Factors:** implementation tractability, approximation error bounds, interaction with per-tenant quota enforcement. ### DP-SCHED-004: Gang Scheduling Protocol **Question:** Should gang scheduling use a two-phase reservation (reserve then commit) protocol, speculative allocation with rollback, or a backfill-with-hold strategy? What is the maximum tolerable scheduling delay for a 1024-GPU gang job? **Factors:** cluster utilisation impact, deadlock risk in two-phase protocols, interaction with preemption. ### DP-SCHED-005: Preemption Granularity **Question:** Should preemption operate at kernel-granularity (CUDA stream checkpointing), job-granularity (full process checkpoint via CRIU-for-GPU), or a hybrid that allows kernel preemption for short jobs and process-level preemption for long jobs? **Factors:** checkpoint latency (kernel-level: microseconds; process-level: seconds to tens of seconds for large model weights), HBM3 save/restore bandwidth cost, preemption frequency requirements. ### DP-SCHED-006: Grace Superchip Scheduling Model **Question:** For GH200 nodes with NVLink-C2C, the CPU and GPU share a unified virtual address space. Should the scheduler treat CPU and GPU on a Grace node as a single scheduling unit, or as two separate but affinity-linked resources? How does this interact with the memory model specified by `council-mem`? **Factors:** utilisation efficiency, memory model consistency requirements, API ergonomics for user workloads, migration cost if job is split across a C2C boundary. ### DP-SCHED-007: Energy Policy Interface **Question:** Should energy-aware scheduling be a first-class scheduling objective (a weight in the placement score function), a hard constraint (power cap as a resource limit), or an advisory mechanism (scheduler receives power headroom hints and applies them at its discretion)? **Factors:** SLO compatibility, predictability of power-capped execution, interaction with thermal management, coordination protocol with `council-telemetry`. --- ## 6. Dependencies ### 6.1 What council-sched Needs from Other Councils | Dependency | Source Council | Artifact | Purpose | |------------|---------------|---------|---------| | Memory pressure signals and eviction cost model | `council-mem` | `ucxl://council-mem:architect@DistOS:memory/*^/interfaces/mem-to-sched-contract.md` | GPU memory oversubscription scheduling requires eviction cost estimates to avoid scheduling kernels that will immediately thrash | | HBM3/DDR5 bandwidth allocation model | `council-mem` | `ucxl://council-mem:architect@DistOS:memory/*^/architecture/bandwidth-allocation.md` | Bandwidth is a co-dominant scheduling resource; model must be consistent | | Network topology map (NVLink domains, IB fabric) | `council-net` | `ucxl://council-net:architect@DistOS:networking/*^/architecture/topology-model.md` | Topology-aware placement requires accurate NVLink domain membership and IB bisection bandwidth per rack | | Network bandwidth reservation API | `council-net` | `ucxl://council-net:architect@DistOS:networking/*^/interfaces/net-to-sched-contract.md` | Gang scheduling for allreduce jobs requires co-reserving network bandwidth | | Resource metering API contract | `council-telemetry` | `ucxl://council-telemetry:architect@DistOS:telemetry/*^/interfaces/telemetry-to-sched-contract.md` | Scheduler must feed placement and execution events to telemetry; power headroom signals flow back | | Security isolation constraints | `council-sec` | `ucxl://council-sec:architect@DistOS:security/*^/interfaces/sec-to-sched-contract.md` | Which tenants may share a GPU partition, MPS context constraints, minimum isolation requirements per tenant class | ### 6.2 What Other Councils Need from council-sched | Consumer Council | Artifact Required | Purpose | |-----------------|------------------|---------| | `council-mem` | Placement decisions and GPU assignment map | Memory subsystem needs to know which GPU a job is placed on to configure HBM3 address space and NVLink fabric attachment | | `council-mem` | Preemption events and checkpoint triggers | Memory tiering must snapshot GPU memory on preemption | | `council-net` | Job placement map with NVLink domain assignments | Network subsystem configures NCCL topology files and RDMA QP mappings based on scheduler placement | | `council-telemetry` | Scheduling events stream (enqueue, dequeue, preempt, complete) | Metering and cost attribution require per-job lifecycle events | | `council-sec` | Partition assignment per tenant | Security subsystem enforces isolation based on scheduler-assigned partition IDs | | `council-verify` | TLA+ scheduler state machine spec | Formal verification council model-checks scheduler invariants | --- ## 7. WHOOSH Configuration ```yaml # WHOOSH council formation configuration for council-sched council_id: council-sched project: DistOS subsystem: scheduling gitea_label: chorus-entrypoint gitea_repo: distos/scheduling formation: target_agents: 80 min_agents: 60 wave: max_per_wave: 12 min_per_wave: 6 period_sec: 30 placement: max_replicas_per_node: 2 join_stagger_ms: 2000 bootstrap_peers_min: 5 roles: - role: lead-architect count: 2 model: claude-opus-4-6 priority: high - role: researcher count: 37 model: qwen2.5-coder:32b priority: normal subgroups: - tag: kernel-scheduling count: 8 - tag: placement count: 6 - tag: queuing-theory count: 5 - tag: gang-scheduling count: 5 - tag: preemption count: 4 - tag: numa-topology count: 4 - tag: energy count: 5 - role: architect count: 10 model: claude-opus-4-6 priority: normal - role: verifier count: 8 model: deepseek-coder-v2 priority: normal - role: reviewer count: 8 model: claude-opus-4-6 priority: normal - role: integration-liaison count: 6 model: qwen2.5-coder:32b priority: normal - role: decision-record-author count: 5 model: claude-opus-4-6 priority: normal - role: adversarial-critic count: 5 model: claude-opus-4-6 priority: normal subchannels: - name: sched-research description: "Research discussion and literature synthesis" participants: [researcher, lead-architect] pubsub: true - name: sched-architecture description: "Architecture proposal discussion and voting" participants: [architect, lead-architect, reviewer, adversarial-critic] pubsub: true - name: sched-formal-spec description: "TLA+ specification authoring and review" participants: [verifier, lead-architect, reviewer] pubsub: false - name: sched-integration description: "Cross-council interface negotiation" participants: [integration-liaison, lead-architect] pubsub: false - name: sched-decisions description: "Decision record authoring and consensus" participants: [decision-record-author, lead-architect, reviewer] pubsub: true quorum: # Architecture decisions require supermajority architecture_changes: policy: supermajority threshold: 0.667 require_domain_role: true require_quality_role: true beat_minutes: 20 timeout_beats: 6 # Research summaries require simple majority research_summaries: policy: simple_majority threshold: 0.5 require_domain_role: true require_quality_role: false beat_minutes: 15 timeout_beats: 4 # Formal specifications require supermajority with verifier sign-off formal_specs: policy: supermajority threshold: 0.667 require_domain_role: true require_quality_role: true require_verifier: true beat_minutes: 25 timeout_beats: 8 # Cross-council interface contracts require unanimous lead-architect approval interface_contracts: policy: unanimous roles: [lead-architect, integration-liaison] beat_minutes: 30 timeout_beats: 4 gates: kaching: p95_latency_ms: 250 max_error_rate: 0.01 backbeat: max_stream_lag: 200 bootstrap: min_healthy_peers: 5 join: min_success_rate: 0.80 review: beat_minutes: 20 quorum: total_min: 3 require_domain_role: true require_quality_role: true timeout_beats: 6 no_self_approval: true ``` --- ## 8. Success Criteria 1. **Research completeness:** All 8 research domain summaries published to DHT with at least 5 primary references each, reviewed and approved by at least 3 council agents. 2. **Architecture coverage:** Architectural proposals exist for all 7 major scheduling components (kernel dispatch, placement, queuing, gang, preemption, NUMA, energy). Each proposal addresses the 1024-node scale constraint explicitly. 3. **Decision records resolved:** All 7 decision points (DP-SCHED-001 through DP-SCHED-007) have a corresponding Decision Record with at least 3 alternatives considered, evidence citations, and a chosen option ratified by council supermajority. 4. **Formal specifications:** TLA+ specifications for the scheduler state machine, gang scheduling protocol, and preemption protocol. At least 2 of these must have model-checked safety invariants (no starvation for highest-priority jobs, gang deadlock freedom) verified by `council-verify`. 5. **Interface contracts ratified:** All 4 interface contracts (to `council-mem`, `council-net`, `council-telemetry`, `council-sec`) are co-signed by integration liaisons from both councils. 6. **UCXL navigability:** A human unfamiliar with the project should be able to navigate from any Decision Record to the research summary that motivated it using only UCXL temporal navigation, within 5 hops. 7. **Adversarial review pass:** Each major architecture proposal has at minimum one adversarial critique documented and a resolution recorded. No proposal advances to formal specification with an unresolved red-vote critique. --- ## 9. Timeline ### Phase 1: Research and Survey (Days 1-3) **Day 1:** - WHOOSH forms council; all 80 agents join via wave deployment - Researchers self-assign to domain subgroups via `sched-research` subchannel - Literature survey begins: GPU kernel scheduling, placement, and fair queuing domains prioritised first (these are on the critical path for Decision Points DP-SCHED-001 through DP-SCHED-003) - Integration liaisons make initial contact with `council-mem` and `council-net` to understand their timelines **Day 2:** - Remaining research domains surveyed: gang scheduling, preemption, NUMA, energy - Research summaries drafted in parallel across subgroups - First internal review cycle: reviewers read summaries and post green/yellow/red votes with rationale - Lead architects synthesise research findings into a preliminary scheduling design space map **Day 3:** - Research summaries revised based on review feedback; final versions published to DHT - Adversarial critics challenge assumptions in each summary - Research phase gate: all 8 summaries must achieve simple majority approval before Phase 2 begins - Preliminary interface contract outlines shared with dependency councils ### Phase 2: Architecture and Trade-offs (Days 3-6) **Day 3-4:** - Architects propose concrete options for DP-SCHED-001 (partition model) and DP-SCHED-002 (placement scoring) — these are the highest-dependency decisions - Adversarial critics engage immediately; all alternatives documented - DP-SCHED-001 decision record drafted; council votes; DR published **Day 4-5:** - Queuing model (DP-SCHED-003), gang scheduling (DP-SCHED-004), and preemption (DP-SCHED-005) design proposals concurrently authored - Inter-council synthesis session with `council-mem` to align on oversubscription and eviction signal interfaces - Inter-council synthesis session with `council-net` to align on topology model input format **Day 5-6:** - Grace Superchip model (DP-SCHED-006) and energy interface (DP-SCHED-007) decisions resolved - All 7 Decision Records drafted; supermajority vote on each - Architecture overview document assembled from approved DRs - Architecture phase gate: all DPs resolved before Phase 3 begins ### Phase 3: Formal Specification (Days 6-10) **Day 6-7:** - Verifiers begin TLA+ specification of the scheduler state machine based on approved architecture - Architects continue refining component-level designs to resolve any ambiguities surfaced during spec authoring - `council-verify` engaged: share spec drafts for early model-checking feedback **Day 7-8:** - Gang scheduler TLA+ module authored; parallel with preemption protocol spec - Fairness invariants formally stated: no starvation under bounded load, gang deadlock freedom, DRF monotonicity **Day 8-10:** - Model checking runs submitted to `council-verify` - Counterexample analysis: any liveness or safety violations trigger architecture revision and updated DRs - Formal spec versions pinned in DHT; UCXL addresses published to dependent councils ### Phase 4: Integration and Review (Days 10-12) **Day 10-11:** - Interface contracts with `council-mem`, `council-net`, `council-telemetry`, `council-sec` finalised and submitted for co-signature - Cross-council integration session: scheduling placement decisions validated against network topology model - `council-synth` engaged for any unresolved conflicts with other councils' specifications **Day 11-12:** - Final council review of complete scheduling specification - Adversarial critics run end-to-end failure scenario analysis - Any remaining yellow votes addressed with documented mitigations - Integration review gate: all interface contracts co-signed ### Phase 5: Documentation and Narrative (Days 12-14) **Day 12-13:** - Decision record authors produce narrative summaries of the scheduling architecture journey - `council-docs` receives the complete scheduling specification package for standardised formatting - UCXL navigability audit: spot-check 10 random decision paths for completeness **Day 14:** - Final specification published - `council-arch` decision archaeology agents generate human-readable narrative of scheduling design evolution - Council formally dissolved; agents released back to WHOOSH pool