12 council design briefs for distributed OS specification project targeting 1024-node Hopper/Grace/Blackwell GPU cluster with Weka parallel filesystem. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
30 KiB
Council Design Brief: Process Scheduling
Council ID: council-sched
Mission: Design the heterogeneous process scheduling subsystem for DistOS, covering GPU kernel dispatch, workload placement across Hopper/Grace/Blackwell accelerators, fair multi-tenant queuing, gang scheduling for distributed training, and energy-aware execution across a 1024-node cluster.
UCXL Base Address: ucxl://council-sched:*@DistOS:scheduling/*
Agent Count: 80
Status: Constitution Phase — awaiting WHOOSH formation trigger
Created: 2026-02-24
1. Scope and Responsibilities
council-sched owns the complete specification of the DistOS process and kernel scheduling subsystem. Scope boundaries are defined as follows.
In scope:
- GPU kernel scheduling: dispatch queue management, kernel concurrency, SM partitioning on Hopper (MIG) and Blackwell, and MPS (Multi-Process Service) lifetime management
- CPU-GPU co-scheduling on Grace Superchip (NVLink-C2C coherent interconnect), including unified virtual address space scheduling implications
- Workload placement policy across heterogeneous accelerator types (H100, GH200, B200), including topology-aware affinity scoring
- Fair multi-tenant queuing: priority classes, weighted fair queuing, and quota enforcement
- Gang scheduling for distributed training workloads: all-or-nothing allocation, partial-allotment strategies, and backfill
- Preemption strategies: checkpoint-based preemption, time-sliced preemption, and priority inversion avoidance
- NUMA-aware placement across CPU sockets and GPU memory domains
- GPU memory oversubscription scheduling: eviction policy, swap-to-host, and coordinated demand management with
council-mem - Energy-aware scheduling: frequency/voltage scaling directives, power capping, and thermal headroom management
- Formal specification of the scheduling API surface exposed to user workloads and to other DistOS subsystems
Out of scope (delegated):
- Physical memory allocation and coherence protocol (delegated to
council-mem) - Network topology discovery and network-aware placement data (delegated to
council-net; consumed as inputs) - Metering, cost attribution, and SLO enforcement (delegated to
council-telemetry; consumed as outputs) - Security isolation between tenants at the hardware level (delegated to
council-sec; policies consumed as constraints)
2. Research Domains
2.1 GPU Kernel Scheduling and SM Partitioning
Survey the CUDA Concurrent Kernel Execution model, the NVIDIA Multi-Instance GPU (MIG) architecture on Hopper (H100), and the NVIDIA Multi-Process Service (MPS). Understand how MIG partitions a GPU into isolated GPC slices with dedicated HBM3 memory, and how MPS enables concurrent kernel execution within a single GPU context without full MIG isolation overhead.
Key materials:
- NVIDIA H100 Architecture Technical Overview (2022) — SM partitioning, GPC layout, NVLink 4.0 bandwidth
- NVIDIA MIG User Guide (CUDA 12.x) — partition profiles (1g.10gb through 7g.80gb), instance isolation, compute and memory capacity tables
- NVIDIA MPS documentation — shared context model, client limit (48 for Hopper), error containment limitations
- AMD ROCm Hardware Abstraction Layer (HAL) source —
amdkfdKFD driver, compute queue management, HWS (Hardware Scheduler) in GFX12 - Jain et al., "Fractional GPUs: Software-Based Compute and Memory Bandwidth Reservation for GPUs" (RTAS 2019) — software-defined partitioning as a baseline comparison
- Xiao et al., "AntMan: Dynamic Scaling on GPU Clusters for Deep Learning" (OSDI 2020) — GPU memory and compute sharing for co-located workloads
2.2 Heterogeneous Workload Placement
Develop a placement model that accounts for the distinct performance characteristics of H100 (PCIe/SXM5), GH200 Grace Superchip (NVLink-C2C), and B200 Blackwell (NVLink 5.0 SXM). Each accelerator type has different compute density, memory bandwidth, and interconnect topology.
Key materials:
- Jouppi et al., "In-Datacenter Performance Analysis of a Tensor Processing Unit" (ISCA 2017) — heterogeneous accelerator placement rationale
- Google Borg paper: Verma et al., "Large-scale cluster management at Google with Borg" (EuroSys 2015) — machine heterogeneity handling, alloc sets, resource estimation
- Microsoft Singularity: Qiao et al., "Pollux: Co-adaptive Cluster Scheduling for Goodput-Optimized Deep Learning" (OSDI 2021) — goodput-aware placement for training workloads
- Weng et al., "MLaaS in the Wild: Workload Analysis and Scheduling in Large-Scale Heterogeneous GPU Clusters" (NSDI 2022) — real-world heterogeneous GPU cluster scheduling from Alibaba
2.3 Fair Multi-Tenant Queuing
Design a multi-level queueing architecture with Dominant Resource Fairness (DRF) semantics extended for GPU resources (SM fraction, HBM bandwidth, NVLink bandwidth as co-dominant resources).
Key materials:
- Ghodsi et al., "Dominant Resource Fairness: Fair Allocation of Multiple Resource Types" (NSDI 2011) — foundational DRF theory
- Apache YARN CapacityScheduler and FairScheduler documentation — hierarchical queues, preemption policy, label-based node targeting
- Apache Mesos DRF implementation —
DominantShareAllocator, offer model, role weights - Kubernetes device plugin framework (k8s.io/device-plugins) — GPU resource advertisement, extended resource scheduling
- Tiresias: Gu et al., "Tiresias: A GPU Cluster Manager for Distributed Deep Learning" (NSDI 2019) — LAS (Least Attained Service) scheduling for DL jobs, 2DAS (2-dimensional attained service)
- Gandiva: Xiao et al., "Gandiva: Introspective Cluster Scheduling for Deep Learning" (OSDI 2018) — time-sliced GPU sharing, job packing, migration
2.4 Gang Scheduling for Distributed Training
Gang scheduling ensures all processes in a distributed training job (e.g., a 512-GPU allreduce ring) are co-scheduled simultaneously. This is critical for preventing head-of-line blocking and deadlock in collective communication.
Key materials:
- Feitelson and Rudolph, "Gang Scheduling Performance Benefits for Fine-Grain Synchronization" (J. Parallel Distrib. Comput., 1992) — foundational gang scheduling analysis
- Rajachandrasekar et al., "A Closer Look at All-Reduce for Deep Learning" (Workshop at SC'19) — collective communication scheduling sensitivity
- Hwang et al., "AFS: Annotation-Free Automatic Sharding for Large Language Models" (ICLR 2023) — pipeline and tensor parallel placement co-scheduling
- Slurm gang scheduling documentation —
Oversubscribe=FORCE,GraceTime, slurmctld gang plugin - Jeon et al., "Analysis of Large-Scale Multi-Tenant GPU Clusters for DNN Training Workloads" (USENIX ATC 2019) — Microsoft Philly cluster trace, gang failure modes
2.5 Preemption Strategies
Survey checkpoint-based, reactive, and time-sliced preemption. Understand the cost of GPU checkpoint (saving SM register file, shared memory, and in-flight DMA) and design preemption policies that bound maximum latency impact.
Key materials:
- Park et al., "Preemptive, Low Latency Datacenter Scheduling via Lightweight Virtualization" (USENIX ATC 2020) — container-level preemption
- Lin et al., "SHEPHERD: Serving DNNs in the Wild" (NSDI 2023) — latency SLO-aware preemption for inference workloads
- NVIDIA CUDA checkpoint/restore (CRIU for GPU) — experimental support status as of CUDA 12.x
- Wang et al., "Achieving Microsecond-Scale Tail Latency Efficiently with Approximate Optimal Scheduling" (SOSP 2017) — preemption-aware scheduling theory
2.6 NUMA-Aware and Topology-Aware Placement
Model the NUMA topology of a Grace Superchip node: 72-core Arm Neoverse V2 CPU, 480 GB LPDDR5X at ~512 GB/s, connected to H100 GPU via NVLink-C2C at 900 GB/s. Compare this with standard SXM5 nodes where CPU-GPU bandwidth is limited to PCIe Gen5 (~128 GB/s bidirectional).
Key materials:
- Linux NUMA scheduling documentation —
libnuma,numactl, CFS NUMA balancing (task_numa_migrate) - NVIDIA GH200 Grace Hopper Superchip Architecture Whitepaper (2023)
- Lepers et al., "Thread and Memory Placement on NUMA Systems: Asymmetry Matters" (USENIX ATC 2015)
- Blagodurov et al., "A Case for NUMA-Aware Contention Management on Multicore Systems" (USENIX ATC 2011)
2.7 GPU Memory Oversubscription
When aggregate GPU memory demand exceeds physical HBM3 capacity, the scheduler must coordinate with council-mem's eviction and tiering policies to decide which kernels to throttle, checkpoint, or migrate.
Key materials:
- Rhu et al., "vDNN: Virtualized Deep Neural Networks for Scalable, Memory-Efficient Neural Network Design" (MICRO 2016) — layer-by-layer activation offload
- Huang et al., "Efficient Large Scale Language Modeling with Mixtures of Experts" (EMNLP 2021) — memory-efficient model parallelism
- NVIDIA Unified Memory documentation (CUDA 12.x) — page migration engine, oversubscription, prefetch advisory API
- Kumar et al., "SwapAdvisor: Push Deep Learning Beyond the GPU Memory Limit via Smart Swapping" (ASPLOS 2020)
2.8 Energy-Aware Scheduling
Design scheduling policies that honour per-node power caps enforced by Baseboard Management Controllers (BMCs) and exploit DVFS (Dynamic Voltage and Frequency Scaling) headroom reported by council-telemetry.
Key materials:
- Patel et al., "Clite: Efficient and QoS Aware Co-location of Multiple Latency-Critical Jobs for Warehouse Scale Computers" (ISCA 2020)
- Lim et al., "Adaptive Power Management for Heterogeneous Multiprocessor SoCs" (ICCAD 2009)
- NVIDIA NVML power management API —
nvmlDeviceSetPowerManagementLimit,nvmlDeviceGetCurrentClocksThrottleReasons - AMD ROCm SMI library — power cap interface (
rsmi_dev_power_cap_set) - RAPL (Running Average Power Limit) interface for Grace CPU power management
3. Agent Roles
Total agents: 80
| Role | Count | Responsibilities |
|---|---|---|
| Lead Architect | 2 | Overall scheduling architecture decisions, cross-subsystem interface design, DR authorship for major decisions |
| Kernel Scheduling Researchers | 8 | CUDA/ROCm scheduler internals, MIG/MPS analysis, SM partitioning survey |
| Placement Researchers | 6 | Heterogeneous accelerator placement, topology modelling, workload profiling |
| Queueing Theory Specialists | 5 | DRF extensions for multi-resource GPU scheduling, fairness proof sketches |
| Gang Scheduling Specialists | 5 | Collective communication scheduling, all-or-nothing allocation protocols |
| Preemption Specialists | 4 | Checkpoint protocol design, preemption cost modelling |
| NUMA/Topology Analysts | 4 | NUMA topology modelling for Grace Superchip and SXM5 nodes |
| Energy Efficiency Researchers | 4 | Power capping, DVFS scheduling integration |
| Formal Specification Authors | 8 | TLA+ specification of scheduler state machine, safety and liveness invariants |
| Architects (sub-component) | 10 | Propose concrete scheduling algorithm designs for each domain |
| Internal Reviewers | 8 | Review research summaries and architecture proposals; cast green/yellow/red votes |
| Integration Liaisons | 6 | Interface with council-mem, council-net, council-telemetry, council-sec |
| Decision Record Authors | 5 | Author DRs for each resolved decision point; maintain UCXL provenance chain |
| Adversarial Critics | 5 | Challenge proposed designs; surface failure modes, starvation scenarios, livelock risks |
Role distribution rationale: The large researcher cohort (37 agents covering 8 research domains) reflects the breadth of scheduling literature. The formal specification group (8 agents) is sized to produce parallel TLA+ modules. Internal reviewers and adversarial critics (13 agents combined) ensure no architecture proposal passes without rigorous challenge.
4. Key Deliverables
All artifacts are published to the DHT and addressable via UCXL.
4.1 Research Summaries
ucxl://council-sched:researcher@DistOS:scheduling/*^/research/gpu-kernel-scheduling.md
ucxl://council-sched:researcher@DistOS:scheduling/*^/research/heterogeneous-placement.md
ucxl://council-sched:researcher@DistOS:scheduling/*^/research/fair-queuing-drf.md
ucxl://council-sched:researcher@DistOS:scheduling/*^/research/gang-scheduling.md
ucxl://council-sched:researcher@DistOS:scheduling/*^/research/preemption-strategies.md
ucxl://council-sched:researcher@DistOS:scheduling/*^/research/numa-topology.md
ucxl://council-sched:researcher@DistOS:scheduling/*^/research/memory-oversubscription.md
ucxl://council-sched:researcher@DistOS:scheduling/*^/research/energy-aware-scheduling.md
4.2 Architecture Proposals
ucxl://council-sched:architect@DistOS:scheduling/*^/architecture/scheduler-overview.md
ucxl://council-sched:architect@DistOS:scheduling/*^/architecture/mig-mps-partitioning-model.md
ucxl://council-sched:architect@DistOS:scheduling/*^/architecture/placement-scoring-algorithm.md
ucxl://council-sched:architect@DistOS:scheduling/*^/architecture/drf-gpu-extension.md
ucxl://council-sched:architect@DistOS:scheduling/*^/architecture/gang-scheduler-protocol.md
ucxl://council-sched:architect@DistOS:scheduling/*^/architecture/preemption-protocol.md
ucxl://council-sched:architect@DistOS:scheduling/*^/architecture/energy-policy-interface.md
4.3 Decision Records
ucxl://council-sched:architect@DistOS:scheduling/*^/decisions/DR-SCHED-001-partition-model.md
ucxl://council-sched:architect@DistOS:scheduling/*^/decisions/DR-SCHED-002-placement-algorithm.md
ucxl://council-sched:architect@DistOS:scheduling/*^/decisions/DR-SCHED-003-queuing-policy.md
ucxl://council-sched:architect@DistOS:scheduling/*^/decisions/DR-SCHED-004-gang-protocol.md
ucxl://council-sched:architect@DistOS:scheduling/*^/decisions/DR-SCHED-005-preemption-model.md
ucxl://council-sched:architect@DistOS:scheduling/*^/decisions/DR-SCHED-006-energy-interface.md
4.4 Formal Specifications
ucxl://council-sched:verifier@DistOS:scheduling/*^/specs/SchedulerStateMachine.tla
ucxl://council-sched:verifier@DistOS:scheduling/*^/specs/GangScheduler.tla
ucxl://council-sched:verifier@DistOS:scheduling/*^/specs/PreemptionProtocol.tla
ucxl://council-sched:verifier@DistOS:scheduling/*^/specs/FairnessInvariants.tla
4.5 Interface Contracts (for other councils)
ucxl://council-sched:architect@DistOS:scheduling/*^/interfaces/sched-to-mem-contract.md
ucxl://council-sched:architect@DistOS:scheduling/*^/interfaces/sched-to-net-contract.md
ucxl://council-sched:architect@DistOS:scheduling/*^/interfaces/sched-to-telemetry-contract.md
ucxl://council-sched:architect@DistOS:scheduling/*^/interfaces/sched-to-sec-contract.md
5. Decision Points
The following are the major architectural questions council-sched must resolve. Each decision must produce a Decision Record with alternatives considered, evidence from research, and rationale for the chosen option.
DP-SCHED-001: Primary Partition Model
Question: Should the scheduler use MIG (hardware-enforced partition isolation), MPS (shared context with software-enforced limits), or a hybrid model where MIG is used for multi-tenant isolation and MPS for co-located jobs within a single tenant's partition?
Factors: isolation strength, context switch overhead, minimum allocation granularity, Blackwell support parity, operational complexity.
Dependency: Decision informs the security isolation model that council-sec will specify.
DP-SCHED-002: Placement Scoring Architecture
Question: Should placement be computed by a centralised scoring service (Borg-style) or a distributed, bid-based negotiation (Mesos offer model)? How does topology affinity (NVLink domain proximity) weight against fairness constraints?
Factors: convergence time at 1024-node scale, fault tolerance of the placement service itself, stale topology information handling, NVLink bandwidth utilisation efficiency.
DP-SCHED-003: Multi-Resource Fairness Extension
Question: DRF was designed for CPU/memory. GPU workloads introduce additional dominant resources: SM fraction, HBM3 bandwidth, NVLink bandwidth, and NVSwitch port occupancy. How many resource dimensions does the fairness model track, and what is the computational complexity of DRF at 80+ resource types?
Factors: implementation tractability, approximation error bounds, interaction with per-tenant quota enforcement.
DP-SCHED-004: Gang Scheduling Protocol
Question: Should gang scheduling use a two-phase reservation (reserve then commit) protocol, speculative allocation with rollback, or a backfill-with-hold strategy? What is the maximum tolerable scheduling delay for a 1024-GPU gang job?
Factors: cluster utilisation impact, deadlock risk in two-phase protocols, interaction with preemption.
DP-SCHED-005: Preemption Granularity
Question: Should preemption operate at kernel-granularity (CUDA stream checkpointing), job-granularity (full process checkpoint via CRIU-for-GPU), or a hybrid that allows kernel preemption for short jobs and process-level preemption for long jobs?
Factors: checkpoint latency (kernel-level: microseconds; process-level: seconds to tens of seconds for large model weights), HBM3 save/restore bandwidth cost, preemption frequency requirements.
DP-SCHED-006: Grace Superchip Scheduling Model
Question: For GH200 nodes with NVLink-C2C, the CPU and GPU share a unified virtual address space. Should the scheduler treat CPU and GPU on a Grace node as a single scheduling unit, or as two separate but affinity-linked resources? How does this interact with the memory model specified by council-mem?
Factors: utilisation efficiency, memory model consistency requirements, API ergonomics for user workloads, migration cost if job is split across a C2C boundary.
DP-SCHED-007: Energy Policy Interface
Question: Should energy-aware scheduling be a first-class scheduling objective (a weight in the placement score function), a hard constraint (power cap as a resource limit), or an advisory mechanism (scheduler receives power headroom hints and applies them at its discretion)?
Factors: SLO compatibility, predictability of power-capped execution, interaction with thermal management, coordination protocol with council-telemetry.
6. Dependencies
6.1 What council-sched Needs from Other Councils
| Dependency | Source Council | Artifact | Purpose |
|---|---|---|---|
| Memory pressure signals and eviction cost model | council-mem |
ucxl://council-mem:architect@DistOS:memory/*^/interfaces/mem-to-sched-contract.md |
GPU memory oversubscription scheduling requires eviction cost estimates to avoid scheduling kernels that will immediately thrash |
| HBM3/DDR5 bandwidth allocation model | council-mem |
ucxl://council-mem:architect@DistOS:memory/*^/architecture/bandwidth-allocation.md |
Bandwidth is a co-dominant scheduling resource; model must be consistent |
| Network topology map (NVLink domains, IB fabric) | council-net |
ucxl://council-net:architect@DistOS:networking/*^/architecture/topology-model.md |
Topology-aware placement requires accurate NVLink domain membership and IB bisection bandwidth per rack |
| Network bandwidth reservation API | council-net |
ucxl://council-net:architect@DistOS:networking/*^/interfaces/net-to-sched-contract.md |
Gang scheduling for allreduce jobs requires co-reserving network bandwidth |
| Resource metering API contract | council-telemetry |
ucxl://council-telemetry:architect@DistOS:telemetry/*^/interfaces/telemetry-to-sched-contract.md |
Scheduler must feed placement and execution events to telemetry; power headroom signals flow back |
| Security isolation constraints | council-sec |
ucxl://council-sec:architect@DistOS:security/*^/interfaces/sec-to-sched-contract.md |
Which tenants may share a GPU partition, MPS context constraints, minimum isolation requirements per tenant class |
6.2 What Other Councils Need from council-sched
| Consumer Council | Artifact Required | Purpose |
|---|---|---|
council-mem |
Placement decisions and GPU assignment map | Memory subsystem needs to know which GPU a job is placed on to configure HBM3 address space and NVLink fabric attachment |
council-mem |
Preemption events and checkpoint triggers | Memory tiering must snapshot GPU memory on preemption |
council-net |
Job placement map with NVLink domain assignments | Network subsystem configures NCCL topology files and RDMA QP mappings based on scheduler placement |
council-telemetry |
Scheduling events stream (enqueue, dequeue, preempt, complete) | Metering and cost attribution require per-job lifecycle events |
council-sec |
Partition assignment per tenant | Security subsystem enforces isolation based on scheduler-assigned partition IDs |
council-verify |
TLA+ scheduler state machine spec | Formal verification council model-checks scheduler invariants |
7. WHOOSH Configuration
# WHOOSH council formation configuration for council-sched
council_id: council-sched
project: DistOS
subsystem: scheduling
gitea_label: chorus-entrypoint
gitea_repo: distos/scheduling
formation:
target_agents: 80
min_agents: 60
wave:
max_per_wave: 12
min_per_wave: 6
period_sec: 30
placement:
max_replicas_per_node: 2
join_stagger_ms: 2000
bootstrap_peers_min: 5
roles:
- role: lead-architect
count: 2
model: claude-opus-4-6
priority: high
- role: researcher
count: 37
model: qwen2.5-coder:32b
priority: normal
subgroups:
- tag: kernel-scheduling
count: 8
- tag: placement
count: 6
- tag: queuing-theory
count: 5
- tag: gang-scheduling
count: 5
- tag: preemption
count: 4
- tag: numa-topology
count: 4
- tag: energy
count: 5
- role: architect
count: 10
model: claude-opus-4-6
priority: normal
- role: verifier
count: 8
model: deepseek-coder-v2
priority: normal
- role: reviewer
count: 8
model: claude-opus-4-6
priority: normal
- role: integration-liaison
count: 6
model: qwen2.5-coder:32b
priority: normal
- role: decision-record-author
count: 5
model: claude-opus-4-6
priority: normal
- role: adversarial-critic
count: 5
model: claude-opus-4-6
priority: normal
subchannels:
- name: sched-research
description: "Research discussion and literature synthesis"
participants: [researcher, lead-architect]
pubsub: true
- name: sched-architecture
description: "Architecture proposal discussion and voting"
participants: [architect, lead-architect, reviewer, adversarial-critic]
pubsub: true
- name: sched-formal-spec
description: "TLA+ specification authoring and review"
participants: [verifier, lead-architect, reviewer]
pubsub: false
- name: sched-integration
description: "Cross-council interface negotiation"
participants: [integration-liaison, lead-architect]
pubsub: false
- name: sched-decisions
description: "Decision record authoring and consensus"
participants: [decision-record-author, lead-architect, reviewer]
pubsub: true
quorum:
# Architecture decisions require supermajority
architecture_changes:
policy: supermajority
threshold: 0.667
require_domain_role: true
require_quality_role: true
beat_minutes: 20
timeout_beats: 6
# Research summaries require simple majority
research_summaries:
policy: simple_majority
threshold: 0.5
require_domain_role: true
require_quality_role: false
beat_minutes: 15
timeout_beats: 4
# Formal specifications require supermajority with verifier sign-off
formal_specs:
policy: supermajority
threshold: 0.667
require_domain_role: true
require_quality_role: true
require_verifier: true
beat_minutes: 25
timeout_beats: 8
# Cross-council interface contracts require unanimous lead-architect approval
interface_contracts:
policy: unanimous
roles: [lead-architect, integration-liaison]
beat_minutes: 30
timeout_beats: 4
gates:
kaching:
p95_latency_ms: 250
max_error_rate: 0.01
backbeat:
max_stream_lag: 200
bootstrap:
min_healthy_peers: 5
join:
min_success_rate: 0.80
review:
beat_minutes: 20
quorum:
total_min: 3
require_domain_role: true
require_quality_role: true
timeout_beats: 6
no_self_approval: true
8. Success Criteria
-
Research completeness: All 8 research domain summaries published to DHT with at least 5 primary references each, reviewed and approved by at least 3 council agents.
-
Architecture coverage: Architectural proposals exist for all 7 major scheduling components (kernel dispatch, placement, queuing, gang, preemption, NUMA, energy). Each proposal addresses the 1024-node scale constraint explicitly.
-
Decision records resolved: All 7 decision points (DP-SCHED-001 through DP-SCHED-007) have a corresponding Decision Record with at least 3 alternatives considered, evidence citations, and a chosen option ratified by council supermajority.
-
Formal specifications: TLA+ specifications for the scheduler state machine, gang scheduling protocol, and preemption protocol. At least 2 of these must have model-checked safety invariants (no starvation for highest-priority jobs, gang deadlock freedom) verified by
council-verify. -
Interface contracts ratified: All 4 interface contracts (to
council-mem,council-net,council-telemetry,council-sec) are co-signed by integration liaisons from both councils. -
UCXL navigability: A human unfamiliar with the project should be able to navigate from any Decision Record to the research summary that motivated it using only UCXL temporal navigation, within 5 hops.
-
Adversarial review pass: Each major architecture proposal has at minimum one adversarial critique documented and a resolution recorded. No proposal advances to formal specification with an unresolved red-vote critique.
9. Timeline
Phase 1: Research and Survey (Days 1-3)
Day 1:
- WHOOSH forms council; all 80 agents join via wave deployment
- Researchers self-assign to domain subgroups via
sched-researchsubchannel - Literature survey begins: GPU kernel scheduling, placement, and fair queuing domains prioritised first (these are on the critical path for Decision Points DP-SCHED-001 through DP-SCHED-003)
- Integration liaisons make initial contact with
council-memandcouncil-netto understand their timelines
Day 2:
- Remaining research domains surveyed: gang scheduling, preemption, NUMA, energy
- Research summaries drafted in parallel across subgroups
- First internal review cycle: reviewers read summaries and post green/yellow/red votes with rationale
- Lead architects synthesise research findings into a preliminary scheduling design space map
Day 3:
- Research summaries revised based on review feedback; final versions published to DHT
- Adversarial critics challenge assumptions in each summary
- Research phase gate: all 8 summaries must achieve simple majority approval before Phase 2 begins
- Preliminary interface contract outlines shared with dependency councils
Phase 2: Architecture and Trade-offs (Days 3-6)
Day 3-4:
- Architects propose concrete options for DP-SCHED-001 (partition model) and DP-SCHED-002 (placement scoring) — these are the highest-dependency decisions
- Adversarial critics engage immediately; all alternatives documented
- DP-SCHED-001 decision record drafted; council votes; DR published
Day 4-5:
- Queuing model (DP-SCHED-003), gang scheduling (DP-SCHED-004), and preemption (DP-SCHED-005) design proposals concurrently authored
- Inter-council synthesis session with
council-memto align on oversubscription and eviction signal interfaces - Inter-council synthesis session with
council-netto align on topology model input format
Day 5-6:
- Grace Superchip model (DP-SCHED-006) and energy interface (DP-SCHED-007) decisions resolved
- All 7 Decision Records drafted; supermajority vote on each
- Architecture overview document assembled from approved DRs
- Architecture phase gate: all DPs resolved before Phase 3 begins
Phase 3: Formal Specification (Days 6-10)
Day 6-7:
- Verifiers begin TLA+ specification of the scheduler state machine based on approved architecture
- Architects continue refining component-level designs to resolve any ambiguities surfaced during spec authoring
council-verifyengaged: share spec drafts for early model-checking feedback
Day 7-8:
- Gang scheduler TLA+ module authored; parallel with preemption protocol spec
- Fairness invariants formally stated: no starvation under bounded load, gang deadlock freedom, DRF monotonicity
Day 8-10:
- Model checking runs submitted to
council-verify - Counterexample analysis: any liveness or safety violations trigger architecture revision and updated DRs
- Formal spec versions pinned in DHT; UCXL addresses published to dependent councils
Phase 4: Integration and Review (Days 10-12)
Day 10-11:
- Interface contracts with
council-mem,council-net,council-telemetry,council-secfinalised and submitted for co-signature - Cross-council integration session: scheduling placement decisions validated against network topology model
council-synthengaged for any unresolved conflicts with other councils' specifications
Day 11-12:
- Final council review of complete scheduling specification
- Adversarial critics run end-to-end failure scenario analysis
- Any remaining yellow votes addressed with documented mitigations
- Integration review gate: all interface contracts co-signed
Phase 5: Documentation and Narrative (Days 12-14)
Day 12-13:
- Decision record authors produce narrative summaries of the scheduling architecture journey
council-docsreceives the complete scheduling specification package for standardised formatting- UCXL navigability audit: spot-check 10 random decision paths for completeness
Day 14:
- Final specification published
council-archdecision archaeology agents generate human-readable narrative of scheduling design evolution- Council formally dissolved; agents released back to WHOOSH pool