Files

anthonyrawlins 68a489b64d Initial commit: Fresh implementation of CHORUS architecture (ResetData Mandate)

2026-03-03 13:38:56 +11:00

30 KiB

Raw Blame History

Council Design Brief: Process Scheduling

Council ID: council-sched Mission: Design the heterogeneous process scheduling subsystem for DistOS, covering GPU kernel dispatch, workload placement across Hopper/Grace/Blackwell accelerators, fair multi-tenant queuing, gang scheduling for distributed training, and energy-aware execution across a 1024-node cluster. UCXL Base Address: ucxl://council-sched:*@DistOS:scheduling/* Agent Count: 80 Status: Constitution Phase — awaiting WHOOSH formation trigger Created: 2026-02-24

1. Scope and Responsibilities

council-sched owns the complete specification of the DistOS process and kernel scheduling subsystem. Scope boundaries are defined as follows.

In scope:

GPU kernel scheduling: dispatch queue management, kernel concurrency, SM partitioning on Hopper (MIG) and Blackwell, and MPS (Multi-Process Service) lifetime management
CPU-GPU co-scheduling on Grace Superchip (NVLink-C2C coherent interconnect), including unified virtual address space scheduling implications
Workload placement policy across heterogeneous accelerator types (H100, GH200, B200), including topology-aware affinity scoring
Fair multi-tenant queuing: priority classes, weighted fair queuing, and quota enforcement
Gang scheduling for distributed training workloads: all-or-nothing allocation, partial-allotment strategies, and backfill
Preemption strategies: checkpoint-based preemption, time-sliced preemption, and priority inversion avoidance
NUMA-aware placement across CPU sockets and GPU memory domains
GPU memory oversubscription scheduling: eviction policy, swap-to-host, and coordinated demand management with council-mem
Energy-aware scheduling: frequency/voltage scaling directives, power capping, and thermal headroom management
Formal specification of the scheduling API surface exposed to user workloads and to other DistOS subsystems

Out of scope (delegated):

Physical memory allocation and coherence protocol (delegated to council-mem)
Network topology discovery and network-aware placement data (delegated to council-net; consumed as inputs)
Metering, cost attribution, and SLO enforcement (delegated to council-telemetry; consumed as outputs)
Security isolation between tenants at the hardware level (delegated to council-sec; policies consumed as constraints)

2. Research Domains

2.1 GPU Kernel Scheduling and SM Partitioning

Survey the CUDA Concurrent Kernel Execution model, the NVIDIA Multi-Instance GPU (MIG) architecture on Hopper (H100), and the NVIDIA Multi-Process Service (MPS). Understand how MIG partitions a GPU into isolated GPC slices with dedicated HBM3 memory, and how MPS enables concurrent kernel execution within a single GPU context without full MIG isolation overhead.

Key materials:

NVIDIA H100 Architecture Technical Overview (2022) — SM partitioning, GPC layout, NVLink 4.0 bandwidth
NVIDIA MIG User Guide (CUDA 12.x) — partition profiles (1g.10gb through 7g.80gb), instance isolation, compute and memory capacity tables
NVIDIA MPS documentation — shared context model, client limit (48 for Hopper), error containment limitations
AMD ROCm Hardware Abstraction Layer (HAL) source — amdkfd KFD driver, compute queue management, HWS (Hardware Scheduler) in GFX12
Jain et al., "Fractional GPUs: Software-Based Compute and Memory Bandwidth Reservation for GPUs" (RTAS 2019) — software-defined partitioning as a baseline comparison
Xiao et al., "AntMan: Dynamic Scaling on GPU Clusters for Deep Learning" (OSDI 2020) — GPU memory and compute sharing for co-located workloads

2.2 Heterogeneous Workload Placement

Develop a placement model that accounts for the distinct performance characteristics of H100 (PCIe/SXM5), GH200 Grace Superchip (NVLink-C2C), and B200 Blackwell (NVLink 5.0 SXM). Each accelerator type has different compute density, memory bandwidth, and interconnect topology.

Key materials:

Jouppi et al., "In-Datacenter Performance Analysis of a Tensor Processing Unit" (ISCA 2017) — heterogeneous accelerator placement rationale
Google Borg paper: Verma et al., "Large-scale cluster management at Google with Borg" (EuroSys 2015) — machine heterogeneity handling, alloc sets, resource estimation
Microsoft Singularity: Qiao et al., "Pollux: Co-adaptive Cluster Scheduling for Goodput-Optimized Deep Learning" (OSDI 2021) — goodput-aware placement for training workloads
Weng et al., "MLaaS in the Wild: Workload Analysis and Scheduling in Large-Scale Heterogeneous GPU Clusters" (NSDI 2022) — real-world heterogeneous GPU cluster scheduling from Alibaba

2.3 Fair Multi-Tenant Queuing

Design a multi-level queueing architecture with Dominant Resource Fairness (DRF) semantics extended for GPU resources (SM fraction, HBM bandwidth, NVLink bandwidth as co-dominant resources).

Key materials:

Ghodsi et al., "Dominant Resource Fairness: Fair Allocation of Multiple Resource Types" (NSDI 2011) — foundational DRF theory
Apache YARN CapacityScheduler and FairScheduler documentation — hierarchical queues, preemption policy, label-based node targeting
Apache Mesos DRF implementation — DominantShareAllocator, offer model, role weights
Kubernetes device plugin framework (k8s.io/device-plugins) — GPU resource advertisement, extended resource scheduling
Tiresias: Gu et al., "Tiresias: A GPU Cluster Manager for Distributed Deep Learning" (NSDI 2019) — LAS (Least Attained Service) scheduling for DL jobs, 2DAS (2-dimensional attained service)
Gandiva: Xiao et al., "Gandiva: Introspective Cluster Scheduling for Deep Learning" (OSDI 2018) — time-sliced GPU sharing, job packing, migration

2.4 Gang Scheduling for Distributed Training

Gang scheduling ensures all processes in a distributed training job (e.g., a 512-GPU allreduce ring) are co-scheduled simultaneously. This is critical for preventing head-of-line blocking and deadlock in collective communication.

Key materials:

Feitelson and Rudolph, "Gang Scheduling Performance Benefits for Fine-Grain Synchronization" (J. Parallel Distrib. Comput., 1992) — foundational gang scheduling analysis
Rajachandrasekar et al., "A Closer Look at All-Reduce for Deep Learning" (Workshop at SC'19) — collective communication scheduling sensitivity
Hwang et al., "AFS: Annotation-Free Automatic Sharding for Large Language Models" (ICLR 2023) — pipeline and tensor parallel placement co-scheduling
Slurm gang scheduling documentation — Oversubscribe=FORCE, GraceTime, slurmctld gang plugin
Jeon et al., "Analysis of Large-Scale Multi-Tenant GPU Clusters for DNN Training Workloads" (USENIX ATC 2019) — Microsoft Philly cluster trace, gang failure modes

2.5 Preemption Strategies

Survey checkpoint-based, reactive, and time-sliced preemption. Understand the cost of GPU checkpoint (saving SM register file, shared memory, and in-flight DMA) and design preemption policies that bound maximum latency impact.

Key materials:

Park et al., "Preemptive, Low Latency Datacenter Scheduling via Lightweight Virtualization" (USENIX ATC 2020) — container-level preemption
Lin et al., "SHEPHERD: Serving DNNs in the Wild" (NSDI 2023) — latency SLO-aware preemption for inference workloads
NVIDIA CUDA checkpoint/restore (CRIU for GPU) — experimental support status as of CUDA 12.x
Wang et al., "Achieving Microsecond-Scale Tail Latency Efficiently with Approximate Optimal Scheduling" (SOSP 2017) — preemption-aware scheduling theory

2.6 NUMA-Aware and Topology-Aware Placement

Model the NUMA topology of a Grace Superchip node: 72-core Arm Neoverse V2 CPU, 480 GB LPDDR5X at ~512 GB/s, connected to H100 GPU via NVLink-C2C at 900 GB/s. Compare this with standard SXM5 nodes where CPU-GPU bandwidth is limited to PCIe Gen5 (~128 GB/s bidirectional).

Key materials:

Linux NUMA scheduling documentation — libnuma, numactl, CFS NUMA balancing (task_numa_migrate)
NVIDIA GH200 Grace Hopper Superchip Architecture Whitepaper (2023)
Lepers et al., "Thread and Memory Placement on NUMA Systems: Asymmetry Matters" (USENIX ATC 2015)
Blagodurov et al., "A Case for NUMA-Aware Contention Management on Multicore Systems" (USENIX ATC 2011)

2.7 GPU Memory Oversubscription

When aggregate GPU memory demand exceeds physical HBM3 capacity, the scheduler must coordinate with council-mem's eviction and tiering policies to decide which kernels to throttle, checkpoint, or migrate.

Key materials:

Rhu et al., "vDNN: Virtualized Deep Neural Networks for Scalable, Memory-Efficient Neural Network Design" (MICRO 2016) — layer-by-layer activation offload
Huang et al., "Efficient Large Scale Language Modeling with Mixtures of Experts" (EMNLP 2021) — memory-efficient model parallelism
NVIDIA Unified Memory documentation (CUDA 12.x) — page migration engine, oversubscription, prefetch advisory API
Kumar et al., "SwapAdvisor: Push Deep Learning Beyond the GPU Memory Limit via Smart Swapping" (ASPLOS 2020)

2.8 Energy-Aware Scheduling

Design scheduling policies that honour per-node power caps enforced by Baseboard Management Controllers (BMCs) and exploit DVFS (Dynamic Voltage and Frequency Scaling) headroom reported by council-telemetry.

Key materials:

Patel et al., "Clite: Efficient and QoS Aware Co-location of Multiple Latency-Critical Jobs for Warehouse Scale Computers" (ISCA 2020)
Lim et al., "Adaptive Power Management for Heterogeneous Multiprocessor SoCs" (ICCAD 2009)
NVIDIA NVML power management API — nvmlDeviceSetPowerManagementLimit, nvmlDeviceGetCurrentClocksThrottleReasons
AMD ROCm SMI library — power cap interface (rsmi_dev_power_cap_set)
RAPL (Running Average Power Limit) interface for Grace CPU power management

3. Agent Roles

Total agents: 80

Role	Count	Responsibilities
Lead Architect	2	Overall scheduling architecture decisions, cross-subsystem interface design, DR authorship for major decisions
Kernel Scheduling Researchers	8	CUDA/ROCm scheduler internals, MIG/MPS analysis, SM partitioning survey
Placement Researchers	6	Heterogeneous accelerator placement, topology modelling, workload profiling
Queueing Theory Specialists	5	DRF extensions for multi-resource GPU scheduling, fairness proof sketches
Gang Scheduling Specialists	5	Collective communication scheduling, all-or-nothing allocation protocols
Preemption Specialists	4	Checkpoint protocol design, preemption cost modelling
NUMA/Topology Analysts	4	NUMA topology modelling for Grace Superchip and SXM5 nodes
Energy Efficiency Researchers	4	Power capping, DVFS scheduling integration
Formal Specification Authors	8	TLA+ specification of scheduler state machine, safety and liveness invariants
Architects (sub-component)	10	Propose concrete scheduling algorithm designs for each domain
Internal Reviewers	8	Review research summaries and architecture proposals; cast green/yellow/red votes
Integration Liaisons	6	Interface with `council-mem`, `council-net`, `council-telemetry`, `council-sec`
Decision Record Authors	5	Author DRs for each resolved decision point; maintain UCXL provenance chain
Adversarial Critics	5	Challenge proposed designs; surface failure modes, starvation scenarios, livelock risks

Role distribution rationale: The large researcher cohort (37 agents covering 8 research domains) reflects the breadth of scheduling literature. The formal specification group (8 agents) is sized to produce parallel TLA+ modules. Internal reviewers and adversarial critics (13 agents combined) ensure no architecture proposal passes without rigorous challenge.

4. Key Deliverables

All artifacts are published to the DHT and addressable via UCXL.

4.1 Research Summaries

ucxl://council-sched:researcher@DistOS:scheduling/*^/research/gpu-kernel-scheduling.md
ucxl://council-sched:researcher@DistOS:scheduling/*^/research/heterogeneous-placement.md
ucxl://council-sched:researcher@DistOS:scheduling/*^/research/fair-queuing-drf.md
ucxl://council-sched:researcher@DistOS:scheduling/*^/research/gang-scheduling.md
ucxl://council-sched:researcher@DistOS:scheduling/*^/research/preemption-strategies.md
ucxl://council-sched:researcher@DistOS:scheduling/*^/research/numa-topology.md
ucxl://council-sched:researcher@DistOS:scheduling/*^/research/memory-oversubscription.md
ucxl://council-sched:researcher@DistOS:scheduling/*^/research/energy-aware-scheduling.md

4.2 Architecture Proposals

ucxl://council-sched:architect@DistOS:scheduling/*^/architecture/scheduler-overview.md
ucxl://council-sched:architect@DistOS:scheduling/*^/architecture/mig-mps-partitioning-model.md
ucxl://council-sched:architect@DistOS:scheduling/*^/architecture/placement-scoring-algorithm.md
ucxl://council-sched:architect@DistOS:scheduling/*^/architecture/drf-gpu-extension.md
ucxl://council-sched:architect@DistOS:scheduling/*^/architecture/gang-scheduler-protocol.md
ucxl://council-sched:architect@DistOS:scheduling/*^/architecture/preemption-protocol.md
ucxl://council-sched:architect@DistOS:scheduling/*^/architecture/energy-policy-interface.md

4.3 Decision Records

ucxl://council-sched:architect@DistOS:scheduling/*^/decisions/DR-SCHED-001-partition-model.md
ucxl://council-sched:architect@DistOS:scheduling/*^/decisions/DR-SCHED-002-placement-algorithm.md
ucxl://council-sched:architect@DistOS:scheduling/*^/decisions/DR-SCHED-003-queuing-policy.md
ucxl://council-sched:architect@DistOS:scheduling/*^/decisions/DR-SCHED-004-gang-protocol.md
ucxl://council-sched:architect@DistOS:scheduling/*^/decisions/DR-SCHED-005-preemption-model.md
ucxl://council-sched:architect@DistOS:scheduling/*^/decisions/DR-SCHED-006-energy-interface.md

4.4 Formal Specifications

ucxl://council-sched:verifier@DistOS:scheduling/*^/specs/SchedulerStateMachine.tla
ucxl://council-sched:verifier@DistOS:scheduling/*^/specs/GangScheduler.tla
ucxl://council-sched:verifier@DistOS:scheduling/*^/specs/PreemptionProtocol.tla
ucxl://council-sched:verifier@DistOS:scheduling/*^/specs/FairnessInvariants.tla

4.5 Interface Contracts (for other councils)

ucxl://council-sched:architect@DistOS:scheduling/*^/interfaces/sched-to-mem-contract.md
ucxl://council-sched:architect@DistOS:scheduling/*^/interfaces/sched-to-net-contract.md
ucxl://council-sched:architect@DistOS:scheduling/*^/interfaces/sched-to-telemetry-contract.md
ucxl://council-sched:architect@DistOS:scheduling/*^/interfaces/sched-to-sec-contract.md

5. Decision Points

The following are the major architectural questions council-sched must resolve. Each decision must produce a Decision Record with alternatives considered, evidence from research, and rationale for the chosen option.

DP-SCHED-001: Primary Partition Model

Question: Should the scheduler use MIG (hardware-enforced partition isolation), MPS (shared context with software-enforced limits), or a hybrid model where MIG is used for multi-tenant isolation and MPS for co-located jobs within a single tenant's partition?

Factors: isolation strength, context switch overhead, minimum allocation granularity, Blackwell support parity, operational complexity.

Dependency: Decision informs the security isolation model that council-sec will specify.

DP-SCHED-002: Placement Scoring Architecture

Question: Should placement be computed by a centralised scoring service (Borg-style) or a distributed, bid-based negotiation (Mesos offer model)? How does topology affinity (NVLink domain proximity) weight against fairness constraints?

Factors: convergence time at 1024-node scale, fault tolerance of the placement service itself, stale topology information handling, NVLink bandwidth utilisation efficiency.

DP-SCHED-003: Multi-Resource Fairness Extension

Question: DRF was designed for CPU/memory. GPU workloads introduce additional dominant resources: SM fraction, HBM3 bandwidth, NVLink bandwidth, and NVSwitch port occupancy. How many resource dimensions does the fairness model track, and what is the computational complexity of DRF at 80+ resource types?

Factors: implementation tractability, approximation error bounds, interaction with per-tenant quota enforcement.

DP-SCHED-004: Gang Scheduling Protocol

Question: Should gang scheduling use a two-phase reservation (reserve then commit) protocol, speculative allocation with rollback, or a backfill-with-hold strategy? What is the maximum tolerable scheduling delay for a 1024-GPU gang job?

Factors: cluster utilisation impact, deadlock risk in two-phase protocols, interaction with preemption.

DP-SCHED-005: Preemption Granularity

Question: Should preemption operate at kernel-granularity (CUDA stream checkpointing), job-granularity (full process checkpoint via CRIU-for-GPU), or a hybrid that allows kernel preemption for short jobs and process-level preemption for long jobs?

Factors: checkpoint latency (kernel-level: microseconds; process-level: seconds to tens of seconds for large model weights), HBM3 save/restore bandwidth cost, preemption frequency requirements.

DP-SCHED-006: Grace Superchip Scheduling Model

Question: For GH200 nodes with NVLink-C2C, the CPU and GPU share a unified virtual address space. Should the scheduler treat CPU and GPU on a Grace node as a single scheduling unit, or as two separate but affinity-linked resources? How does this interact with the memory model specified by council-mem?

Factors: utilisation efficiency, memory model consistency requirements, API ergonomics for user workloads, migration cost if job is split across a C2C boundary.

DP-SCHED-007: Energy Policy Interface

Question: Should energy-aware scheduling be a first-class scheduling objective (a weight in the placement score function), a hard constraint (power cap as a resource limit), or an advisory mechanism (scheduler receives power headroom hints and applies them at its discretion)?

Factors: SLO compatibility, predictability of power-capped execution, interaction with thermal management, coordination protocol with council-telemetry.

6. Dependencies

6.1 What council-sched Needs from Other Councils

Dependency	Source Council	Artifact	Purpose
Memory pressure signals and eviction cost model	`council-mem`	`ucxl://council-mem:architect@DistOS:memory/*^/interfaces/mem-to-sched-contract.md`	GPU memory oversubscription scheduling requires eviction cost estimates to avoid scheduling kernels that will immediately thrash
HBM3/DDR5 bandwidth allocation model	`council-mem`	`ucxl://council-mem:architect@DistOS:memory/*^/architecture/bandwidth-allocation.md`	Bandwidth is a co-dominant scheduling resource; model must be consistent
Network topology map (NVLink domains, IB fabric)	`council-net`	`ucxl://council-net:architect@DistOS:networking/*^/architecture/topology-model.md`	Topology-aware placement requires accurate NVLink domain membership and IB bisection bandwidth per rack
Network bandwidth reservation API	`council-net`	`ucxl://council-net:architect@DistOS:networking/*^/interfaces/net-to-sched-contract.md`	Gang scheduling for allreduce jobs requires co-reserving network bandwidth
Resource metering API contract	`council-telemetry`	`ucxl://council-telemetry:architect@DistOS:telemetry/*^/interfaces/telemetry-to-sched-contract.md`	Scheduler must feed placement and execution events to telemetry; power headroom signals flow back
Security isolation constraints	`council-sec`	`ucxl://council-sec:architect@DistOS:security/*^/interfaces/sec-to-sched-contract.md`	Which tenants may share a GPU partition, MPS context constraints, minimum isolation requirements per tenant class

6.2 What Other Councils Need from council-sched

Consumer Council	Artifact Required	Purpose
`council-mem`	Placement decisions and GPU assignment map	Memory subsystem needs to know which GPU a job is placed on to configure HBM3 address space and NVLink fabric attachment
`council-mem`	Preemption events and checkpoint triggers	Memory tiering must snapshot GPU memory on preemption
`council-net`	Job placement map with NVLink domain assignments	Network subsystem configures NCCL topology files and RDMA QP mappings based on scheduler placement
`council-telemetry`	Scheduling events stream (enqueue, dequeue, preempt, complete)	Metering and cost attribution require per-job lifecycle events
`council-sec`	Partition assignment per tenant	Security subsystem enforces isolation based on scheduler-assigned partition IDs
`council-verify`	TLA+ scheduler state machine spec	Formal verification council model-checks scheduler invariants

7. WHOOSH Configuration

# WHOOSH council formation configuration for council-sched
council_id: council-sched
project: DistOS
subsystem: scheduling
gitea_label: chorus-entrypoint
gitea_repo: distos/scheduling

formation:
  target_agents: 80
  min_agents: 60
  wave:
    max_per_wave: 12
    min_per_wave: 6
    period_sec: 30
  placement:
    max_replicas_per_node: 2
  join_stagger_ms: 2000
  bootstrap_peers_min: 5

roles:
  - role: lead-architect
    count: 2
    model: claude-opus-4-6
    priority: high
  - role: researcher
    count: 37
    model: qwen2.5-coder:32b
    priority: normal
    subgroups:
      - tag: kernel-scheduling
        count: 8
      - tag: placement
        count: 6
      - tag: queuing-theory
        count: 5
      - tag: gang-scheduling
        count: 5
      - tag: preemption
        count: 4
      - tag: numa-topology
        count: 4
      - tag: energy
        count: 5
  - role: architect
    count: 10
    model: claude-opus-4-6
    priority: normal
  - role: verifier
    count: 8
    model: deepseek-coder-v2
    priority: normal
  - role: reviewer
    count: 8
    model: claude-opus-4-6
    priority: normal
  - role: integration-liaison
    count: 6
    model: qwen2.5-coder:32b
    priority: normal
  - role: decision-record-author
    count: 5
    model: claude-opus-4-6
    priority: normal
  - role: adversarial-critic
    count: 5
    model: claude-opus-4-6
    priority: normal

subchannels:
  - name: sched-research
    description: "Research discussion and literature synthesis"
    participants: [researcher, lead-architect]
    pubsub: true
  - name: sched-architecture
    description: "Architecture proposal discussion and voting"
    participants: [architect, lead-architect, reviewer, adversarial-critic]
    pubsub: true
  - name: sched-formal-spec
    description: "TLA+ specification authoring and review"
    participants: [verifier, lead-architect, reviewer]
    pubsub: false
  - name: sched-integration
    description: "Cross-council interface negotiation"
    participants: [integration-liaison, lead-architect]
    pubsub: false
  - name: sched-decisions
    description: "Decision record authoring and consensus"
    participants: [decision-record-author, lead-architect, reviewer]
    pubsub: true

quorum:
  # Architecture decisions require supermajority
  architecture_changes:
    policy: supermajority
    threshold: 0.667
    require_domain_role: true
    require_quality_role: true
    beat_minutes: 20
    timeout_beats: 6
  # Research summaries require simple majority
  research_summaries:
    policy: simple_majority
    threshold: 0.5
    require_domain_role: true
    require_quality_role: false
    beat_minutes: 15
    timeout_beats: 4
  # Formal specifications require supermajority with verifier sign-off
  formal_specs:
    policy: supermajority
    threshold: 0.667
    require_domain_role: true
    require_quality_role: true
    require_verifier: true
    beat_minutes: 25
    timeout_beats: 8
  # Cross-council interface contracts require unanimous lead-architect approval
  interface_contracts:
    policy: unanimous
    roles: [lead-architect, integration-liaison]
    beat_minutes: 30
    timeout_beats: 4

gates:
  kaching:
    p95_latency_ms: 250
    max_error_rate: 0.01
  backbeat:
    max_stream_lag: 200
  bootstrap:
    min_healthy_peers: 5
  join:
    min_success_rate: 0.80

review:
  beat_minutes: 20
  quorum:
    total_min: 3
    require_domain_role: true
    require_quality_role: true
  timeout_beats: 6
  no_self_approval: true

8. Success Criteria

Research completeness: All 8 research domain summaries published to DHT with at least 5 primary references each, reviewed and approved by at least 3 council agents.
Architecture coverage: Architectural proposals exist for all 7 major scheduling components (kernel dispatch, placement, queuing, gang, preemption, NUMA, energy). Each proposal addresses the 1024-node scale constraint explicitly.
Decision records resolved: All 7 decision points (DP-SCHED-001 through DP-SCHED-007) have a corresponding Decision Record with at least 3 alternatives considered, evidence citations, and a chosen option ratified by council supermajority.
Formal specifications: TLA+ specifications for the scheduler state machine, gang scheduling protocol, and preemption protocol. At least 2 of these must have model-checked safety invariants (no starvation for highest-priority jobs, gang deadlock freedom) verified by council-verify.
Interface contracts ratified: All 4 interface contracts (to council-mem, council-net, council-telemetry, council-sec) are co-signed by integration liaisons from both councils.
UCXL navigability: A human unfamiliar with the project should be able to navigate from any Decision Record to the research summary that motivated it using only UCXL temporal navigation, within 5 hops.
Adversarial review pass: Each major architecture proposal has at minimum one adversarial critique documented and a resolution recorded. No proposal advances to formal specification with an unresolved red-vote critique.

9. Timeline

Phase 1: Research and Survey (Days 1-3)

Day 1:

WHOOSH forms council; all 80 agents join via wave deployment
Researchers self-assign to domain subgroups via sched-research subchannel
Literature survey begins: GPU kernel scheduling, placement, and fair queuing domains prioritised first (these are on the critical path for Decision Points DP-SCHED-001 through DP-SCHED-003)
Integration liaisons make initial contact with council-mem and council-net to understand their timelines

Day 2:

Remaining research domains surveyed: gang scheduling, preemption, NUMA, energy
Research summaries drafted in parallel across subgroups
First internal review cycle: reviewers read summaries and post green/yellow/red votes with rationale
Lead architects synthesise research findings into a preliminary scheduling design space map

Day 3:

Research summaries revised based on review feedback; final versions published to DHT
Adversarial critics challenge assumptions in each summary
Research phase gate: all 8 summaries must achieve simple majority approval before Phase 2 begins
Preliminary interface contract outlines shared with dependency councils

Phase 2: Architecture and Trade-offs (Days 3-6)

Day 3-4:

Architects propose concrete options for DP-SCHED-001 (partition model) and DP-SCHED-002 (placement scoring) — these are the highest-dependency decisions
Adversarial critics engage immediately; all alternatives documented
DP-SCHED-001 decision record drafted; council votes; DR published

Day 4-5:

Queuing model (DP-SCHED-003), gang scheduling (DP-SCHED-004), and preemption (DP-SCHED-005) design proposals concurrently authored
Inter-council synthesis session with council-mem to align on oversubscription and eviction signal interfaces
Inter-council synthesis session with council-net to align on topology model input format

Day 5-6:

Grace Superchip model (DP-SCHED-006) and energy interface (DP-SCHED-007) decisions resolved
All 7 Decision Records drafted; supermajority vote on each
Architecture overview document assembled from approved DRs
Architecture phase gate: all DPs resolved before Phase 3 begins

Phase 3: Formal Specification (Days 6-10)

Day 6-7:

Verifiers begin TLA+ specification of the scheduler state machine based on approved architecture
Architects continue refining component-level designs to resolve any ambiguities surfaced during spec authoring
council-verify engaged: share spec drafts for early model-checking feedback

Day 7-8:

Gang scheduler TLA+ module authored; parallel with preemption protocol spec
Fairness invariants formally stated: no starvation under bounded load, gang deadlock freedom, DRF monotonicity

Day 8-10:

Model checking runs submitted to council-verify
Counterexample analysis: any liveness or safety violations trigger architecture revision and updated DRs
Formal spec versions pinned in DHT; UCXL addresses published to dependent councils

Phase 4: Integration and Review (Days 10-12)

Day 10-11:

Interface contracts with council-mem, council-net, council-telemetry, council-sec finalised and submitted for co-signature
Cross-council integration session: scheduling placement decisions validated against network topology model
council-synth engaged for any unresolved conflicts with other councils' specifications

Day 11-12:

Final council review of complete scheduling specification
Adversarial critics run end-to-end failure scenario analysis
Any remaining yellow votes addressed with documented mitigations
Integration review gate: all interface contracts co-signed

Phase 5: Documentation and Narrative (Days 12-14)

Day 12-13:

Decision record authors produce narrative summaries of the scheduling architecture journey
council-docs receives the complete scheduling specification package for standardised formatting
UCXL navigability audit: spot-check 10 random decision paths for completeness

Day 14:

Final specification published
council-arch decision archaeology agents generate human-readable narrative of scheduling design evolution
Council formally dissolved; agents released back to WHOOSH pool

30 KiB Raw Blame History

Council Design Brief: Process Scheduling

1. Scope and Responsibilities

2. Research Domains

2.1 GPU Kernel Scheduling and SM Partitioning

2.2 Heterogeneous Workload Placement

2.3 Fair Multi-Tenant Queuing

2.4 Gang Scheduling for Distributed Training

2.5 Preemption Strategies

2.6 NUMA-Aware and Topology-Aware Placement

2.7 GPU Memory Oversubscription

2.8 Energy-Aware Scheduling

3. Agent Roles

4. Key Deliverables

4.1 Research Summaries

4.2 Architecture Proposals

4.3 Decision Records

4.4 Formal Specifications

4.5 Interface Contracts (for other councils)

5. Decision Points

DP-SCHED-001: Primary Partition Model

DP-SCHED-002: Placement Scoring Architecture

DP-SCHED-003: Multi-Resource Fairness Extension

DP-SCHED-004: Gang Scheduling Protocol

DP-SCHED-005: Preemption Granularity

DP-SCHED-006: Grace Superchip Scheduling Model

DP-SCHED-007: Energy Policy Interface

6. Dependencies

6.1 What council-sched Needs from Other Councils

6.2 What Other Councils Need from council-sched

7. WHOOSH Configuration

8. Success Criteria

9. Timeline

Phase 1: Research and Survey (Days 1-3)

Phase 2: Architecture and Trade-offs (Days 3-6)

Phase 3: Formal Specification (Days 6-10)

Phase 4: Integration and Review (Days 10-12)

Phase 5: Documentation and Narrative (Days 12-14)

30 KiB

Raw Blame History