Initial DistOS project constitution and council design briefs
12 council design briefs for distributed OS specification project targeting 1024-node Hopper/Grace/Blackwell GPU cluster with Weka parallel filesystem. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
535
councils/01-process-scheduling.md
Normal file
535
councils/01-process-scheduling.md
Normal file
@@ -0,0 +1,535 @@
|
||||
# Council Design Brief: Process Scheduling
|
||||
|
||||
**Council ID:** `council-sched`
|
||||
**Mission:** Design the heterogeneous process scheduling subsystem for DistOS, covering GPU kernel dispatch, workload placement across Hopper/Grace/Blackwell accelerators, fair multi-tenant queuing, gang scheduling for distributed training, and energy-aware execution across a 1024-node cluster.
|
||||
**UCXL Base Address:** `ucxl://council-sched:*@DistOS:scheduling/*`
|
||||
**Agent Count:** 80
|
||||
**Status:** Constitution Phase — awaiting WHOOSH formation trigger
|
||||
**Created:** 2026-02-24
|
||||
|
||||
---
|
||||
|
||||
## 1. Scope and Responsibilities
|
||||
|
||||
`council-sched` owns the complete specification of the DistOS process and kernel scheduling subsystem. Scope boundaries are defined as follows.
|
||||
|
||||
**In scope:**
|
||||
|
||||
- GPU kernel scheduling: dispatch queue management, kernel concurrency, SM partitioning on Hopper (MIG) and Blackwell, and MPS (Multi-Process Service) lifetime management
|
||||
- CPU-GPU co-scheduling on Grace Superchip (NVLink-C2C coherent interconnect), including unified virtual address space scheduling implications
|
||||
- Workload placement policy across heterogeneous accelerator types (H100, GH200, B200), including topology-aware affinity scoring
|
||||
- Fair multi-tenant queuing: priority classes, weighted fair queuing, and quota enforcement
|
||||
- Gang scheduling for distributed training workloads: all-or-nothing allocation, partial-allotment strategies, and backfill
|
||||
- Preemption strategies: checkpoint-based preemption, time-sliced preemption, and priority inversion avoidance
|
||||
- NUMA-aware placement across CPU sockets and GPU memory domains
|
||||
- GPU memory oversubscription scheduling: eviction policy, swap-to-host, and coordinated demand management with `council-mem`
|
||||
- Energy-aware scheduling: frequency/voltage scaling directives, power capping, and thermal headroom management
|
||||
- Formal specification of the scheduling API surface exposed to user workloads and to other DistOS subsystems
|
||||
|
||||
**Out of scope (delegated):**
|
||||
|
||||
- Physical memory allocation and coherence protocol (delegated to `council-mem`)
|
||||
- Network topology discovery and network-aware placement data (delegated to `council-net`; consumed as inputs)
|
||||
- Metering, cost attribution, and SLO enforcement (delegated to `council-telemetry`; consumed as outputs)
|
||||
- Security isolation between tenants at the hardware level (delegated to `council-sec`; policies consumed as constraints)
|
||||
|
||||
---
|
||||
|
||||
## 2. Research Domains
|
||||
|
||||
### 2.1 GPU Kernel Scheduling and SM Partitioning
|
||||
|
||||
Survey the CUDA Concurrent Kernel Execution model, the NVIDIA Multi-Instance GPU (MIG) architecture on Hopper (H100), and the NVIDIA Multi-Process Service (MPS). Understand how MIG partitions a GPU into isolated GPC slices with dedicated HBM3 memory, and how MPS enables concurrent kernel execution within a single GPU context without full MIG isolation overhead.
|
||||
|
||||
Key materials:
|
||||
- NVIDIA H100 Architecture Technical Overview (2022) — SM partitioning, GPC layout, NVLink 4.0 bandwidth
|
||||
- NVIDIA MIG User Guide (CUDA 12.x) — partition profiles (1g.10gb through 7g.80gb), instance isolation, compute and memory capacity tables
|
||||
- NVIDIA MPS documentation — shared context model, client limit (48 for Hopper), error containment limitations
|
||||
- AMD ROCm Hardware Abstraction Layer (HAL) source — `amdkfd` KFD driver, compute queue management, HWS (Hardware Scheduler) in GFX12
|
||||
- Jain et al., "Fractional GPUs: Software-Based Compute and Memory Bandwidth Reservation for GPUs" (RTAS 2019) — software-defined partitioning as a baseline comparison
|
||||
- Xiao et al., "AntMan: Dynamic Scaling on GPU Clusters for Deep Learning" (OSDI 2020) — GPU memory and compute sharing for co-located workloads
|
||||
|
||||
### 2.2 Heterogeneous Workload Placement
|
||||
|
||||
Develop a placement model that accounts for the distinct performance characteristics of H100 (PCIe/SXM5), GH200 Grace Superchip (NVLink-C2C), and B200 Blackwell (NVLink 5.0 SXM). Each accelerator type has different compute density, memory bandwidth, and interconnect topology.
|
||||
|
||||
Key materials:
|
||||
- Jouppi et al., "In-Datacenter Performance Analysis of a Tensor Processing Unit" (ISCA 2017) — heterogeneous accelerator placement rationale
|
||||
- Google Borg paper: Verma et al., "Large-scale cluster management at Google with Borg" (EuroSys 2015) — machine heterogeneity handling, alloc sets, resource estimation
|
||||
- Microsoft Singularity: Qiao et al., "Pollux: Co-adaptive Cluster Scheduling for Goodput-Optimized Deep Learning" (OSDI 2021) — goodput-aware placement for training workloads
|
||||
- Weng et al., "MLaaS in the Wild: Workload Analysis and Scheduling in Large-Scale Heterogeneous GPU Clusters" (NSDI 2022) — real-world heterogeneous GPU cluster scheduling from Alibaba
|
||||
|
||||
### 2.3 Fair Multi-Tenant Queuing
|
||||
|
||||
Design a multi-level queueing architecture with Dominant Resource Fairness (DRF) semantics extended for GPU resources (SM fraction, HBM bandwidth, NVLink bandwidth as co-dominant resources).
|
||||
|
||||
Key materials:
|
||||
- Ghodsi et al., "Dominant Resource Fairness: Fair Allocation of Multiple Resource Types" (NSDI 2011) — foundational DRF theory
|
||||
- Apache YARN CapacityScheduler and FairScheduler documentation — hierarchical queues, preemption policy, label-based node targeting
|
||||
- Apache Mesos DRF implementation — `DominantShareAllocator`, offer model, role weights
|
||||
- Kubernetes device plugin framework (k8s.io/device-plugins) — GPU resource advertisement, extended resource scheduling
|
||||
- Tiresias: Gu et al., "Tiresias: A GPU Cluster Manager for Distributed Deep Learning" (NSDI 2019) — LAS (Least Attained Service) scheduling for DL jobs, 2DAS (2-dimensional attained service)
|
||||
- Gandiva: Xiao et al., "Gandiva: Introspective Cluster Scheduling for Deep Learning" (OSDI 2018) — time-sliced GPU sharing, job packing, migration
|
||||
|
||||
### 2.4 Gang Scheduling for Distributed Training
|
||||
|
||||
Gang scheduling ensures all processes in a distributed training job (e.g., a 512-GPU allreduce ring) are co-scheduled simultaneously. This is critical for preventing head-of-line blocking and deadlock in collective communication.
|
||||
|
||||
Key materials:
|
||||
- Feitelson and Rudolph, "Gang Scheduling Performance Benefits for Fine-Grain Synchronization" (J. Parallel Distrib. Comput., 1992) — foundational gang scheduling analysis
|
||||
- Rajachandrasekar et al., "A Closer Look at All-Reduce for Deep Learning" (Workshop at SC'19) — collective communication scheduling sensitivity
|
||||
- Hwang et al., "AFS: Annotation-Free Automatic Sharding for Large Language Models" (ICLR 2023) — pipeline and tensor parallel placement co-scheduling
|
||||
- Slurm gang scheduling documentation — `Oversubscribe=FORCE`, `GraceTime`, slurmctld gang plugin
|
||||
- Jeon et al., "Analysis of Large-Scale Multi-Tenant GPU Clusters for DNN Training Workloads" (USENIX ATC 2019) — Microsoft Philly cluster trace, gang failure modes
|
||||
|
||||
### 2.5 Preemption Strategies
|
||||
|
||||
Survey checkpoint-based, reactive, and time-sliced preemption. Understand the cost of GPU checkpoint (saving SM register file, shared memory, and in-flight DMA) and design preemption policies that bound maximum latency impact.
|
||||
|
||||
Key materials:
|
||||
- Park et al., "Preemptive, Low Latency Datacenter Scheduling via Lightweight Virtualization" (USENIX ATC 2020) — container-level preemption
|
||||
- Lin et al., "SHEPHERD: Serving DNNs in the Wild" (NSDI 2023) — latency SLO-aware preemption for inference workloads
|
||||
- NVIDIA CUDA checkpoint/restore (CRIU for GPU) — experimental support status as of CUDA 12.x
|
||||
- Wang et al., "Achieving Microsecond-Scale Tail Latency Efficiently with Approximate Optimal Scheduling" (SOSP 2017) — preemption-aware scheduling theory
|
||||
|
||||
### 2.6 NUMA-Aware and Topology-Aware Placement
|
||||
|
||||
Model the NUMA topology of a Grace Superchip node: 72-core Arm Neoverse V2 CPU, 480 GB LPDDR5X at ~512 GB/s, connected to H100 GPU via NVLink-C2C at 900 GB/s. Compare this with standard SXM5 nodes where CPU-GPU bandwidth is limited to PCIe Gen5 (~128 GB/s bidirectional).
|
||||
|
||||
Key materials:
|
||||
- Linux NUMA scheduling documentation — `libnuma`, `numactl`, CFS NUMA balancing (`task_numa_migrate`)
|
||||
- NVIDIA GH200 Grace Hopper Superchip Architecture Whitepaper (2023)
|
||||
- Lepers et al., "Thread and Memory Placement on NUMA Systems: Asymmetry Matters" (USENIX ATC 2015)
|
||||
- Blagodurov et al., "A Case for NUMA-Aware Contention Management on Multicore Systems" (USENIX ATC 2011)
|
||||
|
||||
### 2.7 GPU Memory Oversubscription
|
||||
|
||||
When aggregate GPU memory demand exceeds physical HBM3 capacity, the scheduler must coordinate with `council-mem`'s eviction and tiering policies to decide which kernels to throttle, checkpoint, or migrate.
|
||||
|
||||
Key materials:
|
||||
- Rhu et al., "vDNN: Virtualized Deep Neural Networks for Scalable, Memory-Efficient Neural Network Design" (MICRO 2016) — layer-by-layer activation offload
|
||||
- Huang et al., "Efficient Large Scale Language Modeling with Mixtures of Experts" (EMNLP 2021) — memory-efficient model parallelism
|
||||
- NVIDIA Unified Memory documentation (CUDA 12.x) — page migration engine, oversubscription, prefetch advisory API
|
||||
- Kumar et al., "SwapAdvisor: Push Deep Learning Beyond the GPU Memory Limit via Smart Swapping" (ASPLOS 2020)
|
||||
|
||||
### 2.8 Energy-Aware Scheduling
|
||||
|
||||
Design scheduling policies that honour per-node power caps enforced by Baseboard Management Controllers (BMCs) and exploit DVFS (Dynamic Voltage and Frequency Scaling) headroom reported by `council-telemetry`.
|
||||
|
||||
Key materials:
|
||||
- Patel et al., "Clite: Efficient and QoS Aware Co-location of Multiple Latency-Critical Jobs for Warehouse Scale Computers" (ISCA 2020)
|
||||
- Lim et al., "Adaptive Power Management for Heterogeneous Multiprocessor SoCs" (ICCAD 2009)
|
||||
- NVIDIA NVML power management API — `nvmlDeviceSetPowerManagementLimit`, `nvmlDeviceGetCurrentClocksThrottleReasons`
|
||||
- AMD ROCm SMI library — power cap interface (`rsmi_dev_power_cap_set`)
|
||||
- RAPL (Running Average Power Limit) interface for Grace CPU power management
|
||||
|
||||
---
|
||||
|
||||
## 3. Agent Roles
|
||||
|
||||
Total agents: **80**
|
||||
|
||||
| Role | Count | Responsibilities |
|
||||
|------|-------|-----------------|
|
||||
| Lead Architect | 2 | Overall scheduling architecture decisions, cross-subsystem interface design, DR authorship for major decisions |
|
||||
| Kernel Scheduling Researchers | 8 | CUDA/ROCm scheduler internals, MIG/MPS analysis, SM partitioning survey |
|
||||
| Placement Researchers | 6 | Heterogeneous accelerator placement, topology modelling, workload profiling |
|
||||
| Queueing Theory Specialists | 5 | DRF extensions for multi-resource GPU scheduling, fairness proof sketches |
|
||||
| Gang Scheduling Specialists | 5 | Collective communication scheduling, all-or-nothing allocation protocols |
|
||||
| Preemption Specialists | 4 | Checkpoint protocol design, preemption cost modelling |
|
||||
| NUMA/Topology Analysts | 4 | NUMA topology modelling for Grace Superchip and SXM5 nodes |
|
||||
| Energy Efficiency Researchers | 4 | Power capping, DVFS scheduling integration |
|
||||
| Formal Specification Authors | 8 | TLA+ specification of scheduler state machine, safety and liveness invariants |
|
||||
| Architects (sub-component) | 10 | Propose concrete scheduling algorithm designs for each domain |
|
||||
| Internal Reviewers | 8 | Review research summaries and architecture proposals; cast green/yellow/red votes |
|
||||
| Integration Liaisons | 6 | Interface with `council-mem`, `council-net`, `council-telemetry`, `council-sec` |
|
||||
| Decision Record Authors | 5 | Author DRs for each resolved decision point; maintain UCXL provenance chain |
|
||||
| Adversarial Critics | 5 | Challenge proposed designs; surface failure modes, starvation scenarios, livelock risks |
|
||||
|
||||
**Role distribution rationale:** The large researcher cohort (37 agents covering 8 research domains) reflects the breadth of scheduling literature. The formal specification group (8 agents) is sized to produce parallel TLA+ modules. Internal reviewers and adversarial critics (13 agents combined) ensure no architecture proposal passes without rigorous challenge.
|
||||
|
||||
---
|
||||
|
||||
## 4. Key Deliverables
|
||||
|
||||
All artifacts are published to the DHT and addressable via UCXL.
|
||||
|
||||
### 4.1 Research Summaries
|
||||
|
||||
```
|
||||
ucxl://council-sched:researcher@DistOS:scheduling/*^/research/gpu-kernel-scheduling.md
|
||||
ucxl://council-sched:researcher@DistOS:scheduling/*^/research/heterogeneous-placement.md
|
||||
ucxl://council-sched:researcher@DistOS:scheduling/*^/research/fair-queuing-drf.md
|
||||
ucxl://council-sched:researcher@DistOS:scheduling/*^/research/gang-scheduling.md
|
||||
ucxl://council-sched:researcher@DistOS:scheduling/*^/research/preemption-strategies.md
|
||||
ucxl://council-sched:researcher@DistOS:scheduling/*^/research/numa-topology.md
|
||||
ucxl://council-sched:researcher@DistOS:scheduling/*^/research/memory-oversubscription.md
|
||||
ucxl://council-sched:researcher@DistOS:scheduling/*^/research/energy-aware-scheduling.md
|
||||
```
|
||||
|
||||
### 4.2 Architecture Proposals
|
||||
|
||||
```
|
||||
ucxl://council-sched:architect@DistOS:scheduling/*^/architecture/scheduler-overview.md
|
||||
ucxl://council-sched:architect@DistOS:scheduling/*^/architecture/mig-mps-partitioning-model.md
|
||||
ucxl://council-sched:architect@DistOS:scheduling/*^/architecture/placement-scoring-algorithm.md
|
||||
ucxl://council-sched:architect@DistOS:scheduling/*^/architecture/drf-gpu-extension.md
|
||||
ucxl://council-sched:architect@DistOS:scheduling/*^/architecture/gang-scheduler-protocol.md
|
||||
ucxl://council-sched:architect@DistOS:scheduling/*^/architecture/preemption-protocol.md
|
||||
ucxl://council-sched:architect@DistOS:scheduling/*^/architecture/energy-policy-interface.md
|
||||
```
|
||||
|
||||
### 4.3 Decision Records
|
||||
|
||||
```
|
||||
ucxl://council-sched:architect@DistOS:scheduling/*^/decisions/DR-SCHED-001-partition-model.md
|
||||
ucxl://council-sched:architect@DistOS:scheduling/*^/decisions/DR-SCHED-002-placement-algorithm.md
|
||||
ucxl://council-sched:architect@DistOS:scheduling/*^/decisions/DR-SCHED-003-queuing-policy.md
|
||||
ucxl://council-sched:architect@DistOS:scheduling/*^/decisions/DR-SCHED-004-gang-protocol.md
|
||||
ucxl://council-sched:architect@DistOS:scheduling/*^/decisions/DR-SCHED-005-preemption-model.md
|
||||
ucxl://council-sched:architect@DistOS:scheduling/*^/decisions/DR-SCHED-006-energy-interface.md
|
||||
```
|
||||
|
||||
### 4.4 Formal Specifications
|
||||
|
||||
```
|
||||
ucxl://council-sched:verifier@DistOS:scheduling/*^/specs/SchedulerStateMachine.tla
|
||||
ucxl://council-sched:verifier@DistOS:scheduling/*^/specs/GangScheduler.tla
|
||||
ucxl://council-sched:verifier@DistOS:scheduling/*^/specs/PreemptionProtocol.tla
|
||||
ucxl://council-sched:verifier@DistOS:scheduling/*^/specs/FairnessInvariants.tla
|
||||
```
|
||||
|
||||
### 4.5 Interface Contracts (for other councils)
|
||||
|
||||
```
|
||||
ucxl://council-sched:architect@DistOS:scheduling/*^/interfaces/sched-to-mem-contract.md
|
||||
ucxl://council-sched:architect@DistOS:scheduling/*^/interfaces/sched-to-net-contract.md
|
||||
ucxl://council-sched:architect@DistOS:scheduling/*^/interfaces/sched-to-telemetry-contract.md
|
||||
ucxl://council-sched:architect@DistOS:scheduling/*^/interfaces/sched-to-sec-contract.md
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 5. Decision Points
|
||||
|
||||
The following are the major architectural questions `council-sched` must resolve. Each decision must produce a Decision Record with alternatives considered, evidence from research, and rationale for the chosen option.
|
||||
|
||||
### DP-SCHED-001: Primary Partition Model
|
||||
|
||||
**Question:** Should the scheduler use MIG (hardware-enforced partition isolation), MPS (shared context with software-enforced limits), or a hybrid model where MIG is used for multi-tenant isolation and MPS for co-located jobs within a single tenant's partition?
|
||||
|
||||
**Factors:** isolation strength, context switch overhead, minimum allocation granularity, Blackwell support parity, operational complexity.
|
||||
|
||||
**Dependency:** Decision informs the security isolation model that `council-sec` will specify.
|
||||
|
||||
### DP-SCHED-002: Placement Scoring Architecture
|
||||
|
||||
**Question:** Should placement be computed by a centralised scoring service (Borg-style) or a distributed, bid-based negotiation (Mesos offer model)? How does topology affinity (NVLink domain proximity) weight against fairness constraints?
|
||||
|
||||
**Factors:** convergence time at 1024-node scale, fault tolerance of the placement service itself, stale topology information handling, NVLink bandwidth utilisation efficiency.
|
||||
|
||||
### DP-SCHED-003: Multi-Resource Fairness Extension
|
||||
|
||||
**Question:** DRF was designed for CPU/memory. GPU workloads introduce additional dominant resources: SM fraction, HBM3 bandwidth, NVLink bandwidth, and NVSwitch port occupancy. How many resource dimensions does the fairness model track, and what is the computational complexity of DRF at 80+ resource types?
|
||||
|
||||
**Factors:** implementation tractability, approximation error bounds, interaction with per-tenant quota enforcement.
|
||||
|
||||
### DP-SCHED-004: Gang Scheduling Protocol
|
||||
|
||||
**Question:** Should gang scheduling use a two-phase reservation (reserve then commit) protocol, speculative allocation with rollback, or a backfill-with-hold strategy? What is the maximum tolerable scheduling delay for a 1024-GPU gang job?
|
||||
|
||||
**Factors:** cluster utilisation impact, deadlock risk in two-phase protocols, interaction with preemption.
|
||||
|
||||
### DP-SCHED-005: Preemption Granularity
|
||||
|
||||
**Question:** Should preemption operate at kernel-granularity (CUDA stream checkpointing), job-granularity (full process checkpoint via CRIU-for-GPU), or a hybrid that allows kernel preemption for short jobs and process-level preemption for long jobs?
|
||||
|
||||
**Factors:** checkpoint latency (kernel-level: microseconds; process-level: seconds to tens of seconds for large model weights), HBM3 save/restore bandwidth cost, preemption frequency requirements.
|
||||
|
||||
### DP-SCHED-006: Grace Superchip Scheduling Model
|
||||
|
||||
**Question:** For GH200 nodes with NVLink-C2C, the CPU and GPU share a unified virtual address space. Should the scheduler treat CPU and GPU on a Grace node as a single scheduling unit, or as two separate but affinity-linked resources? How does this interact with the memory model specified by `council-mem`?
|
||||
|
||||
**Factors:** utilisation efficiency, memory model consistency requirements, API ergonomics for user workloads, migration cost if job is split across a C2C boundary.
|
||||
|
||||
### DP-SCHED-007: Energy Policy Interface
|
||||
|
||||
**Question:** Should energy-aware scheduling be a first-class scheduling objective (a weight in the placement score function), a hard constraint (power cap as a resource limit), or an advisory mechanism (scheduler receives power headroom hints and applies them at its discretion)?
|
||||
|
||||
**Factors:** SLO compatibility, predictability of power-capped execution, interaction with thermal management, coordination protocol with `council-telemetry`.
|
||||
|
||||
---
|
||||
|
||||
## 6. Dependencies
|
||||
|
||||
### 6.1 What council-sched Needs from Other Councils
|
||||
|
||||
| Dependency | Source Council | Artifact | Purpose |
|
||||
|------------|---------------|---------|---------|
|
||||
| Memory pressure signals and eviction cost model | `council-mem` | `ucxl://council-mem:architect@DistOS:memory/*^/interfaces/mem-to-sched-contract.md` | GPU memory oversubscription scheduling requires eviction cost estimates to avoid scheduling kernels that will immediately thrash |
|
||||
| HBM3/DDR5 bandwidth allocation model | `council-mem` | `ucxl://council-mem:architect@DistOS:memory/*^/architecture/bandwidth-allocation.md` | Bandwidth is a co-dominant scheduling resource; model must be consistent |
|
||||
| Network topology map (NVLink domains, IB fabric) | `council-net` | `ucxl://council-net:architect@DistOS:networking/*^/architecture/topology-model.md` | Topology-aware placement requires accurate NVLink domain membership and IB bisection bandwidth per rack |
|
||||
| Network bandwidth reservation API | `council-net` | `ucxl://council-net:architect@DistOS:networking/*^/interfaces/net-to-sched-contract.md` | Gang scheduling for allreduce jobs requires co-reserving network bandwidth |
|
||||
| Resource metering API contract | `council-telemetry` | `ucxl://council-telemetry:architect@DistOS:telemetry/*^/interfaces/telemetry-to-sched-contract.md` | Scheduler must feed placement and execution events to telemetry; power headroom signals flow back |
|
||||
| Security isolation constraints | `council-sec` | `ucxl://council-sec:architect@DistOS:security/*^/interfaces/sec-to-sched-contract.md` | Which tenants may share a GPU partition, MPS context constraints, minimum isolation requirements per tenant class |
|
||||
|
||||
### 6.2 What Other Councils Need from council-sched
|
||||
|
||||
| Consumer Council | Artifact Required | Purpose |
|
||||
|-----------------|------------------|---------|
|
||||
| `council-mem` | Placement decisions and GPU assignment map | Memory subsystem needs to know which GPU a job is placed on to configure HBM3 address space and NVLink fabric attachment |
|
||||
| `council-mem` | Preemption events and checkpoint triggers | Memory tiering must snapshot GPU memory on preemption |
|
||||
| `council-net` | Job placement map with NVLink domain assignments | Network subsystem configures NCCL topology files and RDMA QP mappings based on scheduler placement |
|
||||
| `council-telemetry` | Scheduling events stream (enqueue, dequeue, preempt, complete) | Metering and cost attribution require per-job lifecycle events |
|
||||
| `council-sec` | Partition assignment per tenant | Security subsystem enforces isolation based on scheduler-assigned partition IDs |
|
||||
| `council-verify` | TLA+ scheduler state machine spec | Formal verification council model-checks scheduler invariants |
|
||||
|
||||
---
|
||||
|
||||
## 7. WHOOSH Configuration
|
||||
|
||||
```yaml
|
||||
# WHOOSH council formation configuration for council-sched
|
||||
council_id: council-sched
|
||||
project: DistOS
|
||||
subsystem: scheduling
|
||||
gitea_label: chorus-entrypoint
|
||||
gitea_repo: distos/scheduling
|
||||
|
||||
formation:
|
||||
target_agents: 80
|
||||
min_agents: 60
|
||||
wave:
|
||||
max_per_wave: 12
|
||||
min_per_wave: 6
|
||||
period_sec: 30
|
||||
placement:
|
||||
max_replicas_per_node: 2
|
||||
join_stagger_ms: 2000
|
||||
bootstrap_peers_min: 5
|
||||
|
||||
roles:
|
||||
- role: lead-architect
|
||||
count: 2
|
||||
model: claude-opus-4-6
|
||||
priority: high
|
||||
- role: researcher
|
||||
count: 37
|
||||
model: qwen2.5-coder:32b
|
||||
priority: normal
|
||||
subgroups:
|
||||
- tag: kernel-scheduling
|
||||
count: 8
|
||||
- tag: placement
|
||||
count: 6
|
||||
- tag: queuing-theory
|
||||
count: 5
|
||||
- tag: gang-scheduling
|
||||
count: 5
|
||||
- tag: preemption
|
||||
count: 4
|
||||
- tag: numa-topology
|
||||
count: 4
|
||||
- tag: energy
|
||||
count: 5
|
||||
- role: architect
|
||||
count: 10
|
||||
model: claude-opus-4-6
|
||||
priority: normal
|
||||
- role: verifier
|
||||
count: 8
|
||||
model: deepseek-coder-v2
|
||||
priority: normal
|
||||
- role: reviewer
|
||||
count: 8
|
||||
model: claude-opus-4-6
|
||||
priority: normal
|
||||
- role: integration-liaison
|
||||
count: 6
|
||||
model: qwen2.5-coder:32b
|
||||
priority: normal
|
||||
- role: decision-record-author
|
||||
count: 5
|
||||
model: claude-opus-4-6
|
||||
priority: normal
|
||||
- role: adversarial-critic
|
||||
count: 5
|
||||
model: claude-opus-4-6
|
||||
priority: normal
|
||||
|
||||
subchannels:
|
||||
- name: sched-research
|
||||
description: "Research discussion and literature synthesis"
|
||||
participants: [researcher, lead-architect]
|
||||
pubsub: true
|
||||
- name: sched-architecture
|
||||
description: "Architecture proposal discussion and voting"
|
||||
participants: [architect, lead-architect, reviewer, adversarial-critic]
|
||||
pubsub: true
|
||||
- name: sched-formal-spec
|
||||
description: "TLA+ specification authoring and review"
|
||||
participants: [verifier, lead-architect, reviewer]
|
||||
pubsub: false
|
||||
- name: sched-integration
|
||||
description: "Cross-council interface negotiation"
|
||||
participants: [integration-liaison, lead-architect]
|
||||
pubsub: false
|
||||
- name: sched-decisions
|
||||
description: "Decision record authoring and consensus"
|
||||
participants: [decision-record-author, lead-architect, reviewer]
|
||||
pubsub: true
|
||||
|
||||
quorum:
|
||||
# Architecture decisions require supermajority
|
||||
architecture_changes:
|
||||
policy: supermajority
|
||||
threshold: 0.667
|
||||
require_domain_role: true
|
||||
require_quality_role: true
|
||||
beat_minutes: 20
|
||||
timeout_beats: 6
|
||||
# Research summaries require simple majority
|
||||
research_summaries:
|
||||
policy: simple_majority
|
||||
threshold: 0.5
|
||||
require_domain_role: true
|
||||
require_quality_role: false
|
||||
beat_minutes: 15
|
||||
timeout_beats: 4
|
||||
# Formal specifications require supermajority with verifier sign-off
|
||||
formal_specs:
|
||||
policy: supermajority
|
||||
threshold: 0.667
|
||||
require_domain_role: true
|
||||
require_quality_role: true
|
||||
require_verifier: true
|
||||
beat_minutes: 25
|
||||
timeout_beats: 8
|
||||
# Cross-council interface contracts require unanimous lead-architect approval
|
||||
interface_contracts:
|
||||
policy: unanimous
|
||||
roles: [lead-architect, integration-liaison]
|
||||
beat_minutes: 30
|
||||
timeout_beats: 4
|
||||
|
||||
gates:
|
||||
kaching:
|
||||
p95_latency_ms: 250
|
||||
max_error_rate: 0.01
|
||||
backbeat:
|
||||
max_stream_lag: 200
|
||||
bootstrap:
|
||||
min_healthy_peers: 5
|
||||
join:
|
||||
min_success_rate: 0.80
|
||||
|
||||
review:
|
||||
beat_minutes: 20
|
||||
quorum:
|
||||
total_min: 3
|
||||
require_domain_role: true
|
||||
require_quality_role: true
|
||||
timeout_beats: 6
|
||||
no_self_approval: true
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 8. Success Criteria
|
||||
|
||||
1. **Research completeness:** All 8 research domain summaries published to DHT with at least 5 primary references each, reviewed and approved by at least 3 council agents.
|
||||
|
||||
2. **Architecture coverage:** Architectural proposals exist for all 7 major scheduling components (kernel dispatch, placement, queuing, gang, preemption, NUMA, energy). Each proposal addresses the 1024-node scale constraint explicitly.
|
||||
|
||||
3. **Decision records resolved:** All 7 decision points (DP-SCHED-001 through DP-SCHED-007) have a corresponding Decision Record with at least 3 alternatives considered, evidence citations, and a chosen option ratified by council supermajority.
|
||||
|
||||
4. **Formal specifications:** TLA+ specifications for the scheduler state machine, gang scheduling protocol, and preemption protocol. At least 2 of these must have model-checked safety invariants (no starvation for highest-priority jobs, gang deadlock freedom) verified by `council-verify`.
|
||||
|
||||
5. **Interface contracts ratified:** All 4 interface contracts (to `council-mem`, `council-net`, `council-telemetry`, `council-sec`) are co-signed by integration liaisons from both councils.
|
||||
|
||||
6. **UCXL navigability:** A human unfamiliar with the project should be able to navigate from any Decision Record to the research summary that motivated it using only UCXL temporal navigation, within 5 hops.
|
||||
|
||||
7. **Adversarial review pass:** Each major architecture proposal has at minimum one adversarial critique documented and a resolution recorded. No proposal advances to formal specification with an unresolved red-vote critique.
|
||||
|
||||
---
|
||||
|
||||
## 9. Timeline
|
||||
|
||||
### Phase 1: Research and Survey (Days 1-3)
|
||||
|
||||
**Day 1:**
|
||||
- WHOOSH forms council; all 80 agents join via wave deployment
|
||||
- Researchers self-assign to domain subgroups via `sched-research` subchannel
|
||||
- Literature survey begins: GPU kernel scheduling, placement, and fair queuing domains prioritised first (these are on the critical path for Decision Points DP-SCHED-001 through DP-SCHED-003)
|
||||
- Integration liaisons make initial contact with `council-mem` and `council-net` to understand their timelines
|
||||
|
||||
**Day 2:**
|
||||
- Remaining research domains surveyed: gang scheduling, preemption, NUMA, energy
|
||||
- Research summaries drafted in parallel across subgroups
|
||||
- First internal review cycle: reviewers read summaries and post green/yellow/red votes with rationale
|
||||
- Lead architects synthesise research findings into a preliminary scheduling design space map
|
||||
|
||||
**Day 3:**
|
||||
- Research summaries revised based on review feedback; final versions published to DHT
|
||||
- Adversarial critics challenge assumptions in each summary
|
||||
- Research phase gate: all 8 summaries must achieve simple majority approval before Phase 2 begins
|
||||
- Preliminary interface contract outlines shared with dependency councils
|
||||
|
||||
### Phase 2: Architecture and Trade-offs (Days 3-6)
|
||||
|
||||
**Day 3-4:**
|
||||
- Architects propose concrete options for DP-SCHED-001 (partition model) and DP-SCHED-002 (placement scoring) — these are the highest-dependency decisions
|
||||
- Adversarial critics engage immediately; all alternatives documented
|
||||
- DP-SCHED-001 decision record drafted; council votes; DR published
|
||||
|
||||
**Day 4-5:**
|
||||
- Queuing model (DP-SCHED-003), gang scheduling (DP-SCHED-004), and preemption (DP-SCHED-005) design proposals concurrently authored
|
||||
- Inter-council synthesis session with `council-mem` to align on oversubscription and eviction signal interfaces
|
||||
- Inter-council synthesis session with `council-net` to align on topology model input format
|
||||
|
||||
**Day 5-6:**
|
||||
- Grace Superchip model (DP-SCHED-006) and energy interface (DP-SCHED-007) decisions resolved
|
||||
- All 7 Decision Records drafted; supermajority vote on each
|
||||
- Architecture overview document assembled from approved DRs
|
||||
- Architecture phase gate: all DPs resolved before Phase 3 begins
|
||||
|
||||
### Phase 3: Formal Specification (Days 6-10)
|
||||
|
||||
**Day 6-7:**
|
||||
- Verifiers begin TLA+ specification of the scheduler state machine based on approved architecture
|
||||
- Architects continue refining component-level designs to resolve any ambiguities surfaced during spec authoring
|
||||
- `council-verify` engaged: share spec drafts for early model-checking feedback
|
||||
|
||||
**Day 7-8:**
|
||||
- Gang scheduler TLA+ module authored; parallel with preemption protocol spec
|
||||
- Fairness invariants formally stated: no starvation under bounded load, gang deadlock freedom, DRF monotonicity
|
||||
|
||||
**Day 8-10:**
|
||||
- Model checking runs submitted to `council-verify`
|
||||
- Counterexample analysis: any liveness or safety violations trigger architecture revision and updated DRs
|
||||
- Formal spec versions pinned in DHT; UCXL addresses published to dependent councils
|
||||
|
||||
### Phase 4: Integration and Review (Days 10-12)
|
||||
|
||||
**Day 10-11:**
|
||||
- Interface contracts with `council-mem`, `council-net`, `council-telemetry`, `council-sec` finalised and submitted for co-signature
|
||||
- Cross-council integration session: scheduling placement decisions validated against network topology model
|
||||
- `council-synth` engaged for any unresolved conflicts with other councils' specifications
|
||||
|
||||
**Day 11-12:**
|
||||
- Final council review of complete scheduling specification
|
||||
- Adversarial critics run end-to-end failure scenario analysis
|
||||
- Any remaining yellow votes addressed with documented mitigations
|
||||
- Integration review gate: all interface contracts co-signed
|
||||
|
||||
### Phase 5: Documentation and Narrative (Days 12-14)
|
||||
|
||||
**Day 12-13:**
|
||||
- Decision record authors produce narrative summaries of the scheduling architecture journey
|
||||
- `council-docs` receives the complete scheduling specification package for standardised formatting
|
||||
- UCXL navigability audit: spot-check 10 random decision paths for completeness
|
||||
|
||||
**Day 14:**
|
||||
- Final specification published
|
||||
- `council-arch` decision archaeology agents generate human-readable narrative of scheduling design evolution
|
||||
- Council formally dissolved; agents released back to WHOOSH pool
|
||||
Reference in New Issue
Block a user