12 council design briefs for distributed OS specification project targeting 1024-node Hopper/Grace/Blackwell GPU cluster with Weka parallel filesystem. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
7.3 KiB
7.3 KiB
DistOS: Distributed Operating System for Heterogeneous GPU Clusters
Project Constitution
Project ID: DistOS
UCXL Base: ucxl://*:*@DistOS:*
Target Platform: 1024-node Hopper/Grace/Blackwell cluster with Weka parallel filesystem
Created: 2026-02-24
Status: Constitution Phase
1. Mission
Design a comprehensive, formally specified distributed operating system optimised for heterogeneous GPU clusters. The system must manage scheduling, memory, networking, fault tolerance, security, and observability across up to 1024 nodes equipped with NVIDIA Hopper, Grace, and Blackwell accelerators, backed by the Weka parallel filesystem.
This project serves dual purpose:
- Primary: Produce a rigorous, verifiable specification for a novel distributed OS
- Meta: Demonstrate that ~1000 coordinated CHORUS agents can collaboratively solve a problem of this complexity, with every decision traceable by humans via UCXL
2. Guiding Principles
- Formal First: Every subsystem must have a formal specification (TLA+, Alloy, or equivalent) before implementation sketches
- Decision Provenance: Every architectural choice must be recorded as a Decision Record (DR) in the DHT with full UCXL addressing, including alternatives considered and rationale
- Cross-Council Coherence: No council operates in isolation; integration points must be explicitly defined and tracked
- Human Navigability: A human must be able to follow any decision chain from final spec back to initial research via UCXL temporal navigation
- Self-Referential Awareness: Agents should recognise they are designing a system they would ideally run on, and leverage that perspective
3. Council Structure
Research & Design Councils
| Council ID | Domain | ~Agents | Brief |
|---|---|---|---|
council-sched |
Process Scheduling | ~80 | Heterogeneous GPU/CPU scheduling, workload placement, fair queuing |
council-mem |
Distributed Memory | ~80 | Memory model, Weka FS integration, caching, coherence |
council-net |
Network Stack | ~60 | P2P mesh, RDMA, overlay networks, transport protocols |
council-fault |
Fault Tolerance | ~60 | Consensus, failure detection, recovery, Byzantine resilience |
council-sec |
Security Model | ~60 | Capability-based security, isolation, attestation, key management |
council-telemetry |
Resource Accounting | ~40 | Metering, telemetry, cost attribution, SLO enforcement |
Verification & Quality Councils
| Council ID | Domain | ~Agents | Brief |
|---|---|---|---|
council-verify |
Formal Verification | ~80 | TLA+ specs, model checking, invariant proofs, liveness properties |
council-qa |
Adversarial Testing | ~60 | Fuzzing, chaos engineering, fault injection, spec conformance |
Integration & Communication Councils
| Council ID | Domain | ~Agents | Brief |
|---|---|---|---|
council-api |
API & Developer Experience | ~40 | System call interface, SDK design, ergonomics, POSIX compatibility |
council-synth |
Inter-Council Synthesis | ~100 | Cross-cutting conflict resolution, architectural coherence, trade-off analysis |
council-docs |
Specification Writing | ~40 | Technical writing, standards formatting, reference documentation |
council-arch |
Decision Archaeology | ~40 | UCXL history traversal, decision narrative generation, human-readable summaries |
Meta-Council
| Council ID | Domain | ~Agents | Brief |
|---|---|---|---|
council-meta |
Project Governance | ~10 | Overall coordination, milestone tracking, council health, escalation |
Total: ~750-800 agents in councils + ~200 unassigned agents available for dynamic council formation
4. UCXL Addressing Conventions
Council Artifacts
ucxl://council-{id}:{role}@DistOS:{subsystem}/*^/{artifact-type}/{name}
Decision Records
ucxl://council-{id}:{role}@DistOS:{subsystem}/*^/decisions/{decision-id}.md
Research Artifacts
ucxl://council-{id}:researcher@DistOS:{subsystem}/*^/research/{topic}.md
Formal Specifications
ucxl://council-{id}:verifier@DistOS:{subsystem}/*^/specs/{component}.tla
Narrative Summaries (Decision Archaeology)
ucxl://council-arch:narrator@DistOS:decision-history/*^/narratives/{period}-summary.md
Cross-Council Integration
ucxl://council-synth:synthesizer@DistOS:integration/*^/conflicts/{council-a}-vs-{council-b}.md
5. Lifecycle Phases
Phase 1: Research & Survey (Days 1-3)
- Each council surveys existing literature, systems, and approaches
- Research artifacts published to DHT with UCXL addresses
- Decision Archaeology agents begin tracking from day one
Phase 2: Architecture & Trade-offs (Days 3-6)
- Councils propose architectural options with formal trade-off analysis
- Inter-Council Synthesis identifies conflicts and dependencies
- Key architectural decisions recorded as DRs
Phase 3: Formal Specification (Days 6-10)
- TLA+/Alloy specifications written for each subsystem
- Verification council model-checks specs for safety and liveness
- QA council designs conformance test suites
Phase 4: Integration & Review (Days 10-12)
- Cross-council integration review
- Conflict resolution via synthesis councils
- API surface finalised
Phase 5: Documentation & Narrative (Days 12-14)
- Complete specification document assembled
- Decision archaeology produces human-readable narrative of entire project
- Final UCXL navigability audit
6. Success Criteria
- Completeness: Formal specifications exist for all 6 core subsystems
- Verification: At least 3 subsystems have model-checked TLA+ specs with proven invariants
- Traceability: Any specification decision can be traced back through UCXL to the research that motivated it
- Human Navigability: A person unfamiliar with the project can, using only UCXL addresses and the archaeology narratives, understand why a given design decision was made
- Coherence: The synthesis council has resolved all identified cross-council conflicts
- Scale Proof: The system successfully coordinated 500+ agents across 10+ concurrent councils
7. Dependencies
- CHORUS v0.5.5+ with leader election and P2P mesh
- SLURP with Go port of storage, resolver, and temporal graph
- WHOOSH MVP with council formation and consensus
- BUBBLE for decision walkback queries
- BACKBEAT for distributed timing coordination
- Weka FS access on resetdata.ai platform
- LLM access via resetdata.ai API (Hopper/Blackwell inference)
8. Council Design Briefs
Each council has a detailed design brief in docs/distos/councils/: