Files
DistOS/PROJECT-CONSTITUTION.md
anthonyrawlins 7f56ca4d46 Initial DistOS project constitution and council design briefs
12 council design briefs for distributed OS specification project targeting
1024-node Hopper/Grace/Blackwell GPU cluster with Weka parallel filesystem.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-26 14:15:39 +11:00

7.3 KiB

DistOS: Distributed Operating System for Heterogeneous GPU Clusters

Project Constitution

Project ID: DistOS UCXL Base: ucxl://*:*@DistOS:* Target Platform: 1024-node Hopper/Grace/Blackwell cluster with Weka parallel filesystem Created: 2026-02-24 Status: Constitution Phase


1. Mission

Design a comprehensive, formally specified distributed operating system optimised for heterogeneous GPU clusters. The system must manage scheduling, memory, networking, fault tolerance, security, and observability across up to 1024 nodes equipped with NVIDIA Hopper, Grace, and Blackwell accelerators, backed by the Weka parallel filesystem.

This project serves dual purpose:

  1. Primary: Produce a rigorous, verifiable specification for a novel distributed OS
  2. Meta: Demonstrate that ~1000 coordinated CHORUS agents can collaboratively solve a problem of this complexity, with every decision traceable by humans via UCXL

2. Guiding Principles

  • Formal First: Every subsystem must have a formal specification (TLA+, Alloy, or equivalent) before implementation sketches
  • Decision Provenance: Every architectural choice must be recorded as a Decision Record (DR) in the DHT with full UCXL addressing, including alternatives considered and rationale
  • Cross-Council Coherence: No council operates in isolation; integration points must be explicitly defined and tracked
  • Human Navigability: A human must be able to follow any decision chain from final spec back to initial research via UCXL temporal navigation
  • Self-Referential Awareness: Agents should recognise they are designing a system they would ideally run on, and leverage that perspective

3. Council Structure

Research & Design Councils

Council ID Domain ~Agents Brief
council-sched Process Scheduling ~80 Heterogeneous GPU/CPU scheduling, workload placement, fair queuing
council-mem Distributed Memory ~80 Memory model, Weka FS integration, caching, coherence
council-net Network Stack ~60 P2P mesh, RDMA, overlay networks, transport protocols
council-fault Fault Tolerance ~60 Consensus, failure detection, recovery, Byzantine resilience
council-sec Security Model ~60 Capability-based security, isolation, attestation, key management
council-telemetry Resource Accounting ~40 Metering, telemetry, cost attribution, SLO enforcement

Verification & Quality Councils

Council ID Domain ~Agents Brief
council-verify Formal Verification ~80 TLA+ specs, model checking, invariant proofs, liveness properties
council-qa Adversarial Testing ~60 Fuzzing, chaos engineering, fault injection, spec conformance

Integration & Communication Councils

Council ID Domain ~Agents Brief
council-api API & Developer Experience ~40 System call interface, SDK design, ergonomics, POSIX compatibility
council-synth Inter-Council Synthesis ~100 Cross-cutting conflict resolution, architectural coherence, trade-off analysis
council-docs Specification Writing ~40 Technical writing, standards formatting, reference documentation
council-arch Decision Archaeology ~40 UCXL history traversal, decision narrative generation, human-readable summaries

Meta-Council

Council ID Domain ~Agents Brief
council-meta Project Governance ~10 Overall coordination, milestone tracking, council health, escalation

Total: ~750-800 agents in councils + ~200 unassigned agents available for dynamic council formation

4. UCXL Addressing Conventions

Council Artifacts

ucxl://council-{id}:{role}@DistOS:{subsystem}/*^/{artifact-type}/{name}

Decision Records

ucxl://council-{id}:{role}@DistOS:{subsystem}/*^/decisions/{decision-id}.md

Research Artifacts

ucxl://council-{id}:researcher@DistOS:{subsystem}/*^/research/{topic}.md

Formal Specifications

ucxl://council-{id}:verifier@DistOS:{subsystem}/*^/specs/{component}.tla

Narrative Summaries (Decision Archaeology)

ucxl://council-arch:narrator@DistOS:decision-history/*^/narratives/{period}-summary.md

Cross-Council Integration

ucxl://council-synth:synthesizer@DistOS:integration/*^/conflicts/{council-a}-vs-{council-b}.md

5. Lifecycle Phases

Phase 1: Research & Survey (Days 1-3)

  • Each council surveys existing literature, systems, and approaches
  • Research artifacts published to DHT with UCXL addresses
  • Decision Archaeology agents begin tracking from day one

Phase 2: Architecture & Trade-offs (Days 3-6)

  • Councils propose architectural options with formal trade-off analysis
  • Inter-Council Synthesis identifies conflicts and dependencies
  • Key architectural decisions recorded as DRs

Phase 3: Formal Specification (Days 6-10)

  • TLA+/Alloy specifications written for each subsystem
  • Verification council model-checks specs for safety and liveness
  • QA council designs conformance test suites

Phase 4: Integration & Review (Days 10-12)

  • Cross-council integration review
  • Conflict resolution via synthesis councils
  • API surface finalised

Phase 5: Documentation & Narrative (Days 12-14)

  • Complete specification document assembled
  • Decision archaeology produces human-readable narrative of entire project
  • Final UCXL navigability audit

6. Success Criteria

  1. Completeness: Formal specifications exist for all 6 core subsystems
  2. Verification: At least 3 subsystems have model-checked TLA+ specs with proven invariants
  3. Traceability: Any specification decision can be traced back through UCXL to the research that motivated it
  4. Human Navigability: A person unfamiliar with the project can, using only UCXL addresses and the archaeology narratives, understand why a given design decision was made
  5. Coherence: The synthesis council has resolved all identified cross-council conflicts
  6. Scale Proof: The system successfully coordinated 500+ agents across 10+ concurrent councils

7. Dependencies

  • CHORUS v0.5.5+ with leader election and P2P mesh
  • SLURP with Go port of storage, resolver, and temporal graph
  • WHOOSH MVP with council formation and consensus
  • BUBBLE for decision walkback queries
  • BACKBEAT for distributed timing coordination
  • Weka FS access on resetdata.ai platform
  • LLM access via resetdata.ai API (Hopper/Blackwell inference)

8. Council Design Briefs

Each council has a detailed design brief in docs/distos/councils/: