Initial DistOS project constitution and council design briefs
12 council design briefs for distributed OS specification project targeting 1024-node Hopper/Grace/Blackwell GPU cluster with Weka parallel filesystem. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
160
PROJECT-CONSTITUTION.md
Normal file
160
PROJECT-CONSTITUTION.md
Normal file
@@ -0,0 +1,160 @@
|
||||
# DistOS: Distributed Operating System for Heterogeneous GPU Clusters
|
||||
|
||||
## Project Constitution
|
||||
|
||||
**Project ID:** `DistOS`
|
||||
**UCXL Base:** `ucxl://*:*@DistOS:*`
|
||||
**Target Platform:** 1024-node Hopper/Grace/Blackwell cluster with Weka parallel filesystem
|
||||
**Created:** 2026-02-24
|
||||
**Status:** Constitution Phase
|
||||
|
||||
---
|
||||
|
||||
## 1. Mission
|
||||
|
||||
Design a comprehensive, formally specified distributed operating system optimised for heterogeneous GPU clusters. The system must manage scheduling, memory, networking, fault tolerance, security, and observability across up to 1024 nodes equipped with NVIDIA Hopper, Grace, and Blackwell accelerators, backed by the Weka parallel filesystem.
|
||||
|
||||
This project serves dual purpose:
|
||||
|
||||
1. **Primary:** Produce a rigorous, verifiable specification for a novel distributed OS
|
||||
2. **Meta:** Demonstrate that ~1000 coordinated CHORUS agents can collaboratively solve a problem of this complexity, with every decision traceable by humans via UCXL
|
||||
|
||||
## 2. Guiding Principles
|
||||
|
||||
- **Formal First:** Every subsystem must have a formal specification (TLA+, Alloy, or equivalent) before implementation sketches
|
||||
- **Decision Provenance:** Every architectural choice must be recorded as a Decision Record (DR) in the DHT with full UCXL addressing, including alternatives considered and rationale
|
||||
- **Cross-Council Coherence:** No council operates in isolation; integration points must be explicitly defined and tracked
|
||||
- **Human Navigability:** A human must be able to follow any decision chain from final spec back to initial research via UCXL temporal navigation
|
||||
- **Self-Referential Awareness:** Agents should recognise they are designing a system they would ideally run on, and leverage that perspective
|
||||
|
||||
## 3. Council Structure
|
||||
|
||||
### Research & Design Councils
|
||||
|
||||
| Council ID | Domain | ~Agents | Brief |
|
||||
|------------|--------|---------|-------|
|
||||
| `council-sched` | Process Scheduling | ~80 | Heterogeneous GPU/CPU scheduling, workload placement, fair queuing |
|
||||
| `council-mem` | Distributed Memory | ~80 | Memory model, Weka FS integration, caching, coherence |
|
||||
| `council-net` | Network Stack | ~60 | P2P mesh, RDMA, overlay networks, transport protocols |
|
||||
| `council-fault` | Fault Tolerance | ~60 | Consensus, failure detection, recovery, Byzantine resilience |
|
||||
| `council-sec` | Security Model | ~60 | Capability-based security, isolation, attestation, key management |
|
||||
| `council-telemetry` | Resource Accounting | ~40 | Metering, telemetry, cost attribution, SLO enforcement |
|
||||
|
||||
### Verification & Quality Councils
|
||||
|
||||
| Council ID | Domain | ~Agents | Brief |
|
||||
|------------|--------|---------|-------|
|
||||
| `council-verify` | Formal Verification | ~80 | TLA+ specs, model checking, invariant proofs, liveness properties |
|
||||
| `council-qa` | Adversarial Testing | ~60 | Fuzzing, chaos engineering, fault injection, spec conformance |
|
||||
|
||||
### Integration & Communication Councils
|
||||
|
||||
| Council ID | Domain | ~Agents | Brief |
|
||||
|------------|--------|---------|-------|
|
||||
| `council-api` | API & Developer Experience | ~40 | System call interface, SDK design, ergonomics, POSIX compatibility |
|
||||
| `council-synth` | Inter-Council Synthesis | ~100 | Cross-cutting conflict resolution, architectural coherence, trade-off analysis |
|
||||
| `council-docs` | Specification Writing | ~40 | Technical writing, standards formatting, reference documentation |
|
||||
| `council-arch` | Decision Archaeology | ~40 | UCXL history traversal, decision narrative generation, human-readable summaries |
|
||||
|
||||
### Meta-Council
|
||||
|
||||
| Council ID | Domain | ~Agents | Brief |
|
||||
|------------|--------|---------|-------|
|
||||
| `council-meta` | Project Governance | ~10 | Overall coordination, milestone tracking, council health, escalation |
|
||||
|
||||
**Total:** ~750-800 agents in councils + ~200 unassigned agents available for dynamic council formation
|
||||
|
||||
## 4. UCXL Addressing Conventions
|
||||
|
||||
### Council Artifacts
|
||||
```
|
||||
ucxl://council-{id}:{role}@DistOS:{subsystem}/*^/{artifact-type}/{name}
|
||||
```
|
||||
|
||||
### Decision Records
|
||||
```
|
||||
ucxl://council-{id}:{role}@DistOS:{subsystem}/*^/decisions/{decision-id}.md
|
||||
```
|
||||
|
||||
### Research Artifacts
|
||||
```
|
||||
ucxl://council-{id}:researcher@DistOS:{subsystem}/*^/research/{topic}.md
|
||||
```
|
||||
|
||||
### Formal Specifications
|
||||
```
|
||||
ucxl://council-{id}:verifier@DistOS:{subsystem}/*^/specs/{component}.tla
|
||||
```
|
||||
|
||||
### Narrative Summaries (Decision Archaeology)
|
||||
```
|
||||
ucxl://council-arch:narrator@DistOS:decision-history/*^/narratives/{period}-summary.md
|
||||
```
|
||||
|
||||
### Cross-Council Integration
|
||||
```
|
||||
ucxl://council-synth:synthesizer@DistOS:integration/*^/conflicts/{council-a}-vs-{council-b}.md
|
||||
```
|
||||
|
||||
## 5. Lifecycle Phases
|
||||
|
||||
### Phase 1: Research & Survey (Days 1-3)
|
||||
- Each council surveys existing literature, systems, and approaches
|
||||
- Research artifacts published to DHT with UCXL addresses
|
||||
- Decision Archaeology agents begin tracking from day one
|
||||
|
||||
### Phase 2: Architecture & Trade-offs (Days 3-6)
|
||||
- Councils propose architectural options with formal trade-off analysis
|
||||
- Inter-Council Synthesis identifies conflicts and dependencies
|
||||
- Key architectural decisions recorded as DRs
|
||||
|
||||
### Phase 3: Formal Specification (Days 6-10)
|
||||
- TLA+/Alloy specifications written for each subsystem
|
||||
- Verification council model-checks specs for safety and liveness
|
||||
- QA council designs conformance test suites
|
||||
|
||||
### Phase 4: Integration & Review (Days 10-12)
|
||||
- Cross-council integration review
|
||||
- Conflict resolution via synthesis councils
|
||||
- API surface finalised
|
||||
|
||||
### Phase 5: Documentation & Narrative (Days 12-14)
|
||||
- Complete specification document assembled
|
||||
- Decision archaeology produces human-readable narrative of entire project
|
||||
- Final UCXL navigability audit
|
||||
|
||||
## 6. Success Criteria
|
||||
|
||||
1. **Completeness:** Formal specifications exist for all 6 core subsystems
|
||||
2. **Verification:** At least 3 subsystems have model-checked TLA+ specs with proven invariants
|
||||
3. **Traceability:** Any specification decision can be traced back through UCXL to the research that motivated it
|
||||
4. **Human Navigability:** A person unfamiliar with the project can, using only UCXL addresses and the archaeology narratives, understand why a given design decision was made
|
||||
5. **Coherence:** The synthesis council has resolved all identified cross-council conflicts
|
||||
6. **Scale Proof:** The system successfully coordinated 500+ agents across 10+ concurrent councils
|
||||
|
||||
## 7. Dependencies
|
||||
|
||||
- **CHORUS v0.5.5+** with leader election and P2P mesh
|
||||
- **SLURP** with Go port of storage, resolver, and temporal graph
|
||||
- **WHOOSH** MVP with council formation and consensus
|
||||
- **BUBBLE** for decision walkback queries
|
||||
- **BACKBEAT** for distributed timing coordination
|
||||
- **Weka FS** access on resetdata.ai platform
|
||||
- **LLM access** via resetdata.ai API (Hopper/Blackwell inference)
|
||||
|
||||
## 8. Council Design Briefs
|
||||
|
||||
Each council has a detailed design brief in `docs/distos/councils/`:
|
||||
|
||||
- [Process Scheduling](councils/01-process-scheduling.md)
|
||||
- [Distributed Memory](councils/02-distributed-memory.md)
|
||||
- [Network Stack](councils/03-network-stack.md)
|
||||
- [Fault Tolerance](councils/04-fault-tolerance.md)
|
||||
- [Security Model](councils/05-security-model.md)
|
||||
- [Resource Accounting](councils/06-resource-accounting.md)
|
||||
- [Formal Verification](councils/07-formal-verification.md)
|
||||
- [API & Developer Experience](councils/08-api-surface.md)
|
||||
- [Inter-Council Synthesis](councils/09-inter-council-synthesis.md)
|
||||
- [QA & Adversarial Testing](councils/10-qa-adversarial-testing.md)
|
||||
- [Specification Writing](councils/11-documentation.md)
|
||||
- [Decision Archaeology](councils/12-decision-archaeology.md)
|
||||
Reference in New Issue
Block a user