12 council design briefs for distributed OS specification project targeting 1024-node Hopper/Grace/Blackwell GPU cluster with Weka parallel filesystem. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
379 lines
22 KiB
Markdown
379 lines
22 KiB
Markdown
# Council Design Brief: Quality Assurance & Adversarial Testing
|
||
|
||
---
|
||
|
||
## Council Identification
|
||
|
||
| Field | Value |
|
||
|-------|-------|
|
||
| **Council ID** | `council-qa` |
|
||
| **Mission** | Adversarially validate the entire DistOS specification and implementation surface through systematic fuzz testing, chaos engineering, Byzantine fault simulation, and property-based verification — ensuring that what the formal specification says is provably correct (council-verify's domain) actually matches what the system does under hostile, degraded, and pathological conditions. |
|
||
| **UCXL Base Address** | `ucxl://council-qa:*@DistOS:qa/*^/` |
|
||
| **Agent Count** | ~60 agents |
|
||
| **Operates From** | Day 3 (Architecture phase) through Day 14 (Documentation close) |
|
||
|
||
---
|
||
|
||
## Scope and Responsibilities
|
||
|
||
Council-qa owns the gap between proof and reality. Formal verification proves that a model is internally consistent; adversarial testing proves that the model is an accurate model of the system as built. This council is responsible for:
|
||
|
||
- Designing and executing a comprehensive adversarial test suite against all DistOS subsystems
|
||
- Validating that formal specifications produced by council-verify match observable system behaviour under both normal and failure conditions
|
||
- Discovering specification ambiguities by finding inputs or conditions that produce unexpected or underspecified behaviour
|
||
- Stress-testing all published API surfaces (received from council-api) for correctness, safety, and resilience
|
||
- Simulating Byzantine, crash-recovery, and omission faults across the 1024-node GPU cluster topology
|
||
- Injecting faults at the hardware layer (GPU errors, NVLink degradation, Weka FS unavailability, InfiniBand partition events) and observing correctness of system response
|
||
- Benchmarking performance under adversarial scheduling pressure, memory fragmentation, and concurrent failure scenarios
|
||
- Performing structured penetration testing of the DistOS security model
|
||
- Running deterministic simulation testing to reproduce races and transient faults at will
|
||
- Reporting all discovered defects, ambiguities, and underspecifications back to originating councils with UCXL-addressable evidence chains
|
||
|
||
Council-qa does **not** write specifications or make architectural decisions. Its authority is to raise issues and block deliverable acceptance until those issues are resolved or formally accepted as known limitations.
|
||
|
||
---
|
||
|
||
## Research Domains
|
||
|
||
### 1. Distributed Systems Fault Injection and Chaos Engineering
|
||
|
||
**Core framework:** Jepsen (Kyle Kingsbury) — the canonical methodology for testing distributed system consistency under network partitions, clock skew, and process crashes. DistOS must pass Jepsen-equivalent analysis for all claims of linearisability, serializability, or causal consistency made in the formal spec.
|
||
|
||
Key Jepsen analyses to replicate:
|
||
- Partition healing with in-flight GPU kernel state
|
||
- Split-brain detection in the consensus layer
|
||
- Clock skew effects on distributed scheduling decisions
|
||
- Write visibility after node rejoins
|
||
|
||
**Netflix Chaos Monkey / Simian Army** — production chaos engineering methodology. Adapt for pre-production specification testing: if chaos can be injected into the architecture model, it should be injected into the formal model before any node is touched.
|
||
|
||
**Google DiRT (Disaster Recovery Testing)** — structured disaster scenario playbooks. Council-qa will produce DiRT-equivalent playbooks for DistOS, covering datacenter-level failure scenarios for a 1024-node cluster.
|
||
|
||
Key papers:
|
||
- Kingsbury, K. (2013–present). Jepsen analysis series. https://jepsen.io
|
||
- Gunawi, H. S., et al. (2011). "FATE and DESTINI: A Framework for Cloud Recovery Testing." NSDI 2011.
|
||
- Basiri, A., et al. (2016). "Chaos Engineering." IEEE Software 33(3).
|
||
- Leesatapornwongsa, T., et al. (2016). "TaxDC: A Taxonomy of Non-Deterministic Concurrency Bugs in Datacenter Distributed Systems." ASPLOS 2016.
|
||
|
||
### 2. Property-Based and Specification-Conformance Testing
|
||
|
||
**Hypothesis (Python) / QuickCheck (Haskell)** — shrinking-based property testing. The council will define properties derived from formal specifications and use automated generators to find minimally failing counterexamples.
|
||
|
||
Properties to encode:
|
||
- Scheduler decisions must be deterministic given the same system state snapshot
|
||
- Memory allocation must never produce aliased physical GPU memory addresses across tenants
|
||
- All filesystem operations on Weka must satisfy the POSIX consistency semantics claimed in the spec
|
||
- Security policy decisions must be monotone (granting more resources never reduces security guarantees)
|
||
|
||
**TLA+ model checking** — in coordination with council-verify, council-qa will encode system properties as TLA+ invariants and run TLC model checking over the state space of the DistOS scheduler and memory manager.
|
||
|
||
Key papers:
|
||
- Claessen, K., & Hughes, J. (2000). "QuickCheck: A Lightweight Tool for Random Testing of Haskell Programs." ICFP 2000.
|
||
- MacIver, D. R., et al. (2019). "Hypothesis: A new approach to property-based testing." JOSS 4(43).
|
||
- Lamport, L. (1999). "Specifying Systems: The TLA+ Language and Tools." Addison-Wesley.
|
||
|
||
### 3. Fuzz Testing of API Surfaces
|
||
|
||
**AFL++ / libFuzzer** — coverage-guided fuzzing of all DistOS API endpoints. Every API defined by council-api is a fuzzing target.
|
||
|
||
**Syzkaller** — adapted for DistOS syscall-equivalent interfaces. The kernel-level scheduling and memory interfaces will be fuzz-tested using a syzkaller-inspired grammar-aware fuzzer.
|
||
|
||
**gRPC/Protobuf fuzzing** — for any inter-node RPC interfaces, schema-aware fuzzing to discover serialisation edge cases, integer overflow in field handling, and unintended state transitions.
|
||
|
||
Focus areas:
|
||
- Malformed job submission to the scheduler
|
||
- Invalid or boundary-condition memory allocation requests
|
||
- Weka filesystem path traversal and permission edge cases
|
||
- Security token manipulation and replay attacks
|
||
- Malformed consensus messages (Raft/Paxos variant used by DistOS)
|
||
|
||
Key papers:
|
||
- Zalewski, M. (2014). American Fuzzy Lop technical whitepaper.
|
||
- Serebryany, K. (2016). "OSS-Fuzz — Google's continuous fuzzing service for open source software." USENIX Security 2017.
|
||
- Vishnyakov, A., et al. (2022). "SYDR-Fuzz: Continuous Hybrid Fuzzing and Dynamic Analysis for Security Development Lifecycle." ISPRAS Open 2022.
|
||
|
||
### 4. Byzantine Fault Simulation
|
||
|
||
For a 1024-node cluster intended for production AI workloads, Byzantine faults (nodes sending malicious, corrupt, or contradictory messages) are a realistic adversary model.
|
||
|
||
Council-qa will simulate:
|
||
- Nodes sending conflicting scheduling state to different peers
|
||
- GPU memory controllers reporting incorrect allocation metadata
|
||
- Weka FS nodes returning inconsistent directory listings
|
||
- Consensus participants casting votes that contradict their local state
|
||
|
||
**BFT protocol validation:** Any Byzantine-fault-tolerant consensus mechanism in DistOS must tolerate f faulty nodes where N >= 3f + 1. For a 1024-node cluster with f = 341, this requires rigorous simulation.
|
||
|
||
Key papers:
|
||
- Lamport, L., Shostak, R., & Pease, M. (1982). "The Byzantine Generals Problem." ACM TOPLAS 4(3).
|
||
- Castro, M., & Liskov, B. (1999). "Practical Byzantine Fault Tolerance." OSDI 1999.
|
||
- Yin, M., et al. (2019). "HotStuff: BFT Consensus with Linearity and Responsiveness." PODC 2019.
|
||
|
||
### 5. GPU-Specific Error Injection
|
||
|
||
**NVIDIA GPU Error Injection (NVML / DCGM)** — inject ECC errors, NVLink faults, and CUDA context corruption. DistOS must respond correctly to all hardware-signalled errors from the GPU fabric.
|
||
|
||
**ROCm SMI fault injection** — equivalent capabilities for AMD GPUs if mixed hardware is present.
|
||
|
||
Scenarios:
|
||
- Single-bit ECC error during active kernel execution: does DistOS checkpoint, migrate, or terminate the affected job?
|
||
- NVLink link failure mid-allreduce: does the collective communication layer correctly abort and reschedule?
|
||
- GPU memory over-temperature throttling: does the scheduler correctly rebalance load?
|
||
- CUDA context loss: does the runtime correctly clean up tenant resources?
|
||
|
||
Key references:
|
||
- NVIDIA. (2024). Data Center GPU Manager (DCGM) User Guide. NVIDIA Corporation.
|
||
- NVIDIA. (2023). NVML API Reference Guide. NVIDIA Corporation.
|
||
- Hochschild, P., et al. (2021). "Cores that don't count." HotOS 2021. (Silent data corruption in datacenter CPUs — same failure class applies to GPUs.)
|
||
|
||
### 6. Weka Parallel Filesystem Adversarial Testing
|
||
|
||
Weka FS is a high-performance parallel filesystem with specific consistency semantics. DistOS makes claims about filesystem semantics that must be tested against Weka's actual behaviour.
|
||
|
||
Test scenarios:
|
||
- Weka cluster node failure during active checkpoint write: does DistOS detect partial writes?
|
||
- Concurrent checkpoint reads and writes from 512 nodes: does the system observe stale data?
|
||
- Weka metadata server overload: how does DistOS degrade gracefully?
|
||
- Network partition between Weka clients and Weka backend: does DistOS correctly pause or abort affected operations?
|
||
|
||
Key references:
|
||
- Weka.io. (2024). WekaFS Architecture and Internals. Weka Technical Documentation.
|
||
- Carns, P., et al. (2011). "Understanding and Improving Computational Science Storage Access through Continuous Characterization." MSST 2011.
|
||
|
||
### 7. Deterministic Simulation Testing
|
||
|
||
**FoundationDB simulation testing** — the most rigorous approach to distributed systems testing known: run the entire system inside a deterministic simulator where all nondeterminism (network delays, disk I/O, timer callbacks) is controlled by a seeded random number generator. Any failing test can be reproduced exactly.
|
||
|
||
Council-qa will specify a deterministic simulation harness for the DistOS core components (scheduler, memory manager, consensus layer) that:
|
||
- Replaces all OS-level nondeterminism with simulated equivalents
|
||
- Records all random seeds for reproducible failure replay
|
||
- Supports time-travel debugging (roll back to a pre-failure state and re-execute)
|
||
|
||
Key references:
|
||
- Villegas, A. (2021). "FoundationDB: A Distributed Unbundled Transactional Key-Value Store." SIGMOD 2021.
|
||
- Kingsbury, K. (2018). "An Introduction to Jepsen." Strange Loop 2018.
|
||
- Belyakova, Y., et al. (2024). "Lineage-driven Fault Injection." SIGMOD 2015.
|
||
|
||
### 8. Race Condition Detection
|
||
|
||
**ThreadSanitizer (TSan) / Helgrind** — data race detection in any shared-memory regions of DistOS.
|
||
|
||
**Concuerror / DPOR (Dynamic Partial Order Reduction)** — systematic concurrency testing for Erlang/Elixir-style actor systems, adaptable to DistOS agent communication patterns.
|
||
|
||
**Deterministic concurrency testing** — enumerate all interleavings of concurrent operations in critical sections of the scheduler and memory allocator.
|
||
|
||
---
|
||
|
||
## Agent Roles
|
||
|
||
| Role | Count | Responsibilities |
|
||
|------|-------|-----------------|
|
||
| **Chaos Engineers** | 12 | Design and execute Jepsen-style partition and failure scenarios; maintain chaos playbook library |
|
||
| **Fuzz Operators** | 10 | Run AFL++/libFuzzer/syzkaller campaigns against all API surfaces; triage and minimise crashes |
|
||
| **Property Testers** | 10 | Encode formal spec properties as Hypothesis/QuickCheck tests; maintain property library |
|
||
| **Byzantine Simulators** | 8 | Implement and execute Byzantine fault scenarios; validate BFT protocol correctness |
|
||
| **GPU Fault Injectors** | 8 | NVML/DCGM-based hardware fault injection; GPU error response validation |
|
||
| **Simulation Engineers** | 7 | Build and maintain the deterministic simulation harness; replay and minimise failures |
|
||
| **Performance Adversarialists** | 5 | Benchmarking under adversarial load; identify performance cliffs and scheduling pathologies |
|
||
|
||
**Total: 60 agents**
|
||
|
||
---
|
||
|
||
## Key Deliverables
|
||
|
||
| Deliverable | UCXL Address | Due Phase |
|
||
|-------------|--------------|-----------|
|
||
| Adversarial Test Strategy | `ucxl://council-qa:chaos-engineer@DistOS:qa/*^/spec/test-strategy` | Phase 2 |
|
||
| Chaos Engineering Playbook | `ucxl://council-qa:chaos-engineer@DistOS:qa/*^/runbook/chaos-playbook` | Phase 2 |
|
||
| Property Test Suite (formal spec conformance) | `ucxl://council-qa:property-tester@DistOS:qa/*^/test-suite/property-tests` | Phase 3 |
|
||
| API Fuzz Campaign Report | `ucxl://council-qa:fuzz-operator@DistOS:qa/*^/report/fuzz-campaign-api` | Phase 3 |
|
||
| Byzantine Fault Simulation Results | `ucxl://council-qa:byzantine-simulator@DistOS:qa/*^/report/byzantine-faults` | Phase 3 |
|
||
| GPU Error Injection Results | `ucxl://council-qa:gpu-fault-injector@DistOS:qa/*^/report/gpu-fault-injection` | Phase 3 |
|
||
| Deterministic Simulation Harness Spec | `ucxl://council-qa:simulation-engineer@DistOS:qa/*^/spec/deterministic-sim` | Phase 3 |
|
||
| Weka FS Adversarial Test Results | `ucxl://council-qa:chaos-engineer@DistOS:qa/*^/report/weka-adversarial` | Phase 4 |
|
||
| Race Condition Detection Report | `ucxl://council-qa:simulation-engineer@DistOS:qa/*^/report/race-conditions` | Phase 4 |
|
||
| Consolidated Defect Register | `ucxl://council-qa:*@DistOS:qa/*^/register/defects` | Continuous |
|
||
| Final QA Acceptance Report | `ucxl://council-qa:*@DistOS:qa/*^/report/final-acceptance` | Phase 5 |
|
||
| Performance Adversarial Benchmarks | `ucxl://council-qa:performance-adversarialist@DistOS:qa/*^/benchmark/adversarial-perf` | Phase 4 |
|
||
|
||
---
|
||
|
||
## Decision Points
|
||
|
||
### DQ-01: Acceptable Defect Threshold for Specification Acceptance
|
||
**Question:** What severity and count of open defects constitutes an acceptable state for the DistOS specification to be declared complete?
|
||
|
||
**Options:**
|
||
- A. Zero P0 defects, fewer than 10 P1 defects with documented mitigations
|
||
- B. Zero P0/P1 defects, unlimited P2 with tracking
|
||
- C. All discovered defects must be resolved or formally accepted with architectural rationale
|
||
|
||
**Implications:** Option C produces the highest-quality specification but may block delivery. Option A risks shipping known critical issues with mitigations that may not hold in production.
|
||
|
||
### DQ-02: Scope of Byzantine Fault Tolerance Requirement
|
||
**Question:** Is DistOS required to tolerate Byzantine faults, or only crash-recovery faults?
|
||
|
||
**Options:**
|
||
- A. Crash-recovery (CFT) only — simpler protocols, lower overhead, sufficient for trusted datacenter hardware
|
||
- B. Byzantine fault tolerance for the consensus layer only
|
||
- C. Full BFT for all distributed components
|
||
|
||
**Implications:** Full BFT (Option C) requires significantly more complex protocols and typically 33% overhead in node count for quorum calculations. Given a 1024-node cluster with known hardware, CFT may be sufficient.
|
||
|
||
### DQ-03: Deterministic Simulation Scope
|
||
**Question:** Which DistOS components must support deterministic simulation testing?
|
||
|
||
**Options:**
|
||
- A. Core consensus and scheduling components only
|
||
- B. All components that touch shared state
|
||
- C. The entire DistOS software stack including GPU runtime interfaces
|
||
|
||
**Implications:** Option C provides the strongest testing guarantees but requires a simulation layer for every hardware interface including GPU APIs.
|
||
|
||
### DQ-04: Jepsen-Equivalent Validation Requirement
|
||
**Question:** Must DistOS pass a Jepsen-equivalent analysis before the specification is finalised, or is simulation-based testing sufficient?
|
||
|
||
**Options:**
|
||
- A. Simulation-based testing is sufficient at specification phase; Jepsen analysis deferred to implementation phase
|
||
- B. A Jepsen-equivalent model-level analysis is required before specification acceptance
|
||
- C. A full Jepsen-style test against a prototype implementation is required within the 14-day window
|
||
|
||
**Implications:** Option C is highly ambitious for a 14-day timeline. Option A defers significant risk to implementation.
|
||
|
||
### DQ-05: Security Penetration Testing Depth
|
||
**Question:** What is the scope of security penetration testing for the DistOS security model?
|
||
|
||
**Options:**
|
||
- A. Threat-model review and manual analysis only
|
||
- B. Automated scanning plus manual review of authentication and authorisation boundaries
|
||
- C. Full red-team exercise against the security model including side-channel analysis
|
||
|
||
---
|
||
|
||
## Dependencies on Other Councils
|
||
|
||
| Council | Dependency Type | What Council-QA Needs |
|
||
|---------|----------------|-----------------------|
|
||
| **council-verify** | Upstream | Formal specifications and invariants to test against; verified properties to validate in simulation |
|
||
| **council-api** | Upstream | Complete API surface definitions with semantics, pre/post conditions, and error contracts |
|
||
| **council-sched** | Upstream | Scheduler specification including claimed consistency and fairness properties |
|
||
| **council-mem** | Upstream | Memory manager specification including isolation guarantees and allocation invariants |
|
||
| **council-sec** | Upstream | Security model specification including trust boundaries and threat model |
|
||
| **council-net** | Upstream | Network layer specification including partition tolerance claims |
|
||
| **council-fs** | Upstream | Weka FS integration specification including consistency claims |
|
||
| **council-synth** | Bidirectional | Synthesis decisions that affect testability; council-qa feeds defects back to council-synth for resolution arbitration |
|
||
| **council-arch** | Downstream | Decision archaeology reads all council-qa defect reports and test results to narrate the QA story |
|
||
| **council-docs** | Downstream | Documentation council consumes test reports for the final specification document |
|
||
|
||
**Critical dependency:** Council-qa cannot begin property testing until council-verify delivers at least draft formal specifications. This creates a hard dependency on council-verify delivering Phase 2 outputs on schedule.
|
||
|
||
---
|
||
|
||
## WHOOSH Configuration
|
||
|
||
```yaml
|
||
council: council-qa
|
||
whoosh:
|
||
formation:
|
||
strategy: domain-partitioned
|
||
# Each test domain forms a sub-team that operates semi-independently
|
||
partitions:
|
||
- name: chaos-partition
|
||
roles: [chaos-engineer]
|
||
size: 12
|
||
coordination: async
|
||
- name: fuzz-partition
|
||
roles: [fuzz-operator]
|
||
size: 10
|
||
coordination: async
|
||
- name: property-partition
|
||
roles: [property-tester]
|
||
size: 10
|
||
coordination: sync-with-verify
|
||
- name: byzantine-partition
|
||
roles: [byzantine-simulator]
|
||
size: 8
|
||
coordination: async
|
||
- name: gpu-partition
|
||
roles: [gpu-fault-injector]
|
||
size: 8
|
||
coordination: async
|
||
- name: simulation-partition
|
||
roles: [simulation-engineer]
|
||
size: 7
|
||
coordination: sync
|
||
- name: perf-partition
|
||
roles: [performance-adversarialist]
|
||
size: 5
|
||
coordination: async
|
||
|
||
quorum:
|
||
defect-classification: 3/5 # Three roles must agree on P0/P1 classification
|
||
test-acceptance: 4/7 # Four of seven role groups must sign off on test suite acceptance
|
||
spec-block: # council-qa can block spec acceptance; requires:
|
||
threshold: simple-majority
|
||
escalation: council-synth # Disputes escalate to council-synth for arbitration
|
||
|
||
subchannels:
|
||
- id: defect-reports
|
||
description: All discovered defects broadcast here for cross-council visibility
|
||
subscribers: [council-verify, council-api, council-sched, council-mem, council-sec, council-synth, council-arch]
|
||
retention: full-history
|
||
- id: chaos-ops
|
||
description: Real-time chaos experiment state
|
||
subscribers: [council-qa, council-arch]
|
||
- id: property-failures
|
||
description: Property test failures routed to council-verify for spec review
|
||
subscribers: [council-verify, council-synth, council-arch]
|
||
- id: qa-acceptance
|
||
description: Formal acceptance/rejection signals per deliverable
|
||
subscribers: [council-synth, council-docs, council-arch]
|
||
|
||
communication:
|
||
internal: broadcast-to-partition
|
||
cross-council: ucxl-addressed
|
||
defect-escalation: immediate-broadcast
|
||
```
|
||
|
||
---
|
||
|
||
## Success Criteria
|
||
|
||
A council-qa execution is considered successful when all of the following are met:
|
||
|
||
1. **Coverage:** Every formal specification invariant produced by council-verify has at least one corresponding property test with documented pass/fail results.
|
||
|
||
2. **API surface:** 100% of API endpoints defined by council-api have been subjected to structured fuzz testing with documented results.
|
||
|
||
3. **Chaos coverage:** Jepsen-equivalent partition and failure scenarios have been executed against the DistOS scheduler, memory manager, and consensus layer models.
|
||
|
||
4. **Byzantine validation:** Byzantine fault simulations have validated that the consensus protocol meets its claimed fault-tolerance bounds.
|
||
|
||
5. **GPU error coverage:** All documented GPU error codes and conditions in the NVIDIA/AMD hardware specifications have corresponding DistOS response scenarios tested.
|
||
|
||
6. **Weka FS:** All consistency claims in the DistOS filesystem integration specification have been tested against documented Weka FS behaviour.
|
||
|
||
7. **Defect resolution:** All P0 defects are resolved. All P1 defects have formal mitigations accepted by the originating council and council-synth. All open defects are registered with UCXL-addressable evidence.
|
||
|
||
8. **Deterministic reproduction:** All discovered failures can be reproduced deterministically via the simulation harness or a minimised test case.
|
||
|
||
9. **Archaeology readability:** Council-arch has confirmed that the defect register and test reports produce comprehensible decision narratives.
|
||
|
||
---
|
||
|
||
## Timeline Mapping
|
||
|
||
| Phase | Days | Council-QA Activities |
|
||
|-------|------|-----------------------|
|
||
| **Phase 1: Research** | 1–3 | Study formal specification structure (coordination with council-verify); design adversarial test taxonomy; review Jepsen, FoundationDB simulation, and AFL++ methodologies; define property language for spec conformance tests |
|
||
| **Phase 2: Architecture** | 3–6 | Produce Adversarial Test Strategy; design chaos playbook structure; begin encoding specification properties as tests (as draft specs arrive from council-verify); establish defect classification schema; configure WHOOSH subchannels and defect broadcast |
|
||
| **Phase 3: Formal Specification** | 6–10 | Execute property tests against formal specs from council-verify; run initial fuzz campaigns against draft API definitions; execute Byzantine fault simulations; GPU error injection scenario validation; first Weka FS adversarial tests; publish defect reports to all councils |
|
||
| **Phase 4: Integration** | 10–12 | Full integration testing across all subsystem specifications; cross-subsystem fault injection (e.g., GPU error during Weka checkpoint during scheduler failover); race condition analysis; performance adversarial benchmarks; defect resolution verification |
|
||
| **Phase 5: Documentation** | 12–14 | Produce Final QA Acceptance Report; compile consolidated defect register; validate all P0/P1 defects are resolved or formally accepted; support council-docs in formatting test reports for the final specification document |
|
||
|
||
---
|
||
|
||
*Council Design Brief v1.0 — DistOS Project — council-qa*
|
||
*Generated: 2026-02-24*
|