12 council design briefs for distributed OS specification project targeting 1024-node Hopper/Grace/Blackwell GPU cluster with Weka parallel filesystem. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
22 KiB
Council Design Brief: Quality Assurance & Adversarial Testing
Council Identification
| Field | Value |
|---|---|
| Council ID | council-qa |
| Mission | Adversarially validate the entire DistOS specification and implementation surface through systematic fuzz testing, chaos engineering, Byzantine fault simulation, and property-based verification — ensuring that what the formal specification says is provably correct (council-verify's domain) actually matches what the system does under hostile, degraded, and pathological conditions. |
| UCXL Base Address | ucxl://council-qa:*@DistOS:qa/*^/ |
| Agent Count | ~60 agents |
| Operates From | Day 3 (Architecture phase) through Day 14 (Documentation close) |
Scope and Responsibilities
Council-qa owns the gap between proof and reality. Formal verification proves that a model is internally consistent; adversarial testing proves that the model is an accurate model of the system as built. This council is responsible for:
- Designing and executing a comprehensive adversarial test suite against all DistOS subsystems
- Validating that formal specifications produced by council-verify match observable system behaviour under both normal and failure conditions
- Discovering specification ambiguities by finding inputs or conditions that produce unexpected or underspecified behaviour
- Stress-testing all published API surfaces (received from council-api) for correctness, safety, and resilience
- Simulating Byzantine, crash-recovery, and omission faults across the 1024-node GPU cluster topology
- Injecting faults at the hardware layer (GPU errors, NVLink degradation, Weka FS unavailability, InfiniBand partition events) and observing correctness of system response
- Benchmarking performance under adversarial scheduling pressure, memory fragmentation, and concurrent failure scenarios
- Performing structured penetration testing of the DistOS security model
- Running deterministic simulation testing to reproduce races and transient faults at will
- Reporting all discovered defects, ambiguities, and underspecifications back to originating councils with UCXL-addressable evidence chains
Council-qa does not write specifications or make architectural decisions. Its authority is to raise issues and block deliverable acceptance until those issues are resolved or formally accepted as known limitations.
Research Domains
1. Distributed Systems Fault Injection and Chaos Engineering
Core framework: Jepsen (Kyle Kingsbury) — the canonical methodology for testing distributed system consistency under network partitions, clock skew, and process crashes. DistOS must pass Jepsen-equivalent analysis for all claims of linearisability, serializability, or causal consistency made in the formal spec.
Key Jepsen analyses to replicate:
- Partition healing with in-flight GPU kernel state
- Split-brain detection in the consensus layer
- Clock skew effects on distributed scheduling decisions
- Write visibility after node rejoins
Netflix Chaos Monkey / Simian Army — production chaos engineering methodology. Adapt for pre-production specification testing: if chaos can be injected into the architecture model, it should be injected into the formal model before any node is touched.
Google DiRT (Disaster Recovery Testing) — structured disaster scenario playbooks. Council-qa will produce DiRT-equivalent playbooks for DistOS, covering datacenter-level failure scenarios for a 1024-node cluster.
Key papers:
- Kingsbury, K. (2013–present). Jepsen analysis series. https://jepsen.io
- Gunawi, H. S., et al. (2011). "FATE and DESTINI: A Framework for Cloud Recovery Testing." NSDI 2011.
- Basiri, A., et al. (2016). "Chaos Engineering." IEEE Software 33(3).
- Leesatapornwongsa, T., et al. (2016). "TaxDC: A Taxonomy of Non-Deterministic Concurrency Bugs in Datacenter Distributed Systems." ASPLOS 2016.
2. Property-Based and Specification-Conformance Testing
Hypothesis (Python) / QuickCheck (Haskell) — shrinking-based property testing. The council will define properties derived from formal specifications and use automated generators to find minimally failing counterexamples.
Properties to encode:
- Scheduler decisions must be deterministic given the same system state snapshot
- Memory allocation must never produce aliased physical GPU memory addresses across tenants
- All filesystem operations on Weka must satisfy the POSIX consistency semantics claimed in the spec
- Security policy decisions must be monotone (granting more resources never reduces security guarantees)
TLA+ model checking — in coordination with council-verify, council-qa will encode system properties as TLA+ invariants and run TLC model checking over the state space of the DistOS scheduler and memory manager.
Key papers:
- Claessen, K., & Hughes, J. (2000). "QuickCheck: A Lightweight Tool for Random Testing of Haskell Programs." ICFP 2000.
- MacIver, D. R., et al. (2019). "Hypothesis: A new approach to property-based testing." JOSS 4(43).
- Lamport, L. (1999). "Specifying Systems: The TLA+ Language and Tools." Addison-Wesley.
3. Fuzz Testing of API Surfaces
AFL++ / libFuzzer — coverage-guided fuzzing of all DistOS API endpoints. Every API defined by council-api is a fuzzing target.
Syzkaller — adapted for DistOS syscall-equivalent interfaces. The kernel-level scheduling and memory interfaces will be fuzz-tested using a syzkaller-inspired grammar-aware fuzzer.
gRPC/Protobuf fuzzing — for any inter-node RPC interfaces, schema-aware fuzzing to discover serialisation edge cases, integer overflow in field handling, and unintended state transitions.
Focus areas:
- Malformed job submission to the scheduler
- Invalid or boundary-condition memory allocation requests
- Weka filesystem path traversal and permission edge cases
- Security token manipulation and replay attacks
- Malformed consensus messages (Raft/Paxos variant used by DistOS)
Key papers:
- Zalewski, M. (2014). American Fuzzy Lop technical whitepaper.
- Serebryany, K. (2016). "OSS-Fuzz — Google's continuous fuzzing service for open source software." USENIX Security 2017.
- Vishnyakov, A., et al. (2022). "SYDR-Fuzz: Continuous Hybrid Fuzzing and Dynamic Analysis for Security Development Lifecycle." ISPRAS Open 2022.
4. Byzantine Fault Simulation
For a 1024-node cluster intended for production AI workloads, Byzantine faults (nodes sending malicious, corrupt, or contradictory messages) are a realistic adversary model.
Council-qa will simulate:
- Nodes sending conflicting scheduling state to different peers
- GPU memory controllers reporting incorrect allocation metadata
- Weka FS nodes returning inconsistent directory listings
- Consensus participants casting votes that contradict their local state
BFT protocol validation: Any Byzantine-fault-tolerant consensus mechanism in DistOS must tolerate f faulty nodes where N >= 3f + 1. For a 1024-node cluster with f = 341, this requires rigorous simulation.
Key papers:
- Lamport, L., Shostak, R., & Pease, M. (1982). "The Byzantine Generals Problem." ACM TOPLAS 4(3).
- Castro, M., & Liskov, B. (1999). "Practical Byzantine Fault Tolerance." OSDI 1999.
- Yin, M., et al. (2019). "HotStuff: BFT Consensus with Linearity and Responsiveness." PODC 2019.
5. GPU-Specific Error Injection
NVIDIA GPU Error Injection (NVML / DCGM) — inject ECC errors, NVLink faults, and CUDA context corruption. DistOS must respond correctly to all hardware-signalled errors from the GPU fabric.
ROCm SMI fault injection — equivalent capabilities for AMD GPUs if mixed hardware is present.
Scenarios:
- Single-bit ECC error during active kernel execution: does DistOS checkpoint, migrate, or terminate the affected job?
- NVLink link failure mid-allreduce: does the collective communication layer correctly abort and reschedule?
- GPU memory over-temperature throttling: does the scheduler correctly rebalance load?
- CUDA context loss: does the runtime correctly clean up tenant resources?
Key references:
- NVIDIA. (2024). Data Center GPU Manager (DCGM) User Guide. NVIDIA Corporation.
- NVIDIA. (2023). NVML API Reference Guide. NVIDIA Corporation.
- Hochschild, P., et al. (2021). "Cores that don't count." HotOS 2021. (Silent data corruption in datacenter CPUs — same failure class applies to GPUs.)
6. Weka Parallel Filesystem Adversarial Testing
Weka FS is a high-performance parallel filesystem with specific consistency semantics. DistOS makes claims about filesystem semantics that must be tested against Weka's actual behaviour.
Test scenarios:
- Weka cluster node failure during active checkpoint write: does DistOS detect partial writes?
- Concurrent checkpoint reads and writes from 512 nodes: does the system observe stale data?
- Weka metadata server overload: how does DistOS degrade gracefully?
- Network partition between Weka clients and Weka backend: does DistOS correctly pause or abort affected operations?
Key references:
- Weka.io. (2024). WekaFS Architecture and Internals. Weka Technical Documentation.
- Carns, P., et al. (2011). "Understanding and Improving Computational Science Storage Access through Continuous Characterization." MSST 2011.
7. Deterministic Simulation Testing
FoundationDB simulation testing — the most rigorous approach to distributed systems testing known: run the entire system inside a deterministic simulator where all nondeterminism (network delays, disk I/O, timer callbacks) is controlled by a seeded random number generator. Any failing test can be reproduced exactly.
Council-qa will specify a deterministic simulation harness for the DistOS core components (scheduler, memory manager, consensus layer) that:
- Replaces all OS-level nondeterminism with simulated equivalents
- Records all random seeds for reproducible failure replay
- Supports time-travel debugging (roll back to a pre-failure state and re-execute)
Key references:
- Villegas, A. (2021). "FoundationDB: A Distributed Unbundled Transactional Key-Value Store." SIGMOD 2021.
- Kingsbury, K. (2018). "An Introduction to Jepsen." Strange Loop 2018.
- Belyakova, Y., et al. (2024). "Lineage-driven Fault Injection." SIGMOD 2015.
8. Race Condition Detection
ThreadSanitizer (TSan) / Helgrind — data race detection in any shared-memory regions of DistOS.
Concuerror / DPOR (Dynamic Partial Order Reduction) — systematic concurrency testing for Erlang/Elixir-style actor systems, adaptable to DistOS agent communication patterns.
Deterministic concurrency testing — enumerate all interleavings of concurrent operations in critical sections of the scheduler and memory allocator.
Agent Roles
| Role | Count | Responsibilities |
|---|---|---|
| Chaos Engineers | 12 | Design and execute Jepsen-style partition and failure scenarios; maintain chaos playbook library |
| Fuzz Operators | 10 | Run AFL++/libFuzzer/syzkaller campaigns against all API surfaces; triage and minimise crashes |
| Property Testers | 10 | Encode formal spec properties as Hypothesis/QuickCheck tests; maintain property library |
| Byzantine Simulators | 8 | Implement and execute Byzantine fault scenarios; validate BFT protocol correctness |
| GPU Fault Injectors | 8 | NVML/DCGM-based hardware fault injection; GPU error response validation |
| Simulation Engineers | 7 | Build and maintain the deterministic simulation harness; replay and minimise failures |
| Performance Adversarialists | 5 | Benchmarking under adversarial load; identify performance cliffs and scheduling pathologies |
Total: 60 agents
Key Deliverables
| Deliverable | UCXL Address | Due Phase |
|---|---|---|
| Adversarial Test Strategy | ucxl://council-qa:chaos-engineer@DistOS:qa/*^/spec/test-strategy |
Phase 2 |
| Chaos Engineering Playbook | ucxl://council-qa:chaos-engineer@DistOS:qa/*^/runbook/chaos-playbook |
Phase 2 |
| Property Test Suite (formal spec conformance) | ucxl://council-qa:property-tester@DistOS:qa/*^/test-suite/property-tests |
Phase 3 |
| API Fuzz Campaign Report | ucxl://council-qa:fuzz-operator@DistOS:qa/*^/report/fuzz-campaign-api |
Phase 3 |
| Byzantine Fault Simulation Results | ucxl://council-qa:byzantine-simulator@DistOS:qa/*^/report/byzantine-faults |
Phase 3 |
| GPU Error Injection Results | ucxl://council-qa:gpu-fault-injector@DistOS:qa/*^/report/gpu-fault-injection |
Phase 3 |
| Deterministic Simulation Harness Spec | ucxl://council-qa:simulation-engineer@DistOS:qa/*^/spec/deterministic-sim |
Phase 3 |
| Weka FS Adversarial Test Results | ucxl://council-qa:chaos-engineer@DistOS:qa/*^/report/weka-adversarial |
Phase 4 |
| Race Condition Detection Report | ucxl://council-qa:simulation-engineer@DistOS:qa/*^/report/race-conditions |
Phase 4 |
| Consolidated Defect Register | ucxl://council-qa:*@DistOS:qa/*^/register/defects |
Continuous |
| Final QA Acceptance Report | ucxl://council-qa:*@DistOS:qa/*^/report/final-acceptance |
Phase 5 |
| Performance Adversarial Benchmarks | ucxl://council-qa:performance-adversarialist@DistOS:qa/*^/benchmark/adversarial-perf |
Phase 4 |
Decision Points
DQ-01: Acceptable Defect Threshold for Specification Acceptance
Question: What severity and count of open defects constitutes an acceptable state for the DistOS specification to be declared complete?
Options:
- A. Zero P0 defects, fewer than 10 P1 defects with documented mitigations
- B. Zero P0/P1 defects, unlimited P2 with tracking
- C. All discovered defects must be resolved or formally accepted with architectural rationale
Implications: Option C produces the highest-quality specification but may block delivery. Option A risks shipping known critical issues with mitigations that may not hold in production.
DQ-02: Scope of Byzantine Fault Tolerance Requirement
Question: Is DistOS required to tolerate Byzantine faults, or only crash-recovery faults?
Options:
- A. Crash-recovery (CFT) only — simpler protocols, lower overhead, sufficient for trusted datacenter hardware
- B. Byzantine fault tolerance for the consensus layer only
- C. Full BFT for all distributed components
Implications: Full BFT (Option C) requires significantly more complex protocols and typically 33% overhead in node count for quorum calculations. Given a 1024-node cluster with known hardware, CFT may be sufficient.
DQ-03: Deterministic Simulation Scope
Question: Which DistOS components must support deterministic simulation testing?
Options:
- A. Core consensus and scheduling components only
- B. All components that touch shared state
- C. The entire DistOS software stack including GPU runtime interfaces
Implications: Option C provides the strongest testing guarantees but requires a simulation layer for every hardware interface including GPU APIs.
DQ-04: Jepsen-Equivalent Validation Requirement
Question: Must DistOS pass a Jepsen-equivalent analysis before the specification is finalised, or is simulation-based testing sufficient?
Options:
- A. Simulation-based testing is sufficient at specification phase; Jepsen analysis deferred to implementation phase
- B. A Jepsen-equivalent model-level analysis is required before specification acceptance
- C. A full Jepsen-style test against a prototype implementation is required within the 14-day window
Implications: Option C is highly ambitious for a 14-day timeline. Option A defers significant risk to implementation.
DQ-05: Security Penetration Testing Depth
Question: What is the scope of security penetration testing for the DistOS security model?
Options:
- A. Threat-model review and manual analysis only
- B. Automated scanning plus manual review of authentication and authorisation boundaries
- C. Full red-team exercise against the security model including side-channel analysis
Dependencies on Other Councils
| Council | Dependency Type | What Council-QA Needs |
|---|---|---|
| council-verify | Upstream | Formal specifications and invariants to test against; verified properties to validate in simulation |
| council-api | Upstream | Complete API surface definitions with semantics, pre/post conditions, and error contracts |
| council-sched | Upstream | Scheduler specification including claimed consistency and fairness properties |
| council-mem | Upstream | Memory manager specification including isolation guarantees and allocation invariants |
| council-sec | Upstream | Security model specification including trust boundaries and threat model |
| council-net | Upstream | Network layer specification including partition tolerance claims |
| council-fs | Upstream | Weka FS integration specification including consistency claims |
| council-synth | Bidirectional | Synthesis decisions that affect testability; council-qa feeds defects back to council-synth for resolution arbitration |
| council-arch | Downstream | Decision archaeology reads all council-qa defect reports and test results to narrate the QA story |
| council-docs | Downstream | Documentation council consumes test reports for the final specification document |
Critical dependency: Council-qa cannot begin property testing until council-verify delivers at least draft formal specifications. This creates a hard dependency on council-verify delivering Phase 2 outputs on schedule.
WHOOSH Configuration
council: council-qa
whoosh:
formation:
strategy: domain-partitioned
# Each test domain forms a sub-team that operates semi-independently
partitions:
- name: chaos-partition
roles: [chaos-engineer]
size: 12
coordination: async
- name: fuzz-partition
roles: [fuzz-operator]
size: 10
coordination: async
- name: property-partition
roles: [property-tester]
size: 10
coordination: sync-with-verify
- name: byzantine-partition
roles: [byzantine-simulator]
size: 8
coordination: async
- name: gpu-partition
roles: [gpu-fault-injector]
size: 8
coordination: async
- name: simulation-partition
roles: [simulation-engineer]
size: 7
coordination: sync
- name: perf-partition
roles: [performance-adversarialist]
size: 5
coordination: async
quorum:
defect-classification: 3/5 # Three roles must agree on P0/P1 classification
test-acceptance: 4/7 # Four of seven role groups must sign off on test suite acceptance
spec-block: # council-qa can block spec acceptance; requires:
threshold: simple-majority
escalation: council-synth # Disputes escalate to council-synth for arbitration
subchannels:
- id: defect-reports
description: All discovered defects broadcast here for cross-council visibility
subscribers: [council-verify, council-api, council-sched, council-mem, council-sec, council-synth, council-arch]
retention: full-history
- id: chaos-ops
description: Real-time chaos experiment state
subscribers: [council-qa, council-arch]
- id: property-failures
description: Property test failures routed to council-verify for spec review
subscribers: [council-verify, council-synth, council-arch]
- id: qa-acceptance
description: Formal acceptance/rejection signals per deliverable
subscribers: [council-synth, council-docs, council-arch]
communication:
internal: broadcast-to-partition
cross-council: ucxl-addressed
defect-escalation: immediate-broadcast
Success Criteria
A council-qa execution is considered successful when all of the following are met:
-
Coverage: Every formal specification invariant produced by council-verify has at least one corresponding property test with documented pass/fail results.
-
API surface: 100% of API endpoints defined by council-api have been subjected to structured fuzz testing with documented results.
-
Chaos coverage: Jepsen-equivalent partition and failure scenarios have been executed against the DistOS scheduler, memory manager, and consensus layer models.
-
Byzantine validation: Byzantine fault simulations have validated that the consensus protocol meets its claimed fault-tolerance bounds.
-
GPU error coverage: All documented GPU error codes and conditions in the NVIDIA/AMD hardware specifications have corresponding DistOS response scenarios tested.
-
Weka FS: All consistency claims in the DistOS filesystem integration specification have been tested against documented Weka FS behaviour.
-
Defect resolution: All P0 defects are resolved. All P1 defects have formal mitigations accepted by the originating council and council-synth. All open defects are registered with UCXL-addressable evidence.
-
Deterministic reproduction: All discovered failures can be reproduced deterministically via the simulation harness or a minimised test case.
-
Archaeology readability: Council-arch has confirmed that the defect register and test reports produce comprehensible decision narratives.
Timeline Mapping
| Phase | Days | Council-QA Activities |
|---|---|---|
| Phase 1: Research | 1–3 | Study formal specification structure (coordination with council-verify); design adversarial test taxonomy; review Jepsen, FoundationDB simulation, and AFL++ methodologies; define property language for spec conformance tests |
| Phase 2: Architecture | 3–6 | Produce Adversarial Test Strategy; design chaos playbook structure; begin encoding specification properties as tests (as draft specs arrive from council-verify); establish defect classification schema; configure WHOOSH subchannels and defect broadcast |
| Phase 3: Formal Specification | 6–10 | Execute property tests against formal specs from council-verify; run initial fuzz campaigns against draft API definitions; execute Byzantine fault simulations; GPU error injection scenario validation; first Weka FS adversarial tests; publish defect reports to all councils |
| Phase 4: Integration | 10–12 | Full integration testing across all subsystem specifications; cross-subsystem fault injection (e.g., GPU error during Weka checkpoint during scheduler failover); race condition analysis; performance adversarial benchmarks; defect resolution verification |
| Phase 5: Documentation | 12–14 | Produce Final QA Acceptance Report; compile consolidated defect register; validate all P0/P1 defects are resolved or formally accepted; support council-docs in formatting test reports for the final specification document |
Council Design Brief v1.0 — DistOS Project — council-qa Generated: 2026-02-24