# Council Design Brief: Quality Assurance & Adversarial Testing --- ## Council Identification | Field | Value | |-------|-------| | **Council ID** | `council-qa` | | **Mission** | Adversarially validate the entire DistOS specification and implementation surface through systematic fuzz testing, chaos engineering, Byzantine fault simulation, and property-based verification — ensuring that what the formal specification says is provably correct (council-verify's domain) actually matches what the system does under hostile, degraded, and pathological conditions. | | **UCXL Base Address** | `ucxl://council-qa:*@DistOS:qa/*^/` | | **Agent Count** | ~60 agents | | **Operates From** | Day 3 (Architecture phase) through Day 14 (Documentation close) | --- ## Scope and Responsibilities Council-qa owns the gap between proof and reality. Formal verification proves that a model is internally consistent; adversarial testing proves that the model is an accurate model of the system as built. This council is responsible for: - Designing and executing a comprehensive adversarial test suite against all DistOS subsystems - Validating that formal specifications produced by council-verify match observable system behaviour under both normal and failure conditions - Discovering specification ambiguities by finding inputs or conditions that produce unexpected or underspecified behaviour - Stress-testing all published API surfaces (received from council-api) for correctness, safety, and resilience - Simulating Byzantine, crash-recovery, and omission faults across the 1024-node GPU cluster topology - Injecting faults at the hardware layer (GPU errors, NVLink degradation, Weka FS unavailability, InfiniBand partition events) and observing correctness of system response - Benchmarking performance under adversarial scheduling pressure, memory fragmentation, and concurrent failure scenarios - Performing structured penetration testing of the DistOS security model - Running deterministic simulation testing to reproduce races and transient faults at will - Reporting all discovered defects, ambiguities, and underspecifications back to originating councils with UCXL-addressable evidence chains Council-qa does **not** write specifications or make architectural decisions. Its authority is to raise issues and block deliverable acceptance until those issues are resolved or formally accepted as known limitations. --- ## Research Domains ### 1. Distributed Systems Fault Injection and Chaos Engineering **Core framework:** Jepsen (Kyle Kingsbury) — the canonical methodology for testing distributed system consistency under network partitions, clock skew, and process crashes. DistOS must pass Jepsen-equivalent analysis for all claims of linearisability, serializability, or causal consistency made in the formal spec. Key Jepsen analyses to replicate: - Partition healing with in-flight GPU kernel state - Split-brain detection in the consensus layer - Clock skew effects on distributed scheduling decisions - Write visibility after node rejoins **Netflix Chaos Monkey / Simian Army** — production chaos engineering methodology. Adapt for pre-production specification testing: if chaos can be injected into the architecture model, it should be injected into the formal model before any node is touched. **Google DiRT (Disaster Recovery Testing)** — structured disaster scenario playbooks. Council-qa will produce DiRT-equivalent playbooks for DistOS, covering datacenter-level failure scenarios for a 1024-node cluster. Key papers: - Kingsbury, K. (2013–present). Jepsen analysis series. https://jepsen.io - Gunawi, H. S., et al. (2011). "FATE and DESTINI: A Framework for Cloud Recovery Testing." NSDI 2011. - Basiri, A., et al. (2016). "Chaos Engineering." IEEE Software 33(3). - Leesatapornwongsa, T., et al. (2016). "TaxDC: A Taxonomy of Non-Deterministic Concurrency Bugs in Datacenter Distributed Systems." ASPLOS 2016. ### 2. Property-Based and Specification-Conformance Testing **Hypothesis (Python) / QuickCheck (Haskell)** — shrinking-based property testing. The council will define properties derived from formal specifications and use automated generators to find minimally failing counterexamples. Properties to encode: - Scheduler decisions must be deterministic given the same system state snapshot - Memory allocation must never produce aliased physical GPU memory addresses across tenants - All filesystem operations on Weka must satisfy the POSIX consistency semantics claimed in the spec - Security policy decisions must be monotone (granting more resources never reduces security guarantees) **TLA+ model checking** — in coordination with council-verify, council-qa will encode system properties as TLA+ invariants and run TLC model checking over the state space of the DistOS scheduler and memory manager. Key papers: - Claessen, K., & Hughes, J. (2000). "QuickCheck: A Lightweight Tool for Random Testing of Haskell Programs." ICFP 2000. - MacIver, D. R., et al. (2019). "Hypothesis: A new approach to property-based testing." JOSS 4(43). - Lamport, L. (1999). "Specifying Systems: The TLA+ Language and Tools." Addison-Wesley. ### 3. Fuzz Testing of API Surfaces **AFL++ / libFuzzer** — coverage-guided fuzzing of all DistOS API endpoints. Every API defined by council-api is a fuzzing target. **Syzkaller** — adapted for DistOS syscall-equivalent interfaces. The kernel-level scheduling and memory interfaces will be fuzz-tested using a syzkaller-inspired grammar-aware fuzzer. **gRPC/Protobuf fuzzing** — for any inter-node RPC interfaces, schema-aware fuzzing to discover serialisation edge cases, integer overflow in field handling, and unintended state transitions. Focus areas: - Malformed job submission to the scheduler - Invalid or boundary-condition memory allocation requests - Weka filesystem path traversal and permission edge cases - Security token manipulation and replay attacks - Malformed consensus messages (Raft/Paxos variant used by DistOS) Key papers: - Zalewski, M. (2014). American Fuzzy Lop technical whitepaper. - Serebryany, K. (2016). "OSS-Fuzz — Google's continuous fuzzing service for open source software." USENIX Security 2017. - Vishnyakov, A., et al. (2022). "SYDR-Fuzz: Continuous Hybrid Fuzzing and Dynamic Analysis for Security Development Lifecycle." ISPRAS Open 2022. ### 4. Byzantine Fault Simulation For a 1024-node cluster intended for production AI workloads, Byzantine faults (nodes sending malicious, corrupt, or contradictory messages) are a realistic adversary model. Council-qa will simulate: - Nodes sending conflicting scheduling state to different peers - GPU memory controllers reporting incorrect allocation metadata - Weka FS nodes returning inconsistent directory listings - Consensus participants casting votes that contradict their local state **BFT protocol validation:** Any Byzantine-fault-tolerant consensus mechanism in DistOS must tolerate f faulty nodes where N >= 3f + 1. For a 1024-node cluster with f = 341, this requires rigorous simulation. Key papers: - Lamport, L., Shostak, R., & Pease, M. (1982). "The Byzantine Generals Problem." ACM TOPLAS 4(3). - Castro, M., & Liskov, B. (1999). "Practical Byzantine Fault Tolerance." OSDI 1999. - Yin, M., et al. (2019). "HotStuff: BFT Consensus with Linearity and Responsiveness." PODC 2019. ### 5. GPU-Specific Error Injection **NVIDIA GPU Error Injection (NVML / DCGM)** — inject ECC errors, NVLink faults, and CUDA context corruption. DistOS must respond correctly to all hardware-signalled errors from the GPU fabric. **ROCm SMI fault injection** — equivalent capabilities for AMD GPUs if mixed hardware is present. Scenarios: - Single-bit ECC error during active kernel execution: does DistOS checkpoint, migrate, or terminate the affected job? - NVLink link failure mid-allreduce: does the collective communication layer correctly abort and reschedule? - GPU memory over-temperature throttling: does the scheduler correctly rebalance load? - CUDA context loss: does the runtime correctly clean up tenant resources? Key references: - NVIDIA. (2024). Data Center GPU Manager (DCGM) User Guide. NVIDIA Corporation. - NVIDIA. (2023). NVML API Reference Guide. NVIDIA Corporation. - Hochschild, P., et al. (2021). "Cores that don't count." HotOS 2021. (Silent data corruption in datacenter CPUs — same failure class applies to GPUs.) ### 6. Weka Parallel Filesystem Adversarial Testing Weka FS is a high-performance parallel filesystem with specific consistency semantics. DistOS makes claims about filesystem semantics that must be tested against Weka's actual behaviour. Test scenarios: - Weka cluster node failure during active checkpoint write: does DistOS detect partial writes? - Concurrent checkpoint reads and writes from 512 nodes: does the system observe stale data? - Weka metadata server overload: how does DistOS degrade gracefully? - Network partition between Weka clients and Weka backend: does DistOS correctly pause or abort affected operations? Key references: - Weka.io. (2024). WekaFS Architecture and Internals. Weka Technical Documentation. - Carns, P., et al. (2011). "Understanding and Improving Computational Science Storage Access through Continuous Characterization." MSST 2011. ### 7. Deterministic Simulation Testing **FoundationDB simulation testing** — the most rigorous approach to distributed systems testing known: run the entire system inside a deterministic simulator where all nondeterminism (network delays, disk I/O, timer callbacks) is controlled by a seeded random number generator. Any failing test can be reproduced exactly. Council-qa will specify a deterministic simulation harness for the DistOS core components (scheduler, memory manager, consensus layer) that: - Replaces all OS-level nondeterminism with simulated equivalents - Records all random seeds for reproducible failure replay - Supports time-travel debugging (roll back to a pre-failure state and re-execute) Key references: - Villegas, A. (2021). "FoundationDB: A Distributed Unbundled Transactional Key-Value Store." SIGMOD 2021. - Kingsbury, K. (2018). "An Introduction to Jepsen." Strange Loop 2018. - Belyakova, Y., et al. (2024). "Lineage-driven Fault Injection." SIGMOD 2015. ### 8. Race Condition Detection **ThreadSanitizer (TSan) / Helgrind** — data race detection in any shared-memory regions of DistOS. **Concuerror / DPOR (Dynamic Partial Order Reduction)** — systematic concurrency testing for Erlang/Elixir-style actor systems, adaptable to DistOS agent communication patterns. **Deterministic concurrency testing** — enumerate all interleavings of concurrent operations in critical sections of the scheduler and memory allocator. --- ## Agent Roles | Role | Count | Responsibilities | |------|-------|-----------------| | **Chaos Engineers** | 12 | Design and execute Jepsen-style partition and failure scenarios; maintain chaos playbook library | | **Fuzz Operators** | 10 | Run AFL++/libFuzzer/syzkaller campaigns against all API surfaces; triage and minimise crashes | | **Property Testers** | 10 | Encode formal spec properties as Hypothesis/QuickCheck tests; maintain property library | | **Byzantine Simulators** | 8 | Implement and execute Byzantine fault scenarios; validate BFT protocol correctness | | **GPU Fault Injectors** | 8 | NVML/DCGM-based hardware fault injection; GPU error response validation | | **Simulation Engineers** | 7 | Build and maintain the deterministic simulation harness; replay and minimise failures | | **Performance Adversarialists** | 5 | Benchmarking under adversarial load; identify performance cliffs and scheduling pathologies | **Total: 60 agents** --- ## Key Deliverables | Deliverable | UCXL Address | Due Phase | |-------------|--------------|-----------| | Adversarial Test Strategy | `ucxl://council-qa:chaos-engineer@DistOS:qa/*^/spec/test-strategy` | Phase 2 | | Chaos Engineering Playbook | `ucxl://council-qa:chaos-engineer@DistOS:qa/*^/runbook/chaos-playbook` | Phase 2 | | Property Test Suite (formal spec conformance) | `ucxl://council-qa:property-tester@DistOS:qa/*^/test-suite/property-tests` | Phase 3 | | API Fuzz Campaign Report | `ucxl://council-qa:fuzz-operator@DistOS:qa/*^/report/fuzz-campaign-api` | Phase 3 | | Byzantine Fault Simulation Results | `ucxl://council-qa:byzantine-simulator@DistOS:qa/*^/report/byzantine-faults` | Phase 3 | | GPU Error Injection Results | `ucxl://council-qa:gpu-fault-injector@DistOS:qa/*^/report/gpu-fault-injection` | Phase 3 | | Deterministic Simulation Harness Spec | `ucxl://council-qa:simulation-engineer@DistOS:qa/*^/spec/deterministic-sim` | Phase 3 | | Weka FS Adversarial Test Results | `ucxl://council-qa:chaos-engineer@DistOS:qa/*^/report/weka-adversarial` | Phase 4 | | Race Condition Detection Report | `ucxl://council-qa:simulation-engineer@DistOS:qa/*^/report/race-conditions` | Phase 4 | | Consolidated Defect Register | `ucxl://council-qa:*@DistOS:qa/*^/register/defects` | Continuous | | Final QA Acceptance Report | `ucxl://council-qa:*@DistOS:qa/*^/report/final-acceptance` | Phase 5 | | Performance Adversarial Benchmarks | `ucxl://council-qa:performance-adversarialist@DistOS:qa/*^/benchmark/adversarial-perf` | Phase 4 | --- ## Decision Points ### DQ-01: Acceptable Defect Threshold for Specification Acceptance **Question:** What severity and count of open defects constitutes an acceptable state for the DistOS specification to be declared complete? **Options:** - A. Zero P0 defects, fewer than 10 P1 defects with documented mitigations - B. Zero P0/P1 defects, unlimited P2 with tracking - C. All discovered defects must be resolved or formally accepted with architectural rationale **Implications:** Option C produces the highest-quality specification but may block delivery. Option A risks shipping known critical issues with mitigations that may not hold in production. ### DQ-02: Scope of Byzantine Fault Tolerance Requirement **Question:** Is DistOS required to tolerate Byzantine faults, or only crash-recovery faults? **Options:** - A. Crash-recovery (CFT) only — simpler protocols, lower overhead, sufficient for trusted datacenter hardware - B. Byzantine fault tolerance for the consensus layer only - C. Full BFT for all distributed components **Implications:** Full BFT (Option C) requires significantly more complex protocols and typically 33% overhead in node count for quorum calculations. Given a 1024-node cluster with known hardware, CFT may be sufficient. ### DQ-03: Deterministic Simulation Scope **Question:** Which DistOS components must support deterministic simulation testing? **Options:** - A. Core consensus and scheduling components only - B. All components that touch shared state - C. The entire DistOS software stack including GPU runtime interfaces **Implications:** Option C provides the strongest testing guarantees but requires a simulation layer for every hardware interface including GPU APIs. ### DQ-04: Jepsen-Equivalent Validation Requirement **Question:** Must DistOS pass a Jepsen-equivalent analysis before the specification is finalised, or is simulation-based testing sufficient? **Options:** - A. Simulation-based testing is sufficient at specification phase; Jepsen analysis deferred to implementation phase - B. A Jepsen-equivalent model-level analysis is required before specification acceptance - C. A full Jepsen-style test against a prototype implementation is required within the 14-day window **Implications:** Option C is highly ambitious for a 14-day timeline. Option A defers significant risk to implementation. ### DQ-05: Security Penetration Testing Depth **Question:** What is the scope of security penetration testing for the DistOS security model? **Options:** - A. Threat-model review and manual analysis only - B. Automated scanning plus manual review of authentication and authorisation boundaries - C. Full red-team exercise against the security model including side-channel analysis --- ## Dependencies on Other Councils | Council | Dependency Type | What Council-QA Needs | |---------|----------------|-----------------------| | **council-verify** | Upstream | Formal specifications and invariants to test against; verified properties to validate in simulation | | **council-api** | Upstream | Complete API surface definitions with semantics, pre/post conditions, and error contracts | | **council-sched** | Upstream | Scheduler specification including claimed consistency and fairness properties | | **council-mem** | Upstream | Memory manager specification including isolation guarantees and allocation invariants | | **council-sec** | Upstream | Security model specification including trust boundaries and threat model | | **council-net** | Upstream | Network layer specification including partition tolerance claims | | **council-fs** | Upstream | Weka FS integration specification including consistency claims | | **council-synth** | Bidirectional | Synthesis decisions that affect testability; council-qa feeds defects back to council-synth for resolution arbitration | | **council-arch** | Downstream | Decision archaeology reads all council-qa defect reports and test results to narrate the QA story | | **council-docs** | Downstream | Documentation council consumes test reports for the final specification document | **Critical dependency:** Council-qa cannot begin property testing until council-verify delivers at least draft formal specifications. This creates a hard dependency on council-verify delivering Phase 2 outputs on schedule. --- ## WHOOSH Configuration ```yaml council: council-qa whoosh: formation: strategy: domain-partitioned # Each test domain forms a sub-team that operates semi-independently partitions: - name: chaos-partition roles: [chaos-engineer] size: 12 coordination: async - name: fuzz-partition roles: [fuzz-operator] size: 10 coordination: async - name: property-partition roles: [property-tester] size: 10 coordination: sync-with-verify - name: byzantine-partition roles: [byzantine-simulator] size: 8 coordination: async - name: gpu-partition roles: [gpu-fault-injector] size: 8 coordination: async - name: simulation-partition roles: [simulation-engineer] size: 7 coordination: sync - name: perf-partition roles: [performance-adversarialist] size: 5 coordination: async quorum: defect-classification: 3/5 # Three roles must agree on P0/P1 classification test-acceptance: 4/7 # Four of seven role groups must sign off on test suite acceptance spec-block: # council-qa can block spec acceptance; requires: threshold: simple-majority escalation: council-synth # Disputes escalate to council-synth for arbitration subchannels: - id: defect-reports description: All discovered defects broadcast here for cross-council visibility subscribers: [council-verify, council-api, council-sched, council-mem, council-sec, council-synth, council-arch] retention: full-history - id: chaos-ops description: Real-time chaos experiment state subscribers: [council-qa, council-arch] - id: property-failures description: Property test failures routed to council-verify for spec review subscribers: [council-verify, council-synth, council-arch] - id: qa-acceptance description: Formal acceptance/rejection signals per deliverable subscribers: [council-synth, council-docs, council-arch] communication: internal: broadcast-to-partition cross-council: ucxl-addressed defect-escalation: immediate-broadcast ``` --- ## Success Criteria A council-qa execution is considered successful when all of the following are met: 1. **Coverage:** Every formal specification invariant produced by council-verify has at least one corresponding property test with documented pass/fail results. 2. **API surface:** 100% of API endpoints defined by council-api have been subjected to structured fuzz testing with documented results. 3. **Chaos coverage:** Jepsen-equivalent partition and failure scenarios have been executed against the DistOS scheduler, memory manager, and consensus layer models. 4. **Byzantine validation:** Byzantine fault simulations have validated that the consensus protocol meets its claimed fault-tolerance bounds. 5. **GPU error coverage:** All documented GPU error codes and conditions in the NVIDIA/AMD hardware specifications have corresponding DistOS response scenarios tested. 6. **Weka FS:** All consistency claims in the DistOS filesystem integration specification have been tested against documented Weka FS behaviour. 7. **Defect resolution:** All P0 defects are resolved. All P1 defects have formal mitigations accepted by the originating council and council-synth. All open defects are registered with UCXL-addressable evidence. 8. **Deterministic reproduction:** All discovered failures can be reproduced deterministically via the simulation harness or a minimised test case. 9. **Archaeology readability:** Council-arch has confirmed that the defect register and test reports produce comprehensible decision narratives. --- ## Timeline Mapping | Phase | Days | Council-QA Activities | |-------|------|-----------------------| | **Phase 1: Research** | 1–3 | Study formal specification structure (coordination with council-verify); design adversarial test taxonomy; review Jepsen, FoundationDB simulation, and AFL++ methodologies; define property language for spec conformance tests | | **Phase 2: Architecture** | 3–6 | Produce Adversarial Test Strategy; design chaos playbook structure; begin encoding specification properties as tests (as draft specs arrive from council-verify); establish defect classification schema; configure WHOOSH subchannels and defect broadcast | | **Phase 3: Formal Specification** | 6–10 | Execute property tests against formal specs from council-verify; run initial fuzz campaigns against draft API definitions; execute Byzantine fault simulations; GPU error injection scenario validation; first Weka FS adversarial tests; publish defect reports to all councils | | **Phase 4: Integration** | 10–12 | Full integration testing across all subsystem specifications; cross-subsystem fault injection (e.g., GPU error during Weka checkpoint during scheduler failover); race condition analysis; performance adversarial benchmarks; defect resolution verification | | **Phase 5: Documentation** | 12–14 | Produce Final QA Acceptance Report; compile consolidated defect register; validate all P0/P1 defects are resolved or formally accepted; support council-docs in formatting test reports for the final specification document | --- *Council Design Brief v1.0 — DistOS Project — council-qa* *Generated: 2026-02-24*