DistOS/councils/10-qa-adversarial-testing.md

# Council Design Brief: Quality Assurance & Adversarial Testing

---

## Council Identification

| Field | Value |
|-------|-------|
| **Council ID** | `council-qa` |
| **Mission** | Adversarially validate the entire DistOS specification and implementation surface through systematic fuzz testing, chaos engineering, Byzantine fault simulation, and property-based verification — ensuring that what the formal specification says is provably correct (council-verify's domain) actually matches what the system does under hostile, degraded, and pathological conditions. |
| **UCXL Base Address** | `ucxl://council-qa:*@DistOS:qa/*^/` |
| **Agent Count** | ~60 agents |
| **Operates From** | Day 3 (Architecture phase) through Day 14 (Documentation close) |

---

## Scope and Responsibilities

Council-qa owns the gap between proof and reality. Formal verification proves that a model is internally consistent; adversarial testing proves that the model is an accurate model of the system as built. This council is responsible for:

- Designing and executing a comprehensive adversarial test suite against all DistOS subsystems
- Validating that formal specifications produced by council-verify match observable system behaviour under both normal and failure conditions
- Discovering specification ambiguities by finding inputs or conditions that produce unexpected or underspecified behaviour
- Stress-testing all published API surfaces (received from council-api) for correctness, safety, and resilience
- Simulating Byzantine, crash-recovery, and omission faults across the 1024-node GPU cluster topology
- Injecting faults at the hardware layer (GPU errors, NVLink degradation, Weka FS unavailability, InfiniBand partition events) and observing correctness of system response
- Benchmarking performance under adversarial scheduling pressure, memory fragmentation, and concurrent failure scenarios
- Performing structured penetration testing of the DistOS security model
- Running deterministic simulation testing to reproduce races and transient faults at will
- Reporting all discovered defects, ambiguities, and underspecifications back to originating councils with UCXL-addressable evidence chains

Council-qa does **not** write specifications or make architectural decisions. Its authority is to raise issues and block deliverable acceptance until those issues are resolved or formally accepted as known limitations.

---

## Research Domains

### 1. Distributed Systems Fault Injection and Chaos Engineering

**Core framework:** Jepsen (Kyle Kingsbury) — the canonical methodology for testing distributed system consistency under network partitions, clock skew, and process crashes. DistOS must pass Jepsen-equivalent analysis for all claims of linearisability, serializability, or causal consistency made in the formal spec.

Key Jepsen analyses to replicate:
- Partition healing with in-flight GPU kernel state
- Split-brain detection in the consensus layer
- Clock skew effects on distributed scheduling decisions
- Write visibility after node rejoins

**Netflix Chaos Monkey / Simian Army** — production chaos engineering methodology. Adapt for pre-production specification testing: if chaos can be injected into the architecture model, it should be injected into the formal model before any node is touched.

**Google DiRT (Disaster Recovery Testing)** — structured disaster scenario playbooks. Council-qa will produce DiRT-equivalent playbooks for DistOS, covering datacenter-level failure scenarios for a 1024-node cluster.

Key papers:
- Kingsbury, K. (2013–present). Jepsen analysis series. https://jepsen.io
- Gunawi, H. S., et al. (2011). "FATE and DESTINI: A Framework for Cloud Recovery Testing." NSDI 2011.
- Basiri, A., et al. (2016). "Chaos Engineering." IEEE Software 33(3).
- Leesatapornwongsa, T., et al. (2016). "TaxDC: A Taxonomy of Non-Deterministic Concurrency Bugs in Datacenter Distributed Systems." ASPLOS 2016.

### 2. Property-Based and Specification-Conformance Testing

**Hypothesis (Python) / QuickCheck (Haskell)** — shrinking-based property testing. The council will define properties derived from formal specifications and use automated generators to find minimally failing counterexamples.

Properties to encode:
- Scheduler decisions must be deterministic given the same system state snapshot
- Memory allocation must never produce aliased physical GPU memory addresses across tenants
- All filesystem operations on Weka must satisfy the POSIX consistency semantics claimed in the spec
- Security policy decisions must be monotone (granting more resources never reduces security guarantees)

**TLA+ model checking** — in coordination with council-verify, council-qa will encode system properties as TLA+ invariants and run TLC model checking over the state space of the DistOS scheduler and memory manager.

Key papers:
- Claessen, K., & Hughes, J. (2000). "QuickCheck: A Lightweight Tool for Random Testing of Haskell Programs." ICFP 2000.
- MacIver, D. R., et al. (2019). "Hypothesis: A new approach to property-based testing." JOSS 4(43).
- Lamport, L. (1999). "Specifying Systems: The TLA+ Language and Tools." Addison-Wesley.

### 3. Fuzz Testing of API Surfaces

**AFL++ / libFuzzer** — coverage-guided fuzzing of all DistOS API endpoints. Every API defined by council-api is a fuzzing target.

**Syzkaller** — adapted for DistOS syscall-equivalent interfaces. The kernel-level scheduling and memory interfaces will be fuzz-tested using a syzkaller-inspired grammar-aware fuzzer.

**gRPC/Protobuf fuzzing** — for any inter-node RPC interfaces, schema-aware fuzzing to discover serialisation edge cases, integer overflow in field handling, and unintended state transitions.

Focus areas:
- Malformed job submission to the scheduler
- Invalid or boundary-condition memory allocation requests
- Weka filesystem path traversal and permission edge cases
- Security token manipulation and replay attacks
- Malformed consensus messages (Raft/Paxos variant used by DistOS)

Key papers:
- Zalewski, M. (2014). American Fuzzy Lop technical whitepaper.
- Serebryany, K. (2016). "OSS-Fuzz — Google's continuous fuzzing service for open source software." USENIX Security 2017.
- Vishnyakov, A., et al. (2022). "SYDR-Fuzz: Continuous Hybrid Fuzzing and Dynamic Analysis for Security Development Lifecycle." ISPRAS Open 2022.

### 4. Byzantine Fault Simulation

For a 1024-node cluster intended for production AI workloads, Byzantine faults (nodes sending malicious, corrupt, or contradictory messages) are a realistic adversary model.

Council-qa will simulate:
- Nodes sending conflicting scheduling state to different peers
- GPU memory controllers reporting incorrect allocation metadata
- Weka FS nodes returning inconsistent directory listings
- Consensus participants casting votes that contradict their local state

**BFT protocol validation:** Any Byzantine-fault-tolerant consensus mechanism in DistOS must tolerate f faulty nodes where N >= 3f + 1. For a 1024-node cluster with f = 341, this requires rigorous simulation.

Key papers:
- Lamport, L., Shostak, R., & Pease, M. (1982). "The Byzantine Generals Problem." ACM TOPLAS 4(3).
- Castro, M., & Liskov, B. (1999). "Practical Byzantine Fault Tolerance." OSDI 1999.
- Yin, M., et al. (2019). "HotStuff: BFT Consensus with Linearity and Responsiveness." PODC 2019.

### 5. GPU-Specific Error Injection

**NVIDIA GPU Error Injection (NVML / DCGM)** — inject ECC errors, NVLink faults, and CUDA context corruption. DistOS must respond correctly to all hardware-signalled errors from the GPU fabric.

**ROCm SMI fault injection** — equivalent capabilities for AMD GPUs if mixed hardware is present.

Scenarios:
- Single-bit ECC error during active kernel execution: does DistOS checkpoint, migrate, or terminate the affected job?
- NVLink link failure mid-allreduce: does the collective communication layer correctly abort and reschedule?
- GPU memory over-temperature throttling: does the scheduler correctly rebalance load?
- CUDA context loss: does the runtime correctly clean up tenant resources?

Key references:
- NVIDIA. (2024). Data Center GPU Manager (DCGM) User Guide. NVIDIA Corporation.
- NVIDIA. (2023). NVML API Reference Guide. NVIDIA Corporation.
- Hochschild, P., et al. (2021). "Cores that don't count." HotOS 2021. (Silent data corruption in datacenter CPUs — same failure class applies to GPUs.)

### 6. Weka Parallel Filesystem Adversarial Testing

Weka FS is a high-performance parallel filesystem with specific consistency semantics. DistOS makes claims about filesystem semantics that must be tested against Weka's actual behaviour.

Test scenarios:
- Weka cluster node failure during active checkpoint write: does DistOS detect partial writes?
- Concurrent checkpoint reads and writes from 512 nodes: does the system observe stale data?
- Weka metadata server overload: how does DistOS degrade gracefully?
- Network partition between Weka clients and Weka backend: does DistOS correctly pause or abort affected operations?

Key references:
- Weka.io. (2024). WekaFS Architecture and Internals. Weka Technical Documentation.
- Carns, P., et al. (2011). "Understanding and Improving Computational Science Storage Access through Continuous Characterization." MSST 2011.

### 7. Deterministic Simulation Testing

**FoundationDB simulation testing** — the most rigorous approach to distributed systems testing known: run the entire system inside a deterministic simulator where all nondeterminism (network delays, disk I/O, timer callbacks) is controlled by a seeded random number generator. Any failing test can be reproduced exactly.

Council-qa will specify a deterministic simulation harness for the DistOS core components (scheduler, memory manager, consensus layer) that:
- Replaces all OS-level nondeterminism with simulated equivalents
- Records all random seeds for reproducible failure replay
- Supports time-travel debugging (roll back to a pre-failure state and re-execute)

Key references:
- Villegas, A. (2021). "FoundationDB: A Distributed Unbundled Transactional Key-Value Store." SIGMOD 2021.
- Kingsbury, K. (2018). "An Introduction to Jepsen." Strange Loop 2018.
- Belyakova, Y., et al. (2024). "Lineage-driven Fault Injection." SIGMOD 2015.

### 8. Race Condition Detection

**ThreadSanitizer (TSan) / Helgrind** — data race detection in any shared-memory regions of DistOS.

**Concuerror / DPOR (Dynamic Partial Order Reduction)** — systematic concurrency testing for Erlang/Elixir-style actor systems, adaptable to DistOS agent communication patterns.

**Deterministic concurrency testing** — enumerate all interleavings of concurrent operations in critical sections of the scheduler and memory allocator.

---

## Agent Roles

| Role | Count | Responsibilities |
|------|-------|-----------------|
| **Chaos Engineers** | 12 | Design and execute Jepsen-style partition and failure scenarios; maintain chaos playbook library |
| **Fuzz Operators** | 10 | Run AFL++/libFuzzer/syzkaller campaigns against all API surfaces; triage and minimise crashes |
| **Property Testers** | 10 | Encode formal spec properties as Hypothesis/QuickCheck tests; maintain property library |
| **Byzantine Simulators** | 8 | Implement and execute Byzantine fault scenarios; validate BFT protocol correctness |
| **GPU Fault Injectors** | 8 | NVML/DCGM-based hardware fault injection; GPU error response validation |
| **Simulation Engineers** | 7 | Build and maintain the deterministic simulation harness; replay and minimise failures |
| **Performance Adversarialists** | 5 | Benchmarking under adversarial load; identify performance cliffs and scheduling pathologies |

**Total: 60 agents**

---

## Key Deliverables

| Deliverable | UCXL Address | Due Phase |
|-------------|--------------|-----------|
| Adversarial Test Strategy | `ucxl://council-qa:chaos-engineer@DistOS:qa/*^/spec/test-strategy` | Phase 2 |
| Chaos Engineering Playbook | `ucxl://council-qa:chaos-engineer@DistOS:qa/*^/runbook/chaos-playbook` | Phase 2 |
| Property Test Suite (formal spec conformance) | `ucxl://council-qa:property-tester@DistOS:qa/*^/test-suite/property-tests` | Phase 3 |
| API Fuzz Campaign Report | `ucxl://council-qa:fuzz-operator@DistOS:qa/*^/report/fuzz-campaign-api` | Phase 3 |
| Byzantine Fault Simulation Results | `ucxl://council-qa:byzantine-simulator@DistOS:qa/*^/report/byzantine-faults` | Phase 3 |
| GPU Error Injection Results | `ucxl://council-qa:gpu-fault-injector@DistOS:qa/*^/report/gpu-fault-injection` | Phase 3 |
| Deterministic Simulation Harness Spec | `ucxl://council-qa:simulation-engineer@DistOS:qa/*^/spec/deterministic-sim` | Phase 3 |
| Weka FS Adversarial Test Results | `ucxl://council-qa:chaos-engineer@DistOS:qa/*^/report/weka-adversarial` | Phase 4 |
| Race Condition Detection Report | `ucxl://council-qa:simulation-engineer@DistOS:qa/*^/report/race-conditions` | Phase 4 |
| Consolidated Defect Register | `ucxl://council-qa:*@DistOS:qa/*^/register/defects` | Continuous |
| Final QA Acceptance Report | `ucxl://council-qa:*@DistOS:qa/*^/report/final-acceptance` | Phase 5 |
| Performance Adversarial Benchmarks | `ucxl://council-qa:performance-adversarialist@DistOS:qa/*^/benchmark/adversarial-perf` | Phase 4 |

---

## Decision Points

### DQ-01: Acceptable Defect Threshold for Specification Acceptance
**Question:** What severity and count of open defects constitutes an acceptable state for the DistOS specification to be declared complete?

**Options:**
- A. Zero P0 defects, fewer than 10 P1 defects with documented mitigations
- B. Zero P0/P1 defects, unlimited P2 with tracking
- C. All discovered defects must be resolved or formally accepted with architectural rationale

**Implications:** Option C produces the highest-quality specification but may block delivery. Option A risks shipping known critical issues with mitigations that may not hold in production.

### DQ-02: Scope of Byzantine Fault Tolerance Requirement
**Question:** Is DistOS required to tolerate Byzantine faults, or only crash-recovery faults?

**Options:**
- A. Crash-recovery (CFT) only — simpler protocols, lower overhead, sufficient for trusted datacenter hardware
- B. Byzantine fault tolerance for the consensus layer only
- C. Full BFT for all distributed components

**Implications:** Full BFT (Option C) requires significantly more complex protocols and typically 33% overhead in node count for quorum calculations. Given a 1024-node cluster with known hardware, CFT may be sufficient.

### DQ-03: Deterministic Simulation Scope
**Question:** Which DistOS components must support deterministic simulation testing?

**Options:**
- A. Core consensus and scheduling components only
- B. All components that touch shared state
- C. The entire DistOS software stack including GPU runtime interfaces

**Implications:** Option C provides the strongest testing guarantees but requires a simulation layer for every hardware interface including GPU APIs.

### DQ-04: Jepsen-Equivalent Validation Requirement
**Question:** Must DistOS pass a Jepsen-equivalent analysis before the specification is finalised, or is simulation-based testing sufficient?

**Options:**
- A. Simulation-based testing is sufficient at specification phase; Jepsen analysis deferred to implementation phase
- B. A Jepsen-equivalent model-level analysis is required before specification acceptance
- C. A full Jepsen-style test against a prototype implementation is required within the 14-day window

**Implications:** Option C is highly ambitious for a 14-day timeline. Option A defers significant risk to implementation.

### DQ-05: Security Penetration Testing Depth
**Question:** What is the scope of security penetration testing for the DistOS security model?

**Options:**
- A. Threat-model review and manual analysis only
- B. Automated scanning plus manual review of authentication and authorisation boundaries
- C. Full red-team exercise against the security model including side-channel analysis

---

## Dependencies on Other Councils

| Council | Dependency Type | What Council-QA Needs |
|---------|----------------|-----------------------|
| **council-verify** | Upstream | Formal specifications and invariants to test against; verified properties to validate in simulation |
| **council-api** | Upstream | Complete API surface definitions with semantics, pre/post conditions, and error contracts |
| **council-sched** | Upstream | Scheduler specification including claimed consistency and fairness properties |
| **council-mem** | Upstream | Memory manager specification including isolation guarantees and allocation invariants |
| **council-sec** | Upstream | Security model specification including trust boundaries and threat model |
| **council-net** | Upstream | Network layer specification including partition tolerance claims |
| **council-fs** | Upstream | Weka FS integration specification including consistency claims |
| **council-synth** | Bidirectional | Synthesis decisions that affect testability; council-qa feeds defects back to council-synth for resolution arbitration |
| **council-arch** | Downstream | Decision archaeology reads all council-qa defect reports and test results to narrate the QA story |
| **council-docs** | Downstream | Documentation council consumes test reports for the final specification document |

**Critical dependency:** Council-qa cannot begin property testing until council-verify delivers at least draft formal specifications. This creates a hard dependency on council-verify delivering Phase 2 outputs on schedule.

---

## WHOOSH Configuration

```yaml
council: council-qa
whoosh:
  formation:
    strategy: domain-partitioned
    # Each test domain forms a sub-team that operates semi-independently
    partitions:
      - name: chaos-partition
        roles: [chaos-engineer]
        size: 12
        coordination: async
      - name: fuzz-partition
        roles: [fuzz-operator]
        size: 10
        coordination: async
      - name: property-partition
        roles: [property-tester]
        size: 10
        coordination: sync-with-verify
      - name: byzantine-partition
        roles: [byzantine-simulator]
        size: 8
        coordination: async
      - name: gpu-partition
        roles: [gpu-fault-injector]
        size: 8
        coordination: async
      - name: simulation-partition
        roles: [simulation-engineer]
        size: 7
        coordination: sync
      - name: perf-partition
        roles: [performance-adversarialist]
        size: 5
        coordination: async

  quorum:
    defect-classification: 3/5  # Three roles must agree on P0/P1 classification
    test-acceptance: 4/7        # Four of seven role groups must sign off on test suite acceptance
    spec-block:                 # council-qa can block spec acceptance; requires:
      threshold: simple-majority
      escalation: council-synth # Disputes escalate to council-synth for arbitration

  subchannels:
    - id: defect-reports
      description: All discovered defects broadcast here for cross-council visibility
      subscribers: [council-verify, council-api, council-sched, council-mem, council-sec, council-synth, council-arch]
      retention: full-history
    - id: chaos-ops
      description: Real-time chaos experiment state
      subscribers: [council-qa, council-arch]
    - id: property-failures
      description: Property test failures routed to council-verify for spec review
      subscribers: [council-verify, council-synth, council-arch]
    - id: qa-acceptance
      description: Formal acceptance/rejection signals per deliverable
      subscribers: [council-synth, council-docs, council-arch]

  communication:
    internal: broadcast-to-partition
    cross-council: ucxl-addressed
    defect-escalation: immediate-broadcast
```

---

## Success Criteria

A council-qa execution is considered successful when all of the following are met:

1. **Coverage:** Every formal specification invariant produced by council-verify has at least one corresponding property test with documented pass/fail results.

2. **API surface:** 100% of API endpoints defined by council-api have been subjected to structured fuzz testing with documented results.

3. **Chaos coverage:** Jepsen-equivalent partition and failure scenarios have been executed against the DistOS scheduler, memory manager, and consensus layer models.

4. **Byzantine validation:** Byzantine fault simulations have validated that the consensus protocol meets its claimed fault-tolerance bounds.

5. **GPU error coverage:** All documented GPU error codes and conditions in the NVIDIA/AMD hardware specifications have corresponding DistOS response scenarios tested.

6. **Weka FS:** All consistency claims in the DistOS filesystem integration specification have been tested against documented Weka FS behaviour.

7. **Defect resolution:** All P0 defects are resolved. All P1 defects have formal mitigations accepted by the originating council and council-synth. All open defects are registered with UCXL-addressable evidence.

8. **Deterministic reproduction:** All discovered failures can be reproduced deterministically via the simulation harness or a minimised test case.

9. **Archaeology readability:** Council-arch has confirmed that the defect register and test reports produce comprehensible decision narratives.

---

## Timeline Mapping

| Phase | Days | Council-QA Activities |
|-------|------|-----------------------|
| **Phase 1: Research** | 1–3 | Study formal specification structure (coordination with council-verify); design adversarial test taxonomy; review Jepsen, FoundationDB simulation, and AFL++ methodologies; define property language for spec conformance tests |
| **Phase 2: Architecture** | 3–6 | Produce Adversarial Test Strategy; design chaos playbook structure; begin encoding specification properties as tests (as draft specs arrive from council-verify); establish defect classification schema; configure WHOOSH subchannels and defect broadcast |
| **Phase 3: Formal Specification** | 6–10 | Execute property tests against formal specs from council-verify; run initial fuzz campaigns against draft API definitions; execute Byzantine fault simulations; GPU error injection scenario validation; first Weka FS adversarial tests; publish defect reports to all councils |
| **Phase 4: Integration** | 10–12 | Full integration testing across all subsystem specifications; cross-subsystem fault injection (e.g., GPU error during Weka checkpoint during scheduler failover); race condition analysis; performance adversarial benchmarks; defect resolution verification |
| **Phase 5: Documentation** | 12–14 | Produce Final QA Acceptance Report; compile consolidated defect register; validate all P0/P1 defects are resolved or formally accepted; support council-docs in formatting test reports for the final specification document |

---

*Council Design Brief v1.0 — DistOS Project — council-qa*
*Generated: 2026-02-24*