Files
CHORUS/docs/distos/councils/10-qa-adversarial-testing.md

22 KiB
Raw Permalink Blame History

Council Design Brief: Quality Assurance & Adversarial Testing


Council Identification

Field Value
Council ID council-qa
Mission Adversarially validate the entire DistOS specification and implementation surface through systematic fuzz testing, chaos engineering, Byzantine fault simulation, and property-based verification — ensuring that what the formal specification says is provably correct (council-verify's domain) actually matches what the system does under hostile, degraded, and pathological conditions.
UCXL Base Address ucxl://council-qa:*@DistOS:qa/*^/
Agent Count ~60 agents
Operates From Day 3 (Architecture phase) through Day 14 (Documentation close)

Scope and Responsibilities

Council-qa owns the gap between proof and reality. Formal verification proves that a model is internally consistent; adversarial testing proves that the model is an accurate model of the system as built. This council is responsible for:

  • Designing and executing a comprehensive adversarial test suite against all DistOS subsystems
  • Validating that formal specifications produced by council-verify match observable system behaviour under both normal and failure conditions
  • Discovering specification ambiguities by finding inputs or conditions that produce unexpected or underspecified behaviour
  • Stress-testing all published API surfaces (received from council-api) for correctness, safety, and resilience
  • Simulating Byzantine, crash-recovery, and omission faults across the 1024-node GPU cluster topology
  • Injecting faults at the hardware layer (GPU errors, NVLink degradation, Weka FS unavailability, InfiniBand partition events) and observing correctness of system response
  • Benchmarking performance under adversarial scheduling pressure, memory fragmentation, and concurrent failure scenarios
  • Performing structured penetration testing of the DistOS security model
  • Running deterministic simulation testing to reproduce races and transient faults at will
  • Reporting all discovered defects, ambiguities, and underspecifications back to originating councils with UCXL-addressable evidence chains

Council-qa does not write specifications or make architectural decisions. Its authority is to raise issues and block deliverable acceptance until those issues are resolved or formally accepted as known limitations.


Research Domains

1. Distributed Systems Fault Injection and Chaos Engineering

Core framework: Jepsen (Kyle Kingsbury) — the canonical methodology for testing distributed system consistency under network partitions, clock skew, and process crashes. DistOS must pass Jepsen-equivalent analysis for all claims of linearisability, serializability, or causal consistency made in the formal spec.

Key Jepsen analyses to replicate:

  • Partition healing with in-flight GPU kernel state
  • Split-brain detection in the consensus layer
  • Clock skew effects on distributed scheduling decisions
  • Write visibility after node rejoins

Netflix Chaos Monkey / Simian Army — production chaos engineering methodology. Adapt for pre-production specification testing: if chaos can be injected into the architecture model, it should be injected into the formal model before any node is touched.

Google DiRT (Disaster Recovery Testing) — structured disaster scenario playbooks. Council-qa will produce DiRT-equivalent playbooks for DistOS, covering datacenter-level failure scenarios for a 1024-node cluster.

Key papers:

  • Kingsbury, K. (2013present). Jepsen analysis series. https://jepsen.io
  • Gunawi, H. S., et al. (2011). "FATE and DESTINI: A Framework for Cloud Recovery Testing." NSDI 2011.
  • Basiri, A., et al. (2016). "Chaos Engineering." IEEE Software 33(3).
  • Leesatapornwongsa, T., et al. (2016). "TaxDC: A Taxonomy of Non-Deterministic Concurrency Bugs in Datacenter Distributed Systems." ASPLOS 2016.

2. Property-Based and Specification-Conformance Testing

Hypothesis (Python) / QuickCheck (Haskell) — shrinking-based property testing. The council will define properties derived from formal specifications and use automated generators to find minimally failing counterexamples.

Properties to encode:

  • Scheduler decisions must be deterministic given the same system state snapshot
  • Memory allocation must never produce aliased physical GPU memory addresses across tenants
  • All filesystem operations on Weka must satisfy the POSIX consistency semantics claimed in the spec
  • Security policy decisions must be monotone (granting more resources never reduces security guarantees)

TLA+ model checking — in coordination with council-verify, council-qa will encode system properties as TLA+ invariants and run TLC model checking over the state space of the DistOS scheduler and memory manager.

Key papers:

  • Claessen, K., & Hughes, J. (2000). "QuickCheck: A Lightweight Tool for Random Testing of Haskell Programs." ICFP 2000.
  • MacIver, D. R., et al. (2019). "Hypothesis: A new approach to property-based testing." JOSS 4(43).
  • Lamport, L. (1999). "Specifying Systems: The TLA+ Language and Tools." Addison-Wesley.

3. Fuzz Testing of API Surfaces

AFL++ / libFuzzer — coverage-guided fuzzing of all DistOS API endpoints. Every API defined by council-api is a fuzzing target.

Syzkaller — adapted for DistOS syscall-equivalent interfaces. The kernel-level scheduling and memory interfaces will be fuzz-tested using a syzkaller-inspired grammar-aware fuzzer.

gRPC/Protobuf fuzzing — for any inter-node RPC interfaces, schema-aware fuzzing to discover serialisation edge cases, integer overflow in field handling, and unintended state transitions.

Focus areas:

  • Malformed job submission to the scheduler
  • Invalid or boundary-condition memory allocation requests
  • Weka filesystem path traversal and permission edge cases
  • Security token manipulation and replay attacks
  • Malformed consensus messages (Raft/Paxos variant used by DistOS)

Key papers:

  • Zalewski, M. (2014). American Fuzzy Lop technical whitepaper.
  • Serebryany, K. (2016). "OSS-Fuzz — Google's continuous fuzzing service for open source software." USENIX Security 2017.
  • Vishnyakov, A., et al. (2022). "SYDR-Fuzz: Continuous Hybrid Fuzzing and Dynamic Analysis for Security Development Lifecycle." ISPRAS Open 2022.

4. Byzantine Fault Simulation

For a 1024-node cluster intended for production AI workloads, Byzantine faults (nodes sending malicious, corrupt, or contradictory messages) are a realistic adversary model.

Council-qa will simulate:

  • Nodes sending conflicting scheduling state to different peers
  • GPU memory controllers reporting incorrect allocation metadata
  • Weka FS nodes returning inconsistent directory listings
  • Consensus participants casting votes that contradict their local state

BFT protocol validation: Any Byzantine-fault-tolerant consensus mechanism in DistOS must tolerate f faulty nodes where N >= 3f + 1. For a 1024-node cluster with f = 341, this requires rigorous simulation.

Key papers:

  • Lamport, L., Shostak, R., & Pease, M. (1982). "The Byzantine Generals Problem." ACM TOPLAS 4(3).
  • Castro, M., & Liskov, B. (1999). "Practical Byzantine Fault Tolerance." OSDI 1999.
  • Yin, M., et al. (2019). "HotStuff: BFT Consensus with Linearity and Responsiveness." PODC 2019.

5. GPU-Specific Error Injection

NVIDIA GPU Error Injection (NVML / DCGM) — inject ECC errors, NVLink faults, and CUDA context corruption. DistOS must respond correctly to all hardware-signalled errors from the GPU fabric.

ROCm SMI fault injection — equivalent capabilities for AMD GPUs if mixed hardware is present.

Scenarios:

  • Single-bit ECC error during active kernel execution: does DistOS checkpoint, migrate, or terminate the affected job?
  • NVLink link failure mid-allreduce: does the collective communication layer correctly abort and reschedule?
  • GPU memory over-temperature throttling: does the scheduler correctly rebalance load?
  • CUDA context loss: does the runtime correctly clean up tenant resources?

Key references:

  • NVIDIA. (2024). Data Center GPU Manager (DCGM) User Guide. NVIDIA Corporation.
  • NVIDIA. (2023). NVML API Reference Guide. NVIDIA Corporation.
  • Hochschild, P., et al. (2021). "Cores that don't count." HotOS 2021. (Silent data corruption in datacenter CPUs — same failure class applies to GPUs.)

6. Weka Parallel Filesystem Adversarial Testing

Weka FS is a high-performance parallel filesystem with specific consistency semantics. DistOS makes claims about filesystem semantics that must be tested against Weka's actual behaviour.

Test scenarios:

  • Weka cluster node failure during active checkpoint write: does DistOS detect partial writes?
  • Concurrent checkpoint reads and writes from 512 nodes: does the system observe stale data?
  • Weka metadata server overload: how does DistOS degrade gracefully?
  • Network partition between Weka clients and Weka backend: does DistOS correctly pause or abort affected operations?

Key references:

  • Weka.io. (2024). WekaFS Architecture and Internals. Weka Technical Documentation.
  • Carns, P., et al. (2011). "Understanding and Improving Computational Science Storage Access through Continuous Characterization." MSST 2011.

7. Deterministic Simulation Testing

FoundationDB simulation testing — the most rigorous approach to distributed systems testing known: run the entire system inside a deterministic simulator where all nondeterminism (network delays, disk I/O, timer callbacks) is controlled by a seeded random number generator. Any failing test can be reproduced exactly.

Council-qa will specify a deterministic simulation harness for the DistOS core components (scheduler, memory manager, consensus layer) that:

  • Replaces all OS-level nondeterminism with simulated equivalents
  • Records all random seeds for reproducible failure replay
  • Supports time-travel debugging (roll back to a pre-failure state and re-execute)

Key references:

  • Villegas, A. (2021). "FoundationDB: A Distributed Unbundled Transactional Key-Value Store." SIGMOD 2021.
  • Kingsbury, K. (2018). "An Introduction to Jepsen." Strange Loop 2018.
  • Belyakova, Y., et al. (2024). "Lineage-driven Fault Injection." SIGMOD 2015.

8. Race Condition Detection

ThreadSanitizer (TSan) / Helgrind — data race detection in any shared-memory regions of DistOS.

Concuerror / DPOR (Dynamic Partial Order Reduction) — systematic concurrency testing for Erlang/Elixir-style actor systems, adaptable to DistOS agent communication patterns.

Deterministic concurrency testing — enumerate all interleavings of concurrent operations in critical sections of the scheduler and memory allocator.


Agent Roles

Role Count Responsibilities
Chaos Engineers 12 Design and execute Jepsen-style partition and failure scenarios; maintain chaos playbook library
Fuzz Operators 10 Run AFL++/libFuzzer/syzkaller campaigns against all API surfaces; triage and minimise crashes
Property Testers 10 Encode formal spec properties as Hypothesis/QuickCheck tests; maintain property library
Byzantine Simulators 8 Implement and execute Byzantine fault scenarios; validate BFT protocol correctness
GPU Fault Injectors 8 NVML/DCGM-based hardware fault injection; GPU error response validation
Simulation Engineers 7 Build and maintain the deterministic simulation harness; replay and minimise failures
Performance Adversarialists 5 Benchmarking under adversarial load; identify performance cliffs and scheduling pathologies

Total: 60 agents


Key Deliverables

Deliverable UCXL Address Due Phase
Adversarial Test Strategy ucxl://council-qa:chaos-engineer@DistOS:qa/*^/spec/test-strategy Phase 2
Chaos Engineering Playbook ucxl://council-qa:chaos-engineer@DistOS:qa/*^/runbook/chaos-playbook Phase 2
Property Test Suite (formal spec conformance) ucxl://council-qa:property-tester@DistOS:qa/*^/test-suite/property-tests Phase 3
API Fuzz Campaign Report ucxl://council-qa:fuzz-operator@DistOS:qa/*^/report/fuzz-campaign-api Phase 3
Byzantine Fault Simulation Results ucxl://council-qa:byzantine-simulator@DistOS:qa/*^/report/byzantine-faults Phase 3
GPU Error Injection Results ucxl://council-qa:gpu-fault-injector@DistOS:qa/*^/report/gpu-fault-injection Phase 3
Deterministic Simulation Harness Spec ucxl://council-qa:simulation-engineer@DistOS:qa/*^/spec/deterministic-sim Phase 3
Weka FS Adversarial Test Results ucxl://council-qa:chaos-engineer@DistOS:qa/*^/report/weka-adversarial Phase 4
Race Condition Detection Report ucxl://council-qa:simulation-engineer@DistOS:qa/*^/report/race-conditions Phase 4
Consolidated Defect Register ucxl://council-qa:*@DistOS:qa/*^/register/defects Continuous
Final QA Acceptance Report ucxl://council-qa:*@DistOS:qa/*^/report/final-acceptance Phase 5
Performance Adversarial Benchmarks ucxl://council-qa:performance-adversarialist@DistOS:qa/*^/benchmark/adversarial-perf Phase 4

Decision Points

DQ-01: Acceptable Defect Threshold for Specification Acceptance

Question: What severity and count of open defects constitutes an acceptable state for the DistOS specification to be declared complete?

Options:

  • A. Zero P0 defects, fewer than 10 P1 defects with documented mitigations
  • B. Zero P0/P1 defects, unlimited P2 with tracking
  • C. All discovered defects must be resolved or formally accepted with architectural rationale

Implications: Option C produces the highest-quality specification but may block delivery. Option A risks shipping known critical issues with mitigations that may not hold in production.

DQ-02: Scope of Byzantine Fault Tolerance Requirement

Question: Is DistOS required to tolerate Byzantine faults, or only crash-recovery faults?

Options:

  • A. Crash-recovery (CFT) only — simpler protocols, lower overhead, sufficient for trusted datacenter hardware
  • B. Byzantine fault tolerance for the consensus layer only
  • C. Full BFT for all distributed components

Implications: Full BFT (Option C) requires significantly more complex protocols and typically 33% overhead in node count for quorum calculations. Given a 1024-node cluster with known hardware, CFT may be sufficient.

DQ-03: Deterministic Simulation Scope

Question: Which DistOS components must support deterministic simulation testing?

Options:

  • A. Core consensus and scheduling components only
  • B. All components that touch shared state
  • C. The entire DistOS software stack including GPU runtime interfaces

Implications: Option C provides the strongest testing guarantees but requires a simulation layer for every hardware interface including GPU APIs.

DQ-04: Jepsen-Equivalent Validation Requirement

Question: Must DistOS pass a Jepsen-equivalent analysis before the specification is finalised, or is simulation-based testing sufficient?

Options:

  • A. Simulation-based testing is sufficient at specification phase; Jepsen analysis deferred to implementation phase
  • B. A Jepsen-equivalent model-level analysis is required before specification acceptance
  • C. A full Jepsen-style test against a prototype implementation is required within the 14-day window

Implications: Option C is highly ambitious for a 14-day timeline. Option A defers significant risk to implementation.

DQ-05: Security Penetration Testing Depth

Question: What is the scope of security penetration testing for the DistOS security model?

Options:

  • A. Threat-model review and manual analysis only
  • B. Automated scanning plus manual review of authentication and authorisation boundaries
  • C. Full red-team exercise against the security model including side-channel analysis

Dependencies on Other Councils

Council Dependency Type What Council-QA Needs
council-verify Upstream Formal specifications and invariants to test against; verified properties to validate in simulation
council-api Upstream Complete API surface definitions with semantics, pre/post conditions, and error contracts
council-sched Upstream Scheduler specification including claimed consistency and fairness properties
council-mem Upstream Memory manager specification including isolation guarantees and allocation invariants
council-sec Upstream Security model specification including trust boundaries and threat model
council-net Upstream Network layer specification including partition tolerance claims
council-fs Upstream Weka FS integration specification including consistency claims
council-synth Bidirectional Synthesis decisions that affect testability; council-qa feeds defects back to council-synth for resolution arbitration
council-arch Downstream Decision archaeology reads all council-qa defect reports and test results to narrate the QA story
council-docs Downstream Documentation council consumes test reports for the final specification document

Critical dependency: Council-qa cannot begin property testing until council-verify delivers at least draft formal specifications. This creates a hard dependency on council-verify delivering Phase 2 outputs on schedule.


WHOOSH Configuration

council: council-qa
whoosh:
  formation:
    strategy: domain-partitioned
    # Each test domain forms a sub-team that operates semi-independently
    partitions:
      - name: chaos-partition
        roles: [chaos-engineer]
        size: 12
        coordination: async
      - name: fuzz-partition
        roles: [fuzz-operator]
        size: 10
        coordination: async
      - name: property-partition
        roles: [property-tester]
        size: 10
        coordination: sync-with-verify
      - name: byzantine-partition
        roles: [byzantine-simulator]
        size: 8
        coordination: async
      - name: gpu-partition
        roles: [gpu-fault-injector]
        size: 8
        coordination: async
      - name: simulation-partition
        roles: [simulation-engineer]
        size: 7
        coordination: sync
      - name: perf-partition
        roles: [performance-adversarialist]
        size: 5
        coordination: async

  quorum:
    defect-classification: 3/5  # Three roles must agree on P0/P1 classification
    test-acceptance: 4/7        # Four of seven role groups must sign off on test suite acceptance
    spec-block:                 # council-qa can block spec acceptance; requires:
      threshold: simple-majority
      escalation: council-synth # Disputes escalate to council-synth for arbitration

  subchannels:
    - id: defect-reports
      description: All discovered defects broadcast here for cross-council visibility
      subscribers: [council-verify, council-api, council-sched, council-mem, council-sec, council-synth, council-arch]
      retention: full-history
    - id: chaos-ops
      description: Real-time chaos experiment state
      subscribers: [council-qa, council-arch]
    - id: property-failures
      description: Property test failures routed to council-verify for spec review
      subscribers: [council-verify, council-synth, council-arch]
    - id: qa-acceptance
      description: Formal acceptance/rejection signals per deliverable
      subscribers: [council-synth, council-docs, council-arch]

  communication:
    internal: broadcast-to-partition
    cross-council: ucxl-addressed
    defect-escalation: immediate-broadcast

Success Criteria

A council-qa execution is considered successful when all of the following are met:

  1. Coverage: Every formal specification invariant produced by council-verify has at least one corresponding property test with documented pass/fail results.

  2. API surface: 100% of API endpoints defined by council-api have been subjected to structured fuzz testing with documented results.

  3. Chaos coverage: Jepsen-equivalent partition and failure scenarios have been executed against the DistOS scheduler, memory manager, and consensus layer models.

  4. Byzantine validation: Byzantine fault simulations have validated that the consensus protocol meets its claimed fault-tolerance bounds.

  5. GPU error coverage: All documented GPU error codes and conditions in the NVIDIA/AMD hardware specifications have corresponding DistOS response scenarios tested.

  6. Weka FS: All consistency claims in the DistOS filesystem integration specification have been tested against documented Weka FS behaviour.

  7. Defect resolution: All P0 defects are resolved. All P1 defects have formal mitigations accepted by the originating council and council-synth. All open defects are registered with UCXL-addressable evidence.

  8. Deterministic reproduction: All discovered failures can be reproduced deterministically via the simulation harness or a minimised test case.

  9. Archaeology readability: Council-arch has confirmed that the defect register and test reports produce comprehensible decision narratives.


Timeline Mapping

Phase Days Council-QA Activities
Phase 1: Research 13 Study formal specification structure (coordination with council-verify); design adversarial test taxonomy; review Jepsen, FoundationDB simulation, and AFL++ methodologies; define property language for spec conformance tests
Phase 2: Architecture 36 Produce Adversarial Test Strategy; design chaos playbook structure; begin encoding specification properties as tests (as draft specs arrive from council-verify); establish defect classification schema; configure WHOOSH subchannels and defect broadcast
Phase 3: Formal Specification 610 Execute property tests against formal specs from council-verify; run initial fuzz campaigns against draft API definitions; execute Byzantine fault simulations; GPU error injection scenario validation; first Weka FS adversarial tests; publish defect reports to all councils
Phase 4: Integration 1012 Full integration testing across all subsystem specifications; cross-subsystem fault injection (e.g., GPU error during Weka checkpoint during scheduler failover); race condition analysis; performance adversarial benchmarks; defect resolution verification
Phase 5: Documentation 1214 Produce Final QA Acceptance Report; compile consolidated defect register; validate all P0/P1 defects are resolved or formally accepted; support council-docs in formatting test reports for the final specification document

Council Design Brief v1.0 — DistOS Project — council-qa Generated: 2026-02-24