DistOS/councils/04-fault-tolerance.md

# Council Design Brief: Fault Tolerance, Consensus, and Recovery

**Council ID:** `council-fault`
**Mission:** Design the fault tolerance architecture for DistOS — encompassing failure detection, distributed consensus, checkpoint/recovery, and Byzantine resilience — such that the 1024-node cluster degrades gracefully under any realistic failure scenario and recovers to full operation without human intervention.
**UCXL Base Address:** `ucxl://council-fault:*@DistOS:fault-tolerance/*`
**Agent Count:** ~60
**Status:** Pre-formation (Constitution Phase)
**Created:** 2026-02-24

---

## 1. Scope and Responsibilities

Council-fault is responsible for every mechanism by which DistOS detects, tolerates, and recovers from failure. This spans the full failure taxonomy: transient bit errors, node crashes, NIC failures, rack-level power loss, network partitions, and Byzantine agent behaviour. The council does not own the network transport layer (that is council-net's domain) nor the memory model (council-mem), but it owns the protocols and state machines that run over those substrates to provide fault-tolerant guarantees.

### In Scope

- Failure detection algorithms and their parameterisation for a 1024-node cluster
- Consensus protocol selection, instantiation, and formal specification for each use case
- Checkpoint strategy design for long-running GPU workloads (distributed training, inference serving)
- Exactly-once and at-least-once delivery semantics at the OS level
- Split-brain prevention and partition healing
- Hot standby, warm standby, and cold recovery strategies and their trade-offs
- State machine replication design for DistOS control plane components
- Failure domain modelling: node, rack, network segment, availability zone
- Byzantine fault tolerance for the council/consensus layer of DistOS itself
- Graceful degradation policies: what the system continues to do under N simultaneous failures
- Recovery time objective (RTO) and recovery point objective (RPO) targets per workload class

### Out of Scope

- Physical network topology and RDMA configuration (council-net)
- Memory coherence protocols (council-mem)
- Scheduling decisions post-recovery (council-sched)
- Security properties of the consensus protocol (council-sec coordinates on this)

---

## 2. Research Domains

### 2.1 Failure Detection

The fundamental challenge is discriminating between a slow node and a dead one in a 1024-node, high-throughput RDMA environment. Conservative timeouts cause false negatives; aggressive timeouts cause false positives that trigger unnecessary failover storms.

**Key Papers and Systems:**

- Chandra & Toueg (1996), "Unreliable failure detectors for reliable distributed systems" — foundational impossibility and completeness/accuracy definitions; establishes the theoretical basis for eventual strong completeness
- Hayashibara et al. (2004), "The phi accrual failure detector" — Cassandra's production adoption shows phi accrual outperforms fixed-threshold detectors by adapting to network jitter; threshold calibration for 10 GbE vs 400 GbE InfiniBand will differ substantially
- Das et al. (2002), "SWIM: Scalable Weakly-consistent Infection-style Process group Membership" — O(log N) dissemination cost makes SWIM the practical choice for 1000+ node clusters; memberlist (HashiCorp) and etcd both derive from this; DistOS should evaluate SWIM with the piggybacking optimisation and compare to phi accrual on the target InfiniBand fabric
- Gupta et al. (2001), "Scalable fault-tolerant management of inter-cluster gossip" — establishes that combining gossip with direct probing (as SWIM does) provides probabilistic completeness without the O(N²) cost of all-pairs heartbeating

**Open questions for research phase:** What is the optimal phi accrual window size given InfiniBand's sub-microsecond baseline latency and the potential for multi-second NVLink saturation events? Should DistOS use a separate out-of-band management network (BMC/IPMI) to disambiguate slow-GPU from dead-node?

### 2.2 Consensus Protocols

DistOS requires consensus at multiple granularities: cluster-wide for global control plane decisions, rack-local for fast scheduling, and per-workgroup for GPU collective operations.

**Key Papers and Systems:**

- Ongaro & Ousterhout (2014), "In Search of an Understandable Consensus Algorithm (Raft)" — Raft's understandability advantage over Multi-Paxos makes it the baseline for DistOS control plane; etcd (used in Kubernetes) provides a production-quality reference implementation; the council must assess whether Raft's leader bottleneck is acceptable at 1024-node scale
- Lamport (1998, 2001), "The Part-Time Parliament" and "Paxos Made Simple" — Multi-Paxos enables pipelining and multi-leader variants; the council should evaluate whether the complexity cost is justified for DistOS's write-heavy workload tracking
- Lamport et al. (2010), "Byzantizing Paxos by Refinement" — pathway from crash-fault-tolerant (CFT) to Byzantine-fault-tolerant (BFT) without redesigning the system from scratch
- Castro & Liskov (1999), "Practical Byzantine Fault Tolerance" (PBFT) — O(N²) message complexity makes vanilla PBFT unsuitable for 1024 nodes, but the security model and view-change protocol remain the reference for BFT design
- Yin et al. (2019), "HotStuff: BFT Consensus with Linear View Change" — reduces PBFT's O(N²) view-change cost to O(N) via threshold signatures; adopted in Diem/LibraBFT; viable for DistOS's security-sensitive control paths if the threat model includes Byzantine council agents
- Liskov & Cowling (2012), "Viewstamped Replication Revisited" — predates Raft but provides cleaner reconfiguration semantics; relevant when council-fault designs the cluster membership change protocol
- Howard et al. (2016), "Flexible Paxos: Quorum Intersection Revisited" — decouples write quorum from read quorum, enabling latency optimisation for read-heavy metadata workloads
- Chandra et al. (2007), "Paxos Made Live" — Google's engineering lessons deploying Paxos in Chubby; multi-master lease management, disk corruption handling, and operational complexity are directly relevant to DistOS
- Google Chubby (Burrows 2006) — coarse-grained lock service over Paxos; DistOS's distributed lock manager should evaluate this architecture
- Apache ZooKeeper / ZAB (Reed & Junqueira 2008), "A simple totally ordered broadcast protocol" — ZAB's primary-backup model with in-order delivery guarantees is simpler than Paxos for log replication; used in Kafka and HBase; council should assess fit for DistOS's scheduler event log

**Open questions for research phase:** Should DistOS use a single consensus cluster (simple, potential bottleneck) or hierarchical consensus (rack-local Raft groups coordinated by a meta-Raft group)? At what cluster size does the Raft leader become a write throughput bottleneck? What is the cost of HotStuff threshold signature aggregation on the cluster's CPU budget?

### 2.3 Checkpoint and Restart for GPU Workloads

Long-running distributed training jobs on 1024 GPUs represent the highest-value workloads; losing 100 hours of training to a single node failure is economically catastrophic. Checkpoint strategy must balance checkpoint overhead against recovery cost.

**Key Papers and Systems:**

- Vaswani et al. (2017), "Attention Is All You Need" — the transformer training workloads that DistOS must protect; model sizes (70B–1T+ parameters) determine checkpoint volume and frequency requirements
- Shoeybi et al. (2019), "Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism" — Megatron-LM's distributed checkpoint format partitions model state across nodes; DistOS must understand this format to design transparent checkpointing that does not require application modification
- Rajbhandari et al. (2020), "ZeRO: Memory Optimizations Toward Training Trillion Parameter Models" — ZeRO-3 partitions optimizer state, gradients, and parameters across devices; recovery after a single node failure requires coordinated state reassembly from multiple surviving nodes
- Koo et al. (2020), "Elastic Training for BERT" — elastic distributed training with dynamic rank reconfiguration; motivates checkpoint formats that survive changes in node count
- Duell (2005), "The Design and Implementation of Berkeley Lab's Linux Checkpoint/Restart" and DMTCP (Ansel et al. 2009) — transparent process-level checkpointing; OS-level DistOS could provide DMTCP-style capabilities without application cooperation
- NVIDIA NCCL fault recovery — NCCL 2.x introduced limited fault recovery for collective operations; DistOS must integrate with or extend this for transparent GPU collective resilience
- Young (1974), "A first order approximation to the optimum checkpoint interval" — classic result showing optimal checkpoint interval as sqrt(2 * MTTR * checkpoint_cost); DistOS should instantiate this formula with empirically measured values for the target cluster
- Gupta et al. (2018), "Checkpointing and Recovery for Iterative Machine Learning" — shows that checkpoint frequency should adapt to loss curve stability; motivates adaptive checkpointing in DistOS

**Weka FS considerations:** Weka's parallel filesystem provides high-bandwidth checkpoint writes. The council must determine: (a) whether checkpoint data should bypass the page cache and write directly to Weka via O_DIRECT, (b) whether Weka's erasure coding provides sufficient durability so that a single Weka shard loss does not require checkpoint redundancy, and (c) optimal checkpoint object sizes given Weka's stripe geometry.

### 2.4 Exactly-Once and Delivery Semantics

**Key Papers and Systems:**

- Zaharia et al. (2013), "Discretized Streams: Fault-Tolerant Streaming Computation at Scale" — Spark Streaming's micro-batch approach achieves exactly-once by recomputing lost micro-batches from durable sources; relevant for DistOS's telemetry and event log pipelines
- Kreps et al. (2011), "Kafka: a Distributed Messaging System for Log Processing" — idempotent producer design with sequence numbers provides exactly-once at the messaging layer; DistOS's internal event bus should adopt equivalent semantics
- Akidau et al. (2015), "The Dataflow Model" — unified model for batch and streaming with watermarks; DistOS event processing should support watermark-based exactly-once semantics

### 2.5 Split-Brain Prevention

**Key Papers and Systems:**

- Bailis et al. (2013), "Highly Available Transactions: Virtues and Limitations" — characterises the exact split between operations achievable during partition (highly available) and those requiring coordination (partition-intolerant); DistOS must classify each operation type
- Brewer (2000), "Towards Robust Distributed Systems" (CAP theorem) — the council must explicitly document CP vs AP trade-offs for each DistOS subsystem; scheduling metadata can be AP, distributed locks must be CP
- Fonseca et al. (2007), "X-Trace: A Pervasive Network Tracing Framework" — causal tracing across partition boundaries enables forensic analysis of split-brain events post-recovery

### 2.6 State Machine Replication and Azure Service Fabric

- Terry et al. (2013), "Replicated Data Consistency Explained Through Baseball" — accessible model for multi-tier consistency; DistOS should define consistency levels analogous to this model
- Kakivaya et al. (2018), "Service Fabric: A Distributed Platform for Building Microservices in the Cloud" — Azure Service Fabric's health model, reliable collections, and actor model directly inform DistOS's control plane reliability design; particularly relevant is its failure domain-aware placement and automatic repair orchestration

---

## 3. Agent Roles

| Role | Count | Responsibilities |
|------|-------|-----------------|
| `lead-architect` | 2 | Cross-domain coherence, final decision arbitration, dependency liaison with council-net and council-mem |
| `failure-detection-specialist` | 6 | Phi accrual and SWIM protocol design, parameterisation for InfiniBand fabric, integration with out-of-band management |
| `consensus-engineer` | 10 | Raft instantiation for control plane, hierarchical consensus evaluation, quorum configuration, leader election timing analysis |
| `byzantine-resilience-analyst` | 6 | HotStuff evaluation, threat model for Byzantine agents, threshold signature overhead analysis, BFT protocol formal specification |
| `checkpoint-engineer` | 10 | GPU workload checkpoint design, Weka FS integration, DMTCP-style transparent checkpointing, checkpoint interval optimisation |
| `recovery-coordinator` | 6 | Hot/warm/cold standby design, RTO/RPO target definition, failure domain modelling, recovery orchestration state machine |
| `delivery-semantics-specialist` | 4 | Exactly-once guarantees for event bus, idempotency patterns, watermark-based processing |
| `partition-analyst` | 4 | CAP classification of DistOS subsystems, split-brain prevention protocols, partition healing procedures |
| `formal-spec-author` | 6 | TLA+ specifications for consensus and failure detection state machines, invariant definition, interface with council-verify |
| `integration-liaison` | 4 | Tracks dependency interfaces with council-net, council-mem, council-sched; attends synthesis sessions; produces interface contracts |
| `research-surveyor` | 2 | Literature monitoring, emerging results, coordination with external references |

**Total: 60 agents**

---

## 4. Key Deliverables

### Phase 1: Research (Days 1-3)

| Deliverable | UCXL Address | Description |
|-------------|-------------|-------------|
| Failure detection survey | `ucxl://council-fault:researcher@DistOS:fault-tolerance/*^/research/failure-detection-survey.md` | Comparative analysis of phi accrual vs SWIM on InfiniBand fabrics with parameterisation recommendations |
| Consensus protocol comparison | `ucxl://council-fault:researcher@DistOS:fault-tolerance/*^/research/consensus-protocol-comparison.md` | Raft vs Multi-Paxos vs HotStuff trade-off matrix for DistOS use cases |
| Checkpoint volume analysis | `ucxl://council-fault:researcher@DistOS:fault-tolerance/*^/research/checkpoint-volume-analysis.md` | Checkpoint sizes and write bandwidth requirements for 70B–1T parameter models on 1024 GPUs |
| Failure domain taxonomy | `ucxl://council-fault:researcher@DistOS:fault-tolerance/*^/research/failure-domain-taxonomy.md` | Classification of failure modes, estimated MTTF per domain, impact radius |

### Phase 2: Architecture (Days 3-6)

| Deliverable | UCXL Address | Description |
|-------------|-------------|-------------|
| Failure detection design | `ucxl://council-fault:lead-architect@DistOS:fault-tolerance/*^/architecture/failure-detection-design.md` | Selected algorithm, configuration parameters, false positive/negative rate targets |
| Consensus architecture | `ucxl://council-fault:consensus-engineer@DistOS:fault-tolerance/*^/architecture/consensus-architecture.md` | Hierarchical Raft topology, group sizing, election timeout bands |
| Checkpoint strategy | `ucxl://council-fault:checkpoint-engineer@DistOS:fault-tolerance/*^/architecture/checkpoint-strategy.md` | Checkpoint placement policy, Weka FS write path, adaptive interval algorithm |
| Recovery state machine | `ucxl://council-fault:recovery-coordinator@DistOS:fault-tolerance/*^/architecture/recovery-state-machine.md` | Node failure response sequence, standby promotion protocol, divergence resolution |
| DR-FT-001: Consensus protocol selection | `ucxl://council-fault:lead-architect@DistOS:fault-tolerance/*^/decisions/DR-FT-001-consensus-selection.md` | Formal decision record for consensus protocol choice |
| DR-FT-002: Checkpoint strategy | `ucxl://council-fault:lead-architect@DistOS:fault-tolerance/*^/decisions/DR-FT-002-checkpoint-strategy.md` | Formal decision record for checkpoint approach |

### Phase 3: Formal Specification (Days 6-10)

| Deliverable | UCXL Address | Description |
|-------------|-------------|-------------|
| Raft TLA+ specification | `ucxl://council-fault:formal-spec-author@DistOS:fault-tolerance/*^/specs/raft-consensus.tla` | TLA+ model of DistOS Raft variant with cluster membership changes |
| Failure detector TLA+ specification | `ucxl://council-fault:formal-spec-author@DistOS:fault-tolerance/*^/specs/failure-detector.tla` | Completeness and accuracy properties as TLA+ invariants |
| Checkpoint protocol specification | `ucxl://council-fault:formal-spec-author@DistOS:fault-tolerance/*^/specs/checkpoint-protocol.tla` | No-orphaned-data and consistent-recovery-point invariants |
| Recovery orchestration specification | `ucxl://council-fault:formal-spec-author@DistOS:fault-tolerance/*^/specs/recovery-orchestration.tla` | Progress liveness property: cluster always eventually recovers from N simultaneous failures |
| Exactly-once semantics specification | `ucxl://council-fault:delivery-semantics-specialist@DistOS:fault-tolerance/*^/specs/exactly-once-semantics.tla` | Idempotency and ordering invariants for DistOS event bus |

### Phase 4: Integration (Days 10-12)

| Deliverable | UCXL Address | Description |
|-------------|-------------|-------------|
| Network interface contract | `ucxl://council-fault:integration-liaison@DistOS:fault-tolerance/*^/integration/council-net-interface.md` | Agreed API surface with council-net: topology queries, link failure notifications |
| Memory interface contract | `ucxl://council-fault:integration-liaison@DistOS:fault-tolerance/*^/integration/council-mem-interface.md` | Agreed API surface with council-mem: checkpoint buffer allocation, durability guarantees |
| Scheduler interface contract | `ucxl://council-fault:integration-liaison@DistOS:fault-tolerance/*^/integration/council-sched-interface.md` | Agreed API surface with council-sched: workload evacuation on node failure, checkpoint placement hints |
| Cross-council conflict log | `ucxl://council-synth:synthesizer@DistOS:integration/*^/conflicts/council-fault-vs-council-sched.md` | Conflicts raised to council-synth for resolution |

### Phase 5: Documentation (Days 12-14)

| Deliverable | UCXL Address | Description |
|-------------|-------------|-------------|
| Fault tolerance reference specification | `ucxl://council-fault:lead-architect@DistOS:fault-tolerance/*^/docs/fault-tolerance-reference-spec.md` | Complete, self-contained specification of DistOS fault tolerance model |
| Operator runbook | `ucxl://council-fault:recovery-coordinator@DistOS:fault-tolerance/*^/docs/operator-runbook.md` | Step-by-step procedures for manual intervention in pathological failure scenarios |
| Decision archaeology summary | `ucxl://council-arch:narrator@DistOS:decision-history/*^/narratives/council-fault-design-narrative.md` | Human-readable narrative of how fault tolerance design decisions evolved |

---

## 5. Decision Points

The following architectural questions constitute the major decision points council-fault must resolve and record as Decision Records.

**DP-FT-01: Failure Detection Algorithm**
Should DistOS use phi accrual, SWIM with piggybacking, or a hybrid combining both? Phi accrual provides graded suspicion levels useful for soft degradation; SWIM provides O(log N) message complexity. Key consideration: the target InfiniBand fabric has sub-microsecond latency, which may make SWIM's indirect probing unnecessary. The out-of-band BMC/IPMI path provides a ground-truth oracle — should this be used to disambiguate, or does it introduce additional complexity and failure modes?

**DP-FT-02: Consensus Topology**
Should DistOS use a single global Raft cluster for all control plane decisions, or a two-tier topology with rack-local Raft groups coordinated by a meta-Raft group? Global Raft is simpler and easier to reason about but introduces a write bottleneck at scale and cross-rack latency on every commit. Hierarchical consensus improves write throughput and localises rack-level decisions but requires a reconfiguration protocol that correctly handles simultaneous rack and meta-group failures.

**DP-FT-03: Byzantine Fault Tolerance Scope**
Should DistOS employ BFT consensus for any layer, or assume crash-fault-tolerant (CFT) suffices? The threat model for a private GPU cluster operated by a single organisation likely does not include Byzantine node behaviour, but it may include Byzantine agent behaviour (a compromised CHORUS agent providing false telemetry). The council must precisely define the Byzantine assumption boundary and determine whether HotStuff is warranted for the agent coordination layer even if the underlying node layer uses CFT Raft.

**DP-FT-04: Checkpoint Placement and Ownership**
Should checkpoints be written by the application (Megatron-LM style, application-cooperative), by a transparent OS-level mechanism (DMTCP style, application-agnostic), or by a hybrid that provides OS-level infrastructure but application-supplied hints? Application-cooperative checkpointing achieves optimal consistency points (end-of-iteration) but requires all workloads to be ported; transparent checkpointing is universal but captures unnecessary state and misses semantic consistency boundaries.

**DP-FT-05: Checkpoint Storage Durability**
Given that Weka provides erasure-coded parallel storage, does DistOS need additional checkpoint redundancy (e.g., copying checkpoints to a second Weka namespace or to object storage), or is single-namespace checkpoint storage sufficient? This depends on whether Weka's erasure coding tolerates the same failure modes that trigger the checkpoint recovery (e.g., if a rack failure simultaneously takes down Weka shards and the compute nodes being recovered, single-namespace checkpoints may be unreadable).

**DP-FT-06: Exactly-Once Semantics Boundary**
Exactly-once semantics are expensive. Which DistOS subsystems require exactly-once delivery (candidate: distributed lock service, scheduler event log, billing/accounting events) versus at-least-once delivery with idempotent consumers (candidate: telemetry, health heartbeats, log events)? The council must produce a formal classification that council-sec and council-telemetry can ratify.

**DP-FT-07: Hot vs Warm vs Cold Standby**
For each category of DistOS component (leader node, metadata server, scheduler replica), what is the appropriate standby tier? Hot standby (immediate failover, high resource cost) is justified for the consensus leader; cold recovery (restart from checkpoint, potentially minutes of downtime) may be acceptable for less critical components. The council must define RTO and RPO targets per component class and map these to standby tier selection.

**DP-FT-08: Failure Domain Granularity**
What failure domains does DistOS model, and what is the maximum tolerated simultaneous failure count per domain? Candidate answer: tolerate 1 rack (32 nodes) simultaneously failing without workload loss (checkpoints survive on other racks), tolerate up to 16 individual node failures per hour without RTO violation. The council must derive these numbers from MTTF data for the target hardware and the RTO/RPO targets.

---

## 6. Dependencies on Other Councils

### council-net (Network Stack)
Council-fault requires network topology information to configure failure detector parameters correctly. The phi accrual window must account for cross-rack latency variability. SWIM's indirect probing path must be aware of network partitions. Specifically needed:

- Link-level latency distribution (P50, P99, P99.9) for same-rack vs cross-rack communication
- Notification API for link-level failure events (to accelerate failure detection beyond gossip convergence time)
- Details of the RDMA fabric topology so failure domains align with physical network segments

**Interface artifact:** `ucxl://council-net:lead-architect@DistOS:network/*^/integration/topology-api-for-council-fault.md`

### council-mem (Distributed Memory)
Checkpoint design depends critically on the memory model. Council-fault needs:

- Memory allocation primitives that can pin GPU memory buffers for checkpoint capture without disrupting running workloads
- Durability guarantees from the Weka FS integration layer: what consistency model applies to checkpoint writes?
- Information on NVLink/NVSwitch topology to understand whether GPU-to-GPU direct checkpoint transfers are viable without CPU involvement

**Interface artifact:** `ucxl://council-mem:lead-architect@DistOS:memory/*^/integration/checkpoint-primitives-for-council-fault.md`

### council-sched (Process Scheduling)
Failure recovery intersects with scheduling at multiple points. Council-fault needs:

- Workload evacuation notification API: when council-fault declares a node dead, council-sched must be notified to reschedule affected workloads
- Checkpoint placement hints: council-sched knows which nodes have GPU memory pressure and can advise checkpoint placement to avoid I/O bottlenecks
- Recovery priority policy: when multiple workloads are recovering simultaneously, council-sched provides the priority ordering

**Interface artifact:** `ucxl://council-sched:lead-architect@DistOS:scheduling/*^/integration/recovery-coordination-api.md`

### council-sec (Security Model)
Council-sec provides constraints on the consensus protocol design. Specifically:

- Cryptographic signing requirements for Raft log entries (are log entries signed, or is transport-level authentication sufficient?)
- Key management for hot standby nodes that must be pre-provisioned with credentials before a failure occurs
- Audit requirements for failover events (every leader election and node eviction must generate a signed audit record)

Council-sec is a consumer of council-fault's decisions (it must enforce security properties on fault-tolerant paths), but it is also a constraint provider.

### council-synth (Inter-Council Synthesis)
Any conflict between council-fault's requirements and other councils' designs is escalated to council-synth. Likely conflict areas:

- Checkpoint overhead vs scheduling latency: if checkpoint I/O degrades scheduling responsiveness, council-synth arbitrates
- Consensus quorum configuration vs network partition tolerance: council-net's topology design may make certain quorum configurations unreachable under valid partition scenarios

---

## 7. WHOOSH Configuration

### Team Formation

```json
{
  "council_id": "council-fault",
  "team_topic": "whoosh.team.distos-council-fault",
  "composition": {
    "lead-architect": 2,
    "failure-detection-specialist": 6,
    "consensus-engineer": 10,
    "byzantine-resilience-analyst": 6,
    "checkpoint-engineer": 10,
    "recovery-coordinator": 6,
    "delivery-semantics-specialist": 4,
    "partition-analyst": 4,
    "formal-spec-author": 6,
    "integration-liaison": 4,
    "research-surveyor": 2
  },
  "total_agents": 60,
  "quorum_policy": {
    "artifact_publication": "simple_majority",
    "architecture_decision": "two_thirds_supermajority",
    "formal_spec_ratification": "lead_architect_plus_two_thirds",
    "dependency_interface_agreement": "all_integration_liaisons_plus_one_lead"
  },
  "join_timeout_minutes": 30,
  "inactivity_eviction_minutes": 120
}
```

### Subchannels

| Subchannel | Topic Suffix | Purpose |
|-----------|-------------|---------|
| Control | `.control` | Role assignments, join/leave events, phase transitions |
| Research | `.research` | Paper sharing, survey coordination, literature discussion |
| Architecture | `.architecture` | Design proposals, trade-off debates, decision drafting |
| Formal Spec | `.formal-spec` | TLA+ review, invariant discussion, council-verify liaison |
| Integration | `.integration` | Cross-council interface negotiation, dependency tracking |
| Voting | `.voting` | Quorum votes on decision records and artifact ratification |
| Artifacts | `.artifacts` | UCXL artifact announcement references |

### Quorum Configuration

The consensus engineering team within council-fault operates under the same principles it is designing. A minimum of 2/3 of active agents must be reachable for architecture decisions to be recorded. If council-fault itself experiences a partition, the larger partition shard continues and the smaller shard's work is merged post-healing (following the same eventual consistency model the council advocates for DistOS data paths where CP is not required).

---

## 8. Success Criteria

1. **Failure detection coverage:** A formal characterisation of phi accrual or SWIM convergence time under the target InfiniBand fabric, with worst-case bounds proven analytically and validated against simulation
2. **Consensus specification completeness:** A TLA+ specification for the chosen consensus protocol that model-checks with no invariant violations under the TLC model checker for clusters up to N=7 (as a representative abstraction), with safety and liveness properties explicitly stated
3. **Checkpoint strategy viability:** A checkpoint overhead analysis showing that optimal checkpoint interval yields less than 5% throughput degradation on a representative transformer training workload
4. **Recovery time verification:** Formal proof that the recovery state machine terminates (liveness) and does not enter a state where a healthy subset of nodes is incorrectly marked failed (safety)
5. **Exactly-once classification:** A complete table classifying every DistOS internal message type as exactly-once, at-least-once, or best-effort, with rationale
6. **Interface contracts ratified:** All three dependency interface contracts (council-net, council-mem, council-sched) are signed off by both parties and published to UCXL
7. **Decision record completeness:** All 8 decision points have corresponding Decision Records with alternatives considered, evidence cited, and rationale documented
8. **Byzantine scope defined:** A written threat model defining the Byzantine assumption boundary, ratified by council-sec

---

## 9. Timeline

### Phase 1: Research (Days 1-3)
- Day 1: Activate council, assign roles, distribute research domains; research-surveyors produce initial bibliography
- Day 2: Failure detection specialists produce phi accrual vs SWIM comparison; checkpoint engineers survey Megatron-LM and DMTCP checkpoint formats; consensus engineers survey Raft, Multi-Paxos, and HotStuff
- Day 3: All research artifacts published to UCXL; cross-domain synthesis discussion; research phase closes with a prioritised list of open questions entering the architecture phase

### Phase 2: Architecture (Days 3-6)
- Day 3-4: Failure detection design and consensus architecture drafted; initial decision record drafts for DP-FT-01 and DP-FT-02 circulated for comment
- Day 4-5: Checkpoint strategy and recovery state machine designed; integration liaisons begin negotiating interface contracts with council-net and council-mem; DR-FT-01 through DR-FT-04 voted on
- Day 5-6: Remaining decision records voted on; architecture artifacts finalised and published; architecture phase closes with all DR-FT-* in accepted or deferred state

### Phase 3: Formal Specification (Days 6-10)
- Day 6-7: Raft TLA+ specification and failure detector TLA+ specification drafted; council-verify liaison established
- Day 7-8: Checkpoint protocol specification and exactly-once semantics specification drafted; council-verify begins model checking on Raft spec
- Day 8-9: Recovery orchestration specification drafted; first model-checking results from council-verify inform spec refinements
- Day 9-10: All TLA+ specifications in review; formal spec phase closes with at least two specs passing model checking with no counterexamples

### Phase 4: Integration (Days 10-12)
- Day 10-11: Interface contracts with council-net, council-mem, and council-sched finalised; any conflicts escalated to council-synth
- Day 11-12: council-synth resolution of any conflicts; final integration artifacts published; cross-council review of security constraints from council-sec

### Phase 5: Documentation (Days 12-14)
- Day 12-13: Fault tolerance reference specification assembled from architecture and formal spec artifacts
- Day 13-14: Operator runbook written; council-arch produces decision archaeology narrative; final UCXL navigability audit confirms all artifact addresses resolve correctly