587 lines
36 KiB
Markdown
587 lines
36 KiB
Markdown
# Council Design Brief: Distributed Memory
|
|
|
|
**Council ID:** `council-mem`
|
|
**Mission:** Design the distributed memory model for DistOS, encompassing the tiered memory hierarchy (HBM3 → DDR5 → NVMe → Weka), Weka parallel filesystem integration, cache coherence at cluster scale, GPU unified memory and managed memory policies, memory-mapped I/O over Weka, page migration, and memory pressure handling across a 1024-node Hopper/Grace/Blackwell cluster.
|
|
**UCXL Base Address:** `ucxl://council-mem:*@DistOS:memory/*`
|
|
**Agent Count:** 80
|
|
**Status:** Constitution Phase — awaiting WHOOSH formation trigger
|
|
**Created:** 2026-02-24
|
|
|
|
---
|
|
|
|
## 1. Scope and Responsibilities
|
|
|
|
`council-mem` owns the complete specification of the DistOS memory subsystem. Scope boundaries are defined as follows.
|
|
|
|
**In scope:**
|
|
|
|
- Distributed shared memory (DSM) model versus message-passing model decision and formal specification
|
|
- Weka WekaFS integration: POSIX semantics, parallel I/O patterns, consistency model at cluster scale, WEKA client mount configuration for GPU nodes
|
|
- NVLink/NVSwitch memory fabric on Hopper: peer-to-peer GPU memory access, NVLink memory copy engines, NVSwitch all-to-all bandwidth sharing
|
|
- Grace Superchip unified memory: NVLink-C2C coherent memory, CPU-GPU unified virtual address space, cache coherence between Arm Neoverse V2 and H100 L2/L3
|
|
- Cache coherence protocols at cluster scale: directory-based coherence, home node placement, false sharing avoidance across the NVLink fabric
|
|
- GPU memory management: CUDA unified memory (UM), managed memory, explicit cudaMemcpy vs implicit page migration, GPUDirect RDMA zero-copy pathways
|
|
- Memory-mapped file I/O over Weka: mmap semantics for GPU-accessible WekaFS files, page fault handling, demand paging from Weka to GPU HBM3
|
|
- Tiered storage hierarchy: HBM3 (80 GB/GPU on H100 SXM5) → DDR5 (GH200 node LPDDR5X) → NVMe (local SSD) → Weka (parallel FS, PB-scale)
|
|
- Page migration policies: when to migrate pages between tiers, migration bandwidth management, and NUMA migration cost models
|
|
- Memory pressure handling: OOM prevention, demand-based eviction, balloon device analogue for GPU memory, cooperative memory release protocols
|
|
- Memory isolation and address space layout for multi-tenant workloads
|
|
- Formal specification of the DistOS virtual memory interface exposed to the scheduler and to user processes
|
|
|
|
**Out of scope (delegated):**
|
|
|
|
- Physical network transport for RDMA (delegated to `council-net`; RDMA registration interfaces consumed)
|
|
- Scheduling decisions about which workloads run on which GPUs (delegated to `council-sched`; placement decisions consumed as inputs)
|
|
- Security isolation primitives at the hardware level (delegated to `council-sec`; IOMMU and capability constraints consumed)
|
|
- Resource metering and HBM3 quota enforcement (delegated to `council-telemetry`; metering events emitted as outputs)
|
|
|
|
---
|
|
|
|
## 2. Research Domains
|
|
|
|
### 2.1 Distributed Shared Memory vs. Message Passing
|
|
|
|
Evaluate DSM systems (software-managed global address space) against message-passing models (MPI, NCCL, UCX) for the primary inter-node memory model of DistOS. The cluster's NVLink/NVSwitch fabric makes intra-NVLink-domain DSM feasible at low latency, while cross-domain communication involves InfiniBand/RoCE.
|
|
|
|
Key materials:
|
|
- Li and Hudak, "Memory Coherence in Shared Virtual Memory Systems" (ACM TOCS 1989) — foundational DSM coherence analysis
|
|
- Bal et al., "Orca: A Flat Object-Based Distributed Shared Memory System" — hybrid DSM design
|
|
- PGAS language models: UPC++ (Berkeley UPC++ team), OpenSHMEM (OpenSHMEM 1.5 spec), Chapel locale model (Cray/HPE)
|
|
- Zheng et al., "UPC++: A High-Performance Communication Framework for Asynchronous Computation" (IPDPS 2023) — modern PGAS for GPU clusters
|
|
- Hoefler et al., "MPI+MPI: A New Hybrid Approach to Parallel Programming with MPI Plus Shared Memory" (Computing 2013) — hybrid model analysis
|
|
|
|
### 2.2 Weka Data Platform and WekaFS
|
|
|
|
Weka is the parallel filesystem deployed on this cluster. WekaFS provides POSIX-compliant parallel I/O with client-side caching and a distributed metadata architecture. Survey the WekaFS client protocol, consistency model, and performance characteristics for GPU workload I/O patterns.
|
|
|
|
Key materials:
|
|
- Weka Data Platform documentation (v4.x) — WekaFS client, mount options (`-o wekafs`, cache policy `readcache`, `writecache`, `coherent`), tiering configuration
|
|
- Weka technical whitepaper: "Weka and NVIDIA GPUDirect Storage Integration" — zerocopy path from Weka to GPU HBM3 via GDS
|
|
- Weka S3 and NFS interoperability guide — multi-protocol access patterns relevant to multi-tenant workloads
|
|
- POSIX consistency semantics under parallel access — close-to-open consistency vs. strict POSIX; implications for checkpoint/restart workflows
|
|
- Bent et al., "PLFS: A Checkpoint Filesystem for Parallel Applications" (SC 2009) — N-to-1 and N-to-N checkpoint patterns relevant to Weka I/O design
|
|
- Lofstead et al., "Flexible IO and Integration for Scientific Codes through the Adaptable IO System (ADIOS)" — parallel I/O pattern survey
|
|
|
|
### 2.3 NVIDIA Magnum IO and GPUDirect
|
|
|
|
NVIDIA Magnum IO is the umbrella framework for GPU-optimised I/O. It encompasses GPUDirect RDMA (peer-to-peer GPU memory over InfiniBand), GPUDirect Storage (GDS, direct path from NVMe/Weka to GPU HBM3 bypassing host DRAM), and NCCL (collective communication over NVLink/IB).
|
|
|
|
Key materials:
|
|
- NVIDIA GPUDirect RDMA documentation — `nvidia_p2p_get_pages`, DMA mapping API, peer-to-peer registration requirements
|
|
- NVIDIA GPUDirect Storage documentation — `cuFile` API, `cuFileRead`, `cuFileWrite`, alignment requirements, Weka GDS driver (`libwekafs-gds`)
|
|
- Shainer et al., "The Development of Mellanox/NVIDIA GPUDirect over InfiniBand — A New Model for GPU to GPU Communications" (EPSRC 2011) — foundational GPUDirect RDMA paper
|
|
- Barroso et al., "The Datacenter as a Computer" (3rd edition) — memory hierarchy cost model relevant to tiered storage trade-offs
|
|
|
|
### 2.4 NVLink and NVSwitch Memory Fabric
|
|
|
|
Hopper H100 SXM5 nodes are connected via NVLink 4.0 within a node and across NVSwitch fabrics in NVLink Switch Systems (formerly DGX SuperPOD architecture). Understand the addressing model, bandwidth characteristics, and coherence semantics of the NVLink fabric.
|
|
|
|
Key materials:
|
|
- NVIDIA H100 SXM5 NVLink 4.0 specification — 900 GB/s bidirectional aggregate per GPU, NVSwitch all-reduce hardware acceleration
|
|
- NVIDIA NVLink Switch System (formerly NVLink Switch Fabric) architecture documentation
|
|
- Foley and Danskin, "Fast In-Kernel GEMM for Server-Class GPUs" — NVLink bandwidth utilisation analysis
|
|
- Choquette et al., "NVIDIA A100 Tensor Core GPU: Performance and Innovation" (IEEE Micro 2021) — NVLink 3.0 predecessor; bandwidth and addressing model basis for NVLink 4.0
|
|
- NVIDIA SHARP (Scalable Hierarchical Aggregation and Reduction Protocol) documentation — in-network compute for allreduce over NVSwitch
|
|
|
|
### 2.5 Grace Superchip Unified Memory (NVLink-C2C)
|
|
|
|
The GH200 Grace Superchip connects the Arm Neoverse V2 CPU die and the H100 GPU die via NVLink-C2C at 900 GB/s with coherent memory semantics. CPU can access GPU HBM3 directly and vice versa. This creates a unified virtual address space that fundamentally changes GPU memory management assumptions.
|
|
|
|
Key materials:
|
|
- NVIDIA GH200 Grace Hopper Superchip Architecture Whitepaper (2023) — NVLink-C2C bandwidth, cache coherence protocol, unified memory semantics, 480 GB LPDDR5X + 96 GB HBM3e
|
|
- NVIDIA Unified Memory for CUDA documentation — UM overview, page migration engine, `cudaMemPrefetchAsync`, `cudaMemAdvise`
|
|
- Ausavarungnirun et al., "Exploiting Inter-Warp Heterogeneity to Improve GPU Performance" — GPU memory access heterogeneity relevant to migration policy design
|
|
- Ganguly et al., "Interconnect-Aware Memory Management for GPU Architectures" (ISCA 2019) — NVLink-aware page placement analysis
|
|
- Li et al., "Evaluating Modern GPU Interconnect: PCIe, NVLink, NV-SLI, NVSwitch and GPUDirect" (IEEE TPDS 2020)
|
|
|
|
### 2.6 Cache Coherence at Cluster Scale
|
|
|
|
For DistOS to support a distributed shared memory model (even within an NVLink domain), a cache coherence protocol must be specified. Study directory-based protocols, their scalability to 1024-node clusters, and the latency characteristics of coherence traffic over InfiniBand.
|
|
|
|
Key materials:
|
|
- Censier and Feautrier, "A New Solution to Coherence Problems in Multicache Systems" (IEEE Trans. Computers 1978) — directory-based coherence origin
|
|
- Lenoski et al., "The Stanford DASH Multiprocessor" (IEEE Computer 1992) — scalable directory coherence at large scale
|
|
- Cray Chapel locale model documentation — distributed memory with locality-aware access
|
|
- Intel Optane DC Persistent Memory documentation — cache-coherent byte-addressable storage relevant to persistence model
|
|
- Aguilera et al., "Memory Performance Attacks: Denial of Memory Service in Multi-Core Systems" (USENIX Security 2007) — coherence traffic as interference vector (adversarial relevance)
|
|
|
|
### 2.7 GPU Memory Management: Unified Memory and Managed Memory
|
|
|
|
Survey CUDA unified memory (automatic page migration between CPU DRAM and GPU HBM3), managed memory (explicitly `cudaMallocManaged`), and the interaction between these mechanisms and the OS page fault handler.
|
|
|
|
Key materials:
|
|
- Landaverde et al., "An Investigation of Unified Memory Access Performance in CUDA" (HiPC 2014)
|
|
- Zheng et al., "Buddy Compression: Enabling Larger Memory for Deep Learning and HPC Workloads on GPUs" (ISCA 2020) — GPU memory compression as oversubscription strategy
|
|
- Rhu et al., "vDNN: Virtualized Deep Neural Networks for Scalable, Memory-Efficient Neural Network Design" (MICRO 2016) — activation checkpointing for memory oversubscription
|
|
- NVIDIA documentation on `cudaMemAdvise` hints — `cudaMemAdviseSetPreferredLocation`, `cudaMemAdviseSetAccessedBy` — hints to the migration engine
|
|
- AMD ROCm HSA (Heterogeneous System Architecture) documentation — `hsa_amd_memory_pool_t`, coarse-grained and fine-grained memory pool semantics
|
|
|
|
### 2.8 Tiered Storage and Page Migration Policies
|
|
|
|
Design the page migration engine for the 4-tier hierarchy: HBM3 (hot, ~80 GB) → DDR5/LPDDR5X (warm, ~480 GB on GH200) → NVMe (cool, 1-10 TB per node) → Weka (cold, PB-scale). Define migration triggers, bandwidth budgets, and admission control to prevent migration storms.
|
|
|
|
Key materials:
|
|
- Lagar-Cavilla et al., "SnowFlock: Rapid Virtual Machine Cloning for Cloud Computing" (EuroSys 2009) — demand paging strategies
|
|
- Yan et al., "Nimble Page Management for Tiered Memory Systems" (ASPLOS 2019) — tiered DRAM page migration policies; directly applicable to HBM3/DDR5 tiering
|
|
- NVIDIA NVMe Direct documentation — NVMe namespace affinity for GPU workloads
|
|
- Linux heterogeneous memory management (HMM) — `hmm_range_fault`, `migrate_vma`, mmap-based GPU page fault delegation
|
|
- Intel PMDK (Persistent Memory Development Kit) — tiered memory management patterns adaptable to GPU tiering
|
|
- Agarwal et al., "Thermostat: Application-Transparent Page Management for Two-Tiered Main Memory" (ASPLOS 2017) — application-transparent hotness tracking
|
|
|
|
### 2.9 Memory Pressure Handling and OOM Prevention
|
|
|
|
At 1024 nodes, memory pressure events will be frequent. Design a cooperative memory pressure protocol where processes can voluntarily release memory under `pressure` signals, an eviction hierarchy, and an OOM prevention mechanism that avoids hard OOM kills in favour of graceful degradation.
|
|
|
|
Key materials:
|
|
- Linux kernel Memory Management documentation — `/proc/pressure/memory` (PSI, Pressure Stall Information), cgroup memory.pressure, OOM killer heuristics
|
|
- Guo et al., "CrystalBall: Statistically-Informed, Co-locating, Workload Placement for Warehouse-Scale Systems" — memory pressure prediction
|
|
- Alistarh et al., "Gradient Sparsification for Communication-Efficient Distributed Optimization" — model-level memory reduction under pressure; relevant to adaptive workload response
|
|
|
|
---
|
|
|
|
## 3. Agent Roles
|
|
|
|
Total agents: **80**
|
|
|
|
| Role | Count | Responsibilities |
|
|
|------|-------|-----------------|
|
|
| Lead Architect | 2 | Distributed memory architecture decisions, DSM vs. message-passing model choice, cross-council memory interface ownership |
|
|
| Weka Integration Specialists | 8 | WekaFS client protocol, POSIX semantics, GDS integration, checkpoint I/O patterns |
|
|
| GPU Memory Researchers | 8 | CUDA Unified Memory, managed memory, page migration engine, HBM3 characteristics |
|
|
| NVLink/NVSwitch Specialists | 6 | NVLink fabric addressing, NVSwitch bandwidth model, peer-to-peer GPU memory semantics |
|
|
| Grace Superchip Specialists | 5 | NVLink-C2C coherence, GH200 unified address space, CPU-GPU shared memory model |
|
|
| Cache Coherence Researchers | 5 | Directory-based coherence protocols, coherence at cluster scale, false sharing analysis |
|
|
| Tiering and Migration Researchers | 6 | Page migration policies, tier promotion/demotion triggers, bandwidth budgeting |
|
|
| Memory Pressure Specialists | 4 | OOM prevention, cooperative release protocols, pressure signal design |
|
|
| PGAS/DSM Language Researchers | 4 | UPC++, OpenSHMEM, Chapel locale model analysis |
|
|
| Formal Specification Authors | 8 | TLA+ specification of memory model state machine, coherence invariants, migration protocol |
|
|
| Architects (sub-component) | 8 | Concrete architecture proposals for each memory subsystem component |
|
|
| Internal Reviewers | 7 | Review research and architecture proposals; green/yellow/red vote casting |
|
|
| Integration Liaisons | 5 | Interface with `council-sched`, `council-net`, `council-sec`, `council-telemetry` |
|
|
| Decision Record Authors | 5 | Author DRs for all decision points; maintain UCXL provenance chain |
|
|
| Adversarial Critics | 4 | Surface memory safety violations, coherence anomalies, migration storm scenarios |
|
|
|
|
**Role distribution rationale:** Weka integration is a high-priority domain given the cluster's reliance on WekaFS; 8 specialists are assigned here. GPU memory (unified memory, managed memory) receives 8 researchers reflecting the complexity of CUDA memory semantics. NVLink-C2C and the GH200 model are treated as distinct specialisms given their novel architecture (11 agents combined).
|
|
|
|
---
|
|
|
|
## 4. Key Deliverables
|
|
|
|
### 4.1 Research Summaries
|
|
|
|
```
|
|
ucxl://council-mem:researcher@DistOS:memory/*^/research/dsm-vs-message-passing.md
|
|
ucxl://council-mem:researcher@DistOS:memory/*^/research/weka-wekafs-integration.md
|
|
ucxl://council-mem:researcher@DistOS:memory/*^/research/magnum-io-gpudirect.md
|
|
ucxl://council-mem:researcher@DistOS:memory/*^/research/nvlink-nvswitch-fabric.md
|
|
ucxl://council-mem:researcher@DistOS:memory/*^/research/grace-superchip-unified-memory.md
|
|
ucxl://council-mem:researcher@DistOS:memory/*^/research/cache-coherence-cluster-scale.md
|
|
ucxl://council-mem:researcher@DistOS:memory/*^/research/gpu-memory-management.md
|
|
ucxl://council-mem:researcher@DistOS:memory/*^/research/tiered-storage-page-migration.md
|
|
ucxl://council-mem:researcher@DistOS:memory/*^/research/memory-pressure-handling.md
|
|
```
|
|
|
|
### 4.2 Architecture Proposals
|
|
|
|
```
|
|
ucxl://council-mem:architect@DistOS:memory/*^/architecture/memory-model-overview.md
|
|
ucxl://council-mem:architect@DistOS:memory/*^/architecture/weka-gds-integration-design.md
|
|
ucxl://council-mem:architect@DistOS:memory/*^/architecture/nvlink-domain-addressing.md
|
|
ucxl://council-mem:architect@DistOS:memory/*^/architecture/grace-c2c-memory-model.md
|
|
ucxl://council-mem:architect@DistOS:memory/*^/architecture/coherence-protocol-design.md
|
|
ucxl://council-mem:architect@DistOS:memory/*^/architecture/tiered-hierarchy-design.md
|
|
ucxl://council-mem:architect@DistOS:memory/*^/architecture/page-migration-engine.md
|
|
ucxl://council-mem:architect@DistOS:memory/*^/architecture/memory-pressure-protocol.md
|
|
ucxl://council-mem:architect@DistOS:memory/*^/architecture/bandwidth-allocation-model.md
|
|
```
|
|
|
|
### 4.3 Decision Records
|
|
|
|
```
|
|
ucxl://council-mem:architect@DistOS:memory/*^/decisions/DR-MEM-001-dsm-vs-mp-model.md
|
|
ucxl://council-mem:architect@DistOS:memory/*^/decisions/DR-MEM-002-coherence-protocol.md
|
|
ucxl://council-mem:architect@DistOS:memory/*^/decisions/DR-MEM-003-weka-consistency-semantics.md
|
|
ucxl://council-mem:architect@DistOS:memory/*^/decisions/DR-MEM-004-grace-c2c-scheduling-unit.md
|
|
ucxl://council-mem:architect@DistOS:memory/*^/decisions/DR-MEM-005-tiering-policy.md
|
|
ucxl://council-mem:architect@DistOS:memory/*^/decisions/DR-MEM-006-page-migration-triggers.md
|
|
ucxl://council-mem:architect@DistOS:memory/*^/decisions/DR-MEM-007-oom-prevention-protocol.md
|
|
ucxl://council-mem:architect@DistOS:memory/*^/decisions/DR-MEM-008-rdma-registration-model.md
|
|
```
|
|
|
|
### 4.4 Formal Specifications
|
|
|
|
```
|
|
ucxl://council-mem:verifier@DistOS:memory/*^/specs/MemoryModel.tla
|
|
ucxl://council-mem:verifier@DistOS:memory/*^/specs/CoherenceProtocol.tla
|
|
ucxl://council-mem:verifier@DistOS:memory/*^/specs/PageMigrationProtocol.tla
|
|
ucxl://council-mem:verifier@DistOS:memory/*^/specs/WekaConsistencyModel.tla
|
|
```
|
|
|
|
### 4.5 Interface Contracts
|
|
|
|
```
|
|
ucxl://council-mem:architect@DistOS:memory/*^/interfaces/mem-to-sched-contract.md
|
|
ucxl://council-mem:architect@DistOS:memory/*^/interfaces/mem-to-net-contract.md
|
|
ucxl://council-mem:architect@DistOS:memory/*^/interfaces/mem-to-sec-contract.md
|
|
ucxl://council-mem:architect@DistOS:memory/*^/interfaces/mem-to-telemetry-contract.md
|
|
```
|
|
|
|
---
|
|
|
|
## 5. Decision Points
|
|
|
|
### DP-MEM-001: Primary Memory Consistency Model
|
|
|
|
**Question:** Should DistOS expose a Distributed Shared Memory (DSM) programming model to user workloads, a pure message-passing model, or a hybrid model where DSM is restricted to within-NVLink-domain access (exploiting NVSwitch hardware) while cross-domain communication uses explicit message passing?
|
|
|
|
**Factors:** NVLink domain size (8 GPUs per NVSwitch in DGX H100, up to 256 GPUs in NVLink Switch System), coherence traffic overhead at 1024-node scale, programmability, compatibility with existing CUDA/MPI workloads, interaction with the PGAS language model.
|
|
|
|
**Dependency:** This is the highest-impact decision in `council-mem`; it governs the programming model presented to users and all downstream specification work. `council-sched` requires this decision before finalising its placement model.
|
|
|
|
### DP-MEM-002: Cache Coherence Protocol Selection
|
|
|
|
**Question:** If a DSM model (even partial) is adopted, which coherence protocol should DistOS implement? Options: (a) MESI/MESIF directory-based protocol with a distributed directory mapped to NVLink domains, (b) release consistency (lazy/eager), (c) entry consistency with lock-based synchronisation, (d) scope consistency (as in OpenSHMEM) restricted to explicitly declared shared objects.
|
|
|
|
**Factors:** Coherence directory scalability to 1024 nodes, false sharing cost over 900 GB/s NVLink vs. 200 Gbps InfiniBand, implementation complexity, interaction with CUDA memory model (`__threadfence_system`).
|
|
|
|
### DP-MEM-003: Weka Consistency Semantics
|
|
|
|
**Question:** WekaFS supports multiple client-side caching modes: `readcache` (read-only caching), `writecache` (write-back caching), and `coherent` (strict POSIX consistency with distributed coherence). DistOS must choose a default mount policy and define when workloads may opt into relaxed consistency. Should checkpoint writes use `writecache` for performance and accept the risk of partial data on crash, or require `coherent` mode with reduced write throughput?
|
|
|
|
**Factors:** Checkpoint I/O bandwidth (a 1024-GPU training job may checkpoint 100+ TB), crash recovery correctness, WekaFS client-side coherence message overhead, interaction with `council-fault`'s recovery model.
|
|
|
|
**Dependency:** `council-fault` must be consulted; this decision affects recovery guarantees.
|
|
|
|
### DP-MEM-004: Grace Superchip Scheduling Unit
|
|
|
|
**Question:** On GH200 nodes, should the CPU and GPU be treated as a single unified scheduling unit (one "node" from the scheduler's perspective, sharing a unified virtual address space) or as two separate resources with explicit affinity? The answer changes the memory model: unified treatment enables transparent CPU-GPU pointer sharing; separate treatment requires explicit migration.
|
|
|
|
**Factors:** Interaction with `council-sched` DP-SCHED-006, API surface complexity, NUMA distance for CPU-to-GPU migration over NVLink-C2C vs. within a unified UVA space.
|
|
|
|
**Coordination required:** This decision is jointly owned by `council-mem` and `council-sched`; a joint session is required before either council can finalise their respective specs.
|
|
|
|
### DP-MEM-005: Tiering Policy Design
|
|
|
|
**Question:** What algorithm governs page promotion and demotion across the HBM3 → DDR5 → NVMe → Weka hierarchy? Options: (a) LRU/LFU approximation (similar to Linux CLOCK), (b) access frequency + recency hybrid (ARC), (c) workload-hint-driven (applications annotate hot vs. cold regions via `cudaMemAdvise` equivalents), (d) ML-based hotness prediction.
|
|
|
|
**Factors:** Migration bandwidth cost (HBM3 → DDR5 over NVLink-C2C is fast; DDR5 → NVMe is slow; NVMe → Weka involves network I/O), migration storm risk, implementation complexity, applicability to training vs. inference vs. HPC workloads.
|
|
|
|
### DP-MEM-006: Page Migration Trigger and Bandwidth Budget
|
|
|
|
**Question:** What events trigger page migration? Candidates: (a) page fault on access (demand paging), (b) periodic access pattern analysis (proactive migration), (c) memory pressure threshold crossings, (d) explicit application hints. How is migration bandwidth budgeted to avoid starving compute workloads that use the same NVLink/NVSwitch fabric?
|
|
|
|
**Factors:** NVLink bandwidth contention with NCCL allreduce traffic, migration latency vs. access fault latency trade-off, integration with `council-net`'s bandwidth reservation model.
|
|
|
|
### DP-MEM-007: OOM Prevention and Memory Pressure Protocol
|
|
|
|
**Question:** When GPU HBM3 memory reaches a pressure threshold, what is the DistOS response hierarchy? Proposed hierarchy: (1) request cooperative memory release from low-priority co-located processes, (2) demote cold HBM3 pages to DDR5 or NVMe, (3) suspend (checkpoint) low-priority jobs to free their entire allocation, (4) if all else fails, terminate lowest-priority job. How are the pressure threshold levels defined and measured?
|
|
|
|
**Factors:** Interaction with `council-sched`'s preemption model, OOM kill latency, impact on SLO guarantees, coordination signal protocol design.
|
|
|
|
### DP-MEM-008: RDMA Registration Model
|
|
|
|
**Question:** GPUDirect RDMA requires GPU memory regions to be registered with the RDMA HCA (Host Channel Adapter). Should DistOS pre-register fixed RDMA memory pools (reducing registration overhead but consuming HBM3 at idle), register on demand (flexible but with latency), or use a cache of registered regions with LRU eviction?
|
|
|
|
**Factors:** Registration latency (~milliseconds for large GPU buffers), HBM3 overhead for pre-registered pools, interaction with `council-net`'s RDMA QP management, memory fragmentation risk.
|
|
|
|
**Dependency:** `council-net` must align on the registration model before RDMA transport design is finalised.
|
|
|
|
---
|
|
|
|
## 6. Dependencies
|
|
|
|
### 6.1 What council-mem Needs from Other Councils
|
|
|
|
| Dependency | Source Council | Artifact | Purpose |
|
|
|------------|---------------|---------|---------|
|
|
| Placement decisions and GPU assignment map | `council-sched` | `ucxl://council-sched:architect@DistOS:scheduling/*^/interfaces/sched-to-mem-contract.md` | Memory subsystem must know which GPU a job is placed on to configure HBM3 address space, NVLink fabric attachment point, and DDR5 NUMA proximity |
|
|
| Preemption events and checkpoint triggers | `council-sched` | `ucxl://council-sched:architect@DistOS:scheduling/*^/architecture/preemption-protocol.md` | Memory tiering must snapshot GPU HBM3 contents on preemption; needs notification protocol |
|
|
| RDMA transport requirements | `council-net` | `ucxl://council-net:architect@DistOS:networking/*^/interfaces/net-to-mem-contract.md` | RDMA registration model (DP-MEM-008) must align with network transport QP management |
|
|
| Network bandwidth reservation API | `council-net` | `ucxl://council-net:architect@DistOS:networking/*^/architecture/bandwidth-reservation.md` | Page migration over NVLink/IB must not starve NCCL allreduce traffic; needs bandwidth reservation |
|
|
| Memory isolation constraints | `council-sec` | `ucxl://council-sec:architect@DistOS:security/*^/interfaces/sec-to-mem-contract.md` | IOMMU domain assignments, capability restrictions on shared memory regions, tenant isolation requirements |
|
|
| Memory quota enforcement hooks | `council-telemetry` | `ucxl://council-telemetry:architect@DistOS:telemetry/*^/interfaces/telemetry-to-mem-contract.md` | HBM3 usage accounting per tenant requires telemetry metering hooks |
|
|
| Recovery semantics for Weka I/O | `council-fault` | `ucxl://council-fault:architect@DistOS:fault-tolerance/*^/interfaces/fault-to-mem-contract.md` | Weka consistency mode choice (DP-MEM-003) depends on recovery guarantees provided by fault tolerance subsystem |
|
|
|
|
### 6.2 What Other Councils Need from council-mem
|
|
|
|
| Consumer Council | Artifact Required | Purpose |
|
|
|-----------------|------------------|---------|
|
|
| `council-sched` | Memory pressure signals and eviction cost model | Scheduler uses eviction cost estimates to avoid placing kernels that will immediately trigger HBM3 pressure |
|
|
| `council-sched` | HBM3 and DDR5 bandwidth allocation model | Bandwidth is a co-dominant scheduling resource; model must be consistent with memory spec |
|
|
| `council-net` | RDMA memory registration interface specification | Network subsystem designs RDMA transport around the memory registration model agreed in DP-MEM-008 |
|
|
| `council-sec` | Address space layout and isolation model | Security isolation requires memory address space boundaries from the memory model |
|
|
| `council-telemetry` | Memory usage event stream specification | Metering requires a defined set of memory events (allocation, migration, eviction, OOM) with UCXL addresses |
|
|
| `council-verify` | TLA+ memory model and coherence protocol specs | Formal verification council model-checks for coherence safety and freedom from memory corruption |
|
|
| `council-api` | Virtual memory API surface | API council designs the user-facing memory allocation and mapping interface based on `council-mem` spec |
|
|
|
|
---
|
|
|
|
## 7. WHOOSH Configuration
|
|
|
|
```yaml
|
|
# WHOOSH council formation configuration for council-mem
|
|
council_id: council-mem
|
|
project: DistOS
|
|
subsystem: memory
|
|
gitea_label: chorus-entrypoint
|
|
gitea_repo: distos/memory
|
|
|
|
formation:
|
|
target_agents: 80
|
|
min_agents: 60
|
|
wave:
|
|
max_per_wave: 12
|
|
min_per_wave: 6
|
|
period_sec: 30
|
|
placement:
|
|
max_replicas_per_node: 2
|
|
join_stagger_ms: 2000
|
|
bootstrap_peers_min: 5
|
|
|
|
roles:
|
|
- role: lead-architect
|
|
count: 2
|
|
model: claude-opus-4-6
|
|
priority: high
|
|
- role: researcher
|
|
count: 46
|
|
model: qwen2.5-coder:32b
|
|
priority: normal
|
|
subgroups:
|
|
- tag: weka-integration
|
|
count: 8
|
|
- tag: gpu-memory
|
|
count: 8
|
|
- tag: nvlink-nvswitch
|
|
count: 6
|
|
- tag: grace-superchip
|
|
count: 5
|
|
- tag: cache-coherence
|
|
count: 5
|
|
- tag: tiering-migration
|
|
count: 6
|
|
- tag: memory-pressure
|
|
count: 4
|
|
- tag: pgas-dsm
|
|
count: 4
|
|
- role: architect
|
|
count: 8
|
|
model: claude-opus-4-6
|
|
priority: normal
|
|
- role: verifier
|
|
count: 8
|
|
model: deepseek-coder-v2
|
|
priority: normal
|
|
- role: reviewer
|
|
count: 7
|
|
model: claude-opus-4-6
|
|
priority: normal
|
|
- role: integration-liaison
|
|
count: 5
|
|
model: qwen2.5-coder:32b
|
|
priority: normal
|
|
- role: decision-record-author
|
|
count: 5
|
|
model: claude-opus-4-6
|
|
priority: normal
|
|
- role: adversarial-critic
|
|
count: 4
|
|
model: claude-opus-4-6
|
|
priority: normal
|
|
|
|
subchannels:
|
|
- name: mem-research
|
|
description: "Memory subsystem research discussion and literature synthesis"
|
|
participants: [researcher, lead-architect]
|
|
pubsub: true
|
|
- name: mem-weka
|
|
description: "Weka/WekaFS integration design — high-priority subchannel"
|
|
participants: [researcher-weka-integration, architect, lead-architect, integration-liaison]
|
|
pubsub: false
|
|
- name: mem-architecture
|
|
description: "Architecture proposal discussion and voting"
|
|
participants: [architect, lead-architect, reviewer, adversarial-critic]
|
|
pubsub: true
|
|
- name: mem-formal-spec
|
|
description: "TLA+ specification authoring and review"
|
|
participants: [verifier, lead-architect, reviewer]
|
|
pubsub: false
|
|
- name: mem-integration
|
|
description: "Cross-council interface negotiation"
|
|
participants: [integration-liaison, lead-architect]
|
|
pubsub: false
|
|
- name: mem-grace-joint
|
|
description: "Joint session channel with council-sched for GH200 scheduling unit decision (DP-MEM-004/DP-SCHED-006)"
|
|
participants: [lead-architect, integration-liaison]
|
|
pubsub: false
|
|
external_councils: [council-sched]
|
|
- name: mem-decisions
|
|
description: "Decision record authoring and consensus"
|
|
participants: [decision-record-author, lead-architect, reviewer]
|
|
pubsub: true
|
|
|
|
quorum:
|
|
architecture_changes:
|
|
policy: supermajority
|
|
threshold: 0.667
|
|
require_domain_role: true
|
|
require_quality_role: true
|
|
beat_minutes: 20
|
|
timeout_beats: 6
|
|
research_summaries:
|
|
policy: simple_majority
|
|
threshold: 0.5
|
|
require_domain_role: true
|
|
require_quality_role: false
|
|
beat_minutes: 15
|
|
timeout_beats: 4
|
|
formal_specs:
|
|
policy: supermajority
|
|
threshold: 0.667
|
|
require_domain_role: true
|
|
require_quality_role: true
|
|
require_verifier: true
|
|
beat_minutes: 25
|
|
timeout_beats: 8
|
|
# Joint decisions with council-sched require lead-architect sign-off from both councils
|
|
joint_decisions:
|
|
policy: unanimous
|
|
roles: [lead-architect]
|
|
councils: [council-mem, council-sched]
|
|
beat_minutes: 30
|
|
timeout_beats: 6
|
|
interface_contracts:
|
|
policy: unanimous
|
|
roles: [lead-architect, integration-liaison]
|
|
beat_minutes: 30
|
|
timeout_beats: 4
|
|
|
|
gates:
|
|
kaching:
|
|
p95_latency_ms: 250
|
|
max_error_rate: 0.01
|
|
backbeat:
|
|
max_stream_lag: 200
|
|
bootstrap:
|
|
min_healthy_peers: 5
|
|
join:
|
|
min_success_rate: 0.80
|
|
|
|
review:
|
|
beat_minutes: 20
|
|
quorum:
|
|
total_min: 3
|
|
require_domain_role: true
|
|
require_quality_role: true
|
|
timeout_beats: 6
|
|
no_self_approval: true
|
|
```
|
|
|
|
---
|
|
|
|
## 8. Success Criteria
|
|
|
|
1. **Research completeness:** All 9 research domain summaries published to DHT with at least 5 primary references each, approved by council simple majority.
|
|
|
|
2. **Architecture coverage:** Architectural proposals exist for all 9 major memory subsystem components. Each proposal addresses the specific implications of HBM3 scarcity (80 GB per H100) and the Weka filesystem integration.
|
|
|
|
3. **Decision records resolved:** All 8 decision points (DP-MEM-001 through DP-MEM-008) have corresponding Decision Records with at least 3 alternatives considered. DP-MEM-001 (primary memory model) and DP-MEM-004 (Grace scheduling unit) are resolved by council supermajority with `council-sched` co-sign on DP-MEM-004.
|
|
|
|
4. **Formal specifications:** TLA+ specifications for the memory model, coherence protocol, and page migration protocol. The coherence protocol spec must include a proof (model-checked by `council-verify`) that it satisfies SC-for-DRF (Sequential Consistency for Data-Race-Free programs) within an NVLink domain.
|
|
|
|
5. **Weka integration validated:** The Weka consistency mode recommendation (DP-MEM-003) is accompanied by a quantitative bandwidth model (estimated checkpoint throughput for each consistency mode) and a failure scenario analysis reviewed by `council-fault`.
|
|
|
|
6. **Interface contracts ratified:** All 4 interface contracts (to `council-sched`, `council-net`, `council-sec`, `council-telemetry`) are co-signed. The RDMA registration model (DP-MEM-008) contract is co-signed by `council-net` before the end of Phase 2.
|
|
|
|
7. **UCXL navigability:** Any Decision Record can be traced to the research summary motivating it within 5 UCXL hops.
|
|
|
|
8. **Adversarial review pass:** Each major architecture proposal has a documented adversarial critique and resolution. The coherence protocol design must specifically address the scenario of a migration storm (100+ simultaneous page migrations consuming all NVLink bandwidth) with a documented mitigation.
|
|
|
|
---
|
|
|
|
## 9. Timeline
|
|
|
|
### Phase 1: Research and Survey (Days 1-3)
|
|
|
|
**Day 1:**
|
|
- WHOOSH forms council; all 80 agents join via wave deployment
|
|
- Researchers self-assign to domain subgroups via `mem-research` subchannel
|
|
- Weka integration research begins immediately (highest dependency for both `council-sched` placement and `council-fault` recovery design)
|
|
- GPU memory management and NVLink/NVSwitch fabric research begins in parallel
|
|
- Integration liaisons contact `council-sched` and `council-net` to establish interface negotiation schedule
|
|
|
|
**Day 2:**
|
|
- Grace Superchip, cache coherence, and tiering/migration research domains surveyed
|
|
- PGAS/DSM language model research completed (inputs to DP-MEM-001)
|
|
- Memory pressure and OOM research domain surveyed
|
|
- Research summaries drafted; internal review cycle begins
|
|
- Lead architects draft preliminary memory model design space map
|
|
|
|
**Day 3:**
|
|
- Research summaries revised based on review feedback; all 9 summaries published to DHT
|
|
- Adversarial critics challenge key assumptions (particularly DSM scalability claims)
|
|
- Research phase gate: all 9 summaries achieve simple majority approval
|
|
- Preliminary interface contract outlines shared with all dependency councils
|
|
- Joint session scheduled with `council-sched` for DP-MEM-004/DP-SCHED-006 (Grace scheduling unit)
|
|
|
|
### Phase 2: Architecture and Trade-offs (Days 3-6)
|
|
|
|
**Day 3-4:**
|
|
- DP-MEM-001 (primary memory model) — highest priority; architects propose DSM, message-passing, and hybrid options
|
|
- Joint session with `council-sched` on DP-MEM-004/DP-SCHED-006 (Grace Superchip scheduling unit) — this is a co-owned decision requiring both councils
|
|
- DP-MEM-008 (RDMA registration model) — early engagement with `council-net` required; initial proposal shared
|
|
|
|
**Day 4-5:**
|
|
- DP-MEM-002 (coherence protocol) resolved — depends on DP-MEM-001 outcome
|
|
- DP-MEM-003 (Weka consistency semantics) resolved — `council-fault` consulted
|
|
- Bandwidth allocation model drafted; shared with `council-sched` as input to their placement scoring
|
|
|
|
**Day 5-6:**
|
|
- DP-MEM-005 (tiering policy), DP-MEM-006 (migration triggers), DP-MEM-007 (OOM protocol) resolved
|
|
- All 8 Decision Records drafted and voted on by council supermajority
|
|
- Architecture overview assembled from approved DRs
|
|
- Architecture phase gate: all DPs resolved and co-dependencies with `council-sched` confirmed
|
|
|
|
### Phase 3: Formal Specification (Days 6-10)
|
|
|
|
**Day 6-7:**
|
|
- TLA+ specification of the memory model begins; verifiers partition work across 4 modules (memory model, coherence, migration, Weka consistency)
|
|
- Architects continue refining designs to resolve spec ambiguities
|
|
- `council-verify` given early access to spec drafts for model-checking setup
|
|
|
|
**Day 7-8:**
|
|
- Coherence protocol TLA+ module authored; safety invariants stated (SC-for-DRF within NVLink domain, no lost writes across domain boundaries)
|
|
- Page migration protocol TLA+ module authored; liveness invariants (no indefinite page residency in any tier, migration storm prevention)
|
|
|
|
**Day 8-10:**
|
|
- Weka consistency model TLA+ module authored
|
|
- Model checking runs submitted to `council-verify`
|
|
- Any counterexamples trigger architecture revision with updated DRs
|
|
- Formal spec versions pinned in DHT; UCXL addresses published to `council-sched`, `council-net`, `council-verify`
|
|
|
|
### Phase 4: Integration and Review (Days 10-12)
|
|
|
|
**Day 10-11:**
|
|
- Interface contracts with all dependency councils finalised and submitted for co-signature
|
|
- Cross-council integration session: memory model validated against network RDMA model (`council-net`)
|
|
- Cross-council integration session: memory pressure protocol validated against preemption protocol (`council-sched`)
|
|
- `council-synth` engaged for any unresolved conflicts
|
|
|
|
**Day 11-12:**
|
|
- Final council review of complete memory specification
|
|
- Adversarial critics run migration storm scenario and DSM coherence traffic saturation analysis
|
|
- All yellow votes addressed with documented mitigations
|
|
- Integration review gate: all interface contracts co-signed
|
|
|
|
### Phase 5: Documentation and Narrative (Days 12-14)
|
|
|
|
**Day 12-13:**
|
|
- Decision record authors produce narrative summaries
|
|
- `council-docs` receives complete memory specification for standardised formatting
|
|
- UCXL navigability audit: spot-check 10 random decision paths
|
|
|
|
**Day 14:**
|
|
- Final specification published
|
|
- `council-arch` generates human-readable narrative of memory subsystem design evolution
|
|
- Council dissolved; agents released back to WHOOSH pool
|