DistOS/councils/08-api-surface.md

# Council Design Brief: API Surface and Developer Experience

**Council ID:** `council-api`
**Mission:** Define the complete, coherent, and ergonomic interface between DistOS and its users — operators, application developers, and other systems. This council decides what the operating system looks like from the outside: system calls, SDK bindings, CLI tools, and the conventions that make all of the above consistent and maintainable across language boundaries and API versions.
**UCXL Base Address:** `ucxl://council-api:*@DistOS:api/*`
**Agent Count:** ~40
**Status:** Design Brief — Constitution Phase

---

## 1. Scope and Responsibilities

`council-api` owns the external interface contract of DistOS. Its scope covers:

- Deciding the overall API philosophy: POSIX-compatible extension, clean-slate design, or a layered model that offers both
- Defining GPU-native system calls for kernel launch, memory allocation, device-to-device transfers, stream and graph management, and event synchronisation
- Defining distributed system calls: remote procedure invocation (covering both synchronous RPC and async futures), distributed lock acquisition and release, barriers, and collective operations across node groups
- Designing an async-first API surface that aligns with modern language runtimes (Rust `async`/`await`, Go goroutines, Python `asyncio`)
- Establishing error handling conventions, including integration with UCXL response codes for errors that carry provenance (which node, which operation, at what logical time)
- Designing the SDK for four target languages: C (ABI-stable systems interface), Rust (idiomatic, zero-cost), Go (ergonomic, channel-friendly), and Python (user-friendly, numpy-compatible)
- Designing CLI tooling for cluster management: node status, job submission, resource inspection, log retrieval, and administrative operations
- Defining the API versioning and evolution strategy: how new calls are introduced, how deprecated calls are retired, compatibility guarantees across minor and major versions
- Producing API reference documentation that is precise enough to serve as a normative source alongside the formal spec
- Specifying example applications that exercise non-trivial API paths and serve as integration test targets

Responsibilities this council does **not** own: kernel implementation (owned by subsystem councils); formal verification of API contracts (owned by `council-verify`); security policy enforcement (owned by `council-sec`, though `council-api` designs the authentication and authorisation API surface in coordination with it); monitoring and metering calls (owned by `council-telemetry`, though `council-api` exposes the SDK surface for those).

---

## 2. Research Domains

### 2.1 POSIX Compatibility vs. Clean-Slate Design

POSIX (IEEE 1003.1) defines the canonical Unix system call interface. Its strengths are: near-universal language runtime support, a mature ecosystem of tools, and decades of developer familiarity. Its weaknesses in a GPU-cluster OS context are: blocking I/O semantics that assume CPU-thread models, file-descriptor-centric resource management ill-suited to GPU memory objects, and no native concept of distributed operations or remote memory.

Two design philosophies must be fully researched before the council can decide:

- **POSIX-compatible extension:** Retain the full POSIX interface and extend it with GPU and distributed primitives as optional add-ons. Applications written for Linux run unmodified; GPU-aware applications opt into extensions. This is the approach taken by CUDA (which layers a driver API on top of the OS) and by ROCm/HIP.
- **Clean-slate design:** Design an interface optimal for the DistOS hardware target without backward-compatibility constraints. This allows stronger type safety, async-native semantics, and a capability-based resource model from the first call. Plan 9 (Pike et al.) and Fuchsia (Zircon) are the primary existence proofs.
- **Layered model:** Provide a clean-slate primary API and a POSIX compatibility layer implemented on top of it. This is the architectural recommendation for evaluation. The compatibility layer has a defined cost budget.

Key references:
- The Open Group. *The Single UNIX Specification (SUSv4/POSIX.1-2017)*. The normative POSIX reference.
- Pike, R. et al. "Plan 9 from Bell Labs." *USENIX Summer 1990 Technical Conference*. Plan 9's contribution is the 9P protocol: everything is a file, including processes and network connections. The simplicity of the resource model is instructive even if DistOS does not adopt 9P verbatim.
- Pike, R. "The Use of Name Spaces in Plan 9." *EUUG Newsletter* 12(1), 1992.
- Google. *Fuchsia OS: Zircon Kernel Objects*. https://fuchsia.dev/fuchsia-src/concepts/kernel. Zircon uses a capability-based object system with handles as the only way to reference kernel objects. This is the most complete modern clean-slate OS design and must be studied in depth.

### 2.2 GPU-Native System Calls

The CUDA Driver API provides the lowest-level GPU control surface available: `cuInit`, `cuDeviceGet`, `cuCtxCreate`, `cuMemAlloc`, `cuLaunchKernel`, `cuEventRecord`, `cuStreamWaitEvent`. It is the reference for what a GPU system call interface must cover.

Agents must evaluate the tradeoffs between:
- **Driver-level API** (CUDA Driver API / ROCm HIP Low-Level): explicit context management, explicit stream management, maximum control, verbose
- **Runtime API** (CUDA Runtime / ROCm): implicit context, automatic stream assignment, less control, more ergonomic
- **Graph-based execution** (CUDA Graphs / HIP Graphs): capture a sequence of operations as a graph for repeated execution with lower launch overhead. Critical for the 1024-node deployment where kernel launch overhead accumulates.

Key references:
- NVIDIA. *CUDA Driver API Reference Manual*. https://docs.nvidia.com/cuda/cuda-driver-api/. Normative reference for GPU system call semantics.
- NVIDIA. *CUDA C Programming Guide* (Chapter 3: Programming Interface). Covers the Runtime API and its relationship to the Driver API.
- NVIDIA. *CUDA Graphs* documentation. https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#cuda-graphs. The graph execution model is essential for understanding low-latency repeated workloads on Hopper and Blackwell.
- Khronos Group. *OpenCL 3.0 Specification*. https://www.khronos.org/opencl/. The vendor-neutral GPU programming API. DistOS must decide whether to support OpenCL alongside CUDA semantics.
- Khronos Group. *SYCL 2020 Specification*. https://www.khronos.org/sycl/. SYCL provides a C++ abstraction over OpenCL and oneAPI targets. Intel's oneAPI unifies GPU programming across vendors and is a candidate for the DistOS higher-level SDK layer.
- Intel. *oneAPI Programming Guide*. https://www.intel.com/content/www/us/en/developer/tools/oneapi/programming-guide.html.
- NVIDIA. *NVLink and NVSwitch Architecture Overview*. https://www.nvidia.com/en-us/data-center/nvlink/. GPU-to-GPU direct access semantics affect memory system call design.

Blackwell-specific: The GB200 NVL72 introduces NVLink Switch System connecting 72 GPUs in a single flat memory domain. System calls for `cuMemAdvise` and `cuMemPrefetchAsync` take on new semantics in this topology. Agents must review:
- NVIDIA. *NVIDIA Blackwell Architecture Technical Brief*. 2024.

### 2.3 Distributed System Calls

System calls that span nodes are novel: POSIX has no notion of them. The design space covers:

- **Remote procedure invocation:** How does a process on node A invoke a procedure on node B? Synchronous blocking (simple, latency-bound), asynchronous with futures (complex, scalable), or continuation-passing. gRPC is the de facto standard for service-to-service RPC in the cloud but carries HTTP/2 overhead.
- **Distributed locks:** Lease-based locks (Chubby/Zookeeper model), RDMA-based compare-and-swap (best latency), or consensus-based locks for strong guarantees. Each has different failure semantics.
- **Barriers:** Collective synchronisation across node groups. MPI_Barrier semantics are well understood; the question is how to expose this in a general-purpose OS API.
- **Collective operations:** AllReduce, AllGather, Broadcast, Reduce-Scatter. These are first-class operations for distributed ML workloads (the dominant use case on a 1024-node GPU cluster) and must be surfaced as OS-level calls, not just library calls, so the OS can optimise placement and routing.

Key references:
- Birrell, A. and Nelson, B. "Implementing Remote Procedure Calls." *ACM Transactions on Computer Systems* 2(1), 1984. The foundational RPC paper.
- Google. *gRPC*. https://grpc.io/. The current industry standard for typed RPC. Protocol Buffers schema evolution strategy is directly applicable to DistOS API versioning.
- Google. *Chubby: A Lock Service for Loosely-Coupled Distributed Systems*. Burrows, M. OSDI 2006.
- Hunt, P. et al. "ZooKeeper: Wait-free Coordination for Internet-scale Systems." *USENIX ATC 2010*.
- Message Passing Interface Forum. *MPI: A Message-Passing Interface Standard, Version 4.1*. 2023. The collective operations specification is normative for `council-api`'s collective call design.
- Mellanox/NVIDIA. *RDMA Programming Guide*. InfiniBand verbs API (ibv_post_send, ibv_post_recv, ibv_create_qp) provides the lowest-latency distributed memory access primitives available on the target cluster.

### 2.4 Async-First API Design

A GPU cluster OS serving AI workloads will have I/O patterns dominated by deep asynchrony: thousands of in-flight kernel launches, streaming data from Weka FS, collective comms across 1024 nodes. A synchronous API is a fundamental design mistake. Agents must research:

- **Rust async/await:** The Rust async model (futures, the `Poll` trait, the executor model) provides zero-cost abstraction over async I/O. The `tokio` runtime is the dominant executor. The DistOS Rust SDK must integrate naturally with tokio.
- **io_uring (Linux 5.1+):** The io_uring interface provides a shared ring-buffer interface between kernel and userspace that eliminates syscall overhead for I/O. Its submission/completion queue model is the reference for how DistOS should design its own async system call interface.
- **Go channels and goroutines:** Go's concurrency model maps well to distributed operations. The DistOS Go SDK must express distributed calls as channels or via the `context.Context` cancellation pattern.
- **Python asyncio:** The Python SDK must be usable from `async def` coroutines. NumPy compatibility for GPU tensor operations should be considered (compatibility with the Numba/CuPy interface).

Key references:
- Axboe, J. *io_uring and the new Linux async I/O API*. https://kernel.dk/io_uring.pdf. 2019. This paper is essential for understanding the state of the art in async syscall design.
- The Rust Async Book. https://rust-lang.github.io/async-book/. Normative reference for Rust async design patterns.
- Grigorik, I. *High Performance Browser Networking* (Chapter 2 on event loop and async I/O patterns). 2013. O'Reilly. Useful background on event-driven I/O design.

### 2.5 Error Handling Conventions

A cluster OS at this scale will produce a high volume of partial failures: a node goes dark, a GPU kernel faults, a network partition isolates a subsystem. The error handling convention must be:

- **Structured:** Every error carries a type, a severity, a source identifier (node, subsystem, call), and a correlation ID that links it to a UCXL-addressed event in the distributed log.
- **Actionable:** The API must distinguish between errors that the caller should retry (transient), errors that require intervention (permanent), and errors that indicate a usage mistake (programmer error).
- **Traceable:** Error correlation IDs must be UCXL-compatible so that an error returned to a Python application can be resolved to the full distributed event chain using the UCXL resolver.

Key references:
- Google. *Google Cloud API Design Guide: Errors*. https://cloud.google.com/apis/design/errors. The most systematic public treatment of structured API error design. The canonical status codes (OK, INVALID_ARGUMENT, NOT_FOUND, UNAVAILABLE, etc.) should be adopted or adapted.
- Klabnik, S. and Nichols, C. *The Rust Programming Language* (Chapter 9: Error Handling). The Rust approach to `Result<T, E>` and the `?` operator represents the state of the art for recoverable errors in a systems language.
- Syme, D. et al. "Exceptional Syntactic Support for Error Handling in F#." *Haskell Symposium 2020*. Relevant to the higher-level SDK error design.

The UCXL response code integration specifically means that API error structs carry a `ucxl_trace` field containing the UCXL address of the distributed event that caused the failure:

```
error.ucxl_trace = "ucxl://council-fault:monitor@DistOS:fault-tolerance/^^/events/node-042-timeout-2026-03-01T14:22:00Z"
```

### 2.6 SDK Design for Multiple Languages

The SDK must present a coherent surface across four languages with different idioms. The design principles are:

- **C ABI as the foundation:** The canonical system call interface is a C ABI. All other language SDKs are generated or hand-written wrappers over the C ABI. This ensures ABI stability and FFI compatibility with every language.
- **Rust SDK:** Idiomatic, zero-cost wrappers. Use Rust's ownership system to enforce resource lifetimes at compile time (e.g., a `GpuBuffer<T>` type that is `Send` but not `Sync`, reflecting GPU buffer ownership semantics). The Rust SDK should use `#[repr(C)]` structs for ABI compatibility.
- **Go SDK:** Ergonomic wrappers using `cgo` for the C ABI. Expose distributed operations as channel-returning functions. Context-aware: all calls accept `context.Context` for cancellation and timeout propagation.
- **Python SDK:** High-level, NumPy-compatible. Consider auto-generating stub code from a schema. Must be `asyncio`-compatible. Integrate with the Python type system via `Protocol` and `TypedDict`.

Key references:
- Klabnik, S. and Nichols, C. *The Rust Programming Language*. https://doc.rust-lang.org/book/. Idiomatic Rust patterns.
- Go Authors. *Effective Go*. https://go.dev/doc/effective_go. Idiomatic Go patterns.
- Google. *Google Cloud API Design Guide*. https://cloud.google.com/apis/design. The most comprehensive public API design guide, covering resource-oriented design, standard methods, naming conventions, and backwards compatibility.
- Smith, P. *Designing for Compatibility in Evolving APIs*. IEEE Software 39(4), 2022.

### 2.7 CLI Tooling Design

The cluster management CLI (`distos-ctl` or equivalent) must follow modern CLI design principles:

- Machine-readable output (JSON/YAML with `--output json`) for scripting
- Structured logging with log levels
- Human-readable default output with colour and progress indicators
- Completion generation for bash/zsh/fish
- Subcommand structure: `node`, `job`, `gpu`, `net`, `storage`, `secret`, `log`

Key references:
- Sigurdsson, A. et al. *Command Line Interface Guidelines*. https://clig.dev/. The community-written standard for modern CLI design. Should be treated as the style guide for `distos-ctl`.
- Hashicorp. *Vault CLI design*. The Vault CLI is an exemplar of a well-structured cluster management tool with consistent subcommand and flag conventions.
- Kubernetes. `kubectl` source and documentation. The de facto standard for distributed cluster management CLIs. The DistOS CLI should match `kubectl` conventions where applicable to reduce cognitive load.

### 2.8 API Versioning and Evolution Strategy

A system call interface must be stable. The versioning strategy must address:

- **Compatibility guarantees:** What changes are backwards-compatible (adding optional parameters, adding new calls) vs. breaking (changing parameter semantics, removing calls)?
- **Deprecation lifecycle:** Minimum deprecation notice period, deprecation markers in the SDK, removal schedule.
- **Version negotiation:** How does a client indicate the API version it was compiled against? How does the kernel report available versions?
- **Experimental APIs:** A clearly marked experimental tier for new calls before they enter the stable surface.

Key references:
- Google. *Google Cloud API Versioning*. https://cloud.google.com/apis/design/versioning. URL-based versioning for REST APIs; the principles apply to system call versioning.
- Klabnik, S. "Stability as a Deliverable." https://blog.rust-lang.org/2014/10/30/Stability.html. Rust's stability commitment is a model for how a systems project can make and keep compatibility promises.
- Semantic Versioning Specification. https://semver.org/. The DistOS SDK and ABI will follow SemVer 2.0.

### 2.9 Plan 9 and Fuchsia Zircon Deep Dive

These two systems represent the clearest non-POSIX OS API designs and must be studied in depth:

- **Plan 9:** The 9P protocol represents all system resources as files served over a file system protocol. Network connections, processes, and graphics are files. The simplicity is extreme. The DistOS clean-slate layer need not adopt 9P but should understand its design philosophy.
  - Pike, R. et al. "The Use of Name Spaces in Plan 9." *EUUG Newsletter* 12(1), 1992.
  - Dorward, S. et al. "The Inferno Operating System." *Bell Labs Technical Journal* 2(1), 1997.
- **Fuchsia / Zircon:** Zircon is a microkernel with capabilities as the security primitive. Every kernel resource is a `zx_handle_t`. Handles are passed between processes explicitly; there is no global namespace for kernel objects. This is the preferred model for DistOS's capability integration with `council-sec`.
  - Google. *Zircon Kernel Concepts*. https://fuchsia.dev/fuchsia-src/concepts/kernel/concepts.
  - Google. *Zircon Syscall Reference*. https://fuchsia.dev/fuchsia-src/reference/syscalls.

---

## 3. Agent Roles

| Role | Count | Responsibilities |
|------|-------|-----------------|
| Lead API Architect | 1 | Decides overall API philosophy; coordinates with all subsystem councils; owns the master API specification document; resolves conflicts between API and subsystem requirements |
| POSIX Compatibility Analysts | 4 | Audit which POSIX calls must be retained; design the compatibility shim layer; document compatibility coverage gaps |
| GPU Syscall Designers | 6 | Design GPU-native system calls for kernel launch, memory, streams, events, graphs; ensure Hopper/Blackwell/Grace specifics are covered |
| Distributed Syscall Designers | 5 | Design RPC, distributed lock, barrier, and collective operation system calls; consult MPI and RDMA references |
| SDK Designers | 8 | Design language-specific SDKs: 2 per language (C, Rust, Go, Python); responsible for ergonomics, idiom conformance, and ABI stability |
| Async API Specialists | 4 | Design the async call model; specify io_uring-style ring buffer interface; ensure Rust/Go/Python async integration |
| CLI Designers | 3 | Design `distos-ctl` command structure, output formats, and completions |
| Error Handling Architects | 3 | Design structured error types, UCXL trace integration, and error propagation conventions across all SDK layers |
| API Versioning Strategists | 2 | Develop the versioning policy, deprecation lifecycle, compatibility matrix, and experimental API tier |
| Developer Experience Reviewers | 4 | Evaluate API usability; write developer-facing documentation and example applications; run internal "dogfooding" walkthroughs |

**Total:** 40 agents

---

## 4. Key Deliverables

All artifacts use the pattern `ucxl://council-api:{role}@DistOS:api/^^/{artifact-type}/{name}`.

### 4.1 Master API Philosophy Decision Record

```
ucxl://council-api:lead-api-architect@DistOS:api/^^/decisions/dr-api-01-philosophy.md
```

Covers the layered model decision: clean-slate primary API, POSIX compatibility shim, and the cost budget for the shim.

### 4.2 GPU System Call Specification

```
ucxl://council-api:gpu-syscall-designer@DistOS:api/^^/specs/gpu-syscalls.md
```

Full specification of all GPU-native system calls with parameter types, semantics, error codes, and Hopper/Blackwell/Grace specifics.

### 4.3 Distributed System Call Specification

```
ucxl://council-api:distributed-syscall-designer@DistOS:api/^^/specs/distributed-syscalls.md
```

### 4.4 Async Call Interface Specification

```
ucxl://council-api:async-api-specialist@DistOS:api/^^/specs/async-interface.md
```

Documents the submission/completion ring model, back-pressure semantics, and language runtime integration.

### 4.5 C ABI Reference

```
ucxl://council-api:sdk-designer@DistOS:api/^^/specs/c-abi-reference.h
```

The normative C header file. All other SDKs are derived from this.

### 4.6 Language SDK Specifications

```
ucxl://council-api:sdk-designer@DistOS:api/^^/specs/sdk-rust.md
ucxl://council-api:sdk-designer@DistOS:api/^^/specs/sdk-go.md
ucxl://council-api:sdk-designer@DistOS:api/^^/specs/sdk-python.md
```

### 4.7 Error Type Catalogue

```
ucxl://council-api:error-handling-architect@DistOS:api/^^/specs/error-catalogue.md
```

All structured error types with UCXL trace integration, severity levels, and retry guidance.

### 4.8 CLI Specification

```
ucxl://council-api:cli-designer@DistOS:api/^^/specs/distos-ctl-spec.md
```

Full command reference including all subcommands, flags, output formats, and completion scripts.

### 4.9 API Versioning Policy

```
ucxl://council-api:api-versioning-strategist@DistOS:api/^^/policies/versioning-policy.md
```

### 4.10 POSIX Compatibility Coverage Matrix

```
ucxl://council-api:posix-compatibility-analyst@DistOS:api/^^/specs/posix-compatibility-matrix.md
```

Tabulates every POSIX call: supported natively, supported via shim, not supported (with rationale).

### 4.11 Example Applications

```
ucxl://council-api:developer-experience-reviewer@DistOS:api/^^/examples/hello-distributed-gpu.md
ucxl://council-api:developer-experience-reviewer@DistOS:api/^^/examples/allreduce-collective.md
ucxl://council-api:developer-experience-reviewer@DistOS:api/^^/examples/weka-fs-streaming-io.md
```

---

## 5. Decision Points

All DRs use the address pattern `ucxl://council-api:lead-api-architect@DistOS:api/^^/decisions/{dr-id}.md`.

### DP-A01: POSIX vs. Clean-Slate vs. Layered

The foundational design philosophy choice. The default recommendation is the layered model, but this must be validated against: the cost of maintaining the shim layer, the risk of semantic leakage from POSIX into the clean-slate layer, and the developer familiarity benefit.
**Deciding parties:** Lead API Architect, POSIX Compatibility Analysts, `council-synth`

### DP-A02: Async System Call Mechanism

Choose between: io_uring-inspired ring buffer (lowest overhead, Linux precedent), a POSIX-extended `aio_*` interface (familiarity, limited expressiveness), or a fully custom completion port model. This decision is tightly coupled to the `council-mem` memory model (the ring buffer requires shared memory between kernel and userspace).
**Deciding parties:** Async API Specialists, `council-mem`, `council-verify` (for ABI safety check)

### DP-A03: GPU Memory API at the Syscall Layer vs. Library Layer

Should GPU memory allocation (`cuMemAlloc` equivalent) be a kernel-mediated system call (allowing the OS to account for and schedule GPU memory as a first-class resource) or a library call that bypasses the kernel after initial device setup? Kernel mediation adds latency; bypass reduces accounting fidelity.
**Deciding parties:** GPU Syscall Designers, `council-mem`, `council-telemetry`

### DP-A04: RPC Mechanism for Distributed System Calls

Choose the wire protocol for remote procedure calls: gRPC (typed, HTTP/2, mature), a custom binary protocol over RDMA (lowest latency, highest implementation cost), or a two-tier model (gRPC for control plane, RDMA for data plane). The choice directly affects the latency budget for distributed system calls.
**Deciding parties:** Distributed Syscall Designers, `council-net`

### DP-A05: SDK Code Generation vs. Hand-Written Wrappers

Decide whether to generate the Rust, Go, and Python SDKs from a schema definition (IDL, such as Protocol Buffers or a custom DSL) or maintain hand-written wrappers. Generated code is more consistent; hand-written code can be more idiomatic. A hybrid (generate the boilerplate, hand-write ergonomic wrappers) is the likely outcome.
**Deciding parties:** SDK Designers, API Versioning Strategists

### DP-A06: Authentication and Authorisation API

How does a process prove its identity to the kernel and acquire capabilities? Options: token-based (JWT or similar), capability handles (Zircon model), certificate-based (X.509 with a cluster CA), or UCXL-scoped credentials. This decision must be made jointly with `council-sec`.
**Deciding parties:** Lead API Architect, `council-sec`

---

## 6. Dependencies on Other Councils

`council-api` is the integrating council: every subsystem council produces functionality, and `council-api` exposes that functionality through a coherent surface. It is therefore a downstream consumer of requirements from all councils and an upstream provider to `council-docs` and `council-verify`.

| Council | Relationship | What council-api consumes | What council-api produces |
|---------|-------------|--------------------------|--------------------------|
| `council-sched` | Consuming requirements | Job submission semantics, priority model, queue management APIs | Scheduler-facing system calls in API spec |
| `council-mem` | Bidirectional | Memory model, allocation semantics, consistency guarantees | Memory system call specs; async memory API |
| `council-net` | Bidirectional | Network abstraction primitives, RDMA capabilities | Network system calls; distributed RPC wire protocol choice |
| `council-fault` | Consuming requirements | Failure notification model, recovery primitives | Fault-tolerance-related error codes; node failure event API |
| `council-sec` | Bidirectional | Capability model, identity primitives, isolation guarantees | Authentication/authorisation API surface; capability handle design |
| `council-telemetry` | Consuming requirements | Metering call semantics, SLO query interface | Telemetry-facing SDK surface; metering call specs |
| `council-verify` | Providing for verification | N/A | API interface contracts for formal verification |
| `council-qa` | Providing for test design | N/A | API spec enables QA to design conformance tests |
| `council-synth` | Receiving directives | Cross-council conflict resolutions affecting API design | Updates to API spec when directed by synth |
| `council-docs` | Providing for documentation | N/A | All API specs feed directly into the reference documentation |

**Critical path constraint:** `council-api` cannot finalise the distributed system call interface until `council-net` has committed to its RPC and RDMA model (DP-A04 depends on this). GPU system call design can proceed independently from Day 1.

---

## 7. WHOOSH Configuration

### 7.1 Team Formation

```yaml
council_id: council-api
display_name: "API Surface and Developer Experience Council"
target_size: 40
formation_strategy: competency_weighted
required_roles:
  - role: lead-api-architect
    count: 1
    persona: systems-analyst
    competencies: [api-design, posix, distributed-systems, gpu-programming, developer-experience]
  - role: posix-compatibility-analyst
    count: 4
    persona: technical-specialist
    competencies: [posix, linux-kernel, system-calls, abi-stability]
  - role: gpu-syscall-designer
    count: 6
    persona: technical-specialist
    competencies: [cuda, rocm, gpu-memory, hopper-architecture, blackwell-architecture, nvlink]
  - role: distributed-syscall-designer
    count: 5
    persona: technical-specialist
    competencies: [rpc, rdma, mpi-collectives, distributed-locks, grpc]
  - role: sdk-designer
    count: 8
    persona: technical-specialist
    competencies: [c-abi, rust-async, go-concurrency, python-asyncio, ffi, sdk-design]
  - role: async-api-specialist
    count: 4
    persona: technical-specialist
    competencies: [io-uring, async-io, rust-futures, event-driven-design]
  - role: cli-designer
    count: 3
    persona: technical-specialist
    competencies: [cli-design, ux, kubectl-conventions, shell-completion]
  - role: error-handling-architect
    count: 3
    persona: systems-analyst
    competencies: [error-design, structured-errors, distributed-tracing, ucxl]
  - role: api-versioning-strategist
    count: 2
    persona: systems-analyst
    competencies: [api-versioning, semver, deprecation-policy, compatibility]
  - role: developer-experience-reviewer
    count: 4
    persona: technical-writer
    competencies: [developer-documentation, api-usability, example-applications, dogfooding]
```

### 7.2 Quorum Rules

```yaml
quorum:
  decision_threshold: 0.65         # 65% of active agents must agree on API design decisions
  lead_architect_veto: true         # Lead API Architect can block any interface decision
  breaking_change_threshold: 0.85  # Breaking changes require 85% supermajority
  cross_council_approval:
    trigger: api_affects_subsystem
    required: [affected_council_lead, council-synth]
    response_sla_hours: 6
  developer_experience_review:
    trigger: new_public_call
    required: [developer-experience-reviewer_count >= 2]
    purpose: "Ensure every new call meets ergonomics standard before it enters the spec"
```

### 7.3 Subchannels

```yaml
subchannels:
  - id: api-posix-compat
    subscribers: [posix-compatibility-analyst, lead-api-architect]
    purpose: "POSIX coverage analysis, shim design, compatibility gap triage"
    ucxl_feed: "ucxl://council-api:posix-compatibility-analyst@DistOS:api/^^/specs/posix-*"

  - id: api-gpu-syscalls
    subscribers: [gpu-syscall-designer, lead-api-architect, async-api-specialist]
    purpose: "GPU-native system call design; Hopper/Blackwell capability integration"
    ucxl_feed: "ucxl://council-api:gpu-syscall-designer@DistOS:api/^^/specs/gpu-*"

  - id: api-distributed-syscalls
    subscribers: [distributed-syscall-designer, lead-api-architect]
    purpose: "Distributed call design; RPC and RDMA protocol negotiation with council-net"
    ucxl_feed: "ucxl://council-api:distributed-syscall-designer@DistOS:api/^^/specs/distributed-*"

  - id: api-sdk-coordination
    subscribers: [sdk-designer, async-api-specialist, developer-experience-reviewer]
    purpose: "Cross-language SDK consistency; ABI stability coordination"
    ucxl_feed: "ucxl://council-api:sdk-designer@DistOS:api/^^/specs/sdk-*"

  - id: api-error-and-versioning
    subscribers: [error-handling-architect, api-versioning-strategist, lead-api-architect]
    purpose: "Error catalogue development; versioning policy; UCXL trace integration"
    ucxl_feed: "ucxl://council-api:error-handling-architect@DistOS:api/^^/specs/error-*"

  - id: api-cross-council-requirements
    subscribers: [lead-api-architect, distributed-syscall-designer, gpu-syscall-designer]
    purpose: "Inbound requirements from all subsystem councils; tracks what each council needs exposed"
    ucxl_feed: "ucxl://council-*:*@DistOS:*/^^/requirements/api-*"

  - id: api-devex-review
    subscribers: [developer-experience-reviewer, lead-api-architect]
    purpose: "Developer experience walkthroughs; example application drafts; usability feedback"
    ucxl_feed: "ucxl://council-api:developer-experience-reviewer@DistOS:api/^^/examples/*"
```

---

## 8. Success Criteria

1. **Complete API surface:** The master API specification covers all system calls required by all six core subsystem councils. No subsystem has an unaddressed API requirement at the end of Phase 4.
2. **POSIX coverage documented:** The POSIX compatibility matrix exists and classifies every POSIX.1-2017 system call as supported, shim-supported, or explicitly unsupported with rationale.
3. **GPU system calls complete:** All GPU-native system calls for Hopper, Grace, and Blackwell are specified with parameter types, semantics, and error codes. NVLink/NVSwitch topology-aware calls are included.
4. **Distributed system calls complete:** All distributed calls (RPC, locks, barriers, collectives) are specified with failure semantics and consistency guarantees matching the `council-fault` and `council-net` specs.
5. **Four-language SDK specs complete:** C ABI, Rust, Go, and Python SDK specifications exist and have been reviewed for idiomatic correctness by SDK Designers.
6. **Error handling consistent:** All error types are catalogued and every public API call has a documented error table. Every error carries a UCXL trace field.
7. **Versioning policy ratified:** The versioning policy is agreed with `council-synth` and published. The experimental API tier is defined.
8. **Verification-ready contracts:** All interface contracts have been delivered to `council-verify` in Alloy-compatible form by Day 8.
9. **Developer experience validated:** At least three example applications have been written by Developer Experience Reviewers and cover: a simple GPU computation, a distributed collective operation, and a Weka FS streaming I/O pattern.
10. **CLI specification complete:** `distos-ctl` subcommand structure and all primary flags are specified.

---

## 9. Timeline

### Phase 1: Research (Days 1–3)

- POSIX Compatibility Analysts catalogue POSIX.1-2017 system calls and assess coverage feasibility
- GPU Syscall Designers survey CUDA Driver API, CUDA Graphs, Hopper/Blackwell architecture documentation, NVLink topology implications
- Distributed Syscall Designers survey MPI collectives, gRPC, RDMA verbs, ZooKeeper/Chubby lock models
- SDK Designers survey language ecosystems: Rust async patterns, Go `cgo` patterns, Python asyncio/CuPy
- Async API Specialists study io_uring interface in depth
- Lead API Architect drafts the API philosophy options paper for DP-A01
- Deliverable: `ucxl://council-api:lead-api-architect@DistOS:api/^^/research/api-philosophy-options.md`

### Phase 2: Architecture (Days 3–6)

- Resolve DP-A01 (philosophy), DP-A02 (async mechanism), DP-A04 (RPC wire protocol), DP-A06 (auth/authz) — all in consultation with relevant councils
- Lead API Architect drafts the call taxonomy: which calls belong in which layer (kernel/shim/library)
- GPU Syscall Designers draft the GPU system call prototype spec for Hopper and Blackwell
- Distributed Syscall Designers draft the distributed call prototype spec, contingent on DP-A04 resolution
- Error Handling Architects draft the error type taxonomy and UCXL trace integration
- Deliverable: `ucxl://council-api:lead-api-architect@DistOS:api/^^/research/call-taxonomy.md`

### Phase 3: Formal Specification (Days 6–10)

- Full API spec written: GPU syscalls, distributed syscalls, async interface, C ABI reference
- Language SDK specifications written in parallel by SDK Designers
- Error catalogue completed and UCXL trace integration specified
- Alloy interface contracts delivered to `council-verify` for structural verification
- CLI specification drafted by CLI Designers
- POSIX compatibility matrix completed
- Deliverable: `ucxl://council-api:gpu-syscall-designer@DistOS:api/^^/specs/gpu-syscalls.md` and all companion specs

### Phase 4: Integration (Days 10–12)

- Resolve any outstanding API requirements from subsystem councils surfaced during their Phase 3 spec work
- DP-A03 and DP-A05 resolved with full DR records
- API versioning policy ratified by `council-synth`
- Developer Experience Reviewers conduct walkthroughs of all three example applications
- Deliver final interface contracts to `council-verify` for re-verification after any Phase 3 changes
- Deliverable: Versioning policy, three example applications

### Phase 5: Documentation (Days 12–14)

- Developer Experience Reviewers produce the developer-facing API reference document
- SDK Designers produce getting-started guides for each language
- All specs integrated into the master DistOS specification document via `council-docs`
- Final UCXL navigability check: every API call traces back to the council decision that introduced it
- Deliverable: `ucxl://council-api:developer-experience-reviewer@DistOS:api/^^/docs/api-reference.md`