# Council Design Brief: API Surface and Developer Experience **Council ID:** `council-api` **Mission:** Define the complete, coherent, and ergonomic interface between DistOS and its users — operators, application developers, and other systems. This council decides what the operating system looks like from the outside: system calls, SDK bindings, CLI tools, and the conventions that make all of the above consistent and maintainable across language boundaries and API versions. **UCXL Base Address:** `ucxl://council-api:*@DistOS:api/*` **Agent Count:** ~40 **Status:** Design Brief — Constitution Phase --- ## 1. Scope and Responsibilities `council-api` owns the external interface contract of DistOS. Its scope covers: - Deciding the overall API philosophy: POSIX-compatible extension, clean-slate design, or a layered model that offers both - Defining GPU-native system calls for kernel launch, memory allocation, device-to-device transfers, stream and graph management, and event synchronisation - Defining distributed system calls: remote procedure invocation (covering both synchronous RPC and async futures), distributed lock acquisition and release, barriers, and collective operations across node groups - Designing an async-first API surface that aligns with modern language runtimes (Rust `async`/`await`, Go goroutines, Python `asyncio`) - Establishing error handling conventions, including integration with UCXL response codes for errors that carry provenance (which node, which operation, at what logical time) - Designing the SDK for four target languages: C (ABI-stable systems interface), Rust (idiomatic, zero-cost), Go (ergonomic, channel-friendly), and Python (user-friendly, numpy-compatible) - Designing CLI tooling for cluster management: node status, job submission, resource inspection, log retrieval, and administrative operations - Defining the API versioning and evolution strategy: how new calls are introduced, how deprecated calls are retired, compatibility guarantees across minor and major versions - Producing API reference documentation that is precise enough to serve as a normative source alongside the formal spec - Specifying example applications that exercise non-trivial API paths and serve as integration test targets Responsibilities this council does **not** own: kernel implementation (owned by subsystem councils); formal verification of API contracts (owned by `council-verify`); security policy enforcement (owned by `council-sec`, though `council-api` designs the authentication and authorisation API surface in coordination with it); monitoring and metering calls (owned by `council-telemetry`, though `council-api` exposes the SDK surface for those). --- ## 2. Research Domains ### 2.1 POSIX Compatibility vs. Clean-Slate Design POSIX (IEEE 1003.1) defines the canonical Unix system call interface. Its strengths are: near-universal language runtime support, a mature ecosystem of tools, and decades of developer familiarity. Its weaknesses in a GPU-cluster OS context are: blocking I/O semantics that assume CPU-thread models, file-descriptor-centric resource management ill-suited to GPU memory objects, and no native concept of distributed operations or remote memory. Two design philosophies must be fully researched before the council can decide: - **POSIX-compatible extension:** Retain the full POSIX interface and extend it with GPU and distributed primitives as optional add-ons. Applications written for Linux run unmodified; GPU-aware applications opt into extensions. This is the approach taken by CUDA (which layers a driver API on top of the OS) and by ROCm/HIP. - **Clean-slate design:** Design an interface optimal for the DistOS hardware target without backward-compatibility constraints. This allows stronger type safety, async-native semantics, and a capability-based resource model from the first call. Plan 9 (Pike et al.) and Fuchsia (Zircon) are the primary existence proofs. - **Layered model:** Provide a clean-slate primary API and a POSIX compatibility layer implemented on top of it. This is the architectural recommendation for evaluation. The compatibility layer has a defined cost budget. Key references: - The Open Group. *The Single UNIX Specification (SUSv4/POSIX.1-2017)*. The normative POSIX reference. - Pike, R. et al. "Plan 9 from Bell Labs." *USENIX Summer 1990 Technical Conference*. Plan 9's contribution is the 9P protocol: everything is a file, including processes and network connections. The simplicity of the resource model is instructive even if DistOS does not adopt 9P verbatim. - Pike, R. "The Use of Name Spaces in Plan 9." *EUUG Newsletter* 12(1), 1992. - Google. *Fuchsia OS: Zircon Kernel Objects*. https://fuchsia.dev/fuchsia-src/concepts/kernel. Zircon uses a capability-based object system with handles as the only way to reference kernel objects. This is the most complete modern clean-slate OS design and must be studied in depth. ### 2.2 GPU-Native System Calls The CUDA Driver API provides the lowest-level GPU control surface available: `cuInit`, `cuDeviceGet`, `cuCtxCreate`, `cuMemAlloc`, `cuLaunchKernel`, `cuEventRecord`, `cuStreamWaitEvent`. It is the reference for what a GPU system call interface must cover. Agents must evaluate the tradeoffs between: - **Driver-level API** (CUDA Driver API / ROCm HIP Low-Level): explicit context management, explicit stream management, maximum control, verbose - **Runtime API** (CUDA Runtime / ROCm): implicit context, automatic stream assignment, less control, more ergonomic - **Graph-based execution** (CUDA Graphs / HIP Graphs): capture a sequence of operations as a graph for repeated execution with lower launch overhead. Critical for the 1024-node deployment where kernel launch overhead accumulates. Key references: - NVIDIA. *CUDA Driver API Reference Manual*. https://docs.nvidia.com/cuda/cuda-driver-api/. Normative reference for GPU system call semantics. - NVIDIA. *CUDA C Programming Guide* (Chapter 3: Programming Interface). Covers the Runtime API and its relationship to the Driver API. - NVIDIA. *CUDA Graphs* documentation. https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#cuda-graphs. The graph execution model is essential for understanding low-latency repeated workloads on Hopper and Blackwell. - Khronos Group. *OpenCL 3.0 Specification*. https://www.khronos.org/opencl/. The vendor-neutral GPU programming API. DistOS must decide whether to support OpenCL alongside CUDA semantics. - Khronos Group. *SYCL 2020 Specification*. https://www.khronos.org/sycl/. SYCL provides a C++ abstraction over OpenCL and oneAPI targets. Intel's oneAPI unifies GPU programming across vendors and is a candidate for the DistOS higher-level SDK layer. - Intel. *oneAPI Programming Guide*. https://www.intel.com/content/www/us/en/developer/tools/oneapi/programming-guide.html. - NVIDIA. *NVLink and NVSwitch Architecture Overview*. https://www.nvidia.com/en-us/data-center/nvlink/. GPU-to-GPU direct access semantics affect memory system call design. Blackwell-specific: The GB200 NVL72 introduces NVLink Switch System connecting 72 GPUs in a single flat memory domain. System calls for `cuMemAdvise` and `cuMemPrefetchAsync` take on new semantics in this topology. Agents must review: - NVIDIA. *NVIDIA Blackwell Architecture Technical Brief*. 2024. ### 2.3 Distributed System Calls System calls that span nodes are novel: POSIX has no notion of them. The design space covers: - **Remote procedure invocation:** How does a process on node A invoke a procedure on node B? Synchronous blocking (simple, latency-bound), asynchronous with futures (complex, scalable), or continuation-passing. gRPC is the de facto standard for service-to-service RPC in the cloud but carries HTTP/2 overhead. - **Distributed locks:** Lease-based locks (Chubby/Zookeeper model), RDMA-based compare-and-swap (best latency), or consensus-based locks for strong guarantees. Each has different failure semantics. - **Barriers:** Collective synchronisation across node groups. MPI_Barrier semantics are well understood; the question is how to expose this in a general-purpose OS API. - **Collective operations:** AllReduce, AllGather, Broadcast, Reduce-Scatter. These are first-class operations for distributed ML workloads (the dominant use case on a 1024-node GPU cluster) and must be surfaced as OS-level calls, not just library calls, so the OS can optimise placement and routing. Key references: - Birrell, A. and Nelson, B. "Implementing Remote Procedure Calls." *ACM Transactions on Computer Systems* 2(1), 1984. The foundational RPC paper. - Google. *gRPC*. https://grpc.io/. The current industry standard for typed RPC. Protocol Buffers schema evolution strategy is directly applicable to DistOS API versioning. - Google. *Chubby: A Lock Service for Loosely-Coupled Distributed Systems*. Burrows, M. OSDI 2006. - Hunt, P. et al. "ZooKeeper: Wait-free Coordination for Internet-scale Systems." *USENIX ATC 2010*. - Message Passing Interface Forum. *MPI: A Message-Passing Interface Standard, Version 4.1*. 2023. The collective operations specification is normative for `council-api`'s collective call design. - Mellanox/NVIDIA. *RDMA Programming Guide*. InfiniBand verbs API (ibv_post_send, ibv_post_recv, ibv_create_qp) provides the lowest-latency distributed memory access primitives available on the target cluster. ### 2.4 Async-First API Design A GPU cluster OS serving AI workloads will have I/O patterns dominated by deep asynchrony: thousands of in-flight kernel launches, streaming data from Weka FS, collective comms across 1024 nodes. A synchronous API is a fundamental design mistake. Agents must research: - **Rust async/await:** The Rust async model (futures, the `Poll` trait, the executor model) provides zero-cost abstraction over async I/O. The `tokio` runtime is the dominant executor. The DistOS Rust SDK must integrate naturally with tokio. - **io_uring (Linux 5.1+):** The io_uring interface provides a shared ring-buffer interface between kernel and userspace that eliminates syscall overhead for I/O. Its submission/completion queue model is the reference for how DistOS should design its own async system call interface. - **Go channels and goroutines:** Go's concurrency model maps well to distributed operations. The DistOS Go SDK must express distributed calls as channels or via the `context.Context` cancellation pattern. - **Python asyncio:** The Python SDK must be usable from `async def` coroutines. NumPy compatibility for GPU tensor operations should be considered (compatibility with the Numba/CuPy interface). Key references: - Axboe, J. *io_uring and the new Linux async I/O API*. https://kernel.dk/io_uring.pdf. 2019. This paper is essential for understanding the state of the art in async syscall design. - The Rust Async Book. https://rust-lang.github.io/async-book/. Normative reference for Rust async design patterns. - Grigorik, I. *High Performance Browser Networking* (Chapter 2 on event loop and async I/O patterns). 2013. O'Reilly. Useful background on event-driven I/O design. ### 2.5 Error Handling Conventions A cluster OS at this scale will produce a high volume of partial failures: a node goes dark, a GPU kernel faults, a network partition isolates a subsystem. The error handling convention must be: - **Structured:** Every error carries a type, a severity, a source identifier (node, subsystem, call), and a correlation ID that links it to a UCXL-addressed event in the distributed log. - **Actionable:** The API must distinguish between errors that the caller should retry (transient), errors that require intervention (permanent), and errors that indicate a usage mistake (programmer error). - **Traceable:** Error correlation IDs must be UCXL-compatible so that an error returned to a Python application can be resolved to the full distributed event chain using the UCXL resolver. Key references: - Google. *Google Cloud API Design Guide: Errors*. https://cloud.google.com/apis/design/errors. The most systematic public treatment of structured API error design. The canonical status codes (OK, INVALID_ARGUMENT, NOT_FOUND, UNAVAILABLE, etc.) should be adopted or adapted. - Klabnik, S. and Nichols, C. *The Rust Programming Language* (Chapter 9: Error Handling). The Rust approach to `Result` and the `?` operator represents the state of the art for recoverable errors in a systems language. - Syme, D. et al. "Exceptional Syntactic Support for Error Handling in F#." *Haskell Symposium 2020*. Relevant to the higher-level SDK error design. The UCXL response code integration specifically means that API error structs carry a `ucxl_trace` field containing the UCXL address of the distributed event that caused the failure: ``` error.ucxl_trace = "ucxl://council-fault:monitor@DistOS:fault-tolerance/^^/events/node-042-timeout-2026-03-01T14:22:00Z" ``` ### 2.6 SDK Design for Multiple Languages The SDK must present a coherent surface across four languages with different idioms. The design principles are: - **C ABI as the foundation:** The canonical system call interface is a C ABI. All other language SDKs are generated or hand-written wrappers over the C ABI. This ensures ABI stability and FFI compatibility with every language. - **Rust SDK:** Idiomatic, zero-cost wrappers. Use Rust's ownership system to enforce resource lifetimes at compile time (e.g., a `GpuBuffer` type that is `Send` but not `Sync`, reflecting GPU buffer ownership semantics). The Rust SDK should use `#[repr(C)]` structs for ABI compatibility. - **Go SDK:** Ergonomic wrappers using `cgo` for the C ABI. Expose distributed operations as channel-returning functions. Context-aware: all calls accept `context.Context` for cancellation and timeout propagation. - **Python SDK:** High-level, NumPy-compatible. Consider auto-generating stub code from a schema. Must be `asyncio`-compatible. Integrate with the Python type system via `Protocol` and `TypedDict`. Key references: - Klabnik, S. and Nichols, C. *The Rust Programming Language*. https://doc.rust-lang.org/book/. Idiomatic Rust patterns. - Go Authors. *Effective Go*. https://go.dev/doc/effective_go. Idiomatic Go patterns. - Google. *Google Cloud API Design Guide*. https://cloud.google.com/apis/design. The most comprehensive public API design guide, covering resource-oriented design, standard methods, naming conventions, and backwards compatibility. - Smith, P. *Designing for Compatibility in Evolving APIs*. IEEE Software 39(4), 2022. ### 2.7 CLI Tooling Design The cluster management CLI (`distos-ctl` or equivalent) must follow modern CLI design principles: - Machine-readable output (JSON/YAML with `--output json`) for scripting - Structured logging with log levels - Human-readable default output with colour and progress indicators - Completion generation for bash/zsh/fish - Subcommand structure: `node`, `job`, `gpu`, `net`, `storage`, `secret`, `log` Key references: - Sigurdsson, A. et al. *Command Line Interface Guidelines*. https://clig.dev/. The community-written standard for modern CLI design. Should be treated as the style guide for `distos-ctl`. - Hashicorp. *Vault CLI design*. The Vault CLI is an exemplar of a well-structured cluster management tool with consistent subcommand and flag conventions. - Kubernetes. `kubectl` source and documentation. The de facto standard for distributed cluster management CLIs. The DistOS CLI should match `kubectl` conventions where applicable to reduce cognitive load. ### 2.8 API Versioning and Evolution Strategy A system call interface must be stable. The versioning strategy must address: - **Compatibility guarantees:** What changes are backwards-compatible (adding optional parameters, adding new calls) vs. breaking (changing parameter semantics, removing calls)? - **Deprecation lifecycle:** Minimum deprecation notice period, deprecation markers in the SDK, removal schedule. - **Version negotiation:** How does a client indicate the API version it was compiled against? How does the kernel report available versions? - **Experimental APIs:** A clearly marked experimental tier for new calls before they enter the stable surface. Key references: - Google. *Google Cloud API Versioning*. https://cloud.google.com/apis/design/versioning. URL-based versioning for REST APIs; the principles apply to system call versioning. - Klabnik, S. "Stability as a Deliverable." https://blog.rust-lang.org/2014/10/30/Stability.html. Rust's stability commitment is a model for how a systems project can make and keep compatibility promises. - Semantic Versioning Specification. https://semver.org/. The DistOS SDK and ABI will follow SemVer 2.0. ### 2.9 Plan 9 and Fuchsia Zircon Deep Dive These two systems represent the clearest non-POSIX OS API designs and must be studied in depth: - **Plan 9:** The 9P protocol represents all system resources as files served over a file system protocol. Network connections, processes, and graphics are files. The simplicity is extreme. The DistOS clean-slate layer need not adopt 9P but should understand its design philosophy. - Pike, R. et al. "The Use of Name Spaces in Plan 9." *EUUG Newsletter* 12(1), 1992. - Dorward, S. et al. "The Inferno Operating System." *Bell Labs Technical Journal* 2(1), 1997. - **Fuchsia / Zircon:** Zircon is a microkernel with capabilities as the security primitive. Every kernel resource is a `zx_handle_t`. Handles are passed between processes explicitly; there is no global namespace for kernel objects. This is the preferred model for DistOS's capability integration with `council-sec`. - Google. *Zircon Kernel Concepts*. https://fuchsia.dev/fuchsia-src/concepts/kernel/concepts. - Google. *Zircon Syscall Reference*. https://fuchsia.dev/fuchsia-src/reference/syscalls. --- ## 3. Agent Roles | Role | Count | Responsibilities | |------|-------|-----------------| | Lead API Architect | 1 | Decides overall API philosophy; coordinates with all subsystem councils; owns the master API specification document; resolves conflicts between API and subsystem requirements | | POSIX Compatibility Analysts | 4 | Audit which POSIX calls must be retained; design the compatibility shim layer; document compatibility coverage gaps | | GPU Syscall Designers | 6 | Design GPU-native system calls for kernel launch, memory, streams, events, graphs; ensure Hopper/Blackwell/Grace specifics are covered | | Distributed Syscall Designers | 5 | Design RPC, distributed lock, barrier, and collective operation system calls; consult MPI and RDMA references | | SDK Designers | 8 | Design language-specific SDKs: 2 per language (C, Rust, Go, Python); responsible for ergonomics, idiom conformance, and ABI stability | | Async API Specialists | 4 | Design the async call model; specify io_uring-style ring buffer interface; ensure Rust/Go/Python async integration | | CLI Designers | 3 | Design `distos-ctl` command structure, output formats, and completions | | Error Handling Architects | 3 | Design structured error types, UCXL trace integration, and error propagation conventions across all SDK layers | | API Versioning Strategists | 2 | Develop the versioning policy, deprecation lifecycle, compatibility matrix, and experimental API tier | | Developer Experience Reviewers | 4 | Evaluate API usability; write developer-facing documentation and example applications; run internal "dogfooding" walkthroughs | **Total:** 40 agents --- ## 4. Key Deliverables All artifacts use the pattern `ucxl://council-api:{role}@DistOS:api/^^/{artifact-type}/{name}`. ### 4.1 Master API Philosophy Decision Record ``` ucxl://council-api:lead-api-architect@DistOS:api/^^/decisions/dr-api-01-philosophy.md ``` Covers the layered model decision: clean-slate primary API, POSIX compatibility shim, and the cost budget for the shim. ### 4.2 GPU System Call Specification ``` ucxl://council-api:gpu-syscall-designer@DistOS:api/^^/specs/gpu-syscalls.md ``` Full specification of all GPU-native system calls with parameter types, semantics, error codes, and Hopper/Blackwell/Grace specifics. ### 4.3 Distributed System Call Specification ``` ucxl://council-api:distributed-syscall-designer@DistOS:api/^^/specs/distributed-syscalls.md ``` ### 4.4 Async Call Interface Specification ``` ucxl://council-api:async-api-specialist@DistOS:api/^^/specs/async-interface.md ``` Documents the submission/completion ring model, back-pressure semantics, and language runtime integration. ### 4.5 C ABI Reference ``` ucxl://council-api:sdk-designer@DistOS:api/^^/specs/c-abi-reference.h ``` The normative C header file. All other SDKs are derived from this. ### 4.6 Language SDK Specifications ``` ucxl://council-api:sdk-designer@DistOS:api/^^/specs/sdk-rust.md ucxl://council-api:sdk-designer@DistOS:api/^^/specs/sdk-go.md ucxl://council-api:sdk-designer@DistOS:api/^^/specs/sdk-python.md ``` ### 4.7 Error Type Catalogue ``` ucxl://council-api:error-handling-architect@DistOS:api/^^/specs/error-catalogue.md ``` All structured error types with UCXL trace integration, severity levels, and retry guidance. ### 4.8 CLI Specification ``` ucxl://council-api:cli-designer@DistOS:api/^^/specs/distos-ctl-spec.md ``` Full command reference including all subcommands, flags, output formats, and completion scripts. ### 4.9 API Versioning Policy ``` ucxl://council-api:api-versioning-strategist@DistOS:api/^^/policies/versioning-policy.md ``` ### 4.10 POSIX Compatibility Coverage Matrix ``` ucxl://council-api:posix-compatibility-analyst@DistOS:api/^^/specs/posix-compatibility-matrix.md ``` Tabulates every POSIX call: supported natively, supported via shim, not supported (with rationale). ### 4.11 Example Applications ``` ucxl://council-api:developer-experience-reviewer@DistOS:api/^^/examples/hello-distributed-gpu.md ucxl://council-api:developer-experience-reviewer@DistOS:api/^^/examples/allreduce-collective.md ucxl://council-api:developer-experience-reviewer@DistOS:api/^^/examples/weka-fs-streaming-io.md ``` --- ## 5. Decision Points All DRs use the address pattern `ucxl://council-api:lead-api-architect@DistOS:api/^^/decisions/{dr-id}.md`. ### DP-A01: POSIX vs. Clean-Slate vs. Layered The foundational design philosophy choice. The default recommendation is the layered model, but this must be validated against: the cost of maintaining the shim layer, the risk of semantic leakage from POSIX into the clean-slate layer, and the developer familiarity benefit. **Deciding parties:** Lead API Architect, POSIX Compatibility Analysts, `council-synth` ### DP-A02: Async System Call Mechanism Choose between: io_uring-inspired ring buffer (lowest overhead, Linux precedent), a POSIX-extended `aio_*` interface (familiarity, limited expressiveness), or a fully custom completion port model. This decision is tightly coupled to the `council-mem` memory model (the ring buffer requires shared memory between kernel and userspace). **Deciding parties:** Async API Specialists, `council-mem`, `council-verify` (for ABI safety check) ### DP-A03: GPU Memory API at the Syscall Layer vs. Library Layer Should GPU memory allocation (`cuMemAlloc` equivalent) be a kernel-mediated system call (allowing the OS to account for and schedule GPU memory as a first-class resource) or a library call that bypasses the kernel after initial device setup? Kernel mediation adds latency; bypass reduces accounting fidelity. **Deciding parties:** GPU Syscall Designers, `council-mem`, `council-telemetry` ### DP-A04: RPC Mechanism for Distributed System Calls Choose the wire protocol for remote procedure calls: gRPC (typed, HTTP/2, mature), a custom binary protocol over RDMA (lowest latency, highest implementation cost), or a two-tier model (gRPC for control plane, RDMA for data plane). The choice directly affects the latency budget for distributed system calls. **Deciding parties:** Distributed Syscall Designers, `council-net` ### DP-A05: SDK Code Generation vs. Hand-Written Wrappers Decide whether to generate the Rust, Go, and Python SDKs from a schema definition (IDL, such as Protocol Buffers or a custom DSL) or maintain hand-written wrappers. Generated code is more consistent; hand-written code can be more idiomatic. A hybrid (generate the boilerplate, hand-write ergonomic wrappers) is the likely outcome. **Deciding parties:** SDK Designers, API Versioning Strategists ### DP-A06: Authentication and Authorisation API How does a process prove its identity to the kernel and acquire capabilities? Options: token-based (JWT or similar), capability handles (Zircon model), certificate-based (X.509 with a cluster CA), or UCXL-scoped credentials. This decision must be made jointly with `council-sec`. **Deciding parties:** Lead API Architect, `council-sec` --- ## 6. Dependencies on Other Councils `council-api` is the integrating council: every subsystem council produces functionality, and `council-api` exposes that functionality through a coherent surface. It is therefore a downstream consumer of requirements from all councils and an upstream provider to `council-docs` and `council-verify`. | Council | Relationship | What council-api consumes | What council-api produces | |---------|-------------|--------------------------|--------------------------| | `council-sched` | Consuming requirements | Job submission semantics, priority model, queue management APIs | Scheduler-facing system calls in API spec | | `council-mem` | Bidirectional | Memory model, allocation semantics, consistency guarantees | Memory system call specs; async memory API | | `council-net` | Bidirectional | Network abstraction primitives, RDMA capabilities | Network system calls; distributed RPC wire protocol choice | | `council-fault` | Consuming requirements | Failure notification model, recovery primitives | Fault-tolerance-related error codes; node failure event API | | `council-sec` | Bidirectional | Capability model, identity primitives, isolation guarantees | Authentication/authorisation API surface; capability handle design | | `council-telemetry` | Consuming requirements | Metering call semantics, SLO query interface | Telemetry-facing SDK surface; metering call specs | | `council-verify` | Providing for verification | N/A | API interface contracts for formal verification | | `council-qa` | Providing for test design | N/A | API spec enables QA to design conformance tests | | `council-synth` | Receiving directives | Cross-council conflict resolutions affecting API design | Updates to API spec when directed by synth | | `council-docs` | Providing for documentation | N/A | All API specs feed directly into the reference documentation | **Critical path constraint:** `council-api` cannot finalise the distributed system call interface until `council-net` has committed to its RPC and RDMA model (DP-A04 depends on this). GPU system call design can proceed independently from Day 1. --- ## 7. WHOOSH Configuration ### 7.1 Team Formation ```yaml council_id: council-api display_name: "API Surface and Developer Experience Council" target_size: 40 formation_strategy: competency_weighted required_roles: - role: lead-api-architect count: 1 persona: systems-analyst competencies: [api-design, posix, distributed-systems, gpu-programming, developer-experience] - role: posix-compatibility-analyst count: 4 persona: technical-specialist competencies: [posix, linux-kernel, system-calls, abi-stability] - role: gpu-syscall-designer count: 6 persona: technical-specialist competencies: [cuda, rocm, gpu-memory, hopper-architecture, blackwell-architecture, nvlink] - role: distributed-syscall-designer count: 5 persona: technical-specialist competencies: [rpc, rdma, mpi-collectives, distributed-locks, grpc] - role: sdk-designer count: 8 persona: technical-specialist competencies: [c-abi, rust-async, go-concurrency, python-asyncio, ffi, sdk-design] - role: async-api-specialist count: 4 persona: technical-specialist competencies: [io-uring, async-io, rust-futures, event-driven-design] - role: cli-designer count: 3 persona: technical-specialist competencies: [cli-design, ux, kubectl-conventions, shell-completion] - role: error-handling-architect count: 3 persona: systems-analyst competencies: [error-design, structured-errors, distributed-tracing, ucxl] - role: api-versioning-strategist count: 2 persona: systems-analyst competencies: [api-versioning, semver, deprecation-policy, compatibility] - role: developer-experience-reviewer count: 4 persona: technical-writer competencies: [developer-documentation, api-usability, example-applications, dogfooding] ``` ### 7.2 Quorum Rules ```yaml quorum: decision_threshold: 0.65 # 65% of active agents must agree on API design decisions lead_architect_veto: true # Lead API Architect can block any interface decision breaking_change_threshold: 0.85 # Breaking changes require 85% supermajority cross_council_approval: trigger: api_affects_subsystem required: [affected_council_lead, council-synth] response_sla_hours: 6 developer_experience_review: trigger: new_public_call required: [developer-experience-reviewer_count >= 2] purpose: "Ensure every new call meets ergonomics standard before it enters the spec" ``` ### 7.3 Subchannels ```yaml subchannels: - id: api-posix-compat subscribers: [posix-compatibility-analyst, lead-api-architect] purpose: "POSIX coverage analysis, shim design, compatibility gap triage" ucxl_feed: "ucxl://council-api:posix-compatibility-analyst@DistOS:api/^^/specs/posix-*" - id: api-gpu-syscalls subscribers: [gpu-syscall-designer, lead-api-architect, async-api-specialist] purpose: "GPU-native system call design; Hopper/Blackwell capability integration" ucxl_feed: "ucxl://council-api:gpu-syscall-designer@DistOS:api/^^/specs/gpu-*" - id: api-distributed-syscalls subscribers: [distributed-syscall-designer, lead-api-architect] purpose: "Distributed call design; RPC and RDMA protocol negotiation with council-net" ucxl_feed: "ucxl://council-api:distributed-syscall-designer@DistOS:api/^^/specs/distributed-*" - id: api-sdk-coordination subscribers: [sdk-designer, async-api-specialist, developer-experience-reviewer] purpose: "Cross-language SDK consistency; ABI stability coordination" ucxl_feed: "ucxl://council-api:sdk-designer@DistOS:api/^^/specs/sdk-*" - id: api-error-and-versioning subscribers: [error-handling-architect, api-versioning-strategist, lead-api-architect] purpose: "Error catalogue development; versioning policy; UCXL trace integration" ucxl_feed: "ucxl://council-api:error-handling-architect@DistOS:api/^^/specs/error-*" - id: api-cross-council-requirements subscribers: [lead-api-architect, distributed-syscall-designer, gpu-syscall-designer] purpose: "Inbound requirements from all subsystem councils; tracks what each council needs exposed" ucxl_feed: "ucxl://council-*:*@DistOS:*/^^/requirements/api-*" - id: api-devex-review subscribers: [developer-experience-reviewer, lead-api-architect] purpose: "Developer experience walkthroughs; example application drafts; usability feedback" ucxl_feed: "ucxl://council-api:developer-experience-reviewer@DistOS:api/^^/examples/*" ``` --- ## 8. Success Criteria 1. **Complete API surface:** The master API specification covers all system calls required by all six core subsystem councils. No subsystem has an unaddressed API requirement at the end of Phase 4. 2. **POSIX coverage documented:** The POSIX compatibility matrix exists and classifies every POSIX.1-2017 system call as supported, shim-supported, or explicitly unsupported with rationale. 3. **GPU system calls complete:** All GPU-native system calls for Hopper, Grace, and Blackwell are specified with parameter types, semantics, and error codes. NVLink/NVSwitch topology-aware calls are included. 4. **Distributed system calls complete:** All distributed calls (RPC, locks, barriers, collectives) are specified with failure semantics and consistency guarantees matching the `council-fault` and `council-net` specs. 5. **Four-language SDK specs complete:** C ABI, Rust, Go, and Python SDK specifications exist and have been reviewed for idiomatic correctness by SDK Designers. 6. **Error handling consistent:** All error types are catalogued and every public API call has a documented error table. Every error carries a UCXL trace field. 7. **Versioning policy ratified:** The versioning policy is agreed with `council-synth` and published. The experimental API tier is defined. 8. **Verification-ready contracts:** All interface contracts have been delivered to `council-verify` in Alloy-compatible form by Day 8. 9. **Developer experience validated:** At least three example applications have been written by Developer Experience Reviewers and cover: a simple GPU computation, a distributed collective operation, and a Weka FS streaming I/O pattern. 10. **CLI specification complete:** `distos-ctl` subcommand structure and all primary flags are specified. --- ## 9. Timeline ### Phase 1: Research (Days 1–3) - POSIX Compatibility Analysts catalogue POSIX.1-2017 system calls and assess coverage feasibility - GPU Syscall Designers survey CUDA Driver API, CUDA Graphs, Hopper/Blackwell architecture documentation, NVLink topology implications - Distributed Syscall Designers survey MPI collectives, gRPC, RDMA verbs, ZooKeeper/Chubby lock models - SDK Designers survey language ecosystems: Rust async patterns, Go `cgo` patterns, Python asyncio/CuPy - Async API Specialists study io_uring interface in depth - Lead API Architect drafts the API philosophy options paper for DP-A01 - Deliverable: `ucxl://council-api:lead-api-architect@DistOS:api/^^/research/api-philosophy-options.md` ### Phase 2: Architecture (Days 3–6) - Resolve DP-A01 (philosophy), DP-A02 (async mechanism), DP-A04 (RPC wire protocol), DP-A06 (auth/authz) — all in consultation with relevant councils - Lead API Architect drafts the call taxonomy: which calls belong in which layer (kernel/shim/library) - GPU Syscall Designers draft the GPU system call prototype spec for Hopper and Blackwell - Distributed Syscall Designers draft the distributed call prototype spec, contingent on DP-A04 resolution - Error Handling Architects draft the error type taxonomy and UCXL trace integration - Deliverable: `ucxl://council-api:lead-api-architect@DistOS:api/^^/research/call-taxonomy.md` ### Phase 3: Formal Specification (Days 6–10) - Full API spec written: GPU syscalls, distributed syscalls, async interface, C ABI reference - Language SDK specifications written in parallel by SDK Designers - Error catalogue completed and UCXL trace integration specified - Alloy interface contracts delivered to `council-verify` for structural verification - CLI specification drafted by CLI Designers - POSIX compatibility matrix completed - Deliverable: `ucxl://council-api:gpu-syscall-designer@DistOS:api/^^/specs/gpu-syscalls.md` and all companion specs ### Phase 4: Integration (Days 10–12) - Resolve any outstanding API requirements from subsystem councils surfaced during their Phase 3 spec work - DP-A03 and DP-A05 resolved with full DR records - API versioning policy ratified by `council-synth` - Developer Experience Reviewers conduct walkthroughs of all three example applications - Deliver final interface contracts to `council-verify` for re-verification after any Phase 3 changes - Deliverable: Versioning policy, three example applications ### Phase 5: Documentation (Days 12–14) - Developer Experience Reviewers produce the developer-facing API reference document - SDK Designers produce getting-started guides for each language - All specs integrated into the master DistOS specification document via `council-docs` - Final UCXL navigability check: every API call traces back to the council decision that introduced it - Deliverable: `ucxl://council-api:developer-experience-reviewer@DistOS:api/^^/docs/api-reference.md`