35 KiB
Council Design Brief: API Surface and Developer Experience
Council ID: council-api
Mission: Define the complete, coherent, and ergonomic interface between DistOS and its users — operators, application developers, and other systems. This council decides what the operating system looks like from the outside: system calls, SDK bindings, CLI tools, and the conventions that make all of the above consistent and maintainable across language boundaries and API versions.
UCXL Base Address: ucxl://council-api:*@DistOS:api/*
Agent Count: ~40
Status: Design Brief — Constitution Phase
1. Scope and Responsibilities
council-api owns the external interface contract of DistOS. Its scope covers:
- Deciding the overall API philosophy: POSIX-compatible extension, clean-slate design, or a layered model that offers both
- Defining GPU-native system calls for kernel launch, memory allocation, device-to-device transfers, stream and graph management, and event synchronisation
- Defining distributed system calls: remote procedure invocation (covering both synchronous RPC and async futures), distributed lock acquisition and release, barriers, and collective operations across node groups
- Designing an async-first API surface that aligns with modern language runtimes (Rust
async/await, Go goroutines, Pythonasyncio) - Establishing error handling conventions, including integration with UCXL response codes for errors that carry provenance (which node, which operation, at what logical time)
- Designing the SDK for four target languages: C (ABI-stable systems interface), Rust (idiomatic, zero-cost), Go (ergonomic, channel-friendly), and Python (user-friendly, numpy-compatible)
- Designing CLI tooling for cluster management: node status, job submission, resource inspection, log retrieval, and administrative operations
- Defining the API versioning and evolution strategy: how new calls are introduced, how deprecated calls are retired, compatibility guarantees across minor and major versions
- Producing API reference documentation that is precise enough to serve as a normative source alongside the formal spec
- Specifying example applications that exercise non-trivial API paths and serve as integration test targets
Responsibilities this council does not own: kernel implementation (owned by subsystem councils); formal verification of API contracts (owned by council-verify); security policy enforcement (owned by council-sec, though council-api designs the authentication and authorisation API surface in coordination with it); monitoring and metering calls (owned by council-telemetry, though council-api exposes the SDK surface for those).
2. Research Domains
2.1 POSIX Compatibility vs. Clean-Slate Design
POSIX (IEEE 1003.1) defines the canonical Unix system call interface. Its strengths are: near-universal language runtime support, a mature ecosystem of tools, and decades of developer familiarity. Its weaknesses in a GPU-cluster OS context are: blocking I/O semantics that assume CPU-thread models, file-descriptor-centric resource management ill-suited to GPU memory objects, and no native concept of distributed operations or remote memory.
Two design philosophies must be fully researched before the council can decide:
- POSIX-compatible extension: Retain the full POSIX interface and extend it with GPU and distributed primitives as optional add-ons. Applications written for Linux run unmodified; GPU-aware applications opt into extensions. This is the approach taken by CUDA (which layers a driver API on top of the OS) and by ROCm/HIP.
- Clean-slate design: Design an interface optimal for the DistOS hardware target without backward-compatibility constraints. This allows stronger type safety, async-native semantics, and a capability-based resource model from the first call. Plan 9 (Pike et al.) and Fuchsia (Zircon) are the primary existence proofs.
- Layered model: Provide a clean-slate primary API and a POSIX compatibility layer implemented on top of it. This is the architectural recommendation for evaluation. The compatibility layer has a defined cost budget.
Key references:
- The Open Group. The Single UNIX Specification (SUSv4/POSIX.1-2017). The normative POSIX reference.
- Pike, R. et al. "Plan 9 from Bell Labs." USENIX Summer 1990 Technical Conference. Plan 9's contribution is the 9P protocol: everything is a file, including processes and network connections. The simplicity of the resource model is instructive even if DistOS does not adopt 9P verbatim.
- Pike, R. "The Use of Name Spaces in Plan 9." EUUG Newsletter 12(1), 1992.
- Google. Fuchsia OS: Zircon Kernel Objects. https://fuchsia.dev/fuchsia-src/concepts/kernel. Zircon uses a capability-based object system with handles as the only way to reference kernel objects. This is the most complete modern clean-slate OS design and must be studied in depth.
2.2 GPU-Native System Calls
The CUDA Driver API provides the lowest-level GPU control surface available: cuInit, cuDeviceGet, cuCtxCreate, cuMemAlloc, cuLaunchKernel, cuEventRecord, cuStreamWaitEvent. It is the reference for what a GPU system call interface must cover.
Agents must evaluate the tradeoffs between:
- Driver-level API (CUDA Driver API / ROCm HIP Low-Level): explicit context management, explicit stream management, maximum control, verbose
- Runtime API (CUDA Runtime / ROCm): implicit context, automatic stream assignment, less control, more ergonomic
- Graph-based execution (CUDA Graphs / HIP Graphs): capture a sequence of operations as a graph for repeated execution with lower launch overhead. Critical for the 1024-node deployment where kernel launch overhead accumulates.
Key references:
- NVIDIA. CUDA Driver API Reference Manual. https://docs.nvidia.com/cuda/cuda-driver-api/. Normative reference for GPU system call semantics.
- NVIDIA. CUDA C Programming Guide (Chapter 3: Programming Interface). Covers the Runtime API and its relationship to the Driver API.
- NVIDIA. CUDA Graphs documentation. https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#cuda-graphs. The graph execution model is essential for understanding low-latency repeated workloads on Hopper and Blackwell.
- Khronos Group. OpenCL 3.0 Specification. https://www.khronos.org/opencl/. The vendor-neutral GPU programming API. DistOS must decide whether to support OpenCL alongside CUDA semantics.
- Khronos Group. SYCL 2020 Specification. https://www.khronos.org/sycl/. SYCL provides a C++ abstraction over OpenCL and oneAPI targets. Intel's oneAPI unifies GPU programming across vendors and is a candidate for the DistOS higher-level SDK layer.
- Intel. oneAPI Programming Guide. https://www.intel.com/content/www/us/en/developer/tools/oneapi/programming-guide.html.
- NVIDIA. NVLink and NVSwitch Architecture Overview. https://www.nvidia.com/en-us/data-center/nvlink/. GPU-to-GPU direct access semantics affect memory system call design.
Blackwell-specific: The GB200 NVL72 introduces NVLink Switch System connecting 72 GPUs in a single flat memory domain. System calls for cuMemAdvise and cuMemPrefetchAsync take on new semantics in this topology. Agents must review:
- NVIDIA. NVIDIA Blackwell Architecture Technical Brief. 2024.
2.3 Distributed System Calls
System calls that span nodes are novel: POSIX has no notion of them. The design space covers:
- Remote procedure invocation: How does a process on node A invoke a procedure on node B? Synchronous blocking (simple, latency-bound), asynchronous with futures (complex, scalable), or continuation-passing. gRPC is the de facto standard for service-to-service RPC in the cloud but carries HTTP/2 overhead.
- Distributed locks: Lease-based locks (Chubby/Zookeeper model), RDMA-based compare-and-swap (best latency), or consensus-based locks for strong guarantees. Each has different failure semantics.
- Barriers: Collective synchronisation across node groups. MPI_Barrier semantics are well understood; the question is how to expose this in a general-purpose OS API.
- Collective operations: AllReduce, AllGather, Broadcast, Reduce-Scatter. These are first-class operations for distributed ML workloads (the dominant use case on a 1024-node GPU cluster) and must be surfaced as OS-level calls, not just library calls, so the OS can optimise placement and routing.
Key references:
- Birrell, A. and Nelson, B. "Implementing Remote Procedure Calls." ACM Transactions on Computer Systems 2(1), 1984. The foundational RPC paper.
- Google. gRPC. https://grpc.io/. The current industry standard for typed RPC. Protocol Buffers schema evolution strategy is directly applicable to DistOS API versioning.
- Google. Chubby: A Lock Service for Loosely-Coupled Distributed Systems. Burrows, M. OSDI 2006.
- Hunt, P. et al. "ZooKeeper: Wait-free Coordination for Internet-scale Systems." USENIX ATC 2010.
- Message Passing Interface Forum. MPI: A Message-Passing Interface Standard, Version 4.1. 2023. The collective operations specification is normative for
council-api's collective call design. - Mellanox/NVIDIA. RDMA Programming Guide. InfiniBand verbs API (ibv_post_send, ibv_post_recv, ibv_create_qp) provides the lowest-latency distributed memory access primitives available on the target cluster.
2.4 Async-First API Design
A GPU cluster OS serving AI workloads will have I/O patterns dominated by deep asynchrony: thousands of in-flight kernel launches, streaming data from Weka FS, collective comms across 1024 nodes. A synchronous API is a fundamental design mistake. Agents must research:
- Rust async/await: The Rust async model (futures, the
Polltrait, the executor model) provides zero-cost abstraction over async I/O. Thetokioruntime is the dominant executor. The DistOS Rust SDK must integrate naturally with tokio. - io_uring (Linux 5.1+): The io_uring interface provides a shared ring-buffer interface between kernel and userspace that eliminates syscall overhead for I/O. Its submission/completion queue model is the reference for how DistOS should design its own async system call interface.
- Go channels and goroutines: Go's concurrency model maps well to distributed operations. The DistOS Go SDK must express distributed calls as channels or via the
context.Contextcancellation pattern. - Python asyncio: The Python SDK must be usable from
async defcoroutines. NumPy compatibility for GPU tensor operations should be considered (compatibility with the Numba/CuPy interface).
Key references:
- Axboe, J. io_uring and the new Linux async I/O API. https://kernel.dk/io_uring.pdf. 2019. This paper is essential for understanding the state of the art in async syscall design.
- The Rust Async Book. https://rust-lang.github.io/async-book/. Normative reference for Rust async design patterns.
- Grigorik, I. High Performance Browser Networking (Chapter 2 on event loop and async I/O patterns). 2013. O'Reilly. Useful background on event-driven I/O design.
2.5 Error Handling Conventions
A cluster OS at this scale will produce a high volume of partial failures: a node goes dark, a GPU kernel faults, a network partition isolates a subsystem. The error handling convention must be:
- Structured: Every error carries a type, a severity, a source identifier (node, subsystem, call), and a correlation ID that links it to a UCXL-addressed event in the distributed log.
- Actionable: The API must distinguish between errors that the caller should retry (transient), errors that require intervention (permanent), and errors that indicate a usage mistake (programmer error).
- Traceable: Error correlation IDs must be UCXL-compatible so that an error returned to a Python application can be resolved to the full distributed event chain using the UCXL resolver.
Key references:
- Google. Google Cloud API Design Guide: Errors. https://cloud.google.com/apis/design/errors. The most systematic public treatment of structured API error design. The canonical status codes (OK, INVALID_ARGUMENT, NOT_FOUND, UNAVAILABLE, etc.) should be adopted or adapted.
- Klabnik, S. and Nichols, C. The Rust Programming Language (Chapter 9: Error Handling). The Rust approach to
Result<T, E>and the?operator represents the state of the art for recoverable errors in a systems language. - Syme, D. et al. "Exceptional Syntactic Support for Error Handling in F#." Haskell Symposium 2020. Relevant to the higher-level SDK error design.
The UCXL response code integration specifically means that API error structs carry a ucxl_trace field containing the UCXL address of the distributed event that caused the failure:
error.ucxl_trace = "ucxl://council-fault:monitor@DistOS:fault-tolerance/^^/events/node-042-timeout-2026-03-01T14:22:00Z"
2.6 SDK Design for Multiple Languages
The SDK must present a coherent surface across four languages with different idioms. The design principles are:
- C ABI as the foundation: The canonical system call interface is a C ABI. All other language SDKs are generated or hand-written wrappers over the C ABI. This ensures ABI stability and FFI compatibility with every language.
- Rust SDK: Idiomatic, zero-cost wrappers. Use Rust's ownership system to enforce resource lifetimes at compile time (e.g., a
GpuBuffer<T>type that isSendbut notSync, reflecting GPU buffer ownership semantics). The Rust SDK should use#[repr(C)]structs for ABI compatibility. - Go SDK: Ergonomic wrappers using
cgofor the C ABI. Expose distributed operations as channel-returning functions. Context-aware: all calls acceptcontext.Contextfor cancellation and timeout propagation. - Python SDK: High-level, NumPy-compatible. Consider auto-generating stub code from a schema. Must be
asyncio-compatible. Integrate with the Python type system viaProtocolandTypedDict.
Key references:
- Klabnik, S. and Nichols, C. The Rust Programming Language. https://doc.rust-lang.org/book/. Idiomatic Rust patterns.
- Go Authors. Effective Go. https://go.dev/doc/effective_go. Idiomatic Go patterns.
- Google. Google Cloud API Design Guide. https://cloud.google.com/apis/design. The most comprehensive public API design guide, covering resource-oriented design, standard methods, naming conventions, and backwards compatibility.
- Smith, P. Designing for Compatibility in Evolving APIs. IEEE Software 39(4), 2022.
2.7 CLI Tooling Design
The cluster management CLI (distos-ctl or equivalent) must follow modern CLI design principles:
- Machine-readable output (JSON/YAML with
--output json) for scripting - Structured logging with log levels
- Human-readable default output with colour and progress indicators
- Completion generation for bash/zsh/fish
- Subcommand structure:
node,job,gpu,net,storage,secret,log
Key references:
- Sigurdsson, A. et al. Command Line Interface Guidelines. https://clig.dev/. The community-written standard for modern CLI design. Should be treated as the style guide for
distos-ctl. - Hashicorp. Vault CLI design. The Vault CLI is an exemplar of a well-structured cluster management tool with consistent subcommand and flag conventions.
- Kubernetes.
kubectlsource and documentation. The de facto standard for distributed cluster management CLIs. The DistOS CLI should matchkubectlconventions where applicable to reduce cognitive load.
2.8 API Versioning and Evolution Strategy
A system call interface must be stable. The versioning strategy must address:
- Compatibility guarantees: What changes are backwards-compatible (adding optional parameters, adding new calls) vs. breaking (changing parameter semantics, removing calls)?
- Deprecation lifecycle: Minimum deprecation notice period, deprecation markers in the SDK, removal schedule.
- Version negotiation: How does a client indicate the API version it was compiled against? How does the kernel report available versions?
- Experimental APIs: A clearly marked experimental tier for new calls before they enter the stable surface.
Key references:
- Google. Google Cloud API Versioning. https://cloud.google.com/apis/design/versioning. URL-based versioning for REST APIs; the principles apply to system call versioning.
- Klabnik, S. "Stability as a Deliverable." https://blog.rust-lang.org/2014/10/30/Stability.html. Rust's stability commitment is a model for how a systems project can make and keep compatibility promises.
- Semantic Versioning Specification. https://semver.org/. The DistOS SDK and ABI will follow SemVer 2.0.
2.9 Plan 9 and Fuchsia Zircon Deep Dive
These two systems represent the clearest non-POSIX OS API designs and must be studied in depth:
- Plan 9: The 9P protocol represents all system resources as files served over a file system protocol. Network connections, processes, and graphics are files. The simplicity is extreme. The DistOS clean-slate layer need not adopt 9P but should understand its design philosophy.
- Pike, R. et al. "The Use of Name Spaces in Plan 9." EUUG Newsletter 12(1), 1992.
- Dorward, S. et al. "The Inferno Operating System." Bell Labs Technical Journal 2(1), 1997.
- Fuchsia / Zircon: Zircon is a microkernel with capabilities as the security primitive. Every kernel resource is a
zx_handle_t. Handles are passed between processes explicitly; there is no global namespace for kernel objects. This is the preferred model for DistOS's capability integration withcouncil-sec.- Google. Zircon Kernel Concepts. https://fuchsia.dev/fuchsia-src/concepts/kernel/concepts.
- Google. Zircon Syscall Reference. https://fuchsia.dev/fuchsia-src/reference/syscalls.
3. Agent Roles
| Role | Count | Responsibilities |
|---|---|---|
| Lead API Architect | 1 | Decides overall API philosophy; coordinates with all subsystem councils; owns the master API specification document; resolves conflicts between API and subsystem requirements |
| POSIX Compatibility Analysts | 4 | Audit which POSIX calls must be retained; design the compatibility shim layer; document compatibility coverage gaps |
| GPU Syscall Designers | 6 | Design GPU-native system calls for kernel launch, memory, streams, events, graphs; ensure Hopper/Blackwell/Grace specifics are covered |
| Distributed Syscall Designers | 5 | Design RPC, distributed lock, barrier, and collective operation system calls; consult MPI and RDMA references |
| SDK Designers | 8 | Design language-specific SDKs: 2 per language (C, Rust, Go, Python); responsible for ergonomics, idiom conformance, and ABI stability |
| Async API Specialists | 4 | Design the async call model; specify io_uring-style ring buffer interface; ensure Rust/Go/Python async integration |
| CLI Designers | 3 | Design distos-ctl command structure, output formats, and completions |
| Error Handling Architects | 3 | Design structured error types, UCXL trace integration, and error propagation conventions across all SDK layers |
| API Versioning Strategists | 2 | Develop the versioning policy, deprecation lifecycle, compatibility matrix, and experimental API tier |
| Developer Experience Reviewers | 4 | Evaluate API usability; write developer-facing documentation and example applications; run internal "dogfooding" walkthroughs |
Total: 40 agents
4. Key Deliverables
All artifacts use the pattern ucxl://council-api:{role}@DistOS:api/^^/{artifact-type}/{name}.
4.1 Master API Philosophy Decision Record
ucxl://council-api:lead-api-architect@DistOS:api/^^/decisions/dr-api-01-philosophy.md
Covers the layered model decision: clean-slate primary API, POSIX compatibility shim, and the cost budget for the shim.
4.2 GPU System Call Specification
ucxl://council-api:gpu-syscall-designer@DistOS:api/^^/specs/gpu-syscalls.md
Full specification of all GPU-native system calls with parameter types, semantics, error codes, and Hopper/Blackwell/Grace specifics.
4.3 Distributed System Call Specification
ucxl://council-api:distributed-syscall-designer@DistOS:api/^^/specs/distributed-syscalls.md
4.4 Async Call Interface Specification
ucxl://council-api:async-api-specialist@DistOS:api/^^/specs/async-interface.md
Documents the submission/completion ring model, back-pressure semantics, and language runtime integration.
4.5 C ABI Reference
ucxl://council-api:sdk-designer@DistOS:api/^^/specs/c-abi-reference.h
The normative C header file. All other SDKs are derived from this.
4.6 Language SDK Specifications
ucxl://council-api:sdk-designer@DistOS:api/^^/specs/sdk-rust.md
ucxl://council-api:sdk-designer@DistOS:api/^^/specs/sdk-go.md
ucxl://council-api:sdk-designer@DistOS:api/^^/specs/sdk-python.md
4.7 Error Type Catalogue
ucxl://council-api:error-handling-architect@DistOS:api/^^/specs/error-catalogue.md
All structured error types with UCXL trace integration, severity levels, and retry guidance.
4.8 CLI Specification
ucxl://council-api:cli-designer@DistOS:api/^^/specs/distos-ctl-spec.md
Full command reference including all subcommands, flags, output formats, and completion scripts.
4.9 API Versioning Policy
ucxl://council-api:api-versioning-strategist@DistOS:api/^^/policies/versioning-policy.md
4.10 POSIX Compatibility Coverage Matrix
ucxl://council-api:posix-compatibility-analyst@DistOS:api/^^/specs/posix-compatibility-matrix.md
Tabulates every POSIX call: supported natively, supported via shim, not supported (with rationale).
4.11 Example Applications
ucxl://council-api:developer-experience-reviewer@DistOS:api/^^/examples/hello-distributed-gpu.md
ucxl://council-api:developer-experience-reviewer@DistOS:api/^^/examples/allreduce-collective.md
ucxl://council-api:developer-experience-reviewer@DistOS:api/^^/examples/weka-fs-streaming-io.md
5. Decision Points
All DRs use the address pattern ucxl://council-api:lead-api-architect@DistOS:api/^^/decisions/{dr-id}.md.
DP-A01: POSIX vs. Clean-Slate vs. Layered
The foundational design philosophy choice. The default recommendation is the layered model, but this must be validated against: the cost of maintaining the shim layer, the risk of semantic leakage from POSIX into the clean-slate layer, and the developer familiarity benefit.
Deciding parties: Lead API Architect, POSIX Compatibility Analysts, council-synth
DP-A02: Async System Call Mechanism
Choose between: io_uring-inspired ring buffer (lowest overhead, Linux precedent), a POSIX-extended aio_* interface (familiarity, limited expressiveness), or a fully custom completion port model. This decision is tightly coupled to the council-mem memory model (the ring buffer requires shared memory between kernel and userspace).
Deciding parties: Async API Specialists, council-mem, council-verify (for ABI safety check)
DP-A03: GPU Memory API at the Syscall Layer vs. Library Layer
Should GPU memory allocation (cuMemAlloc equivalent) be a kernel-mediated system call (allowing the OS to account for and schedule GPU memory as a first-class resource) or a library call that bypasses the kernel after initial device setup? Kernel mediation adds latency; bypass reduces accounting fidelity.
Deciding parties: GPU Syscall Designers, council-mem, council-telemetry
DP-A04: RPC Mechanism for Distributed System Calls
Choose the wire protocol for remote procedure calls: gRPC (typed, HTTP/2, mature), a custom binary protocol over RDMA (lowest latency, highest implementation cost), or a two-tier model (gRPC for control plane, RDMA for data plane). The choice directly affects the latency budget for distributed system calls.
Deciding parties: Distributed Syscall Designers, council-net
DP-A05: SDK Code Generation vs. Hand-Written Wrappers
Decide whether to generate the Rust, Go, and Python SDKs from a schema definition (IDL, such as Protocol Buffers or a custom DSL) or maintain hand-written wrappers. Generated code is more consistent; hand-written code can be more idiomatic. A hybrid (generate the boilerplate, hand-write ergonomic wrappers) is the likely outcome.
Deciding parties: SDK Designers, API Versioning Strategists
DP-A06: Authentication and Authorisation API
How does a process prove its identity to the kernel and acquire capabilities? Options: token-based (JWT or similar), capability handles (Zircon model), certificate-based (X.509 with a cluster CA), or UCXL-scoped credentials. This decision must be made jointly with council-sec.
Deciding parties: Lead API Architect, council-sec
6. Dependencies on Other Councils
council-api is the integrating council: every subsystem council produces functionality, and council-api exposes that functionality through a coherent surface. It is therefore a downstream consumer of requirements from all councils and an upstream provider to council-docs and council-verify.
| Council | Relationship | What council-api consumes | What council-api produces |
|---|---|---|---|
council-sched |
Consuming requirements | Job submission semantics, priority model, queue management APIs | Scheduler-facing system calls in API spec |
council-mem |
Bidirectional | Memory model, allocation semantics, consistency guarantees | Memory system call specs; async memory API |
council-net |
Bidirectional | Network abstraction primitives, RDMA capabilities | Network system calls; distributed RPC wire protocol choice |
council-fault |
Consuming requirements | Failure notification model, recovery primitives | Fault-tolerance-related error codes; node failure event API |
council-sec |
Bidirectional | Capability model, identity primitives, isolation guarantees | Authentication/authorisation API surface; capability handle design |
council-telemetry |
Consuming requirements | Metering call semantics, SLO query interface | Telemetry-facing SDK surface; metering call specs |
council-verify |
Providing for verification | N/A | API interface contracts for formal verification |
council-qa |
Providing for test design | N/A | API spec enables QA to design conformance tests |
council-synth |
Receiving directives | Cross-council conflict resolutions affecting API design | Updates to API spec when directed by synth |
council-docs |
Providing for documentation | N/A | All API specs feed directly into the reference documentation |
Critical path constraint: council-api cannot finalise the distributed system call interface until council-net has committed to its RPC and RDMA model (DP-A04 depends on this). GPU system call design can proceed independently from Day 1.
7. WHOOSH Configuration
7.1 Team Formation
council_id: council-api
display_name: "API Surface and Developer Experience Council"
target_size: 40
formation_strategy: competency_weighted
required_roles:
- role: lead-api-architect
count: 1
persona: systems-analyst
competencies: [api-design, posix, distributed-systems, gpu-programming, developer-experience]
- role: posix-compatibility-analyst
count: 4
persona: technical-specialist
competencies: [posix, linux-kernel, system-calls, abi-stability]
- role: gpu-syscall-designer
count: 6
persona: technical-specialist
competencies: [cuda, rocm, gpu-memory, hopper-architecture, blackwell-architecture, nvlink]
- role: distributed-syscall-designer
count: 5
persona: technical-specialist
competencies: [rpc, rdma, mpi-collectives, distributed-locks, grpc]
- role: sdk-designer
count: 8
persona: technical-specialist
competencies: [c-abi, rust-async, go-concurrency, python-asyncio, ffi, sdk-design]
- role: async-api-specialist
count: 4
persona: technical-specialist
competencies: [io-uring, async-io, rust-futures, event-driven-design]
- role: cli-designer
count: 3
persona: technical-specialist
competencies: [cli-design, ux, kubectl-conventions, shell-completion]
- role: error-handling-architect
count: 3
persona: systems-analyst
competencies: [error-design, structured-errors, distributed-tracing, ucxl]
- role: api-versioning-strategist
count: 2
persona: systems-analyst
competencies: [api-versioning, semver, deprecation-policy, compatibility]
- role: developer-experience-reviewer
count: 4
persona: technical-writer
competencies: [developer-documentation, api-usability, example-applications, dogfooding]
7.2 Quorum Rules
quorum:
decision_threshold: 0.65 # 65% of active agents must agree on API design decisions
lead_architect_veto: true # Lead API Architect can block any interface decision
breaking_change_threshold: 0.85 # Breaking changes require 85% supermajority
cross_council_approval:
trigger: api_affects_subsystem
required: [affected_council_lead, council-synth]
response_sla_hours: 6
developer_experience_review:
trigger: new_public_call
required: [developer-experience-reviewer_count >= 2]
purpose: "Ensure every new call meets ergonomics standard before it enters the spec"
7.3 Subchannels
subchannels:
- id: api-posix-compat
subscribers: [posix-compatibility-analyst, lead-api-architect]
purpose: "POSIX coverage analysis, shim design, compatibility gap triage"
ucxl_feed: "ucxl://council-api:posix-compatibility-analyst@DistOS:api/^^/specs/posix-*"
- id: api-gpu-syscalls
subscribers: [gpu-syscall-designer, lead-api-architect, async-api-specialist]
purpose: "GPU-native system call design; Hopper/Blackwell capability integration"
ucxl_feed: "ucxl://council-api:gpu-syscall-designer@DistOS:api/^^/specs/gpu-*"
- id: api-distributed-syscalls
subscribers: [distributed-syscall-designer, lead-api-architect]
purpose: "Distributed call design; RPC and RDMA protocol negotiation with council-net"
ucxl_feed: "ucxl://council-api:distributed-syscall-designer@DistOS:api/^^/specs/distributed-*"
- id: api-sdk-coordination
subscribers: [sdk-designer, async-api-specialist, developer-experience-reviewer]
purpose: "Cross-language SDK consistency; ABI stability coordination"
ucxl_feed: "ucxl://council-api:sdk-designer@DistOS:api/^^/specs/sdk-*"
- id: api-error-and-versioning
subscribers: [error-handling-architect, api-versioning-strategist, lead-api-architect]
purpose: "Error catalogue development; versioning policy; UCXL trace integration"
ucxl_feed: "ucxl://council-api:error-handling-architect@DistOS:api/^^/specs/error-*"
- id: api-cross-council-requirements
subscribers: [lead-api-architect, distributed-syscall-designer, gpu-syscall-designer]
purpose: "Inbound requirements from all subsystem councils; tracks what each council needs exposed"
ucxl_feed: "ucxl://council-*:*@DistOS:*/^^/requirements/api-*"
- id: api-devex-review
subscribers: [developer-experience-reviewer, lead-api-architect]
purpose: "Developer experience walkthroughs; example application drafts; usability feedback"
ucxl_feed: "ucxl://council-api:developer-experience-reviewer@DistOS:api/^^/examples/*"
8. Success Criteria
- Complete API surface: The master API specification covers all system calls required by all six core subsystem councils. No subsystem has an unaddressed API requirement at the end of Phase 4.
- POSIX coverage documented: The POSIX compatibility matrix exists and classifies every POSIX.1-2017 system call as supported, shim-supported, or explicitly unsupported with rationale.
- GPU system calls complete: All GPU-native system calls for Hopper, Grace, and Blackwell are specified with parameter types, semantics, and error codes. NVLink/NVSwitch topology-aware calls are included.
- Distributed system calls complete: All distributed calls (RPC, locks, barriers, collectives) are specified with failure semantics and consistency guarantees matching the
council-faultandcouncil-netspecs. - Four-language SDK specs complete: C ABI, Rust, Go, and Python SDK specifications exist and have been reviewed for idiomatic correctness by SDK Designers.
- Error handling consistent: All error types are catalogued and every public API call has a documented error table. Every error carries a UCXL trace field.
- Versioning policy ratified: The versioning policy is agreed with
council-synthand published. The experimental API tier is defined. - Verification-ready contracts: All interface contracts have been delivered to
council-verifyin Alloy-compatible form by Day 8. - Developer experience validated: At least three example applications have been written by Developer Experience Reviewers and cover: a simple GPU computation, a distributed collective operation, and a Weka FS streaming I/O pattern.
- CLI specification complete:
distos-ctlsubcommand structure and all primary flags are specified.
9. Timeline
Phase 1: Research (Days 1–3)
- POSIX Compatibility Analysts catalogue POSIX.1-2017 system calls and assess coverage feasibility
- GPU Syscall Designers survey CUDA Driver API, CUDA Graphs, Hopper/Blackwell architecture documentation, NVLink topology implications
- Distributed Syscall Designers survey MPI collectives, gRPC, RDMA verbs, ZooKeeper/Chubby lock models
- SDK Designers survey language ecosystems: Rust async patterns, Go
cgopatterns, Python asyncio/CuPy - Async API Specialists study io_uring interface in depth
- Lead API Architect drafts the API philosophy options paper for DP-A01
- Deliverable:
ucxl://council-api:lead-api-architect@DistOS:api/^^/research/api-philosophy-options.md
Phase 2: Architecture (Days 3–6)
- Resolve DP-A01 (philosophy), DP-A02 (async mechanism), DP-A04 (RPC wire protocol), DP-A06 (auth/authz) — all in consultation with relevant councils
- Lead API Architect drafts the call taxonomy: which calls belong in which layer (kernel/shim/library)
- GPU Syscall Designers draft the GPU system call prototype spec for Hopper and Blackwell
- Distributed Syscall Designers draft the distributed call prototype spec, contingent on DP-A04 resolution
- Error Handling Architects draft the error type taxonomy and UCXL trace integration
- Deliverable:
ucxl://council-api:lead-api-architect@DistOS:api/^^/research/call-taxonomy.md
Phase 3: Formal Specification (Days 6–10)
- Full API spec written: GPU syscalls, distributed syscalls, async interface, C ABI reference
- Language SDK specifications written in parallel by SDK Designers
- Error catalogue completed and UCXL trace integration specified
- Alloy interface contracts delivered to
council-verifyfor structural verification - CLI specification drafted by CLI Designers
- POSIX compatibility matrix completed
- Deliverable:
ucxl://council-api:gpu-syscall-designer@DistOS:api/^^/specs/gpu-syscalls.mdand all companion specs
Phase 4: Integration (Days 10–12)
- Resolve any outstanding API requirements from subsystem councils surfaced during their Phase 3 spec work
- DP-A03 and DP-A05 resolved with full DR records
- API versioning policy ratified by
council-synth - Developer Experience Reviewers conduct walkthroughs of all three example applications
- Deliver final interface contracts to
council-verifyfor re-verification after any Phase 3 changes - Deliverable: Versioning policy, three example applications
Phase 5: Documentation (Days 12–14)
- Developer Experience Reviewers produce the developer-facing API reference document
- SDK Designers produce getting-started guides for each language
- All specs integrated into the master DistOS specification document via
council-docs - Final UCXL navigability check: every API call traces back to the council decision that introduced it
- Deliverable:
ucxl://council-api:developer-experience-reviewer@DistOS:api/^^/docs/api-reference.md