Files

anthonyrawlins 7f56ca4d46 Initial DistOS project constitution and council design briefs

12 council design briefs for distributed OS specification project targeting
1024-node Hopper/Grace/Blackwell GPU cluster with Weka parallel filesystem.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

2026-02-26 14:15:39 +11:00

44 KiB

Raw Blame History

Council Design Brief: Resource Accounting, Telemetry, and SLO Enforcement

Council ID: council-telemetry Mission: Design the complete observability, metering, and accountability infrastructure for DistOS — covering GPU utilisation metering, multi-tenant cost attribution, SLO definition and enforcement, real-time telemetry pipelines, distributed tracing, energy consumption tracking, and anomaly detection — such that every resource consumed in the cluster is measured, attributed, and reportable with sub-minute latency, and that service level objectives are enforced automatically with human-navigable evidence. UCXL Base Address: ucxl://council-telemetry:*@DistOS:telemetry/* Agent Count: ~40 Status: Pre-formation (Constitution Phase) Created: 2026-02-24

1. Scope and Responsibilities

Council-telemetry is responsible for the full stack of observability and accountability in DistOS: from raw hardware counters (GPU SM occupancy, NVLink bandwidth, DRAM bandwidth) through the telemetry pipeline (collection, aggregation, storage) to higher-level constructs (cost attribution, SLO evaluation, anomaly detection, capacity planning signals). The council also provides the metering data that the KACHING enterprise licensing module consumes for billing, and the energy consumption signals used for carbon-aware scheduling.

Unlike council-sec, which issues constraints to other councils, council-telemetry primarily consumes events and metrics produced by other subsystems. Its design is therefore deeply dependent on the interfaces those councils expose. A key deliverable is a telemetry interface specification that other councils must implement.

In Scope

GPU hardware counter collection design: SM occupancy, memory bandwidth, tensor core utilisation, NVLink bandwidth, PCIe traffic — using NVIDIA DCGM and NVML as the collection substrate
Multi-tenant cost attribution: relating raw hardware counter values to tenant workloads; designing fair attribution algorithms for shared resources (e.g., GPU memory bus contention)
SLO definition language and schema: formal specification of latency, throughput, availability, and utilisation SLOs; SLO evaluation at runtime
SLO enforcement mechanisms: automatic throttling, priority inversion prevention, quota enforcement, workload eviction as a last resort
Real-time telemetry pipeline: collection, transport, aggregation, and storage architecture; latency requirements (sub-minute for billing, sub-second for SLO enforcement, millisecond for anomaly detection)
Distributed tracing: causally-correct tracing of agent interactions across the CHORUS mesh; integration with OpenTelemetry
Energy consumption tracking: per-node and per-workload energy metering using NVIDIA DCGM power readings and Intel RAPL (for CPU/memory subsystem); integration with Kepler for Kubernetes-style workloads if DistOS supports Kubernetes compatibility
Carbon-aware scheduling signals: translating energy metering into carbon signals; publishing scheduling advisory signals for carbon-aware placement decisions (consumed by council-sched)
Quota management: per-tenant resource quotas, quota accounting, quota enforcement protocol
Anomaly detection: statistical and ML-based anomaly detection on resource metrics; alert generation; integration with operator notification
Capacity planning: historical data aggregation, trend analysis, capacity headroom computation
Chargeback models: metering-to-cost translation; integration with KACHING billing model

Out of Scope

Scheduling decisions based on telemetry signals (council-sched consumes telemetry signals; council-telemetry produces them)
Memory allocation algorithms (council-mem owns this; council-telemetry meters it)
Network transport implementation (council-net owns this; council-telemetry instruments it)
Security of the telemetry pipeline (council-sec owns this; council-telemetry complies with its requirements)

2. Research Domains

2.1 GPU Hardware Counter Collection

Accurate GPU utilisation metering requires direct access to hardware performance counters. The challenge in a multi-tenant environment is that hardware counter access is often exclusive (enabling profiling for one workload disables it for others) or requires privileged access that tenants should not have.

Key Papers and Systems:

NVIDIA Data Center GPU Manager (DCGM) — NVIDIA's official GPU telemetry framework for data centres; provides SM occupancy, memory bandwidth, active cycles, power consumption, temperature, PCIe throughput, and NVLink bandwidth via the DCGM API; DistOS's collection layer should be built on DCGM for NVIDIA GPU metrics; critically, DCGM operates in a privileged daemon model that does not require per-workload profiling access, which is compatible with multi-tenant environments
NVIDIA Management Library (NVML) — low-level C library underlying DCGM; exposes per-GPU and per-process metrics; the per-process accounting mode (nvmlDeviceSetAccountingMode) enables post-run attribution of compute time and memory to specific PIDs, which is essential for chargeback; however, PID-based attribution is insufficient for MIG-partitioned GPUs where tenants share the same physical device
Jouppi et al. (2017), "In-Datacenter Performance Analysis of a Tensor Processing Unit" — TPU performance counter methodology; useful comparison point for what GPU metrics are available versus what would be useful; many metrics DistOS needs (e.g., tensor core utilisation breakdowns by operation type) may not be directly hardware-countable and must be estimated from higher-level profiling
Mei et al. (2023), "Characterising and Optimising Deep Learning Inference Workloads on Modern GPUs" — empirical characterisation of GPU counter behaviour for inference workloads; establishes that SM occupancy alone is an insufficient utilisation proxy; NVLink bandwidth saturation and L2 cache miss rate are important co-metrics for attributing performance degradation to resource contention
NVIDIA Nsight Systems and Nsight Compute — profiling tools that expose timeline-level GPU activity; not suitable for production always-on metering (intrusion overhead), but inform which subset of DCGM metrics are sufficient proxies for workload characterisation
Awan et al. (2023), "Near-Zero Overhead Telemetry" (research prototype, MLSys 2023) — demonstrates that carefully selected subset of hardware counters with low sampling frequency achieves 98% accuracy of full profiling at less than 0.5% overhead; informs DistOS's counter selection policy

Open questions for research phase: Does DCGM support per-MIG-instance accounting with the same granularity as per-physical-GPU accounting? What is the DCGM polling overhead at 1-second vs 100-millisecond sampling intervals across 1024 GPUs? How should council-telemetry handle the period during which a GPU is being re-partitioned (MIG reconfiguration)?

2.2 Multi-Tenant Cost Attribution

Attributing resource consumption to individual tenants in a shared GPU environment is significantly harder than in a dedicated allocation model. Shared resources (memory bus bandwidth, NVLink, L2 cache) cause one tenant's workload to degrade another's performance, creating contention attribution problems.

Key Papers and Systems:

Delimitrou & Kozyrakis (2014), "Quasar: Resource-Efficient and QoS-Aware Cluster Management" — demonstrates that interference between co-located workloads in shared clusters is predictable and can be modelled; Quasar's interference matrix approach is directly applicable to DistOS's contention attribution; DistOS's metering should record not just individual workload consumption but also co-location context
Nathuji et al. (2010), "VirtualPower: Coordinated Power Management in Virtualized Enterprise Systems" — power attribution in virtualised environments; establishes that power metering must account for idle overhead allocation, which is directly analogous to GPU idle capacity attribution in multi-tenant DistOS
Amazon EC2 Enhanced Networking and CPU credits — practical chargeback model for burstable shared resources; DistOS's quota model should evaluate the T-credit style approach for GPU workloads that occasionally spike beyond their allocated share
Zaharia et al. (2011), "Apache Spark: A Unified Analytics Engine for Large-Scale Data Processing" — Spark's stage-level resource metering provides an abstraction level above hardware counters that is more useful for chargeback; DistOS should consider an analogous task/stage metering abstraction
Ouyang et al. (2023), "Characterizing Interference in Shared Multi-GPU Systems" — empirical study showing L2 cache and memory bus contention between co-located workloads; provides the empirical basis for DistOS's interference-aware attribution model

2.3 SLO Definition, Evaluation, and Enforcement

SLO-based resource management is the standard model for production cluster management. DistOS must define SLOs, evaluate them continuously, and enforce them before violations occur.

Key Papers and Systems:

Mogul & Wilkes (2019), "Nines are Not Enough: Meaningful Metrics for Clouds" — argues that traditional availability SLOs (99.9%, 99.99%) are insufficient for capturing user experience; motivates richer SLO definitions including latency percentiles (P99, P99.9), throughput, and error budget; DistOS's SLO language should support these constructs
Hauer et al. (2020), "Shard Manager: A Generic Shard Management Framework for Geo-Distributed Applications" — Google's production SLO management experience; demonstrates the importance of error budgets (how much SLO budget has been consumed) as the primary metric driving reliability investment
Google Site Reliability Engineering Book (Beyer et al. 2016), Chapter 4 "Service Level Objectives" — defines SLI, SLO, SLA, and error budget concepts; DistOS's SLO framework should be built on these definitions; particularly important is the error budget burn rate alerting model (Alerting on SLOs, Chapter 5 of the SRE Workbook)
Cortez et al. (2017), "Resource Central: Understanding and Predicting Workloads for Improved Resource Management in Large Cloud Platforms" — Microsoft Azure's workload prediction model for SLO-aware resource management; DistOS should evaluate whether workload type classification (via fingerprinting) can improve SLO prediction accuracy
Borg (Verma et al. 2015), "Large-Scale Cluster Management at Google with Borg" — Borg's priority-based preemption as the primary SLO enforcement mechanism; DistOS should adopt a similar priority taxonomy (prod vs non-prod, high vs low latency) with preemption as the enforcement backstop
Kubernetes Quality of Service (QoS) classes and LimitRange — Guaranteed, Burstable, and BestEffort QoS classes provide a practical three-tier SLO model; DistOS's SLO framework should map to an equivalent model for compute and memory resources, extended with GPU-specific dimensions

Open questions for research phase: What SLO dimensions are unique to GPU workloads versus CPU workloads? (Candidates: tensor core utilisation rate, NVLink bandwidth utilisation, checkpoint latency.) How should DistOS define an SLO for a distributed training job where progress is measured in loss reduction per unit time rather than latency?

2.4 Telemetry Pipeline Architecture

At 1024 nodes each generating hundreds of metrics per second, the telemetry ingestion rate is substantial. The pipeline must handle high ingest volume, support sub-second fan-in aggregation for SLO enforcement, and store data efficiently for historical analysis.

Key Papers and Systems:

Prometheus (Volz & Wilkinson, SoundCloud 2012; CNCF 2016) — pull-based metrics collection with multi-dimensional data model; Prometheus's label-based cardinality model is well-suited to DistOS's tenant × node × workload attribution; however, Prometheus's single-node storage is not suitable for 1024 nodes at 1-second granularity over months of retention
Thanos (Improbable Engineering, 2019) — horizontally scalable Prometheus with object storage backend; enables long-term metrics retention and cross-cluster querying; DistOS should evaluate Thanos (or its equivalent Cortex/Mimir) as the metrics storage layer
Borgmon (Moyé et al. 2003, internal Google; described in SRE Book Chapter 10) — Google's Borg monitoring system; time-series database with rule evaluation for alerting; the concept of borgmon rules as the primary SLO evaluation mechanism directly informs DistOS's design
Kafka (Kreps et al. 2011, LinkedIn) — distributed log for high-throughput event streaming; DistOS's telemetry pipeline should evaluate Kafka or an equivalent (Pulsar, Redpanda) as the transport layer between metric producers and the time-series database; Kafka's retention window provides a replay buffer for SLO evaluation catching up after an outage
OpenTelemetry (CNCF 2019) — vendor-neutral standard for metrics, logs, and traces; the OpenTelemetry Collector provides a pipeline architecture (receivers, processors, exporters) that DistOS should adopt as its telemetry collection standard; OTLP (OpenTelemetry Protocol) is the wire format
VictoriaMetrics — high-performance metrics storage designed for high cardinality; benchmarks show 10x storage efficiency over Prometheus for high-cardinality workloads typical of multi-tenant cluster telemetry; DistOS should include VictoriaMetrics in the storage backend evaluation

Throughput estimation: At 1024 nodes, each running 8 GPUs, at 100 DCGM metrics per GPU at 1-second sampling, the raw ingest rate is approximately 1024 × 8 × 100 = 819,200 metric data points per second. At 8 bytes per data point plus 40 bytes of labels, this is approximately 39 MB/s raw metric stream, before compression. The telemetry pipeline must handle this as a sustained base load with 3x burst headroom during incidents (when anomaly detection triggers increased sampling rates).

2.5 Distributed Tracing

In a system where CHORUS agents interact via the WHOOSH/CHORUS mesh, understanding the causal chain of an operation across hundreds of agents requires distributed tracing.

Key Papers and Systems:

Dapper (Sigelman et al. 2010), "Dapper, a Large-Scale Distributed Systems Tracing Infrastructure" — Google's production distributed tracing system; introduces the span/trace model that all modern tracing systems follow; DistOS should implement Dapper-style tracing for all inter-agent communications
Zipkin (Twitter, 2012) — open-source Dapper implementation; provides the reference implementation for DistOS's tracing layer; B3 propagation headers are the de facto standard for trace context propagation
OpenTelemetry Traces — the successor to Zipkin and OpenCensus; provides a standardised trace API and SDK; DistOS should use OpenTelemetry traces for all CHORUS agent interactions, with OTLP export to a Jaeger or Tempo backend
Fonseca et al. (2007), "X-Trace: A Pervasive Network Tracing Framework" — alternative to Dapper with stronger causality tracking; relevant for DistOS because CHORUS agent communication is not strictly request-response (it is gossip-based), and standard parent-child span models may not capture gossip causality accurately

Open questions for research phase: How should DistOS trace gossip-based communication (where a single message may spawn fan-out to hundreds of recipients)? What is the trace context propagation mechanism for CHORUS pubsub messages? Should DistOS use baggage propagation to carry tenant identity through the trace context?

2.6 Energy Consumption and Carbon-Aware Scheduling

Key Papers and Systems:

Intel RAPL (Running Average Power Limit) — exposes CPU and DRAM power consumption via MSR registers; available on all Intel Xeon processors in the target cluster's Grace-Hopper nodes if Grace CPUs expose an equivalent; provides per-socket power metering at millisecond granularity
NVIDIA DCGM Power Fields — DCGM exposes DCGM_FI_DEV_POWER_USAGE (current power draw in watts) and DCGM_FI_DEV_ENERGY_CONSUMPTION (cumulative joules) per GPU; DistOS's energy metering should aggregate these per workload and per tenant
Kepler (Kubernetes-based Efficient Power Level Exporter, CNCF Sandbox 2023) — exports per-container and per-pod energy consumption estimates derived from hardware performance counters; provides Prometheus metrics for energy and carbon; DistOS should evaluate Kepler's counter-to-energy estimation model for per-workload attribution
Lottarini et al. (2018), "vBoost: Scaling Up Microservices by Automatically Boosting Cloud Resources" — demonstrates that energy cost models must account for the non-linear relationship between utilisation and power draw (a GPU at 50% utilisation typically consumes more than 50% of its maximum power due to idle power overheads)
Wiesner et al. (2021), "Let's Wait Awhile: How Temporal Workload Shifting Can Reduce Carbon Emissions in the Cloud" — demonstrates that temporal workload shifting (scheduling batch workloads to run when the grid carbon intensity is low) can reduce carbon emissions by 20-40%; DistOS should publish carbon intensity signals that council-sched can use for carbon-aware placement
Electricity Maps API — real-time grid carbon intensity data; DistOS's carbon signal should integrate with an external grid carbon API to translate energy consumption to CO2 equivalent; provides the "carbon intensity" value in gCO2eq/kWh that DistOS needs for carbon-aware scheduling
Patterson et al. (2021), "Carbon Emissions and Large Neural Network Training" — quantifies the carbon footprint of large model training; the methodology here is the direct motivation for DistOS's energy and carbon telemetry; DistOS should enable workload-level carbon reporting at the same granularity as this paper reports

2.7 Anomaly Detection

Key Papers and Systems:

Dean et al. (2013), "The Tail at Scale" — motivates tail latency as the primary SLO metric; a system composed of many services has a combinatorially higher probability of hitting slow outliers; DistOS's anomaly detection should focus on detecting when individual nodes begin exhibiting tail latency behaviour before it becomes cluster-wide
Chandola et al. (2009), "Anomaly Detection: A Survey" — comprehensive taxonomy of anomaly detection approaches; DistOS should implement statistical anomaly detection (control charts, CUSUM) for low-latency alerting on simple metric deviations, and ML-based anomaly detection (Isolation Forest, LSTM-AD) for complex multi-metric anomalies
Liu et al. (2008), "Isolation Forest" — O(n log n) anomaly detection algorithm well-suited to streaming metric data; DistOS's anomaly detection pipeline should evaluate Isolation Forest for high-dimensional metric vectors
Hundman et al. (2018), "Detecting Spacecraft Anomalies Using LSTMs and Nonparametric Dynamic Thresholding" — LSTM-based anomaly detection with dynamic thresholds; relevant for DistOS GPU metrics that exhibit strong periodicity (training loss curves, checkpoint intervals) that static thresholds cannot handle
Prometheus Alertmanager and recording rules — production alerting infrastructure; DistOS's anomaly detection can integrate with Prometheus recording rules for pre-computed alert expressions and Alertmanager for routing, deduplication, and silencing

2.8 KACHING Integration

The KACHING enterprise licensing module is DistOS's commercial monetisation layer. Council-telemetry's metering data is the input to KACHING's billing engine.

Key Papers:

Lim et al. (2009), "Characterizing Web Server Capacity" — motivates the importance of accurate metering as the foundation for any chargeback model; inaccurate meters lead to customer disputes and revenue leakage
Amazon Web Services billing model (EC2 detailed billing, Cost Allocation Tags) — practical reference for chargeback model design; DistOS's billing tags should be compatible with AWS cost allocation tag conventions to ease hybrid billing for organisations already using AWS cost accounting

3. Agent Roles

Role	Count	Responsibilities
`lead-architect`	2	Cross-domain coherence, KACHING integration oversight, SLO enforcement arbitration, synthesis liaison
`gpu-metrics-engineer`	6	DCGM integration design, counter selection policy, per-MIG accounting design, sampling rate analysis
`telemetry-pipeline-engineer`	6	Collection pipeline architecture, Kafka/OTLP transport design, time-series storage selection, throughput analysis
`cost-attribution-specialist`	4	Multi-tenant attribution algorithms, contention attribution, shared resource cost allocation models
`slo-engineer`	6	SLO language design, error budget model, SLO evaluation runtime, enforcement trigger design
`distributed-tracing-specialist`	4	OpenTelemetry integration, trace context propagation for CHORUS gossip, span model for agent interactions
`energy-carbon-analyst`	4	DCGM power metering, Intel RAPL integration, Kepler evaluation, carbon intensity signal design
`anomaly-detection-engineer`	4	Statistical and ML-based anomaly detection design, alert routing, dynamic threshold design
`quota-manager`	2	Per-tenant quota definition, quota accounting protocol, enforcement escalation
`formal-spec-author`	2	TLA+ specifications for SLO evaluation state machine, quota accounting invariants

Total: 40 agents

4. Key Deliverables

Phase 1: Research (Days 1-3)

Deliverable	UCXL Address	Description
GPU metrics survey	`ucxl://council-telemetry:researcher@DistOS:telemetry/*^/research/gpu-metrics-survey.md`	DCGM counter catalogue, sampling overhead analysis, per-MIG accounting capabilities
Telemetry pipeline survey	`ucxl://council-telemetry:researcher@DistOS:telemetry/*^/research/telemetry-pipeline-survey.md`	Prometheus/Thanos vs VictoriaMetrics vs OpenTelemetry Collector comparison
SLO framework survey	`ucxl://council-telemetry:researcher@DistOS:telemetry/*^/research/slo-framework-survey.md`	SRE SLO model, error budgets, GPU-specific SLO dimensions
Energy metering survey	`ucxl://council-telemetry:researcher@DistOS:telemetry/*^/research/energy-metering-survey.md`	DCGM power fields, Intel RAPL, Kepler, carbon intensity API options
Telemetry interface requirements	`ucxl://council-telemetry:lead-architect@DistOS:telemetry/*^/research/telemetry-interface-requirements.md`	What events and metrics council-telemetry requires from council-sched, council-mem, and council-net

Phase 2: Architecture (Days 3-6)

Deliverable	UCXL Address	Description
GPU metering design	`ucxl://council-telemetry:gpu-metrics-engineer@DistOS:telemetry/*^/architecture/gpu-metering-design.md`	Counter selection, sampling policy, per-MIG attribution design
Telemetry pipeline design	`ucxl://council-telemetry:telemetry-pipeline-engineer@DistOS:telemetry/*^/architecture/telemetry-pipeline-design.md`	Collection, transport, aggregation, and storage architecture
Cost attribution model	`ucxl://council-telemetry:cost-attribution-specialist@DistOS:telemetry/*^/architecture/cost-attribution-model.md`	Attribution algorithm, contention handling, shared resource allocation
SLO framework design	`ucxl://council-telemetry:slo-engineer@DistOS:telemetry/*^/architecture/slo-framework-design.md`	SLO language spec, error budget tracking, enforcement mechanism
Energy and carbon model	`ucxl://council-telemetry:energy-carbon-analyst@DistOS:telemetry/*^/architecture/energy-carbon-model.md`	Per-workload energy metering, carbon signal production
Anomaly detection design	`ucxl://council-telemetry:anomaly-detection-engineer@DistOS:telemetry/*^/architecture/anomaly-detection-design.md`	Algorithm selection, threshold design, alert routing
Telemetry API specification	`ucxl://council-telemetry:lead-architect@DistOS:telemetry/*^/architecture/telemetry-api-spec.md`	API surface that other councils must implement to emit events to council-telemetry
DR-TEL-001: Time-series storage backend	`ucxl://council-telemetry:lead-architect@DistOS:telemetry/*^/decisions/DR-TEL-001-tsdb-selection.md`	Decision record: Thanos vs VictoriaMetrics vs Mimir
DR-TEL-002: SLO enforcement mechanism	`ucxl://council-telemetry:lead-architect@DistOS:telemetry/*^/decisions/DR-TEL-002-slo-enforcement.md`	Decision record: throttling vs preemption as primary enforcement
DR-TEL-003: Energy attribution model	`ucxl://council-telemetry:lead-architect@DistOS:telemetry/*^/decisions/DR-TEL-003-energy-attribution.md`	Decision record: direct measurement vs performance-counter estimation

Phase 3: Formal Specification (Days 6-10)

Deliverable	UCXL Address	Description
SLO evaluation TLA+ specification	`ucxl://council-telemetry:formal-spec-author@DistOS:telemetry/*^/specs/slo-evaluation.tla`	Invariant: no workload's SLO is evaluated on stale data; liveness: every SLO violation triggers enforcement within a bounded interval
Quota accounting TLA+ specification	`ucxl://council-telemetry:formal-spec-author@DistOS:telemetry/*^/specs/quota-accounting.tla`	Invariant: sum of attributed quota across all tenants does not exceed physical resource capacity; no negative quota balances
Cost attribution TLA+ specification	`ucxl://council-telemetry:formal-spec-author@DistOS:telemetry/*^/specs/cost-attribution.tla`	Invariant: every resource unit consumed is attributed to exactly one tenant; attribution completeness property
Telemetry data model schema	`ucxl://council-telemetry:telemetry-pipeline-engineer@DistOS:telemetry/*^/specs/telemetry-data-model.json`	JSON Schema definition of all telemetry event types, label cardinality constraints

Phase 4: Integration (Days 10-12)

Deliverable	UCXL Address	Description
Scheduling event interface	`ucxl://council-telemetry:lead-architect@DistOS:telemetry/*^/integration/council-sched-interface.md`	Events required from council-sched: workload placement, preemption, completion, checkpoint events
Memory metrics interface	`ucxl://council-telemetry:lead-architect@DistOS:telemetry/*^/integration/council-mem-interface.md`	Metrics required from council-mem: allocation, deallocation, bandwidth consumption, cache miss rates
Network metrics interface	`ucxl://council-telemetry:lead-architect@DistOS:telemetry/*^/integration/council-net-interface.md`	Metrics required from council-net: per-link bandwidth, RDMA queue depths, retransmission rates
KACHING billing integration	`ucxl://council-telemetry:cost-attribution-specialist@DistOS:telemetry/*^/integration/kaching-billing-integration.md`	Metering event schema for KACHING; billing granularity; attribution latency SLA
Security compliance review	`ucxl://council-telemetry:lead-architect@DistOS:telemetry/*^/integration/security-compliance-review.md`	Confirmation that telemetry pipeline satisfies council-sec's data classification requirements

Phase 5: Documentation (Days 12-14)

Deliverable	UCXL Address	Description
Telemetry reference specification	`ucxl://council-telemetry:lead-architect@DistOS:telemetry/*^/docs/telemetry-reference-spec.md`	Complete observability and accounting model specification
SLO configuration guide	`ucxl://council-telemetry:slo-engineer@DistOS:telemetry/*^/docs/slo-configuration-guide.md`	How tenants define SLOs, how operators set enforcement policies
Operator metering guide	`ucxl://council-telemetry:gpu-metrics-engineer@DistOS:telemetry/*^/docs/operator-metering-guide.md`	DCGM configuration, counter selection, sampling rate tuning
Carbon accounting guide	`ucxl://council-telemetry:energy-carbon-analyst@DistOS:telemetry/*^/docs/carbon-accounting-guide.md`	How to read per-workload energy reports, carbon intensity interpretation
Decision archaeology summary	`ucxl://council-arch:narrator@DistOS:decision-history/*^/narratives/council-telemetry-design-narrative.md`	Human-readable narrative of how resource accounting design decisions evolved

5. Decision Points

DP-TEL-01: GPU Metric Sampling Strategy Should council-telemetry use DCGM's push model (DCGM publishes metrics to a Kafka topic at a configured interval) or a pull model (the telemetry collector scrapes DCGM's gRPC API at configurable intervals, Prometheus-style)? The push model reduces collection latency but couples the DCGM configuration to the pipeline design; the pull model is more operationally familiar and aligns with Prometheus conventions, but adds a scrape hop. Additionally, should sampling intervals be fixed (1 second for all metrics) or adaptive (higher frequency during SLO-critical periods, lower frequency during idle periods)?

DP-TEL-02: Time-Series Database Selection Three candidates for the long-term metrics storage backend have been identified: (a) Prometheus with Thanos for object storage backend — widely adopted, strong community, operational maturity; (b) VictoriaMetrics — significantly higher storage efficiency (3-5x compression vs Prometheus), higher ingest throughput, but smaller ecosystem; (c) OpenTelemetry Collector with a custom backend — maximum flexibility but highest implementation cost. The decision should be driven by the ingest rate calculation (approximately 39 MB/s at 1-second granularity for 1024 nodes × 8 GPUs × 100 metrics) and the query latency requirement for SLO evaluation (sub-second query execution for dashboard rendering).

DP-TEL-03: SLO Enforcement Mechanism When an SLO violation is detected, what enforcement actions should DistOS take and in what order? Candidate escalation ladder: (1) warning signal to workload orchestrator (advisory); (2) CPU/memory throttling via cgroups; (3) GPU bandwidth throttling via NVIDIA MPS (Multi-Process Service) client thread percentage; (4) GPU time-slice reduction; (5) workload preemption (destructive if no checkpoint exists). The council must define the escalation policy including dwell times at each stage, the maximum rate of escalation, and the de-escalation criteria. A workload should not oscillate between enforcement states faster than its checkpoint interval.

DP-TEL-04: Cost Attribution for Contended Resources When two tenants compete for a shared resource (e.g., L2 cache, NVLink bandwidth) and both suffer performance degradation, how should the cost be attributed? Option A: attribute the physical resource consumption to both tenants proportionally (each pays for the resource they consumed regardless of contention); Option B: attribute the wasted capacity caused by contention to the lower-priority tenant (the disruptor pays); Option C: attribute contention overhead as an unallocated cluster cost absorbed by the operator. Option A is simple but may not align with tenants' SLO expectations; Option B requires a contention detection mechanism that is technically complex but aligns with the principle of least cost surprise.

DP-TEL-05: Energy Attribution Granularity Should energy attribution be per-GPU (simple: divide total GPU power draw by workload count), per-MIG-instance (more precise: DCGM provides per-MIG power estimates), per-CUDA-stream (most precise: requires performance counter access that may not be available in multi-tenant mode), or per-transaction (for inference serving: most useful for KACHING billing of per-request costs)? The appropriate granularity depends on the KACHING billing model and the measurement capabilities of the hardware. The council should document the measurement capability gaps and the estimation methodologies used to bridge them.

DP-TEL-06: Telemetry Data Retention and Downsampling Raw metric data at 1-second granularity generates approximately 39 MB/s. A 90-day retention period at this rate requires approximately 210 TB of storage. Is this acceptable, or should DistOS implement a tiered retention policy: raw data retained for 7 days, 1-minute downsampled for 30 days, 5-minute downsampled for 1 year? Downsampled data loses information about short-duration anomalies but dramatically reduces storage cost. The decision must specify the minimum retention period for data used in SLO violation forensics (which may require second-level resolution for the duration of the SLO window).

DP-TEL-07: Carbon Signal Architecture Should DistOS maintain an internal carbon intensity model (estimating carbon per kWh from the data centre's contracted electricity mix), integrate with an external real-time carbon API (such as Electricity Maps or WattTime), or provide both with the internal model as a fallback? The external API provides higher accuracy and real-time grid-responsive signals but introduces an external dependency; the internal model is always available but may be inaccurate. The carbon signal is advisory (consumed by council-sched for carbon-aware placement), so accuracy requirements are lower than for billing data.

DP-TEL-08: Anomaly Detection Sensitivity and Alert Fatigue Anomaly detection systems in production clusters routinely suffer from alert fatigue: so many anomalies are detected that operators stop responding to alerts. DistOS must define an alert quality target (e.g., 90% of alerts should correspond to a genuine degradation event, measured over a rolling 7-day window) and design the anomaly detection algorithm parameters to achieve it. The council must choose between a sensitivity-first approach (catch everything, suppress via correlation) and a precision-first approach (only alert when confidence is high, miss some real anomalies).

6. Dependencies on Other Councils

council-sched (Process Scheduling) — Events Required

Council-sched is the primary event source for workload-level attribution. Council-telemetry requires the following events from council-sched:

Workload placement event: workload ID, tenant ID, node ID, GPU ID (or MIG instance ID), start timestamp
Workload completion event: workload ID, end timestamp, exit status
Preemption event: workload ID, timestamp, preempting workload ID, reason
Checkpoint event: workload ID, checkpoint ID, timestamp, checkpoint size, Weka FS location
Priority change event: workload ID, old priority, new priority, timestamp

Without these events, council-telemetry cannot attribute GPU hardware counter readings to specific workloads. The attribution is performed by joining the scheduling event stream (from council-sched) with the DCGM counter stream on (node_id, gpu_id, time_window).

Interface artifact: ucxl://council-telemetry:lead-architect@DistOS:telemetry/*^/integration/council-sched-interface.md

council-mem (Distributed Memory) — Metrics Required

Council-telemetry requires memory utilisation metrics that council-mem produces or can expose:

Per-workload GPU memory allocation size (bytes, by memory type: HBM, NVLink DRAM, Weka-backed)
Memory bandwidth consumption per workload (GB/s, HBM and NVLink separately)
Memory allocation and deallocation events for quota accounting
L2 cache hit rate per workload (for contention detection)

Interface artifact: ucxl://council-telemetry:lead-architect@DistOS:telemetry/*^/integration/council-mem-interface.md

council-net (Network Stack) — Metrics Required

Council-telemetry requires network metrics to provide end-to-end resource accounting:

Per-flow bandwidth consumption (bytes transferred, broken down by tenant workload where SPIFFE identity is available)
NVLink utilisation per link (for intra-node GPU-to-GPU bandwidth attribution)
InfiniBand queue pair statistics (for RDMA-based checkpoint I/O attribution)
Weka client I/O statistics per node (to attribute filesystem I/O to tenants)

Interface artifact: ucxl://council-telemetry:lead-architect@DistOS:telemetry/*^/integration/council-net-interface.md

council-sec (Security Model) — Constraint Direction

Council-sec provides the following constraints on council-telemetry's design:

Telemetry data classification: raw hardware counters are classified as internal operational data; per-tenant cost attribution data is classified as confidential tenant data; no tenant may query another tenant's attribution data
Audit log requirements: all SLO enforcement actions (throttling, preemption) must generate signed audit log entries readable by council-sec's audit log query interface
Telemetry pipeline security: the DCGM collection daemon, the Kafka transport, and the TSDB must all operate under SPIFFE-issued SVIDs; no unauthenticated writes to the metrics store are permitted

Constraint artifact: ucxl://council-sec:audit-engineer@DistOS:security/*^/integration/constraints-for-council-telemetry.md

KACHING Integration

KACHING is the CHORUS enterprise licensing and billing module. Council-telemetry is the sole authoritative source of metering data for KACHING billing. The integration requires:

A streaming metering event feed from council-telemetry to KACHING (Kafka topic with per-workload resource consumption events at billing granularity)
Immutable billing records: metering events used for billing must be cryptographically signed by council-telemetry and stored in KACHING's tamper-evident billing ledger
Reconciliation: periodic reconciliation between council-telemetry's attribution totals and KACHING's billed totals; discrepancies must be flagged as billing anomalies

Interface artifact: ucxl://council-telemetry:cost-attribution-specialist@DistOS:telemetry/*^/integration/kaching-billing-integration.md

7. WHOOSH Configuration

Team Formation

{
  "council_id": "council-telemetry",
  "team_topic": "whoosh.team.distos-council-telemetry",
  "composition": {
    "lead-architect": 2,
    "gpu-metrics-engineer": 6,
    "telemetry-pipeline-engineer": 6,
    "cost-attribution-specialist": 4,
    "slo-engineer": 6,
    "distributed-tracing-specialist": 4,
    "energy-carbon-analyst": 4,
    "anomaly-detection-engineer": 4,
    "quota-manager": 2,
    "formal-spec-author": 2
  },
  "total_agents": 40,
  "quorum_policy": {
    "artifact_publication": "simple_majority",
    "architecture_decision": "two_thirds_supermajority",
    "slo_enforcement_policy": "all_slo_engineers_plus_one_lead",
    "kaching_billing_contract": "lead_architect_plus_cost_attribution_specialists",
    "formal_spec_ratification": "formal_spec_authors_plus_one_lead"
  },
  "join_timeout_minutes": 30,
  "inactivity_eviction_minutes": 120,
  "special_policy": "telemetry_api_spec_requires_acknowledgement_from_council_sched_council_mem_council_net_before_finalisation"
}

Subchannels

Subchannel	Topic Suffix	Purpose
Control	`.control`	Role assignments, join/leave events, phase transitions
Research	`.research`	Literature survey coordination, throughput calculations, system comparisons
Pipeline	`.pipeline`	Telemetry collection and storage design
SLO	`.slo`	SLO language design, enforcement policy debate
Attribution	`.attribution`	Cost attribution algorithm design, contention handling
Energy	`.energy`	Power metering, carbon signal design
Integration	`.integration`	Interface contract negotiation with other councils and KACHING
Voting	`.voting`	Quorum votes on decision records and artifact ratification
Artifacts	`.artifacts`	UCXL artifact announcement references

Quorum Configuration

Council-telemetry, being the smallest of the DistOS research councils at 40 agents, operates with proportionally lower absolute quorum counts. A simple majority (21 agents) suffices for artifact publication. Architecture decisions require 27 agents (two-thirds). The KACHING billing contract requires sign-off from both lead architects and all four cost attribution specialists because billing accuracy errors have direct financial consequences that cannot be corrected retroactively without customer disputes.

8. Success Criteria

Metering coverage: A complete catalogue of all GPU and system resources tracked by DistOS's telemetry layer, with coverage gaps explicitly documented and estimation methodologies described for each gap
Pipeline throughput verification: An architectural analysis showing that the selected telemetry pipeline can sustain the calculated peak ingest rate (approximately 39 MB/s base plus 3x burst headroom) with sub-500ms end-to-end latency from metric production to TSDB availability
Attribution completeness proof: A TLA+ specification with the invariant that every resource unit consumed in the cluster is attributed to exactly one tenant in every execution; verified by council-verify
SLO enforcement latency: A formal specification proving that every SLO violation is detected and enforcement action initiated within a bounded interval (target: 5 seconds from violation onset to enforcement action, 30 seconds to preemption if lower enforcement levels are insufficient)
KACHING integration: A ratified billing event schema that covers all billable resource types (GPU compute time, GPU memory-seconds, NVLink bandwidth, Weka I/O), with signed acknowledgement from the KACHING module team
Energy metering validation: A methodology document showing how per-workload energy estimates are derived and validated, including uncertainty bounds on the estimates
Interface contracts ratified: All three dependency interface contracts (council-sched, council-mem, council-net) are signed off by both parties and published to UCXL before Phase 3 begins
Decision record completeness: All 8 decision points have corresponding Decision Records with rationale and alternatives documented

9. Timeline

Phase 1: Research (Days 1-3)

Day 1: Activate council, assign roles; gpu-metrics-engineers begin DCGM counter catalogue; telemetry-pipeline-engineers begin throughput calculations; lead architects draft telemetry interface requirements document for distribution to other councils
Day 2: GPU metrics survey, telemetry pipeline survey, and energy metering survey completed; SLO framework survey in progress; throughput estimate finalised and published
Day 3: All research artifacts published to UCXL; telemetry interface requirements document v0 distributed to council-sched, council-mem, and council-net for early review; research phase closes with prioritised list of open questions

Phase 2: Architecture (Days 3-6)

Day 3-4: GPU metering design and telemetry pipeline design drafted; initial decision records for DP-TEL-01 and DP-TEL-02 circulated; cost attribution model draft initiated
Day 4-5: SLO framework design and energy/carbon model drafted; telemetry API specification drafted and distributed to dependent councils for review; anomaly detection design initiated; DP-TEL-01 through DP-TEL-05 voted on
Day 5-6: Remaining decision records voted on; telemetry API specification finalised pending council-sched, council-mem, and council-net acknowledgement; architecture artifacts published; architecture phase closes

Phase 3: Formal Specification (Days 6-10)

Day 6-7: SLO evaluation TLA+ specification and quota accounting TLA+ specification drafted; telemetry data model schema published; council-verify liaison established
Day 7-8: Cost attribution TLA+ specification drafted; council-verify begins model checking on SLO evaluation spec
Day 8-9: First model-checking results from council-verify; spec refinements as needed; KACHING billing integration document drafted
Day 9-10: All TLA+ specifications in final state; formal spec phase closes with model-checking status documented for all three TLA+ specifications

Phase 4: Integration (Days 10-12)

Day 10-11: Interface contracts with council-sched, council-mem, and council-net finalised; any conflicts escalated to council-synth; KACHING billing integration reviewed by KACHING team
Day 11-12: council-sec security compliance review completed; any outstanding interface issues resolved; final integration artifacts published

Phase 5: Documentation (Days 12-14)

Day 12-13: Telemetry reference specification assembled; SLO configuration guide and operator metering guide drafted
Day 13-14: Carbon accounting guide and KACHING integration documentation completed; council-arch produces decision archaeology narrative; final UCXL navigability audit confirms all artifact addresses resolve correctly

44 KiB Raw Blame History Unescape Escape

Council Design Brief: Resource Accounting, Telemetry, and SLO Enforcement

1. Scope and Responsibilities

In Scope

Out of Scope

2. Research Domains

2.1 GPU Hardware Counter Collection

2.2 Multi-Tenant Cost Attribution

2.3 SLO Definition, Evaluation, and Enforcement

2.4 Telemetry Pipeline Architecture

2.5 Distributed Tracing

2.6 Energy Consumption and Carbon-Aware Scheduling

2.7 Anomaly Detection

2.8 KACHING Integration

3. Agent Roles

4. Key Deliverables

Phase 1: Research (Days 1-3)

Phase 2: Architecture (Days 3-6)

Phase 3: Formal Specification (Days 6-10)

Phase 4: Integration (Days 10-12)

Phase 5: Documentation (Days 12-14)

5. Decision Points

6. Dependencies on Other Councils

council-sched (Process Scheduling) — Events Required

council-mem (Distributed Memory) — Metrics Required

council-net (Network Stack) — Metrics Required

council-sec (Security Model) — Constraint Direction

KACHING Integration

7. WHOOSH Configuration

Team Formation

Subchannels

Quorum Configuration

8. Success Criteria

9. Timeline

Phase 1: Research (Days 1-3)

Phase 2: Architecture (Days 3-6)

Phase 3: Formal Specification (Days 6-10)

Phase 4: Integration (Days 10-12)

Phase 5: Documentation (Days 12-14)

44 KiB

Raw Blame History