Feature Request: CHORUS Swarm Scaling and Re-parametrization without Docker-in-Docker #7

Closed
opened 2025-09-21 13:25:44 +00:00 by tony · 5 comments
Owner

CHORUS Swarm Scaling Feature Request

Executive Summary

Implement capability for WHOOSH to scale and re-parametrize CHORUS instances in Docker Swarm without Docker-in-Docker. WHOOSH will communicate with the Docker Remote API on the Swarm manager (via restricted socket proxy or TLS), then pass role/model/prompt to new replicas via environment/configs or bootstrap fetch patterns.

Technical Implementation

1. Give WHOOSH Controlled Access to Swarm (No DinD)

Option A: Socket Proxy (Recommended)

  • Run tecnativa/docker-socket-proxy on manager with only needed endpoints (services, tasks)
  • Point WHOOSH at proxy via DOCKER_HOST=tcp://socket-proxy:2375
  • Benefits: No /var/run/docker.sock mount, scoped permissions

Option B: Docker Remote API over TLS

  • Enable daemon TCP listener on manager (:2376) with TLS client certs
  • Mount WHOOSH client certs as secrets, set DOCKER_HOST=tcp://manager:2376

2. WHOOSH → Swarm Control Implementation

Use official Docker SDK for programmatic service management:

# Example Python implementation
import docker, json, time
from docker.tls import TLSConfig

# Socket proxy connection
client = docker.DockerClient(base_url='tcp://socket-proxy:2375')

service = client.services.get('chorus_app')  # Target CHORUS service

# Scale replicated service
spec = service.attrs['Spec']
spec['Mode']['Replicated']['Replicas'] = 8  # Scale to 8 replicas

# Add config epoch for updates
env = spec['TaskTemplate']['ContainerSpec'].get('Env', [])
env = [e for e in env if not e.startswith('CHORUS_CONFIG_EPOCH=')]
env.append(f'CHORUS_CONFIG_EPOCH={int(time.time())}')    
spec['TaskTemplate']['ContainerSpec']['Env'] = env

# Apply update
service.update(
    task_template=spec['TaskTemplate'],
    name=spec['Name'],
    labels=spec.get('Labels', {}),
    mode=spec['Mode'],
    update_config=spec.get('UpdateConfig'),
    networks=spec.get('Networks', []),
    endpoint_spec=spec.get('EndpointSpec')
)

3. Parameter Passing Patterns

Pattern 1: Bootstrap Pull (Recommended)

  • WHOOSH writes task assignments to UCXL or WHOOSH API
  • CHORUS replicas bootstrap by fetching params using task identity
  • Supports unique prompts per replica

Example service configuration:

services:
  chorus_app:
    image: your/chorus:latest
    deploy:
      mode: replicated
      replicas: 1
      placement:
        constraints: ["node.labels.role==workers"]
    environment:
      - WHOOSH_ASSIGN_URL=http://whoosh:8080/assign
      - CHORUS_CLUSTER_ID=prod
      - TASK_SLOT={{.Task.Slot}}        # Swarm template
      - TASK_ID={{.Task.ID}}
    command: ["/bin/sh","-lc","/app/bootstrap.sh && exec /app/chorus"]
    networks: [overlay]

Bootstrap script:

#!/bin/sh
curl "$WHOOSH_ASSIGN_URL?cluster=$CHORUS_CLUSTER_ID&slot=$TASK_SLOT&task=$TASK_ID" > /config/assignment.json
# Process assignment and export env vars

Pattern 2: Config Objects

  • WHOOSH creates Swarm configs per epoch/role group
  • Service mounts config at /config/spec.json
  • CHORUS reads file on start or watches config epoch env

Pattern 3: Per-Role Services

  • Separate services: chorus-writer, chorus-planner, chorus-coder
  • Each with distinct ENV/Config/Secrets
  • Simple scaling and clear intent

4. New Role-Specific Service Creation

# Create role-specific CHORUS service
spec_env = [
    "ROLE=planner",
    "MODEL=llama3.1-8b-instruct-q6",
    "PROMPT_UCXL=ucxl://projects/alpha/specs/planner_prompt.md",
    "WHOOSH_ASSIGN_URL=http://whoosh:8080/assign",
]

client.services.create(
    name="chorus_planner",
    image="your/chorus:latest",
    env=spec_env,
    networks=["overlay"],
    mode=docker.types.ServiceMode("replicated", replicas=3),
    task_template=docker.types.TaskTemplate(
        container_spec=docker.types.ContainerSpec(
            image="your/chorus:latest",
            env=spec_env,
            command=["/bin/sh","-lc","/app/bootstrap.sh && exec /app/chorus"]
        )
    )
)

Security & Operations

  • Never mount /var/run/docker.sock directly into WHOOSH
  • Use socket-proxy with restricted endpoints: -e SERVICES=1 -e TASKS=1 -e CONFIGS=1 -e SWARM=1
  • For TLS: Keep client keys as swarm secrets, not env vars
  • Implement RBAC around WHOOSH API for scaling operations
  • Use labels for service discovery: com.chorus.managed=true

Advanced Features

Rolling Parameter Updates

  • Support runtime reload via SIGHUP or file watching
  • Message bus integration (Redis/NATS) for instant updates

Service Discovery

  • Label-based filtering for managed services
  • Automated service lifecycle management

Implementation Checklist

  • Set up docker-socket-proxy on Swarm manager
  • Implement WHOOSH Docker SDK integration
  • Create service scaling API endpoints
  • Implement bootstrap parameter fetch pattern
  • Add role-specific service creation
  • Implement security restrictions and RBAC
  • Add service discovery via labels
  • Create rolling update mechanisms
  • Add comprehensive logging and monitoring
  • Write integration tests

Benefits

  • Secure: No Docker-in-Docker, restricted API access
  • Scalable: Programmatic replica management
  • Flexible: Multiple parameter passing patterns
  • Production-Ready: Battle-tested architecture
  • Maintainable: Clean separation of concerns

References

# CHORUS Swarm Scaling Feature Request ## Executive Summary Implement capability for WHOOSH to scale and re-parametrize CHORUS instances in Docker Swarm **without** Docker-in-Docker. WHOOSH will communicate with the **Docker Remote API** on the Swarm manager (via restricted socket proxy or TLS), then pass role/model/prompt to new replicas via environment/configs or bootstrap fetch patterns. ## Technical Implementation ### 1. Give WHOOSH Controlled Access to Swarm (No DinD) **Option A: Socket Proxy (Recommended)** - Run `tecnativa/docker-socket-proxy` on manager with only needed endpoints (services, tasks) - Point WHOOSH at proxy via `DOCKER_HOST=tcp://socket-proxy:2375` - Benefits: No `/var/run/docker.sock` mount, scoped permissions **Option B: Docker Remote API over TLS** - Enable daemon TCP listener on manager (`:2376`) with TLS client certs - Mount WHOOSH client certs as secrets, set `DOCKER_HOST=tcp://manager:2376` ### 2. WHOOSH → Swarm Control Implementation Use official Docker SDK for programmatic service management: ```python # Example Python implementation import docker, json, time from docker.tls import TLSConfig # Socket proxy connection client = docker.DockerClient(base_url='tcp://socket-proxy:2375') service = client.services.get('chorus_app') # Target CHORUS service # Scale replicated service spec = service.attrs['Spec'] spec['Mode']['Replicated']['Replicas'] = 8 # Scale to 8 replicas # Add config epoch for updates env = spec['TaskTemplate']['ContainerSpec'].get('Env', []) env = [e for e in env if not e.startswith('CHORUS_CONFIG_EPOCH=')] env.append(f'CHORUS_CONFIG_EPOCH={int(time.time())}') spec['TaskTemplate']['ContainerSpec']['Env'] = env # Apply update service.update( task_template=spec['TaskTemplate'], name=spec['Name'], labels=spec.get('Labels', {}), mode=spec['Mode'], update_config=spec.get('UpdateConfig'), networks=spec.get('Networks', []), endpoint_spec=spec.get('EndpointSpec') ) ``` ### 3. Parameter Passing Patterns **Pattern 1: Bootstrap Pull (Recommended)** - WHOOSH writes task assignments to UCXL or WHOOSH API - CHORUS replicas bootstrap by fetching params using task identity - Supports unique prompts per replica Example service configuration: ```yaml services: chorus_app: image: your/chorus:latest deploy: mode: replicated replicas: 1 placement: constraints: ["node.labels.role==workers"] environment: - WHOOSH_ASSIGN_URL=http://whoosh:8080/assign - CHORUS_CLUSTER_ID=prod - TASK_SLOT={{.Task.Slot}} # Swarm template - TASK_ID={{.Task.ID}} command: ["/bin/sh","-lc","/app/bootstrap.sh && exec /app/chorus"] networks: [overlay] ``` Bootstrap script: ```bash #!/bin/sh curl "$WHOOSH_ASSIGN_URL?cluster=$CHORUS_CLUSTER_ID&slot=$TASK_SLOT&task=$TASK_ID" > /config/assignment.json # Process assignment and export env vars ``` **Pattern 2: Config Objects** - WHOOSH creates Swarm configs per epoch/role group - Service mounts config at `/config/spec.json` - CHORUS reads file on start or watches config epoch env **Pattern 3: Per-Role Services** - Separate services: `chorus-writer`, `chorus-planner`, `chorus-coder` - Each with distinct ENV/Config/Secrets - Simple scaling and clear intent ### 4. New Role-Specific Service Creation ```python # Create role-specific CHORUS service spec_env = [ "ROLE=planner", "MODEL=llama3.1-8b-instruct-q6", "PROMPT_UCXL=ucxl://projects/alpha/specs/planner_prompt.md", "WHOOSH_ASSIGN_URL=http://whoosh:8080/assign", ] client.services.create( name="chorus_planner", image="your/chorus:latest", env=spec_env, networks=["overlay"], mode=docker.types.ServiceMode("replicated", replicas=3), task_template=docker.types.TaskTemplate( container_spec=docker.types.ContainerSpec( image="your/chorus:latest", env=spec_env, command=["/bin/sh","-lc","/app/bootstrap.sh && exec /app/chorus"] ) ) ) ``` ## Security & Operations - **Never mount** `/var/run/docker.sock` directly into WHOOSH - Use **socket-proxy** with restricted endpoints: `-e SERVICES=1 -e TASKS=1 -e CONFIGS=1 -e SWARM=1` - For **TLS**: Keep client keys as swarm secrets, not env vars - Implement **RBAC** around WHOOSH API for scaling operations - Use **labels** for service discovery: `com.chorus.managed=true` ## Advanced Features ### Rolling Parameter Updates - Support runtime reload via SIGHUP or file watching - Message bus integration (Redis/NATS) for instant updates ### Service Discovery - Label-based filtering for managed services - Automated service lifecycle management ## Implementation Checklist - [ ] Set up docker-socket-proxy on Swarm manager - [ ] Implement WHOOSH Docker SDK integration - [ ] Create service scaling API endpoints - [ ] Implement bootstrap parameter fetch pattern - [ ] Add role-specific service creation - [ ] Implement security restrictions and RBAC - [ ] Add service discovery via labels - [ ] Create rolling update mechanisms - [ ] Add comprehensive logging and monitoring - [ ] Write integration tests ## Benefits - ✅ **Secure**: No Docker-in-Docker, restricted API access - ✅ **Scalable**: Programmatic replica management - ✅ **Flexible**: Multiple parameter passing patterns - ✅ **Production-Ready**: Battle-tested architecture - ✅ **Maintainable**: Clean separation of concerns ## References - [Docker Socket Proxy](https://github.com/Tecnativa/docker-socket-proxy) - [Docker Python SDK](https://docker-py.readthedocs.io/) - [Docker Swarm Services API](https://docs.docker.com/engine/api/v1.42/#tag/Service)
Author
Owner

CHORUS Swarm Scaling Implementation: Potential Friction Points

After analyzing the CHORUS codebase, I've identified several significant friction points that could complicate the implementation of the proposed WHOOSH → Swarm scaling feature:

🔴 High-Risk Friction Points

1. License Validation Bottleneck (internal/runtime/shared.go:156-165)

  • Issue: Every CHORUS instance requires license validation with KACHING before starting any P2P operations
  • Current Flow: licenseValidator.Validate() must succeed or the agent fails to start
  • Friction: KACHING needs to handle burst validation requests during rapid scaling
  • Impact: New replicas may fail to start if KACHING is overwhelmed or unavailable
// CRITICAL: Validate license before any P2P operations
runtime.Logger.Info("🔐 Validating CHORUS license with KACHING...")
if err := licenseValidator.Validate(); err != nil {
    return nil, fmt.Errorf("license validation failed: %v", err)
}

2. P2P Network Identity Conflicts (internal/runtime/shared.go:206-217)

  • Issue: Each CHORUS instance generates a unique P2P node ID, but peer discovery is based on these IDs
  • Current Flow: Node creates libp2p host with auto-generated peer ID
  • Friction: Rapid scaling may create peer discovery storms or connection conflicts
  • Impact: Network partitions or connection failures during scale-up events

3. Bootstrap Peer Dependency (internal/runtime/shared.go:410-448)

  • Issue: New instances require existing peers for DHT bootstrap and network joining
  • Current Flow: Hardcoded bootstrap peers in environment variables
  • Friction: Bootstrap peers may be overwhelmed during scale events; new instances can't join if bootstrap fails
  • Impact: New replicas remain isolated from the P2P mesh

🟡 Medium-Risk Friction Points

4. Leader Election Disruption (internal/runtime/shared.go:320-374)

  • Issue: Adding/removing nodes triggers election system recalculation
  • Current Flow: Democratic election with heartbeat-based leadership
  • Friction: Scale events may cause leadership thrashing or temporary admin vacuums
  • Impact: SLURP functionality (admin-only) may become intermittently unavailable

5. Configuration Environment Rigidity (pkg/config/config.go:144-245)

  • Issue: All configuration comes from environment variables, making per-replica customization difficult
  • Current Flow: Single environment-based config per service definition
  • Friction: The proposed "bootstrap pull" pattern conflicts with current environment-only configuration
  • Impact: Role/model/prompt assignment per replica requires significant refactoring

6. DHT Storage Encryption Key Management (internal/runtime/shared.go:450-471)

  • Issue: Encrypted DHT storage uses agent-specific keys
  • Current Flow: Keys generated per node instance
  • Friction: New replicas may not inherit correct decryption capabilities
  • Impact: Context/decision data may become inaccessible to new instances

7. BACKBEAT Integration Coupling (internal/runtime/shared.go:190-203)

  • Issue: Each instance expects BACKBEAT integration for operation tracking
  • Current Flow: Synchronous NATS-based coordination
  • Friction: Scaling events may overwhelm BACKBEAT message throughput
  • Impact: Operation tracking may become unreliable during scale events

🟢 Low-Risk but Notable Friction Points

8. Hypercore Logging Conflicts (internal/runtime/shared.go:220-226)

  • Issue: Each instance creates its own hypercore log
  • Friction: Log coordination across rapidly changing replica sets
  • Impact: Audit trail fragmentation

9. mDNS Discovery Limitations (internal/runtime/shared.go:229-233)

  • Issue: mDNS discovery may not scale well with many replicas
  • Friction: Network broadcast storms during scale events
  • Impact: Degraded peer discovery performance

10. Resource Management Assumptions (docker/docker-compose.yml:84-90)

  • Issue: Current resource limits assume manual scaling
  • Friction: Static resource allocation may not suit dynamic scaling patterns
  • Impact: Inefficient resource utilization during auto-scaling

For High-Risk Items:

  1. License Validation:

    • Implement KACHING license token caching/sharing
    • Add license validation circuit breaker
    • Consider cluster-wide license validation
  2. P2P Network:

    • Implement staged peer introduction
    • Add connection rate limiting
    • Use deterministic peer ID generation for predictable networking
  3. Bootstrap Dependencies:

    • Implement dynamic bootstrap peer discovery via WHOOSH
    • Add peer list refresh mechanisms
    • Create bootstrap peer health monitoring

For Medium-Risk Items:

  1. Leader Election:

    • Add election stability windows
    • Implement gradual scaling to avoid election storms
    • Consider dedicated admin nodes
  2. Configuration System:

    • Extend config system to support runtime parameter injection
    • Implement WHOOSH → CHORUS parameter API
    • Add configuration templating system

For Implementation Priority:

  1. Phase 1: Address license validation and bootstrap dependencies
  2. Phase 2: Implement configuration flexibility for per-replica parameters
  3. Phase 3: Optimize P2P network behavior for dynamic scaling
  4. Phase 4: Add monitoring and failure recovery mechanisms

🎯 Critical Success Factors

  • KACHING must be scaled to handle burst license validation requests
  • Bootstrap peer infrastructure needs redesign for dynamic environments
  • Configuration system requires major refactoring to support per-replica parameters
  • P2P network behavior needs optimization for rapid topology changes

The current CHORUS architecture assumes relatively static, manually-managed deployments. Implementing dynamic Swarm scaling will require significant architectural changes to handle the distributed systems challenges that arise from rapid replica creation and destruction.

# CHORUS Swarm Scaling Implementation: Potential Friction Points After analyzing the CHORUS codebase, I've identified several significant friction points that could complicate the implementation of the proposed WHOOSH → Swarm scaling feature: ## 🔴 **High-Risk Friction Points** ### 1. **License Validation Bottleneck** (`internal/runtime/shared.go:156-165`) - **Issue**: Every CHORUS instance requires license validation with KACHING before starting any P2P operations - **Current Flow**: `licenseValidator.Validate()` must succeed or the agent fails to start - **Friction**: KACHING needs to handle burst validation requests during rapid scaling - **Impact**: New replicas may fail to start if KACHING is overwhelmed or unavailable ```go // CRITICAL: Validate license before any P2P operations runtime.Logger.Info("🔐 Validating CHORUS license with KACHING...") if err := licenseValidator.Validate(); err != nil { return nil, fmt.Errorf("license validation failed: %v", err) } ``` ### 2. **P2P Network Identity Conflicts** (`internal/runtime/shared.go:206-217`) - **Issue**: Each CHORUS instance generates a unique P2P node ID, but peer discovery is based on these IDs - **Current Flow**: Node creates libp2p host with auto-generated peer ID - **Friction**: Rapid scaling may create peer discovery storms or connection conflicts - **Impact**: Network partitions or connection failures during scale-up events ### 3. **Bootstrap Peer Dependency** (`internal/runtime/shared.go:410-448`) - **Issue**: New instances require existing peers for DHT bootstrap and network joining - **Current Flow**: Hardcoded bootstrap peers in environment variables - **Friction**: Bootstrap peers may be overwhelmed during scale events; new instances can't join if bootstrap fails - **Impact**: New replicas remain isolated from the P2P mesh ## 🟡 **Medium-Risk Friction Points** ### 4. **Leader Election Disruption** (`internal/runtime/shared.go:320-374`) - **Issue**: Adding/removing nodes triggers election system recalculation - **Current Flow**: Democratic election with heartbeat-based leadership - **Friction**: Scale events may cause leadership thrashing or temporary admin vacuums - **Impact**: SLURP functionality (admin-only) may become intermittently unavailable ### 5. **Configuration Environment Rigidity** (`pkg/config/config.go:144-245`) - **Issue**: All configuration comes from environment variables, making per-replica customization difficult - **Current Flow**: Single environment-based config per service definition - **Friction**: The proposed "bootstrap pull" pattern conflicts with current environment-only configuration - **Impact**: Role/model/prompt assignment per replica requires significant refactoring ### 6. **DHT Storage Encryption Key Management** (`internal/runtime/shared.go:450-471`) - **Issue**: Encrypted DHT storage uses agent-specific keys - **Current Flow**: Keys generated per node instance - **Friction**: New replicas may not inherit correct decryption capabilities - **Impact**: Context/decision data may become inaccessible to new instances ### 7. **BACKBEAT Integration Coupling** (`internal/runtime/shared.go:190-203`) - **Issue**: Each instance expects BACKBEAT integration for operation tracking - **Current Flow**: Synchronous NATS-based coordination - **Friction**: Scaling events may overwhelm BACKBEAT message throughput - **Impact**: Operation tracking may become unreliable during scale events ## 🟢 **Low-Risk but Notable Friction Points** ### 8. **Hypercore Logging Conflicts** (`internal/runtime/shared.go:220-226`) - **Issue**: Each instance creates its own hypercore log - **Friction**: Log coordination across rapidly changing replica sets - **Impact**: Audit trail fragmentation ### 9. **mDNS Discovery Limitations** (`internal/runtime/shared.go:229-233`) - **Issue**: mDNS discovery may not scale well with many replicas - **Friction**: Network broadcast storms during scale events - **Impact**: Degraded peer discovery performance ### 10. **Resource Management Assumptions** (`docker/docker-compose.yml:84-90`) - **Issue**: Current resource limits assume manual scaling - **Friction**: Static resource allocation may not suit dynamic scaling patterns - **Impact**: Inefficient resource utilization during auto-scaling ## 🛠️ **Recommended Mitigation Strategies** ### For High-Risk Items: 1. **License Validation**: - Implement KACHING license token caching/sharing - Add license validation circuit breaker - Consider cluster-wide license validation 2. **P2P Network**: - Implement staged peer introduction - Add connection rate limiting - Use deterministic peer ID generation for predictable networking 3. **Bootstrap Dependencies**: - Implement dynamic bootstrap peer discovery via WHOOSH - Add peer list refresh mechanisms - Create bootstrap peer health monitoring ### For Medium-Risk Items: 4. **Leader Election**: - Add election stability windows - Implement gradual scaling to avoid election storms - Consider dedicated admin nodes 5. **Configuration System**: - Extend config system to support runtime parameter injection - Implement WHOOSH → CHORUS parameter API - Add configuration templating system ### For Implementation Priority: 1. **Phase 1**: Address license validation and bootstrap dependencies 2. **Phase 2**: Implement configuration flexibility for per-replica parameters 3. **Phase 3**: Optimize P2P network behavior for dynamic scaling 4. **Phase 4**: Add monitoring and failure recovery mechanisms ## 🎯 **Critical Success Factors** - **KACHING must be scaled** to handle burst license validation requests - **Bootstrap peer infrastructure** needs redesign for dynamic environments - **Configuration system requires major refactoring** to support per-replica parameters - **P2P network behavior** needs optimization for rapid topology changes The current CHORUS architecture assumes relatively static, manually-managed deployments. Implementing dynamic Swarm scaling will require significant architectural changes to handle the distributed systems challenges that arise from rapid replica creation and destruction.
Author
Owner

Claude’s read isn’t wrong. Your top three risks (KACHING burst, bootstrap overload, and peer churn) will absolutely bite you the first time WHOOSH does a fast scale-out. The good news: you don’t need DinD, and you don’t need to rip out CHORUS. You need three surgical changes in CHORUS + one orchestration guardrail in WHOOSH.

Here’s the practical plan, in priority order, with concrete deltas.

0) Orchestrator rule: WHOOSH scales in waves (not floods)

  • Wave size: min( max(3, floor(total_nodes/10)), 8 ) replicas per 20–30s.

  • Gates: Do not start the next wave unless:

    • KACHING 95p latency < X ms for last 30s,
    • bootstrap pool ≥ 3 healthy peers,
    • BackBeat subject lag < L,
    • join-success rate for last wave ≥ 80%.
  • Backoff/jitter: on any gate failure, exponential backoff (cap 2 minutes), add ±20% jitter.

This single rule keeps most “storm” failure modes contained.


1) High-risk fixes (must-do)

1.1 KACHING: burst-proof the license path

Problem: cold-start thundering herd.

Fix:

  • Cluster lease tokens: KACHING issues a short-lived, revocable lease (chorus.cluster.lease) valid for N replicas for T minutes.

  • WHOOSH acts as broker: fetch one lease, fan it to the wave via WHOOSH’s assign API; replicas present lease, not raw credentials.

  • In-proc cache + circuit breaker in CHORUS:

    • Cache VALID for lease TTL – skew.
    • On KACHING outage: allow start with grace window (e.g., 90s) while retrying; if still failing after window, self-quarantine (no P2P join) but process local tasks that don’t require P2P.

CHORUS patch (Go, concept):

type LicenseGate struct {
  cache       atomic.Value // cached lease + expiry
  breaker     gobreaker.CircuitBreaker
  graceUntil  time.Time
}

func (g *LicenseGate) Validate(ctx context.Context) error {
  if lease := g.loadCachedLease(); lease.ValidNow() {
    return nil
  }
  _, err := g.breaker.Execute(func() (any, error) {
    l, e := kaching.RequestLease(ctx, LeaseRequest{ClusterID, Need:1})
    if e == nil { g.storeLease(l) }
    return nil, e
  })
  if err != nil {
    if time.Now().Before(g.graceUntil) { return nil } // soft-start
    return fmt.Errorf("license validation failed: %w", err)
  }
  // set new grace on success
  g.graceUntil = time.Now().Add(90 * time.Second)
  return nil
}

KACHING:

  • Rate-limit per cluster; queue; priority lane for renewals vs new.
  • Export a /health/burst metric WHOOSH can poll as a gate.

1.2 Bootstrap without meltdowns

Problem: DHT join storms against a tiny fixed peer set.

Fix:

  • Maintain a WHOOSH-managed bootstrap pool (10–30 stable peers) with rotating admission window.
  • WHOOSH assigns each new replica a bootstrap subset (2–3 peers) + staggered join delay (0–20s).
  • Connection rate limiter in CHORUS: max N new outbound dials / second; cap concurrent DHT queries.

CHORUS config additions:

BOOTSTRAP_JSON=/config/bootstrap.json  # file or WHOOSH /assign response
DIALS_PER_SEC=5
MAX_CONCURRENT_DHT=16
JOIN_STAGGER_MS=rand(0..20000)
DISABLE_MDNS=true  # in swarm, turn this off

libp2p: disable mDNS in containers; use static/gossip bootstrap via JSON list; enable AutoNAT + connection manager with low water/high water tuned (e.g., 64/256).


1.3 Peer identity sanity

Problem: churn + identity collisions risks.

Fix:

  • Generate node keypair deterministically from (ClusterID, TaskID, secret salt) via HKDF; store ephemeral key in tmpfs. This gives stable IDs for restarts while keeping uniqueness.
  • Add connection manager watermarks to avoid over-peering during joins.

2) Medium-risk fixes (next)

2.1 Leader election stability

  • Add stability window: after a leader is elected, ignore challenges for ELECTION_MIN_TERM (e.g., 30s) if leader heartbeats OK.
  • WHOOSH scales followers first, never scales down leaders; allow election only after a cooldown post-wave (e.g., 45–60s).
  • Consider dedicated admin quorum size=3 (unpinned services) for SLURP operations; workers don’t campaign unless admin quorum is unhealthy.

2.2 Config system: allow runtime assignment (no “env-only”)

Implement the bootstrap-pull we discussed:

  • On start, CHORUS reads ASSIGN_URL (with TASK_SLOT, TASK_ID) and merges returned JSON into the in-memory config.
  • Support a reload signal (SIGHUP or POST /v1/reload) to apply new role/model/prompt without restart.
  • Keep env as defaults; file/HTTP assignment overrides env.

Go sketch:

type RuntimeConfig struct {
  Base Config // from env
  Over Config // from /assign
}

func (c *RuntimeConfig) Get(k string) string {
  if v := c.Over.Get(k); v != "" { return v }
  return c.Base.Get(k)
}

func loadAssignment() (Config, error) {
  // GET $ASSIGN_URL?slot=$TASK_SLOT&id=$TASK_ID&cluster=$CLUSTER_ID
  // merge JSON into Over
}

2.3 DHT encryption keys

  • Derive agent keys from a cluster root key + role scope via HKDF; store per-node ephemeral decryptors.
  • WHOOSH can hand out the wrapped role key inside the assignment payload; node unwraps with cluster secret (from Swarm Secret).

3) BackBeat/NATS hardening

  • Move from raw core NATS pub/sub to JetStream with bounded streams, durable consumers, and max in-flight to prevent floods.
  • Namespace subjects per cluster/role: bb.<cluster>.<role>.*.
  • WHOOSH gate checks lag on critical subjects before the next wave.

4) Low-risk hygiene

  • Hypercore logs: write per-node, but also ship to a role stream (JetStream or Loki) keyed by AssignmentID to avoid audit fragmentation.
  • mDNS: keep off inside Swarm.
  • Swarm resources: define Reservations > 0 and Limits tuned for your GPU/CPU nodes; let WHOOSH pick placement via labels to avoid oversubscription.

docker-stack.yml excerpt:

services:
  chorus:
    image: registry/chorus:latest
    deploy:
      mode: replicated
      update_config:
        parallelism: 2
        delay: 15s
        order: start-first
        monitor: 30s
      restart_policy:
        condition: on-failure
        max_attempts: 3
      resources:
        reservations:
          cpus: '1.0'
          memory: 2G
        limits:
          cpus: '2.0'
          memory: 4G
      placement:
        constraints:
          - 'node.labels.role == worker'
    environment:
      - ASSIGN_URL=http://whoosh:8080/assign
      - CLUSTER_ID=prod
      - DISABLE_MDNS=true
    configs:
      - source: chorus-bootstrap
        target: /config/bootstrap.json
    command: ["/bin/sh","-lc","/app/bootstrap.sh && exec /app/chorus"]
configs:
  chorus-bootstrap:
    file: ./bootstrap.json

5) WHOOSH → Swarm control (what WHOOSH must implement)

Scale algorithm (pseudo):

def scale_service(target):
    while current_replicas < target:
        wave = min( max(3, nodes//10), 8, target - current_replicas )
        if not gates_ok(): sleep(backoff()); continue

        # create assignments + per-replica stagger
        for i in range(wave):
            assign = build_assignment(role, model, prompt, bootstrap_subset(), stagger_ms())
            store_assignment(assign)  # keyed by (TaskSlot or ephemeral token)

        swarm.scale('chorus', current_replicas + wave)

        wait_until(lambda: joined_count_since_wave() >= floor(0.8 * wave), timeout=90s)
        current_replicas += wave

Gates WHOOSH polls:

  • GET kaching/health (latency + 429 rate),
  • GET backbeat/metrics (subject lag),
  • GET chorus/peers (bootstrap pool ready).

6) Where I agree/disagree with Claude

  • Agree the current env-only config and fixed bootstrap will block smooth scaling. Change both.
  • Agree KACHING is your #1 single point of pain without leases and a breaker.
  • Nuance on “deterministic peer IDs”: don’t make them globally predictable; derive from task-scoped HKDF + cluster salt, rotate on assignment epoch if needed.
  • Leader election isn’t fatal if you add a stability window and scale followers first.

7) Minimal code changes checklist (you can hand this to a dev)

  1. Add license gate with cache + breaker + grace window (CHORUS).
  2. Add assignment merge layer (HTTP/file) + SIGHUP reload (CHORUS).
  3. Disable mDNS; add dial/QPS limiter; connection manager watermarks (CHORUS).
  4. Bootstrap pool JSON support + join stagger (CHORUS).
  5. Key derivation: cluster root → role keys → node unwrap (CHORUS).
  6. JetStream migration for BackBeat with lag metrics (infra).
  7. WHOOSH scaling waves + health gates + assignment broker (WHOOSH).
  8. KACHING lease tokens + rate-limit/priority renewal path + burst health (KACHING).

Do those eight and you’ll move from “static swarm with manual scale” to robust, wave-based autoscaling that doesn’t collapse your license server, your bootstrap peers, or your control plane.

Claude’s read isn’t wrong. Your top three risks (KACHING burst, bootstrap overload, and peer churn) will absolutely bite you the first time WHOOSH does a fast scale-out. The good news: you don’t need DinD, and you don’t need to rip out CHORUS. You need three surgical changes in CHORUS + one orchestration guardrail in WHOOSH. Here’s the practical plan, in priority order, with concrete deltas. # 0) Orchestrator rule: WHOOSH scales in waves (not floods) * **Wave size**: `min( max(3, floor(total_nodes/10)), 8 )` replicas per 20–30s. * **Gates**: Do not start the next wave unless: * KACHING 95p latency < X ms for last 30s, * bootstrap pool ≥ 3 healthy peers, * BackBeat subject lag < L, * join-success rate for last wave ≥ 80%. * **Backoff/jitter**: on any gate failure, exponential backoff (cap 2 minutes), add ±20% jitter. This single rule keeps most “storm” failure modes contained. --- # 1) High-risk fixes (must-do) ## 1.1 KACHING: burst-proof the license path **Problem**: cold-start thundering herd. **Fix**: * **Cluster lease tokens**: KACHING issues a short-lived, revocable **lease** (`chorus.cluster.lease`) valid for N replicas for T minutes. * **WHOOSH acts as broker**: fetch one lease, fan it to the wave via WHOOSH’s assign API; replicas present lease, not raw credentials. * **In-proc cache + circuit breaker** in CHORUS: * Cache VALID for lease TTL – skew. * On KACHING outage: allow start with **grace window** (e.g., 90s) while retrying; if still failing after window, self-quarantine (no P2P join) but process local tasks that don’t require P2P. **CHORUS patch (Go, concept)**: ```go type LicenseGate struct { cache atomic.Value // cached lease + expiry breaker gobreaker.CircuitBreaker graceUntil time.Time } func (g *LicenseGate) Validate(ctx context.Context) error { if lease := g.loadCachedLease(); lease.ValidNow() { return nil } _, err := g.breaker.Execute(func() (any, error) { l, e := kaching.RequestLease(ctx, LeaseRequest{ClusterID, Need:1}) if e == nil { g.storeLease(l) } return nil, e }) if err != nil { if time.Now().Before(g.graceUntil) { return nil } // soft-start return fmt.Errorf("license validation failed: %w", err) } // set new grace on success g.graceUntil = time.Now().Add(90 * time.Second) return nil } ``` **KACHING**: * Rate-limit per cluster; queue; **priority lane** for renewals vs new. * Export a **/health/burst** metric WHOOSH can poll as a gate. --- ## 1.2 Bootstrap without meltdowns **Problem**: DHT join storms against a tiny fixed peer set. **Fix**: * Maintain a **WHOOSH-managed bootstrap pool** (10–30 stable peers) with rotating **admission window**. * WHOOSH assigns each new replica a **bootstrap subset** (2–3 peers) + **staggered join delay** (0–20s). * **Connection rate limiter** in CHORUS: max N new outbound dials / second; cap concurrent DHT queries. **CHORUS config additions**: ``` BOOTSTRAP_JSON=/config/bootstrap.json # file or WHOOSH /assign response DIALS_PER_SEC=5 MAX_CONCURRENT_DHT=16 JOIN_STAGGER_MS=rand(0..20000) DISABLE_MDNS=true # in swarm, turn this off ``` **libp2p**: disable mDNS in containers; use static/gossip bootstrap via JSON list; enable **AutoNAT** + **connection manager** with low water/high water tuned (e.g., 64/256). --- ## 1.3 Peer identity sanity **Problem**: churn + identity collisions risks. **Fix**: * Generate **node keypair deterministically** from (ClusterID, TaskID, secret salt) via HKDF; store ephemeral key in tmpfs. This gives stable IDs for restarts while keeping uniqueness. * Add **connection manager** watermarks to avoid over-peering during joins. --- # 2) Medium-risk fixes (next) ## 2.1 Leader election stability * Add **stability window**: after a leader is elected, ignore challenges for `ELECTION_MIN_TERM` (e.g., 30s) if leader heartbeats OK. * WHOOSH scales **followers first**, never scales down leaders; allow election only after a **cooldown** post-wave (e.g., 45–60s). * Consider **dedicated admin quorum** size=3 (unpinned services) for SLURP operations; workers don’t campaign unless admin quorum is unhealthy. ## 2.2 Config system: allow runtime assignment (no “env-only”) Implement the **bootstrap-pull** we discussed: * On start, CHORUS reads `ASSIGN_URL` (with `TASK_SLOT`, `TASK_ID`) and **merges** returned JSON into the in-memory config. * Support a **reload signal** (`SIGHUP` or `POST /v1/reload`) to apply new role/model/prompt without restart. * Keep env as defaults; **file/HTTP assignment overrides env**. **Go sketch**: ```go type RuntimeConfig struct { Base Config // from env Over Config // from /assign } func (c *RuntimeConfig) Get(k string) string { if v := c.Over.Get(k); v != "" { return v } return c.Base.Get(k) } func loadAssignment() (Config, error) { // GET $ASSIGN_URL?slot=$TASK_SLOT&id=$TASK_ID&cluster=$CLUSTER_ID // merge JSON into Over } ``` ## 2.3 DHT encryption keys * Derive **agent keys** from a **cluster root key** + role scope via HKDF; store per-node **ephemeral decryptors**. * WHOOSH can hand out the **wrapped role key** inside the assignment payload; node unwraps with cluster secret (from Swarm Secret). --- # 3) BackBeat/NATS hardening * Move from raw core NATS pub/sub to **JetStream** with bounded streams, **durable consumers**, and **max in-flight** to prevent floods. * Namespace subjects per cluster/role: `bb.<cluster>.<role>.*`. * WHOOSH gate checks **lag** on critical subjects before the next wave. --- # 4) Low-risk hygiene * **Hypercore logs**: write per-node, but also **ship to a role stream** (JetStream or Loki) keyed by `AssignmentID` to avoid audit fragmentation. * **mDNS**: keep **off** inside Swarm. * **Swarm resources**: define **Reservations** > 0 and **Limits** tuned for your GPU/CPU nodes; let WHOOSH pick **placement** via labels to avoid oversubscription. **docker-stack.yml excerpt**: ```yaml services: chorus: image: registry/chorus:latest deploy: mode: replicated update_config: parallelism: 2 delay: 15s order: start-first monitor: 30s restart_policy: condition: on-failure max_attempts: 3 resources: reservations: cpus: '1.0' memory: 2G limits: cpus: '2.0' memory: 4G placement: constraints: - 'node.labels.role == worker' environment: - ASSIGN_URL=http://whoosh:8080/assign - CLUSTER_ID=prod - DISABLE_MDNS=true configs: - source: chorus-bootstrap target: /config/bootstrap.json command: ["/bin/sh","-lc","/app/bootstrap.sh && exec /app/chorus"] configs: chorus-bootstrap: file: ./bootstrap.json ``` --- # 5) WHOOSH → Swarm control (what WHOOSH must implement) **Scale algorithm (pseudo):** ```python def scale_service(target): while current_replicas < target: wave = min( max(3, nodes//10), 8, target - current_replicas ) if not gates_ok(): sleep(backoff()); continue # create assignments + per-replica stagger for i in range(wave): assign = build_assignment(role, model, prompt, bootstrap_subset(), stagger_ms()) store_assignment(assign) # keyed by (TaskSlot or ephemeral token) swarm.scale('chorus', current_replicas + wave) wait_until(lambda: joined_count_since_wave() >= floor(0.8 * wave), timeout=90s) current_replicas += wave ``` **Gates WHOOSH polls**: * `GET kaching/health` (latency + 429 rate), * `GET backbeat/metrics` (subject lag), * `GET chorus/peers` (bootstrap pool ready). --- # 6) Where I agree/disagree with Claude * **Agree** the current env-only config and fixed bootstrap will block smooth scaling. Change both. * **Agree** KACHING is your #1 single point of pain without leases and a breaker. * **Nuance** on “deterministic peer IDs”: don’t make them globally predictable; derive from task-scoped HKDF + cluster salt, rotate on assignment epoch if needed. * **Leader election** isn’t fatal if you add a stability window and scale followers first. --- # 7) Minimal code changes checklist (you can hand this to a dev) 1. **Add license gate** with cache + breaker + grace window (CHORUS). 2. **Add assignment merge layer** (HTTP/file) + `SIGHUP` reload (CHORUS). 3. **Disable mDNS**; add dial/QPS limiter; connection manager watermarks (CHORUS). 4. **Bootstrap pool JSON** support + join stagger (CHORUS). 5. **Key derivation**: cluster root → role keys → node unwrap (CHORUS). 6. **JetStream migration** for BackBeat with lag metrics (infra). 7. **WHOOSH scaling waves** + health gates + assignment broker (WHOOSH). 8. **KACHING lease tokens** + rate-limit/priority renewal path + burst health (KACHING). Do those eight and you’ll move from “static swarm with manual scale” to **robust, wave-based autoscaling** that doesn’t collapse your license server, your bootstrap peers, or your control plane.
Author
Owner

KACHING Issue: Support CHORUS Swarm Scaling: Cluster Lease Tokens and Burst Rate Limiting #2

This tracks the required KACHING enhancements to support burst scaling:

  • Cluster lease token system
  • Burst-proof rate limiting with priority lanes
  • Health monitoring endpoints for WHOOSH gates
  • Enhanced telemetry and monitoring

The KACHING changes are a prerequisite for the CHORUS scaling implementation.

## 🔗 Related Issue Created **KACHING Issue**: [Support CHORUS Swarm Scaling: Cluster Lease Tokens and Burst Rate Limiting #2](https://gitea.chorus.services/tony/KACHING/issues/2) This tracks the required KACHING enhancements to support burst scaling: - Cluster lease token system - Burst-proof rate limiting with priority lanes - Health monitoring endpoints for WHOOSH gates - Enhanced telemetry and monitoring The KACHING changes are a prerequisite for the CHORUS scaling implementation.
Author
Owner

CHORUS Surgical Changes Implemented

I've successfully implemented all the surgical changes to CHORUS based on the feedback. Here's what was delivered:

1. License Gate with Cache + Breaker + Grace Window

Files Added/Modified:

  • internal/licensing/license_gate.go - New burst-proof license validation
  • internal/licensing/validator.go - Enhanced with license gate integration

Features:

  • Circuit breaker with 3-failure trip threshold
  • 90-second grace period for soft-start during outages
  • Cached lease tokens with 2-minute safety margin
  • Fallback to legacy validation for backward compatibility
  • Cluster lease token support (when KACHING implements the API)

2. Assignment Merge Layer with HTTP/File Support + SIGHUP Reload

Files Added/Modified:

  • pkg/config/runtime_config.go - New runtime configuration system
  • pkg/config/config.go - Added LoadRuntimeConfig function

Features:

  • HTTP assignment fetch from WHOOSH (ASSIGN_URL with task slot/ID)
  • File-based assignment loading (ASSIGNMENT_FILE for config objects)
  • SIGHUP signal handler for configuration reload
  • Override precedence: Assignment > Environment
  • Real-time role/model/prompt updates without restart

3. Disable mDNS + Add Dial/QPS Limiters + Connection Manager

Files Modified:

  • p2p/config.go - Enhanced with rate limiting and connection management

Features:

  • mDNS disabled by default for container environments
  • Connection manager with low/high watermarks (32/128)
  • Dial rate limiting: 5 dials per second, max 10 concurrent
  • DHT query limiting: max 16 concurrent queries
  • Join stagger delay: configurable via assignment

4. Bootstrap Pool JSON Support with Join Stagger

Files Added:

  • pkg/bootstrap/pool_manager.go - Complete bootstrap pool management

Features:

  • JSON bootstrap configuration with peer health status
  • WHOOSH-managed bootstrap pool via URL endpoint
  • Staggered peer connections with random delay (0-20s configurable)
  • Rate-limited connection attempts to prevent storms
  • Bootstrap subset assignment (2-3 peers per replica)
  • Legacy environment fallback for backward compatibility

5. Key Derivation: Cluster Root → Role Keys → Node Unwrap

Files Added:

  • pkg/crypto/key_derivation.go - Complete key derivation system

Features:

  • HKDF-based key derivation from cluster root key
  • Role-scoped encryption (all nodes in role can decrypt)
  • Node-specific keys derived from role keys + agent ID
  • AGE encryption with deterministic key generation
  • Multi-recipient encryption for role-wide data sharing
  • Cluster-wide vs node-specific decryption strategies

6. Dependencies Added

Updated:

  • go.mod - Added github.com/sony/gobreaker and golang.org/x/crypto

🎯 Integration Points for WHOOSH

These CHORUS changes enable WHOOSH to:

  1. Fetch cluster leases from KACHING (reduces license validation storms)
  2. Assign role/model/prompt via /assign API endpoint
  3. Manage bootstrap pools via JSON configuration
  4. Control join timing via stagger delays in assignments
  5. Monitor health gates before scaling waves

🔄 Next Steps

  1. KACHING: Implement cluster lease tokens (see Issue #2)
  2. WHOOSH: Implement scaling waves with health gates
  3. Testing: Integration testing with burst scaling scenarios
  4. BackBeat: JetStream migration for better scaling behavior

All surgical changes maintain backward compatibility while adding the scaling capabilities needed for WHOOSH orchestration. The wave-based scaling approach with health gates should now be feasible without the friction points identified in the analysis.

## ✅ CHORUS Surgical Changes Implemented I've successfully implemented all the surgical changes to CHORUS based on the feedback. Here's what was delivered: ### 1. ✅ License Gate with Cache + Breaker + Grace Window **Files Added/Modified:** - `internal/licensing/license_gate.go` - New burst-proof license validation - `internal/licensing/validator.go` - Enhanced with license gate integration **Features:** - Circuit breaker with 3-failure trip threshold - 90-second grace period for soft-start during outages - Cached lease tokens with 2-minute safety margin - Fallback to legacy validation for backward compatibility - Cluster lease token support (when KACHING implements the API) ### 2. ✅ Assignment Merge Layer with HTTP/File Support + SIGHUP Reload **Files Added/Modified:** - `pkg/config/runtime_config.go` - New runtime configuration system - `pkg/config/config.go` - Added LoadRuntimeConfig function **Features:** - HTTP assignment fetch from WHOOSH (`ASSIGN_URL` with task slot/ID) - File-based assignment loading (`ASSIGNMENT_FILE` for config objects) - SIGHUP signal handler for configuration reload - Override precedence: Assignment > Environment - Real-time role/model/prompt updates without restart ### 3. ✅ Disable mDNS + Add Dial/QPS Limiters + Connection Manager **Files Modified:** - `p2p/config.go` - Enhanced with rate limiting and connection management **Features:** - **mDNS disabled** by default for container environments - **Connection manager** with low/high watermarks (32/128) - **Dial rate limiting**: 5 dials per second, max 10 concurrent - **DHT query limiting**: max 16 concurrent queries - **Join stagger delay**: configurable via assignment ### 4. ✅ Bootstrap Pool JSON Support with Join Stagger **Files Added:** - `pkg/bootstrap/pool_manager.go` - Complete bootstrap pool management **Features:** - **JSON bootstrap configuration** with peer health status - **WHOOSH-managed bootstrap pool** via URL endpoint - **Staggered peer connections** with random delay (0-20s configurable) - **Rate-limited connection attempts** to prevent storms - **Bootstrap subset assignment** (2-3 peers per replica) - **Legacy environment fallback** for backward compatibility ### 5. ✅ Key Derivation: Cluster Root → Role Keys → Node Unwrap **Files Added:** - `pkg/crypto/key_derivation.go` - Complete key derivation system **Features:** - **HKDF-based key derivation** from cluster root key - **Role-scoped encryption** (all nodes in role can decrypt) - **Node-specific keys** derived from role keys + agent ID - **AGE encryption** with deterministic key generation - **Multi-recipient encryption** for role-wide data sharing - **Cluster-wide vs node-specific** decryption strategies ### 6. ✅ Dependencies Added **Updated:** - `go.mod` - Added `github.com/sony/gobreaker` and `golang.org/x/crypto` ## 🎯 Integration Points for WHOOSH These CHORUS changes enable WHOOSH to: 1. **Fetch cluster leases** from KACHING (reduces license validation storms) 2. **Assign role/model/prompt** via `/assign` API endpoint 3. **Manage bootstrap pools** via JSON configuration 4. **Control join timing** via stagger delays in assignments 5. **Monitor health gates** before scaling waves ## 🔄 Next Steps 1. **KACHING**: Implement cluster lease tokens (see [Issue #2](https://gitea.chorus.services/tony/KACHING/issues/2)) 2. **WHOOSH**: Implement scaling waves with health gates 3. **Testing**: Integration testing with burst scaling scenarios 4. **BackBeat**: JetStream migration for better scaling behavior All surgical changes maintain **backward compatibility** while adding the scaling capabilities needed for WHOOSH orchestration. The wave-based scaling approach with health gates should now be feasible without the friction points identified in the analysis.
Author
Owner

Some progress has been made on this at WALNUT. PR yet to be submitted. Push not possible.

Some progress has been made on this at WALNUT. PR yet to be submitted. Push not possible.
tony closed this issue 2025-10-10 00:36:23 +00:00
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: tony/WHOOSH#7
No description provided.