Feature Request: CHORUS Swarm Scaling and Re-parametrization without Docker-in-Docker #7
Reference in New Issue
Block a user
No description provided.
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
CHORUS Swarm Scaling Feature Request
Executive Summary
Implement capability for WHOOSH to scale and re-parametrize CHORUS instances in Docker Swarm without Docker-in-Docker. WHOOSH will communicate with the Docker Remote API on the Swarm manager (via restricted socket proxy or TLS), then pass role/model/prompt to new replicas via environment/configs or bootstrap fetch patterns.
Technical Implementation
1. Give WHOOSH Controlled Access to Swarm (No DinD)
Option A: Socket Proxy (Recommended)
tecnativa/docker-socket-proxyon manager with only needed endpoints (services, tasks)DOCKER_HOST=tcp://socket-proxy:2375/var/run/docker.sockmount, scoped permissionsOption B: Docker Remote API over TLS
:2376) with TLS client certsDOCKER_HOST=tcp://manager:23762. WHOOSH → Swarm Control Implementation
Use official Docker SDK for programmatic service management:
3. Parameter Passing Patterns
Pattern 1: Bootstrap Pull (Recommended)
Example service configuration:
Bootstrap script:
Pattern 2: Config Objects
/config/spec.jsonPattern 3: Per-Role Services
chorus-writer,chorus-planner,chorus-coder4. New Role-Specific Service Creation
Security & Operations
/var/run/docker.sockdirectly into WHOOSH-e SERVICES=1 -e TASKS=1 -e CONFIGS=1 -e SWARM=1com.chorus.managed=trueAdvanced Features
Rolling Parameter Updates
Service Discovery
Implementation Checklist
Benefits
References
CHORUS Swarm Scaling Implementation: Potential Friction Points
After analyzing the CHORUS codebase, I've identified several significant friction points that could complicate the implementation of the proposed WHOOSH → Swarm scaling feature:
🔴 High-Risk Friction Points
1. License Validation Bottleneck (
internal/runtime/shared.go:156-165)licenseValidator.Validate()must succeed or the agent fails to start2. P2P Network Identity Conflicts (
internal/runtime/shared.go:206-217)3. Bootstrap Peer Dependency (
internal/runtime/shared.go:410-448)🟡 Medium-Risk Friction Points
4. Leader Election Disruption (
internal/runtime/shared.go:320-374)5. Configuration Environment Rigidity (
pkg/config/config.go:144-245)6. DHT Storage Encryption Key Management (
internal/runtime/shared.go:450-471)7. BACKBEAT Integration Coupling (
internal/runtime/shared.go:190-203)🟢 Low-Risk but Notable Friction Points
8. Hypercore Logging Conflicts (
internal/runtime/shared.go:220-226)9. mDNS Discovery Limitations (
internal/runtime/shared.go:229-233)10. Resource Management Assumptions (
docker/docker-compose.yml:84-90)🛠️ Recommended Mitigation Strategies
For High-Risk Items:
License Validation:
P2P Network:
Bootstrap Dependencies:
For Medium-Risk Items:
Leader Election:
Configuration System:
For Implementation Priority:
🎯 Critical Success Factors
The current CHORUS architecture assumes relatively static, manually-managed deployments. Implementing dynamic Swarm scaling will require significant architectural changes to handle the distributed systems challenges that arise from rapid replica creation and destruction.
Claude’s read isn’t wrong. Your top three risks (KACHING burst, bootstrap overload, and peer churn) will absolutely bite you the first time WHOOSH does a fast scale-out. The good news: you don’t need DinD, and you don’t need to rip out CHORUS. You need three surgical changes in CHORUS + one orchestration guardrail in WHOOSH.
Here’s the practical plan, in priority order, with concrete deltas.
0) Orchestrator rule: WHOOSH scales in waves (not floods)
Wave size:
min( max(3, floor(total_nodes/10)), 8 )replicas per 20–30s.Gates: Do not start the next wave unless:
Backoff/jitter: on any gate failure, exponential backoff (cap 2 minutes), add ±20% jitter.
This single rule keeps most “storm” failure modes contained.
1) High-risk fixes (must-do)
1.1 KACHING: burst-proof the license path
Problem: cold-start thundering herd.
Fix:
Cluster lease tokens: KACHING issues a short-lived, revocable lease (
chorus.cluster.lease) valid for N replicas for T minutes.WHOOSH acts as broker: fetch one lease, fan it to the wave via WHOOSH’s assign API; replicas present lease, not raw credentials.
In-proc cache + circuit breaker in CHORUS:
CHORUS patch (Go, concept):
KACHING:
1.2 Bootstrap without meltdowns
Problem: DHT join storms against a tiny fixed peer set.
Fix:
CHORUS config additions:
libp2p: disable mDNS in containers; use static/gossip bootstrap via JSON list; enable AutoNAT + connection manager with low water/high water tuned (e.g., 64/256).
1.3 Peer identity sanity
Problem: churn + identity collisions risks.
Fix:
2) Medium-risk fixes (next)
2.1 Leader election stability
ELECTION_MIN_TERM(e.g., 30s) if leader heartbeats OK.2.2 Config system: allow runtime assignment (no “env-only”)
Implement the bootstrap-pull we discussed:
ASSIGN_URL(withTASK_SLOT,TASK_ID) and merges returned JSON into the in-memory config.SIGHUPorPOST /v1/reload) to apply new role/model/prompt without restart.Go sketch:
2.3 DHT encryption keys
3) BackBeat/NATS hardening
bb.<cluster>.<role>.*.4) Low-risk hygiene
AssignmentIDto avoid audit fragmentation.docker-stack.yml excerpt:
5) WHOOSH → Swarm control (what WHOOSH must implement)
Scale algorithm (pseudo):
Gates WHOOSH polls:
GET kaching/health(latency + 429 rate),GET backbeat/metrics(subject lag),GET chorus/peers(bootstrap pool ready).6) Where I agree/disagree with Claude
7) Minimal code changes checklist (you can hand this to a dev)
SIGHUPreload (CHORUS).Do those eight and you’ll move from “static swarm with manual scale” to robust, wave-based autoscaling that doesn’t collapse your license server, your bootstrap peers, or your control plane.
🔗 Related Issue Created
KACHING Issue: Support CHORUS Swarm Scaling: Cluster Lease Tokens and Burst Rate Limiting #2
This tracks the required KACHING enhancements to support burst scaling:
The KACHING changes are a prerequisite for the CHORUS scaling implementation.
✅ CHORUS Surgical Changes Implemented
I've successfully implemented all the surgical changes to CHORUS based on the feedback. Here's what was delivered:
1. ✅ License Gate with Cache + Breaker + Grace Window
Files Added/Modified:
internal/licensing/license_gate.go- New burst-proof license validationinternal/licensing/validator.go- Enhanced with license gate integrationFeatures:
2. ✅ Assignment Merge Layer with HTTP/File Support + SIGHUP Reload
Files Added/Modified:
pkg/config/runtime_config.go- New runtime configuration systempkg/config/config.go- Added LoadRuntimeConfig functionFeatures:
ASSIGN_URLwith task slot/ID)ASSIGNMENT_FILEfor config objects)3. ✅ Disable mDNS + Add Dial/QPS Limiters + Connection Manager
Files Modified:
p2p/config.go- Enhanced with rate limiting and connection managementFeatures:
4. ✅ Bootstrap Pool JSON Support with Join Stagger
Files Added:
pkg/bootstrap/pool_manager.go- Complete bootstrap pool managementFeatures:
5. ✅ Key Derivation: Cluster Root → Role Keys → Node Unwrap
Files Added:
pkg/crypto/key_derivation.go- Complete key derivation systemFeatures:
6. ✅ Dependencies Added
Updated:
go.mod- Addedgithub.com/sony/gobreakerandgolang.org/x/crypto🎯 Integration Points for WHOOSH
These CHORUS changes enable WHOOSH to:
/assignAPI endpoint🔄 Next Steps
All surgical changes maintain backward compatibility while adding the scaling capabilities needed for WHOOSH orchestration. The wave-based scaling approach with health gates should now be feasible without the friction points identified in the analysis.
Some progress has been made on this at WALNUT. PR yet to be submitted. Push not possible.