Files
CHORUS/README.md
anthonyrawlins e523c4b543 feat: Implement CHORUS scaling improvements for robust autoscaling
Address WHOOSH issue #7 with comprehensive scaling optimizations to prevent
license server, bootstrap peer, and control plane collapse during fast scale-out.

HIGH-RISK FIXES (Must-Do):
 License gate already implemented with cache + circuit breaker + grace window
 mDNS disabled in container environments (CHORUS_MDNS_ENABLED=false)
 Connection rate limiting (5 dials/sec, 16 concurrent DHT queries)
 Connection manager with watermarks (32 low, 128 high)
 AutoNAT enabled for container networking

MEDIUM-RISK FIXES (Next Priority):
 Assignment merge layer with HTTP/file config + SIGHUP reload
 Runtime configuration system with WHOOSH assignment API support
 Election stability windows to prevent churn:
  - CHORUS_ELECTION_MIN_TERM=30s (minimum time between elections)
  - CHORUS_LEADER_MIN_TERM=45s (minimum time before challenging healthy leader)
 Bootstrap pool JSON support with priority sorting and join stagger

NEW FEATURES:
- Runtime config system with assignment overrides from WHOOSH
- SIGHUP reload handler for live configuration updates
- JSON bootstrap configuration with peer metadata (region, roles, priority)
- Configurable election stability windows with environment variables
- Multi-format bootstrap support: Assignment → JSON → CSV

FILES MODIFIED:
- pkg/config/assignment.go (NEW): Runtime assignment merge system
- docker/bootstrap.json (NEW): Example JSON bootstrap configuration
- pkg/election/election.go: Added stability windows and churn prevention
- internal/runtime/shared.go: Integrated assignment loading and conditional mDNS
- p2p/node.go: Added connection management and rate limiting
- pkg/config/hybrid_config.go: Added rate limiting configuration fields
- docker/docker-compose.yml: Updated environment variables and configs
- README.md: Updated status table with scaling milestone

This implementation enables wave-based autoscaling without system collapse,
addressing all scaling concerns from WHOOSH issue #7.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-09-23 17:50:40 +10:00

88 lines
4.5 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# CHORUS Container-First Context Platform (Alpha)
CHORUS is the runtime that ties the CHORUS ecosystem together: libp2p mesh, DHT-backed storage, council/task coordination, and (eventually) SLURP contextual intelligence. The repository you are looking at is the in-progress container-first refactor. Several core systems boot today, but higher-level services (SLURP, SHHH, full HMMM routing) are still landing.
## Current Status
| Area | Status | Notes |
| --- | --- | --- |
| libp2p node + PubSub | ✅ Running | `internal/runtime/shared.go` spins up the mesh, hypercore logging, availability broadcasts. |
| DHT + DecisionPublisher | ✅ Running | Encrypted storage wired through `pkg/dht`; decisions written via `ucxl.DecisionPublisher`. |
| **Leader Election System** | ✅ **FULLY FUNCTIONAL** | **🎉 MILESTONE: Complete admin election with consensus, discovery protocol, heartbeats, and SLURP activation!** |
| SLURP (context intelligence) | 🚧 Stubbed | `pkg/slurp/slurp.go` contains TODOs for resolver, temporal graphs, intelligence. Leader integration scaffolding exists but uses placeholder IDs/request forwarding. |
| SHHH (secrets sentinel) | 🚧 Sentinel live | `pkg/shhh` redacts hypercore + PubSub payloads with audit + metrics hooks (policy replay TBD). |
| HMMM routing | 🚧 Partial | PubSub topics join, but capability/role announcements and HMMM router wiring are placeholders (`internal/runtime/agent_support.go`). |
See `docs/progress/CHORUS-WHOOSH-development-plan.md` for the detailed build plan and `docs/progress/CHORUS-WHOOSH-roadmap.md` for sequencing.
## Quick Start (Alpha)
The container-first workflows are still evolving; expect frequent changes.
```bash
git clone https://gitea.chorus.services/tony/CHORUS.git
cd CHORUS
cp docker/chorus.env.example docker/chorus.env
# adjust env vars (KACHING license, bootstrap peers, etc.)
docker compose -f docker/docker-compose.yml up --build
```
Youll get a single agent container with:
- libp2p networking (mDNS + configured bootstrap peers)
- election heartbeat
- DHT storage (AGE-encrypted)
- HTTP API + health endpoints
**Missing today:** SLURP context resolution, advanced SHHH policy replay, HMMM per-issue routing. Expect log warnings/TODOs for those paths.
## 🎉 Leader Election System (NEW!)
CHORUS now features a complete, production-ready leader election system:
### Core Features
- **Consensus-based election** with weighted scoring (uptime, capabilities, resources)
- **Admin discovery protocol** for network-wide leader identification
- **Heartbeat system** with automatic failover (15-second intervals)
- **Concurrent election prevention** with randomized delays
- **SLURP activation** on elected admin nodes
### How It Works
1. **Bootstrap**: Nodes start in idle state, no admin known
2. **Discovery**: Nodes send discovery requests to find existing admin
3. **Election trigger**: If no admin found after grace period, trigger election
4. **Candidacy**: Eligible nodes announce themselves with capability scores
5. **Consensus**: Network selects winner based on highest score
6. **Leadership**: Winner starts heartbeats, activates SLURP functionality
7. **Monitoring**: Nodes continuously verify admin health via heartbeats
### Debugging
Use these log patterns to monitor election health:
```bash
# Monitor WHOAMI messages and leader identification
docker service logs CHORUS_chorus | grep "🤖 WHOAMI\|👑\|📡.*Discovered"
# Track election cycles
docker service logs CHORUS_chorus | grep "🗳️\|📢.*candidacy\|🏆.*winner"
# Watch discovery protocol
docker service logs CHORUS_chorus | grep "📩\|📤\|📥"
```
## Roadmap Highlights
1. **Security substrate** land SHHH sentinel, finish SLURP leader-only operations, validate COOEE enrolment (see roadmap Phase 1).
2. **Autonomous teams** coordinate with WHOOSH for deployment telemetry + SLURP context export.
3. **UCXL + KACHING** hook runtime telemetry into KACHING and enforce UCXL validator.
Track progress via the shared roadmap and weekly burndown dashboards.
## Related Projects
- [WHOOSH](https://gitea.chorus.services/tony/WHOOSH) council/team orchestration
- [KACHING](https://gitea.chorus.services/tony/KACHING) telemetry/licensing
- [SLURP](https://gitea.chorus.services/tony/SLURP) contextual intelligence prototypes
- [HMMM](https://gitea.chorus.services/tony/hmmm) meta-discussion layer
## Contributing
This repo is still alpha. Please coordinate via the roadmap tickets before landing changes. Major security/runtime decisions should include a Decision Record with a UCXL address so SLURP/BUBBLE can ingest it later.