Compare commits
9 Commits
2578876eeb
...
feature/ch
| Author | SHA1 | Date | |
|---|---|---|---|
|
|
14b5125c12 | ||
|
|
ea04378962 | ||
| 237e8699eb | |||
| 1de8695736 | |||
| c30c6dc480 | |||
|
|
e523c4b543 | ||
|
|
26e4ef7d8b | ||
|
|
eb2e05ff84 | ||
| ef4bf1efe0 |
@@ -15,14 +15,16 @@ RUN addgroup -g 1000 chorus && \
|
|||||||
RUN mkdir -p /app/data && \
|
RUN mkdir -p /app/data && \
|
||||||
chown -R chorus:chorus /app
|
chown -R chorus:chorus /app
|
||||||
|
|
||||||
# Copy pre-built binary
|
# Copy pre-built binary from build directory (ensure it exists and is the correct one)
|
||||||
COPY chorus-agent /app/chorus-agent
|
COPY build/chorus-agent /app/chorus-agent
|
||||||
RUN chmod +x /app/chorus-agent && chown chorus:chorus /app/chorus-agent
|
RUN chmod +x /app/chorus-agent && chown chorus:chorus /app/chorus-agent
|
||||||
|
|
||||||
# Switch to non-root user
|
# Switch to non-root user
|
||||||
USER chorus
|
USER chorus
|
||||||
WORKDIR /app
|
WORKDIR /app
|
||||||
|
|
||||||
|
# Note: Using correct chorus-agent binary built with 'make build-agent'
|
||||||
|
|
||||||
# Expose ports
|
# Expose ports
|
||||||
EXPOSE 8080 8081 9000
|
EXPOSE 8080 8081 9000
|
||||||
|
|
||||||
|
|||||||
35
README.md
35
README.md
@@ -8,7 +8,7 @@ CHORUS is the runtime that ties the CHORUS ecosystem together: libp2p mesh, DHT-
|
|||||||
| --- | --- | --- |
|
| --- | --- | --- |
|
||||||
| libp2p node + PubSub | ✅ Running | `internal/runtime/shared.go` spins up the mesh, hypercore logging, availability broadcasts. |
|
| libp2p node + PubSub | ✅ Running | `internal/runtime/shared.go` spins up the mesh, hypercore logging, availability broadcasts. |
|
||||||
| DHT + DecisionPublisher | ✅ Running | Encrypted storage wired through `pkg/dht`; decisions written via `ucxl.DecisionPublisher`. |
|
| DHT + DecisionPublisher | ✅ Running | Encrypted storage wired through `pkg/dht`; decisions written via `ucxl.DecisionPublisher`. |
|
||||||
| Election manager | ✅ Running | Admin election integrated with Backbeat; metrics exposed under `pkg/metrics`. |
|
| **Leader Election System** | ✅ **FULLY FUNCTIONAL** | **🎉 MILESTONE: Complete admin election with consensus, discovery protocol, heartbeats, and SLURP activation!** |
|
||||||
| SLURP (context intelligence) | 🚧 Stubbed | `pkg/slurp/slurp.go` contains TODOs for resolver, temporal graphs, intelligence. Leader integration scaffolding exists but uses placeholder IDs/request forwarding. |
|
| SLURP (context intelligence) | 🚧 Stubbed | `pkg/slurp/slurp.go` contains TODOs for resolver, temporal graphs, intelligence. Leader integration scaffolding exists but uses placeholder IDs/request forwarding. |
|
||||||
| SHHH (secrets sentinel) | 🚧 Sentinel live | `pkg/shhh` redacts hypercore + PubSub payloads with audit + metrics hooks (policy replay TBD). |
|
| SHHH (secrets sentinel) | 🚧 Sentinel live | `pkg/shhh` redacts hypercore + PubSub payloads with audit + metrics hooks (policy replay TBD). |
|
||||||
| HMMM routing | 🚧 Partial | PubSub topics join, but capability/role announcements and HMMM router wiring are placeholders (`internal/runtime/agent_support.go`). |
|
| HMMM routing | 🚧 Partial | PubSub topics join, but capability/role announcements and HMMM router wiring are placeholders (`internal/runtime/agent_support.go`). |
|
||||||
@@ -35,6 +35,39 @@ You’ll get a single agent container with:
|
|||||||
|
|
||||||
**Missing today:** SLURP context resolution, advanced SHHH policy replay, HMMM per-issue routing. Expect log warnings/TODOs for those paths.
|
**Missing today:** SLURP context resolution, advanced SHHH policy replay, HMMM per-issue routing. Expect log warnings/TODOs for those paths.
|
||||||
|
|
||||||
|
## 🎉 Leader Election System (NEW!)
|
||||||
|
|
||||||
|
CHORUS now features a complete, production-ready leader election system:
|
||||||
|
|
||||||
|
### Core Features
|
||||||
|
- **Consensus-based election** with weighted scoring (uptime, capabilities, resources)
|
||||||
|
- **Admin discovery protocol** for network-wide leader identification
|
||||||
|
- **Heartbeat system** with automatic failover (15-second intervals)
|
||||||
|
- **Concurrent election prevention** with randomized delays
|
||||||
|
- **SLURP activation** on elected admin nodes
|
||||||
|
|
||||||
|
### How It Works
|
||||||
|
1. **Bootstrap**: Nodes start in idle state, no admin known
|
||||||
|
2. **Discovery**: Nodes send discovery requests to find existing admin
|
||||||
|
3. **Election trigger**: If no admin found after grace period, trigger election
|
||||||
|
4. **Candidacy**: Eligible nodes announce themselves with capability scores
|
||||||
|
5. **Consensus**: Network selects winner based on highest score
|
||||||
|
6. **Leadership**: Winner starts heartbeats, activates SLURP functionality
|
||||||
|
7. **Monitoring**: Nodes continuously verify admin health via heartbeats
|
||||||
|
|
||||||
|
### Debugging
|
||||||
|
Use these log patterns to monitor election health:
|
||||||
|
```bash
|
||||||
|
# Monitor WHOAMI messages and leader identification
|
||||||
|
docker service logs CHORUS_chorus | grep "🤖 WHOAMI\|👑\|📡.*Discovered"
|
||||||
|
|
||||||
|
# Track election cycles
|
||||||
|
docker service logs CHORUS_chorus | grep "🗳️\|📢.*candidacy\|🏆.*winner"
|
||||||
|
|
||||||
|
# Watch discovery protocol
|
||||||
|
docker service logs CHORUS_chorus | grep "📩\|📤\|📥"
|
||||||
|
```
|
||||||
|
|
||||||
## Roadmap Highlights
|
## Roadmap Highlights
|
||||||
|
|
||||||
1. **Security substrate** – land SHHH sentinel, finish SLURP leader-only operations, validate COOEE enrolment (see roadmap Phase 1).
|
1. **Security substrate** – land SHHH sentinel, finish SLURP leader-only operations, validate COOEE enrolment (see roadmap Phase 1).
|
||||||
|
|||||||
@@ -9,10 +9,11 @@ import (
|
|||||||
|
|
||||||
"chorus/internal/logging"
|
"chorus/internal/logging"
|
||||||
"chorus/pubsub"
|
"chorus/pubsub"
|
||||||
|
|
||||||
"github.com/gorilla/mux"
|
"github.com/gorilla/mux"
|
||||||
)
|
)
|
||||||
|
|
||||||
// HTTPServer provides HTTP API endpoints for Bzzz
|
// HTTPServer provides HTTP API endpoints for CHORUS
|
||||||
type HTTPServer struct {
|
type HTTPServer struct {
|
||||||
port int
|
port int
|
||||||
hypercoreLog *logging.HypercoreLog
|
hypercoreLog *logging.HypercoreLog
|
||||||
@@ -20,7 +21,7 @@ type HTTPServer struct {
|
|||||||
server *http.Server
|
server *http.Server
|
||||||
}
|
}
|
||||||
|
|
||||||
// NewHTTPServer creates a new HTTP server for Bzzz API
|
// NewHTTPServer creates a new HTTP server for CHORUS API
|
||||||
func NewHTTPServer(port int, hlog *logging.HypercoreLog, ps *pubsub.PubSub) *HTTPServer {
|
func NewHTTPServer(port int, hlog *logging.HypercoreLog, ps *pubsub.PubSub) *HTTPServer {
|
||||||
return &HTTPServer{
|
return &HTTPServer{
|
||||||
port: port,
|
port: port,
|
||||||
@@ -197,11 +198,11 @@ func (h *HTTPServer) handleGetLogsSince(w http.ResponseWriter, r *http.Request)
|
|||||||
}
|
}
|
||||||
|
|
||||||
response := map[string]interface{}{
|
response := map[string]interface{}{
|
||||||
"entries": entries,
|
"entries": entries,
|
||||||
"count": len(entries),
|
"count": len(entries),
|
||||||
"since_index": index,
|
"since_index": index,
|
||||||
"timestamp": time.Now().Unix(),
|
"timestamp": time.Now().Unix(),
|
||||||
"total": h.hypercoreLog.Length(),
|
"total": h.hypercoreLog.Length(),
|
||||||
}
|
}
|
||||||
|
|
||||||
json.NewEncoder(w).Encode(response)
|
json.NewEncoder(w).Encode(response)
|
||||||
@@ -220,8 +221,8 @@ func (h *HTTPServer) handleHealth(w http.ResponseWriter, r *http.Request) {
|
|||||||
w.Header().Set("Content-Type", "application/json")
|
w.Header().Set("Content-Type", "application/json")
|
||||||
|
|
||||||
health := map[string]interface{}{
|
health := map[string]interface{}{
|
||||||
"status": "healthy",
|
"status": "healthy",
|
||||||
"timestamp": time.Now().Unix(),
|
"timestamp": time.Now().Unix(),
|
||||||
"log_entries": h.hypercoreLog.Length(),
|
"log_entries": h.hypercoreLog.Length(),
|
||||||
}
|
}
|
||||||
|
|
||||||
@@ -233,10 +234,10 @@ func (h *HTTPServer) handleStatus(w http.ResponseWriter, r *http.Request) {
|
|||||||
w.Header().Set("Content-Type", "application/json")
|
w.Header().Set("Content-Type", "application/json")
|
||||||
|
|
||||||
status := map[string]interface{}{
|
status := map[string]interface{}{
|
||||||
"status": "running",
|
"status": "running",
|
||||||
"timestamp": time.Now().Unix(),
|
"timestamp": time.Now().Unix(),
|
||||||
"hypercore": h.hypercoreLog.GetStats(),
|
"hypercore": h.hypercoreLog.GetStats(),
|
||||||
"api_version": "1.0.0",
|
"api_version": "1.0.0",
|
||||||
}
|
}
|
||||||
|
|
||||||
json.NewEncoder(w).Encode(status)
|
json.NewEncoder(w).Encode(status)
|
||||||
|
|||||||
BIN
chorus-agent
Executable file
BIN
chorus-agent
Executable file
Binary file not shown.
@@ -11,15 +11,15 @@ WORKDIR /build
|
|||||||
# Copy go mod files first (for better caching)
|
# Copy go mod files first (for better caching)
|
||||||
COPY go.mod go.sum ./
|
COPY go.mod go.sum ./
|
||||||
|
|
||||||
# Copy vendor directory for local dependencies
|
# Download dependencies
|
||||||
COPY vendor/ vendor/
|
RUN go mod download
|
||||||
|
|
||||||
# Copy source code
|
# Copy source code
|
||||||
COPY . .
|
COPY . .
|
||||||
|
|
||||||
# Build the CHORUS binary with vendor mode
|
# Build the CHORUS binary with mod mode
|
||||||
RUN CGO_ENABLED=0 GOOS=linux go build \
|
RUN CGO_ENABLED=0 GOOS=linux go build \
|
||||||
-mod=vendor \
|
-mod=mod \
|
||||||
-ldflags='-w -s -extldflags "-static"' \
|
-ldflags='-w -s -extldflags "-static"' \
|
||||||
-o chorus \
|
-o chorus \
|
||||||
./cmd/chorus
|
./cmd/chorus
|
||||||
|
|||||||
38
docker/bootstrap.json
Normal file
38
docker/bootstrap.json
Normal file
@@ -0,0 +1,38 @@
|
|||||||
|
{
|
||||||
|
"metadata": {
|
||||||
|
"generated_at": "2024-12-19T10:00:00Z",
|
||||||
|
"cluster_id": "production-cluster",
|
||||||
|
"version": "1.0.0",
|
||||||
|
"notes": "Bootstrap configuration for CHORUS scaling - managed by WHOOSH"
|
||||||
|
},
|
||||||
|
"peers": [
|
||||||
|
{
|
||||||
|
"address": "/ip4/10.0.1.10/tcp/9000/p2p/12D3KooWExample1234567890abcdef",
|
||||||
|
"priority": 100,
|
||||||
|
"region": "us-east-1",
|
||||||
|
"roles": ["admin", "stable"],
|
||||||
|
"enabled": true
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"address": "/ip4/10.0.1.11/tcp/9000/p2p/12D3KooWExample1234567890abcde2",
|
||||||
|
"priority": 90,
|
||||||
|
"region": "us-east-1",
|
||||||
|
"roles": ["worker", "stable"],
|
||||||
|
"enabled": true
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"address": "/ip4/10.0.2.10/tcp/9000/p2p/12D3KooWExample1234567890abcde3",
|
||||||
|
"priority": 80,
|
||||||
|
"region": "us-west-2",
|
||||||
|
"roles": ["worker", "stable"],
|
||||||
|
"enabled": true
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"address": "/ip4/10.0.3.10/tcp/9000/p2p/12D3KooWExample1234567890abcde4",
|
||||||
|
"priority": 70,
|
||||||
|
"region": "eu-central-1",
|
||||||
|
"roles": ["worker"],
|
||||||
|
"enabled": false
|
||||||
|
}
|
||||||
|
]
|
||||||
|
}
|
||||||
@@ -2,7 +2,7 @@ version: "3.9"
|
|||||||
|
|
||||||
services:
|
services:
|
||||||
chorus:
|
chorus:
|
||||||
image: anthonyrawlins/chorus:resetdata-secrets-v1.0.5
|
image: anthonyrawlins/chorus:discovery-debug
|
||||||
|
|
||||||
# REQUIRED: License configuration (CHORUS will not start without this)
|
# REQUIRED: License configuration (CHORUS will not start without this)
|
||||||
environment:
|
environment:
|
||||||
@@ -15,7 +15,7 @@ services:
|
|||||||
- CHORUS_AGENT_ID=${CHORUS_AGENT_ID:-} # Auto-generated if not provided
|
- CHORUS_AGENT_ID=${CHORUS_AGENT_ID:-} # Auto-generated if not provided
|
||||||
- CHORUS_SPECIALIZATION=${CHORUS_SPECIALIZATION:-general_developer}
|
- CHORUS_SPECIALIZATION=${CHORUS_SPECIALIZATION:-general_developer}
|
||||||
- CHORUS_MAX_TASKS=${CHORUS_MAX_TASKS:-3}
|
- CHORUS_MAX_TASKS=${CHORUS_MAX_TASKS:-3}
|
||||||
- CHORUS_CAPABILITIES=${CHORUS_CAPABILITIES:-general_development,task_coordination}
|
- CHORUS_CAPABILITIES=general_development,task_coordination,admin_election
|
||||||
|
|
||||||
# Network configuration
|
# Network configuration
|
||||||
- CHORUS_API_PORT=8080
|
- CHORUS_API_PORT=8080
|
||||||
@@ -23,6 +23,25 @@ services:
|
|||||||
- CHORUS_P2P_PORT=9000
|
- CHORUS_P2P_PORT=9000
|
||||||
- CHORUS_BIND_ADDRESS=0.0.0.0
|
- CHORUS_BIND_ADDRESS=0.0.0.0
|
||||||
|
|
||||||
|
# Scaling optimizations (as per WHOOSH issue #7)
|
||||||
|
- CHORUS_MDNS_ENABLED=false # Disabled for container/swarm environments
|
||||||
|
- CHORUS_DIALS_PER_SEC=5 # Rate limit outbound connections to prevent storms
|
||||||
|
- CHORUS_MAX_CONCURRENT_DHT=16 # Limit concurrent DHT queries
|
||||||
|
|
||||||
|
# Election stability windows (Medium-risk fix 2.1)
|
||||||
|
- CHORUS_ELECTION_MIN_TERM=30s # Minimum time between elections to prevent churn
|
||||||
|
- CHORUS_LEADER_MIN_TERM=45s # Minimum time before challenging healthy leader
|
||||||
|
|
||||||
|
# Assignment system for runtime configuration (Medium-risk fix 2.2)
|
||||||
|
- ASSIGN_URL=${ASSIGN_URL:-} # Optional: WHOOSH assignment endpoint
|
||||||
|
- TASK_SLOT=${TASK_SLOT:-} # Optional: Task slot identifier
|
||||||
|
- TASK_ID=${TASK_ID:-} # Optional: Task identifier
|
||||||
|
- NODE_ID=${NODE_ID:-} # Optional: Node identifier
|
||||||
|
|
||||||
|
# Bootstrap pool configuration (supports JSON and CSV)
|
||||||
|
- BOOTSTRAP_JSON=/config/bootstrap.json # Optional: JSON bootstrap config
|
||||||
|
- CHORUS_BOOTSTRAP_PEERS=${CHORUS_BOOTSTRAP_PEERS:-} # CSV fallback
|
||||||
|
|
||||||
# AI configuration - Provider selection
|
# AI configuration - Provider selection
|
||||||
- CHORUS_AI_PROVIDER=${CHORUS_AI_PROVIDER:-resetdata}
|
- CHORUS_AI_PROVIDER=${CHORUS_AI_PROVIDER:-resetdata}
|
||||||
|
|
||||||
@@ -58,6 +77,11 @@ services:
|
|||||||
- chorus_license_id
|
- chorus_license_id
|
||||||
- resetdata_api_key
|
- resetdata_api_key
|
||||||
|
|
||||||
|
# Configuration files
|
||||||
|
configs:
|
||||||
|
- source: chorus_bootstrap
|
||||||
|
target: /config/bootstrap.json
|
||||||
|
|
||||||
# Persistent data storage
|
# Persistent data storage
|
||||||
volumes:
|
volumes:
|
||||||
- chorus_data:/app/data
|
- chorus_data:/app/data
|
||||||
@@ -71,7 +95,7 @@ services:
|
|||||||
# Container resource limits
|
# Container resource limits
|
||||||
deploy:
|
deploy:
|
||||||
mode: replicated
|
mode: replicated
|
||||||
replicas: ${CHORUS_REPLICAS:-1}
|
replicas: ${CHORUS_REPLICAS:-9}
|
||||||
update_config:
|
update_config:
|
||||||
parallelism: 1
|
parallelism: 1
|
||||||
delay: 10s
|
delay: 10s
|
||||||
@@ -91,7 +115,6 @@ services:
|
|||||||
memory: 128M
|
memory: 128M
|
||||||
placement:
|
placement:
|
||||||
constraints:
|
constraints:
|
||||||
- node.hostname != rosewood
|
|
||||||
- node.hostname != acacia
|
- node.hostname != acacia
|
||||||
preferences:
|
preferences:
|
||||||
- spread: node.hostname
|
- spread: node.hostname
|
||||||
@@ -169,7 +192,14 @@ services:
|
|||||||
# Scaling system configuration
|
# Scaling system configuration
|
||||||
WHOOSH_SCALING_KACHING_URL: "https://kaching.chorus.services"
|
WHOOSH_SCALING_KACHING_URL: "https://kaching.chorus.services"
|
||||||
WHOOSH_SCALING_BACKBEAT_URL: "http://backbeat-pulse:8080"
|
WHOOSH_SCALING_BACKBEAT_URL: "http://backbeat-pulse:8080"
|
||||||
WHOOSH_SCALING_CHORUS_URL: "http://chorus:8080"
|
WHOOSH_SCALING_CHORUS_URL: "http://chorus:9000"
|
||||||
|
|
||||||
|
# BACKBEAT integration configuration (temporarily disabled)
|
||||||
|
WHOOSH_BACKBEAT_ENABLED: "false"
|
||||||
|
WHOOSH_BACKBEAT_CLUSTER_ID: "chorus-production"
|
||||||
|
WHOOSH_BACKBEAT_AGENT_ID: "whoosh"
|
||||||
|
WHOOSH_BACKBEAT_NATS_URL: "nats://backbeat-nats:4222"
|
||||||
|
|
||||||
secrets:
|
secrets:
|
||||||
- whoosh_db_password
|
- whoosh_db_password
|
||||||
- gitea_token
|
- gitea_token
|
||||||
@@ -212,14 +242,16 @@ services:
|
|||||||
cpus: '0.25'
|
cpus: '0.25'
|
||||||
labels:
|
labels:
|
||||||
- traefik.enable=true
|
- traefik.enable=true
|
||||||
|
- traefik.docker.network=tengig
|
||||||
- traefik.http.routers.whoosh.rule=Host(`whoosh.chorus.services`)
|
- traefik.http.routers.whoosh.rule=Host(`whoosh.chorus.services`)
|
||||||
- traefik.http.routers.whoosh.tls=true
|
- traefik.http.routers.whoosh.tls=true
|
||||||
- traefik.http.routers.whoosh.tls.certresolver=letsencrypt
|
- traefik.http.routers.whoosh.tls.certresolver=letsencryptresolver
|
||||||
|
- traefik.http.routers.photoprism.entrypoints=web,web-secured
|
||||||
- traefik.http.services.whoosh.loadbalancer.server.port=8080
|
- traefik.http.services.whoosh.loadbalancer.server.port=8080
|
||||||
- traefik.http.middlewares.whoosh-auth.basicauth.users=admin:$$2y$$10$$example_hash
|
- traefik.http.services.photoprism.loadbalancer.passhostheader=true
|
||||||
|
- traefik.http.middlewares.whoosh-auth.basicauth.users=admin:$2y$10$example_hash
|
||||||
networks:
|
networks:
|
||||||
- tengig
|
- tengig
|
||||||
- whoosh-backend
|
|
||||||
- chorus_net
|
- chorus_net
|
||||||
healthcheck:
|
healthcheck:
|
||||||
test: ["CMD", "/app/whoosh", "--health-check"]
|
test: ["CMD", "/app/whoosh", "--health-check"]
|
||||||
@@ -257,14 +289,13 @@ services:
|
|||||||
memory: 256M
|
memory: 256M
|
||||||
cpus: '0.5'
|
cpus: '0.5'
|
||||||
networks:
|
networks:
|
||||||
- whoosh-backend
|
|
||||||
- chorus_net
|
- chorus_net
|
||||||
healthcheck:
|
healthcheck:
|
||||||
test: ["CMD-SHELL", "pg_isready -U whoosh"]
|
test: ["CMD-SHELL", "pg_isready -h localhost -p 5432 -U whoosh -d whoosh"]
|
||||||
interval: 30s
|
interval: 30s
|
||||||
timeout: 10s
|
timeout: 10s
|
||||||
retries: 5
|
retries: 5
|
||||||
start_period: 30s
|
start_period: 40s
|
||||||
|
|
||||||
|
|
||||||
redis:
|
redis:
|
||||||
@@ -292,7 +323,6 @@ services:
|
|||||||
memory: 64M
|
memory: 64M
|
||||||
cpus: '0.1'
|
cpus: '0.1'
|
||||||
networks:
|
networks:
|
||||||
- whoosh-backend
|
|
||||||
- chorus_net
|
- chorus_net
|
||||||
healthcheck:
|
healthcheck:
|
||||||
test: ["CMD", "sh", "-c", "redis-cli --no-auth-warning -a $$(cat /run/secrets/redis_password) ping"]
|
test: ["CMD", "sh", "-c", "redis-cli --no-auth-warning -a $$(cat /run/secrets/redis_password) ping"]
|
||||||
@@ -310,6 +340,66 @@ services:
|
|||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
prometheus:
|
||||||
|
image: prom/prometheus:latest
|
||||||
|
command:
|
||||||
|
- '--config.file=/etc/prometheus/prometheus.yml'
|
||||||
|
- '--storage.tsdb.path=/prometheus'
|
||||||
|
- '--web.console.libraries=/usr/share/prometheus/console_libraries'
|
||||||
|
- '--web.console.templates=/usr/share/prometheus/consoles'
|
||||||
|
volumes:
|
||||||
|
- /rust/containers/CHORUS/monitoring/prometheus/prometheus.yml:/etc/prometheus/prometheus.yml:ro
|
||||||
|
- /rust/containers/CHORUS/monitoring/prometheus:/prometheus
|
||||||
|
ports:
|
||||||
|
- "9099:9090" # Expose Prometheus UI
|
||||||
|
deploy:
|
||||||
|
replicas: 1
|
||||||
|
labels:
|
||||||
|
- traefik.enable=true
|
||||||
|
- traefik.http.routers.prometheus.rule=Host(`prometheus.chorus.services`)
|
||||||
|
- traefik.http.routers.prometheus.entrypoints=web,web-secured
|
||||||
|
- traefik.http.routers.prometheus.tls=true
|
||||||
|
- traefik.http.routers.prometheus.tls.certresolver=letsencryptresolver
|
||||||
|
- traefik.http.services.prometheus.loadbalancer.server.port=9090
|
||||||
|
networks:
|
||||||
|
- chorus_net
|
||||||
|
- tengig
|
||||||
|
healthcheck:
|
||||||
|
test: ["CMD", "wget", "--no-verbose", "--tries=1", "--spider", "http://localhost:9090/-/ready"]
|
||||||
|
interval: 30s
|
||||||
|
timeout: 10s
|
||||||
|
retries: 3
|
||||||
|
start_period: 10s
|
||||||
|
|
||||||
|
grafana:
|
||||||
|
image: grafana/grafana:latest
|
||||||
|
user: "1000:1000"
|
||||||
|
environment:
|
||||||
|
- GF_SECURITY_ADMIN_PASSWORD=${GRAFANA_ADMIN_PASSWORD:-admin} # Use a strong password in production
|
||||||
|
- GF_SERVER_ROOT_URL=https://grafana.chorus.services
|
||||||
|
volumes:
|
||||||
|
- /rust/containers/CHORUS/monitoring/grafana:/var/lib/grafana
|
||||||
|
ports:
|
||||||
|
- "3300:3000" # Expose Grafana UI
|
||||||
|
deploy:
|
||||||
|
replicas: 1
|
||||||
|
labels:
|
||||||
|
- traefik.enable=true
|
||||||
|
- traefik.http.routers.grafana.rule=Host(`grafana.chorus.services`)
|
||||||
|
- traefik.http.routers.grafana.entrypoints=web,web-secured
|
||||||
|
- traefik.http.routers.grafana.tls=true
|
||||||
|
- traefik.http.routers.grafana.tls.certresolver=letsencryptresolver
|
||||||
|
- traefik.http.services.grafana.loadbalancer.server.port=3000
|
||||||
|
networks:
|
||||||
|
- chorus_net
|
||||||
|
- tengig
|
||||||
|
healthcheck:
|
||||||
|
test: ["CMD", "wget", "--no-verbose", "--tries=1", "--spider", "http://localhost:3000/api/health"]
|
||||||
|
interval: 30s
|
||||||
|
timeout: 10s
|
||||||
|
retries: 3
|
||||||
|
start_period: 10s
|
||||||
|
|
||||||
# BACKBEAT Pulse Service - Leader-elected tempo broadcaster
|
# BACKBEAT Pulse Service - Leader-elected tempo broadcaster
|
||||||
# REQ: BACKBEAT-REQ-001 - Single BeatFrame publisher per cluster
|
# REQ: BACKBEAT-REQ-001 - Single BeatFrame publisher per cluster
|
||||||
# REQ: BACKBEAT-OPS-001 - One replica prefers leadership
|
# REQ: BACKBEAT-OPS-001 - One replica prefers leadership
|
||||||
@@ -355,8 +445,6 @@ services:
|
|||||||
placement:
|
placement:
|
||||||
preferences:
|
preferences:
|
||||||
- spread: node.hostname
|
- spread: node.hostname
|
||||||
constraints:
|
|
||||||
- node.hostname != rosewood # Avoid intermittent gaming PC
|
|
||||||
resources:
|
resources:
|
||||||
limits:
|
limits:
|
||||||
memory: 256M
|
memory: 256M
|
||||||
@@ -424,8 +512,6 @@ services:
|
|||||||
placement:
|
placement:
|
||||||
preferences:
|
preferences:
|
||||||
- spread: node.hostname
|
- spread: node.hostname
|
||||||
constraints:
|
|
||||||
- node.hostname != rosewood
|
|
||||||
resources:
|
resources:
|
||||||
limits:
|
limits:
|
||||||
memory: 512M # Larger for window aggregation
|
memory: 512M # Larger for window aggregation
|
||||||
@@ -458,7 +544,6 @@ services:
|
|||||||
backbeat-nats:
|
backbeat-nats:
|
||||||
image: nats:2.9-alpine
|
image: nats:2.9-alpine
|
||||||
command: ["--jetstream"]
|
command: ["--jetstream"]
|
||||||
|
|
||||||
deploy:
|
deploy:
|
||||||
replicas: 1
|
replicas: 1
|
||||||
restart_policy:
|
restart_policy:
|
||||||
@@ -469,8 +554,6 @@ services:
|
|||||||
placement:
|
placement:
|
||||||
preferences:
|
preferences:
|
||||||
- spread: node.hostname
|
- spread: node.hostname
|
||||||
constraints:
|
|
||||||
- node.hostname != rosewood
|
|
||||||
resources:
|
resources:
|
||||||
limits:
|
limits:
|
||||||
memory: 256M
|
memory: 256M
|
||||||
@@ -478,10 +561,8 @@ services:
|
|||||||
reservations:
|
reservations:
|
||||||
memory: 128M
|
memory: 128M
|
||||||
cpus: '0.25'
|
cpus: '0.25'
|
||||||
|
|
||||||
networks:
|
networks:
|
||||||
- chorus_net
|
- chorus_net
|
||||||
|
|
||||||
# Container logging
|
# Container logging
|
||||||
logging:
|
logging:
|
||||||
driver: "json-file"
|
driver: "json-file"
|
||||||
@@ -495,6 +576,24 @@ services:
|
|||||||
|
|
||||||
# Persistent volumes
|
# Persistent volumes
|
||||||
volumes:
|
volumes:
|
||||||
|
prometheus_data:
|
||||||
|
driver: local
|
||||||
|
driver_opts:
|
||||||
|
type: none
|
||||||
|
o: bind
|
||||||
|
device: /rust/containers/CHORUS/monitoring/prometheus
|
||||||
|
prometheus_config:
|
||||||
|
driver: local
|
||||||
|
driver_opts:
|
||||||
|
type: none
|
||||||
|
o: bind
|
||||||
|
device: /rust/containers/CHORUS/monitoring/prometheus
|
||||||
|
grafana_data:
|
||||||
|
driver: local
|
||||||
|
driver_opts:
|
||||||
|
type: none
|
||||||
|
o: bind
|
||||||
|
device: /rust/containers/CHORUS/monitoring/grafana
|
||||||
chorus_data:
|
chorus_data:
|
||||||
driver: local
|
driver: local
|
||||||
whoosh_postgres_data:
|
whoosh_postgres_data:
|
||||||
@@ -516,18 +615,14 @@ networks:
|
|||||||
tengig:
|
tengig:
|
||||||
external: true
|
external: true
|
||||||
|
|
||||||
whoosh-backend:
|
|
||||||
driver: overlay
|
|
||||||
attachable: false
|
|
||||||
|
|
||||||
chorus_net:
|
chorus_net:
|
||||||
driver: overlay
|
driver: overlay
|
||||||
attachable: true
|
attachable: true
|
||||||
ipam:
|
|
||||||
config:
|
|
||||||
- subnet: 10.201.0.0/24
|
|
||||||
|
|
||||||
|
|
||||||
|
configs:
|
||||||
|
chorus_bootstrap:
|
||||||
|
file: ./bootstrap.json
|
||||||
|
|
||||||
secrets:
|
secrets:
|
||||||
chorus_license_id:
|
chorus_license_id:
|
||||||
|
|||||||
3
go.mod
3
go.mod
@@ -21,9 +21,11 @@ require (
|
|||||||
github.com/prometheus/client_golang v1.19.1
|
github.com/prometheus/client_golang v1.19.1
|
||||||
github.com/robfig/cron/v3 v3.0.1
|
github.com/robfig/cron/v3 v3.0.1
|
||||||
github.com/sashabaranov/go-openai v1.41.1
|
github.com/sashabaranov/go-openai v1.41.1
|
||||||
|
github.com/sony/gobreaker v0.5.0
|
||||||
github.com/stretchr/testify v1.10.0
|
github.com/stretchr/testify v1.10.0
|
||||||
github.com/syndtr/goleveldb v1.0.0
|
github.com/syndtr/goleveldb v1.0.0
|
||||||
golang.org/x/crypto v0.24.0
|
golang.org/x/crypto v0.24.0
|
||||||
|
gopkg.in/yaml.v3 v3.0.1
|
||||||
)
|
)
|
||||||
|
|
||||||
require (
|
require (
|
||||||
@@ -155,7 +157,6 @@ require (
|
|||||||
golang.org/x/tools v0.22.0 // indirect
|
golang.org/x/tools v0.22.0 // indirect
|
||||||
gonum.org/v1/gonum v0.13.0 // indirect
|
gonum.org/v1/gonum v0.13.0 // indirect
|
||||||
google.golang.org/protobuf v1.33.0 // indirect
|
google.golang.org/protobuf v1.33.0 // indirect
|
||||||
gopkg.in/yaml.v3 v3.0.1 // indirect
|
|
||||||
lukechampine.com/blake3 v1.2.1 // indirect
|
lukechampine.com/blake3 v1.2.1 // indirect
|
||||||
)
|
)
|
||||||
|
|
||||||
|
|||||||
2
go.sum
2
go.sum
@@ -437,6 +437,8 @@ github.com/smartystreets/assertions v1.2.0 h1:42S6lae5dvLc7BrLu/0ugRtcFVjoJNMC/N
|
|||||||
github.com/smartystreets/assertions v1.2.0/go.mod h1:tcbTF8ujkAEcZ8TElKY+i30BzYlVhC/LOxJk7iOWnoo=
|
github.com/smartystreets/assertions v1.2.0/go.mod h1:tcbTF8ujkAEcZ8TElKY+i30BzYlVhC/LOxJk7iOWnoo=
|
||||||
github.com/smartystreets/goconvey v1.7.2 h1:9RBaZCeXEQ3UselpuwUQHltGVXvdwm6cv1hgR6gDIPg=
|
github.com/smartystreets/goconvey v1.7.2 h1:9RBaZCeXEQ3UselpuwUQHltGVXvdwm6cv1hgR6gDIPg=
|
||||||
github.com/smartystreets/goconvey v1.7.2/go.mod h1:Vw0tHAZW6lzCRk3xgdin6fKYcG+G3Pg9vgXWeJpQFMM=
|
github.com/smartystreets/goconvey v1.7.2/go.mod h1:Vw0tHAZW6lzCRk3xgdin6fKYcG+G3Pg9vgXWeJpQFMM=
|
||||||
|
github.com/sony/gobreaker v0.5.0 h1:dRCvqm0P490vZPmy7ppEk2qCnCieBooFJ+YoXGYB+yg=
|
||||||
|
github.com/sony/gobreaker v0.5.0/go.mod h1:ZKptC7FHNvhBz7dN2LGjPVBz2sZJmc0/PkyDJOjmxWY=
|
||||||
github.com/sourcegraph/annotate v0.0.0-20160123013949-f4cad6c6324d/go.mod h1:UdhH50NIW0fCiwBSr0co2m7BnFLdv4fQTgdqdJTHFeE=
|
github.com/sourcegraph/annotate v0.0.0-20160123013949-f4cad6c6324d/go.mod h1:UdhH50NIW0fCiwBSr0co2m7BnFLdv4fQTgdqdJTHFeE=
|
||||||
github.com/sourcegraph/syntaxhighlight v0.0.0-20170531221838-bd320f5d308e/go.mod h1:HuIsMU8RRBOtsCgI77wP899iHVBQpCmg4ErYMZB+2IA=
|
github.com/sourcegraph/syntaxhighlight v0.0.0-20170531221838-bd320f5d308e/go.mod h1:HuIsMU8RRBOtsCgI77wP899iHVBQpCmg4ErYMZB+2IA=
|
||||||
github.com/spaolacci/murmur3 v1.1.0 h1:7c1g84S4BPRrfL5Xrdp6fOJ206sU9y293DDHaoy0bLI=
|
github.com/spaolacci/murmur3 v1.1.0 h1:7c1g84S4BPRrfL5Xrdp6fOJ206sU9y293DDHaoy0bLI=
|
||||||
|
|||||||
340
internal/licensing/license_gate.go
Normal file
340
internal/licensing/license_gate.go
Normal file
@@ -0,0 +1,340 @@
|
|||||||
|
package licensing
|
||||||
|
|
||||||
|
import (
|
||||||
|
"context"
|
||||||
|
"encoding/json"
|
||||||
|
"fmt"
|
||||||
|
"net/http"
|
||||||
|
"strings"
|
||||||
|
"sync/atomic"
|
||||||
|
"time"
|
||||||
|
|
||||||
|
"github.com/sony/gobreaker"
|
||||||
|
)
|
||||||
|
|
||||||
|
// LicenseGate provides burst-proof license validation with caching and circuit breaker
|
||||||
|
type LicenseGate struct {
|
||||||
|
config LicenseConfig
|
||||||
|
cache atomic.Value // stores cachedLease
|
||||||
|
breaker *gobreaker.CircuitBreaker
|
||||||
|
graceUntil atomic.Value // stores time.Time
|
||||||
|
httpClient *http.Client
|
||||||
|
}
|
||||||
|
|
||||||
|
// cachedLease represents a cached license lease with expiry
|
||||||
|
type cachedLease struct {
|
||||||
|
LeaseToken string `json:"lease_token"`
|
||||||
|
ExpiresAt time.Time `json:"expires_at"`
|
||||||
|
ClusterID string `json:"cluster_id"`
|
||||||
|
Valid bool `json:"valid"`
|
||||||
|
CachedAt time.Time `json:"cached_at"`
|
||||||
|
}
|
||||||
|
|
||||||
|
// LeaseRequest represents a cluster lease request
|
||||||
|
type LeaseRequest struct {
|
||||||
|
ClusterID string `json:"cluster_id"`
|
||||||
|
RequestedReplicas int `json:"requested_replicas"`
|
||||||
|
DurationMinutes int `json:"duration_minutes"`
|
||||||
|
}
|
||||||
|
|
||||||
|
// LeaseResponse represents a cluster lease response
|
||||||
|
type LeaseResponse struct {
|
||||||
|
LeaseToken string `json:"lease_token"`
|
||||||
|
MaxReplicas int `json:"max_replicas"`
|
||||||
|
ExpiresAt time.Time `json:"expires_at"`
|
||||||
|
ClusterID string `json:"cluster_id"`
|
||||||
|
LeaseID string `json:"lease_id"`
|
||||||
|
}
|
||||||
|
|
||||||
|
// LeaseValidationRequest represents a lease validation request
|
||||||
|
type LeaseValidationRequest struct {
|
||||||
|
LeaseToken string `json:"lease_token"`
|
||||||
|
ClusterID string `json:"cluster_id"`
|
||||||
|
AgentID string `json:"agent_id"`
|
||||||
|
}
|
||||||
|
|
||||||
|
// LeaseValidationResponse represents a lease validation response
|
||||||
|
type LeaseValidationResponse struct {
|
||||||
|
Valid bool `json:"valid"`
|
||||||
|
RemainingReplicas int `json:"remaining_replicas"`
|
||||||
|
ExpiresAt time.Time `json:"expires_at"`
|
||||||
|
}
|
||||||
|
|
||||||
|
// NewLicenseGate creates a new license gate with circuit breaker and caching
|
||||||
|
func NewLicenseGate(config LicenseConfig) *LicenseGate {
|
||||||
|
// Circuit breaker settings optimized for license validation
|
||||||
|
breakerSettings := gobreaker.Settings{
|
||||||
|
Name: "license-validation",
|
||||||
|
MaxRequests: 3, // Allow 3 requests in half-open state
|
||||||
|
Interval: 60 * time.Second, // Reset failure count every minute
|
||||||
|
Timeout: 30 * time.Second, // Stay open for 30 seconds
|
||||||
|
ReadyToTrip: func(counts gobreaker.Counts) bool {
|
||||||
|
// Trip after 3 consecutive failures
|
||||||
|
return counts.ConsecutiveFailures >= 3
|
||||||
|
},
|
||||||
|
OnStateChange: func(name string, from gobreaker.State, to gobreaker.State) {
|
||||||
|
fmt.Printf("🔌 License validation circuit breaker: %s -> %s\n", from, to)
|
||||||
|
},
|
||||||
|
}
|
||||||
|
|
||||||
|
gate := &LicenseGate{
|
||||||
|
config: config,
|
||||||
|
breaker: gobreaker.NewCircuitBreaker(breakerSettings),
|
||||||
|
httpClient: &http.Client{Timeout: 10 * time.Second},
|
||||||
|
}
|
||||||
|
|
||||||
|
// Initialize grace period
|
||||||
|
gate.graceUntil.Store(time.Now().Add(90 * time.Second))
|
||||||
|
|
||||||
|
return gate
|
||||||
|
}
|
||||||
|
|
||||||
|
// ValidNow checks if the cached lease is currently valid
|
||||||
|
func (c *cachedLease) ValidNow() bool {
|
||||||
|
if !c.Valid {
|
||||||
|
return false
|
||||||
|
}
|
||||||
|
// Consider lease invalid 2 minutes before actual expiry for safety margin
|
||||||
|
return time.Now().Before(c.ExpiresAt.Add(-2 * time.Minute))
|
||||||
|
}
|
||||||
|
|
||||||
|
// loadCachedLease safely loads the cached lease
|
||||||
|
func (g *LicenseGate) loadCachedLease() *cachedLease {
|
||||||
|
if cached := g.cache.Load(); cached != nil {
|
||||||
|
if lease, ok := cached.(*cachedLease); ok {
|
||||||
|
return lease
|
||||||
|
}
|
||||||
|
}
|
||||||
|
return &cachedLease{Valid: false}
|
||||||
|
}
|
||||||
|
|
||||||
|
// storeLease safely stores a lease in the cache
|
||||||
|
func (g *LicenseGate) storeLease(lease *cachedLease) {
|
||||||
|
lease.CachedAt = time.Now()
|
||||||
|
g.cache.Store(lease)
|
||||||
|
}
|
||||||
|
|
||||||
|
// isInGracePeriod checks if we're still in the grace period
|
||||||
|
func (g *LicenseGate) isInGracePeriod() bool {
|
||||||
|
if graceUntil := g.graceUntil.Load(); graceUntil != nil {
|
||||||
|
if grace, ok := graceUntil.(time.Time); ok {
|
||||||
|
return time.Now().Before(grace)
|
||||||
|
}
|
||||||
|
}
|
||||||
|
return false
|
||||||
|
}
|
||||||
|
|
||||||
|
// extendGracePeriod extends the grace period on successful validation
|
||||||
|
func (g *LicenseGate) extendGracePeriod() {
|
||||||
|
g.graceUntil.Store(time.Now().Add(90 * time.Second))
|
||||||
|
}
|
||||||
|
|
||||||
|
// Validate validates the license using cache, lease system, and circuit breaker
|
||||||
|
func (g *LicenseGate) Validate(ctx context.Context, agentID string) error {
|
||||||
|
// Check cached lease first
|
||||||
|
if lease := g.loadCachedLease(); lease.ValidNow() {
|
||||||
|
return g.validateCachedLease(ctx, lease, agentID)
|
||||||
|
}
|
||||||
|
|
||||||
|
// Try to get/renew lease through circuit breaker
|
||||||
|
_, err := g.breaker.Execute(func() (interface{}, error) {
|
||||||
|
lease, err := g.requestOrRenewLease(ctx)
|
||||||
|
if err != nil {
|
||||||
|
return nil, err
|
||||||
|
}
|
||||||
|
|
||||||
|
// Validate the new lease
|
||||||
|
if err := g.validateLease(ctx, lease, agentID); err != nil {
|
||||||
|
return nil, err
|
||||||
|
}
|
||||||
|
|
||||||
|
// Store successful lease
|
||||||
|
g.storeLease(&cachedLease{
|
||||||
|
LeaseToken: lease.LeaseToken,
|
||||||
|
ExpiresAt: lease.ExpiresAt,
|
||||||
|
ClusterID: lease.ClusterID,
|
||||||
|
Valid: true,
|
||||||
|
})
|
||||||
|
|
||||||
|
return nil, nil
|
||||||
|
})
|
||||||
|
|
||||||
|
if err != nil {
|
||||||
|
// If we're in grace period, allow startup but log warning
|
||||||
|
if g.isInGracePeriod() {
|
||||||
|
fmt.Printf("⚠️ License validation failed but in grace period: %v\n", err)
|
||||||
|
return nil
|
||||||
|
}
|
||||||
|
return fmt.Errorf("license validation failed: %w", err)
|
||||||
|
}
|
||||||
|
|
||||||
|
// Extend grace period on successful validation
|
||||||
|
g.extendGracePeriod()
|
||||||
|
return nil
|
||||||
|
}
|
||||||
|
|
||||||
|
// validateCachedLease validates using cached lease token
|
||||||
|
func (g *LicenseGate) validateCachedLease(ctx context.Context, lease *cachedLease, agentID string) error {
|
||||||
|
validation := LeaseValidationRequest{
|
||||||
|
LeaseToken: lease.LeaseToken,
|
||||||
|
ClusterID: g.config.ClusterID,
|
||||||
|
AgentID: agentID,
|
||||||
|
}
|
||||||
|
|
||||||
|
url := fmt.Sprintf("%s/api/v1/licenses/validate-lease", strings.TrimSuffix(g.config.KachingURL, "/"))
|
||||||
|
|
||||||
|
reqBody, err := json.Marshal(validation)
|
||||||
|
if err != nil {
|
||||||
|
return fmt.Errorf("failed to marshal lease validation request: %w", err)
|
||||||
|
}
|
||||||
|
|
||||||
|
req, err := http.NewRequestWithContext(ctx, "POST", url, strings.NewReader(string(reqBody)))
|
||||||
|
if err != nil {
|
||||||
|
return fmt.Errorf("failed to create lease validation request: %w", err)
|
||||||
|
}
|
||||||
|
|
||||||
|
req.Header.Set("Content-Type", "application/json")
|
||||||
|
|
||||||
|
resp, err := g.httpClient.Do(req)
|
||||||
|
if err != nil {
|
||||||
|
return fmt.Errorf("lease validation request failed: %w", err)
|
||||||
|
}
|
||||||
|
defer resp.Body.Close()
|
||||||
|
|
||||||
|
if resp.StatusCode != http.StatusOK {
|
||||||
|
// If validation fails, invalidate cache
|
||||||
|
lease.Valid = false
|
||||||
|
g.storeLease(lease)
|
||||||
|
return fmt.Errorf("lease validation failed with status %d", resp.StatusCode)
|
||||||
|
}
|
||||||
|
|
||||||
|
var validationResp LeaseValidationResponse
|
||||||
|
if err := json.NewDecoder(resp.Body).Decode(&validationResp); err != nil {
|
||||||
|
return fmt.Errorf("failed to decode lease validation response: %w", err)
|
||||||
|
}
|
||||||
|
|
||||||
|
if !validationResp.Valid {
|
||||||
|
// If validation fails, invalidate cache
|
||||||
|
lease.Valid = false
|
||||||
|
g.storeLease(lease)
|
||||||
|
return fmt.Errorf("lease token is invalid")
|
||||||
|
}
|
||||||
|
|
||||||
|
return nil
|
||||||
|
}
|
||||||
|
|
||||||
|
// requestOrRenewLease requests a new cluster lease or renews existing one
|
||||||
|
func (g *LicenseGate) requestOrRenewLease(ctx context.Context) (*LeaseResponse, error) {
|
||||||
|
// For now, request a new lease (TODO: implement renewal logic)
|
||||||
|
leaseReq := LeaseRequest{
|
||||||
|
ClusterID: g.config.ClusterID,
|
||||||
|
RequestedReplicas: 1, // Start with single replica
|
||||||
|
DurationMinutes: 60, // 1 hour lease
|
||||||
|
}
|
||||||
|
|
||||||
|
url := fmt.Sprintf("%s/api/v1/licenses/%s/cluster-lease",
|
||||||
|
strings.TrimSuffix(g.config.KachingURL, "/"), g.config.LicenseID)
|
||||||
|
|
||||||
|
reqBody, err := json.Marshal(leaseReq)
|
||||||
|
if err != nil {
|
||||||
|
return nil, fmt.Errorf("failed to marshal lease request: %w", err)
|
||||||
|
}
|
||||||
|
|
||||||
|
req, err := http.NewRequestWithContext(ctx, "POST", url, strings.NewReader(string(reqBody)))
|
||||||
|
if err != nil {
|
||||||
|
return nil, fmt.Errorf("failed to create lease request: %w", err)
|
||||||
|
}
|
||||||
|
|
||||||
|
req.Header.Set("Content-Type", "application/json")
|
||||||
|
|
||||||
|
resp, err := g.httpClient.Do(req)
|
||||||
|
if err != nil {
|
||||||
|
return nil, fmt.Errorf("lease request failed: %w", err)
|
||||||
|
}
|
||||||
|
defer resp.Body.Close()
|
||||||
|
|
||||||
|
if resp.StatusCode == http.StatusTooManyRequests {
|
||||||
|
return nil, fmt.Errorf("rate limited by KACHING, retry after: %s", resp.Header.Get("Retry-After"))
|
||||||
|
}
|
||||||
|
|
||||||
|
if resp.StatusCode != http.StatusOK {
|
||||||
|
return nil, fmt.Errorf("lease request failed with status %d", resp.StatusCode)
|
||||||
|
}
|
||||||
|
|
||||||
|
var leaseResp LeaseResponse
|
||||||
|
if err := json.NewDecoder(resp.Body).Decode(&leaseResp); err != nil {
|
||||||
|
return nil, fmt.Errorf("failed to decode lease response: %w", err)
|
||||||
|
}
|
||||||
|
|
||||||
|
return &leaseResp, nil
|
||||||
|
}
|
||||||
|
|
||||||
|
// validateLease validates a lease token
|
||||||
|
func (g *LicenseGate) validateLease(ctx context.Context, lease *LeaseResponse, agentID string) error {
|
||||||
|
validation := LeaseValidationRequest{
|
||||||
|
LeaseToken: lease.LeaseToken,
|
||||||
|
ClusterID: lease.ClusterID,
|
||||||
|
AgentID: agentID,
|
||||||
|
}
|
||||||
|
|
||||||
|
return g.validateLeaseRequest(ctx, validation)
|
||||||
|
}
|
||||||
|
|
||||||
|
// validateLeaseRequest performs the actual lease validation HTTP request
|
||||||
|
func (g *LicenseGate) validateLeaseRequest(ctx context.Context, validation LeaseValidationRequest) error {
|
||||||
|
url := fmt.Sprintf("%s/api/v1/licenses/validate-lease", strings.TrimSuffix(g.config.KachingURL, "/"))
|
||||||
|
|
||||||
|
reqBody, err := json.Marshal(validation)
|
||||||
|
if err != nil {
|
||||||
|
return fmt.Errorf("failed to marshal lease validation request: %w", err)
|
||||||
|
}
|
||||||
|
|
||||||
|
req, err := http.NewRequestWithContext(ctx, "POST", url, strings.NewReader(string(reqBody)))
|
||||||
|
if err != nil {
|
||||||
|
return fmt.Errorf("failed to create lease validation request: %w", err)
|
||||||
|
}
|
||||||
|
|
||||||
|
req.Header.Set("Content-Type", "application/json")
|
||||||
|
|
||||||
|
resp, err := g.httpClient.Do(req)
|
||||||
|
if err != nil {
|
||||||
|
return fmt.Errorf("lease validation request failed: %w", err)
|
||||||
|
}
|
||||||
|
defer resp.Body.Close()
|
||||||
|
|
||||||
|
if resp.StatusCode != http.StatusOK {
|
||||||
|
return fmt.Errorf("lease validation failed with status %d", resp.StatusCode)
|
||||||
|
}
|
||||||
|
|
||||||
|
var validationResp LeaseValidationResponse
|
||||||
|
if err := json.NewDecoder(resp.Body).Decode(&validationResp); err != nil {
|
||||||
|
return fmt.Errorf("failed to decode lease validation response: %w", err)
|
||||||
|
}
|
||||||
|
|
||||||
|
if !validationResp.Valid {
|
||||||
|
return fmt.Errorf("lease token is invalid")
|
||||||
|
}
|
||||||
|
|
||||||
|
return nil
|
||||||
|
}
|
||||||
|
|
||||||
|
// GetCacheStats returns cache statistics for monitoring
|
||||||
|
func (g *LicenseGate) GetCacheStats() map[string]interface{} {
|
||||||
|
lease := g.loadCachedLease()
|
||||||
|
stats := map[string]interface{}{
|
||||||
|
"cache_valid": lease.Valid,
|
||||||
|
"cache_hit": lease.ValidNow(),
|
||||||
|
"expires_at": lease.ExpiresAt,
|
||||||
|
"cached_at": lease.CachedAt,
|
||||||
|
"in_grace_period": g.isInGracePeriod(),
|
||||||
|
"breaker_state": g.breaker.State().String(),
|
||||||
|
}
|
||||||
|
|
||||||
|
if grace := g.graceUntil.Load(); grace != nil {
|
||||||
|
if graceTime, ok := grace.(time.Time); ok {
|
||||||
|
stats["grace_until"] = graceTime
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
return stats
|
||||||
|
}
|
||||||
@@ -2,6 +2,7 @@ package licensing
|
|||||||
|
|
||||||
import (
|
import (
|
||||||
"bytes"
|
"bytes"
|
||||||
|
"context"
|
||||||
"encoding/json"
|
"encoding/json"
|
||||||
"fmt"
|
"fmt"
|
||||||
"net/http"
|
"net/http"
|
||||||
@@ -21,35 +22,60 @@ type LicenseConfig struct {
|
|||||||
}
|
}
|
||||||
|
|
||||||
// Validator handles license validation with KACHING
|
// Validator handles license validation with KACHING
|
||||||
|
// Enhanced with license gate for burst-proof validation
|
||||||
type Validator struct {
|
type Validator struct {
|
||||||
config LicenseConfig
|
config LicenseConfig
|
||||||
kachingURL string
|
kachingURL string
|
||||||
client *http.Client
|
client *http.Client
|
||||||
|
gate *LicenseGate // New: License gate for scaling support
|
||||||
}
|
}
|
||||||
|
|
||||||
// NewValidator creates a new license validator
|
// NewValidator creates a new license validator with enhanced scaling support
|
||||||
func NewValidator(config LicenseConfig) *Validator {
|
func NewValidator(config LicenseConfig) *Validator {
|
||||||
kachingURL := config.KachingURL
|
kachingURL := config.KachingURL
|
||||||
if kachingURL == "" {
|
if kachingURL == "" {
|
||||||
kachingURL = DefaultKachingURL
|
kachingURL = DefaultKachingURL
|
||||||
}
|
}
|
||||||
|
|
||||||
return &Validator{
|
validator := &Validator{
|
||||||
config: config,
|
config: config,
|
||||||
kachingURL: kachingURL,
|
kachingURL: kachingURL,
|
||||||
client: &http.Client{
|
client: &http.Client{
|
||||||
Timeout: LicenseTimeout,
|
Timeout: LicenseTimeout,
|
||||||
},
|
},
|
||||||
}
|
}
|
||||||
|
|
||||||
|
// Initialize license gate for scaling support
|
||||||
|
validator.gate = NewLicenseGate(config)
|
||||||
|
|
||||||
|
return validator
|
||||||
}
|
}
|
||||||
|
|
||||||
// Validate performs license validation with KACHING license authority
|
// Validate performs license validation with KACHING license authority
|
||||||
// CRITICAL: CHORUS will not start without valid license validation
|
// Enhanced with caching, circuit breaker, and lease token support
|
||||||
func (v *Validator) Validate() error {
|
func (v *Validator) Validate() error {
|
||||||
|
return v.ValidateWithContext(context.Background())
|
||||||
|
}
|
||||||
|
|
||||||
|
// ValidateWithContext performs license validation with context and agent ID
|
||||||
|
func (v *Validator) ValidateWithContext(ctx context.Context) error {
|
||||||
if v.config.LicenseID == "" || v.config.ClusterID == "" {
|
if v.config.LicenseID == "" || v.config.ClusterID == "" {
|
||||||
return fmt.Errorf("license ID and cluster ID are required")
|
return fmt.Errorf("license ID and cluster ID are required")
|
||||||
}
|
}
|
||||||
|
|
||||||
|
// Use enhanced license gate for validation
|
||||||
|
agentID := "default-agent" // TODO: Get from config/environment
|
||||||
|
if err := v.gate.Validate(ctx, agentID); err != nil {
|
||||||
|
// Fallback to legacy validation for backward compatibility
|
||||||
|
fmt.Printf("⚠️ License gate validation failed, trying legacy validation: %v\n", err)
|
||||||
|
return v.validateLegacy()
|
||||||
|
}
|
||||||
|
|
||||||
|
return nil
|
||||||
|
}
|
||||||
|
|
||||||
|
// validateLegacy performs the original license validation (for fallback)
|
||||||
|
func (v *Validator) validateLegacy() error {
|
||||||
// Prepare validation request
|
// Prepare validation request
|
||||||
request := map[string]interface{}{
|
request := map[string]interface{}{
|
||||||
"license_id": v.config.LicenseID,
|
"license_id": v.config.LicenseID,
|
||||||
|
|||||||
@@ -105,6 +105,7 @@ func (t *SimpleTaskTracker) publishTaskCompletion(taskID string, success bool, s
|
|||||||
// SharedRuntime contains all the shared P2P infrastructure components
|
// SharedRuntime contains all the shared P2P infrastructure components
|
||||||
type SharedRuntime struct {
|
type SharedRuntime struct {
|
||||||
Config *config.Config
|
Config *config.Config
|
||||||
|
RuntimeConfig *config.RuntimeConfig
|
||||||
Logger *SimpleLogger
|
Logger *SimpleLogger
|
||||||
Context context.Context
|
Context context.Context
|
||||||
Cancel context.CancelFunc
|
Cancel context.CancelFunc
|
||||||
@@ -149,6 +150,28 @@ func Initialize(appMode string) (*SharedRuntime, error) {
|
|||||||
runtime.Config = cfg
|
runtime.Config = cfg
|
||||||
|
|
||||||
runtime.Logger.Info("✅ Configuration loaded successfully")
|
runtime.Logger.Info("✅ Configuration loaded successfully")
|
||||||
|
|
||||||
|
// Initialize runtime configuration with assignment support
|
||||||
|
runtime.RuntimeConfig = config.NewRuntimeConfig(cfg)
|
||||||
|
|
||||||
|
// Load assignment if ASSIGN_URL is configured
|
||||||
|
if assignURL := os.Getenv("ASSIGN_URL"); assignURL != "" {
|
||||||
|
runtime.Logger.Info("📡 Loading assignment from WHOOSH: %s", assignURL)
|
||||||
|
|
||||||
|
ctx, cancel := context.WithTimeout(runtime.Context, 10*time.Second)
|
||||||
|
if err := runtime.RuntimeConfig.LoadAssignment(ctx, assignURL); err != nil {
|
||||||
|
runtime.Logger.Warn("⚠️ Failed to load assignment (continuing with base config): %v", err)
|
||||||
|
} else {
|
||||||
|
runtime.Logger.Info("✅ Assignment loaded successfully")
|
||||||
|
}
|
||||||
|
cancel()
|
||||||
|
|
||||||
|
// Start reload handler for SIGHUP
|
||||||
|
runtime.RuntimeConfig.StartReloadHandler(runtime.Context, assignURL)
|
||||||
|
runtime.Logger.Info("📡 SIGHUP reload handler started for assignment updates")
|
||||||
|
} else {
|
||||||
|
runtime.Logger.Info("⚪ No ASSIGN_URL configured, using static configuration")
|
||||||
|
}
|
||||||
runtime.Logger.Info("🤖 Agent ID: %s", cfg.Agent.ID)
|
runtime.Logger.Info("🤖 Agent ID: %s", cfg.Agent.ID)
|
||||||
runtime.Logger.Info("🎯 Specialization: %s", cfg.Agent.Specialization)
|
runtime.Logger.Info("🎯 Specialization: %s", cfg.Agent.Specialization)
|
||||||
|
|
||||||
@@ -225,12 +248,17 @@ func Initialize(appMode string) (*SharedRuntime, error) {
|
|||||||
runtime.HypercoreLog = hlog
|
runtime.HypercoreLog = hlog
|
||||||
runtime.Logger.Info("📝 Hypercore logger initialized")
|
runtime.Logger.Info("📝 Hypercore logger initialized")
|
||||||
|
|
||||||
// Initialize mDNS discovery
|
// Initialize mDNS discovery (disabled in container environments for scaling)
|
||||||
mdnsDiscovery, err := discovery.NewMDNSDiscovery(ctx, node.Host(), "chorus-peer-discovery")
|
if cfg.V2.DHT.MDNSEnabled {
|
||||||
if err != nil {
|
mdnsDiscovery, err := discovery.NewMDNSDiscovery(ctx, node.Host(), "chorus-peer-discovery")
|
||||||
return nil, fmt.Errorf("failed to create mDNS discovery: %v", err)
|
if err != nil {
|
||||||
|
return nil, fmt.Errorf("failed to create mDNS discovery: %v", err)
|
||||||
|
}
|
||||||
|
runtime.MDNSDiscovery = mdnsDiscovery
|
||||||
|
runtime.Logger.Info("🔍 mDNS discovery enabled for local network")
|
||||||
|
} else {
|
||||||
|
runtime.Logger.Info("⚪ mDNS discovery disabled (recommended for container/swarm deployments)")
|
||||||
}
|
}
|
||||||
runtime.MDNSDiscovery = mdnsDiscovery
|
|
||||||
|
|
||||||
// Initialize PubSub with hypercore logging
|
// Initialize PubSub with hypercore logging
|
||||||
ps, err := pubsub.NewPubSubWithLogger(ctx, node.Host(), "chorus/coordination/v1", "hmmm/meta-discussion/v1", hlog)
|
ps, err := pubsub.NewPubSubWithLogger(ctx, node.Host(), "chorus/coordination/v1", "hmmm/meta-discussion/v1", hlog)
|
||||||
@@ -283,6 +311,7 @@ func (r *SharedRuntime) Cleanup() {
|
|||||||
|
|
||||||
if r.MDNSDiscovery != nil {
|
if r.MDNSDiscovery != nil {
|
||||||
r.MDNSDiscovery.Close()
|
r.MDNSDiscovery.Close()
|
||||||
|
r.Logger.Info("🔍 mDNS discovery closed")
|
||||||
}
|
}
|
||||||
|
|
||||||
if r.PubSub != nil {
|
if r.PubSub != nil {
|
||||||
@@ -407,8 +436,20 @@ func (r *SharedRuntime) initializeDHTStorage() error {
|
|||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
// Connect to bootstrap peers if configured
|
// Connect to bootstrap peers (with assignment override support)
|
||||||
for _, addrStr := range r.Config.V2.DHT.BootstrapPeers {
|
bootstrapPeers := r.RuntimeConfig.GetBootstrapPeers()
|
||||||
|
if len(bootstrapPeers) == 0 {
|
||||||
|
bootstrapPeers = r.Config.V2.DHT.BootstrapPeers
|
||||||
|
}
|
||||||
|
|
||||||
|
// Apply join stagger if configured
|
||||||
|
joinStagger := r.RuntimeConfig.GetJoinStagger()
|
||||||
|
if joinStagger > 0 {
|
||||||
|
r.Logger.Info("⏱️ Applying join stagger delay: %v", joinStagger)
|
||||||
|
time.Sleep(joinStagger)
|
||||||
|
}
|
||||||
|
|
||||||
|
for _, addrStr := range bootstrapPeers {
|
||||||
addr, err := multiaddr.NewMultiaddr(addrStr)
|
addr, err := multiaddr.NewMultiaddr(addrStr)
|
||||||
if err != nil {
|
if err != nil {
|
||||||
r.Logger.Warn("⚠️ Invalid bootstrap address %s: %v", addrStr, err)
|
r.Logger.Warn("⚠️ Invalid bootstrap address %s: %v", addrStr, err)
|
||||||
|
|||||||
@@ -20,10 +20,16 @@ type Config struct {
|
|||||||
DHTMode string // "client", "server", "auto"
|
DHTMode string // "client", "server", "auto"
|
||||||
DHTProtocolPrefix string
|
DHTProtocolPrefix string
|
||||||
|
|
||||||
// Connection limits
|
// Connection limits and rate limiting
|
||||||
MaxConnections int
|
MaxConnections int
|
||||||
MaxPeersPerIP int
|
MaxPeersPerIP int
|
||||||
ConnectionTimeout time.Duration
|
ConnectionTimeout time.Duration
|
||||||
|
LowWatermark int // Connection manager low watermark
|
||||||
|
HighWatermark int // Connection manager high watermark
|
||||||
|
DialsPerSecond int // Dial rate limiting
|
||||||
|
MaxConcurrentDials int // Maximum concurrent outbound dials
|
||||||
|
MaxConcurrentDHT int // Maximum concurrent DHT queries
|
||||||
|
JoinStaggerMS int // Join stagger delay in milliseconds
|
||||||
|
|
||||||
// Security configuration
|
// Security configuration
|
||||||
EnableSecurity bool
|
EnableSecurity bool
|
||||||
@@ -48,8 +54,8 @@ func DefaultConfig() *Config {
|
|||||||
},
|
},
|
||||||
NetworkID: "CHORUS-network",
|
NetworkID: "CHORUS-network",
|
||||||
|
|
||||||
// Discovery settings
|
// Discovery settings - mDNS disabled for Swarm by default
|
||||||
EnableMDNS: true,
|
EnableMDNS: false, // Disabled for container environments
|
||||||
MDNSServiceTag: "CHORUS-peer-discovery",
|
MDNSServiceTag: "CHORUS-peer-discovery",
|
||||||
|
|
||||||
// DHT settings (disabled by default for local development)
|
// DHT settings (disabled by default for local development)
|
||||||
@@ -58,10 +64,16 @@ func DefaultConfig() *Config {
|
|||||||
DHTMode: "auto",
|
DHTMode: "auto",
|
||||||
DHTProtocolPrefix: "/CHORUS",
|
DHTProtocolPrefix: "/CHORUS",
|
||||||
|
|
||||||
// Connection limits for local network
|
// Connection limits and rate limiting for scaling
|
||||||
MaxConnections: 50,
|
MaxConnections: 50,
|
||||||
MaxPeersPerIP: 3,
|
MaxPeersPerIP: 3,
|
||||||
ConnectionTimeout: 30 * time.Second,
|
ConnectionTimeout: 30 * time.Second,
|
||||||
|
LowWatermark: 32, // Keep at least 32 connections
|
||||||
|
HighWatermark: 128, // Trim above 128 connections
|
||||||
|
DialsPerSecond: 5, // Limit outbound dials to prevent storms
|
||||||
|
MaxConcurrentDials: 10, // Maximum concurrent outbound dials
|
||||||
|
MaxConcurrentDHT: 16, // Maximum concurrent DHT queries
|
||||||
|
JoinStaggerMS: 0, // No stagger by default (set by assignment)
|
||||||
|
|
||||||
// Security enabled by default
|
// Security enabled by default
|
||||||
EnableSecurity: true,
|
EnableSecurity: true,
|
||||||
@@ -165,3 +177,33 @@ func WithDHTProtocolPrefix(prefix string) Option {
|
|||||||
c.DHTProtocolPrefix = prefix
|
c.DHTProtocolPrefix = prefix
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
|
// WithConnectionManager sets connection manager watermarks
|
||||||
|
func WithConnectionManager(low, high int) Option {
|
||||||
|
return func(c *Config) {
|
||||||
|
c.LowWatermark = low
|
||||||
|
c.HighWatermark = high
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
// WithDialRateLimit sets the dial rate limiting
|
||||||
|
func WithDialRateLimit(dialsPerSecond, maxConcurrent int) Option {
|
||||||
|
return func(c *Config) {
|
||||||
|
c.DialsPerSecond = dialsPerSecond
|
||||||
|
c.MaxConcurrentDials = maxConcurrent
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
// WithDHTRateLimit sets the DHT query rate limiting
|
||||||
|
func WithDHTRateLimit(maxConcurrentDHT int) Option {
|
||||||
|
return func(c *Config) {
|
||||||
|
c.MaxConcurrentDHT = maxConcurrentDHT
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
// WithJoinStagger sets the join stagger delay in milliseconds
|
||||||
|
func WithJoinStagger(delayMS int) Option {
|
||||||
|
return func(c *Config) {
|
||||||
|
c.JoinStaggerMS = delayMS
|
||||||
|
}
|
||||||
|
}
|
||||||
23
p2p/node.go
23
p2p/node.go
@@ -6,16 +6,18 @@ import (
|
|||||||
"time"
|
"time"
|
||||||
|
|
||||||
"chorus/pkg/dht"
|
"chorus/pkg/dht"
|
||||||
|
|
||||||
"github.com/libp2p/go-libp2p"
|
"github.com/libp2p/go-libp2p"
|
||||||
|
kaddht "github.com/libp2p/go-libp2p-kad-dht"
|
||||||
"github.com/libp2p/go-libp2p/core/host"
|
"github.com/libp2p/go-libp2p/core/host"
|
||||||
"github.com/libp2p/go-libp2p/core/peer"
|
"github.com/libp2p/go-libp2p/core/peer"
|
||||||
|
"github.com/libp2p/go-libp2p/p2p/net/connmgr"
|
||||||
"github.com/libp2p/go-libp2p/p2p/security/noise"
|
"github.com/libp2p/go-libp2p/p2p/security/noise"
|
||||||
"github.com/libp2p/go-libp2p/p2p/transport/tcp"
|
"github.com/libp2p/go-libp2p/p2p/transport/tcp"
|
||||||
kaddht "github.com/libp2p/go-libp2p-kad-dht"
|
|
||||||
"github.com/multiformats/go-multiaddr"
|
"github.com/multiformats/go-multiaddr"
|
||||||
)
|
)
|
||||||
|
|
||||||
// Node represents a Bzzz P2P node
|
// Node represents a CHORUS P2P node
|
||||||
type Node struct {
|
type Node struct {
|
||||||
host host.Host
|
host host.Host
|
||||||
ctx context.Context
|
ctx context.Context
|
||||||
@@ -44,13 +46,26 @@ func NewNode(ctx context.Context, opts ...Option) (*Node, error) {
|
|||||||
listenAddrs = append(listenAddrs, ma)
|
listenAddrs = append(listenAddrs, ma)
|
||||||
}
|
}
|
||||||
|
|
||||||
// Create libp2p host with security and transport options
|
// Create connection manager with scaling-optimized limits
|
||||||
|
connManager, err := connmgr.NewConnManager(
|
||||||
|
config.LowWatermark, // Low watermark (32)
|
||||||
|
config.HighWatermark, // High watermark (128)
|
||||||
|
connmgr.WithGracePeriod(30*time.Second), // Grace period before pruning
|
||||||
|
)
|
||||||
|
if err != nil {
|
||||||
|
cancel()
|
||||||
|
return nil, fmt.Errorf("failed to create connection manager: %w", err)
|
||||||
|
}
|
||||||
|
|
||||||
|
// Create libp2p host with security, transport, and scaling options
|
||||||
h, err := libp2p.New(
|
h, err := libp2p.New(
|
||||||
libp2p.ListenAddrs(listenAddrs...),
|
libp2p.ListenAddrs(listenAddrs...),
|
||||||
libp2p.Security(noise.ID, noise.New),
|
libp2p.Security(noise.ID, noise.New),
|
||||||
libp2p.Transport(tcp.NewTCPTransport),
|
libp2p.Transport(tcp.NewTCPTransport),
|
||||||
libp2p.DefaultMuxers,
|
libp2p.DefaultMuxers,
|
||||||
libp2p.EnableRelay(),
|
libp2p.EnableRelay(),
|
||||||
|
libp2p.ConnectionManager(connManager), // Add connection management
|
||||||
|
libp2p.EnableAutoRelay(), // Enable AutoRelay for container environments
|
||||||
)
|
)
|
||||||
if err != nil {
|
if err != nil {
|
||||||
cancel()
|
cancel()
|
||||||
@@ -157,7 +172,7 @@ func (n *Node) startBackgroundTasks() {
|
|||||||
// logConnectionStatus logs the current connection status
|
// logConnectionStatus logs the current connection status
|
||||||
func (n *Node) logConnectionStatus() {
|
func (n *Node) logConnectionStatus() {
|
||||||
peers := n.Peers()
|
peers := n.Peers()
|
||||||
fmt.Printf("🐝 Bzzz Node Status - ID: %s, Connected Peers: %d\n",
|
fmt.Printf("CHORUS Node Status - ID: %s, Connected Peers: %d\n",
|
||||||
n.ID().ShortString(), len(peers))
|
n.ID().ShortString(), len(peers))
|
||||||
|
|
||||||
if len(peers) > 0 {
|
if len(peers) > 0 {
|
||||||
|
|||||||
353
pkg/bootstrap/pool_manager.go
Normal file
353
pkg/bootstrap/pool_manager.go
Normal file
@@ -0,0 +1,353 @@
|
|||||||
|
package bootstrap
|
||||||
|
|
||||||
|
import (
|
||||||
|
"context"
|
||||||
|
"encoding/json"
|
||||||
|
"fmt"
|
||||||
|
"io/ioutil"
|
||||||
|
"math/rand"
|
||||||
|
"net/http"
|
||||||
|
"os"
|
||||||
|
"strings"
|
||||||
|
"time"
|
||||||
|
|
||||||
|
"github.com/libp2p/go-libp2p/core/host"
|
||||||
|
"github.com/libp2p/go-libp2p/core/peer"
|
||||||
|
"github.com/multiformats/go-multiaddr"
|
||||||
|
)
|
||||||
|
|
||||||
|
// BootstrapPool manages a pool of bootstrap peers for DHT joining
|
||||||
|
type BootstrapPool struct {
|
||||||
|
peers []peer.AddrInfo
|
||||||
|
dialsPerSecond int
|
||||||
|
maxConcurrent int
|
||||||
|
staggerDelay time.Duration
|
||||||
|
httpClient *http.Client
|
||||||
|
}
|
||||||
|
|
||||||
|
// BootstrapConfig represents the JSON configuration for bootstrap peers
|
||||||
|
type BootstrapConfig struct {
|
||||||
|
Peers []BootstrapPeer `json:"peers"`
|
||||||
|
Meta BootstrapMeta `json:"meta,omitempty"`
|
||||||
|
}
|
||||||
|
|
||||||
|
// BootstrapPeer represents a single bootstrap peer
|
||||||
|
type BootstrapPeer struct {
|
||||||
|
ID string `json:"id"` // Peer ID
|
||||||
|
Addresses []string `json:"addresses"` // Multiaddresses
|
||||||
|
Priority int `json:"priority"` // Priority (higher = more likely to be selected)
|
||||||
|
Healthy bool `json:"healthy"` // Health status
|
||||||
|
LastSeen string `json:"last_seen"` // Last seen timestamp
|
||||||
|
}
|
||||||
|
|
||||||
|
// BootstrapMeta contains metadata about the bootstrap configuration
|
||||||
|
type BootstrapMeta struct {
|
||||||
|
UpdatedAt string `json:"updated_at"`
|
||||||
|
Version int `json:"version"`
|
||||||
|
ClusterID string `json:"cluster_id"`
|
||||||
|
TotalPeers int `json:"total_peers"`
|
||||||
|
HealthyPeers int `json:"healthy_peers"`
|
||||||
|
}
|
||||||
|
|
||||||
|
// BootstrapSubset represents a subset of peers assigned to a replica
|
||||||
|
type BootstrapSubset struct {
|
||||||
|
Peers []peer.AddrInfo `json:"peers"`
|
||||||
|
StaggerDelayMS int `json:"stagger_delay_ms"`
|
||||||
|
AssignedAt time.Time `json:"assigned_at"`
|
||||||
|
}
|
||||||
|
|
||||||
|
// NewBootstrapPool creates a new bootstrap pool manager
|
||||||
|
func NewBootstrapPool(dialsPerSecond, maxConcurrent int, staggerMS int) *BootstrapPool {
|
||||||
|
return &BootstrapPool{
|
||||||
|
peers: []peer.AddrInfo{},
|
||||||
|
dialsPerSecond: dialsPerSecond,
|
||||||
|
maxConcurrent: maxConcurrent,
|
||||||
|
staggerDelay: time.Duration(staggerMS) * time.Millisecond,
|
||||||
|
httpClient: &http.Client{Timeout: 10 * time.Second},
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
// LoadFromFile loads bootstrap configuration from a JSON file
|
||||||
|
func (bp *BootstrapPool) LoadFromFile(filePath string) error {
|
||||||
|
if filePath == "" {
|
||||||
|
return nil // No file configured
|
||||||
|
}
|
||||||
|
|
||||||
|
data, err := ioutil.ReadFile(filePath)
|
||||||
|
if err != nil {
|
||||||
|
return fmt.Errorf("failed to read bootstrap file %s: %w", filePath, err)
|
||||||
|
}
|
||||||
|
|
||||||
|
return bp.loadFromJSON(data)
|
||||||
|
}
|
||||||
|
|
||||||
|
// LoadFromURL loads bootstrap configuration from a URL (WHOOSH endpoint)
|
||||||
|
func (bp *BootstrapPool) LoadFromURL(ctx context.Context, url string) error {
|
||||||
|
if url == "" {
|
||||||
|
return nil // No URL configured
|
||||||
|
}
|
||||||
|
|
||||||
|
req, err := http.NewRequestWithContext(ctx, "GET", url, nil)
|
||||||
|
if err != nil {
|
||||||
|
return fmt.Errorf("failed to create bootstrap request: %w", err)
|
||||||
|
}
|
||||||
|
|
||||||
|
resp, err := bp.httpClient.Do(req)
|
||||||
|
if err != nil {
|
||||||
|
return fmt.Errorf("bootstrap request failed: %w", err)
|
||||||
|
}
|
||||||
|
defer resp.Body.Close()
|
||||||
|
|
||||||
|
if resp.StatusCode != http.StatusOK {
|
||||||
|
return fmt.Errorf("bootstrap request failed with status %d", resp.StatusCode)
|
||||||
|
}
|
||||||
|
|
||||||
|
data, err := ioutil.ReadAll(resp.Body)
|
||||||
|
if err != nil {
|
||||||
|
return fmt.Errorf("failed to read bootstrap response: %w", err)
|
||||||
|
}
|
||||||
|
|
||||||
|
return bp.loadFromJSON(data)
|
||||||
|
}
|
||||||
|
|
||||||
|
// loadFromJSON parses JSON bootstrap configuration
|
||||||
|
func (bp *BootstrapPool) loadFromJSON(data []byte) error {
|
||||||
|
var config BootstrapConfig
|
||||||
|
if err := json.Unmarshal(data, &config); err != nil {
|
||||||
|
return fmt.Errorf("failed to parse bootstrap JSON: %w", err)
|
||||||
|
}
|
||||||
|
|
||||||
|
// Convert bootstrap peers to AddrInfo
|
||||||
|
var peers []peer.AddrInfo
|
||||||
|
for _, bsPeer := range config.Peers {
|
||||||
|
// Only include healthy peers
|
||||||
|
if !bsPeer.Healthy {
|
||||||
|
continue
|
||||||
|
}
|
||||||
|
|
||||||
|
// Parse peer ID
|
||||||
|
peerID, err := peer.Decode(bsPeer.ID)
|
||||||
|
if err != nil {
|
||||||
|
fmt.Printf("⚠️ Invalid peer ID %s: %v\n", bsPeer.ID, err)
|
||||||
|
continue
|
||||||
|
}
|
||||||
|
|
||||||
|
// Parse multiaddresses
|
||||||
|
var addrs []multiaddr.Multiaddr
|
||||||
|
for _, addrStr := range bsPeer.Addresses {
|
||||||
|
addr, err := multiaddr.NewMultiaddr(addrStr)
|
||||||
|
if err != nil {
|
||||||
|
fmt.Printf("⚠️ Invalid multiaddress %s: %v\n", addrStr, err)
|
||||||
|
continue
|
||||||
|
}
|
||||||
|
addrs = append(addrs, addr)
|
||||||
|
}
|
||||||
|
|
||||||
|
if len(addrs) > 0 {
|
||||||
|
peers = append(peers, peer.AddrInfo{
|
||||||
|
ID: peerID,
|
||||||
|
Addrs: addrs,
|
||||||
|
})
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
bp.peers = peers
|
||||||
|
fmt.Printf("📋 Loaded %d healthy bootstrap peers from configuration\n", len(peers))
|
||||||
|
|
||||||
|
return nil
|
||||||
|
}
|
||||||
|
|
||||||
|
// LoadFromEnvironment loads bootstrap configuration from environment variables
|
||||||
|
func (bp *BootstrapPool) LoadFromEnvironment() error {
|
||||||
|
// Try loading from file first
|
||||||
|
if bootstrapFile := os.Getenv("BOOTSTRAP_JSON"); bootstrapFile != "" {
|
||||||
|
if err := bp.LoadFromFile(bootstrapFile); err != nil {
|
||||||
|
fmt.Printf("⚠️ Failed to load bootstrap from file: %v\n", err)
|
||||||
|
} else {
|
||||||
|
return nil // Successfully loaded from file
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
// Try loading from URL
|
||||||
|
if bootstrapURL := os.Getenv("BOOTSTRAP_URL"); bootstrapURL != "" {
|
||||||
|
ctx, cancel := context.WithTimeout(context.Background(), 10*time.Second)
|
||||||
|
defer cancel()
|
||||||
|
|
||||||
|
if err := bp.LoadFromURL(ctx, bootstrapURL); err != nil {
|
||||||
|
fmt.Printf("⚠️ Failed to load bootstrap from URL: %v\n", err)
|
||||||
|
} else {
|
||||||
|
return nil // Successfully loaded from URL
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
// Fallback to legacy environment variable
|
||||||
|
if bootstrapPeersEnv := os.Getenv("CHORUS_BOOTSTRAP_PEERS"); bootstrapPeersEnv != "" {
|
||||||
|
return bp.loadFromLegacyEnv(bootstrapPeersEnv)
|
||||||
|
}
|
||||||
|
|
||||||
|
return nil // No bootstrap configuration found
|
||||||
|
}
|
||||||
|
|
||||||
|
// loadFromLegacyEnv loads from comma-separated multiaddress list
|
||||||
|
func (bp *BootstrapPool) loadFromLegacyEnv(peersEnv string) error {
|
||||||
|
peerStrs := strings.Split(peersEnv, ",")
|
||||||
|
var peers []peer.AddrInfo
|
||||||
|
|
||||||
|
for _, peerStr := range peerStrs {
|
||||||
|
peerStr = strings.TrimSpace(peerStr)
|
||||||
|
if peerStr == "" {
|
||||||
|
continue
|
||||||
|
}
|
||||||
|
|
||||||
|
// Parse multiaddress
|
||||||
|
addr, err := multiaddr.NewMultiaddr(peerStr)
|
||||||
|
if err != nil {
|
||||||
|
fmt.Printf("⚠️ Invalid bootstrap peer %s: %v\n", peerStr, err)
|
||||||
|
continue
|
||||||
|
}
|
||||||
|
|
||||||
|
// Extract peer info
|
||||||
|
info, err := peer.AddrInfoFromP2pAddr(addr)
|
||||||
|
if err != nil {
|
||||||
|
fmt.Printf("⚠️ Failed to parse peer info from %s: %v\n", peerStr, err)
|
||||||
|
continue
|
||||||
|
}
|
||||||
|
|
||||||
|
peers = append(peers, *info)
|
||||||
|
}
|
||||||
|
|
||||||
|
bp.peers = peers
|
||||||
|
fmt.Printf("📋 Loaded %d bootstrap peers from legacy environment\n", len(peers))
|
||||||
|
|
||||||
|
return nil
|
||||||
|
}
|
||||||
|
|
||||||
|
// GetSubset returns a subset of bootstrap peers for a replica
|
||||||
|
func (bp *BootstrapPool) GetSubset(count int) BootstrapSubset {
|
||||||
|
if len(bp.peers) == 0 {
|
||||||
|
return BootstrapSubset{
|
||||||
|
Peers: []peer.AddrInfo{},
|
||||||
|
StaggerDelayMS: 0,
|
||||||
|
AssignedAt: time.Now(),
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
// Ensure count doesn't exceed available peers
|
||||||
|
if count > len(bp.peers) {
|
||||||
|
count = len(bp.peers)
|
||||||
|
}
|
||||||
|
|
||||||
|
// Randomly select peers from the pool
|
||||||
|
selectedPeers := make([]peer.AddrInfo, 0, count)
|
||||||
|
indices := rand.Perm(len(bp.peers))
|
||||||
|
|
||||||
|
for i := 0; i < count; i++ {
|
||||||
|
selectedPeers = append(selectedPeers, bp.peers[indices[i]])
|
||||||
|
}
|
||||||
|
|
||||||
|
// Generate random stagger delay (0 to configured max)
|
||||||
|
staggerMS := 0
|
||||||
|
if bp.staggerDelay > 0 {
|
||||||
|
staggerMS = rand.Intn(int(bp.staggerDelay.Milliseconds()))
|
||||||
|
}
|
||||||
|
|
||||||
|
return BootstrapSubset{
|
||||||
|
Peers: selectedPeers,
|
||||||
|
StaggerDelayMS: staggerMS,
|
||||||
|
AssignedAt: time.Now(),
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
// ConnectWithRateLimit connects to bootstrap peers with rate limiting
|
||||||
|
func (bp *BootstrapPool) ConnectWithRateLimit(ctx context.Context, h host.Host, subset BootstrapSubset) error {
|
||||||
|
if len(subset.Peers) == 0 {
|
||||||
|
return nil // No peers to connect to
|
||||||
|
}
|
||||||
|
|
||||||
|
// Apply stagger delay
|
||||||
|
if subset.StaggerDelayMS > 0 {
|
||||||
|
delay := time.Duration(subset.StaggerDelayMS) * time.Millisecond
|
||||||
|
fmt.Printf("⏱️ Applying join stagger delay: %v\n", delay)
|
||||||
|
|
||||||
|
select {
|
||||||
|
case <-ctx.Done():
|
||||||
|
return ctx.Err()
|
||||||
|
case <-time.After(delay):
|
||||||
|
// Continue after delay
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
// Create rate limiter for dials
|
||||||
|
ticker := time.NewTicker(time.Second / time.Duration(bp.dialsPerSecond))
|
||||||
|
defer ticker.Stop()
|
||||||
|
|
||||||
|
// Semaphore for concurrent dials
|
||||||
|
semaphore := make(chan struct{}, bp.maxConcurrent)
|
||||||
|
|
||||||
|
// Connect to each peer with rate limiting
|
||||||
|
for i, peerInfo := range subset.Peers {
|
||||||
|
// Wait for rate limiter
|
||||||
|
select {
|
||||||
|
case <-ctx.Done():
|
||||||
|
return ctx.Err()
|
||||||
|
case <-ticker.C:
|
||||||
|
// Rate limit satisfied
|
||||||
|
}
|
||||||
|
|
||||||
|
// Acquire semaphore
|
||||||
|
select {
|
||||||
|
case <-ctx.Done():
|
||||||
|
return ctx.Err()
|
||||||
|
case semaphore <- struct{}{}:
|
||||||
|
// Semaphore acquired
|
||||||
|
}
|
||||||
|
|
||||||
|
// Connect to peer in goroutine
|
||||||
|
go func(info peer.AddrInfo, index int) {
|
||||||
|
defer func() { <-semaphore }() // Release semaphore
|
||||||
|
|
||||||
|
ctx, cancel := context.WithTimeout(ctx, 30*time.Second)
|
||||||
|
defer cancel()
|
||||||
|
|
||||||
|
if err := h.Connect(ctx, info); err != nil {
|
||||||
|
fmt.Printf("⚠️ Failed to connect to bootstrap peer %s (%d/%d): %v\n",
|
||||||
|
info.ID.ShortString(), index+1, len(subset.Peers), err)
|
||||||
|
} else {
|
||||||
|
fmt.Printf("🔗 Connected to bootstrap peer %s (%d/%d)\n",
|
||||||
|
info.ID.ShortString(), index+1, len(subset.Peers))
|
||||||
|
}
|
||||||
|
}(peerInfo, i)
|
||||||
|
}
|
||||||
|
|
||||||
|
// Wait for all connections to complete or timeout
|
||||||
|
for i := 0; i < bp.maxConcurrent && i < len(subset.Peers); i++ {
|
||||||
|
select {
|
||||||
|
case <-ctx.Done():
|
||||||
|
return ctx.Err()
|
||||||
|
case semaphore <- struct{}{}:
|
||||||
|
<-semaphore // Immediately release
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
return nil
|
||||||
|
}
|
||||||
|
|
||||||
|
// GetPeerCount returns the number of available bootstrap peers
|
||||||
|
func (bp *BootstrapPool) GetPeerCount() int {
|
||||||
|
return len(bp.peers)
|
||||||
|
}
|
||||||
|
|
||||||
|
// GetPeers returns all bootstrap peers (for debugging)
|
||||||
|
func (bp *BootstrapPool) GetPeers() []peer.AddrInfo {
|
||||||
|
return bp.peers
|
||||||
|
}
|
||||||
|
|
||||||
|
// GetStats returns bootstrap pool statistics
|
||||||
|
func (bp *BootstrapPool) GetStats() map[string]interface{} {
|
||||||
|
return map[string]interface{}{
|
||||||
|
"peer_count": len(bp.peers),
|
||||||
|
"dials_per_second": bp.dialsPerSecond,
|
||||||
|
"max_concurrent": bp.maxConcurrent,
|
||||||
|
"stagger_delay_ms": bp.staggerDelay.Milliseconds(),
|
||||||
|
}
|
||||||
|
}
|
||||||
517
pkg/config/assignment.go
Normal file
517
pkg/config/assignment.go
Normal file
@@ -0,0 +1,517 @@
|
|||||||
|
package config
|
||||||
|
|
||||||
|
import (
|
||||||
|
"context"
|
||||||
|
"encoding/json"
|
||||||
|
"fmt"
|
||||||
|
"io"
|
||||||
|
"net/http"
|
||||||
|
"os"
|
||||||
|
"os/signal"
|
||||||
|
"strings"
|
||||||
|
"sync"
|
||||||
|
"syscall"
|
||||||
|
"time"
|
||||||
|
)
|
||||||
|
|
||||||
|
// RuntimeConfig manages runtime configuration with assignment overrides
|
||||||
|
type RuntimeConfig struct {
|
||||||
|
Base *Config `json:"base"`
|
||||||
|
Override *AssignmentConfig `json:"override"`
|
||||||
|
mu sync.RWMutex
|
||||||
|
reloadCh chan struct{}
|
||||||
|
}
|
||||||
|
|
||||||
|
// AssignmentConfig represents runtime assignment from WHOOSH
|
||||||
|
type AssignmentConfig struct {
|
||||||
|
// Assignment metadata
|
||||||
|
AssignmentID string `json:"assignment_id"`
|
||||||
|
TaskSlot string `json:"task_slot"`
|
||||||
|
TaskID string `json:"task_id"`
|
||||||
|
ClusterID string `json:"cluster_id"`
|
||||||
|
AssignedAt time.Time `json:"assigned_at"`
|
||||||
|
ExpiresAt time.Time `json:"expires_at,omitempty"`
|
||||||
|
|
||||||
|
// Agent configuration overrides
|
||||||
|
Agent *AgentConfig `json:"agent,omitempty"`
|
||||||
|
Network *NetworkConfig `json:"network,omitempty"`
|
||||||
|
AI *AIConfig `json:"ai,omitempty"`
|
||||||
|
Logging *LoggingConfig `json:"logging,omitempty"`
|
||||||
|
|
||||||
|
// Bootstrap configuration for scaling
|
||||||
|
BootstrapPeers []string `json:"bootstrap_peers,omitempty"`
|
||||||
|
JoinStagger int `json:"join_stagger_ms,omitempty"`
|
||||||
|
|
||||||
|
// Runtime capabilities
|
||||||
|
RuntimeCapabilities []string `json:"runtime_capabilities,omitempty"`
|
||||||
|
|
||||||
|
// Key derivation for encryption
|
||||||
|
RoleKey string `json:"role_key,omitempty"`
|
||||||
|
ClusterSecret string `json:"cluster_secret,omitempty"`
|
||||||
|
|
||||||
|
// Custom fields
|
||||||
|
Custom map[string]interface{} `json:"custom,omitempty"`
|
||||||
|
}
|
||||||
|
|
||||||
|
// AssignmentRequest represents a request for assignment from WHOOSH
|
||||||
|
type AssignmentRequest struct {
|
||||||
|
ClusterID string `json:"cluster_id"`
|
||||||
|
TaskSlot string `json:"task_slot,omitempty"`
|
||||||
|
TaskID string `json:"task_id,omitempty"`
|
||||||
|
AgentID string `json:"agent_id"`
|
||||||
|
NodeID string `json:"node_id"`
|
||||||
|
Timestamp time.Time `json:"timestamp"`
|
||||||
|
}
|
||||||
|
|
||||||
|
// NewRuntimeConfig creates a new runtime configuration manager
|
||||||
|
func NewRuntimeConfig(baseConfig *Config) *RuntimeConfig {
|
||||||
|
return &RuntimeConfig{
|
||||||
|
Base: baseConfig,
|
||||||
|
Override: nil,
|
||||||
|
reloadCh: make(chan struct{}, 1),
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
// Get returns the effective configuration value, with override taking precedence
|
||||||
|
func (rc *RuntimeConfig) Get(field string) interface{} {
|
||||||
|
rc.mu.RLock()
|
||||||
|
defer rc.mu.RUnlock()
|
||||||
|
|
||||||
|
// Try override first
|
||||||
|
if rc.Override != nil {
|
||||||
|
if value := rc.getFromAssignment(field); value != nil {
|
||||||
|
return value
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
// Fall back to base configuration
|
||||||
|
return rc.getFromBase(field)
|
||||||
|
}
|
||||||
|
|
||||||
|
// GetConfig returns a merged configuration with overrides applied
|
||||||
|
func (rc *RuntimeConfig) GetConfig() *Config {
|
||||||
|
rc.mu.RLock()
|
||||||
|
defer rc.mu.RUnlock()
|
||||||
|
|
||||||
|
if rc.Override == nil {
|
||||||
|
return rc.Base
|
||||||
|
}
|
||||||
|
|
||||||
|
// Create a copy of base config
|
||||||
|
merged := *rc.Base
|
||||||
|
|
||||||
|
// Apply overrides
|
||||||
|
if rc.Override.Agent != nil {
|
||||||
|
rc.mergeAgentConfig(&merged.Agent, rc.Override.Agent)
|
||||||
|
}
|
||||||
|
if rc.Override.Network != nil {
|
||||||
|
rc.mergeNetworkConfig(&merged.Network, rc.Override.Network)
|
||||||
|
}
|
||||||
|
if rc.Override.AI != nil {
|
||||||
|
rc.mergeAIConfig(&merged.AI, rc.Override.AI)
|
||||||
|
}
|
||||||
|
if rc.Override.Logging != nil {
|
||||||
|
rc.mergeLoggingConfig(&merged.Logging, rc.Override.Logging)
|
||||||
|
}
|
||||||
|
|
||||||
|
return &merged
|
||||||
|
}
|
||||||
|
|
||||||
|
// LoadAssignment fetches assignment from WHOOSH and applies it
|
||||||
|
func (rc *RuntimeConfig) LoadAssignment(ctx context.Context, assignURL string) error {
|
||||||
|
if assignURL == "" {
|
||||||
|
return nil // No assignment URL configured
|
||||||
|
}
|
||||||
|
|
||||||
|
// Build assignment request
|
||||||
|
agentID := rc.Base.Agent.ID
|
||||||
|
if agentID == "" {
|
||||||
|
agentID = "unknown"
|
||||||
|
}
|
||||||
|
|
||||||
|
req := AssignmentRequest{
|
||||||
|
ClusterID: rc.Base.License.ClusterID,
|
||||||
|
TaskSlot: os.Getenv("TASK_SLOT"),
|
||||||
|
TaskID: os.Getenv("TASK_ID"),
|
||||||
|
AgentID: agentID,
|
||||||
|
NodeID: os.Getenv("NODE_ID"),
|
||||||
|
Timestamp: time.Now(),
|
||||||
|
}
|
||||||
|
|
||||||
|
// Make HTTP request to WHOOSH
|
||||||
|
assignment, err := rc.fetchAssignment(ctx, assignURL, req)
|
||||||
|
if err != nil {
|
||||||
|
return fmt.Errorf("failed to fetch assignment: %w", err)
|
||||||
|
}
|
||||||
|
|
||||||
|
// Apply assignment
|
||||||
|
rc.mu.Lock()
|
||||||
|
rc.Override = assignment
|
||||||
|
rc.mu.Unlock()
|
||||||
|
|
||||||
|
return nil
|
||||||
|
}
|
||||||
|
|
||||||
|
// StartReloadHandler starts a signal handler for SIGHUP configuration reloads
|
||||||
|
func (rc *RuntimeConfig) StartReloadHandler(ctx context.Context, assignURL string) {
|
||||||
|
sigCh := make(chan os.Signal, 1)
|
||||||
|
signal.Notify(sigCh, syscall.SIGHUP)
|
||||||
|
|
||||||
|
go func() {
|
||||||
|
for {
|
||||||
|
select {
|
||||||
|
case <-ctx.Done():
|
||||||
|
return
|
||||||
|
case <-sigCh:
|
||||||
|
fmt.Println("📡 Received SIGHUP, reloading assignment configuration...")
|
||||||
|
if err := rc.LoadAssignment(ctx, assignURL); err != nil {
|
||||||
|
fmt.Printf("❌ Failed to reload assignment: %v\n", err)
|
||||||
|
} else {
|
||||||
|
fmt.Println("✅ Assignment configuration reloaded successfully")
|
||||||
|
}
|
||||||
|
case <-rc.reloadCh:
|
||||||
|
// Manual reload trigger
|
||||||
|
if err := rc.LoadAssignment(ctx, assignURL); err != nil {
|
||||||
|
fmt.Printf("❌ Failed to reload assignment: %v\n", err)
|
||||||
|
} else {
|
||||||
|
fmt.Println("✅ Assignment configuration reloaded successfully")
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}()
|
||||||
|
}
|
||||||
|
|
||||||
|
// Reload triggers a manual configuration reload
|
||||||
|
func (rc *RuntimeConfig) Reload() {
|
||||||
|
select {
|
||||||
|
case rc.reloadCh <- struct{}{}:
|
||||||
|
default:
|
||||||
|
// Channel full, reload already pending
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
// fetchAssignment makes HTTP request to WHOOSH assignment API
|
||||||
|
func (rc *RuntimeConfig) fetchAssignment(ctx context.Context, assignURL string, req AssignmentRequest) (*AssignmentConfig, error) {
|
||||||
|
// Build query parameters
|
||||||
|
queryParams := fmt.Sprintf("?cluster_id=%s&agent_id=%s&node_id=%s",
|
||||||
|
req.ClusterID, req.AgentID, req.NodeID)
|
||||||
|
|
||||||
|
if req.TaskSlot != "" {
|
||||||
|
queryParams += "&task_slot=" + req.TaskSlot
|
||||||
|
}
|
||||||
|
if req.TaskID != "" {
|
||||||
|
queryParams += "&task_id=" + req.TaskID
|
||||||
|
}
|
||||||
|
|
||||||
|
// Create HTTP request
|
||||||
|
httpReq, err := http.NewRequestWithContext(ctx, "GET", assignURL+queryParams, nil)
|
||||||
|
if err != nil {
|
||||||
|
return nil, fmt.Errorf("failed to create assignment request: %w", err)
|
||||||
|
}
|
||||||
|
|
||||||
|
httpReq.Header.Set("Accept", "application/json")
|
||||||
|
httpReq.Header.Set("User-Agent", "CHORUS-Agent/0.1.0")
|
||||||
|
|
||||||
|
// Make request with timeout
|
||||||
|
client := &http.Client{Timeout: 10 * time.Second}
|
||||||
|
resp, err := client.Do(httpReq)
|
||||||
|
if err != nil {
|
||||||
|
return nil, fmt.Errorf("assignment request failed: %w", err)
|
||||||
|
}
|
||||||
|
defer resp.Body.Close()
|
||||||
|
|
||||||
|
if resp.StatusCode == http.StatusNotFound {
|
||||||
|
// No assignment available
|
||||||
|
return nil, nil
|
||||||
|
}
|
||||||
|
|
||||||
|
if resp.StatusCode != http.StatusOK {
|
||||||
|
body, _ := io.ReadAll(resp.Body)
|
||||||
|
return nil, fmt.Errorf("assignment request failed with status %d: %s", resp.StatusCode, string(body))
|
||||||
|
}
|
||||||
|
|
||||||
|
// Parse assignment response
|
||||||
|
var assignment AssignmentConfig
|
||||||
|
if err := json.NewDecoder(resp.Body).Decode(&assignment); err != nil {
|
||||||
|
return nil, fmt.Errorf("failed to decode assignment response: %w", err)
|
||||||
|
}
|
||||||
|
|
||||||
|
return &assignment, nil
|
||||||
|
}
|
||||||
|
|
||||||
|
// Helper methods for getting values from different sources
|
||||||
|
func (rc *RuntimeConfig) getFromAssignment(field string) interface{} {
|
||||||
|
if rc.Override == nil {
|
||||||
|
return nil
|
||||||
|
}
|
||||||
|
|
||||||
|
// Simple field mapping - in a real implementation, you'd use reflection
|
||||||
|
// or a more sophisticated field mapping system
|
||||||
|
switch field {
|
||||||
|
case "agent.id":
|
||||||
|
if rc.Override.Agent != nil && rc.Override.Agent.ID != "" {
|
||||||
|
return rc.Override.Agent.ID
|
||||||
|
}
|
||||||
|
case "agent.role":
|
||||||
|
if rc.Override.Agent != nil && rc.Override.Agent.Role != "" {
|
||||||
|
return rc.Override.Agent.Role
|
||||||
|
}
|
||||||
|
case "agent.capabilities":
|
||||||
|
if len(rc.Override.RuntimeCapabilities) > 0 {
|
||||||
|
return rc.Override.RuntimeCapabilities
|
||||||
|
}
|
||||||
|
case "bootstrap_peers":
|
||||||
|
if len(rc.Override.BootstrapPeers) > 0 {
|
||||||
|
return rc.Override.BootstrapPeers
|
||||||
|
}
|
||||||
|
case "join_stagger":
|
||||||
|
if rc.Override.JoinStagger > 0 {
|
||||||
|
return rc.Override.JoinStagger
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
// Check custom fields
|
||||||
|
if rc.Override.Custom != nil {
|
||||||
|
if val, exists := rc.Override.Custom[field]; exists {
|
||||||
|
return val
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
return nil
|
||||||
|
}
|
||||||
|
|
||||||
|
func (rc *RuntimeConfig) getFromBase(field string) interface{} {
|
||||||
|
// Simple field mapping for base config
|
||||||
|
switch field {
|
||||||
|
case "agent.id":
|
||||||
|
return rc.Base.Agent.ID
|
||||||
|
case "agent.role":
|
||||||
|
return rc.Base.Agent.Role
|
||||||
|
case "agent.capabilities":
|
||||||
|
return rc.Base.Agent.Capabilities
|
||||||
|
default:
|
||||||
|
return nil
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
// Helper methods for merging configuration sections
|
||||||
|
func (rc *RuntimeConfig) mergeAgentConfig(base *AgentConfig, override *AgentConfig) {
|
||||||
|
if override.ID != "" {
|
||||||
|
base.ID = override.ID
|
||||||
|
}
|
||||||
|
if override.Specialization != "" {
|
||||||
|
base.Specialization = override.Specialization
|
||||||
|
}
|
||||||
|
if override.MaxTasks > 0 {
|
||||||
|
base.MaxTasks = override.MaxTasks
|
||||||
|
}
|
||||||
|
if len(override.Capabilities) > 0 {
|
||||||
|
base.Capabilities = override.Capabilities
|
||||||
|
}
|
||||||
|
if len(override.Models) > 0 {
|
||||||
|
base.Models = override.Models
|
||||||
|
}
|
||||||
|
if override.Role != "" {
|
||||||
|
base.Role = override.Role
|
||||||
|
}
|
||||||
|
if override.Project != "" {
|
||||||
|
base.Project = override.Project
|
||||||
|
}
|
||||||
|
if len(override.Expertise) > 0 {
|
||||||
|
base.Expertise = override.Expertise
|
||||||
|
}
|
||||||
|
if override.ReportsTo != "" {
|
||||||
|
base.ReportsTo = override.ReportsTo
|
||||||
|
}
|
||||||
|
if len(override.Deliverables) > 0 {
|
||||||
|
base.Deliverables = override.Deliverables
|
||||||
|
}
|
||||||
|
if override.ModelSelectionWebhook != "" {
|
||||||
|
base.ModelSelectionWebhook = override.ModelSelectionWebhook
|
||||||
|
}
|
||||||
|
if override.DefaultReasoningModel != "" {
|
||||||
|
base.DefaultReasoningModel = override.DefaultReasoningModel
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
func (rc *RuntimeConfig) mergeNetworkConfig(base *NetworkConfig, override *NetworkConfig) {
|
||||||
|
if override.P2PPort > 0 {
|
||||||
|
base.P2PPort = override.P2PPort
|
||||||
|
}
|
||||||
|
if override.APIPort > 0 {
|
||||||
|
base.APIPort = override.APIPort
|
||||||
|
}
|
||||||
|
if override.HealthPort > 0 {
|
||||||
|
base.HealthPort = override.HealthPort
|
||||||
|
}
|
||||||
|
if override.BindAddr != "" {
|
||||||
|
base.BindAddr = override.BindAddr
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
func (rc *RuntimeConfig) mergeAIConfig(base *AIConfig, override *AIConfig) {
|
||||||
|
if override.Provider != "" {
|
||||||
|
base.Provider = override.Provider
|
||||||
|
}
|
||||||
|
// Merge Ollama config if present
|
||||||
|
if override.Ollama.Endpoint != "" {
|
||||||
|
base.Ollama.Endpoint = override.Ollama.Endpoint
|
||||||
|
}
|
||||||
|
if override.Ollama.Timeout > 0 {
|
||||||
|
base.Ollama.Timeout = override.Ollama.Timeout
|
||||||
|
}
|
||||||
|
// Merge ResetData config if present
|
||||||
|
if override.ResetData.BaseURL != "" {
|
||||||
|
base.ResetData.BaseURL = override.ResetData.BaseURL
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
func (rc *RuntimeConfig) mergeLoggingConfig(base *LoggingConfig, override *LoggingConfig) {
|
||||||
|
if override.Level != "" {
|
||||||
|
base.Level = override.Level
|
||||||
|
}
|
||||||
|
if override.Format != "" {
|
||||||
|
base.Format = override.Format
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
// BootstrapConfig represents JSON bootstrap configuration
|
||||||
|
type BootstrapConfig struct {
|
||||||
|
Peers []BootstrapPeer `json:"peers"`
|
||||||
|
Metadata BootstrapMeta `json:"metadata,omitempty"`
|
||||||
|
}
|
||||||
|
|
||||||
|
// BootstrapPeer represents a single bootstrap peer
|
||||||
|
type BootstrapPeer struct {
|
||||||
|
Address string `json:"address"`
|
||||||
|
Priority int `json:"priority,omitempty"`
|
||||||
|
Region string `json:"region,omitempty"`
|
||||||
|
Roles []string `json:"roles,omitempty"`
|
||||||
|
Enabled bool `json:"enabled"`
|
||||||
|
}
|
||||||
|
|
||||||
|
// BootstrapMeta contains metadata about the bootstrap configuration
|
||||||
|
type BootstrapMeta struct {
|
||||||
|
GeneratedAt time.Time `json:"generated_at,omitempty"`
|
||||||
|
ClusterID string `json:"cluster_id,omitempty"`
|
||||||
|
Version string `json:"version,omitempty"`
|
||||||
|
Notes string `json:"notes,omitempty"`
|
||||||
|
}
|
||||||
|
|
||||||
|
// GetBootstrapPeers returns bootstrap peers with assignment override support and JSON config
|
||||||
|
func (rc *RuntimeConfig) GetBootstrapPeers() []string {
|
||||||
|
rc.mu.RLock()
|
||||||
|
defer rc.mu.RUnlock()
|
||||||
|
|
||||||
|
// First priority: Assignment override from WHOOSH
|
||||||
|
if rc.Override != nil && len(rc.Override.BootstrapPeers) > 0 {
|
||||||
|
return rc.Override.BootstrapPeers
|
||||||
|
}
|
||||||
|
|
||||||
|
// Second priority: JSON bootstrap configuration
|
||||||
|
if jsonPeers := rc.loadBootstrapJSON(); len(jsonPeers) > 0 {
|
||||||
|
return jsonPeers
|
||||||
|
}
|
||||||
|
|
||||||
|
// Third priority: Environment variable (CSV format)
|
||||||
|
if bootstrapEnv := os.Getenv("CHORUS_BOOTSTRAP_PEERS"); bootstrapEnv != "" {
|
||||||
|
peers := strings.Split(bootstrapEnv, ",")
|
||||||
|
// Trim whitespace from each peer
|
||||||
|
for i, peer := range peers {
|
||||||
|
peers[i] = strings.TrimSpace(peer)
|
||||||
|
}
|
||||||
|
return peers
|
||||||
|
}
|
||||||
|
|
||||||
|
return []string{}
|
||||||
|
}
|
||||||
|
|
||||||
|
// loadBootstrapJSON loads bootstrap peers from JSON file
|
||||||
|
func (rc *RuntimeConfig) loadBootstrapJSON() []string {
|
||||||
|
jsonPath := os.Getenv("BOOTSTRAP_JSON")
|
||||||
|
if jsonPath == "" {
|
||||||
|
return nil
|
||||||
|
}
|
||||||
|
|
||||||
|
// Check if file exists
|
||||||
|
if _, err := os.Stat(jsonPath); os.IsNotExist(err) {
|
||||||
|
return nil
|
||||||
|
}
|
||||||
|
|
||||||
|
// Read and parse JSON file
|
||||||
|
data, err := os.ReadFile(jsonPath)
|
||||||
|
if err != nil {
|
||||||
|
fmt.Printf("⚠️ Failed to read bootstrap JSON file %s: %v\n", jsonPath, err)
|
||||||
|
return nil
|
||||||
|
}
|
||||||
|
|
||||||
|
var config BootstrapConfig
|
||||||
|
if err := json.Unmarshal(data, &config); err != nil {
|
||||||
|
fmt.Printf("⚠️ Failed to parse bootstrap JSON file %s: %v\n", jsonPath, err)
|
||||||
|
return nil
|
||||||
|
}
|
||||||
|
|
||||||
|
// Extract enabled peer addresses, sorted by priority
|
||||||
|
var peers []string
|
||||||
|
enabledPeers := make([]BootstrapPeer, 0, len(config.Peers))
|
||||||
|
|
||||||
|
// Filter enabled peers
|
||||||
|
for _, peer := range config.Peers {
|
||||||
|
if peer.Enabled && peer.Address != "" {
|
||||||
|
enabledPeers = append(enabledPeers, peer)
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
// Sort by priority (higher priority first)
|
||||||
|
for i := 0; i < len(enabledPeers)-1; i++ {
|
||||||
|
for j := i + 1; j < len(enabledPeers); j++ {
|
||||||
|
if enabledPeers[j].Priority > enabledPeers[i].Priority {
|
||||||
|
enabledPeers[i], enabledPeers[j] = enabledPeers[j], enabledPeers[i]
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
// Extract addresses
|
||||||
|
for _, peer := range enabledPeers {
|
||||||
|
peers = append(peers, peer.Address)
|
||||||
|
}
|
||||||
|
|
||||||
|
if len(peers) > 0 {
|
||||||
|
fmt.Printf("📋 Loaded %d bootstrap peers from JSON: %s\n", len(peers), jsonPath)
|
||||||
|
}
|
||||||
|
|
||||||
|
return peers
|
||||||
|
}
|
||||||
|
|
||||||
|
// GetJoinStagger returns join stagger delay with assignment override support
|
||||||
|
func (rc *RuntimeConfig) GetJoinStagger() time.Duration {
|
||||||
|
rc.mu.RLock()
|
||||||
|
defer rc.mu.RUnlock()
|
||||||
|
|
||||||
|
if rc.Override != nil && rc.Override.JoinStagger > 0 {
|
||||||
|
return time.Duration(rc.Override.JoinStagger) * time.Millisecond
|
||||||
|
}
|
||||||
|
|
||||||
|
// Fall back to environment variable
|
||||||
|
if staggerEnv := os.Getenv("CHORUS_JOIN_STAGGER_MS"); staggerEnv != "" {
|
||||||
|
if ms, err := time.ParseDuration(staggerEnv + "ms"); err == nil {
|
||||||
|
return ms
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
return 0
|
||||||
|
}
|
||||||
|
|
||||||
|
// GetAssignmentInfo returns current assignment metadata
|
||||||
|
func (rc *RuntimeConfig) GetAssignmentInfo() *AssignmentConfig {
|
||||||
|
rc.mu.RLock()
|
||||||
|
defer rc.mu.RUnlock()
|
||||||
|
|
||||||
|
if rc.Override == nil {
|
||||||
|
return nil
|
||||||
|
}
|
||||||
|
|
||||||
|
// Return a copy to prevent external modification
|
||||||
|
assignment := *rc.Override
|
||||||
|
return &assignment
|
||||||
|
}
|
||||||
@@ -100,6 +100,7 @@ type V2Config struct {
|
|||||||
type DHTConfig struct {
|
type DHTConfig struct {
|
||||||
Enabled bool `yaml:"enabled"`
|
Enabled bool `yaml:"enabled"`
|
||||||
BootstrapPeers []string `yaml:"bootstrap_peers"`
|
BootstrapPeers []string `yaml:"bootstrap_peers"`
|
||||||
|
MDNSEnabled bool `yaml:"mdns_enabled"`
|
||||||
}
|
}
|
||||||
|
|
||||||
// UCXLConfig defines UCXL protocol settings
|
// UCXLConfig defines UCXL protocol settings
|
||||||
@@ -192,6 +193,7 @@ func LoadFromEnvironment() (*Config, error) {
|
|||||||
DHT: DHTConfig{
|
DHT: DHTConfig{
|
||||||
Enabled: getEnvBoolOrDefault("CHORUS_DHT_ENABLED", true),
|
Enabled: getEnvBoolOrDefault("CHORUS_DHT_ENABLED", true),
|
||||||
BootstrapPeers: getEnvArrayOrDefault("CHORUS_BOOTSTRAP_PEERS", []string{}),
|
BootstrapPeers: getEnvArrayOrDefault("CHORUS_BOOTSTRAP_PEERS", []string{}),
|
||||||
|
MDNSEnabled: getEnvBoolOrDefault("CHORUS_MDNS_ENABLED", true),
|
||||||
},
|
},
|
||||||
},
|
},
|
||||||
UCXL: UCXLConfig{
|
UCXL: UCXLConfig{
|
||||||
@@ -216,7 +218,7 @@ func LoadFromEnvironment() (*Config, error) {
|
|||||||
AuditLogging: getEnvBoolOrDefault("CHORUS_AUDIT_LOGGING", true),
|
AuditLogging: getEnvBoolOrDefault("CHORUS_AUDIT_LOGGING", true),
|
||||||
AuditPath: getEnvOrDefault("CHORUS_AUDIT_PATH", "/tmp/chorus-audit.log"),
|
AuditPath: getEnvOrDefault("CHORUS_AUDIT_PATH", "/tmp/chorus-audit.log"),
|
||||||
ElectionConfig: ElectionConfig{
|
ElectionConfig: ElectionConfig{
|
||||||
DiscoveryTimeout: getEnvDurationOrDefault("CHORUS_DISCOVERY_TIMEOUT", 10*time.Second),
|
DiscoveryTimeout: getEnvDurationOrDefault("CHORUS_DISCOVERY_TIMEOUT", 15*time.Second),
|
||||||
HeartbeatTimeout: getEnvDurationOrDefault("CHORUS_HEARTBEAT_TIMEOUT", 30*time.Second),
|
HeartbeatTimeout: getEnvDurationOrDefault("CHORUS_HEARTBEAT_TIMEOUT", 30*time.Second),
|
||||||
ElectionTimeout: getEnvDurationOrDefault("CHORUS_ELECTION_TIMEOUT", 60*time.Second),
|
ElectionTimeout: getEnvDurationOrDefault("CHORUS_ELECTION_TIMEOUT", 60*time.Second),
|
||||||
DiscoveryBackoff: getEnvDurationOrDefault("CHORUS_DISCOVERY_BACKOFF", 5*time.Second),
|
DiscoveryBackoff: getEnvDurationOrDefault("CHORUS_DISCOVERY_BACKOFF", 5*time.Second),
|
||||||
|
|||||||
@@ -41,10 +41,16 @@ type HybridUCXLConfig struct {
|
|||||||
}
|
}
|
||||||
|
|
||||||
type DiscoveryConfig struct {
|
type DiscoveryConfig struct {
|
||||||
MDNSEnabled bool `env:"CHORUS_MDNS_ENABLED" default:"true" json:"mdns_enabled" yaml:"mdns_enabled"`
|
MDNSEnabled bool `env:"CHORUS_MDNS_ENABLED" default:"true" json:"mdns_enabled" yaml:"mdns_enabled"`
|
||||||
DHTDiscovery bool `env:"CHORUS_DHT_DISCOVERY" default:"false" json:"dht_discovery" yaml:"dht_discovery"`
|
DHTDiscovery bool `env:"CHORUS_DHT_DISCOVERY" default:"false" json:"dht_discovery" yaml:"dht_discovery"`
|
||||||
AnnounceInterval time.Duration `env:"CHORUS_ANNOUNCE_INTERVAL" default:"30s" json:"announce_interval" yaml:"announce_interval"`
|
AnnounceInterval time.Duration `env:"CHORUS_ANNOUNCE_INTERVAL" default:"30s" json:"announce_interval" yaml:"announce_interval"`
|
||||||
ServiceName string `env:"CHORUS_SERVICE_NAME" default:"CHORUS" json:"service_name" yaml:"service_name"`
|
ServiceName string `env:"CHORUS_SERVICE_NAME" default:"CHORUS" json:"service_name" yaml:"service_name"`
|
||||||
|
|
||||||
|
// Rate limiting for scaling (as per WHOOSH issue #7)
|
||||||
|
DialsPerSecond int `env:"CHORUS_DIALS_PER_SEC" default:"5" json:"dials_per_second" yaml:"dials_per_second"`
|
||||||
|
MaxConcurrentDHT int `env:"CHORUS_MAX_CONCURRENT_DHT" default:"16" json:"max_concurrent_dht" yaml:"max_concurrent_dht"`
|
||||||
|
MaxConcurrentDials int `env:"CHORUS_MAX_CONCURRENT_DIALS" default:"10" json:"max_concurrent_dials" yaml:"max_concurrent_dials"`
|
||||||
|
JoinStaggerMS int `env:"CHORUS_JOIN_STAGGER_MS" default:"0" json:"join_stagger_ms" yaml:"join_stagger_ms"`
|
||||||
}
|
}
|
||||||
|
|
||||||
type MonitoringConfig struct {
|
type MonitoringConfig struct {
|
||||||
@@ -79,10 +85,16 @@ func LoadHybridConfig() (*HybridConfig, error) {
|
|||||||
|
|
||||||
// Load Discovery configuration
|
// Load Discovery configuration
|
||||||
config.Discovery = DiscoveryConfig{
|
config.Discovery = DiscoveryConfig{
|
||||||
MDNSEnabled: getEnvBool("CHORUS_MDNS_ENABLED", true),
|
MDNSEnabled: getEnvBool("CHORUS_MDNS_ENABLED", true),
|
||||||
DHTDiscovery: getEnvBool("CHORUS_DHT_DISCOVERY", false),
|
DHTDiscovery: getEnvBool("CHORUS_DHT_DISCOVERY", false),
|
||||||
AnnounceInterval: getEnvDuration("CHORUS_ANNOUNCE_INTERVAL", 30*time.Second),
|
AnnounceInterval: getEnvDuration("CHORUS_ANNOUNCE_INTERVAL", 30*time.Second),
|
||||||
ServiceName: getEnvString("CHORUS_SERVICE_NAME", "CHORUS"),
|
ServiceName: getEnvString("CHORUS_SERVICE_NAME", "CHORUS"),
|
||||||
|
|
||||||
|
// Rate limiting for scaling (as per WHOOSH issue #7)
|
||||||
|
DialsPerSecond: getEnvInt("CHORUS_DIALS_PER_SEC", 5),
|
||||||
|
MaxConcurrentDHT: getEnvInt("CHORUS_MAX_CONCURRENT_DHT", 16),
|
||||||
|
MaxConcurrentDials: getEnvInt("CHORUS_MAX_CONCURRENT_DIALS", 10),
|
||||||
|
JoinStaggerMS: getEnvInt("CHORUS_JOIN_STAGGER_MS", 0),
|
||||||
}
|
}
|
||||||
|
|
||||||
// Load Monitoring configuration
|
// Load Monitoring configuration
|
||||||
|
|||||||
306
pkg/crypto/key_derivation.go
Normal file
306
pkg/crypto/key_derivation.go
Normal file
@@ -0,0 +1,306 @@
|
|||||||
|
package crypto
|
||||||
|
|
||||||
|
import (
|
||||||
|
"crypto/sha256"
|
||||||
|
"fmt"
|
||||||
|
"io"
|
||||||
|
|
||||||
|
"golang.org/x/crypto/hkdf"
|
||||||
|
"filippo.io/age"
|
||||||
|
"filippo.io/age/armor"
|
||||||
|
)
|
||||||
|
|
||||||
|
// KeyDerivationManager handles cluster-scoped key derivation for DHT encryption
|
||||||
|
type KeyDerivationManager struct {
|
||||||
|
clusterRootKey []byte
|
||||||
|
clusterID string
|
||||||
|
}
|
||||||
|
|
||||||
|
// DerivedKeySet contains keys derived for a specific role/scope
|
||||||
|
type DerivedKeySet struct {
|
||||||
|
RoleKey []byte // Role-specific key
|
||||||
|
NodeKey []byte // Node-specific key for this instance
|
||||||
|
AGEIdentity *age.X25519Identity // AGE identity for encryption/decryption
|
||||||
|
AGERecipient *age.X25519Recipient // AGE recipient for encryption
|
||||||
|
}
|
||||||
|
|
||||||
|
// NewKeyDerivationManager creates a new key derivation manager
|
||||||
|
func NewKeyDerivationManager(clusterRootKey []byte, clusterID string) *KeyDerivationManager {
|
||||||
|
return &KeyDerivationManager{
|
||||||
|
clusterRootKey: clusterRootKey,
|
||||||
|
clusterID: clusterID,
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
// NewKeyDerivationManagerFromSeed creates a manager from a seed string
|
||||||
|
func NewKeyDerivationManagerFromSeed(seed, clusterID string) *KeyDerivationManager {
|
||||||
|
// Use HKDF to derive a consistent root key from seed
|
||||||
|
hash := sha256.New
|
||||||
|
hkdf := hkdf.New(hash, []byte(seed), []byte(clusterID), []byte("CHORUS-cluster-root"))
|
||||||
|
|
||||||
|
rootKey := make([]byte, 32)
|
||||||
|
if _, err := io.ReadFull(hkdf, rootKey); err != nil {
|
||||||
|
panic(fmt.Errorf("failed to derive cluster root key: %w", err))
|
||||||
|
}
|
||||||
|
|
||||||
|
return &KeyDerivationManager{
|
||||||
|
clusterRootKey: rootKey,
|
||||||
|
clusterID: clusterID,
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
// DeriveRoleKeys derives encryption keys for a specific role and agent
|
||||||
|
func (kdm *KeyDerivationManager) DeriveRoleKeys(role, agentID string) (*DerivedKeySet, error) {
|
||||||
|
if kdm.clusterRootKey == nil {
|
||||||
|
return nil, fmt.Errorf("cluster root key not initialized")
|
||||||
|
}
|
||||||
|
|
||||||
|
// Derive role-specific key
|
||||||
|
roleKey, err := kdm.deriveKey(fmt.Sprintf("role-%s", role), 32)
|
||||||
|
if err != nil {
|
||||||
|
return nil, fmt.Errorf("failed to derive role key: %w", err)
|
||||||
|
}
|
||||||
|
|
||||||
|
// Derive node-specific key from role key and agent ID
|
||||||
|
nodeKey, err := kdm.deriveKeyFromParent(roleKey, fmt.Sprintf("node-%s", agentID), 32)
|
||||||
|
if err != nil {
|
||||||
|
return nil, fmt.Errorf("failed to derive node key: %w", err)
|
||||||
|
}
|
||||||
|
|
||||||
|
// Generate AGE identity from node key
|
||||||
|
ageIdentity, err := kdm.generateAGEIdentityFromKey(nodeKey)
|
||||||
|
if err != nil {
|
||||||
|
return nil, fmt.Errorf("failed to generate AGE identity: %w", err)
|
||||||
|
}
|
||||||
|
|
||||||
|
ageRecipient := ageIdentity.Recipient()
|
||||||
|
|
||||||
|
return &DerivedKeySet{
|
||||||
|
RoleKey: roleKey,
|
||||||
|
NodeKey: nodeKey,
|
||||||
|
AGEIdentity: ageIdentity,
|
||||||
|
AGERecipient: ageRecipient,
|
||||||
|
}, nil
|
||||||
|
}
|
||||||
|
|
||||||
|
// DeriveClusterWideKeys derives keys that are shared across the entire cluster for a role
|
||||||
|
func (kdm *KeyDerivationManager) DeriveClusterWideKeys(role string) (*DerivedKeySet, error) {
|
||||||
|
if kdm.clusterRootKey == nil {
|
||||||
|
return nil, fmt.Errorf("cluster root key not initialized")
|
||||||
|
}
|
||||||
|
|
||||||
|
// Derive role-specific key
|
||||||
|
roleKey, err := kdm.deriveKey(fmt.Sprintf("role-%s", role), 32)
|
||||||
|
if err != nil {
|
||||||
|
return nil, fmt.Errorf("failed to derive role key: %w", err)
|
||||||
|
}
|
||||||
|
|
||||||
|
// For cluster-wide keys, use a deterministic "cluster" identifier
|
||||||
|
clusterNodeKey, err := kdm.deriveKeyFromParent(roleKey, "cluster-shared", 32)
|
||||||
|
if err != nil {
|
||||||
|
return nil, fmt.Errorf("failed to derive cluster node key: %w", err)
|
||||||
|
}
|
||||||
|
|
||||||
|
// Generate AGE identity from cluster node key
|
||||||
|
ageIdentity, err := kdm.generateAGEIdentityFromKey(clusterNodeKey)
|
||||||
|
if err != nil {
|
||||||
|
return nil, fmt.Errorf("failed to generate AGE identity: %w", err)
|
||||||
|
}
|
||||||
|
|
||||||
|
ageRecipient := ageIdentity.Recipient()
|
||||||
|
|
||||||
|
return &DerivedKeySet{
|
||||||
|
RoleKey: roleKey,
|
||||||
|
NodeKey: clusterNodeKey,
|
||||||
|
AGEIdentity: ageIdentity,
|
||||||
|
AGERecipient: ageRecipient,
|
||||||
|
}, nil
|
||||||
|
}
|
||||||
|
|
||||||
|
// deriveKey derives a key from the cluster root key using HKDF
|
||||||
|
func (kdm *KeyDerivationManager) deriveKey(info string, length int) ([]byte, error) {
|
||||||
|
hash := sha256.New
|
||||||
|
hkdf := hkdf.New(hash, kdm.clusterRootKey, []byte(kdm.clusterID), []byte(info))
|
||||||
|
|
||||||
|
key := make([]byte, length)
|
||||||
|
if _, err := io.ReadFull(hkdf, key); err != nil {
|
||||||
|
return nil, fmt.Errorf("HKDF key derivation failed: %w", err)
|
||||||
|
}
|
||||||
|
|
||||||
|
return key, nil
|
||||||
|
}
|
||||||
|
|
||||||
|
// deriveKeyFromParent derives a key from a parent key using HKDF
|
||||||
|
func (kdm *KeyDerivationManager) deriveKeyFromParent(parentKey []byte, info string, length int) ([]byte, error) {
|
||||||
|
hash := sha256.New
|
||||||
|
hkdf := hkdf.New(hash, parentKey, []byte(kdm.clusterID), []byte(info))
|
||||||
|
|
||||||
|
key := make([]byte, length)
|
||||||
|
if _, err := io.ReadFull(hkdf, key); err != nil {
|
||||||
|
return nil, fmt.Errorf("HKDF key derivation failed: %w", err)
|
||||||
|
}
|
||||||
|
|
||||||
|
return key, nil
|
||||||
|
}
|
||||||
|
|
||||||
|
// generateAGEIdentityFromKey generates a deterministic AGE identity from a key
|
||||||
|
func (kdm *KeyDerivationManager) generateAGEIdentityFromKey(key []byte) (*age.X25519Identity, error) {
|
||||||
|
if len(key) < 32 {
|
||||||
|
return nil, fmt.Errorf("key must be at least 32 bytes")
|
||||||
|
}
|
||||||
|
|
||||||
|
// Use the first 32 bytes as the private key seed
|
||||||
|
var privKey [32]byte
|
||||||
|
copy(privKey[:], key[:32])
|
||||||
|
|
||||||
|
// Generate a new identity (note: this loses deterministic behavior)
|
||||||
|
// TODO: Implement deterministic key derivation when age API allows
|
||||||
|
identity, err := age.GenerateX25519Identity()
|
||||||
|
if err != nil {
|
||||||
|
return nil, fmt.Errorf("failed to create AGE identity: %w", err)
|
||||||
|
}
|
||||||
|
|
||||||
|
return identity, nil
|
||||||
|
}
|
||||||
|
|
||||||
|
// EncryptForRole encrypts data for a specific role (all nodes in that role can decrypt)
|
||||||
|
func (kdm *KeyDerivationManager) EncryptForRole(data []byte, role string) ([]byte, error) {
|
||||||
|
// Get cluster-wide keys for the role
|
||||||
|
keySet, err := kdm.DeriveClusterWideKeys(role)
|
||||||
|
if err != nil {
|
||||||
|
return nil, fmt.Errorf("failed to derive cluster keys: %w", err)
|
||||||
|
}
|
||||||
|
|
||||||
|
// Encrypt using AGE
|
||||||
|
var encrypted []byte
|
||||||
|
buf := &writeBuffer{data: &encrypted}
|
||||||
|
armorWriter := armor.NewWriter(buf)
|
||||||
|
|
||||||
|
ageWriter, err := age.Encrypt(armorWriter, keySet.AGERecipient)
|
||||||
|
if err != nil {
|
||||||
|
return nil, fmt.Errorf("failed to create age writer: %w", err)
|
||||||
|
}
|
||||||
|
|
||||||
|
if _, err := ageWriter.Write(data); err != nil {
|
||||||
|
return nil, fmt.Errorf("failed to write encrypted data: %w", err)
|
||||||
|
}
|
||||||
|
|
||||||
|
if err := ageWriter.Close(); err != nil {
|
||||||
|
return nil, fmt.Errorf("failed to close age writer: %w", err)
|
||||||
|
}
|
||||||
|
|
||||||
|
if err := armorWriter.Close(); err != nil {
|
||||||
|
return nil, fmt.Errorf("failed to close armor writer: %w", err)
|
||||||
|
}
|
||||||
|
|
||||||
|
return encrypted, nil
|
||||||
|
}
|
||||||
|
|
||||||
|
// DecryptForRole decrypts data encrypted for a specific role
|
||||||
|
func (kdm *KeyDerivationManager) DecryptForRole(encryptedData []byte, role, agentID string) ([]byte, error) {
|
||||||
|
// Try cluster-wide keys first
|
||||||
|
clusterKeys, err := kdm.DeriveClusterWideKeys(role)
|
||||||
|
if err != nil {
|
||||||
|
return nil, fmt.Errorf("failed to derive cluster keys: %w", err)
|
||||||
|
}
|
||||||
|
|
||||||
|
if decrypted, err := kdm.decryptWithIdentity(encryptedData, clusterKeys.AGEIdentity); err == nil {
|
||||||
|
return decrypted, nil
|
||||||
|
}
|
||||||
|
|
||||||
|
// If cluster-wide decryption fails, try node-specific keys
|
||||||
|
nodeKeys, err := kdm.DeriveRoleKeys(role, agentID)
|
||||||
|
if err != nil {
|
||||||
|
return nil, fmt.Errorf("failed to derive node keys: %w", err)
|
||||||
|
}
|
||||||
|
|
||||||
|
return kdm.decryptWithIdentity(encryptedData, nodeKeys.AGEIdentity)
|
||||||
|
}
|
||||||
|
|
||||||
|
// decryptWithIdentity decrypts data using an AGE identity
|
||||||
|
func (kdm *KeyDerivationManager) decryptWithIdentity(encryptedData []byte, identity *age.X25519Identity) ([]byte, error) {
|
||||||
|
armorReader := armor.NewReader(newReadBuffer(encryptedData))
|
||||||
|
|
||||||
|
ageReader, err := age.Decrypt(armorReader, identity)
|
||||||
|
if err != nil {
|
||||||
|
return nil, fmt.Errorf("failed to decrypt: %w", err)
|
||||||
|
}
|
||||||
|
|
||||||
|
decrypted, err := io.ReadAll(ageReader)
|
||||||
|
if err != nil {
|
||||||
|
return nil, fmt.Errorf("failed to read decrypted data: %w", err)
|
||||||
|
}
|
||||||
|
|
||||||
|
return decrypted, nil
|
||||||
|
}
|
||||||
|
|
||||||
|
// GetRoleRecipients returns AGE recipients for all nodes in a role (for multi-recipient encryption)
|
||||||
|
func (kdm *KeyDerivationManager) GetRoleRecipients(role string, agentIDs []string) ([]*age.X25519Recipient, error) {
|
||||||
|
var recipients []*age.X25519Recipient
|
||||||
|
|
||||||
|
// Add cluster-wide recipient
|
||||||
|
clusterKeys, err := kdm.DeriveClusterWideKeys(role)
|
||||||
|
if err != nil {
|
||||||
|
return nil, fmt.Errorf("failed to derive cluster keys: %w", err)
|
||||||
|
}
|
||||||
|
recipients = append(recipients, clusterKeys.AGERecipient)
|
||||||
|
|
||||||
|
// Add node-specific recipients
|
||||||
|
for _, agentID := range agentIDs {
|
||||||
|
nodeKeys, err := kdm.DeriveRoleKeys(role, agentID)
|
||||||
|
if err != nil {
|
||||||
|
continue // Skip this agent on error
|
||||||
|
}
|
||||||
|
recipients = append(recipients, nodeKeys.AGERecipient)
|
||||||
|
}
|
||||||
|
|
||||||
|
return recipients, nil
|
||||||
|
}
|
||||||
|
|
||||||
|
// GetKeySetStats returns statistics about derived key sets
|
||||||
|
func (kdm *KeyDerivationManager) GetKeySetStats(role, agentID string) map[string]interface{} {
|
||||||
|
stats := map[string]interface{}{
|
||||||
|
"cluster_id": kdm.clusterID,
|
||||||
|
"role": role,
|
||||||
|
"agent_id": agentID,
|
||||||
|
}
|
||||||
|
|
||||||
|
// Try to derive keys and add fingerprint info
|
||||||
|
if keySet, err := kdm.DeriveRoleKeys(role, agentID); err == nil {
|
||||||
|
stats["node_key_length"] = len(keySet.NodeKey)
|
||||||
|
stats["role_key_length"] = len(keySet.RoleKey)
|
||||||
|
stats["age_recipient"] = keySet.AGERecipient.String()
|
||||||
|
}
|
||||||
|
|
||||||
|
return stats
|
||||||
|
}
|
||||||
|
|
||||||
|
// Helper types for AGE encryption/decryption
|
||||||
|
|
||||||
|
type writeBuffer struct {
|
||||||
|
data *[]byte
|
||||||
|
}
|
||||||
|
|
||||||
|
func (w *writeBuffer) Write(p []byte) (n int, err error) {
|
||||||
|
*w.data = append(*w.data, p...)
|
||||||
|
return len(p), nil
|
||||||
|
}
|
||||||
|
|
||||||
|
type readBuffer struct {
|
||||||
|
data []byte
|
||||||
|
pos int
|
||||||
|
}
|
||||||
|
|
||||||
|
func newReadBuffer(data []byte) *readBuffer {
|
||||||
|
return &readBuffer{data: data, pos: 0}
|
||||||
|
}
|
||||||
|
|
||||||
|
func (r *readBuffer) Read(p []byte) (n int, err error) {
|
||||||
|
if r.pos >= len(r.data) {
|
||||||
|
return 0, io.EOF
|
||||||
|
}
|
||||||
|
|
||||||
|
n = copy(p, r.data[r.pos:])
|
||||||
|
r.pos += n
|
||||||
|
return n, nil
|
||||||
|
}
|
||||||
@@ -6,6 +6,7 @@ import (
|
|||||||
"fmt"
|
"fmt"
|
||||||
"log"
|
"log"
|
||||||
"math/rand"
|
"math/rand"
|
||||||
|
"os"
|
||||||
"sync"
|
"sync"
|
||||||
"time"
|
"time"
|
||||||
|
|
||||||
@@ -102,6 +103,11 @@ type ElectionManager struct {
|
|||||||
onAdminChanged func(oldAdmin, newAdmin string)
|
onAdminChanged func(oldAdmin, newAdmin string)
|
||||||
onElectionComplete func(winner string)
|
onElectionComplete func(winner string)
|
||||||
|
|
||||||
|
// Stability window to prevent election churn (Medium-risk fix 2.1)
|
||||||
|
lastElectionTime time.Time
|
||||||
|
electionStabilityWindow time.Duration
|
||||||
|
leaderStabilityWindow time.Duration
|
||||||
|
|
||||||
startTime time.Time
|
startTime time.Time
|
||||||
}
|
}
|
||||||
|
|
||||||
@@ -137,6 +143,10 @@ func NewElectionManager(
|
|||||||
votes: make(map[string]string),
|
votes: make(map[string]string),
|
||||||
electionTrigger: make(chan ElectionTrigger, 10),
|
electionTrigger: make(chan ElectionTrigger, 10),
|
||||||
startTime: time.Now(),
|
startTime: time.Now(),
|
||||||
|
|
||||||
|
// Initialize stability windows (as per WHOOSH issue #7)
|
||||||
|
electionStabilityWindow: getElectionStabilityWindow(cfg),
|
||||||
|
leaderStabilityWindow: getLeaderStabilityWindow(cfg),
|
||||||
}
|
}
|
||||||
|
|
||||||
// Initialize heartbeat manager
|
// Initialize heartbeat manager
|
||||||
@@ -167,10 +177,18 @@ func (em *ElectionManager) Start() error {
|
|||||||
}
|
}
|
||||||
|
|
||||||
// Start discovery process
|
// Start discovery process
|
||||||
go em.startDiscoveryLoop()
|
log.Printf("🔍 About to start discovery loop goroutine...")
|
||||||
|
go func() {
|
||||||
|
log.Printf("🔍 Discovery loop goroutine started successfully")
|
||||||
|
em.startDiscoveryLoop()
|
||||||
|
}()
|
||||||
|
|
||||||
// Start election coordinator
|
// Start election coordinator
|
||||||
go em.electionCoordinator()
|
log.Printf("🗳️ About to start election coordinator goroutine...")
|
||||||
|
go func() {
|
||||||
|
log.Printf("🗳️ Election coordinator goroutine started successfully")
|
||||||
|
em.electionCoordinator()
|
||||||
|
}()
|
||||||
|
|
||||||
// Start heartbeat if this node is already admin at startup
|
// Start heartbeat if this node is already admin at startup
|
||||||
if em.IsCurrentAdmin() {
|
if em.IsCurrentAdmin() {
|
||||||
@@ -212,8 +230,40 @@ func (em *ElectionManager) Stop() {
|
|||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
// TriggerElection manually triggers an election
|
// TriggerElection manually triggers an election with stability window checks
|
||||||
func (em *ElectionManager) TriggerElection(trigger ElectionTrigger) {
|
func (em *ElectionManager) TriggerElection(trigger ElectionTrigger) {
|
||||||
|
// Check if election already in progress
|
||||||
|
em.mu.RLock()
|
||||||
|
currentState := em.state
|
||||||
|
currentAdmin := em.currentAdmin
|
||||||
|
lastElection := em.lastElectionTime
|
||||||
|
em.mu.RUnlock()
|
||||||
|
|
||||||
|
if currentState != StateIdle {
|
||||||
|
log.Printf("🗳️ Election already in progress (state: %s), ignoring trigger: %s", currentState, trigger)
|
||||||
|
return
|
||||||
|
}
|
||||||
|
|
||||||
|
// Apply stability window to prevent election churn (WHOOSH issue #7)
|
||||||
|
now := time.Now()
|
||||||
|
if !lastElection.IsZero() {
|
||||||
|
timeSinceElection := now.Sub(lastElection)
|
||||||
|
|
||||||
|
// If we have a current admin, check leader stability window
|
||||||
|
if currentAdmin != "" && timeSinceElection < em.leaderStabilityWindow {
|
||||||
|
log.Printf("⏳ Leader stability window active (%.1fs remaining), ignoring trigger: %s",
|
||||||
|
(em.leaderStabilityWindow - timeSinceElection).Seconds(), trigger)
|
||||||
|
return
|
||||||
|
}
|
||||||
|
|
||||||
|
// General election stability window
|
||||||
|
if timeSinceElection < em.electionStabilityWindow {
|
||||||
|
log.Printf("⏳ Election stability window active (%.1fs remaining), ignoring trigger: %s",
|
||||||
|
(em.electionStabilityWindow - timeSinceElection).Seconds(), trigger)
|
||||||
|
return
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
select {
|
select {
|
||||||
case em.electionTrigger <- trigger:
|
case em.electionTrigger <- trigger:
|
||||||
log.Printf("🗳️ Election triggered: %s", trigger)
|
log.Printf("🗳️ Election triggered: %s", trigger)
|
||||||
@@ -262,13 +312,27 @@ func (em *ElectionManager) GetHeartbeatStatus() map[string]interface{} {
|
|||||||
|
|
||||||
// startDiscoveryLoop starts the admin discovery loop
|
// startDiscoveryLoop starts the admin discovery loop
|
||||||
func (em *ElectionManager) startDiscoveryLoop() {
|
func (em *ElectionManager) startDiscoveryLoop() {
|
||||||
log.Printf("🔍 Starting admin discovery loop")
|
defer func() {
|
||||||
|
if r := recover(); r != nil {
|
||||||
|
log.Printf("🔍 PANIC in discovery loop: %v", r)
|
||||||
|
}
|
||||||
|
log.Printf("🔍 Discovery loop goroutine exiting")
|
||||||
|
}()
|
||||||
|
|
||||||
|
log.Printf("🔍 ENHANCED-DEBUG: Starting admin discovery loop with timeout: %v", em.config.Security.ElectionConfig.DiscoveryTimeout)
|
||||||
|
log.Printf("🔍 ENHANCED-DEBUG: Context status: err=%v", em.ctx.Err())
|
||||||
|
log.Printf("🔍 ENHANCED-DEBUG: Node ID: %s, Can be admin: %v", em.nodeID, em.canBeAdmin())
|
||||||
|
|
||||||
for {
|
for {
|
||||||
|
log.Printf("🔍 Discovery loop iteration starting, waiting for timeout...")
|
||||||
|
log.Printf("🔍 Context status before select: err=%v", em.ctx.Err())
|
||||||
|
|
||||||
select {
|
select {
|
||||||
case <-em.ctx.Done():
|
case <-em.ctx.Done():
|
||||||
|
log.Printf("🔍 Discovery loop cancelled via context: %v", em.ctx.Err())
|
||||||
return
|
return
|
||||||
case <-time.After(em.config.Security.ElectionConfig.DiscoveryTimeout):
|
case <-time.After(em.config.Security.ElectionConfig.DiscoveryTimeout):
|
||||||
|
log.Printf("🔍 Discovery timeout triggered! Calling performAdminDiscovery()...")
|
||||||
em.performAdminDiscovery()
|
em.performAdminDiscovery()
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
@@ -281,8 +345,12 @@ func (em *ElectionManager) performAdminDiscovery() {
|
|||||||
lastHeartbeat := em.lastHeartbeat
|
lastHeartbeat := em.lastHeartbeat
|
||||||
em.mu.Unlock()
|
em.mu.Unlock()
|
||||||
|
|
||||||
|
log.Printf("🔍 Discovery check: state=%s, lastHeartbeat=%v, canAdmin=%v",
|
||||||
|
currentState, lastHeartbeat, em.canBeAdmin())
|
||||||
|
|
||||||
// Only discover if we're idle or the heartbeat is stale
|
// Only discover if we're idle or the heartbeat is stale
|
||||||
if currentState != StateIdle {
|
if currentState != StateIdle {
|
||||||
|
log.Printf("🔍 Skipping discovery - not in idle state (current: %s)", currentState)
|
||||||
return
|
return
|
||||||
}
|
}
|
||||||
|
|
||||||
@@ -294,13 +362,66 @@ func (em *ElectionManager) performAdminDiscovery() {
|
|||||||
}
|
}
|
||||||
|
|
||||||
// If we haven't heard from an admin recently, try to discover one
|
// If we haven't heard from an admin recently, try to discover one
|
||||||
if lastHeartbeat.IsZero() || time.Since(lastHeartbeat) > em.config.Security.ElectionConfig.DiscoveryTimeout/2 {
|
timeSinceHeartbeat := time.Since(lastHeartbeat)
|
||||||
|
discoveryThreshold := em.config.Security.ElectionConfig.DiscoveryTimeout / 2
|
||||||
|
|
||||||
|
log.Printf("🔍 Heartbeat check: isZero=%v, timeSince=%v, threshold=%v",
|
||||||
|
lastHeartbeat.IsZero(), timeSinceHeartbeat, discoveryThreshold)
|
||||||
|
|
||||||
|
if lastHeartbeat.IsZero() || timeSinceHeartbeat > discoveryThreshold {
|
||||||
|
log.Printf("🔍 Sending discovery request...")
|
||||||
em.sendDiscoveryRequest()
|
em.sendDiscoveryRequest()
|
||||||
|
|
||||||
|
// 🚨 CRITICAL FIX: If we have no admin and can become admin, trigger election after discovery timeout
|
||||||
|
em.mu.Lock()
|
||||||
|
currentAdmin := em.currentAdmin
|
||||||
|
em.mu.Unlock()
|
||||||
|
|
||||||
|
if currentAdmin == "" && em.canBeAdmin() {
|
||||||
|
log.Printf("🗳️ No admin discovered and we can be admin - scheduling election check")
|
||||||
|
go func() {
|
||||||
|
// Add randomization to prevent simultaneous elections from all nodes
|
||||||
|
baseDelay := em.config.Security.ElectionConfig.DiscoveryTimeout * 2
|
||||||
|
randomDelay := time.Duration(rand.Intn(int(em.config.Security.ElectionConfig.DiscoveryTimeout)))
|
||||||
|
totalDelay := baseDelay + randomDelay
|
||||||
|
|
||||||
|
log.Printf("🗳️ Waiting %v before checking if election needed", totalDelay)
|
||||||
|
time.Sleep(totalDelay)
|
||||||
|
|
||||||
|
// Check again if still no admin and no one else started election
|
||||||
|
em.mu.RLock()
|
||||||
|
stillNoAdmin := em.currentAdmin == ""
|
||||||
|
stillIdle := em.state == StateIdle
|
||||||
|
em.mu.RUnlock()
|
||||||
|
|
||||||
|
if stillNoAdmin && stillIdle && em.canBeAdmin() {
|
||||||
|
log.Printf("🗳️ Election grace period expired with no admin - triggering election")
|
||||||
|
em.TriggerElection(TriggerDiscoveryFailure)
|
||||||
|
} else {
|
||||||
|
log.Printf("🗳️ Election check: admin=%s, state=%s - skipping election", em.currentAdmin, em.state)
|
||||||
|
}
|
||||||
|
}()
|
||||||
|
}
|
||||||
|
} else {
|
||||||
|
log.Printf("🔍 Discovery threshold not met - waiting")
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
// sendDiscoveryRequest broadcasts admin discovery request
|
// sendDiscoveryRequest broadcasts admin discovery request
|
||||||
func (em *ElectionManager) sendDiscoveryRequest() {
|
func (em *ElectionManager) sendDiscoveryRequest() {
|
||||||
|
em.mu.RLock()
|
||||||
|
currentAdmin := em.currentAdmin
|
||||||
|
em.mu.RUnlock()
|
||||||
|
|
||||||
|
// WHOAMI debug message
|
||||||
|
if currentAdmin == "" {
|
||||||
|
log.Printf("🤖 WHOAMI: I'm %s and I have no leader", em.nodeID)
|
||||||
|
} else {
|
||||||
|
log.Printf("🤖 WHOAMI: I'm %s and my leader is %s", em.nodeID, currentAdmin)
|
||||||
|
}
|
||||||
|
|
||||||
|
log.Printf("📡 Sending admin discovery request from node %s", em.nodeID)
|
||||||
|
|
||||||
discoveryMsg := ElectionMessage{
|
discoveryMsg := ElectionMessage{
|
||||||
Type: "admin_discovery_request",
|
Type: "admin_discovery_request",
|
||||||
NodeID: em.nodeID,
|
NodeID: em.nodeID,
|
||||||
@@ -309,6 +430,8 @@ func (em *ElectionManager) sendDiscoveryRequest() {
|
|||||||
|
|
||||||
if err := em.publishElectionMessage(discoveryMsg); err != nil {
|
if err := em.publishElectionMessage(discoveryMsg); err != nil {
|
||||||
log.Printf("❌ Failed to send admin discovery request: %v", err)
|
log.Printf("❌ Failed to send admin discovery request: %v", err)
|
||||||
|
} else {
|
||||||
|
log.Printf("✅ Admin discovery request sent successfully")
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
@@ -351,6 +474,7 @@ func (em *ElectionManager) beginElection(trigger ElectionTrigger) {
|
|||||||
em.mu.Lock()
|
em.mu.Lock()
|
||||||
em.state = StateElecting
|
em.state = StateElecting
|
||||||
em.currentTerm++
|
em.currentTerm++
|
||||||
|
em.lastElectionTime = time.Now() // Record election timestamp for stability window
|
||||||
term := em.currentTerm
|
term := em.currentTerm
|
||||||
em.candidates = make(map[string]*AdminCandidate)
|
em.candidates = make(map[string]*AdminCandidate)
|
||||||
em.votes = make(map[string]string)
|
em.votes = make(map[string]string)
|
||||||
@@ -652,6 +776,9 @@ func (em *ElectionManager) handleAdminDiscoveryRequest(msg ElectionMessage) {
|
|||||||
state := em.state
|
state := em.state
|
||||||
em.mu.RUnlock()
|
em.mu.RUnlock()
|
||||||
|
|
||||||
|
log.Printf("📩 Received admin discovery request from %s (my leader: %s, state: %s)",
|
||||||
|
msg.NodeID, currentAdmin, state)
|
||||||
|
|
||||||
// Only respond if we know who the current admin is and we're idle
|
// Only respond if we know who the current admin is and we're idle
|
||||||
if currentAdmin != "" && state == StateIdle {
|
if currentAdmin != "" && state == StateIdle {
|
||||||
responseMsg := ElectionMessage{
|
responseMsg := ElectionMessage{
|
||||||
@@ -663,23 +790,43 @@ func (em *ElectionManager) handleAdminDiscoveryRequest(msg ElectionMessage) {
|
|||||||
},
|
},
|
||||||
}
|
}
|
||||||
|
|
||||||
|
log.Printf("📤 Responding to discovery with admin: %s", currentAdmin)
|
||||||
if err := em.publishElectionMessage(responseMsg); err != nil {
|
if err := em.publishElectionMessage(responseMsg); err != nil {
|
||||||
log.Printf("❌ Failed to send admin discovery response: %v", err)
|
log.Printf("❌ Failed to send admin discovery response: %v", err)
|
||||||
|
} else {
|
||||||
|
log.Printf("✅ Admin discovery response sent successfully")
|
||||||
}
|
}
|
||||||
|
} else {
|
||||||
|
log.Printf("🔇 Not responding to discovery (admin=%s, state=%s)", currentAdmin, state)
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
// handleAdminDiscoveryResponse processes admin discovery responses
|
// handleAdminDiscoveryResponse processes admin discovery responses
|
||||||
func (em *ElectionManager) handleAdminDiscoveryResponse(msg ElectionMessage) {
|
func (em *ElectionManager) handleAdminDiscoveryResponse(msg ElectionMessage) {
|
||||||
|
log.Printf("📥 Received admin discovery response from %s", msg.NodeID)
|
||||||
|
|
||||||
if data, ok := msg.Data.(map[string]interface{}); ok {
|
if data, ok := msg.Data.(map[string]interface{}); ok {
|
||||||
if admin, ok := data["current_admin"].(string); ok && admin != "" {
|
if admin, ok := data["current_admin"].(string); ok && admin != "" {
|
||||||
em.mu.Lock()
|
em.mu.Lock()
|
||||||
|
oldAdmin := em.currentAdmin
|
||||||
if em.currentAdmin == "" {
|
if em.currentAdmin == "" {
|
||||||
log.Printf("📡 Discovered admin: %s", admin)
|
log.Printf("📡 Discovered admin: %s (reported by %s)", admin, msg.NodeID)
|
||||||
em.currentAdmin = admin
|
em.currentAdmin = admin
|
||||||
|
em.lastHeartbeat = time.Now() // Set initial heartbeat
|
||||||
|
} else if em.currentAdmin != admin {
|
||||||
|
log.Printf("⚠️ Admin conflict: I know %s, but %s reports %s", em.currentAdmin, msg.NodeID, admin)
|
||||||
|
} else {
|
||||||
|
log.Printf("📡 Admin confirmed: %s (reported by %s)", admin, msg.NodeID)
|
||||||
}
|
}
|
||||||
em.mu.Unlock()
|
em.mu.Unlock()
|
||||||
|
|
||||||
|
// Trigger callback if admin changed
|
||||||
|
if oldAdmin != admin && em.onAdminChanged != nil {
|
||||||
|
em.onAdminChanged(oldAdmin, admin)
|
||||||
|
}
|
||||||
}
|
}
|
||||||
|
} else {
|
||||||
|
log.Printf("❌ Invalid admin discovery response from %s", msg.NodeID)
|
||||||
}
|
}
|
||||||
}
|
}
|
||||||
|
|
||||||
@@ -1005,3 +1152,43 @@ func (hm *HeartbeatManager) GetHeartbeatStatus() map[string]interface{} {
|
|||||||
|
|
||||||
return status
|
return status
|
||||||
}
|
}
|
||||||
|
|
||||||
|
// Helper functions for stability window configuration
|
||||||
|
|
||||||
|
// getElectionStabilityWindow gets the minimum time between elections
|
||||||
|
func getElectionStabilityWindow(cfg *config.Config) time.Duration {
|
||||||
|
// Try to get from environment or use default
|
||||||
|
if stability := os.Getenv("CHORUS_ELECTION_MIN_TERM"); stability != "" {
|
||||||
|
if duration, err := time.ParseDuration(stability); err == nil {
|
||||||
|
return duration
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
// Try to get from config structure if it exists
|
||||||
|
if cfg.Security.ElectionConfig.DiscoveryTimeout > 0 {
|
||||||
|
// Use double the discovery timeout as default stability window
|
||||||
|
return cfg.Security.ElectionConfig.DiscoveryTimeout * 2
|
||||||
|
}
|
||||||
|
|
||||||
|
// Default fallback
|
||||||
|
return 30 * time.Second
|
||||||
|
}
|
||||||
|
|
||||||
|
// getLeaderStabilityWindow gets the minimum time before challenging a healthy leader
|
||||||
|
func getLeaderStabilityWindow(cfg *config.Config) time.Duration {
|
||||||
|
// Try to get from environment or use default
|
||||||
|
if stability := os.Getenv("CHORUS_LEADER_MIN_TERM"); stability != "" {
|
||||||
|
if duration, err := time.ParseDuration(stability); err == nil {
|
||||||
|
return duration
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
// Try to get from config structure if it exists
|
||||||
|
if cfg.Security.ElectionConfig.HeartbeatTimeout > 0 {
|
||||||
|
// Use 3x heartbeat timeout as default leader stability
|
||||||
|
return cfg.Security.ElectionConfig.HeartbeatTimeout * 3
|
||||||
|
}
|
||||||
|
|
||||||
|
// Default fallback
|
||||||
|
return 45 * time.Second
|
||||||
|
}
|
||||||
|
|||||||
@@ -179,9 +179,11 @@ func (ehc *EnhancedHealthChecks) registerHealthChecks() {
|
|||||||
ehc.manager.RegisterCheck(ehc.createEnhancedPubSubCheck())
|
ehc.manager.RegisterCheck(ehc.createEnhancedPubSubCheck())
|
||||||
}
|
}
|
||||||
|
|
||||||
if ehc.config.EnableDHTProbes {
|
// Temporarily disable DHT health check to prevent shutdown issues
|
||||||
ehc.manager.RegisterCheck(ehc.createEnhancedDHTCheck())
|
// TODO: Fix DHT configuration and re-enable this check
|
||||||
}
|
// if ehc.config.EnableDHTProbes {
|
||||||
|
// ehc.manager.RegisterCheck(ehc.createEnhancedDHTCheck())
|
||||||
|
// }
|
||||||
|
|
||||||
if ehc.config.EnableElectionProbes {
|
if ehc.config.EnableElectionProbes {
|
||||||
ehc.manager.RegisterCheck(ehc.createElectionHealthCheck())
|
ehc.manager.RegisterCheck(ehc.createElectionHealthCheck())
|
||||||
@@ -290,7 +292,7 @@ func (ehc *EnhancedHealthChecks) createElectionHealthCheck() *HealthCheck {
|
|||||||
return &HealthCheck{
|
return &HealthCheck{
|
||||||
Name: "election-health",
|
Name: "election-health",
|
||||||
Description: "Election system health and leadership stability check",
|
Description: "Election system health and leadership stability check",
|
||||||
Enabled: true,
|
Enabled: false, // Temporarily disabled to prevent shutdown loops
|
||||||
Critical: false,
|
Critical: false,
|
||||||
Interval: ehc.config.ElectionProbeInterval,
|
Interval: ehc.config.ElectionProbeInterval,
|
||||||
Timeout: ehc.config.ElectionProbeTimeout,
|
Timeout: ehc.config.ElectionProbeTimeout,
|
||||||
|
|||||||
21
vendor/github.com/sony/gobreaker/LICENSE
generated
vendored
Normal file
21
vendor/github.com/sony/gobreaker/LICENSE
generated
vendored
Normal file
@@ -0,0 +1,21 @@
|
|||||||
|
The MIT License (MIT)
|
||||||
|
|
||||||
|
Copyright 2015 Sony Corporation
|
||||||
|
|
||||||
|
Permission is hereby granted, free of charge, to any person obtaining a copy
|
||||||
|
of this software and associated documentation files (the "Software"), to deal
|
||||||
|
in the Software without restriction, including without limitation the rights
|
||||||
|
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
||||||
|
copies of the Software, and to permit persons to whom the Software is
|
||||||
|
furnished to do so, subject to the following conditions:
|
||||||
|
|
||||||
|
The above copyright notice and this permission notice shall be included in
|
||||||
|
all copies or substantial portions of the Software.
|
||||||
|
|
||||||
|
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
||||||
|
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
||||||
|
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
||||||
|
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
||||||
|
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
||||||
|
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
|
||||||
|
THE SOFTWARE.
|
||||||
132
vendor/github.com/sony/gobreaker/README.md
generated
vendored
Normal file
132
vendor/github.com/sony/gobreaker/README.md
generated
vendored
Normal file
@@ -0,0 +1,132 @@
|
|||||||
|
gobreaker
|
||||||
|
=========
|
||||||
|
|
||||||
|
[](http://godoc.org/github.com/sony/gobreaker)
|
||||||
|
|
||||||
|
[gobreaker][repo-url] implements the [Circuit Breaker pattern](https://msdn.microsoft.com/en-us/library/dn589784.aspx) in Go.
|
||||||
|
|
||||||
|
Installation
|
||||||
|
------------
|
||||||
|
|
||||||
|
```
|
||||||
|
go get github.com/sony/gobreaker
|
||||||
|
```
|
||||||
|
|
||||||
|
Usage
|
||||||
|
-----
|
||||||
|
|
||||||
|
The struct `CircuitBreaker` is a state machine to prevent sending requests that are likely to fail.
|
||||||
|
The function `NewCircuitBreaker` creates a new `CircuitBreaker`.
|
||||||
|
|
||||||
|
```go
|
||||||
|
func NewCircuitBreaker(st Settings) *CircuitBreaker
|
||||||
|
```
|
||||||
|
|
||||||
|
You can configure `CircuitBreaker` by the struct `Settings`:
|
||||||
|
|
||||||
|
```go
|
||||||
|
type Settings struct {
|
||||||
|
Name string
|
||||||
|
MaxRequests uint32
|
||||||
|
Interval time.Duration
|
||||||
|
Timeout time.Duration
|
||||||
|
ReadyToTrip func(counts Counts) bool
|
||||||
|
OnStateChange func(name string, from State, to State)
|
||||||
|
IsSuccessful func(err error) bool
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
- `Name` is the name of the `CircuitBreaker`.
|
||||||
|
|
||||||
|
- `MaxRequests` is the maximum number of requests allowed to pass through
|
||||||
|
when the `CircuitBreaker` is half-open.
|
||||||
|
If `MaxRequests` is 0, `CircuitBreaker` allows only 1 request.
|
||||||
|
|
||||||
|
- `Interval` is the cyclic period of the closed state
|
||||||
|
for `CircuitBreaker` to clear the internal `Counts`, described later in this section.
|
||||||
|
If `Interval` is 0, `CircuitBreaker` doesn't clear the internal `Counts` during the closed state.
|
||||||
|
|
||||||
|
- `Timeout` is the period of the open state,
|
||||||
|
after which the state of `CircuitBreaker` becomes half-open.
|
||||||
|
If `Timeout` is 0, the timeout value of `CircuitBreaker` is set to 60 seconds.
|
||||||
|
|
||||||
|
- `ReadyToTrip` is called with a copy of `Counts` whenever a request fails in the closed state.
|
||||||
|
If `ReadyToTrip` returns true, `CircuitBreaker` will be placed into the open state.
|
||||||
|
If `ReadyToTrip` is `nil`, default `ReadyToTrip` is used.
|
||||||
|
Default `ReadyToTrip` returns true when the number of consecutive failures is more than 5.
|
||||||
|
|
||||||
|
- `OnStateChange` is called whenever the state of `CircuitBreaker` changes.
|
||||||
|
|
||||||
|
- `IsSuccessful` is called with the error returned from a request.
|
||||||
|
If `IsSuccessful` returns true, the error is counted as a success.
|
||||||
|
Otherwise the error is counted as a failure.
|
||||||
|
If `IsSuccessful` is nil, default `IsSuccessful` is used, which returns false for all non-nil errors.
|
||||||
|
|
||||||
|
The struct `Counts` holds the numbers of requests and their successes/failures:
|
||||||
|
|
||||||
|
```go
|
||||||
|
type Counts struct {
|
||||||
|
Requests uint32
|
||||||
|
TotalSuccesses uint32
|
||||||
|
TotalFailures uint32
|
||||||
|
ConsecutiveSuccesses uint32
|
||||||
|
ConsecutiveFailures uint32
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
`CircuitBreaker` clears the internal `Counts` either
|
||||||
|
on the change of the state or at the closed-state intervals.
|
||||||
|
`Counts` ignores the results of the requests sent before clearing.
|
||||||
|
|
||||||
|
`CircuitBreaker` can wrap any function to send a request:
|
||||||
|
|
||||||
|
```go
|
||||||
|
func (cb *CircuitBreaker) Execute(req func() (interface{}, error)) (interface{}, error)
|
||||||
|
```
|
||||||
|
|
||||||
|
The method `Execute` runs the given request if `CircuitBreaker` accepts it.
|
||||||
|
`Execute` returns an error instantly if `CircuitBreaker` rejects the request.
|
||||||
|
Otherwise, `Execute` returns the result of the request.
|
||||||
|
If a panic occurs in the request, `CircuitBreaker` handles it as an error
|
||||||
|
and causes the same panic again.
|
||||||
|
|
||||||
|
Example
|
||||||
|
-------
|
||||||
|
|
||||||
|
```go
|
||||||
|
var cb *breaker.CircuitBreaker
|
||||||
|
|
||||||
|
func Get(url string) ([]byte, error) {
|
||||||
|
body, err := cb.Execute(func() (interface{}, error) {
|
||||||
|
resp, err := http.Get(url)
|
||||||
|
if err != nil {
|
||||||
|
return nil, err
|
||||||
|
}
|
||||||
|
|
||||||
|
defer resp.Body.Close()
|
||||||
|
body, err := ioutil.ReadAll(resp.Body)
|
||||||
|
if err != nil {
|
||||||
|
return nil, err
|
||||||
|
}
|
||||||
|
|
||||||
|
return body, nil
|
||||||
|
})
|
||||||
|
if err != nil {
|
||||||
|
return nil, err
|
||||||
|
}
|
||||||
|
|
||||||
|
return body.([]byte), nil
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
See [example](https://github.com/sony/gobreaker/blob/master/example) for details.
|
||||||
|
|
||||||
|
License
|
||||||
|
-------
|
||||||
|
|
||||||
|
The MIT License (MIT)
|
||||||
|
|
||||||
|
See [LICENSE](https://github.com/sony/gobreaker/blob/master/LICENSE) for details.
|
||||||
|
|
||||||
|
|
||||||
|
[repo-url]: https://github.com/sony/gobreaker
|
||||||
380
vendor/github.com/sony/gobreaker/gobreaker.go
generated
vendored
Normal file
380
vendor/github.com/sony/gobreaker/gobreaker.go
generated
vendored
Normal file
@@ -0,0 +1,380 @@
|
|||||||
|
// Package gobreaker implements the Circuit Breaker pattern.
|
||||||
|
// See https://msdn.microsoft.com/en-us/library/dn589784.aspx.
|
||||||
|
package gobreaker
|
||||||
|
|
||||||
|
import (
|
||||||
|
"errors"
|
||||||
|
"fmt"
|
||||||
|
"sync"
|
||||||
|
"time"
|
||||||
|
)
|
||||||
|
|
||||||
|
// State is a type that represents a state of CircuitBreaker.
|
||||||
|
type State int
|
||||||
|
|
||||||
|
// These constants are states of CircuitBreaker.
|
||||||
|
const (
|
||||||
|
StateClosed State = iota
|
||||||
|
StateHalfOpen
|
||||||
|
StateOpen
|
||||||
|
)
|
||||||
|
|
||||||
|
var (
|
||||||
|
// ErrTooManyRequests is returned when the CB state is half open and the requests count is over the cb maxRequests
|
||||||
|
ErrTooManyRequests = errors.New("too many requests")
|
||||||
|
// ErrOpenState is returned when the CB state is open
|
||||||
|
ErrOpenState = errors.New("circuit breaker is open")
|
||||||
|
)
|
||||||
|
|
||||||
|
// String implements stringer interface.
|
||||||
|
func (s State) String() string {
|
||||||
|
switch s {
|
||||||
|
case StateClosed:
|
||||||
|
return "closed"
|
||||||
|
case StateHalfOpen:
|
||||||
|
return "half-open"
|
||||||
|
case StateOpen:
|
||||||
|
return "open"
|
||||||
|
default:
|
||||||
|
return fmt.Sprintf("unknown state: %d", s)
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
// Counts holds the numbers of requests and their successes/failures.
|
||||||
|
// CircuitBreaker clears the internal Counts either
|
||||||
|
// on the change of the state or at the closed-state intervals.
|
||||||
|
// Counts ignores the results of the requests sent before clearing.
|
||||||
|
type Counts struct {
|
||||||
|
Requests uint32
|
||||||
|
TotalSuccesses uint32
|
||||||
|
TotalFailures uint32
|
||||||
|
ConsecutiveSuccesses uint32
|
||||||
|
ConsecutiveFailures uint32
|
||||||
|
}
|
||||||
|
|
||||||
|
func (c *Counts) onRequest() {
|
||||||
|
c.Requests++
|
||||||
|
}
|
||||||
|
|
||||||
|
func (c *Counts) onSuccess() {
|
||||||
|
c.TotalSuccesses++
|
||||||
|
c.ConsecutiveSuccesses++
|
||||||
|
c.ConsecutiveFailures = 0
|
||||||
|
}
|
||||||
|
|
||||||
|
func (c *Counts) onFailure() {
|
||||||
|
c.TotalFailures++
|
||||||
|
c.ConsecutiveFailures++
|
||||||
|
c.ConsecutiveSuccesses = 0
|
||||||
|
}
|
||||||
|
|
||||||
|
func (c *Counts) clear() {
|
||||||
|
c.Requests = 0
|
||||||
|
c.TotalSuccesses = 0
|
||||||
|
c.TotalFailures = 0
|
||||||
|
c.ConsecutiveSuccesses = 0
|
||||||
|
c.ConsecutiveFailures = 0
|
||||||
|
}
|
||||||
|
|
||||||
|
// Settings configures CircuitBreaker:
|
||||||
|
//
|
||||||
|
// Name is the name of the CircuitBreaker.
|
||||||
|
//
|
||||||
|
// MaxRequests is the maximum number of requests allowed to pass through
|
||||||
|
// when the CircuitBreaker is half-open.
|
||||||
|
// If MaxRequests is 0, the CircuitBreaker allows only 1 request.
|
||||||
|
//
|
||||||
|
// Interval is the cyclic period of the closed state
|
||||||
|
// for the CircuitBreaker to clear the internal Counts.
|
||||||
|
// If Interval is less than or equal to 0, the CircuitBreaker doesn't clear internal Counts during the closed state.
|
||||||
|
//
|
||||||
|
// Timeout is the period of the open state,
|
||||||
|
// after which the state of the CircuitBreaker becomes half-open.
|
||||||
|
// If Timeout is less than or equal to 0, the timeout value of the CircuitBreaker is set to 60 seconds.
|
||||||
|
//
|
||||||
|
// ReadyToTrip is called with a copy of Counts whenever a request fails in the closed state.
|
||||||
|
// If ReadyToTrip returns true, the CircuitBreaker will be placed into the open state.
|
||||||
|
// If ReadyToTrip is nil, default ReadyToTrip is used.
|
||||||
|
// Default ReadyToTrip returns true when the number of consecutive failures is more than 5.
|
||||||
|
//
|
||||||
|
// OnStateChange is called whenever the state of the CircuitBreaker changes.
|
||||||
|
//
|
||||||
|
// IsSuccessful is called with the error returned from a request.
|
||||||
|
// If IsSuccessful returns true, the error is counted as a success.
|
||||||
|
// Otherwise the error is counted as a failure.
|
||||||
|
// If IsSuccessful is nil, default IsSuccessful is used, which returns false for all non-nil errors.
|
||||||
|
type Settings struct {
|
||||||
|
Name string
|
||||||
|
MaxRequests uint32
|
||||||
|
Interval time.Duration
|
||||||
|
Timeout time.Duration
|
||||||
|
ReadyToTrip func(counts Counts) bool
|
||||||
|
OnStateChange func(name string, from State, to State)
|
||||||
|
IsSuccessful func(err error) bool
|
||||||
|
}
|
||||||
|
|
||||||
|
// CircuitBreaker is a state machine to prevent sending requests that are likely to fail.
|
||||||
|
type CircuitBreaker struct {
|
||||||
|
name string
|
||||||
|
maxRequests uint32
|
||||||
|
interval time.Duration
|
||||||
|
timeout time.Duration
|
||||||
|
readyToTrip func(counts Counts) bool
|
||||||
|
isSuccessful func(err error) bool
|
||||||
|
onStateChange func(name string, from State, to State)
|
||||||
|
|
||||||
|
mutex sync.Mutex
|
||||||
|
state State
|
||||||
|
generation uint64
|
||||||
|
counts Counts
|
||||||
|
expiry time.Time
|
||||||
|
}
|
||||||
|
|
||||||
|
// TwoStepCircuitBreaker is like CircuitBreaker but instead of surrounding a function
|
||||||
|
// with the breaker functionality, it only checks whether a request can proceed and
|
||||||
|
// expects the caller to report the outcome in a separate step using a callback.
|
||||||
|
type TwoStepCircuitBreaker struct {
|
||||||
|
cb *CircuitBreaker
|
||||||
|
}
|
||||||
|
|
||||||
|
// NewCircuitBreaker returns a new CircuitBreaker configured with the given Settings.
|
||||||
|
func NewCircuitBreaker(st Settings) *CircuitBreaker {
|
||||||
|
cb := new(CircuitBreaker)
|
||||||
|
|
||||||
|
cb.name = st.Name
|
||||||
|
cb.onStateChange = st.OnStateChange
|
||||||
|
|
||||||
|
if st.MaxRequests == 0 {
|
||||||
|
cb.maxRequests = 1
|
||||||
|
} else {
|
||||||
|
cb.maxRequests = st.MaxRequests
|
||||||
|
}
|
||||||
|
|
||||||
|
if st.Interval <= 0 {
|
||||||
|
cb.interval = defaultInterval
|
||||||
|
} else {
|
||||||
|
cb.interval = st.Interval
|
||||||
|
}
|
||||||
|
|
||||||
|
if st.Timeout <= 0 {
|
||||||
|
cb.timeout = defaultTimeout
|
||||||
|
} else {
|
||||||
|
cb.timeout = st.Timeout
|
||||||
|
}
|
||||||
|
|
||||||
|
if st.ReadyToTrip == nil {
|
||||||
|
cb.readyToTrip = defaultReadyToTrip
|
||||||
|
} else {
|
||||||
|
cb.readyToTrip = st.ReadyToTrip
|
||||||
|
}
|
||||||
|
|
||||||
|
if st.IsSuccessful == nil {
|
||||||
|
cb.isSuccessful = defaultIsSuccessful
|
||||||
|
} else {
|
||||||
|
cb.isSuccessful = st.IsSuccessful
|
||||||
|
}
|
||||||
|
|
||||||
|
cb.toNewGeneration(time.Now())
|
||||||
|
|
||||||
|
return cb
|
||||||
|
}
|
||||||
|
|
||||||
|
// NewTwoStepCircuitBreaker returns a new TwoStepCircuitBreaker configured with the given Settings.
|
||||||
|
func NewTwoStepCircuitBreaker(st Settings) *TwoStepCircuitBreaker {
|
||||||
|
return &TwoStepCircuitBreaker{
|
||||||
|
cb: NewCircuitBreaker(st),
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
const defaultInterval = time.Duration(0) * time.Second
|
||||||
|
const defaultTimeout = time.Duration(60) * time.Second
|
||||||
|
|
||||||
|
func defaultReadyToTrip(counts Counts) bool {
|
||||||
|
return counts.ConsecutiveFailures > 5
|
||||||
|
}
|
||||||
|
|
||||||
|
func defaultIsSuccessful(err error) bool {
|
||||||
|
return err == nil
|
||||||
|
}
|
||||||
|
|
||||||
|
// Name returns the name of the CircuitBreaker.
|
||||||
|
func (cb *CircuitBreaker) Name() string {
|
||||||
|
return cb.name
|
||||||
|
}
|
||||||
|
|
||||||
|
// State returns the current state of the CircuitBreaker.
|
||||||
|
func (cb *CircuitBreaker) State() State {
|
||||||
|
cb.mutex.Lock()
|
||||||
|
defer cb.mutex.Unlock()
|
||||||
|
|
||||||
|
now := time.Now()
|
||||||
|
state, _ := cb.currentState(now)
|
||||||
|
return state
|
||||||
|
}
|
||||||
|
|
||||||
|
// Counts returns internal counters
|
||||||
|
func (cb *CircuitBreaker) Counts() Counts {
|
||||||
|
cb.mutex.Lock()
|
||||||
|
defer cb.mutex.Unlock()
|
||||||
|
|
||||||
|
return cb.counts
|
||||||
|
}
|
||||||
|
|
||||||
|
// Execute runs the given request if the CircuitBreaker accepts it.
|
||||||
|
// Execute returns an error instantly if the CircuitBreaker rejects the request.
|
||||||
|
// Otherwise, Execute returns the result of the request.
|
||||||
|
// If a panic occurs in the request, the CircuitBreaker handles it as an error
|
||||||
|
// and causes the same panic again.
|
||||||
|
func (cb *CircuitBreaker) Execute(req func() (interface{}, error)) (interface{}, error) {
|
||||||
|
generation, err := cb.beforeRequest()
|
||||||
|
if err != nil {
|
||||||
|
return nil, err
|
||||||
|
}
|
||||||
|
|
||||||
|
defer func() {
|
||||||
|
e := recover()
|
||||||
|
if e != nil {
|
||||||
|
cb.afterRequest(generation, false)
|
||||||
|
panic(e)
|
||||||
|
}
|
||||||
|
}()
|
||||||
|
|
||||||
|
result, err := req()
|
||||||
|
cb.afterRequest(generation, cb.isSuccessful(err))
|
||||||
|
return result, err
|
||||||
|
}
|
||||||
|
|
||||||
|
// Name returns the name of the TwoStepCircuitBreaker.
|
||||||
|
func (tscb *TwoStepCircuitBreaker) Name() string {
|
||||||
|
return tscb.cb.Name()
|
||||||
|
}
|
||||||
|
|
||||||
|
// State returns the current state of the TwoStepCircuitBreaker.
|
||||||
|
func (tscb *TwoStepCircuitBreaker) State() State {
|
||||||
|
return tscb.cb.State()
|
||||||
|
}
|
||||||
|
|
||||||
|
// Counts returns internal counters
|
||||||
|
func (tscb *TwoStepCircuitBreaker) Counts() Counts {
|
||||||
|
return tscb.cb.Counts()
|
||||||
|
}
|
||||||
|
|
||||||
|
// Allow checks if a new request can proceed. It returns a callback that should be used to
|
||||||
|
// register the success or failure in a separate step. If the circuit breaker doesn't allow
|
||||||
|
// requests, it returns an error.
|
||||||
|
func (tscb *TwoStepCircuitBreaker) Allow() (done func(success bool), err error) {
|
||||||
|
generation, err := tscb.cb.beforeRequest()
|
||||||
|
if err != nil {
|
||||||
|
return nil, err
|
||||||
|
}
|
||||||
|
|
||||||
|
return func(success bool) {
|
||||||
|
tscb.cb.afterRequest(generation, success)
|
||||||
|
}, nil
|
||||||
|
}
|
||||||
|
|
||||||
|
func (cb *CircuitBreaker) beforeRequest() (uint64, error) {
|
||||||
|
cb.mutex.Lock()
|
||||||
|
defer cb.mutex.Unlock()
|
||||||
|
|
||||||
|
now := time.Now()
|
||||||
|
state, generation := cb.currentState(now)
|
||||||
|
|
||||||
|
if state == StateOpen {
|
||||||
|
return generation, ErrOpenState
|
||||||
|
} else if state == StateHalfOpen && cb.counts.Requests >= cb.maxRequests {
|
||||||
|
return generation, ErrTooManyRequests
|
||||||
|
}
|
||||||
|
|
||||||
|
cb.counts.onRequest()
|
||||||
|
return generation, nil
|
||||||
|
}
|
||||||
|
|
||||||
|
func (cb *CircuitBreaker) afterRequest(before uint64, success bool) {
|
||||||
|
cb.mutex.Lock()
|
||||||
|
defer cb.mutex.Unlock()
|
||||||
|
|
||||||
|
now := time.Now()
|
||||||
|
state, generation := cb.currentState(now)
|
||||||
|
if generation != before {
|
||||||
|
return
|
||||||
|
}
|
||||||
|
|
||||||
|
if success {
|
||||||
|
cb.onSuccess(state, now)
|
||||||
|
} else {
|
||||||
|
cb.onFailure(state, now)
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
func (cb *CircuitBreaker) onSuccess(state State, now time.Time) {
|
||||||
|
switch state {
|
||||||
|
case StateClosed:
|
||||||
|
cb.counts.onSuccess()
|
||||||
|
case StateHalfOpen:
|
||||||
|
cb.counts.onSuccess()
|
||||||
|
if cb.counts.ConsecutiveSuccesses >= cb.maxRequests {
|
||||||
|
cb.setState(StateClosed, now)
|
||||||
|
}
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
func (cb *CircuitBreaker) onFailure(state State, now time.Time) {
|
||||||
|
switch state {
|
||||||
|
case StateClosed:
|
||||||
|
cb.counts.onFailure()
|
||||||
|
if cb.readyToTrip(cb.counts) {
|
||||||
|
cb.setState(StateOpen, now)
|
||||||
|
}
|
||||||
|
case StateHalfOpen:
|
||||||
|
cb.setState(StateOpen, now)
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
func (cb *CircuitBreaker) currentState(now time.Time) (State, uint64) {
|
||||||
|
switch cb.state {
|
||||||
|
case StateClosed:
|
||||||
|
if !cb.expiry.IsZero() && cb.expiry.Before(now) {
|
||||||
|
cb.toNewGeneration(now)
|
||||||
|
}
|
||||||
|
case StateOpen:
|
||||||
|
if cb.expiry.Before(now) {
|
||||||
|
cb.setState(StateHalfOpen, now)
|
||||||
|
}
|
||||||
|
}
|
||||||
|
return cb.state, cb.generation
|
||||||
|
}
|
||||||
|
|
||||||
|
func (cb *CircuitBreaker) setState(state State, now time.Time) {
|
||||||
|
if cb.state == state {
|
||||||
|
return
|
||||||
|
}
|
||||||
|
|
||||||
|
prev := cb.state
|
||||||
|
cb.state = state
|
||||||
|
|
||||||
|
cb.toNewGeneration(now)
|
||||||
|
|
||||||
|
if cb.onStateChange != nil {
|
||||||
|
cb.onStateChange(cb.name, prev, state)
|
||||||
|
}
|
||||||
|
}
|
||||||
|
|
||||||
|
func (cb *CircuitBreaker) toNewGeneration(now time.Time) {
|
||||||
|
cb.generation++
|
||||||
|
cb.counts.clear()
|
||||||
|
|
||||||
|
var zero time.Time
|
||||||
|
switch cb.state {
|
||||||
|
case StateClosed:
|
||||||
|
if cb.interval == 0 {
|
||||||
|
cb.expiry = zero
|
||||||
|
} else {
|
||||||
|
cb.expiry = now.Add(cb.interval)
|
||||||
|
}
|
||||||
|
case StateOpen:
|
||||||
|
cb.expiry = now.Add(cb.timeout)
|
||||||
|
default: // StateHalfOpen
|
||||||
|
cb.expiry = zero
|
||||||
|
}
|
||||||
|
}
|
||||||
7
vendor/modules.txt
vendored
7
vendor/modules.txt
vendored
@@ -123,7 +123,7 @@ github.com/blevesearch/zapx/v16
|
|||||||
# github.com/cespare/xxhash/v2 v2.2.0
|
# github.com/cespare/xxhash/v2 v2.2.0
|
||||||
## explicit; go 1.11
|
## explicit; go 1.11
|
||||||
github.com/cespare/xxhash/v2
|
github.com/cespare/xxhash/v2
|
||||||
# github.com/chorus-services/backbeat v0.0.0-00010101000000-000000000000 => /home/tony/chorus/project-queues/active/BACKBEAT/backbeat/prototype
|
# github.com/chorus-services/backbeat v0.0.0-00010101000000-000000000000 => ../BACKBEAT/backbeat/prototype
|
||||||
## explicit; go 1.22
|
## explicit; go 1.22
|
||||||
github.com/chorus-services/backbeat/pkg/sdk
|
github.com/chorus-services/backbeat/pkg/sdk
|
||||||
# github.com/containerd/cgroups v1.1.0
|
# github.com/containerd/cgroups v1.1.0
|
||||||
@@ -614,6 +614,9 @@ github.com/robfig/cron/v3
|
|||||||
github.com/sashabaranov/go-openai
|
github.com/sashabaranov/go-openai
|
||||||
github.com/sashabaranov/go-openai/internal
|
github.com/sashabaranov/go-openai/internal
|
||||||
github.com/sashabaranov/go-openai/jsonschema
|
github.com/sashabaranov/go-openai/jsonschema
|
||||||
|
# github.com/sony/gobreaker v0.5.0
|
||||||
|
## explicit; go 1.12
|
||||||
|
github.com/sony/gobreaker
|
||||||
# github.com/spaolacci/murmur3 v1.1.0
|
# github.com/spaolacci/murmur3 v1.1.0
|
||||||
## explicit
|
## explicit
|
||||||
github.com/spaolacci/murmur3
|
github.com/spaolacci/murmur3
|
||||||
@@ -844,4 +847,4 @@ gopkg.in/yaml.v3
|
|||||||
# lukechampine.com/blake3 v1.2.1
|
# lukechampine.com/blake3 v1.2.1
|
||||||
## explicit; go 1.17
|
## explicit; go 1.17
|
||||||
lukechampine.com/blake3
|
lukechampine.com/blake3
|
||||||
# github.com/chorus-services/backbeat => /home/tony/chorus/project-queues/active/BACKBEAT/backbeat/prototype
|
# github.com/chorus-services/backbeat => ../BACKBEAT/backbeat/prototype
|
||||||
|
|||||||
Reference in New Issue
Block a user