Comprehensive documentation for coordination, messaging, discovery, and internal systems. Core Coordination Packages: - pkg/election - Democratic leader election (uptime-based, heartbeat mechanism, SLURP integration) - pkg/coordination - Meta-coordination with dependency detection (4 built-in rules) - coordinator/ - Task orchestration and assignment (AI-powered scoring) - discovery/ - mDNS peer discovery (automatic LAN detection) Messaging & P2P Infrastructure: - pubsub/ - GossipSub messaging (31 message types, role-based topics, HMMM integration) - p2p/ - libp2p networking (DHT modes, connection management, security) Monitoring & Health: - pkg/metrics - Prometheus metrics (80+ metrics across 12 categories) - pkg/health - Health monitoring (4 HTTP endpoints, enhanced checks, graceful degradation) Internal Systems: - internal/licensing - License validation (KACHING integration, cluster leases, fail-closed) - internal/hapui - Human Agent Portal UI (9 commands, HMMM wizard, UCXL browser, decision voting) - internal/backbeat - P2P operation telemetry (6 phases, beat synchronization, health reporting) Documentation Statistics (Phase 3): - 10 packages documented (~18,000 lines) - 31 PubSub message types cataloged - 80+ Prometheus metrics documented - Complete API references with examples - Integration patterns and best practices Key Features Documented: - Election: 5 triggers, candidate scoring (5 weighted components), stability windows - Coordination: AI-powered dependency detection, cross-repo sessions, escalation handling - PubSub: Topic patterns, message envelopes, SHHH redaction, Hypercore logging - Metrics: All metric types with labels, Prometheus scrape config, alert rules - Health: Liveness vs readiness, critical checks, Kubernetes integration - Licensing: Grace periods, circuit breaker, cluster lease management - HAP UI: Interactive terminal commands, HMMM composition wizard, web interface (beta) - BACKBEAT: 6-phase operation tracking, beat budget estimation, drift detection Implementation Status Marked: - ✅ Production: Election, metrics, health, licensing, pubsub, p2p, discovery, coordinator - 🔶 Beta: HAP web interface, BACKBEAT telemetry, advanced coordination - 🔷 Alpha: SLURP election scoring - ⚠️ Experimental: Meta-coordination, AI-powered dependency detection Progress: 22/62 files complete (35%) Next Phase: AI providers, SLURP system, API layer, reasoning engine 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
40 KiB
CHORUS License Validation System
Package: internal/licensing
Purpose: KACHING license authority integration with fail-closed validation
Critical: License validation is MANDATORY at startup - invalid license = immediate exit
Overview
The CHORUS licensing system enforces software licensing through integration with the KACHING license authority. The system implements a fail-closed security model: if license validation fails, CHORUS will not start. This ensures that all running instances are properly licensed and authorized.
Key Components
- Validator: Core license validation with KACHING server communication
- LicenseGate: Enhanced validation with caching, circuit breaker, and cluster lease management
- LicenseConfig: Configuration structure for licensing parameters
Security Model: FAIL-CLOSED
┌─────────────────────────────────────────────────────────────┐
│ CHORUS STARTUP │
├─────────────────────────────────────────────────────────────┤
│ │
│ 1. Load Configuration │
│ 2. Initialize Logger │
│ 3. ⚠️ VALIDATE LICENSE (CRITICAL GATE) │
│ │ │
│ ├─── SUCCESS ──→ Continue startup │
│ │ │
│ └─── FAILURE ──→ return error → IMMEDIATE EXIT │
│ │
│ 4. Initialize AI Provider │
│ 5. Start P2P Network │
│ 6. ... rest of initialization │
│ │
└─────────────────────────────────────────────────────────────┘
NO BYPASS: License validation cannot be skipped or bypassed.
Architecture
1. LicenseConfig Structure
Location: internal/licensing/validator.go and pkg/config/config.go
// Core licensing configuration
type LicenseConfig struct {
LicenseID string // Unique license identifier
ClusterID string // Cluster/deployment identifier
KachingURL string // KACHING server URL
}
// Extended configuration in pkg/config
type LicenseConfig struct {
LicenseID string `yaml:"license_id"`
ClusterID string `yaml:"cluster_id"`
OrganizationName string `yaml:"organization_name"`
KachingURL string `yaml:"kaching_url"`
IsActive bool `yaml:"is_active"`
LastValidated time.Time `yaml:"last_validated"`
GracePeriodHours int `yaml:"grace_period_hours"`
LicenseType string `yaml:"license_type"`
ExpiresAt time.Time `yaml:"expires_at"`
MaxNodes int `yaml:"max_nodes"`
}
Configuration Fields:
| Field | Required | Purpose |
|---|---|---|
LicenseID |
✅ Yes | Unique identifier for the license |
ClusterID |
✅ Yes | Identifies the cluster/deployment |
KachingURL |
No | KACHING server URL (defaults to http://localhost:8083) |
OrganizationName |
No | Organization name for tracking |
LicenseType |
No | Type of license (e.g., "enterprise", "developer") |
ExpiresAt |
No | License expiration timestamp |
MaxNodes |
No | Maximum nodes allowed in cluster |
Validation Flow
Standard Validation Sequence
┌──────────────────────────────────────────────────────────────────┐
│ License Validation Flow │
└──────────────────────────────────────────────────────────────────┘
1. NewValidator(config) → Initialize Validator
│
├─→ Set KachingURL (default: http://localhost:8083)
├─→ Create HTTP client (timeout: 30s)
└─→ Initialize LicenseGate
│
└─→ Initialize Circuit Breaker
└─→ Set Grace Period (90 seconds from start)
2. Validate() → Perform validation
│
├─→ ValidateWithContext(ctx)
│ │
│ ├─→ Check required fields (LicenseID, ClusterID)
│ │
│ ├─→ LicenseGate.Validate(ctx, agentID)
│ │ │
│ │ ├─→ Check cached lease (if valid, use it)
│ │ │ ├─→ validateCachedLease()
│ │ │ │ ├─→ POST /api/v1/licenses/validate-lease
│ │ │ │ ├─→ Include: lease_token, cluster_id, agent_id
│ │ │ │ └─→ Response: valid, remaining_replicas, expires_at
│ │ │ │
│ │ │ └─→ Cache hit? → SUCCESS
│ │ │
│ │ ├─→ Cache miss? → Request new lease
│ │ │ │
│ │ │ ├─→ breaker.Execute() [Circuit Breaker]
│ │ │ │ │
│ │ │ │ ├─→ requestOrRenewLease()
│ │ │ │ │ ├─→ POST /api/v1/licenses/{id}/cluster-lease
│ │ │ │ │ ├─→ Request: cluster_id, requested_replicas, duration_minutes
│ │ │ │ │ └─→ Response: lease_token, max_replicas, expires_at, lease_id
│ │ │ │ │
│ │ │ │ ├─→ validateLease(lease, agentID)
│ │ │ │ │ └─→ POST /api/v1/licenses/validate-lease
│ │ │ │ │
│ │ │ │ └─→ storeLease() → Cache the valid lease
│ │ │ │
│ │ │ └─→ Extend grace period (90s)
│ │ │
│ │ └─→ Validation failed?
│ │ │
│ │ ├─→ In grace period? → Log warning, ALLOW startup
│ │ └─→ Outside grace period? → RETURN ERROR
│ │
│ └─→ Fallback to validateLegacy() on LicenseGate failure
│ │
│ ├─→ POST /v1/license/activate
│ ├─→ Request: license_id, cluster_id, metadata
│ └─→ Response: validation result
│
└─→ Return validation result
3. Result Handling (in runtime/shared.go)
│
├─→ SUCCESS → Log "✅ License validation successful"
│ → Continue initialization
│
└─→ FAILURE → return error → CHORUS EXITS IMMEDIATELY
Component Details
1. Validator Component
File: internal/licensing/validator.go
The Validator is the primary component for license validation, providing communication with the KACHING license authority.
Key Methods
NewValidator(config LicenseConfig)
func NewValidator(config LicenseConfig) *Validator
Creates a new license validator with:
- HTTP client with 30-second timeout
- Default KACHING URL if not specified
- Initialized LicenseGate for enhanced validation
Validate()
func (v *Validator) Validate() error
Performs license validation with KACHING authority:
- Validates required configuration fields
- Uses LicenseGate for cached/enhanced validation
- Falls back to legacy validation if needed
- Returns error if validation fails
validateLegacy()
func (v *Validator) validateLegacy() error
Legacy validation method (fallback):
- Direct HTTP POST to
/v1/license/activate - Sends license metadata (product, version, container flag)
- Fail-closed: Network error = validation failure
- Parses and validates response status
Request Example:
{
"license_id": "lic_abc123",
"cluster_id": "cluster_xyz789",
"metadata": {
"product": "CHORUS",
"version": "0.1.0-dev",
"container": "true"
}
}
Response Example (Success):
{
"status": "ok",
"message": "License valid",
"expires_at": "2025-12-31T23:59:59Z"
}
Response Example (Failure):
{
"status": "error",
"message": "License expired"
}
2. LicenseGate Component
File: internal/licensing/license_gate.go
Enhanced license validation with caching, circuit breaker, and cluster lease management for production scalability.
Key Features
- Caching: Stores valid lease tokens to reduce KACHING load
- Circuit Breaker: Prevents cascade failures during KACHING outages
- Grace Period: 90-second startup grace period for transient failures
- Cluster Leases: Supports multi-replica deployments with lease tokens
- Burst Protection: Rate limiting and retry logic
Data Structures
cachedLease
type cachedLease struct {
LeaseToken string `json:"lease_token"`
ExpiresAt time.Time `json:"expires_at"`
ClusterID string `json:"cluster_id"`
Valid bool `json:"valid"`
CachedAt time.Time `json:"cached_at"`
}
Lease Validation:
- Lease considered invalid 2 minutes before actual expiry (safety margin)
- Invalid leases are evicted from cache automatically
LeaseRequest
type LeaseRequest struct {
ClusterID string `json:"cluster_id"`
RequestedReplicas int `json:"requested_replicas"`
DurationMinutes int `json:"duration_minutes"`
}
LeaseResponse
type LeaseResponse struct {
LeaseToken string `json:"lease_token"`
MaxReplicas int `json:"max_replicas"`
ExpiresAt time.Time `json:"expires_at"`
ClusterID string `json:"cluster_id"`
LeaseID string `json:"lease_id"`
}
LeaseValidationRequest
type LeaseValidationRequest struct {
LeaseToken string `json:"lease_token"`
ClusterID string `json:"cluster_id"`
AgentID string `json:"agent_id"`
}
LeaseValidationResponse
type LeaseValidationResponse struct {
Valid bool `json:"valid"`
RemainingReplicas int `json:"remaining_replicas"`
ExpiresAt time.Time `json:"expires_at"`
}
Circuit Breaker Configuration
breakerSettings := gobreaker.Settings{
Name: "license-validation",
MaxRequests: 3, // Allow 3 requests in half-open state
Interval: 60 * time.Second, // Reset failure count every minute
Timeout: 30 * time.Second, // Stay open for 30 seconds
ReadyToTrip: func(counts gobreaker.Counts) bool {
return counts.ConsecutiveFailures >= 3 // Trip after 3 failures
},
OnStateChange: func(name string, from, to gobreaker.State) {
fmt.Printf("🔌 License validation circuit breaker: %s -> %s\n", from, to)
},
}
Circuit Breaker States:
| State | Behavior | Transition |
|---|---|---|
| Closed | Normal operation, requests pass through | 3 consecutive failures → Open |
| Open | All requests fail immediately (30s) | After timeout → Half-Open |
| Half-Open | Allow 3 test requests | Success → Closed, Failure → Open |
Key Methods
NewLicenseGate(config LicenseConfig)
func NewLicenseGate(config LicenseConfig) *LicenseGate
Initializes license gate with:
- Circuit breaker with production settings
- HTTP client with 10-second timeout
- 90-second grace period from startup
Validate(ctx context.Context, agentID string)
func (g *LicenseGate) Validate(ctx context.Context, agentID string) error
Primary validation method:
- Check cache: If valid cached lease exists, validate it
- Cache miss: Request new lease through circuit breaker
- Store result: Cache successful lease for future requests
- Grace period: Allow startup during grace period even if validation fails
- Extend grace: Extend grace period on successful validation
validateCachedLease(ctx, lease, agentID)
func (g *LicenseGate) validateCachedLease(ctx context.Context, lease *cachedLease, agentID string) error
Validates cached lease token:
- POST to
/api/v1/licenses/validate-lease - Invalidates cache if validation fails
- Returns error if lease is no longer valid
requestOrRenewLease(ctx)
func (g *LicenseGate) requestOrRenewLease(ctx context.Context) (*LeaseResponse, error)
Requests new cluster lease:
- POST to
/api/v1/licenses/{license_id}/cluster-lease - Default: 1 replica, 60-minute duration
- Handles rate limiting (429 Too Many Requests)
- Returns lease token and metadata
GetCacheStats()
func (g *LicenseGate) GetCacheStats() map[string]interface{}
Returns cache statistics for monitoring:
{
"cache_valid": true,
"cache_hit": true,
"expires_at": "2025-09-30T15:30:00Z",
"cached_at": "2025-09-30T14:30:00Z",
"in_grace_period": false,
"breaker_state": "closed",
"grace_until": "2025-09-30T14:31:30Z"
}
KACHING Server Integration
API Endpoints
1. Legacy Activation Endpoint
Endpoint: POST /v1/license/activate
Purpose: Legacy license validation (fallback)
Request:
{
"license_id": "lic_abc123",
"cluster_id": "cluster_xyz789",
"metadata": {
"product": "CHORUS",
"version": "0.1.0-dev",
"container": "true"
}
}
Response (Success):
{
"status": "ok",
"message": "License valid",
"expires_at": "2025-12-31T23:59:59Z"
}
Response (Failure):
{
"status": "error",
"message": "License expired"
}
2. Cluster Lease Endpoint
Endpoint: POST /api/v1/licenses/{license_id}/cluster-lease
Purpose: Request cluster deployment lease
Request:
{
"cluster_id": "cluster_xyz789",
"requested_replicas": 1,
"duration_minutes": 60
}
Response (Success):
{
"lease_token": "lease_def456",
"max_replicas": 5,
"expires_at": "2025-09-30T15:30:00Z",
"cluster_id": "cluster_xyz789",
"lease_id": "lease_def456"
}
Response (Rate Limited):
HTTP 429 Too Many Requests
Retry-After: 60
3. Lease Validation Endpoint
Endpoint: POST /api/v1/licenses/validate-lease
Purpose: Validate lease token for agent startup
Request:
{
"lease_token": "lease_def456",
"cluster_id": "cluster_xyz789",
"agent_id": "agent_001"
}
Response (Success):
{
"valid": true,
"remaining_replicas": 4,
"expires_at": "2025-09-30T15:30:00Z"
}
Response (Invalid):
{
"valid": false,
"remaining_replicas": 0,
"expires_at": "2025-09-30T14:30:00Z"
}
Validation Sequence Diagram
┌─────────┐ ┌───────────┐ ┌──────────────┐ ┌──────────┐
│ CHORUS │ │ Validator │ │ LicenseGate │ │ KACHING │
│ Runtime │ │ │ │ │ │ Server │
└────┬────┘ └─────┬─────┘ └──────┬───────┘ └────┬─────┘
│ │ │ │
│ InitializeRuntime()│ │ │
│───────────────────>│ │ │
│ │ │ │
│ │ Validate() │ │
│ │──────────────────────>│ │
│ │ │ │
│ │ │ Check cache │
│ │ │────────┐ │
│ │ │ │ │
│ │ │<───────┘ │
│ │ │ │
│ │ │ Cache miss │
│ │ │ │
│ │ │ POST /cluster-lease │
│ │ │─────────────────────>│
│ │ │ │
│ │ │ Lease Response │
│ │ │<─────────────────────│
│ │ │ │
│ │ │ POST /validate-lease │
│ │ │─────────────────────>│
│ │ │ │
│ │ │ Validation Response │
│ │ │<─────────────────────│
│ │ │ │
│ │ │ Store in cache │
│ │ │────────┐ │
│ │ │ │ │
│ │ │<───────┘ │
│ │ │ │
│ │ SUCCESS │ │
│ │<──────────────────────│ │
│ │ │ │
│ Continue startup │ │ │
│<───────────────────│ │ │
│ │ │ │
┌─────────────────────────────────────────────────────────────────────┐
│ FAILURE SCENARIO │
└─────────────────────────────────────────────────────────────────────┘
│ │ │ │
│ │ │ POST /validate-lease │
│ │ │─────────────────────>│
│ │ │ │
│ │ │ INVALID LICENSE │
│ │ │<─────────────────────│
│ │ │ │
│ │ │ Check grace period │
│ │ │────────┐ │
│ │ │ │ │
│ │ │<───────┘ │
│ │ │ │
│ │ │ Outside grace period │
│ │ │ │
│ │ ERROR │ │
│ │<──────────────────────│ │
│ │ │ │
│ return error │ │ │
│<───────────────────│ │ │
│ │ │ │
│ EXIT │ │ │
│────────X │ │ │
Error Handling
Error Categories
1. Configuration Errors
Condition: Missing required configuration fields
if v.config.LicenseID == "" || v.config.ClusterID == "" {
return fmt.Errorf("license ID and cluster ID are required")
}
Result: Immediate validation failure → CHORUS exits
2. Network Errors
Condition: Cannot contact KACHING server
resp, err := v.client.Post(licenseURL, "application/json", bytes.NewReader(requestBody))
if err != nil {
// FAIL-CLOSED: No network = No license = No operation
return fmt.Errorf("unable to contact license authority: %w", err)
}
Result:
- Outside grace period: Immediate validation failure → CHORUS exits
- Inside grace period: Log warning, allow startup
Fail-Closed Behavior: Network unavailability does NOT allow bypass
3. Invalid License Errors
Condition: KACHING rejects license
if resp.StatusCode != http.StatusOK {
message := "license validation failed"
if msg, ok := licenseResponse["message"].(string); ok {
message = msg
}
return fmt.Errorf("license validation failed: %s", message)
}
Possible Messages:
- "License expired"
- "License revoked"
- "License not found"
- "Cluster ID mismatch"
- "Maximum nodes exceeded"
Result: Immediate validation failure → CHORUS exits
4. Rate Limiting Errors
Condition: Too many requests to KACHING
if resp.StatusCode == http.StatusTooManyRequests {
return nil, fmt.Errorf("rate limited by KACHING, retry after: %s",
resp.Header.Get("Retry-After"))
}
Result:
- Circuit breaker may trip after repeated rate limiting
- Grace period allows startup if rate limiting is transient
5. Circuit Breaker Errors
Condition: Circuit breaker is open (too many failures)
Result:
- All requests fail immediately
- Grace period allows startup if breaker trips during initialization
- Circuit breaker auto-recovers after timeout (30s)
Error Messages Reference
User-Facing Error Messages
| Error Message | Cause | Resolution |
|---|---|---|
license ID and cluster ID are required |
Missing configuration | Set CHORUS_LICENSE_ID and CHORUS_CLUSTER_ID |
unable to contact license authority |
Network error | Check KACHING server accessibility |
license validation failed: License expired |
Expired license | Renew license with vendor |
license validation failed: License revoked |
Revoked license | Contact vendor |
license validation failed: Cluster ID mismatch |
Wrong cluster | Use correct cluster configuration |
rate limited by KACHING |
Too many requests | Wait for rate limit reset |
lease token is invalid |
Expired or invalid lease | System will auto-request new lease |
lease validation failed with status 404 |
Lease not found | System will auto-request new lease |
License validation failed but in grace period |
Transient failure during startup | System continues with warning |
Grace Period Mechanism
Purpose
The grace period allows CHORUS to start even when license validation temporarily fails, preventing service disruption due to transient network issues or KACHING server maintenance.
Behavior
- Duration: 90 seconds from startup
- Triggered: When validation fails but grace period is active
- Effect: Validation returns success with warning log
- Extension: Grace period extends by 90s on each successful validation
- Expiry: After grace period expires, validation failures cause immediate exit
Grace Period States
┌──────────────────────────────────────────────────────────────┐
│ Grace Period Timeline │
└──────────────────────────────────────────────────────────────┘
T+0s ┌─────────────────────────────────────────────┐
│ GRACE PERIOD ACTIVE (90s) │
│ Validation failures allowed with warning │
└─────────────────────────────────────────────┘
T+30s │ Validation SUCCESS │
└──> Grace period extended to T+120s │
T+90s │ Grace period expires (no successful validation)
└──> Next validation failure causes exit │
T+120s │ (Extended) Grace period expires
└──> Next validation failure causes exit │
Implementation
// Initialize grace period at startup
func NewLicenseGate(config LicenseConfig) *LicenseGate {
gate := &LicenseGate{...}
gate.graceUntil.Store(time.Now().Add(90 * time.Second))
return gate
}
// Check grace period during validation
if err != nil {
if g.isInGracePeriod() {
fmt.Printf("⚠️ License validation failed but in grace period: %v\n", err)
return nil // Allow startup
}
return fmt.Errorf("license validation failed: %w", err)
}
// Extend grace period on success
g.extendGracePeriod() // Adds 90s to current time
Startup Integration
Location
File: internal/runtime/shared.go
Function: InitializeRuntime()
Integration Point
func InitializeRuntime(cfg *config.CHORUSConfig) (*RuntimeContext, error) {
// ... early initialization ...
// CRITICAL: Validate license before any P2P operations
runtime.Logger.Info("🔐 Validating CHORUS license with KACHING...")
licenseValidator := licensing.NewValidator(licensing.LicenseConfig{
LicenseID: cfg.License.LicenseID,
ClusterID: cfg.License.ClusterID,
KachingURL: cfg.License.KachingURL,
})
if err := licenseValidator.Validate(); err != nil {
// This error causes InitializeRuntime to return error
// which causes main() to exit immediately
return nil, fmt.Errorf("license validation failed: %v", err)
}
runtime.Logger.Info("✅ License validation successful - CHORUS authorized to run")
// ... continue with P2P, AI provider initialization, etc ...
}
Execution Order
1. Load configuration from YAML
2. Initialize logger
3. ⚠️ VALIDATE LICENSE ⚠️
└─→ FAILURE → return error → main() exits
4. Initialize AI provider
5. Initialize metrics collector
6. Initialize SHHH sentinel
7. Initialize P2P network
8. Start HAP server
9. Enter main runtime loop
Critical Note: License validation occurs BEFORE any P2P networking or AI provider initialization. If validation fails, no network connections are made and no services are started.
Configuration Examples
Minimal Configuration
license:
license_id: "lic_abc123"
cluster_id: "cluster_xyz789"
KACHING URL defaults to http://localhost:8083
Production Configuration
license:
license_id: "lic_prod_abc123"
cluster_id: "cluster_production_xyz789"
kaching_url: "https://kaching.chorus.services"
organization_name: "Acme Corporation"
license_type: "enterprise"
max_nodes: 10
Development Configuration
license:
license_id: "lic_dev_abc123"
cluster_id: "cluster_dev_local"
kaching_url: "http://localhost:8083"
organization_name: "Development Team"
license_type: "developer"
max_nodes: 1
Environment Variables
Licensing configuration can also be set via environment variables:
export CHORUS_LICENSE_ID="lic_abc123"
export CHORUS_CLUSTER_ID="cluster_xyz789"
export CHORUS_KACHING_URL="http://localhost:8083"
Monitoring and Observability
Log Messages
Successful Validation
🔐 Validating CHORUS license with KACHING...
✅ License validation successful - CHORUS authorized to run
Validation with Cached Lease
🔐 Validating CHORUS license with KACHING...
[Using cached lease token: lease_def456]
✅ License validation successful - CHORUS authorized to run
Validation During Grace Period
🔐 Validating CHORUS license with KACHING...
⚠️ License validation failed but in grace period: unable to contact license authority
✅ License validation successful - CHORUS authorized to run
Circuit Breaker State Changes
🔌 License validation circuit breaker: closed -> open
🔌 License validation circuit breaker: open -> half-open
🔌 License validation circuit breaker: half-open -> closed
Validation Failure (Fatal)
🔐 Validating CHORUS license with KACHING...
❌ License validation failed: License expired
Error: license validation failed: License expired
[CHORUS exits]
Cache Statistics API
stats := licenseGate.GetCacheStats()
Returns:
{
"cache_valid": true,
"cache_hit": true,
"expires_at": "2025-09-30T15:30:00Z",
"cached_at": "2025-09-30T14:30:00Z",
"in_grace_period": false,
"breaker_state": "closed",
"grace_until": "2025-09-30T14:31:30Z"
}
Recommended Monitoring Metrics
| Metric | Type | Description |
|---|---|---|
license_validation_success |
Counter | Successful validations |
license_validation_failure |
Counter | Failed validations |
license_validation_duration_ms |
Histogram | Validation latency |
license_cache_hit_rate |
Gauge | Percentage of cache hits |
license_grace_period_active |
Gauge | 1 if in grace period, 0 otherwise |
license_circuit_breaker_state |
Gauge | 0=closed, 1=half-open, 2=open |
license_lease_expiry_seconds |
Gauge | Seconds until lease expiry |
Cluster Lease Management
Lease Lifecycle
┌──────────────────────────────────────────────────────────────────┐
│ Cluster Lease Lifecycle │
└──────────────────────────────────────────────────────────────────┘
1. REQUEST LEASE
├─→ POST /api/v1/licenses/{license_id}/cluster-lease
├─→ cluster_id: "cluster_xyz789"
├─→ requested_replicas: 1
└─→ duration_minutes: 60
2. RECEIVE LEASE
├─→ lease_token: "lease_def456"
├─→ max_replicas: 5
├─→ expires_at: T+60m
└─→ Store in cache
3. USE LEASE (per agent startup)
├─→ POST /api/v1/licenses/validate-lease
├─→ lease_token: "lease_def456"
├─→ cluster_id: "cluster_xyz789"
├─→ agent_id: "agent_001"
└─→ Decrements remaining_replicas
4. LEASE EXPIRY
├─→ Cache invalidated at T+58m (2min safety margin)
└─→ Next validation requests new lease
5. LEASE RENEWAL
└─→ Automatic on cache invalidation
Multi-Replica Support
The lease system supports multiple CHORUS agent replicas:
- max_replicas: Maximum concurrent agents allowed
- remaining_replicas: Available agent slots
- agent_id: Unique identifier for each agent instance
Example: License allows 5 replicas
- Request lease →
max_replicas: 5 - Agent 1 validates →
remaining_replicas: 4 - Agent 2 validates →
remaining_replicas: 3 - Agent 6 validates → FAILURE (exceeds max_replicas)
Security Considerations
Fail-Closed Architecture
The licensing system implements fail-closed security:
- ✅ Network unavailable → Validation fails → CHORUS exits (unless in grace period)
- ✅ KACHING server down → Validation fails → CHORUS exits (unless in grace period)
- ✅ Invalid license → Validation fails → CHORUS exits (no grace period)
- ✅ Expired license → Validation fails → CHORUS exits (no grace period)
- ❌ No "development mode" bypass
- ❌ No "skip validation" flag
Grace Period Security
The grace period is designed for transient failures, NOT as a bypass:
- Limited to 90 seconds initially
- Only extends on successful validation
- Does NOT apply to invalid/expired licenses
- Primarily for network/KACHING server availability issues
License Token Security
- Lease tokens are short-lived (default: 60 minutes)
- Tokens cached in memory only (not persisted to disk)
- Tokens include cluster_id binding (cannot be used by other clusters)
- Agent ID tracking prevents token sharing between agents
Network Security
- HTTPS recommended for production KACHING URLs
- 30-second timeout prevents hanging on network issues
- Circuit breaker prevents cascade failures
Troubleshooting
Issue: "license ID and cluster ID are required"
Cause: Missing configuration
Resolution:
# config.yml
license:
license_id: "your_license_id"
cluster_id: "your_cluster_id"
Or via environment:
export CHORUS_LICENSE_ID="your_license_id"
export CHORUS_CLUSTER_ID="your_cluster_id"
Issue: "unable to contact license authority"
Cause: KACHING server unreachable
Resolution:
- Verify KACHING server is running
- Check network connectivity:
curl http://localhost:8083/health - Verify
kaching_urlconfiguration - Check firewall rules
- If transient, grace period allows startup
Issue: "license validation failed: License expired"
Cause: License has expired
Resolution:
- Contact license vendor to renew
- Update license_id in configuration
- Restart CHORUS
Note: Grace period does NOT apply to expired licenses
Issue: "rate limited by KACHING"
Cause: Too many validation requests
Resolution:
- Check for rapid restart loops
- Verify cache is working (should reduce requests)
- Wait for rate limit reset (check Retry-After header)
- Consider increasing lease duration_minutes
Issue: Circuit breaker stuck in "open" state
Cause: Repeated validation failures
Resolution:
- Check KACHING server health
- Verify license configuration
- Circuit breaker auto-recovers after 30 seconds
- Check grace period status: may allow startup during recovery
Issue: "lease token is invalid"
Cause: Lease expired or revoked
Resolution:
- System should auto-request new lease
- If persistent, check license status with vendor
- Verify cluster_id matches license configuration
Testing
Unit Testing
// Test license validation success
func TestValidatorSuccess(t *testing.T) {
// Mock KACHING server
server := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
w.WriteHeader(http.StatusOK)
json.NewEncoder(w).Encode(map[string]interface{}{
"status": "ok",
"message": "License valid",
})
}))
defer server.Close()
validator := licensing.NewValidator(licensing.LicenseConfig{
LicenseID: "test_license",
ClusterID: "test_cluster",
KachingURL: server.URL,
})
err := validator.Validate()
assert.NoError(t, err)
}
// Test license validation failure
func TestValidatorFailure(t *testing.T) {
server := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
w.WriteHeader(http.StatusForbidden)
json.NewEncoder(w).Encode(map[string]interface{}{
"status": "error",
"message": "License expired",
})
}))
defer server.Close()
validator := licensing.NewValidator(licensing.LicenseConfig{
LicenseID: "test_license",
ClusterID: "test_cluster",
KachingURL: server.URL,
})
err := validator.Validate()
assert.Error(t, err)
assert.Contains(t, err.Error(), "License expired")
}
Integration Testing
# Start KACHING test server
docker run -p 8083:8083 kaching:latest
# Test CHORUS startup with valid license
export CHORUS_LICENSE_ID="test_lic_123"
export CHORUS_CLUSTER_ID="test_cluster"
./chorus-agent
# Expected output:
# 🔐 Validating CHORUS license with KACHING...
# ✅ License validation successful - CHORUS authorized to run
# Test CHORUS startup with invalid license
export CHORUS_LICENSE_ID="invalid_license"
./chorus-agent
# Expected output:
# 🔐 Validating CHORUS license with KACHING...
# ❌ License validation failed: License not found
# Error: license validation failed: License not found
# [Exit code 1]
Future Enhancements
Planned Features
-
Offline License Support
- JWT-based license files for air-gapped deployments
- Signature verification without KACHING connectivity
-
License Renewal Automation
- Background renewal of expiring licenses
- Alert system for upcoming expirations
-
Multi-License Support
- Support for multiple license tiers
- Feature flag based on license type
-
License Analytics
- Usage metrics reporting to KACHING
- License utilization dashboards
-
Enhanced Lease Management
- Lease renewal before expiry
- Dynamic replica scaling based on license
API Constants
Timeouts
const (
DefaultKachingURL = "http://localhost:8083"
LicenseTimeout = 30 * time.Second // Validator HTTP timeout
GateCTimeout = 10 * time.Second // LicenseGate HTTP timeout
)
Grace Period
const (
GracePeriodDuration = 90 * time.Second
)
Circuit Breaker
const (
MaxRequests = 3 // Half-open state test requests
FailureThreshold = 3 // Consecutive failures to trip
CircuitTimeout = 30 * time.Second // Open state duration
FailureResetInterval = 60 * time.Second // Failure count reset
)
Lease Safety Margin
const (
LeaseSafetyMargin = 2 * time.Minute // Cache invalidation before expiry
)
Related Documentation
- KACHING License Server: See KACHING documentation for server setup and API details
- CHORUS Configuration:
/docs/comprehensive/pkg/config.md - CHORUS Runtime:
/docs/comprehensive/internal/runtime.md - Deployment Guide:
/docs/deployment.md
Summary
The CHORUS licensing system provides robust, fail-closed license enforcement through integration with the KACHING license authority. Key characteristics:
- Mandatory: License validation is required at startup
- Fail-Closed: Invalid license or network failure prevents startup (outside grace period)
- Cached: Lease tokens cached to reduce KACHING load
- Resilient: Circuit breaker and grace period handle transient failures
- Scalable: Cluster lease system supports multi-replica deployments
- Secure: No bypass mechanisms, short-lived tokens, cluster binding
The system ensures that all running CHORUS instances are properly licensed while providing operational flexibility through caching and grace periods for transient failures.