 c5b7311a8b
			
		
	
	c5b7311a8b
	
	
	
		
			
			Comprehensive documentation for coordination, messaging, discovery, and internal systems. Core Coordination Packages: - pkg/election - Democratic leader election (uptime-based, heartbeat mechanism, SLURP integration) - pkg/coordination - Meta-coordination with dependency detection (4 built-in rules) - coordinator/ - Task orchestration and assignment (AI-powered scoring) - discovery/ - mDNS peer discovery (automatic LAN detection) Messaging & P2P Infrastructure: - pubsub/ - GossipSub messaging (31 message types, role-based topics, HMMM integration) - p2p/ - libp2p networking (DHT modes, connection management, security) Monitoring & Health: - pkg/metrics - Prometheus metrics (80+ metrics across 12 categories) - pkg/health - Health monitoring (4 HTTP endpoints, enhanced checks, graceful degradation) Internal Systems: - internal/licensing - License validation (KACHING integration, cluster leases, fail-closed) - internal/hapui - Human Agent Portal UI (9 commands, HMMM wizard, UCXL browser, decision voting) - internal/backbeat - P2P operation telemetry (6 phases, beat synchronization, health reporting) Documentation Statistics (Phase 3): - 10 packages documented (~18,000 lines) - 31 PubSub message types cataloged - 80+ Prometheus metrics documented - Complete API references with examples - Integration patterns and best practices Key Features Documented: - Election: 5 triggers, candidate scoring (5 weighted components), stability windows - Coordination: AI-powered dependency detection, cross-repo sessions, escalation handling - PubSub: Topic patterns, message envelopes, SHHH redaction, Hypercore logging - Metrics: All metric types with labels, Prometheus scrape config, alert rules - Health: Liveness vs readiness, critical checks, Kubernetes integration - Licensing: Grace periods, circuit breaker, cluster lease management - HAP UI: Interactive terminal commands, HMMM composition wizard, web interface (beta) - BACKBEAT: 6-phase operation tracking, beat budget estimation, drift detection Implementation Status Marked: - ✅ Production: Election, metrics, health, licensing, pubsub, p2p, discovery, coordinator - 🔶 Beta: HAP web interface, BACKBEAT telemetry, advanced coordination - 🔷 Alpha: SLURP election scoring - ⚠️ Experimental: Meta-coordination, AI-powered dependency detection Progress: 22/62 files complete (35%) Next Phase: AI providers, SLURP system, API layer, reasoning engine 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
		
			
				
	
	
	
		
			40 KiB
		
	
	
	
	
	
	
	
			
		
		
	
	CHORUS License Validation System
Package: internal/licensing
Purpose: KACHING license authority integration with fail-closed validation
Critical: License validation is MANDATORY at startup - invalid license = immediate exit
Overview
The CHORUS licensing system enforces software licensing through integration with the KACHING license authority. The system implements a fail-closed security model: if license validation fails, CHORUS will not start. This ensures that all running instances are properly licensed and authorized.
Key Components
- Validator: Core license validation with KACHING server communication
- LicenseGate: Enhanced validation with caching, circuit breaker, and cluster lease management
- LicenseConfig: Configuration structure for licensing parameters
Security Model: FAIL-CLOSED
┌─────────────────────────────────────────────────────────────┐
│                    CHORUS STARTUP                            │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  1. Load Configuration                                       │
│  2. Initialize Logger                                        │
│  3. ⚠️  VALIDATE LICENSE (CRITICAL GATE)                     │
│     │                                                        │
│     ├─── SUCCESS ──→ Continue startup                       │
│     │                                                        │
│     └─── FAILURE ──→ return error → IMMEDIATE EXIT          │
│                                                              │
│  4. Initialize AI Provider                                   │
│  5. Start P2P Network                                        │
│  6. ... rest of initialization                               │
│                                                              │
└─────────────────────────────────────────────────────────────┘
NO BYPASS: License validation cannot be skipped or bypassed.
Architecture
1. LicenseConfig Structure
Location: internal/licensing/validator.go and pkg/config/config.go
// Core licensing configuration
type LicenseConfig struct {
    LicenseID   string    // Unique license identifier
    ClusterID   string    // Cluster/deployment identifier
    KachingURL  string    // KACHING server URL
}
// Extended configuration in pkg/config
type LicenseConfig struct {
    LicenseID        string    `yaml:"license_id"`
    ClusterID        string    `yaml:"cluster_id"`
    OrganizationName string    `yaml:"organization_name"`
    KachingURL       string    `yaml:"kaching_url"`
    IsActive         bool      `yaml:"is_active"`
    LastValidated    time.Time `yaml:"last_validated"`
    GracePeriodHours int       `yaml:"grace_period_hours"`
    LicenseType      string    `yaml:"license_type"`
    ExpiresAt        time.Time `yaml:"expires_at"`
    MaxNodes         int       `yaml:"max_nodes"`
}
Configuration Fields:
| Field | Required | Purpose | 
|---|---|---|
| LicenseID | ✅ Yes | Unique identifier for the license | 
| ClusterID | ✅ Yes | Identifies the cluster/deployment | 
| KachingURL | No | KACHING server URL (defaults to http://localhost:8083) | 
| OrganizationName | No | Organization name for tracking | 
| LicenseType | No | Type of license (e.g., "enterprise", "developer") | 
| ExpiresAt | No | License expiration timestamp | 
| MaxNodes | No | Maximum nodes allowed in cluster | 
Validation Flow
Standard Validation Sequence
┌──────────────────────────────────────────────────────────────────┐
│                    License Validation Flow                        │
└──────────────────────────────────────────────────────────────────┘
1. NewValidator(config) → Initialize Validator
   │
   ├─→ Set KachingURL (default: http://localhost:8083)
   ├─→ Create HTTP client (timeout: 30s)
   └─→ Initialize LicenseGate
       │
       └─→ Initialize Circuit Breaker
           └─→ Set Grace Period (90 seconds from start)
2. Validate() → Perform validation
   │
   ├─→ ValidateWithContext(ctx)
   │   │
   │   ├─→ Check required fields (LicenseID, ClusterID)
   │   │
   │   ├─→ LicenseGate.Validate(ctx, agentID)
   │   │   │
   │   │   ├─→ Check cached lease (if valid, use it)
   │   │   │   ├─→ validateCachedLease()
   │   │   │   │   ├─→ POST /api/v1/licenses/validate-lease
   │   │   │   │   ├─→ Include: lease_token, cluster_id, agent_id
   │   │   │   │   └─→ Response: valid, remaining_replicas, expires_at
   │   │   │   │
   │   │   │   └─→ Cache hit? → SUCCESS
   │   │   │
   │   │   ├─→ Cache miss? → Request new lease
   │   │   │   │
   │   │   │   ├─→ breaker.Execute() [Circuit Breaker]
   │   │   │   │   │
   │   │   │   │   ├─→ requestOrRenewLease()
   │   │   │   │   │   ├─→ POST /api/v1/licenses/{id}/cluster-lease
   │   │   │   │   │   ├─→ Request: cluster_id, requested_replicas, duration_minutes
   │   │   │   │   │   └─→ Response: lease_token, max_replicas, expires_at, lease_id
   │   │   │   │   │
   │   │   │   │   ├─→ validateLease(lease, agentID)
   │   │   │   │   │   └─→ POST /api/v1/licenses/validate-lease
   │   │   │   │   │
   │   │   │   │   └─→ storeLease() → Cache the valid lease
   │   │   │   │
   │   │   │   └─→ Extend grace period (90s)
   │   │   │
   │   │   └─→ Validation failed?
   │   │       │
   │   │       ├─→ In grace period? → Log warning, ALLOW startup
   │   │       └─→ Outside grace period? → RETURN ERROR
   │   │
   │   └─→ Fallback to validateLegacy() on LicenseGate failure
   │       │
   │       ├─→ POST /v1/license/activate
   │       ├─→ Request: license_id, cluster_id, metadata
   │       └─→ Response: validation result
   │
   └─→ Return validation result
3. Result Handling (in runtime/shared.go)
   │
   ├─→ SUCCESS → Log "✅ License validation successful"
   │            → Continue initialization
   │
   └─→ FAILURE → return error → CHORUS EXITS IMMEDIATELY
Component Details
1. Validator Component
File: internal/licensing/validator.go
The Validator is the primary component for license validation, providing communication with the KACHING license authority.
Key Methods
NewValidator(config LicenseConfig)
func NewValidator(config LicenseConfig) *Validator
Creates a new license validator with:
- HTTP client with 30-second timeout
- Default KACHING URL if not specified
- Initialized LicenseGate for enhanced validation
Validate()
func (v *Validator) Validate() error
Performs license validation with KACHING authority:
- Validates required configuration fields
- Uses LicenseGate for cached/enhanced validation
- Falls back to legacy validation if needed
- Returns error if validation fails
validateLegacy()
func (v *Validator) validateLegacy() error
Legacy validation method (fallback):
- Direct HTTP POST to /v1/license/activate
- Sends license metadata (product, version, container flag)
- Fail-closed: Network error = validation failure
- Parses and validates response status
Request Example:
{
  "license_id": "lic_abc123",
  "cluster_id": "cluster_xyz789",
  "metadata": {
    "product": "CHORUS",
    "version": "0.1.0-dev",
    "container": "true"
  }
}
Response Example (Success):
{
  "status": "ok",
  "message": "License valid",
  "expires_at": "2025-12-31T23:59:59Z"
}
Response Example (Failure):
{
  "status": "error",
  "message": "License expired"
}
2. LicenseGate Component
File: internal/licensing/license_gate.go
Enhanced license validation with caching, circuit breaker, and cluster lease management for production scalability.
Key Features
- Caching: Stores valid lease tokens to reduce KACHING load
- Circuit Breaker: Prevents cascade failures during KACHING outages
- Grace Period: 90-second startup grace period for transient failures
- Cluster Leases: Supports multi-replica deployments with lease tokens
- Burst Protection: Rate limiting and retry logic
Data Structures
cachedLease
type cachedLease struct {
    LeaseToken string    `json:"lease_token"`
    ExpiresAt  time.Time `json:"expires_at"`
    ClusterID  string    `json:"cluster_id"`
    Valid      bool      `json:"valid"`
    CachedAt   time.Time `json:"cached_at"`
}
Lease Validation:
- Lease considered invalid 2 minutes before actual expiry (safety margin)
- Invalid leases are evicted from cache automatically
LeaseRequest
type LeaseRequest struct {
    ClusterID         string `json:"cluster_id"`
    RequestedReplicas int    `json:"requested_replicas"`
    DurationMinutes   int    `json:"duration_minutes"`
}
LeaseResponse
type LeaseResponse struct {
    LeaseToken   string    `json:"lease_token"`
    MaxReplicas  int       `json:"max_replicas"`
    ExpiresAt    time.Time `json:"expires_at"`
    ClusterID    string    `json:"cluster_id"`
    LeaseID      string    `json:"lease_id"`
}
LeaseValidationRequest
type LeaseValidationRequest struct {
    LeaseToken string `json:"lease_token"`
    ClusterID  string `json:"cluster_id"`
    AgentID    string `json:"agent_id"`
}
LeaseValidationResponse
type LeaseValidationResponse struct {
    Valid             bool      `json:"valid"`
    RemainingReplicas int       `json:"remaining_replicas"`
    ExpiresAt         time.Time `json:"expires_at"`
}
Circuit Breaker Configuration
breakerSettings := gobreaker.Settings{
    Name:        "license-validation",
    MaxRequests: 3,              // Allow 3 requests in half-open state
    Interval:    60 * time.Second, // Reset failure count every minute
    Timeout:     30 * time.Second, // Stay open for 30 seconds
    ReadyToTrip: func(counts gobreaker.Counts) bool {
        return counts.ConsecutiveFailures >= 3  // Trip after 3 failures
    },
    OnStateChange: func(name string, from, to gobreaker.State) {
        fmt.Printf("🔌 License validation circuit breaker: %s -> %s\n", from, to)
    },
}
Circuit Breaker States:
| State | Behavior | Transition | 
|---|---|---|
| Closed | Normal operation, requests pass through | 3 consecutive failures → Open | 
| Open | All requests fail immediately (30s) | After timeout → Half-Open | 
| Half-Open | Allow 3 test requests | Success → Closed, Failure → Open | 
Key Methods
NewLicenseGate(config LicenseConfig)
func NewLicenseGate(config LicenseConfig) *LicenseGate
Initializes license gate with:
- Circuit breaker with production settings
- HTTP client with 10-second timeout
- 90-second grace period from startup
Validate(ctx context.Context, agentID string)
func (g *LicenseGate) Validate(ctx context.Context, agentID string) error
Primary validation method:
- Check cache: If valid cached lease exists, validate it
- Cache miss: Request new lease through circuit breaker
- Store result: Cache successful lease for future requests
- Grace period: Allow startup during grace period even if validation fails
- Extend grace: Extend grace period on successful validation
validateCachedLease(ctx, lease, agentID)
func (g *LicenseGate) validateCachedLease(ctx context.Context, lease *cachedLease, agentID string) error
Validates cached lease token:
- POST to /api/v1/licenses/validate-lease
- Invalidates cache if validation fails
- Returns error if lease is no longer valid
requestOrRenewLease(ctx)
func (g *LicenseGate) requestOrRenewLease(ctx context.Context) (*LeaseResponse, error)
Requests new cluster lease:
- POST to /api/v1/licenses/{license_id}/cluster-lease
- Default: 1 replica, 60-minute duration
- Handles rate limiting (429 Too Many Requests)
- Returns lease token and metadata
GetCacheStats()
func (g *LicenseGate) GetCacheStats() map[string]interface{}
Returns cache statistics for monitoring:
{
  "cache_valid": true,
  "cache_hit": true,
  "expires_at": "2025-09-30T15:30:00Z",
  "cached_at": "2025-09-30T14:30:00Z",
  "in_grace_period": false,
  "breaker_state": "closed",
  "grace_until": "2025-09-30T14:31:30Z"
}
KACHING Server Integration
API Endpoints
1. Legacy Activation Endpoint
Endpoint: POST /v1/license/activate
Purpose: Legacy license validation (fallback)
Request:
{
  "license_id": "lic_abc123",
  "cluster_id": "cluster_xyz789",
  "metadata": {
    "product": "CHORUS",
    "version": "0.1.0-dev",
    "container": "true"
  }
}
Response (Success):
{
  "status": "ok",
  "message": "License valid",
  "expires_at": "2025-12-31T23:59:59Z"
}
Response (Failure):
{
  "status": "error",
  "message": "License expired"
}
2. Cluster Lease Endpoint
Endpoint: POST /api/v1/licenses/{license_id}/cluster-lease
Purpose: Request cluster deployment lease
Request:
{
  "cluster_id": "cluster_xyz789",
  "requested_replicas": 1,
  "duration_minutes": 60
}
Response (Success):
{
  "lease_token": "lease_def456",
  "max_replicas": 5,
  "expires_at": "2025-09-30T15:30:00Z",
  "cluster_id": "cluster_xyz789",
  "lease_id": "lease_def456"
}
Response (Rate Limited):
HTTP 429 Too Many Requests
Retry-After: 60
3. Lease Validation Endpoint
Endpoint: POST /api/v1/licenses/validate-lease
Purpose: Validate lease token for agent startup
Request:
{
  "lease_token": "lease_def456",
  "cluster_id": "cluster_xyz789",
  "agent_id": "agent_001"
}
Response (Success):
{
  "valid": true,
  "remaining_replicas": 4,
  "expires_at": "2025-09-30T15:30:00Z"
}
Response (Invalid):
{
  "valid": false,
  "remaining_replicas": 0,
  "expires_at": "2025-09-30T14:30:00Z"
}
Validation Sequence Diagram
┌─────────┐         ┌───────────┐         ┌──────────────┐         ┌──────────┐
│ CHORUS  │         │ Validator │         │ LicenseGate  │         │ KACHING  │
│ Runtime │         │           │         │              │         │  Server  │
└────┬────┘         └─────┬─────┘         └──────┬───────┘         └────┬─────┘
     │                    │                       │                      │
     │ InitializeRuntime()│                       │                      │
     │───────────────────>│                       │                      │
     │                    │                       │                      │
     │                    │ Validate()            │                      │
     │                    │──────────────────────>│                      │
     │                    │                       │                      │
     │                    │                       │ Check cache          │
     │                    │                       │────────┐             │
     │                    │                       │        │             │
     │                    │                       │<───────┘             │
     │                    │                       │                      │
     │                    │                       │ Cache miss           │
     │                    │                       │                      │
     │                    │                       │ POST /cluster-lease  │
     │                    │                       │─────────────────────>│
     │                    │                       │                      │
     │                    │                       │    Lease Response    │
     │                    │                       │<─────────────────────│
     │                    │                       │                      │
     │                    │                       │ POST /validate-lease │
     │                    │                       │─────────────────────>│
     │                    │                       │                      │
     │                    │                       │  Validation Response │
     │                    │                       │<─────────────────────│
     │                    │                       │                      │
     │                    │                       │ Store in cache       │
     │                    │                       │────────┐             │
     │                    │                       │        │             │
     │                    │                       │<───────┘             │
     │                    │                       │                      │
     │                    │   SUCCESS             │                      │
     │                    │<──────────────────────│                      │
     │                    │                       │                      │
     │   Continue startup │                       │                      │
     │<───────────────────│                       │                      │
     │                    │                       │                      │
┌─────────────────────────────────────────────────────────────────────┐
│                        FAILURE SCENARIO                              │
└─────────────────────────────────────────────────────────────────────┘
     │                    │                       │                      │
     │                    │                       │ POST /validate-lease │
     │                    │                       │─────────────────────>│
     │                    │                       │                      │
     │                    │                       │    INVALID LICENSE   │
     │                    │                       │<─────────────────────│
     │                    │                       │                      │
     │                    │                       │ Check grace period   │
     │                    │                       │────────┐             │
     │                    │                       │        │             │
     │                    │                       │<───────┘             │
     │                    │                       │                      │
     │                    │                       │ Outside grace period │
     │                    │                       │                      │
     │                    │   ERROR               │                      │
     │                    │<──────────────────────│                      │
     │                    │                       │                      │
     │   return error     │                       │                      │
     │<───────────────────│                       │                      │
     │                    │                       │                      │
     │ EXIT               │                       │                      │
     │────────X           │                       │                      │
Error Handling
Error Categories
1. Configuration Errors
Condition: Missing required configuration fields
if v.config.LicenseID == "" || v.config.ClusterID == "" {
    return fmt.Errorf("license ID and cluster ID are required")
}
Result: Immediate validation failure → CHORUS exits
2. Network Errors
Condition: Cannot contact KACHING server
resp, err := v.client.Post(licenseURL, "application/json", bytes.NewReader(requestBody))
if err != nil {
    // FAIL-CLOSED: No network = No license = No operation
    return fmt.Errorf("unable to contact license authority: %w", err)
}
Result:
- Outside grace period: Immediate validation failure → CHORUS exits
- Inside grace period: Log warning, allow startup
Fail-Closed Behavior: Network unavailability does NOT allow bypass
3. Invalid License Errors
Condition: KACHING rejects license
if resp.StatusCode != http.StatusOK {
    message := "license validation failed"
    if msg, ok := licenseResponse["message"].(string); ok {
        message = msg
    }
    return fmt.Errorf("license validation failed: %s", message)
}
Possible Messages:
- "License expired"
- "License revoked"
- "License not found"
- "Cluster ID mismatch"
- "Maximum nodes exceeded"
Result: Immediate validation failure → CHORUS exits
4. Rate Limiting Errors
Condition: Too many requests to KACHING
if resp.StatusCode == http.StatusTooManyRequests {
    return nil, fmt.Errorf("rate limited by KACHING, retry after: %s",
        resp.Header.Get("Retry-After"))
}
Result:
- Circuit breaker may trip after repeated rate limiting
- Grace period allows startup if rate limiting is transient
5. Circuit Breaker Errors
Condition: Circuit breaker is open (too many failures)
Result:
- All requests fail immediately
- Grace period allows startup if breaker trips during initialization
- Circuit breaker auto-recovers after timeout (30s)
Error Messages Reference
User-Facing Error Messages
| Error Message | Cause | Resolution | 
|---|---|---|
| license ID and cluster ID are required | Missing configuration | Set CHORUS_LICENSE_IDandCHORUS_CLUSTER_ID | 
| unable to contact license authority | Network error | Check KACHING server accessibility | 
| license validation failed: License expired | Expired license | Renew license with vendor | 
| license validation failed: License revoked | Revoked license | Contact vendor | 
| license validation failed: Cluster ID mismatch | Wrong cluster | Use correct cluster configuration | 
| rate limited by KACHING | Too many requests | Wait for rate limit reset | 
| lease token is invalid | Expired or invalid lease | System will auto-request new lease | 
| lease validation failed with status 404 | Lease not found | System will auto-request new lease | 
| License validation failed but in grace period | Transient failure during startup | System continues with warning | 
Grace Period Mechanism
Purpose
The grace period allows CHORUS to start even when license validation temporarily fails, preventing service disruption due to transient network issues or KACHING server maintenance.
Behavior
- Duration: 90 seconds from startup
- Triggered: When validation fails but grace period is active
- Effect: Validation returns success with warning log
- Extension: Grace period extends by 90s on each successful validation
- Expiry: After grace period expires, validation failures cause immediate exit
Grace Period States
┌──────────────────────────────────────────────────────────────┐
│                    Grace Period Timeline                      │
└──────────────────────────────────────────────────────────────┘
T+0s    ┌─────────────────────────────────────────────┐
        │        GRACE PERIOD ACTIVE (90s)            │
        │   Validation failures allowed with warning   │
        └─────────────────────────────────────────────┘
T+30s   │ Validation SUCCESS                          │
        └──> Grace period extended to T+120s          │
T+90s   │ Grace period expires (no successful validation)
        └──> Next validation failure causes exit      │
T+120s  │ (Extended) Grace period expires
        └──> Next validation failure causes exit      │
Implementation
// Initialize grace period at startup
func NewLicenseGate(config LicenseConfig) *LicenseGate {
    gate := &LicenseGate{...}
    gate.graceUntil.Store(time.Now().Add(90 * time.Second))
    return gate
}
// Check grace period during validation
if err != nil {
    if g.isInGracePeriod() {
        fmt.Printf("⚠️ License validation failed but in grace period: %v\n", err)
        return nil  // Allow startup
    }
    return fmt.Errorf("license validation failed: %w", err)
}
// Extend grace period on success
g.extendGracePeriod()  // Adds 90s to current time
Startup Integration
Location
File: internal/runtime/shared.go
Function: InitializeRuntime()
Integration Point
func InitializeRuntime(cfg *config.CHORUSConfig) (*RuntimeContext, error) {
    // ... early initialization ...
    // CRITICAL: Validate license before any P2P operations
    runtime.Logger.Info("🔐 Validating CHORUS license with KACHING...")
    licenseValidator := licensing.NewValidator(licensing.LicenseConfig{
        LicenseID:  cfg.License.LicenseID,
        ClusterID:  cfg.License.ClusterID,
        KachingURL: cfg.License.KachingURL,
    })
    if err := licenseValidator.Validate(); err != nil {
        // This error causes InitializeRuntime to return error
        // which causes main() to exit immediately
        return nil, fmt.Errorf("license validation failed: %v", err)
    }
    runtime.Logger.Info("✅ License validation successful - CHORUS authorized to run")
    // ... continue with P2P, AI provider initialization, etc ...
}
Execution Order
1. Load configuration from YAML
2. Initialize logger
3. ⚠️ VALIDATE LICENSE ⚠️
   └─→ FAILURE → return error → main() exits
4. Initialize AI provider
5. Initialize metrics collector
6. Initialize SHHH sentinel
7. Initialize P2P network
8. Start HAP server
9. Enter main runtime loop
Critical Note: License validation occurs BEFORE any P2P networking or AI provider initialization. If validation fails, no network connections are made and no services are started.
Configuration Examples
Minimal Configuration
license:
  license_id: "lic_abc123"
  cluster_id: "cluster_xyz789"
KACHING URL defaults to http://localhost:8083
Production Configuration
license:
  license_id: "lic_prod_abc123"
  cluster_id: "cluster_production_xyz789"
  kaching_url: "https://kaching.chorus.services"
  organization_name: "Acme Corporation"
  license_type: "enterprise"
  max_nodes: 10
Development Configuration
license:
  license_id: "lic_dev_abc123"
  cluster_id: "cluster_dev_local"
  kaching_url: "http://localhost:8083"
  organization_name: "Development Team"
  license_type: "developer"
  max_nodes: 1
Environment Variables
Licensing configuration can also be set via environment variables:
export CHORUS_LICENSE_ID="lic_abc123"
export CHORUS_CLUSTER_ID="cluster_xyz789"
export CHORUS_KACHING_URL="http://localhost:8083"
Monitoring and Observability
Log Messages
Successful Validation
🔐 Validating CHORUS license with KACHING...
✅ License validation successful - CHORUS authorized to run
Validation with Cached Lease
🔐 Validating CHORUS license with KACHING...
[Using cached lease token: lease_def456]
✅ License validation successful - CHORUS authorized to run
Validation During Grace Period
🔐 Validating CHORUS license with KACHING...
⚠️ License validation failed but in grace period: unable to contact license authority
✅ License validation successful - CHORUS authorized to run
Circuit Breaker State Changes
🔌 License validation circuit breaker: closed -> open
🔌 License validation circuit breaker: open -> half-open
🔌 License validation circuit breaker: half-open -> closed
Validation Failure (Fatal)
🔐 Validating CHORUS license with KACHING...
❌ License validation failed: License expired
Error: license validation failed: License expired
[CHORUS exits]
Cache Statistics API
stats := licenseGate.GetCacheStats()
Returns:
{
  "cache_valid": true,
  "cache_hit": true,
  "expires_at": "2025-09-30T15:30:00Z",
  "cached_at": "2025-09-30T14:30:00Z",
  "in_grace_period": false,
  "breaker_state": "closed",
  "grace_until": "2025-09-30T14:31:30Z"
}
Recommended Monitoring Metrics
| Metric | Type | Description | 
|---|---|---|
| license_validation_success | Counter | Successful validations | 
| license_validation_failure | Counter | Failed validations | 
| license_validation_duration_ms | Histogram | Validation latency | 
| license_cache_hit_rate | Gauge | Percentage of cache hits | 
| license_grace_period_active | Gauge | 1 if in grace period, 0 otherwise | 
| license_circuit_breaker_state | Gauge | 0=closed, 1=half-open, 2=open | 
| license_lease_expiry_seconds | Gauge | Seconds until lease expiry | 
Cluster Lease Management
Lease Lifecycle
┌──────────────────────────────────────────────────────────────────┐
│                      Cluster Lease Lifecycle                      │
└──────────────────────────────────────────────────────────────────┘
1. REQUEST LEASE
   ├─→ POST /api/v1/licenses/{license_id}/cluster-lease
   ├─→ cluster_id: "cluster_xyz789"
   ├─→ requested_replicas: 1
   └─→ duration_minutes: 60
2. RECEIVE LEASE
   ├─→ lease_token: "lease_def456"
   ├─→ max_replicas: 5
   ├─→ expires_at: T+60m
   └─→ Store in cache
3. USE LEASE (per agent startup)
   ├─→ POST /api/v1/licenses/validate-lease
   ├─→ lease_token: "lease_def456"
   ├─→ cluster_id: "cluster_xyz789"
   ├─→ agent_id: "agent_001"
   └─→ Decrements remaining_replicas
4. LEASE EXPIRY
   ├─→ Cache invalidated at T+58m (2min safety margin)
   └─→ Next validation requests new lease
5. LEASE RENEWAL
   └─→ Automatic on cache invalidation
Multi-Replica Support
The lease system supports multiple CHORUS agent replicas:
- max_replicas: Maximum concurrent agents allowed
- remaining_replicas: Available agent slots
- agent_id: Unique identifier for each agent instance
Example: License allows 5 replicas
- Request lease → max_replicas: 5
- Agent 1 validates → remaining_replicas: 4
- Agent 2 validates → remaining_replicas: 3
- Agent 6 validates → FAILURE (exceeds max_replicas)
Security Considerations
Fail-Closed Architecture
The licensing system implements fail-closed security:
- ✅ Network unavailable → Validation fails → CHORUS exits (unless in grace period)
- ✅ KACHING server down → Validation fails → CHORUS exits (unless in grace period)
- ✅ Invalid license → Validation fails → CHORUS exits (no grace period)
- ✅ Expired license → Validation fails → CHORUS exits (no grace period)
- ❌ No "development mode" bypass
- ❌ No "skip validation" flag
Grace Period Security
The grace period is designed for transient failures, NOT as a bypass:
- Limited to 90 seconds initially
- Only extends on successful validation
- Does NOT apply to invalid/expired licenses
- Primarily for network/KACHING server availability issues
License Token Security
- Lease tokens are short-lived (default: 60 minutes)
- Tokens cached in memory only (not persisted to disk)
- Tokens include cluster_id binding (cannot be used by other clusters)
- Agent ID tracking prevents token sharing between agents
Network Security
- HTTPS recommended for production KACHING URLs
- 30-second timeout prevents hanging on network issues
- Circuit breaker prevents cascade failures
Troubleshooting
Issue: "license ID and cluster ID are required"
Cause: Missing configuration
Resolution:
# config.yml
license:
  license_id: "your_license_id"
  cluster_id: "your_cluster_id"
Or via environment:
export CHORUS_LICENSE_ID="your_license_id"
export CHORUS_CLUSTER_ID="your_cluster_id"
Issue: "unable to contact license authority"
Cause: KACHING server unreachable
Resolution:
- Verify KACHING server is running
- Check network connectivity: curl http://localhost:8083/health
- Verify kaching_urlconfiguration
- Check firewall rules
- If transient, grace period allows startup
Issue: "license validation failed: License expired"
Cause: License has expired
Resolution:
- Contact license vendor to renew
- Update license_id in configuration
- Restart CHORUS
Note: Grace period does NOT apply to expired licenses
Issue: "rate limited by KACHING"
Cause: Too many validation requests
Resolution:
- Check for rapid restart loops
- Verify cache is working (should reduce requests)
- Wait for rate limit reset (check Retry-After header)
- Consider increasing lease duration_minutes
Issue: Circuit breaker stuck in "open" state
Cause: Repeated validation failures
Resolution:
- Check KACHING server health
- Verify license configuration
- Circuit breaker auto-recovers after 30 seconds
- Check grace period status: may allow startup during recovery
Issue: "lease token is invalid"
Cause: Lease expired or revoked
Resolution:
- System should auto-request new lease
- If persistent, check license status with vendor
- Verify cluster_id matches license configuration
Testing
Unit Testing
// Test license validation success
func TestValidatorSuccess(t *testing.T) {
    // Mock KACHING server
    server := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
        w.WriteHeader(http.StatusOK)
        json.NewEncoder(w).Encode(map[string]interface{}{
            "status": "ok",
            "message": "License valid",
        })
    }))
    defer server.Close()
    validator := licensing.NewValidator(licensing.LicenseConfig{
        LicenseID:  "test_license",
        ClusterID:  "test_cluster",
        KachingURL: server.URL,
    })
    err := validator.Validate()
    assert.NoError(t, err)
}
// Test license validation failure
func TestValidatorFailure(t *testing.T) {
    server := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
        w.WriteHeader(http.StatusForbidden)
        json.NewEncoder(w).Encode(map[string]interface{}{
            "status": "error",
            "message": "License expired",
        })
    }))
    defer server.Close()
    validator := licensing.NewValidator(licensing.LicenseConfig{
        LicenseID:  "test_license",
        ClusterID:  "test_cluster",
        KachingURL: server.URL,
    })
    err := validator.Validate()
    assert.Error(t, err)
    assert.Contains(t, err.Error(), "License expired")
}
Integration Testing
# Start KACHING test server
docker run -p 8083:8083 kaching:latest
# Test CHORUS startup with valid license
export CHORUS_LICENSE_ID="test_lic_123"
export CHORUS_CLUSTER_ID="test_cluster"
./chorus-agent
# Expected output:
# 🔐 Validating CHORUS license with KACHING...
# ✅ License validation successful - CHORUS authorized to run
# Test CHORUS startup with invalid license
export CHORUS_LICENSE_ID="invalid_license"
./chorus-agent
# Expected output:
# 🔐 Validating CHORUS license with KACHING...
# ❌ License validation failed: License not found
# Error: license validation failed: License not found
# [Exit code 1]
Future Enhancements
Planned Features
- 
Offline License Support - JWT-based license files for air-gapped deployments
- Signature verification without KACHING connectivity
 
- 
License Renewal Automation - Background renewal of expiring licenses
- Alert system for upcoming expirations
 
- 
Multi-License Support - Support for multiple license tiers
- Feature flag based on license type
 
- 
License Analytics - Usage metrics reporting to KACHING
- License utilization dashboards
 
- 
Enhanced Lease Management - Lease renewal before expiry
- Dynamic replica scaling based on license
 
API Constants
Timeouts
const (
    DefaultKachingURL = "http://localhost:8083"
    LicenseTimeout    = 30 * time.Second  // Validator HTTP timeout
    GateCTimeout      = 10 * time.Second  // LicenseGate HTTP timeout
)
Grace Period
const (
    GracePeriodDuration = 90 * time.Second
)
Circuit Breaker
const (
    MaxRequests          = 3              // Half-open state test requests
    FailureThreshold     = 3              // Consecutive failures to trip
    CircuitTimeout       = 30 * time.Second  // Open state duration
    FailureResetInterval = 60 * time.Second  // Failure count reset
)
Lease Safety Margin
const (
    LeaseSafetyMargin = 2 * time.Minute  // Cache invalidation before expiry
)
Related Documentation
- KACHING License Server: See KACHING documentation for server setup and API details
- CHORUS Configuration: /docs/comprehensive/pkg/config.md
- CHORUS Runtime: /docs/comprehensive/internal/runtime.md
- Deployment Guide: /docs/deployment.md
Summary
The CHORUS licensing system provides robust, fail-closed license enforcement through integration with the KACHING license authority. Key characteristics:
- Mandatory: License validation is required at startup
- Fail-Closed: Invalid license or network failure prevents startup (outside grace period)
- Cached: Lease tokens cached to reduce KACHING load
- Resilient: Circuit breaker and grace period handle transient failures
- Scalable: Cluster lease system supports multi-replica deployments
- Secure: No bypass mechanisms, short-lived tokens, cluster binding
The system ensures that all running CHORUS instances are properly licensed while providing operational flexibility through caching and grace periods for transient failures.