Files
CHORUS/docs/comprehensive/internal/licensing.md
anthonyrawlins c5b7311a8b docs: Add Phase 3 coordination and infrastructure documentation
Comprehensive documentation for coordination, messaging, discovery, and internal systems.

Core Coordination Packages:
- pkg/election - Democratic leader election (uptime-based, heartbeat mechanism, SLURP integration)
- pkg/coordination - Meta-coordination with dependency detection (4 built-in rules)
- coordinator/ - Task orchestration and assignment (AI-powered scoring)
- discovery/ - mDNS peer discovery (automatic LAN detection)

Messaging & P2P Infrastructure:
- pubsub/ - GossipSub messaging (31 message types, role-based topics, HMMM integration)
- p2p/ - libp2p networking (DHT modes, connection management, security)

Monitoring & Health:
- pkg/metrics - Prometheus metrics (80+ metrics across 12 categories)
- pkg/health - Health monitoring (4 HTTP endpoints, enhanced checks, graceful degradation)

Internal Systems:
- internal/licensing - License validation (KACHING integration, cluster leases, fail-closed)
- internal/hapui - Human Agent Portal UI (9 commands, HMMM wizard, UCXL browser, decision voting)
- internal/backbeat - P2P operation telemetry (6 phases, beat synchronization, health reporting)

Documentation Statistics (Phase 3):
- 10 packages documented (~18,000 lines)
- 31 PubSub message types cataloged
- 80+ Prometheus metrics documented
- Complete API references with examples
- Integration patterns and best practices

Key Features Documented:
- Election: 5 triggers, candidate scoring (5 weighted components), stability windows
- Coordination: AI-powered dependency detection, cross-repo sessions, escalation handling
- PubSub: Topic patterns, message envelopes, SHHH redaction, Hypercore logging
- Metrics: All metric types with labels, Prometheus scrape config, alert rules
- Health: Liveness vs readiness, critical checks, Kubernetes integration
- Licensing: Grace periods, circuit breaker, cluster lease management
- HAP UI: Interactive terminal commands, HMMM composition wizard, web interface (beta)
- BACKBEAT: 6-phase operation tracking, beat budget estimation, drift detection

Implementation Status Marked:
-  Production: Election, metrics, health, licensing, pubsub, p2p, discovery, coordinator
- 🔶 Beta: HAP web interface, BACKBEAT telemetry, advanced coordination
- 🔷 Alpha: SLURP election scoring
- ⚠️ Experimental: Meta-coordination, AI-powered dependency detection

Progress: 22/62 files complete (35%)

Next Phase: AI providers, SLURP system, API layer, reasoning engine

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-09-30 18:27:39 +10:00

40 KiB

CHORUS License Validation System

Package: internal/licensing Purpose: KACHING license authority integration with fail-closed validation Critical: License validation is MANDATORY at startup - invalid license = immediate exit

Overview

The CHORUS licensing system enforces software licensing through integration with the KACHING license authority. The system implements a fail-closed security model: if license validation fails, CHORUS will not start. This ensures that all running instances are properly licensed and authorized.

Key Components

  • Validator: Core license validation with KACHING server communication
  • LicenseGate: Enhanced validation with caching, circuit breaker, and cluster lease management
  • LicenseConfig: Configuration structure for licensing parameters

Security Model: FAIL-CLOSED

┌─────────────────────────────────────────────────────────────┐
│                    CHORUS STARTUP                            │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  1. Load Configuration                                       │
│  2. Initialize Logger                                        │
│  3. ⚠️  VALIDATE LICENSE (CRITICAL GATE)                     │
│     │                                                        │
│     ├─── SUCCESS ──→ Continue startup                       │
│     │                                                        │
│     └─── FAILURE ──→ return error → IMMEDIATE EXIT          │
│                                                              │
│  4. Initialize AI Provider                                   │
│  5. Start P2P Network                                        │
│  6. ... rest of initialization                               │
│                                                              │
└─────────────────────────────────────────────────────────────┘

NO BYPASS: License validation cannot be skipped or bypassed.

Architecture

1. LicenseConfig Structure

Location: internal/licensing/validator.go and pkg/config/config.go

// Core licensing configuration
type LicenseConfig struct {
    LicenseID   string    // Unique license identifier
    ClusterID   string    // Cluster/deployment identifier
    KachingURL  string    // KACHING server URL
}

// Extended configuration in pkg/config
type LicenseConfig struct {
    LicenseID        string    `yaml:"license_id"`
    ClusterID        string    `yaml:"cluster_id"`
    OrganizationName string    `yaml:"organization_name"`
    KachingURL       string    `yaml:"kaching_url"`
    IsActive         bool      `yaml:"is_active"`
    LastValidated    time.Time `yaml:"last_validated"`
    GracePeriodHours int       `yaml:"grace_period_hours"`
    LicenseType      string    `yaml:"license_type"`
    ExpiresAt        time.Time `yaml:"expires_at"`
    MaxNodes         int       `yaml:"max_nodes"`
}

Configuration Fields:

Field Required Purpose
LicenseID Yes Unique identifier for the license
ClusterID Yes Identifies the cluster/deployment
KachingURL No KACHING server URL (defaults to http://localhost:8083)
OrganizationName No Organization name for tracking
LicenseType No Type of license (e.g., "enterprise", "developer")
ExpiresAt No License expiration timestamp
MaxNodes No Maximum nodes allowed in cluster

Validation Flow

Standard Validation Sequence

┌──────────────────────────────────────────────────────────────────┐
│                    License Validation Flow                        │
└──────────────────────────────────────────────────────────────────┘

1. NewValidator(config) → Initialize Validator
   │
   ├─→ Set KachingURL (default: http://localhost:8083)
   ├─→ Create HTTP client (timeout: 30s)
   └─→ Initialize LicenseGate
       │
       └─→ Initialize Circuit Breaker
           └─→ Set Grace Period (90 seconds from start)

2. Validate() → Perform validation
   │
   ├─→ ValidateWithContext(ctx)
   │   │
   │   ├─→ Check required fields (LicenseID, ClusterID)
   │   │
   │   ├─→ LicenseGate.Validate(ctx, agentID)
   │   │   │
   │   │   ├─→ Check cached lease (if valid, use it)
   │   │   │   ├─→ validateCachedLease()
   │   │   │   │   ├─→ POST /api/v1/licenses/validate-lease
   │   │   │   │   ├─→ Include: lease_token, cluster_id, agent_id
   │   │   │   │   └─→ Response: valid, remaining_replicas, expires_at
   │   │   │   │
   │   │   │   └─→ Cache hit? → SUCCESS
   │   │   │
   │   │   ├─→ Cache miss? → Request new lease
   │   │   │   │
   │   │   │   ├─→ breaker.Execute() [Circuit Breaker]
   │   │   │   │   │
   │   │   │   │   ├─→ requestOrRenewLease()
   │   │   │   │   │   ├─→ POST /api/v1/licenses/{id}/cluster-lease
   │   │   │   │   │   ├─→ Request: cluster_id, requested_replicas, duration_minutes
   │   │   │   │   │   └─→ Response: lease_token, max_replicas, expires_at, lease_id
   │   │   │   │   │
   │   │   │   │   ├─→ validateLease(lease, agentID)
   │   │   │   │   │   └─→ POST /api/v1/licenses/validate-lease
   │   │   │   │   │
   │   │   │   │   └─→ storeLease() → Cache the valid lease
   │   │   │   │
   │   │   │   └─→ Extend grace period (90s)
   │   │   │
   │   │   └─→ Validation failed?
   │   │       │
   │   │       ├─→ In grace period? → Log warning, ALLOW startup
   │   │       └─→ Outside grace period? → RETURN ERROR
   │   │
   │   └─→ Fallback to validateLegacy() on LicenseGate failure
   │       │
   │       ├─→ POST /v1/license/activate
   │       ├─→ Request: license_id, cluster_id, metadata
   │       └─→ Response: validation result
   │
   └─→ Return validation result

3. Result Handling (in runtime/shared.go)
   │
   ├─→ SUCCESS → Log "✅ License validation successful"
   │            → Continue initialization
   │
   └─→ FAILURE → return error → CHORUS EXITS IMMEDIATELY

Component Details

1. Validator Component

File: internal/licensing/validator.go

The Validator is the primary component for license validation, providing communication with the KACHING license authority.

Key Methods

NewValidator(config LicenseConfig)
func NewValidator(config LicenseConfig) *Validator

Creates a new license validator with:

  • HTTP client with 30-second timeout
  • Default KACHING URL if not specified
  • Initialized LicenseGate for enhanced validation
Validate()
func (v *Validator) Validate() error

Performs license validation with KACHING authority:

  • Validates required configuration fields
  • Uses LicenseGate for cached/enhanced validation
  • Falls back to legacy validation if needed
  • Returns error if validation fails
validateLegacy()
func (v *Validator) validateLegacy() error

Legacy validation method (fallback):

  • Direct HTTP POST to /v1/license/activate
  • Sends license metadata (product, version, container flag)
  • Fail-closed: Network error = validation failure
  • Parses and validates response status

Request Example:

{
  "license_id": "lic_abc123",
  "cluster_id": "cluster_xyz789",
  "metadata": {
    "product": "CHORUS",
    "version": "0.1.0-dev",
    "container": "true"
  }
}

Response Example (Success):

{
  "status": "ok",
  "message": "License valid",
  "expires_at": "2025-12-31T23:59:59Z"
}

Response Example (Failure):

{
  "status": "error",
  "message": "License expired"
}

2. LicenseGate Component

File: internal/licensing/license_gate.go

Enhanced license validation with caching, circuit breaker, and cluster lease management for production scalability.

Key Features

  • Caching: Stores valid lease tokens to reduce KACHING load
  • Circuit Breaker: Prevents cascade failures during KACHING outages
  • Grace Period: 90-second startup grace period for transient failures
  • Cluster Leases: Supports multi-replica deployments with lease tokens
  • Burst Protection: Rate limiting and retry logic

Data Structures

cachedLease
type cachedLease struct {
    LeaseToken string    `json:"lease_token"`
    ExpiresAt  time.Time `json:"expires_at"`
    ClusterID  string    `json:"cluster_id"`
    Valid      bool      `json:"valid"`
    CachedAt   time.Time `json:"cached_at"`
}

Lease Validation:

  • Lease considered invalid 2 minutes before actual expiry (safety margin)
  • Invalid leases are evicted from cache automatically
LeaseRequest
type LeaseRequest struct {
    ClusterID         string `json:"cluster_id"`
    RequestedReplicas int    `json:"requested_replicas"`
    DurationMinutes   int    `json:"duration_minutes"`
}
LeaseResponse
type LeaseResponse struct {
    LeaseToken   string    `json:"lease_token"`
    MaxReplicas  int       `json:"max_replicas"`
    ExpiresAt    time.Time `json:"expires_at"`
    ClusterID    string    `json:"cluster_id"`
    LeaseID      string    `json:"lease_id"`
}
LeaseValidationRequest
type LeaseValidationRequest struct {
    LeaseToken string `json:"lease_token"`
    ClusterID  string `json:"cluster_id"`
    AgentID    string `json:"agent_id"`
}
LeaseValidationResponse
type LeaseValidationResponse struct {
    Valid             bool      `json:"valid"`
    RemainingReplicas int       `json:"remaining_replicas"`
    ExpiresAt         time.Time `json:"expires_at"`
}

Circuit Breaker Configuration

breakerSettings := gobreaker.Settings{
    Name:        "license-validation",
    MaxRequests: 3,              // Allow 3 requests in half-open state
    Interval:    60 * time.Second, // Reset failure count every minute
    Timeout:     30 * time.Second, // Stay open for 30 seconds
    ReadyToTrip: func(counts gobreaker.Counts) bool {
        return counts.ConsecutiveFailures >= 3  // Trip after 3 failures
    },
    OnStateChange: func(name string, from, to gobreaker.State) {
        fmt.Printf("🔌 License validation circuit breaker: %s -> %s\n", from, to)
    },
}

Circuit Breaker States:

State Behavior Transition
Closed Normal operation, requests pass through 3 consecutive failures → Open
Open All requests fail immediately (30s) After timeout → Half-Open
Half-Open Allow 3 test requests Success → Closed, Failure → Open

Key Methods

NewLicenseGate(config LicenseConfig)
func NewLicenseGate(config LicenseConfig) *LicenseGate

Initializes license gate with:

  • Circuit breaker with production settings
  • HTTP client with 10-second timeout
  • 90-second grace period from startup
Validate(ctx context.Context, agentID string)
func (g *LicenseGate) Validate(ctx context.Context, agentID string) error

Primary validation method:

  1. Check cache: If valid cached lease exists, validate it
  2. Cache miss: Request new lease through circuit breaker
  3. Store result: Cache successful lease for future requests
  4. Grace period: Allow startup during grace period even if validation fails
  5. Extend grace: Extend grace period on successful validation
validateCachedLease(ctx, lease, agentID)
func (g *LicenseGate) validateCachedLease(ctx context.Context, lease *cachedLease, agentID string) error

Validates cached lease token:

  • POST to /api/v1/licenses/validate-lease
  • Invalidates cache if validation fails
  • Returns error if lease is no longer valid
requestOrRenewLease(ctx)
func (g *LicenseGate) requestOrRenewLease(ctx context.Context) (*LeaseResponse, error)

Requests new cluster lease:

  • POST to /api/v1/licenses/{license_id}/cluster-lease
  • Default: 1 replica, 60-minute duration
  • Handles rate limiting (429 Too Many Requests)
  • Returns lease token and metadata
GetCacheStats()
func (g *LicenseGate) GetCacheStats() map[string]interface{}

Returns cache statistics for monitoring:

{
  "cache_valid": true,
  "cache_hit": true,
  "expires_at": "2025-09-30T15:30:00Z",
  "cached_at": "2025-09-30T14:30:00Z",
  "in_grace_period": false,
  "breaker_state": "closed",
  "grace_until": "2025-09-30T14:31:30Z"
}

KACHING Server Integration

API Endpoints

1. Legacy Activation Endpoint

Endpoint: POST /v1/license/activate Purpose: Legacy license validation (fallback)

Request:

{
  "license_id": "lic_abc123",
  "cluster_id": "cluster_xyz789",
  "metadata": {
    "product": "CHORUS",
    "version": "0.1.0-dev",
    "container": "true"
  }
}

Response (Success):

{
  "status": "ok",
  "message": "License valid",
  "expires_at": "2025-12-31T23:59:59Z"
}

Response (Failure):

{
  "status": "error",
  "message": "License expired"
}

2. Cluster Lease Endpoint

Endpoint: POST /api/v1/licenses/{license_id}/cluster-lease Purpose: Request cluster deployment lease

Request:

{
  "cluster_id": "cluster_xyz789",
  "requested_replicas": 1,
  "duration_minutes": 60
}

Response (Success):

{
  "lease_token": "lease_def456",
  "max_replicas": 5,
  "expires_at": "2025-09-30T15:30:00Z",
  "cluster_id": "cluster_xyz789",
  "lease_id": "lease_def456"
}

Response (Rate Limited):

HTTP 429 Too Many Requests
Retry-After: 60

3. Lease Validation Endpoint

Endpoint: POST /api/v1/licenses/validate-lease Purpose: Validate lease token for agent startup

Request:

{
  "lease_token": "lease_def456",
  "cluster_id": "cluster_xyz789",
  "agent_id": "agent_001"
}

Response (Success):

{
  "valid": true,
  "remaining_replicas": 4,
  "expires_at": "2025-09-30T15:30:00Z"
}

Response (Invalid):

{
  "valid": false,
  "remaining_replicas": 0,
  "expires_at": "2025-09-30T14:30:00Z"
}

Validation Sequence Diagram

┌─────────┐         ┌───────────┐         ┌──────────────┐         ┌──────────┐
│ CHORUS  │         │ Validator │         │ LicenseGate  │         │ KACHING  │
│ Runtime │         │           │         │              │         │  Server  │
└────┬────┘         └─────┬─────┘         └──────┬───────┘         └────┬─────┘
     │                    │                       │                      │
     │ InitializeRuntime()│                       │                      │
     │───────────────────>│                       │                      │
     │                    │                       │                      │
     │                    │ Validate()            │                      │
     │                    │──────────────────────>│                      │
     │                    │                       │                      │
     │                    │                       │ Check cache          │
     │                    │                       │────────┐             │
     │                    │                       │        │             │
     │                    │                       │<───────┘             │
     │                    │                       │                      │
     │                    │                       │ Cache miss           │
     │                    │                       │                      │
     │                    │                       │ POST /cluster-lease  │
     │                    │                       │─────────────────────>│
     │                    │                       │                      │
     │                    │                       │    Lease Response    │
     │                    │                       │<─────────────────────│
     │                    │                       │                      │
     │                    │                       │ POST /validate-lease │
     │                    │                       │─────────────────────>│
     │                    │                       │                      │
     │                    │                       │  Validation Response │
     │                    │                       │<─────────────────────│
     │                    │                       │                      │
     │                    │                       │ Store in cache       │
     │                    │                       │────────┐             │
     │                    │                       │        │             │
     │                    │                       │<───────┘             │
     │                    │                       │                      │
     │                    │   SUCCESS             │                      │
     │                    │<──────────────────────│                      │
     │                    │                       │                      │
     │   Continue startup │                       │                      │
     │<───────────────────│                       │                      │
     │                    │                       │                      │

┌─────────────────────────────────────────────────────────────────────┐
│                        FAILURE SCENARIO                              │
└─────────────────────────────────────────────────────────────────────┘

     │                    │                       │                      │
     │                    │                       │ POST /validate-lease │
     │                    │                       │─────────────────────>│
     │                    │                       │                      │
     │                    │                       │    INVALID LICENSE   │
     │                    │                       │<─────────────────────│
     │                    │                       │                      │
     │                    │                       │ Check grace period   │
     │                    │                       │────────┐             │
     │                    │                       │        │             │
     │                    │                       │<───────┘             │
     │                    │                       │                      │
     │                    │                       │ Outside grace period │
     │                    │                       │                      │
     │                    │   ERROR               │                      │
     │                    │<──────────────────────│                      │
     │                    │                       │                      │
     │   return error     │                       │                      │
     │<───────────────────│                       │                      │
     │                    │                       │                      │
     │ EXIT               │                       │                      │
     │────────X           │                       │                      │

Error Handling

Error Categories

1. Configuration Errors

Condition: Missing required configuration fields

if v.config.LicenseID == "" || v.config.ClusterID == "" {
    return fmt.Errorf("license ID and cluster ID are required")
}

Result: Immediate validation failure → CHORUS exits

2. Network Errors

Condition: Cannot contact KACHING server

resp, err := v.client.Post(licenseURL, "application/json", bytes.NewReader(requestBody))
if err != nil {
    // FAIL-CLOSED: No network = No license = No operation
    return fmt.Errorf("unable to contact license authority: %w", err)
}

Result:

  • Outside grace period: Immediate validation failure → CHORUS exits
  • Inside grace period: Log warning, allow startup

Fail-Closed Behavior: Network unavailability does NOT allow bypass

3. Invalid License Errors

Condition: KACHING rejects license

if resp.StatusCode != http.StatusOK {
    message := "license validation failed"
    if msg, ok := licenseResponse["message"].(string); ok {
        message = msg
    }
    return fmt.Errorf("license validation failed: %s", message)
}

Possible Messages:

  • "License expired"
  • "License revoked"
  • "License not found"
  • "Cluster ID mismatch"
  • "Maximum nodes exceeded"

Result: Immediate validation failure → CHORUS exits

4. Rate Limiting Errors

Condition: Too many requests to KACHING

if resp.StatusCode == http.StatusTooManyRequests {
    return nil, fmt.Errorf("rate limited by KACHING, retry after: %s",
        resp.Header.Get("Retry-After"))
}

Result:

  • Circuit breaker may trip after repeated rate limiting
  • Grace period allows startup if rate limiting is transient

5. Circuit Breaker Errors

Condition: Circuit breaker is open (too many failures)

Result:

  • All requests fail immediately
  • Grace period allows startup if breaker trips during initialization
  • Circuit breaker auto-recovers after timeout (30s)

Error Messages Reference

User-Facing Error Messages

Error Message Cause Resolution
license ID and cluster ID are required Missing configuration Set CHORUS_LICENSE_ID and CHORUS_CLUSTER_ID
unable to contact license authority Network error Check KACHING server accessibility
license validation failed: License expired Expired license Renew license with vendor
license validation failed: License revoked Revoked license Contact vendor
license validation failed: Cluster ID mismatch Wrong cluster Use correct cluster configuration
rate limited by KACHING Too many requests Wait for rate limit reset
lease token is invalid Expired or invalid lease System will auto-request new lease
lease validation failed with status 404 Lease not found System will auto-request new lease
License validation failed but in grace period Transient failure during startup System continues with warning

Grace Period Mechanism

Purpose

The grace period allows CHORUS to start even when license validation temporarily fails, preventing service disruption due to transient network issues or KACHING server maintenance.

Behavior

  • Duration: 90 seconds from startup
  • Triggered: When validation fails but grace period is active
  • Effect: Validation returns success with warning log
  • Extension: Grace period extends by 90s on each successful validation
  • Expiry: After grace period expires, validation failures cause immediate exit

Grace Period States

┌──────────────────────────────────────────────────────────────┐
│                    Grace Period Timeline                      │
└──────────────────────────────────────────────────────────────┘

T+0s    ┌─────────────────────────────────────────────┐
        │        GRACE PERIOD ACTIVE (90s)            │
        │   Validation failures allowed with warning   │
        └─────────────────────────────────────────────┘

T+30s   │ Validation SUCCESS                          │
        └──> Grace period extended to T+120s          │

T+90s   │ Grace period expires (no successful validation)
        └──> Next validation failure causes exit      │

T+120s  │ (Extended) Grace period expires
        └──> Next validation failure causes exit      │

Implementation

// Initialize grace period at startup
func NewLicenseGate(config LicenseConfig) *LicenseGate {
    gate := &LicenseGate{...}
    gate.graceUntil.Store(time.Now().Add(90 * time.Second))
    return gate
}

// Check grace period during validation
if err != nil {
    if g.isInGracePeriod() {
        fmt.Printf("⚠️ License validation failed but in grace period: %v\n", err)
        return nil  // Allow startup
    }
    return fmt.Errorf("license validation failed: %w", err)
}

// Extend grace period on success
g.extendGracePeriod()  // Adds 90s to current time

Startup Integration

Location

File: internal/runtime/shared.go Function: InitializeRuntime()

Integration Point

func InitializeRuntime(cfg *config.CHORUSConfig) (*RuntimeContext, error) {
    // ... early initialization ...

    // CRITICAL: Validate license before any P2P operations
    runtime.Logger.Info("🔐 Validating CHORUS license with KACHING...")
    licenseValidator := licensing.NewValidator(licensing.LicenseConfig{
        LicenseID:  cfg.License.LicenseID,
        ClusterID:  cfg.License.ClusterID,
        KachingURL: cfg.License.KachingURL,
    })

    if err := licenseValidator.Validate(); err != nil {
        // This error causes InitializeRuntime to return error
        // which causes main() to exit immediately
        return nil, fmt.Errorf("license validation failed: %v", err)
    }

    runtime.Logger.Info("✅ License validation successful - CHORUS authorized to run")

    // ... continue with P2P, AI provider initialization, etc ...
}

Execution Order

1. Load configuration from YAML
2. Initialize logger
3. ⚠️ VALIDATE LICENSE ⚠️
   └─→ FAILURE → return error → main() exits
4. Initialize AI provider
5. Initialize metrics collector
6. Initialize SHHH sentinel
7. Initialize P2P network
8. Start HAP server
9. Enter main runtime loop

Critical Note: License validation occurs BEFORE any P2P networking or AI provider initialization. If validation fails, no network connections are made and no services are started.


Configuration Examples

Minimal Configuration

license:
  license_id: "lic_abc123"
  cluster_id: "cluster_xyz789"

KACHING URL defaults to http://localhost:8083

Production Configuration

license:
  license_id: "lic_prod_abc123"
  cluster_id: "cluster_production_xyz789"
  kaching_url: "https://kaching.chorus.services"
  organization_name: "Acme Corporation"
  license_type: "enterprise"
  max_nodes: 10

Development Configuration

license:
  license_id: "lic_dev_abc123"
  cluster_id: "cluster_dev_local"
  kaching_url: "http://localhost:8083"
  organization_name: "Development Team"
  license_type: "developer"
  max_nodes: 1

Environment Variables

Licensing configuration can also be set via environment variables:

export CHORUS_LICENSE_ID="lic_abc123"
export CHORUS_CLUSTER_ID="cluster_xyz789"
export CHORUS_KACHING_URL="http://localhost:8083"

Monitoring and Observability

Log Messages

Successful Validation

🔐 Validating CHORUS license with KACHING...
✅ License validation successful - CHORUS authorized to run

Validation with Cached Lease

🔐 Validating CHORUS license with KACHING...
[Using cached lease token: lease_def456]
✅ License validation successful - CHORUS authorized to run

Validation During Grace Period

🔐 Validating CHORUS license with KACHING...
⚠️ License validation failed but in grace period: unable to contact license authority
✅ License validation successful - CHORUS authorized to run

Circuit Breaker State Changes

🔌 License validation circuit breaker: closed -> open
🔌 License validation circuit breaker: open -> half-open
🔌 License validation circuit breaker: half-open -> closed

Validation Failure (Fatal)

🔐 Validating CHORUS license with KACHING...
❌ License validation failed: License expired
Error: license validation failed: License expired
[CHORUS exits]

Cache Statistics API

stats := licenseGate.GetCacheStats()

Returns:

{
  "cache_valid": true,
  "cache_hit": true,
  "expires_at": "2025-09-30T15:30:00Z",
  "cached_at": "2025-09-30T14:30:00Z",
  "in_grace_period": false,
  "breaker_state": "closed",
  "grace_until": "2025-09-30T14:31:30Z"
}
Metric Type Description
license_validation_success Counter Successful validations
license_validation_failure Counter Failed validations
license_validation_duration_ms Histogram Validation latency
license_cache_hit_rate Gauge Percentage of cache hits
license_grace_period_active Gauge 1 if in grace period, 0 otherwise
license_circuit_breaker_state Gauge 0=closed, 1=half-open, 2=open
license_lease_expiry_seconds Gauge Seconds until lease expiry

Cluster Lease Management

Lease Lifecycle

┌──────────────────────────────────────────────────────────────────┐
│                      Cluster Lease Lifecycle                      │
└──────────────────────────────────────────────────────────────────┘

1. REQUEST LEASE
   ├─→ POST /api/v1/licenses/{license_id}/cluster-lease
   ├─→ cluster_id: "cluster_xyz789"
   ├─→ requested_replicas: 1
   └─→ duration_minutes: 60

2. RECEIVE LEASE
   ├─→ lease_token: "lease_def456"
   ├─→ max_replicas: 5
   ├─→ expires_at: T+60m
   └─→ Store in cache

3. USE LEASE (per agent startup)
   ├─→ POST /api/v1/licenses/validate-lease
   ├─→ lease_token: "lease_def456"
   ├─→ cluster_id: "cluster_xyz789"
   ├─→ agent_id: "agent_001"
   └─→ Decrements remaining_replicas

4. LEASE EXPIRY
   ├─→ Cache invalidated at T+58m (2min safety margin)
   └─→ Next validation requests new lease

5. LEASE RENEWAL
   └─→ Automatic on cache invalidation

Multi-Replica Support

The lease system supports multiple CHORUS agent replicas:

  • max_replicas: Maximum concurrent agents allowed
  • remaining_replicas: Available agent slots
  • agent_id: Unique identifier for each agent instance

Example: License allows 5 replicas

  • Request lease → max_replicas: 5
  • Agent 1 validates → remaining_replicas: 4
  • Agent 2 validates → remaining_replicas: 3
  • Agent 6 validates → FAILURE (exceeds max_replicas)

Security Considerations

Fail-Closed Architecture

The licensing system implements fail-closed security:

  • Network unavailable → Validation fails → CHORUS exits (unless in grace period)
  • KACHING server down → Validation fails → CHORUS exits (unless in grace period)
  • Invalid license → Validation fails → CHORUS exits (no grace period)
  • Expired license → Validation fails → CHORUS exits (no grace period)
  • No "development mode" bypass
  • No "skip validation" flag

Grace Period Security

The grace period is designed for transient failures, NOT as a bypass:

  • Limited to 90 seconds initially
  • Only extends on successful validation
  • Does NOT apply to invalid/expired licenses
  • Primarily for network/KACHING server availability issues

License Token Security

  • Lease tokens are short-lived (default: 60 minutes)
  • Tokens cached in memory only (not persisted to disk)
  • Tokens include cluster_id binding (cannot be used by other clusters)
  • Agent ID tracking prevents token sharing between agents

Network Security

  • HTTPS recommended for production KACHING URLs
  • 30-second timeout prevents hanging on network issues
  • Circuit breaker prevents cascade failures

Troubleshooting

Issue: "license ID and cluster ID are required"

Cause: Missing configuration

Resolution:

# config.yml
license:
  license_id: "your_license_id"
  cluster_id: "your_cluster_id"

Or via environment:

export CHORUS_LICENSE_ID="your_license_id"
export CHORUS_CLUSTER_ID="your_cluster_id"

Issue: "unable to contact license authority"

Cause: KACHING server unreachable

Resolution:

  1. Verify KACHING server is running
  2. Check network connectivity: curl http://localhost:8083/health
  3. Verify kaching_url configuration
  4. Check firewall rules
  5. If transient, grace period allows startup

Issue: "license validation failed: License expired"

Cause: License has expired

Resolution:

  1. Contact license vendor to renew
  2. Update license_id in configuration
  3. Restart CHORUS

Note: Grace period does NOT apply to expired licenses


Issue: "rate limited by KACHING"

Cause: Too many validation requests

Resolution:

  1. Check for rapid restart loops
  2. Verify cache is working (should reduce requests)
  3. Wait for rate limit reset (check Retry-After header)
  4. Consider increasing lease duration_minutes

Issue: Circuit breaker stuck in "open" state

Cause: Repeated validation failures

Resolution:

  1. Check KACHING server health
  2. Verify license configuration
  3. Circuit breaker auto-recovers after 30 seconds
  4. Check grace period status: may allow startup during recovery

Issue: "lease token is invalid"

Cause: Lease expired or revoked

Resolution:

  • System should auto-request new lease
  • If persistent, check license status with vendor
  • Verify cluster_id matches license configuration

Testing

Unit Testing

// Test license validation success
func TestValidatorSuccess(t *testing.T) {
    // Mock KACHING server
    server := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
        w.WriteHeader(http.StatusOK)
        json.NewEncoder(w).Encode(map[string]interface{}{
            "status": "ok",
            "message": "License valid",
        })
    }))
    defer server.Close()

    validator := licensing.NewValidator(licensing.LicenseConfig{
        LicenseID:  "test_license",
        ClusterID:  "test_cluster",
        KachingURL: server.URL,
    })

    err := validator.Validate()
    assert.NoError(t, err)
}

// Test license validation failure
func TestValidatorFailure(t *testing.T) {
    server := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
        w.WriteHeader(http.StatusForbidden)
        json.NewEncoder(w).Encode(map[string]interface{}{
            "status": "error",
            "message": "License expired",
        })
    }))
    defer server.Close()

    validator := licensing.NewValidator(licensing.LicenseConfig{
        LicenseID:  "test_license",
        ClusterID:  "test_cluster",
        KachingURL: server.URL,
    })

    err := validator.Validate()
    assert.Error(t, err)
    assert.Contains(t, err.Error(), "License expired")
}

Integration Testing

# Start KACHING test server
docker run -p 8083:8083 kaching:latest

# Test CHORUS startup with valid license
export CHORUS_LICENSE_ID="test_lic_123"
export CHORUS_CLUSTER_ID="test_cluster"
./chorus-agent

# Expected output:
# 🔐 Validating CHORUS license with KACHING...
# ✅ License validation successful - CHORUS authorized to run

# Test CHORUS startup with invalid license
export CHORUS_LICENSE_ID="invalid_license"
./chorus-agent

# Expected output:
# 🔐 Validating CHORUS license with KACHING...
# ❌ License validation failed: License not found
# Error: license validation failed: License not found
# [Exit code 1]

Future Enhancements

Planned Features

  1. Offline License Support

    • JWT-based license files for air-gapped deployments
    • Signature verification without KACHING connectivity
  2. License Renewal Automation

    • Background renewal of expiring licenses
    • Alert system for upcoming expirations
  3. Multi-License Support

    • Support for multiple license tiers
    • Feature flag based on license type
  4. License Analytics

    • Usage metrics reporting to KACHING
    • License utilization dashboards
  5. Enhanced Lease Management

    • Lease renewal before expiry
    • Dynamic replica scaling based on license

API Constants

Timeouts

const (
    DefaultKachingURL = "http://localhost:8083"
    LicenseTimeout    = 30 * time.Second  // Validator HTTP timeout
    GateCTimeout      = 10 * time.Second  // LicenseGate HTTP timeout
)

Grace Period

const (
    GracePeriodDuration = 90 * time.Second
)

Circuit Breaker

const (
    MaxRequests          = 3              // Half-open state test requests
    FailureThreshold     = 3              // Consecutive failures to trip
    CircuitTimeout       = 30 * time.Second  // Open state duration
    FailureResetInterval = 60 * time.Second  // Failure count reset
)

Lease Safety Margin

const (
    LeaseSafetyMargin = 2 * time.Minute  // Cache invalidation before expiry
)

  • KACHING License Server: See KACHING documentation for server setup and API details
  • CHORUS Configuration: /docs/comprehensive/pkg/config.md
  • CHORUS Runtime: /docs/comprehensive/internal/runtime.md
  • Deployment Guide: /docs/deployment.md

Summary

The CHORUS licensing system provides robust, fail-closed license enforcement through integration with the KACHING license authority. Key characteristics:

  • Mandatory: License validation is required at startup
  • Fail-Closed: Invalid license or network failure prevents startup (outside grace period)
  • Cached: Lease tokens cached to reduce KACHING load
  • Resilient: Circuit breaker and grace period handle transient failures
  • Scalable: Cluster lease system supports multi-replica deployments
  • Secure: No bypass mechanisms, short-lived tokens, cluster binding

The system ensures that all running CHORUS instances are properly licensed while providing operational flexibility through caching and grace periods for transient failures.