Files
CHORUS/docs/comprehensive/internal/licensing.md
anthonyrawlins c5b7311a8b docs: Add Phase 3 coordination and infrastructure documentation
Comprehensive documentation for coordination, messaging, discovery, and internal systems.

Core Coordination Packages:
- pkg/election - Democratic leader election (uptime-based, heartbeat mechanism, SLURP integration)
- pkg/coordination - Meta-coordination with dependency detection (4 built-in rules)
- coordinator/ - Task orchestration and assignment (AI-powered scoring)
- discovery/ - mDNS peer discovery (automatic LAN detection)

Messaging & P2P Infrastructure:
- pubsub/ - GossipSub messaging (31 message types, role-based topics, HMMM integration)
- p2p/ - libp2p networking (DHT modes, connection management, security)

Monitoring & Health:
- pkg/metrics - Prometheus metrics (80+ metrics across 12 categories)
- pkg/health - Health monitoring (4 HTTP endpoints, enhanced checks, graceful degradation)

Internal Systems:
- internal/licensing - License validation (KACHING integration, cluster leases, fail-closed)
- internal/hapui - Human Agent Portal UI (9 commands, HMMM wizard, UCXL browser, decision voting)
- internal/backbeat - P2P operation telemetry (6 phases, beat synchronization, health reporting)

Documentation Statistics (Phase 3):
- 10 packages documented (~18,000 lines)
- 31 PubSub message types cataloged
- 80+ Prometheus metrics documented
- Complete API references with examples
- Integration patterns and best practices

Key Features Documented:
- Election: 5 triggers, candidate scoring (5 weighted components), stability windows
- Coordination: AI-powered dependency detection, cross-repo sessions, escalation handling
- PubSub: Topic patterns, message envelopes, SHHH redaction, Hypercore logging
- Metrics: All metric types with labels, Prometheus scrape config, alert rules
- Health: Liveness vs readiness, critical checks, Kubernetes integration
- Licensing: Grace periods, circuit breaker, cluster lease management
- HAP UI: Interactive terminal commands, HMMM composition wizard, web interface (beta)
- BACKBEAT: 6-phase operation tracking, beat budget estimation, drift detection

Implementation Status Marked:
-  Production: Election, metrics, health, licensing, pubsub, p2p, discovery, coordinator
- 🔶 Beta: HAP web interface, BACKBEAT telemetry, advanced coordination
- 🔷 Alpha: SLURP election scoring
- ⚠️ Experimental: Meta-coordination, AI-powered dependency detection

Progress: 22/62 files complete (35%)

Next Phase: AI providers, SLURP system, API layer, reasoning engine

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-09-30 18:27:39 +10:00

1266 lines
40 KiB
Markdown

# CHORUS License Validation System
**Package**: `internal/licensing`
**Purpose**: KACHING license authority integration with fail-closed validation
**Critical**: License validation is **MANDATORY** at startup - invalid license = immediate exit
## Overview
The CHORUS licensing system enforces software licensing through integration with the KACHING license authority. The system implements a **fail-closed** security model: if license validation fails, CHORUS will not start. This ensures that all running instances are properly licensed and authorized.
### Key Components
- **Validator**: Core license validation with KACHING server communication
- **LicenseGate**: Enhanced validation with caching, circuit breaker, and cluster lease management
- **LicenseConfig**: Configuration structure for licensing parameters
### Security Model: FAIL-CLOSED
```
┌─────────────────────────────────────────────────────────────┐
│ CHORUS STARTUP │
├─────────────────────────────────────────────────────────────┤
│ │
│ 1. Load Configuration │
│ 2. Initialize Logger │
│ 3. ⚠️ VALIDATE LICENSE (CRITICAL GATE) │
│ │ │
│ ├─── SUCCESS ──→ Continue startup │
│ │ │
│ └─── FAILURE ──→ return error → IMMEDIATE EXIT │
│ │
│ 4. Initialize AI Provider │
│ 5. Start P2P Network │
│ 6. ... rest of initialization │
│ │
└─────────────────────────────────────────────────────────────┘
NO BYPASS: License validation cannot be skipped or bypassed.
```
---
## Architecture
### 1. LicenseConfig Structure
**Location**: `internal/licensing/validator.go` and `pkg/config/config.go`
```go
// Core licensing configuration
type LicenseConfig struct {
LicenseID string // Unique license identifier
ClusterID string // Cluster/deployment identifier
KachingURL string // KACHING server URL
}
// Extended configuration in pkg/config
type LicenseConfig struct {
LicenseID string `yaml:"license_id"`
ClusterID string `yaml:"cluster_id"`
OrganizationName string `yaml:"organization_name"`
KachingURL string `yaml:"kaching_url"`
IsActive bool `yaml:"is_active"`
LastValidated time.Time `yaml:"last_validated"`
GracePeriodHours int `yaml:"grace_period_hours"`
LicenseType string `yaml:"license_type"`
ExpiresAt time.Time `yaml:"expires_at"`
MaxNodes int `yaml:"max_nodes"`
}
```
**Configuration Fields**:
| Field | Required | Purpose |
|-------|----------|---------|
| `LicenseID` | ✅ Yes | Unique identifier for the license |
| `ClusterID` | ✅ Yes | Identifies the cluster/deployment |
| `KachingURL` | No | KACHING server URL (defaults to `http://localhost:8083`) |
| `OrganizationName` | No | Organization name for tracking |
| `LicenseType` | No | Type of license (e.g., "enterprise", "developer") |
| `ExpiresAt` | No | License expiration timestamp |
| `MaxNodes` | No | Maximum nodes allowed in cluster |
---
## Validation Flow
### Standard Validation Sequence
```
┌──────────────────────────────────────────────────────────────────┐
│ License Validation Flow │
└──────────────────────────────────────────────────────────────────┘
1. NewValidator(config) → Initialize Validator
├─→ Set KachingURL (default: http://localhost:8083)
├─→ Create HTTP client (timeout: 30s)
└─→ Initialize LicenseGate
└─→ Initialize Circuit Breaker
└─→ Set Grace Period (90 seconds from start)
2. Validate() → Perform validation
├─→ ValidateWithContext(ctx)
│ │
│ ├─→ Check required fields (LicenseID, ClusterID)
│ │
│ ├─→ LicenseGate.Validate(ctx, agentID)
│ │ │
│ │ ├─→ Check cached lease (if valid, use it)
│ │ │ ├─→ validateCachedLease()
│ │ │ │ ├─→ POST /api/v1/licenses/validate-lease
│ │ │ │ ├─→ Include: lease_token, cluster_id, agent_id
│ │ │ │ └─→ Response: valid, remaining_replicas, expires_at
│ │ │ │
│ │ │ └─→ Cache hit? → SUCCESS
│ │ │
│ │ ├─→ Cache miss? → Request new lease
│ │ │ │
│ │ │ ├─→ breaker.Execute() [Circuit Breaker]
│ │ │ │ │
│ │ │ │ ├─→ requestOrRenewLease()
│ │ │ │ │ ├─→ POST /api/v1/licenses/{id}/cluster-lease
│ │ │ │ │ ├─→ Request: cluster_id, requested_replicas, duration_minutes
│ │ │ │ │ └─→ Response: lease_token, max_replicas, expires_at, lease_id
│ │ │ │ │
│ │ │ │ ├─→ validateLease(lease, agentID)
│ │ │ │ │ └─→ POST /api/v1/licenses/validate-lease
│ │ │ │ │
│ │ │ │ └─→ storeLease() → Cache the valid lease
│ │ │ │
│ │ │ └─→ Extend grace period (90s)
│ │ │
│ │ └─→ Validation failed?
│ │ │
│ │ ├─→ In grace period? → Log warning, ALLOW startup
│ │ └─→ Outside grace period? → RETURN ERROR
│ │
│ └─→ Fallback to validateLegacy() on LicenseGate failure
│ │
│ ├─→ POST /v1/license/activate
│ ├─→ Request: license_id, cluster_id, metadata
│ └─→ Response: validation result
└─→ Return validation result
3. Result Handling (in runtime/shared.go)
├─→ SUCCESS → Log "✅ License validation successful"
│ → Continue initialization
└─→ FAILURE → return error → CHORUS EXITS IMMEDIATELY
```
---
## Component Details
### 1. Validator Component
**File**: `internal/licensing/validator.go`
The Validator is the primary component for license validation, providing communication with the KACHING license authority.
#### Key Methods
##### NewValidator(config LicenseConfig)
```go
func NewValidator(config LicenseConfig) *Validator
```
Creates a new license validator with:
- HTTP client with 30-second timeout
- Default KACHING URL if not specified
- Initialized LicenseGate for enhanced validation
##### Validate()
```go
func (v *Validator) Validate() error
```
Performs license validation with KACHING authority:
- Validates required configuration fields
- Uses LicenseGate for cached/enhanced validation
- Falls back to legacy validation if needed
- Returns error if validation fails
##### validateLegacy()
```go
func (v *Validator) validateLegacy() error
```
Legacy validation method (fallback):
- Direct HTTP POST to `/v1/license/activate`
- Sends license metadata (product, version, container flag)
- **Fail-closed**: Network error = validation failure
- Parses and validates response status
**Request Example**:
```json
{
"license_id": "lic_abc123",
"cluster_id": "cluster_xyz789",
"metadata": {
"product": "CHORUS",
"version": "0.1.0-dev",
"container": "true"
}
}
```
**Response Example (Success)**:
```json
{
"status": "ok",
"message": "License valid",
"expires_at": "2025-12-31T23:59:59Z"
}
```
**Response Example (Failure)**:
```json
{
"status": "error",
"message": "License expired"
}
```
---
### 2. LicenseGate Component
**File**: `internal/licensing/license_gate.go`
Enhanced license validation with caching, circuit breaker, and cluster lease management for production scalability.
#### Key Features
- **Caching**: Stores valid lease tokens to reduce KACHING load
- **Circuit Breaker**: Prevents cascade failures during KACHING outages
- **Grace Period**: 90-second startup grace period for transient failures
- **Cluster Leases**: Supports multi-replica deployments with lease tokens
- **Burst Protection**: Rate limiting and retry logic
#### Data Structures
##### cachedLease
```go
type cachedLease struct {
LeaseToken string `json:"lease_token"`
ExpiresAt time.Time `json:"expires_at"`
ClusterID string `json:"cluster_id"`
Valid bool `json:"valid"`
CachedAt time.Time `json:"cached_at"`
}
```
**Lease Validation**:
- Lease considered invalid 2 minutes before actual expiry (safety margin)
- Invalid leases are evicted from cache automatically
##### LeaseRequest
```go
type LeaseRequest struct {
ClusterID string `json:"cluster_id"`
RequestedReplicas int `json:"requested_replicas"`
DurationMinutes int `json:"duration_minutes"`
}
```
##### LeaseResponse
```go
type LeaseResponse struct {
LeaseToken string `json:"lease_token"`
MaxReplicas int `json:"max_replicas"`
ExpiresAt time.Time `json:"expires_at"`
ClusterID string `json:"cluster_id"`
LeaseID string `json:"lease_id"`
}
```
##### LeaseValidationRequest
```go
type LeaseValidationRequest struct {
LeaseToken string `json:"lease_token"`
ClusterID string `json:"cluster_id"`
AgentID string `json:"agent_id"`
}
```
##### LeaseValidationResponse
```go
type LeaseValidationResponse struct {
Valid bool `json:"valid"`
RemainingReplicas int `json:"remaining_replicas"`
ExpiresAt time.Time `json:"expires_at"`
}
```
#### Circuit Breaker Configuration
```go
breakerSettings := gobreaker.Settings{
Name: "license-validation",
MaxRequests: 3, // Allow 3 requests in half-open state
Interval: 60 * time.Second, // Reset failure count every minute
Timeout: 30 * time.Second, // Stay open for 30 seconds
ReadyToTrip: func(counts gobreaker.Counts) bool {
return counts.ConsecutiveFailures >= 3 // Trip after 3 failures
},
OnStateChange: func(name string, from, to gobreaker.State) {
fmt.Printf("🔌 License validation circuit breaker: %s -> %s\n", from, to)
},
}
```
**Circuit Breaker States**:
| State | Behavior | Transition |
|-------|----------|------------|
| **Closed** | Normal operation, requests pass through | 3 consecutive failures → **Open** |
| **Open** | All requests fail immediately (30s) | After timeout → **Half-Open** |
| **Half-Open** | Allow 3 test requests | Success → **Closed**, Failure → **Open** |
#### Key Methods
##### NewLicenseGate(config LicenseConfig)
```go
func NewLicenseGate(config LicenseConfig) *LicenseGate
```
Initializes license gate with:
- Circuit breaker with production settings
- HTTP client with 10-second timeout
- 90-second grace period from startup
##### Validate(ctx context.Context, agentID string)
```go
func (g *LicenseGate) Validate(ctx context.Context, agentID string) error
```
Primary validation method:
1. **Check cache**: If valid cached lease exists, validate it
2. **Cache miss**: Request new lease through circuit breaker
3. **Store result**: Cache successful lease for future requests
4. **Grace period**: Allow startup during grace period even if validation fails
5. **Extend grace**: Extend grace period on successful validation
##### validateCachedLease(ctx, lease, agentID)
```go
func (g *LicenseGate) validateCachedLease(ctx context.Context, lease *cachedLease, agentID string) error
```
Validates cached lease token:
- POST to `/api/v1/licenses/validate-lease`
- Invalidates cache if validation fails
- Returns error if lease is no longer valid
##### requestOrRenewLease(ctx)
```go
func (g *LicenseGate) requestOrRenewLease(ctx context.Context) (*LeaseResponse, error)
```
Requests new cluster lease:
- POST to `/api/v1/licenses/{license_id}/cluster-lease`
- Default: 1 replica, 60-minute duration
- Handles rate limiting (429 Too Many Requests)
- Returns lease token and metadata
##### GetCacheStats()
```go
func (g *LicenseGate) GetCacheStats() map[string]interface{}
```
Returns cache statistics for monitoring:
```json
{
"cache_valid": true,
"cache_hit": true,
"expires_at": "2025-09-30T15:30:00Z",
"cached_at": "2025-09-30T14:30:00Z",
"in_grace_period": false,
"breaker_state": "closed",
"grace_until": "2025-09-30T14:31:30Z"
}
```
---
## KACHING Server Integration
### API Endpoints
#### 1. Legacy Activation Endpoint
**Endpoint**: `POST /v1/license/activate`
**Purpose**: Legacy license validation (fallback)
**Request**:
```json
{
"license_id": "lic_abc123",
"cluster_id": "cluster_xyz789",
"metadata": {
"product": "CHORUS",
"version": "0.1.0-dev",
"container": "true"
}
}
```
**Response (Success)**:
```json
{
"status": "ok",
"message": "License valid",
"expires_at": "2025-12-31T23:59:59Z"
}
```
**Response (Failure)**:
```json
{
"status": "error",
"message": "License expired"
}
```
#### 2. Cluster Lease Endpoint
**Endpoint**: `POST /api/v1/licenses/{license_id}/cluster-lease`
**Purpose**: Request cluster deployment lease
**Request**:
```json
{
"cluster_id": "cluster_xyz789",
"requested_replicas": 1,
"duration_minutes": 60
}
```
**Response (Success)**:
```json
{
"lease_token": "lease_def456",
"max_replicas": 5,
"expires_at": "2025-09-30T15:30:00Z",
"cluster_id": "cluster_xyz789",
"lease_id": "lease_def456"
}
```
**Response (Rate Limited)**:
```
HTTP 429 Too Many Requests
Retry-After: 60
```
#### 3. Lease Validation Endpoint
**Endpoint**: `POST /api/v1/licenses/validate-lease`
**Purpose**: Validate lease token for agent startup
**Request**:
```json
{
"lease_token": "lease_def456",
"cluster_id": "cluster_xyz789",
"agent_id": "agent_001"
}
```
**Response (Success)**:
```json
{
"valid": true,
"remaining_replicas": 4,
"expires_at": "2025-09-30T15:30:00Z"
}
```
**Response (Invalid)**:
```json
{
"valid": false,
"remaining_replicas": 0,
"expires_at": "2025-09-30T14:30:00Z"
}
```
---
## Validation Sequence Diagram
```
┌─────────┐ ┌───────────┐ ┌──────────────┐ ┌──────────┐
│ CHORUS │ │ Validator │ │ LicenseGate │ │ KACHING │
│ Runtime │ │ │ │ │ │ Server │
└────┬────┘ └─────┬─────┘ └──────┬───────┘ └────┬─────┘
│ │ │ │
│ InitializeRuntime()│ │ │
│───────────────────>│ │ │
│ │ │ │
│ │ Validate() │ │
│ │──────────────────────>│ │
│ │ │ │
│ │ │ Check cache │
│ │ │────────┐ │
│ │ │ │ │
│ │ │<───────┘ │
│ │ │ │
│ │ │ Cache miss │
│ │ │ │
│ │ │ POST /cluster-lease │
│ │ │─────────────────────>│
│ │ │ │
│ │ │ Lease Response │
│ │ │<─────────────────────│
│ │ │ │
│ │ │ POST /validate-lease │
│ │ │─────────────────────>│
│ │ │ │
│ │ │ Validation Response │
│ │ │<─────────────────────│
│ │ │ │
│ │ │ Store in cache │
│ │ │────────┐ │
│ │ │ │ │
│ │ │<───────┘ │
│ │ │ │
│ │ SUCCESS │ │
│ │<──────────────────────│ │
│ │ │ │
│ Continue startup │ │ │
│<───────────────────│ │ │
│ │ │ │
┌─────────────────────────────────────────────────────────────────────┐
│ FAILURE SCENARIO │
└─────────────────────────────────────────────────────────────────────┘
│ │ │ │
│ │ │ POST /validate-lease │
│ │ │─────────────────────>│
│ │ │ │
│ │ │ INVALID LICENSE │
│ │ │<─────────────────────│
│ │ │ │
│ │ │ Check grace period │
│ │ │────────┐ │
│ │ │ │ │
│ │ │<───────┘ │
│ │ │ │
│ │ │ Outside grace period │
│ │ │ │
│ │ ERROR │ │
│ │<──────────────────────│ │
│ │ │ │
│ return error │ │ │
│<───────────────────│ │ │
│ │ │ │
│ EXIT │ │ │
│────────X │ │ │
```
---
## Error Handling
### Error Categories
#### 1. Configuration Errors
**Condition**: Missing required configuration fields
```go
if v.config.LicenseID == "" || v.config.ClusterID == "" {
return fmt.Errorf("license ID and cluster ID are required")
}
```
**Result**: Immediate validation failure → CHORUS exits
#### 2. Network Errors
**Condition**: Cannot contact KACHING server
```go
resp, err := v.client.Post(licenseURL, "application/json", bytes.NewReader(requestBody))
if err != nil {
// FAIL-CLOSED: No network = No license = No operation
return fmt.Errorf("unable to contact license authority: %w", err)
}
```
**Result**:
- Outside grace period: Immediate validation failure → CHORUS exits
- Inside grace period: Log warning, allow startup
**Fail-Closed Behavior**: Network unavailability does NOT allow bypass
#### 3. Invalid License Errors
**Condition**: KACHING rejects license
```go
if resp.StatusCode != http.StatusOK {
message := "license validation failed"
if msg, ok := licenseResponse["message"].(string); ok {
message = msg
}
return fmt.Errorf("license validation failed: %s", message)
}
```
**Possible Messages**:
- "License expired"
- "License revoked"
- "License not found"
- "Cluster ID mismatch"
- "Maximum nodes exceeded"
**Result**: Immediate validation failure → CHORUS exits
#### 4. Rate Limiting Errors
**Condition**: Too many requests to KACHING
```go
if resp.StatusCode == http.StatusTooManyRequests {
return nil, fmt.Errorf("rate limited by KACHING, retry after: %s",
resp.Header.Get("Retry-After"))
}
```
**Result**:
- Circuit breaker may trip after repeated rate limiting
- Grace period allows startup if rate limiting is transient
#### 5. Circuit Breaker Errors
**Condition**: Circuit breaker is open (too many failures)
**Result**:
- All requests fail immediately
- Grace period allows startup if breaker trips during initialization
- Circuit breaker auto-recovers after timeout (30s)
---
## Error Messages Reference
### User-Facing Error Messages
| Error Message | Cause | Resolution |
|--------------|-------|------------|
| `license ID and cluster ID are required` | Missing configuration | Set `CHORUS_LICENSE_ID` and `CHORUS_CLUSTER_ID` |
| `unable to contact license authority` | Network error | Check KACHING server accessibility |
| `license validation failed: License expired` | Expired license | Renew license with vendor |
| `license validation failed: License revoked` | Revoked license | Contact vendor |
| `license validation failed: Cluster ID mismatch` | Wrong cluster | Use correct cluster configuration |
| `rate limited by KACHING` | Too many requests | Wait for rate limit reset |
| `lease token is invalid` | Expired or invalid lease | System will auto-request new lease |
| `lease validation failed with status 404` | Lease not found | System will auto-request new lease |
| `License validation failed but in grace period` | Transient failure during startup | System continues with warning |
---
## Grace Period Mechanism
### Purpose
The grace period allows CHORUS to start even when license validation temporarily fails, preventing service disruption due to transient network issues or KACHING server maintenance.
### Behavior
- **Duration**: 90 seconds from startup
- **Triggered**: When validation fails but grace period is active
- **Effect**: Validation returns success with warning log
- **Extension**: Grace period extends by 90s on each successful validation
- **Expiry**: After grace period expires, validation failures cause immediate exit
### Grace Period States
```
┌──────────────────────────────────────────────────────────────┐
│ Grace Period Timeline │
└──────────────────────────────────────────────────────────────┘
T+0s ┌─────────────────────────────────────────────┐
│ GRACE PERIOD ACTIVE (90s) │
│ Validation failures allowed with warning │
└─────────────────────────────────────────────┘
T+30s │ Validation SUCCESS │
└──> Grace period extended to T+120s │
T+90s │ Grace period expires (no successful validation)
└──> Next validation failure causes exit │
T+120s │ (Extended) Grace period expires
└──> Next validation failure causes exit │
```
### Implementation
```go
// Initialize grace period at startup
func NewLicenseGate(config LicenseConfig) *LicenseGate {
gate := &LicenseGate{...}
gate.graceUntil.Store(time.Now().Add(90 * time.Second))
return gate
}
// Check grace period during validation
if err != nil {
if g.isInGracePeriod() {
fmt.Printf("⚠️ License validation failed but in grace period: %v\n", err)
return nil // Allow startup
}
return fmt.Errorf("license validation failed: %w", err)
}
// Extend grace period on success
g.extendGracePeriod() // Adds 90s to current time
```
---
## Startup Integration
### Location
**File**: `internal/runtime/shared.go`
**Function**: `InitializeRuntime()`
### Integration Point
```go
func InitializeRuntime(cfg *config.CHORUSConfig) (*RuntimeContext, error) {
// ... early initialization ...
// CRITICAL: Validate license before any P2P operations
runtime.Logger.Info("🔐 Validating CHORUS license with KACHING...")
licenseValidator := licensing.NewValidator(licensing.LicenseConfig{
LicenseID: cfg.License.LicenseID,
ClusterID: cfg.License.ClusterID,
KachingURL: cfg.License.KachingURL,
})
if err := licenseValidator.Validate(); err != nil {
// This error causes InitializeRuntime to return error
// which causes main() to exit immediately
return nil, fmt.Errorf("license validation failed: %v", err)
}
runtime.Logger.Info("✅ License validation successful - CHORUS authorized to run")
// ... continue with P2P, AI provider initialization, etc ...
}
```
### Execution Order
```
1. Load configuration from YAML
2. Initialize logger
3. ⚠️ VALIDATE LICENSE ⚠️
└─→ FAILURE → return error → main() exits
4. Initialize AI provider
5. Initialize metrics collector
6. Initialize SHHH sentinel
7. Initialize P2P network
8. Start HAP server
9. Enter main runtime loop
```
**Critical Note**: License validation occurs **BEFORE** any P2P networking or AI provider initialization. If validation fails, no network connections are made and no services are started.
---
## Configuration Examples
### Minimal Configuration
```yaml
license:
license_id: "lic_abc123"
cluster_id: "cluster_xyz789"
```
KACHING URL defaults to `http://localhost:8083`
### Production Configuration
```yaml
license:
license_id: "lic_prod_abc123"
cluster_id: "cluster_production_xyz789"
kaching_url: "https://kaching.chorus.services"
organization_name: "Acme Corporation"
license_type: "enterprise"
max_nodes: 10
```
### Development Configuration
```yaml
license:
license_id: "lic_dev_abc123"
cluster_id: "cluster_dev_local"
kaching_url: "http://localhost:8083"
organization_name: "Development Team"
license_type: "developer"
max_nodes: 1
```
### Environment Variables
Licensing configuration can also be set via environment variables:
```bash
export CHORUS_LICENSE_ID="lic_abc123"
export CHORUS_CLUSTER_ID="cluster_xyz789"
export CHORUS_KACHING_URL="http://localhost:8083"
```
---
## Monitoring and Observability
### Log Messages
#### Successful Validation
```
🔐 Validating CHORUS license with KACHING...
✅ License validation successful - CHORUS authorized to run
```
#### Validation with Cached Lease
```
🔐 Validating CHORUS license with KACHING...
[Using cached lease token: lease_def456]
✅ License validation successful - CHORUS authorized to run
```
#### Validation During Grace Period
```
🔐 Validating CHORUS license with KACHING...
⚠️ License validation failed but in grace period: unable to contact license authority
✅ License validation successful - CHORUS authorized to run
```
#### Circuit Breaker State Changes
```
🔌 License validation circuit breaker: closed -> open
🔌 License validation circuit breaker: open -> half-open
🔌 License validation circuit breaker: half-open -> closed
```
#### Validation Failure (Fatal)
```
🔐 Validating CHORUS license with KACHING...
❌ License validation failed: License expired
Error: license validation failed: License expired
[CHORUS exits]
```
### Cache Statistics API
```go
stats := licenseGate.GetCacheStats()
```
Returns:
```json
{
"cache_valid": true,
"cache_hit": true,
"expires_at": "2025-09-30T15:30:00Z",
"cached_at": "2025-09-30T14:30:00Z",
"in_grace_period": false,
"breaker_state": "closed",
"grace_until": "2025-09-30T14:31:30Z"
}
```
### Recommended Monitoring Metrics
| Metric | Type | Description |
|--------|------|-------------|
| `license_validation_success` | Counter | Successful validations |
| `license_validation_failure` | Counter | Failed validations |
| `license_validation_duration_ms` | Histogram | Validation latency |
| `license_cache_hit_rate` | Gauge | Percentage of cache hits |
| `license_grace_period_active` | Gauge | 1 if in grace period, 0 otherwise |
| `license_circuit_breaker_state` | Gauge | 0=closed, 1=half-open, 2=open |
| `license_lease_expiry_seconds` | Gauge | Seconds until lease expiry |
---
## Cluster Lease Management
### Lease Lifecycle
```
┌──────────────────────────────────────────────────────────────────┐
│ Cluster Lease Lifecycle │
└──────────────────────────────────────────────────────────────────┘
1. REQUEST LEASE
├─→ POST /api/v1/licenses/{license_id}/cluster-lease
├─→ cluster_id: "cluster_xyz789"
├─→ requested_replicas: 1
└─→ duration_minutes: 60
2. RECEIVE LEASE
├─→ lease_token: "lease_def456"
├─→ max_replicas: 5
├─→ expires_at: T+60m
└─→ Store in cache
3. USE LEASE (per agent startup)
├─→ POST /api/v1/licenses/validate-lease
├─→ lease_token: "lease_def456"
├─→ cluster_id: "cluster_xyz789"
├─→ agent_id: "agent_001"
└─→ Decrements remaining_replicas
4. LEASE EXPIRY
├─→ Cache invalidated at T+58m (2min safety margin)
└─→ Next validation requests new lease
5. LEASE RENEWAL
└─→ Automatic on cache invalidation
```
### Multi-Replica Support
The lease system supports multiple CHORUS agent replicas:
- **max_replicas**: Maximum concurrent agents allowed
- **remaining_replicas**: Available agent slots
- **agent_id**: Unique identifier for each agent instance
**Example**: License allows 5 replicas
- Request lease → `max_replicas: 5`
- Agent 1 validates → `remaining_replicas: 4`
- Agent 2 validates → `remaining_replicas: 3`
- Agent 6 validates → **FAILURE** (exceeds max_replicas)
---
## Security Considerations
### Fail-Closed Architecture
The licensing system implements **fail-closed** security:
- ✅ Network unavailable → Validation fails → CHORUS exits (unless in grace period)
- ✅ KACHING server down → Validation fails → CHORUS exits (unless in grace period)
- ✅ Invalid license → Validation fails → CHORUS exits (no grace period)
- ✅ Expired license → Validation fails → CHORUS exits (no grace period)
- ❌ No "development mode" bypass
- ❌ No "skip validation" flag
### Grace Period Security
The grace period is designed for transient failures, NOT as a bypass:
- Limited to 90 seconds initially
- Only extends on successful validation
- Does NOT apply to invalid/expired licenses
- Primarily for network/KACHING server availability issues
### License Token Security
- Lease tokens are short-lived (default: 60 minutes)
- Tokens cached in memory only (not persisted to disk)
- Tokens include cluster_id binding (cannot be used by other clusters)
- Agent ID tracking prevents token sharing between agents
### Network Security
- HTTPS recommended for production KACHING URLs
- 30-second timeout prevents hanging on network issues
- Circuit breaker prevents cascade failures
---
## Troubleshooting
### Issue: "license ID and cluster ID are required"
**Cause**: Missing configuration
**Resolution**:
```yaml
# config.yml
license:
license_id: "your_license_id"
cluster_id: "your_cluster_id"
```
Or via environment:
```bash
export CHORUS_LICENSE_ID="your_license_id"
export CHORUS_CLUSTER_ID="your_cluster_id"
```
---
### Issue: "unable to contact license authority"
**Cause**: KACHING server unreachable
**Resolution**:
1. Verify KACHING server is running
2. Check network connectivity: `curl http://localhost:8083/health`
3. Verify `kaching_url` configuration
4. Check firewall rules
5. If transient, grace period allows startup
---
### Issue: "license validation failed: License expired"
**Cause**: License has expired
**Resolution**:
1. Contact license vendor to renew
2. Update license_id in configuration
3. Restart CHORUS
**Note**: Grace period does NOT apply to expired licenses
---
### Issue: "rate limited by KACHING"
**Cause**: Too many validation requests
**Resolution**:
1. Check for rapid restart loops
2. Verify cache is working (should reduce requests)
3. Wait for rate limit reset (check Retry-After header)
4. Consider increasing lease duration_minutes
---
### Issue: Circuit breaker stuck in "open" state
**Cause**: Repeated validation failures
**Resolution**:
1. Check KACHING server health
2. Verify license configuration
3. Circuit breaker auto-recovers after 30 seconds
4. Check grace period status: may allow startup during recovery
---
### Issue: "lease token is invalid"
**Cause**: Lease expired or revoked
**Resolution**:
- System should auto-request new lease
- If persistent, check license status with vendor
- Verify cluster_id matches license configuration
---
## Testing
### Unit Testing
```go
// Test license validation success
func TestValidatorSuccess(t *testing.T) {
// Mock KACHING server
server := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
w.WriteHeader(http.StatusOK)
json.NewEncoder(w).Encode(map[string]interface{}{
"status": "ok",
"message": "License valid",
})
}))
defer server.Close()
validator := licensing.NewValidator(licensing.LicenseConfig{
LicenseID: "test_license",
ClusterID: "test_cluster",
KachingURL: server.URL,
})
err := validator.Validate()
assert.NoError(t, err)
}
// Test license validation failure
func TestValidatorFailure(t *testing.T) {
server := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
w.WriteHeader(http.StatusForbidden)
json.NewEncoder(w).Encode(map[string]interface{}{
"status": "error",
"message": "License expired",
})
}))
defer server.Close()
validator := licensing.NewValidator(licensing.LicenseConfig{
LicenseID: "test_license",
ClusterID: "test_cluster",
KachingURL: server.URL,
})
err := validator.Validate()
assert.Error(t, err)
assert.Contains(t, err.Error(), "License expired")
}
```
### Integration Testing
```bash
# Start KACHING test server
docker run -p 8083:8083 kaching:latest
# Test CHORUS startup with valid license
export CHORUS_LICENSE_ID="test_lic_123"
export CHORUS_CLUSTER_ID="test_cluster"
./chorus-agent
# Expected output:
# 🔐 Validating CHORUS license with KACHING...
# ✅ License validation successful - CHORUS authorized to run
# Test CHORUS startup with invalid license
export CHORUS_LICENSE_ID="invalid_license"
./chorus-agent
# Expected output:
# 🔐 Validating CHORUS license with KACHING...
# ❌ License validation failed: License not found
# Error: license validation failed: License not found
# [Exit code 1]
```
---
## Future Enhancements
### Planned Features
1. **Offline License Support**
- JWT-based license files for air-gapped deployments
- Signature verification without KACHING connectivity
2. **License Renewal Automation**
- Background renewal of expiring licenses
- Alert system for upcoming expirations
3. **Multi-License Support**
- Support for multiple license tiers
- Feature flag based on license type
4. **License Analytics**
- Usage metrics reporting to KACHING
- License utilization dashboards
5. **Enhanced Lease Management**
- Lease renewal before expiry
- Dynamic replica scaling based on license
---
## API Constants
### Timeouts
```go
const (
DefaultKachingURL = "http://localhost:8083"
LicenseTimeout = 30 * time.Second // Validator HTTP timeout
GateCTimeout = 10 * time.Second // LicenseGate HTTP timeout
)
```
### Grace Period
```go
const (
GracePeriodDuration = 90 * time.Second
)
```
### Circuit Breaker
```go
const (
MaxRequests = 3 // Half-open state test requests
FailureThreshold = 3 // Consecutive failures to trip
CircuitTimeout = 30 * time.Second // Open state duration
FailureResetInterval = 60 * time.Second // Failure count reset
)
```
### Lease Safety Margin
```go
const (
LeaseSafetyMargin = 2 * time.Minute // Cache invalidation before expiry
)
```
---
## Related Documentation
- **KACHING License Server**: See KACHING documentation for server setup and API details
- **CHORUS Configuration**: `/docs/comprehensive/pkg/config.md`
- **CHORUS Runtime**: `/docs/comprehensive/internal/runtime.md`
- **Deployment Guide**: `/docs/deployment.md`
---
## Summary
The CHORUS licensing system provides robust, fail-closed license enforcement through integration with the KACHING license authority. Key characteristics:
- **Mandatory**: License validation is required at startup
- **Fail-Closed**: Invalid license or network failure prevents startup (outside grace period)
- **Cached**: Lease tokens cached to reduce KACHING load
- **Resilient**: Circuit breaker and grace period handle transient failures
- **Scalable**: Cluster lease system supports multi-replica deployments
- **Secure**: No bypass mechanisms, short-lived tokens, cluster binding
The system ensures that all running CHORUS instances are properly licensed while providing operational flexibility through caching and grace periods for transient failures.