Comprehensive documentation for coordination, messaging, discovery, and internal systems. Core Coordination Packages: - pkg/election - Democratic leader election (uptime-based, heartbeat mechanism, SLURP integration) - pkg/coordination - Meta-coordination with dependency detection (4 built-in rules) - coordinator/ - Task orchestration and assignment (AI-powered scoring) - discovery/ - mDNS peer discovery (automatic LAN detection) Messaging & P2P Infrastructure: - pubsub/ - GossipSub messaging (31 message types, role-based topics, HMMM integration) - p2p/ - libp2p networking (DHT modes, connection management, security) Monitoring & Health: - pkg/metrics - Prometheus metrics (80+ metrics across 12 categories) - pkg/health - Health monitoring (4 HTTP endpoints, enhanced checks, graceful degradation) Internal Systems: - internal/licensing - License validation (KACHING integration, cluster leases, fail-closed) - internal/hapui - Human Agent Portal UI (9 commands, HMMM wizard, UCXL browser, decision voting) - internal/backbeat - P2P operation telemetry (6 phases, beat synchronization, health reporting) Documentation Statistics (Phase 3): - 10 packages documented (~18,000 lines) - 31 PubSub message types cataloged - 80+ Prometheus metrics documented - Complete API references with examples - Integration patterns and best practices Key Features Documented: - Election: 5 triggers, candidate scoring (5 weighted components), stability windows - Coordination: AI-powered dependency detection, cross-repo sessions, escalation handling - PubSub: Topic patterns, message envelopes, SHHH redaction, Hypercore logging - Metrics: All metric types with labels, Prometheus scrape config, alert rules - Health: Liveness vs readiness, critical checks, Kubernetes integration - Licensing: Grace periods, circuit breaker, cluster lease management - HAP UI: Interactive terminal commands, HMMM composition wizard, web interface (beta) - BACKBEAT: 6-phase operation tracking, beat budget estimation, drift detection Implementation Status Marked: - ✅ Production: Election, metrics, health, licensing, pubsub, p2p, discovery, coordinator - 🔶 Beta: HAP web interface, BACKBEAT telemetry, advanced coordination - 🔷 Alpha: SLURP election scoring - ⚠️ Experimental: Meta-coordination, AI-powered dependency detection Progress: 22/62 files complete (35%) Next Phase: AI providers, SLURP system, API layer, reasoning engine 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
1266 lines
40 KiB
Markdown
1266 lines
40 KiB
Markdown
# CHORUS License Validation System
|
|
|
|
**Package**: `internal/licensing`
|
|
**Purpose**: KACHING license authority integration with fail-closed validation
|
|
**Critical**: License validation is **MANDATORY** at startup - invalid license = immediate exit
|
|
|
|
## Overview
|
|
|
|
The CHORUS licensing system enforces software licensing through integration with the KACHING license authority. The system implements a **fail-closed** security model: if license validation fails, CHORUS will not start. This ensures that all running instances are properly licensed and authorized.
|
|
|
|
### Key Components
|
|
|
|
- **Validator**: Core license validation with KACHING server communication
|
|
- **LicenseGate**: Enhanced validation with caching, circuit breaker, and cluster lease management
|
|
- **LicenseConfig**: Configuration structure for licensing parameters
|
|
|
|
### Security Model: FAIL-CLOSED
|
|
|
|
```
|
|
┌─────────────────────────────────────────────────────────────┐
|
|
│ CHORUS STARTUP │
|
|
├─────────────────────────────────────────────────────────────┤
|
|
│ │
|
|
│ 1. Load Configuration │
|
|
│ 2. Initialize Logger │
|
|
│ 3. ⚠️ VALIDATE LICENSE (CRITICAL GATE) │
|
|
│ │ │
|
|
│ ├─── SUCCESS ──→ Continue startup │
|
|
│ │ │
|
|
│ └─── FAILURE ──→ return error → IMMEDIATE EXIT │
|
|
│ │
|
|
│ 4. Initialize AI Provider │
|
|
│ 5. Start P2P Network │
|
|
│ 6. ... rest of initialization │
|
|
│ │
|
|
└─────────────────────────────────────────────────────────────┘
|
|
|
|
NO BYPASS: License validation cannot be skipped or bypassed.
|
|
```
|
|
|
|
---
|
|
|
|
## Architecture
|
|
|
|
### 1. LicenseConfig Structure
|
|
|
|
**Location**: `internal/licensing/validator.go` and `pkg/config/config.go`
|
|
|
|
```go
|
|
// Core licensing configuration
|
|
type LicenseConfig struct {
|
|
LicenseID string // Unique license identifier
|
|
ClusterID string // Cluster/deployment identifier
|
|
KachingURL string // KACHING server URL
|
|
}
|
|
|
|
// Extended configuration in pkg/config
|
|
type LicenseConfig struct {
|
|
LicenseID string `yaml:"license_id"`
|
|
ClusterID string `yaml:"cluster_id"`
|
|
OrganizationName string `yaml:"organization_name"`
|
|
KachingURL string `yaml:"kaching_url"`
|
|
IsActive bool `yaml:"is_active"`
|
|
LastValidated time.Time `yaml:"last_validated"`
|
|
GracePeriodHours int `yaml:"grace_period_hours"`
|
|
LicenseType string `yaml:"license_type"`
|
|
ExpiresAt time.Time `yaml:"expires_at"`
|
|
MaxNodes int `yaml:"max_nodes"`
|
|
}
|
|
```
|
|
|
|
**Configuration Fields**:
|
|
|
|
| Field | Required | Purpose |
|
|
|-------|----------|---------|
|
|
| `LicenseID` | ✅ Yes | Unique identifier for the license |
|
|
| `ClusterID` | ✅ Yes | Identifies the cluster/deployment |
|
|
| `KachingURL` | No | KACHING server URL (defaults to `http://localhost:8083`) |
|
|
| `OrganizationName` | No | Organization name for tracking |
|
|
| `LicenseType` | No | Type of license (e.g., "enterprise", "developer") |
|
|
| `ExpiresAt` | No | License expiration timestamp |
|
|
| `MaxNodes` | No | Maximum nodes allowed in cluster |
|
|
|
|
---
|
|
|
|
## Validation Flow
|
|
|
|
### Standard Validation Sequence
|
|
|
|
```
|
|
┌──────────────────────────────────────────────────────────────────┐
|
|
│ License Validation Flow │
|
|
└──────────────────────────────────────────────────────────────────┘
|
|
|
|
1. NewValidator(config) → Initialize Validator
|
|
│
|
|
├─→ Set KachingURL (default: http://localhost:8083)
|
|
├─→ Create HTTP client (timeout: 30s)
|
|
└─→ Initialize LicenseGate
|
|
│
|
|
└─→ Initialize Circuit Breaker
|
|
└─→ Set Grace Period (90 seconds from start)
|
|
|
|
2. Validate() → Perform validation
|
|
│
|
|
├─→ ValidateWithContext(ctx)
|
|
│ │
|
|
│ ├─→ Check required fields (LicenseID, ClusterID)
|
|
│ │
|
|
│ ├─→ LicenseGate.Validate(ctx, agentID)
|
|
│ │ │
|
|
│ │ ├─→ Check cached lease (if valid, use it)
|
|
│ │ │ ├─→ validateCachedLease()
|
|
│ │ │ │ ├─→ POST /api/v1/licenses/validate-lease
|
|
│ │ │ │ ├─→ Include: lease_token, cluster_id, agent_id
|
|
│ │ │ │ └─→ Response: valid, remaining_replicas, expires_at
|
|
│ │ │ │
|
|
│ │ │ └─→ Cache hit? → SUCCESS
|
|
│ │ │
|
|
│ │ ├─→ Cache miss? → Request new lease
|
|
│ │ │ │
|
|
│ │ │ ├─→ breaker.Execute() [Circuit Breaker]
|
|
│ │ │ │ │
|
|
│ │ │ │ ├─→ requestOrRenewLease()
|
|
│ │ │ │ │ ├─→ POST /api/v1/licenses/{id}/cluster-lease
|
|
│ │ │ │ │ ├─→ Request: cluster_id, requested_replicas, duration_minutes
|
|
│ │ │ │ │ └─→ Response: lease_token, max_replicas, expires_at, lease_id
|
|
│ │ │ │ │
|
|
│ │ │ │ ├─→ validateLease(lease, agentID)
|
|
│ │ │ │ │ └─→ POST /api/v1/licenses/validate-lease
|
|
│ │ │ │ │
|
|
│ │ │ │ └─→ storeLease() → Cache the valid lease
|
|
│ │ │ │
|
|
│ │ │ └─→ Extend grace period (90s)
|
|
│ │ │
|
|
│ │ └─→ Validation failed?
|
|
│ │ │
|
|
│ │ ├─→ In grace period? → Log warning, ALLOW startup
|
|
│ │ └─→ Outside grace period? → RETURN ERROR
|
|
│ │
|
|
│ └─→ Fallback to validateLegacy() on LicenseGate failure
|
|
│ │
|
|
│ ├─→ POST /v1/license/activate
|
|
│ ├─→ Request: license_id, cluster_id, metadata
|
|
│ └─→ Response: validation result
|
|
│
|
|
└─→ Return validation result
|
|
|
|
3. Result Handling (in runtime/shared.go)
|
|
│
|
|
├─→ SUCCESS → Log "✅ License validation successful"
|
|
│ → Continue initialization
|
|
│
|
|
└─→ FAILURE → return error → CHORUS EXITS IMMEDIATELY
|
|
```
|
|
|
|
---
|
|
|
|
## Component Details
|
|
|
|
### 1. Validator Component
|
|
|
|
**File**: `internal/licensing/validator.go`
|
|
|
|
The Validator is the primary component for license validation, providing communication with the KACHING license authority.
|
|
|
|
#### Key Methods
|
|
|
|
##### NewValidator(config LicenseConfig)
|
|
|
|
```go
|
|
func NewValidator(config LicenseConfig) *Validator
|
|
```
|
|
|
|
Creates a new license validator with:
|
|
- HTTP client with 30-second timeout
|
|
- Default KACHING URL if not specified
|
|
- Initialized LicenseGate for enhanced validation
|
|
|
|
##### Validate()
|
|
|
|
```go
|
|
func (v *Validator) Validate() error
|
|
```
|
|
|
|
Performs license validation with KACHING authority:
|
|
- Validates required configuration fields
|
|
- Uses LicenseGate for cached/enhanced validation
|
|
- Falls back to legacy validation if needed
|
|
- Returns error if validation fails
|
|
|
|
##### validateLegacy()
|
|
|
|
```go
|
|
func (v *Validator) validateLegacy() error
|
|
```
|
|
|
|
Legacy validation method (fallback):
|
|
- Direct HTTP POST to `/v1/license/activate`
|
|
- Sends license metadata (product, version, container flag)
|
|
- **Fail-closed**: Network error = validation failure
|
|
- Parses and validates response status
|
|
|
|
**Request Example**:
|
|
|
|
```json
|
|
{
|
|
"license_id": "lic_abc123",
|
|
"cluster_id": "cluster_xyz789",
|
|
"metadata": {
|
|
"product": "CHORUS",
|
|
"version": "0.1.0-dev",
|
|
"container": "true"
|
|
}
|
|
}
|
|
```
|
|
|
|
**Response Example (Success)**:
|
|
|
|
```json
|
|
{
|
|
"status": "ok",
|
|
"message": "License valid",
|
|
"expires_at": "2025-12-31T23:59:59Z"
|
|
}
|
|
```
|
|
|
|
**Response Example (Failure)**:
|
|
|
|
```json
|
|
{
|
|
"status": "error",
|
|
"message": "License expired"
|
|
}
|
|
```
|
|
|
|
---
|
|
|
|
### 2. LicenseGate Component
|
|
|
|
**File**: `internal/licensing/license_gate.go`
|
|
|
|
Enhanced license validation with caching, circuit breaker, and cluster lease management for production scalability.
|
|
|
|
#### Key Features
|
|
|
|
- **Caching**: Stores valid lease tokens to reduce KACHING load
|
|
- **Circuit Breaker**: Prevents cascade failures during KACHING outages
|
|
- **Grace Period**: 90-second startup grace period for transient failures
|
|
- **Cluster Leases**: Supports multi-replica deployments with lease tokens
|
|
- **Burst Protection**: Rate limiting and retry logic
|
|
|
|
#### Data Structures
|
|
|
|
##### cachedLease
|
|
|
|
```go
|
|
type cachedLease struct {
|
|
LeaseToken string `json:"lease_token"`
|
|
ExpiresAt time.Time `json:"expires_at"`
|
|
ClusterID string `json:"cluster_id"`
|
|
Valid bool `json:"valid"`
|
|
CachedAt time.Time `json:"cached_at"`
|
|
}
|
|
```
|
|
|
|
**Lease Validation**:
|
|
- Lease considered invalid 2 minutes before actual expiry (safety margin)
|
|
- Invalid leases are evicted from cache automatically
|
|
|
|
##### LeaseRequest
|
|
|
|
```go
|
|
type LeaseRequest struct {
|
|
ClusterID string `json:"cluster_id"`
|
|
RequestedReplicas int `json:"requested_replicas"`
|
|
DurationMinutes int `json:"duration_minutes"`
|
|
}
|
|
```
|
|
|
|
##### LeaseResponse
|
|
|
|
```go
|
|
type LeaseResponse struct {
|
|
LeaseToken string `json:"lease_token"`
|
|
MaxReplicas int `json:"max_replicas"`
|
|
ExpiresAt time.Time `json:"expires_at"`
|
|
ClusterID string `json:"cluster_id"`
|
|
LeaseID string `json:"lease_id"`
|
|
}
|
|
```
|
|
|
|
##### LeaseValidationRequest
|
|
|
|
```go
|
|
type LeaseValidationRequest struct {
|
|
LeaseToken string `json:"lease_token"`
|
|
ClusterID string `json:"cluster_id"`
|
|
AgentID string `json:"agent_id"`
|
|
}
|
|
```
|
|
|
|
##### LeaseValidationResponse
|
|
|
|
```go
|
|
type LeaseValidationResponse struct {
|
|
Valid bool `json:"valid"`
|
|
RemainingReplicas int `json:"remaining_replicas"`
|
|
ExpiresAt time.Time `json:"expires_at"`
|
|
}
|
|
```
|
|
|
|
#### Circuit Breaker Configuration
|
|
|
|
```go
|
|
breakerSettings := gobreaker.Settings{
|
|
Name: "license-validation",
|
|
MaxRequests: 3, // Allow 3 requests in half-open state
|
|
Interval: 60 * time.Second, // Reset failure count every minute
|
|
Timeout: 30 * time.Second, // Stay open for 30 seconds
|
|
ReadyToTrip: func(counts gobreaker.Counts) bool {
|
|
return counts.ConsecutiveFailures >= 3 // Trip after 3 failures
|
|
},
|
|
OnStateChange: func(name string, from, to gobreaker.State) {
|
|
fmt.Printf("🔌 License validation circuit breaker: %s -> %s\n", from, to)
|
|
},
|
|
}
|
|
```
|
|
|
|
**Circuit Breaker States**:
|
|
|
|
| State | Behavior | Transition |
|
|
|-------|----------|------------|
|
|
| **Closed** | Normal operation, requests pass through | 3 consecutive failures → **Open** |
|
|
| **Open** | All requests fail immediately (30s) | After timeout → **Half-Open** |
|
|
| **Half-Open** | Allow 3 test requests | Success → **Closed**, Failure → **Open** |
|
|
|
|
#### Key Methods
|
|
|
|
##### NewLicenseGate(config LicenseConfig)
|
|
|
|
```go
|
|
func NewLicenseGate(config LicenseConfig) *LicenseGate
|
|
```
|
|
|
|
Initializes license gate with:
|
|
- Circuit breaker with production settings
|
|
- HTTP client with 10-second timeout
|
|
- 90-second grace period from startup
|
|
|
|
##### Validate(ctx context.Context, agentID string)
|
|
|
|
```go
|
|
func (g *LicenseGate) Validate(ctx context.Context, agentID string) error
|
|
```
|
|
|
|
Primary validation method:
|
|
|
|
1. **Check cache**: If valid cached lease exists, validate it
|
|
2. **Cache miss**: Request new lease through circuit breaker
|
|
3. **Store result**: Cache successful lease for future requests
|
|
4. **Grace period**: Allow startup during grace period even if validation fails
|
|
5. **Extend grace**: Extend grace period on successful validation
|
|
|
|
##### validateCachedLease(ctx, lease, agentID)
|
|
|
|
```go
|
|
func (g *LicenseGate) validateCachedLease(ctx context.Context, lease *cachedLease, agentID string) error
|
|
```
|
|
|
|
Validates cached lease token:
|
|
- POST to `/api/v1/licenses/validate-lease`
|
|
- Invalidates cache if validation fails
|
|
- Returns error if lease is no longer valid
|
|
|
|
##### requestOrRenewLease(ctx)
|
|
|
|
```go
|
|
func (g *LicenseGate) requestOrRenewLease(ctx context.Context) (*LeaseResponse, error)
|
|
```
|
|
|
|
Requests new cluster lease:
|
|
- POST to `/api/v1/licenses/{license_id}/cluster-lease`
|
|
- Default: 1 replica, 60-minute duration
|
|
- Handles rate limiting (429 Too Many Requests)
|
|
- Returns lease token and metadata
|
|
|
|
##### GetCacheStats()
|
|
|
|
```go
|
|
func (g *LicenseGate) GetCacheStats() map[string]interface{}
|
|
```
|
|
|
|
Returns cache statistics for monitoring:
|
|
|
|
```json
|
|
{
|
|
"cache_valid": true,
|
|
"cache_hit": true,
|
|
"expires_at": "2025-09-30T15:30:00Z",
|
|
"cached_at": "2025-09-30T14:30:00Z",
|
|
"in_grace_period": false,
|
|
"breaker_state": "closed",
|
|
"grace_until": "2025-09-30T14:31:30Z"
|
|
}
|
|
```
|
|
|
|
---
|
|
|
|
## KACHING Server Integration
|
|
|
|
### API Endpoints
|
|
|
|
#### 1. Legacy Activation Endpoint
|
|
|
|
**Endpoint**: `POST /v1/license/activate`
|
|
**Purpose**: Legacy license validation (fallback)
|
|
|
|
**Request**:
|
|
```json
|
|
{
|
|
"license_id": "lic_abc123",
|
|
"cluster_id": "cluster_xyz789",
|
|
"metadata": {
|
|
"product": "CHORUS",
|
|
"version": "0.1.0-dev",
|
|
"container": "true"
|
|
}
|
|
}
|
|
```
|
|
|
|
**Response (Success)**:
|
|
```json
|
|
{
|
|
"status": "ok",
|
|
"message": "License valid",
|
|
"expires_at": "2025-12-31T23:59:59Z"
|
|
}
|
|
```
|
|
|
|
**Response (Failure)**:
|
|
```json
|
|
{
|
|
"status": "error",
|
|
"message": "License expired"
|
|
}
|
|
```
|
|
|
|
#### 2. Cluster Lease Endpoint
|
|
|
|
**Endpoint**: `POST /api/v1/licenses/{license_id}/cluster-lease`
|
|
**Purpose**: Request cluster deployment lease
|
|
|
|
**Request**:
|
|
```json
|
|
{
|
|
"cluster_id": "cluster_xyz789",
|
|
"requested_replicas": 1,
|
|
"duration_minutes": 60
|
|
}
|
|
```
|
|
|
|
**Response (Success)**:
|
|
```json
|
|
{
|
|
"lease_token": "lease_def456",
|
|
"max_replicas": 5,
|
|
"expires_at": "2025-09-30T15:30:00Z",
|
|
"cluster_id": "cluster_xyz789",
|
|
"lease_id": "lease_def456"
|
|
}
|
|
```
|
|
|
|
**Response (Rate Limited)**:
|
|
```
|
|
HTTP 429 Too Many Requests
|
|
Retry-After: 60
|
|
```
|
|
|
|
#### 3. Lease Validation Endpoint
|
|
|
|
**Endpoint**: `POST /api/v1/licenses/validate-lease`
|
|
**Purpose**: Validate lease token for agent startup
|
|
|
|
**Request**:
|
|
```json
|
|
{
|
|
"lease_token": "lease_def456",
|
|
"cluster_id": "cluster_xyz789",
|
|
"agent_id": "agent_001"
|
|
}
|
|
```
|
|
|
|
**Response (Success)**:
|
|
```json
|
|
{
|
|
"valid": true,
|
|
"remaining_replicas": 4,
|
|
"expires_at": "2025-09-30T15:30:00Z"
|
|
}
|
|
```
|
|
|
|
**Response (Invalid)**:
|
|
```json
|
|
{
|
|
"valid": false,
|
|
"remaining_replicas": 0,
|
|
"expires_at": "2025-09-30T14:30:00Z"
|
|
}
|
|
```
|
|
|
|
---
|
|
|
|
## Validation Sequence Diagram
|
|
|
|
```
|
|
┌─────────┐ ┌───────────┐ ┌──────────────┐ ┌──────────┐
|
|
│ CHORUS │ │ Validator │ │ LicenseGate │ │ KACHING │
|
|
│ Runtime │ │ │ │ │ │ Server │
|
|
└────┬────┘ └─────┬─────┘ └──────┬───────┘ └────┬─────┘
|
|
│ │ │ │
|
|
│ InitializeRuntime()│ │ │
|
|
│───────────────────>│ │ │
|
|
│ │ │ │
|
|
│ │ Validate() │ │
|
|
│ │──────────────────────>│ │
|
|
│ │ │ │
|
|
│ │ │ Check cache │
|
|
│ │ │────────┐ │
|
|
│ │ │ │ │
|
|
│ │ │<───────┘ │
|
|
│ │ │ │
|
|
│ │ │ Cache miss │
|
|
│ │ │ │
|
|
│ │ │ POST /cluster-lease │
|
|
│ │ │─────────────────────>│
|
|
│ │ │ │
|
|
│ │ │ Lease Response │
|
|
│ │ │<─────────────────────│
|
|
│ │ │ │
|
|
│ │ │ POST /validate-lease │
|
|
│ │ │─────────────────────>│
|
|
│ │ │ │
|
|
│ │ │ Validation Response │
|
|
│ │ │<─────────────────────│
|
|
│ │ │ │
|
|
│ │ │ Store in cache │
|
|
│ │ │────────┐ │
|
|
│ │ │ │ │
|
|
│ │ │<───────┘ │
|
|
│ │ │ │
|
|
│ │ SUCCESS │ │
|
|
│ │<──────────────────────│ │
|
|
│ │ │ │
|
|
│ Continue startup │ │ │
|
|
│<───────────────────│ │ │
|
|
│ │ │ │
|
|
|
|
┌─────────────────────────────────────────────────────────────────────┐
|
|
│ FAILURE SCENARIO │
|
|
└─────────────────────────────────────────────────────────────────────┘
|
|
|
|
│ │ │ │
|
|
│ │ │ POST /validate-lease │
|
|
│ │ │─────────────────────>│
|
|
│ │ │ │
|
|
│ │ │ INVALID LICENSE │
|
|
│ │ │<─────────────────────│
|
|
│ │ │ │
|
|
│ │ │ Check grace period │
|
|
│ │ │────────┐ │
|
|
│ │ │ │ │
|
|
│ │ │<───────┘ │
|
|
│ │ │ │
|
|
│ │ │ Outside grace period │
|
|
│ │ │ │
|
|
│ │ ERROR │ │
|
|
│ │<──────────────────────│ │
|
|
│ │ │ │
|
|
│ return error │ │ │
|
|
│<───────────────────│ │ │
|
|
│ │ │ │
|
|
│ EXIT │ │ │
|
|
│────────X │ │ │
|
|
```
|
|
|
|
---
|
|
|
|
## Error Handling
|
|
|
|
### Error Categories
|
|
|
|
#### 1. Configuration Errors
|
|
|
|
**Condition**: Missing required configuration fields
|
|
|
|
```go
|
|
if v.config.LicenseID == "" || v.config.ClusterID == "" {
|
|
return fmt.Errorf("license ID and cluster ID are required")
|
|
}
|
|
```
|
|
|
|
**Result**: Immediate validation failure → CHORUS exits
|
|
|
|
#### 2. Network Errors
|
|
|
|
**Condition**: Cannot contact KACHING server
|
|
|
|
```go
|
|
resp, err := v.client.Post(licenseURL, "application/json", bytes.NewReader(requestBody))
|
|
if err != nil {
|
|
// FAIL-CLOSED: No network = No license = No operation
|
|
return fmt.Errorf("unable to contact license authority: %w", err)
|
|
}
|
|
```
|
|
|
|
**Result**:
|
|
- Outside grace period: Immediate validation failure → CHORUS exits
|
|
- Inside grace period: Log warning, allow startup
|
|
|
|
**Fail-Closed Behavior**: Network unavailability does NOT allow bypass
|
|
|
|
#### 3. Invalid License Errors
|
|
|
|
**Condition**: KACHING rejects license
|
|
|
|
```go
|
|
if resp.StatusCode != http.StatusOK {
|
|
message := "license validation failed"
|
|
if msg, ok := licenseResponse["message"].(string); ok {
|
|
message = msg
|
|
}
|
|
return fmt.Errorf("license validation failed: %s", message)
|
|
}
|
|
```
|
|
|
|
**Possible Messages**:
|
|
- "License expired"
|
|
- "License revoked"
|
|
- "License not found"
|
|
- "Cluster ID mismatch"
|
|
- "Maximum nodes exceeded"
|
|
|
|
**Result**: Immediate validation failure → CHORUS exits
|
|
|
|
#### 4. Rate Limiting Errors
|
|
|
|
**Condition**: Too many requests to KACHING
|
|
|
|
```go
|
|
if resp.StatusCode == http.StatusTooManyRequests {
|
|
return nil, fmt.Errorf("rate limited by KACHING, retry after: %s",
|
|
resp.Header.Get("Retry-After"))
|
|
}
|
|
```
|
|
|
|
**Result**:
|
|
- Circuit breaker may trip after repeated rate limiting
|
|
- Grace period allows startup if rate limiting is transient
|
|
|
|
#### 5. Circuit Breaker Errors
|
|
|
|
**Condition**: Circuit breaker is open (too many failures)
|
|
|
|
**Result**:
|
|
- All requests fail immediately
|
|
- Grace period allows startup if breaker trips during initialization
|
|
- Circuit breaker auto-recovers after timeout (30s)
|
|
|
|
---
|
|
|
|
## Error Messages Reference
|
|
|
|
### User-Facing Error Messages
|
|
|
|
| Error Message | Cause | Resolution |
|
|
|--------------|-------|------------|
|
|
| `license ID and cluster ID are required` | Missing configuration | Set `CHORUS_LICENSE_ID` and `CHORUS_CLUSTER_ID` |
|
|
| `unable to contact license authority` | Network error | Check KACHING server accessibility |
|
|
| `license validation failed: License expired` | Expired license | Renew license with vendor |
|
|
| `license validation failed: License revoked` | Revoked license | Contact vendor |
|
|
| `license validation failed: Cluster ID mismatch` | Wrong cluster | Use correct cluster configuration |
|
|
| `rate limited by KACHING` | Too many requests | Wait for rate limit reset |
|
|
| `lease token is invalid` | Expired or invalid lease | System will auto-request new lease |
|
|
| `lease validation failed with status 404` | Lease not found | System will auto-request new lease |
|
|
| `License validation failed but in grace period` | Transient failure during startup | System continues with warning |
|
|
|
|
---
|
|
|
|
## Grace Period Mechanism
|
|
|
|
### Purpose
|
|
|
|
The grace period allows CHORUS to start even when license validation temporarily fails, preventing service disruption due to transient network issues or KACHING server maintenance.
|
|
|
|
### Behavior
|
|
|
|
- **Duration**: 90 seconds from startup
|
|
- **Triggered**: When validation fails but grace period is active
|
|
- **Effect**: Validation returns success with warning log
|
|
- **Extension**: Grace period extends by 90s on each successful validation
|
|
- **Expiry**: After grace period expires, validation failures cause immediate exit
|
|
|
|
### Grace Period States
|
|
|
|
```
|
|
┌──────────────────────────────────────────────────────────────┐
|
|
│ Grace Period Timeline │
|
|
└──────────────────────────────────────────────────────────────┘
|
|
|
|
T+0s ┌─────────────────────────────────────────────┐
|
|
│ GRACE PERIOD ACTIVE (90s) │
|
|
│ Validation failures allowed with warning │
|
|
└─────────────────────────────────────────────┘
|
|
|
|
T+30s │ Validation SUCCESS │
|
|
└──> Grace period extended to T+120s │
|
|
|
|
T+90s │ Grace period expires (no successful validation)
|
|
└──> Next validation failure causes exit │
|
|
|
|
T+120s │ (Extended) Grace period expires
|
|
└──> Next validation failure causes exit │
|
|
```
|
|
|
|
### Implementation
|
|
|
|
```go
|
|
// Initialize grace period at startup
|
|
func NewLicenseGate(config LicenseConfig) *LicenseGate {
|
|
gate := &LicenseGate{...}
|
|
gate.graceUntil.Store(time.Now().Add(90 * time.Second))
|
|
return gate
|
|
}
|
|
|
|
// Check grace period during validation
|
|
if err != nil {
|
|
if g.isInGracePeriod() {
|
|
fmt.Printf("⚠️ License validation failed but in grace period: %v\n", err)
|
|
return nil // Allow startup
|
|
}
|
|
return fmt.Errorf("license validation failed: %w", err)
|
|
}
|
|
|
|
// Extend grace period on success
|
|
g.extendGracePeriod() // Adds 90s to current time
|
|
```
|
|
|
|
---
|
|
|
|
## Startup Integration
|
|
|
|
### Location
|
|
|
|
**File**: `internal/runtime/shared.go`
|
|
**Function**: `InitializeRuntime()`
|
|
|
|
### Integration Point
|
|
|
|
```go
|
|
func InitializeRuntime(cfg *config.CHORUSConfig) (*RuntimeContext, error) {
|
|
// ... early initialization ...
|
|
|
|
// CRITICAL: Validate license before any P2P operations
|
|
runtime.Logger.Info("🔐 Validating CHORUS license with KACHING...")
|
|
licenseValidator := licensing.NewValidator(licensing.LicenseConfig{
|
|
LicenseID: cfg.License.LicenseID,
|
|
ClusterID: cfg.License.ClusterID,
|
|
KachingURL: cfg.License.KachingURL,
|
|
})
|
|
|
|
if err := licenseValidator.Validate(); err != nil {
|
|
// This error causes InitializeRuntime to return error
|
|
// which causes main() to exit immediately
|
|
return nil, fmt.Errorf("license validation failed: %v", err)
|
|
}
|
|
|
|
runtime.Logger.Info("✅ License validation successful - CHORUS authorized to run")
|
|
|
|
// ... continue with P2P, AI provider initialization, etc ...
|
|
}
|
|
```
|
|
|
|
### Execution Order
|
|
|
|
```
|
|
1. Load configuration from YAML
|
|
2. Initialize logger
|
|
3. ⚠️ VALIDATE LICENSE ⚠️
|
|
└─→ FAILURE → return error → main() exits
|
|
4. Initialize AI provider
|
|
5. Initialize metrics collector
|
|
6. Initialize SHHH sentinel
|
|
7. Initialize P2P network
|
|
8. Start HAP server
|
|
9. Enter main runtime loop
|
|
```
|
|
|
|
**Critical Note**: License validation occurs **BEFORE** any P2P networking or AI provider initialization. If validation fails, no network connections are made and no services are started.
|
|
|
|
---
|
|
|
|
## Configuration Examples
|
|
|
|
### Minimal Configuration
|
|
|
|
```yaml
|
|
license:
|
|
license_id: "lic_abc123"
|
|
cluster_id: "cluster_xyz789"
|
|
```
|
|
|
|
KACHING URL defaults to `http://localhost:8083`
|
|
|
|
### Production Configuration
|
|
|
|
```yaml
|
|
license:
|
|
license_id: "lic_prod_abc123"
|
|
cluster_id: "cluster_production_xyz789"
|
|
kaching_url: "https://kaching.chorus.services"
|
|
organization_name: "Acme Corporation"
|
|
license_type: "enterprise"
|
|
max_nodes: 10
|
|
```
|
|
|
|
### Development Configuration
|
|
|
|
```yaml
|
|
license:
|
|
license_id: "lic_dev_abc123"
|
|
cluster_id: "cluster_dev_local"
|
|
kaching_url: "http://localhost:8083"
|
|
organization_name: "Development Team"
|
|
license_type: "developer"
|
|
max_nodes: 1
|
|
```
|
|
|
|
### Environment Variables
|
|
|
|
Licensing configuration can also be set via environment variables:
|
|
|
|
```bash
|
|
export CHORUS_LICENSE_ID="lic_abc123"
|
|
export CHORUS_CLUSTER_ID="cluster_xyz789"
|
|
export CHORUS_KACHING_URL="http://localhost:8083"
|
|
```
|
|
|
|
---
|
|
|
|
## Monitoring and Observability
|
|
|
|
### Log Messages
|
|
|
|
#### Successful Validation
|
|
|
|
```
|
|
🔐 Validating CHORUS license with KACHING...
|
|
✅ License validation successful - CHORUS authorized to run
|
|
```
|
|
|
|
#### Validation with Cached Lease
|
|
|
|
```
|
|
🔐 Validating CHORUS license with KACHING...
|
|
[Using cached lease token: lease_def456]
|
|
✅ License validation successful - CHORUS authorized to run
|
|
```
|
|
|
|
#### Validation During Grace Period
|
|
|
|
```
|
|
🔐 Validating CHORUS license with KACHING...
|
|
⚠️ License validation failed but in grace period: unable to contact license authority
|
|
✅ License validation successful - CHORUS authorized to run
|
|
```
|
|
|
|
#### Circuit Breaker State Changes
|
|
|
|
```
|
|
🔌 License validation circuit breaker: closed -> open
|
|
🔌 License validation circuit breaker: open -> half-open
|
|
🔌 License validation circuit breaker: half-open -> closed
|
|
```
|
|
|
|
#### Validation Failure (Fatal)
|
|
|
|
```
|
|
🔐 Validating CHORUS license with KACHING...
|
|
❌ License validation failed: License expired
|
|
Error: license validation failed: License expired
|
|
[CHORUS exits]
|
|
```
|
|
|
|
### Cache Statistics API
|
|
|
|
```go
|
|
stats := licenseGate.GetCacheStats()
|
|
```
|
|
|
|
Returns:
|
|
|
|
```json
|
|
{
|
|
"cache_valid": true,
|
|
"cache_hit": true,
|
|
"expires_at": "2025-09-30T15:30:00Z",
|
|
"cached_at": "2025-09-30T14:30:00Z",
|
|
"in_grace_period": false,
|
|
"breaker_state": "closed",
|
|
"grace_until": "2025-09-30T14:31:30Z"
|
|
}
|
|
```
|
|
|
|
### Recommended Monitoring Metrics
|
|
|
|
| Metric | Type | Description |
|
|
|--------|------|-------------|
|
|
| `license_validation_success` | Counter | Successful validations |
|
|
| `license_validation_failure` | Counter | Failed validations |
|
|
| `license_validation_duration_ms` | Histogram | Validation latency |
|
|
| `license_cache_hit_rate` | Gauge | Percentage of cache hits |
|
|
| `license_grace_period_active` | Gauge | 1 if in grace period, 0 otherwise |
|
|
| `license_circuit_breaker_state` | Gauge | 0=closed, 1=half-open, 2=open |
|
|
| `license_lease_expiry_seconds` | Gauge | Seconds until lease expiry |
|
|
|
|
---
|
|
|
|
## Cluster Lease Management
|
|
|
|
### Lease Lifecycle
|
|
|
|
```
|
|
┌──────────────────────────────────────────────────────────────────┐
|
|
│ Cluster Lease Lifecycle │
|
|
└──────────────────────────────────────────────────────────────────┘
|
|
|
|
1. REQUEST LEASE
|
|
├─→ POST /api/v1/licenses/{license_id}/cluster-lease
|
|
├─→ cluster_id: "cluster_xyz789"
|
|
├─→ requested_replicas: 1
|
|
└─→ duration_minutes: 60
|
|
|
|
2. RECEIVE LEASE
|
|
├─→ lease_token: "lease_def456"
|
|
├─→ max_replicas: 5
|
|
├─→ expires_at: T+60m
|
|
└─→ Store in cache
|
|
|
|
3. USE LEASE (per agent startup)
|
|
├─→ POST /api/v1/licenses/validate-lease
|
|
├─→ lease_token: "lease_def456"
|
|
├─→ cluster_id: "cluster_xyz789"
|
|
├─→ agent_id: "agent_001"
|
|
└─→ Decrements remaining_replicas
|
|
|
|
4. LEASE EXPIRY
|
|
├─→ Cache invalidated at T+58m (2min safety margin)
|
|
└─→ Next validation requests new lease
|
|
|
|
5. LEASE RENEWAL
|
|
└─→ Automatic on cache invalidation
|
|
```
|
|
|
|
### Multi-Replica Support
|
|
|
|
The lease system supports multiple CHORUS agent replicas:
|
|
|
|
- **max_replicas**: Maximum concurrent agents allowed
|
|
- **remaining_replicas**: Available agent slots
|
|
- **agent_id**: Unique identifier for each agent instance
|
|
|
|
**Example**: License allows 5 replicas
|
|
- Request lease → `max_replicas: 5`
|
|
- Agent 1 validates → `remaining_replicas: 4`
|
|
- Agent 2 validates → `remaining_replicas: 3`
|
|
- Agent 6 validates → **FAILURE** (exceeds max_replicas)
|
|
|
|
---
|
|
|
|
## Security Considerations
|
|
|
|
### Fail-Closed Architecture
|
|
|
|
The licensing system implements **fail-closed** security:
|
|
|
|
- ✅ Network unavailable → Validation fails → CHORUS exits (unless in grace period)
|
|
- ✅ KACHING server down → Validation fails → CHORUS exits (unless in grace period)
|
|
- ✅ Invalid license → Validation fails → CHORUS exits (no grace period)
|
|
- ✅ Expired license → Validation fails → CHORUS exits (no grace period)
|
|
- ❌ No "development mode" bypass
|
|
- ❌ No "skip validation" flag
|
|
|
|
### Grace Period Security
|
|
|
|
The grace period is designed for transient failures, NOT as a bypass:
|
|
|
|
- Limited to 90 seconds initially
|
|
- Only extends on successful validation
|
|
- Does NOT apply to invalid/expired licenses
|
|
- Primarily for network/KACHING server availability issues
|
|
|
|
### License Token Security
|
|
|
|
- Lease tokens are short-lived (default: 60 minutes)
|
|
- Tokens cached in memory only (not persisted to disk)
|
|
- Tokens include cluster_id binding (cannot be used by other clusters)
|
|
- Agent ID tracking prevents token sharing between agents
|
|
|
|
### Network Security
|
|
|
|
- HTTPS recommended for production KACHING URLs
|
|
- 30-second timeout prevents hanging on network issues
|
|
- Circuit breaker prevents cascade failures
|
|
|
|
---
|
|
|
|
## Troubleshooting
|
|
|
|
### Issue: "license ID and cluster ID are required"
|
|
|
|
**Cause**: Missing configuration
|
|
|
|
**Resolution**:
|
|
```yaml
|
|
# config.yml
|
|
license:
|
|
license_id: "your_license_id"
|
|
cluster_id: "your_cluster_id"
|
|
```
|
|
|
|
Or via environment:
|
|
```bash
|
|
export CHORUS_LICENSE_ID="your_license_id"
|
|
export CHORUS_CLUSTER_ID="your_cluster_id"
|
|
```
|
|
|
|
---
|
|
|
|
### Issue: "unable to contact license authority"
|
|
|
|
**Cause**: KACHING server unreachable
|
|
|
|
**Resolution**:
|
|
1. Verify KACHING server is running
|
|
2. Check network connectivity: `curl http://localhost:8083/health`
|
|
3. Verify `kaching_url` configuration
|
|
4. Check firewall rules
|
|
5. If transient, grace period allows startup
|
|
|
|
---
|
|
|
|
### Issue: "license validation failed: License expired"
|
|
|
|
**Cause**: License has expired
|
|
|
|
**Resolution**:
|
|
1. Contact license vendor to renew
|
|
2. Update license_id in configuration
|
|
3. Restart CHORUS
|
|
|
|
**Note**: Grace period does NOT apply to expired licenses
|
|
|
|
---
|
|
|
|
### Issue: "rate limited by KACHING"
|
|
|
|
**Cause**: Too many validation requests
|
|
|
|
**Resolution**:
|
|
1. Check for rapid restart loops
|
|
2. Verify cache is working (should reduce requests)
|
|
3. Wait for rate limit reset (check Retry-After header)
|
|
4. Consider increasing lease duration_minutes
|
|
|
|
---
|
|
|
|
### Issue: Circuit breaker stuck in "open" state
|
|
|
|
**Cause**: Repeated validation failures
|
|
|
|
**Resolution**:
|
|
1. Check KACHING server health
|
|
2. Verify license configuration
|
|
3. Circuit breaker auto-recovers after 30 seconds
|
|
4. Check grace period status: may allow startup during recovery
|
|
|
|
---
|
|
|
|
### Issue: "lease token is invalid"
|
|
|
|
**Cause**: Lease expired or revoked
|
|
|
|
**Resolution**:
|
|
- System should auto-request new lease
|
|
- If persistent, check license status with vendor
|
|
- Verify cluster_id matches license configuration
|
|
|
|
---
|
|
|
|
## Testing
|
|
|
|
### Unit Testing
|
|
|
|
```go
|
|
// Test license validation success
|
|
func TestValidatorSuccess(t *testing.T) {
|
|
// Mock KACHING server
|
|
server := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
|
|
w.WriteHeader(http.StatusOK)
|
|
json.NewEncoder(w).Encode(map[string]interface{}{
|
|
"status": "ok",
|
|
"message": "License valid",
|
|
})
|
|
}))
|
|
defer server.Close()
|
|
|
|
validator := licensing.NewValidator(licensing.LicenseConfig{
|
|
LicenseID: "test_license",
|
|
ClusterID: "test_cluster",
|
|
KachingURL: server.URL,
|
|
})
|
|
|
|
err := validator.Validate()
|
|
assert.NoError(t, err)
|
|
}
|
|
|
|
// Test license validation failure
|
|
func TestValidatorFailure(t *testing.T) {
|
|
server := httptest.NewServer(http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
|
|
w.WriteHeader(http.StatusForbidden)
|
|
json.NewEncoder(w).Encode(map[string]interface{}{
|
|
"status": "error",
|
|
"message": "License expired",
|
|
})
|
|
}))
|
|
defer server.Close()
|
|
|
|
validator := licensing.NewValidator(licensing.LicenseConfig{
|
|
LicenseID: "test_license",
|
|
ClusterID: "test_cluster",
|
|
KachingURL: server.URL,
|
|
})
|
|
|
|
err := validator.Validate()
|
|
assert.Error(t, err)
|
|
assert.Contains(t, err.Error(), "License expired")
|
|
}
|
|
```
|
|
|
|
### Integration Testing
|
|
|
|
```bash
|
|
# Start KACHING test server
|
|
docker run -p 8083:8083 kaching:latest
|
|
|
|
# Test CHORUS startup with valid license
|
|
export CHORUS_LICENSE_ID="test_lic_123"
|
|
export CHORUS_CLUSTER_ID="test_cluster"
|
|
./chorus-agent
|
|
|
|
# Expected output:
|
|
# 🔐 Validating CHORUS license with KACHING...
|
|
# ✅ License validation successful - CHORUS authorized to run
|
|
|
|
# Test CHORUS startup with invalid license
|
|
export CHORUS_LICENSE_ID="invalid_license"
|
|
./chorus-agent
|
|
|
|
# Expected output:
|
|
# 🔐 Validating CHORUS license with KACHING...
|
|
# ❌ License validation failed: License not found
|
|
# Error: license validation failed: License not found
|
|
# [Exit code 1]
|
|
```
|
|
|
|
---
|
|
|
|
## Future Enhancements
|
|
|
|
### Planned Features
|
|
|
|
1. **Offline License Support**
|
|
- JWT-based license files for air-gapped deployments
|
|
- Signature verification without KACHING connectivity
|
|
|
|
2. **License Renewal Automation**
|
|
- Background renewal of expiring licenses
|
|
- Alert system for upcoming expirations
|
|
|
|
3. **Multi-License Support**
|
|
- Support for multiple license tiers
|
|
- Feature flag based on license type
|
|
|
|
4. **License Analytics**
|
|
- Usage metrics reporting to KACHING
|
|
- License utilization dashboards
|
|
|
|
5. **Enhanced Lease Management**
|
|
- Lease renewal before expiry
|
|
- Dynamic replica scaling based on license
|
|
|
|
---
|
|
|
|
## API Constants
|
|
|
|
### Timeouts
|
|
|
|
```go
|
|
const (
|
|
DefaultKachingURL = "http://localhost:8083"
|
|
LicenseTimeout = 30 * time.Second // Validator HTTP timeout
|
|
GateCTimeout = 10 * time.Second // LicenseGate HTTP timeout
|
|
)
|
|
```
|
|
|
|
### Grace Period
|
|
|
|
```go
|
|
const (
|
|
GracePeriodDuration = 90 * time.Second
|
|
)
|
|
```
|
|
|
|
### Circuit Breaker
|
|
|
|
```go
|
|
const (
|
|
MaxRequests = 3 // Half-open state test requests
|
|
FailureThreshold = 3 // Consecutive failures to trip
|
|
CircuitTimeout = 30 * time.Second // Open state duration
|
|
FailureResetInterval = 60 * time.Second // Failure count reset
|
|
)
|
|
```
|
|
|
|
### Lease Safety Margin
|
|
|
|
```go
|
|
const (
|
|
LeaseSafetyMargin = 2 * time.Minute // Cache invalidation before expiry
|
|
)
|
|
```
|
|
|
|
---
|
|
|
|
## Related Documentation
|
|
|
|
- **KACHING License Server**: See KACHING documentation for server setup and API details
|
|
- **CHORUS Configuration**: `/docs/comprehensive/pkg/config.md`
|
|
- **CHORUS Runtime**: `/docs/comprehensive/internal/runtime.md`
|
|
- **Deployment Guide**: `/docs/deployment.md`
|
|
|
|
---
|
|
|
|
## Summary
|
|
|
|
The CHORUS licensing system provides robust, fail-closed license enforcement through integration with the KACHING license authority. Key characteristics:
|
|
|
|
- **Mandatory**: License validation is required at startup
|
|
- **Fail-Closed**: Invalid license or network failure prevents startup (outside grace period)
|
|
- **Cached**: Lease tokens cached to reduce KACHING load
|
|
- **Resilient**: Circuit breaker and grace period handle transient failures
|
|
- **Scalable**: Cluster lease system supports multi-replica deployments
|
|
- **Secure**: No bypass mechanisms, short-lived tokens, cluster binding
|
|
|
|
The system ensures that all running CHORUS instances are properly licensed while providing operational flexibility through caching and grace periods for transient failures. |