Files

anthonyrawlins c5b7311a8b docs: Add Phase 3 coordination and infrastructure documentation

Comprehensive documentation for coordination, messaging, discovery, and internal systems.

Core Coordination Packages:
- pkg/election - Democratic leader election (uptime-based, heartbeat mechanism, SLURP integration)
- pkg/coordination - Meta-coordination with dependency detection (4 built-in rules)
- coordinator/ - Task orchestration and assignment (AI-powered scoring)
- discovery/ - mDNS peer discovery (automatic LAN detection)

Messaging & P2P Infrastructure:
- pubsub/ - GossipSub messaging (31 message types, role-based topics, HMMM integration)
- p2p/ - libp2p networking (DHT modes, connection management, security)

Monitoring & Health:
- pkg/metrics - Prometheus metrics (80+ metrics across 12 categories)
- pkg/health - Health monitoring (4 HTTP endpoints, enhanced checks, graceful degradation)

Internal Systems:
- internal/licensing - License validation (KACHING integration, cluster leases, fail-closed)
- internal/hapui - Human Agent Portal UI (9 commands, HMMM wizard, UCXL browser, decision voting)
- internal/backbeat - P2P operation telemetry (6 phases, beat synchronization, health reporting)

Documentation Statistics (Phase 3):
- 10 packages documented (~18,000 lines)
- 31 PubSub message types cataloged
- 80+ Prometheus metrics documented
- Complete API references with examples
- Integration patterns and best practices

Key Features Documented:
- Election: 5 triggers, candidate scoring (5 weighted components), stability windows
- Coordination: AI-powered dependency detection, cross-repo sessions, escalation handling
- PubSub: Topic patterns, message envelopes, SHHH redaction, Hypercore logging
- Metrics: All metric types with labels, Prometheus scrape config, alert rules
- Health: Liveness vs readiness, critical checks, Kubernetes integration
- Licensing: Grace periods, circuit breaker, cluster lease management
- HAP UI: Interactive terminal commands, HMMM composition wizard, web interface (beta)
- BACKBEAT: 6-phase operation tracking, beat budget estimation, drift detection

Implementation Status Marked:
- ✅ Production: Election, metrics, health, licensing, pubsub, p2p, discovery, coordinator
- 🔶 Beta: HAP web interface, BACKBEAT telemetry, advanced coordination
- 🔷 Alpha: SLURP election scoring
- ⚠️ Experimental: Meta-coordination, AI-powered dependency detection

Progress: 22/62 files complete (35%)

Next Phase: AI providers, SLURP system, API layer, reasoning engine

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

2025-09-30 18:27:39 +10:00

77 KiB

Raw Blame History

CHORUS Election Package Documentation

Package: chorus/pkg/election Purpose: Democratic leader election and consensus coordination for distributed CHORUS agents Status: Production-ready core system; SLURP integration experimental

Overview
Architecture
Election Algorithm
Admin Heartbeat Mechanism
Election Triggers
Candidate Scoring System
SLURP Integration
Quorum and Consensus
API Reference
Configuration
Message Formats
State Machine
Callbacks and Events
Testing
Production Considerations

Overview

The election package implements a democratic leader election system for distributed CHORUS agents. It enables autonomous agents to elect an "admin" node responsible for coordination, context curation, and key management tasks. The system uses uptime-based voting, capability scoring, and heartbeat monitoring to maintain stable leadership while allowing graceful failover.

Key Features

Democratic Election: Nodes vote for the most qualified candidate based on uptime, capabilities, and resources
Heartbeat Monitoring: Active admin sends periodic heartbeats (5s interval) to prove liveness
Automatic Failover: Elections triggered on heartbeat timeout (15s), split-brain detection, or manual triggers
Capability-Based Scoring: Candidates scored on admin capabilities, resources, uptime, and experience
SLURP Integration: Experimental context leadership with Project Manager intelligence capabilities
Stability Windows: Prevents election churn with configurable minimum term durations
Graceful Transitions: Callback system for clean leadership handoffs

Use Cases

Admin Node Selection: Elect a coordinator for project-wide context curation
Split-Brain Recovery: Resolve network partition conflicts through re-election
Load Distribution: Select admin based on available resources and current load
Failover: Automatic promotion of standby nodes when admin becomes unavailable
Context Leadership: (SLURP) Specialized election for AI context generation leadership

Architecture

Component Structure

election/
├── election.go            # Core election manager (production)
├── interfaces.go          # Shared type definitions
├── slurp_election.go      # SLURP election interface (experimental)
├── slurp_manager.go       # SLURP election manager implementation (experimental)
├── slurp_scoring.go       # SLURP candidate scoring (experimental)
└── election_test.go       # Unit tests

Core Components

1. ElectionManager (Production)

The ElectionManager is the production-ready core election coordinator:

type ElectionManager struct {
    ctx    context.Context
    cancel context.CancelFunc
    config *config.Config
    host   libp2p.Host
    pubsub *pubsub.PubSub
    nodeID string

    // Election state
    mu            sync.RWMutex
    state         ElectionState
    currentTerm   int
    lastHeartbeat time.Time
    currentAdmin  string
    candidates    map[string]*AdminCandidate
    votes         map[string]string // voter -> candidate

    // Timers and channels
    heartbeatTimer  *time.Timer
    discoveryTimer  *time.Timer
    electionTimer   *time.Timer
    electionTrigger chan ElectionTrigger

    // Heartbeat management
    heartbeatManager *HeartbeatManager

    // Callbacks
    onAdminChanged     func(oldAdmin, newAdmin string)
    onElectionComplete func(winner string)

    // Stability windows (prevents election churn)
    lastElectionTime          time.Time
    electionStabilityWindow   time.Duration
    leaderStabilityWindow     time.Duration

    startTime time.Time
}

Key Responsibilities:

Discovery of existing admin via broadcast queries
Triggering elections based on heartbeat timeouts or manual triggers
Managing candidate announcements and vote collection
Determining election winners based on votes and scores
Broadcasting election results to cluster
Managing admin heartbeat lifecycle

2. HeartbeatManager

Manages the admin heartbeat transmission lifecycle:

type HeartbeatManager struct {
    mu          sync.Mutex
    isRunning   bool
    stopCh      chan struct{}
    ticker      *time.Ticker
    electionMgr *ElectionManager
    logger      func(msg string, args ...interface{})
}

Configuration:

Heartbeat Interval: HeartbeatTimeout / 2 (default ~7.5s)
Heartbeat Timeout: 15 seconds (configurable via Security.ElectionConfig.HeartbeatTimeout)
Transmission: Only when node is current admin
Lifecycle: Automatically started/stopped on leadership changes

3. SLURPElectionManager (Experimental)

Extends ElectionManager with SLURP contextual intelligence for Project Manager duties:

type SLURPElectionManager struct {
    *ElectionManager // Embeds base election manager

    // SLURP-specific state
    contextMu          sync.RWMutex
    contextManager     ContextManager
    slurpConfig        *SLURPElectionConfig
    contextCallbacks   *ContextLeadershipCallbacks

    // Context leadership state
    isContextLeader    bool
    contextTerm        int64
    contextStartedAt   *time.Time
    lastHealthCheck    time.Time

    // Failover state
    failoverState      *ContextFailoverState
    transferInProgress bool

    // Monitoring
    healthMonitor      *ContextHealthMonitor
    metricsCollector   *ContextMetricsCollector

    // Shutdown coordination
    contextShutdown    chan struct{}
    contextWg          sync.WaitGroup
}

Additional Capabilities:

Context generation leadership
Graceful leadership transfer with state preservation
Health monitoring and metrics collection
Failover state validation and recovery
Advanced scoring for AI capabilities

Election Algorithm

Democratic Election Process

The election system implements a democratic voting algorithm where nodes elect the most qualified candidate based on objective metrics.

Election Flow

┌─────────────────────────────────────────────────────────────┐
│ 1. DISCOVERY PHASE                                          │
│    - Node broadcasts admin discovery request                │
│    - Existing admin (if any) responds                       │
│    - Node updates currentAdmin if discovered                │
└─────────────────────────────────────────────────────────────┘
                           │
                           ▼
┌─────────────────────────────────────────────────────────────┐
│ 2. ELECTION TRIGGER                                         │
│    - Heartbeat timeout (15s without admin heartbeat)        │
│    - No admin discovered after discovery attempts           │
│    - Split-brain detection                                  │
│    - Manual trigger                                         │
│    - Quorum restoration                                     │
└─────────────────────────────────────────────────────────────┘
                           │
                           ▼
┌─────────────────────────────────────────────────────────────┐
│ 3. CANDIDATE ANNOUNCEMENT                                   │
│    - Eligible nodes announce candidacy                      │
│    - Include: NodeID, capabilities, uptime, resources       │
│    - Calculate and include candidate score                  │
│    - Broadcast to election topic                            │
└─────────────────────────────────────────────────────────────┘
                           │
                           ▼
┌─────────────────────────────────────────────────────────────┐
│ 4. VOTE COLLECTION (Election Timeout Period)                │
│    - Nodes receive candidate announcements                  │
│    - Nodes cast votes for highest-scoring candidate         │
│    - Votes broadcast to cluster                             │
│    - Duration: Security.ElectionConfig.ElectionTimeout      │
└─────────────────────────────────────────────────────────────┘
                           │
                           ▼
┌─────────────────────────────────────────────────────────────┐
│ 5. WINNER DETERMINATION                                     │
│    - Tally votes for each candidate                         │
│    - Winner: Most votes (ties broken by score)              │
│    - Fallback: Highest score if no votes cast               │
│    - Broadcast election winner                              │
└─────────────────────────────────────────────────────────────┘
                           │
                           ▼
┌─────────────────────────────────────────────────────────────┐
│ 6. LEADERSHIP TRANSITION                                    │
│    - Update currentAdmin                                    │
│    - Winner starts admin heartbeat                          │
│    - Previous admin stops heartbeat (if different node)     │
│    - Trigger callbacks (OnAdminChanged, OnElectionComplete) │
│    - Return to DISCOVERY/MONITORING phase                   │
└─────────────────────────────────────────────────────────────┘

Eligibility Criteria

A node can become admin if it has any of these capabilities:

admin_election - Core admin election capability
context_curation - Context management capability
project_manager - Project coordination capability

Checked via ElectionManager.canBeAdmin():

func (em *ElectionManager) canBeAdmin() bool {
    for _, cap := range em.config.Agent.Capabilities {
        if cap == "admin_election" || cap == "context_curation" || cap == "project_manager" {
            return true
        }
    }
    return false
}

Election Timing

Discovery Loop: Runs continuously, interval = Security.ElectionConfig.DiscoveryTimeout (default: 10s)
Election Timeout: Security.ElectionConfig.ElectionTimeout (default: 30s)
Randomized Delay: When triggering election after discovery failure, adds random delay (DiscoveryTimeout to 2×DiscoveryTimeout) to prevent simultaneous elections

Admin Heartbeat Mechanism

The admin heartbeat proves liveness and prevents unnecessary elections.

Heartbeat Configuration

Parameter	Value	Description
Interval	`HeartbeatTimeout / 2`	Heartbeat transmission frequency (~7.5s)
Timeout	`HeartbeatTimeout`	Max time without heartbeat before election (15s)
Topic	`CHORUS/admin/heartbeat/v1`	PubSub topic for heartbeats
Format	JSON	Message serialization format

Heartbeat Message Format

{
  "node_id": "QmXxx...abc",
  "timestamp": "2025-09-30T18:15:30.123456789Z"
}

Fields:

node_id (string): Admin node's ID
timestamp (RFC3339Nano): When heartbeat was sent

Heartbeat Lifecycle

Starting Heartbeat (Becoming Admin)

// Automatically called when node becomes admin
func (hm *HeartbeatManager) StartHeartbeat() error {
    hm.mu.Lock()
    defer hm.mu.Unlock()

    if hm.isRunning {
        return nil // Already running
    }

    if !hm.electionMgr.IsCurrentAdmin() {
        return fmt.Errorf("not admin, cannot start heartbeat")
    }

    hm.stopCh = make(chan struct{})
    interval := hm.electionMgr.config.Security.ElectionConfig.HeartbeatTimeout / 2
    hm.ticker = time.NewTicker(interval)
    hm.isRunning = true

    go hm.heartbeatLoop()

    return nil
}

Stopping Heartbeat (Losing Admin)

// Automatically called when node loses admin role
func (hm *HeartbeatManager) StopHeartbeat() error {
    hm.mu.Lock()
    defer hm.mu.Unlock()

    if !hm.isRunning {
        return nil // Already stopped
    }

    close(hm.stopCh)

    if hm.ticker != nil {
        hm.ticker.Stop()
        hm.ticker = nil
    }

    hm.isRunning = false
    return nil
}

Heartbeat Transmission Loop

func (hm *HeartbeatManager) heartbeatLoop() {
    defer func() {
        hm.mu.Lock()
        hm.isRunning = false
        hm.mu.Unlock()
    }()

    for {
        select {
        case <-hm.ticker.C:
            // Only send heartbeat if still admin
            if hm.electionMgr.IsCurrentAdmin() {
                if err := hm.electionMgr.SendAdminHeartbeat(); err != nil {
                    hm.logger("Failed to send heartbeat: %v", err)
                }
            } else {
                hm.logger("No longer admin, stopping heartbeat")
                return
            }

        case <-hm.stopCh:
            return

        case <-hm.electionMgr.ctx.Done():
            return
        }
    }
}

Heartbeat Processing

When a node receives a heartbeat:

func (em *ElectionManager) handleAdminHeartbeat(data []byte) {
    var heartbeat struct {
        NodeID    string    `json:"node_id"`
        Timestamp time.Time `json:"timestamp"`
    }

    if err := json.Unmarshal(data, &heartbeat); err != nil {
        log.Printf("❌ Failed to unmarshal heartbeat: %v", err)
        return
    }

    em.mu.Lock()
    defer em.mu.Unlock()

    // Update admin and heartbeat timestamp
    if em.currentAdmin == "" || em.currentAdmin == heartbeat.NodeID {
        em.currentAdmin = heartbeat.NodeID
        em.lastHeartbeat = heartbeat.Timestamp
    }
}

Timeout Detection

Checked during discovery loop:

func (em *ElectionManager) performAdminDiscovery() {
    em.mu.Lock()
    lastHeartbeat := em.lastHeartbeat
    em.mu.Unlock()

    // Check if admin heartbeat has timed out
    if !lastHeartbeat.IsZero() &&
       time.Since(lastHeartbeat) > em.config.Security.ElectionConfig.HeartbeatTimeout {
        log.Printf("⚰️ Admin heartbeat timeout detected (last: %v)", lastHeartbeat)
        em.TriggerElection(TriggerHeartbeatTimeout)
        return
    }
}

Election Triggers

Elections can be triggered by multiple events, each with different stability guarantees.

Trigger Types

type ElectionTrigger string

const (
    TriggerHeartbeatTimeout ElectionTrigger = "admin_heartbeat_timeout"
    TriggerDiscoveryFailure ElectionTrigger = "no_admin_discovered"
    TriggerSplitBrain       ElectionTrigger = "split_brain_detected"
    TriggerQuorumRestored   ElectionTrigger = "quorum_restored"
    TriggerManual           ElectionTrigger = "manual_trigger"
)

Trigger Details

1. Heartbeat Timeout

When: No admin heartbeat received for HeartbeatTimeout duration (15s)

Behavior:

Most common trigger for failover
Indicates admin node failure or network partition
Immediate election trigger (no stability window applies)

Example:

if time.Since(lastHeartbeat) > em.config.Security.ElectionConfig.HeartbeatTimeout {
    em.TriggerElection(TriggerHeartbeatTimeout)
}

2. Discovery Failure

When: No admin discovered after multiple discovery attempts

Behavior:

Occurs on cluster startup or after total admin loss
Includes randomized delay to prevent simultaneous elections
Base delay: 2 × DiscoveryTimeout + random(DiscoveryTimeout)

Example:

if currentAdmin == "" && em.canBeAdmin() {
    baseDelay := em.config.Security.ElectionConfig.DiscoveryTimeout * 2
    randomDelay := time.Duration(rand.Intn(int(em.config.Security.ElectionConfig.DiscoveryTimeout)))
    totalDelay := baseDelay + randomDelay

    time.Sleep(totalDelay)

    if stillNoAdmin && stillIdle && em.canBeAdmin() {
        em.TriggerElection(TriggerDiscoveryFailure)
    }
}

3. Split-Brain Detection

When: Multiple nodes believe they are admin

Behavior:

Detected through conflicting admin announcements
Forces re-election to resolve conflict
Should be rare in properly configured clusters

Usage: (Implementation-specific, typically in cluster coordination layer)

4. Quorum Restored

When: Network partition heals and quorum is re-established

Behavior:

Allows cluster to re-elect with full member participation
Ensures minority partition doesn't maintain stale admin

Usage: (Implementation-specific, typically in quorum management layer)

5. Manual Trigger

When: Explicitly triggered via API or administrative action

Behavior:

Used for planned leadership transfers
Used for testing and debugging
Respects stability windows (can be overridden)

Example:

em.TriggerElection(TriggerManual)

Stability Windows

To prevent election churn, the system enforces minimum durations between elections:

Election Stability Window

Default: 2 × DiscoveryTimeout (20s) Configuration: Environment variable CHORUS_ELECTION_MIN_TERM

Prevents rapid back-to-back elections regardless of admin state.

func getElectionStabilityWindow(cfg *config.Config) time.Duration {
    if stability := os.Getenv("CHORUS_ELECTION_MIN_TERM"); stability != "" {
        if duration, err := time.ParseDuration(stability); err == nil {
            return duration
        }
    }

    if cfg.Security.ElectionConfig.DiscoveryTimeout > 0 {
        return cfg.Security.ElectionConfig.DiscoveryTimeout * 2
    }

    return 30 * time.Second // Fallback
}

Leader Stability Window

Default: 3 × HeartbeatTimeout (45s) Configuration: Environment variable CHORUS_LEADER_MIN_TERM

Prevents challenging a healthy leader too quickly after election.

func getLeaderStabilityWindow(cfg *config.Config) time.Duration {
    if stability := os.Getenv("CHORUS_LEADER_MIN_TERM"); stability != "" {
        if duration, err := time.ParseDuration(stability); err == nil {
            return duration
        }
    }

    if cfg.Security.ElectionConfig.HeartbeatTimeout > 0 {
        return cfg.Security.ElectionConfig.HeartbeatTimeout * 3
    }

    return 45 * time.Second // Fallback
}

Stability Window Enforcement

func (em *ElectionManager) TriggerElection(trigger ElectionTrigger) {
    em.mu.RLock()
    currentState := em.state
    currentAdmin := em.currentAdmin
    lastElection := em.lastElectionTime
    em.mu.RUnlock()

    if currentState != StateIdle {
        log.Printf("🗳️ Election already in progress (state: %s), ignoring trigger: %s",
                   currentState, trigger)
        return
    }

    now := time.Now()
    if !lastElection.IsZero() {
        timeSinceElection := now.Sub(lastElection)

        // Leader stability window (if we have a current admin)
        if currentAdmin != "" && timeSinceElection < em.leaderStabilityWindow {
            log.Printf("⏳ Leader stability window active (%.1fs remaining), ignoring trigger: %s",
                       (em.leaderStabilityWindow - timeSinceElection).Seconds(), trigger)
            return
        }

        // General election stability window
        if timeSinceElection < em.electionStabilityWindow {
            log.Printf("⏳ Election stability window active (%.1fs remaining), ignoring trigger: %s",
                       (em.electionStabilityWindow - timeSinceElection).Seconds(), trigger)
            return
        }
    }

    select {
    case em.electionTrigger <- trigger:
        log.Printf("🗳️ Election triggered: %s", trigger)
    default:
        log.Printf("⚠️ Election trigger buffer full, ignoring: %s", trigger)
    }
}

Key Points:

Stability windows prevent election storms during network instability
Heartbeat timeout triggers bypass some stability checks (admin definitely unavailable)
Manual triggers respect stability windows unless explicitly overridden
Referenced in WHOOSH issue #7 as fix for election churn

Candidate Scoring System

Candidates are scored on multiple dimensions to determine the most qualified admin.

Base Election Scoring (Production)

Scoring Formula

finalScore = uptimeScore * 0.3 +
             capabilityScore * 0.2 +
             resourceScore * 0.2 +
             networkQuality * 0.15 +
             experienceScore * 0.15

Component Scores

1. Uptime Score (Weight: 0.3)

Measures node stability and continuous availability.

uptimeScore := min(1.0, candidate.Uptime.Hours() / 24.0)

Calculation: Linear scaling from 0 to 1.0 over 24 hours
Max Score: 1.0 (achieved at 24+ hours uptime)
Purpose: Prefer nodes with proven stability

2. Capability Score (Weight: 0.2)

Measures administrative and coordination capabilities.

capabilityScore := 0.0
adminCapabilities := []string{
    "admin_election",
    "context_curation",
    "key_reconstruction",
    "semantic_analysis",
    "project_manager",
}

for _, cap := range candidate.Capabilities {
    for _, adminCap := range adminCapabilities {
        if cap == adminCap {
            weight := 0.25 // Default weight
            // Project manager capabilities get higher weight
            if adminCap == "project_manager" || adminCap == "context_curation" {
                weight = 0.35
            }
            capabilityScore += weight
        }
    }
}
capabilityScore = min(1.0, capabilityScore)

Admin Capabilities: +0.25 per capability (standard)
Premium Capabilities: +0.35 for project_manager and context_curation
Max Score: 1.0 (capped)
Purpose: Prefer nodes with admin-specific capabilities

3. Resource Score (Weight: 0.2)

Measures available compute resources (lower usage = better).

resourceScore := (1.0 - candidate.Resources.CPUUsage) * 0.3 +
                 (1.0 - candidate.Resources.MemoryUsage) * 0.3 +
                 (1.0 - candidate.Resources.DiskUsage) * 0.2 +
                 candidate.Resources.NetworkQuality * 0.2

CPU Usage: Lower is better (30% weight)
Memory Usage: Lower is better (30% weight)
Disk Usage: Lower is better (20% weight)
Network Quality: Higher is better (20% weight)
Purpose: Prefer nodes with available resources

4. Network Quality Score (Weight: 0.15)

Direct measure of network connectivity quality.

networkScore := candidate.Resources.NetworkQuality // Range: 0.0 to 1.0

Source: Measured network quality metric
Range: 0.0 (poor) to 1.0 (excellent)
Purpose: Ensure admin has good connectivity

5. Experience Score (Weight: 0.15)

Measures long-term operational experience.

experienceScore := min(1.0, candidate.Experience.Hours() / 168.0)

Calculation: Linear scaling from 0 to 1.0 over 1 week (168 hours)
Max Score: 1.0 (achieved at 1+ week experience)
Purpose: Prefer nodes with proven long-term reliability

Resource Metrics Structure

type ResourceMetrics struct {
    CPUUsage       float64 `json:"cpu_usage"`        // 0.0 to 1.0 (0-100%)
    MemoryUsage    float64 `json:"memory_usage"`     // 0.0 to 1.0 (0-100%)
    DiskUsage      float64 `json:"disk_usage"`       // 0.0 to 1.0 (0-100%)
    NetworkQuality float64 `json:"network_quality"`  // 0.0 to 1.0 (quality score)
}

Note: Current implementation uses simulated values. Production systems should integrate actual resource monitoring.

SLURP Candidate Scoring (Experimental)

SLURP extends base scoring with contextual intelligence metrics for Project Manager leadership.

Extended Scoring Formula

finalScore = baseScore * (baseWeightsSum) +
             contextCapabilityScore * contextWeight +
             intelligenceScore * intelligenceWeight +
             coordinationScore * coordinationWeight +
             qualityScore * qualityWeight +
             performanceScore * performanceWeight +
             specializationScore * specializationWeight +
             availabilityScore * availabilityWeight +
             reliabilityScore * reliabilityWeight

Normalized by total weight sum.

SLURP Scoring Weights (Default)

func DefaultSLURPScoringWeights() *SLURPScoringWeights {
    return &SLURPScoringWeights{
        // Base election weights (total: 0.4)
        UptimeWeight:           0.08,
        CapabilityWeight:       0.10,
        ResourceWeight:         0.08,
        NetworkWeight:          0.06,
        ExperienceWeight:       0.08,

        // SLURP-specific weights (total: 0.6)
        ContextCapabilityWeight: 0.15, // Most important for context leadership
        IntelligenceWeight:      0.12,
        CoordinationWeight:      0.10,
        QualityWeight:          0.08,
        PerformanceWeight:      0.06,
        SpecializationWeight:   0.04,
        AvailabilityWeight:     0.03,
        ReliabilityWeight:      0.02,
    }
}

SLURP Component Scores

1. Context Capability Score (Weight: 0.15)

Core context generation capabilities:

score := 0.0
if caps.ContextGeneration { score += 0.3 }   // Required for leadership
if caps.ContextCuration { score += 0.2 }     // Content quality
if caps.ContextDistribution { score += 0.2 } // Delivery capability
if caps.ContextStorage { score += 0.1 }      // Persistence
if caps.SemanticAnalysis { score += 0.1 }    // Advanced analysis
if caps.RAGIntegration { score += 0.1 }      // RAG capability

2. Intelligence Score (Weight: 0.12)

AI and analysis capabilities:

score := 0.0
if caps.SemanticAnalysis { score += 0.25 }
if caps.RAGIntegration { score += 0.25 }
if caps.TemporalAnalysis { score += 0.25 }
if caps.DecisionTracking { score += 0.25 }

// Apply quality multiplier
score = score * caps.GenerationQuality

3. Coordination Score (Weight: 0.10)

Cluster management capabilities:

score := 0.0
if caps.ClusterCoordination { score += 0.3 }
if caps.LoadBalancing { score += 0.25 }
if caps.HealthMonitoring { score += 0.2 }
if caps.ResourceManagement { score += 0.25 }

4. Quality Score (Weight: 0.08)

Average of quality metrics:

score := (caps.GenerationQuality + caps.ProcessingSpeed + caps.AccuracyScore) / 3.0

5. Performance Score (Weight: 0.06)

Historical operation success:

totalOps := caps.SuccessfulOperations + caps.FailedOperations
successRate := float64(caps.SuccessfulOperations) / float64(totalOps)

// Response time score (1s optimal, 10s poor)
responseTimeScore := calculateResponseTimeScore(caps.AverageResponseTime)

score := (successRate * 0.7) + (responseTimeScore * 0.3)

6. Specialization Score (Weight: 0.04)

Domain expertise coverage:

domainCoverage := float64(len(caps.DomainExpertise)) / 10.0
domainCoverage = min(1.0, domainCoverage)

score := (caps.SpecializationScore * 0.6) + (domainCoverage * 0.4)

7. Availability Score (Weight: 0.03)

Resource availability:

cpuScore := min(1.0, caps.AvailableCPU / 8.0)           // 8 cores = 1.0
memoryScore := min(1.0, caps.AvailableMemory / 16GB)    // 16GB = 1.0
storageScore := min(1.0, caps.AvailableStorage / 1TB)   // 1TB = 1.0
networkScore := min(1.0, caps.NetworkBandwidth / 1Gbps) // 1Gbps = 1.0

score := (cpuScore * 0.3) + (memoryScore * 0.3) +
         (storageScore * 0.2) + (networkScore * 0.2)

8. Reliability Score (Weight: 0.02)

Uptime and reliability:

score := (caps.ReliabilityScore * 0.6) + (caps.UptimePercentage * 0.4)

SLURP Requirements Filtering

Candidates must meet minimum requirements to be eligible:

func DefaultSLURPLeadershipRequirements() *SLURPLeadershipRequirements {
    return &SLURPLeadershipRequirements{
        RequiredCapabilities:    []string{"context_generation", "context_curation"},
        PreferredCapabilities:   []string{"semantic_analysis", "cluster_coordination", "rag_integration"},

        MinQualityScore:         0.6,
        MinReliabilityScore:     0.7,
        MinUptimePercentage:     0.8,

        MinCPU:                  2.0,              // 2 CPU cores
        MinMemory:               4 * GB,           // 4GB
        MinStorage:              100 * GB,         // 100GB
        MinNetworkBandwidth:     100 * Mbps,       // 100 Mbps

        MinSuccessfulOperations: 10,
        MaxFailureRate:          0.1,              // 10% max
        MaxResponseTime:         5 * time.Second,
    }
}

Disqualification: Candidates failing requirements receive score of 0.0 and are marked with disqualification reasons.

Score Adjustments (Bonuses/Penalties)

// Bonuses
if caps.GenerationQuality > 0.95 {
    finalScore += 0.05 // Exceptional quality
}
if caps.UptimePercentage > 0.99 {
    finalScore += 0.03 // Exceptional uptime
}
if caps.ContextGeneration && caps.ContextCuration &&
   caps.SemanticAnalysis && caps.ClusterCoordination {
    finalScore += 0.02 // Full capability coverage
}

// Penalties
if caps.GenerationQuality < 0.5 {
    finalScore -= 0.1 // Low quality
}
if caps.FailedOperations > caps.SuccessfulOperations {
    finalScore -= 0.15 // High failure rate
}

SLURP Integration

SLURP (Semantic Layer for Understanding, Reasoning, and Planning) extends election with context generation leadership.

Architecture

┌─────────────────────────────────────────────────────────────┐
│ Base Election (Production)                                  │
│ - Admin election                                            │
│ - Heartbeat monitoring                                      │
│ - Basic leadership                                          │
└─────────────────────────────────────────────────────────────┘
                           │
                           │ Embeds
                           ▼
┌─────────────────────────────────────────────────────────────┐
│ SLURP Election (Experimental)                               │
│ - Context leadership                                        │
│ - Advanced scoring                                          │
│ - Failover state management                                 │
│ - Health monitoring                                         │
└─────────────────────────────────────────────────────────────┘

Status: Experimental

The SLURP integration is experimental and provides:

✅ Extended candidate scoring for AI capabilities
✅ Context leadership state management
✅ Graceful failover with state preservation
✅ Health and metrics monitoring framework
⚠️ Incomplete: Actual context manager integration (TODOs present)
⚠️ Incomplete: State recovery mechanisms
⚠️ Incomplete: Production metrics collection

Context Leadership

Becoming Context Leader

When a node wins election and becomes admin, it can also become context leader:

func (sem *SLURPElectionManager) StartContextGeneration(ctx context.Context) error {
    if !sem.IsCurrentAdmin() {
        return fmt.Errorf("not admin, cannot start context generation")
    }

    sem.contextMu.Lock()
    defer sem.contextMu.Unlock()

    if sem.contextManager == nil {
        return fmt.Errorf("no context manager registered")
    }

    // Mark as context leader
    sem.isContextLeader = true
    sem.contextTerm++
    now := time.Now()
    sem.contextStartedAt = &now

    // Start background processes
    sem.contextWg.Add(2)
    go sem.runHealthMonitoring()
    go sem.runMetricsCollection()

    // Trigger callbacks
    if sem.contextCallbacks != nil {
        if sem.contextCallbacks.OnBecomeContextLeader != nil {
            sem.contextCallbacks.OnBecomeContextLeader(ctx, sem.contextTerm)
        }
        if sem.contextCallbacks.OnContextGenerationStarted != nil {
            sem.contextCallbacks.OnContextGenerationStarted(sem.nodeID)
        }
    }

    // Broadcast context leadership start
    // ...

    return nil
}

Losing Context Leadership

When a node loses admin role or election, it stops context generation:

func (sem *SLURPElectionManager) StopContextGeneration(ctx context.Context) error {
    // Signal shutdown to background processes
    close(sem.contextShutdown)

    // Wait for background processes with timeout
    done := make(chan struct{})
    go func() {
        sem.contextWg.Wait()
        close(done)
    }()

    select {
    case <-done:
        // Clean shutdown
    case <-time.After(sem.slurpConfig.GenerationStopTimeout):
        // Timeout
    }

    sem.contextMu.Lock()
    sem.isContextLeader = false
    sem.contextStartedAt = nil
    sem.contextMu.Unlock()

    // Trigger callbacks
    // ...

    return nil
}

Graceful Leadership Transfer

SLURP supports explicit leadership transfer with state preservation:

func (sem *SLURPElectionManager) TransferContextLeadership(
    ctx context.Context,
    targetNodeID string,
) error {
    if !sem.IsContextLeader() {
        return fmt.Errorf("not context leader, cannot transfer")
    }

    // Prepare failover state
    state, err := sem.PrepareContextFailover(ctx)
    if err != nil {
        return err
    }

    // Send transfer message to cluster
    transferMsg := ElectionMessage{
        Type:      "context_leadership_transfer",
        NodeID:    sem.nodeID,
        Timestamp: time.Now(),
        Term:      int(sem.contextTerm),
        Data: map[string]interface{}{
            "target_node":    targetNodeID,
            "failover_state": state,
            "reason":         "manual_transfer",
        },
    }

    if err := sem.publishElectionMessage(transferMsg); err != nil {
        return err
    }

    // Stop context generation
    sem.StopContextGeneration(ctx)

    // Trigger new election
    sem.TriggerElection(TriggerManual)

    return nil
}

Failover State

Context leadership state preserved during failover:

type ContextFailoverState struct {
    // Basic failover state
    LeaderID            string
    Term                int64
    TransferTime        time.Time

    // Context generation state
    QueuedRequests      []*ContextGenerationRequest
    ActiveJobs          map[string]*ContextGenerationJob
    CompletedJobs       []*ContextGenerationJob

    // Cluster coordination state
    ClusterState        *ClusterState
    ResourceAllocations map[string]*ResourceAllocation
    NodeAssignments     map[string][]string

    // Configuration state
    ManagerConfig       *ManagerConfig
    GenerationPolicy    *GenerationPolicy
    QueuePolicy         *QueuePolicy

    // State validation
    StateVersion        int64
    Checksum            string
    HealthSnapshot      *ContextClusterHealth

    // Transfer metadata
    TransferReason      string
    TransferSource      string
    TransferDuration    time.Duration
    ValidationResults   *ContextStateValidation
}

State Validation

Before accepting transferred state:

func (sem *SLURPElectionManager) ValidateContextState(
    state *ContextFailoverState,
) (*ContextStateValidation, error) {
    validation := &ContextStateValidation{
        ValidatedAt: time.Now(),
        ValidatedBy: sem.nodeID,
        Valid:       true,
    }

    // Check basic fields
    if state.LeaderID == "" {
        validation.Issues = append(validation.Issues, "missing leader ID")
        validation.Valid = false
    }

    // Validate checksum
    if state.Checksum != "" {
        tempState := *state
        tempState.Checksum = ""
        data, _ := json.Marshal(tempState)
        hash := md5.Sum(data)
        expectedChecksum := fmt.Sprintf("%x", hash)
        validation.ChecksumValid = (expectedChecksum == state.Checksum)
        if !validation.ChecksumValid {
            validation.Issues = append(validation.Issues, "checksum validation failed")
            validation.Valid = false
        }
    }

    // Validate timestamps, queue state, cluster state, config
    // ...

    // Set recovery requirements if issues found
    if len(validation.Issues) > 0 {
        validation.RequiresRecovery = true
        validation.RecoverySteps = []string{
            "Review validation issues",
            "Perform partial state recovery",
            "Restart context generation with defaults",
        }
    }

    return validation, nil
}

Health Monitoring

SLURP election includes cluster health monitoring:

type ContextClusterHealth struct {
    TotalNodes          int
    HealthyNodes        int
    UnhealthyNodes      []string
    CurrentLeader       string
    LeaderHealthy       bool
    GenerationActive    bool
    QueueHealth         *QueueHealthStatus
    NodeHealths         map[string]*NodeHealthStatus
    LastElection        time.Time
    NextHealthCheck     time.Time
    OverallHealthScore  float64 // 0-1
}

Health checks run periodically (default: 30s):

func (sem *SLURPElectionManager) runHealthMonitoring() {
    defer sem.contextWg.Done()

    ticker := time.NewTicker(sem.slurpConfig.ContextHealthCheckInterval)
    defer ticker.Stop()

    for {
        select {
        case <-ticker.C:
            sem.performHealthCheck()
        case <-sem.contextShutdown:
            return
        }
    }
}

Configuration

func DefaultSLURPElectionConfig() *SLURPElectionConfig {
    return &SLURPElectionConfig{
        EnableContextLeadership:     true,
        ContextLeadershipWeight:     0.3,
        RequireContextCapability:    true,

        AutoStartGeneration:         true,
        GenerationStartDelay:        5 * time.Second,
        GenerationStopTimeout:       30 * time.Second,

        ContextFailoverTimeout:      60 * time.Second,
        StateTransferTimeout:        30 * time.Second,
        ValidationTimeout:           10 * time.Second,
        RequireStateValidation:      true,

        ContextHealthCheckInterval:  30 * time.Second,
        ClusterHealthThreshold:      0.7,
        LeaderHealthThreshold:       0.8,

        MaxQueueTransferSize:        1000,
        QueueDrainTimeout:           60 * time.Second,
        PreserveCompletedJobs:       true,

        CoordinationTimeout:         10 * time.Second,
        MaxCoordinationRetries:      3,
        CoordinationBackoff:         2 * time.Second,
    }
}

Quorum and Consensus

Currently, the election system uses democratic voting without strict quorum requirements. This section describes the voting mechanism and future quorum considerations.

Voting Mechanism

Vote Casting

Nodes cast votes for candidates during the election period:

voteMsg := ElectionMessage{
    Type:      "election_vote",
    NodeID:    voterNodeID,
    Timestamp: time.Now(),
    Term:      currentTerm,
    Data: map[string]interface{}{
        "candidate": chosenCandidateID,
    },
}

Vote Tallying

Votes are tallied when election timeout occurs:

func (em *ElectionManager) findElectionWinner() *AdminCandidate {
    if len(em.candidates) == 0 {
        return nil
    }

    // Count votes for each candidate
    voteCounts := make(map[string]int)
    totalVotes := 0

    for _, candidateID := range em.votes {
        if _, exists := em.candidates[candidateID]; exists {
            voteCounts[candidateID]++
            totalVotes++
        }
    }

    // If no votes cast, fall back to highest scoring candidate
    if totalVotes == 0 {
        var winner *AdminCandidate
        highestScore := -1.0

        for _, candidate := range em.candidates {
            if candidate.Score > highestScore {
                highestScore = candidate.Score
                winner = candidate
            }
        }
        return winner
    }

    // Find candidate with most votes (ties broken by score)
    var winner *AdminCandidate
    maxVotes := -1
    highestScore := -1.0

    for candidateID, voteCount := range voteCounts {
        candidate := em.candidates[candidateID]
        if voteCount > maxVotes ||
           (voteCount == maxVotes && candidate.Score > highestScore) {
            maxVotes = voteCount
            highestScore = candidate.Score
            winner = candidate
        }
    }

    return winner
}

Key Points:

Majority not required (simple plurality)
If no votes cast, highest score wins (useful for single-node startup)
Ties broken by candidate score
Vote validation ensures voted candidate exists

Quorum Considerations

The system does not currently implement strict quorum requirements. This has implications:

Advantages:

Works in small clusters (1-2 nodes)
Allows elections during network partitions
Simple consensus algorithm

Disadvantages:

Risk of split-brain if network partitions occur
No guarantee majority of cluster agrees on admin
Potential for competing admins in partition scenarios

Future Enhancement: Consider implementing configurable quorum (e.g., "majority of last known cluster size") for production deployments.

Split-Brain Scenarios

Scenario: Network partition creates two isolated groups, each electing separate admin.

Detection Methods:

Admin heartbeat conflicts (multiple nodes claiming admin)
Cluster membership disagreements
Partition healing revealing duplicate admins

Resolution:

Detect conflicting admin via heartbeat messages
Trigger TriggerSplitBrain election
Re-elect with full cluster participation
Higher-scored or higher-term admin typically wins

Mitigation:

Stability windows reduce rapid re-elections
Heartbeat timeout ensures dead admin detection
Democratic voting resolves conflicts when partition heals

API Reference

ElectionManager (Production)

Constructor

func NewElectionManager(
    ctx context.Context,
    cfg *config.Config,
    host libp2p.Host,
    ps *pubsub.PubSub,
    nodeID string,
) *ElectionManager

Creates new election manager.

Parameters:

ctx: Parent context for lifecycle management
cfg: CHORUS configuration (capabilities, election config)
host: libp2p host for peer communication
ps: PubSub instance for election messages
nodeID: Unique identifier for this node

Returns: Configured ElectionManager

Methods

func (em *ElectionManager) Start() error

Starts the election management system. Subscribes to election and heartbeat topics, launches discovery and coordination goroutines.

Returns: Error if subscription fails

func (em *ElectionManager) Stop()

Stops the election manager. Stops heartbeat, cancels context, cleans up timers.

func (em *ElectionManager) TriggerElection(trigger ElectionTrigger)

Manually triggers an election.

Parameters:

trigger: Reason for triggering election (see Election Triggers)

Behavior:

Respects stability windows
Ignores if election already in progress
Buffers trigger in channel (size 10)

func (em *ElectionManager) GetCurrentAdmin() string

Returns the current admin node ID.

Returns: Node ID string (empty if no admin)

func (em *ElectionManager) IsCurrentAdmin() bool

Checks if this node is the current admin.

Returns: true if this node is admin

func (em *ElectionManager) GetElectionState() ElectionState

Returns current election state.

Returns: One of: StateIdle, StateDiscovering, StateElecting, StateReconstructing, StateComplete

func (em *ElectionManager) SetCallbacks(
    onAdminChanged func(oldAdmin, newAdmin string),
    onElectionComplete func(winner string),
)

Sets election event callbacks.

Parameters:

onAdminChanged: Called when admin changes (includes admin discovery)
onElectionComplete: Called when election completes

func (em *ElectionManager) SendAdminHeartbeat() error

Sends admin heartbeat (only if this node is admin).

Returns: Error if not admin or send fails

func (em *ElectionManager) GetHeartbeatStatus() map[string]interface{}

Returns current heartbeat status.

Returns: Map with keys:

running (bool): Whether heartbeat is active
is_admin (bool): Whether this node is admin
last_sent (time.Time): Last heartbeat time
interval (string): Heartbeat interval (if running)
next_heartbeat (time.Time): Next scheduled heartbeat (if running)

SLURPElectionManager (Experimental)

Constructor

func NewSLURPElectionManager(
    ctx context.Context,
    cfg *config.Config,
    host libp2p.Host,
    ps *pubsub.PubSub,
    nodeID string,
    slurpConfig *SLURPElectionConfig,
) *SLURPElectionManager

Creates new SLURP-enhanced election manager.

Parameters:

(Same as NewElectionManager)
slurpConfig: SLURP-specific configuration (nil for defaults)

Returns: Configured SLURPElectionManager

Methods

All ElectionManager methods plus:

func (sem *SLURPElectionManager) RegisterContextManager(manager ContextManager) error

Registers a context manager for leader duties.

Parameters:

manager: Context manager implementing ContextManager interface

Returns: Error if manager already registered

Behavior: If this node is already admin and auto-start enabled, starts context generation

func (sem *SLURPElectionManager) IsContextLeader() bool

Checks if this node is the current context generation leader.

Returns: true if context leader and admin

func (sem *SLURPElectionManager) GetContextManager() (ContextManager, error)

Returns the registered context manager (only if leader).

Returns:

ContextManager if leader
Error if not leader or no manager registered

func (sem *SLURPElectionManager) StartContextGeneration(ctx context.Context) error

Begins context generation operations (leader only).

Returns: Error if not admin, already started, or no manager registered

Behavior:

Marks node as context leader
Increments context term
Starts health monitoring and metrics collection
Triggers callbacks
Broadcasts context generation start

func (sem *SLURPElectionManager) StopContextGeneration(ctx context.Context) error

Stops context generation operations.

Returns: Error if issues during shutdown (logged, not fatal)

Behavior:

Signals background processes to stop
Waits for clean shutdown (with timeout)
Triggers callbacks
Broadcasts context generation stop

func (sem *SLURPElectionManager) TransferContextLeadership(
    ctx context.Context,
    targetNodeID string,
) error

Initiates graceful context leadership transfer.

Parameters:

ctx: Context for transfer operations
targetNodeID: Target node to receive leadership

Returns: Error if not leader, transfer in progress, or preparation fails

Behavior:

Prepares failover state
Broadcasts transfer message
Stops context generation
Triggers new election

func (sem *SLURPElectionManager) GetContextLeaderInfo() (*LeaderInfo, error)

Returns information about current context leader.

Returns:

LeaderInfo with leader details
Error if no current leader

func (sem *SLURPElectionManager) GetContextGenerationStatus() (*GenerationStatus, error)

Returns status of context operations.

Returns:

GenerationStatus with current state
Error if retrieval fails

func (sem *SLURPElectionManager) SetContextLeadershipCallbacks(
    callbacks *ContextLeadershipCallbacks,
) error

Sets callbacks for context leadership changes.

Parameters:

callbacks: Struct with context leadership event callbacks

Returns: Always nil (error reserved for future validation)

func (sem *SLURPElectionManager) GetContextClusterHealth() (*ContextClusterHealth, error)

Returns health of context generation cluster.

Returns: ContextClusterHealth with cluster health metrics

func (sem *SLURPElectionManager) PrepareContextFailover(
    ctx context.Context,
) (*ContextFailoverState, error)

Prepares context state for leadership failover.

Returns:

ContextFailoverState with preserved state
Error if not context leader or preparation fails

Behavior:

Collects queued requests, active jobs, configuration
Captures health snapshot
Calculates checksum for validation

func (sem *SLURPElectionManager) ExecuteContextFailover(
    ctx context.Context,
    state *ContextFailoverState,
) error

Executes context leadership failover from provided state.

Parameters:

ctx: Context for failover operations
state: Failover state from previous leader

Returns: Error if already leader, validation fails, or restoration fails

Behavior:

Validates failover state
Restores context leadership
Applies configuration and state
Starts background processes

func (sem *SLURPElectionManager) ValidateContextState(
    state *ContextFailoverState,
) (*ContextStateValidation, error)

Validates context failover state before accepting.

Parameters:

state: Failover state to validate

Returns:

ContextStateValidation with validation results
Error only if validation process itself fails (rare)

Validation Checks:

Basic field presence (LeaderID, Term, StateVersion)
Checksum validation (MD5)
Timestamp validity
Queue state validity
Cluster state validity
Configuration validity

Configuration

Election Configuration Structure

type ElectionConfig struct {
    DiscoveryTimeout    time.Duration // Admin discovery loop interval
    DiscoveryBackoff    time.Duration // Backoff after failed discovery
    ElectionTimeout     time.Duration // Election voting period duration
    HeartbeatTimeout    time.Duration // Max time without heartbeat before election
}

Configuration Sources

1. Config File (config.toml)

[security.election_config]
discovery_timeout = "10s"
discovery_backoff = "5s"
election_timeout = "30s"
heartbeat_timeout = "15s"

2. Environment Variables

# Stability windows
export CHORUS_ELECTION_MIN_TERM="30s"  # Min time between elections
export CHORUS_LEADER_MIN_TERM="45s"    # Min time before challenging healthy leader

3. Default Values (Fallback)

// In config package
ElectionConfig: ElectionConfig{
    DiscoveryTimeout:  10 * time.Second,
    DiscoveryBackoff:  5 * time.Second,
    ElectionTimeout:   30 * time.Second,
    HeartbeatTimeout:  15 * time.Second,
}

SLURP Configuration

type SLURPElectionConfig struct {
    // Context leadership configuration
    EnableContextLeadership     bool          // Enable context leadership
    ContextLeadershipWeight     float64       // Weight for context leadership scoring
    RequireContextCapability    bool          // Require context capability for leadership

    // Context generation configuration
    AutoStartGeneration         bool          // Auto-start generation on leadership
    GenerationStartDelay        time.Duration // Delay before starting generation
    GenerationStopTimeout       time.Duration // Timeout for stopping generation

    // Failover configuration
    ContextFailoverTimeout      time.Duration // Context failover timeout
    StateTransferTimeout        time.Duration // State transfer timeout
    ValidationTimeout           time.Duration // State validation timeout
    RequireStateValidation      bool          // Require state validation

    // Health monitoring configuration
    ContextHealthCheckInterval  time.Duration // Context health check interval
    ClusterHealthThreshold      float64       // Minimum cluster health for operations
    LeaderHealthThreshold       float64       // Minimum leader health

    // Queue management configuration
    MaxQueueTransferSize        int           // Max requests to transfer
    QueueDrainTimeout           time.Duration // Timeout for draining queue
    PreserveCompletedJobs       bool          // Preserve completed jobs on transfer

    // Coordination configuration
    CoordinationTimeout         time.Duration // Coordination operation timeout
    MaxCoordinationRetries      int           // Max coordination retries
    CoordinationBackoff         time.Duration // Backoff between coordination retries
}

Defaults: See DefaultSLURPElectionConfig() in SLURP Integration

Message Formats

PubSub Topics

CHORUS/election/v1             # Election messages (candidates, votes, winners)
CHORUS/admin/heartbeat/v1      # Admin heartbeat messages

ElectionMessage Structure

type ElectionMessage struct {
    Type      string      `json:"type"`       // Message type
    NodeID    string      `json:"node_id"`    // Sender node ID
    Timestamp time.Time   `json:"timestamp"`  // Message timestamp
    Term      int         `json:"term"`       // Election term
    Data      interface{} `json:"data,omitempty"` // Type-specific data
}

Message Types

1. Admin Discovery Request

Type: admin_discovery_request

Purpose: Node searching for existing admin

Data: nil

Example:

{
  "type": "admin_discovery_request",
  "node_id": "QmXxx...abc",
  "timestamp": "2025-09-30T18:15:30.123Z",
  "term": 0
}

2. Admin Discovery Response

Type: admin_discovery_response

Purpose: Node informing requester of known admin

Data:

{
  "current_admin": "QmYyy...def"
}

Example:

{
  "type": "admin_discovery_response",
  "node_id": "QmYyy...def",
  "timestamp": "2025-09-30T18:15:30.456Z",
  "term": 0,
  "data": {
    "current_admin": "QmYyy...def"
  }
}

3. Election Started

Type: election_started

Purpose: Node announcing start of new election

Data:

{
  "trigger": "admin_heartbeat_timeout"
}

Example:

{
  "type": "election_started",
  "node_id": "QmXxx...abc",
  "timestamp": "2025-09-30T18:15:45.123Z",
  "term": 5,
  "data": {
    "trigger": "admin_heartbeat_timeout"
  }
}

4. Candidacy Announcement

Type: candidacy_announcement

Purpose: Node announcing candidacy in election

Data: AdminCandidate structure

Example:

{
  "type": "candidacy_announcement",
  "node_id": "QmXxx...abc",
  "timestamp": "2025-09-30T18:15:46.123Z",
  "term": 5,
  "data": {
    "node_id": "QmXxx...abc",
    "peer_id": "QmXxx...abc",
    "capabilities": ["admin_election", "context_curation"],
    "uptime": "86400000000000",
    "resources": {
      "cpu_usage": 0.35,
      "memory_usage": 0.52,
      "disk_usage": 0.41,
      "network_quality": 0.95
    },
    "experience": "604800000000000",
    "score": 0.78
  }
}

5. Election Vote

Type: election_vote

Purpose: Node casting vote for candidate

Data:

{
  "candidate": "QmYyy...def"
}

Example:

{
  "type": "election_vote",
  "node_id": "QmZzz...ghi",
  "timestamp": "2025-09-30T18:15:50.123Z",
  "term": 5,
  "data": {
    "candidate": "QmYyy...def"
  }
}

6. Election Winner

Type: election_winner

Purpose: Announcing election winner

Data: AdminCandidate structure (winner)

Example:

{
  "type": "election_winner",
  "node_id": "QmXxx...abc",
  "timestamp": "2025-09-30T18:16:15.123Z",
  "term": 5,
  "data": {
    "node_id": "QmYyy...def",
    "peer_id": "QmYyy...def",
    "capabilities": ["admin_election", "context_curation", "project_manager"],
    "uptime": "172800000000000",
    "resources": {
      "cpu_usage": 0.25,
      "memory_usage": 0.45,
      "disk_usage": 0.38,
      "network_quality": 0.98
    },
    "experience": "1209600000000000",
    "score": 0.85
  }
}

7. Context Leadership Transfer (SLURP)

Type: context_leadership_transfer

Purpose: Graceful transfer of context leadership

Data:

{
  "target_node": "QmNewLeader...xyz",
  "failover_state": { /* ContextFailoverState */ },
  "reason": "manual_transfer"
}

8. Context Generation Started (SLURP)

Type: context_generation_started

Purpose: Node announcing start of context generation

Data:

{
  "leader_id": "QmLeader...abc"
}

9. Context Generation Stopped (SLURP)

Type: context_generation_stopped

Purpose: Node announcing stop of context generation

Data:

{
  "reason": "leadership_lost"
}

Admin Heartbeat Message

Topic: CHORUS/admin/heartbeat/v1

Format:

{
  "node_id": "QmAdmin...abc",
  "timestamp": "2025-09-30T18:15:30.123456789Z"
}

Frequency: Every HeartbeatTimeout / 2 (default: ~7.5s)

Purpose: Prove admin liveness, prevent unnecessary elections

State Machine

Election States

type ElectionState string

const (
    StateIdle           ElectionState = "idle"
    StateDiscovering    ElectionState = "discovering"
    StateElecting       ElectionState = "electing"
    StateReconstructing ElectionState = "reconstructing_keys"
    StateComplete       ElectionState = "complete"
)

State Transitions

                     ┌─────────────────────────────────┐
                     │                                 │
                     │  START                          │
                     │                                 │
                     └────────────┬────────────────────┘
                                  │
                                  ▼
                     ┌─────────────────────────────────┐
                     │                                 │
                     │  StateIdle                      │
                     │  - Monitoring heartbeats        │
                     │  - Running discovery loop       │
                     │  - Waiting for triggers         │
                     │                                 │
                     └───┬─────────────────────────┬───┘
                         │                         │
              Discovery  │                         │  Election
              Request    │                         │  Trigger
                         │                         │
                         ▼                         ▼
        ┌────────────────────────────┐  ┌─────────────────────────────┐
        │                            │  │                             │
        │  StateDiscovering          │  │  StateElecting              │
        │  - Broadcasting discovery  │  │  - Collecting candidates    │
        │  - Waiting for responses   │  │  - Collecting votes         │
        │                            │  │  - Election timeout running │
        └────────────┬───────────────┘  └──────────┬──────────────────┘
                     │                             │
           Admin     │                             │  Timeout
           Found     │                             │  Reached
                     │                             │
                     ▼                             ▼
        ┌────────────────────────────┐  ┌─────────────────────────────┐
        │                            │  │                             │
        │  Update currentAdmin       │  │  StateComplete              │
        │  Trigger OnAdminChanged    │  │  - Tallying votes           │
        │  Return to StateIdle       │  │  - Determining winner       │
        │                            │  │  - Broadcasting winner      │
        └────────────────────────────┘  └──────────┬──────────────────┘
                                                    │
                                          Winner    │
                                          Announced │
                                                    │
                                                    ▼
                                        ┌─────────────────────────────┐
                                        │                             │
                                        │  Update currentAdmin        │
                                        │  Start/Stop heartbeat       │
                                        │  Trigger callbacks          │
                                        │  Return to StateIdle        │
                                        │                             │
                                        └─────────────────────────────┘

State Descriptions

StateIdle

Description: Normal operation state. Node is monitoring for admin heartbeats and ready to participate in elections.

Activities:

Running discovery loop (periodic admin checks)
Monitoring heartbeat timeout
Listening for election messages
Ready to trigger election

Transitions:

→ StateDiscovering: Discovery request sent
→ StateElecting: Election triggered

StateDiscovering

Description: Node is actively searching for existing admin.

Activities:

Broadcasting discovery requests
Waiting for discovery responses
Timeout-based fallback to election

Transitions:

→ StateIdle: Admin discovered
→ StateElecting: No admin discovered (after timeout)

Note: Current implementation doesn't explicitly use this state; discovery is integrated into idle loop.

StateElecting

Description: Election in progress. Node is collecting candidates and votes.

Activities:

Announcing candidacy (if eligible)
Listening for candidate announcements
Casting votes
Collecting votes
Waiting for election timeout

Transitions:

→ StateComplete: Election timeout reached

Duration: ElectionTimeout (default: 30s)

StateComplete

Description: Election complete, determining winner.

Activities:

Tallying votes
Determining winner (most votes or highest score)
Broadcasting winner
Updating currentAdmin
Managing heartbeat lifecycle
Triggering callbacks

Transitions:

→ StateIdle: Winner announced, system returns to normal

Duration: Momentary (immediate transition to StateIdle)

StateReconstructing

Description: Reserved for future key reconstruction operations.

Status: Not currently used in production code.

Purpose: Placeholder for post-election key reconstruction when Shamir Secret Sharing is integrated.

Callbacks and Events

Callback Types

1. OnAdminChanged

Signature:

func(oldAdmin, newAdmin string)

When Called:

Admin discovered via discovery response
Admin elected via election completion
Admin changed due to re-election

Purpose: Notify application of admin leadership changes

Example:

em.SetCallbacks(
    func(oldAdmin, newAdmin string) {
        if oldAdmin == "" {
            log.Printf("✅ Admin discovered: %s", newAdmin)
        } else {
            log.Printf("🔄 Admin changed: %s → %s", oldAdmin, newAdmin)
        }

        // Update application state
        app.SetCoordinator(newAdmin)
    },
    nil,
)

2. OnElectionComplete

Signature:

func(winner string)

When Called:

Election completes and winner is determined

Purpose: Notify application of election completion

Example:

em.SetCallbacks(
    nil,
    func(winner string) {
        log.Printf("🏆 Election complete, winner: %s", winner)

        // Record election in metrics
        metrics.RecordElection(winner)
    },
)

SLURP Context Leadership Callbacks

type ContextLeadershipCallbacks struct {
    // Called when this node becomes context leader
    OnBecomeContextLeader func(ctx context.Context, term int64) error

    // Called when this node loses context leadership
    OnLoseContextLeadership func(ctx context.Context, newLeader string) error

    // Called when any leadership change occurs
    OnContextLeaderChanged func(oldLeader, newLeader string, term int64)

    // Called when context generation starts
    OnContextGenerationStarted func(leaderID string)

    // Called when context generation stops
    OnContextGenerationStopped func(leaderID string, reason string)

    // Called when context leadership failover occurs
    OnContextFailover func(oldLeader, newLeader string, duration time.Duration)

    // Called when context-related errors occur
    OnContextError func(err error, severity ErrorSeverity)
}

Example:

sem.SetContextLeadershipCallbacks(&election.ContextLeadershipCallbacks{
    OnBecomeContextLeader: func(ctx context.Context, term int64) error {
        log.Printf("🚀 Became context leader (term %d)", term)
        return app.InitializeContextGeneration()
    },

    OnLoseContextLeadership: func(ctx context.Context, newLeader string) error {
        log.Printf("🔄 Lost context leadership to %s", newLeader)
        return app.ShutdownContextGeneration()
    },

    OnContextError: func(err error, severity election.ErrorSeverity) {
        log.Printf("⚠️ Context error [%s]: %v", severity, err)
        if severity == election.ErrorSeverityCritical {
            app.TriggerFailover()
        }
    },
})

Callback Threading

Important: Callbacks are invoked from election manager goroutines. Consider:

Non-Blocking: Callbacks should be fast or spawn goroutines for slow operations
Error Handling: Errors in callbacks are logged but don't prevent election operations
Synchronization: Use proper locking if callbacks modify shared state
Idempotency: Callbacks may be invoked multiple times for same event (rare but possible)

Good Practice:

em.SetCallbacks(
    func(oldAdmin, newAdmin string) {
        // Fast: Update local state
        app.mu.Lock()
        app.currentAdmin = newAdmin
        app.mu.Unlock()

        // Slow: Spawn goroutine for heavy work
        go app.NotifyAdminChange(oldAdmin, newAdmin)
    },
    nil,
)

Testing

Test Structure

The package includes comprehensive unit tests in election_test.go.

Running Tests

# Run all election tests
cd /home/tony/chorus/project-queues/active/CHORUS
go test ./pkg/election

# Run with verbose output
go test -v ./pkg/election

# Run specific test
go test -v ./pkg/election -run TestElectionManagerCanBeAdmin

# Run with race detection
go test -race ./pkg/election

Test Utilities

newTestElectionManager

func newTestElectionManager(t *testing.T) *ElectionManager

Creates a fully-wired test election manager with:

Real libp2p host (localhost)
Real PubSub instance
Test configuration
Automatic cleanup

Example:

func TestMyFeature(t *testing.T) {
    em := newTestElectionManager(t)

    // Test uses real message passing
    em.Start()

    // ... test code ...

    // Cleanup automatic via t.Cleanup()
}

Test Coverage

1. TestNewElectionManagerInitialState

Verifies initial state after construction:

State is StateIdle
Term is 0
Node ID is populated

2. TestElectionManagerCanBeAdmin

Tests eligibility checking:

Node with admin capabilities can be admin
Node without admin capabilities cannot be admin

3. TestFindElectionWinnerPrefersVotesThenScore

Tests winner determination logic:

Most votes wins
Score breaks ties
Fallback to highest score if no votes

4. TestHandleElectionMessageAddsCandidate

Tests candidacy announcement handling:

Candidate added to candidates map
Candidate data correctly deserialized

5. TestSendAdminHeartbeatRequiresLeadership

Tests heartbeat authorization:

Non-admin cannot send heartbeat
Admin can send heartbeat

Integration Testing

For integration testing with multiple nodes:

func TestMultiNodeElection(t *testing.T) {
    // Create 3 test nodes
    nodes := make([]*ElectionManager, 3)
    for i := 0; i < 3; i++ {
        nodes[i] = newTestElectionManager(t)
        nodes[i].Start()
    }

    // Connect nodes (libp2p peer connection)
    // ...

    // Trigger election
    nodes[0].TriggerElection(TriggerManual)

    // Wait for election to complete
    time.Sleep(35 * time.Second)

    // Verify all nodes agree on admin
    admin := nodes[0].GetCurrentAdmin()
    for i, node := range nodes {
        if node.GetCurrentAdmin() != admin {
            t.Errorf("Node %d disagrees on admin", i)
        }
    }
}

Note: Multi-node tests require proper libp2p peer discovery and connection setup.

Production Considerations

Deployment Checklist

Configuration

Set appropriate HeartbeatTimeout (default 15s)
Set appropriate ElectionTimeout (default 30s)
Configure stability windows via environment variables
Ensure nodes have correct capabilities in config
Configure discovery and backoff timeouts

Monitoring

Monitor election frequency (should be rare)
Monitor heartbeat status on admin node
Alert on frequent admin changes (possible network issues)
Track election duration and participation
Monitor candidate scores and voting patterns

Network

Ensure PubSub connectivity between all nodes
Configure appropriate gossipsub parameters
Test behavior during network partitions
Verify heartbeat messages reach all nodes
Monitor libp2p connection stability

Capabilities

Ensure at least one node has admin capabilities
Balance capabilities across cluster (redundancy)
Test elections with different capability distributions
Verify scoring weights match organizational priorities

Resource Metrics

Implement actual resource metric collection (currently simulated)
Calibrate resource scoring weights
Test behavior under high load
Verify low-resource nodes don't become admin

Performance Characteristics

Latency

Discovery Response: < 1s (network RTT + processing)
Election Duration: ElectionTimeout + processing (~30-35s)
Heartbeat Latency: < 1s (network RTT)
Admin Failover: HeartbeatTimeout + ElectionTimeout (~45s)

Scalability

Tested: 1-10 nodes
Expected: 10-100 nodes (limited by gossipsub performance)
Bottleneck: PubSub message fanout, JSON serialization overhead

Message Load

Per election cycle:

Discovery: 1 request + N responses
Election: 1 start + N candidacies + N votes + 1 winner = ~3N+2 messages
Heartbeat: 1 message every ~7.5s from admin

Common Issues and Solutions

Issue: Rapid Election Churn

Symptoms: Elections occurring frequently, admin changing constantly

Causes:

Network instability
Insufficient stability windows
Admin node resource exhaustion

Solutions:

Increase stability windows:

export CHORUS_ELECTION_MIN_TERM="60s"
export CHORUS_LEADER_MIN_TERM="90s"

Investigate network connectivity
Check admin node resources
Review scoring weights (prefer stable nodes)

Issue: Split-Brain (Multiple Admins)

Symptoms: Different nodes report different admins

Causes:

Network partition
PubSub message loss
No quorum enforcement

Solutions:

Trigger manual election to force re-sync:
```
em.TriggerElection(TriggerManual)
```
Verify network connectivity
Consider implementing quorum (future enhancement)

Issue: No Admin Elected

Symptoms: All nodes report empty admin

Causes:

No nodes have admin capabilities
Election timeout too short
PubSub not properly connected

Solutions:

Verify at least one node has capabilities:

capabilities = ["admin_election", "context_curation"]

Increase ElectionTimeout
Check PubSub subscription status
Verify nodes are connected in libp2p mesh

Issue: Admin Heartbeat Not Received

Symptoms: Frequent heartbeat timeout elections despite admin running

Causes:

PubSub message loss
Heartbeat goroutine stopped
Clock skew

Solutions:

Check heartbeat status:

status := em.GetHeartbeatStatus()
log.Printf("Heartbeat status: %+v", status)

Verify PubSub connectivity
Check admin node logs for heartbeat errors
Ensure NTP synchronization across cluster

Security Considerations

Authentication

Current State: Election messages are not authenticated beyond libp2p peer IDs.

Risk: Malicious node could announce false election results.

Mitigation:

Rely on libp2p transport security
Future: Sign election messages with node private keys
Future: Verify candidate claims against cluster membership

Authorization

Current State: Any node with admin capabilities can participate in elections.

Risk: Compromised node could win election and become admin.

Mitigation:

Carefully control which nodes have admin capabilities
Monitor election outcomes for suspicious patterns
Future: Implement capability attestation
Future: Add reputation scoring

Split-Brain Attacks

Current State: No strict quorum, partitions can elect separate admins.

Risk: Adversary could isolate admin and force minority election.

Mitigation:

Use stability windows to prevent rapid changes
Monitor for conflicting admin announcements
Future: Implement configurable quorum requirements
Future: Add partition detection and recovery

Message Spoofing

Current State: PubSub messages are authenticated by libp2p but content is not signed.

Risk: Man-in-the-middle could modify election messages.

Mitigation:

Use libp2p transport security (TLS)
Future: Add message signing with node keys
Future: Implement message sequence numbers

SLURP Production Readiness

Status: Experimental - Not recommended for production

Incomplete Features:

Context manager integration (TODOs present)
State recovery mechanisms
Production metrics collection
Comprehensive failover testing

Production Use: Wait for:

Context manager interface stabilization
Complete state recovery implementation
Production metrics and monitoring
Multi-node failover testing
Documentation of recovery procedures

Summary

The CHORUS election package provides democratic leader election for distributed agent clusters. Key highlights:

Production Features (Ready)

✅ Democratic Voting: Uptime and capability-based candidate scoring ✅ Heartbeat Monitoring: 5s interval, 15s timeout for liveness detection ✅ Automatic Failover: Elections triggered on timeout, split-brain, manual ✅ Stability Windows: Prevents election churn during network instability ✅ Clean Transitions: Callback system for graceful leadership handoffs ✅ Well-Tested: Comprehensive unit tests with real libp2p integration

Experimental Features (Not Production-Ready)

⚠️ SLURP Integration: Context leadership with advanced AI scoring ⚠️ Failover State: Graceful transfer with state preservation ⚠️ Health Monitoring: Cluster health tracking framework

Key Metrics

Discovery Cycle: 10s (configurable)
Heartbeat Interval: ~7.5s (HeartbeatTimeout / 2)
Heartbeat Timeout: 15s (triggers election)
Election Duration: 30s (voting period)
Failover Time: ~45s (timeout + election)

Recommended Configuration

[security.election_config]
discovery_timeout = "10s"
election_timeout = "30s"
heartbeat_timeout = "15s"

export CHORUS_ELECTION_MIN_TERM="30s"
export CHORUS_LEADER_MIN_TERM="45s"

Next Steps for Production

Implement Resource Metrics: Replace simulated metrics with actual system monitoring
Add Quorum Support: Implement configurable quorum for split-brain prevention
Complete SLURP Integration: Finish context manager integration and state recovery
Enhanced Security: Add message signing and capability attestation
Comprehensive Testing: Multi-node integration tests with partition scenarios

Documentation Version: 1.0 Last Updated: 2025-09-30 Package Version: Based on commit at documentation time Maintainer: CHORUS Development Team

77 KiB Raw Blame History Unescape Escape

CHORUS Election Package Documentation

Table of Contents

Overview

Key Features

Use Cases

Architecture

Component Structure

Core Components

1. ElectionManager (Production)

2. HeartbeatManager

3. SLURPElectionManager (Experimental)

Election Algorithm

Democratic Election Process

Election Flow

Eligibility Criteria

Election Timing

Admin Heartbeat Mechanism

Heartbeat Configuration

Heartbeat Message Format

Heartbeat Lifecycle

Starting Heartbeat (Becoming Admin)

Stopping Heartbeat (Losing Admin)

Heartbeat Transmission Loop

Heartbeat Processing

Timeout Detection

Election Triggers

Trigger Types

Trigger Details

1. Heartbeat Timeout

2. Discovery Failure

3. Split-Brain Detection

4. Quorum Restored

5. Manual Trigger

Stability Windows

Election Stability Window

Leader Stability Window

Stability Window Enforcement

Candidate Scoring System

Base Election Scoring (Production)

Scoring Formula

Component Scores

Resource Metrics Structure

SLURP Candidate Scoring (Experimental)

Extended Scoring Formula

SLURP Scoring Weights (Default)

SLURP Component Scores

SLURP Requirements Filtering

Score Adjustments (Bonuses/Penalties)

SLURP Integration

Architecture

Status: Experimental

Context Leadership

Becoming Context Leader

Losing Context Leadership

Graceful Leadership Transfer

Failover State

State Validation

Health Monitoring

Configuration

Quorum and Consensus

Voting Mechanism

Vote Casting

Vote Tallying

Quorum Considerations

Split-Brain Scenarios

API Reference

ElectionManager (Production)

Constructor

Methods

SLURPElectionManager (Experimental)

Constructor

Methods

Configuration

Election Configuration Structure

Configuration Sources

1. Config File (config.toml)

2. Environment Variables

3. Default Values (Fallback)

SLURP Configuration

77 KiB

Raw Blame History