CHORUS/docs/comprehensive/packages/election.md

# CHORUS Election Package Documentation

**Package:** `chorus/pkg/election`
**Purpose:** Democratic leader election and consensus coordination for distributed CHORUS agents
**Status:** Production-ready core system; SLURP integration experimental

---

## Table of Contents

1. [Overview](#overview)
2. [Architecture](#architecture)
3. [Election Algorithm](#election-algorithm)
4. [Admin Heartbeat Mechanism](#admin-heartbeat-mechanism)
5. [Election Triggers](#election-triggers)
6. [Candidate Scoring System](#candidate-scoring-system)
7. [SLURP Integration](#slurp-integration)
8. [Quorum and Consensus](#quorum-and-consensus)
9. [API Reference](#api-reference)
10. [Configuration](#configuration)
11. [Message Formats](#message-formats)
12. [State Machine](#state-machine)
13. [Callbacks and Events](#callbacks-and-events)
14. [Testing](#testing)
15. [Production Considerations](#production-considerations)

---

## Overview

The election package implements a democratic leader election system for distributed CHORUS agents. It enables autonomous agents to elect an "admin" node responsible for coordination, context curation, and key management tasks. The system uses uptime-based voting, capability scoring, and heartbeat monitoring to maintain stable leadership while allowing graceful failover.

### Key Features

- **Democratic Election**: Nodes vote for the most qualified candidate based on uptime, capabilities, and resources
- **Heartbeat Monitoring**: Active admin sends periodic heartbeats (5s interval) to prove liveness
- **Automatic Failover**: Elections triggered on heartbeat timeout (15s), split-brain detection, or manual triggers
- **Capability-Based Scoring**: Candidates scored on admin capabilities, resources, uptime, and experience
- **SLURP Integration**: Experimental context leadership with Project Manager intelligence capabilities
- **Stability Windows**: Prevents election churn with configurable minimum term durations
- **Graceful Transitions**: Callback system for clean leadership handoffs

### Use Cases

1. **Admin Node Selection**: Elect a coordinator for project-wide context curation
2. **Split-Brain Recovery**: Resolve network partition conflicts through re-election
3. **Load Distribution**: Select admin based on available resources and current load
4. **Failover**: Automatic promotion of standby nodes when admin becomes unavailable
5. **Context Leadership**: (SLURP) Specialized election for AI context generation leadership

---

## Architecture

### Component Structure

```
election/
├── election.go            # Core election manager (production)
├── interfaces.go          # Shared type definitions
├── slurp_election.go      # SLURP election interface (experimental)
├── slurp_manager.go       # SLURP election manager implementation (experimental)
├── slurp_scoring.go       # SLURP candidate scoring (experimental)
└── election_test.go       # Unit tests
```

### Core Components

#### 1. ElectionManager (Production)

The `ElectionManager` is the production-ready core election coordinator:

```go
type ElectionManager struct {
    ctx    context.Context
    cancel context.CancelFunc
    config *config.Config
    host   libp2p.Host
    pubsub *pubsub.PubSub
    nodeID string

    // Election state
    mu            sync.RWMutex
    state         ElectionState
    currentTerm   int
    lastHeartbeat time.Time
    currentAdmin  string
    candidates    map[string]*AdminCandidate
    votes         map[string]string // voter -> candidate

    // Timers and channels
    heartbeatTimer  *time.Timer
    discoveryTimer  *time.Timer
    electionTimer   *time.Timer
    electionTrigger chan ElectionTrigger

    // Heartbeat management
    heartbeatManager *HeartbeatManager

    // Callbacks
    onAdminChanged     func(oldAdmin, newAdmin string)
    onElectionComplete func(winner string)

    // Stability windows (prevents election churn)
    lastElectionTime          time.Time
    electionStabilityWindow   time.Duration
    leaderStabilityWindow     time.Duration

    startTime time.Time
}
```

**Key Responsibilities:**
- Discovery of existing admin via broadcast queries
- Triggering elections based on heartbeat timeouts or manual triggers
- Managing candidate announcements and vote collection
- Determining election winners based on votes and scores
- Broadcasting election results to cluster
- Managing admin heartbeat lifecycle

#### 2. HeartbeatManager

Manages the admin heartbeat transmission lifecycle:

```go
type HeartbeatManager struct {
    mu          sync.Mutex
    isRunning   bool
    stopCh      chan struct{}
    ticker      *time.Ticker
    electionMgr *ElectionManager
    logger      func(msg string, args ...interface{})
}
```

**Configuration:**
- **Heartbeat Interval**: `HeartbeatTimeout / 2` (default ~7.5s)
- **Heartbeat Timeout**: 15 seconds (configurable via `Security.ElectionConfig.HeartbeatTimeout`)
- **Transmission**: Only when node is current admin
- **Lifecycle**: Automatically started/stopped on leadership changes

#### 3. SLURPElectionManager (Experimental)

Extends `ElectionManager` with SLURP contextual intelligence for Project Manager duties:

```go
type SLURPElectionManager struct {
    *ElectionManager // Embeds base election manager

    // SLURP-specific state
    contextMu          sync.RWMutex
    contextManager     ContextManager
    slurpConfig        *SLURPElectionConfig
    contextCallbacks   *ContextLeadershipCallbacks

    // Context leadership state
    isContextLeader    bool
    contextTerm        int64
    contextStartedAt   *time.Time
    lastHealthCheck    time.Time

    // Failover state
    failoverState      *ContextFailoverState
    transferInProgress bool

    // Monitoring
    healthMonitor      *ContextHealthMonitor
    metricsCollector   *ContextMetricsCollector

    // Shutdown coordination
    contextShutdown    chan struct{}
    contextWg          sync.WaitGroup
}
```

**Additional Capabilities:**
- Context generation leadership
- Graceful leadership transfer with state preservation
- Health monitoring and metrics collection
- Failover state validation and recovery
- Advanced scoring for AI capabilities

---

## Election Algorithm

### Democratic Election Process

The election system implements a **democratic voting algorithm** where nodes elect the most qualified candidate based on objective metrics.

#### Election Flow

```
┌─────────────────────────────────────────────────────────────┐
│ 1. DISCOVERY PHASE                                          │
│    - Node broadcasts admin discovery request                │
│    - Existing admin (if any) responds                       │
│    - Node updates currentAdmin if discovered                │
└─────────────────────────────────────────────────────────────┘
                           │
                           ▼
┌─────────────────────────────────────────────────────────────┐
│ 2. ELECTION TRIGGER                                         │
│    - Heartbeat timeout (15s without admin heartbeat)        │
│    - No admin discovered after discovery attempts           │
│    - Split-brain detection                                  │
│    - Manual trigger                                         │
│    - Quorum restoration                                     │
└─────────────────────────────────────────────────────────────┘
                           │
                           ▼
┌─────────────────────────────────────────────────────────────┐
│ 3. CANDIDATE ANNOUNCEMENT                                   │
│    - Eligible nodes announce candidacy                      │
│    - Include: NodeID, capabilities, uptime, resources       │
│    - Calculate and include candidate score                  │
│    - Broadcast to election topic                            │
└─────────────────────────────────────────────────────────────┘
                           │
                           ▼
┌─────────────────────────────────────────────────────────────┐
│ 4. VOTE COLLECTION (Election Timeout Period)                │
│    - Nodes receive candidate announcements                  │
│    - Nodes cast votes for highest-scoring candidate         │
│    - Votes broadcast to cluster                             │
│    - Duration: Security.ElectionConfig.ElectionTimeout      │
└─────────────────────────────────────────────────────────────┘
                           │
                           ▼
┌─────────────────────────────────────────────────────────────┐
│ 5. WINNER DETERMINATION                                     │
│    - Tally votes for each candidate                         │
│    - Winner: Most votes (ties broken by score)              │
│    - Fallback: Highest score if no votes cast               │
│    - Broadcast election winner                              │
└─────────────────────────────────────────────────────────────┘
                           │
                           ▼
┌─────────────────────────────────────────────────────────────┐
│ 6. LEADERSHIP TRANSITION                                    │
│    - Update currentAdmin                                    │
│    - Winner starts admin heartbeat                          │
│    - Previous admin stops heartbeat (if different node)     │
│    - Trigger callbacks (OnAdminChanged, OnElectionComplete) │
│    - Return to DISCOVERY/MONITORING phase                   │
└─────────────────────────────────────────────────────────────┘
```

### Eligibility Criteria

A node can become admin if it has **any** of these capabilities:
- `admin_election` - Core admin election capability
- `context_curation` - Context management capability
- `project_manager` - Project coordination capability

Checked via `ElectionManager.canBeAdmin()`:
```go
func (em *ElectionManager) canBeAdmin() bool {
    for _, cap := range em.config.Agent.Capabilities {
        if cap == "admin_election" || cap == "context_curation" || cap == "project_manager" {
            return true
        }
    }
    return false
}
```

### Election Timing

- **Discovery Loop**: Runs continuously, interval = `Security.ElectionConfig.DiscoveryTimeout` (default: 10s)
- **Election Timeout**: `Security.ElectionConfig.ElectionTimeout` (default: 30s)
- **Randomized Delay**: When triggering election after discovery failure, adds random delay (`DiscoveryTimeout` to `2×DiscoveryTimeout`) to prevent simultaneous elections

---

## Admin Heartbeat Mechanism

The admin heartbeat proves liveness and prevents unnecessary elections.

### Heartbeat Configuration

| Parameter | Value | Description |
|-----------|-------|-------------|
| **Interval** | `HeartbeatTimeout / 2` | Heartbeat transmission frequency (~7.5s) |
| **Timeout** | `HeartbeatTimeout` | Max time without heartbeat before election (15s) |
| **Topic** | `CHORUS/admin/heartbeat/v1` | PubSub topic for heartbeats |
| **Format** | JSON | Message serialization format |

### Heartbeat Message Format

```json
{
  "node_id": "QmXxx...abc",
  "timestamp": "2025-09-30T18:15:30.123456789Z"
}
```

**Fields:**
- `node_id` (string): Admin node's ID
- `timestamp` (RFC3339Nano): When heartbeat was sent

### Heartbeat Lifecycle

#### Starting Heartbeat (Becoming Admin)

```go
// Automatically called when node becomes admin
func (hm *HeartbeatManager) StartHeartbeat() error {
    hm.mu.Lock()
    defer hm.mu.Unlock()

    if hm.isRunning {
        return nil // Already running
    }

    if !hm.electionMgr.IsCurrentAdmin() {
        return fmt.Errorf("not admin, cannot start heartbeat")
    }

    hm.stopCh = make(chan struct{})
    interval := hm.electionMgr.config.Security.ElectionConfig.HeartbeatTimeout / 2
    hm.ticker = time.NewTicker(interval)
    hm.isRunning = true

    go hm.heartbeatLoop()

    return nil
}
```

#### Stopping Heartbeat (Losing Admin)

```go
// Automatically called when node loses admin role
func (hm *HeartbeatManager) StopHeartbeat() error {
    hm.mu.Lock()
    defer hm.mu.Unlock()

    if !hm.isRunning {
        return nil // Already stopped
    }

    close(hm.stopCh)

    if hm.ticker != nil {
        hm.ticker.Stop()
        hm.ticker = nil
    }

    hm.isRunning = false
    return nil
}
```

#### Heartbeat Transmission Loop

```go
func (hm *HeartbeatManager) heartbeatLoop() {
    defer func() {
        hm.mu.Lock()
        hm.isRunning = false
        hm.mu.Unlock()
    }()

    for {
        select {
        case <-hm.ticker.C:
            // Only send heartbeat if still admin
            if hm.electionMgr.IsCurrentAdmin() {
                if err := hm.electionMgr.SendAdminHeartbeat(); err != nil {
                    hm.logger("Failed to send heartbeat: %v", err)
                }
            } else {
                hm.logger("No longer admin, stopping heartbeat")
                return
            }

        case <-hm.stopCh:
            return

        case <-hm.electionMgr.ctx.Done():
            return
        }
    }
}
```

### Heartbeat Processing

When a node receives a heartbeat:

```go
func (em *ElectionManager) handleAdminHeartbeat(data []byte) {
    var heartbeat struct {
        NodeID    string    `json:"node_id"`
        Timestamp time.Time `json:"timestamp"`
    }

    if err := json.Unmarshal(data, &heartbeat); err != nil {
        log.Printf("❌ Failed to unmarshal heartbeat: %v", err)
        return
    }

    em.mu.Lock()
    defer em.mu.Unlock()

    // Update admin and heartbeat timestamp
    if em.currentAdmin == "" || em.currentAdmin == heartbeat.NodeID {
        em.currentAdmin = heartbeat.NodeID
        em.lastHeartbeat = heartbeat.Timestamp
    }
}
```

### Timeout Detection

Checked during discovery loop:

```go
func (em *ElectionManager) performAdminDiscovery() {
    em.mu.Lock()
    lastHeartbeat := em.lastHeartbeat
    em.mu.Unlock()

    // Check if admin heartbeat has timed out
    if !lastHeartbeat.IsZero() &&
       time.Since(lastHeartbeat) > em.config.Security.ElectionConfig.HeartbeatTimeout {
        log.Printf("⚰️ Admin heartbeat timeout detected (last: %v)", lastHeartbeat)
        em.TriggerElection(TriggerHeartbeatTimeout)
        return
    }
}
```

---

## Election Triggers

Elections can be triggered by multiple events, each with different stability guarantees.

### Trigger Types

```go
type ElectionTrigger string

const (
    TriggerHeartbeatTimeout ElectionTrigger = "admin_heartbeat_timeout"
    TriggerDiscoveryFailure ElectionTrigger = "no_admin_discovered"
    TriggerSplitBrain       ElectionTrigger = "split_brain_detected"
    TriggerQuorumRestored   ElectionTrigger = "quorum_restored"
    TriggerManual           ElectionTrigger = "manual_trigger"
)
```

### Trigger Details

#### 1. Heartbeat Timeout

**When:** No admin heartbeat received for `HeartbeatTimeout` duration (15s)

**Behavior:**
- Most common trigger for failover
- Indicates admin node failure or network partition
- Immediate election trigger (no stability window applies)

**Example:**
```go
if time.Since(lastHeartbeat) > em.config.Security.ElectionConfig.HeartbeatTimeout {
    em.TriggerElection(TriggerHeartbeatTimeout)
}
```

#### 2. Discovery Failure

**When:** No admin discovered after multiple discovery attempts

**Behavior:**
- Occurs on cluster startup or after total admin loss
- Includes randomized delay to prevent simultaneous elections
- Base delay: `2 × DiscoveryTimeout` + random(`DiscoveryTimeout`)

**Example:**
```go
if currentAdmin == "" && em.canBeAdmin() {
    baseDelay := em.config.Security.ElectionConfig.DiscoveryTimeout * 2
    randomDelay := time.Duration(rand.Intn(int(em.config.Security.ElectionConfig.DiscoveryTimeout)))
    totalDelay := baseDelay + randomDelay

    time.Sleep(totalDelay)

    if stillNoAdmin && stillIdle && em.canBeAdmin() {
        em.TriggerElection(TriggerDiscoveryFailure)
    }
}
```

#### 3. Split-Brain Detection

**When:** Multiple nodes believe they are admin

**Behavior:**
- Detected through conflicting admin announcements
- Forces re-election to resolve conflict
- Should be rare in properly configured clusters

**Usage:** (Implementation-specific, typically in cluster coordination layer)

#### 4. Quorum Restored

**When:** Network partition heals and quorum is re-established

**Behavior:**
- Allows cluster to re-elect with full member participation
- Ensures minority partition doesn't maintain stale admin

**Usage:** (Implementation-specific, typically in quorum management layer)

#### 5. Manual Trigger

**When:** Explicitly triggered via API or administrative action

**Behavior:**
- Used for planned leadership transfers
- Used for testing and debugging
- Respects stability windows (can be overridden)

**Example:**
```go
em.TriggerElection(TriggerManual)
```

### Stability Windows

To prevent election churn, the system enforces minimum durations between elections:

#### Election Stability Window

**Default:** `2 × DiscoveryTimeout` (20s)
**Configuration:** Environment variable `CHORUS_ELECTION_MIN_TERM`

Prevents rapid back-to-back elections regardless of admin state.

```go
func getElectionStabilityWindow(cfg *config.Config) time.Duration {
    if stability := os.Getenv("CHORUS_ELECTION_MIN_TERM"); stability != "" {
        if duration, err := time.ParseDuration(stability); err == nil {
            return duration
        }
    }

    if cfg.Security.ElectionConfig.DiscoveryTimeout > 0 {
        return cfg.Security.ElectionConfig.DiscoveryTimeout * 2
    }

    return 30 * time.Second // Fallback
}
```

#### Leader Stability Window

**Default:** `3 × HeartbeatTimeout` (45s)
**Configuration:** Environment variable `CHORUS_LEADER_MIN_TERM`

Prevents challenging a healthy leader too quickly after election.

```go
func getLeaderStabilityWindow(cfg *config.Config) time.Duration {
    if stability := os.Getenv("CHORUS_LEADER_MIN_TERM"); stability != "" {
        if duration, err := time.ParseDuration(stability); err == nil {
            return duration
        }
    }

    if cfg.Security.ElectionConfig.HeartbeatTimeout > 0 {
        return cfg.Security.ElectionConfig.HeartbeatTimeout * 3
    }

    return 45 * time.Second // Fallback
}
```

#### Stability Window Enforcement

```go
func (em *ElectionManager) TriggerElection(trigger ElectionTrigger) {
    em.mu.RLock()
    currentState := em.state
    currentAdmin := em.currentAdmin
    lastElection := em.lastElectionTime
    em.mu.RUnlock()

    if currentState != StateIdle {
        log.Printf("🗳️ Election already in progress (state: %s), ignoring trigger: %s",
                   currentState, trigger)
        return
    }

    now := time.Now()
    if !lastElection.IsZero() {
        timeSinceElection := now.Sub(lastElection)

        // Leader stability window (if we have a current admin)
        if currentAdmin != "" && timeSinceElection < em.leaderStabilityWindow {
            log.Printf("⏳ Leader stability window active (%.1fs remaining), ignoring trigger: %s",
                       (em.leaderStabilityWindow - timeSinceElection).Seconds(), trigger)
            return
        }

        // General election stability window
        if timeSinceElection < em.electionStabilityWindow {
            log.Printf("⏳ Election stability window active (%.1fs remaining), ignoring trigger: %s",
                       (em.electionStabilityWindow - timeSinceElection).Seconds(), trigger)
            return
        }
    }

    select {
    case em.electionTrigger <- trigger:
        log.Printf("🗳️ Election triggered: %s", trigger)
    default:
        log.Printf("⚠️ Election trigger buffer full, ignoring: %s", trigger)
    }
}
```

**Key Points:**
- Stability windows prevent election storms during network instability
- Heartbeat timeout triggers bypass some stability checks (admin definitely unavailable)
- Manual triggers respect stability windows unless explicitly overridden
- Referenced in WHOOSH issue #7 as fix for election churn

---

## Candidate Scoring System

Candidates are scored on multiple dimensions to determine the most qualified admin.

### Base Election Scoring (Production)

#### Scoring Formula

```
finalScore = uptimeScore * 0.3 +
             capabilityScore * 0.2 +
             resourceScore * 0.2 +
             networkQuality * 0.15 +
             experienceScore * 0.15
```

#### Component Scores

**1. Uptime Score (Weight: 0.3)**

Measures node stability and continuous availability.

```go
uptimeScore := min(1.0, candidate.Uptime.Hours() / 24.0)
```

- **Calculation:** Linear scaling from 0 to 1.0 over 24 hours
- **Max Score:** 1.0 (achieved at 24+ hours uptime)
- **Purpose:** Prefer nodes with proven stability

**2. Capability Score (Weight: 0.2)**

Measures administrative and coordination capabilities.

```go
capabilityScore := 0.0
adminCapabilities := []string{
    "admin_election",
    "context_curation",
    "key_reconstruction",
    "semantic_analysis",
    "project_manager",
}

for _, cap := range candidate.Capabilities {
    for _, adminCap := range adminCapabilities {
        if cap == adminCap {
            weight := 0.25 // Default weight
            // Project manager capabilities get higher weight
            if adminCap == "project_manager" || adminCap == "context_curation" {
                weight = 0.35
            }
            capabilityScore += weight
        }
    }
}
capabilityScore = min(1.0, capabilityScore)
```

- **Admin Capabilities:** +0.25 per capability (standard)
- **Premium Capabilities:** +0.35 for `project_manager` and `context_curation`
- **Max Score:** 1.0 (capped)
- **Purpose:** Prefer nodes with admin-specific capabilities

**3. Resource Score (Weight: 0.2)**

Measures available compute resources (lower usage = better).

```go
resourceScore := (1.0 - candidate.Resources.CPUUsage) * 0.3 +
                 (1.0 - candidate.Resources.MemoryUsage) * 0.3 +
                 (1.0 - candidate.Resources.DiskUsage) * 0.2 +
                 candidate.Resources.NetworkQuality * 0.2
```

- **CPU Usage:** Lower is better (30% weight)
- **Memory Usage:** Lower is better (30% weight)
- **Disk Usage:** Lower is better (20% weight)
- **Network Quality:** Higher is better (20% weight)
- **Purpose:** Prefer nodes with available resources

**4. Network Quality Score (Weight: 0.15)**

Direct measure of network connectivity quality.

```go
networkScore := candidate.Resources.NetworkQuality // Range: 0.0 to 1.0
```

- **Source:** Measured network quality metric
- **Range:** 0.0 (poor) to 1.0 (excellent)
- **Purpose:** Ensure admin has good connectivity

**5. Experience Score (Weight: 0.15)**

Measures long-term operational experience.

```go
experienceScore := min(1.0, candidate.Experience.Hours() / 168.0)
```

- **Calculation:** Linear scaling from 0 to 1.0 over 1 week (168 hours)
- **Max Score:** 1.0 (achieved at 1+ week experience)
- **Purpose:** Prefer nodes with proven long-term reliability

#### Resource Metrics Structure

```go
type ResourceMetrics struct {
    CPUUsage       float64 `json:"cpu_usage"`        // 0.0 to 1.0 (0-100%)
    MemoryUsage    float64 `json:"memory_usage"`     // 0.0 to 1.0 (0-100%)
    DiskUsage      float64 `json:"disk_usage"`       // 0.0 to 1.0 (0-100%)
    NetworkQuality float64 `json:"network_quality"`  // 0.0 to 1.0 (quality score)
}
```

**Note:** Current implementation uses simulated values. Production systems should integrate actual resource monitoring.

### SLURP Candidate Scoring (Experimental)

SLURP extends base scoring with contextual intelligence metrics for Project Manager leadership.

#### Extended Scoring Formula

```
finalScore = baseScore * (baseWeightsSum) +
             contextCapabilityScore * contextWeight +
             intelligenceScore * intelligenceWeight +
             coordinationScore * coordinationWeight +
             qualityScore * qualityWeight +
             performanceScore * performanceWeight +
             specializationScore * specializationWeight +
             availabilityScore * availabilityWeight +
             reliabilityScore * reliabilityWeight
```

Normalized by total weight sum.

#### SLURP Scoring Weights (Default)

```go
func DefaultSLURPScoringWeights() *SLURPScoringWeights {
    return &SLURPScoringWeights{
        // Base election weights (total: 0.4)
        UptimeWeight:           0.08,
        CapabilityWeight:       0.10,
        ResourceWeight:         0.08,
        NetworkWeight:          0.06,
        ExperienceWeight:       0.08,

        // SLURP-specific weights (total: 0.6)
        ContextCapabilityWeight: 0.15, // Most important for context leadership
        IntelligenceWeight:      0.12,
        CoordinationWeight:      0.10,
        QualityWeight:          0.08,
        PerformanceWeight:      0.06,
        SpecializationWeight:   0.04,
        AvailabilityWeight:     0.03,
        ReliabilityWeight:      0.02,
    }
}
```

#### SLURP Component Scores

**1. Context Capability Score (Weight: 0.15)**

Core context generation capabilities:

```go
score := 0.0
if caps.ContextGeneration { score += 0.3 }   // Required for leadership
if caps.ContextCuration { score += 0.2 }     // Content quality
if caps.ContextDistribution { score += 0.2 } // Delivery capability
if caps.ContextStorage { score += 0.1 }      // Persistence
if caps.SemanticAnalysis { score += 0.1 }    // Advanced analysis
if caps.RAGIntegration { score += 0.1 }      // RAG capability
```

**2. Intelligence Score (Weight: 0.12)**

AI and analysis capabilities:

```go
score := 0.0
if caps.SemanticAnalysis { score += 0.25 }
if caps.RAGIntegration { score += 0.25 }
if caps.TemporalAnalysis { score += 0.25 }
if caps.DecisionTracking { score += 0.25 }

// Apply quality multiplier
score = score * caps.GenerationQuality
```

**3. Coordination Score (Weight: 0.10)**

Cluster management capabilities:

```go
score := 0.0
if caps.ClusterCoordination { score += 0.3 }
if caps.LoadBalancing { score += 0.25 }
if caps.HealthMonitoring { score += 0.2 }
if caps.ResourceManagement { score += 0.25 }
```

**4. Quality Score (Weight: 0.08)**

Average of quality metrics:

```go
score := (caps.GenerationQuality + caps.ProcessingSpeed + caps.AccuracyScore) / 3.0
```

**5. Performance Score (Weight: 0.06)**

Historical operation success:

```go
totalOps := caps.SuccessfulOperations + caps.FailedOperations
successRate := float64(caps.SuccessfulOperations) / float64(totalOps)

// Response time score (1s optimal, 10s poor)
responseTimeScore := calculateResponseTimeScore(caps.AverageResponseTime)

score := (successRate * 0.7) + (responseTimeScore * 0.3)
```

**6. Specialization Score (Weight: 0.04)**

Domain expertise coverage:

```go
domainCoverage := float64(len(caps.DomainExpertise)) / 10.0
domainCoverage = min(1.0, domainCoverage)

score := (caps.SpecializationScore * 0.6) + (domainCoverage * 0.4)
```

**7. Availability Score (Weight: 0.03)**

Resource availability:

```go
cpuScore := min(1.0, caps.AvailableCPU / 8.0)           // 8 cores = 1.0
memoryScore := min(1.0, caps.AvailableMemory / 16GB)    // 16GB = 1.0
storageScore := min(1.0, caps.AvailableStorage / 1TB)   // 1TB = 1.0
networkScore := min(1.0, caps.NetworkBandwidth / 1Gbps) // 1Gbps = 1.0

score := (cpuScore * 0.3) + (memoryScore * 0.3) +
         (storageScore * 0.2) + (networkScore * 0.2)
```

**8. Reliability Score (Weight: 0.02)**

Uptime and reliability:

```go
score := (caps.ReliabilityScore * 0.6) + (caps.UptimePercentage * 0.4)
```

#### SLURP Requirements Filtering

Candidates must meet minimum requirements to be eligible:

```go
func DefaultSLURPLeadershipRequirements() *SLURPLeadershipRequirements {
    return &SLURPLeadershipRequirements{
        RequiredCapabilities:    []string{"context_generation", "context_curation"},
        PreferredCapabilities:   []string{"semantic_analysis", "cluster_coordination", "rag_integration"},

        MinQualityScore:         0.6,
        MinReliabilityScore:     0.7,
        MinUptimePercentage:     0.8,

        MinCPU:                  2.0,              // 2 CPU cores
        MinMemory:               4 * GB,           // 4GB
        MinStorage:              100 * GB,         // 100GB
        MinNetworkBandwidth:     100 * Mbps,       // 100 Mbps

        MinSuccessfulOperations: 10,
        MaxFailureRate:          0.1,              // 10% max
        MaxResponseTime:         5 * time.Second,
    }
}
```

**Disqualification:** Candidates failing requirements receive score of 0.0 and are marked with disqualification reasons.

#### Score Adjustments (Bonuses/Penalties)

```go
// Bonuses
if caps.GenerationQuality > 0.95 {
    finalScore += 0.05 // Exceptional quality
}
if caps.UptimePercentage > 0.99 {
    finalScore += 0.03 // Exceptional uptime
}
if caps.ContextGeneration && caps.ContextCuration &&
   caps.SemanticAnalysis && caps.ClusterCoordination {
    finalScore += 0.02 // Full capability coverage
}

// Penalties
if caps.GenerationQuality < 0.5 {
    finalScore -= 0.1 // Low quality
}
if caps.FailedOperations > caps.SuccessfulOperations {
    finalScore -= 0.15 // High failure rate
}
```

---

## SLURP Integration

SLURP (Semantic Layer for Understanding, Reasoning, and Planning) extends election with context generation leadership.

### Architecture

```
┌─────────────────────────────────────────────────────────────┐
│ Base Election (Production)                                  │
│ - Admin election                                            │
│ - Heartbeat monitoring                                      │
│ - Basic leadership                                          │
└─────────────────────────────────────────────────────────────┘
                           │
                           │ Embeds
                           ▼
┌─────────────────────────────────────────────────────────────┐
│ SLURP Election (Experimental)                               │
│ - Context leadership                                        │
│ - Advanced scoring                                          │
│ - Failover state management                                 │
│ - Health monitoring                                         │
└─────────────────────────────────────────────────────────────┘
```

### Status: Experimental

The SLURP integration is **experimental** and provides:
- ✅ Extended candidate scoring for AI capabilities
- ✅ Context leadership state management
- ✅ Graceful failover with state preservation
- ✅ Health and metrics monitoring framework
- ⚠️ Incomplete: Actual context manager integration (TODOs present)
- ⚠️ Incomplete: State recovery mechanisms
- ⚠️ Incomplete: Production metrics collection

### Context Leadership

#### Becoming Context Leader

When a node wins election and becomes admin, it can also become context leader:

```go
func (sem *SLURPElectionManager) StartContextGeneration(ctx context.Context) error {
    if !sem.IsCurrentAdmin() {
        return fmt.Errorf("not admin, cannot start context generation")
    }

    sem.contextMu.Lock()
    defer sem.contextMu.Unlock()

    if sem.contextManager == nil {
        return fmt.Errorf("no context manager registered")
    }

    // Mark as context leader
    sem.isContextLeader = true
    sem.contextTerm++
    now := time.Now()
    sem.contextStartedAt = &now

    // Start background processes
    sem.contextWg.Add(2)
    go sem.runHealthMonitoring()
    go sem.runMetricsCollection()

    // Trigger callbacks
    if sem.contextCallbacks != nil {
        if sem.contextCallbacks.OnBecomeContextLeader != nil {
            sem.contextCallbacks.OnBecomeContextLeader(ctx, sem.contextTerm)
        }
        if sem.contextCallbacks.OnContextGenerationStarted != nil {
            sem.contextCallbacks.OnContextGenerationStarted(sem.nodeID)
        }
    }

    // Broadcast context leadership start
    // ...

    return nil
}
```

#### Losing Context Leadership

When a node loses admin role or election, it stops context generation:

```go
func (sem *SLURPElectionManager) StopContextGeneration(ctx context.Context) error {
    // Signal shutdown to background processes
    close(sem.contextShutdown)

    // Wait for background processes with timeout
    done := make(chan struct{})
    go func() {
        sem.contextWg.Wait()
        close(done)
    }()

    select {
    case <-done:
        // Clean shutdown
    case <-time.After(sem.slurpConfig.GenerationStopTimeout):
        // Timeout
    }

    sem.contextMu.Lock()
    sem.isContextLeader = false
    sem.contextStartedAt = nil
    sem.contextMu.Unlock()

    // Trigger callbacks
    // ...

    return nil
}
```

### Graceful Leadership Transfer

SLURP supports explicit leadership transfer with state preservation:

```go
func (sem *SLURPElectionManager) TransferContextLeadership(
    ctx context.Context,
    targetNodeID string,
) error {
    if !sem.IsContextLeader() {
        return fmt.Errorf("not context leader, cannot transfer")
    }

    // Prepare failover state
    state, err := sem.PrepareContextFailover(ctx)
    if err != nil {
        return err
    }

    // Send transfer message to cluster
    transferMsg := ElectionMessage{
        Type:      "context_leadership_transfer",
        NodeID:    sem.nodeID,
        Timestamp: time.Now(),
        Term:      int(sem.contextTerm),
        Data: map[string]interface{}{
            "target_node":    targetNodeID,
            "failover_state": state,
            "reason":         "manual_transfer",
        },
    }

    if err := sem.publishElectionMessage(transferMsg); err != nil {
        return err
    }

    // Stop context generation
    sem.StopContextGeneration(ctx)

    // Trigger new election
    sem.TriggerElection(TriggerManual)

    return nil
}
```

### Failover State

Context leadership state preserved during failover:

```go
type ContextFailoverState struct {
    // Basic failover state
    LeaderID            string
    Term                int64
    TransferTime        time.Time

    // Context generation state
    QueuedRequests      []*ContextGenerationRequest
    ActiveJobs          map[string]*ContextGenerationJob
    CompletedJobs       []*ContextGenerationJob

    // Cluster coordination state
    ClusterState        *ClusterState
    ResourceAllocations map[string]*ResourceAllocation
    NodeAssignments     map[string][]string

    // Configuration state
    ManagerConfig       *ManagerConfig
    GenerationPolicy    *GenerationPolicy
    QueuePolicy         *QueuePolicy

    // State validation
    StateVersion        int64
    Checksum            string
    HealthSnapshot      *ContextClusterHealth

    // Transfer metadata
    TransferReason      string
    TransferSource      string
    TransferDuration    time.Duration
    ValidationResults   *ContextStateValidation
}
```

#### State Validation

Before accepting transferred state:

```go
func (sem *SLURPElectionManager) ValidateContextState(
    state *ContextFailoverState,
) (*ContextStateValidation, error) {
    validation := &ContextStateValidation{
        ValidatedAt: time.Now(),
        ValidatedBy: sem.nodeID,
        Valid:       true,
    }

    // Check basic fields
    if state.LeaderID == "" {
        validation.Issues = append(validation.Issues, "missing leader ID")
        validation.Valid = false
    }

    // Validate checksum
    if state.Checksum != "" {
        tempState := *state
        tempState.Checksum = ""
        data, _ := json.Marshal(tempState)
        hash := md5.Sum(data)
        expectedChecksum := fmt.Sprintf("%x", hash)
        validation.ChecksumValid = (expectedChecksum == state.Checksum)
        if !validation.ChecksumValid {
            validation.Issues = append(validation.Issues, "checksum validation failed")
            validation.Valid = false
        }
    }

    // Validate timestamps, queue state, cluster state, config
    // ...

    // Set recovery requirements if issues found
    if len(validation.Issues) > 0 {
        validation.RequiresRecovery = true
        validation.RecoverySteps = []string{
            "Review validation issues",
            "Perform partial state recovery",
            "Restart context generation with defaults",
        }
    }

    return validation, nil
}
```

### Health Monitoring

SLURP election includes cluster health monitoring:

```go
type ContextClusterHealth struct {
    TotalNodes          int
    HealthyNodes        int
    UnhealthyNodes      []string
    CurrentLeader       string
    LeaderHealthy       bool
    GenerationActive    bool
    QueueHealth         *QueueHealthStatus
    NodeHealths         map[string]*NodeHealthStatus
    LastElection        time.Time
    NextHealthCheck     time.Time
    OverallHealthScore  float64 // 0-1
}
```

Health checks run periodically (default: 30s):

```go
func (sem *SLURPElectionManager) runHealthMonitoring() {
    defer sem.contextWg.Done()

    ticker := time.NewTicker(sem.slurpConfig.ContextHealthCheckInterval)
    defer ticker.Stop()

    for {
        select {
        case <-ticker.C:
            sem.performHealthCheck()
        case <-sem.contextShutdown:
            return
        }
    }
}
```

### Configuration

```go
func DefaultSLURPElectionConfig() *SLURPElectionConfig {
    return &SLURPElectionConfig{
        EnableContextLeadership:     true,
        ContextLeadershipWeight:     0.3,
        RequireContextCapability:    true,

        AutoStartGeneration:         true,
        GenerationStartDelay:        5 * time.Second,
        GenerationStopTimeout:       30 * time.Second,

        ContextFailoverTimeout:      60 * time.Second,
        StateTransferTimeout:        30 * time.Second,
        ValidationTimeout:           10 * time.Second,
        RequireStateValidation:      true,

        ContextHealthCheckInterval:  30 * time.Second,
        ClusterHealthThreshold:      0.7,
        LeaderHealthThreshold:       0.8,

        MaxQueueTransferSize:        1000,
        QueueDrainTimeout:           60 * time.Second,
        PreserveCompletedJobs:       true,

        CoordinationTimeout:         10 * time.Second,
        MaxCoordinationRetries:      3,
        CoordinationBackoff:         2 * time.Second,
    }
}
```

---

## Quorum and Consensus

Currently, the election system uses **democratic voting** without strict quorum requirements. This section describes the voting mechanism and future quorum considerations.

### Voting Mechanism

#### Vote Casting

Nodes cast votes for candidates during the election period:

```go
voteMsg := ElectionMessage{
    Type:      "election_vote",
    NodeID:    voterNodeID,
    Timestamp: time.Now(),
    Term:      currentTerm,
    Data: map[string]interface{}{
        "candidate": chosenCandidateID,
    },
}
```

#### Vote Tallying

Votes are tallied when election timeout occurs:

```go
func (em *ElectionManager) findElectionWinner() *AdminCandidate {
    if len(em.candidates) == 0 {
        return nil
    }

    // Count votes for each candidate
    voteCounts := make(map[string]int)
    totalVotes := 0

    for _, candidateID := range em.votes {
        if _, exists := em.candidates[candidateID]; exists {
            voteCounts[candidateID]++
            totalVotes++
        }
    }

    // If no votes cast, fall back to highest scoring candidate
    if totalVotes == 0 {
        var winner *AdminCandidate
        highestScore := -1.0

        for _, candidate := range em.candidates {
            if candidate.Score > highestScore {
                highestScore = candidate.Score
                winner = candidate
            }
        }
        return winner
    }

    // Find candidate with most votes (ties broken by score)
    var winner *AdminCandidate
    maxVotes := -1
    highestScore := -1.0

    for candidateID, voteCount := range voteCounts {
        candidate := em.candidates[candidateID]
        if voteCount > maxVotes ||
           (voteCount == maxVotes && candidate.Score > highestScore) {
            maxVotes = voteCount
            highestScore = candidate.Score
            winner = candidate
        }
    }

    return winner
}
```

**Key Points:**
- Majority not required (simple plurality)
- If no votes cast, highest score wins (useful for single-node startup)
- Ties broken by candidate score
- Vote validation ensures voted candidate exists

### Quorum Considerations

The system does **not** currently implement strict quorum requirements. This has implications:

**Advantages:**
- Works in small clusters (1-2 nodes)
- Allows elections during network partitions
- Simple consensus algorithm

**Disadvantages:**
- Risk of split-brain if network partitions occur
- No guarantee majority of cluster agrees on admin
- Potential for competing admins in partition scenarios

**Future Enhancement:** Consider implementing configurable quorum (e.g., "majority of last known cluster size") for production deployments.

### Split-Brain Scenarios

**Scenario:** Network partition creates two isolated groups, each electing separate admin.

**Detection Methods:**
1. Admin heartbeat conflicts (multiple nodes claiming admin)
2. Cluster membership disagreements
3. Partition healing revealing duplicate admins

**Resolution:**
1. Detect conflicting admin via heartbeat messages
2. Trigger `TriggerSplitBrain` election
3. Re-elect with full cluster participation
4. Higher-scored or higher-term admin typically wins

**Mitigation:**
- Stability windows reduce rapid re-elections
- Heartbeat timeout ensures dead admin detection
- Democratic voting resolves conflicts when partition heals

---

## API Reference

### ElectionManager (Production)

#### Constructor

```go
func NewElectionManager(
    ctx context.Context,
    cfg *config.Config,
    host libp2p.Host,
    ps *pubsub.PubSub,
    nodeID string,
) *ElectionManager
```

Creates new election manager.

**Parameters:**
- `ctx`: Parent context for lifecycle management
- `cfg`: CHORUS configuration (capabilities, election config)
- `host`: libp2p host for peer communication
- `ps`: PubSub instance for election messages
- `nodeID`: Unique identifier for this node

**Returns:** Configured `ElectionManager`

#### Methods

```go
func (em *ElectionManager) Start() error
```

Starts the election management system. Subscribes to election and heartbeat topics, launches discovery and coordination goroutines.

**Returns:** Error if subscription fails

---

```go
func (em *ElectionManager) Stop()
```

Stops the election manager. Stops heartbeat, cancels context, cleans up timers.

---

```go
func (em *ElectionManager) TriggerElection(trigger ElectionTrigger)
```

Manually triggers an election.

**Parameters:**
- `trigger`: Reason for triggering election (see [Election Triggers](#election-triggers))

**Behavior:**
- Respects stability windows
- Ignores if election already in progress
- Buffers trigger in channel (size 10)

---

```go
func (em *ElectionManager) GetCurrentAdmin() string
```

Returns the current admin node ID.

**Returns:** Node ID string (empty if no admin)

---

```go
func (em *ElectionManager) IsCurrentAdmin() bool
```

Checks if this node is the current admin.

**Returns:** `true` if this node is admin

---

```go
func (em *ElectionManager) GetElectionState() ElectionState
```

Returns current election state.

**Returns:** One of: `StateIdle`, `StateDiscovering`, `StateElecting`, `StateReconstructing`, `StateComplete`

---

```go
func (em *ElectionManager) SetCallbacks(
    onAdminChanged func(oldAdmin, newAdmin string),
    onElectionComplete func(winner string),
)
```

Sets election event callbacks.

**Parameters:**
- `onAdminChanged`: Called when admin changes (includes admin discovery)
- `onElectionComplete`: Called when election completes

---

```go
func (em *ElectionManager) SendAdminHeartbeat() error
```

Sends admin heartbeat (only if this node is admin).

**Returns:** Error if not admin or send fails

---

```go
func (em *ElectionManager) GetHeartbeatStatus() map[string]interface{}
```

Returns current heartbeat status.

**Returns:** Map with keys:
- `running` (bool): Whether heartbeat is active
- `is_admin` (bool): Whether this node is admin
- `last_sent` (time.Time): Last heartbeat time
- `interval` (string): Heartbeat interval (if running)
- `next_heartbeat` (time.Time): Next scheduled heartbeat (if running)

### SLURPElectionManager (Experimental)

#### Constructor

```go
func NewSLURPElectionManager(
    ctx context.Context,
    cfg *config.Config,
    host libp2p.Host,
    ps *pubsub.PubSub,
    nodeID string,
    slurpConfig *SLURPElectionConfig,
) *SLURPElectionManager
```

Creates new SLURP-enhanced election manager.

**Parameters:**
- (Same as `NewElectionManager`)
- `slurpConfig`: SLURP-specific configuration (nil for defaults)

**Returns:** Configured `SLURPElectionManager`

#### Methods

**All ElectionManager methods plus:**

```go
func (sem *SLURPElectionManager) RegisterContextManager(manager ContextManager) error
```

Registers a context manager for leader duties.

**Parameters:**
- `manager`: Context manager implementing `ContextManager` interface

**Returns:** Error if manager already registered

**Behavior:** If this node is already admin and auto-start enabled, starts context generation

---

```go
func (sem *SLURPElectionManager) IsContextLeader() bool
```

Checks if this node is the current context generation leader.

**Returns:** `true` if context leader and admin

---

```go
func (sem *SLURPElectionManager) GetContextManager() (ContextManager, error)
```

Returns the registered context manager (only if leader).

**Returns:**
- `ContextManager` if leader
- Error if not leader or no manager registered

---

```go
func (sem *SLURPElectionManager) StartContextGeneration(ctx context.Context) error
```

Begins context generation operations (leader only).

**Returns:** Error if not admin, already started, or no manager registered

**Behavior:**
- Marks node as context leader
- Increments context term
- Starts health monitoring and metrics collection
- Triggers callbacks
- Broadcasts context generation start

---

```go
func (sem *SLURPElectionManager) StopContextGeneration(ctx context.Context) error
```

Stops context generation operations.

**Returns:** Error if issues during shutdown (logged, not fatal)

**Behavior:**
- Signals background processes to stop
- Waits for clean shutdown (with timeout)
- Triggers callbacks
- Broadcasts context generation stop

---

```go
func (sem *SLURPElectionManager) TransferContextLeadership(
    ctx context.Context,
    targetNodeID string,
) error
```

Initiates graceful context leadership transfer.

**Parameters:**
- `ctx`: Context for transfer operations
- `targetNodeID`: Target node to receive leadership

**Returns:** Error if not leader, transfer in progress, or preparation fails

**Behavior:**
- Prepares failover state
- Broadcasts transfer message
- Stops context generation
- Triggers new election

---

```go
func (sem *SLURPElectionManager) GetContextLeaderInfo() (*LeaderInfo, error)
```

Returns information about current context leader.

**Returns:**
- `LeaderInfo` with leader details
- Error if no current leader

---

```go
func (sem *SLURPElectionManager) GetContextGenerationStatus() (*GenerationStatus, error)
```

Returns status of context operations.

**Returns:**
- `GenerationStatus` with current state
- Error if retrieval fails

---

```go
func (sem *SLURPElectionManager) SetContextLeadershipCallbacks(
    callbacks *ContextLeadershipCallbacks,
) error
```

Sets callbacks for context leadership changes.

**Parameters:**
- `callbacks`: Struct with context leadership event callbacks

**Returns:** Always `nil` (error reserved for future validation)

---

```go
func (sem *SLURPElectionManager) GetContextClusterHealth() (*ContextClusterHealth, error)
```

Returns health of context generation cluster.

**Returns:** `ContextClusterHealth` with cluster health metrics

---

```go
func (sem *SLURPElectionManager) PrepareContextFailover(
    ctx context.Context,
) (*ContextFailoverState, error)
```

Prepares context state for leadership failover.

**Returns:**
- `ContextFailoverState` with preserved state
- Error if not context leader or preparation fails

**Behavior:**
- Collects queued requests, active jobs, configuration
- Captures health snapshot
- Calculates checksum for validation

---

```go
func (sem *SLURPElectionManager) ExecuteContextFailover(
    ctx context.Context,
    state *ContextFailoverState,
) error
```

Executes context leadership failover from provided state.

**Parameters:**
- `ctx`: Context for failover operations
- `state`: Failover state from previous leader

**Returns:** Error if already leader, validation fails, or restoration fails

**Behavior:**
- Validates failover state
- Restores context leadership
- Applies configuration and state
- Starts background processes

---

```go
func (sem *SLURPElectionManager) ValidateContextState(
    state *ContextFailoverState,
) (*ContextStateValidation, error)
```

Validates context failover state before accepting.

**Parameters:**
- `state`: Failover state to validate

**Returns:**
- `ContextStateValidation` with validation results
- Error only if validation process itself fails (rare)

**Validation Checks:**
- Basic field presence (LeaderID, Term, StateVersion)
- Checksum validation (MD5)
- Timestamp validity
- Queue state validity
- Cluster state validity
- Configuration validity

---

## Configuration

### Election Configuration Structure

```go
type ElectionConfig struct {
    DiscoveryTimeout    time.Duration // Admin discovery loop interval
    DiscoveryBackoff    time.Duration // Backoff after failed discovery
    ElectionTimeout     time.Duration // Election voting period duration
    HeartbeatTimeout    time.Duration // Max time without heartbeat before election
}
```

### Configuration Sources

#### 1. Config File (config.toml)

```toml
[security.election_config]
discovery_timeout = "10s"
discovery_backoff = "5s"
election_timeout = "30s"
heartbeat_timeout = "15s"
```

#### 2. Environment Variables

```bash
# Stability windows
export CHORUS_ELECTION_MIN_TERM="30s"  # Min time between elections
export CHORUS_LEADER_MIN_TERM="45s"    # Min time before challenging healthy leader
```

#### 3. Default Values (Fallback)

```go
// In config package
ElectionConfig: ElectionConfig{
    DiscoveryTimeout:  10 * time.Second,
    DiscoveryBackoff:  5 * time.Second,
    ElectionTimeout:   30 * time.Second,
    HeartbeatTimeout:  15 * time.Second,
}
```

### SLURP Configuration

```go
type SLURPElectionConfig struct {
    // Context leadership configuration
    EnableContextLeadership     bool          // Enable context leadership
    ContextLeadershipWeight     float64       // Weight for context leadership scoring
    RequireContextCapability    bool          // Require context capability for leadership

    // Context generation configuration
    AutoStartGeneration         bool          // Auto-start generation on leadership
    GenerationStartDelay        time.Duration // Delay before starting generation
    GenerationStopTimeout       time.Duration // Timeout for stopping generation

    // Failover configuration
    ContextFailoverTimeout      time.Duration // Context failover timeout
    StateTransferTimeout        time.Duration // State transfer timeout
    ValidationTimeout           time.Duration // State validation timeout
    RequireStateValidation      bool          // Require state validation

    // Health monitoring configuration
    ContextHealthCheckInterval  time.Duration // Context health check interval
    ClusterHealthThreshold      float64       // Minimum cluster health for operations
    LeaderHealthThreshold       float64       // Minimum leader health

    // Queue management configuration
    MaxQueueTransferSize        int           // Max requests to transfer
    QueueDrainTimeout           time.Duration // Timeout for draining queue
    PreserveCompletedJobs       bool          // Preserve completed jobs on transfer

    // Coordination configuration
    CoordinationTimeout         time.Duration // Coordination operation timeout
    MaxCoordinationRetries      int           // Max coordination retries
    CoordinationBackoff         time.Duration // Backoff between coordination retries
}
```

**Defaults:** See `DefaultSLURPElectionConfig()` in [SLURP Integration](#slurp-integration)

---

## Message Formats

### PubSub Topics

```
CHORUS/election/v1             # Election messages (candidates, votes, winners)
CHORUS/admin/heartbeat/v1      # Admin heartbeat messages
```

### ElectionMessage Structure

```go
type ElectionMessage struct {
    Type      string      `json:"type"`       // Message type
    NodeID    string      `json:"node_id"`    // Sender node ID
    Timestamp time.Time   `json:"timestamp"`  // Message timestamp
    Term      int         `json:"term"`       // Election term
    Data      interface{} `json:"data,omitempty"` // Type-specific data
}
```

### Message Types

#### 1. Admin Discovery Request

**Type:** `admin_discovery_request`

**Purpose:** Node searching for existing admin

**Data:** `nil`

**Example:**
```json
{
  "type": "admin_discovery_request",
  "node_id": "QmXxx...abc",
  "timestamp": "2025-09-30T18:15:30.123Z",
  "term": 0
}
```

#### 2. Admin Discovery Response

**Type:** `admin_discovery_response`

**Purpose:** Node informing requester of known admin

**Data:**
```json
{
  "current_admin": "QmYyy...def"
}
```

**Example:**
```json
{
  "type": "admin_discovery_response",
  "node_id": "QmYyy...def",
  "timestamp": "2025-09-30T18:15:30.456Z",
  "term": 0,
  "data": {
    "current_admin": "QmYyy...def"
  }
}
```

#### 3. Election Started

**Type:** `election_started`

**Purpose:** Node announcing start of new election

**Data:**
```json
{
  "trigger": "admin_heartbeat_timeout"
}
```

**Example:**
```json
{
  "type": "election_started",
  "node_id": "QmXxx...abc",
  "timestamp": "2025-09-30T18:15:45.123Z",
  "term": 5,
  "data": {
    "trigger": "admin_heartbeat_timeout"
  }
}
```

#### 4. Candidacy Announcement

**Type:** `candidacy_announcement`

**Purpose:** Node announcing candidacy in election

**Data:** `AdminCandidate` structure

**Example:**
```json
{
  "type": "candidacy_announcement",
  "node_id": "QmXxx...abc",
  "timestamp": "2025-09-30T18:15:46.123Z",
  "term": 5,
  "data": {
    "node_id": "QmXxx...abc",
    "peer_id": "QmXxx...abc",
    "capabilities": ["admin_election", "context_curation"],
    "uptime": "86400000000000",
    "resources": {
      "cpu_usage": 0.35,
      "memory_usage": 0.52,
      "disk_usage": 0.41,
      "network_quality": 0.95
    },
    "experience": "604800000000000",
    "score": 0.78
  }
}
```

#### 5. Election Vote

**Type:** `election_vote`

**Purpose:** Node casting vote for candidate

**Data:**
```json
{
  "candidate": "QmYyy...def"
}
```

**Example:**
```json
{
  "type": "election_vote",
  "node_id": "QmZzz...ghi",
  "timestamp": "2025-09-30T18:15:50.123Z",
  "term": 5,
  "data": {
    "candidate": "QmYyy...def"
  }
}
```

#### 6. Election Winner

**Type:** `election_winner`

**Purpose:** Announcing election winner

**Data:** `AdminCandidate` structure (winner)

**Example:**
```json
{
  "type": "election_winner",
  "node_id": "QmXxx...abc",
  "timestamp": "2025-09-30T18:16:15.123Z",
  "term": 5,
  "data": {
    "node_id": "QmYyy...def",
    "peer_id": "QmYyy...def",
    "capabilities": ["admin_election", "context_curation", "project_manager"],
    "uptime": "172800000000000",
    "resources": {
      "cpu_usage": 0.25,
      "memory_usage": 0.45,
      "disk_usage": 0.38,
      "network_quality": 0.98
    },
    "experience": "1209600000000000",
    "score": 0.85
  }
}
```

#### 7. Context Leadership Transfer (SLURP)

**Type:** `context_leadership_transfer`

**Purpose:** Graceful transfer of context leadership

**Data:**
```json
{
  "target_node": "QmNewLeader...xyz",
  "failover_state": { /* ContextFailoverState */ },
  "reason": "manual_transfer"
}
```

#### 8. Context Generation Started (SLURP)

**Type:** `context_generation_started`

**Purpose:** Node announcing start of context generation

**Data:**
```json
{
  "leader_id": "QmLeader...abc"
}
```

#### 9. Context Generation Stopped (SLURP)

**Type:** `context_generation_stopped`

**Purpose:** Node announcing stop of context generation

**Data:**
```json
{
  "reason": "leadership_lost"
}
```

### Admin Heartbeat Message

**Topic:** `CHORUS/admin/heartbeat/v1`

**Format:**
```json
{
  "node_id": "QmAdmin...abc",
  "timestamp": "2025-09-30T18:15:30.123456789Z"
}
```

**Frequency:** Every `HeartbeatTimeout / 2` (default: ~7.5s)

**Purpose:** Prove admin liveness, prevent unnecessary elections

---

## State Machine

### Election States

```go
type ElectionState string

const (
    StateIdle           ElectionState = "idle"
    StateDiscovering    ElectionState = "discovering"
    StateElecting       ElectionState = "electing"
    StateReconstructing ElectionState = "reconstructing_keys"
    StateComplete       ElectionState = "complete"
)
```

### State Transitions

```
                     ┌─────────────────────────────────┐
                     │                                 │
                     │  START                          │
                     │                                 │
                     └────────────┬────────────────────┘
                                  │
                                  ▼
                     ┌─────────────────────────────────┐
                     │                                 │
                     │  StateIdle                      │
                     │  - Monitoring heartbeats        │
                     │  - Running discovery loop       │
                     │  - Waiting for triggers         │
                     │                                 │
                     └───┬─────────────────────────┬───┘
                         │                         │
              Discovery  │                         │  Election
              Request    │                         │  Trigger
                         │                         │
                         ▼                         ▼
        ┌────────────────────────────┐  ┌─────────────────────────────┐
        │                            │  │                             │
        │  StateDiscovering          │  │  StateElecting              │
        │  - Broadcasting discovery  │  │  - Collecting candidates    │
        │  - Waiting for responses   │  │  - Collecting votes         │
        │                            │  │  - Election timeout running │
        └────────────┬───────────────┘  └──────────┬──────────────────┘
                     │                             │
           Admin     │                             │  Timeout
           Found     │                             │  Reached
                     │                             │
                     ▼                             ▼
        ┌────────────────────────────┐  ┌─────────────────────────────┐
        │                            │  │                             │
        │  Update currentAdmin       │  │  StateComplete              │
        │  Trigger OnAdminChanged    │  │  - Tallying votes           │
        │  Return to StateIdle       │  │  - Determining winner       │
        │                            │  │  - Broadcasting winner      │
        └────────────────────────────┘  └──────────┬──────────────────┘
                                                    │
                                          Winner    │
                                          Announced │
                                                    │
                                                    ▼
                                        ┌─────────────────────────────┐
                                        │                             │
                                        │  Update currentAdmin        │
                                        │  Start/Stop heartbeat       │
                                        │  Trigger callbacks          │
                                        │  Return to StateIdle        │
                                        │                             │
                                        └─────────────────────────────┘
```

### State Descriptions

#### StateIdle

**Description:** Normal operation state. Node is monitoring for admin heartbeats and ready to participate in elections.

**Activities:**
- Running discovery loop (periodic admin checks)
- Monitoring heartbeat timeout
- Listening for election messages
- Ready to trigger election

**Transitions:**
- → `StateDiscovering`: Discovery request sent
- → `StateElecting`: Election triggered

#### StateDiscovering

**Description:** Node is actively searching for existing admin.

**Activities:**
- Broadcasting discovery requests
- Waiting for discovery responses
- Timeout-based fallback to election

**Transitions:**
- → `StateIdle`: Admin discovered
- → `StateElecting`: No admin discovered (after timeout)

**Note:** Current implementation doesn't explicitly use this state; discovery is integrated into idle loop.

#### StateElecting

**Description:** Election in progress. Node is collecting candidates and votes.

**Activities:**
- Announcing candidacy (if eligible)
- Listening for candidate announcements
- Casting votes
- Collecting votes
- Waiting for election timeout

**Transitions:**
- → `StateComplete`: Election timeout reached

**Duration:** `ElectionTimeout` (default: 30s)

#### StateComplete

**Description:** Election complete, determining winner.

**Activities:**
- Tallying votes
- Determining winner (most votes or highest score)
- Broadcasting winner
- Updating currentAdmin
- Managing heartbeat lifecycle
- Triggering callbacks

**Transitions:**
- → `StateIdle`: Winner announced, system returns to normal

**Duration:** Momentary (immediate transition to `StateIdle`)

#### StateReconstructing

**Description:** Reserved for future key reconstruction operations.

**Status:** Not currently used in production code.

**Purpose:** Placeholder for post-election key reconstruction when Shamir Secret Sharing is integrated.

---

## Callbacks and Events

### Callback Types

#### 1. OnAdminChanged

**Signature:**
```go
func(oldAdmin, newAdmin string)
```

**When Called:**
- Admin discovered via discovery response
- Admin elected via election completion
- Admin changed due to re-election

**Purpose:** Notify application of admin leadership changes

**Example:**
```go
em.SetCallbacks(
    func(oldAdmin, newAdmin string) {
        if oldAdmin == "" {
            log.Printf("✅ Admin discovered: %s", newAdmin)
        } else {
            log.Printf("🔄 Admin changed: %s → %s", oldAdmin, newAdmin)
        }

        // Update application state
        app.SetCoordinator(newAdmin)
    },
    nil,
)
```

#### 2. OnElectionComplete

**Signature:**
```go
func(winner string)
```

**When Called:**
- Election completes and winner is determined

**Purpose:** Notify application of election completion

**Example:**
```go
em.SetCallbacks(
    nil,
    func(winner string) {
        log.Printf("🏆 Election complete, winner: %s", winner)

        // Record election in metrics
        metrics.RecordElection(winner)
    },
)
```

### SLURP Context Leadership Callbacks

```go
type ContextLeadershipCallbacks struct {
    // Called when this node becomes context leader
    OnBecomeContextLeader func(ctx context.Context, term int64) error

    // Called when this node loses context leadership
    OnLoseContextLeadership func(ctx context.Context, newLeader string) error

    // Called when any leadership change occurs
    OnContextLeaderChanged func(oldLeader, newLeader string, term int64)

    // Called when context generation starts
    OnContextGenerationStarted func(leaderID string)

    // Called when context generation stops
    OnContextGenerationStopped func(leaderID string, reason string)

    // Called when context leadership failover occurs
    OnContextFailover func(oldLeader, newLeader string, duration time.Duration)

    // Called when context-related errors occur
    OnContextError func(err error, severity ErrorSeverity)
}
```

**Example:**
```go
sem.SetContextLeadershipCallbacks(&election.ContextLeadershipCallbacks{
    OnBecomeContextLeader: func(ctx context.Context, term int64) error {
        log.Printf("🚀 Became context leader (term %d)", term)
        return app.InitializeContextGeneration()
    },

    OnLoseContextLeadership: func(ctx context.Context, newLeader string) error {
        log.Printf("🔄 Lost context leadership to %s", newLeader)
        return app.ShutdownContextGeneration()
    },

    OnContextError: func(err error, severity election.ErrorSeverity) {
        log.Printf("⚠️ Context error [%s]: %v", severity, err)
        if severity == election.ErrorSeverityCritical {
            app.TriggerFailover()
        }
    },
})
```

### Callback Threading

**Important:** Callbacks are invoked from election manager goroutines. Consider:

1. **Non-Blocking:** Callbacks should be fast or spawn goroutines for slow operations
2. **Error Handling:** Errors in callbacks are logged but don't prevent election operations
3. **Synchronization:** Use proper locking if callbacks modify shared state
4. **Idempotency:** Callbacks may be invoked multiple times for same event (rare but possible)

**Good Practice:**
```go
em.SetCallbacks(
    func(oldAdmin, newAdmin string) {
        // Fast: Update local state
        app.mu.Lock()
        app.currentAdmin = newAdmin
        app.mu.Unlock()

        // Slow: Spawn goroutine for heavy work
        go app.NotifyAdminChange(oldAdmin, newAdmin)
    },
    nil,
)
```

---

## Testing

### Test Structure

The package includes comprehensive unit tests in `election_test.go`.

### Running Tests

```bash
# Run all election tests
cd /home/tony/chorus/project-queues/active/CHORUS
go test ./pkg/election

# Run with verbose output
go test -v ./pkg/election

# Run specific test
go test -v ./pkg/election -run TestElectionManagerCanBeAdmin

# Run with race detection
go test -race ./pkg/election
```

### Test Utilities

#### newTestElectionManager

```go
func newTestElectionManager(t *testing.T) *ElectionManager
```

Creates a fully-wired test election manager with:
- Real libp2p host (localhost)
- Real PubSub instance
- Test configuration
- Automatic cleanup

**Example:**
```go
func TestMyFeature(t *testing.T) {
    em := newTestElectionManager(t)

    // Test uses real message passing
    em.Start()

    // ... test code ...

    // Cleanup automatic via t.Cleanup()
}
```

### Test Coverage

#### 1. TestNewElectionManagerInitialState

Verifies initial state after construction:
- State is `StateIdle`
- Term is `0`
- Node ID is populated

#### 2. TestElectionManagerCanBeAdmin

Tests eligibility checking:
- Node with admin capabilities can be admin
- Node without admin capabilities cannot be admin

#### 3. TestFindElectionWinnerPrefersVotesThenScore

Tests winner determination logic:
- Most votes wins
- Score breaks ties
- Fallback to highest score if no votes

#### 4. TestHandleElectionMessageAddsCandidate

Tests candidacy announcement handling:
- Candidate added to candidates map
- Candidate data correctly deserialized

#### 5. TestSendAdminHeartbeatRequiresLeadership

Tests heartbeat authorization:
- Non-admin cannot send heartbeat
- Admin can send heartbeat

### Integration Testing

For integration testing with multiple nodes:

```go
func TestMultiNodeElection(t *testing.T) {
    // Create 3 test nodes
    nodes := make([]*ElectionManager, 3)
    for i := 0; i < 3; i++ {
        nodes[i] = newTestElectionManager(t)
        nodes[i].Start()
    }

    // Connect nodes (libp2p peer connection)
    // ...

    // Trigger election
    nodes[0].TriggerElection(TriggerManual)

    // Wait for election to complete
    time.Sleep(35 * time.Second)

    // Verify all nodes agree on admin
    admin := nodes[0].GetCurrentAdmin()
    for i, node := range nodes {
        if node.GetCurrentAdmin() != admin {
            t.Errorf("Node %d disagrees on admin", i)
        }
    }
}
```

**Note:** Multi-node tests require proper libp2p peer discovery and connection setup.

---

## Production Considerations

### Deployment Checklist

#### Configuration

- [ ] Set appropriate `HeartbeatTimeout` (default 15s)
- [ ] Set appropriate `ElectionTimeout` (default 30s)
- [ ] Configure stability windows via environment variables
- [ ] Ensure nodes have correct capabilities in config
- [ ] Configure discovery and backoff timeouts

#### Monitoring

- [ ] Monitor election frequency (should be rare)
- [ ] Monitor heartbeat status on admin node
- [ ] Alert on frequent admin changes (possible network issues)
- [ ] Track election duration and participation
- [ ] Monitor candidate scores and voting patterns

#### Network

- [ ] Ensure PubSub connectivity between all nodes
- [ ] Configure appropriate gossipsub parameters
- [ ] Test behavior during network partitions
- [ ] Verify heartbeat messages reach all nodes
- [ ] Monitor libp2p connection stability

#### Capabilities

- [ ] Ensure at least one node has admin capabilities
- [ ] Balance capabilities across cluster (redundancy)
- [ ] Test elections with different capability distributions
- [ ] Verify scoring weights match organizational priorities

#### Resource Metrics

- [ ] Implement actual resource metric collection (currently simulated)
- [ ] Calibrate resource scoring weights
- [ ] Test behavior under high load
- [ ] Verify low-resource nodes don't become admin

### Performance Characteristics

#### Latency

- **Discovery Response:** < 1s (network RTT + processing)
- **Election Duration:** `ElectionTimeout` + processing (~30-35s)
- **Heartbeat Latency:** < 1s (network RTT)
- **Admin Failover:** `HeartbeatTimeout` + `ElectionTimeout` (~45s)

#### Scalability

- **Tested:** 1-10 nodes
- **Expected:** 10-100 nodes (limited by gossipsub performance)
- **Bottleneck:** PubSub message fanout, JSON serialization overhead

#### Message Load

Per election cycle:
- Discovery: 1 request + N responses
- Election: 1 start + N candidacies + N votes + 1 winner = ~3N+2 messages
- Heartbeat: 1 message every ~7.5s from admin

### Common Issues and Solutions

#### Issue: Rapid Election Churn

**Symptoms:** Elections occurring frequently, admin changing constantly

**Causes:**
- Network instability
- Insufficient stability windows
- Admin node resource exhaustion

**Solutions:**
1. Increase stability windows:
   ```bash
   export CHORUS_ELECTION_MIN_TERM="60s"
   export CHORUS_LEADER_MIN_TERM="90s"
   ```
2. Investigate network connectivity
3. Check admin node resources
4. Review scoring weights (prefer stable nodes)

#### Issue: Split-Brain (Multiple Admins)

**Symptoms:** Different nodes report different admins

**Causes:**
- Network partition
- PubSub message loss
- No quorum enforcement

**Solutions:**
1. Trigger manual election to force re-sync:
   ```go
   em.TriggerElection(TriggerManual)
   ```
2. Verify network connectivity
3. Consider implementing quorum (future enhancement)

#### Issue: No Admin Elected

**Symptoms:** All nodes report empty admin

**Causes:**
- No nodes have admin capabilities
- Election timeout too short
- PubSub not properly connected

**Solutions:**
1. Verify at least one node has capabilities:
   ```toml
   capabilities = ["admin_election", "context_curation"]
   ```
2. Increase `ElectionTimeout`
3. Check PubSub subscription status
4. Verify nodes are connected in libp2p mesh

#### Issue: Admin Heartbeat Not Received

**Symptoms:** Frequent heartbeat timeout elections despite admin running

**Causes:**
- PubSub message loss
- Heartbeat goroutine stopped
- Clock skew

**Solutions:**
1. Check heartbeat status:
   ```go
   status := em.GetHeartbeatStatus()
   log.Printf("Heartbeat status: %+v", status)
   ```
2. Verify PubSub connectivity
3. Check admin node logs for heartbeat errors
4. Ensure NTP synchronization across cluster

### Security Considerations

#### Authentication

**Current State:** Election messages are not authenticated beyond libp2p peer IDs.

**Risk:** Malicious node could announce false election results.

**Mitigation:**
- Rely on libp2p transport security
- Future: Sign election messages with node private keys
- Future: Verify candidate claims against cluster membership

#### Authorization

**Current State:** Any node with admin capabilities can participate in elections.

**Risk:** Compromised node could win election and become admin.

**Mitigation:**
- Carefully control which nodes have admin capabilities
- Monitor election outcomes for suspicious patterns
- Future: Implement capability attestation
- Future: Add reputation scoring

#### Split-Brain Attacks

**Current State:** No strict quorum, partitions can elect separate admins.

**Risk:** Adversary could isolate admin and force minority election.

**Mitigation:**
- Use stability windows to prevent rapid changes
- Monitor for conflicting admin announcements
- Future: Implement configurable quorum requirements
- Future: Add partition detection and recovery

#### Message Spoofing

**Current State:** PubSub messages are authenticated by libp2p but content is not signed.

**Risk:** Man-in-the-middle could modify election messages.

**Mitigation:**
- Use libp2p transport security (TLS)
- Future: Add message signing with node keys
- Future: Implement message sequence numbers

### SLURP Production Readiness

**Status:** Experimental - Not recommended for production

**Incomplete Features:**
- Context manager integration (TODOs present)
- State recovery mechanisms
- Production metrics collection
- Comprehensive failover testing

**Production Use:** Wait for:
1. Context manager interface stabilization
2. Complete state recovery implementation
3. Production metrics and monitoring
4. Multi-node failover testing
5. Documentation of recovery procedures

---

## Summary

The CHORUS election package provides democratic leader election for distributed agent clusters. Key highlights:

### Production Features (Ready)

✅ **Democratic Voting:** Uptime and capability-based candidate scoring
✅ **Heartbeat Monitoring:** 5s interval, 15s timeout for liveness detection
✅ **Automatic Failover:** Elections triggered on timeout, split-brain, manual
✅ **Stability Windows:** Prevents election churn during network instability
✅ **Clean Transitions:** Callback system for graceful leadership handoffs
✅ **Well-Tested:** Comprehensive unit tests with real libp2p integration

### Experimental Features (Not Production-Ready)

⚠️ **SLURP Integration:** Context leadership with advanced AI scoring
⚠️ **Failover State:** Graceful transfer with state preservation
⚠️ **Health Monitoring:** Cluster health tracking framework

### Key Metrics

- **Discovery Cycle:** 10s (configurable)
- **Heartbeat Interval:** ~7.5s (HeartbeatTimeout / 2)
- **Heartbeat Timeout:** 15s (triggers election)
- **Election Duration:** 30s (voting period)
- **Failover Time:** ~45s (timeout + election)

### Recommended Configuration

```toml
[security.election_config]
discovery_timeout = "10s"
election_timeout = "30s"
heartbeat_timeout = "15s"
```

```bash
export CHORUS_ELECTION_MIN_TERM="30s"
export CHORUS_LEADER_MIN_TERM="45s"
```

### Next Steps for Production

1. **Implement Resource Metrics:** Replace simulated metrics with actual system monitoring
2. **Add Quorum Support:** Implement configurable quorum for split-brain prevention
3. **Complete SLURP Integration:** Finish context manager integration and state recovery
4. **Enhanced Security:** Add message signing and capability attestation
5. **Comprehensive Testing:** Multi-node integration tests with partition scenarios

---

**Documentation Version:** 1.0
**Last Updated:** 2025-09-30
**Package Version:** Based on commit at documentation time
**Maintainer:** CHORUS Development Team