Files

anthonyrawlins c5b7311a8b docs: Add Phase 3 coordination and infrastructure documentation

Comprehensive documentation for coordination, messaging, discovery, and internal systems.

Core Coordination Packages:
- pkg/election - Democratic leader election (uptime-based, heartbeat mechanism, SLURP integration)
- pkg/coordination - Meta-coordination with dependency detection (4 built-in rules)
- coordinator/ - Task orchestration and assignment (AI-powered scoring)
- discovery/ - mDNS peer discovery (automatic LAN detection)

Messaging & P2P Infrastructure:
- pubsub/ - GossipSub messaging (31 message types, role-based topics, HMMM integration)
- p2p/ - libp2p networking (DHT modes, connection management, security)

Monitoring & Health:
- pkg/metrics - Prometheus metrics (80+ metrics across 12 categories)
- pkg/health - Health monitoring (4 HTTP endpoints, enhanced checks, graceful degradation)

Internal Systems:
- internal/licensing - License validation (KACHING integration, cluster leases, fail-closed)
- internal/hapui - Human Agent Portal UI (9 commands, HMMM wizard, UCXL browser, decision voting)
- internal/backbeat - P2P operation telemetry (6 phases, beat synchronization, health reporting)

Documentation Statistics (Phase 3):
- 10 packages documented (~18,000 lines)
- 31 PubSub message types cataloged
- 80+ Prometheus metrics documented
- Complete API references with examples
- Integration patterns and best practices

Key Features Documented:
- Election: 5 triggers, candidate scoring (5 weighted components), stability windows
- Coordination: AI-powered dependency detection, cross-repo sessions, escalation handling
- PubSub: Topic patterns, message envelopes, SHHH redaction, Hypercore logging
- Metrics: All metric types with labels, Prometheus scrape config, alert rules
- Health: Liveness vs readiness, critical checks, Kubernetes integration
- Licensing: Grace periods, circuit breaker, cluster lease management
- HAP UI: Interactive terminal commands, HMMM composition wizard, web interface (beta)
- BACKBEAT: 6-phase operation tracking, beat budget estimation, drift detection

Implementation Status Marked:
- ✅ Production: Election, metrics, health, licensing, pubsub, p2p, discovery, coordinator
- 🔶 Beta: HAP web interface, BACKBEAT telemetry, advanced coordination
- 🔷 Alpha: SLURP election scoring
- ⚠️ Experimental: Meta-coordination, AI-powered dependency detection

Progress: 22/62 files complete (35%)

Next Phase: AI providers, SLURP system, API layer, reasoning engine

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

2025-09-30 18:27:39 +10:00

20 KiB

Raw Blame History

Package: coordinator

Location: /home/tony/chorus/project-queues/active/CHORUS/coordinator/

Overview

The coordinator package provides the TaskCoordinator - the main orchestrator for distributed task management in CHORUS. It handles task discovery, intelligent assignment, execution coordination, and real-time progress tracking across multiple repositories and agents. The coordinator integrates with the PubSub system for role-based collaboration and uses AI-powered execution engines for autonomous task completion.

Core Components

TaskCoordinator

The central orchestrator managing task lifecycle across the distributed CHORUS network.

type TaskCoordinator struct {
    pubsub     *pubsub.PubSub
    hlog       *logging.HypercoreLog
    ctx        context.Context
    config     *config.Config
    hmmmRouter *hmmm.Router

    // Repository management
    providers    map[int]repository.TaskProvider // projectID -> provider
    providerLock sync.RWMutex
    factory      repository.ProviderFactory

    // Task management
    activeTasks map[string]*ActiveTask // taskKey -> active task
    taskLock    sync.RWMutex
    taskMatcher repository.TaskMatcher
    taskTracker TaskProgressTracker

    // Task execution
    executionEngine execution.TaskExecutionEngine

    // Agent tracking
    nodeID    string
    agentInfo *repository.AgentInfo

    // Sync settings
    syncInterval time.Duration
    lastSync     map[int]time.Time
    syncLock     sync.RWMutex
}

Key Responsibilities:

Discover available tasks across multiple repositories
Score and assign tasks based on agent capabilities and expertise
Coordinate task execution with AI-powered execution engines
Track active tasks and broadcast progress updates
Request and coordinate multi-agent collaboration
Integrate with HMMM for meta-discussion and coordination

ActiveTask

Represents a task currently being worked on by an agent.

type ActiveTask struct {
    Task      *repository.Task
    Provider  repository.TaskProvider
    ProjectID int
    ClaimedAt time.Time
    Status    string // claimed, working, completed, failed
    AgentID   string
    Results   map[string]interface{}
}

Task Lifecycle States:

claimed - Task has been claimed by an agent
working - Agent is actively executing the task
completed - Task finished successfully
failed - Task execution failed

TaskProgressTracker Interface

Callback interface for tracking task progress and updating availability broadcasts.

type TaskProgressTracker interface {
    AddTask(taskID string)
    RemoveTask(taskID string)
}

This interface ensures availability broadcasts accurately reflect current workload.

Task Coordination Flow

1. Initialization

coordinator := NewTaskCoordinator(
    ctx,
    ps,           // PubSub instance
    hlog,         // Hypercore log
    cfg,          // Agent configuration
    nodeID,       // P2P node ID
    hmmmRouter,   // HMMM router for meta-discussion
    tracker,      // Task progress tracker
)

coordinator.Start()

Initialization Process:

Creates agent info from configuration
Sets up task execution engine with AI providers
Announces agent role and capabilities via PubSub
Starts task discovery loop
Begins listening for role-based messages

2. Task Discovery and Assignment

Discovery Loop (runs every 30 seconds):

taskDiscoveryLoop() ->
  (Discovery now handled by WHOOSH integration)

Task Evaluation (shouldProcessTask):

func (tc *TaskCoordinator) shouldProcessTask(task *repository.Task) bool {
    // 1. Check capacity: currentTasks < maxTasks
    // 2. Check if already assigned to this agent
    // 3. Score task fit for agent capabilities
    // 4. Return true if score > 0.5 threshold
}

Task Scoring:

Agent role matches required role
Agent expertise matches required expertise
Current workload vs capacity
Task priority level
Historical performance scores

3. Task Claiming and Processing

processTask() flow:
  1. Evaluate if collaboration needed (shouldRequestCollaboration)
  2. Request collaboration via PubSub if needed
  3. Claim task through repository provider
  4. Create ActiveTask and store in activeTasks map
  5. Log claim to Hypercore
  6. Announce claim via PubSub (TaskProgress message)
  7. Seed HMMM meta-discussion room for task
  8. Start execution in background goroutine

Collaboration Request Criteria:

Task priority >= 8 (high priority)
Task requires expertise agent doesn't have
Complex multi-component tasks

4. Task Execution

AI-Powered Execution (executeTaskWithAI):

executionRequest := &execution.TaskExecutionRequest{
    ID:          "repo:taskNumber",
    Type:        determineTaskType(task), // bug_fix, feature_development, etc.
    Description: buildTaskDescription(task),
    Context:     buildTaskContext(task),
    Requirements: &execution.TaskRequirements{
        AIModel:        "", // Auto-selected based on role
        SandboxType:    "docker",
        RequiredTools:  []string{"git", "curl"},
        EnvironmentVars: map[string]string{
            "TASK_ID":    taskID,
            "REPOSITORY": repoName,
            "AGENT_ID":   agentID,
            "AGENT_ROLE": agentRole,
        },
    },
    Timeout: 10 * time.Minute,
}

result := tc.executionEngine.ExecuteTask(ctx, executionRequest)

Task Type Detection:

bug_fix - Keywords: "bug", "fix"
feature_development - Keywords: "feature", "implement"
testing - Keywords: "test"
documentation - Keywords: "doc", "documentation"
refactoring - Keywords: "refactor"
code_review - Keywords: "review"
development - Default for general tasks

Fallback Mock Execution: If AI execution engine is unavailable or fails, falls back to mock execution with simulated work time.

5. Task Completion

executeTask() completion flow:
  1. Update ActiveTask status to "completed"
  2. Complete task through repository provider
  3. Remove from activeTasks map
  4. Update TaskProgressTracker
  5. Log completion to Hypercore
  6. Announce completion via PubSub

Task Result Structure:

type TaskResult struct {
    Success  bool
    Message  string
    Metadata map[string]interface{} // Includes:
                                     // - execution_type (ai_powered/mock)
                                     // - duration
                                     // - commands_executed
                                     // - files_generated
                                     // - resource_usage
                                     // - artifacts
}

PubSub Integration

Published Message Types

1. RoleAnnouncement

Topic: hmmm/meta-discussion/v1 Frequency: Once on startup, when capabilities change

{
  "type": "role_announcement",
  "from": "peer_id",
  "from_role": "Senior Backend Developer",
  "data": {
    "agent_id": "agent-001",
    "node_id": "Qm...",
    "role": "Senior Backend Developer",
    "expertise": ["Go", "PostgreSQL", "Kubernetes"],
    "capabilities": ["code", "test", "deploy"],
    "max_tasks": 3,
    "current_tasks": 0,
    "status": "ready",
    "specialization": "microservices"
  }
}

2. TaskProgress

Topic: CHORUS/coordination/v1 Frequency: On claim, start, completion

Task Claim:

{
  "type": "task_progress",
  "from": "peer_id",
  "from_role": "Senior Backend Developer",
  "thread_id": "task-myrepo-42",
  "data": {
    "task_number": 42,
    "repository": "myrepo",
    "title": "Add authentication endpoint",
    "agent_id": "agent-001",
    "agent_role": "Senior Backend Developer",
    "claim_time": "2025-09-30T10:00:00Z",
    "estimated_completion": "2025-09-30T11:00:00Z"
  }
}

Task Status Update:

{
  "type": "task_progress",
  "from": "peer_id",
  "from_role": "Senior Backend Developer",
  "thread_id": "task-myrepo-42",
  "data": {
    "task_number": 42,
    "repository": "myrepo",
    "agent_id": "agent-001",
    "agent_role": "Senior Backend Developer",
    "status": "started" | "completed",
    "timestamp": "2025-09-30T10:05:00Z"
  }
}

3. TaskHelpRequest

Topic: hmmm/meta-discussion/v1 Frequency: When collaboration needed

{
  "type": "task_help_request",
  "from": "peer_id",
  "from_role": "Senior Backend Developer",
  "to_roles": ["Database Specialist"],
  "required_expertise": ["PostgreSQL", "Query Optimization"],
  "priority": "high",
  "thread_id": "task-myrepo-42",
  "data": {
    "task_number": 42,
    "repository": "myrepo",
    "title": "Optimize database queries",
    "required_role": "Database Specialist",
    "required_expertise": ["PostgreSQL", "Query Optimization"],
    "priority": 8,
    "requester_role": "Senior Backend Developer",
    "reason": "expertise_gap"
  }
}

Received Message Types

1. TaskHelpRequest

Handler: handleTaskHelpRequest

Response Logic:

Check if agent has required expertise
Verify agent has available capacity (currentTasks < maxTasks)
If can help, send TaskHelpResponse
Reflect offer into HMMM per-issue room

Response Message:

{
  "type": "task_help_response",
  "from": "peer_id",
  "from_role": "Database Specialist",
  "thread_id": "task-myrepo-42",
  "data": {
    "agent_id": "agent-002",
    "agent_role": "Database Specialist",
    "expertise": ["PostgreSQL", "Query Optimization", "Indexing"],
    "availability": 2,
    "offer_type": "collaboration",
    "response_to": { /* original help request data */ }
  }
}

2. ExpertiseRequest

Handler: handleExpertiseRequest

Processes requests for specific expertise areas.

3. CoordinationRequest

Handler: handleCoordinationRequest

Handles coordination requests for multi-agent tasks.

4. RoleAnnouncement

Handler: handleRoleAnnouncement

Logs when other agents announce their roles and capabilities.

HMMM Integration

Per-Issue Room Seeding

When a task is claimed, the coordinator seeds a HMMM meta-discussion room:

seedMsg := hmmm.Message{
    Version:   1,
    Type:      "meta_msg",
    IssueID:   int64(taskNumber),
    ThreadID:  fmt.Sprintf("issue-%d", taskNumber),
    MsgID:     uuid.New().String(),
    NodeID:    nodeID,
    HopCount:  0,
    Timestamp: time.Now().UTC(),
    Message:   "Seed: Task 'title' claimed. Description: ...",
}

hmmmRouter.Publish(ctx, seedMsg)

Purpose:

Creates dedicated discussion space for task
Enables agents to coordinate on specific tasks
Integrates with broader meta-coordination system
Provides context for SLURP event generation

Help Offer Reflection

When agents offer help, the offer is reflected into the HMMM room:

hmsg := hmmm.Message{
    Version:   1,
    Type:      "meta_msg",
    IssueID:   issueID,
    ThreadID:  fmt.Sprintf("issue-%d", issueID),
    MsgID:     uuid.New().String(),
    NodeID:    nodeID,
    HopCount:  0,
    Timestamp: time.Now().UTC(),
    Message:   fmt.Sprintf("Help offer from %s (availability %d)",
                          agentRole, availableSlots),
}

Availability Tracking

The coordinator tracks task progress to keep availability broadcasts accurate:

// When task is claimed:
if tc.taskTracker != nil {
    tc.taskTracker.AddTask(taskKey)
}

// When task completes:
if tc.taskTracker != nil {
    tc.taskTracker.RemoveTask(taskKey)
}

This ensures the availability broadcaster (in internal/runtime) has accurate real-time data:

{
  "type": "availability_broadcast",
  "data": {
    "node_id": "Qm...",
    "available_for_work": true,
    "current_tasks": 1,
    "max_tasks": 3,
    "last_activity": 1727692800,
    "status": "working",
    "timestamp": 1727692800
  }
}

Task Assignment Algorithm

Scoring System

The TaskMatcher scores tasks for agents based on multiple factors:

Score = (roleMatch * 0.4) +
        (expertiseMatch * 0.3) +
        (availabilityScore * 0.2) +
        (performanceScore * 0.1)

Where:
- roleMatch: 1.0 if agent role matches required role, 0.5 for partial match
- expertiseMatch: percentage of required expertise agent possesses
- availabilityScore: (maxTasks - currentTasks) / maxTasks
- performanceScore: agent's historical performance metric (0.0-1.0)

Threshold: Tasks with score > 0.5 are considered for assignment.

Assignment Priority

Tasks are prioritized by:

Priority Level (task.Priority field, 0-10)
Task Score (calculated by matcher)
Age (older tasks first)
Dependencies (tasks blocking others)

Claim Race Condition Handling

Multiple agents may attempt to claim the same task:

1. Agent A evaluates task: score = 0.8, attempts claim
2. Agent B evaluates task: score = 0.7, attempts claim
3. Repository provider uses atomic claim operation
4. First successful claim wins
5. Other agents receive claim failure
6. Failed agents continue to next task

Error Handling

Task Execution Failures

// On AI execution failure:
if err := tc.executeTaskWithAI(activeTask); err != nil {
    // Fall back to mock execution
    taskResult = tc.executeMockTask(activeTask)
}

// On completion failure:
if err := provider.CompleteTask(task, result); err != nil {
    // Update status to failed
    activeTask.Status = "failed"
    activeTask.Results = map[string]interface{}{
        "error": err.Error(),
    }
}

Collaboration Request Failures

err := tc.pubsub.PublishRoleBasedMessage(
    pubsub.TaskHelpRequest, data, opts)
if err != nil {
    // Log error but continue with task
    fmt.Printf("⚠️ Failed to request collaboration: %v\n", err)
    // Task execution proceeds without collaboration
}

HMMM Seeding Failures

if err := tc.hmmmRouter.Publish(ctx, seedMsg); err != nil {
    // Log error to Hypercore
    tc.hlog.AppendString("system_error", map[string]interface{}{
        "error":       "hmmm_seed_failed",
        "task_number": taskNumber,
        "repository":  repository,
        "message":     err.Error(),
    })
    // Task execution continues without HMMM room
}

Agent Configuration

Required Configuration

agent:
  id: "agent-001"
  role: "Senior Backend Developer"
  expertise:
    - "Go"
    - "PostgreSQL"
    - "Docker"
    - "Kubernetes"
  capabilities:
    - "code"
    - "test"
    - "deploy"
  max_tasks: 3
  specialization: "microservices"
  models:
    - name: "llama3.1:70b"
      provider: "ollama"
      endpoint: "http://192.168.1.72:11434"

AgentInfo Structure

type AgentInfo struct {
    ID           string
    Role         string
    Expertise    []string
    CurrentTasks int
    MaxTasks     int
    Status       string // ready, working, busy, offline
    LastSeen     time.Time
    Performance  map[string]interface{} // score: 0.8
    Availability string // available, busy, offline
}

Hypercore Logging

All coordination events are logged to Hypercore:

Task Claimed

hlog.Append(logging.TaskClaimed, map[string]interface{}{
    "task_number":   taskNumber,
    "repository":    repository,
    "title":         title,
    "required_role": requiredRole,
    "priority":      priority,
})

Task Completed

hlog.Append(logging.TaskCompleted, map[string]interface{}{
    "task_number": taskNumber,
    "repository":  repository,
    "duration":    durationSeconds,
    "results":     resultsMap,
})

Status Reporting

Coordinator Status

status := coordinator.GetStatus()
// Returns:
{
    "agent_id":         "agent-001",
    "role":             "Senior Backend Developer",
    "expertise":        ["Go", "PostgreSQL", "Docker"],
    "current_tasks":    1,
    "max_tasks":        3,
    "active_providers": 2,
    "status":           "working",
    "active_tasks": [
        {
            "repository": "myrepo",
            "number":     42,
            "title":      "Add authentication",
            "status":     "working",
            "claimed_at": "2025-09-30T10:00:00Z"
        }
    ]
}

Best Practices

Task Coordinator Usage

Initialize Early: Create coordinator during agent startup
Set Task Tracker: Always provide TaskProgressTracker for accurate availability
Configure HMMM: Wire up hmmmRouter for meta-discussion integration
Monitor Status: Periodically check GetStatus() for health monitoring
Handle Failures: Implement proper error handling for degraded operation

Configuration Tuning

Max Tasks: Set based on agent resources (CPU, memory, AI model capacity)
Sync Interval: Balance between responsiveness and network overhead (default: 30s)
Task Scoring: Adjust threshold (default: 0.5) based on task availability
Collaboration: Enable for high-priority or expertise-gap tasks

Performance Optimization

Task Discovery: Delegate to WHOOSH for efficient search and indexing
Concurrent Execution: Use goroutines for parallel task execution
Lock Granularity: Minimize lock contention with separate locks for providers/tasks
Caching: Cache agent info and provider connections

Integration Points

With PubSub

Publishes: RoleAnnouncement, TaskProgress, TaskHelpRequest
Subscribes: TaskHelpRequest, ExpertiseRequest, CoordinationRequest
Topics: CHORUS/coordination/v1, hmmm/meta-discussion/v1

With HMMM

Seeds per-issue discussion rooms
Reflects help offers into rooms
Enables agent coordination on specific tasks

With Repository Providers

Claims tasks atomically
Fetches task details
Updates task status
Completes tasks with results

With Execution Engine

Converts repository tasks to execution requests
Executes tasks with AI providers
Handles sandbox environments
Collects execution metrics and artifacts

With Hypercore

Logs task claims
Logs task completions
Logs coordination errors
Provides audit trail

Task Message Format

PubSub Task Messages

All task-related messages follow the standard PubSub Message format:

type Message struct {
    Type              MessageType            // e.g., "task_progress"
    From              string                 // Peer ID
    Timestamp         time.Time
    Data              map[string]interface{} // Message payload
    HopCount          int
    FromRole          string                 // Agent role
    ToRoles           []string               // Target roles
    RequiredExpertise []string               // Required expertise
    ProjectID         string
    Priority          string                 // low, medium, high, urgent
    ThreadID          string                 // Conversation thread
}

Task Assignment Message Flow

1. TaskAnnouncement (WHOOSH → PubSub)
   ├─ Available task discovered
   └─ Broadcast to coordination topic

2. Task Evaluation (Local)
   ├─ Score task for agent
   └─ Decide whether to claim

3. TaskClaim (Agent → Repository)
   ├─ Atomic claim operation
   └─ Only one agent succeeds

4. TaskProgress (Agent → PubSub)
   ├─ Announce claim to network
   └─ Status: "claimed"

5. TaskHelpRequest (Optional, Agent → PubSub)
   ├─ Request collaboration if needed
   └─ Target specific roles/expertise

6. TaskHelpResponse (Other Agents → PubSub)
   ├─ Offer assistance
   └─ Include availability info

7. TaskProgress (Agent → PubSub)
   ├─ Announce work started
   └─ Status: "started"

8. Task Execution (Local with AI Engine)
   ├─ Execute task in sandbox
   └─ Generate artifacts

9. TaskProgress (Agent → PubSub)
   ├─ Announce completion
   └─ Status: "completed"

20 KiB Raw Blame History