Files
CHORUS/docs/comprehensive/packages/discovery.md
anthonyrawlins c5b7311a8b docs: Add Phase 3 coordination and infrastructure documentation
Comprehensive documentation for coordination, messaging, discovery, and internal systems.

Core Coordination Packages:
- pkg/election - Democratic leader election (uptime-based, heartbeat mechanism, SLURP integration)
- pkg/coordination - Meta-coordination with dependency detection (4 built-in rules)
- coordinator/ - Task orchestration and assignment (AI-powered scoring)
- discovery/ - mDNS peer discovery (automatic LAN detection)

Messaging & P2P Infrastructure:
- pubsub/ - GossipSub messaging (31 message types, role-based topics, HMMM integration)
- p2p/ - libp2p networking (DHT modes, connection management, security)

Monitoring & Health:
- pkg/metrics - Prometheus metrics (80+ metrics across 12 categories)
- pkg/health - Health monitoring (4 HTTP endpoints, enhanced checks, graceful degradation)

Internal Systems:
- internal/licensing - License validation (KACHING integration, cluster leases, fail-closed)
- internal/hapui - Human Agent Portal UI (9 commands, HMMM wizard, UCXL browser, decision voting)
- internal/backbeat - P2P operation telemetry (6 phases, beat synchronization, health reporting)

Documentation Statistics (Phase 3):
- 10 packages documented (~18,000 lines)
- 31 PubSub message types cataloged
- 80+ Prometheus metrics documented
- Complete API references with examples
- Integration patterns and best practices

Key Features Documented:
- Election: 5 triggers, candidate scoring (5 weighted components), stability windows
- Coordination: AI-powered dependency detection, cross-repo sessions, escalation handling
- PubSub: Topic patterns, message envelopes, SHHH redaction, Hypercore logging
- Metrics: All metric types with labels, Prometheus scrape config, alert rules
- Health: Liveness vs readiness, critical checks, Kubernetes integration
- Licensing: Grace periods, circuit breaker, cluster lease management
- HAP UI: Interactive terminal commands, HMMM composition wizard, web interface (beta)
- BACKBEAT: 6-phase operation tracking, beat budget estimation, drift detection

Implementation Status Marked:
-  Production: Election, metrics, health, licensing, pubsub, p2p, discovery, coordinator
- 🔶 Beta: HAP web interface, BACKBEAT telemetry, advanced coordination
- 🔷 Alpha: SLURP election scoring
- ⚠️ Experimental: Meta-coordination, AI-powered dependency detection

Progress: 22/62 files complete (35%)

Next Phase: AI providers, SLURP system, API layer, reasoning engine

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-09-30 18:27:39 +10:00

15 KiB

Package: discovery

Location: /home/tony/chorus/project-queues/active/CHORUS/discovery/

Overview

The discovery package provides mDNS-based peer discovery for automatic detection and connection of CHORUS agents on the local network. It enables zero-configuration peer discovery using multicast DNS (mDNS), allowing agents to find and connect to each other without manual configuration or central coordination.

Architecture

mDNS Overview

Multicast DNS (mDNS) is a protocol that resolves hostnames to IP addresses within small networks that do not include a local name server. It uses:

  • Multicast IP: 224.0.0.251 (IPv4) or FF02::FB (IPv6)
  • UDP Port: 5353
  • Service Discovery: Advertises and discovers services on the local network

CHORUS Service Tag

Default Service Name: "CHORUS-peer-discovery"

This service tag identifies CHORUS peers on the network. All CHORUS agents advertise themselves with this tag and listen for other agents using the same tag.

Core Components

MDNSDiscovery

Main structure managing mDNS discovery operations.

type MDNSDiscovery struct {
    host        host.Host                 // libp2p host
    service     mdns.Service              // mDNS service
    notifee     *mdnsNotifee             // Peer notification handler
    ctx         context.Context           // Discovery context
    cancel      context.CancelFunc        // Context cancellation
    serviceTag  string                    // Service name (default: "CHORUS-peer-discovery")
}

Key Responsibilities:

  • Advertise local agent as mDNS service
  • Listen for mDNS announcements from other agents
  • Automatically connect to discovered peers
  • Handle peer connection lifecycle

mdnsNotifee

Internal notification handler for discovered peers.

type mdnsNotifee struct {
    h         host.Host                // libp2p host
    ctx       context.Context          // Context for operations
    peersChan chan peer.AddrInfo       // Channel for discovered peers (buffer: 10)
}

Implements the mDNS notification interface to receive peer discovery events.

Discovery Flow

1. Service Initialization

discovery, err := NewMDNSDiscovery(ctx, host, "CHORUS-peer-discovery")
if err != nil {
    return fmt.Errorf("failed to start mDNS discovery: %w", err)
}

Initialization Steps:

  1. Create discovery context with cancellation
  2. Initialize mdnsNotifee with peer channel
  3. Create mDNS service with service tag
  4. Start mDNS service (begins advertising and listening)
  5. Launch background peer connection handler

2. Service Advertisement

When the service starts, it automatically advertises:

Service Type: _CHORUS-peer-discovery._udp.local
Port: libp2p host port
Addresses: All local IP addresses (IPv4 and IPv6)

This allows other CHORUS agents on the network to discover this peer.

3. Peer Discovery

Discovery Process:

1. mDNS Service listens for multicast announcements
   ├─ Receives service announcement from peer
   └─ Extracts peer.AddrInfo (ID + addresses)

2. mdnsNotifee.HandlePeerFound() called
   ├─ Peer info sent to peersChan
   └─ Non-blocking send (drops if channel full)

3. handleDiscoveredPeers() goroutine receives
   ├─ Skip if peer is self
   ├─ Skip if already connected
   └─ Attempt connection

4. Automatic Connection

func (d *MDNSDiscovery) handleDiscoveredPeers() {
    for {
        select {
        case <-d.ctx.Done():
            return
        case peerInfo := <-d.notifee.peersChan:
            // Skip self
            if peerInfo.ID == d.host.ID() {
                continue
            }

            // Check if already connected
            if d.host.Network().Connectedness(peerInfo.ID) == 1 {
                continue
            }

            // Attempt connection with timeout
            connectCtx, cancel := context.WithTimeout(d.ctx, 10*time.Second)
            err := d.host.Connect(connectCtx, peerInfo)
            cancel()

            if err != nil {
                fmt.Printf("❌ Failed to connect to peer %s: %v\n",
                          peerInfo.ID.ShortString(), err)
            } else {
                fmt.Printf("✅ Successfully connected to peer %s\n",
                          peerInfo.ID.ShortString())
            }
        }
    }
}

Connection Features:

  • 10-second timeout per connection attempt
  • Idempotent: Safe to attempt connection to already-connected peer
  • Self-filtering: Ignores own mDNS announcements
  • Duplicate filtering: Checks existing connections before attempting
  • Non-blocking: Runs in background goroutine

Usage

Basic Usage

import (
    "context"
    "chorus/discovery"
    "github.com/libp2p/go-libp2p/core/host"
)

func setupDiscovery(ctx context.Context, h host.Host) (*discovery.MDNSDiscovery, error) {
    // Start mDNS discovery with default service tag
    disc, err := discovery.NewMDNSDiscovery(ctx, h, "")
    if err != nil {
        return nil, err
    }

    fmt.Println("🔍 mDNS discovery started")
    return disc, nil
}

Custom Service Tag

// Use custom service tag for specific environments
disc, err := discovery.NewMDNSDiscovery(ctx, h, "CHORUS-dev-network")
if err != nil {
    return nil, err
}

Monitoring Discovered Peers

// Access peer channel for custom handling
peersChan := disc.PeersChan()

go func() {
    for peerInfo := range peersChan {
        fmt.Printf("🔍 Discovered peer: %s with %d addresses\n",
                  peerInfo.ID.ShortString(),
                  len(peerInfo.Addrs))

        // Custom peer processing
        handleNewPeer(peerInfo)
    }
}()

Graceful Shutdown

// Close discovery service
if err := disc.Close(); err != nil {
    log.Printf("Error closing discovery: %v", err)
}

Peer Information Structure

peer.AddrInfo

Discovered peers are represented as libp2p peer.AddrInfo:

type AddrInfo struct {
    ID    peer.ID           // Unique peer identifier
    Addrs []multiaddr.Multiaddr  // Peer addresses
}

Example Multiaddresses:

/ip4/192.168.1.100/tcp/4001/p2p/QmPeerID...
/ip6/fe80::1/tcp/4001/p2p/QmPeerID...

Network Configuration

Firewall Requirements

mDNS requires the following ports to be open:

  • UDP 5353: mDNS multicast
  • TCP/UDP 4001 (or configured libp2p port): libp2p connections

Network Scope

mDNS operates on local network only:

  • Same subnet required for discovery
  • Does not traverse routers (by design)
  • Ideal for LAN-based agent clusters

Multicast Group

mDNS uses standard multicast groups:

  • IPv4: 224.0.0.251
  • IPv6: FF02::FB

Integration with CHORUS

Cluster Formation

mDNS discovery enables automatic cluster formation:

Startup Sequence:
1. Agent starts with libp2p host
2. mDNS discovery initialized
3. Agent advertises itself via mDNS
4. Agent listens for other agents
5. Auto-connects to discovered peers
6. PubSub gossip network forms
7. Task coordination begins

Multi-Node Cluster Example

Network: 192.168.1.0/24

Node 1 (walnut):     192.168.1.27  - Agent: backend-dev
Node 2 (ironwood):   192.168.1.72  - Agent: frontend-dev
Node 3 (rosewood):   192.168.1.113 - Agent: devops-specialist

Discovery Flow:
1. All nodes start with CHORUS-peer-discovery tag
2. Each node multicasts to 224.0.0.251:5353
3. All nodes receive each other's announcements
4. Automatic connection establishment:
   walnut ↔ ironwood
   walnut ↔ rosewood
   ironwood ↔ rosewood
5. Full mesh topology formed
6. PubSub topics synchronized

Error Handling

Service Start Failure

disc, err := discovery.NewMDNSDiscovery(ctx, h, serviceTag)
if err != nil {
    // Common causes:
    // - Port 5353 already in use
    // - Insufficient permissions (require multicast)
    // - Network interface unavailable
    return fmt.Errorf("failed to start mDNS discovery: %w", err)
}

Connection Failures

Connection failures are logged but do not stop the discovery process:

❌ Failed to connect to peer Qm... : context deadline exceeded

Common Causes:

  • Peer behind firewall
  • Network congestion
  • Peer offline/restarting
  • Connection limit reached

Behavior: Discovery continues, will retry on next mDNS announcement.

Channel Full

If peer discovery is faster than connection handling:

⚠️ Discovery channel full, skipping peer Qm...

Buffer Size: 10 peers Mitigation: Non-critical, peer will be rediscovered on next announcement cycle

Performance Characteristics

Discovery Latency

  • Initial Advertisement: ~1-2 seconds after service start
  • Discovery Response: Typically < 1 second on LAN
  • Connection Establishment: 1-10 seconds (with 10s timeout)
  • Re-announcement: Periodic (standard mDNS timing)

Resource Usage

  • Memory: Minimal (~1MB per discovery service)
  • CPU: Very low (event-driven)
  • Network: Minimal (periodic multicast announcements)
  • Concurrent Connections: Handled by libp2p connection manager

Configuration Options

Service Tag Customization

// Production environment
disc, _ := discovery.NewMDNSDiscovery(ctx, h, "CHORUS-production")

// Development environment
disc, _ := discovery.NewMDNSDiscovery(ctx, h, "CHORUS-dev")

// Testing environment
disc, _ := discovery.NewMDNSDiscovery(ctx, h, "CHORUS-test")

Use Case: Isolate environments on same physical network.

Connection Timeout Adjustment

Currently hardcoded to 10 seconds. For customization:

// In handleDiscoveredPeers():
connectTimeout := 30 * time.Second  // Longer for slow networks
connectCtx, cancel := context.WithTimeout(d.ctx, connectTimeout)

Advanced Usage

Custom Peer Handling

Bypass automatic connection and implement custom logic:

// Subscribe to peer channel
peersChan := disc.PeersChan()

go func() {
    for peerInfo := range peersChan {
        // Custom filtering
        if shouldConnectToPeer(peerInfo) {
            // Custom connection logic
            connectWithRetry(peerInfo)
        }
    }
}()

Discovery Metrics

type DiscoveryMetrics struct {
    PeersDiscovered   int
    ConnectionsSuccess int
    ConnectionsFailed  int
    LastDiscovery     time.Time
}

// Track metrics
var metrics DiscoveryMetrics

// In handleDiscoveredPeers():
metrics.PeersDiscovered++
if err := host.Connect(ctx, peerInfo); err != nil {
    metrics.ConnectionsFailed++
} else {
    metrics.ConnectionsSuccess++
}
metrics.LastDiscovery = time.Now()

Comparison with Other Discovery Methods

mDNS vs DHT

Feature mDNS DHT (Kademlia)
Network Scope Local network only Global
Setup Zero-config Requires bootstrap nodes
Speed Very fast (< 1s) Slower (seconds to minutes)
Privacy Local only Public network
Reliability High on LAN Depends on DHT health
Use Case LAN clusters Internet-wide P2P

CHORUS Choice: mDNS for local agent clusters, DHT could be added for internet-wide coordination.

mDNS vs Bootstrap List

Feature mDNS Bootstrap List
Configuration None Manual list
Maintenance Automatic Manual updates
Scalability Limited to LAN Unlimited
Flexibility Dynamic Static
Failure Handling Auto-discovery Manual intervention

CHORUS Choice: mDNS for local discovery, bootstrap list as fallback.

libp2p Integration

Host Requirement

mDNS discovery requires a libp2p host:

import (
    "github.com/libp2p/go-libp2p"
    "github.com/libp2p/go-libp2p/core/host"
)

// Create libp2p host
h, err := libp2p.New(
    libp2p.ListenAddrStrings(
        "/ip4/0.0.0.0/tcp/4001",
        "/ip6/::/tcp/4001",
    ),
)
if err != nil {
    return err
}

// Initialize mDNS discovery with host
disc, err := discovery.NewMDNSDiscovery(ctx, h, "CHORUS-peer-discovery")

Connection Manager Integration

mDNS discovery works with libp2p connection manager:

h, err := libp2p.New(
    libp2p.ListenAddrStrings("/ip4/0.0.0.0/tcp/4001"),
    libp2p.ConnectionManager(connmgr.NewConnManager(
        100,  // Low water mark
        400,  // High water mark
        time.Minute,
    )),
)

// mDNS-discovered connections managed by connection manager
disc, err := discovery.NewMDNSDiscovery(ctx, h, "")

Security Considerations

Trust Model

mDNS operates on local network trust:

  • Assumes local network is trusted
  • No authentication at mDNS layer
  • Authentication handled by libp2p security transport

Attack Vectors

  1. Peer ID Spoofing: Mitigated by libp2p peer ID verification
  2. DoS via Fake Peers: Limited by channel buffer and connection timeout
  3. Network Snooping: mDNS announcements are plaintext (by design)

Best Practices

  1. Use libp2p Security: TLS or Noise transport for encrypted connections
  2. Peer Authentication: Verify peer identities after connection
  3. Network Isolation: Deploy on trusted networks
  4. Connection Limits: Use libp2p connection manager
  5. Monitoring: Log all discovery and connection events

Troubleshooting

No Peers Discovered

Symptoms: Service starts but no peers found.

Checks:

  1. Verify all agents on same subnet
  2. Check firewall rules (UDP 5353)
  3. Verify mDNS/multicast not blocked by network
  4. Check service tag matches across agents
  5. Verify no mDNS conflicts with other services

Connection Failures

Symptoms: Peers discovered but connections fail.

Checks:

  1. Verify libp2p port open (default: TCP 4001)
  2. Check connection manager limits
  3. Verify peer addresses are reachable
  4. Check for NAT/firewall between peers
  5. Verify sufficient system resources (file descriptors, memory)

High CPU/Network Usage

Symptoms: Excessive mDNS traffic or CPU usage.

Causes:

  • Rapid peer restarts (re-announcements)
  • Many peers on network
  • Short announcement intervals

Solutions:

  • Implement connection caching
  • Adjust mDNS announcement timing
  • Use connection limits

Monitoring and Debugging

Discovery Events

// Log all discovery events
disc, _ := discovery.NewMDNSDiscovery(ctx, h, "CHORUS-peer-discovery")

peersChan := disc.PeersChan()
go func() {
    for peerInfo := range peersChan {
        logger.Info("Discovered peer",
            "peer_id", peerInfo.ID.String(),
            "addresses", peerInfo.Addrs,
            "timestamp", time.Now())
    }
}()

Connection Status

// Monitor connection status
func monitorConnections(h host.Host) {
    ticker := time.NewTicker(30 * time.Second)
    defer ticker.Stop()

    for range ticker.C {
        peers := h.Network().Peers()
        fmt.Printf("📊 Connected to %d peers: %v\n",
                  len(peers), peers)
    }
}

See Also