Comprehensive documentation for coordination, messaging, discovery, and internal systems. Core Coordination Packages: - pkg/election - Democratic leader election (uptime-based, heartbeat mechanism, SLURP integration) - pkg/coordination - Meta-coordination with dependency detection (4 built-in rules) - coordinator/ - Task orchestration and assignment (AI-powered scoring) - discovery/ - mDNS peer discovery (automatic LAN detection) Messaging & P2P Infrastructure: - pubsub/ - GossipSub messaging (31 message types, role-based topics, HMMM integration) - p2p/ - libp2p networking (DHT modes, connection management, security) Monitoring & Health: - pkg/metrics - Prometheus metrics (80+ metrics across 12 categories) - pkg/health - Health monitoring (4 HTTP endpoints, enhanced checks, graceful degradation) Internal Systems: - internal/licensing - License validation (KACHING integration, cluster leases, fail-closed) - internal/hapui - Human Agent Portal UI (9 commands, HMMM wizard, UCXL browser, decision voting) - internal/backbeat - P2P operation telemetry (6 phases, beat synchronization, health reporting) Documentation Statistics (Phase 3): - 10 packages documented (~18,000 lines) - 31 PubSub message types cataloged - 80+ Prometheus metrics documented - Complete API references with examples - Integration patterns and best practices Key Features Documented: - Election: 5 triggers, candidate scoring (5 weighted components), stability windows - Coordination: AI-powered dependency detection, cross-repo sessions, escalation handling - PubSub: Topic patterns, message envelopes, SHHH redaction, Hypercore logging - Metrics: All metric types with labels, Prometheus scrape config, alert rules - Health: Liveness vs readiness, critical checks, Kubernetes integration - Licensing: Grace periods, circuit breaker, cluster lease management - HAP UI: Interactive terminal commands, HMMM composition wizard, web interface (beta) - BACKBEAT: 6-phase operation tracking, beat budget estimation, drift detection Implementation Status Marked: - ✅ Production: Election, metrics, health, licensing, pubsub, p2p, discovery, coordinator - 🔶 Beta: HAP web interface, BACKBEAT telemetry, advanced coordination - 🔷 Alpha: SLURP election scoring - ⚠️ Experimental: Meta-coordination, AI-powered dependency detection Progress: 22/62 files complete (35%) Next Phase: AI providers, SLURP system, API layer, reasoning engine 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
15 KiB
Package: discovery
Location: /home/tony/chorus/project-queues/active/CHORUS/discovery/
Overview
The discovery package provides mDNS-based peer discovery for automatic detection and connection of CHORUS agents on the local network. It enables zero-configuration peer discovery using multicast DNS (mDNS), allowing agents to find and connect to each other without manual configuration or central coordination.
Architecture
mDNS Overview
Multicast DNS (mDNS) is a protocol that resolves hostnames to IP addresses within small networks that do not include a local name server. It uses:
- Multicast IP: 224.0.0.251 (IPv4) or FF02::FB (IPv6)
- UDP Port: 5353
- Service Discovery: Advertises and discovers services on the local network
CHORUS Service Tag
Default Service Name: "CHORUS-peer-discovery"
This service tag identifies CHORUS peers on the network. All CHORUS agents advertise themselves with this tag and listen for other agents using the same tag.
Core Components
MDNSDiscovery
Main structure managing mDNS discovery operations.
type MDNSDiscovery struct {
host host.Host // libp2p host
service mdns.Service // mDNS service
notifee *mdnsNotifee // Peer notification handler
ctx context.Context // Discovery context
cancel context.CancelFunc // Context cancellation
serviceTag string // Service name (default: "CHORUS-peer-discovery")
}
Key Responsibilities:
- Advertise local agent as mDNS service
- Listen for mDNS announcements from other agents
- Automatically connect to discovered peers
- Handle peer connection lifecycle
mdnsNotifee
Internal notification handler for discovered peers.
type mdnsNotifee struct {
h host.Host // libp2p host
ctx context.Context // Context for operations
peersChan chan peer.AddrInfo // Channel for discovered peers (buffer: 10)
}
Implements the mDNS notification interface to receive peer discovery events.
Discovery Flow
1. Service Initialization
discovery, err := NewMDNSDiscovery(ctx, host, "CHORUS-peer-discovery")
if err != nil {
return fmt.Errorf("failed to start mDNS discovery: %w", err)
}
Initialization Steps:
- Create discovery context with cancellation
- Initialize mdnsNotifee with peer channel
- Create mDNS service with service tag
- Start mDNS service (begins advertising and listening)
- Launch background peer connection handler
2. Service Advertisement
When the service starts, it automatically advertises:
Service Type: _CHORUS-peer-discovery._udp.local
Port: libp2p host port
Addresses: All local IP addresses (IPv4 and IPv6)
This allows other CHORUS agents on the network to discover this peer.
3. Peer Discovery
Discovery Process:
1. mDNS Service listens for multicast announcements
├─ Receives service announcement from peer
└─ Extracts peer.AddrInfo (ID + addresses)
2. mdnsNotifee.HandlePeerFound() called
├─ Peer info sent to peersChan
└─ Non-blocking send (drops if channel full)
3. handleDiscoveredPeers() goroutine receives
├─ Skip if peer is self
├─ Skip if already connected
└─ Attempt connection
4. Automatic Connection
func (d *MDNSDiscovery) handleDiscoveredPeers() {
for {
select {
case <-d.ctx.Done():
return
case peerInfo := <-d.notifee.peersChan:
// Skip self
if peerInfo.ID == d.host.ID() {
continue
}
// Check if already connected
if d.host.Network().Connectedness(peerInfo.ID) == 1 {
continue
}
// Attempt connection with timeout
connectCtx, cancel := context.WithTimeout(d.ctx, 10*time.Second)
err := d.host.Connect(connectCtx, peerInfo)
cancel()
if err != nil {
fmt.Printf("❌ Failed to connect to peer %s: %v\n",
peerInfo.ID.ShortString(), err)
} else {
fmt.Printf("✅ Successfully connected to peer %s\n",
peerInfo.ID.ShortString())
}
}
}
}
Connection Features:
- 10-second timeout per connection attempt
- Idempotent: Safe to attempt connection to already-connected peer
- Self-filtering: Ignores own mDNS announcements
- Duplicate filtering: Checks existing connections before attempting
- Non-blocking: Runs in background goroutine
Usage
Basic Usage
import (
"context"
"chorus/discovery"
"github.com/libp2p/go-libp2p/core/host"
)
func setupDiscovery(ctx context.Context, h host.Host) (*discovery.MDNSDiscovery, error) {
// Start mDNS discovery with default service tag
disc, err := discovery.NewMDNSDiscovery(ctx, h, "")
if err != nil {
return nil, err
}
fmt.Println("🔍 mDNS discovery started")
return disc, nil
}
Custom Service Tag
// Use custom service tag for specific environments
disc, err := discovery.NewMDNSDiscovery(ctx, h, "CHORUS-dev-network")
if err != nil {
return nil, err
}
Monitoring Discovered Peers
// Access peer channel for custom handling
peersChan := disc.PeersChan()
go func() {
for peerInfo := range peersChan {
fmt.Printf("🔍 Discovered peer: %s with %d addresses\n",
peerInfo.ID.ShortString(),
len(peerInfo.Addrs))
// Custom peer processing
handleNewPeer(peerInfo)
}
}()
Graceful Shutdown
// Close discovery service
if err := disc.Close(); err != nil {
log.Printf("Error closing discovery: %v", err)
}
Peer Information Structure
peer.AddrInfo
Discovered peers are represented as libp2p peer.AddrInfo:
type AddrInfo struct {
ID peer.ID // Unique peer identifier
Addrs []multiaddr.Multiaddr // Peer addresses
}
Example Multiaddresses:
/ip4/192.168.1.100/tcp/4001/p2p/QmPeerID...
/ip6/fe80::1/tcp/4001/p2p/QmPeerID...
Network Configuration
Firewall Requirements
mDNS requires the following ports to be open:
- UDP 5353: mDNS multicast
- TCP/UDP 4001 (or configured libp2p port): libp2p connections
Network Scope
mDNS operates on local network only:
- Same subnet required for discovery
- Does not traverse routers (by design)
- Ideal for LAN-based agent clusters
Multicast Group
mDNS uses standard multicast groups:
- IPv4: 224.0.0.251
- IPv6: FF02::FB
Integration with CHORUS
Cluster Formation
mDNS discovery enables automatic cluster formation:
Startup Sequence:
1. Agent starts with libp2p host
2. mDNS discovery initialized
3. Agent advertises itself via mDNS
4. Agent listens for other agents
5. Auto-connects to discovered peers
6. PubSub gossip network forms
7. Task coordination begins
Multi-Node Cluster Example
Network: 192.168.1.0/24
Node 1 (walnut): 192.168.1.27 - Agent: backend-dev
Node 2 (ironwood): 192.168.1.72 - Agent: frontend-dev
Node 3 (rosewood): 192.168.1.113 - Agent: devops-specialist
Discovery Flow:
1. All nodes start with CHORUS-peer-discovery tag
2. Each node multicasts to 224.0.0.251:5353
3. All nodes receive each other's announcements
4. Automatic connection establishment:
walnut ↔ ironwood
walnut ↔ rosewood
ironwood ↔ rosewood
5. Full mesh topology formed
6. PubSub topics synchronized
Error Handling
Service Start Failure
disc, err := discovery.NewMDNSDiscovery(ctx, h, serviceTag)
if err != nil {
// Common causes:
// - Port 5353 already in use
// - Insufficient permissions (require multicast)
// - Network interface unavailable
return fmt.Errorf("failed to start mDNS discovery: %w", err)
}
Connection Failures
Connection failures are logged but do not stop the discovery process:
❌ Failed to connect to peer Qm... : context deadline exceeded
Common Causes:
- Peer behind firewall
- Network congestion
- Peer offline/restarting
- Connection limit reached
Behavior: Discovery continues, will retry on next mDNS announcement.
Channel Full
If peer discovery is faster than connection handling:
⚠️ Discovery channel full, skipping peer Qm...
Buffer Size: 10 peers Mitigation: Non-critical, peer will be rediscovered on next announcement cycle
Performance Characteristics
Discovery Latency
- Initial Advertisement: ~1-2 seconds after service start
- Discovery Response: Typically < 1 second on LAN
- Connection Establishment: 1-10 seconds (with 10s timeout)
- Re-announcement: Periodic (standard mDNS timing)
Resource Usage
- Memory: Minimal (~1MB per discovery service)
- CPU: Very low (event-driven)
- Network: Minimal (periodic multicast announcements)
- Concurrent Connections: Handled by libp2p connection manager
Configuration Options
Service Tag Customization
// Production environment
disc, _ := discovery.NewMDNSDiscovery(ctx, h, "CHORUS-production")
// Development environment
disc, _ := discovery.NewMDNSDiscovery(ctx, h, "CHORUS-dev")
// Testing environment
disc, _ := discovery.NewMDNSDiscovery(ctx, h, "CHORUS-test")
Use Case: Isolate environments on same physical network.
Connection Timeout Adjustment
Currently hardcoded to 10 seconds. For customization:
// In handleDiscoveredPeers():
connectTimeout := 30 * time.Second // Longer for slow networks
connectCtx, cancel := context.WithTimeout(d.ctx, connectTimeout)
Advanced Usage
Custom Peer Handling
Bypass automatic connection and implement custom logic:
// Subscribe to peer channel
peersChan := disc.PeersChan()
go func() {
for peerInfo := range peersChan {
// Custom filtering
if shouldConnectToPeer(peerInfo) {
// Custom connection logic
connectWithRetry(peerInfo)
}
}
}()
Discovery Metrics
type DiscoveryMetrics struct {
PeersDiscovered int
ConnectionsSuccess int
ConnectionsFailed int
LastDiscovery time.Time
}
// Track metrics
var metrics DiscoveryMetrics
// In handleDiscoveredPeers():
metrics.PeersDiscovered++
if err := host.Connect(ctx, peerInfo); err != nil {
metrics.ConnectionsFailed++
} else {
metrics.ConnectionsSuccess++
}
metrics.LastDiscovery = time.Now()
Comparison with Other Discovery Methods
mDNS vs DHT
| Feature | mDNS | DHT (Kademlia) |
|---|---|---|
| Network Scope | Local network only | Global |
| Setup | Zero-config | Requires bootstrap nodes |
| Speed | Very fast (< 1s) | Slower (seconds to minutes) |
| Privacy | Local only | Public network |
| Reliability | High on LAN | Depends on DHT health |
| Use Case | LAN clusters | Internet-wide P2P |
CHORUS Choice: mDNS for local agent clusters, DHT could be added for internet-wide coordination.
mDNS vs Bootstrap List
| Feature | mDNS | Bootstrap List |
|---|---|---|
| Configuration | None | Manual list |
| Maintenance | Automatic | Manual updates |
| Scalability | Limited to LAN | Unlimited |
| Flexibility | Dynamic | Static |
| Failure Handling | Auto-discovery | Manual intervention |
CHORUS Choice: mDNS for local discovery, bootstrap list as fallback.
libp2p Integration
Host Requirement
mDNS discovery requires a libp2p host:
import (
"github.com/libp2p/go-libp2p"
"github.com/libp2p/go-libp2p/core/host"
)
// Create libp2p host
h, err := libp2p.New(
libp2p.ListenAddrStrings(
"/ip4/0.0.0.0/tcp/4001",
"/ip6/::/tcp/4001",
),
)
if err != nil {
return err
}
// Initialize mDNS discovery with host
disc, err := discovery.NewMDNSDiscovery(ctx, h, "CHORUS-peer-discovery")
Connection Manager Integration
mDNS discovery works with libp2p connection manager:
h, err := libp2p.New(
libp2p.ListenAddrStrings("/ip4/0.0.0.0/tcp/4001"),
libp2p.ConnectionManager(connmgr.NewConnManager(
100, // Low water mark
400, // High water mark
time.Minute,
)),
)
// mDNS-discovered connections managed by connection manager
disc, err := discovery.NewMDNSDiscovery(ctx, h, "")
Security Considerations
Trust Model
mDNS operates on local network trust:
- Assumes local network is trusted
- No authentication at mDNS layer
- Authentication handled by libp2p security transport
Attack Vectors
- Peer ID Spoofing: Mitigated by libp2p peer ID verification
- DoS via Fake Peers: Limited by channel buffer and connection timeout
- Network Snooping: mDNS announcements are plaintext (by design)
Best Practices
- Use libp2p Security: TLS or Noise transport for encrypted connections
- Peer Authentication: Verify peer identities after connection
- Network Isolation: Deploy on trusted networks
- Connection Limits: Use libp2p connection manager
- Monitoring: Log all discovery and connection events
Troubleshooting
No Peers Discovered
Symptoms: Service starts but no peers found.
Checks:
- Verify all agents on same subnet
- Check firewall rules (UDP 5353)
- Verify mDNS/multicast not blocked by network
- Check service tag matches across agents
- Verify no mDNS conflicts with other services
Connection Failures
Symptoms: Peers discovered but connections fail.
Checks:
- Verify libp2p port open (default: TCP 4001)
- Check connection manager limits
- Verify peer addresses are reachable
- Check for NAT/firewall between peers
- Verify sufficient system resources (file descriptors, memory)
High CPU/Network Usage
Symptoms: Excessive mDNS traffic or CPU usage.
Causes:
- Rapid peer restarts (re-announcements)
- Many peers on network
- Short announcement intervals
Solutions:
- Implement connection caching
- Adjust mDNS announcement timing
- Use connection limits
Monitoring and Debugging
Discovery Events
// Log all discovery events
disc, _ := discovery.NewMDNSDiscovery(ctx, h, "CHORUS-peer-discovery")
peersChan := disc.PeersChan()
go func() {
for peerInfo := range peersChan {
logger.Info("Discovered peer",
"peer_id", peerInfo.ID.String(),
"addresses", peerInfo.Addrs,
"timestamp", time.Now())
}
}()
Connection Status
// Monitor connection status
func monitorConnections(h host.Host) {
ticker := time.NewTicker(30 * time.Second)
defer ticker.Stop()
for range ticker.C {
peers := h.Network().Peers()
fmt.Printf("📊 Connected to %d peers: %v\n",
len(peers), peers)
}
}
See Also
- coordinator/ - Task coordination using discovered peers
- pubsub/ - PubSub over discovered peer network
- internal/runtime/ - Runtime initialization with discovery
- libp2p Documentation - libp2p concepts and APIs
- mDNS RFC 6762 - mDNS protocol specification