Comprehensive documentation for coordination, messaging, discovery, and internal systems. Core Coordination Packages: - pkg/election - Democratic leader election (uptime-based, heartbeat mechanism, SLURP integration) - pkg/coordination - Meta-coordination with dependency detection (4 built-in rules) - coordinator/ - Task orchestration and assignment (AI-powered scoring) - discovery/ - mDNS peer discovery (automatic LAN detection) Messaging & P2P Infrastructure: - pubsub/ - GossipSub messaging (31 message types, role-based topics, HMMM integration) - p2p/ - libp2p networking (DHT modes, connection management, security) Monitoring & Health: - pkg/metrics - Prometheus metrics (80+ metrics across 12 categories) - pkg/health - Health monitoring (4 HTTP endpoints, enhanced checks, graceful degradation) Internal Systems: - internal/licensing - License validation (KACHING integration, cluster leases, fail-closed) - internal/hapui - Human Agent Portal UI (9 commands, HMMM wizard, UCXL browser, decision voting) - internal/backbeat - P2P operation telemetry (6 phases, beat synchronization, health reporting) Documentation Statistics (Phase 3): - 10 packages documented (~18,000 lines) - 31 PubSub message types cataloged - 80+ Prometheus metrics documented - Complete API references with examples - Integration patterns and best practices Key Features Documented: - Election: 5 triggers, candidate scoring (5 weighted components), stability windows - Coordination: AI-powered dependency detection, cross-repo sessions, escalation handling - PubSub: Topic patterns, message envelopes, SHHH redaction, Hypercore logging - Metrics: All metric types with labels, Prometheus scrape config, alert rules - Health: Liveness vs readiness, critical checks, Kubernetes integration - Licensing: Grace periods, circuit breaker, cluster lease management - HAP UI: Interactive terminal commands, HMMM composition wizard, web interface (beta) - BACKBEAT: 6-phase operation tracking, beat budget estimation, drift detection Implementation Status Marked: - ✅ Production: Election, metrics, health, licensing, pubsub, p2p, discovery, coordinator - 🔶 Beta: HAP web interface, BACKBEAT telemetry, advanced coordination - 🔷 Alpha: SLURP election scoring - ⚠️ Experimental: Meta-coordination, AI-powered dependency detection Progress: 22/62 files complete (35%) Next Phase: AI providers, SLURP system, API layer, reasoning engine 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
596 lines
15 KiB
Markdown
596 lines
15 KiB
Markdown
# Package: discovery
|
|
|
|
**Location**: `/home/tony/chorus/project-queues/active/CHORUS/discovery/`
|
|
|
|
## Overview
|
|
|
|
The `discovery` package provides **mDNS-based peer discovery** for automatic detection and connection of CHORUS agents on the local network. It enables zero-configuration peer discovery using multicast DNS (mDNS), allowing agents to find and connect to each other without manual configuration or central coordination.
|
|
|
|
## Architecture
|
|
|
|
### mDNS Overview
|
|
|
|
Multicast DNS (mDNS) is a protocol that resolves hostnames to IP addresses within small networks that do not include a local name server. It uses:
|
|
|
|
- **Multicast IP**: 224.0.0.251 (IPv4) or FF02::FB (IPv6)
|
|
- **UDP Port**: 5353
|
|
- **Service Discovery**: Advertises and discovers services on the local network
|
|
|
|
### CHORUS Service Tag
|
|
|
|
**Default Service Name**: `"CHORUS-peer-discovery"`
|
|
|
|
This service tag identifies CHORUS peers on the network. All CHORUS agents advertise themselves with this tag and listen for other agents using the same tag.
|
|
|
|
## Core Components
|
|
|
|
### MDNSDiscovery
|
|
|
|
Main structure managing mDNS discovery operations.
|
|
|
|
```go
|
|
type MDNSDiscovery struct {
|
|
host host.Host // libp2p host
|
|
service mdns.Service // mDNS service
|
|
notifee *mdnsNotifee // Peer notification handler
|
|
ctx context.Context // Discovery context
|
|
cancel context.CancelFunc // Context cancellation
|
|
serviceTag string // Service name (default: "CHORUS-peer-discovery")
|
|
}
|
|
```
|
|
|
|
**Key Responsibilities:**
|
|
- Advertise local agent as mDNS service
|
|
- Listen for mDNS announcements from other agents
|
|
- Automatically connect to discovered peers
|
|
- Handle peer connection lifecycle
|
|
|
|
### mdnsNotifee
|
|
|
|
Internal notification handler for discovered peers.
|
|
|
|
```go
|
|
type mdnsNotifee struct {
|
|
h host.Host // libp2p host
|
|
ctx context.Context // Context for operations
|
|
peersChan chan peer.AddrInfo // Channel for discovered peers (buffer: 10)
|
|
}
|
|
```
|
|
|
|
Implements the mDNS notification interface to receive peer discovery events.
|
|
|
|
## Discovery Flow
|
|
|
|
### 1. Service Initialization
|
|
|
|
```go
|
|
discovery, err := NewMDNSDiscovery(ctx, host, "CHORUS-peer-discovery")
|
|
if err != nil {
|
|
return fmt.Errorf("failed to start mDNS discovery: %w", err)
|
|
}
|
|
```
|
|
|
|
**Initialization Steps:**
|
|
1. Create discovery context with cancellation
|
|
2. Initialize mdnsNotifee with peer channel
|
|
3. Create mDNS service with service tag
|
|
4. Start mDNS service (begins advertising and listening)
|
|
5. Launch background peer connection handler
|
|
|
|
### 2. Service Advertisement
|
|
|
|
When the service starts, it automatically advertises:
|
|
|
|
```
|
|
Service Type: _CHORUS-peer-discovery._udp.local
|
|
Port: libp2p host port
|
|
Addresses: All local IP addresses (IPv4 and IPv6)
|
|
```
|
|
|
|
This allows other CHORUS agents on the network to discover this peer.
|
|
|
|
### 3. Peer Discovery
|
|
|
|
**Discovery Process:**
|
|
|
|
```
|
|
1. mDNS Service listens for multicast announcements
|
|
├─ Receives service announcement from peer
|
|
└─ Extracts peer.AddrInfo (ID + addresses)
|
|
|
|
2. mdnsNotifee.HandlePeerFound() called
|
|
├─ Peer info sent to peersChan
|
|
└─ Non-blocking send (drops if channel full)
|
|
|
|
3. handleDiscoveredPeers() goroutine receives
|
|
├─ Skip if peer is self
|
|
├─ Skip if already connected
|
|
└─ Attempt connection
|
|
```
|
|
|
|
### 4. Automatic Connection
|
|
|
|
```go
|
|
func (d *MDNSDiscovery) handleDiscoveredPeers() {
|
|
for {
|
|
select {
|
|
case <-d.ctx.Done():
|
|
return
|
|
case peerInfo := <-d.notifee.peersChan:
|
|
// Skip self
|
|
if peerInfo.ID == d.host.ID() {
|
|
continue
|
|
}
|
|
|
|
// Check if already connected
|
|
if d.host.Network().Connectedness(peerInfo.ID) == 1 {
|
|
continue
|
|
}
|
|
|
|
// Attempt connection with timeout
|
|
connectCtx, cancel := context.WithTimeout(d.ctx, 10*time.Second)
|
|
err := d.host.Connect(connectCtx, peerInfo)
|
|
cancel()
|
|
|
|
if err != nil {
|
|
fmt.Printf("❌ Failed to connect to peer %s: %v\n",
|
|
peerInfo.ID.ShortString(), err)
|
|
} else {
|
|
fmt.Printf("✅ Successfully connected to peer %s\n",
|
|
peerInfo.ID.ShortString())
|
|
}
|
|
}
|
|
}
|
|
}
|
|
```
|
|
|
|
**Connection Features:**
|
|
- **10-second timeout** per connection attempt
|
|
- **Idempotent**: Safe to attempt connection to already-connected peer
|
|
- **Self-filtering**: Ignores own mDNS announcements
|
|
- **Duplicate filtering**: Checks existing connections before attempting
|
|
- **Non-blocking**: Runs in background goroutine
|
|
|
|
## Usage
|
|
|
|
### Basic Usage
|
|
|
|
```go
|
|
import (
|
|
"context"
|
|
"chorus/discovery"
|
|
"github.com/libp2p/go-libp2p/core/host"
|
|
)
|
|
|
|
func setupDiscovery(ctx context.Context, h host.Host) (*discovery.MDNSDiscovery, error) {
|
|
// Start mDNS discovery with default service tag
|
|
disc, err := discovery.NewMDNSDiscovery(ctx, h, "")
|
|
if err != nil {
|
|
return nil, err
|
|
}
|
|
|
|
fmt.Println("🔍 mDNS discovery started")
|
|
return disc, nil
|
|
}
|
|
```
|
|
|
|
### Custom Service Tag
|
|
|
|
```go
|
|
// Use custom service tag for specific environments
|
|
disc, err := discovery.NewMDNSDiscovery(ctx, h, "CHORUS-dev-network")
|
|
if err != nil {
|
|
return nil, err
|
|
}
|
|
```
|
|
|
|
### Monitoring Discovered Peers
|
|
|
|
```go
|
|
// Access peer channel for custom handling
|
|
peersChan := disc.PeersChan()
|
|
|
|
go func() {
|
|
for peerInfo := range peersChan {
|
|
fmt.Printf("🔍 Discovered peer: %s with %d addresses\n",
|
|
peerInfo.ID.ShortString(),
|
|
len(peerInfo.Addrs))
|
|
|
|
// Custom peer processing
|
|
handleNewPeer(peerInfo)
|
|
}
|
|
}()
|
|
```
|
|
|
|
### Graceful Shutdown
|
|
|
|
```go
|
|
// Close discovery service
|
|
if err := disc.Close(); err != nil {
|
|
log.Printf("Error closing discovery: %v", err)
|
|
}
|
|
```
|
|
|
|
## Peer Information Structure
|
|
|
|
### peer.AddrInfo
|
|
|
|
Discovered peers are represented as libp2p `peer.AddrInfo`:
|
|
|
|
```go
|
|
type AddrInfo struct {
|
|
ID peer.ID // Unique peer identifier
|
|
Addrs []multiaddr.Multiaddr // Peer addresses
|
|
}
|
|
```
|
|
|
|
**Example Multiaddresses:**
|
|
```
|
|
/ip4/192.168.1.100/tcp/4001/p2p/QmPeerID...
|
|
/ip6/fe80::1/tcp/4001/p2p/QmPeerID...
|
|
```
|
|
|
|
## Network Configuration
|
|
|
|
### Firewall Requirements
|
|
|
|
mDNS requires the following ports to be open:
|
|
|
|
- **UDP 5353**: mDNS multicast
|
|
- **TCP/UDP 4001** (or configured libp2p port): libp2p connections
|
|
|
|
### Network Scope
|
|
|
|
mDNS operates on **local network** only:
|
|
- Same subnet required for discovery
|
|
- Does not traverse routers (by design)
|
|
- Ideal for LAN-based agent clusters
|
|
|
|
### Multicast Group
|
|
|
|
mDNS uses standard multicast groups:
|
|
- **IPv4**: 224.0.0.251
|
|
- **IPv6**: FF02::FB
|
|
|
|
## Integration with CHORUS
|
|
|
|
### Cluster Formation
|
|
|
|
mDNS discovery enables automatic cluster formation:
|
|
|
|
```
|
|
Startup Sequence:
|
|
1. Agent starts with libp2p host
|
|
2. mDNS discovery initialized
|
|
3. Agent advertises itself via mDNS
|
|
4. Agent listens for other agents
|
|
5. Auto-connects to discovered peers
|
|
6. PubSub gossip network forms
|
|
7. Task coordination begins
|
|
```
|
|
|
|
### Multi-Node Cluster Example
|
|
|
|
```
|
|
Network: 192.168.1.0/24
|
|
|
|
Node 1 (walnut): 192.168.1.27 - Agent: backend-dev
|
|
Node 2 (ironwood): 192.168.1.72 - Agent: frontend-dev
|
|
Node 3 (rosewood): 192.168.1.113 - Agent: devops-specialist
|
|
|
|
Discovery Flow:
|
|
1. All nodes start with CHORUS-peer-discovery tag
|
|
2. Each node multicasts to 224.0.0.251:5353
|
|
3. All nodes receive each other's announcements
|
|
4. Automatic connection establishment:
|
|
walnut ↔ ironwood
|
|
walnut ↔ rosewood
|
|
ironwood ↔ rosewood
|
|
5. Full mesh topology formed
|
|
6. PubSub topics synchronized
|
|
```
|
|
|
|
## Error Handling
|
|
|
|
### Service Start Failure
|
|
|
|
```go
|
|
disc, err := discovery.NewMDNSDiscovery(ctx, h, serviceTag)
|
|
if err != nil {
|
|
// Common causes:
|
|
// - Port 5353 already in use
|
|
// - Insufficient permissions (require multicast)
|
|
// - Network interface unavailable
|
|
return fmt.Errorf("failed to start mDNS discovery: %w", err)
|
|
}
|
|
```
|
|
|
|
### Connection Failures
|
|
|
|
Connection failures are logged but do not stop the discovery process:
|
|
|
|
```
|
|
❌ Failed to connect to peer Qm... : context deadline exceeded
|
|
```
|
|
|
|
**Common Causes:**
|
|
- Peer behind firewall
|
|
- Network congestion
|
|
- Peer offline/restarting
|
|
- Connection limit reached
|
|
|
|
**Behavior**: Discovery continues, will retry on next mDNS announcement.
|
|
|
|
### Channel Full
|
|
|
|
If peer discovery is faster than connection handling:
|
|
|
|
```
|
|
⚠️ Discovery channel full, skipping peer Qm...
|
|
```
|
|
|
|
**Buffer Size**: 10 peers
|
|
**Mitigation**: Non-critical, peer will be rediscovered on next announcement cycle
|
|
|
|
## Performance Characteristics
|
|
|
|
### Discovery Latency
|
|
|
|
- **Initial Advertisement**: ~1-2 seconds after service start
|
|
- **Discovery Response**: Typically < 1 second on LAN
|
|
- **Connection Establishment**: 1-10 seconds (with 10s timeout)
|
|
- **Re-announcement**: Periodic (standard mDNS timing)
|
|
|
|
### Resource Usage
|
|
|
|
- **Memory**: Minimal (~1MB per discovery service)
|
|
- **CPU**: Very low (event-driven)
|
|
- **Network**: Minimal (periodic multicast announcements)
|
|
- **Concurrent Connections**: Handled by libp2p connection manager
|
|
|
|
## Configuration Options
|
|
|
|
### Service Tag Customization
|
|
|
|
```go
|
|
// Production environment
|
|
disc, _ := discovery.NewMDNSDiscovery(ctx, h, "CHORUS-production")
|
|
|
|
// Development environment
|
|
disc, _ := discovery.NewMDNSDiscovery(ctx, h, "CHORUS-dev")
|
|
|
|
// Testing environment
|
|
disc, _ := discovery.NewMDNSDiscovery(ctx, h, "CHORUS-test")
|
|
```
|
|
|
|
**Use Case**: Isolate environments on same physical network.
|
|
|
|
### Connection Timeout Adjustment
|
|
|
|
Currently hardcoded to 10 seconds. For customization:
|
|
|
|
```go
|
|
// In handleDiscoveredPeers():
|
|
connectTimeout := 30 * time.Second // Longer for slow networks
|
|
connectCtx, cancel := context.WithTimeout(d.ctx, connectTimeout)
|
|
```
|
|
|
|
## Advanced Usage
|
|
|
|
### Custom Peer Handling
|
|
|
|
Bypass automatic connection and implement custom logic:
|
|
|
|
```go
|
|
// Subscribe to peer channel
|
|
peersChan := disc.PeersChan()
|
|
|
|
go func() {
|
|
for peerInfo := range peersChan {
|
|
// Custom filtering
|
|
if shouldConnectToPeer(peerInfo) {
|
|
// Custom connection logic
|
|
connectWithRetry(peerInfo)
|
|
}
|
|
}
|
|
}()
|
|
```
|
|
|
|
### Discovery Metrics
|
|
|
|
```go
|
|
type DiscoveryMetrics struct {
|
|
PeersDiscovered int
|
|
ConnectionsSuccess int
|
|
ConnectionsFailed int
|
|
LastDiscovery time.Time
|
|
}
|
|
|
|
// Track metrics
|
|
var metrics DiscoveryMetrics
|
|
|
|
// In handleDiscoveredPeers():
|
|
metrics.PeersDiscovered++
|
|
if err := host.Connect(ctx, peerInfo); err != nil {
|
|
metrics.ConnectionsFailed++
|
|
} else {
|
|
metrics.ConnectionsSuccess++
|
|
}
|
|
metrics.LastDiscovery = time.Now()
|
|
```
|
|
|
|
## Comparison with Other Discovery Methods
|
|
|
|
### mDNS vs DHT
|
|
|
|
| Feature | mDNS | DHT (Kademlia) |
|
|
|---------|------|----------------|
|
|
| Network Scope | Local network only | Global |
|
|
| Setup | Zero-config | Requires bootstrap nodes |
|
|
| Speed | Very fast (< 1s) | Slower (seconds to minutes) |
|
|
| Privacy | Local only | Public network |
|
|
| Reliability | High on LAN | Depends on DHT health |
|
|
| Use Case | LAN clusters | Internet-wide P2P |
|
|
|
|
**CHORUS Choice**: mDNS for local agent clusters, DHT could be added for internet-wide coordination.
|
|
|
|
### mDNS vs Bootstrap List
|
|
|
|
| Feature | mDNS | Bootstrap List |
|
|
|---------|------|----------------|
|
|
| Configuration | None | Manual list |
|
|
| Maintenance | Automatic | Manual updates |
|
|
| Scalability | Limited to LAN | Unlimited |
|
|
| Flexibility | Dynamic | Static |
|
|
| Failure Handling | Auto-discovery | Manual intervention |
|
|
|
|
**CHORUS Choice**: mDNS for local discovery, bootstrap list as fallback.
|
|
|
|
## libp2p Integration
|
|
|
|
### Host Requirement
|
|
|
|
mDNS discovery requires a libp2p host:
|
|
|
|
```go
|
|
import (
|
|
"github.com/libp2p/go-libp2p"
|
|
"github.com/libp2p/go-libp2p/core/host"
|
|
)
|
|
|
|
// Create libp2p host
|
|
h, err := libp2p.New(
|
|
libp2p.ListenAddrStrings(
|
|
"/ip4/0.0.0.0/tcp/4001",
|
|
"/ip6/::/tcp/4001",
|
|
),
|
|
)
|
|
if err != nil {
|
|
return err
|
|
}
|
|
|
|
// Initialize mDNS discovery with host
|
|
disc, err := discovery.NewMDNSDiscovery(ctx, h, "CHORUS-peer-discovery")
|
|
```
|
|
|
|
### Connection Manager Integration
|
|
|
|
mDNS discovery works with libp2p connection manager:
|
|
|
|
```go
|
|
h, err := libp2p.New(
|
|
libp2p.ListenAddrStrings("/ip4/0.0.0.0/tcp/4001"),
|
|
libp2p.ConnectionManager(connmgr.NewConnManager(
|
|
100, // Low water mark
|
|
400, // High water mark
|
|
time.Minute,
|
|
)),
|
|
)
|
|
|
|
// mDNS-discovered connections managed by connection manager
|
|
disc, err := discovery.NewMDNSDiscovery(ctx, h, "")
|
|
```
|
|
|
|
## Security Considerations
|
|
|
|
### Trust Model
|
|
|
|
mDNS operates on **local network trust**:
|
|
- Assumes local network is trusted
|
|
- No authentication at mDNS layer
|
|
- Authentication handled by libp2p security transport
|
|
|
|
### Attack Vectors
|
|
|
|
1. **Peer ID Spoofing**: Mitigated by libp2p peer ID verification
|
|
2. **DoS via Fake Peers**: Limited by channel buffer and connection timeout
|
|
3. **Network Snooping**: mDNS announcements are plaintext (by design)
|
|
|
|
### Best Practices
|
|
|
|
1. **Use libp2p Security**: TLS or Noise transport for encrypted connections
|
|
2. **Peer Authentication**: Verify peer identities after connection
|
|
3. **Network Isolation**: Deploy on trusted networks
|
|
4. **Connection Limits**: Use libp2p connection manager
|
|
5. **Monitoring**: Log all discovery and connection events
|
|
|
|
## Troubleshooting
|
|
|
|
### No Peers Discovered
|
|
|
|
**Symptoms**: Service starts but no peers found.
|
|
|
|
**Checks:**
|
|
1. Verify all agents on same subnet
|
|
2. Check firewall rules (UDP 5353)
|
|
3. Verify mDNS/multicast not blocked by network
|
|
4. Check service tag matches across agents
|
|
5. Verify no mDNS conflicts with other services
|
|
|
|
### Connection Failures
|
|
|
|
**Symptoms**: Peers discovered but connections fail.
|
|
|
|
**Checks:**
|
|
1. Verify libp2p port open (default: TCP 4001)
|
|
2. Check connection manager limits
|
|
3. Verify peer addresses are reachable
|
|
4. Check for NAT/firewall between peers
|
|
5. Verify sufficient system resources (file descriptors, memory)
|
|
|
|
### High CPU/Network Usage
|
|
|
|
**Symptoms**: Excessive mDNS traffic or CPU usage.
|
|
|
|
**Causes:**
|
|
- Rapid peer restarts (re-announcements)
|
|
- Many peers on network
|
|
- Short announcement intervals
|
|
|
|
**Solutions:**
|
|
- Implement connection caching
|
|
- Adjust mDNS announcement timing
|
|
- Use connection limits
|
|
|
|
## Monitoring and Debugging
|
|
|
|
### Discovery Events
|
|
|
|
```go
|
|
// Log all discovery events
|
|
disc, _ := discovery.NewMDNSDiscovery(ctx, h, "CHORUS-peer-discovery")
|
|
|
|
peersChan := disc.PeersChan()
|
|
go func() {
|
|
for peerInfo := range peersChan {
|
|
logger.Info("Discovered peer",
|
|
"peer_id", peerInfo.ID.String(),
|
|
"addresses", peerInfo.Addrs,
|
|
"timestamp", time.Now())
|
|
}
|
|
}()
|
|
```
|
|
|
|
### Connection Status
|
|
|
|
```go
|
|
// Monitor connection status
|
|
func monitorConnections(h host.Host) {
|
|
ticker := time.NewTicker(30 * time.Second)
|
|
defer ticker.Stop()
|
|
|
|
for range ticker.C {
|
|
peers := h.Network().Peers()
|
|
fmt.Printf("📊 Connected to %d peers: %v\n",
|
|
len(peers), peers)
|
|
}
|
|
}
|
|
```
|
|
|
|
## See Also
|
|
|
|
- [coordinator/](coordinator.md) - Task coordination using discovered peers
|
|
- [pubsub/](../pubsub.md) - PubSub over discovered peer network
|
|
- [internal/runtime/](../internal/runtime.md) - Runtime initialization with discovery
|
|
- [libp2p Documentation](https://docs.libp2p.io/) - libp2p concepts and APIs
|
|
- [mDNS RFC 6762](https://tools.ietf.org/html/rfc6762) - mDNS protocol specification |