**Problem**: The standardized label set was missing the `chorus-entrypoint` label, which is present in CHORUS repository and required for triggering council formation for project kickoffs. **Changes**: - Added `chorus-entrypoint` label (#ff6b6b) to `EnsureRequiredLabels()` in `internal/gitea/client.go` - Now creates 9 standard labels (was 8): 1. bug 2. bzzz-task 3. chorus-entrypoint (NEW) 4. duplicate 5. enhancement 6. help wanted 7. invalid 8. question 9. wontfix **Testing**: - Rebuilt and deployed WHOOSH with updated label configuration - Synced labels to all 5 monitored repositories (whoosh-ui, SequentialThinkingForCHORUS, TEST, WHOOSH, CHORUS) - Verified all repositories now have complete 9-label set **Impact**: All CHORUS ecosystem repositories now have consistent labeling matching the CHORUS repository standard, enabling proper council formation triggers. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
16 KiB
P2P Mesh Status Report - HMMM Monitor Integration
Date: 2025-10-12 Status: ✅ Working (with limitations) System: CHORUS agents + HMMM monitor + WHOOSH bootstrap
Summary
The HMMM monitor is now successfully connected to the P2P mesh and receiving GossipSub messages from CHORUS agents. However, there are several limitations and inefficiencies that need addressing in future iterations.
Current Working State
What's Working ✅
-
P2P Connections Established
- HMMM monitor connects to bootstrap peers via overlay network IPs
- Monitor subscribes to 3 GossipSub topics:
CHORUS/coordination/v1(task coordination)hmmm/meta-discussion/v1(meta-discussion)CHORUS/context-feedback/v1(context feedback)
-
Message Broadcast System
- Agents broadcast availability every 30 seconds
- Messages include:
node_id,available_for_work,current_tasks,max_tasks,last_activity,status,timestamp
-
Docker Swarm Overlay Network
- Monitor and agents on same network:
lz9ny9bmvm6fzalvy9ckpxpcw - Direct IP-based connections work within overlay network
- Monitor and agents on same network:
-
Bootstrap Discovery
- WHOOSH queries agent
/api/healthendpoints - Agents expose peer IDs and multiaddrs
- Monitor fetches bootstrap list from WHOOSH
- WHOOSH queries agent
Key Issues & Limitations ⚠️
1. Limited Agent Discovery
Problem: Only 2-3 unique agents discovered out of 10 running replicas
Evidence:
✅ Fetched 3 bootstrap peers from WHOOSH
🔗 Connected to bootstrap peer: <peer.ID 12*isFYCH> (2 connections)
🔗 Connected to bootstrap peer: <peer.ID 12*RS37W6> (1 connection)
✅ Connected to 3/3 bootstrap peers
Root Cause: WHOOSH's P2P discovery mechanism (p2pDiscovery.GetAgents()) is not returning all 10 agent replicas consistently.
Impact:
- Monitor only connects to a subset of agents
- Some agents' messages may not be visible to monitor
- P2P mesh is incomplete
2. Docker Swarm VIP Load Balancing
Problem: Service DNS names (chorus:8080) use VIP load balancing, which breaks direct P2P connections
Why This Breaks P2P:
- Monitor resolves
chorus:8080→ VIP load balancer - VIP routes to random agent container
- That container has different peer ID than expected
- libp2p handshake fails: "peer id mismatch"
Current Workaround:
- Agents expose overlay network IPs:
/ip4/10.0.13.x/tcp/9000/p2p/{peer_id} - Monitor connects directly to container IPs
- Bypasses VIP load balancer
Limitation: Relies on overlay network IP addresses being stable and routable
3. Multiple Multiaddrs Per Agent
Problem: Each agent has multiple network interfaces (localhost + overlay IP), creating duplicate multiaddrs
Example:
Agent has 2 addresses:
- /ip4/127.0.0.1/tcp/9000 (localhost - skipped)
- /ip4/10.0.13.227/tcp/9000 (overlay IP - used)
Current Fix: WHOOSH now returns only first multiaddr per agent
Better Solution Needed: Filter multiaddrs to only include routable overlay IPs, exclude localhost
4. Incomplete Agent Health Endpoint
Current Implementation (/home/tony/chorus/project-queues/active/CHORUS/api/http_server.go:319-366):
// Agents expose:
- peer_id: string
- multiaddrs: []string (all interfaces)
- connected_peers: int
- gossipsub_topics: []string
Missing Information:
- No agent metadata (capabilities, specialization, version)
- No P2P connection quality metrics
- No topic subscription status per peer
- No mesh topology visibility
5. WHOOSH Bootstrap Discovery Issues
Problem: WHOOSH's agent discovery is incomplete and inconsistent
Observed Behavior:
- Only 3-5 agents discovered out of 10 running
- Duplicate agent entries with different names:
chorus-agent-001chorus-agent-http-//chorus-8080chorus-agent-http-//CHORUS_chorus-8080
Root Cause: WHOOSH's P2P discovery mechanism not reliably detecting all Swarm replicas
Location: /home/tony/chorus/project-queues/active/WHOOSH/internal/server/bootstrap.go:48
agents := s.p2pDiscovery.GetAgents()
Architecture Decisions Made
1. Use Overlay Network IPs Instead of Service DNS
Rationale:
- Service DNS uses VIP load balancing
- VIP breaks direct P2P connections (peer ID mismatch)
- Overlay IPs allow direct container-to-container communication
Trade-offs:
- ✅ P2P connections work
- ✅ No need for port-per-replica (20+ ports)
- ⚠️ Depends on overlay network IP stability
- ⚠️ IPs not externally routable (monitor must be on same network)
2. Single Multiaddr Per Agent in Bootstrap
Rationale:
- Avoid duplicate connections to same peer
- Simplify bootstrap list
- Reduce connection overhead
Implementation: WHOOSH returns only first multiaddr per agent
Trade-offs:
- ✅ No duplicate connections
- ✅ Cleaner bootstrap list
- ⚠️ No failover if first multiaddr unreachable
- ⚠️ Doesn't leverage libp2p multi-address resilience
3. Monitor on Same Overlay Network as Agents
Rationale:
- Overlay IPs only routable within overlay network
- Simplest solution for P2P connectivity
Trade-offs:
- ✅ Direct connectivity works
- ✅ No additional networking configuration
- ⚠️ Monitor tightly coupled to agent network
- ⚠️ Can't monitor from external networks
Code Changes Summary
1. CHORUS Agent Health Endpoint
File: /home/tony/chorus/project-queues/active/CHORUS/api/http_server.go
Changes:
- Added
node *p2p.Nodefield to HTTPServer - Enhanced
handleHealth()to expose:peer_id: Full peer ID stringmultiaddrs: Overlay network IPs with peer IDconnected_peers: Current P2P connection countgossipsub_topics: Subscribed topics
- Added debug logging for address resolution
Key Logic (lines 319-366):
// Extract overlay network IPs (skip localhost)
for _, addr := range h.node.Addresses() {
if ip == "127.0.0.1" || ip == "::1" {
continue // Skip localhost
}
multiaddr := fmt.Sprintf("/ip4/%s/tcp/%s/p2p/%s", ip, port, h.node.ID().String())
multiaddrs = append(multiaddrs, multiaddr)
}
2. WHOOSH Bootstrap Endpoint
File: /home/tony/chorus/project-queues/active/WHOOSH/internal/server/bootstrap.go
Changes:
- Modified
HandleBootstrapPeers()to:- Query each agent's
/api/healthendpoint - Extract
peer_idandmultiaddrsfrom health response - Return only first multiaddr per agent (deduplication)
- Add proper error handling for unavailable agents
- Query each agent's
Key Logic (lines 87-103):
// Add only first multiaddr per agent to avoid duplicates
if len(health.Multiaddrs) > 0 {
bootstrapPeers = append(bootstrapPeers, BootstrapPeer{
Multiaddr: health.Multiaddrs[0], // Only first
PeerID: health.PeerID,
Name: agent.ID,
Priority: priority + 1,
})
}
3. HMMM Monitor Topic Names
File: /home/tony/chorus/project-queues/active/CHORUS/hmmm-monitor/main.go
Changes:
- Fixed topic name case sensitivity (line 128):
- Was:
"chorus/coordination/v1"(lowercase) - Now:
"CHORUS/coordination/v1"(uppercase)
- Was:
- Matches agent topic names from
pubsub/pubsub.go:138-143
Performance Metrics
Connection Success Rate
- Target: 10/10 agents connected
- Actual: 3/10 agents connected (30%)
- Bottleneck: WHOOSH agent discovery
Message Visibility
- Expected: All agent broadcasts visible to monitor
- Actual: Only broadcasts from connected agents visible
- Coverage: ~30% of mesh traffic
Connection Latency
- Bootstrap fetch: < 1s
- P2P connection establishment: < 1s per peer
- GossipSub message propagation: < 100ms (estimated)
Recommended Improvements
High Priority
-
Fix WHOOSH Agent Discovery
- Problem: Only 3/10 agents discovered
- Root Cause:
p2pDiscovery.GetAgents()incomplete - Solution: Investigate discovery mechanism, possibly use Docker API directly
- File:
/home/tony/chorus/project-queues/active/WHOOSH/internal/discovery/...
-
Add Health Check Retry Logic
- Problem: WHOOSH may query agents before they're ready
- Solution: Retry failed health checks with exponential backoff
- File:
/home/tony/chorus/project-queues/active/WHOOSH/internal/server/bootstrap.go
-
Improve Multiaddr Filtering
- Problem: Including all interfaces, not just routable ones
- Solution: Filter for overlay network IPs only, exclude localhost/link-local
- File:
/home/tony/chorus/project-queues/active/CHORUS/api/http_server.go
Medium Priority
-
Add Mesh Topology Visibility
- Enhancement: Monitor should report full mesh topology
- Data Needed: Which agents are connected to which peers
- UI: Add dashboard showing P2P mesh graph
-
Implement Peer Discovery via DHT
- Problem: Relying solely on WHOOSH for bootstrap
- Solution: Add libp2p DHT for peer-to-peer discovery
- Benefit: Agents can discover each other without WHOOSH
-
Add Connection Quality Metrics
- Enhancement: Track latency, bandwidth, reliability per peer
- Data: Round-trip time, message success rate, connection uptime
- Use: Identify and debug problematic P2P connections
Low Priority
-
Support External Monitor Deployment
- Limitation: Monitor must be on same overlay network
- Solution: Use libp2p relay or expose agents on host network
- Use Case: Monitor from laptop/external host
-
Add Multiaddr Failover
- Enhancement: Try all multiaddrs if first fails
- Current: Only use first multiaddr per agent
- Benefit: Better resilience to network issues
Testing Checklist
Functional Tests Needed
- All 10 agents appear in bootstrap list
- Monitor connects to all 10 agents
- Monitor receives broadcasts from all agents
- Agent restart doesn't break monitor connectivity
- WHOOSH restart doesn't break monitor connectivity
- Scale agents to 20 replicas → all visible to monitor
Performance Tests Needed
- Message delivery latency < 100ms
- Bootstrap list refresh < 1s
- Monitor handles 100+ messages/sec
- CPU/memory usage acceptable under load
Edge Cases to Test
- Agent crashes/restarts → monitor reconnects
- Network partition → monitor detects split
- Duplicate peer IDs → handled gracefully
- Invalid multiaddrs → skipped without crash
- WHOOSH unavailable → monitor uses cached bootstrap
System Architecture Diagram
┌─────────────────────────────────────────────────────────────┐
│ Docker Swarm Overlay Network │
│ (lz9ny9bmvm6fzalvy9ckpxpcw) │
│ │
│ ┌──────────────┐ ┌──────────────┐ │
│ │ CHORUS Agent │────▶│ WHOOSH │◀──── HTTP Query │
│ │ (10 replicas)│ │ Bootstrap │ │
│ │ │ │ Server │ │
│ │ /api/health │ │ │ │
│ │ - peer_id │ │ /api/v1/ │ │
│ │ - multiaddrs │ │ bootstrap- │ │
│ │ - topics │ │ peers │ │
│ └───────┬──────┘ └──────────────┘ │
│ │ │
│ │ GossipSub │
│ │ Messages │
│ ▼ │
│ ┌──────────────┐ │
│ │ HMMM Monitor │ │
│ │ │ │
│ │ Subscribes: │ │
│ │ - CHORUS/ │ │
│ │ coordination│ │
│ │ - hmmm/meta │ │
│ │ - context │ │
│ └──────────────┘ │
│ │
└─────────────────────────────────────────────────────────────┘
Flow:
1. WHOOSH queries agent /api/health endpoints
2. Agents respond with peer_id + overlay IP multiaddrs
3. WHOOSH aggregates into bootstrap list
4. Monitor fetches bootstrap list
5. Monitor connects directly to agent overlay IPs
6. Monitor subscribes to GossipSub topics
7. Agents broadcast messages every 30s
8. Monitor receives and logs messages
Known Issues
Issue #1: Incomplete Agent Discovery
- Severity: High
- Impact: Only 30% of agents visible to monitor
- Workaround: None
- Fix Required: Investigate WHOOSH discovery mechanism
Issue #2: No Automatic Peer Discovery
- Severity: Medium
- Impact: Monitor relies on WHOOSH for all peer discovery
- Workaround: Manual restart to refresh bootstrap
- Fix Required: Implement DHT or mDNS discovery
Issue #3: Topic Name Case Sensitivity
- Severity: Low (fixed)
- Impact: Was preventing message reception
- Fix: Corrected topic names to match agents
- Status: Resolved
Deployment Instructions
Current Deployment State
All components deployed and running:
- ✅ CHORUS agents: 10 replicas (anthonyrawlins/chorus:latest)
- ✅ WHOOSH: 1 replica (anthonyrawlins/whoosh:latest)
- ✅ HMMM monitor: 1 replica (anthonyrawlins/hmmm-monitor:latest)
To Redeploy After Changes
# 1. Rebuild and deploy CHORUS agents
cd /home/tony/chorus/project-queues/active/CHORUS
env GOWORK=off go build -v -o build/chorus-agent ./cmd/agent
docker build -f Dockerfile.ubuntu -t anthonyrawlins/chorus:latest .
docker push anthonyrawlins/chorus:latest
ssh acacia "docker service update --image anthonyrawlins/chorus:latest CHORUS_chorus"
# 2. Rebuild and deploy WHOOSH
cd /home/tony/chorus/project-queues/active/WHOOSH
docker build -t anthonyrawlins/whoosh:latest .
docker push anthonyrawlins/whoosh:latest
ssh acacia "docker service update --image anthonyrawlins/whoosh:latest CHORUS_whoosh"
# 3. Rebuild and deploy HMMM monitor
cd /home/tony/chorus/project-queues/active/CHORUS/hmmm-monitor
docker build -t anthonyrawlins/hmmm-monitor:latest .
docker push anthonyrawlins/hmmm-monitor:latest
ssh acacia "docker service update --image anthonyrawlins/hmmm-monitor:latest CHORUS_hmmm-monitor"
# 4. Verify deployment
ssh acacia "docker service ps CHORUS_chorus CHORUS_whoosh CHORUS_hmmm-monitor"
ssh acacia "docker service logs --tail 20 CHORUS_hmmm-monitor"
References
- Architecture Plan:
/home/tony/chorus/project-queues/active/CHORUS/docs/P2P_MESH_ARCHITECTURE_PLAN.md - Agent Health Endpoint:
/home/tony/chorus/project-queues/active/CHORUS/api/http_server.go:319-366 - WHOOSH Bootstrap:
/home/tony/chorus/project-queues/active/WHOOSH/internal/server/bootstrap.go:41-47 - HMMM Monitor:
/home/tony/chorus/project-queues/active/CHORUS/hmmm-monitor/main.go - Agent Pubsub:
/home/tony/chorus/project-queues/active/CHORUS/pubsub/pubsub.go:138-143
Conclusion
The P2P mesh is functionally working but requires improvements to achieve full reliability and visibility. The primary blocker is WHOOSH's incomplete agent discovery, which prevents the monitor from seeing all 10 agents. Once this is resolved, the system should achieve 100% message visibility across the entire mesh.
Next Steps:
- Debug WHOOSH agent discovery to ensure all 10 replicas are discovered
- Add retry logic for health endpoint queries
- Improve multiaddr filtering to exclude non-routable addresses
- Add mesh topology monitoring and visualization
Status: ✅ Working, ⚠️ Needs Improvement