tony/WHOOSH

Fork 0

Files

Claude Code 3373f7b462

WHOOSH CI / speclint (push) Has been cancelled

Details

WHOOSH CI / contracts (push) Has been cancelled

Details

Add chorus-entrypoint label to standardized label set

**Problem**: The standardized label set was missing the `chorus-entrypoint`
label, which is present in CHORUS repository and required for triggering
council formation for project kickoffs.

**Changes**:
- Added `chorus-entrypoint` label (#ff6b6b) to `EnsureRequiredLabels()`
  in `internal/gitea/client.go`
- Now creates 9 standard labels (was 8):
  1. bug
  2. bzzz-task
  3. chorus-entrypoint (NEW)
  4. duplicate
  5. enhancement
  6. help wanted
  7. invalid
  8. question
  9. wontfix

**Testing**:
- Rebuilt and deployed WHOOSH with updated label configuration
- Synced labels to all 5 monitored repositories (whoosh-ui,
  SequentialThinkingForCHORUS, TEST, WHOOSH, CHORUS)
- Verified all repositories now have complete 9-label set

**Impact**: All CHORUS ecosystem repositories now have consistent labeling
matching the CHORUS repository standard, enabling proper council formation
triggers.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>

2025-10-12 22:06:10 +11:00

16 KiB

Raw Blame History

P2P Mesh Status Report - HMMM Monitor Integration

Date: 2025-10-12 Status: ✅ Working (with limitations) System: CHORUS agents + HMMM monitor + WHOOSH bootstrap

Summary

The HMMM monitor is now successfully connected to the P2P mesh and receiving GossipSub messages from CHORUS agents. However, there are several limitations and inefficiencies that need addressing in future iterations.

Current Working State

What's Working ✅

P2P Connections Established
- HMMM monitor connects to bootstrap peers via overlay network IPs
- Monitor subscribes to 3 GossipSub topics:
  - CHORUS/coordination/v1 (task coordination)
  - hmmm/meta-discussion/v1 (meta-discussion)
  - CHORUS/context-feedback/v1 (context feedback)
Message Broadcast System
- Agents broadcast availability every 30 seconds
- Messages include: node_id, available_for_work, current_tasks, max_tasks, last_activity, status, timestamp
Docker Swarm Overlay Network
- Monitor and agents on same network: lz9ny9bmvm6fzalvy9ckpxpcw
- Direct IP-based connections work within overlay network
Bootstrap Discovery
- WHOOSH queries agent /api/health endpoints
- Agents expose peer IDs and multiaddrs
- Monitor fetches bootstrap list from WHOOSH

Key Issues & Limitations ⚠️

1. Limited Agent Discovery

Problem: Only 2-3 unique agents discovered out of 10 running replicas

Evidence:

✅ Fetched 3 bootstrap peers from WHOOSH
🔗 Connected to bootstrap peer: <peer.ID 12*isFYCH>  (2 connections)
🔗 Connected to bootstrap peer: <peer.ID 12*RS37W6>  (1 connection)
✅ Connected to 3/3 bootstrap peers

Root Cause: WHOOSH's P2P discovery mechanism (p2pDiscovery.GetAgents()) is not returning all 10 agent replicas consistently.

Impact:

Monitor only connects to a subset of agents
Some agents' messages may not be visible to monitor
P2P mesh is incomplete

2. Docker Swarm VIP Load Balancing

Problem: Service DNS names (chorus:8080) use VIP load balancing, which breaks direct P2P connections

Why This Breaks P2P:

Monitor resolves chorus:8080 → VIP load balancer
VIP routes to random agent container
That container has different peer ID than expected
libp2p handshake fails: "peer id mismatch"

Current Workaround:

Agents expose overlay network IPs: /ip4/10.0.13.x/tcp/9000/p2p/{peer_id}
Monitor connects directly to container IPs
Bypasses VIP load balancer

Limitation: Relies on overlay network IP addresses being stable and routable

3. Multiple Multiaddrs Per Agent

Problem: Each agent has multiple network interfaces (localhost + overlay IP), creating duplicate multiaddrs

Example:

Agent has 2 addresses:
- /ip4/127.0.0.1/tcp/9000  (localhost - skipped)
- /ip4/10.0.13.227/tcp/9000  (overlay IP - used)

Current Fix: WHOOSH now returns only first multiaddr per agent

Better Solution Needed: Filter multiaddrs to only include routable overlay IPs, exclude localhost

4. Incomplete Agent Health Endpoint

Current Implementation (/home/tony/chorus/project-queues/active/CHORUS/api/http_server.go:319-366):

// Agents expose:
- peer_id: string
- multiaddrs: []string  (all interfaces)
- connected_peers: int
- gossipsub_topics: []string

Missing Information:

No agent metadata (capabilities, specialization, version)
No P2P connection quality metrics
No topic subscription status per peer
No mesh topology visibility

5. WHOOSH Bootstrap Discovery Issues

Problem: WHOOSH's agent discovery is incomplete and inconsistent

Observed Behavior:

Only 3-5 agents discovered out of 10 running
Duplicate agent entries with different names:
- chorus-agent-001
- chorus-agent-http-//chorus-8080
- chorus-agent-http-//CHORUS_chorus-8080

Root Cause: WHOOSH's P2P discovery mechanism not reliably detecting all Swarm replicas

Location: /home/tony/chorus/project-queues/active/WHOOSH/internal/server/bootstrap.go:48

agents := s.p2pDiscovery.GetAgents()

Architecture Decisions Made

1. Use Overlay Network IPs Instead of Service DNS

Rationale:

Service DNS uses VIP load balancing
VIP breaks direct P2P connections (peer ID mismatch)
Overlay IPs allow direct container-to-container communication

Trade-offs:

✅ P2P connections work
✅ No need for port-per-replica (20+ ports)
⚠️ Depends on overlay network IP stability
⚠️ IPs not externally routable (monitor must be on same network)

2. Single Multiaddr Per Agent in Bootstrap

Rationale:

Avoid duplicate connections to same peer
Simplify bootstrap list
Reduce connection overhead

Implementation: WHOOSH returns only first multiaddr per agent

Trade-offs:

✅ No duplicate connections
✅ Cleaner bootstrap list
⚠️ No failover if first multiaddr unreachable
⚠️ Doesn't leverage libp2p multi-address resilience

3. Monitor on Same Overlay Network as Agents

Rationale:

Overlay IPs only routable within overlay network
Simplest solution for P2P connectivity

Trade-offs:

✅ Direct connectivity works
✅ No additional networking configuration
⚠️ Monitor tightly coupled to agent network
⚠️ Can't monitor from external networks

Code Changes Summary

1. CHORUS Agent Health Endpoint

File: /home/tony/chorus/project-queues/active/CHORUS/api/http_server.go

Changes:

Added node *p2p.Node field to HTTPServer
Enhanced handleHealth() to expose:
- peer_id: Full peer ID string
- multiaddrs: Overlay network IPs with peer ID
- connected_peers: Current P2P connection count
- gossipsub_topics: Subscribed topics
Added debug logging for address resolution

Key Logic (lines 319-366):

// Extract overlay network IPs (skip localhost)
for _, addr := range h.node.Addresses() {
    if ip == "127.0.0.1" || ip == "::1" {
        continue  // Skip localhost
    }
    multiaddr := fmt.Sprintf("/ip4/%s/tcp/%s/p2p/%s", ip, port, h.node.ID().String())
    multiaddrs = append(multiaddrs, multiaddr)
}

2. WHOOSH Bootstrap Endpoint

File: /home/tony/chorus/project-queues/active/WHOOSH/internal/server/bootstrap.go

Changes:

Modified HandleBootstrapPeers() to:
- Query each agent's /api/health endpoint
- Extract peer_id and multiaddrs from health response
- Return only first multiaddr per agent (deduplication)
- Add proper error handling for unavailable agents

Key Logic (lines 87-103):

// Add only first multiaddr per agent to avoid duplicates
if len(health.Multiaddrs) > 0 {
    bootstrapPeers = append(bootstrapPeers, BootstrapPeer{
        Multiaddr: health.Multiaddrs[0],  // Only first
        PeerID:    health.PeerID,
        Name:      agent.ID,
        Priority:  priority + 1,
    })
}

3. HMMM Monitor Topic Names

File: /home/tony/chorus/project-queues/active/CHORUS/hmmm-monitor/main.go

Changes:

Fixed topic name case sensitivity (line 128):
- Was: "chorus/coordination/v1" (lowercase)
- Now: "CHORUS/coordination/v1" (uppercase)
Matches agent topic names from pubsub/pubsub.go:138-143

Performance Metrics

Connection Success Rate

Target: 10/10 agents connected
Actual: 3/10 agents connected (30%)
Bottleneck: WHOOSH agent discovery

Message Visibility

Expected: All agent broadcasts visible to monitor
Actual: Only broadcasts from connected agents visible
Coverage: ~30% of mesh traffic

Connection Latency

Bootstrap fetch: < 1s
P2P connection establishment: < 1s per peer
GossipSub message propagation: < 100ms (estimated)

Recommended Improvements

High Priority

Fix WHOOSH Agent Discovery
- Problem: Only 3/10 agents discovered
- Root Cause: p2pDiscovery.GetAgents() incomplete
- Solution: Investigate discovery mechanism, possibly use Docker API directly
- File: /home/tony/chorus/project-queues/active/WHOOSH/internal/discovery/...
Add Health Check Retry Logic
- Problem: WHOOSH may query agents before they're ready
- Solution: Retry failed health checks with exponential backoff
- File: /home/tony/chorus/project-queues/active/WHOOSH/internal/server/bootstrap.go
Improve Multiaddr Filtering
- Problem: Including all interfaces, not just routable ones
- Solution: Filter for overlay network IPs only, exclude localhost/link-local
- File: /home/tony/chorus/project-queues/active/CHORUS/api/http_server.go

Medium Priority

Add Mesh Topology Visibility
- Enhancement: Monitor should report full mesh topology
- Data Needed: Which agents are connected to which peers
- UI: Add dashboard showing P2P mesh graph
Implement Peer Discovery via DHT
- Problem: Relying solely on WHOOSH for bootstrap
- Solution: Add libp2p DHT for peer-to-peer discovery
- Benefit: Agents can discover each other without WHOOSH
Add Connection Quality Metrics
- Enhancement: Track latency, bandwidth, reliability per peer
- Data: Round-trip time, message success rate, connection uptime
- Use: Identify and debug problematic P2P connections

Low Priority

Support External Monitor Deployment
- Limitation: Monitor must be on same overlay network
- Solution: Use libp2p relay or expose agents on host network
- Use Case: Monitor from laptop/external host
Add Multiaddr Failover
- Enhancement: Try all multiaddrs if first fails
- Current: Only use first multiaddr per agent
- Benefit: Better resilience to network issues

Testing Checklist

Functional Tests Needed

All 10 agents appear in bootstrap list
Monitor connects to all 10 agents
Monitor receives broadcasts from all agents
Agent restart doesn't break monitor connectivity
WHOOSH restart doesn't break monitor connectivity
Scale agents to 20 replicas → all visible to monitor

Performance Tests Needed

Message delivery latency < 100ms
Bootstrap list refresh < 1s
Monitor handles 100+ messages/sec
CPU/memory usage acceptable under load

Edge Cases to Test

Agent crashes/restarts → monitor reconnects
Network partition → monitor detects split
Duplicate peer IDs → handled gracefully
Invalid multiaddrs → skipped without crash
WHOOSH unavailable → monitor uses cached bootstrap

System Architecture Diagram

┌─────────────────────────────────────────────────────────────┐
│                    Docker Swarm Overlay Network             │
│                    (lz9ny9bmvm6fzalvy9ckpxpcw)              │
│                                                             │
│  ┌──────────────┐     ┌──────────────┐                    │
│  │ CHORUS Agent │────▶│   WHOOSH     │◀──── HTTP Query    │
│  │  (10 replicas)│     │  Bootstrap   │                    │
│  │               │     │   Server     │                    │
│  │ /api/health   │     │              │                    │
│  │  - peer_id    │     │ /api/v1/     │                    │
│  │  - multiaddrs │     │ bootstrap-   │                    │
│  │  - topics     │     │ peers        │                    │
│  └───────┬──────┘     └──────────────┘                    │
│          │                                                  │
│          │ GossipSub                                        │
│          │ Messages                                         │
│          ▼                                                  │
│  ┌──────────────┐                                          │
│  │ HMMM Monitor │                                          │
│  │              │                                          │
│  │ Subscribes:  │                                          │
│  │ - CHORUS/    │                                          │
│  │   coordination│                                         │
│  │ - hmmm/meta  │                                          │
│  │ - context    │                                          │
│  └──────────────┘                                          │
│                                                             │
└─────────────────────────────────────────────────────────────┘

Flow:
1. WHOOSH queries agent /api/health endpoints
2. Agents respond with peer_id + overlay IP multiaddrs
3. WHOOSH aggregates into bootstrap list
4. Monitor fetches bootstrap list
5. Monitor connects directly to agent overlay IPs
6. Monitor subscribes to GossipSub topics
7. Agents broadcast messages every 30s
8. Monitor receives and logs messages

Known Issues

Issue #1: Incomplete Agent Discovery

Severity: High
Impact: Only 30% of agents visible to monitor
Workaround: None
Fix Required: Investigate WHOOSH discovery mechanism

Issue #2: No Automatic Peer Discovery

Severity: Medium
Impact: Monitor relies on WHOOSH for all peer discovery
Workaround: Manual restart to refresh bootstrap
Fix Required: Implement DHT or mDNS discovery

Issue #3: Topic Name Case Sensitivity

Severity: Low (fixed)
Impact: Was preventing message reception
Fix: Corrected topic names to match agents
Status: Resolved

Deployment Instructions

Current Deployment State

All components deployed and running:

✅ CHORUS agents: 10 replicas (anthonyrawlins/chorus:latest)
✅ WHOOSH: 1 replica (anthonyrawlins/whoosh:latest)
✅ HMMM monitor: 1 replica (anthonyrawlins/hmmm-monitor:latest)

To Redeploy After Changes

# 1. Rebuild and deploy CHORUS agents
cd /home/tony/chorus/project-queues/active/CHORUS
env GOWORK=off go build -v -o build/chorus-agent ./cmd/agent
docker build -f Dockerfile.ubuntu -t anthonyrawlins/chorus:latest .
docker push anthonyrawlins/chorus:latest
ssh acacia "docker service update --image anthonyrawlins/chorus:latest CHORUS_chorus"

# 2. Rebuild and deploy WHOOSH
cd /home/tony/chorus/project-queues/active/WHOOSH
docker build -t anthonyrawlins/whoosh:latest .
docker push anthonyrawlins/whoosh:latest
ssh acacia "docker service update --image anthonyrawlins/whoosh:latest CHORUS_whoosh"

# 3. Rebuild and deploy HMMM monitor
cd /home/tony/chorus/project-queues/active/CHORUS/hmmm-monitor
docker build -t anthonyrawlins/hmmm-monitor:latest .
docker push anthonyrawlins/hmmm-monitor:latest
ssh acacia "docker service update --image anthonyrawlins/hmmm-monitor:latest CHORUS_hmmm-monitor"

# 4. Verify deployment
ssh acacia "docker service ps CHORUS_chorus CHORUS_whoosh CHORUS_hmmm-monitor"
ssh acacia "docker service logs --tail 20 CHORUS_hmmm-monitor"

References

Architecture Plan: /home/tony/chorus/project-queues/active/CHORUS/docs/P2P_MESH_ARCHITECTURE_PLAN.md
Agent Health Endpoint: /home/tony/chorus/project-queues/active/CHORUS/api/http_server.go:319-366
WHOOSH Bootstrap: /home/tony/chorus/project-queues/active/WHOOSH/internal/server/bootstrap.go:41-47
HMMM Monitor: /home/tony/chorus/project-queues/active/CHORUS/hmmm-monitor/main.go
Agent Pubsub: /home/tony/chorus/project-queues/active/CHORUS/pubsub/pubsub.go:138-143

Conclusion

The P2P mesh is functionally working but requires improvements to achieve full reliability and visibility. The primary blocker is WHOOSH's incomplete agent discovery, which prevents the monitor from seeing all 10 agents. Once this is resolved, the system should achieve 100% message visibility across the entire mesh.

Next Steps:

Debug WHOOSH agent discovery to ensure all 10 replicas are discovered
Add retry logic for health endpoint queries
Improve multiaddr filtering to exclude non-routable addresses
Add mesh topology monitoring and visualization

Status: ✅ Working, ⚠️ Needs Improvement

16 KiB Raw Blame History

P2P Mesh Status Report - HMMM Monitor Integration

Summary

Current Working State

What's Working ✅

Key Issues & Limitations ⚠️

1. Limited Agent Discovery

2. Docker Swarm VIP Load Balancing

3. Multiple Multiaddrs Per Agent

4. Incomplete Agent Health Endpoint

5. WHOOSH Bootstrap Discovery Issues

Architecture Decisions Made

1. Use Overlay Network IPs Instead of Service DNS

2. Single Multiaddr Per Agent in Bootstrap

3. Monitor on Same Overlay Network as Agents

Code Changes Summary

1. CHORUS Agent Health Endpoint

2. WHOOSH Bootstrap Endpoint

3. HMMM Monitor Topic Names

Performance Metrics

Connection Success Rate

Message Visibility

Connection Latency

Recommended Improvements

High Priority

Medium Priority

Low Priority

Testing Checklist

Functional Tests Needed

Performance Tests Needed

Edge Cases to Test

System Architecture Diagram

Known Issues

Issue #1: Incomplete Agent Discovery

Issue #2: No Automatic Peer Discovery

Issue #3: Topic Name Case Sensitivity

Deployment Instructions

Current Deployment State

To Redeploy After Changes

References

Conclusion

16 KiB

Raw Blame History