WHOOSH/P2P_MESH_STATUS_REPORT.md

# P2P Mesh Status Report - HMMM Monitor Integration

**Date**: 2025-10-12
**Status**: ✅ Working (with limitations)
**System**: CHORUS agents + HMMM monitor + WHOOSH bootstrap

---

## Summary

The HMMM monitor is now successfully connected to the P2P mesh and receiving GossipSub messages from CHORUS agents. However, there are several limitations and inefficiencies that need addressing in future iterations.

---

## Current Working State

### What's Working ✅

1. **P2P Connections Established**
   - HMMM monitor connects to bootstrap peers via overlay network IPs
   - Monitor subscribes to 3 GossipSub topics:
     - `CHORUS/coordination/v1` (task coordination)
     - `hmmm/meta-discussion/v1` (meta-discussion)
     - `CHORUS/context-feedback/v1` (context feedback)

2. **Message Broadcast System**
   - Agents broadcast availability every 30 seconds
   - Messages include: `node_id`, `available_for_work`, `current_tasks`, `max_tasks`, `last_activity`, `status`, `timestamp`

3. **Docker Swarm Overlay Network**
   - Monitor and agents on same network: `lz9ny9bmvm6fzalvy9ckpxpcw`
   - Direct IP-based connections work within overlay network

4. **Bootstrap Discovery**
   - WHOOSH queries agent `/api/health` endpoints
   - Agents expose peer IDs and multiaddrs
   - Monitor fetches bootstrap list from WHOOSH

---

## Key Issues & Limitations ⚠️

### 1. Limited Agent Discovery

**Problem**: Only 2-3 unique agents discovered out of 10 running replicas

**Evidence**:
```
✅ Fetched 3 bootstrap peers from WHOOSH
🔗 Connected to bootstrap peer: <peer.ID 12*isFYCH>  (2 connections)
🔗 Connected to bootstrap peer: <peer.ID 12*RS37W6>  (1 connection)
✅ Connected to 3/3 bootstrap peers
```

**Root Cause**: WHOOSH's P2P discovery mechanism (`p2pDiscovery.GetAgents()`) is not returning all 10 agent replicas consistently.

**Impact**:
- Monitor only connects to a subset of agents
- Some agents' messages may not be visible to monitor
- P2P mesh is incomplete

---

### 2. Docker Swarm VIP Load Balancing

**Problem**: Service DNS names (`chorus:8080`) use VIP load balancing, which breaks direct P2P connections

**Why This Breaks P2P**:
1. Monitor resolves `chorus:8080` → VIP load balancer
2. VIP routes to random agent container
3. That container has different peer ID than expected
4. libp2p handshake fails: "peer id mismatch"

**Current Workaround**:
- Agents expose overlay network IPs: `/ip4/10.0.13.x/tcp/9000/p2p/{peer_id}`
- Monitor connects directly to container IPs
- Bypasses VIP load balancer

**Limitation**: Relies on overlay network IP addresses being stable and routable

---

### 3. Multiple Multiaddrs Per Agent

**Problem**: Each agent has multiple network interfaces (localhost + overlay IP), creating duplicate multiaddrs

**Example**:
```
Agent has 2 addresses:
- /ip4/127.0.0.1/tcp/9000  (localhost - skipped)
- /ip4/10.0.13.227/tcp/9000  (overlay IP - used)
```

**Current Fix**: WHOOSH now returns only first multiaddr per agent

**Better Solution Needed**: Filter multiaddrs to only include routable overlay IPs, exclude localhost

---

### 4. Incomplete Agent Health Endpoint

**Current Implementation** (`/home/tony/chorus/project-queues/active/CHORUS/api/http_server.go:319-366`):

```go
// Agents expose:
- peer_id: string
- multiaddrs: []string  (all interfaces)
- connected_peers: int
- gossipsub_topics: []string
```

**Missing Information**:
- No agent metadata (capabilities, specialization, version)
- No P2P connection quality metrics
- No topic subscription status per peer
- No mesh topology visibility

---

### 5. WHOOSH Bootstrap Discovery Issues

**Problem**: WHOOSH's agent discovery is incomplete and inconsistent

**Observed Behavior**:
- Only 3-5 agents discovered out of 10 running
- Duplicate agent entries with different names:
  - `chorus-agent-001`
  - `chorus-agent-http-//chorus-8080`
  - `chorus-agent-http-//CHORUS_chorus-8080`

**Root Cause**: WHOOSH's P2P discovery mechanism not reliably detecting all Swarm replicas

**Location**: `/home/tony/chorus/project-queues/active/WHOOSH/internal/server/bootstrap.go:48`
```go
agents := s.p2pDiscovery.GetAgents()
```

---

## Architecture Decisions Made

### 1. Use Overlay Network IPs Instead of Service DNS

**Rationale**:
- Service DNS uses VIP load balancing
- VIP breaks direct P2P connections (peer ID mismatch)
- Overlay IPs allow direct container-to-container communication

**Trade-offs**:
- ✅ P2P connections work
- ✅ No need for port-per-replica (20+ ports)
- ⚠️ Depends on overlay network IP stability
- ⚠️ IPs not externally routable (monitor must be on same network)

### 2. Single Multiaddr Per Agent in Bootstrap

**Rationale**:
- Avoid duplicate connections to same peer
- Simplify bootstrap list
- Reduce connection overhead

**Implementation**: WHOOSH returns only first multiaddr per agent

**Trade-offs**:
- ✅ No duplicate connections
- ✅ Cleaner bootstrap list
- ⚠️ No failover if first multiaddr unreachable
- ⚠️ Doesn't leverage libp2p multi-address resilience

### 3. Monitor on Same Overlay Network as Agents

**Rationale**:
- Overlay IPs only routable within overlay network
- Simplest solution for P2P connectivity

**Trade-offs**:
- ✅ Direct connectivity works
- ✅ No additional networking configuration
- ⚠️ Monitor tightly coupled to agent network
- ⚠️ Can't monitor from external networks

---

## Code Changes Summary

### 1. CHORUS Agent Health Endpoint
**File**: `/home/tony/chorus/project-queues/active/CHORUS/api/http_server.go`

**Changes**:
- Added `node *p2p.Node` field to HTTPServer
- Enhanced `handleHealth()` to expose:
  - `peer_id`: Full peer ID string
  - `multiaddrs`: Overlay network IPs with peer ID
  - `connected_peers`: Current P2P connection count
  - `gossipsub_topics`: Subscribed topics
- Added debug logging for address resolution

**Key Logic** (lines 319-366):
```go
// Extract overlay network IPs (skip localhost)
for _, addr := range h.node.Addresses() {
    if ip == "127.0.0.1" || ip == "::1" {
        continue  // Skip localhost
    }
    multiaddr := fmt.Sprintf("/ip4/%s/tcp/%s/p2p/%s", ip, port, h.node.ID().String())
    multiaddrs = append(multiaddrs, multiaddr)
}
```

### 2. WHOOSH Bootstrap Endpoint
**File**: `/home/tony/chorus/project-queues/active/WHOOSH/internal/server/bootstrap.go`

**Changes**:
- Modified `HandleBootstrapPeers()` to:
  - Query each agent's `/api/health` endpoint
  - Extract `peer_id` and `multiaddrs` from health response
  - Return only first multiaddr per agent (deduplication)
  - Add proper error handling for unavailable agents

**Key Logic** (lines 87-103):
```go
// Add only first multiaddr per agent to avoid duplicates
if len(health.Multiaddrs) > 0 {
    bootstrapPeers = append(bootstrapPeers, BootstrapPeer{
        Multiaddr: health.Multiaddrs[0],  // Only first
        PeerID:    health.PeerID,
        Name:      agent.ID,
        Priority:  priority + 1,
    })
}
```

### 3. HMMM Monitor Topic Names
**File**: `/home/tony/chorus/project-queues/active/CHORUS/hmmm-monitor/main.go`

**Changes**:
- Fixed topic name case sensitivity (line 128):
  - Was: `"chorus/coordination/v1"` (lowercase)
  - Now: `"CHORUS/coordination/v1"` (uppercase)
- Matches agent topic names from `pubsub/pubsub.go:138-143`

---

## Performance Metrics

### Connection Success Rate
- **Target**: 10/10 agents connected
- **Actual**: 3/10 agents connected (30%)
- **Bottleneck**: WHOOSH agent discovery

### Message Visibility
- **Expected**: All agent broadcasts visible to monitor
- **Actual**: Only broadcasts from connected agents visible
- **Coverage**: ~30% of mesh traffic

### Connection Latency
- **Bootstrap fetch**: < 1s
- **P2P connection establishment**: < 1s per peer
- **GossipSub message propagation**: < 100ms (estimated)

---

## Recommended Improvements

### High Priority

1. **Fix WHOOSH Agent Discovery**
   - **Problem**: Only 3/10 agents discovered
   - **Root Cause**: `p2pDiscovery.GetAgents()` incomplete
   - **Solution**: Investigate discovery mechanism, possibly use Docker API directly
   - **File**: `/home/tony/chorus/project-queues/active/WHOOSH/internal/discovery/...`

2. **Add Health Check Retry Logic**
   - **Problem**: WHOOSH may query agents before they're ready
   - **Solution**: Retry failed health checks with exponential backoff
   - **File**: `/home/tony/chorus/project-queues/active/WHOOSH/internal/server/bootstrap.go`

3. **Improve Multiaddr Filtering**
   - **Problem**: Including all interfaces, not just routable ones
   - **Solution**: Filter for overlay network IPs only, exclude localhost/link-local
   - **File**: `/home/tony/chorus/project-queues/active/CHORUS/api/http_server.go`

### Medium Priority

4. **Add Mesh Topology Visibility**
   - **Enhancement**: Monitor should report full mesh topology
   - **Data Needed**: Which agents are connected to which peers
   - **UI**: Add dashboard showing P2P mesh graph

5. **Implement Peer Discovery via DHT**
   - **Problem**: Relying solely on WHOOSH for bootstrap
   - **Solution**: Add libp2p DHT for peer-to-peer discovery
   - **Benefit**: Agents can discover each other without WHOOSH

6. **Add Connection Quality Metrics**
   - **Enhancement**: Track latency, bandwidth, reliability per peer
   - **Data**: Round-trip time, message success rate, connection uptime
   - **Use**: Identify and debug problematic P2P connections

### Low Priority

7. **Support External Monitor Deployment**
   - **Limitation**: Monitor must be on same overlay network
   - **Solution**: Use libp2p relay or expose agents on host network
   - **Use Case**: Monitor from laptop/external host

8. **Add Multiaddr Failover**
   - **Enhancement**: Try all multiaddrs if first fails
   - **Current**: Only use first multiaddr per agent
   - **Benefit**: Better resilience to network issues

---

## Testing Checklist

### Functional Tests Needed
- [ ] All 10 agents appear in bootstrap list
- [ ] Monitor connects to all 10 agents
- [ ] Monitor receives broadcasts from all agents
- [ ] Agent restart doesn't break monitor connectivity
- [ ] WHOOSH restart doesn't break monitor connectivity
- [ ] Scale agents to 20 replicas → all visible to monitor

### Performance Tests Needed
- [ ] Message delivery latency < 100ms
- [ ] Bootstrap list refresh < 1s
- [ ] Monitor handles 100+ messages/sec
- [ ] CPU/memory usage acceptable under load

### Edge Cases to Test
- [ ] Agent crashes/restarts → monitor reconnects
- [ ] Network partition → monitor detects split
- [ ] Duplicate peer IDs → handled gracefully
- [ ] Invalid multiaddrs → skipped without crash
- [ ] WHOOSH unavailable → monitor uses cached bootstrap

---

## System Architecture Diagram

```
┌─────────────────────────────────────────────────────────────┐
│                    Docker Swarm Overlay Network             │
│                    (lz9ny9bmvm6fzalvy9ckpxpcw)              │
│                                                             │
│  ┌──────────────┐     ┌──────────────┐                    │
│  │ CHORUS Agent │────▶│   WHOOSH     │◀──── HTTP Query    │
│  │  (10 replicas)│     │  Bootstrap   │                    │
│  │               │     │   Server     │                    │
│  │ /api/health   │     │              │                    │
│  │  - peer_id    │     │ /api/v1/     │                    │
│  │  - multiaddrs │     │ bootstrap-   │                    │
│  │  - topics     │     │ peers        │                    │
│  └───────┬──────┘     └──────────────┘                    │
│          │                                                  │
│          │ GossipSub                                        │
│          │ Messages                                         │
│          ▼                                                  │
│  ┌──────────────┐                                          │
│  │ HMMM Monitor │                                          │
│  │              │                                          │
│  │ Subscribes:  │                                          │
│  │ - CHORUS/    │                                          │
│  │   coordination│                                         │
│  │ - hmmm/meta  │                                          │
│  │ - context    │                                          │
│  └──────────────┘                                          │
│                                                             │
└─────────────────────────────────────────────────────────────┘

Flow:
1. WHOOSH queries agent /api/health endpoints
2. Agents respond with peer_id + overlay IP multiaddrs
3. WHOOSH aggregates into bootstrap list
4. Monitor fetches bootstrap list
5. Monitor connects directly to agent overlay IPs
6. Monitor subscribes to GossipSub topics
7. Agents broadcast messages every 30s
8. Monitor receives and logs messages
```

---

## Known Issues

### Issue #1: Incomplete Agent Discovery
- **Severity**: High
- **Impact**: Only 30% of agents visible to monitor
- **Workaround**: None
- **Fix Required**: Investigate WHOOSH discovery mechanism

### Issue #2: No Automatic Peer Discovery
- **Severity**: Medium
- **Impact**: Monitor relies on WHOOSH for all peer discovery
- **Workaround**: Manual restart to refresh bootstrap
- **Fix Required**: Implement DHT or mDNS discovery

### Issue #3: Topic Name Case Sensitivity
- **Severity**: Low (fixed)
- **Impact**: Was preventing message reception
- **Fix**: Corrected topic names to match agents
- **Status**: Resolved

---

## Deployment Instructions

### Current Deployment State
All components deployed and running:
- ✅ CHORUS agents: 10 replicas (anthonyrawlins/chorus:latest)
- ✅ WHOOSH: 1 replica (anthonyrawlins/whoosh:latest)
- ✅ HMMM monitor: 1 replica (anthonyrawlins/hmmm-monitor:latest)

### To Redeploy After Changes

```bash
# 1. Rebuild and deploy CHORUS agents
cd /home/tony/chorus/project-queues/active/CHORUS
env GOWORK=off go build -v -o build/chorus-agent ./cmd/agent
docker build -f Dockerfile.ubuntu -t anthonyrawlins/chorus:latest .
docker push anthonyrawlins/chorus:latest
ssh acacia "docker service update --image anthonyrawlins/chorus:latest CHORUS_chorus"

# 2. Rebuild and deploy WHOOSH
cd /home/tony/chorus/project-queues/active/WHOOSH
docker build -t anthonyrawlins/whoosh:latest .
docker push anthonyrawlins/whoosh:latest
ssh acacia "docker service update --image anthonyrawlins/whoosh:latest CHORUS_whoosh"

# 3. Rebuild and deploy HMMM monitor
cd /home/tony/chorus/project-queues/active/CHORUS/hmmm-monitor
docker build -t anthonyrawlins/hmmm-monitor:latest .
docker push anthonyrawlins/hmmm-monitor:latest
ssh acacia "docker service update --image anthonyrawlins/hmmm-monitor:latest CHORUS_hmmm-monitor"

# 4. Verify deployment
ssh acacia "docker service ps CHORUS_chorus CHORUS_whoosh CHORUS_hmmm-monitor"
ssh acacia "docker service logs --tail 20 CHORUS_hmmm-monitor"
```

---

## References

- **Architecture Plan**: `/home/tony/chorus/project-queues/active/CHORUS/docs/P2P_MESH_ARCHITECTURE_PLAN.md`
- **Agent Health Endpoint**: `/home/tony/chorus/project-queues/active/CHORUS/api/http_server.go:319-366`
- **WHOOSH Bootstrap**: `/home/tony/chorus/project-queues/active/WHOOSH/internal/server/bootstrap.go:41-47`
- **HMMM Monitor**: `/home/tony/chorus/project-queues/active/CHORUS/hmmm-monitor/main.go`
- **Agent Pubsub**: `/home/tony/chorus/project-queues/active/CHORUS/pubsub/pubsub.go:138-143`

---

## Conclusion

The P2P mesh is **functionally working** but requires improvements to achieve full reliability and visibility. The primary blocker is WHOOSH's incomplete agent discovery, which prevents the monitor from seeing all 10 agents. Once this is resolved, the system should achieve 100% message visibility across the entire mesh.

**Next Steps**:
1. Debug WHOOSH agent discovery to ensure all 10 replicas are discovered
2. Add retry logic for health endpoint queries
3. Improve multiaddr filtering to exclude non-routable addresses
4. Add mesh topology monitoring and visualization

**Status**: ✅ Working, ⚠️ Needs Improvement