Adopt IPvlan + Tailscale Backbone for Reliable Cross-Host CHORUS Mesh & Remote Clusters #16

New Issue

tony · 2025-10-15T06:15:28Z

tony commented

2025-10-15 06:15:28 +00:00

Context / Problem

Swarm’s VXLAN overlay keeps pruning long-lived libp2p TCP sessions. Result: each CHORUS replica only sees peers on the same physical
host; every Stage S2/S3 test tops out at 3–4 peers.
Existing Stage S1/S2/S3 harness now captures this clearly.
We also need to extend CHORUS to off-site clusters—overlay issues would be even worse across the WAN.

Libp2p demands a stable, routable L2, not the fragile overlay. Goal: move CHORUS onto a real subnet and use Tailscale to haul that
traffic between sites.

Proposed Solution

Give CHORUS real IPs (L2)
- Reserve 192.168.1.192/26 in the LAN (router/DHCP).
- Create swarm-scoped IPvlan network on each node:
  
  docker network create -d ipvlan --scope swarm --attachable
  --subnet 192.168.1.0/24 --gateway 192.168.1.1
  --ip-range 192.168.1.192/26
  -o parent=enp11s0 -o ipvlan_mode=l2
  chorus_ipvlan
- Update compose files (docker/docker-compose.yml, stage testing compose files, hmmm-monitor) so all CHORUS services attach to
  chorus_ipvlan instead of the overlay.
Join every swarm node to Tailscale
- Install tailscaled, join tailnet with tagged key:
  
  tailscale up --authkey= --hostname=$(hostname)-chorus
  --ssh --accept-dns=false
  --advertise-tags=tag:chorus-cluster
  --advertise-routes=192.168.1.192/26
- Approve route + ACLs in Tailscale Admin, enable IP forwarding (e.g. sysctl -w net.ipv4.ip_forward=1).
- Sample ACL:
  
  {
  "acls": [
  { "src": ["tag:chorus-cluster"],
  "dst": ["tag:chorus-cluster:", "192.168.1.192/26:", "192.168.2.192/26:*"] }
  ],
  "tagOwners": { "tag:chorus-cluster": ["group:admins"] },
  "ssh": [{ "action": "accept", "src": ["tag:chorus-cluster"], "dst": ["tag:chorus-cluster"] }]
  }
Remote sites
- On a new cluster (e.g. off-site) reserve another /26 (192.168.2.192/26), create the same IPvlan network with local parent NIC,
  join tailnet advertising the new route, approve in admin panel.
- Now both sites can route to each other’s container IPs via Tailscale, no overlay involved.
Validate
- Re-run staged peer tests: python scripts/peer_connectivity_test.py ... for S1, S2, S3. Expect Stage S2 peer counts ≥5, Stage
  S3 >12.
- Use tailscale status for health, docker service logs ... | grep "Connected Peers" for quick diagnostics.
- All artefacts drop into artifacts/peer-connectivity/sX/ for review.
Rollout / safeguards
- Ensure router permanently reserves the container IP ranges.
- Keep Tailscale ACLs tight (only CHORUS hosts/networks).
- Document the tailscale up command/version in infra repo so new nodes follow same pattern.

Acceptance Criteria

CHORUS services are running on IPvlan network (chorus_ipvlan) with reserved IP block.
All swarm nodes show up in Tailscale with tag:chorus-cluster, advertised routes approved.
Stage S1/S2/S3 tests produce full mesh peer counts (≥12 for S3), confirming no overlay churn.
Remote cluster bootstrap (at least lab setup) can dial peers across sites using the tailnet.

Notes / Extras

If you need host ↔ container access locally, add a macvlan/ipvlan shim (ip link add macvlan0 ...) for each host, or switch IPvlan
mode to L3.
The harness already writes task-level logs and /health snapshots for easy debugging.
Once satisfied, remove the legacy overlay network (docker network rm CHORUS_chorus_net) so we don’t accidentally attach to it again.

This plan migrates CHORUS to stable networking and lays the groundwork for hybrid deployments without sacrificing libp2p health.

**Context / Problem** - Swarm’s VXLAN overlay keeps pruning long-lived libp2p TCP sessions. Result: each CHORUS replica only sees peers on the same physical host; every Stage S2/S3 test tops out at 3–4 peers. - Existing Stage S1/S2/S3 harness now captures this clearly. - We also need to extend CHORUS to off-site clusters—overlay issues would be even worse across the WAN. Libp2p demands a stable, routable L2, not the fragile overlay. Goal: move CHORUS onto a real subnet and use Tailscale to haul that traffic between sites. Proposed Solution 1. Give CHORUS real IPs (L2) - Reserve 192.168.1.192/26 in the LAN (router/DHCP). - Create swarm-scoped IPvlan network on each node: docker network create -d ipvlan --scope swarm --attachable \ --subnet 192.168.1.0/24 --gateway 192.168.1.1 \ --ip-range 192.168.1.192/26 \ -o parent=enp11s0 -o ipvlan_mode=l2 \ chorus_ipvlan - Update compose files (docker/docker-compose.yml, stage testing compose files, hmmm-monitor) so all CHORUS services attach to chorus_ipvlan instead of the overlay. 2. Join every swarm node to Tailscale - Install tailscaled, join tailnet with tagged key: tailscale up --authkey=<tskey> --hostname=$(hostname)-chorus \ --ssh --accept-dns=false \ --advertise-tags=tag:chorus-cluster \ --advertise-routes=192.168.1.192/26 - Approve route + ACLs in Tailscale Admin, enable IP forwarding (e.g. sysctl -w net.ipv4.ip_forward=1). - Sample ACL: { "acls": [ { "src": ["tag:chorus-cluster"], "dst": ["tag:chorus-cluster:*", "192.168.1.192/26:*", "192.168.2.192/26:*"] } ], "tagOwners": { "tag:chorus-cluster": ["group:admins"] }, "ssh": [{ "action": "accept", "src": ["tag:chorus-cluster"], "dst": ["tag:chorus-cluster"] }] } 3. Remote sites - On a new cluster (e.g. off-site) reserve another /26 (192.168.2.192/26), create the same IPvlan network with local parent NIC, join tailnet advertising the new route, approve in admin panel. - Now both sites can route to each other’s container IPs via Tailscale, no overlay involved. 4. Validate - Re-run staged peer tests: python scripts/peer_connectivity_test.py ... for S1, S2, S3. Expect Stage S2 peer counts ≥5, Stage S3 >12. - Use tailscale status for health, docker service logs ... | grep "Connected Peers" for quick diagnostics. - All artefacts drop into artifacts/peer-connectivity/sX/ for review. 5. Rollout / safeguards - Ensure router permanently reserves the container IP ranges. - Keep Tailscale ACLs tight (only CHORUS hosts/networks). - Document the tailscale up command/version in infra repo so new nodes follow same pattern. Acceptance Criteria - CHORUS services are running on IPvlan network (chorus_ipvlan) with reserved IP block. - All swarm nodes show up in Tailscale with tag:chorus-cluster, advertised routes approved. - Stage S1/S2/S3 tests produce full mesh peer counts (≥12 for S3), confirming no overlay churn. - Remote cluster bootstrap (at least lab setup) can dial peers across sites using the tailnet. Notes / Extras - If you need host ↔ container access locally, add a macvlan/ipvlan shim (ip link add macvlan0 ...) for each host, or switch IPvlan mode to L3. - The harness already writes task-level logs and /health snapshots for easy debugging. - Once satisfied, remove the legacy overlay network (docker network rm CHORUS_chorus_net) so we don’t accidentally attach to it again. This plan migrates CHORUS to stable networking and lays the groundwork for hybrid deployments without sacrificing libp2p health.

tony added the

help wanted

enhancement

bzzz-task

labels 2025-10-15 06:15:28 +00:00

tony commented

2025-10-15 06:17:48 +00:00

There's an existing account with tailscale and walnut, rosewood, acacia, and ironwood are already connected to it a system level. http://ironwood.tail04519c.ts.net/ for example.

Sign in to join this conversation.

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: tony/CHORUS#16