Adopt IPvlan + Tailscale Backbone for Reliable Cross-Host CHORUS Mesh & Remote Clusters #16

Open
opened 2025-10-15 06:15:28 +00:00 by tony · 1 comment
Owner

Context / Problem

  • Swarm’s VXLAN overlay keeps pruning long-lived libp2p TCP sessions. Result: each CHORUS replica only sees peers on the same physical
    host; every Stage S2/S3 test tops out at 3–4 peers.
  • Existing Stage S1/S2/S3 harness now captures this clearly.
  • We also need to extend CHORUS to off-site clusters—overlay issues would be even worse across the WAN.

Libp2p demands a stable, routable L2, not the fragile overlay. Goal: move CHORUS onto a real subnet and use Tailscale to haul that
traffic between sites.

Proposed Solution

  1. Give CHORUS real IPs (L2)
    • Reserve 192.168.1.192/26 in the LAN (router/DHCP).

    • Create swarm-scoped IPvlan network on each node:

      docker network create -d ipvlan --scope swarm --attachable
      --subnet 192.168.1.0/24 --gateway 192.168.1.1
      --ip-range 192.168.1.192/26
      -o parent=enp11s0 -o ipvlan_mode=l2
      chorus_ipvlan

    • Update compose files (docker/docker-compose.yml, stage testing compose files, hmmm-monitor) so all CHORUS services attach to
      chorus_ipvlan instead of the overlay.

  2. Join every swarm node to Tailscale
    • Install tailscaled, join tailnet with tagged key:

      tailscale up --authkey= --hostname=$(hostname)-chorus
      --ssh --accept-dns=false
      --advertise-tags=tag:chorus-cluster
      --advertise-routes=192.168.1.192/26

    • Approve route + ACLs in Tailscale Admin, enable IP forwarding (e.g. sysctl -w net.ipv4.ip_forward=1).

    • Sample ACL:

      {
      "acls": [
      { "src": ["tag:chorus-cluster"],
      "dst": ["tag:chorus-cluster:", "192.168.1.192/26:", "192.168.2.192/26:*"] }
      ],
      "tagOwners": { "tag:chorus-cluster": ["group:admins"] },
      "ssh": [{ "action": "accept", "src": ["tag:chorus-cluster"], "dst": ["tag:chorus-cluster"] }]
      }

  3. Remote sites
    • On a new cluster (e.g. off-site) reserve another /26 (192.168.2.192/26), create the same IPvlan network with local parent NIC,
      join tailnet advertising the new route, approve in admin panel.
    • Now both sites can route to each other’s container IPs via Tailscale, no overlay involved.
  4. Validate
    • Re-run staged peer tests: python scripts/peer_connectivity_test.py ... for S1, S2, S3. Expect Stage S2 peer counts ≥5, Stage
      S3 >12.
    • Use tailscale status for health, docker service logs ... | grep "Connected Peers" for quick diagnostics.
    • All artefacts drop into artifacts/peer-connectivity/sX/ for review.
  5. Rollout / safeguards
    • Ensure router permanently reserves the container IP ranges.
    • Keep Tailscale ACLs tight (only CHORUS hosts/networks).
    • Document the tailscale up command/version in infra repo so new nodes follow same pattern.

Acceptance Criteria

  • CHORUS services are running on IPvlan network (chorus_ipvlan) with reserved IP block.
  • All swarm nodes show up in Tailscale with tag:chorus-cluster, advertised routes approved.
  • Stage S1/S2/S3 tests produce full mesh peer counts (≥12 for S3), confirming no overlay churn.
  • Remote cluster bootstrap (at least lab setup) can dial peers across sites using the tailnet.

Notes / Extras

  • If you need host ↔ container access locally, add a macvlan/ipvlan shim (ip link add macvlan0 ...) for each host, or switch IPvlan
    mode to L3.
  • The harness already writes task-level logs and /health snapshots for easy debugging.
  • Once satisfied, remove the legacy overlay network (docker network rm CHORUS_chorus_net) so we don’t accidentally attach to it again.

This plan migrates CHORUS to stable networking and lays the groundwork for hybrid deployments without sacrificing libp2p health.

**Context / Problem** - Swarm’s VXLAN overlay keeps pruning long-lived libp2p TCP sessions. Result: each CHORUS replica only sees peers on the same physical host; every Stage S2/S3 test tops out at 3–4 peers. - Existing Stage S1/S2/S3 harness now captures this clearly. - We also need to extend CHORUS to off-site clusters—overlay issues would be even worse across the WAN. Libp2p demands a stable, routable L2, not the fragile overlay. Goal: move CHORUS onto a real subnet and use Tailscale to haul that traffic between sites. Proposed Solution 1. Give CHORUS real IPs (L2) - Reserve 192.168.1.192/26 in the LAN (router/DHCP). - Create swarm-scoped IPvlan network on each node: docker network create -d ipvlan --scope swarm --attachable \ --subnet 192.168.1.0/24 --gateway 192.168.1.1 \ --ip-range 192.168.1.192/26 \ -o parent=enp11s0 -o ipvlan_mode=l2 \ chorus_ipvlan - Update compose files (docker/docker-compose.yml, stage testing compose files, hmmm-monitor) so all CHORUS services attach to chorus_ipvlan instead of the overlay. 2. Join every swarm node to Tailscale - Install tailscaled, join tailnet with tagged key: tailscale up --authkey=<tskey> --hostname=$(hostname)-chorus \ --ssh --accept-dns=false \ --advertise-tags=tag:chorus-cluster \ --advertise-routes=192.168.1.192/26 - Approve route + ACLs in Tailscale Admin, enable IP forwarding (e.g. sysctl -w net.ipv4.ip_forward=1). - Sample ACL: { "acls": [ { "src": ["tag:chorus-cluster"], "dst": ["tag:chorus-cluster:*", "192.168.1.192/26:*", "192.168.2.192/26:*"] } ], "tagOwners": { "tag:chorus-cluster": ["group:admins"] }, "ssh": [{ "action": "accept", "src": ["tag:chorus-cluster"], "dst": ["tag:chorus-cluster"] }] } 3. Remote sites - On a new cluster (e.g. off-site) reserve another /26 (192.168.2.192/26), create the same IPvlan network with local parent NIC, join tailnet advertising the new route, approve in admin panel. - Now both sites can route to each other’s container IPs via Tailscale, no overlay involved. 4. Validate - Re-run staged peer tests: python scripts/peer_connectivity_test.py ... for S1, S2, S3. Expect Stage S2 peer counts ≥5, Stage S3 >12. - Use tailscale status for health, docker service logs ... | grep "Connected Peers" for quick diagnostics. - All artefacts drop into artifacts/peer-connectivity/sX/ for review. 5. Rollout / safeguards - Ensure router permanently reserves the container IP ranges. - Keep Tailscale ACLs tight (only CHORUS hosts/networks). - Document the tailscale up command/version in infra repo so new nodes follow same pattern. Acceptance Criteria - CHORUS services are running on IPvlan network (chorus_ipvlan) with reserved IP block. - All swarm nodes show up in Tailscale with tag:chorus-cluster, advertised routes approved. - Stage S1/S2/S3 tests produce full mesh peer counts (≥12 for S3), confirming no overlay churn. - Remote cluster bootstrap (at least lab setup) can dial peers across sites using the tailnet. Notes / Extras - If you need host ↔ container access locally, add a macvlan/ipvlan shim (ip link add macvlan0 ...) for each host, or switch IPvlan mode to L3. - The harness already writes task-level logs and /health snapshots for easy debugging. - Once satisfied, remove the legacy overlay network (docker network rm CHORUS_chorus_net) so we don’t accidentally attach to it again. This plan migrates CHORUS to stable networking and lays the groundwork for hybrid deployments without sacrificing libp2p health.
tony added the
help wanted
enhancement
bzzz-task
labels 2025-10-15 06:15:28 +00:00
Author
Owner

There's an existing account with tailscale and walnut, rosewood, acacia, and ironwood are already connected to it a system level. http://ironwood.tail04519c.ts.net/ for example.

There's an existing account with tailscale and walnut, rosewood, acacia, and ironwood are already connected to it a system level. http://ironwood.tail04519c.ts.net/ for example.
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: tony/CHORUS#16
No description provided.