- Parameterize CORS_ORIGINS in docker-compose.swarm.yml - Add .env.example with configuration options - Create comprehensive LOCAL_DEVELOPMENT.md guide - Update README.md with environment variable documentation - Provide alternatives for local development without production domain 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
12 KiB
Docker Swarm Networking Troubleshooting Guide
Date: July 8, 2025
Context: Comprehensive analysis of Docker Swarm routing mesh and Traefik integration issues
Status: Diagnostic guide based on official documentation and community findings
🎯 Executive Summary
This guide provides a comprehensive troubleshooting framework for Docker Swarm networking issues, specifically focusing on routing mesh failures and Traefik integration problems. Based on extensive analysis of official Docker and Traefik documentation, community forums, and practical testing, this guide identifies the most common root causes and provides systematic diagnostic procedures.
📋 Problem Categories
1. Routing Mesh Failures
- Symptom: Published service ports not accessible via
localhost:port - Impact: Services only accessible via direct node IP addresses
- Root Cause: Infrastructure-level networking issues
2. Traefik Integration Issues
- Symptom: HTTPS endpoints return "Bad Gateway" (502)
- Impact: External access to services fails despite internal health
- Root Cause: Service discovery and overlay network connectivity
3. Selective Service Failures
- Symptom: Some services work via routing mesh while others fail
- Impact: Inconsistent service availability
- Root Cause: Service-specific configuration or placement issues
🔍 Diagnostic Framework
Phase 1: Infrastructure Validation
1.1 Required Port Connectivity
Docker Swarm requires specific ports to be open between ALL nodes:
# Test cluster management port
nc -zv <node-ip> 2377
# Test container network discovery (TCP/UDP)
nc -zv <node-ip> 7946
nc -zuv <node-ip> 7946
# Test overlay network data path
nc -zuv <node-ip> 4789
Expected Result: All ports should be reachable from all nodes
1.2 Kernel Module Verification
Docker Swarm overlay networks require specific kernel modules:
# Check required kernel modules
lsmod | grep -E "(bridge|ip_tables|nf_nat|overlay|br_netfilter)"
# Load missing modules if needed
sudo modprobe bridge
sudo modprobe ip_tables
sudo modprobe nf_nat
sudo modprobe overlay
sudo modprobe br_netfilter
Expected Result: All modules should be loaded and active
1.3 Firewall Configuration
Ensure permissive rules for internal cluster communication:
# Add comprehensive internal subnet rules
sudo ufw allow from 192.168.1.0/24 to any
sudo ufw allow to 192.168.1.0/24 from any
# Add specific Docker Swarm ports
sudo ufw allow 2377/tcp
sudo ufw allow 7946
sudo ufw allow 4789/udp
Expected Result: All cluster traffic should be permitted
Phase 2: Docker Swarm Health Assessment
2.1 Cluster Status Validation
# Check overall cluster health
docker node ls
# Verify node addresses
docker node inspect <node-name> --format '{{.Status.Addr}}'
# Check swarm configuration
docker system info | grep -A 10 "Swarm"
Expected Result: All nodes should be "Ready" with proper IP addresses
2.2 Ingress Network Inspection
# Examine ingress network configuration
docker network inspect ingress
# Check ingress network containers
docker network inspect ingress --format '{{json .Containers}}' | python3 -m json.tool
# Verify ingress network subnet
docker network inspect ingress --format '{{json .IPAM.Config}}'
Expected Result: Ingress network should contain active service containers
2.3 Service Port Publishing Verification
# Check service port configuration
docker service inspect <service-name> --format '{{json .Endpoint.Ports}}'
# Verify service placement
docker service ps <service-name>
# Check service labels (for Traefik)
docker service inspect <service-name> --format '{{json .Spec.Labels}}'
Expected Result: Ports should be properly published with "ingress" mode
Phase 3: Service-Specific Diagnostics
3.1 Internal Service Connectivity
# Test service-to-service communication
docker run --rm --network <network-name> alpine/curl -s http://<service-name>:<port>/health
# Check DNS resolution
docker run --rm --network <network-name> alpine/curl nslookup <service-name>
# Test direct container connectivity
docker run --rm --network <network-name> alpine/curl -s http://<container-ip>:<port>/health
Expected Result: Services should be reachable via service names
3.2 Routing Mesh Validation
# Test routing mesh functionality
curl -s http://localhost:<published-port>/ --connect-timeout 5
# Test from different nodes
ssh <node-ip> "curl -s http://localhost:<published-port>/ --connect-timeout 5"
# Check port binding status
ss -tulpn | grep :<published-port>
Expected Result: Services should be accessible from all nodes
3.3 Traefik Integration Assessment
# Test Traefik service discovery
curl -s https://traefik.home.deepblack.cloud/api/rawdata
# Check Traefik service status
docker service logs <traefik-service> --tail 20
# Verify certificate provisioning
curl -I https://<service-domain>/
Expected Result: Traefik should discover services and provision certificates
🛠️ Common Resolution Strategies
Strategy 1: Infrastructure Fixes
Firewall Resolution
# Apply comprehensive firewall rules
sudo ufw allow from 192.168.1.0/24 to any
sudo ufw allow to 192.168.1.0/24 from any
sudo ufw allow 2377/tcp
sudo ufw allow 7946
sudo ufw allow 4789/udp
Kernel Module Resolution
# Load all required modules
sudo modprobe bridge ip_tables nf_nat overlay br_netfilter
# Make persistent (add to /etc/modules)
echo -e "bridge\nip_tables\nnf_nat\noverlay\nbr_netfilter" | sudo tee -a /etc/modules
Docker Daemon Restart
# Restart Docker daemon to reset networking
sudo systemctl restart docker
# Wait for swarm reconvergence
sleep 60
# Verify cluster health
docker node ls
Strategy 2: Configuration Fixes
Service Placement Optimization
# Remove restrictive placement constraints
deploy:
placement:
constraints: [] # Remove manager-only constraints
Network Configuration
# Ensure proper network configuration
networks:
- hive-network # Internal communication
- tengig # Traefik integration
Port Mapping Standardization
# Add explicit port mappings for debugging
ports:
- "<external-port>:<internal-port>"
Strategy 3: Advanced Troubleshooting
Data Path Port Change
# If port 4789 conflicts, change data path port
docker swarm init --data-path-port=4790
Service Force Restart
# Force service restart to reset networking
docker service update --force <service-name>
Ingress Network Recreation
# Nuclear option: recreate ingress network
docker network rm ingress
docker network create \
--driver overlay \
--ingress \
--subnet=10.0.0.0/24 \
--gateway=10.0.0.1 \
--opt com.docker.network.driver.mtu=1200 \
ingress
📊 Diagnostic Checklist
Infrastructure Level
- All required ports open between nodes (2377, 7946, 4789)
- Kernel modules loaded (bridge, ip_tables, nf_nat, overlay, br_netfilter)
- Firewall rules permit cluster communication
- No network interface checksum offloading issues
Docker Swarm Level
- All nodes in "Ready" state
- Proper node IP addresses configured
- Ingress network contains service containers
- Service ports properly published with "ingress" mode
Service Level
- Services respond to internal health checks
- DNS resolution works for service names
- Traefik labels correctly formatted
- Services connected to proper networks
Application Level
- Applications bind to 0.0.0.0 (not localhost)
- Health check endpoints respond correctly
- No port conflicts between services
- Proper service dependencies configured
🔄 Systematic Troubleshooting Process
Step 1: Quick Validation
# Test basic connectivity
curl -s http://localhost:80/ --connect-timeout 2 # Should work (Traefik)
curl -s http://localhost:<service-port>/ --connect-timeout 2 # Test target service
Step 2: Infrastructure Assessment
# Run infrastructure diagnostics
nc -zv <node-ip> 2377 7946 4789
lsmod | grep -E "(bridge|ip_tables|nf_nat|overlay|br_netfilter)"
docker node ls
Step 3: Service-Specific Testing
# Test direct service connectivity
curl -s http://<node-ip>:<service-port>/health
docker service ps <service-name>
docker service inspect <service-name> --format '{{json .Endpoint.Ports}}'
Step 4: Network Deep Dive
# Analyze network configuration
docker network inspect ingress
docker network inspect <service-network>
ss -tulpn | grep <service-port>
Step 5: Resolution Implementation
# Apply fixes based on findings
sudo ufw allow from 192.168.1.0/24 to any # Fix firewall
sudo modprobe overlay bridge # Fix kernel modules
docker service update --force <service-name> # Reset service
📚 Reference Documentation
Official Docker Documentation
Official Traefik Documentation
Community Resources
🎯 Key Insights
Critical Understanding
- Routing Mesh vs Service Discovery: Traefik uses overlay networks for service discovery, not the routing mesh
- Port Requirements: Specific ports (2377, 7946, 4789) must be open between ALL nodes
- Kernel Dependencies: Overlay networks require specific kernel modules
- Firewall Impact: Most routing mesh issues are firewall-related
Best Practices
- Always test infrastructure first before troubleshooting applications
- Use permissive firewall rules for internal cluster communication
- Verify kernel modules in containerized environments
- Test routing mesh systematically across all nodes
Common Pitfalls
- Assuming localhost works: Docker Swarm routing mesh may not bind to localhost
- Ignoring kernel modules: Missing modules cause silent failures
- Firewall confusion: UFW rules may not cover all Docker traffic
- Service placement assumptions: Placement constraints can break routing
🚀 Quick Reference Commands
Infrastructure Testing
# Test all required ports
for port in 2377 7946 4789; do nc -zv <node-ip> $port; done
# Check kernel modules
lsmod | grep -E "(bridge|ip_tables|nf_nat|overlay|br_netfilter)"
# Test routing mesh
curl -s http://localhost:<port>/ --connect-timeout 5
Service Diagnostics
# Service health check
docker service ps <service-name>
docker service inspect <service-name> --format '{{json .Endpoint.Ports}}'
curl -s http://<node-ip>:<port>/health
Network Analysis
# Network inspection
docker network inspect ingress
docker network inspect <service-network>
ss -tulpn | grep <port>
This guide should be referenced whenever Docker Swarm networking issues arise, providing a systematic approach to diagnosis and resolution.