# BZZZ v2 Deployment Runbook ## Overview This runbook provides step-by-step procedures for deploying, operating, and maintaining BZZZ v2 infrastructure. It covers normal operations, emergency procedures, and troubleshooting guidelines. ## Prerequisites ### System Requirements - **Cluster**: 3 nodes (WALNUT, IRONWOOD, ACACIA) - **OS**: Ubuntu 22.04 LTS or newer - **Docker**: Version 24+ with Swarm mode enabled - **Storage**: NFS mount at `/rust/` with 500GB+ available - **Network**: Internal 192.168.1.0/24 with external internet access - **Secrets**: OpenAI API key and database credentials ### Access Requirements - SSH access to all cluster nodes - Docker Swarm manager privileges - Sudo access for system configuration - GitLab access for CI/CD pipeline management ## Pre-Deployment Checklist ### Infrastructure Verification ```bash # Verify Docker Swarm status docker node ls docker network ls | grep tengig # Check available storage df -h /rust/ # Verify network connectivity ping -c 3 192.168.1.27 # WALNUT ping -c 3 192.168.1.113 # IRONWOOD ping -c 3 192.168.1.xxx # ACACIA # Test registry access docker pull registry.home.deepblack.cloud/hello-world || echo "Registry access test" ``` ### Security Hardening ```bash # Run security hardening script cd /home/tony/chorus/project-queues/active/BZZZ/infrastructure/security sudo ./security-hardening.sh # Verify firewall status sudo ufw status verbose # Check fail2ban status sudo fail2ban-client status ``` ## Deployment Procedures ### 1. Initial Deployment (Fresh Install) #### Step 1: Prepare Infrastructure ```bash # Create directory structure mkdir -p /rust/bzzz-v2/{config,data,logs,backup} mkdir -p /rust/bzzz-v2/data/{blobs,conversations,dht,postgres,redis} mkdir -p /rust/bzzz-v2/config/{swarm,monitoring,security} # Set permissions sudo chown -R tony:tony /rust/bzzz-v2 chmod -R 755 /rust/bzzz-v2 ``` #### Step 2: Configure Secrets and Configs ```bash cd /home/tony/chorus/project-queues/active/BZZZ/infrastructure # Create Docker secrets docker secret create bzzz_postgres_password config/secrets/postgres_password docker secret create bzzz_openai_api_key ~/chorus/business/secrets/openai-api-key docker secret create bzzz_grafana_admin_password config/secrets/grafana_admin_password # Create Docker configs docker config create bzzz_v2_config config/bzzz-config.yaml docker config create bzzz_prometheus_config monitoring/configs/prometheus.yml docker config create bzzz_alertmanager_config monitoring/configs/alertmanager.yml ``` #### Step 3: Deploy Core Services ```bash # Deploy main BZZZ v2 stack docker stack deploy -c docker-compose.swarm.yml bzzz-v2 # Wait for services to start (this may take 5-10 minutes) watch docker stack ps bzzz-v2 ``` #### Step 4: Deploy Monitoring Stack ```bash # Deploy monitoring services docker stack deploy -c monitoring/docker-compose.monitoring.yml bzzz-monitoring # Verify monitoring services curl -f http://localhost:9090/-/healthy # Prometheus curl -f http://localhost:3000/api/health # Grafana ``` #### Step 5: Verify Deployment ```bash # Check all services are running docker service ls --filter label=com.docker.stack.namespace=bzzz-v2 # Test external endpoints curl -f https://bzzz.deepblack.cloud/health curl -f https://mcp.deepblack.cloud/health curl -f https://resolve.deepblack.cloud/health # Check P2P mesh connectivity docker exec $(docker ps -q -f label=com.docker.swarm.service.name=bzzz-v2_bzzz-agent | head -1) \ curl -s http://localhost:9000/api/v2/peers | jq '.connected_peers | length' ``` ### 2. Update Deployment (Rolling Update) #### Step 1: Pre-Update Checks ```bash # Check current deployment health docker stack ps bzzz-v2 | grep -v "Shutdown\|Failed" # Backup current configuration mkdir -p /rust/bzzz-v2/backup/$(date +%Y%m%d-%H%M%S) docker config ls | grep bzzz_ > /rust/bzzz-v2/backup/$(date +%Y%m%d-%H%M%S)/configs.txt docker secret ls | grep bzzz_ > /rust/bzzz-v2/backup/$(date +%Y%m%d-%H%M%S)/secrets.txt ``` #### Step 2: Update Images ```bash # Update to new image version export NEW_IMAGE_TAG="v2.1.0" # Update Docker Compose file with new image tags sed -i "s/registry.home.deepblack.cloud\/bzzz:.*$/registry.home.deepblack.cloud\/bzzz:${NEW_IMAGE_TAG}/g" \ docker-compose.swarm.yml # Deploy updated stack (rolling update) docker stack deploy -c docker-compose.swarm.yml bzzz-v2 ``` #### Step 3: Monitor Update Progress ```bash # Watch rolling update progress watch "docker service ps bzzz-v2_bzzz-agent | head -20" # Check for any failed updates docker service ps bzzz-v2_bzzz-agent --filter desired-state=running --filter current-state=failed ``` ### 3. Migration from v1 to v2 ```bash # Use the automated migration script cd /home/tony/chorus/project-queues/active/BZZZ/infrastructure/migration-scripts # Dry run first to preview changes ./migrate-v1-to-v2.sh --dry-run # Execute full migration ./migrate-v1-to-v2.sh # If rollback is needed ./migrate-v1-to-v2.sh --rollback ``` ## Monitoring and Health Checks ### Health Check Commands ```bash # Service health checks docker service ls --filter label=com.docker.stack.namespace=bzzz-v2 docker service ps bzzz-v2_bzzz-agent --filter desired-state=running # Application health checks curl -f https://bzzz.deepblack.cloud/health curl -f https://mcp.deepblack.cloud/health curl -f https://resolve.deepblack.cloud/health curl -f https://openai.deepblack.cloud/health # P2P network health docker exec $(docker ps -q -f label=com.docker.swarm.service.name=bzzz-v2_bzzz-agent | head -1) \ curl -s http://localhost:9000/api/v2/dht/stats | jq '.' # Database connectivity docker exec $(docker ps -q -f label=com.docker.swarm.service.name=bzzz-v2_postgres) \ pg_isready -U bzzz -d bzzz_v2 # Cache connectivity docker exec $(docker ps -q -f label=com.docker.swarm.service.name=bzzz-v2_redis) \ redis-cli ping ``` ### Performance Monitoring ```bash # Check resource usage docker stats --no-stream # Monitor disk usage df -h /rust/bzzz-v2/data/ # Check network connections netstat -tuln | grep -E ":(9000|3001|3002|3003|9101|9102|9103)" # Monitor OpenAI API usage curl -s http://localhost:9203/metrics | grep openai_cost ``` ## Troubleshooting Guide ### Common Issues and Solutions #### 1. Service Won't Start **Symptoms:** Service stuck in `preparing` or constantly restarting **Diagnosis:** ```bash # Check service logs docker service logs bzzz-v2_bzzz-agent --tail 50 # Check node resources docker node ls docker system df # Verify secrets and configs docker secret ls | grep bzzz_ docker config ls | grep bzzz_ ``` **Solutions:** - Check resource constraints and availability - Verify secrets and configs are accessible - Ensure image is available and correct - Check node labels and placement constraints #### 2. P2P Network Issues **Symptoms:** Agents not discovering each other, DHT lookups failing **Diagnosis:** ```bash # Check peer connections docker exec $(docker ps -q -f label=com.docker.swarm.service.name=bzzz-v2_bzzz-agent | head -1) \ curl -s http://localhost:9000/api/v2/peers # Check DHT bootstrap nodes curl http://localhost:9101/health curl http://localhost:9102/health curl http://localhost:9103/health # Check network connectivity docker network inspect bzzz-internal ``` **Solutions:** - Restart DHT bootstrap services - Check firewall rules for P2P ports - Verify Docker Swarm overlay network - Check for port conflicts #### 3. High OpenAI Costs **Symptoms:** Cost alerts triggering, rate limits being hit **Diagnosis:** ```bash # Check current usage curl -s http://localhost:9203/metrics | grep -E "openai_(cost|requests|tokens)" # Check rate limiting docker service logs bzzz-v2_openai-proxy --tail 100 | grep "rate limit" ``` **Solutions:** - Adjust rate limiting parameters - Review conversation patterns for excessive API calls - Implement request caching - Consider model selection optimization #### 4. Database Connection Issues **Symptoms:** Service errors related to database connectivity **Diagnosis:** ```bash # Check PostgreSQL status docker service logs bzzz-v2_postgres --tail 50 # Test connection from agent docker exec $(docker ps -q -f label=com.docker.swarm.service.name=bzzz-v2_bzzz-agent | head -1) \ pg_isready -h postgres -U bzzz # Check connection limits docker exec $(docker ps -q -f label=com.docker.swarm.service.name=bzzz-v2_postgres) \ psql -U bzzz -d bzzz_v2 -c "SELECT count(*) FROM pg_stat_activity;" ``` **Solutions:** - Restart PostgreSQL service - Check connection pool settings - Increase max_connections if needed - Review long-running queries #### 5. Storage Issues **Symptoms:** Disk full alerts, content store errors **Diagnosis:** ```bash # Check disk usage df -h /rust/bzzz-v2/data/ du -sh /rust/bzzz-v2/data/blobs/ # Check content store health curl -s http://localhost:9202/metrics | grep content_store ``` **Solutions:** - Run garbage collection on old blobs - Clean up old conversation threads - Increase storage capacity - Adjust retention policies ## Emergency Procedures ### Service Outage Response #### Priority 1: Complete Service Outage ```bash # 1. Check cluster status docker node ls docker service ls --filter label=com.docker.stack.namespace=bzzz-v2 # 2. Emergency restart of critical services docker service update --force bzzz-v2_bzzz-agent docker service update --force bzzz-v2_postgres docker service update --force bzzz-v2_redis # 3. If stack is corrupted, redeploy docker stack rm bzzz-v2 sleep 60 docker stack deploy -c docker-compose.swarm.yml bzzz-v2 # 4. Monitor recovery watch docker stack ps bzzz-v2 ``` #### Priority 2: Partial Service Degradation ```bash # 1. Identify problematic services docker service ps bzzz-v2_bzzz-agent --filter desired-state=running --filter current-state=failed # 2. Scale up healthy replicas docker service update --replicas 3 bzzz-v2_bzzz-agent # 3. Remove unhealthy tasks docker service update --force bzzz-v2_bzzz-agent ``` ### Security Incident Response #### Step 1: Immediate Containment ```bash # 1. Block suspicious IPs sudo ufw insert 1 deny from SUSPICIOUS_IP # 2. Check for compromise indicators sudo fail2ban-client status sudo tail -100 /var/log/audit/audit.log | grep -i "denied\|failed\|error" # 3. Isolate affected services docker service update --replicas 0 AFFECTED_SERVICE ``` #### Step 2: Investigation ```bash # 1. Check access logs docker service logs bzzz-v2_bzzz-agent --since 1h | grep -i "error\|failed\|unauthorized" # 2. Review monitoring alerts curl -s http://localhost:9093/api/v1/alerts | jq '.data[] | select(.state=="firing")' # 3. Examine network connections netstat -tuln ss -tulpn | grep -E ":(9000|3001|3002|3003)" ``` #### Step 3: Recovery ```bash # 1. Update security rules ./infrastructure/security/security-hardening.sh # 2. Rotate secrets if compromised docker secret rm bzzz_postgres_password openssl rand -base64 32 | docker secret create bzzz_postgres_password - # 3. Restart services with new secrets docker stack deploy -c docker-compose.swarm.yml bzzz-v2 ``` ### Data Recovery Procedures #### Backup Restoration ```bash # 1. Stop services docker stack rm bzzz-v2 # 2. Restore from backup BACKUP_DATE="20241201-120000" rsync -av /rust/bzzz-v2/backup/$BACKUP_DATE/ /rust/bzzz-v2/data/ # 3. Restart services docker stack deploy -c docker-compose.swarm.yml bzzz-v2 ``` #### Database Recovery ```bash # 1. Stop application services docker service scale bzzz-v2_bzzz-agent=0 # 2. Create database backup docker exec $(docker ps -q -f label=com.docker.swarm.service.name=bzzz-v2_postgres) \ pg_dump -U bzzz bzzz_v2 > /rust/bzzz-v2/backup/database-$(date +%Y%m%d-%H%M%S).sql # 3. Restore database docker exec -i $(docker ps -q -f label=com.docker.swarm.service.name=bzzz-v2_postgres) \ psql -U bzzz -d bzzz_v2 < /rust/bzzz-v2/backup/database-backup.sql # 4. Restart application services docker service scale bzzz-v2_bzzz-agent=3 ``` ## Maintenance Procedures ### Routine Maintenance (Weekly) ```bash #!/bin/bash # Weekly maintenance script # 1. Check service health docker service ls --filter label=com.docker.stack.namespace=bzzz-v2 docker system df # 2. Clean up unused resources docker system prune -f docker volume prune -f # 3. Backup critical data pg_dump -h localhost -U bzzz bzzz_v2 | gzip > \ /rust/bzzz-v2/backup/weekly-db-$(date +%Y%m%d).sql.gz # 4. Rotate logs find /rust/bzzz-v2/logs -name "*.log" -mtime +7 -delete # 5. Check certificate expiration openssl x509 -in /rust/bzzz-v2/config/tls/server/walnut.pem -noout -dates # 6. Update security rules fail2ban-client reload # 7. Generate maintenance report echo "Maintenance completed on $(date)" >> /rust/bzzz-v2/logs/maintenance.log ``` ### Scaling Procedures #### Scale Up ```bash # Increase replica count docker service scale bzzz-v2_bzzz-agent=5 docker service scale bzzz-v2_mcp-server=5 # Add new node to cluster (run on new node) docker swarm join --token $WORKER_TOKEN $MANAGER_IP:2377 # Label new node docker node update --label-add bzzz.role=agent NEW_NODE_HOSTNAME ``` #### Scale Down ```bash # Gracefully reduce replicas docker service scale bzzz-v2_bzzz-agent=2 docker service scale bzzz-v2_mcp-server=2 # Remove node from cluster docker node update --availability drain NODE_HOSTNAME docker node rm NODE_HOSTNAME ``` ## Performance Tuning ### Database Optimization ```bash # PostgreSQL tuning docker exec $(docker ps -q -f label=com.docker.swarm.service.name=bzzz-v2_postgres) \ psql -U bzzz -d bzzz_v2 -c " ALTER SYSTEM SET shared_buffers = '1GB'; ALTER SYSTEM SET max_connections = 200; ALTER SYSTEM SET checkpoint_timeout = '15min'; SELECT pg_reload_conf(); " ``` ### Storage Optimization ```bash # Content store optimization find /rust/bzzz-v2/data/blobs -name "*.tmp" -mtime +1 -delete find /rust/bzzz-v2/data/blobs -type f -size 0 -delete # Compress old logs find /rust/bzzz-v2/logs -name "*.log" -mtime +3 -exec gzip {} \; ``` ### Network Optimization ```bash # Optimize network buffer sizes echo 'net.core.rmem_max = 134217728' | sudo tee -a /etc/sysctl.conf echo 'net.core.wmem_max = 134217728' | sudo tee -a /etc/sysctl.conf echo 'net.ipv4.tcp_rmem = 4096 87380 134217728' | sudo tee -a /etc/sysctl.conf echo 'net.ipv4.tcp_wmem = 4096 65536 134217728' | sudo tee -a /etc/sysctl.conf sudo sysctl -p ``` ## Contact Information ### On-Call Procedures - **Primary Contact**: DevOps Team Lead - **Secondary Contact**: Senior Site Reliability Engineer - **Escalation**: Platform Engineering Manager ### Communication Channels - **Slack**: #bzzz-incidents - **Email**: devops@deepblack.cloud - **Phone**: Emergency On-Call Rotation ### Documentation - **Runbooks**: This document - **Architecture**: `/docs/BZZZ_V2_INFRASTRUCTURE_ARCHITECTURE.md` - **API Documentation**: https://bzzz.deepblack.cloud/docs - **Monitoring Dashboards**: https://grafana.deepblack.cloud --- *This runbook should be reviewed and updated monthly. Last updated: $(date)*