Files
bzzz/infrastructure/docs/DEPLOYMENT_RUNBOOK.md
anthonyrawlins 065dddf8d5 Prepare for v2 development: Add MCP integration and future development planning
- Add FUTURE_DEVELOPMENT.md with comprehensive v2 protocol specification
- Add MCP integration design and implementation foundation
- Add infrastructure and deployment configurations
- Update system architecture for v2 evolution

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-08-07 14:38:22 +10:00

15 KiB

BZZZ v2 Deployment Runbook

Overview

This runbook provides step-by-step procedures for deploying, operating, and maintaining BZZZ v2 infrastructure. It covers normal operations, emergency procedures, and troubleshooting guidelines.

Prerequisites

System Requirements

  • Cluster: 3 nodes (WALNUT, IRONWOOD, ACACIA)
  • OS: Ubuntu 22.04 LTS or newer
  • Docker: Version 24+ with Swarm mode enabled
  • Storage: NFS mount at /rust/ with 500GB+ available
  • Network: Internal 192.168.1.0/24 with external internet access
  • Secrets: OpenAI API key and database credentials

Access Requirements

  • SSH access to all cluster nodes
  • Docker Swarm manager privileges
  • Sudo access for system configuration
  • GitLab access for CI/CD pipeline management

Pre-Deployment Checklist

Infrastructure Verification

# Verify Docker Swarm status
docker node ls
docker network ls | grep tengig

# Check available storage
df -h /rust/

# Verify network connectivity
ping -c 3 192.168.1.27  # WALNUT
ping -c 3 192.168.1.113 # IRONWOOD  
ping -c 3 192.168.1.xxx # ACACIA

# Test registry access
docker pull registry.home.deepblack.cloud/hello-world || echo "Registry access test"

Security Hardening

# Run security hardening script
cd /home/tony/chorus/project-queues/active/BZZZ/infrastructure/security
sudo ./security-hardening.sh

# Verify firewall status
sudo ufw status verbose

# Check fail2ban status
sudo fail2ban-client status

Deployment Procedures

1. Initial Deployment (Fresh Install)

Step 1: Prepare Infrastructure

# Create directory structure
mkdir -p /rust/bzzz-v2/{config,data,logs,backup}
mkdir -p /rust/bzzz-v2/data/{blobs,conversations,dht,postgres,redis}
mkdir -p /rust/bzzz-v2/config/{swarm,monitoring,security}

# Set permissions
sudo chown -R tony:tony /rust/bzzz-v2
chmod -R 755 /rust/bzzz-v2

Step 2: Configure Secrets and Configs

cd /home/tony/chorus/project-queues/active/BZZZ/infrastructure

# Create Docker secrets
docker secret create bzzz_postgres_password config/secrets/postgres_password
docker secret create bzzz_openai_api_key ~/chorus/business/secrets/openai-api-key
docker secret create bzzz_grafana_admin_password config/secrets/grafana_admin_password

# Create Docker configs
docker config create bzzz_v2_config config/bzzz-config.yaml
docker config create bzzz_prometheus_config monitoring/configs/prometheus.yml
docker config create bzzz_alertmanager_config monitoring/configs/alertmanager.yml

Step 3: Deploy Core Services

# Deploy main BZZZ v2 stack
docker stack deploy -c docker-compose.swarm.yml bzzz-v2

# Wait for services to start (this may take 5-10 minutes)
watch docker stack ps bzzz-v2

Step 4: Deploy Monitoring Stack

# Deploy monitoring services
docker stack deploy -c monitoring/docker-compose.monitoring.yml bzzz-monitoring

# Verify monitoring services
curl -f http://localhost:9090/-/healthy  # Prometheus
curl -f http://localhost:3000/api/health # Grafana

Step 5: Verify Deployment

# Check all services are running
docker service ls --filter label=com.docker.stack.namespace=bzzz-v2

# Test external endpoints
curl -f https://bzzz.deepblack.cloud/health
curl -f https://mcp.deepblack.cloud/health
curl -f https://resolve.deepblack.cloud/health

# Check P2P mesh connectivity
docker exec $(docker ps -q -f label=com.docker.swarm.service.name=bzzz-v2_bzzz-agent | head -1) \
  curl -s http://localhost:9000/api/v2/peers | jq '.connected_peers | length'

2. Update Deployment (Rolling Update)

Step 1: Pre-Update Checks

# Check current deployment health
docker stack ps bzzz-v2 | grep -v "Shutdown\|Failed"

# Backup current configuration
mkdir -p /rust/bzzz-v2/backup/$(date +%Y%m%d-%H%M%S)
docker config ls | grep bzzz_ > /rust/bzzz-v2/backup/$(date +%Y%m%d-%H%M%S)/configs.txt
docker secret ls | grep bzzz_ > /rust/bzzz-v2/backup/$(date +%Y%m%d-%H%M%S)/secrets.txt

Step 2: Update Images

# Update to new image version
export NEW_IMAGE_TAG="v2.1.0"

# Update Docker Compose file with new image tags
sed -i "s/registry.home.deepblack.cloud\/bzzz:.*$/registry.home.deepblack.cloud\/bzzz:${NEW_IMAGE_TAG}/g" \
  docker-compose.swarm.yml

# Deploy updated stack (rolling update)
docker stack deploy -c docker-compose.swarm.yml bzzz-v2

Step 3: Monitor Update Progress

# Watch rolling update progress
watch "docker service ps bzzz-v2_bzzz-agent | head -20"

# Check for any failed updates
docker service ps bzzz-v2_bzzz-agent --filter desired-state=running --filter current-state=failed

3. Migration from v1 to v2

# Use the automated migration script
cd /home/tony/chorus/project-queues/active/BZZZ/infrastructure/migration-scripts

# Dry run first to preview changes
./migrate-v1-to-v2.sh --dry-run

# Execute full migration
./migrate-v1-to-v2.sh

# If rollback is needed
./migrate-v1-to-v2.sh --rollback

Monitoring and Health Checks

Health Check Commands

# Service health checks
docker service ls --filter label=com.docker.stack.namespace=bzzz-v2
docker service ps bzzz-v2_bzzz-agent --filter desired-state=running

# Application health checks
curl -f https://bzzz.deepblack.cloud/health
curl -f https://mcp.deepblack.cloud/health
curl -f https://resolve.deepblack.cloud/health
curl -f https://openai.deepblack.cloud/health

# P2P network health
docker exec $(docker ps -q -f label=com.docker.swarm.service.name=bzzz-v2_bzzz-agent | head -1) \
  curl -s http://localhost:9000/api/v2/dht/stats | jq '.'

# Database connectivity
docker exec $(docker ps -q -f label=com.docker.swarm.service.name=bzzz-v2_postgres) \
  pg_isready -U bzzz -d bzzz_v2

# Cache connectivity  
docker exec $(docker ps -q -f label=com.docker.swarm.service.name=bzzz-v2_redis) \
  redis-cli ping

Performance Monitoring

# Check resource usage
docker stats --no-stream

# Monitor disk usage
df -h /rust/bzzz-v2/data/

# Check network connections
netstat -tuln | grep -E ":(9000|3001|3002|3003|9101|9102|9103)"

# Monitor OpenAI API usage
curl -s http://localhost:9203/metrics | grep openai_cost

Troubleshooting Guide

Common Issues and Solutions

1. Service Won't Start

Symptoms: Service stuck in preparing or constantly restarting

Diagnosis:

# Check service logs
docker service logs bzzz-v2_bzzz-agent --tail 50

# Check node resources
docker node ls
docker system df

# Verify secrets and configs
docker secret ls | grep bzzz_
docker config ls | grep bzzz_

Solutions:

  • Check resource constraints and availability
  • Verify secrets and configs are accessible
  • Ensure image is available and correct
  • Check node labels and placement constraints

2. P2P Network Issues

Symptoms: Agents not discovering each other, DHT lookups failing

Diagnosis:

# Check peer connections
docker exec $(docker ps -q -f label=com.docker.swarm.service.name=bzzz-v2_bzzz-agent | head -1) \
  curl -s http://localhost:9000/api/v2/peers

# Check DHT bootstrap nodes
curl http://localhost:9101/health
curl http://localhost:9102/health  
curl http://localhost:9103/health

# Check network connectivity
docker network inspect bzzz-internal

Solutions:

  • Restart DHT bootstrap services
  • Check firewall rules for P2P ports
  • Verify Docker Swarm overlay network
  • Check for port conflicts

3. High OpenAI Costs

Symptoms: Cost alerts triggering, rate limits being hit

Diagnosis:

# Check current usage
curl -s http://localhost:9203/metrics | grep -E "openai_(cost|requests|tokens)"

# Check rate limiting
docker service logs bzzz-v2_openai-proxy --tail 100 | grep "rate limit"

Solutions:

  • Adjust rate limiting parameters
  • Review conversation patterns for excessive API calls
  • Implement request caching
  • Consider model selection optimization

4. Database Connection Issues

Symptoms: Service errors related to database connectivity

Diagnosis:

# Check PostgreSQL status
docker service logs bzzz-v2_postgres --tail 50

# Test connection from agent
docker exec $(docker ps -q -f label=com.docker.swarm.service.name=bzzz-v2_bzzz-agent | head -1) \
  pg_isready -h postgres -U bzzz

# Check connection limits
docker exec $(docker ps -q -f label=com.docker.swarm.service.name=bzzz-v2_postgres) \
  psql -U bzzz -d bzzz_v2 -c "SELECT count(*) FROM pg_stat_activity;"

Solutions:

  • Restart PostgreSQL service
  • Check connection pool settings
  • Increase max_connections if needed
  • Review long-running queries

5. Storage Issues

Symptoms: Disk full alerts, content store errors

Diagnosis:

# Check disk usage
df -h /rust/bzzz-v2/data/
du -sh /rust/bzzz-v2/data/blobs/

# Check content store health
curl -s http://localhost:9202/metrics | grep content_store

Solutions:

  • Run garbage collection on old blobs
  • Clean up old conversation threads
  • Increase storage capacity
  • Adjust retention policies

Emergency Procedures

Service Outage Response

Priority 1: Complete Service Outage

# 1. Check cluster status
docker node ls
docker service ls --filter label=com.docker.stack.namespace=bzzz-v2

# 2. Emergency restart of critical services
docker service update --force bzzz-v2_bzzz-agent
docker service update --force bzzz-v2_postgres
docker service update --force bzzz-v2_redis

# 3. If stack is corrupted, redeploy
docker stack rm bzzz-v2
sleep 60
docker stack deploy -c docker-compose.swarm.yml bzzz-v2

# 4. Monitor recovery
watch docker stack ps bzzz-v2

Priority 2: Partial Service Degradation

# 1. Identify problematic services
docker service ps bzzz-v2_bzzz-agent --filter desired-state=running --filter current-state=failed

# 2. Scale up healthy replicas
docker service update --replicas 3 bzzz-v2_bzzz-agent

# 3. Remove unhealthy tasks
docker service update --force bzzz-v2_bzzz-agent

Security Incident Response

Step 1: Immediate Containment

# 1. Block suspicious IPs
sudo ufw insert 1 deny from SUSPICIOUS_IP

# 2. Check for compromise indicators
sudo fail2ban-client status
sudo tail -100 /var/log/audit/audit.log | grep -i "denied\|failed\|error"

# 3. Isolate affected services
docker service update --replicas 0 AFFECTED_SERVICE

Step 2: Investigation

# 1. Check access logs
docker service logs bzzz-v2_bzzz-agent --since 1h | grep -i "error\|failed\|unauthorized"

# 2. Review monitoring alerts
curl -s http://localhost:9093/api/v1/alerts | jq '.data[] | select(.state=="firing")'

# 3. Examine network connections
netstat -tuln
ss -tulpn | grep -E ":(9000|3001|3002|3003)"

Step 3: Recovery

# 1. Update security rules
./infrastructure/security/security-hardening.sh

# 2. Rotate secrets if compromised
docker secret rm bzzz_postgres_password
openssl rand -base64 32 | docker secret create bzzz_postgres_password -

# 3. Restart services with new secrets
docker stack deploy -c docker-compose.swarm.yml bzzz-v2

Data Recovery Procedures

Backup Restoration

# 1. Stop services
docker stack rm bzzz-v2

# 2. Restore from backup
BACKUP_DATE="20241201-120000"
rsync -av /rust/bzzz-v2/backup/$BACKUP_DATE/ /rust/bzzz-v2/data/

# 3. Restart services
docker stack deploy -c docker-compose.swarm.yml bzzz-v2

Database Recovery

# 1. Stop application services
docker service scale bzzz-v2_bzzz-agent=0

# 2. Create database backup
docker exec $(docker ps -q -f label=com.docker.swarm.service.name=bzzz-v2_postgres) \
  pg_dump -U bzzz bzzz_v2 > /rust/bzzz-v2/backup/database-$(date +%Y%m%d-%H%M%S).sql

# 3. Restore database
docker exec -i $(docker ps -q -f label=com.docker.swarm.service.name=bzzz-v2_postgres) \
  psql -U bzzz -d bzzz_v2 < /rust/bzzz-v2/backup/database-backup.sql

# 4. Restart application services
docker service scale bzzz-v2_bzzz-agent=3

Maintenance Procedures

Routine Maintenance (Weekly)

#!/bin/bash
# Weekly maintenance script

# 1. Check service health
docker service ls --filter label=com.docker.stack.namespace=bzzz-v2
docker system df

# 2. Clean up unused resources
docker system prune -f
docker volume prune -f

# 3. Backup critical data
pg_dump -h localhost -U bzzz bzzz_v2 | gzip > \
  /rust/bzzz-v2/backup/weekly-db-$(date +%Y%m%d).sql.gz

# 4. Rotate logs
find /rust/bzzz-v2/logs -name "*.log" -mtime +7 -delete

# 5. Check certificate expiration
openssl x509 -in /rust/bzzz-v2/config/tls/server/walnut.pem -noout -dates

# 6. Update security rules
fail2ban-client reload

# 7. Generate maintenance report
echo "Maintenance completed on $(date)" >> /rust/bzzz-v2/logs/maintenance.log

Scaling Procedures

Scale Up

# Increase replica count
docker service scale bzzz-v2_bzzz-agent=5
docker service scale bzzz-v2_mcp-server=5

# Add new node to cluster (run on new node)
docker swarm join --token $WORKER_TOKEN $MANAGER_IP:2377

# Label new node
docker node update --label-add bzzz.role=agent NEW_NODE_HOSTNAME

Scale Down

# Gracefully reduce replicas
docker service scale bzzz-v2_bzzz-agent=2
docker service scale bzzz-v2_mcp-server=2

# Remove node from cluster
docker node update --availability drain NODE_HOSTNAME
docker node rm NODE_HOSTNAME

Performance Tuning

Database Optimization

# PostgreSQL tuning
docker exec $(docker ps -q -f label=com.docker.swarm.service.name=bzzz-v2_postgres) \
  psql -U bzzz -d bzzz_v2 -c "
    ALTER SYSTEM SET shared_buffers = '1GB';
    ALTER SYSTEM SET max_connections = 200;
    ALTER SYSTEM SET checkpoint_timeout = '15min';
    SELECT pg_reload_conf();
  "

Storage Optimization

# Content store optimization
find /rust/bzzz-v2/data/blobs -name "*.tmp" -mtime +1 -delete
find /rust/bzzz-v2/data/blobs -type f -size 0 -delete

# Compress old logs
find /rust/bzzz-v2/logs -name "*.log" -mtime +3 -exec gzip {} \;

Network Optimization

# Optimize network buffer sizes
echo 'net.core.rmem_max = 134217728' | sudo tee -a /etc/sysctl.conf
echo 'net.core.wmem_max = 134217728' | sudo tee -a /etc/sysctl.conf
echo 'net.ipv4.tcp_rmem = 4096 87380 134217728' | sudo tee -a /etc/sysctl.conf
echo 'net.ipv4.tcp_wmem = 4096 65536 134217728' | sudo tee -a /etc/sysctl.conf
sudo sysctl -p

Contact Information

On-Call Procedures

  • Primary Contact: DevOps Team Lead
  • Secondary Contact: Senior Site Reliability Engineer
  • Escalation: Platform Engineering Manager

Communication Channels

Documentation


This runbook should be reviewed and updated monthly. Last updated: $(date)