Files
bzzz/infrastructure/docs/DEPLOYMENT_RUNBOOK.md
anthonyrawlins 065dddf8d5 Prepare for v2 development: Add MCP integration and future development planning
- Add FUTURE_DEVELOPMENT.md with comprehensive v2 protocol specification
- Add MCP integration design and implementation foundation
- Add infrastructure and deployment configurations
- Update system architecture for v2 evolution

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-08-07 14:38:22 +10:00

581 lines
15 KiB
Markdown

# BZZZ v2 Deployment Runbook
## Overview
This runbook provides step-by-step procedures for deploying, operating, and maintaining BZZZ v2 infrastructure. It covers normal operations, emergency procedures, and troubleshooting guidelines.
## Prerequisites
### System Requirements
- **Cluster**: 3 nodes (WALNUT, IRONWOOD, ACACIA)
- **OS**: Ubuntu 22.04 LTS or newer
- **Docker**: Version 24+ with Swarm mode enabled
- **Storage**: NFS mount at `/rust/` with 500GB+ available
- **Network**: Internal 192.168.1.0/24 with external internet access
- **Secrets**: OpenAI API key and database credentials
### Access Requirements
- SSH access to all cluster nodes
- Docker Swarm manager privileges
- Sudo access for system configuration
- GitLab access for CI/CD pipeline management
## Pre-Deployment Checklist
### Infrastructure Verification
```bash
# Verify Docker Swarm status
docker node ls
docker network ls | grep tengig
# Check available storage
df -h /rust/
# Verify network connectivity
ping -c 3 192.168.1.27 # WALNUT
ping -c 3 192.168.1.113 # IRONWOOD
ping -c 3 192.168.1.xxx # ACACIA
# Test registry access
docker pull registry.home.deepblack.cloud/hello-world || echo "Registry access test"
```
### Security Hardening
```bash
# Run security hardening script
cd /home/tony/chorus/project-queues/active/BZZZ/infrastructure/security
sudo ./security-hardening.sh
# Verify firewall status
sudo ufw status verbose
# Check fail2ban status
sudo fail2ban-client status
```
## Deployment Procedures
### 1. Initial Deployment (Fresh Install)
#### Step 1: Prepare Infrastructure
```bash
# Create directory structure
mkdir -p /rust/bzzz-v2/{config,data,logs,backup}
mkdir -p /rust/bzzz-v2/data/{blobs,conversations,dht,postgres,redis}
mkdir -p /rust/bzzz-v2/config/{swarm,monitoring,security}
# Set permissions
sudo chown -R tony:tony /rust/bzzz-v2
chmod -R 755 /rust/bzzz-v2
```
#### Step 2: Configure Secrets and Configs
```bash
cd /home/tony/chorus/project-queues/active/BZZZ/infrastructure
# Create Docker secrets
docker secret create bzzz_postgres_password config/secrets/postgres_password
docker secret create bzzz_openai_api_key ~/chorus/business/secrets/openai-api-key
docker secret create bzzz_grafana_admin_password config/secrets/grafana_admin_password
# Create Docker configs
docker config create bzzz_v2_config config/bzzz-config.yaml
docker config create bzzz_prometheus_config monitoring/configs/prometheus.yml
docker config create bzzz_alertmanager_config monitoring/configs/alertmanager.yml
```
#### Step 3: Deploy Core Services
```bash
# Deploy main BZZZ v2 stack
docker stack deploy -c docker-compose.swarm.yml bzzz-v2
# Wait for services to start (this may take 5-10 minutes)
watch docker stack ps bzzz-v2
```
#### Step 4: Deploy Monitoring Stack
```bash
# Deploy monitoring services
docker stack deploy -c monitoring/docker-compose.monitoring.yml bzzz-monitoring
# Verify monitoring services
curl -f http://localhost:9090/-/healthy # Prometheus
curl -f http://localhost:3000/api/health # Grafana
```
#### Step 5: Verify Deployment
```bash
# Check all services are running
docker service ls --filter label=com.docker.stack.namespace=bzzz-v2
# Test external endpoints
curl -f https://bzzz.deepblack.cloud/health
curl -f https://mcp.deepblack.cloud/health
curl -f https://resolve.deepblack.cloud/health
# Check P2P mesh connectivity
docker exec $(docker ps -q -f label=com.docker.swarm.service.name=bzzz-v2_bzzz-agent | head -1) \
curl -s http://localhost:9000/api/v2/peers | jq '.connected_peers | length'
```
### 2. Update Deployment (Rolling Update)
#### Step 1: Pre-Update Checks
```bash
# Check current deployment health
docker stack ps bzzz-v2 | grep -v "Shutdown\|Failed"
# Backup current configuration
mkdir -p /rust/bzzz-v2/backup/$(date +%Y%m%d-%H%M%S)
docker config ls | grep bzzz_ > /rust/bzzz-v2/backup/$(date +%Y%m%d-%H%M%S)/configs.txt
docker secret ls | grep bzzz_ > /rust/bzzz-v2/backup/$(date +%Y%m%d-%H%M%S)/secrets.txt
```
#### Step 2: Update Images
```bash
# Update to new image version
export NEW_IMAGE_TAG="v2.1.0"
# Update Docker Compose file with new image tags
sed -i "s/registry.home.deepblack.cloud\/bzzz:.*$/registry.home.deepblack.cloud\/bzzz:${NEW_IMAGE_TAG}/g" \
docker-compose.swarm.yml
# Deploy updated stack (rolling update)
docker stack deploy -c docker-compose.swarm.yml bzzz-v2
```
#### Step 3: Monitor Update Progress
```bash
# Watch rolling update progress
watch "docker service ps bzzz-v2_bzzz-agent | head -20"
# Check for any failed updates
docker service ps bzzz-v2_bzzz-agent --filter desired-state=running --filter current-state=failed
```
### 3. Migration from v1 to v2
```bash
# Use the automated migration script
cd /home/tony/chorus/project-queues/active/BZZZ/infrastructure/migration-scripts
# Dry run first to preview changes
./migrate-v1-to-v2.sh --dry-run
# Execute full migration
./migrate-v1-to-v2.sh
# If rollback is needed
./migrate-v1-to-v2.sh --rollback
```
## Monitoring and Health Checks
### Health Check Commands
```bash
# Service health checks
docker service ls --filter label=com.docker.stack.namespace=bzzz-v2
docker service ps bzzz-v2_bzzz-agent --filter desired-state=running
# Application health checks
curl -f https://bzzz.deepblack.cloud/health
curl -f https://mcp.deepblack.cloud/health
curl -f https://resolve.deepblack.cloud/health
curl -f https://openai.deepblack.cloud/health
# P2P network health
docker exec $(docker ps -q -f label=com.docker.swarm.service.name=bzzz-v2_bzzz-agent | head -1) \
curl -s http://localhost:9000/api/v2/dht/stats | jq '.'
# Database connectivity
docker exec $(docker ps -q -f label=com.docker.swarm.service.name=bzzz-v2_postgres) \
pg_isready -U bzzz -d bzzz_v2
# Cache connectivity
docker exec $(docker ps -q -f label=com.docker.swarm.service.name=bzzz-v2_redis) \
redis-cli ping
```
### Performance Monitoring
```bash
# Check resource usage
docker stats --no-stream
# Monitor disk usage
df -h /rust/bzzz-v2/data/
# Check network connections
netstat -tuln | grep -E ":(9000|3001|3002|3003|9101|9102|9103)"
# Monitor OpenAI API usage
curl -s http://localhost:9203/metrics | grep openai_cost
```
## Troubleshooting Guide
### Common Issues and Solutions
#### 1. Service Won't Start
**Symptoms:** Service stuck in `preparing` or constantly restarting
**Diagnosis:**
```bash
# Check service logs
docker service logs bzzz-v2_bzzz-agent --tail 50
# Check node resources
docker node ls
docker system df
# Verify secrets and configs
docker secret ls | grep bzzz_
docker config ls | grep bzzz_
```
**Solutions:**
- Check resource constraints and availability
- Verify secrets and configs are accessible
- Ensure image is available and correct
- Check node labels and placement constraints
#### 2. P2P Network Issues
**Symptoms:** Agents not discovering each other, DHT lookups failing
**Diagnosis:**
```bash
# Check peer connections
docker exec $(docker ps -q -f label=com.docker.swarm.service.name=bzzz-v2_bzzz-agent | head -1) \
curl -s http://localhost:9000/api/v2/peers
# Check DHT bootstrap nodes
curl http://localhost:9101/health
curl http://localhost:9102/health
curl http://localhost:9103/health
# Check network connectivity
docker network inspect bzzz-internal
```
**Solutions:**
- Restart DHT bootstrap services
- Check firewall rules for P2P ports
- Verify Docker Swarm overlay network
- Check for port conflicts
#### 3. High OpenAI Costs
**Symptoms:** Cost alerts triggering, rate limits being hit
**Diagnosis:**
```bash
# Check current usage
curl -s http://localhost:9203/metrics | grep -E "openai_(cost|requests|tokens)"
# Check rate limiting
docker service logs bzzz-v2_openai-proxy --tail 100 | grep "rate limit"
```
**Solutions:**
- Adjust rate limiting parameters
- Review conversation patterns for excessive API calls
- Implement request caching
- Consider model selection optimization
#### 4. Database Connection Issues
**Symptoms:** Service errors related to database connectivity
**Diagnosis:**
```bash
# Check PostgreSQL status
docker service logs bzzz-v2_postgres --tail 50
# Test connection from agent
docker exec $(docker ps -q -f label=com.docker.swarm.service.name=bzzz-v2_bzzz-agent | head -1) \
pg_isready -h postgres -U bzzz
# Check connection limits
docker exec $(docker ps -q -f label=com.docker.swarm.service.name=bzzz-v2_postgres) \
psql -U bzzz -d bzzz_v2 -c "SELECT count(*) FROM pg_stat_activity;"
```
**Solutions:**
- Restart PostgreSQL service
- Check connection pool settings
- Increase max_connections if needed
- Review long-running queries
#### 5. Storage Issues
**Symptoms:** Disk full alerts, content store errors
**Diagnosis:**
```bash
# Check disk usage
df -h /rust/bzzz-v2/data/
du -sh /rust/bzzz-v2/data/blobs/
# Check content store health
curl -s http://localhost:9202/metrics | grep content_store
```
**Solutions:**
- Run garbage collection on old blobs
- Clean up old conversation threads
- Increase storage capacity
- Adjust retention policies
## Emergency Procedures
### Service Outage Response
#### Priority 1: Complete Service Outage
```bash
# 1. Check cluster status
docker node ls
docker service ls --filter label=com.docker.stack.namespace=bzzz-v2
# 2. Emergency restart of critical services
docker service update --force bzzz-v2_bzzz-agent
docker service update --force bzzz-v2_postgres
docker service update --force bzzz-v2_redis
# 3. If stack is corrupted, redeploy
docker stack rm bzzz-v2
sleep 60
docker stack deploy -c docker-compose.swarm.yml bzzz-v2
# 4. Monitor recovery
watch docker stack ps bzzz-v2
```
#### Priority 2: Partial Service Degradation
```bash
# 1. Identify problematic services
docker service ps bzzz-v2_bzzz-agent --filter desired-state=running --filter current-state=failed
# 2. Scale up healthy replicas
docker service update --replicas 3 bzzz-v2_bzzz-agent
# 3. Remove unhealthy tasks
docker service update --force bzzz-v2_bzzz-agent
```
### Security Incident Response
#### Step 1: Immediate Containment
```bash
# 1. Block suspicious IPs
sudo ufw insert 1 deny from SUSPICIOUS_IP
# 2. Check for compromise indicators
sudo fail2ban-client status
sudo tail -100 /var/log/audit/audit.log | grep -i "denied\|failed\|error"
# 3. Isolate affected services
docker service update --replicas 0 AFFECTED_SERVICE
```
#### Step 2: Investigation
```bash
# 1. Check access logs
docker service logs bzzz-v2_bzzz-agent --since 1h | grep -i "error\|failed\|unauthorized"
# 2. Review monitoring alerts
curl -s http://localhost:9093/api/v1/alerts | jq '.data[] | select(.state=="firing")'
# 3. Examine network connections
netstat -tuln
ss -tulpn | grep -E ":(9000|3001|3002|3003)"
```
#### Step 3: Recovery
```bash
# 1. Update security rules
./infrastructure/security/security-hardening.sh
# 2. Rotate secrets if compromised
docker secret rm bzzz_postgres_password
openssl rand -base64 32 | docker secret create bzzz_postgres_password -
# 3. Restart services with new secrets
docker stack deploy -c docker-compose.swarm.yml bzzz-v2
```
### Data Recovery Procedures
#### Backup Restoration
```bash
# 1. Stop services
docker stack rm bzzz-v2
# 2. Restore from backup
BACKUP_DATE="20241201-120000"
rsync -av /rust/bzzz-v2/backup/$BACKUP_DATE/ /rust/bzzz-v2/data/
# 3. Restart services
docker stack deploy -c docker-compose.swarm.yml bzzz-v2
```
#### Database Recovery
```bash
# 1. Stop application services
docker service scale bzzz-v2_bzzz-agent=0
# 2. Create database backup
docker exec $(docker ps -q -f label=com.docker.swarm.service.name=bzzz-v2_postgres) \
pg_dump -U bzzz bzzz_v2 > /rust/bzzz-v2/backup/database-$(date +%Y%m%d-%H%M%S).sql
# 3. Restore database
docker exec -i $(docker ps -q -f label=com.docker.swarm.service.name=bzzz-v2_postgres) \
psql -U bzzz -d bzzz_v2 < /rust/bzzz-v2/backup/database-backup.sql
# 4. Restart application services
docker service scale bzzz-v2_bzzz-agent=3
```
## Maintenance Procedures
### Routine Maintenance (Weekly)
```bash
#!/bin/bash
# Weekly maintenance script
# 1. Check service health
docker service ls --filter label=com.docker.stack.namespace=bzzz-v2
docker system df
# 2. Clean up unused resources
docker system prune -f
docker volume prune -f
# 3. Backup critical data
pg_dump -h localhost -U bzzz bzzz_v2 | gzip > \
/rust/bzzz-v2/backup/weekly-db-$(date +%Y%m%d).sql.gz
# 4. Rotate logs
find /rust/bzzz-v2/logs -name "*.log" -mtime +7 -delete
# 5. Check certificate expiration
openssl x509 -in /rust/bzzz-v2/config/tls/server/walnut.pem -noout -dates
# 6. Update security rules
fail2ban-client reload
# 7. Generate maintenance report
echo "Maintenance completed on $(date)" >> /rust/bzzz-v2/logs/maintenance.log
```
### Scaling Procedures
#### Scale Up
```bash
# Increase replica count
docker service scale bzzz-v2_bzzz-agent=5
docker service scale bzzz-v2_mcp-server=5
# Add new node to cluster (run on new node)
docker swarm join --token $WORKER_TOKEN $MANAGER_IP:2377
# Label new node
docker node update --label-add bzzz.role=agent NEW_NODE_HOSTNAME
```
#### Scale Down
```bash
# Gracefully reduce replicas
docker service scale bzzz-v2_bzzz-agent=2
docker service scale bzzz-v2_mcp-server=2
# Remove node from cluster
docker node update --availability drain NODE_HOSTNAME
docker node rm NODE_HOSTNAME
```
## Performance Tuning
### Database Optimization
```bash
# PostgreSQL tuning
docker exec $(docker ps -q -f label=com.docker.swarm.service.name=bzzz-v2_postgres) \
psql -U bzzz -d bzzz_v2 -c "
ALTER SYSTEM SET shared_buffers = '1GB';
ALTER SYSTEM SET max_connections = 200;
ALTER SYSTEM SET checkpoint_timeout = '15min';
SELECT pg_reload_conf();
"
```
### Storage Optimization
```bash
# Content store optimization
find /rust/bzzz-v2/data/blobs -name "*.tmp" -mtime +1 -delete
find /rust/bzzz-v2/data/blobs -type f -size 0 -delete
# Compress old logs
find /rust/bzzz-v2/logs -name "*.log" -mtime +3 -exec gzip {} \;
```
### Network Optimization
```bash
# Optimize network buffer sizes
echo 'net.core.rmem_max = 134217728' | sudo tee -a /etc/sysctl.conf
echo 'net.core.wmem_max = 134217728' | sudo tee -a /etc/sysctl.conf
echo 'net.ipv4.tcp_rmem = 4096 87380 134217728' | sudo tee -a /etc/sysctl.conf
echo 'net.ipv4.tcp_wmem = 4096 65536 134217728' | sudo tee -a /etc/sysctl.conf
sudo sysctl -p
```
## Contact Information
### On-Call Procedures
- **Primary Contact**: DevOps Team Lead
- **Secondary Contact**: Senior Site Reliability Engineer
- **Escalation**: Platform Engineering Manager
### Communication Channels
- **Slack**: #bzzz-incidents
- **Email**: devops@deepblack.cloud
- **Phone**: Emergency On-Call Rotation
### Documentation
- **Runbooks**: This document
- **Architecture**: `/docs/BZZZ_V2_INFRASTRUCTURE_ARCHITECTURE.md`
- **API Documentation**: https://bzzz.deepblack.cloud/docs
- **Monitoring Dashboards**: https://grafana.deepblack.cloud
---
*This runbook should be reviewed and updated monthly. Last updated: $(date)*