- Add FUTURE_DEVELOPMENT.md with comprehensive v2 protocol specification - Add MCP integration design and implementation foundation - Add infrastructure and deployment configurations - Update system architecture for v2 evolution 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
581 lines
15 KiB
Markdown
581 lines
15 KiB
Markdown
# BZZZ v2 Deployment Runbook
|
|
|
|
## Overview
|
|
|
|
This runbook provides step-by-step procedures for deploying, operating, and maintaining BZZZ v2 infrastructure. It covers normal operations, emergency procedures, and troubleshooting guidelines.
|
|
|
|
## Prerequisites
|
|
|
|
### System Requirements
|
|
|
|
- **Cluster**: 3 nodes (WALNUT, IRONWOOD, ACACIA)
|
|
- **OS**: Ubuntu 22.04 LTS or newer
|
|
- **Docker**: Version 24+ with Swarm mode enabled
|
|
- **Storage**: NFS mount at `/rust/` with 500GB+ available
|
|
- **Network**: Internal 192.168.1.0/24 with external internet access
|
|
- **Secrets**: OpenAI API key and database credentials
|
|
|
|
### Access Requirements
|
|
|
|
- SSH access to all cluster nodes
|
|
- Docker Swarm manager privileges
|
|
- Sudo access for system configuration
|
|
- GitLab access for CI/CD pipeline management
|
|
|
|
## Pre-Deployment Checklist
|
|
|
|
### Infrastructure Verification
|
|
|
|
```bash
|
|
# Verify Docker Swarm status
|
|
docker node ls
|
|
docker network ls | grep tengig
|
|
|
|
# Check available storage
|
|
df -h /rust/
|
|
|
|
# Verify network connectivity
|
|
ping -c 3 192.168.1.27 # WALNUT
|
|
ping -c 3 192.168.1.113 # IRONWOOD
|
|
ping -c 3 192.168.1.xxx # ACACIA
|
|
|
|
# Test registry access
|
|
docker pull registry.home.deepblack.cloud/hello-world || echo "Registry access test"
|
|
```
|
|
|
|
### Security Hardening
|
|
|
|
```bash
|
|
# Run security hardening script
|
|
cd /home/tony/chorus/project-queues/active/BZZZ/infrastructure/security
|
|
sudo ./security-hardening.sh
|
|
|
|
# Verify firewall status
|
|
sudo ufw status verbose
|
|
|
|
# Check fail2ban status
|
|
sudo fail2ban-client status
|
|
```
|
|
|
|
## Deployment Procedures
|
|
|
|
### 1. Initial Deployment (Fresh Install)
|
|
|
|
#### Step 1: Prepare Infrastructure
|
|
|
|
```bash
|
|
# Create directory structure
|
|
mkdir -p /rust/bzzz-v2/{config,data,logs,backup}
|
|
mkdir -p /rust/bzzz-v2/data/{blobs,conversations,dht,postgres,redis}
|
|
mkdir -p /rust/bzzz-v2/config/{swarm,monitoring,security}
|
|
|
|
# Set permissions
|
|
sudo chown -R tony:tony /rust/bzzz-v2
|
|
chmod -R 755 /rust/bzzz-v2
|
|
```
|
|
|
|
#### Step 2: Configure Secrets and Configs
|
|
|
|
```bash
|
|
cd /home/tony/chorus/project-queues/active/BZZZ/infrastructure
|
|
|
|
# Create Docker secrets
|
|
docker secret create bzzz_postgres_password config/secrets/postgres_password
|
|
docker secret create bzzz_openai_api_key ~/chorus/business/secrets/openai-api-key
|
|
docker secret create bzzz_grafana_admin_password config/secrets/grafana_admin_password
|
|
|
|
# Create Docker configs
|
|
docker config create bzzz_v2_config config/bzzz-config.yaml
|
|
docker config create bzzz_prometheus_config monitoring/configs/prometheus.yml
|
|
docker config create bzzz_alertmanager_config monitoring/configs/alertmanager.yml
|
|
```
|
|
|
|
#### Step 3: Deploy Core Services
|
|
|
|
```bash
|
|
# Deploy main BZZZ v2 stack
|
|
docker stack deploy -c docker-compose.swarm.yml bzzz-v2
|
|
|
|
# Wait for services to start (this may take 5-10 minutes)
|
|
watch docker stack ps bzzz-v2
|
|
```
|
|
|
|
#### Step 4: Deploy Monitoring Stack
|
|
|
|
```bash
|
|
# Deploy monitoring services
|
|
docker stack deploy -c monitoring/docker-compose.monitoring.yml bzzz-monitoring
|
|
|
|
# Verify monitoring services
|
|
curl -f http://localhost:9090/-/healthy # Prometheus
|
|
curl -f http://localhost:3000/api/health # Grafana
|
|
```
|
|
|
|
#### Step 5: Verify Deployment
|
|
|
|
```bash
|
|
# Check all services are running
|
|
docker service ls --filter label=com.docker.stack.namespace=bzzz-v2
|
|
|
|
# Test external endpoints
|
|
curl -f https://bzzz.deepblack.cloud/health
|
|
curl -f https://mcp.deepblack.cloud/health
|
|
curl -f https://resolve.deepblack.cloud/health
|
|
|
|
# Check P2P mesh connectivity
|
|
docker exec $(docker ps -q -f label=com.docker.swarm.service.name=bzzz-v2_bzzz-agent | head -1) \
|
|
curl -s http://localhost:9000/api/v2/peers | jq '.connected_peers | length'
|
|
```
|
|
|
|
### 2. Update Deployment (Rolling Update)
|
|
|
|
#### Step 1: Pre-Update Checks
|
|
|
|
```bash
|
|
# Check current deployment health
|
|
docker stack ps bzzz-v2 | grep -v "Shutdown\|Failed"
|
|
|
|
# Backup current configuration
|
|
mkdir -p /rust/bzzz-v2/backup/$(date +%Y%m%d-%H%M%S)
|
|
docker config ls | grep bzzz_ > /rust/bzzz-v2/backup/$(date +%Y%m%d-%H%M%S)/configs.txt
|
|
docker secret ls | grep bzzz_ > /rust/bzzz-v2/backup/$(date +%Y%m%d-%H%M%S)/secrets.txt
|
|
```
|
|
|
|
#### Step 2: Update Images
|
|
|
|
```bash
|
|
# Update to new image version
|
|
export NEW_IMAGE_TAG="v2.1.0"
|
|
|
|
# Update Docker Compose file with new image tags
|
|
sed -i "s/registry.home.deepblack.cloud\/bzzz:.*$/registry.home.deepblack.cloud\/bzzz:${NEW_IMAGE_TAG}/g" \
|
|
docker-compose.swarm.yml
|
|
|
|
# Deploy updated stack (rolling update)
|
|
docker stack deploy -c docker-compose.swarm.yml bzzz-v2
|
|
```
|
|
|
|
#### Step 3: Monitor Update Progress
|
|
|
|
```bash
|
|
# Watch rolling update progress
|
|
watch "docker service ps bzzz-v2_bzzz-agent | head -20"
|
|
|
|
# Check for any failed updates
|
|
docker service ps bzzz-v2_bzzz-agent --filter desired-state=running --filter current-state=failed
|
|
```
|
|
|
|
### 3. Migration from v1 to v2
|
|
|
|
```bash
|
|
# Use the automated migration script
|
|
cd /home/tony/chorus/project-queues/active/BZZZ/infrastructure/migration-scripts
|
|
|
|
# Dry run first to preview changes
|
|
./migrate-v1-to-v2.sh --dry-run
|
|
|
|
# Execute full migration
|
|
./migrate-v1-to-v2.sh
|
|
|
|
# If rollback is needed
|
|
./migrate-v1-to-v2.sh --rollback
|
|
```
|
|
|
|
## Monitoring and Health Checks
|
|
|
|
### Health Check Commands
|
|
|
|
```bash
|
|
# Service health checks
|
|
docker service ls --filter label=com.docker.stack.namespace=bzzz-v2
|
|
docker service ps bzzz-v2_bzzz-agent --filter desired-state=running
|
|
|
|
# Application health checks
|
|
curl -f https://bzzz.deepblack.cloud/health
|
|
curl -f https://mcp.deepblack.cloud/health
|
|
curl -f https://resolve.deepblack.cloud/health
|
|
curl -f https://openai.deepblack.cloud/health
|
|
|
|
# P2P network health
|
|
docker exec $(docker ps -q -f label=com.docker.swarm.service.name=bzzz-v2_bzzz-agent | head -1) \
|
|
curl -s http://localhost:9000/api/v2/dht/stats | jq '.'
|
|
|
|
# Database connectivity
|
|
docker exec $(docker ps -q -f label=com.docker.swarm.service.name=bzzz-v2_postgres) \
|
|
pg_isready -U bzzz -d bzzz_v2
|
|
|
|
# Cache connectivity
|
|
docker exec $(docker ps -q -f label=com.docker.swarm.service.name=bzzz-v2_redis) \
|
|
redis-cli ping
|
|
```
|
|
|
|
### Performance Monitoring
|
|
|
|
```bash
|
|
# Check resource usage
|
|
docker stats --no-stream
|
|
|
|
# Monitor disk usage
|
|
df -h /rust/bzzz-v2/data/
|
|
|
|
# Check network connections
|
|
netstat -tuln | grep -E ":(9000|3001|3002|3003|9101|9102|9103)"
|
|
|
|
# Monitor OpenAI API usage
|
|
curl -s http://localhost:9203/metrics | grep openai_cost
|
|
```
|
|
|
|
## Troubleshooting Guide
|
|
|
|
### Common Issues and Solutions
|
|
|
|
#### 1. Service Won't Start
|
|
|
|
**Symptoms:** Service stuck in `preparing` or constantly restarting
|
|
|
|
**Diagnosis:**
|
|
```bash
|
|
# Check service logs
|
|
docker service logs bzzz-v2_bzzz-agent --tail 50
|
|
|
|
# Check node resources
|
|
docker node ls
|
|
docker system df
|
|
|
|
# Verify secrets and configs
|
|
docker secret ls | grep bzzz_
|
|
docker config ls | grep bzzz_
|
|
```
|
|
|
|
**Solutions:**
|
|
- Check resource constraints and availability
|
|
- Verify secrets and configs are accessible
|
|
- Ensure image is available and correct
|
|
- Check node labels and placement constraints
|
|
|
|
#### 2. P2P Network Issues
|
|
|
|
**Symptoms:** Agents not discovering each other, DHT lookups failing
|
|
|
|
**Diagnosis:**
|
|
```bash
|
|
# Check peer connections
|
|
docker exec $(docker ps -q -f label=com.docker.swarm.service.name=bzzz-v2_bzzz-agent | head -1) \
|
|
curl -s http://localhost:9000/api/v2/peers
|
|
|
|
# Check DHT bootstrap nodes
|
|
curl http://localhost:9101/health
|
|
curl http://localhost:9102/health
|
|
curl http://localhost:9103/health
|
|
|
|
# Check network connectivity
|
|
docker network inspect bzzz-internal
|
|
```
|
|
|
|
**Solutions:**
|
|
- Restart DHT bootstrap services
|
|
- Check firewall rules for P2P ports
|
|
- Verify Docker Swarm overlay network
|
|
- Check for port conflicts
|
|
|
|
#### 3. High OpenAI Costs
|
|
|
|
**Symptoms:** Cost alerts triggering, rate limits being hit
|
|
|
|
**Diagnosis:**
|
|
```bash
|
|
# Check current usage
|
|
curl -s http://localhost:9203/metrics | grep -E "openai_(cost|requests|tokens)"
|
|
|
|
# Check rate limiting
|
|
docker service logs bzzz-v2_openai-proxy --tail 100 | grep "rate limit"
|
|
```
|
|
|
|
**Solutions:**
|
|
- Adjust rate limiting parameters
|
|
- Review conversation patterns for excessive API calls
|
|
- Implement request caching
|
|
- Consider model selection optimization
|
|
|
|
#### 4. Database Connection Issues
|
|
|
|
**Symptoms:** Service errors related to database connectivity
|
|
|
|
**Diagnosis:**
|
|
```bash
|
|
# Check PostgreSQL status
|
|
docker service logs bzzz-v2_postgres --tail 50
|
|
|
|
# Test connection from agent
|
|
docker exec $(docker ps -q -f label=com.docker.swarm.service.name=bzzz-v2_bzzz-agent | head -1) \
|
|
pg_isready -h postgres -U bzzz
|
|
|
|
# Check connection limits
|
|
docker exec $(docker ps -q -f label=com.docker.swarm.service.name=bzzz-v2_postgres) \
|
|
psql -U bzzz -d bzzz_v2 -c "SELECT count(*) FROM pg_stat_activity;"
|
|
```
|
|
|
|
**Solutions:**
|
|
- Restart PostgreSQL service
|
|
- Check connection pool settings
|
|
- Increase max_connections if needed
|
|
- Review long-running queries
|
|
|
|
#### 5. Storage Issues
|
|
|
|
**Symptoms:** Disk full alerts, content store errors
|
|
|
|
**Diagnosis:**
|
|
```bash
|
|
# Check disk usage
|
|
df -h /rust/bzzz-v2/data/
|
|
du -sh /rust/bzzz-v2/data/blobs/
|
|
|
|
# Check content store health
|
|
curl -s http://localhost:9202/metrics | grep content_store
|
|
```
|
|
|
|
**Solutions:**
|
|
- Run garbage collection on old blobs
|
|
- Clean up old conversation threads
|
|
- Increase storage capacity
|
|
- Adjust retention policies
|
|
|
|
## Emergency Procedures
|
|
|
|
### Service Outage Response
|
|
|
|
#### Priority 1: Complete Service Outage
|
|
|
|
```bash
|
|
# 1. Check cluster status
|
|
docker node ls
|
|
docker service ls --filter label=com.docker.stack.namespace=bzzz-v2
|
|
|
|
# 2. Emergency restart of critical services
|
|
docker service update --force bzzz-v2_bzzz-agent
|
|
docker service update --force bzzz-v2_postgres
|
|
docker service update --force bzzz-v2_redis
|
|
|
|
# 3. If stack is corrupted, redeploy
|
|
docker stack rm bzzz-v2
|
|
sleep 60
|
|
docker stack deploy -c docker-compose.swarm.yml bzzz-v2
|
|
|
|
# 4. Monitor recovery
|
|
watch docker stack ps bzzz-v2
|
|
```
|
|
|
|
#### Priority 2: Partial Service Degradation
|
|
|
|
```bash
|
|
# 1. Identify problematic services
|
|
docker service ps bzzz-v2_bzzz-agent --filter desired-state=running --filter current-state=failed
|
|
|
|
# 2. Scale up healthy replicas
|
|
docker service update --replicas 3 bzzz-v2_bzzz-agent
|
|
|
|
# 3. Remove unhealthy tasks
|
|
docker service update --force bzzz-v2_bzzz-agent
|
|
```
|
|
|
|
### Security Incident Response
|
|
|
|
#### Step 1: Immediate Containment
|
|
|
|
```bash
|
|
# 1. Block suspicious IPs
|
|
sudo ufw insert 1 deny from SUSPICIOUS_IP
|
|
|
|
# 2. Check for compromise indicators
|
|
sudo fail2ban-client status
|
|
sudo tail -100 /var/log/audit/audit.log | grep -i "denied\|failed\|error"
|
|
|
|
# 3. Isolate affected services
|
|
docker service update --replicas 0 AFFECTED_SERVICE
|
|
```
|
|
|
|
#### Step 2: Investigation
|
|
|
|
```bash
|
|
# 1. Check access logs
|
|
docker service logs bzzz-v2_bzzz-agent --since 1h | grep -i "error\|failed\|unauthorized"
|
|
|
|
# 2. Review monitoring alerts
|
|
curl -s http://localhost:9093/api/v1/alerts | jq '.data[] | select(.state=="firing")'
|
|
|
|
# 3. Examine network connections
|
|
netstat -tuln
|
|
ss -tulpn | grep -E ":(9000|3001|3002|3003)"
|
|
```
|
|
|
|
#### Step 3: Recovery
|
|
|
|
```bash
|
|
# 1. Update security rules
|
|
./infrastructure/security/security-hardening.sh
|
|
|
|
# 2. Rotate secrets if compromised
|
|
docker secret rm bzzz_postgres_password
|
|
openssl rand -base64 32 | docker secret create bzzz_postgres_password -
|
|
|
|
# 3. Restart services with new secrets
|
|
docker stack deploy -c docker-compose.swarm.yml bzzz-v2
|
|
```
|
|
|
|
### Data Recovery Procedures
|
|
|
|
#### Backup Restoration
|
|
|
|
```bash
|
|
# 1. Stop services
|
|
docker stack rm bzzz-v2
|
|
|
|
# 2. Restore from backup
|
|
BACKUP_DATE="20241201-120000"
|
|
rsync -av /rust/bzzz-v2/backup/$BACKUP_DATE/ /rust/bzzz-v2/data/
|
|
|
|
# 3. Restart services
|
|
docker stack deploy -c docker-compose.swarm.yml bzzz-v2
|
|
```
|
|
|
|
#### Database Recovery
|
|
|
|
```bash
|
|
# 1. Stop application services
|
|
docker service scale bzzz-v2_bzzz-agent=0
|
|
|
|
# 2. Create database backup
|
|
docker exec $(docker ps -q -f label=com.docker.swarm.service.name=bzzz-v2_postgres) \
|
|
pg_dump -U bzzz bzzz_v2 > /rust/bzzz-v2/backup/database-$(date +%Y%m%d-%H%M%S).sql
|
|
|
|
# 3. Restore database
|
|
docker exec -i $(docker ps -q -f label=com.docker.swarm.service.name=bzzz-v2_postgres) \
|
|
psql -U bzzz -d bzzz_v2 < /rust/bzzz-v2/backup/database-backup.sql
|
|
|
|
# 4. Restart application services
|
|
docker service scale bzzz-v2_bzzz-agent=3
|
|
```
|
|
|
|
## Maintenance Procedures
|
|
|
|
### Routine Maintenance (Weekly)
|
|
|
|
```bash
|
|
#!/bin/bash
|
|
# Weekly maintenance script
|
|
|
|
# 1. Check service health
|
|
docker service ls --filter label=com.docker.stack.namespace=bzzz-v2
|
|
docker system df
|
|
|
|
# 2. Clean up unused resources
|
|
docker system prune -f
|
|
docker volume prune -f
|
|
|
|
# 3. Backup critical data
|
|
pg_dump -h localhost -U bzzz bzzz_v2 | gzip > \
|
|
/rust/bzzz-v2/backup/weekly-db-$(date +%Y%m%d).sql.gz
|
|
|
|
# 4. Rotate logs
|
|
find /rust/bzzz-v2/logs -name "*.log" -mtime +7 -delete
|
|
|
|
# 5. Check certificate expiration
|
|
openssl x509 -in /rust/bzzz-v2/config/tls/server/walnut.pem -noout -dates
|
|
|
|
# 6. Update security rules
|
|
fail2ban-client reload
|
|
|
|
# 7. Generate maintenance report
|
|
echo "Maintenance completed on $(date)" >> /rust/bzzz-v2/logs/maintenance.log
|
|
```
|
|
|
|
### Scaling Procedures
|
|
|
|
#### Scale Up
|
|
|
|
```bash
|
|
# Increase replica count
|
|
docker service scale bzzz-v2_bzzz-agent=5
|
|
docker service scale bzzz-v2_mcp-server=5
|
|
|
|
# Add new node to cluster (run on new node)
|
|
docker swarm join --token $WORKER_TOKEN $MANAGER_IP:2377
|
|
|
|
# Label new node
|
|
docker node update --label-add bzzz.role=agent NEW_NODE_HOSTNAME
|
|
```
|
|
|
|
#### Scale Down
|
|
|
|
```bash
|
|
# Gracefully reduce replicas
|
|
docker service scale bzzz-v2_bzzz-agent=2
|
|
docker service scale bzzz-v2_mcp-server=2
|
|
|
|
# Remove node from cluster
|
|
docker node update --availability drain NODE_HOSTNAME
|
|
docker node rm NODE_HOSTNAME
|
|
```
|
|
|
|
## Performance Tuning
|
|
|
|
### Database Optimization
|
|
|
|
```bash
|
|
# PostgreSQL tuning
|
|
docker exec $(docker ps -q -f label=com.docker.swarm.service.name=bzzz-v2_postgres) \
|
|
psql -U bzzz -d bzzz_v2 -c "
|
|
ALTER SYSTEM SET shared_buffers = '1GB';
|
|
ALTER SYSTEM SET max_connections = 200;
|
|
ALTER SYSTEM SET checkpoint_timeout = '15min';
|
|
SELECT pg_reload_conf();
|
|
"
|
|
```
|
|
|
|
### Storage Optimization
|
|
|
|
```bash
|
|
# Content store optimization
|
|
find /rust/bzzz-v2/data/blobs -name "*.tmp" -mtime +1 -delete
|
|
find /rust/bzzz-v2/data/blobs -type f -size 0 -delete
|
|
|
|
# Compress old logs
|
|
find /rust/bzzz-v2/logs -name "*.log" -mtime +3 -exec gzip {} \;
|
|
```
|
|
|
|
### Network Optimization
|
|
|
|
```bash
|
|
# Optimize network buffer sizes
|
|
echo 'net.core.rmem_max = 134217728' | sudo tee -a /etc/sysctl.conf
|
|
echo 'net.core.wmem_max = 134217728' | sudo tee -a /etc/sysctl.conf
|
|
echo 'net.ipv4.tcp_rmem = 4096 87380 134217728' | sudo tee -a /etc/sysctl.conf
|
|
echo 'net.ipv4.tcp_wmem = 4096 65536 134217728' | sudo tee -a /etc/sysctl.conf
|
|
sudo sysctl -p
|
|
```
|
|
|
|
## Contact Information
|
|
|
|
### On-Call Procedures
|
|
|
|
- **Primary Contact**: DevOps Team Lead
|
|
- **Secondary Contact**: Senior Site Reliability Engineer
|
|
- **Escalation**: Platform Engineering Manager
|
|
|
|
### Communication Channels
|
|
|
|
- **Slack**: #bzzz-incidents
|
|
- **Email**: devops@deepblack.cloud
|
|
- **Phone**: Emergency On-Call Rotation
|
|
|
|
### Documentation
|
|
|
|
- **Runbooks**: This document
|
|
- **Architecture**: `/docs/BZZZ_V2_INFRASTRUCTURE_ARCHITECTURE.md`
|
|
- **API Documentation**: https://bzzz.deepblack.cloud/docs
|
|
- **Monitoring Dashboards**: https://grafana.deepblack.cloud
|
|
|
|
---
|
|
|
|
*This runbook should be reviewed and updated monthly. Last updated: $(date)* |