Prepare for v2 development: Add MCP integration and future development planning
- Add FUTURE_DEVELOPMENT.md with comprehensive v2 protocol specification - Add MCP integration design and implementation foundation - Add infrastructure and deployment configurations - Update system architecture for v2 evolution 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
This commit is contained in:
581
infrastructure/docs/DEPLOYMENT_RUNBOOK.md
Normal file
581
infrastructure/docs/DEPLOYMENT_RUNBOOK.md
Normal file
@@ -0,0 +1,581 @@
|
||||
# BZZZ v2 Deployment Runbook
|
||||
|
||||
## Overview
|
||||
|
||||
This runbook provides step-by-step procedures for deploying, operating, and maintaining BZZZ v2 infrastructure. It covers normal operations, emergency procedures, and troubleshooting guidelines.
|
||||
|
||||
## Prerequisites
|
||||
|
||||
### System Requirements
|
||||
|
||||
- **Cluster**: 3 nodes (WALNUT, IRONWOOD, ACACIA)
|
||||
- **OS**: Ubuntu 22.04 LTS or newer
|
||||
- **Docker**: Version 24+ with Swarm mode enabled
|
||||
- **Storage**: NFS mount at `/rust/` with 500GB+ available
|
||||
- **Network**: Internal 192.168.1.0/24 with external internet access
|
||||
- **Secrets**: OpenAI API key and database credentials
|
||||
|
||||
### Access Requirements
|
||||
|
||||
- SSH access to all cluster nodes
|
||||
- Docker Swarm manager privileges
|
||||
- Sudo access for system configuration
|
||||
- GitLab access for CI/CD pipeline management
|
||||
|
||||
## Pre-Deployment Checklist
|
||||
|
||||
### Infrastructure Verification
|
||||
|
||||
```bash
|
||||
# Verify Docker Swarm status
|
||||
docker node ls
|
||||
docker network ls | grep tengig
|
||||
|
||||
# Check available storage
|
||||
df -h /rust/
|
||||
|
||||
# Verify network connectivity
|
||||
ping -c 3 192.168.1.27 # WALNUT
|
||||
ping -c 3 192.168.1.113 # IRONWOOD
|
||||
ping -c 3 192.168.1.xxx # ACACIA
|
||||
|
||||
# Test registry access
|
||||
docker pull registry.home.deepblack.cloud/hello-world || echo "Registry access test"
|
||||
```
|
||||
|
||||
### Security Hardening
|
||||
|
||||
```bash
|
||||
# Run security hardening script
|
||||
cd /home/tony/chorus/project-queues/active/BZZZ/infrastructure/security
|
||||
sudo ./security-hardening.sh
|
||||
|
||||
# Verify firewall status
|
||||
sudo ufw status verbose
|
||||
|
||||
# Check fail2ban status
|
||||
sudo fail2ban-client status
|
||||
```
|
||||
|
||||
## Deployment Procedures
|
||||
|
||||
### 1. Initial Deployment (Fresh Install)
|
||||
|
||||
#### Step 1: Prepare Infrastructure
|
||||
|
||||
```bash
|
||||
# Create directory structure
|
||||
mkdir -p /rust/bzzz-v2/{config,data,logs,backup}
|
||||
mkdir -p /rust/bzzz-v2/data/{blobs,conversations,dht,postgres,redis}
|
||||
mkdir -p /rust/bzzz-v2/config/{swarm,monitoring,security}
|
||||
|
||||
# Set permissions
|
||||
sudo chown -R tony:tony /rust/bzzz-v2
|
||||
chmod -R 755 /rust/bzzz-v2
|
||||
```
|
||||
|
||||
#### Step 2: Configure Secrets and Configs
|
||||
|
||||
```bash
|
||||
cd /home/tony/chorus/project-queues/active/BZZZ/infrastructure
|
||||
|
||||
# Create Docker secrets
|
||||
docker secret create bzzz_postgres_password config/secrets/postgres_password
|
||||
docker secret create bzzz_openai_api_key ~/chorus/business/secrets/openai-api-key
|
||||
docker secret create bzzz_grafana_admin_password config/secrets/grafana_admin_password
|
||||
|
||||
# Create Docker configs
|
||||
docker config create bzzz_v2_config config/bzzz-config.yaml
|
||||
docker config create bzzz_prometheus_config monitoring/configs/prometheus.yml
|
||||
docker config create bzzz_alertmanager_config monitoring/configs/alertmanager.yml
|
||||
```
|
||||
|
||||
#### Step 3: Deploy Core Services
|
||||
|
||||
```bash
|
||||
# Deploy main BZZZ v2 stack
|
||||
docker stack deploy -c docker-compose.swarm.yml bzzz-v2
|
||||
|
||||
# Wait for services to start (this may take 5-10 minutes)
|
||||
watch docker stack ps bzzz-v2
|
||||
```
|
||||
|
||||
#### Step 4: Deploy Monitoring Stack
|
||||
|
||||
```bash
|
||||
# Deploy monitoring services
|
||||
docker stack deploy -c monitoring/docker-compose.monitoring.yml bzzz-monitoring
|
||||
|
||||
# Verify monitoring services
|
||||
curl -f http://localhost:9090/-/healthy # Prometheus
|
||||
curl -f http://localhost:3000/api/health # Grafana
|
||||
```
|
||||
|
||||
#### Step 5: Verify Deployment
|
||||
|
||||
```bash
|
||||
# Check all services are running
|
||||
docker service ls --filter label=com.docker.stack.namespace=bzzz-v2
|
||||
|
||||
# Test external endpoints
|
||||
curl -f https://bzzz.deepblack.cloud/health
|
||||
curl -f https://mcp.deepblack.cloud/health
|
||||
curl -f https://resolve.deepblack.cloud/health
|
||||
|
||||
# Check P2P mesh connectivity
|
||||
docker exec $(docker ps -q -f label=com.docker.swarm.service.name=bzzz-v2_bzzz-agent | head -1) \
|
||||
curl -s http://localhost:9000/api/v2/peers | jq '.connected_peers | length'
|
||||
```
|
||||
|
||||
### 2. Update Deployment (Rolling Update)
|
||||
|
||||
#### Step 1: Pre-Update Checks
|
||||
|
||||
```bash
|
||||
# Check current deployment health
|
||||
docker stack ps bzzz-v2 | grep -v "Shutdown\|Failed"
|
||||
|
||||
# Backup current configuration
|
||||
mkdir -p /rust/bzzz-v2/backup/$(date +%Y%m%d-%H%M%S)
|
||||
docker config ls | grep bzzz_ > /rust/bzzz-v2/backup/$(date +%Y%m%d-%H%M%S)/configs.txt
|
||||
docker secret ls | grep bzzz_ > /rust/bzzz-v2/backup/$(date +%Y%m%d-%H%M%S)/secrets.txt
|
||||
```
|
||||
|
||||
#### Step 2: Update Images
|
||||
|
||||
```bash
|
||||
# Update to new image version
|
||||
export NEW_IMAGE_TAG="v2.1.0"
|
||||
|
||||
# Update Docker Compose file with new image tags
|
||||
sed -i "s/registry.home.deepblack.cloud\/bzzz:.*$/registry.home.deepblack.cloud\/bzzz:${NEW_IMAGE_TAG}/g" \
|
||||
docker-compose.swarm.yml
|
||||
|
||||
# Deploy updated stack (rolling update)
|
||||
docker stack deploy -c docker-compose.swarm.yml bzzz-v2
|
||||
```
|
||||
|
||||
#### Step 3: Monitor Update Progress
|
||||
|
||||
```bash
|
||||
# Watch rolling update progress
|
||||
watch "docker service ps bzzz-v2_bzzz-agent | head -20"
|
||||
|
||||
# Check for any failed updates
|
||||
docker service ps bzzz-v2_bzzz-agent --filter desired-state=running --filter current-state=failed
|
||||
```
|
||||
|
||||
### 3. Migration from v1 to v2
|
||||
|
||||
```bash
|
||||
# Use the automated migration script
|
||||
cd /home/tony/chorus/project-queues/active/BZZZ/infrastructure/migration-scripts
|
||||
|
||||
# Dry run first to preview changes
|
||||
./migrate-v1-to-v2.sh --dry-run
|
||||
|
||||
# Execute full migration
|
||||
./migrate-v1-to-v2.sh
|
||||
|
||||
# If rollback is needed
|
||||
./migrate-v1-to-v2.sh --rollback
|
||||
```
|
||||
|
||||
## Monitoring and Health Checks
|
||||
|
||||
### Health Check Commands
|
||||
|
||||
```bash
|
||||
# Service health checks
|
||||
docker service ls --filter label=com.docker.stack.namespace=bzzz-v2
|
||||
docker service ps bzzz-v2_bzzz-agent --filter desired-state=running
|
||||
|
||||
# Application health checks
|
||||
curl -f https://bzzz.deepblack.cloud/health
|
||||
curl -f https://mcp.deepblack.cloud/health
|
||||
curl -f https://resolve.deepblack.cloud/health
|
||||
curl -f https://openai.deepblack.cloud/health
|
||||
|
||||
# P2P network health
|
||||
docker exec $(docker ps -q -f label=com.docker.swarm.service.name=bzzz-v2_bzzz-agent | head -1) \
|
||||
curl -s http://localhost:9000/api/v2/dht/stats | jq '.'
|
||||
|
||||
# Database connectivity
|
||||
docker exec $(docker ps -q -f label=com.docker.swarm.service.name=bzzz-v2_postgres) \
|
||||
pg_isready -U bzzz -d bzzz_v2
|
||||
|
||||
# Cache connectivity
|
||||
docker exec $(docker ps -q -f label=com.docker.swarm.service.name=bzzz-v2_redis) \
|
||||
redis-cli ping
|
||||
```
|
||||
|
||||
### Performance Monitoring
|
||||
|
||||
```bash
|
||||
# Check resource usage
|
||||
docker stats --no-stream
|
||||
|
||||
# Monitor disk usage
|
||||
df -h /rust/bzzz-v2/data/
|
||||
|
||||
# Check network connections
|
||||
netstat -tuln | grep -E ":(9000|3001|3002|3003|9101|9102|9103)"
|
||||
|
||||
# Monitor OpenAI API usage
|
||||
curl -s http://localhost:9203/metrics | grep openai_cost
|
||||
```
|
||||
|
||||
## Troubleshooting Guide
|
||||
|
||||
### Common Issues and Solutions
|
||||
|
||||
#### 1. Service Won't Start
|
||||
|
||||
**Symptoms:** Service stuck in `preparing` or constantly restarting
|
||||
|
||||
**Diagnosis:**
|
||||
```bash
|
||||
# Check service logs
|
||||
docker service logs bzzz-v2_bzzz-agent --tail 50
|
||||
|
||||
# Check node resources
|
||||
docker node ls
|
||||
docker system df
|
||||
|
||||
# Verify secrets and configs
|
||||
docker secret ls | grep bzzz_
|
||||
docker config ls | grep bzzz_
|
||||
```
|
||||
|
||||
**Solutions:**
|
||||
- Check resource constraints and availability
|
||||
- Verify secrets and configs are accessible
|
||||
- Ensure image is available and correct
|
||||
- Check node labels and placement constraints
|
||||
|
||||
#### 2. P2P Network Issues
|
||||
|
||||
**Symptoms:** Agents not discovering each other, DHT lookups failing
|
||||
|
||||
**Diagnosis:**
|
||||
```bash
|
||||
# Check peer connections
|
||||
docker exec $(docker ps -q -f label=com.docker.swarm.service.name=bzzz-v2_bzzz-agent | head -1) \
|
||||
curl -s http://localhost:9000/api/v2/peers
|
||||
|
||||
# Check DHT bootstrap nodes
|
||||
curl http://localhost:9101/health
|
||||
curl http://localhost:9102/health
|
||||
curl http://localhost:9103/health
|
||||
|
||||
# Check network connectivity
|
||||
docker network inspect bzzz-internal
|
||||
```
|
||||
|
||||
**Solutions:**
|
||||
- Restart DHT bootstrap services
|
||||
- Check firewall rules for P2P ports
|
||||
- Verify Docker Swarm overlay network
|
||||
- Check for port conflicts
|
||||
|
||||
#### 3. High OpenAI Costs
|
||||
|
||||
**Symptoms:** Cost alerts triggering, rate limits being hit
|
||||
|
||||
**Diagnosis:**
|
||||
```bash
|
||||
# Check current usage
|
||||
curl -s http://localhost:9203/metrics | grep -E "openai_(cost|requests|tokens)"
|
||||
|
||||
# Check rate limiting
|
||||
docker service logs bzzz-v2_openai-proxy --tail 100 | grep "rate limit"
|
||||
```
|
||||
|
||||
**Solutions:**
|
||||
- Adjust rate limiting parameters
|
||||
- Review conversation patterns for excessive API calls
|
||||
- Implement request caching
|
||||
- Consider model selection optimization
|
||||
|
||||
#### 4. Database Connection Issues
|
||||
|
||||
**Symptoms:** Service errors related to database connectivity
|
||||
|
||||
**Diagnosis:**
|
||||
```bash
|
||||
# Check PostgreSQL status
|
||||
docker service logs bzzz-v2_postgres --tail 50
|
||||
|
||||
# Test connection from agent
|
||||
docker exec $(docker ps -q -f label=com.docker.swarm.service.name=bzzz-v2_bzzz-agent | head -1) \
|
||||
pg_isready -h postgres -U bzzz
|
||||
|
||||
# Check connection limits
|
||||
docker exec $(docker ps -q -f label=com.docker.swarm.service.name=bzzz-v2_postgres) \
|
||||
psql -U bzzz -d bzzz_v2 -c "SELECT count(*) FROM pg_stat_activity;"
|
||||
```
|
||||
|
||||
**Solutions:**
|
||||
- Restart PostgreSQL service
|
||||
- Check connection pool settings
|
||||
- Increase max_connections if needed
|
||||
- Review long-running queries
|
||||
|
||||
#### 5. Storage Issues
|
||||
|
||||
**Symptoms:** Disk full alerts, content store errors
|
||||
|
||||
**Diagnosis:**
|
||||
```bash
|
||||
# Check disk usage
|
||||
df -h /rust/bzzz-v2/data/
|
||||
du -sh /rust/bzzz-v2/data/blobs/
|
||||
|
||||
# Check content store health
|
||||
curl -s http://localhost:9202/metrics | grep content_store
|
||||
```
|
||||
|
||||
**Solutions:**
|
||||
- Run garbage collection on old blobs
|
||||
- Clean up old conversation threads
|
||||
- Increase storage capacity
|
||||
- Adjust retention policies
|
||||
|
||||
## Emergency Procedures
|
||||
|
||||
### Service Outage Response
|
||||
|
||||
#### Priority 1: Complete Service Outage
|
||||
|
||||
```bash
|
||||
# 1. Check cluster status
|
||||
docker node ls
|
||||
docker service ls --filter label=com.docker.stack.namespace=bzzz-v2
|
||||
|
||||
# 2. Emergency restart of critical services
|
||||
docker service update --force bzzz-v2_bzzz-agent
|
||||
docker service update --force bzzz-v2_postgres
|
||||
docker service update --force bzzz-v2_redis
|
||||
|
||||
# 3. If stack is corrupted, redeploy
|
||||
docker stack rm bzzz-v2
|
||||
sleep 60
|
||||
docker stack deploy -c docker-compose.swarm.yml bzzz-v2
|
||||
|
||||
# 4. Monitor recovery
|
||||
watch docker stack ps bzzz-v2
|
||||
```
|
||||
|
||||
#### Priority 2: Partial Service Degradation
|
||||
|
||||
```bash
|
||||
# 1. Identify problematic services
|
||||
docker service ps bzzz-v2_bzzz-agent --filter desired-state=running --filter current-state=failed
|
||||
|
||||
# 2. Scale up healthy replicas
|
||||
docker service update --replicas 3 bzzz-v2_bzzz-agent
|
||||
|
||||
# 3. Remove unhealthy tasks
|
||||
docker service update --force bzzz-v2_bzzz-agent
|
||||
```
|
||||
|
||||
### Security Incident Response
|
||||
|
||||
#### Step 1: Immediate Containment
|
||||
|
||||
```bash
|
||||
# 1. Block suspicious IPs
|
||||
sudo ufw insert 1 deny from SUSPICIOUS_IP
|
||||
|
||||
# 2. Check for compromise indicators
|
||||
sudo fail2ban-client status
|
||||
sudo tail -100 /var/log/audit/audit.log | grep -i "denied\|failed\|error"
|
||||
|
||||
# 3. Isolate affected services
|
||||
docker service update --replicas 0 AFFECTED_SERVICE
|
||||
```
|
||||
|
||||
#### Step 2: Investigation
|
||||
|
||||
```bash
|
||||
# 1. Check access logs
|
||||
docker service logs bzzz-v2_bzzz-agent --since 1h | grep -i "error\|failed\|unauthorized"
|
||||
|
||||
# 2. Review monitoring alerts
|
||||
curl -s http://localhost:9093/api/v1/alerts | jq '.data[] | select(.state=="firing")'
|
||||
|
||||
# 3. Examine network connections
|
||||
netstat -tuln
|
||||
ss -tulpn | grep -E ":(9000|3001|3002|3003)"
|
||||
```
|
||||
|
||||
#### Step 3: Recovery
|
||||
|
||||
```bash
|
||||
# 1. Update security rules
|
||||
./infrastructure/security/security-hardening.sh
|
||||
|
||||
# 2. Rotate secrets if compromised
|
||||
docker secret rm bzzz_postgres_password
|
||||
openssl rand -base64 32 | docker secret create bzzz_postgres_password -
|
||||
|
||||
# 3. Restart services with new secrets
|
||||
docker stack deploy -c docker-compose.swarm.yml bzzz-v2
|
||||
```
|
||||
|
||||
### Data Recovery Procedures
|
||||
|
||||
#### Backup Restoration
|
||||
|
||||
```bash
|
||||
# 1. Stop services
|
||||
docker stack rm bzzz-v2
|
||||
|
||||
# 2. Restore from backup
|
||||
BACKUP_DATE="20241201-120000"
|
||||
rsync -av /rust/bzzz-v2/backup/$BACKUP_DATE/ /rust/bzzz-v2/data/
|
||||
|
||||
# 3. Restart services
|
||||
docker stack deploy -c docker-compose.swarm.yml bzzz-v2
|
||||
```
|
||||
|
||||
#### Database Recovery
|
||||
|
||||
```bash
|
||||
# 1. Stop application services
|
||||
docker service scale bzzz-v2_bzzz-agent=0
|
||||
|
||||
# 2. Create database backup
|
||||
docker exec $(docker ps -q -f label=com.docker.swarm.service.name=bzzz-v2_postgres) \
|
||||
pg_dump -U bzzz bzzz_v2 > /rust/bzzz-v2/backup/database-$(date +%Y%m%d-%H%M%S).sql
|
||||
|
||||
# 3. Restore database
|
||||
docker exec -i $(docker ps -q -f label=com.docker.swarm.service.name=bzzz-v2_postgres) \
|
||||
psql -U bzzz -d bzzz_v2 < /rust/bzzz-v2/backup/database-backup.sql
|
||||
|
||||
# 4. Restart application services
|
||||
docker service scale bzzz-v2_bzzz-agent=3
|
||||
```
|
||||
|
||||
## Maintenance Procedures
|
||||
|
||||
### Routine Maintenance (Weekly)
|
||||
|
||||
```bash
|
||||
#!/bin/bash
|
||||
# Weekly maintenance script
|
||||
|
||||
# 1. Check service health
|
||||
docker service ls --filter label=com.docker.stack.namespace=bzzz-v2
|
||||
docker system df
|
||||
|
||||
# 2. Clean up unused resources
|
||||
docker system prune -f
|
||||
docker volume prune -f
|
||||
|
||||
# 3. Backup critical data
|
||||
pg_dump -h localhost -U bzzz bzzz_v2 | gzip > \
|
||||
/rust/bzzz-v2/backup/weekly-db-$(date +%Y%m%d).sql.gz
|
||||
|
||||
# 4. Rotate logs
|
||||
find /rust/bzzz-v2/logs -name "*.log" -mtime +7 -delete
|
||||
|
||||
# 5. Check certificate expiration
|
||||
openssl x509 -in /rust/bzzz-v2/config/tls/server/walnut.pem -noout -dates
|
||||
|
||||
# 6. Update security rules
|
||||
fail2ban-client reload
|
||||
|
||||
# 7. Generate maintenance report
|
||||
echo "Maintenance completed on $(date)" >> /rust/bzzz-v2/logs/maintenance.log
|
||||
```
|
||||
|
||||
### Scaling Procedures
|
||||
|
||||
#### Scale Up
|
||||
|
||||
```bash
|
||||
# Increase replica count
|
||||
docker service scale bzzz-v2_bzzz-agent=5
|
||||
docker service scale bzzz-v2_mcp-server=5
|
||||
|
||||
# Add new node to cluster (run on new node)
|
||||
docker swarm join --token $WORKER_TOKEN $MANAGER_IP:2377
|
||||
|
||||
# Label new node
|
||||
docker node update --label-add bzzz.role=agent NEW_NODE_HOSTNAME
|
||||
```
|
||||
|
||||
#### Scale Down
|
||||
|
||||
```bash
|
||||
# Gracefully reduce replicas
|
||||
docker service scale bzzz-v2_bzzz-agent=2
|
||||
docker service scale bzzz-v2_mcp-server=2
|
||||
|
||||
# Remove node from cluster
|
||||
docker node update --availability drain NODE_HOSTNAME
|
||||
docker node rm NODE_HOSTNAME
|
||||
```
|
||||
|
||||
## Performance Tuning
|
||||
|
||||
### Database Optimization
|
||||
|
||||
```bash
|
||||
# PostgreSQL tuning
|
||||
docker exec $(docker ps -q -f label=com.docker.swarm.service.name=bzzz-v2_postgres) \
|
||||
psql -U bzzz -d bzzz_v2 -c "
|
||||
ALTER SYSTEM SET shared_buffers = '1GB';
|
||||
ALTER SYSTEM SET max_connections = 200;
|
||||
ALTER SYSTEM SET checkpoint_timeout = '15min';
|
||||
SELECT pg_reload_conf();
|
||||
"
|
||||
```
|
||||
|
||||
### Storage Optimization
|
||||
|
||||
```bash
|
||||
# Content store optimization
|
||||
find /rust/bzzz-v2/data/blobs -name "*.tmp" -mtime +1 -delete
|
||||
find /rust/bzzz-v2/data/blobs -type f -size 0 -delete
|
||||
|
||||
# Compress old logs
|
||||
find /rust/bzzz-v2/logs -name "*.log" -mtime +3 -exec gzip {} \;
|
||||
```
|
||||
|
||||
### Network Optimization
|
||||
|
||||
```bash
|
||||
# Optimize network buffer sizes
|
||||
echo 'net.core.rmem_max = 134217728' | sudo tee -a /etc/sysctl.conf
|
||||
echo 'net.core.wmem_max = 134217728' | sudo tee -a /etc/sysctl.conf
|
||||
echo 'net.ipv4.tcp_rmem = 4096 87380 134217728' | sudo tee -a /etc/sysctl.conf
|
||||
echo 'net.ipv4.tcp_wmem = 4096 65536 134217728' | sudo tee -a /etc/sysctl.conf
|
||||
sudo sysctl -p
|
||||
```
|
||||
|
||||
## Contact Information
|
||||
|
||||
### On-Call Procedures
|
||||
|
||||
- **Primary Contact**: DevOps Team Lead
|
||||
- **Secondary Contact**: Senior Site Reliability Engineer
|
||||
- **Escalation**: Platform Engineering Manager
|
||||
|
||||
### Communication Channels
|
||||
|
||||
- **Slack**: #bzzz-incidents
|
||||
- **Email**: devops@deepblack.cloud
|
||||
- **Phone**: Emergency On-Call Rotation
|
||||
|
||||
### Documentation
|
||||
|
||||
- **Runbooks**: This document
|
||||
- **Architecture**: `/docs/BZZZ_V2_INFRASTRUCTURE_ARCHITECTURE.md`
|
||||
- **API Documentation**: https://bzzz.deepblack.cloud/docs
|
||||
- **Monitoring Dashboards**: https://grafana.deepblack.cloud
|
||||
|
||||
---
|
||||
|
||||
*This runbook should be reviewed and updated monthly. Last updated: $(date)*
|
||||
Reference in New Issue
Block a user