Prepare for v2 development: Add MCP integration and future development planning

- Add FUTURE_DEVELOPMENT.md with comprehensive v2 protocol specification - Add MCP integration design and implementation foundation - Add infrastructure and deployment configurations - Update system architecture for v2 evolution 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
2025-08-07 14:38:22 +10:00
parent 5f94288fbb
commit 065dddf8d5
41 changed files with 14970 additions and 161 deletions
--- a/infrastructure/docs/DEPLOYMENT_RUNBOOK.md
+++ b/infrastructure/docs/DEPLOYMENT_RUNBOOK.md
@@ -0,0 +1,581 @@
+# BZZZ v2 Deployment Runbook
+
+## Overview
+
+This runbook provides step-by-step procedures for deploying, operating, and maintaining BZZZ v2 infrastructure. It covers normal operations, emergency procedures, and troubleshooting guidelines.
+
+## Prerequisites
+
+### System Requirements
+
+- **Cluster**: 3 nodes (WALNUT, IRONWOOD, ACACIA)
+- **OS**: Ubuntu 22.04 LTS or newer
+- **Docker**: Version 24+ with Swarm mode enabled
+- **Storage**: NFS mount at `/rust/` with 500GB+ available
+- **Network**: Internal 192.168.1.0/24 with external internet access
+- **Secrets**: OpenAI API key and database credentials
+
+### Access Requirements
+
+- SSH access to all cluster nodes
+- Docker Swarm manager privileges
+- Sudo access for system configuration
+- GitLab access for CI/CD pipeline management
+
+## Pre-Deployment Checklist
+
+### Infrastructure Verification
+
+```bash
+# Verify Docker Swarm status
+docker node ls
+docker network ls | grep tengig
+
+# Check available storage
+df -h /rust/
+
+# Verify network connectivity
+ping -c 3 192.168.1.27  # WALNUT
+ping -c 3 192.168.1.113 # IRONWOOD  
+ping -c 3 192.168.1.xxx # ACACIA
+
+# Test registry access
+docker pull registry.home.deepblack.cloud/hello-world || echo "Registry access test"
+```
+
+### Security Hardening
+
+```bash
+# Run security hardening script
+cd /home/tony/chorus/project-queues/active/BZZZ/infrastructure/security
+sudo ./security-hardening.sh
+
+# Verify firewall status
+sudo ufw status verbose
+
+# Check fail2ban status
+sudo fail2ban-client status
+```
+
+## Deployment Procedures
+
+### 1. Initial Deployment (Fresh Install)
+
+#### Step 1: Prepare Infrastructure
+
+```bash
+# Create directory structure
+mkdir -p /rust/bzzz-v2/{config,data,logs,backup}
+mkdir -p /rust/bzzz-v2/data/{blobs,conversations,dht,postgres,redis}
+mkdir -p /rust/bzzz-v2/config/{swarm,monitoring,security}
+
+# Set permissions
+sudo chown -R tony:tony /rust/bzzz-v2
+chmod -R 755 /rust/bzzz-v2
+```
+
+#### Step 2: Configure Secrets and Configs
+
+```bash
+cd /home/tony/chorus/project-queues/active/BZZZ/infrastructure
+
+# Create Docker secrets
+docker secret create bzzz_postgres_password config/secrets/postgres_password
+docker secret create bzzz_openai_api_key ~/chorus/business/secrets/openai-api-key
+docker secret create bzzz_grafana_admin_password config/secrets/grafana_admin_password
+
+# Create Docker configs
+docker config create bzzz_v2_config config/bzzz-config.yaml
+docker config create bzzz_prometheus_config monitoring/configs/prometheus.yml
+docker config create bzzz_alertmanager_config monitoring/configs/alertmanager.yml
+```
+
+#### Step 3: Deploy Core Services
+
+```bash
+# Deploy main BZZZ v2 stack
+docker stack deploy -c docker-compose.swarm.yml bzzz-v2
+
+# Wait for services to start (this may take 5-10 minutes)
+watch docker stack ps bzzz-v2
+```
+
+#### Step 4: Deploy Monitoring Stack
+
+```bash
+# Deploy monitoring services
+docker stack deploy -c monitoring/docker-compose.monitoring.yml bzzz-monitoring
+
+# Verify monitoring services
+curl -f http://localhost:9090/-/healthy  # Prometheus
+curl -f http://localhost:3000/api/health # Grafana
+```
+
+#### Step 5: Verify Deployment
+
+```bash
+# Check all services are running
+docker service ls --filter label=com.docker.stack.namespace=bzzz-v2
+
+# Test external endpoints
+curl -f https://bzzz.deepblack.cloud/health
+curl -f https://mcp.deepblack.cloud/health
+curl -f https://resolve.deepblack.cloud/health
+
+# Check P2P mesh connectivity
+docker exec $(docker ps -q -f label=com.docker.swarm.service.name=bzzz-v2_bzzz-agent | head -1) \
+  curl -s http://localhost:9000/api/v2/peers | jq '.connected_peers | length'
+```
+
+### 2. Update Deployment (Rolling Update)
+
+#### Step 1: Pre-Update Checks
+
+```bash
+# Check current deployment health
+docker stack ps bzzz-v2 | grep -v "Shutdown\|Failed"
+
+# Backup current configuration
+mkdir -p /rust/bzzz-v2/backup/$(date +%Y%m%d-%H%M%S)
+docker config ls | grep bzzz_ > /rust/bzzz-v2/backup/$(date +%Y%m%d-%H%M%S)/configs.txt
+docker secret ls | grep bzzz_ > /rust/bzzz-v2/backup/$(date +%Y%m%d-%H%M%S)/secrets.txt
+```
+
+#### Step 2: Update Images
+
+```bash
+# Update to new image version
+export NEW_IMAGE_TAG="v2.1.0"
+
+# Update Docker Compose file with new image tags
+sed -i "s/registry.home.deepblack.cloud\/bzzz:.*$/registry.home.deepblack.cloud\/bzzz:${NEW_IMAGE_TAG}/g" \
+  docker-compose.swarm.yml
+
+# Deploy updated stack (rolling update)
+docker stack deploy -c docker-compose.swarm.yml bzzz-v2
+```
+
+#### Step 3: Monitor Update Progress
+
+```bash
+# Watch rolling update progress
+watch "docker service ps bzzz-v2_bzzz-agent | head -20"
+
+# Check for any failed updates
+docker service ps bzzz-v2_bzzz-agent --filter desired-state=running --filter current-state=failed
+```
+
+### 3. Migration from v1 to v2
+
+```bash
+# Use the automated migration script
+cd /home/tony/chorus/project-queues/active/BZZZ/infrastructure/migration-scripts
+
+# Dry run first to preview changes
+./migrate-v1-to-v2.sh --dry-run
+
+# Execute full migration
+./migrate-v1-to-v2.sh
+
+# If rollback is needed
+./migrate-v1-to-v2.sh --rollback
+```
+
+## Monitoring and Health Checks
+
+### Health Check Commands
+
+```bash
+# Service health checks
+docker service ls --filter label=com.docker.stack.namespace=bzzz-v2
+docker service ps bzzz-v2_bzzz-agent --filter desired-state=running
+
+# Application health checks
+curl -f https://bzzz.deepblack.cloud/health
+curl -f https://mcp.deepblack.cloud/health
+curl -f https://resolve.deepblack.cloud/health
+curl -f https://openai.deepblack.cloud/health
+
+# P2P network health
+docker exec $(docker ps -q -f label=com.docker.swarm.service.name=bzzz-v2_bzzz-agent | head -1) \
+  curl -s http://localhost:9000/api/v2/dht/stats | jq '.'
+
+# Database connectivity
+docker exec $(docker ps -q -f label=com.docker.swarm.service.name=bzzz-v2_postgres) \
+  pg_isready -U bzzz -d bzzz_v2
+
+# Cache connectivity  
+docker exec $(docker ps -q -f label=com.docker.swarm.service.name=bzzz-v2_redis) \
+  redis-cli ping
+```
+
+### Performance Monitoring
+
+```bash
+# Check resource usage
+docker stats --no-stream
+
+# Monitor disk usage
+df -h /rust/bzzz-v2/data/
+
+# Check network connections
+netstat -tuln | grep -E ":(9000|3001|3002|3003|9101|9102|9103)"
+
+# Monitor OpenAI API usage
+curl -s http://localhost:9203/metrics | grep openai_cost
+```
+
+## Troubleshooting Guide
+
+### Common Issues and Solutions
+
+#### 1. Service Won't Start
+
+**Symptoms:** Service stuck in `preparing` or constantly restarting
+
+**Diagnosis:**
+```bash
+# Check service logs
+docker service logs bzzz-v2_bzzz-agent --tail 50
+
+# Check node resources
+docker node ls
+docker system df
+
+# Verify secrets and configs
+docker secret ls | grep bzzz_
+docker config ls | grep bzzz_
+```
+
+**Solutions:**
+- Check resource constraints and availability
+- Verify secrets and configs are accessible
+- Ensure image is available and correct
+- Check node labels and placement constraints
+
+#### 2. P2P Network Issues
+
+**Symptoms:** Agents not discovering each other, DHT lookups failing
+
+**Diagnosis:**
+```bash
+# Check peer connections
+docker exec $(docker ps -q -f label=com.docker.swarm.service.name=bzzz-v2_bzzz-agent | head -1) \
+  curl -s http://localhost:9000/api/v2/peers
+
+# Check DHT bootstrap nodes
+curl http://localhost:9101/health
+curl http://localhost:9102/health  
+curl http://localhost:9103/health
+
+# Check network connectivity
+docker network inspect bzzz-internal
+```
+
+**Solutions:**
+- Restart DHT bootstrap services
+- Check firewall rules for P2P ports
+- Verify Docker Swarm overlay network
+- Check for port conflicts
+
+#### 3. High OpenAI Costs
+
+**Symptoms:** Cost alerts triggering, rate limits being hit
+
+**Diagnosis:**
+```bash
+# Check current usage
+curl -s http://localhost:9203/metrics | grep -E "openai_(cost|requests|tokens)"
+
+# Check rate limiting
+docker service logs bzzz-v2_openai-proxy --tail 100 | grep "rate limit"
+```
+
+**Solutions:**
+- Adjust rate limiting parameters
+- Review conversation patterns for excessive API calls
+- Implement request caching
+- Consider model selection optimization
+
+#### 4. Database Connection Issues
+
+**Symptoms:** Service errors related to database connectivity
+
+**Diagnosis:**
+```bash
+# Check PostgreSQL status
+docker service logs bzzz-v2_postgres --tail 50
+
+# Test connection from agent
+docker exec $(docker ps -q -f label=com.docker.swarm.service.name=bzzz-v2_bzzz-agent | head -1) \
+  pg_isready -h postgres -U bzzz
+
+# Check connection limits
+docker exec $(docker ps -q -f label=com.docker.swarm.service.name=bzzz-v2_postgres) \
+  psql -U bzzz -d bzzz_v2 -c "SELECT count(*) FROM pg_stat_activity;"
+```
+
+**Solutions:**
+- Restart PostgreSQL service
+- Check connection pool settings
+- Increase max_connections if needed
+- Review long-running queries
+
+#### 5. Storage Issues
+
+**Symptoms:** Disk full alerts, content store errors
+
+**Diagnosis:**
+```bash
+# Check disk usage
+df -h /rust/bzzz-v2/data/
+du -sh /rust/bzzz-v2/data/blobs/
+
+# Check content store health
+curl -s http://localhost:9202/metrics | grep content_store
+```
+
+**Solutions:**
+- Run garbage collection on old blobs
+- Clean up old conversation threads
+- Increase storage capacity
+- Adjust retention policies
+
+## Emergency Procedures
+
+### Service Outage Response
+
+#### Priority 1: Complete Service Outage
+
+```bash
+# 1. Check cluster status
+docker node ls
+docker service ls --filter label=com.docker.stack.namespace=bzzz-v2
+
+# 2. Emergency restart of critical services
+docker service update --force bzzz-v2_bzzz-agent
+docker service update --force bzzz-v2_postgres
+docker service update --force bzzz-v2_redis
+
+# 3. If stack is corrupted, redeploy
+docker stack rm bzzz-v2
+sleep 60
+docker stack deploy -c docker-compose.swarm.yml bzzz-v2
+
+# 4. Monitor recovery
+watch docker stack ps bzzz-v2
+```
+
+#### Priority 2: Partial Service Degradation
+
+```bash
+# 1. Identify problematic services
+docker service ps bzzz-v2_bzzz-agent --filter desired-state=running --filter current-state=failed
+
+# 2. Scale up healthy replicas
+docker service update --replicas 3 bzzz-v2_bzzz-agent
+
+# 3. Remove unhealthy tasks
+docker service update --force bzzz-v2_bzzz-agent
+```
+
+### Security Incident Response
+
+#### Step 1: Immediate Containment
+
+```bash
+# 1. Block suspicious IPs
+sudo ufw insert 1 deny from SUSPICIOUS_IP
+
+# 2. Check for compromise indicators
+sudo fail2ban-client status
+sudo tail -100 /var/log/audit/audit.log | grep -i "denied\|failed\|error"
+
+# 3. Isolate affected services
+docker service update --replicas 0 AFFECTED_SERVICE
+```
+
+#### Step 2: Investigation
+
+```bash
+# 1. Check access logs
+docker service logs bzzz-v2_bzzz-agent --since 1h | grep -i "error\|failed\|unauthorized"
+
+# 2. Review monitoring alerts
+curl -s http://localhost:9093/api/v1/alerts | jq '.data[] | select(.state=="firing")'
+
+# 3. Examine network connections
+netstat -tuln
+ss -tulpn | grep -E ":(9000|3001|3002|3003)"
+```
+
+#### Step 3: Recovery
+
+```bash
+# 1. Update security rules
+./infrastructure/security/security-hardening.sh
+
+# 2. Rotate secrets if compromised
+docker secret rm bzzz_postgres_password
+openssl rand -base64 32 | docker secret create bzzz_postgres_password -
+
+# 3. Restart services with new secrets
+docker stack deploy -c docker-compose.swarm.yml bzzz-v2
+```
+
+### Data Recovery Procedures
+
+#### Backup Restoration
+
+```bash
+# 1. Stop services
+docker stack rm bzzz-v2
+
+# 2. Restore from backup
+BACKUP_DATE="20241201-120000"
+rsync -av /rust/bzzz-v2/backup/$BACKUP_DATE/ /rust/bzzz-v2/data/
+
+# 3. Restart services
+docker stack deploy -c docker-compose.swarm.yml bzzz-v2
+```
+
+#### Database Recovery
+
+```bash
+# 1. Stop application services
+docker service scale bzzz-v2_bzzz-agent=0
+
+# 2. Create database backup
+docker exec $(docker ps -q -f label=com.docker.swarm.service.name=bzzz-v2_postgres) \
+  pg_dump -U bzzz bzzz_v2 > /rust/bzzz-v2/backup/database-$(date +%Y%m%d-%H%M%S).sql
+
+# 3. Restore database
+docker exec -i $(docker ps -q -f label=com.docker.swarm.service.name=bzzz-v2_postgres) \
+  psql -U bzzz -d bzzz_v2 < /rust/bzzz-v2/backup/database-backup.sql
+
+# 4. Restart application services
+docker service scale bzzz-v2_bzzz-agent=3
+```
+
+## Maintenance Procedures
+
+### Routine Maintenance (Weekly)
+
+```bash
+#!/bin/bash
+# Weekly maintenance script
+
+# 1. Check service health
+docker service ls --filter label=com.docker.stack.namespace=bzzz-v2
+docker system df
+
+# 2. Clean up unused resources
+docker system prune -f
+docker volume prune -f
+
+# 3. Backup critical data
+pg_dump -h localhost -U bzzz bzzz_v2 | gzip > \
+  /rust/bzzz-v2/backup/weekly-db-$(date +%Y%m%d).sql.gz
+
+# 4. Rotate logs
+find /rust/bzzz-v2/logs -name "*.log" -mtime +7 -delete
+
+# 5. Check certificate expiration
+openssl x509 -in /rust/bzzz-v2/config/tls/server/walnut.pem -noout -dates
+
+# 6. Update security rules
+fail2ban-client reload
+
+# 7. Generate maintenance report
+echo "Maintenance completed on $(date)" >> /rust/bzzz-v2/logs/maintenance.log
+```
+
+### Scaling Procedures
+
+#### Scale Up
+
+```bash
+# Increase replica count
+docker service scale bzzz-v2_bzzz-agent=5
+docker service scale bzzz-v2_mcp-server=5
+
+# Add new node to cluster (run on new node)
+docker swarm join --token $WORKER_TOKEN $MANAGER_IP:2377
+
+# Label new node
+docker node update --label-add bzzz.role=agent NEW_NODE_HOSTNAME
+```
+
+#### Scale Down
+
+```bash
+# Gracefully reduce replicas
+docker service scale bzzz-v2_bzzz-agent=2
+docker service scale bzzz-v2_mcp-server=2
+
+# Remove node from cluster
+docker node update --availability drain NODE_HOSTNAME
+docker node rm NODE_HOSTNAME
+```
+
+## Performance Tuning
+
+### Database Optimization
+
+```bash
+# PostgreSQL tuning
+docker exec $(docker ps -q -f label=com.docker.swarm.service.name=bzzz-v2_postgres) \
+  psql -U bzzz -d bzzz_v2 -c "
+    ALTER SYSTEM SET shared_buffers = '1GB';
+    ALTER SYSTEM SET max_connections = 200;
+    ALTER SYSTEM SET checkpoint_timeout = '15min';
+    SELECT pg_reload_conf();
+  "
+```
+
+### Storage Optimization
+
+```bash
+# Content store optimization
+find /rust/bzzz-v2/data/blobs -name "*.tmp" -mtime +1 -delete
+find /rust/bzzz-v2/data/blobs -type f -size 0 -delete
+
+# Compress old logs
+find /rust/bzzz-v2/logs -name "*.log" -mtime +3 -exec gzip {} \;
+```
+
+### Network Optimization
+
+```bash
+# Optimize network buffer sizes
+echo 'net.core.rmem_max = 134217728' | sudo tee -a /etc/sysctl.conf
+echo 'net.core.wmem_max = 134217728' | sudo tee -a /etc/sysctl.conf
+echo 'net.ipv4.tcp_rmem = 4096 87380 134217728' | sudo tee -a /etc/sysctl.conf
+echo 'net.ipv4.tcp_wmem = 4096 65536 134217728' | sudo tee -a /etc/sysctl.conf
+sudo sysctl -p
+```
+
+## Contact Information
+
+### On-Call Procedures
+
+- **Primary Contact**: DevOps Team Lead
+- **Secondary Contact**: Senior Site Reliability Engineer  
+- **Escalation**: Platform Engineering Manager
+
+### Communication Channels
+
+- **Slack**: #bzzz-incidents
+- **Email**: devops@deepblack.cloud
+- **Phone**: Emergency On-Call Rotation
+
+### Documentation
+
+- **Runbooks**: This document
+- **Architecture**: `/docs/BZZZ_V2_INFRASTRUCTURE_ARCHITECTURE.md`
+- **API Documentation**: https://bzzz.deepblack.cloud/docs
+- **Monitoring Dashboards**: https://grafana.deepblack.cloud
+
+---
+
+*This runbook should be reviewed and updated monthly. Last updated: $(date)*