tony/bzzz

Files

anthonyrawlins 065dddf8d5 Prepare for v2 development: Add MCP integration and future development planning

- Add FUTURE_DEVELOPMENT.md with comprehensive v2 protocol specification
- Add MCP integration design and implementation foundation
- Add infrastructure and deployment configurations
- Update system architecture for v2 evolution

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>

2025-08-07 14:38:22 +10:00

15 KiB

Raw Permalink Blame History

BZZZ v2 Deployment Runbook

Overview

This runbook provides step-by-step procedures for deploying, operating, and maintaining BZZZ v2 infrastructure. It covers normal operations, emergency procedures, and troubleshooting guidelines.

Prerequisites

System Requirements

Cluster: 3 nodes (WALNUT, IRONWOOD, ACACIA)
OS: Ubuntu 22.04 LTS or newer
Docker: Version 24+ with Swarm mode enabled
Storage: NFS mount at /rust/ with 500GB+ available
Network: Internal 192.168.1.0/24 with external internet access
Secrets: OpenAI API key and database credentials

Access Requirements

SSH access to all cluster nodes
Docker Swarm manager privileges
Sudo access for system configuration
GitLab access for CI/CD pipeline management

Pre-Deployment Checklist

Infrastructure Verification

# Verify Docker Swarm status
docker node ls
docker network ls | grep tengig

# Check available storage
df -h /rust/

# Verify network connectivity
ping -c 3 192.168.1.27  # WALNUT
ping -c 3 192.168.1.113 # IRONWOOD  
ping -c 3 192.168.1.xxx # ACACIA

# Test registry access
docker pull registry.home.deepblack.cloud/hello-world || echo "Registry access test"

Security Hardening

# Run security hardening script
cd /home/tony/chorus/project-queues/active/BZZZ/infrastructure/security
sudo ./security-hardening.sh

# Verify firewall status
sudo ufw status verbose

# Check fail2ban status
sudo fail2ban-client status

Deployment Procedures

1. Initial Deployment (Fresh Install)

Step 1: Prepare Infrastructure

# Create directory structure
mkdir -p /rust/bzzz-v2/{config,data,logs,backup}
mkdir -p /rust/bzzz-v2/data/{blobs,conversations,dht,postgres,redis}
mkdir -p /rust/bzzz-v2/config/{swarm,monitoring,security}

# Set permissions
sudo chown -R tony:tony /rust/bzzz-v2
chmod -R 755 /rust/bzzz-v2

Step 2: Configure Secrets and Configs

cd /home/tony/chorus/project-queues/active/BZZZ/infrastructure

# Create Docker secrets
docker secret create bzzz_postgres_password config/secrets/postgres_password
docker secret create bzzz_openai_api_key ~/chorus/business/secrets/openai-api-key
docker secret create bzzz_grafana_admin_password config/secrets/grafana_admin_password

# Create Docker configs
docker config create bzzz_v2_config config/bzzz-config.yaml
docker config create bzzz_prometheus_config monitoring/configs/prometheus.yml
docker config create bzzz_alertmanager_config monitoring/configs/alertmanager.yml

Step 3: Deploy Core Services

# Deploy main BZZZ v2 stack
docker stack deploy -c docker-compose.swarm.yml bzzz-v2

# Wait for services to start (this may take 5-10 minutes)
watch docker stack ps bzzz-v2

Step 4: Deploy Monitoring Stack

# Deploy monitoring services
docker stack deploy -c monitoring/docker-compose.monitoring.yml bzzz-monitoring

# Verify monitoring services
curl -f http://localhost:9090/-/healthy  # Prometheus
curl -f http://localhost:3000/api/health # Grafana

Step 5: Verify Deployment

# Check all services are running
docker service ls --filter label=com.docker.stack.namespace=bzzz-v2

# Test external endpoints
curl -f https://bzzz.deepblack.cloud/health
curl -f https://mcp.deepblack.cloud/health
curl -f https://resolve.deepblack.cloud/health

# Check P2P mesh connectivity
docker exec $(docker ps -q -f label=com.docker.swarm.service.name=bzzz-v2_bzzz-agent | head -1) \
  curl -s http://localhost:9000/api/v2/peers | jq '.connected_peers | length'

2. Update Deployment (Rolling Update)

Step 1: Pre-Update Checks

# Check current deployment health
docker stack ps bzzz-v2 | grep -v "Shutdown\|Failed"

# Backup current configuration
mkdir -p /rust/bzzz-v2/backup/$(date +%Y%m%d-%H%M%S)
docker config ls | grep bzzz_ > /rust/bzzz-v2/backup/$(date +%Y%m%d-%H%M%S)/configs.txt
docker secret ls | grep bzzz_ > /rust/bzzz-v2/backup/$(date +%Y%m%d-%H%M%S)/secrets.txt

Step 2: Update Images

# Update to new image version
export NEW_IMAGE_TAG="v2.1.0"

# Update Docker Compose file with new image tags
sed -i "s/registry.home.deepblack.cloud\/bzzz:.*$/registry.home.deepblack.cloud\/bzzz:${NEW_IMAGE_TAG}/g" \
  docker-compose.swarm.yml

# Deploy updated stack (rolling update)
docker stack deploy -c docker-compose.swarm.yml bzzz-v2

Step 3: Monitor Update Progress

# Watch rolling update progress
watch "docker service ps bzzz-v2_bzzz-agent | head -20"

# Check for any failed updates
docker service ps bzzz-v2_bzzz-agent --filter desired-state=running --filter current-state=failed

3. Migration from v1 to v2

# Use the automated migration script
cd /home/tony/chorus/project-queues/active/BZZZ/infrastructure/migration-scripts

# Dry run first to preview changes
./migrate-v1-to-v2.sh --dry-run

# Execute full migration
./migrate-v1-to-v2.sh

# If rollback is needed
./migrate-v1-to-v2.sh --rollback

Monitoring and Health Checks

Health Check Commands

# Service health checks
docker service ls --filter label=com.docker.stack.namespace=bzzz-v2
docker service ps bzzz-v2_bzzz-agent --filter desired-state=running

# Application health checks
curl -f https://bzzz.deepblack.cloud/health
curl -f https://mcp.deepblack.cloud/health
curl -f https://resolve.deepblack.cloud/health
curl -f https://openai.deepblack.cloud/health

# P2P network health
docker exec $(docker ps -q -f label=com.docker.swarm.service.name=bzzz-v2_bzzz-agent | head -1) \
  curl -s http://localhost:9000/api/v2/dht/stats | jq '.'

# Database connectivity
docker exec $(docker ps -q -f label=com.docker.swarm.service.name=bzzz-v2_postgres) \
  pg_isready -U bzzz -d bzzz_v2

# Cache connectivity  
docker exec $(docker ps -q -f label=com.docker.swarm.service.name=bzzz-v2_redis) \
  redis-cli ping

Performance Monitoring

# Check resource usage
docker stats --no-stream

# Monitor disk usage
df -h /rust/bzzz-v2/data/

# Check network connections
netstat -tuln | grep -E ":(9000|3001|3002|3003|9101|9102|9103)"

# Monitor OpenAI API usage
curl -s http://localhost:9203/metrics | grep openai_cost

Troubleshooting Guide

Common Issues and Solutions

1. Service Won't Start

Symptoms: Service stuck in preparing or constantly restarting

Diagnosis:

# Check service logs
docker service logs bzzz-v2_bzzz-agent --tail 50

# Check node resources
docker node ls
docker system df

# Verify secrets and configs
docker secret ls | grep bzzz_
docker config ls | grep bzzz_

Solutions:

Check resource constraints and availability
Verify secrets and configs are accessible
Ensure image is available and correct
Check node labels and placement constraints

2. P2P Network Issues

Symptoms: Agents not discovering each other, DHT lookups failing

Diagnosis:

# Check peer connections
docker exec $(docker ps -q -f label=com.docker.swarm.service.name=bzzz-v2_bzzz-agent | head -1) \
  curl -s http://localhost:9000/api/v2/peers

# Check DHT bootstrap nodes
curl http://localhost:9101/health
curl http://localhost:9102/health  
curl http://localhost:9103/health

# Check network connectivity
docker network inspect bzzz-internal

Solutions:

Restart DHT bootstrap services
Check firewall rules for P2P ports
Verify Docker Swarm overlay network
Check for port conflicts

3. High OpenAI Costs

Symptoms: Cost alerts triggering, rate limits being hit

Diagnosis:

# Check current usage
curl -s http://localhost:9203/metrics | grep -E "openai_(cost|requests|tokens)"

# Check rate limiting
docker service logs bzzz-v2_openai-proxy --tail 100 | grep "rate limit"

Solutions:

Adjust rate limiting parameters
Review conversation patterns for excessive API calls
Implement request caching
Consider model selection optimization

4. Database Connection Issues

Symptoms: Service errors related to database connectivity

Diagnosis:

# Check PostgreSQL status
docker service logs bzzz-v2_postgres --tail 50

# Test connection from agent
docker exec $(docker ps -q -f label=com.docker.swarm.service.name=bzzz-v2_bzzz-agent | head -1) \
  pg_isready -h postgres -U bzzz

# Check connection limits
docker exec $(docker ps -q -f label=com.docker.swarm.service.name=bzzz-v2_postgres) \
  psql -U bzzz -d bzzz_v2 -c "SELECT count(*) FROM pg_stat_activity;"

Solutions:

Restart PostgreSQL service
Check connection pool settings
Increase max_connections if needed
Review long-running queries

5. Storage Issues

Symptoms: Disk full alerts, content store errors

Diagnosis:

# Check disk usage
df -h /rust/bzzz-v2/data/
du -sh /rust/bzzz-v2/data/blobs/

# Check content store health
curl -s http://localhost:9202/metrics | grep content_store

Solutions:

Run garbage collection on old blobs
Clean up old conversation threads
Increase storage capacity
Adjust retention policies

Emergency Procedures

Service Outage Response

Priority 1: Complete Service Outage

# 1. Check cluster status
docker node ls
docker service ls --filter label=com.docker.stack.namespace=bzzz-v2

# 2. Emergency restart of critical services
docker service update --force bzzz-v2_bzzz-agent
docker service update --force bzzz-v2_postgres
docker service update --force bzzz-v2_redis

# 3. If stack is corrupted, redeploy
docker stack rm bzzz-v2
sleep 60
docker stack deploy -c docker-compose.swarm.yml bzzz-v2

# 4. Monitor recovery
watch docker stack ps bzzz-v2

Priority 2: Partial Service Degradation

# 1. Identify problematic services
docker service ps bzzz-v2_bzzz-agent --filter desired-state=running --filter current-state=failed

# 2. Scale up healthy replicas
docker service update --replicas 3 bzzz-v2_bzzz-agent

# 3. Remove unhealthy tasks
docker service update --force bzzz-v2_bzzz-agent

Security Incident Response

Step 1: Immediate Containment

# 1. Block suspicious IPs
sudo ufw insert 1 deny from SUSPICIOUS_IP

# 2. Check for compromise indicators
sudo fail2ban-client status
sudo tail -100 /var/log/audit/audit.log | grep -i "denied\|failed\|error"

# 3. Isolate affected services
docker service update --replicas 0 AFFECTED_SERVICE

Step 2: Investigation

# 1. Check access logs
docker service logs bzzz-v2_bzzz-agent --since 1h | grep -i "error\|failed\|unauthorized"

# 2. Review monitoring alerts
curl -s http://localhost:9093/api/v1/alerts | jq '.data[] | select(.state=="firing")'

# 3. Examine network connections
netstat -tuln
ss -tulpn | grep -E ":(9000|3001|3002|3003)"

Step 3: Recovery

# 1. Update security rules
./infrastructure/security/security-hardening.sh

# 2. Rotate secrets if compromised
docker secret rm bzzz_postgres_password
openssl rand -base64 32 | docker secret create bzzz_postgres_password -

# 3. Restart services with new secrets
docker stack deploy -c docker-compose.swarm.yml bzzz-v2

Data Recovery Procedures

Backup Restoration

# 1. Stop services
docker stack rm bzzz-v2

# 2. Restore from backup
BACKUP_DATE="20241201-120000"
rsync -av /rust/bzzz-v2/backup/$BACKUP_DATE/ /rust/bzzz-v2/data/

# 3. Restart services
docker stack deploy -c docker-compose.swarm.yml bzzz-v2

Database Recovery

# 1. Stop application services
docker service scale bzzz-v2_bzzz-agent=0

# 2. Create database backup
docker exec $(docker ps -q -f label=com.docker.swarm.service.name=bzzz-v2_postgres) \
  pg_dump -U bzzz bzzz_v2 > /rust/bzzz-v2/backup/database-$(date +%Y%m%d-%H%M%S).sql

# 3. Restore database
docker exec -i $(docker ps -q -f label=com.docker.swarm.service.name=bzzz-v2_postgres) \
  psql -U bzzz -d bzzz_v2 < /rust/bzzz-v2/backup/database-backup.sql

# 4. Restart application services
docker service scale bzzz-v2_bzzz-agent=3

Maintenance Procedures

Routine Maintenance (Weekly)

#!/bin/bash
# Weekly maintenance script

# 1. Check service health
docker service ls --filter label=com.docker.stack.namespace=bzzz-v2
docker system df

# 2. Clean up unused resources
docker system prune -f
docker volume prune -f

# 3. Backup critical data
pg_dump -h localhost -U bzzz bzzz_v2 | gzip > \
  /rust/bzzz-v2/backup/weekly-db-$(date +%Y%m%d).sql.gz

# 4. Rotate logs
find /rust/bzzz-v2/logs -name "*.log" -mtime +7 -delete

# 5. Check certificate expiration
openssl x509 -in /rust/bzzz-v2/config/tls/server/walnut.pem -noout -dates

# 6. Update security rules
fail2ban-client reload

# 7. Generate maintenance report
echo "Maintenance completed on $(date)" >> /rust/bzzz-v2/logs/maintenance.log

Scaling Procedures

Scale Up

# Increase replica count
docker service scale bzzz-v2_bzzz-agent=5
docker service scale bzzz-v2_mcp-server=5

# Add new node to cluster (run on new node)
docker swarm join --token $WORKER_TOKEN $MANAGER_IP:2377

# Label new node
docker node update --label-add bzzz.role=agent NEW_NODE_HOSTNAME

Scale Down

# Gracefully reduce replicas
docker service scale bzzz-v2_bzzz-agent=2
docker service scale bzzz-v2_mcp-server=2

# Remove node from cluster
docker node update --availability drain NODE_HOSTNAME
docker node rm NODE_HOSTNAME

Performance Tuning

Database Optimization

# PostgreSQL tuning
docker exec $(docker ps -q -f label=com.docker.swarm.service.name=bzzz-v2_postgres) \
  psql -U bzzz -d bzzz_v2 -c "
    ALTER SYSTEM SET shared_buffers = '1GB';
    ALTER SYSTEM SET max_connections = 200;
    ALTER SYSTEM SET checkpoint_timeout = '15min';
    SELECT pg_reload_conf();
  "

Storage Optimization

# Content store optimization
find /rust/bzzz-v2/data/blobs -name "*.tmp" -mtime +1 -delete
find /rust/bzzz-v2/data/blobs -type f -size 0 -delete

# Compress old logs
find /rust/bzzz-v2/logs -name "*.log" -mtime +3 -exec gzip {} \;

Network Optimization

# Optimize network buffer sizes
echo 'net.core.rmem_max = 134217728' | sudo tee -a /etc/sysctl.conf
echo 'net.core.wmem_max = 134217728' | sudo tee -a /etc/sysctl.conf
echo 'net.ipv4.tcp_rmem = 4096 87380 134217728' | sudo tee -a /etc/sysctl.conf
echo 'net.ipv4.tcp_wmem = 4096 65536 134217728' | sudo tee -a /etc/sysctl.conf
sudo sysctl -p

Contact Information

On-Call Procedures

Primary Contact: DevOps Team Lead
Secondary Contact: Senior Site Reliability Engineer
Escalation: Platform Engineering Manager

Communication Channels

Slack: #bzzz-incidents
Email: devops@deepblack.cloud
Phone: Emergency On-Call Rotation

Documentation

Runbooks: This document
Architecture: /docs/BZZZ_V2_INFRASTRUCTURE_ARCHITECTURE.md
API Documentation: https://bzzz.deepblack.cloud/docs
Monitoring Dashboards: https://grafana.deepblack.cloud

This runbook should be reviewed and updated monthly. Last updated: $(date)

15 KiB Raw Permalink Blame History

BZZZ v2 Deployment Runbook

Overview

Prerequisites

System Requirements

Access Requirements

Pre-Deployment Checklist

Infrastructure Verification

Security Hardening

Deployment Procedures

1. Initial Deployment (Fresh Install)

Step 1: Prepare Infrastructure

Step 2: Configure Secrets and Configs

Step 3: Deploy Core Services

Step 4: Deploy Monitoring Stack

Step 5: Verify Deployment

2. Update Deployment (Rolling Update)

Step 1: Pre-Update Checks

Step 2: Update Images

Step 3: Monitor Update Progress

3. Migration from v1 to v2

Monitoring and Health Checks

Health Check Commands

Performance Monitoring

Troubleshooting Guide

Common Issues and Solutions

1. Service Won't Start

2. P2P Network Issues

3. High OpenAI Costs

4. Database Connection Issues

5. Storage Issues

Emergency Procedures

Service Outage Response

Priority 1: Complete Service Outage

Priority 2: Partial Service Degradation

Security Incident Response

Step 1: Immediate Containment

Step 2: Investigation

Step 3: Recovery

Data Recovery Procedures

Backup Restoration

Database Recovery

Maintenance Procedures

Routine Maintenance (Weekly)

Scaling Procedures

Scale Up

Scale Down

Performance Tuning

Database Optimization

Storage Optimization

Network Optimization

Contact Information

On-Call Procedures

Communication Channels

Documentation

15 KiB

Raw Permalink Blame History