# BZZZ v2 Infrastructure Architecture & Deployment Strategy ## Executive Summary This document outlines the comprehensive infrastructure architecture and deployment strategy for BZZZ v2 evolution. The design maintains the existing 3-node cluster reliability while enabling advanced protocol features including content-addressed storage, DHT networking, OpenAI integration, and MCP server capabilities. ## Current Infrastructure Analysis ### Existing v1 Deployment - **Cluster**: WALNUT (192.168.1.27), IRONWOOD (192.168.1.113), ACACIA (192.168.1.xxx) - **Deployment**: SystemD services with P2P mesh networking - **Protocol**: libp2p with mDNS discovery and pubsub messaging - **Storage**: File-based configuration and in-memory state - **Integration**: Basic WHOOSH API connectivity and task coordination ### Infrastructure Dependencies - **Docker Swarm**: Existing cluster with `tengig` network - **Traefik**: Load balancing and SSL termination - **Private Registry**: registry.home.deepblack.cloud - **GitLab CI/CD**: gitlab.deepblack.cloud - **Secrets**: ~/chorus/business/secrets/ management - **Storage**: NFS mounts on /rust/ for shared data ## BZZZ v2 Architecture Design ### 1. Protocol Evolution Architecture ``` ┌─────────────────────── BZZZ v2 Protocol Stack ───────────────────────┐ │ │ │ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────────┐ │ │ │ MCP Server │ │ OpenAI Proxy │ │ bzzz:// Resolver │ │ │ │ (Port 3001) │ │ (Port 3002) │ │ (Port 3003) │ │ │ └─────────────────┘ └─────────────────┘ └─────────────────────┘ │ │ │ │ │ │ │ ┌─────────────────────────────────────────────────────────────────┐ │ │ │ Content Layer │ │ │ │ ┌─────────────┐ ┌──────────────┐ ┌─────────────────────────┐ │ │ │ │ │ Conversation│ │ Content Store│ │ BLAKE3 Hasher │ │ │ │ │ │ Threading │ │ (CAS Blobs) │ │ (Content Addressing) │ │ │ │ │ └─────────────┘ └──────────────┘ └─────────────────────────┘ │ │ │ └─────────────────────────────────────────────────────────────────┘ │ │ │ │ │ ┌─────────────────────────────────────────────────────────────────┐ │ │ │ P2P Layer │ │ │ │ ┌─────────────┐ ┌──────────────┐ ┌─────────────────────────┐ │ │ │ │ │ libp2p DHT │ │Content Route │ │ Stream Multiplexing │ │ │ │ │ │ (Discovery)│ │ (Routing) │ │ (Yamux/mplex) │ │ │ │ │ └─────────────┘ └──────────────┘ └─────────────────────────┘ │ │ │ └─────────────────────────────────────────────────────────────────┘ │ │ │ └───────────────────────────────────────────────────────────────────────┘ ``` ### 2. Content-Addressed Storage (CAS) Architecture ``` ┌────────────────── Content-Addressed Storage System ──────────────────┐ │ │ │ ┌─────────────────────────── Node Distribution ────────────────────┐ │ │ │ │ │ │ │ WALNUT IRONWOOD ACACIA │ │ │ │ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │ │ │ │ │ Primary │────▶│ Secondary │────▶│ Tertiary │ │ │ │ │ │ Blob Store │ │ Replica │ │ Replica │ │ │ │ │ └─────────────┘ └─────────────┘ └─────────────┘ │ │ │ │ │ │ │ │ │ │ │ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │ │ │ │ │BLAKE3 Index │ │BLAKE3 Index │ │BLAKE3 Index │ │ │ │ │ │ (Primary) │ │ (Secondary) │ │ (Tertiary) │ │ │ │ │ └─────────────┘ └─────────────┘ └─────────────┘ │ │ │ └───────────────────────────────────────────────────────────────────┘ │ │ │ │ ┌─────────────────── Storage Layout ──────────────────────────────┐ │ │ │ /rust/bzzz-v2/blobs/ │ │ │ │ ├── data/ # Raw blob storage │ │ │ │ │ ├── bl/ # BLAKE3 prefix sharding │ │ │ │ │ │ └── 3k/ # Further sharding │ │ │ │ │ └── conversations/ # Conversation threads │ │ │ │ ├── index/ # BLAKE3 hash indices │ │ │ │ │ ├── primary.db # Primary hash->location mapping │ │ │ │ │ └── replication.db # Replication metadata │ │ │ │ └── temp/ # Temporary staging area │ │ │ └───────────────────────────────────────────────────────────────────┘ │ └───────────────────────────────────────────────────────────────────────┘ ``` ### 3. DHT and Network Architecture ``` ┌────────────────────── DHT Network Topology ──────────────────────────┐ │ │ │ ┌─────────────────── Bootstrap & Discovery ────────────────────────┐ │ │ │ │ │ │ │ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │ │ │ │ │ WALNUT │────▶│ IRONWOOD │────▶│ ACACIA │ │ │ │ │ │(Bootstrap 1)│◀────│(Bootstrap 2)│◀────│(Bootstrap 3)│ │ │ │ │ └─────────────┘ └─────────────┘ └─────────────┘ │ │ │ │ │ │ │ │ ┌─────────────────── DHT Responsibilities ────────────────────┐ │ │ │ │ │ WALNUT: Content Routing + Agent Discovery │ │ │ │ │ │ IRONWOOD: Conversation Threading + OpenAI Coordination │ │ │ │ │ │ ACACIA: MCP Services + External Integration │ │ │ │ │ └─────────────────────────────────────────────────────────────┘ │ │ │ └───────────────────────────────────────────────────────────────────┘ │ │ │ │ ┌─────────────────── Network Protocols ────────────────────────────┐ │ │ │ │ │ │ │ Protocol Support: │ │ │ │ • bzzz:// semantic addressing (DHT resolution) │ │ │ │ • Content routing via DHT (BLAKE3 hash lookup) │ │ │ │ • Agent discovery and capability broadcasting │ │ │ │ • Stream multiplexing for concurrent conversations │ │ │ │ • NAT traversal and hole punching │ │ │ │ │ │ │ │ Port Allocation: │ │ │ │ • P2P Listen: 9000-9100 (configurable range) │ │ │ │ • DHT Bootstrap: 9101-9103 (per node) │ │ │ │ • Content Routing: 9200-9300 (dynamic allocation) │ │ │ │ • mDNS Discovery: 5353 (standard multicast DNS) │ │ │ └───────────────────────────────────────────────────────────────────┘ │ └───────────────────────────────────────────────────────────────────────┘ ``` ### 4. Service Architecture ``` ┌─────────────────────── BZZZ v2 Service Stack ────────────────────────┐ │ │ │ ┌─────────────────── External Layer ───────────────────────────────┐ │ │ │ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │ │ │ │ │ Traefik │────▶│ OpenAI │────▶│ MCP │ │ │ │ │ │Load Balancer│ │ Gateway │ │ Clients │ │ │ │ │ │ (SSL Term) │ │(Rate Limit) │ │(External) │ │ │ │ │ └─────────────┘ └─────────────┘ └─────────────┘ │ │ │ └───────────────────────────────────────────────────────────────────┘ │ │ │ │ ┌─────────────────── Application Layer ────────────────────────────┐ │ │ │ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │ │ │ │ │ BZZZ Agent │────▶│ Conversation│────▶│ Content │ │ │ │ │ │ Manager │ │ Threading │ │ Resolver │ │ │ │ │ └─────────────┘ └─────────────┘ └─────────────┘ │ │ │ │ │ │ │ │ │ │ │ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │ │ │ │ │ MCP │ │ OpenAI │ │ DHT │ │ │ │ │ │ Server │ │ Client │ │ Manager │ │ │ │ │ └─────────────┘ └─────────────┘ └─────────────┘ │ │ │ └───────────────────────────────────────────────────────────────────┘ │ │ │ │ ┌─────────────────── Storage Layer ─────────────────────────────────┐ │ │ │ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │ │ │ │ │ CAS │────▶│ PostgreSQL │────▶│ Redis │ │ │ │ │ │ Blob Store │ │(Metadata) │ │ (Cache) │ │ │ │ │ └─────────────┘ └─────────────┘ └─────────────┘ │ │ │ └───────────────────────────────────────────────────────────────────┘ │ └───────────────────────────────────────────────────────────────────────┘ ``` ## Migration Strategy ### Phase 1: Parallel Deployment (Weeks 1-2) #### 1.1 Infrastructure Preparation ```bash # Create v2 directory structure /rust/bzzz-v2/ ├── config/ │ ├── swarm/ │ ├── systemd/ │ └── secrets/ ├── data/ │ ├── blobs/ │ ├── conversations/ │ └── dht/ └── logs/ ├── application/ ├── p2p/ └── monitoring/ ``` #### 1.2 Service Deployment Strategy - Deploy v2 services on non-standard ports (9000+ range) - Maintain v1 SystemD services during transition - Use Docker Swarm stack for v2 components - Implement health checks and readiness probes #### 1.3 Database Migration - Create new PostgreSQL schema for v2 metadata - Implement data migration scripts for conversation history - Set up Redis cluster for DHT caching - Configure backup and recovery procedures ### Phase 2: Feature Migration (Weeks 3-4) #### 2.1 Content Store Migration ```bash # Migration workflow 1. Export v1 conversation logs from Hypercore 2. Convert to BLAKE3-addressed blobs 3. Populate content store with historical data 4. Verify data integrity and accessibility 5. Update references in conversation threads ``` #### 2.2 P2P Protocol Upgrade - Implement dual-protocol support (v1 + v2) - Migrate peer discovery from mDNS to DHT - Update message formats and routing - Maintain backward compatibility during transition ### Phase 3: Service Cutover (Weeks 5-6) #### 3.1 Traffic Migration - Implement feature flags for v2 protocol - Gradual migration of agents to v2 endpoints - Monitor performance and error rates - Implement automatic rollback triggers #### 3.2 Monitoring and Validation - Deploy comprehensive monitoring stack - Validate all v2 protocol operations - Performance benchmarking vs v1 - Load testing with conversation threading ### Phase 4: Production Deployment (Weeks 7-8) #### 4.1 Full Cutover - Disable v1 protocol endpoints - Remove v1 SystemD services - Update all client configurations - Archive v1 data and configurations #### 4.2 Optimization and Tuning - Performance optimization based on production load - Resource allocation tuning - Security hardening and audit - Documentation and training completion ## Container Orchestration ### Docker Swarm Stack Configuration ```yaml # docker-compose.swarm.yml version: '3.8' services: bzzz-agent: image: registry.home.deepblack.cloud/bzzz:v2.0.0 networks: - tengig - bzzz-internal ports: - "9000-9100:9000-9100" volumes: - /rust/bzzz-v2/data:/app/data - /rust/bzzz-v2/config:/app/config environment: - BZZZ_VERSION=2.0.0 - BZZZ_PROTOCOL=bzzz:// - DHT_BOOTSTRAP_NODES=walnut:9101,ironwood:9102,acacia:9103 deploy: replicas: 1 placement: constraints: - node.hostname == walnut labels: - "traefik.enable=true" - "traefik.http.routers.bzzz-agent.rule=Host(`bzzz.deepblack.cloud`)" - "traefik.http.services.bzzz-agent.loadbalancer.server.port=9000" mcp-server: image: registry.home.deepblack.cloud/bzzz-mcp:v2.0.0 networks: - tengig ports: - "3001:3001" environment: - MCP_VERSION=1.0.0 - BZZZ_ENDPOINT=http://bzzz-agent:9000 deploy: replicas: 3 labels: - "traefik.enable=true" - "traefik.http.routers.mcp-server.rule=Host(`mcp.deepblack.cloud`)" openai-proxy: image: registry.home.deepblack.cloud/bzzz-openai-proxy:v2.0.0 networks: - tengig - bzzz-internal ports: - "3002:3002" environment: - OPENAI_API_KEY_FILE=/run/secrets/openai_api_key - RATE_LIMIT_RPM=1000 - COST_TRACKING_ENABLED=true secrets: - openai_api_key deploy: replicas: 2 content-resolver: image: registry.home.deepblack.cloud/bzzz-resolver:v2.0.0 networks: - bzzz-internal ports: - "3003:3003" volumes: - /rust/bzzz-v2/data/blobs:/app/blobs:ro deploy: replicas: 3 postgres: image: postgres:15-alpine networks: - bzzz-internal environment: - POSTGRES_DB=bzzz_v2 - POSTGRES_USER_FILE=/run/secrets/postgres_user - POSTGRES_PASSWORD_FILE=/run/secrets/postgres_password volumes: - /rust/bzzz-v2/data/postgres:/var/lib/postgresql/data secrets: - postgres_user - postgres_password deploy: replicas: 1 placement: constraints: - node.hostname == walnut redis: image: redis:7-alpine networks: - bzzz-internal volumes: - /rust/bzzz-v2/data/redis:/data deploy: replicas: 1 placement: constraints: - node.hostname == ironwood networks: tengig: external: true bzzz-internal: driver: overlay internal: true secrets: openai_api_key: external: true postgres_user: external: true postgres_password: external: true ``` ## CI/CD Pipeline Configuration ### GitLab CI Pipeline ```yaml # .gitlab-ci.yml stages: - build - test - deploy-staging - deploy-production variables: REGISTRY: registry.home.deepblack.cloud IMAGE_TAG: ${CI_COMMIT_SHORT_SHA} build: stage: build script: - docker build -t ${REGISTRY}/bzzz:${IMAGE_TAG} . - docker build -t ${REGISTRY}/bzzz-mcp:${IMAGE_TAG} -f Dockerfile.mcp . - docker build -t ${REGISTRY}/bzzz-openai-proxy:${IMAGE_TAG} -f Dockerfile.proxy . - docker build -t ${REGISTRY}/bzzz-resolver:${IMAGE_TAG} -f Dockerfile.resolver . - docker push ${REGISTRY}/bzzz:${IMAGE_TAG} - docker push ${REGISTRY}/bzzz-mcp:${IMAGE_TAG} - docker push ${REGISTRY}/bzzz-openai-proxy:${IMAGE_TAG} - docker push ${REGISTRY}/bzzz-resolver:${IMAGE_TAG} only: - main - develop test-protocol: stage: test script: - go test ./... - docker run --rm ${REGISTRY}/bzzz:${IMAGE_TAG} /app/test-suite dependencies: - build test-integration: stage: test script: - docker-compose -f docker-compose.test.yml up -d - ./scripts/integration-tests.sh - docker-compose -f docker-compose.test.yml down dependencies: - build deploy-staging: stage: deploy-staging script: - docker stack deploy -c docker-compose.staging.yml bzzz-v2-staging environment: name: staging only: - develop deploy-production: stage: deploy-production script: - docker stack deploy -c docker-compose.swarm.yml bzzz-v2 environment: name: production only: - main when: manual ``` ## Monitoring and Operations ### Monitoring Stack ```yaml # docker-compose.monitoring.yml version: '3.8' services: prometheus: image: prom/prometheus:latest networks: - monitoring volumes: - ./monitoring/prometheus.yml:/etc/prometheus/prometheus.yml - /rust/bzzz-v2/data/prometheus:/prometheus deploy: replicas: 1 grafana: image: grafana/grafana:latest networks: - monitoring - tengig volumes: - /rust/bzzz-v2/data/grafana:/var/lib/grafana deploy: labels: - "traefik.enable=true" - "traefik.http.routers.bzzz-grafana.rule=Host(`bzzz-monitor.deepblack.cloud`)" alertmanager: image: prom/alertmanager:latest networks: - monitoring volumes: - ./monitoring/alertmanager.yml:/etc/alertmanager/alertmanager.yml deploy: replicas: 1 networks: monitoring: driver: overlay tengig: external: true ``` ### Key Metrics to Monitor 1. **Protocol Metrics** - DHT lookup latency and success rate - Content resolution time - Peer discovery and connection stability - bzzz:// address resolution performance 2. **Service Metrics** - MCP server response times - OpenAI API usage and costs - Conversation threading performance - Content store I/O operations 3. **Infrastructure Metrics** - Docker Swarm service health - Network connectivity between nodes - Storage utilization and performance - Resource utilization (CPU, memory, disk) ### Alerting Configuration ```yaml # monitoring/alertmanager.yml global: smtp_smarthost: 'localhost:587' smtp_from: 'alerts@deepblack.cloud' route: group_by: ['alertname'] group_wait: 10s group_interval: 10s repeat_interval: 1h receiver: 'web.hook' receivers: - name: 'web.hook' slack_configs: - api_url: 'YOUR_SLACK_WEBHOOK_URL' channel: '#bzzz-alerts' title: 'BZZZ v2 Alert' text: '{{ range .Alerts }}{{ .Annotations.summary }}{{ end }}' inhibit_rules: - source_match: severity: 'critical' target_match: severity: 'warning' equal: ['alertname', 'dev', 'instance'] ``` ## Security and Networking ### Security Architecture 1. **Network Isolation** - Internal overlay network for inter-service communication - External network exposure only through Traefik - Firewall rules restricting P2P ports to local network 2. **Secret Management** - Docker Swarm secrets for sensitive data - Encrypted storage of API keys and credentials - Regular secret rotation procedures 3. **Access Control** - mTLS for P2P communication - API authentication and authorization - Role-based access for MCP endpoints ### Networking Configuration ```bash # UFW firewall rules for BZZZ v2 sudo ufw allow from 192.168.1.0/24 to any port 9000:9300 proto tcp sudo ufw allow from 192.168.1.0/24 to any port 5353 proto udp sudo ufw allow from 192.168.1.0/24 to any port 2377 proto tcp # Docker Swarm sudo ufw allow from 192.168.1.0/24 to any port 7946 proto tcp # Docker Swarm sudo ufw allow from 192.168.1.0/24 to any port 4789 proto udp # Docker Swarm ``` ## Rollback Procedures ### Automatic Rollback Triggers 1. **Health Check Failures** - Service health checks failing for > 5 minutes - DHT network partition detection - Content store corruption detection - Critical error rate > 5% 2. **Performance Degradation** - Response time increase > 200% from baseline - Memory usage > 90% for > 10 minutes - Storage I/O errors > 1% rate ### Manual Rollback Process ```bash #!/bin/bash # rollback-v2.sh - Emergency rollback to v1 echo "🚨 Initiating BZZZ v2 rollback procedure..." # Step 1: Stop v2 services docker stack rm bzzz-v2 sleep 30 # Step 2: Restart v1 SystemD services sudo systemctl start bzzz@walnut sudo systemctl start bzzz@ironwood sudo systemctl start bzzz@acacia # Step 3: Verify v1 connectivity ./scripts/verify-v1-mesh.sh # Step 4: Update load balancer configuration ./scripts/update-traefik-v1.sh # Step 5: Notify operations team curl -X POST $SLACK_WEBHOOK -d '{"text":"🚨 BZZZ rollback to v1 completed"}' echo "✅ Rollback completed successfully" ``` ## Resource Requirements ### Node Specifications | Component | CPU | Memory | Storage | Network | |-----------|-----|---------|---------|---------| | BZZZ Agent | 2 cores | 4GB | 20GB | 1Gbps | | MCP Server | 1 core | 2GB | 5GB | 100Mbps | | OpenAI Proxy | 1 core | 2GB | 5GB | 100Mbps | | Content Store | 2 cores | 8GB | 500GB | 1Gbps | | DHT Manager | 1 core | 4GB | 50GB | 1Gbps | ### Scaling Considerations 1. **Horizontal Scaling** - Add nodes to DHT for increased capacity - Scale MCP servers based on external demand - Replicate content store across availability zones 2. **Vertical Scaling** - Increase memory for larger conversation contexts - Add storage for content addressing requirements - Enhance network capacity for P2P traffic ## Operational Procedures ### Daily Operations 1. **Health Monitoring** - Review Grafana dashboards for anomalies - Check DHT network connectivity - Verify content store replication status - Monitor OpenAI API usage and costs 2. **Maintenance Tasks** - Log rotation and archival - Content store garbage collection - DHT routing table optimization - Security patch deployment ### Weekly Operations 1. **Performance Review** - Analyze response time trends - Review resource utilization patterns - Assess scaling requirements - Update capacity planning 2. **Security Audit** - Review access logs - Validate secret rotation - Check for security updates - Test backup and recovery procedures ### Incident Response 1. **Incident Classification** - P0: Complete service outage - P1: Major feature degradation - P2: Performance issues - P3: Minor functionality problems 2. **Response Procedures** - Automated alerting and escalation - Incident commander assignment - Communication protocols - Post-incident review process This comprehensive infrastructure architecture provides a robust foundation for BZZZ v2 deployment while maintaining operational excellence and enabling future growth. The design prioritizes reliability, security, and maintainability while introducing advanced protocol features required for the next generation of the BZZZ ecosystem.