Files
bzzz/infrastructure/BZZZ_V2_INFRASTRUCTURE_ARCHITECTURE.md
anthonyrawlins 065dddf8d5 Prepare for v2 development: Add MCP integration and future development planning
- Add FUTURE_DEVELOPMENT.md with comprehensive v2 protocol specification
- Add MCP integration design and implementation foundation
- Add infrastructure and deployment configurations
- Update system architecture for v2 evolution

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-08-07 14:38:22 +10:00

28 KiB

BZZZ v2 Infrastructure Architecture & Deployment Strategy

Executive Summary

This document outlines the comprehensive infrastructure architecture and deployment strategy for BZZZ v2 evolution. The design maintains the existing 3-node cluster reliability while enabling advanced protocol features including content-addressed storage, DHT networking, OpenAI integration, and MCP server capabilities.

Current Infrastructure Analysis

Existing v1 Deployment

  • Cluster: WALNUT (192.168.1.27), IRONWOOD (192.168.1.113), ACACIA (192.168.1.xxx)
  • Deployment: SystemD services with P2P mesh networking
  • Protocol: libp2p with mDNS discovery and pubsub messaging
  • Storage: File-based configuration and in-memory state
  • Integration: Basic Hive API connectivity and task coordination

Infrastructure Dependencies

  • Docker Swarm: Existing cluster with tengig network
  • Traefik: Load balancing and SSL termination
  • Private Registry: registry.home.deepblack.cloud
  • GitLab CI/CD: gitlab.deepblack.cloud
  • Secrets: ~/chorus/business/secrets/ management
  • Storage: NFS mounts on /rust/ for shared data

BZZZ v2 Architecture Design

1. Protocol Evolution Architecture

┌─────────────────────── BZZZ v2 Protocol Stack ───────────────────────┐
│                                                                       │
│  ┌─────────────────┐  ┌─────────────────┐  ┌─────────────────────┐   │
│  │   MCP Server    │  │  OpenAI Proxy   │  │  bzzz:// Resolver   │   │
│  │   (Port 3001)   │  │   (Port 3002)   │  │   (Port 3003)       │   │
│  └─────────────────┘  └─────────────────┘  └─────────────────────┘   │
│           │                     │                      │             │
│  ┌─────────────────────────────────────────────────────────────────┐ │
│  │                  Content Layer                                  │ │
│  │  ┌─────────────┐ ┌──────────────┐ ┌─────────────────────────┐  │ │
│  │  │ Conversation│ │ Content Store│ │    BLAKE3 Hasher        │  │ │
│  │  │ Threading   │ │  (CAS Blobs) │ │   (Content Addressing)  │  │ │
│  │  └─────────────┘ └──────────────┘ └─────────────────────────┘  │ │
│  └─────────────────────────────────────────────────────────────────┘ │
│                                   │                                   │
│  ┌─────────────────────────────────────────────────────────────────┐ │
│  │                    P2P Layer                                    │ │
│  │  ┌─────────────┐ ┌──────────────┐ ┌─────────────────────────┐  │ │
│  │  │  libp2p DHT │ │Content Route │ │    Stream Multiplexing  │  │ │
│  │  │  (Discovery)│ │   (Routing)  │ │      (Yamux/mplex)      │  │ │
│  │  └─────────────┘ └──────────────┘ └─────────────────────────┘  │ │
│  └─────────────────────────────────────────────────────────────────┘ │
│                                                                       │
└───────────────────────────────────────────────────────────────────────┘

2. Content-Addressed Storage (CAS) Architecture

┌────────────────── Content-Addressed Storage System ──────────────────┐
│                                                                       │
│  ┌─────────────────────────── Node Distribution ────────────────────┐ │
│  │                                                                   │ │
│  │  WALNUT              IRONWOOD             ACACIA                 │ │
│  │  ┌─────────────┐     ┌─────────────┐     ┌─────────────┐        │ │
│  │  │  Primary    │────▶│  Secondary  │────▶│  Tertiary   │        │ │
│  │  │  Blob Store │     │  Replica    │     │  Replica    │        │ │
│  │  └─────────────┘     └─────────────┘     └─────────────┘        │ │
│  │       │                    │                    │               │ │
│  │  ┌─────────────┐     ┌─────────────┐     ┌─────────────┐        │ │
│  │  │BLAKE3 Index │     │BLAKE3 Index │     │BLAKE3 Index │        │ │
│  │  │  (Primary)  │     │ (Secondary) │     │ (Tertiary)  │        │ │
│  │  └─────────────┘     └─────────────┘     └─────────────┘        │ │
│  └───────────────────────────────────────────────────────────────────┘ │
│                                                                       │
│  ┌─────────────────── Storage Layout ──────────────────────────────┐ │
│  │  /rust/bzzz-v2/blobs/                                           │ │
│  │  ├── data/                   # Raw blob storage                 │ │
│  │  │   ├── bl/                # BLAKE3 prefix sharding           │ │
│  │  │   │   └── 3k/            # Further sharding                 │ │
│  │  │   └── conversations/      # Conversation threads            │ │
│  │  ├── index/                 # BLAKE3 hash indices              │ │
│  │  │   ├── primary.db         # Primary hash->location mapping   │ │
│  │  │   └── replication.db     # Replication metadata            │ │
│  │  └── temp/                  # Temporary staging area           │ │
│  └───────────────────────────────────────────────────────────────────┘ │
└───────────────────────────────────────────────────────────────────────┘

3. DHT and Network Architecture

┌────────────────────── DHT Network Topology ──────────────────────────┐
│                                                                       │
│  ┌─────────────────── Bootstrap & Discovery ────────────────────────┐ │
│  │                                                                   │ │
│  │  ┌─────────────┐     ┌─────────────┐     ┌─────────────┐        │ │
│  │  │   WALNUT    │────▶│  IRONWOOD   │────▶│   ACACIA    │        │ │
│  │  │(Bootstrap 1)│◀────│(Bootstrap 2)│◀────│(Bootstrap 3)│        │ │
│  │  └─────────────┘     └─────────────┘     └─────────────┘        │ │
│  │                                                                   │ │
│  │  ┌─────────────────── DHT Responsibilities ────────────────────┐ │ │
│  │  │ WALNUT: Content Routing + Agent Discovery                  │ │ │
│  │  │ IRONWOOD: Conversation Threading + OpenAI Coordination     │ │ │
│  │  │ ACACIA: MCP Services + External Integration               │ │ │
│  │  └─────────────────────────────────────────────────────────────┘ │ │
│  └───────────────────────────────────────────────────────────────────┘ │
│                                                                       │
│  ┌─────────────────── Network Protocols ────────────────────────────┐ │
│  │                                                                   │ │
│  │  Protocol Support:                                               │ │
│  │  • bzzz:// semantic addressing (DHT resolution)                  │ │
│  │  • Content routing via DHT (BLAKE3 hash lookup)                  │ │
│  │  • Agent discovery and capability broadcasting                   │ │
│  │  • Stream multiplexing for concurrent conversations              │ │
│  │  • NAT traversal and hole punching                              │ │
│  │                                                                   │ │
│  │  Port Allocation:                                                │ │
│  │  • P2P Listen: 9000-9100 (configurable range)                   │ │
│  │  • DHT Bootstrap: 9101-9103 (per node)                          │ │
│  │  • Content Routing: 9200-9300 (dynamic allocation)              │ │
│  │  • mDNS Discovery: 5353 (standard multicast DNS)                │ │
│  └───────────────────────────────────────────────────────────────────┘ │
└───────────────────────────────────────────────────────────────────────┘

4. Service Architecture

┌─────────────────────── BZZZ v2 Service Stack ────────────────────────┐
│                                                                       │
│  ┌─────────────────── External Layer ───────────────────────────────┐ │
│  │  ┌─────────────┐     ┌─────────────┐     ┌─────────────┐        │ │
│  │  │   Traefik   │────▶│   OpenAI    │────▶│    MCP      │        │ │
│  │  │Load Balancer│     │   Gateway   │     │  Clients    │        │ │
│  │  │  (SSL Term) │     │(Rate Limit) │     │(External)   │        │ │
│  │  └─────────────┘     └─────────────┘     └─────────────┘        │ │
│  └───────────────────────────────────────────────────────────────────┘ │
│                                                                       │
│  ┌─────────────────── Application Layer ────────────────────────────┐ │
│  │  ┌─────────────┐     ┌─────────────┐     ┌─────────────┐        │ │
│  │  │ BZZZ Agent  │────▶│ Conversation│────▶│   Content   │        │ │
│  │  │  Manager    │     │  Threading  │     │  Resolver   │        │ │
│  │  └─────────────┘     └─────────────┘     └─────────────┘        │ │
│  │          │                   │                   │             │ │
│  │  ┌─────────────┐     ┌─────────────┐     ┌─────────────┐        │ │
│  │  │    MCP      │     │   OpenAI    │     │    DHT      │        │ │
│  │  │   Server    │     │   Client    │     │  Manager    │        │ │
│  │  └─────────────┘     └─────────────┘     └─────────────┘        │ │
│  └───────────────────────────────────────────────────────────────────┘ │
│                                                                       │
│  ┌─────────────────── Storage Layer ─────────────────────────────────┐ │
│  │  ┌─────────────┐     ┌─────────────┐     ┌─────────────┐        │ │
│  │  │    CAS      │────▶│ PostgreSQL  │────▶│    Redis    │        │ │
│  │  │ Blob Store  │     │(Metadata)   │     │  (Cache)    │        │ │
│  │  └─────────────┘     └─────────────┘     └─────────────┘        │ │
│  └───────────────────────────────────────────────────────────────────┘ │
└───────────────────────────────────────────────────────────────────────┘

Migration Strategy

Phase 1: Parallel Deployment (Weeks 1-2)

1.1 Infrastructure Preparation

# Create v2 directory structure
/rust/bzzz-v2/
├── config/
│   ├── swarm/
│   ├── systemd/
│   └── secrets/
├── data/
│   ├── blobs/
│   ├── conversations/
│   └── dht/
└── logs/
    ├── application/
    ├── p2p/
    └── monitoring/

1.2 Service Deployment Strategy

  • Deploy v2 services on non-standard ports (9000+ range)
  • Maintain v1 SystemD services during transition
  • Use Docker Swarm stack for v2 components
  • Implement health checks and readiness probes

1.3 Database Migration

  • Create new PostgreSQL schema for v2 metadata
  • Implement data migration scripts for conversation history
  • Set up Redis cluster for DHT caching
  • Configure backup and recovery procedures

Phase 2: Feature Migration (Weeks 3-4)

2.1 Content Store Migration

# Migration workflow
1. Export v1 conversation logs from Hypercore
2. Convert to BLAKE3-addressed blobs
3. Populate content store with historical data
4. Verify data integrity and accessibility
5. Update references in conversation threads

2.2 P2P Protocol Upgrade

  • Implement dual-protocol support (v1 + v2)
  • Migrate peer discovery from mDNS to DHT
  • Update message formats and routing
  • Maintain backward compatibility during transition

Phase 3: Service Cutover (Weeks 5-6)

3.1 Traffic Migration

  • Implement feature flags for v2 protocol
  • Gradual migration of agents to v2 endpoints
  • Monitor performance and error rates
  • Implement automatic rollback triggers

3.2 Monitoring and Validation

  • Deploy comprehensive monitoring stack
  • Validate all v2 protocol operations
  • Performance benchmarking vs v1
  • Load testing with conversation threading

Phase 4: Production Deployment (Weeks 7-8)

4.1 Full Cutover

  • Disable v1 protocol endpoints
  • Remove v1 SystemD services
  • Update all client configurations
  • Archive v1 data and configurations

4.2 Optimization and Tuning

  • Performance optimization based on production load
  • Resource allocation tuning
  • Security hardening and audit
  • Documentation and training completion

Container Orchestration

Docker Swarm Stack Configuration

# docker-compose.swarm.yml
version: '3.8'

services:
  bzzz-agent:
    image: registry.home.deepblack.cloud/bzzz:v2.0.0
    networks:
      - tengig
      - bzzz-internal
    ports:
      - "9000-9100:9000-9100"
    volumes:
      - /rust/bzzz-v2/data:/app/data
      - /rust/bzzz-v2/config:/app/config
    environment:
      - BZZZ_VERSION=2.0.0
      - BZZZ_PROTOCOL=bzzz://
      - DHT_BOOTSTRAP_NODES=walnut:9101,ironwood:9102,acacia:9103
    deploy:
      replicas: 1
      placement:
        constraints:
          - node.hostname == walnut
      labels:
        - "traefik.enable=true"
        - "traefik.http.routers.bzzz-agent.rule=Host(`bzzz.deepblack.cloud`)"
        - "traefik.http.services.bzzz-agent.loadbalancer.server.port=9000"

  mcp-server:
    image: registry.home.deepblack.cloud/bzzz-mcp:v2.0.0
    networks:
      - tengig
    ports:
      - "3001:3001"
    environment:
      - MCP_VERSION=1.0.0
      - BZZZ_ENDPOINT=http://bzzz-agent:9000
    deploy:
      replicas: 3
      labels:
        - "traefik.enable=true"
        - "traefik.http.routers.mcp-server.rule=Host(`mcp.deepblack.cloud`)"

  openai-proxy:
    image: registry.home.deepblack.cloud/bzzz-openai-proxy:v2.0.0
    networks:
      - tengig
      - bzzz-internal
    ports:
      - "3002:3002"
    environment:
      - OPENAI_API_KEY_FILE=/run/secrets/openai_api_key
      - RATE_LIMIT_RPM=1000
      - COST_TRACKING_ENABLED=true
    secrets:
      - openai_api_key
    deploy:
      replicas: 2

  content-resolver:
    image: registry.home.deepblack.cloud/bzzz-resolver:v2.0.0
    networks:
      - bzzz-internal
    ports:
      - "3003:3003"
    volumes:
      - /rust/bzzz-v2/data/blobs:/app/blobs:ro
    deploy:
      replicas: 3

  postgres:
    image: postgres:15-alpine
    networks:
      - bzzz-internal
    environment:
      - POSTGRES_DB=bzzz_v2
      - POSTGRES_USER_FILE=/run/secrets/postgres_user
      - POSTGRES_PASSWORD_FILE=/run/secrets/postgres_password
    volumes:
      - /rust/bzzz-v2/data/postgres:/var/lib/postgresql/data
    secrets:
      - postgres_user
      - postgres_password
    deploy:
      replicas: 1
      placement:
        constraints:
          - node.hostname == walnut

  redis:
    image: redis:7-alpine
    networks:
      - bzzz-internal
    volumes:
      - /rust/bzzz-v2/data/redis:/data
    deploy:
      replicas: 1
      placement:
        constraints:
          - node.hostname == ironwood

networks:
  tengig:
    external: true
  bzzz-internal:
    driver: overlay
    internal: true

secrets:
  openai_api_key:
    external: true
  postgres_user:
    external: true
  postgres_password:
    external: true

CI/CD Pipeline Configuration

GitLab CI Pipeline

# .gitlab-ci.yml
stages:
  - build
  - test
  - deploy-staging
  - deploy-production

variables:
  REGISTRY: registry.home.deepblack.cloud
  IMAGE_TAG: ${CI_COMMIT_SHORT_SHA}

build:
  stage: build
  script:
    - docker build -t ${REGISTRY}/bzzz:${IMAGE_TAG} .
    - docker build -t ${REGISTRY}/bzzz-mcp:${IMAGE_TAG} -f Dockerfile.mcp .
    - docker build -t ${REGISTRY}/bzzz-openai-proxy:${IMAGE_TAG} -f Dockerfile.proxy .
    - docker build -t ${REGISTRY}/bzzz-resolver:${IMAGE_TAG} -f Dockerfile.resolver .
    - docker push ${REGISTRY}/bzzz:${IMAGE_TAG}
    - docker push ${REGISTRY}/bzzz-mcp:${IMAGE_TAG}
    - docker push ${REGISTRY}/bzzz-openai-proxy:${IMAGE_TAG}
    - docker push ${REGISTRY}/bzzz-resolver:${IMAGE_TAG}
  only:
    - main
    - develop

test-protocol:
  stage: test
  script:
    - go test ./...
    - docker run --rm ${REGISTRY}/bzzz:${IMAGE_TAG} /app/test-suite
  dependencies:
    - build

test-integration:
  stage: test
  script:
    - docker-compose -f docker-compose.test.yml up -d
    - ./scripts/integration-tests.sh
    - docker-compose -f docker-compose.test.yml down
  dependencies:
    - build

deploy-staging:
  stage: deploy-staging
  script:
    - docker stack deploy -c docker-compose.staging.yml bzzz-v2-staging
  environment:
    name: staging
  only:
    - develop

deploy-production:
  stage: deploy-production
  script:
    - docker stack deploy -c docker-compose.swarm.yml bzzz-v2
  environment:
    name: production
  only:
    - main
  when: manual

Monitoring and Operations

Monitoring Stack

# docker-compose.monitoring.yml
version: '3.8'

services:
  prometheus:
    image: prom/prometheus:latest
    networks:
      - monitoring
    volumes:
      - ./monitoring/prometheus.yml:/etc/prometheus/prometheus.yml
      - /rust/bzzz-v2/data/prometheus:/prometheus
    deploy:
      replicas: 1

  grafana:
    image: grafana/grafana:latest
    networks:
      - monitoring
      - tengig
    volumes:
      - /rust/bzzz-v2/data/grafana:/var/lib/grafana
    deploy:
      labels:
        - "traefik.enable=true"
        - "traefik.http.routers.bzzz-grafana.rule=Host(`bzzz-monitor.deepblack.cloud`)"

  alertmanager:
    image: prom/alertmanager:latest
    networks:
      - monitoring
    volumes:
      - ./monitoring/alertmanager.yml:/etc/alertmanager/alertmanager.yml
    deploy:
      replicas: 1

networks:
  monitoring:
    driver: overlay
  tengig:
    external: true

Key Metrics to Monitor

  1. Protocol Metrics

    • DHT lookup latency and success rate
    • Content resolution time
    • Peer discovery and connection stability
    • bzzz:// address resolution performance
  2. Service Metrics

    • MCP server response times
    • OpenAI API usage and costs
    • Conversation threading performance
    • Content store I/O operations
  3. Infrastructure Metrics

    • Docker Swarm service health
    • Network connectivity between nodes
    • Storage utilization and performance
    • Resource utilization (CPU, memory, disk)

Alerting Configuration

# monitoring/alertmanager.yml
global:
  smtp_smarthost: 'localhost:587'
  smtp_from: 'alerts@deepblack.cloud'

route:
  group_by: ['alertname']
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 1h
  receiver: 'web.hook'

receivers:
  - name: 'web.hook'
    slack_configs:
      - api_url: 'YOUR_SLACK_WEBHOOK_URL'
        channel: '#bzzz-alerts'
        title: 'BZZZ v2 Alert'
        text: '{{ range .Alerts }}{{ .Annotations.summary }}{{ end }}'

inhibit_rules:
  - source_match:
      severity: 'critical'
    target_match:
      severity: 'warning'
    equal: ['alertname', 'dev', 'instance']

Security and Networking

Security Architecture

  1. Network Isolation

    • Internal overlay network for inter-service communication
    • External network exposure only through Traefik
    • Firewall rules restricting P2P ports to local network
  2. Secret Management

    • Docker Swarm secrets for sensitive data
    • Encrypted storage of API keys and credentials
    • Regular secret rotation procedures
  3. Access Control

    • mTLS for P2P communication
    • API authentication and authorization
    • Role-based access for MCP endpoints

Networking Configuration

# UFW firewall rules for BZZZ v2
sudo ufw allow from 192.168.1.0/24 to any port 9000:9300 proto tcp
sudo ufw allow from 192.168.1.0/24 to any port 5353 proto udp
sudo ufw allow from 192.168.1.0/24 to any port 2377 proto tcp  # Docker Swarm
sudo ufw allow from 192.168.1.0/24 to any port 7946 proto tcp  # Docker Swarm
sudo ufw allow from 192.168.1.0/24 to any port 4789 proto udp  # Docker Swarm

Rollback Procedures

Automatic Rollback Triggers

  1. Health Check Failures

    • Service health checks failing for > 5 minutes
    • DHT network partition detection
    • Content store corruption detection
    • Critical error rate > 5%
  2. Performance Degradation

    • Response time increase > 200% from baseline
    • Memory usage > 90% for > 10 minutes
    • Storage I/O errors > 1% rate

Manual Rollback Process

#!/bin/bash
# rollback-v2.sh - Emergency rollback to v1

echo "🚨 Initiating BZZZ v2 rollback procedure..."

# Step 1: Stop v2 services
docker stack rm bzzz-v2
sleep 30

# Step 2: Restart v1 SystemD services
sudo systemctl start bzzz@walnut
sudo systemctl start bzzz@ironwood  
sudo systemctl start bzzz@acacia

# Step 3: Verify v1 connectivity
./scripts/verify-v1-mesh.sh

# Step 4: Update load balancer configuration
./scripts/update-traefik-v1.sh

# Step 5: Notify operations team
curl -X POST $SLACK_WEBHOOK -d '{"text":"🚨 BZZZ rollback to v1 completed"}'

echo "✅ Rollback completed successfully"

Resource Requirements

Node Specifications

Component CPU Memory Storage Network
BZZZ Agent 2 cores 4GB 20GB 1Gbps
MCP Server 1 core 2GB 5GB 100Mbps
OpenAI Proxy 1 core 2GB 5GB 100Mbps
Content Store 2 cores 8GB 500GB 1Gbps
DHT Manager 1 core 4GB 50GB 1Gbps

Scaling Considerations

  1. Horizontal Scaling

    • Add nodes to DHT for increased capacity
    • Scale MCP servers based on external demand
    • Replicate content store across availability zones
  2. Vertical Scaling

    • Increase memory for larger conversation contexts
    • Add storage for content addressing requirements
    • Enhance network capacity for P2P traffic

Operational Procedures

Daily Operations

  1. Health Monitoring

    • Review Grafana dashboards for anomalies
    • Check DHT network connectivity
    • Verify content store replication status
    • Monitor OpenAI API usage and costs
  2. Maintenance Tasks

    • Log rotation and archival
    • Content store garbage collection
    • DHT routing table optimization
    • Security patch deployment

Weekly Operations

  1. Performance Review

    • Analyze response time trends
    • Review resource utilization patterns
    • Assess scaling requirements
    • Update capacity planning
  2. Security Audit

    • Review access logs
    • Validate secret rotation
    • Check for security updates
    • Test backup and recovery procedures

Incident Response

  1. Incident Classification

    • P0: Complete service outage
    • P1: Major feature degradation
    • P2: Performance issues
    • P3: Minor functionality problems
  2. Response Procedures

    • Automated alerting and escalation
    • Incident commander assignment
    • Communication protocols
    • Post-incident review process

This comprehensive infrastructure architecture provides a robust foundation for BZZZ v2 deployment while maintaining operational excellence and enabling future growth. The design prioritizes reliability, security, and maintainability while introducing advanced protocol features required for the next generation of the BZZZ ecosystem.