Prepare for v2 development: Add MCP integration and future development planning

- Add FUTURE_DEVELOPMENT.md with comprehensive v2 protocol specification
- Add MCP integration design and implementation foundation
- Add infrastructure and deployment configurations
- Update system architecture for v2 evolution

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
This commit is contained in:
anthonyrawlins
2025-08-07 14:38:22 +10:00
parent 5f94288fbb
commit 065dddf8d5
41 changed files with 14970 additions and 161 deletions

View File

@@ -0,0 +1,669 @@
# BZZZ v2 Infrastructure Architecture & Deployment Strategy
## Executive Summary
This document outlines the comprehensive infrastructure architecture and deployment strategy for BZZZ v2 evolution. The design maintains the existing 3-node cluster reliability while enabling advanced protocol features including content-addressed storage, DHT networking, OpenAI integration, and MCP server capabilities.
## Current Infrastructure Analysis
### Existing v1 Deployment
- **Cluster**: WALNUT (192.168.1.27), IRONWOOD (192.168.1.113), ACACIA (192.168.1.xxx)
- **Deployment**: SystemD services with P2P mesh networking
- **Protocol**: libp2p with mDNS discovery and pubsub messaging
- **Storage**: File-based configuration and in-memory state
- **Integration**: Basic Hive API connectivity and task coordination
### Infrastructure Dependencies
- **Docker Swarm**: Existing cluster with `tengig` network
- **Traefik**: Load balancing and SSL termination
- **Private Registry**: registry.home.deepblack.cloud
- **GitLab CI/CD**: gitlab.deepblack.cloud
- **Secrets**: ~/chorus/business/secrets/ management
- **Storage**: NFS mounts on /rust/ for shared data
## BZZZ v2 Architecture Design
### 1. Protocol Evolution Architecture
```
┌─────────────────────── BZZZ v2 Protocol Stack ───────────────────────┐
│ │
│ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────────┐ │
│ │ MCP Server │ │ OpenAI Proxy │ │ bzzz:// Resolver │ │
│ │ (Port 3001) │ │ (Port 3002) │ │ (Port 3003) │ │
│ └─────────────────┘ └─────────────────┘ └─────────────────────┘ │
│ │ │ │ │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ Content Layer │ │
│ │ ┌─────────────┐ ┌──────────────┐ ┌─────────────────────────┐ │ │
│ │ │ Conversation│ │ Content Store│ │ BLAKE3 Hasher │ │ │
│ │ │ Threading │ │ (CAS Blobs) │ │ (Content Addressing) │ │ │
│ │ └─────────────┘ └──────────────┘ └─────────────────────────┘ │ │
│ └─────────────────────────────────────────────────────────────────┘ │
│ │ │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ P2P Layer │ │
│ │ ┌─────────────┐ ┌──────────────┐ ┌─────────────────────────┐ │ │
│ │ │ libp2p DHT │ │Content Route │ │ Stream Multiplexing │ │ │
│ │ │ (Discovery)│ │ (Routing) │ │ (Yamux/mplex) │ │ │
│ │ └─────────────┘ └──────────────┘ └─────────────────────────┘ │ │
│ └─────────────────────────────────────────────────────────────────┘ │
│ │
└───────────────────────────────────────────────────────────────────────┘
```
### 2. Content-Addressed Storage (CAS) Architecture
```
┌────────────────── Content-Addressed Storage System ──────────────────┐
│ │
│ ┌─────────────────────────── Node Distribution ────────────────────┐ │
│ │ │ │
│ │ WALNUT IRONWOOD ACACIA │ │
│ │ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │ │
│ │ │ Primary │────▶│ Secondary │────▶│ Tertiary │ │ │
│ │ │ Blob Store │ │ Replica │ │ Replica │ │ │
│ │ └─────────────┘ └─────────────┘ └─────────────┘ │ │
│ │ │ │ │ │ │
│ │ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │ │
│ │ │BLAKE3 Index │ │BLAKE3 Index │ │BLAKE3 Index │ │ │
│ │ │ (Primary) │ │ (Secondary) │ │ (Tertiary) │ │ │
│ │ └─────────────┘ └─────────────┘ └─────────────┘ │ │
│ └───────────────────────────────────────────────────────────────────┘ │
│ │
│ ┌─────────────────── Storage Layout ──────────────────────────────┐ │
│ │ /rust/bzzz-v2/blobs/ │ │
│ │ ├── data/ # Raw blob storage │ │
│ │ │ ├── bl/ # BLAKE3 prefix sharding │ │
│ │ │ │ └── 3k/ # Further sharding │ │
│ │ │ └── conversations/ # Conversation threads │ │
│ │ ├── index/ # BLAKE3 hash indices │ │
│ │ │ ├── primary.db # Primary hash->location mapping │ │
│ │ │ └── replication.db # Replication metadata │ │
│ │ └── temp/ # Temporary staging area │ │
│ └───────────────────────────────────────────────────────────────────┘ │
└───────────────────────────────────────────────────────────────────────┘
```
### 3. DHT and Network Architecture
```
┌────────────────────── DHT Network Topology ──────────────────────────┐
│ │
│ ┌─────────────────── Bootstrap & Discovery ────────────────────────┐ │
│ │ │ │
│ │ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │ │
│ │ │ WALNUT │────▶│ IRONWOOD │────▶│ ACACIA │ │ │
│ │ │(Bootstrap 1)│◀────│(Bootstrap 2)│◀────│(Bootstrap 3)│ │ │
│ │ └─────────────┘ └─────────────┘ └─────────────┘ │ │
│ │ │ │
│ │ ┌─────────────────── DHT Responsibilities ────────────────────┐ │ │
│ │ │ WALNUT: Content Routing + Agent Discovery │ │ │
│ │ │ IRONWOOD: Conversation Threading + OpenAI Coordination │ │ │
│ │ │ ACACIA: MCP Services + External Integration │ │ │
│ │ └─────────────────────────────────────────────────────────────┘ │ │
│ └───────────────────────────────────────────────────────────────────┘ │
│ │
│ ┌─────────────────── Network Protocols ────────────────────────────┐ │
│ │ │ │
│ │ Protocol Support: │ │
│ │ • bzzz:// semantic addressing (DHT resolution) │ │
│ │ • Content routing via DHT (BLAKE3 hash lookup) │ │
│ │ • Agent discovery and capability broadcasting │ │
│ │ • Stream multiplexing for concurrent conversations │ │
│ │ • NAT traversal and hole punching │ │
│ │ │ │
│ │ Port Allocation: │ │
│ │ • P2P Listen: 9000-9100 (configurable range) │ │
│ │ • DHT Bootstrap: 9101-9103 (per node) │ │
│ │ • Content Routing: 9200-9300 (dynamic allocation) │ │
│ │ • mDNS Discovery: 5353 (standard multicast DNS) │ │
│ └───────────────────────────────────────────────────────────────────┘ │
└───────────────────────────────────────────────────────────────────────┘
```
### 4. Service Architecture
```
┌─────────────────────── BZZZ v2 Service Stack ────────────────────────┐
│ │
│ ┌─────────────────── External Layer ───────────────────────────────┐ │
│ │ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │ │
│ │ │ Traefik │────▶│ OpenAI │────▶│ MCP │ │ │
│ │ │Load Balancer│ │ Gateway │ │ Clients │ │ │
│ │ │ (SSL Term) │ │(Rate Limit) │ │(External) │ │ │
│ │ └─────────────┘ └─────────────┘ └─────────────┘ │ │
│ └───────────────────────────────────────────────────────────────────┘ │
│ │
│ ┌─────────────────── Application Layer ────────────────────────────┐ │
│ │ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │ │
│ │ │ BZZZ Agent │────▶│ Conversation│────▶│ Content │ │ │
│ │ │ Manager │ │ Threading │ │ Resolver │ │ │
│ │ └─────────────┘ └─────────────┘ └─────────────┘ │ │
│ │ │ │ │ │ │
│ │ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │ │
│ │ │ MCP │ │ OpenAI │ │ DHT │ │ │
│ │ │ Server │ │ Client │ │ Manager │ │ │
│ │ └─────────────┘ └─────────────┘ └─────────────┘ │ │
│ └───────────────────────────────────────────────────────────────────┘ │
│ │
│ ┌─────────────────── Storage Layer ─────────────────────────────────┐ │
│ │ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │ │
│ │ │ CAS │────▶│ PostgreSQL │────▶│ Redis │ │ │
│ │ │ Blob Store │ │(Metadata) │ │ (Cache) │ │ │
│ │ └─────────────┘ └─────────────┘ └─────────────┘ │ │
│ └───────────────────────────────────────────────────────────────────┘ │
└───────────────────────────────────────────────────────────────────────┘
```
## Migration Strategy
### Phase 1: Parallel Deployment (Weeks 1-2)
#### 1.1 Infrastructure Preparation
```bash
# Create v2 directory structure
/rust/bzzz-v2/
├── config/
│ ├── swarm/
│ ├── systemd/
│ └── secrets/
├── data/
│ ├── blobs/
│ ├── conversations/
│ └── dht/
└── logs/
├── application/
├── p2p/
└── monitoring/
```
#### 1.2 Service Deployment Strategy
- Deploy v2 services on non-standard ports (9000+ range)
- Maintain v1 SystemD services during transition
- Use Docker Swarm stack for v2 components
- Implement health checks and readiness probes
#### 1.3 Database Migration
- Create new PostgreSQL schema for v2 metadata
- Implement data migration scripts for conversation history
- Set up Redis cluster for DHT caching
- Configure backup and recovery procedures
### Phase 2: Feature Migration (Weeks 3-4)
#### 2.1 Content Store Migration
```bash
# Migration workflow
1. Export v1 conversation logs from Hypercore
2. Convert to BLAKE3-addressed blobs
3. Populate content store with historical data
4. Verify data integrity and accessibility
5. Update references in conversation threads
```
#### 2.2 P2P Protocol Upgrade
- Implement dual-protocol support (v1 + v2)
- Migrate peer discovery from mDNS to DHT
- Update message formats and routing
- Maintain backward compatibility during transition
### Phase 3: Service Cutover (Weeks 5-6)
#### 3.1 Traffic Migration
- Implement feature flags for v2 protocol
- Gradual migration of agents to v2 endpoints
- Monitor performance and error rates
- Implement automatic rollback triggers
#### 3.2 Monitoring and Validation
- Deploy comprehensive monitoring stack
- Validate all v2 protocol operations
- Performance benchmarking vs v1
- Load testing with conversation threading
### Phase 4: Production Deployment (Weeks 7-8)
#### 4.1 Full Cutover
- Disable v1 protocol endpoints
- Remove v1 SystemD services
- Update all client configurations
- Archive v1 data and configurations
#### 4.2 Optimization and Tuning
- Performance optimization based on production load
- Resource allocation tuning
- Security hardening and audit
- Documentation and training completion
## Container Orchestration
### Docker Swarm Stack Configuration
```yaml
# docker-compose.swarm.yml
version: '3.8'
services:
bzzz-agent:
image: registry.home.deepblack.cloud/bzzz:v2.0.0
networks:
- tengig
- bzzz-internal
ports:
- "9000-9100:9000-9100"
volumes:
- /rust/bzzz-v2/data:/app/data
- /rust/bzzz-v2/config:/app/config
environment:
- BZZZ_VERSION=2.0.0
- BZZZ_PROTOCOL=bzzz://
- DHT_BOOTSTRAP_NODES=walnut:9101,ironwood:9102,acacia:9103
deploy:
replicas: 1
placement:
constraints:
- node.hostname == walnut
labels:
- "traefik.enable=true"
- "traefik.http.routers.bzzz-agent.rule=Host(`bzzz.deepblack.cloud`)"
- "traefik.http.services.bzzz-agent.loadbalancer.server.port=9000"
mcp-server:
image: registry.home.deepblack.cloud/bzzz-mcp:v2.0.0
networks:
- tengig
ports:
- "3001:3001"
environment:
- MCP_VERSION=1.0.0
- BZZZ_ENDPOINT=http://bzzz-agent:9000
deploy:
replicas: 3
labels:
- "traefik.enable=true"
- "traefik.http.routers.mcp-server.rule=Host(`mcp.deepblack.cloud`)"
openai-proxy:
image: registry.home.deepblack.cloud/bzzz-openai-proxy:v2.0.0
networks:
- tengig
- bzzz-internal
ports:
- "3002:3002"
environment:
- OPENAI_API_KEY_FILE=/run/secrets/openai_api_key
- RATE_LIMIT_RPM=1000
- COST_TRACKING_ENABLED=true
secrets:
- openai_api_key
deploy:
replicas: 2
content-resolver:
image: registry.home.deepblack.cloud/bzzz-resolver:v2.0.0
networks:
- bzzz-internal
ports:
- "3003:3003"
volumes:
- /rust/bzzz-v2/data/blobs:/app/blobs:ro
deploy:
replicas: 3
postgres:
image: postgres:15-alpine
networks:
- bzzz-internal
environment:
- POSTGRES_DB=bzzz_v2
- POSTGRES_USER_FILE=/run/secrets/postgres_user
- POSTGRES_PASSWORD_FILE=/run/secrets/postgres_password
volumes:
- /rust/bzzz-v2/data/postgres:/var/lib/postgresql/data
secrets:
- postgres_user
- postgres_password
deploy:
replicas: 1
placement:
constraints:
- node.hostname == walnut
redis:
image: redis:7-alpine
networks:
- bzzz-internal
volumes:
- /rust/bzzz-v2/data/redis:/data
deploy:
replicas: 1
placement:
constraints:
- node.hostname == ironwood
networks:
tengig:
external: true
bzzz-internal:
driver: overlay
internal: true
secrets:
openai_api_key:
external: true
postgres_user:
external: true
postgres_password:
external: true
```
## CI/CD Pipeline Configuration
### GitLab CI Pipeline
```yaml
# .gitlab-ci.yml
stages:
- build
- test
- deploy-staging
- deploy-production
variables:
REGISTRY: registry.home.deepblack.cloud
IMAGE_TAG: ${CI_COMMIT_SHORT_SHA}
build:
stage: build
script:
- docker build -t ${REGISTRY}/bzzz:${IMAGE_TAG} .
- docker build -t ${REGISTRY}/bzzz-mcp:${IMAGE_TAG} -f Dockerfile.mcp .
- docker build -t ${REGISTRY}/bzzz-openai-proxy:${IMAGE_TAG} -f Dockerfile.proxy .
- docker build -t ${REGISTRY}/bzzz-resolver:${IMAGE_TAG} -f Dockerfile.resolver .
- docker push ${REGISTRY}/bzzz:${IMAGE_TAG}
- docker push ${REGISTRY}/bzzz-mcp:${IMAGE_TAG}
- docker push ${REGISTRY}/bzzz-openai-proxy:${IMAGE_TAG}
- docker push ${REGISTRY}/bzzz-resolver:${IMAGE_TAG}
only:
- main
- develop
test-protocol:
stage: test
script:
- go test ./...
- docker run --rm ${REGISTRY}/bzzz:${IMAGE_TAG} /app/test-suite
dependencies:
- build
test-integration:
stage: test
script:
- docker-compose -f docker-compose.test.yml up -d
- ./scripts/integration-tests.sh
- docker-compose -f docker-compose.test.yml down
dependencies:
- build
deploy-staging:
stage: deploy-staging
script:
- docker stack deploy -c docker-compose.staging.yml bzzz-v2-staging
environment:
name: staging
only:
- develop
deploy-production:
stage: deploy-production
script:
- docker stack deploy -c docker-compose.swarm.yml bzzz-v2
environment:
name: production
only:
- main
when: manual
```
## Monitoring and Operations
### Monitoring Stack
```yaml
# docker-compose.monitoring.yml
version: '3.8'
services:
prometheus:
image: prom/prometheus:latest
networks:
- monitoring
volumes:
- ./monitoring/prometheus.yml:/etc/prometheus/prometheus.yml
- /rust/bzzz-v2/data/prometheus:/prometheus
deploy:
replicas: 1
grafana:
image: grafana/grafana:latest
networks:
- monitoring
- tengig
volumes:
- /rust/bzzz-v2/data/grafana:/var/lib/grafana
deploy:
labels:
- "traefik.enable=true"
- "traefik.http.routers.bzzz-grafana.rule=Host(`bzzz-monitor.deepblack.cloud`)"
alertmanager:
image: prom/alertmanager:latest
networks:
- monitoring
volumes:
- ./monitoring/alertmanager.yml:/etc/alertmanager/alertmanager.yml
deploy:
replicas: 1
networks:
monitoring:
driver: overlay
tengig:
external: true
```
### Key Metrics to Monitor
1. **Protocol Metrics**
- DHT lookup latency and success rate
- Content resolution time
- Peer discovery and connection stability
- bzzz:// address resolution performance
2. **Service Metrics**
- MCP server response times
- OpenAI API usage and costs
- Conversation threading performance
- Content store I/O operations
3. **Infrastructure Metrics**
- Docker Swarm service health
- Network connectivity between nodes
- Storage utilization and performance
- Resource utilization (CPU, memory, disk)
### Alerting Configuration
```yaml
# monitoring/alertmanager.yml
global:
smtp_smarthost: 'localhost:587'
smtp_from: 'alerts@deepblack.cloud'
route:
group_by: ['alertname']
group_wait: 10s
group_interval: 10s
repeat_interval: 1h
receiver: 'web.hook'
receivers:
- name: 'web.hook'
slack_configs:
- api_url: 'YOUR_SLACK_WEBHOOK_URL'
channel: '#bzzz-alerts'
title: 'BZZZ v2 Alert'
text: '{{ range .Alerts }}{{ .Annotations.summary }}{{ end }}'
inhibit_rules:
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['alertname', 'dev', 'instance']
```
## Security and Networking
### Security Architecture
1. **Network Isolation**
- Internal overlay network for inter-service communication
- External network exposure only through Traefik
- Firewall rules restricting P2P ports to local network
2. **Secret Management**
- Docker Swarm secrets for sensitive data
- Encrypted storage of API keys and credentials
- Regular secret rotation procedures
3. **Access Control**
- mTLS for P2P communication
- API authentication and authorization
- Role-based access for MCP endpoints
### Networking Configuration
```bash
# UFW firewall rules for BZZZ v2
sudo ufw allow from 192.168.1.0/24 to any port 9000:9300 proto tcp
sudo ufw allow from 192.168.1.0/24 to any port 5353 proto udp
sudo ufw allow from 192.168.1.0/24 to any port 2377 proto tcp # Docker Swarm
sudo ufw allow from 192.168.1.0/24 to any port 7946 proto tcp # Docker Swarm
sudo ufw allow from 192.168.1.0/24 to any port 4789 proto udp # Docker Swarm
```
## Rollback Procedures
### Automatic Rollback Triggers
1. **Health Check Failures**
- Service health checks failing for > 5 minutes
- DHT network partition detection
- Content store corruption detection
- Critical error rate > 5%
2. **Performance Degradation**
- Response time increase > 200% from baseline
- Memory usage > 90% for > 10 minutes
- Storage I/O errors > 1% rate
### Manual Rollback Process
```bash
#!/bin/bash
# rollback-v2.sh - Emergency rollback to v1
echo "🚨 Initiating BZZZ v2 rollback procedure..."
# Step 1: Stop v2 services
docker stack rm bzzz-v2
sleep 30
# Step 2: Restart v1 SystemD services
sudo systemctl start bzzz@walnut
sudo systemctl start bzzz@ironwood
sudo systemctl start bzzz@acacia
# Step 3: Verify v1 connectivity
./scripts/verify-v1-mesh.sh
# Step 4: Update load balancer configuration
./scripts/update-traefik-v1.sh
# Step 5: Notify operations team
curl -X POST $SLACK_WEBHOOK -d '{"text":"🚨 BZZZ rollback to v1 completed"}'
echo "✅ Rollback completed successfully"
```
## Resource Requirements
### Node Specifications
| Component | CPU | Memory | Storage | Network |
|-----------|-----|---------|---------|---------|
| BZZZ Agent | 2 cores | 4GB | 20GB | 1Gbps |
| MCP Server | 1 core | 2GB | 5GB | 100Mbps |
| OpenAI Proxy | 1 core | 2GB | 5GB | 100Mbps |
| Content Store | 2 cores | 8GB | 500GB | 1Gbps |
| DHT Manager | 1 core | 4GB | 50GB | 1Gbps |
### Scaling Considerations
1. **Horizontal Scaling**
- Add nodes to DHT for increased capacity
- Scale MCP servers based on external demand
- Replicate content store across availability zones
2. **Vertical Scaling**
- Increase memory for larger conversation contexts
- Add storage for content addressing requirements
- Enhance network capacity for P2P traffic
## Operational Procedures
### Daily Operations
1. **Health Monitoring**
- Review Grafana dashboards for anomalies
- Check DHT network connectivity
- Verify content store replication status
- Monitor OpenAI API usage and costs
2. **Maintenance Tasks**
- Log rotation and archival
- Content store garbage collection
- DHT routing table optimization
- Security patch deployment
### Weekly Operations
1. **Performance Review**
- Analyze response time trends
- Review resource utilization patterns
- Assess scaling requirements
- Update capacity planning
2. **Security Audit**
- Review access logs
- Validate secret rotation
- Check for security updates
- Test backup and recovery procedures
### Incident Response
1. **Incident Classification**
- P0: Complete service outage
- P1: Major feature degradation
- P2: Performance issues
- P3: Minor functionality problems
2. **Response Procedures**
- Automated alerting and escalation
- Incident commander assignment
- Communication protocols
- Post-incident review process
This comprehensive infrastructure architecture provides a robust foundation for BZZZ v2 deployment while maintaining operational excellence and enabling future growth. The design prioritizes reliability, security, and maintainability while introducing advanced protocol features required for the next generation of the BZZZ ecosystem.