 92779523c0
			
		
	
	92779523c0
	
	
	
		
			
			Comprehensive multi-agent implementation addressing all issues from INDEX.md: ## Core Architecture & Validation - ✅ Issue 001: UCXL address validation at all system boundaries - ✅ Issue 002: Fixed search parsing bug in encrypted storage - ✅ Issue 003: Wired UCXI P2P announce and discover functionality - ✅ Issue 011: Aligned temporal grammar and documentation - ✅ Issue 012: SLURP idempotency, backpressure, and DLQ implementation - ✅ Issue 013: Linked SLURP events to UCXL decisions and DHT ## API Standardization & Configuration - ✅ Issue 004: Standardized UCXI payloads to UCXL codes - ✅ Issue 010: Status endpoints and configuration surface ## Infrastructure & Operations - ✅ Issue 005: Election heartbeat on admin transition - ✅ Issue 006: Active health checks for PubSub and DHT - ✅ Issue 007: DHT replication and provider records - ✅ Issue 014: SLURP leadership lifecycle and health probes - ✅ Issue 015: Comprehensive monitoring, SLOs, and alerts ## Security & Access Control - ✅ Issue 008: Key rotation and role-based access policies ## Testing & Quality Assurance - ✅ Issue 009: Integration tests for UCXI + DHT encryption + search - ✅ Issue 016: E2E tests for HMMM → SLURP → UCXL workflow ## HMMM Integration - ✅ Issue 017: HMMM adapter wiring and comprehensive testing ## Key Features Delivered: - Enterprise-grade security with automated key rotation - Comprehensive monitoring with Prometheus/Grafana stack - Role-based collaboration with HMMM integration - Complete API standardization with UCXL response formats - Full test coverage with integration and E2E testing - Production-ready infrastructure monitoring and alerting All solutions include comprehensive testing, documentation, and production-ready implementations. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
		
			
				
	
	
		
			835 lines
		
	
	
		
			22 KiB
		
	
	
	
		
			Markdown
		
	
	
	
	
	
			
		
		
	
	
			835 lines
		
	
	
		
			22 KiB
		
	
	
	
		
			Markdown
		
	
	
	
	
	
| # BZZZ Infrastructure Operational Runbook
 | |
| 
 | |
| ## Table of Contents
 | |
| 1. [Quick Reference](#quick-reference)
 | |
| 2. [System Architecture Overview](#system-architecture-overview)
 | |
| 3. [Common Operational Tasks](#common-operational-tasks)
 | |
| 4. [Incident Response Procedures](#incident-response-procedures)
 | |
| 5. [Health Check Procedures](#health-check-procedures)
 | |
| 6. [Performance Tuning](#performance-tuning)
 | |
| 7. [Backup and Recovery](#backup-and-recovery)
 | |
| 8. [Troubleshooting Guide](#troubleshooting-guide)
 | |
| 9. [Maintenance Procedures](#maintenance-procedures)
 | |
| 
 | |
| ## Quick Reference
 | |
| 
 | |
| ### Critical Service Endpoints
 | |
| - **Grafana Dashboard**: https://grafana.chorus.services
 | |
| - **Prometheus**: https://prometheus.chorus.services
 | |
| - **AlertManager**: https://alerts.chorus.services
 | |
| - **BZZZ Main API**: https://bzzz.deepblack.cloud
 | |
| - **Health Checks**: https://bzzz.deepblack.cloud/health
 | |
| 
 | |
| ### Emergency Contacts
 | |
| - **Primary Oncall**: Slack #bzzz-alerts
 | |
| - **System Administrator**: @tony
 | |
| - **Infrastructure Team**: @platform-team
 | |
| 
 | |
| ### Key Commands
 | |
| ```bash
 | |
| # Check system health
 | |
| curl -s https://bzzz.deepblack.cloud/health | jq
 | |
| 
 | |
| # View logs
 | |
| docker service logs bzzz-v2_bzzz-agent -f --tail 100
 | |
| 
 | |
| # Scale service
 | |
| docker service scale bzzz-v2_bzzz-agent=5
 | |
| 
 | |
| # Force service update
 | |
| docker service update --force bzzz-v2_bzzz-agent
 | |
| ```
 | |
| 
 | |
| ## System Architecture Overview
 | |
| 
 | |
| ### Component Relationships
 | |
| ```
 | |
| ┌─────────────┐    ┌─────────────┐    ┌─────────────┐
 | |
| │   PubSub    │────│     DHT     │────│  Election   │
 | |
| │  Messaging  │    │   Storage   │    │   Manager   │
 | |
| └─────────────┘    └─────────────┘    └─────────────┘
 | |
|        │                   │                   │
 | |
|        └───────────────────┼───────────────────┘
 | |
|                            │
 | |
|                     ┌─────────────┐
 | |
|                     │    SLURP    │
 | |
|                     │   Context   │
 | |
|                     │  Generator  │
 | |
|                     └─────────────┘
 | |
|                            │
 | |
|                     ┌─────────────┐
 | |
|                     │    UCXI     │
 | |
|                     │  Protocol   │
 | |
|                     │  Resolver   │
 | |
|                     └─────────────┘
 | |
| ```
 | |
| 
 | |
| ### Data Flow
 | |
| 1. **Task Requests** → PubSub → Task Coordinator → SLURP (if admin)
 | |
| 2. **Context Generation** → DHT Storage → UCXI Resolution
 | |
| 3. **Health Monitoring** → Prometheus → AlertManager → Notifications
 | |
| 
 | |
| ### Critical Dependencies
 | |
| - **Docker Swarm**: Container orchestration
 | |
| - **NFS Storage**: Persistent data storage
 | |
| - **Prometheus Stack**: Monitoring and alerting
 | |
| - **DHT Bootstrap Nodes**: P2P network foundation
 | |
| 
 | |
| ## Common Operational Tasks
 | |
| 
 | |
| ### Service Management
 | |
| 
 | |
| #### Check Service Status
 | |
| ```bash
 | |
| # List all BZZZ services
 | |
| docker service ls | grep bzzz
 | |
| 
 | |
| # Check specific service
 | |
| docker service ps bzzz-v2_bzzz-agent
 | |
| 
 | |
| # View service configuration
 | |
| docker service inspect bzzz-v2_bzzz-agent
 | |
| ```
 | |
| 
 | |
| #### Scale Services
 | |
| ```bash
 | |
| # Scale main BZZZ service
 | |
| docker service scale bzzz-v2_bzzz-agent=5
 | |
| 
 | |
| # Scale monitoring stack
 | |
| docker service scale bzzz-monitoring_prometheus=1
 | |
| docker service scale bzzz-monitoring_grafana=1
 | |
| ```
 | |
| 
 | |
| #### Update Services
 | |
| ```bash
 | |
| # Update to new image version
 | |
| docker service update \
 | |
|   --image registry.home.deepblack.cloud/bzzz:v2.1.0 \
 | |
|   bzzz-v2_bzzz-agent
 | |
| 
 | |
| # Update environment variables
 | |
| docker service update \
 | |
|   --env-add LOG_LEVEL=debug \
 | |
|   bzzz-v2_bzzz-agent
 | |
| 
 | |
| # Update resource limits
 | |
| docker service update \
 | |
|   --limit-memory 4G \
 | |
|   --limit-cpu 2 \
 | |
|   bzzz-v2_bzzz-agent
 | |
| ```
 | |
| 
 | |
| ### Configuration Management
 | |
| 
 | |
| #### Update Docker Secrets
 | |
| ```bash
 | |
| # Create new secret
 | |
| echo "new_password" | docker secret create bzzz_postgres_password_v2 -
 | |
| 
 | |
| # Update service to use new secret
 | |
| docker service update \
 | |
|   --secret-rm bzzz_postgres_password \
 | |
|   --secret-add bzzz_postgres_password_v2 \
 | |
|   bzzz-v2_postgres
 | |
| ```
 | |
| 
 | |
| #### Update Docker Configs
 | |
| ```bash
 | |
| # Create new config
 | |
| docker config create bzzz_v2_config_v3 /path/to/new/config.yaml
 | |
| 
 | |
| # Update service
 | |
| docker service update \
 | |
|   --config-rm bzzz_v2_config \
 | |
|   --config-add source=bzzz_v2_config_v3,target=/app/config/config.yaml \
 | |
|   bzzz-v2_bzzz-agent
 | |
| ```
 | |
| 
 | |
| ### Monitoring and Alerting
 | |
| 
 | |
| #### Check Alert Status
 | |
| ```bash
 | |
| # View active alerts
 | |
| curl -s http://alertmanager:9093/api/v1/alerts | jq '.data[] | select(.status.state == "active")'
 | |
| 
 | |
| # Silence alert
 | |
| curl -X POST http://alertmanager:9093/api/v1/silences \
 | |
|   -d '{
 | |
|     "matchers": [{"name": "alertname", "value": "BZZZSystemHealthCritical"}],
 | |
|     "startsAt": "2025-01-01T00:00:00Z",
 | |
|     "endsAt": "2025-01-01T01:00:00Z",
 | |
|     "comment": "Maintenance window",
 | |
|     "createdBy": "operator"
 | |
|   }'
 | |
| ```
 | |
| 
 | |
| #### Query Metrics
 | |
| ```bash
 | |
| # Check system health
 | |
| curl -s 'http://prometheus:9090/api/v1/query?query=bzzz_system_health_score' | jq
 | |
| 
 | |
| # Check connected peers
 | |
| curl -s 'http://prometheus:9090/api/v1/query?query=bzzz_p2p_connected_peers' | jq
 | |
| 
 | |
| # Check error rates
 | |
| curl -s 'http://prometheus:9090/api/v1/query?query=rate(bzzz_errors_total[5m])' | jq
 | |
| ```
 | |
| 
 | |
| ## Incident Response Procedures
 | |
| 
 | |
| ### Severity Levels
 | |
| 
 | |
| #### Critical (P0)
 | |
| - System completely unavailable
 | |
| - Data loss or corruption
 | |
| - Security breach
 | |
| - **Response Time**: 15 minutes
 | |
| - **Resolution Target**: 2 hours
 | |
| 
 | |
| #### High (P1)
 | |
| - Major functionality impaired
 | |
| - Performance severely degraded
 | |
| - **Response Time**: 1 hour
 | |
| - **Resolution Target**: 4 hours
 | |
| 
 | |
| #### Medium (P2)
 | |
| - Minor functionality issues
 | |
| - Performance slightly degraded
 | |
| - **Response Time**: 4 hours
 | |
| - **Resolution Target**: 24 hours
 | |
| 
 | |
| #### Low (P3)
 | |
| - Cosmetic issues
 | |
| - Enhancement requests
 | |
| - **Response Time**: 24 hours
 | |
| - **Resolution Target**: 1 week
 | |
| 
 | |
| ### Common Incident Scenarios
 | |
| 
 | |
| #### System Health Critical (Alert: BZZZSystemHealthCritical)
 | |
| 
 | |
| **Symptoms**: System health score < 0.5
 | |
| 
 | |
| **Immediate Actions**:
 | |
| 1. Check Grafana dashboard for component failures
 | |
| 2. Review recent deployments or changes
 | |
| 3. Check resource utilization (CPU, memory, disk)
 | |
| 4. Verify P2P connectivity
 | |
| 
 | |
| **Investigation Steps**:
 | |
| ```bash
 | |
| # Check overall system status
 | |
| curl -s https://bzzz.deepblack.cloud/health | jq
 | |
| 
 | |
| # Check component health
 | |
| curl -s https://bzzz.deepblack.cloud/health/checks | jq
 | |
| 
 | |
| # Review recent logs
 | |
| docker service logs bzzz-v2_bzzz-agent --since 1h | tail -100
 | |
| 
 | |
| # Check resource usage
 | |
| docker stats --no-stream
 | |
| ```
 | |
| 
 | |
| **Recovery Actions**:
 | |
| 1. If memory leak: Restart affected services
 | |
| 2. If disk full: Clean up logs and temporary files
 | |
| 3. If network issues: Restart networking components
 | |
| 4. If database issues: Check PostgreSQL health
 | |
| 
 | |
| #### P2P Network Partition (Alert: BZZZInsufficientPeers)
 | |
| 
 | |
| **Symptoms**: Connected peers < 3
 | |
| 
 | |
| **Immediate Actions**:
 | |
| 1. Check network connectivity between nodes
 | |
| 2. Verify DHT bootstrap nodes are running
 | |
| 3. Check firewall rules and port accessibility
 | |
| 
 | |
| **Investigation Steps**:
 | |
| ```bash
 | |
| # Check DHT bootstrap nodes
 | |
| for node in walnut:9101 ironwood:9102 acacia:9103; do
 | |
|   echo "Checking $node:"
 | |
|   nc -zv ${node%:*} ${node#*:}
 | |
| done
 | |
| 
 | |
| # Check P2P connectivity
 | |
| docker service logs bzzz-v2_dht-bootstrap-walnut --since 1h
 | |
| 
 | |
| # Test network between nodes
 | |
| docker run --rm --network host nicolaka/netshoot ping -c 3 ironwood
 | |
| ```
 | |
| 
 | |
| **Recovery Actions**:
 | |
| 1. Restart DHT bootstrap services
 | |
| 2. Clear peer store if corrupted
 | |
| 3. Check and fix network configuration
 | |
| 4. Restart affected BZZZ agents
 | |
| 
 | |
| #### Election System Failure (Alert: BZZZNoAdminElected)
 | |
| 
 | |
| **Symptoms**: No admin elected or frequent leadership changes
 | |
| 
 | |
| **Immediate Actions**:
 | |
| 1. Check election state on all nodes
 | |
| 2. Review heartbeat status
 | |
| 3. Verify role configurations
 | |
| 
 | |
| **Investigation Steps**:
 | |
| ```bash
 | |
| # Check election status on each node
 | |
| for node in walnut ironwood acacia; do
 | |
|   echo "Node $node election status:"
 | |
|   docker exec $(docker ps -q --filter label=com.docker.swarm.node.id) \
 | |
|     curl -s localhost:8081/health/checks | jq '.checks["election-health"]'
 | |
| done
 | |
| 
 | |
| # Check role configurations
 | |
| docker config inspect bzzz_v2_config | jq '.Spec.Data' | base64 -d | grep -A5 -B5 role
 | |
| ```
 | |
| 
 | |
| **Recovery Actions**:
 | |
| 1. Force re-election by restarting election managers
 | |
| 2. Fix role configuration issues
 | |
| 3. Clear election state if corrupted
 | |
| 4. Ensure at least one node has admin capabilities
 | |
| 
 | |
| #### DHT Replication Failure (Alert: BZZZDHTReplicationDegraded)
 | |
| 
 | |
| **Symptoms**: Average replication factor < 2
 | |
| 
 | |
| **Immediate Actions**:
 | |
| 1. Check DHT provider records
 | |
| 2. Verify replication manager status
 | |
| 3. Check storage availability
 | |
| 
 | |
| **Investigation Steps**:
 | |
| ```bash
 | |
| # Check DHT metrics
 | |
| curl -s 'http://prometheus:9090/api/v1/query?query=bzzz_dht_replication_factor' | jq
 | |
| 
 | |
| # Check provider records
 | |
| curl -s 'http://prometheus:9090/api/v1/query?query=bzzz_dht_provider_records' | jq
 | |
| 
 | |
| # Check replication manager logs
 | |
| docker service logs bzzz-v2_bzzz-agent | grep -i replication
 | |
| ```
 | |
| 
 | |
| **Recovery Actions**:
 | |
| 1. Restart replication managers
 | |
| 2. Force re-provision of content
 | |
| 3. Check and fix storage issues
 | |
| 4. Verify DHT network connectivity
 | |
| 
 | |
| ### Escalation Procedures
 | |
| 
 | |
| #### When to Escalate
 | |
| - Unable to resolve P0/P1 incident within target time
 | |
| - Incident requires specialized knowledge
 | |
| - Multiple systems affected
 | |
| - Potential security implications
 | |
| 
 | |
| #### Escalation Contacts
 | |
| 1. **Technical Lead**: @tech-lead (Slack)
 | |
| 2. **Infrastructure Team**: @infra-team (Slack)
 | |
| 3. **Management**: @management (for business-critical issues)
 | |
| 
 | |
| ## Health Check Procedures
 | |
| 
 | |
| ### Manual Health Verification
 | |
| 
 | |
| #### System-Level Checks
 | |
| ```bash
 | |
| # 1. Overall system health
 | |
| curl -s https://bzzz.deepblack.cloud/health | jq '.status'
 | |
| 
 | |
| # 2. Component health checks
 | |
| curl -s https://bzzz.deepblack.cloud/health/checks | jq
 | |
| 
 | |
| # 3. Resource utilization
 | |
| docker stats --no-stream --format "table {{.Container}}\t{{.CPUPerc}}\t{{.MemUsage}}\t{{.MemPerc}}"
 | |
| 
 | |
| # 4. Service status
 | |
| docker service ls | grep bzzz
 | |
| 
 | |
| # 5. Network connectivity
 | |
| docker network ls | grep bzzz
 | |
| ```
 | |
| 
 | |
| #### Component-Specific Checks
 | |
| 
 | |
| **P2P Network**:
 | |
| ```bash
 | |
| # Check connected peers
 | |
| curl -s 'http://prometheus:9090/api/v1/query?query=bzzz_p2p_connected_peers'
 | |
| 
 | |
| # Test P2P messaging
 | |
| docker exec -it $(docker ps -q -f name=bzzz-agent) \
 | |
|   /app/bzzz test-p2p-message
 | |
| ```
 | |
| 
 | |
| **DHT Storage**:
 | |
| ```bash
 | |
| # Check DHT operations
 | |
| curl -s 'http://prometheus:9090/api/v1/query?query=rate(bzzz_dht_put_operations_total[5m])'
 | |
| 
 | |
| # Test DHT functionality
 | |
| docker exec -it $(docker ps -q -f name=bzzz-agent) \
 | |
|   /app/bzzz test-dht-operations
 | |
| ```
 | |
| 
 | |
| **Election System**:
 | |
| ```bash
 | |
| # Check current admin
 | |
| curl -s 'http://prometheus:9090/api/v1/query?query=bzzz_election_state'
 | |
| 
 | |
| # Check heartbeat status
 | |
| curl -s https://bzzz.deepblack.cloud/api/election/status | jq
 | |
| ```
 | |
| 
 | |
| ### Automated Health Monitoring
 | |
| 
 | |
| #### Prometheus Queries for Health
 | |
| ```promql
 | |
| # Overall system health
 | |
| bzzz_system_health_score
 | |
| 
 | |
| # Component health scores
 | |
| bzzz_component_health_score
 | |
| 
 | |
| # SLI compliance
 | |
| rate(bzzz_health_checks_passed_total[5m]) / rate(bzzz_health_checks_failed_total[5m] + bzzz_health_checks_passed_total[5m])
 | |
| 
 | |
| # Error budget burn rate
 | |
| 1 - bzzz:dht_success_rate > 0.01  # 1% error budget
 | |
| ```
 | |
| 
 | |
| #### Alert Validation
 | |
| After resolving issues, verify alerts clear:
 | |
| ```bash
 | |
| # Check if alerts are resolved
 | |
| curl -s http://alertmanager:9093/api/v1/alerts | \
 | |
|   jq '.data[] | select(.status.state == "active") | .labels.alertname'
 | |
| ```
 | |
| 
 | |
| ## Performance Tuning
 | |
| 
 | |
| ### Resource Optimization
 | |
| 
 | |
| #### Memory Tuning
 | |
| ```bash
 | |
| # Increase memory limits for heavy workloads
 | |
| docker service update --limit-memory 8G bzzz-v2_bzzz-agent
 | |
| 
 | |
| # Optimize JVM heap size (if applicable)
 | |
| docker service update \
 | |
|   --env-add JAVA_OPTS="-Xmx4g -Xms2g" \
 | |
|   bzzz-v2_bzzz-agent
 | |
| ```
 | |
| 
 | |
| #### CPU Optimization
 | |
| ```bash
 | |
| # Adjust CPU limits
 | |
| docker service update --limit-cpu 4 bzzz-v2_bzzz-agent
 | |
| 
 | |
| # Set CPU affinity for critical services
 | |
| docker service update \
 | |
|   --placement-pref "spread=node.labels.cpu_type==high_performance" \
 | |
|   bzzz-v2_bzzz-agent
 | |
| ```
 | |
| 
 | |
| #### Network Optimization
 | |
| ```bash
 | |
| # Optimize network buffer sizes
 | |
| echo 'net.core.rmem_max = 16777216' >> /etc/sysctl.conf
 | |
| echo 'net.core.wmem_max = 16777216' >> /etc/sysctl.conf
 | |
| sysctl -p
 | |
| ```
 | |
| 
 | |
| ### Application-Level Tuning
 | |
| 
 | |
| #### DHT Performance
 | |
| - Increase replication factor for critical content
 | |
| - Optimize provider record refresh intervals
 | |
| - Tune cache sizes based on memory availability
 | |
| 
 | |
| #### PubSub Performance  
 | |
| - Adjust message batch sizes
 | |
| - Optimize topic subscription patterns
 | |
| - Configure message retention policies
 | |
| 
 | |
| #### Election Stability
 | |
| - Tune heartbeat intervals
 | |
| - Adjust election timeouts based on network latency
 | |
| - Optimize candidate scoring algorithms
 | |
| 
 | |
| ### Monitoring Performance Impact
 | |
| ```bash
 | |
| # Before tuning - capture baseline
 | |
| curl -s 'http://prometheus:9090/api/v1/query_range?query=rate(bzzz_dht_operation_latency_seconds_sum[5m])/rate(bzzz_dht_operation_latency_seconds_count[5m])&start=2025-01-01T00:00:00Z&end=2025-01-01T01:00:00Z&step=60s'
 | |
| 
 | |
| # After tuning - compare results
 | |
| # Use Grafana dashboards to visualize improvements
 | |
| ```
 | |
| 
 | |
| ## Backup and Recovery
 | |
| 
 | |
| ### Critical Data Identification
 | |
| 
 | |
| #### Persistent Data
 | |
| - **PostgreSQL Database**: User data, task history, conversation threads
 | |
| - **DHT Content**: Distributed content storage
 | |
| - **Configuration**: Docker secrets, configs, service definitions
 | |
| - **Prometheus Data**: Historical metrics (optional but valuable)
 | |
| 
 | |
| #### Backup Schedule
 | |
| - **PostgreSQL**: Daily full backup, continuous WAL archiving
 | |
| - **Configuration**: Weekly backup, immediately after changes
 | |
| - **Prometheus**: Weekly backup of selected metrics
 | |
| 
 | |
| ### Backup Procedures
 | |
| 
 | |
| #### Database Backup
 | |
| ```bash
 | |
| # Create database backup
 | |
| docker exec $(docker ps -q -f name=postgres) \
 | |
|   pg_dump -U bzzz -d bzzz_v2 -f /backup/bzzz_$(date +%Y%m%d_%H%M%S).sql
 | |
| 
 | |
| # Compress and store
 | |
| gzip /rust/bzzz-v2/backups/bzzz_$(date +%Y%m%d_%H%M%S).sql
 | |
| aws s3 cp /rust/bzzz-v2/backups/ s3://chorus-backups/bzzz/ --recursive
 | |
| ```
 | |
| 
 | |
| #### Configuration Backup
 | |
| ```bash
 | |
| # Export all secrets (encrypted)
 | |
| for secret in $(docker secret ls -q); do
 | |
|   docker secret inspect $secret > /backup/secrets/${secret}.json
 | |
| done
 | |
| 
 | |
| # Export all configs
 | |
| for config in $(docker config ls -q); do
 | |
|   docker config inspect $config > /backup/configs/${config}.json
 | |
| done
 | |
| 
 | |
| # Export service definitions
 | |
| docker service ls --format '{{.Name}}' | xargs -I {} docker service inspect {} > /backup/services.json
 | |
| ```
 | |
| 
 | |
| #### Prometheus Data Backup
 | |
| ```bash
 | |
| # Snapshot Prometheus data
 | |
| curl -X POST http://prometheus:9090/api/v1/admin/tsdb/snapshot
 | |
| 
 | |
| # Copy snapshot to backup location
 | |
| docker cp prometheus_container:/prometheus/snapshots/latest /backup/prometheus/$(date +%Y%m%d)
 | |
| ```
 | |
| 
 | |
| ### Recovery Procedures
 | |
| 
 | |
| #### Full System Recovery
 | |
| 1. **Restore Infrastructure**: Deploy Docker Swarm stack
 | |
| 2. **Restore Configuration**: Import secrets and configs
 | |
| 3. **Restore Database**: Restore PostgreSQL from backup
 | |
| 4. **Validate Services**: Verify all services are healthy
 | |
| 5. **Test Functionality**: Run end-to-end tests
 | |
| 
 | |
| #### Database Recovery
 | |
| ```bash
 | |
| # Stop application services
 | |
| docker service scale bzzz-v2_bzzz-agent=0
 | |
| 
 | |
| # Restore database
 | |
| gunzip -c /backup/bzzz_20250101_120000.sql.gz | \
 | |
|   docker exec -i $(docker ps -q -f name=postgres) \
 | |
|   psql -U bzzz -d bzzz_v2
 | |
| 
 | |
| # Start application services
 | |
| docker service scale bzzz-v2_bzzz-agent=3
 | |
| ```
 | |
| 
 | |
| #### Point-in-Time Recovery
 | |
| ```bash
 | |
| # For WAL-based recovery
 | |
| docker exec $(docker ps -q -f name=postgres) \
 | |
|   pg_basebackup -U postgres -D /backup/base -X stream -P
 | |
| 
 | |
| # Restore to specific time
 | |
| # (Implementation depends on PostgreSQL configuration)
 | |
| ```
 | |
| 
 | |
| ### Recovery Testing
 | |
| 
 | |
| #### Monthly Recovery Tests
 | |
| ```bash
 | |
| # Test database restore
 | |
| ./scripts/test-db-restore.sh
 | |
| 
 | |
| # Test configuration restore  
 | |
| ./scripts/test-config-restore.sh
 | |
| 
 | |
| # Test full system restore (staging environment)
 | |
| ./scripts/test-full-restore.sh staging
 | |
| ```
 | |
| 
 | |
| #### Recovery Validation
 | |
| - Verify all services start successfully
 | |
| - Check data integrity and completeness
 | |
| - Validate P2P network connectivity
 | |
| - Test core functionality (task coordination, context generation)
 | |
| - Monitor system health for 24 hours post-recovery
 | |
| 
 | |
| ## Troubleshooting Guide
 | |
| 
 | |
| ### Log Analysis
 | |
| 
 | |
| #### Centralized Logging
 | |
| ```bash
 | |
| # View aggregated logs through Loki
 | |
| curl -G -s 'http://loki:3100/loki/api/v1/query_range' \
 | |
|   --data-urlencode 'query={job="bzzz"}' \
 | |
|   --data-urlencode 'start=2025-01-01T00:00:00Z' \
 | |
|   --data-urlencode 'end=2025-01-01T01:00:00Z' | jq
 | |
| 
 | |
| # Search for specific errors
 | |
| curl -G -s 'http://loki:3100/loki/api/v1/query_range' \
 | |
|   --data-urlencode 'query={job="bzzz"} |= "ERROR"' | jq
 | |
| ```
 | |
| 
 | |
| #### Service-Specific Logs
 | |
| ```bash
 | |
| # BZZZ agent logs
 | |
| docker service logs bzzz-v2_bzzz-agent -f --tail 100
 | |
| 
 | |
| # DHT bootstrap logs
 | |
| docker service logs bzzz-v2_dht-bootstrap-walnut -f
 | |
| 
 | |
| # Database logs
 | |
| docker service logs bzzz-v2_postgres -f
 | |
| 
 | |
| # Filter for specific patterns
 | |
| docker service logs bzzz-v2_bzzz-agent | grep -E "(ERROR|FATAL|panic)"
 | |
| ```
 | |
| 
 | |
| ### Common Issues and Solutions
 | |
| 
 | |
| #### "No Admin Elected" Error
 | |
| ```bash
 | |
| # Check role configurations
 | |
| docker config inspect bzzz_v2_config | jq '.Spec.Data' | base64 -d | yq '.agent.role'
 | |
| 
 | |
| # Force election
 | |
| docker exec -it $(docker ps -q -f name=bzzz-agent) /app/bzzz trigger-election
 | |
| 
 | |
| # Restart election managers
 | |
| docker service update --force bzzz-v2_bzzz-agent
 | |
| ```
 | |
| 
 | |
| #### "DHT Operations Failing" Error
 | |
| ```bash
 | |
| # Check DHT bootstrap nodes
 | |
| for port in 9101 9102 9103; do
 | |
|   nc -zv localhost $port
 | |
| done
 | |
| 
 | |
| # Restart DHT services
 | |
| docker service update --force bzzz-v2_dht-bootstrap-walnut
 | |
| docker service update --force bzzz-v2_dht-bootstrap-ironwood
 | |
| docker service update --force bzzz-v2_dht-bootstrap-acacia
 | |
| 
 | |
| # Clear DHT cache
 | |
| docker exec -it $(docker ps -q -f name=bzzz-agent) rm -rf /app/data/dht/cache/*
 | |
| ```
 | |
| 
 | |
| #### "High Memory Usage" Alert
 | |
| ```bash
 | |
| # Identify memory-hungry processes
 | |
| docker stats --no-stream --format "table {{.Container}}\t{{.MemUsage}}\t{{.MemPerc}}" | sort -k3 -n
 | |
| 
 | |
| # Check for memory leaks
 | |
| docker exec -it $(docker ps -q -f name=bzzz-agent) pprof -http=:6060 /app/bzzz
 | |
| 
 | |
| # Restart high-memory services
 | |
| docker service update --force bzzz-v2_bzzz-agent
 | |
| ```
 | |
| 
 | |
| #### "Network Connectivity Issues"
 | |
| ```bash
 | |
| # Check overlay network
 | |
| docker network inspect bzzz-internal
 | |
| 
 | |
| # Test connectivity between services
 | |
| docker run --rm --network bzzz-internal nicolaka/netshoot ping -c 3 postgres
 | |
| 
 | |
| # Check firewall rules
 | |
| iptables -L | grep -E "(9000|9101|9102|9103)"
 | |
| 
 | |
| # Restart networking
 | |
| docker network disconnect bzzz-internal $(docker ps -q -f name=bzzz-agent)
 | |
| docker network connect bzzz-internal $(docker ps -q -f name=bzzz-agent)
 | |
| ```
 | |
| 
 | |
| ### Performance Issues
 | |
| 
 | |
| #### High Latency Diagnosis
 | |
| ```bash
 | |
| # Check operation latencies
 | |
| curl -s 'http://prometheus:9090/api/v1/query?query=histogram_quantile(0.95, rate(bzzz_dht_operation_latency_seconds_bucket[5m]))'
 | |
| 
 | |
| # Identify bottlenecks
 | |
| docker exec -it $(docker ps -q -f name=bzzz-agent) /app/bzzz profile-cpu 30
 | |
| 
 | |
| # Check network latency between nodes
 | |
| for node in walnut ironwood acacia; do
 | |
|   ping -c 10 $node | tail -1
 | |
| done
 | |
| ```
 | |
| 
 | |
| #### Resource Contention
 | |
| ```bash
 | |
| # Check CPU usage
 | |
| docker stats --no-stream --format "table {{.Container}}\t{{.CPUPerc}}"
 | |
| 
 | |
| # Check I/O wait
 | |
| iostat -x 1 5
 | |
| 
 | |
| # Check network utilization
 | |
| iftop -i eth0
 | |
| ```
 | |
| 
 | |
| ### Debugging Tools
 | |
| 
 | |
| #### Application Debugging
 | |
| ```bash
 | |
| # Enable debug logging
 | |
| docker service update --env-add LOG_LEVEL=debug bzzz-v2_bzzz-agent
 | |
| 
 | |
| # Access debug endpoints
 | |
| curl -s http://localhost:8080/debug/pprof/heap > heap.prof
 | |
| go tool pprof heap.prof
 | |
| 
 | |
| # Trace requests
 | |
| curl -s http://localhost:8080/debug/requests
 | |
| ```
 | |
| 
 | |
| #### System Debugging
 | |
| ```bash
 | |
| # System resource usage
 | |
| htop
 | |
| iotop
 | |
| nethogs
 | |
| 
 | |
| # Process analysis
 | |
| ps aux --sort=-%cpu | head -20
 | |
| ps aux --sort=-%mem | head -20
 | |
| 
 | |
| # Network analysis
 | |
| netstat -tulpn | grep -E ":9000|:9101|:9102|:9103"
 | |
| ss -tuln | grep -E ":9000|:9101|:9102|:9103"
 | |
| ```
 | |
| 
 | |
| ## Maintenance Procedures
 | |
| 
 | |
| ### Scheduled Maintenance
 | |
| 
 | |
| #### Weekly Maintenance (Low-impact)
 | |
| - Review system health metrics
 | |
| - Check log sizes and rotate if necessary
 | |
| - Update monitoring dashboards
 | |
| - Validate backup integrity
 | |
| 
 | |
| #### Monthly Maintenance (Medium-impact)
 | |
| - Update non-critical components
 | |
| - Perform capacity planning review
 | |
| - Test disaster recovery procedures
 | |
| - Security scan and updates
 | |
| 
 | |
| #### Quarterly Maintenance (High-impact)
 | |
| - Major version updates
 | |
| - Infrastructure upgrades
 | |
| - Performance optimization review
 | |
| - Security audit and remediation
 | |
| 
 | |
| ### Update Procedures
 | |
| 
 | |
| #### Rolling Updates
 | |
| ```bash
 | |
| # Update with zero downtime
 | |
| docker service update \
 | |
|   --image registry.home.deepblack.cloud/bzzz:v2.1.0 \
 | |
|   --update-parallelism 1 \
 | |
|   --update-delay 30s \
 | |
|   --update-failure-action rollback \
 | |
|   bzzz-v2_bzzz-agent
 | |
| ```
 | |
| 
 | |
| #### Configuration Updates
 | |
| ```bash
 | |
| # Update configuration without restart
 | |
| docker config create bzzz_v2_config_new /path/to/new/config.yaml
 | |
| 
 | |
| docker service update \
 | |
|   --config-rm bzzz_v2_config \
 | |
|   --config-add source=bzzz_v2_config_new,target=/app/config/config.yaml \
 | |
|   bzzz-v2_bzzz-agent
 | |
| 
 | |
| # Cleanup old config
 | |
| docker config rm bzzz_v2_config
 | |
| ```
 | |
| 
 | |
| #### Database Maintenance
 | |
| ```bash
 | |
| # Database optimization
 | |
| docker exec -it $(docker ps -q -f name=postgres) \
 | |
|   psql -U bzzz -d bzzz_v2 -c "VACUUM ANALYZE;"
 | |
| 
 | |
| # Update statistics
 | |
| docker exec -it $(docker ps -q -f name=postgres) \
 | |
|   psql -U bzzz -d bzzz_v2 -c "ANALYZE;"
 | |
| 
 | |
| # Check database size
 | |
| docker exec -it $(docker ps -q -f name=postgres) \
 | |
|   psql -U bzzz -d bzzz_v2 -c "SELECT pg_size_pretty(pg_database_size('bzzz_v2'));"
 | |
| ```
 | |
| 
 | |
| ### Capacity Planning
 | |
| 
 | |
| #### Growth Projections
 | |
| - Monitor resource usage trends over time
 | |
| - Project capacity needs based on growth patterns
 | |
| - Plan for seasonal or event-driven spikes
 | |
| 
 | |
| #### Scaling Decisions
 | |
| ```bash
 | |
| # Horizontal scaling
 | |
| docker service scale bzzz-v2_bzzz-agent=5
 | |
| 
 | |
| # Vertical scaling
 | |
| docker service update \
 | |
|   --limit-memory 8G \
 | |
|   --limit-cpu 4 \
 | |
|   bzzz-v2_bzzz-agent
 | |
| 
 | |
| # Add new node to swarm
 | |
| docker swarm join-token worker
 | |
| ```
 | |
| 
 | |
| #### Resource Monitoring
 | |
| - Set up capacity alerts at 70% utilization
 | |
| - Monitor growth rate and extrapolate
 | |
| - Plan infrastructure expansions 3-6 months ahead
 | |
| 
 | |
| ---
 | |
| 
 | |
| ## Contact Information
 | |
| 
 | |
| **Primary Contact**: Tony (@tony)
 | |
| **Team**: BZZZ Infrastructure Team
 | |
| **Documentation**: https://wiki.chorus.services/bzzz
 | |
| **Source Code**: https://gitea.chorus.services/tony/BZZZ
 | |
| 
 | |
| **Last Updated**: 2025-01-01
 | |
| **Version**: 2.0
 | |
| **Review Date**: 2025-04-01 |