# BZZZ Infrastructure Operational Runbook ## Table of Contents 1. [Quick Reference](#quick-reference) 2. [System Architecture Overview](#system-architecture-overview) 3. [Common Operational Tasks](#common-operational-tasks) 4. [Incident Response Procedures](#incident-response-procedures) 5. [Health Check Procedures](#health-check-procedures) 6. [Performance Tuning](#performance-tuning) 7. [Backup and Recovery](#backup-and-recovery) 8. [Troubleshooting Guide](#troubleshooting-guide) 9. [Maintenance Procedures](#maintenance-procedures) ## Quick Reference ### Critical Service Endpoints - **Grafana Dashboard**: https://grafana.chorus.services - **Prometheus**: https://prometheus.chorus.services - **AlertManager**: https://alerts.chorus.services - **BZZZ Main API**: https://bzzz.deepblack.cloud - **Health Checks**: https://bzzz.deepblack.cloud/health ### Emergency Contacts - **Primary Oncall**: Slack #bzzz-alerts - **System Administrator**: @tony - **Infrastructure Team**: @platform-team ### Key Commands ```bash # Check system health curl -s https://bzzz.deepblack.cloud/health | jq # View logs docker service logs bzzz-v2_bzzz-agent -f --tail 100 # Scale service docker service scale bzzz-v2_bzzz-agent=5 # Force service update docker service update --force bzzz-v2_bzzz-agent ``` ## System Architecture Overview ### Component Relationships ``` ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │ PubSub │────│ DHT │────│ Election │ │ Messaging │ │ Storage │ │ Manager │ └─────────────┘ └─────────────┘ └─────────────┘ │ │ │ └───────────────────┼───────────────────┘ │ ┌─────────────┐ │ SLURP │ │ Context │ │ Generator │ └─────────────┘ │ ┌─────────────┐ │ UCXI │ │ Protocol │ │ Resolver │ └─────────────┘ ``` ### Data Flow 1. **Task Requests** → PubSub → Task Coordinator → SLURP (if admin) 2. **Context Generation** → DHT Storage → UCXI Resolution 3. **Health Monitoring** → Prometheus → AlertManager → Notifications ### Critical Dependencies - **Docker Swarm**: Container orchestration - **NFS Storage**: Persistent data storage - **Prometheus Stack**: Monitoring and alerting - **DHT Bootstrap Nodes**: P2P network foundation ## Common Operational Tasks ### Service Management #### Check Service Status ```bash # List all BZZZ services docker service ls | grep bzzz # Check specific service docker service ps bzzz-v2_bzzz-agent # View service configuration docker service inspect bzzz-v2_bzzz-agent ``` #### Scale Services ```bash # Scale main BZZZ service docker service scale bzzz-v2_bzzz-agent=5 # Scale monitoring stack docker service scale bzzz-monitoring_prometheus=1 docker service scale bzzz-monitoring_grafana=1 ``` #### Update Services ```bash # Update to new image version docker service update \ --image registry.home.deepblack.cloud/bzzz:v2.1.0 \ bzzz-v2_bzzz-agent # Update environment variables docker service update \ --env-add LOG_LEVEL=debug \ bzzz-v2_bzzz-agent # Update resource limits docker service update \ --limit-memory 4G \ --limit-cpu 2 \ bzzz-v2_bzzz-agent ``` ### Configuration Management #### Update Docker Secrets ```bash # Create new secret echo "new_password" | docker secret create bzzz_postgres_password_v2 - # Update service to use new secret docker service update \ --secret-rm bzzz_postgres_password \ --secret-add bzzz_postgres_password_v2 \ bzzz-v2_postgres ``` #### Update Docker Configs ```bash # Create new config docker config create bzzz_v2_config_v3 /path/to/new/config.yaml # Update service docker service update \ --config-rm bzzz_v2_config \ --config-add source=bzzz_v2_config_v3,target=/app/config/config.yaml \ bzzz-v2_bzzz-agent ``` ### Monitoring and Alerting #### Check Alert Status ```bash # View active alerts curl -s http://alertmanager:9093/api/v1/alerts | jq '.data[] | select(.status.state == "active")' # Silence alert curl -X POST http://alertmanager:9093/api/v1/silences \ -d '{ "matchers": [{"name": "alertname", "value": "BZZZSystemHealthCritical"}], "startsAt": "2025-01-01T00:00:00Z", "endsAt": "2025-01-01T01:00:00Z", "comment": "Maintenance window", "createdBy": "operator" }' ``` #### Query Metrics ```bash # Check system health curl -s 'http://prometheus:9090/api/v1/query?query=bzzz_system_health_score' | jq # Check connected peers curl -s 'http://prometheus:9090/api/v1/query?query=bzzz_p2p_connected_peers' | jq # Check error rates curl -s 'http://prometheus:9090/api/v1/query?query=rate(bzzz_errors_total[5m])' | jq ``` ## Incident Response Procedures ### Severity Levels #### Critical (P0) - System completely unavailable - Data loss or corruption - Security breach - **Response Time**: 15 minutes - **Resolution Target**: 2 hours #### High (P1) - Major functionality impaired - Performance severely degraded - **Response Time**: 1 hour - **Resolution Target**: 4 hours #### Medium (P2) - Minor functionality issues - Performance slightly degraded - **Response Time**: 4 hours - **Resolution Target**: 24 hours #### Low (P3) - Cosmetic issues - Enhancement requests - **Response Time**: 24 hours - **Resolution Target**: 1 week ### Common Incident Scenarios #### System Health Critical (Alert: BZZZSystemHealthCritical) **Symptoms**: System health score < 0.5 **Immediate Actions**: 1. Check Grafana dashboard for component failures 2. Review recent deployments or changes 3. Check resource utilization (CPU, memory, disk) 4. Verify P2P connectivity **Investigation Steps**: ```bash # Check overall system status curl -s https://bzzz.deepblack.cloud/health | jq # Check component health curl -s https://bzzz.deepblack.cloud/health/checks | jq # Review recent logs docker service logs bzzz-v2_bzzz-agent --since 1h | tail -100 # Check resource usage docker stats --no-stream ``` **Recovery Actions**: 1. If memory leak: Restart affected services 2. If disk full: Clean up logs and temporary files 3. If network issues: Restart networking components 4. If database issues: Check PostgreSQL health #### P2P Network Partition (Alert: BZZZInsufficientPeers) **Symptoms**: Connected peers < 3 **Immediate Actions**: 1. Check network connectivity between nodes 2. Verify DHT bootstrap nodes are running 3. Check firewall rules and port accessibility **Investigation Steps**: ```bash # Check DHT bootstrap nodes for node in walnut:9101 ironwood:9102 acacia:9103; do echo "Checking $node:" nc -zv ${node%:*} ${node#*:} done # Check P2P connectivity docker service logs bzzz-v2_dht-bootstrap-walnut --since 1h # Test network between nodes docker run --rm --network host nicolaka/netshoot ping -c 3 ironwood ``` **Recovery Actions**: 1. Restart DHT bootstrap services 2. Clear peer store if corrupted 3. Check and fix network configuration 4. Restart affected BZZZ agents #### Election System Failure (Alert: BZZZNoAdminElected) **Symptoms**: No admin elected or frequent leadership changes **Immediate Actions**: 1. Check election state on all nodes 2. Review heartbeat status 3. Verify role configurations **Investigation Steps**: ```bash # Check election status on each node for node in walnut ironwood acacia; do echo "Node $node election status:" docker exec $(docker ps -q --filter label=com.docker.swarm.node.id) \ curl -s localhost:8081/health/checks | jq '.checks["election-health"]' done # Check role configurations docker config inspect bzzz_v2_config | jq '.Spec.Data' | base64 -d | grep -A5 -B5 role ``` **Recovery Actions**: 1. Force re-election by restarting election managers 2. Fix role configuration issues 3. Clear election state if corrupted 4. Ensure at least one node has admin capabilities #### DHT Replication Failure (Alert: BZZZDHTReplicationDegraded) **Symptoms**: Average replication factor < 2 **Immediate Actions**: 1. Check DHT provider records 2. Verify replication manager status 3. Check storage availability **Investigation Steps**: ```bash # Check DHT metrics curl -s 'http://prometheus:9090/api/v1/query?query=bzzz_dht_replication_factor' | jq # Check provider records curl -s 'http://prometheus:9090/api/v1/query?query=bzzz_dht_provider_records' | jq # Check replication manager logs docker service logs bzzz-v2_bzzz-agent | grep -i replication ``` **Recovery Actions**: 1. Restart replication managers 2. Force re-provision of content 3. Check and fix storage issues 4. Verify DHT network connectivity ### Escalation Procedures #### When to Escalate - Unable to resolve P0/P1 incident within target time - Incident requires specialized knowledge - Multiple systems affected - Potential security implications #### Escalation Contacts 1. **Technical Lead**: @tech-lead (Slack) 2. **Infrastructure Team**: @infra-team (Slack) 3. **Management**: @management (for business-critical issues) ## Health Check Procedures ### Manual Health Verification #### System-Level Checks ```bash # 1. Overall system health curl -s https://bzzz.deepblack.cloud/health | jq '.status' # 2. Component health checks curl -s https://bzzz.deepblack.cloud/health/checks | jq # 3. Resource utilization docker stats --no-stream --format "table {{.Container}}\t{{.CPUPerc}}\t{{.MemUsage}}\t{{.MemPerc}}" # 4. Service status docker service ls | grep bzzz # 5. Network connectivity docker network ls | grep bzzz ``` #### Component-Specific Checks **P2P Network**: ```bash # Check connected peers curl -s 'http://prometheus:9090/api/v1/query?query=bzzz_p2p_connected_peers' # Test P2P messaging docker exec -it $(docker ps -q -f name=bzzz-agent) \ /app/bzzz test-p2p-message ``` **DHT Storage**: ```bash # Check DHT operations curl -s 'http://prometheus:9090/api/v1/query?query=rate(bzzz_dht_put_operations_total[5m])' # Test DHT functionality docker exec -it $(docker ps -q -f name=bzzz-agent) \ /app/bzzz test-dht-operations ``` **Election System**: ```bash # Check current admin curl -s 'http://prometheus:9090/api/v1/query?query=bzzz_election_state' # Check heartbeat status curl -s https://bzzz.deepblack.cloud/api/election/status | jq ``` ### Automated Health Monitoring #### Prometheus Queries for Health ```promql # Overall system health bzzz_system_health_score # Component health scores bzzz_component_health_score # SLI compliance rate(bzzz_health_checks_passed_total[5m]) / rate(bzzz_health_checks_failed_total[5m] + bzzz_health_checks_passed_total[5m]) # Error budget burn rate 1 - bzzz:dht_success_rate > 0.01 # 1% error budget ``` #### Alert Validation After resolving issues, verify alerts clear: ```bash # Check if alerts are resolved curl -s http://alertmanager:9093/api/v1/alerts | \ jq '.data[] | select(.status.state == "active") | .labels.alertname' ``` ## Performance Tuning ### Resource Optimization #### Memory Tuning ```bash # Increase memory limits for heavy workloads docker service update --limit-memory 8G bzzz-v2_bzzz-agent # Optimize JVM heap size (if applicable) docker service update \ --env-add JAVA_OPTS="-Xmx4g -Xms2g" \ bzzz-v2_bzzz-agent ``` #### CPU Optimization ```bash # Adjust CPU limits docker service update --limit-cpu 4 bzzz-v2_bzzz-agent # Set CPU affinity for critical services docker service update \ --placement-pref "spread=node.labels.cpu_type==high_performance" \ bzzz-v2_bzzz-agent ``` #### Network Optimization ```bash # Optimize network buffer sizes echo 'net.core.rmem_max = 16777216' >> /etc/sysctl.conf echo 'net.core.wmem_max = 16777216' >> /etc/sysctl.conf sysctl -p ``` ### Application-Level Tuning #### DHT Performance - Increase replication factor for critical content - Optimize provider record refresh intervals - Tune cache sizes based on memory availability #### PubSub Performance - Adjust message batch sizes - Optimize topic subscription patterns - Configure message retention policies #### Election Stability - Tune heartbeat intervals - Adjust election timeouts based on network latency - Optimize candidate scoring algorithms ### Monitoring Performance Impact ```bash # Before tuning - capture baseline curl -s 'http://prometheus:9090/api/v1/query_range?query=rate(bzzz_dht_operation_latency_seconds_sum[5m])/rate(bzzz_dht_operation_latency_seconds_count[5m])&start=2025-01-01T00:00:00Z&end=2025-01-01T01:00:00Z&step=60s' # After tuning - compare results # Use Grafana dashboards to visualize improvements ``` ## Backup and Recovery ### Critical Data Identification #### Persistent Data - **PostgreSQL Database**: User data, task history, conversation threads - **DHT Content**: Distributed content storage - **Configuration**: Docker secrets, configs, service definitions - **Prometheus Data**: Historical metrics (optional but valuable) #### Backup Schedule - **PostgreSQL**: Daily full backup, continuous WAL archiving - **Configuration**: Weekly backup, immediately after changes - **Prometheus**: Weekly backup of selected metrics ### Backup Procedures #### Database Backup ```bash # Create database backup docker exec $(docker ps -q -f name=postgres) \ pg_dump -U bzzz -d bzzz_v2 -f /backup/bzzz_$(date +%Y%m%d_%H%M%S).sql # Compress and store gzip /rust/bzzz-v2/backups/bzzz_$(date +%Y%m%d_%H%M%S).sql aws s3 cp /rust/bzzz-v2/backups/ s3://chorus-backups/bzzz/ --recursive ``` #### Configuration Backup ```bash # Export all secrets (encrypted) for secret in $(docker secret ls -q); do docker secret inspect $secret > /backup/secrets/${secret}.json done # Export all configs for config in $(docker config ls -q); do docker config inspect $config > /backup/configs/${config}.json done # Export service definitions docker service ls --format '{{.Name}}' | xargs -I {} docker service inspect {} > /backup/services.json ``` #### Prometheus Data Backup ```bash # Snapshot Prometheus data curl -X POST http://prometheus:9090/api/v1/admin/tsdb/snapshot # Copy snapshot to backup location docker cp prometheus_container:/prometheus/snapshots/latest /backup/prometheus/$(date +%Y%m%d) ``` ### Recovery Procedures #### Full System Recovery 1. **Restore Infrastructure**: Deploy Docker Swarm stack 2. **Restore Configuration**: Import secrets and configs 3. **Restore Database**: Restore PostgreSQL from backup 4. **Validate Services**: Verify all services are healthy 5. **Test Functionality**: Run end-to-end tests #### Database Recovery ```bash # Stop application services docker service scale bzzz-v2_bzzz-agent=0 # Restore database gunzip -c /backup/bzzz_20250101_120000.sql.gz | \ docker exec -i $(docker ps -q -f name=postgres) \ psql -U bzzz -d bzzz_v2 # Start application services docker service scale bzzz-v2_bzzz-agent=3 ``` #### Point-in-Time Recovery ```bash # For WAL-based recovery docker exec $(docker ps -q -f name=postgres) \ pg_basebackup -U postgres -D /backup/base -X stream -P # Restore to specific time # (Implementation depends on PostgreSQL configuration) ``` ### Recovery Testing #### Monthly Recovery Tests ```bash # Test database restore ./scripts/test-db-restore.sh # Test configuration restore ./scripts/test-config-restore.sh # Test full system restore (staging environment) ./scripts/test-full-restore.sh staging ``` #### Recovery Validation - Verify all services start successfully - Check data integrity and completeness - Validate P2P network connectivity - Test core functionality (task coordination, context generation) - Monitor system health for 24 hours post-recovery ## Troubleshooting Guide ### Log Analysis #### Centralized Logging ```bash # View aggregated logs through Loki curl -G -s 'http://loki:3100/loki/api/v1/query_range' \ --data-urlencode 'query={job="bzzz"}' \ --data-urlencode 'start=2025-01-01T00:00:00Z' \ --data-urlencode 'end=2025-01-01T01:00:00Z' | jq # Search for specific errors curl -G -s 'http://loki:3100/loki/api/v1/query_range' \ --data-urlencode 'query={job="bzzz"} |= "ERROR"' | jq ``` #### Service-Specific Logs ```bash # BZZZ agent logs docker service logs bzzz-v2_bzzz-agent -f --tail 100 # DHT bootstrap logs docker service logs bzzz-v2_dht-bootstrap-walnut -f # Database logs docker service logs bzzz-v2_postgres -f # Filter for specific patterns docker service logs bzzz-v2_bzzz-agent | grep -E "(ERROR|FATAL|panic)" ``` ### Common Issues and Solutions #### "No Admin Elected" Error ```bash # Check role configurations docker config inspect bzzz_v2_config | jq '.Spec.Data' | base64 -d | yq '.agent.role' # Force election docker exec -it $(docker ps -q -f name=bzzz-agent) /app/bzzz trigger-election # Restart election managers docker service update --force bzzz-v2_bzzz-agent ``` #### "DHT Operations Failing" Error ```bash # Check DHT bootstrap nodes for port in 9101 9102 9103; do nc -zv localhost $port done # Restart DHT services docker service update --force bzzz-v2_dht-bootstrap-walnut docker service update --force bzzz-v2_dht-bootstrap-ironwood docker service update --force bzzz-v2_dht-bootstrap-acacia # Clear DHT cache docker exec -it $(docker ps -q -f name=bzzz-agent) rm -rf /app/data/dht/cache/* ``` #### "High Memory Usage" Alert ```bash # Identify memory-hungry processes docker stats --no-stream --format "table {{.Container}}\t{{.MemUsage}}\t{{.MemPerc}}" | sort -k3 -n # Check for memory leaks docker exec -it $(docker ps -q -f name=bzzz-agent) pprof -http=:6060 /app/bzzz # Restart high-memory services docker service update --force bzzz-v2_bzzz-agent ``` #### "Network Connectivity Issues" ```bash # Check overlay network docker network inspect bzzz-internal # Test connectivity between services docker run --rm --network bzzz-internal nicolaka/netshoot ping -c 3 postgres # Check firewall rules iptables -L | grep -E "(9000|9101|9102|9103)" # Restart networking docker network disconnect bzzz-internal $(docker ps -q -f name=bzzz-agent) docker network connect bzzz-internal $(docker ps -q -f name=bzzz-agent) ``` ### Performance Issues #### High Latency Diagnosis ```bash # Check operation latencies curl -s 'http://prometheus:9090/api/v1/query?query=histogram_quantile(0.95, rate(bzzz_dht_operation_latency_seconds_bucket[5m]))' # Identify bottlenecks docker exec -it $(docker ps -q -f name=bzzz-agent) /app/bzzz profile-cpu 30 # Check network latency between nodes for node in walnut ironwood acacia; do ping -c 10 $node | tail -1 done ``` #### Resource Contention ```bash # Check CPU usage docker stats --no-stream --format "table {{.Container}}\t{{.CPUPerc}}" # Check I/O wait iostat -x 1 5 # Check network utilization iftop -i eth0 ``` ### Debugging Tools #### Application Debugging ```bash # Enable debug logging docker service update --env-add LOG_LEVEL=debug bzzz-v2_bzzz-agent # Access debug endpoints curl -s http://localhost:8080/debug/pprof/heap > heap.prof go tool pprof heap.prof # Trace requests curl -s http://localhost:8080/debug/requests ``` #### System Debugging ```bash # System resource usage htop iotop nethogs # Process analysis ps aux --sort=-%cpu | head -20 ps aux --sort=-%mem | head -20 # Network analysis netstat -tulpn | grep -E ":9000|:9101|:9102|:9103" ss -tuln | grep -E ":9000|:9101|:9102|:9103" ``` ## Maintenance Procedures ### Scheduled Maintenance #### Weekly Maintenance (Low-impact) - Review system health metrics - Check log sizes and rotate if necessary - Update monitoring dashboards - Validate backup integrity #### Monthly Maintenance (Medium-impact) - Update non-critical components - Perform capacity planning review - Test disaster recovery procedures - Security scan and updates #### Quarterly Maintenance (High-impact) - Major version updates - Infrastructure upgrades - Performance optimization review - Security audit and remediation ### Update Procedures #### Rolling Updates ```bash # Update with zero downtime docker service update \ --image registry.home.deepblack.cloud/bzzz:v2.1.0 \ --update-parallelism 1 \ --update-delay 30s \ --update-failure-action rollback \ bzzz-v2_bzzz-agent ``` #### Configuration Updates ```bash # Update configuration without restart docker config create bzzz_v2_config_new /path/to/new/config.yaml docker service update \ --config-rm bzzz_v2_config \ --config-add source=bzzz_v2_config_new,target=/app/config/config.yaml \ bzzz-v2_bzzz-agent # Cleanup old config docker config rm bzzz_v2_config ``` #### Database Maintenance ```bash # Database optimization docker exec -it $(docker ps -q -f name=postgres) \ psql -U bzzz -d bzzz_v2 -c "VACUUM ANALYZE;" # Update statistics docker exec -it $(docker ps -q -f name=postgres) \ psql -U bzzz -d bzzz_v2 -c "ANALYZE;" # Check database size docker exec -it $(docker ps -q -f name=postgres) \ psql -U bzzz -d bzzz_v2 -c "SELECT pg_size_pretty(pg_database_size('bzzz_v2'));" ``` ### Capacity Planning #### Growth Projections - Monitor resource usage trends over time - Project capacity needs based on growth patterns - Plan for seasonal or event-driven spikes #### Scaling Decisions ```bash # Horizontal scaling docker service scale bzzz-v2_bzzz-agent=5 # Vertical scaling docker service update \ --limit-memory 8G \ --limit-cpu 4 \ bzzz-v2_bzzz-agent # Add new node to swarm docker swarm join-token worker ``` #### Resource Monitoring - Set up capacity alerts at 70% utilization - Monitor growth rate and extrapolate - Plan infrastructure expansions 3-6 months ahead --- ## Contact Information **Primary Contact**: Tony (@tony) **Team**: BZZZ Infrastructure Team **Documentation**: https://wiki.chorus.services/bzzz **Source Code**: https://gitea.chorus.services/tony/BZZZ **Last Updated**: 2025-01-01 **Version**: 2.0 **Review Date**: 2025-04-01