Files
bzzz/infrastructure/docs/OPERATIONAL_RUNBOOK.md
anthonyrawlins 92779523c0 🚀 Complete BZZZ Issue Resolution - All 17 Issues Solved
Comprehensive multi-agent implementation addressing all issues from INDEX.md:

## Core Architecture & Validation
-  Issue 001: UCXL address validation at all system boundaries
-  Issue 002: Fixed search parsing bug in encrypted storage
-  Issue 003: Wired UCXI P2P announce and discover functionality
-  Issue 011: Aligned temporal grammar and documentation
-  Issue 012: SLURP idempotency, backpressure, and DLQ implementation
-  Issue 013: Linked SLURP events to UCXL decisions and DHT

## API Standardization & Configuration
-  Issue 004: Standardized UCXI payloads to UCXL codes
-  Issue 010: Status endpoints and configuration surface

## Infrastructure & Operations
-  Issue 005: Election heartbeat on admin transition
-  Issue 006: Active health checks for PubSub and DHT
-  Issue 007: DHT replication and provider records
-  Issue 014: SLURP leadership lifecycle and health probes
-  Issue 015: Comprehensive monitoring, SLOs, and alerts

## Security & Access Control
-  Issue 008: Key rotation and role-based access policies

## Testing & Quality Assurance
-  Issue 009: Integration tests for UCXI + DHT encryption + search
-  Issue 016: E2E tests for HMMM → SLURP → UCXL workflow

## HMMM Integration
-  Issue 017: HMMM adapter wiring and comprehensive testing

## Key Features Delivered:
- Enterprise-grade security with automated key rotation
- Comprehensive monitoring with Prometheus/Grafana stack
- Role-based collaboration with HMMM integration
- Complete API standardization with UCXL response formats
- Full test coverage with integration and E2E testing
- Production-ready infrastructure monitoring and alerting

All solutions include comprehensive testing, documentation, and
production-ready implementations.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-08-29 12:39:38 +10:00

22 KiB

BZZZ Infrastructure Operational Runbook

Table of Contents

  1. Quick Reference
  2. System Architecture Overview
  3. Common Operational Tasks
  4. Incident Response Procedures
  5. Health Check Procedures
  6. Performance Tuning
  7. Backup and Recovery
  8. Troubleshooting Guide
  9. Maintenance Procedures

Quick Reference

Critical Service Endpoints

Emergency Contacts

  • Primary Oncall: Slack #bzzz-alerts
  • System Administrator: @tony
  • Infrastructure Team: @platform-team

Key Commands

# Check system health
curl -s https://bzzz.deepblack.cloud/health | jq

# View logs
docker service logs bzzz-v2_bzzz-agent -f --tail 100

# Scale service
docker service scale bzzz-v2_bzzz-agent=5

# Force service update
docker service update --force bzzz-v2_bzzz-agent

System Architecture Overview

Component Relationships

┌─────────────┐    ┌─────────────┐    ┌─────────────┐
│   PubSub    │────│     DHT     │────│  Election   │
│  Messaging  │    │   Storage   │    │   Manager   │
└─────────────┘    └─────────────┘    └─────────────┘
       │                   │                   │
       └───────────────────┼───────────────────┘
                           │
                    ┌─────────────┐
                    │    SLURP    │
                    │   Context   │
                    │  Generator  │
                    └─────────────┘
                           │
                    ┌─────────────┐
                    │    UCXI     │
                    │  Protocol   │
                    │  Resolver   │
                    └─────────────┘

Data Flow

  1. Task Requests → PubSub → Task Coordinator → SLURP (if admin)
  2. Context Generation → DHT Storage → UCXI Resolution
  3. Health Monitoring → Prometheus → AlertManager → Notifications

Critical Dependencies

  • Docker Swarm: Container orchestration
  • NFS Storage: Persistent data storage
  • Prometheus Stack: Monitoring and alerting
  • DHT Bootstrap Nodes: P2P network foundation

Common Operational Tasks

Service Management

Check Service Status

# List all BZZZ services
docker service ls | grep bzzz

# Check specific service
docker service ps bzzz-v2_bzzz-agent

# View service configuration
docker service inspect bzzz-v2_bzzz-agent

Scale Services

# Scale main BZZZ service
docker service scale bzzz-v2_bzzz-agent=5

# Scale monitoring stack
docker service scale bzzz-monitoring_prometheus=1
docker service scale bzzz-monitoring_grafana=1

Update Services

# Update to new image version
docker service update \
  --image registry.home.deepblack.cloud/bzzz:v2.1.0 \
  bzzz-v2_bzzz-agent

# Update environment variables
docker service update \
  --env-add LOG_LEVEL=debug \
  bzzz-v2_bzzz-agent

# Update resource limits
docker service update \
  --limit-memory 4G \
  --limit-cpu 2 \
  bzzz-v2_bzzz-agent

Configuration Management

Update Docker Secrets

# Create new secret
echo "new_password" | docker secret create bzzz_postgres_password_v2 -

# Update service to use new secret
docker service update \
  --secret-rm bzzz_postgres_password \
  --secret-add bzzz_postgres_password_v2 \
  bzzz-v2_postgres

Update Docker Configs

# Create new config
docker config create bzzz_v2_config_v3 /path/to/new/config.yaml

# Update service
docker service update \
  --config-rm bzzz_v2_config \
  --config-add source=bzzz_v2_config_v3,target=/app/config/config.yaml \
  bzzz-v2_bzzz-agent

Monitoring and Alerting

Check Alert Status

# View active alerts
curl -s http://alertmanager:9093/api/v1/alerts | jq '.data[] | select(.status.state == "active")'

# Silence alert
curl -X POST http://alertmanager:9093/api/v1/silences \
  -d '{
    "matchers": [{"name": "alertname", "value": "BZZZSystemHealthCritical"}],
    "startsAt": "2025-01-01T00:00:00Z",
    "endsAt": "2025-01-01T01:00:00Z",
    "comment": "Maintenance window",
    "createdBy": "operator"
  }'

Query Metrics

# Check system health
curl -s 'http://prometheus:9090/api/v1/query?query=bzzz_system_health_score' | jq

# Check connected peers
curl -s 'http://prometheus:9090/api/v1/query?query=bzzz_p2p_connected_peers' | jq

# Check error rates
curl -s 'http://prometheus:9090/api/v1/query?query=rate(bzzz_errors_total[5m])' | jq

Incident Response Procedures

Severity Levels

Critical (P0)

  • System completely unavailable
  • Data loss or corruption
  • Security breach
  • Response Time: 15 minutes
  • Resolution Target: 2 hours

High (P1)

  • Major functionality impaired
  • Performance severely degraded
  • Response Time: 1 hour
  • Resolution Target: 4 hours

Medium (P2)

  • Minor functionality issues
  • Performance slightly degraded
  • Response Time: 4 hours
  • Resolution Target: 24 hours

Low (P3)

  • Cosmetic issues
  • Enhancement requests
  • Response Time: 24 hours
  • Resolution Target: 1 week

Common Incident Scenarios

System Health Critical (Alert: BZZZSystemHealthCritical)

Symptoms: System health score < 0.5

Immediate Actions:

  1. Check Grafana dashboard for component failures
  2. Review recent deployments or changes
  3. Check resource utilization (CPU, memory, disk)
  4. Verify P2P connectivity

Investigation Steps:

# Check overall system status
curl -s https://bzzz.deepblack.cloud/health | jq

# Check component health
curl -s https://bzzz.deepblack.cloud/health/checks | jq

# Review recent logs
docker service logs bzzz-v2_bzzz-agent --since 1h | tail -100

# Check resource usage
docker stats --no-stream

Recovery Actions:

  1. If memory leak: Restart affected services
  2. If disk full: Clean up logs and temporary files
  3. If network issues: Restart networking components
  4. If database issues: Check PostgreSQL health

P2P Network Partition (Alert: BZZZInsufficientPeers)

Symptoms: Connected peers < 3

Immediate Actions:

  1. Check network connectivity between nodes
  2. Verify DHT bootstrap nodes are running
  3. Check firewall rules and port accessibility

Investigation Steps:

# Check DHT bootstrap nodes
for node in walnut:9101 ironwood:9102 acacia:9103; do
  echo "Checking $node:"
  nc -zv ${node%:*} ${node#*:}
done

# Check P2P connectivity
docker service logs bzzz-v2_dht-bootstrap-walnut --since 1h

# Test network between nodes
docker run --rm --network host nicolaka/netshoot ping -c 3 ironwood

Recovery Actions:

  1. Restart DHT bootstrap services
  2. Clear peer store if corrupted
  3. Check and fix network configuration
  4. Restart affected BZZZ agents

Election System Failure (Alert: BZZZNoAdminElected)

Symptoms: No admin elected or frequent leadership changes

Immediate Actions:

  1. Check election state on all nodes
  2. Review heartbeat status
  3. Verify role configurations

Investigation Steps:

# Check election status on each node
for node in walnut ironwood acacia; do
  echo "Node $node election status:"
  docker exec $(docker ps -q --filter label=com.docker.swarm.node.id) \
    curl -s localhost:8081/health/checks | jq '.checks["election-health"]'
done

# Check role configurations
docker config inspect bzzz_v2_config | jq '.Spec.Data' | base64 -d | grep -A5 -B5 role

Recovery Actions:

  1. Force re-election by restarting election managers
  2. Fix role configuration issues
  3. Clear election state if corrupted
  4. Ensure at least one node has admin capabilities

DHT Replication Failure (Alert: BZZZDHTReplicationDegraded)

Symptoms: Average replication factor < 2

Immediate Actions:

  1. Check DHT provider records
  2. Verify replication manager status
  3. Check storage availability

Investigation Steps:

# Check DHT metrics
curl -s 'http://prometheus:9090/api/v1/query?query=bzzz_dht_replication_factor' | jq

# Check provider records
curl -s 'http://prometheus:9090/api/v1/query?query=bzzz_dht_provider_records' | jq

# Check replication manager logs
docker service logs bzzz-v2_bzzz-agent | grep -i replication

Recovery Actions:

  1. Restart replication managers
  2. Force re-provision of content
  3. Check and fix storage issues
  4. Verify DHT network connectivity

Escalation Procedures

When to Escalate

  • Unable to resolve P0/P1 incident within target time
  • Incident requires specialized knowledge
  • Multiple systems affected
  • Potential security implications

Escalation Contacts

  1. Technical Lead: @tech-lead (Slack)
  2. Infrastructure Team: @infra-team (Slack)
  3. Management: @management (for business-critical issues)

Health Check Procedures

Manual Health Verification

System-Level Checks

# 1. Overall system health
curl -s https://bzzz.deepblack.cloud/health | jq '.status'

# 2. Component health checks
curl -s https://bzzz.deepblack.cloud/health/checks | jq

# 3. Resource utilization
docker stats --no-stream --format "table {{.Container}}\t{{.CPUPerc}}\t{{.MemUsage}}\t{{.MemPerc}}"

# 4. Service status
docker service ls | grep bzzz

# 5. Network connectivity
docker network ls | grep bzzz

Component-Specific Checks

P2P Network:

# Check connected peers
curl -s 'http://prometheus:9090/api/v1/query?query=bzzz_p2p_connected_peers'

# Test P2P messaging
docker exec -it $(docker ps -q -f name=bzzz-agent) \
  /app/bzzz test-p2p-message

DHT Storage:

# Check DHT operations
curl -s 'http://prometheus:9090/api/v1/query?query=rate(bzzz_dht_put_operations_total[5m])'

# Test DHT functionality
docker exec -it $(docker ps -q -f name=bzzz-agent) \
  /app/bzzz test-dht-operations

Election System:

# Check current admin
curl -s 'http://prometheus:9090/api/v1/query?query=bzzz_election_state'

# Check heartbeat status
curl -s https://bzzz.deepblack.cloud/api/election/status | jq

Automated Health Monitoring

Prometheus Queries for Health

# Overall system health
bzzz_system_health_score

# Component health scores
bzzz_component_health_score

# SLI compliance
rate(bzzz_health_checks_passed_total[5m]) / rate(bzzz_health_checks_failed_total[5m] + bzzz_health_checks_passed_total[5m])

# Error budget burn rate
1 - bzzz:dht_success_rate > 0.01  # 1% error budget

Alert Validation

After resolving issues, verify alerts clear:

# Check if alerts are resolved
curl -s http://alertmanager:9093/api/v1/alerts | \
  jq '.data[] | select(.status.state == "active") | .labels.alertname'

Performance Tuning

Resource Optimization

Memory Tuning

# Increase memory limits for heavy workloads
docker service update --limit-memory 8G bzzz-v2_bzzz-agent

# Optimize JVM heap size (if applicable)
docker service update \
  --env-add JAVA_OPTS="-Xmx4g -Xms2g" \
  bzzz-v2_bzzz-agent

CPU Optimization

# Adjust CPU limits
docker service update --limit-cpu 4 bzzz-v2_bzzz-agent

# Set CPU affinity for critical services
docker service update \
  --placement-pref "spread=node.labels.cpu_type==high_performance" \
  bzzz-v2_bzzz-agent

Network Optimization

# Optimize network buffer sizes
echo 'net.core.rmem_max = 16777216' >> /etc/sysctl.conf
echo 'net.core.wmem_max = 16777216' >> /etc/sysctl.conf
sysctl -p

Application-Level Tuning

DHT Performance

  • Increase replication factor for critical content
  • Optimize provider record refresh intervals
  • Tune cache sizes based on memory availability

PubSub Performance

  • Adjust message batch sizes
  • Optimize topic subscription patterns
  • Configure message retention policies

Election Stability

  • Tune heartbeat intervals
  • Adjust election timeouts based on network latency
  • Optimize candidate scoring algorithms

Monitoring Performance Impact

# Before tuning - capture baseline
curl -s 'http://prometheus:9090/api/v1/query_range?query=rate(bzzz_dht_operation_latency_seconds_sum[5m])/rate(bzzz_dht_operation_latency_seconds_count[5m])&start=2025-01-01T00:00:00Z&end=2025-01-01T01:00:00Z&step=60s'

# After tuning - compare results
# Use Grafana dashboards to visualize improvements

Backup and Recovery

Critical Data Identification

Persistent Data

  • PostgreSQL Database: User data, task history, conversation threads
  • DHT Content: Distributed content storage
  • Configuration: Docker secrets, configs, service definitions
  • Prometheus Data: Historical metrics (optional but valuable)

Backup Schedule

  • PostgreSQL: Daily full backup, continuous WAL archiving
  • Configuration: Weekly backup, immediately after changes
  • Prometheus: Weekly backup of selected metrics

Backup Procedures

Database Backup

# Create database backup
docker exec $(docker ps -q -f name=postgres) \
  pg_dump -U bzzz -d bzzz_v2 -f /backup/bzzz_$(date +%Y%m%d_%H%M%S).sql

# Compress and store
gzip /rust/bzzz-v2/backups/bzzz_$(date +%Y%m%d_%H%M%S).sql
aws s3 cp /rust/bzzz-v2/backups/ s3://chorus-backups/bzzz/ --recursive

Configuration Backup

# Export all secrets (encrypted)
for secret in $(docker secret ls -q); do
  docker secret inspect $secret > /backup/secrets/${secret}.json
done

# Export all configs
for config in $(docker config ls -q); do
  docker config inspect $config > /backup/configs/${config}.json
done

# Export service definitions
docker service ls --format '{{.Name}}' | xargs -I {} docker service inspect {} > /backup/services.json

Prometheus Data Backup

# Snapshot Prometheus data
curl -X POST http://prometheus:9090/api/v1/admin/tsdb/snapshot

# Copy snapshot to backup location
docker cp prometheus_container:/prometheus/snapshots/latest /backup/prometheus/$(date +%Y%m%d)

Recovery Procedures

Full System Recovery

  1. Restore Infrastructure: Deploy Docker Swarm stack
  2. Restore Configuration: Import secrets and configs
  3. Restore Database: Restore PostgreSQL from backup
  4. Validate Services: Verify all services are healthy
  5. Test Functionality: Run end-to-end tests

Database Recovery

# Stop application services
docker service scale bzzz-v2_bzzz-agent=0

# Restore database
gunzip -c /backup/bzzz_20250101_120000.sql.gz | \
  docker exec -i $(docker ps -q -f name=postgres) \
  psql -U bzzz -d bzzz_v2

# Start application services
docker service scale bzzz-v2_bzzz-agent=3

Point-in-Time Recovery

# For WAL-based recovery
docker exec $(docker ps -q -f name=postgres) \
  pg_basebackup -U postgres -D /backup/base -X stream -P

# Restore to specific time
# (Implementation depends on PostgreSQL configuration)

Recovery Testing

Monthly Recovery Tests

# Test database restore
./scripts/test-db-restore.sh

# Test configuration restore  
./scripts/test-config-restore.sh

# Test full system restore (staging environment)
./scripts/test-full-restore.sh staging

Recovery Validation

  • Verify all services start successfully
  • Check data integrity and completeness
  • Validate P2P network connectivity
  • Test core functionality (task coordination, context generation)
  • Monitor system health for 24 hours post-recovery

Troubleshooting Guide

Log Analysis

Centralized Logging

# View aggregated logs through Loki
curl -G -s 'http://loki:3100/loki/api/v1/query_range' \
  --data-urlencode 'query={job="bzzz"}' \
  --data-urlencode 'start=2025-01-01T00:00:00Z' \
  --data-urlencode 'end=2025-01-01T01:00:00Z' | jq

# Search for specific errors
curl -G -s 'http://loki:3100/loki/api/v1/query_range' \
  --data-urlencode 'query={job="bzzz"} |= "ERROR"' | jq

Service-Specific Logs

# BZZZ agent logs
docker service logs bzzz-v2_bzzz-agent -f --tail 100

# DHT bootstrap logs
docker service logs bzzz-v2_dht-bootstrap-walnut -f

# Database logs
docker service logs bzzz-v2_postgres -f

# Filter for specific patterns
docker service logs bzzz-v2_bzzz-agent | grep -E "(ERROR|FATAL|panic)"

Common Issues and Solutions

"No Admin Elected" Error

# Check role configurations
docker config inspect bzzz_v2_config | jq '.Spec.Data' | base64 -d | yq '.agent.role'

# Force election
docker exec -it $(docker ps -q -f name=bzzz-agent) /app/bzzz trigger-election

# Restart election managers
docker service update --force bzzz-v2_bzzz-agent

"DHT Operations Failing" Error

# Check DHT bootstrap nodes
for port in 9101 9102 9103; do
  nc -zv localhost $port
done

# Restart DHT services
docker service update --force bzzz-v2_dht-bootstrap-walnut
docker service update --force bzzz-v2_dht-bootstrap-ironwood
docker service update --force bzzz-v2_dht-bootstrap-acacia

# Clear DHT cache
docker exec -it $(docker ps -q -f name=bzzz-agent) rm -rf /app/data/dht/cache/*

"High Memory Usage" Alert

# Identify memory-hungry processes
docker stats --no-stream --format "table {{.Container}}\t{{.MemUsage}}\t{{.MemPerc}}" | sort -k3 -n

# Check for memory leaks
docker exec -it $(docker ps -q -f name=bzzz-agent) pprof -http=:6060 /app/bzzz

# Restart high-memory services
docker service update --force bzzz-v2_bzzz-agent

"Network Connectivity Issues"

# Check overlay network
docker network inspect bzzz-internal

# Test connectivity between services
docker run --rm --network bzzz-internal nicolaka/netshoot ping -c 3 postgres

# Check firewall rules
iptables -L | grep -E "(9000|9101|9102|9103)"

# Restart networking
docker network disconnect bzzz-internal $(docker ps -q -f name=bzzz-agent)
docker network connect bzzz-internal $(docker ps -q -f name=bzzz-agent)

Performance Issues

High Latency Diagnosis

# Check operation latencies
curl -s 'http://prometheus:9090/api/v1/query?query=histogram_quantile(0.95, rate(bzzz_dht_operation_latency_seconds_bucket[5m]))'

# Identify bottlenecks
docker exec -it $(docker ps -q -f name=bzzz-agent) /app/bzzz profile-cpu 30

# Check network latency between nodes
for node in walnut ironwood acacia; do
  ping -c 10 $node | tail -1
done

Resource Contention

# Check CPU usage
docker stats --no-stream --format "table {{.Container}}\t{{.CPUPerc}}"

# Check I/O wait
iostat -x 1 5

# Check network utilization
iftop -i eth0

Debugging Tools

Application Debugging

# Enable debug logging
docker service update --env-add LOG_LEVEL=debug bzzz-v2_bzzz-agent

# Access debug endpoints
curl -s http://localhost:8080/debug/pprof/heap > heap.prof
go tool pprof heap.prof

# Trace requests
curl -s http://localhost:8080/debug/requests

System Debugging

# System resource usage
htop
iotop
nethogs

# Process analysis
ps aux --sort=-%cpu | head -20
ps aux --sort=-%mem | head -20

# Network analysis
netstat -tulpn | grep -E ":9000|:9101|:9102|:9103"
ss -tuln | grep -E ":9000|:9101|:9102|:9103"

Maintenance Procedures

Scheduled Maintenance

Weekly Maintenance (Low-impact)

  • Review system health metrics
  • Check log sizes and rotate if necessary
  • Update monitoring dashboards
  • Validate backup integrity

Monthly Maintenance (Medium-impact)

  • Update non-critical components
  • Perform capacity planning review
  • Test disaster recovery procedures
  • Security scan and updates

Quarterly Maintenance (High-impact)

  • Major version updates
  • Infrastructure upgrades
  • Performance optimization review
  • Security audit and remediation

Update Procedures

Rolling Updates

# Update with zero downtime
docker service update \
  --image registry.home.deepblack.cloud/bzzz:v2.1.0 \
  --update-parallelism 1 \
  --update-delay 30s \
  --update-failure-action rollback \
  bzzz-v2_bzzz-agent

Configuration Updates

# Update configuration without restart
docker config create bzzz_v2_config_new /path/to/new/config.yaml

docker service update \
  --config-rm bzzz_v2_config \
  --config-add source=bzzz_v2_config_new,target=/app/config/config.yaml \
  bzzz-v2_bzzz-agent

# Cleanup old config
docker config rm bzzz_v2_config

Database Maintenance

# Database optimization
docker exec -it $(docker ps -q -f name=postgres) \
  psql -U bzzz -d bzzz_v2 -c "VACUUM ANALYZE;"

# Update statistics
docker exec -it $(docker ps -q -f name=postgres) \
  psql -U bzzz -d bzzz_v2 -c "ANALYZE;"

# Check database size
docker exec -it $(docker ps -q -f name=postgres) \
  psql -U bzzz -d bzzz_v2 -c "SELECT pg_size_pretty(pg_database_size('bzzz_v2'));"

Capacity Planning

Growth Projections

  • Monitor resource usage trends over time
  • Project capacity needs based on growth patterns
  • Plan for seasonal or event-driven spikes

Scaling Decisions

# Horizontal scaling
docker service scale bzzz-v2_bzzz-agent=5

# Vertical scaling
docker service update \
  --limit-memory 8G \
  --limit-cpu 4 \
  bzzz-v2_bzzz-agent

# Add new node to swarm
docker swarm join-token worker

Resource Monitoring

  • Set up capacity alerts at 70% utilization
  • Monitor growth rate and extrapolate
  • Plan infrastructure expansions 3-6 months ahead

Contact Information

Primary Contact: Tony (@tony) Team: BZZZ Infrastructure Team Documentation: https://wiki.chorus.services/bzzz Source Code: https://gitea.chorus.services/tony/BZZZ

Last Updated: 2025-01-01 Version: 2.0 Review Date: 2025-04-01