Comprehensive multi-agent implementation addressing all issues from INDEX.md: ## Core Architecture & Validation - ✅ Issue 001: UCXL address validation at all system boundaries - ✅ Issue 002: Fixed search parsing bug in encrypted storage - ✅ Issue 003: Wired UCXI P2P announce and discover functionality - ✅ Issue 011: Aligned temporal grammar and documentation - ✅ Issue 012: SLURP idempotency, backpressure, and DLQ implementation - ✅ Issue 013: Linked SLURP events to UCXL decisions and DHT ## API Standardization & Configuration - ✅ Issue 004: Standardized UCXI payloads to UCXL codes - ✅ Issue 010: Status endpoints and configuration surface ## Infrastructure & Operations - ✅ Issue 005: Election heartbeat on admin transition - ✅ Issue 006: Active health checks for PubSub and DHT - ✅ Issue 007: DHT replication and provider records - ✅ Issue 014: SLURP leadership lifecycle and health probes - ✅ Issue 015: Comprehensive monitoring, SLOs, and alerts ## Security & Access Control - ✅ Issue 008: Key rotation and role-based access policies ## Testing & Quality Assurance - ✅ Issue 009: Integration tests for UCXI + DHT encryption + search - ✅ Issue 016: E2E tests for HMMM → SLURP → UCXL workflow ## HMMM Integration - ✅ Issue 017: HMMM adapter wiring and comprehensive testing ## Key Features Delivered: - Enterprise-grade security with automated key rotation - Comprehensive monitoring with Prometheus/Grafana stack - Role-based collaboration with HMMM integration - Complete API standardization with UCXL response formats - Full test coverage with integration and E2E testing - Production-ready infrastructure monitoring and alerting All solutions include comprehensive testing, documentation, and production-ready implementations. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
22 KiB
22 KiB
BZZZ Infrastructure Operational Runbook
Table of Contents
- Quick Reference
- System Architecture Overview
- Common Operational Tasks
- Incident Response Procedures
- Health Check Procedures
- Performance Tuning
- Backup and Recovery
- Troubleshooting Guide
- Maintenance Procedures
Quick Reference
Critical Service Endpoints
- Grafana Dashboard: https://grafana.chorus.services
- Prometheus: https://prometheus.chorus.services
- AlertManager: https://alerts.chorus.services
- BZZZ Main API: https://bzzz.deepblack.cloud
- Health Checks: https://bzzz.deepblack.cloud/health
Emergency Contacts
- Primary Oncall: Slack #bzzz-alerts
- System Administrator: @tony
- Infrastructure Team: @platform-team
Key Commands
# Check system health
curl -s https://bzzz.deepblack.cloud/health | jq
# View logs
docker service logs bzzz-v2_bzzz-agent -f --tail 100
# Scale service
docker service scale bzzz-v2_bzzz-agent=5
# Force service update
docker service update --force bzzz-v2_bzzz-agent
System Architecture Overview
Component Relationships
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ PubSub │────│ DHT │────│ Election │
│ Messaging │ │ Storage │ │ Manager │
└─────────────┘ └─────────────┘ └─────────────┘
│ │ │
└───────────────────┼───────────────────┘
│
┌─────────────┐
│ SLURP │
│ Context │
│ Generator │
└─────────────┘
│
┌─────────────┐
│ UCXI │
│ Protocol │
│ Resolver │
└─────────────┘
Data Flow
- Task Requests → PubSub → Task Coordinator → SLURP (if admin)
- Context Generation → DHT Storage → UCXI Resolution
- Health Monitoring → Prometheus → AlertManager → Notifications
Critical Dependencies
- Docker Swarm: Container orchestration
- NFS Storage: Persistent data storage
- Prometheus Stack: Monitoring and alerting
- DHT Bootstrap Nodes: P2P network foundation
Common Operational Tasks
Service Management
Check Service Status
# List all BZZZ services
docker service ls | grep bzzz
# Check specific service
docker service ps bzzz-v2_bzzz-agent
# View service configuration
docker service inspect bzzz-v2_bzzz-agent
Scale Services
# Scale main BZZZ service
docker service scale bzzz-v2_bzzz-agent=5
# Scale monitoring stack
docker service scale bzzz-monitoring_prometheus=1
docker service scale bzzz-monitoring_grafana=1
Update Services
# Update to new image version
docker service update \
--image registry.home.deepblack.cloud/bzzz:v2.1.0 \
bzzz-v2_bzzz-agent
# Update environment variables
docker service update \
--env-add LOG_LEVEL=debug \
bzzz-v2_bzzz-agent
# Update resource limits
docker service update \
--limit-memory 4G \
--limit-cpu 2 \
bzzz-v2_bzzz-agent
Configuration Management
Update Docker Secrets
# Create new secret
echo "new_password" | docker secret create bzzz_postgres_password_v2 -
# Update service to use new secret
docker service update \
--secret-rm bzzz_postgres_password \
--secret-add bzzz_postgres_password_v2 \
bzzz-v2_postgres
Update Docker Configs
# Create new config
docker config create bzzz_v2_config_v3 /path/to/new/config.yaml
# Update service
docker service update \
--config-rm bzzz_v2_config \
--config-add source=bzzz_v2_config_v3,target=/app/config/config.yaml \
bzzz-v2_bzzz-agent
Monitoring and Alerting
Check Alert Status
# View active alerts
curl -s http://alertmanager:9093/api/v1/alerts | jq '.data[] | select(.status.state == "active")'
# Silence alert
curl -X POST http://alertmanager:9093/api/v1/silences \
-d '{
"matchers": [{"name": "alertname", "value": "BZZZSystemHealthCritical"}],
"startsAt": "2025-01-01T00:00:00Z",
"endsAt": "2025-01-01T01:00:00Z",
"comment": "Maintenance window",
"createdBy": "operator"
}'
Query Metrics
# Check system health
curl -s 'http://prometheus:9090/api/v1/query?query=bzzz_system_health_score' | jq
# Check connected peers
curl -s 'http://prometheus:9090/api/v1/query?query=bzzz_p2p_connected_peers' | jq
# Check error rates
curl -s 'http://prometheus:9090/api/v1/query?query=rate(bzzz_errors_total[5m])' | jq
Incident Response Procedures
Severity Levels
Critical (P0)
- System completely unavailable
- Data loss or corruption
- Security breach
- Response Time: 15 minutes
- Resolution Target: 2 hours
High (P1)
- Major functionality impaired
- Performance severely degraded
- Response Time: 1 hour
- Resolution Target: 4 hours
Medium (P2)
- Minor functionality issues
- Performance slightly degraded
- Response Time: 4 hours
- Resolution Target: 24 hours
Low (P3)
- Cosmetic issues
- Enhancement requests
- Response Time: 24 hours
- Resolution Target: 1 week
Common Incident Scenarios
System Health Critical (Alert: BZZZSystemHealthCritical)
Symptoms: System health score < 0.5
Immediate Actions:
- Check Grafana dashboard for component failures
- Review recent deployments or changes
- Check resource utilization (CPU, memory, disk)
- Verify P2P connectivity
Investigation Steps:
# Check overall system status
curl -s https://bzzz.deepblack.cloud/health | jq
# Check component health
curl -s https://bzzz.deepblack.cloud/health/checks | jq
# Review recent logs
docker service logs bzzz-v2_bzzz-agent --since 1h | tail -100
# Check resource usage
docker stats --no-stream
Recovery Actions:
- If memory leak: Restart affected services
- If disk full: Clean up logs and temporary files
- If network issues: Restart networking components
- If database issues: Check PostgreSQL health
P2P Network Partition (Alert: BZZZInsufficientPeers)
Symptoms: Connected peers < 3
Immediate Actions:
- Check network connectivity between nodes
- Verify DHT bootstrap nodes are running
- Check firewall rules and port accessibility
Investigation Steps:
# Check DHT bootstrap nodes
for node in walnut:9101 ironwood:9102 acacia:9103; do
echo "Checking $node:"
nc -zv ${node%:*} ${node#*:}
done
# Check P2P connectivity
docker service logs bzzz-v2_dht-bootstrap-walnut --since 1h
# Test network between nodes
docker run --rm --network host nicolaka/netshoot ping -c 3 ironwood
Recovery Actions:
- Restart DHT bootstrap services
- Clear peer store if corrupted
- Check and fix network configuration
- Restart affected BZZZ agents
Election System Failure (Alert: BZZZNoAdminElected)
Symptoms: No admin elected or frequent leadership changes
Immediate Actions:
- Check election state on all nodes
- Review heartbeat status
- Verify role configurations
Investigation Steps:
# Check election status on each node
for node in walnut ironwood acacia; do
echo "Node $node election status:"
docker exec $(docker ps -q --filter label=com.docker.swarm.node.id) \
curl -s localhost:8081/health/checks | jq '.checks["election-health"]'
done
# Check role configurations
docker config inspect bzzz_v2_config | jq '.Spec.Data' | base64 -d | grep -A5 -B5 role
Recovery Actions:
- Force re-election by restarting election managers
- Fix role configuration issues
- Clear election state if corrupted
- Ensure at least one node has admin capabilities
DHT Replication Failure (Alert: BZZZDHTReplicationDegraded)
Symptoms: Average replication factor < 2
Immediate Actions:
- Check DHT provider records
- Verify replication manager status
- Check storage availability
Investigation Steps:
# Check DHT metrics
curl -s 'http://prometheus:9090/api/v1/query?query=bzzz_dht_replication_factor' | jq
# Check provider records
curl -s 'http://prometheus:9090/api/v1/query?query=bzzz_dht_provider_records' | jq
# Check replication manager logs
docker service logs bzzz-v2_bzzz-agent | grep -i replication
Recovery Actions:
- Restart replication managers
- Force re-provision of content
- Check and fix storage issues
- Verify DHT network connectivity
Escalation Procedures
When to Escalate
- Unable to resolve P0/P1 incident within target time
- Incident requires specialized knowledge
- Multiple systems affected
- Potential security implications
Escalation Contacts
- Technical Lead: @tech-lead (Slack)
- Infrastructure Team: @infra-team (Slack)
- Management: @management (for business-critical issues)
Health Check Procedures
Manual Health Verification
System-Level Checks
# 1. Overall system health
curl -s https://bzzz.deepblack.cloud/health | jq '.status'
# 2. Component health checks
curl -s https://bzzz.deepblack.cloud/health/checks | jq
# 3. Resource utilization
docker stats --no-stream --format "table {{.Container}}\t{{.CPUPerc}}\t{{.MemUsage}}\t{{.MemPerc}}"
# 4. Service status
docker service ls | grep bzzz
# 5. Network connectivity
docker network ls | grep bzzz
Component-Specific Checks
P2P Network:
# Check connected peers
curl -s 'http://prometheus:9090/api/v1/query?query=bzzz_p2p_connected_peers'
# Test P2P messaging
docker exec -it $(docker ps -q -f name=bzzz-agent) \
/app/bzzz test-p2p-message
DHT Storage:
# Check DHT operations
curl -s 'http://prometheus:9090/api/v1/query?query=rate(bzzz_dht_put_operations_total[5m])'
# Test DHT functionality
docker exec -it $(docker ps -q -f name=bzzz-agent) \
/app/bzzz test-dht-operations
Election System:
# Check current admin
curl -s 'http://prometheus:9090/api/v1/query?query=bzzz_election_state'
# Check heartbeat status
curl -s https://bzzz.deepblack.cloud/api/election/status | jq
Automated Health Monitoring
Prometheus Queries for Health
# Overall system health
bzzz_system_health_score
# Component health scores
bzzz_component_health_score
# SLI compliance
rate(bzzz_health_checks_passed_total[5m]) / rate(bzzz_health_checks_failed_total[5m] + bzzz_health_checks_passed_total[5m])
# Error budget burn rate
1 - bzzz:dht_success_rate > 0.01 # 1% error budget
Alert Validation
After resolving issues, verify alerts clear:
# Check if alerts are resolved
curl -s http://alertmanager:9093/api/v1/alerts | \
jq '.data[] | select(.status.state == "active") | .labels.alertname'
Performance Tuning
Resource Optimization
Memory Tuning
# Increase memory limits for heavy workloads
docker service update --limit-memory 8G bzzz-v2_bzzz-agent
# Optimize JVM heap size (if applicable)
docker service update \
--env-add JAVA_OPTS="-Xmx4g -Xms2g" \
bzzz-v2_bzzz-agent
CPU Optimization
# Adjust CPU limits
docker service update --limit-cpu 4 bzzz-v2_bzzz-agent
# Set CPU affinity for critical services
docker service update \
--placement-pref "spread=node.labels.cpu_type==high_performance" \
bzzz-v2_bzzz-agent
Network Optimization
# Optimize network buffer sizes
echo 'net.core.rmem_max = 16777216' >> /etc/sysctl.conf
echo 'net.core.wmem_max = 16777216' >> /etc/sysctl.conf
sysctl -p
Application-Level Tuning
DHT Performance
- Increase replication factor for critical content
- Optimize provider record refresh intervals
- Tune cache sizes based on memory availability
PubSub Performance
- Adjust message batch sizes
- Optimize topic subscription patterns
- Configure message retention policies
Election Stability
- Tune heartbeat intervals
- Adjust election timeouts based on network latency
- Optimize candidate scoring algorithms
Monitoring Performance Impact
# Before tuning - capture baseline
curl -s 'http://prometheus:9090/api/v1/query_range?query=rate(bzzz_dht_operation_latency_seconds_sum[5m])/rate(bzzz_dht_operation_latency_seconds_count[5m])&start=2025-01-01T00:00:00Z&end=2025-01-01T01:00:00Z&step=60s'
# After tuning - compare results
# Use Grafana dashboards to visualize improvements
Backup and Recovery
Critical Data Identification
Persistent Data
- PostgreSQL Database: User data, task history, conversation threads
- DHT Content: Distributed content storage
- Configuration: Docker secrets, configs, service definitions
- Prometheus Data: Historical metrics (optional but valuable)
Backup Schedule
- PostgreSQL: Daily full backup, continuous WAL archiving
- Configuration: Weekly backup, immediately after changes
- Prometheus: Weekly backup of selected metrics
Backup Procedures
Database Backup
# Create database backup
docker exec $(docker ps -q -f name=postgres) \
pg_dump -U bzzz -d bzzz_v2 -f /backup/bzzz_$(date +%Y%m%d_%H%M%S).sql
# Compress and store
gzip /rust/bzzz-v2/backups/bzzz_$(date +%Y%m%d_%H%M%S).sql
aws s3 cp /rust/bzzz-v2/backups/ s3://chorus-backups/bzzz/ --recursive
Configuration Backup
# Export all secrets (encrypted)
for secret in $(docker secret ls -q); do
docker secret inspect $secret > /backup/secrets/${secret}.json
done
# Export all configs
for config in $(docker config ls -q); do
docker config inspect $config > /backup/configs/${config}.json
done
# Export service definitions
docker service ls --format '{{.Name}}' | xargs -I {} docker service inspect {} > /backup/services.json
Prometheus Data Backup
# Snapshot Prometheus data
curl -X POST http://prometheus:9090/api/v1/admin/tsdb/snapshot
# Copy snapshot to backup location
docker cp prometheus_container:/prometheus/snapshots/latest /backup/prometheus/$(date +%Y%m%d)
Recovery Procedures
Full System Recovery
- Restore Infrastructure: Deploy Docker Swarm stack
- Restore Configuration: Import secrets and configs
- Restore Database: Restore PostgreSQL from backup
- Validate Services: Verify all services are healthy
- Test Functionality: Run end-to-end tests
Database Recovery
# Stop application services
docker service scale bzzz-v2_bzzz-agent=0
# Restore database
gunzip -c /backup/bzzz_20250101_120000.sql.gz | \
docker exec -i $(docker ps -q -f name=postgres) \
psql -U bzzz -d bzzz_v2
# Start application services
docker service scale bzzz-v2_bzzz-agent=3
Point-in-Time Recovery
# For WAL-based recovery
docker exec $(docker ps -q -f name=postgres) \
pg_basebackup -U postgres -D /backup/base -X stream -P
# Restore to specific time
# (Implementation depends on PostgreSQL configuration)
Recovery Testing
Monthly Recovery Tests
# Test database restore
./scripts/test-db-restore.sh
# Test configuration restore
./scripts/test-config-restore.sh
# Test full system restore (staging environment)
./scripts/test-full-restore.sh staging
Recovery Validation
- Verify all services start successfully
- Check data integrity and completeness
- Validate P2P network connectivity
- Test core functionality (task coordination, context generation)
- Monitor system health for 24 hours post-recovery
Troubleshooting Guide
Log Analysis
Centralized Logging
# View aggregated logs through Loki
curl -G -s 'http://loki:3100/loki/api/v1/query_range' \
--data-urlencode 'query={job="bzzz"}' \
--data-urlencode 'start=2025-01-01T00:00:00Z' \
--data-urlencode 'end=2025-01-01T01:00:00Z' | jq
# Search for specific errors
curl -G -s 'http://loki:3100/loki/api/v1/query_range' \
--data-urlencode 'query={job="bzzz"} |= "ERROR"' | jq
Service-Specific Logs
# BZZZ agent logs
docker service logs bzzz-v2_bzzz-agent -f --tail 100
# DHT bootstrap logs
docker service logs bzzz-v2_dht-bootstrap-walnut -f
# Database logs
docker service logs bzzz-v2_postgres -f
# Filter for specific patterns
docker service logs bzzz-v2_bzzz-agent | grep -E "(ERROR|FATAL|panic)"
Common Issues and Solutions
"No Admin Elected" Error
# Check role configurations
docker config inspect bzzz_v2_config | jq '.Spec.Data' | base64 -d | yq '.agent.role'
# Force election
docker exec -it $(docker ps -q -f name=bzzz-agent) /app/bzzz trigger-election
# Restart election managers
docker service update --force bzzz-v2_bzzz-agent
"DHT Operations Failing" Error
# Check DHT bootstrap nodes
for port in 9101 9102 9103; do
nc -zv localhost $port
done
# Restart DHT services
docker service update --force bzzz-v2_dht-bootstrap-walnut
docker service update --force bzzz-v2_dht-bootstrap-ironwood
docker service update --force bzzz-v2_dht-bootstrap-acacia
# Clear DHT cache
docker exec -it $(docker ps -q -f name=bzzz-agent) rm -rf /app/data/dht/cache/*
"High Memory Usage" Alert
# Identify memory-hungry processes
docker stats --no-stream --format "table {{.Container}}\t{{.MemUsage}}\t{{.MemPerc}}" | sort -k3 -n
# Check for memory leaks
docker exec -it $(docker ps -q -f name=bzzz-agent) pprof -http=:6060 /app/bzzz
# Restart high-memory services
docker service update --force bzzz-v2_bzzz-agent
"Network Connectivity Issues"
# Check overlay network
docker network inspect bzzz-internal
# Test connectivity between services
docker run --rm --network bzzz-internal nicolaka/netshoot ping -c 3 postgres
# Check firewall rules
iptables -L | grep -E "(9000|9101|9102|9103)"
# Restart networking
docker network disconnect bzzz-internal $(docker ps -q -f name=bzzz-agent)
docker network connect bzzz-internal $(docker ps -q -f name=bzzz-agent)
Performance Issues
High Latency Diagnosis
# Check operation latencies
curl -s 'http://prometheus:9090/api/v1/query?query=histogram_quantile(0.95, rate(bzzz_dht_operation_latency_seconds_bucket[5m]))'
# Identify bottlenecks
docker exec -it $(docker ps -q -f name=bzzz-agent) /app/bzzz profile-cpu 30
# Check network latency between nodes
for node in walnut ironwood acacia; do
ping -c 10 $node | tail -1
done
Resource Contention
# Check CPU usage
docker stats --no-stream --format "table {{.Container}}\t{{.CPUPerc}}"
# Check I/O wait
iostat -x 1 5
# Check network utilization
iftop -i eth0
Debugging Tools
Application Debugging
# Enable debug logging
docker service update --env-add LOG_LEVEL=debug bzzz-v2_bzzz-agent
# Access debug endpoints
curl -s http://localhost:8080/debug/pprof/heap > heap.prof
go tool pprof heap.prof
# Trace requests
curl -s http://localhost:8080/debug/requests
System Debugging
# System resource usage
htop
iotop
nethogs
# Process analysis
ps aux --sort=-%cpu | head -20
ps aux --sort=-%mem | head -20
# Network analysis
netstat -tulpn | grep -E ":9000|:9101|:9102|:9103"
ss -tuln | grep -E ":9000|:9101|:9102|:9103"
Maintenance Procedures
Scheduled Maintenance
Weekly Maintenance (Low-impact)
- Review system health metrics
- Check log sizes and rotate if necessary
- Update monitoring dashboards
- Validate backup integrity
Monthly Maintenance (Medium-impact)
- Update non-critical components
- Perform capacity planning review
- Test disaster recovery procedures
- Security scan and updates
Quarterly Maintenance (High-impact)
- Major version updates
- Infrastructure upgrades
- Performance optimization review
- Security audit and remediation
Update Procedures
Rolling Updates
# Update with zero downtime
docker service update \
--image registry.home.deepblack.cloud/bzzz:v2.1.0 \
--update-parallelism 1 \
--update-delay 30s \
--update-failure-action rollback \
bzzz-v2_bzzz-agent
Configuration Updates
# Update configuration without restart
docker config create bzzz_v2_config_new /path/to/new/config.yaml
docker service update \
--config-rm bzzz_v2_config \
--config-add source=bzzz_v2_config_new,target=/app/config/config.yaml \
bzzz-v2_bzzz-agent
# Cleanup old config
docker config rm bzzz_v2_config
Database Maintenance
# Database optimization
docker exec -it $(docker ps -q -f name=postgres) \
psql -U bzzz -d bzzz_v2 -c "VACUUM ANALYZE;"
# Update statistics
docker exec -it $(docker ps -q -f name=postgres) \
psql -U bzzz -d bzzz_v2 -c "ANALYZE;"
# Check database size
docker exec -it $(docker ps -q -f name=postgres) \
psql -U bzzz -d bzzz_v2 -c "SELECT pg_size_pretty(pg_database_size('bzzz_v2'));"
Capacity Planning
Growth Projections
- Monitor resource usage trends over time
- Project capacity needs based on growth patterns
- Plan for seasonal or event-driven spikes
Scaling Decisions
# Horizontal scaling
docker service scale bzzz-v2_bzzz-agent=5
# Vertical scaling
docker service update \
--limit-memory 8G \
--limit-cpu 4 \
bzzz-v2_bzzz-agent
# Add new node to swarm
docker swarm join-token worker
Resource Monitoring
- Set up capacity alerts at 70% utilization
- Monitor growth rate and extrapolate
- Plan infrastructure expansions 3-6 months ahead
Contact Information
Primary Contact: Tony (@tony) Team: BZZZ Infrastructure Team Documentation: https://wiki.chorus.services/bzzz Source Code: https://gitea.chorus.services/tony/BZZZ
Last Updated: 2025-01-01 Version: 2.0 Review Date: 2025-04-01