bzzz/infrastructure/docs/OPERATIONAL_RUNBOOK.md

# BZZZ Infrastructure Operational Runbook

## Table of Contents
1. [Quick Reference](#quick-reference)
2. [System Architecture Overview](#system-architecture-overview)
3. [Common Operational Tasks](#common-operational-tasks)
4. [Incident Response Procedures](#incident-response-procedures)
5. [Health Check Procedures](#health-check-procedures)
6. [Performance Tuning](#performance-tuning)
7. [Backup and Recovery](#backup-and-recovery)
8. [Troubleshooting Guide](#troubleshooting-guide)
9. [Maintenance Procedures](#maintenance-procedures)

## Quick Reference

### Critical Service Endpoints
- **Grafana Dashboard**: https://grafana.chorus.services
- **Prometheus**: https://prometheus.chorus.services
- **AlertManager**: https://alerts.chorus.services
- **BZZZ Main API**: https://bzzz.deepblack.cloud
- **Health Checks**: https://bzzz.deepblack.cloud/health

### Emergency Contacts
- **Primary Oncall**: Slack #bzzz-alerts
- **System Administrator**: @tony
- **Infrastructure Team**: @platform-team

### Key Commands
```bash
# Check system health
curl -s https://bzzz.deepblack.cloud/health | jq

# View logs
docker service logs bzzz-v2_bzzz-agent -f --tail 100

# Scale service
docker service scale bzzz-v2_bzzz-agent=5

# Force service update
docker service update --force bzzz-v2_bzzz-agent
```

## System Architecture Overview

### Component Relationships
```
┌─────────────┐    ┌─────────────┐    ┌─────────────┐
│   PubSub    │────│     DHT     │────│  Election   │
│  Messaging  │    │   Storage   │    │   Manager   │
└─────────────┘    └─────────────┘    └─────────────┘
       │                   │                   │
       └───────────────────┼───────────────────┘
                           │
                    ┌─────────────┐
                    │    SLURP    │
                    │   Context   │
                    │  Generator  │
                    └─────────────┘
                           │
                    ┌─────────────┐
                    │    UCXI     │
                    │  Protocol   │
                    │  Resolver   │
                    └─────────────┘
```

### Data Flow
1. **Task Requests** → PubSub → Task Coordinator → SLURP (if admin)
2. **Context Generation** → DHT Storage → UCXI Resolution
3. **Health Monitoring** → Prometheus → AlertManager → Notifications

### Critical Dependencies
- **Docker Swarm**: Container orchestration
- **NFS Storage**: Persistent data storage
- **Prometheus Stack**: Monitoring and alerting
- **DHT Bootstrap Nodes**: P2P network foundation

## Common Operational Tasks

### Service Management

#### Check Service Status
```bash
# List all BZZZ services
docker service ls | grep bzzz

# Check specific service
docker service ps bzzz-v2_bzzz-agent

# View service configuration
docker service inspect bzzz-v2_bzzz-agent
```

#### Scale Services
```bash
# Scale main BZZZ service
docker service scale bzzz-v2_bzzz-agent=5

# Scale monitoring stack
docker service scale bzzz-monitoring_prometheus=1
docker service scale bzzz-monitoring_grafana=1
```

#### Update Services
```bash
# Update to new image version
docker service update \
  --image registry.home.deepblack.cloud/bzzz:v2.1.0 \
  bzzz-v2_bzzz-agent

# Update environment variables
docker service update \
  --env-add LOG_LEVEL=debug \
  bzzz-v2_bzzz-agent

# Update resource limits
docker service update \
  --limit-memory 4G \
  --limit-cpu 2 \
  bzzz-v2_bzzz-agent
```

### Configuration Management

#### Update Docker Secrets
```bash
# Create new secret
echo "new_password" | docker secret create bzzz_postgres_password_v2 -

# Update service to use new secret
docker service update \
  --secret-rm bzzz_postgres_password \
  --secret-add bzzz_postgres_password_v2 \
  bzzz-v2_postgres
```

#### Update Docker Configs
```bash
# Create new config
docker config create bzzz_v2_config_v3 /path/to/new/config.yaml

# Update service
docker service update \
  --config-rm bzzz_v2_config \
  --config-add source=bzzz_v2_config_v3,target=/app/config/config.yaml \
  bzzz-v2_bzzz-agent
```

### Monitoring and Alerting

#### Check Alert Status
```bash
# View active alerts
curl -s http://alertmanager:9093/api/v1/alerts | jq '.data[] | select(.status.state == "active")'

# Silence alert
curl -X POST http://alertmanager:9093/api/v1/silences \
  -d '{
    "matchers": [{"name": "alertname", "value": "BZZZSystemHealthCritical"}],
    "startsAt": "2025-01-01T00:00:00Z",
    "endsAt": "2025-01-01T01:00:00Z",
    "comment": "Maintenance window",
    "createdBy": "operator"
  }'
```

#### Query Metrics
```bash
# Check system health
curl -s 'http://prometheus:9090/api/v1/query?query=bzzz_system_health_score' | jq

# Check connected peers
curl -s 'http://prometheus:9090/api/v1/query?query=bzzz_p2p_connected_peers' | jq

# Check error rates
curl -s 'http://prometheus:9090/api/v1/query?query=rate(bzzz_errors_total[5m])' | jq
```

## Incident Response Procedures

### Severity Levels

#### Critical (P0)
- System completely unavailable
- Data loss or corruption
- Security breach
- **Response Time**: 15 minutes
- **Resolution Target**: 2 hours

#### High (P1)
- Major functionality impaired
- Performance severely degraded
- **Response Time**: 1 hour
- **Resolution Target**: 4 hours

#### Medium (P2)
- Minor functionality issues
- Performance slightly degraded
- **Response Time**: 4 hours
- **Resolution Target**: 24 hours

#### Low (P3)
- Cosmetic issues
- Enhancement requests
- **Response Time**: 24 hours
- **Resolution Target**: 1 week

### Common Incident Scenarios

#### System Health Critical (Alert: BZZZSystemHealthCritical)

**Symptoms**: System health score < 0.5

**Immediate Actions**:
1. Check Grafana dashboard for component failures
2. Review recent deployments or changes
3. Check resource utilization (CPU, memory, disk)
4. Verify P2P connectivity

**Investigation Steps**:
```bash
# Check overall system status
curl -s https://bzzz.deepblack.cloud/health | jq

# Check component health
curl -s https://bzzz.deepblack.cloud/health/checks | jq

# Review recent logs
docker service logs bzzz-v2_bzzz-agent --since 1h | tail -100

# Check resource usage
docker stats --no-stream
```

**Recovery Actions**:
1. If memory leak: Restart affected services
2. If disk full: Clean up logs and temporary files
3. If network issues: Restart networking components
4. If database issues: Check PostgreSQL health

#### P2P Network Partition (Alert: BZZZInsufficientPeers)

**Symptoms**: Connected peers < 3

**Immediate Actions**:
1. Check network connectivity between nodes
2. Verify DHT bootstrap nodes are running
3. Check firewall rules and port accessibility

**Investigation Steps**:
```bash
# Check DHT bootstrap nodes
for node in walnut:9101 ironwood:9102 acacia:9103; do
  echo "Checking $node:"
  nc -zv ${node%:*} ${node#*:}
done

# Check P2P connectivity
docker service logs bzzz-v2_dht-bootstrap-walnut --since 1h

# Test network between nodes
docker run --rm --network host nicolaka/netshoot ping -c 3 ironwood
```

**Recovery Actions**:
1. Restart DHT bootstrap services
2. Clear peer store if corrupted
3. Check and fix network configuration
4. Restart affected BZZZ agents

#### Election System Failure (Alert: BZZZNoAdminElected)

**Symptoms**: No admin elected or frequent leadership changes

**Immediate Actions**:
1. Check election state on all nodes
2. Review heartbeat status
3. Verify role configurations

**Investigation Steps**:
```bash
# Check election status on each node
for node in walnut ironwood acacia; do
  echo "Node $node election status:"
  docker exec $(docker ps -q --filter label=com.docker.swarm.node.id) \
    curl -s localhost:8081/health/checks | jq '.checks["election-health"]'
done

# Check role configurations
docker config inspect bzzz_v2_config | jq '.Spec.Data' | base64 -d | grep -A5 -B5 role
```

**Recovery Actions**:
1. Force re-election by restarting election managers
2. Fix role configuration issues
3. Clear election state if corrupted
4. Ensure at least one node has admin capabilities

#### DHT Replication Failure (Alert: BZZZDHTReplicationDegraded)

**Symptoms**: Average replication factor < 2

**Immediate Actions**:
1. Check DHT provider records
2. Verify replication manager status
3. Check storage availability

**Investigation Steps**:
```bash
# Check DHT metrics
curl -s 'http://prometheus:9090/api/v1/query?query=bzzz_dht_replication_factor' | jq

# Check provider records
curl -s 'http://prometheus:9090/api/v1/query?query=bzzz_dht_provider_records' | jq

# Check replication manager logs
docker service logs bzzz-v2_bzzz-agent | grep -i replication
```

**Recovery Actions**:
1. Restart replication managers
2. Force re-provision of content
3. Check and fix storage issues
4. Verify DHT network connectivity

### Escalation Procedures

#### When to Escalate
- Unable to resolve P0/P1 incident within target time
- Incident requires specialized knowledge
- Multiple systems affected
- Potential security implications

#### Escalation Contacts
1. **Technical Lead**: @tech-lead (Slack)
2. **Infrastructure Team**: @infra-team (Slack)
3. **Management**: @management (for business-critical issues)

## Health Check Procedures

### Manual Health Verification

#### System-Level Checks
```bash
# 1. Overall system health
curl -s https://bzzz.deepblack.cloud/health | jq '.status'

# 2. Component health checks
curl -s https://bzzz.deepblack.cloud/health/checks | jq

# 3. Resource utilization
docker stats --no-stream --format "table {{.Container}}\t{{.CPUPerc}}\t{{.MemUsage}}\t{{.MemPerc}}"

# 4. Service status
docker service ls | grep bzzz

# 5. Network connectivity
docker network ls | grep bzzz
```

#### Component-Specific Checks

**P2P Network**:
```bash
# Check connected peers
curl -s 'http://prometheus:9090/api/v1/query?query=bzzz_p2p_connected_peers'

# Test P2P messaging
docker exec -it $(docker ps -q -f name=bzzz-agent) \
  /app/bzzz test-p2p-message
```

**DHT Storage**:
```bash
# Check DHT operations
curl -s 'http://prometheus:9090/api/v1/query?query=rate(bzzz_dht_put_operations_total[5m])'

# Test DHT functionality
docker exec -it $(docker ps -q -f name=bzzz-agent) \
  /app/bzzz test-dht-operations
```

**Election System**:
```bash
# Check current admin
curl -s 'http://prometheus:9090/api/v1/query?query=bzzz_election_state'

# Check heartbeat status
curl -s https://bzzz.deepblack.cloud/api/election/status | jq
```

### Automated Health Monitoring

#### Prometheus Queries for Health
```promql
# Overall system health
bzzz_system_health_score

# Component health scores
bzzz_component_health_score

# SLI compliance
rate(bzzz_health_checks_passed_total[5m]) / rate(bzzz_health_checks_failed_total[5m] + bzzz_health_checks_passed_total[5m])

# Error budget burn rate
1 - bzzz:dht_success_rate > 0.01  # 1% error budget
```

#### Alert Validation
After resolving issues, verify alerts clear:
```bash
# Check if alerts are resolved
curl -s http://alertmanager:9093/api/v1/alerts | \
  jq '.data[] | select(.status.state == "active") | .labels.alertname'
```

## Performance Tuning

### Resource Optimization

#### Memory Tuning
```bash
# Increase memory limits for heavy workloads
docker service update --limit-memory 8G bzzz-v2_bzzz-agent

# Optimize JVM heap size (if applicable)
docker service update \
  --env-add JAVA_OPTS="-Xmx4g -Xms2g" \
  bzzz-v2_bzzz-agent
```

#### CPU Optimization
```bash
# Adjust CPU limits
docker service update --limit-cpu 4 bzzz-v2_bzzz-agent

# Set CPU affinity for critical services
docker service update \
  --placement-pref "spread=node.labels.cpu_type==high_performance" \
  bzzz-v2_bzzz-agent
```

#### Network Optimization
```bash
# Optimize network buffer sizes
echo 'net.core.rmem_max = 16777216' >> /etc/sysctl.conf
echo 'net.core.wmem_max = 16777216' >> /etc/sysctl.conf
sysctl -p
```

### Application-Level Tuning

#### DHT Performance
- Increase replication factor for critical content
- Optimize provider record refresh intervals
- Tune cache sizes based on memory availability

#### PubSub Performance
- Adjust message batch sizes
- Optimize topic subscription patterns
- Configure message retention policies

#### Election Stability
- Tune heartbeat intervals
- Adjust election timeouts based on network latency
- Optimize candidate scoring algorithms

### Monitoring Performance Impact
```bash
# Before tuning - capture baseline
curl -s 'http://prometheus:9090/api/v1/query_range?query=rate(bzzz_dht_operation_latency_seconds_sum[5m])/rate(bzzz_dht_operation_latency_seconds_count[5m])&start=2025-01-01T00:00:00Z&end=2025-01-01T01:00:00Z&step=60s'

# After tuning - compare results
# Use Grafana dashboards to visualize improvements
```

## Backup and Recovery

### Critical Data Identification

#### Persistent Data
- **PostgreSQL Database**: User data, task history, conversation threads
- **DHT Content**: Distributed content storage
- **Configuration**: Docker secrets, configs, service definitions
- **Prometheus Data**: Historical metrics (optional but valuable)

#### Backup Schedule
- **PostgreSQL**: Daily full backup, continuous WAL archiving
- **Configuration**: Weekly backup, immediately after changes
- **Prometheus**: Weekly backup of selected metrics

### Backup Procedures

#### Database Backup
```bash
# Create database backup
docker exec $(docker ps -q -f name=postgres) \
  pg_dump -U bzzz -d bzzz_v2 -f /backup/bzzz_$(date +%Y%m%d_%H%M%S).sql

# Compress and store
gzip /rust/bzzz-v2/backups/bzzz_$(date +%Y%m%d_%H%M%S).sql
aws s3 cp /rust/bzzz-v2/backups/ s3://chorus-backups/bzzz/ --recursive
```

#### Configuration Backup
```bash
# Export all secrets (encrypted)
for secret in $(docker secret ls -q); do
  docker secret inspect $secret > /backup/secrets/${secret}.json
done

# Export all configs
for config in $(docker config ls -q); do
  docker config inspect $config > /backup/configs/${config}.json
done

# Export service definitions
docker service ls --format '{{.Name}}' | xargs -I {} docker service inspect {} > /backup/services.json
```

#### Prometheus Data Backup
```bash
# Snapshot Prometheus data
curl -X POST http://prometheus:9090/api/v1/admin/tsdb/snapshot

# Copy snapshot to backup location
docker cp prometheus_container:/prometheus/snapshots/latest /backup/prometheus/$(date +%Y%m%d)
```

### Recovery Procedures

#### Full System Recovery
1. **Restore Infrastructure**: Deploy Docker Swarm stack
2. **Restore Configuration**: Import secrets and configs
3. **Restore Database**: Restore PostgreSQL from backup
4. **Validate Services**: Verify all services are healthy
5. **Test Functionality**: Run end-to-end tests

#### Database Recovery
```bash
# Stop application services
docker service scale bzzz-v2_bzzz-agent=0

# Restore database
gunzip -c /backup/bzzz_20250101_120000.sql.gz | \
  docker exec -i $(docker ps -q -f name=postgres) \
  psql -U bzzz -d bzzz_v2

# Start application services
docker service scale bzzz-v2_bzzz-agent=3
```

#### Point-in-Time Recovery
```bash
# For WAL-based recovery
docker exec $(docker ps -q -f name=postgres) \
  pg_basebackup -U postgres -D /backup/base -X stream -P

# Restore to specific time
# (Implementation depends on PostgreSQL configuration)
```

### Recovery Testing

#### Monthly Recovery Tests
```bash
# Test database restore
./scripts/test-db-restore.sh

# Test configuration restore
./scripts/test-config-restore.sh

# Test full system restore (staging environment)
./scripts/test-full-restore.sh staging
```

#### Recovery Validation
- Verify all services start successfully
- Check data integrity and completeness
- Validate P2P network connectivity
- Test core functionality (task coordination, context generation)
- Monitor system health for 24 hours post-recovery

## Troubleshooting Guide

### Log Analysis

#### Centralized Logging
```bash
# View aggregated logs through Loki
curl -G -s 'http://loki:3100/loki/api/v1/query_range' \
  --data-urlencode 'query={job="bzzz"}' \
  --data-urlencode 'start=2025-01-01T00:00:00Z' \
  --data-urlencode 'end=2025-01-01T01:00:00Z' | jq

# Search for specific errors
curl -G -s 'http://loki:3100/loki/api/v1/query_range' \
  --data-urlencode 'query={job="bzzz"} |= "ERROR"' | jq
```

#### Service-Specific Logs
```bash
# BZZZ agent logs
docker service logs bzzz-v2_bzzz-agent -f --tail 100

# DHT bootstrap logs
docker service logs bzzz-v2_dht-bootstrap-walnut -f

# Database logs
docker service logs bzzz-v2_postgres -f

# Filter for specific patterns
docker service logs bzzz-v2_bzzz-agent | grep -E "(ERROR|FATAL|panic)"
```

### Common Issues and Solutions

#### "No Admin Elected" Error
```bash
# Check role configurations
docker config inspect bzzz_v2_config | jq '.Spec.Data' | base64 -d | yq '.agent.role'

# Force election
docker exec -it $(docker ps -q -f name=bzzz-agent) /app/bzzz trigger-election

# Restart election managers
docker service update --force bzzz-v2_bzzz-agent
```

#### "DHT Operations Failing" Error
```bash
# Check DHT bootstrap nodes
for port in 9101 9102 9103; do
  nc -zv localhost $port
done

# Restart DHT services
docker service update --force bzzz-v2_dht-bootstrap-walnut
docker service update --force bzzz-v2_dht-bootstrap-ironwood
docker service update --force bzzz-v2_dht-bootstrap-acacia

# Clear DHT cache
docker exec -it $(docker ps -q -f name=bzzz-agent) rm -rf /app/data/dht/cache/*
```

#### "High Memory Usage" Alert
```bash
# Identify memory-hungry processes
docker stats --no-stream --format "table {{.Container}}\t{{.MemUsage}}\t{{.MemPerc}}" | sort -k3 -n

# Check for memory leaks
docker exec -it $(docker ps -q -f name=bzzz-agent) pprof -http=:6060 /app/bzzz

# Restart high-memory services
docker service update --force bzzz-v2_bzzz-agent
```

#### "Network Connectivity Issues"
```bash
# Check overlay network
docker network inspect bzzz-internal

# Test connectivity between services
docker run --rm --network bzzz-internal nicolaka/netshoot ping -c 3 postgres

# Check firewall rules
iptables -L | grep -E "(9000|9101|9102|9103)"

# Restart networking
docker network disconnect bzzz-internal $(docker ps -q -f name=bzzz-agent)
docker network connect bzzz-internal $(docker ps -q -f name=bzzz-agent)
```

### Performance Issues

#### High Latency Diagnosis
```bash
# Check operation latencies
curl -s 'http://prometheus:9090/api/v1/query?query=histogram_quantile(0.95, rate(bzzz_dht_operation_latency_seconds_bucket[5m]))'

# Identify bottlenecks
docker exec -it $(docker ps -q -f name=bzzz-agent) /app/bzzz profile-cpu 30

# Check network latency between nodes
for node in walnut ironwood acacia; do
  ping -c 10 $node | tail -1
done
```

#### Resource Contention
```bash
# Check CPU usage
docker stats --no-stream --format "table {{.Container}}\t{{.CPUPerc}}"

# Check I/O wait
iostat -x 1 5

# Check network utilization
iftop -i eth0
```

### Debugging Tools

#### Application Debugging
```bash
# Enable debug logging
docker service update --env-add LOG_LEVEL=debug bzzz-v2_bzzz-agent

# Access debug endpoints
curl -s http://localhost:8080/debug/pprof/heap > heap.prof
go tool pprof heap.prof

# Trace requests
curl -s http://localhost:8080/debug/requests
```

#### System Debugging
```bash
# System resource usage
htop
iotop
nethogs

# Process analysis
ps aux --sort=-%cpu | head -20
ps aux --sort=-%mem | head -20

# Network analysis
netstat -tulpn | grep -E ":9000|:9101|:9102|:9103"
ss -tuln | grep -E ":9000|:9101|:9102|:9103"
```

## Maintenance Procedures

### Scheduled Maintenance

#### Weekly Maintenance (Low-impact)
- Review system health metrics
- Check log sizes and rotate if necessary
- Update monitoring dashboards
- Validate backup integrity

#### Monthly Maintenance (Medium-impact)
- Update non-critical components
- Perform capacity planning review
- Test disaster recovery procedures
- Security scan and updates

#### Quarterly Maintenance (High-impact)
- Major version updates
- Infrastructure upgrades
- Performance optimization review
- Security audit and remediation

### Update Procedures

#### Rolling Updates
```bash
# Update with zero downtime
docker service update \
  --image registry.home.deepblack.cloud/bzzz:v2.1.0 \
  --update-parallelism 1 \
  --update-delay 30s \
  --update-failure-action rollback \
  bzzz-v2_bzzz-agent
```

#### Configuration Updates
```bash
# Update configuration without restart
docker config create bzzz_v2_config_new /path/to/new/config.yaml

docker service update \
  --config-rm bzzz_v2_config \
  --config-add source=bzzz_v2_config_new,target=/app/config/config.yaml \
  bzzz-v2_bzzz-agent

# Cleanup old config
docker config rm bzzz_v2_config
```

#### Database Maintenance
```bash
# Database optimization
docker exec -it $(docker ps -q -f name=postgres) \
  psql -U bzzz -d bzzz_v2 -c "VACUUM ANALYZE;"

# Update statistics
docker exec -it $(docker ps -q -f name=postgres) \
  psql -U bzzz -d bzzz_v2 -c "ANALYZE;"

# Check database size
docker exec -it $(docker ps -q -f name=postgres) \
  psql -U bzzz -d bzzz_v2 -c "SELECT pg_size_pretty(pg_database_size('bzzz_v2'));"
```

### Capacity Planning

#### Growth Projections
- Monitor resource usage trends over time
- Project capacity needs based on growth patterns
- Plan for seasonal or event-driven spikes

#### Scaling Decisions
```bash
# Horizontal scaling
docker service scale bzzz-v2_bzzz-agent=5

# Vertical scaling
docker service update \
  --limit-memory 8G \
  --limit-cpu 4 \
  bzzz-v2_bzzz-agent

# Add new node to swarm
docker swarm join-token worker
```

#### Resource Monitoring
- Set up capacity alerts at 70% utilization
- Monitor growth rate and extrapolate
- Plan infrastructure expansions 3-6 months ahead

---

## Contact Information

**Primary Contact**: Tony (@tony)
**Team**: BZZZ Infrastructure Team
**Documentation**: https://wiki.chorus.services/bzzz
**Source Code**: https://gitea.chorus.services/tony/BZZZ

**Last Updated**: 2025-01-01
**Version**: 2.0
**Review Date**: 2025-04-01