🚀 Complete BZZZ Issue Resolution - All 17 Issues Solved

Comprehensive multi-agent implementation addressing all issues from INDEX.md: ## Core Architecture & Validation - ✅ Issue 001: UCXL address validation at all system boundaries - ✅ Issue 002: Fixed search parsing bug in encrypted storage - ✅ Issue 003: Wired UCXI P2P announce and discover functionality - ✅ Issue 011: Aligned temporal grammar and documentation - ✅ Issue 012: SLURP idempotency, backpressure, and DLQ implementation - ✅ Issue 013: Linked SLURP events to UCXL decisions and DHT ## API Standardization & Configuration - ✅ Issue 004: Standardized UCXI payloads to UCXL codes - ✅ Issue 010: Status endpoints and configuration surface ## Infrastructure & Operations - ✅ Issue 005: Election heartbeat on admin transition - ✅ Issue 006: Active health checks for PubSub and DHT - ✅ Issue 007: DHT replication and provider records - ✅ Issue 014: SLURP leadership lifecycle and health probes - ✅ Issue 015: Comprehensive monitoring, SLOs, and alerts ## Security & Access Control - ✅ Issue 008: Key rotation and role-based access policies ## Testing & Quality Assurance - ✅ Issue 009: Integration tests for UCXI + DHT encryption + search - ✅ Issue 016: E2E tests for HMMM → SLURP → UCXL workflow ## HMMM Integration - ✅ Issue 017: HMMM adapter wiring and comprehensive testing ## Key Features Delivered: - Enterprise-grade security with automated key rotation - Comprehensive monitoring with Prometheus/Grafana stack - Role-based collaboration with HMMM integration - Complete API standardization with UCXL response formats - Full test coverage with integration and E2E testing - Production-ready infrastructure monitoring and alerting All solutions include comprehensive testing, documentation, and production-ready implementations. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
2025-08-29 12:39:38 +10:00
parent 59f40e17a5
commit 92779523c0
136 changed files with 56649 additions and 134 deletions
--- a/infrastructure/docs/OPERATIONAL_RUNBOOK.md
+++ b/infrastructure/docs/OPERATIONAL_RUNBOOK.md
@@ -0,0 +1,835 @@
+# BZZZ Infrastructure Operational Runbook
+
+## Table of Contents
+1. [Quick Reference](#quick-reference)
+2. [System Architecture Overview](#system-architecture-overview)
+3. [Common Operational Tasks](#common-operational-tasks)
+4. [Incident Response Procedures](#incident-response-procedures)
+5. [Health Check Procedures](#health-check-procedures)
+6. [Performance Tuning](#performance-tuning)
+7. [Backup and Recovery](#backup-and-recovery)
+8. [Troubleshooting Guide](#troubleshooting-guide)
+9. [Maintenance Procedures](#maintenance-procedures)
+
+## Quick Reference
+
+### Critical Service Endpoints
+- **Grafana Dashboard**: https://grafana.chorus.services
+- **Prometheus**: https://prometheus.chorus.services
+- **AlertManager**: https://alerts.chorus.services
+- **BZZZ Main API**: https://bzzz.deepblack.cloud
+- **Health Checks**: https://bzzz.deepblack.cloud/health
+
+### Emergency Contacts
+- **Primary Oncall**: Slack #bzzz-alerts
+- **System Administrator**: @tony
+- **Infrastructure Team**: @platform-team
+
+### Key Commands
+```bash
+# Check system health
+curl -s https://bzzz.deepblack.cloud/health | jq
+
+# View logs
+docker service logs bzzz-v2_bzzz-agent -f --tail 100
+
+# Scale service
+docker service scale bzzz-v2_bzzz-agent=5
+
+# Force service update
+docker service update --force bzzz-v2_bzzz-agent
+```
+
+## System Architecture Overview
+
+### Component Relationships
+```
+┌─────────────┐    ┌─────────────┐    ┌─────────────┐
+│   PubSub    │────│     DHT     │────│  Election   │
+│  Messaging  │    │   Storage   │    │   Manager   │
+└─────────────┘    └─────────────┘    └─────────────┘
+       │                   │                   │
+       └───────────────────┼───────────────────┘
+                           │
+                    ┌─────────────┐
+                    │    SLURP    │
+                    │   Context   │
+                    │  Generator  │
+                    └─────────────┘
+                           │
+                    ┌─────────────┐
+                    │    UCXI     │
+                    │  Protocol   │
+                    │  Resolver   │
+                    └─────────────┘
+```
+
+### Data Flow
+1. **Task Requests** → PubSub → Task Coordinator → SLURP (if admin)
+2. **Context Generation** → DHT Storage → UCXI Resolution
+3. **Health Monitoring** → Prometheus → AlertManager → Notifications
+
+### Critical Dependencies
+- **Docker Swarm**: Container orchestration
+- **NFS Storage**: Persistent data storage
+- **Prometheus Stack**: Monitoring and alerting
+- **DHT Bootstrap Nodes**: P2P network foundation
+
+## Common Operational Tasks
+
+### Service Management
+
+#### Check Service Status
+```bash
+# List all BZZZ services
+docker service ls | grep bzzz
+
+# Check specific service
+docker service ps bzzz-v2_bzzz-agent
+
+# View service configuration
+docker service inspect bzzz-v2_bzzz-agent
+```
+
+#### Scale Services
+```bash
+# Scale main BZZZ service
+docker service scale bzzz-v2_bzzz-agent=5
+
+# Scale monitoring stack
+docker service scale bzzz-monitoring_prometheus=1
+docker service scale bzzz-monitoring_grafana=1
+```
+
+#### Update Services
+```bash
+# Update to new image version
+docker service update \
+  --image registry.home.deepblack.cloud/bzzz:v2.1.0 \
+  bzzz-v2_bzzz-agent
+
+# Update environment variables
+docker service update \
+  --env-add LOG_LEVEL=debug \
+  bzzz-v2_bzzz-agent
+
+# Update resource limits
+docker service update \
+  --limit-memory 4G \
+  --limit-cpu 2 \
+  bzzz-v2_bzzz-agent
+```
+
+### Configuration Management
+
+#### Update Docker Secrets
+```bash
+# Create new secret
+echo "new_password" | docker secret create bzzz_postgres_password_v2 -
+
+# Update service to use new secret
+docker service update \
+  --secret-rm bzzz_postgres_password \
+  --secret-add bzzz_postgres_password_v2 \
+  bzzz-v2_postgres
+```
+
+#### Update Docker Configs
+```bash
+# Create new config
+docker config create bzzz_v2_config_v3 /path/to/new/config.yaml
+
+# Update service
+docker service update \
+  --config-rm bzzz_v2_config \
+  --config-add source=bzzz_v2_config_v3,target=/app/config/config.yaml \
+  bzzz-v2_bzzz-agent
+```
+
+### Monitoring and Alerting
+
+#### Check Alert Status
+```bash
+# View active alerts
+curl -s http://alertmanager:9093/api/v1/alerts | jq '.data[] | select(.status.state == "active")'
+
+# Silence alert
+curl -X POST http://alertmanager:9093/api/v1/silences \
+  -d '{
+    "matchers": [{"name": "alertname", "value": "BZZZSystemHealthCritical"}],
+    "startsAt": "2025-01-01T00:00:00Z",
+    "endsAt": "2025-01-01T01:00:00Z",
+    "comment": "Maintenance window",
+    "createdBy": "operator"
+  }'
+```
+
+#### Query Metrics
+```bash
+# Check system health
+curl -s 'http://prometheus:9090/api/v1/query?query=bzzz_system_health_score' | jq
+
+# Check connected peers
+curl -s 'http://prometheus:9090/api/v1/query?query=bzzz_p2p_connected_peers' | jq
+
+# Check error rates
+curl -s 'http://prometheus:9090/api/v1/query?query=rate(bzzz_errors_total[5m])' | jq
+```
+
+## Incident Response Procedures
+
+### Severity Levels
+
+#### Critical (P0)
+- System completely unavailable
+- Data loss or corruption
+- Security breach
+- **Response Time**: 15 minutes
+- **Resolution Target**: 2 hours
+
+#### High (P1)
+- Major functionality impaired
+- Performance severely degraded
+- **Response Time**: 1 hour
+- **Resolution Target**: 4 hours
+
+#### Medium (P2)
+- Minor functionality issues
+- Performance slightly degraded
+- **Response Time**: 4 hours
+- **Resolution Target**: 24 hours
+
+#### Low (P3)
+- Cosmetic issues
+- Enhancement requests
+- **Response Time**: 24 hours
+- **Resolution Target**: 1 week
+
+### Common Incident Scenarios
+
+#### System Health Critical (Alert: BZZZSystemHealthCritical)
+
+**Symptoms**: System health score < 0.5
+
+**Immediate Actions**:
+1. Check Grafana dashboard for component failures
+2. Review recent deployments or changes
+3. Check resource utilization (CPU, memory, disk)
+4. Verify P2P connectivity
+
+**Investigation Steps**:
+```bash
+# Check overall system status
+curl -s https://bzzz.deepblack.cloud/health | jq
+
+# Check component health
+curl -s https://bzzz.deepblack.cloud/health/checks | jq
+
+# Review recent logs
+docker service logs bzzz-v2_bzzz-agent --since 1h | tail -100
+
+# Check resource usage
+docker stats --no-stream
+```
+
+**Recovery Actions**:
+1. If memory leak: Restart affected services
+2. If disk full: Clean up logs and temporary files
+3. If network issues: Restart networking components
+4. If database issues: Check PostgreSQL health
+
+#### P2P Network Partition (Alert: BZZZInsufficientPeers)
+
+**Symptoms**: Connected peers < 3
+
+**Immediate Actions**:
+1. Check network connectivity between nodes
+2. Verify DHT bootstrap nodes are running
+3. Check firewall rules and port accessibility
+
+**Investigation Steps**:
+```bash
+# Check DHT bootstrap nodes
+for node in walnut:9101 ironwood:9102 acacia:9103; do
+  echo "Checking $node:"
+  nc -zv ${node%:*} ${node#*:}
+done
+
+# Check P2P connectivity
+docker service logs bzzz-v2_dht-bootstrap-walnut --since 1h
+
+# Test network between nodes
+docker run --rm --network host nicolaka/netshoot ping -c 3 ironwood
+```
+
+**Recovery Actions**:
+1. Restart DHT bootstrap services
+2. Clear peer store if corrupted
+3. Check and fix network configuration
+4. Restart affected BZZZ agents
+
+#### Election System Failure (Alert: BZZZNoAdminElected)
+
+**Symptoms**: No admin elected or frequent leadership changes
+
+**Immediate Actions**:
+1. Check election state on all nodes
+2. Review heartbeat status
+3. Verify role configurations
+
+**Investigation Steps**:
+```bash
+# Check election status on each node
+for node in walnut ironwood acacia; do
+  echo "Node $node election status:"
+  docker exec $(docker ps -q --filter label=com.docker.swarm.node.id) \
+    curl -s localhost:8081/health/checks | jq '.checks["election-health"]'
+done
+
+# Check role configurations
+docker config inspect bzzz_v2_config | jq '.Spec.Data' | base64 -d | grep -A5 -B5 role
+```
+
+**Recovery Actions**:
+1. Force re-election by restarting election managers
+2. Fix role configuration issues
+3. Clear election state if corrupted
+4. Ensure at least one node has admin capabilities
+
+#### DHT Replication Failure (Alert: BZZZDHTReplicationDegraded)
+
+**Symptoms**: Average replication factor < 2
+
+**Immediate Actions**:
+1. Check DHT provider records
+2. Verify replication manager status
+3. Check storage availability
+
+**Investigation Steps**:
+```bash
+# Check DHT metrics
+curl -s 'http://prometheus:9090/api/v1/query?query=bzzz_dht_replication_factor' | jq
+
+# Check provider records
+curl -s 'http://prometheus:9090/api/v1/query?query=bzzz_dht_provider_records' | jq
+
+# Check replication manager logs
+docker service logs bzzz-v2_bzzz-agent | grep -i replication
+```
+
+**Recovery Actions**:
+1. Restart replication managers
+2. Force re-provision of content
+3. Check and fix storage issues
+4. Verify DHT network connectivity
+
+### Escalation Procedures
+
+#### When to Escalate
+- Unable to resolve P0/P1 incident within target time
+- Incident requires specialized knowledge
+- Multiple systems affected
+- Potential security implications
+
+#### Escalation Contacts
+1. **Technical Lead**: @tech-lead (Slack)
+2. **Infrastructure Team**: @infra-team (Slack)
+3. **Management**: @management (for business-critical issues)
+
+## Health Check Procedures
+
+### Manual Health Verification
+
+#### System-Level Checks
+```bash
+# 1. Overall system health
+curl -s https://bzzz.deepblack.cloud/health | jq '.status'
+
+# 2. Component health checks
+curl -s https://bzzz.deepblack.cloud/health/checks | jq
+
+# 3. Resource utilization
+docker stats --no-stream --format "table {{.Container}}\t{{.CPUPerc}}\t{{.MemUsage}}\t{{.MemPerc}}"
+
+# 4. Service status
+docker service ls | grep bzzz
+
+# 5. Network connectivity
+docker network ls | grep bzzz
+```
+
+#### Component-Specific Checks
+
+**P2P Network**:
+```bash
+# Check connected peers
+curl -s 'http://prometheus:9090/api/v1/query?query=bzzz_p2p_connected_peers'
+
+# Test P2P messaging
+docker exec -it $(docker ps -q -f name=bzzz-agent) \
+  /app/bzzz test-p2p-message
+```
+
+**DHT Storage**:
+```bash
+# Check DHT operations
+curl -s 'http://prometheus:9090/api/v1/query?query=rate(bzzz_dht_put_operations_total[5m])'
+
+# Test DHT functionality
+docker exec -it $(docker ps -q -f name=bzzz-agent) \
+  /app/bzzz test-dht-operations
+```
+
+**Election System**:
+```bash
+# Check current admin
+curl -s 'http://prometheus:9090/api/v1/query?query=bzzz_election_state'
+
+# Check heartbeat status
+curl -s https://bzzz.deepblack.cloud/api/election/status | jq
+```
+
+### Automated Health Monitoring
+
+#### Prometheus Queries for Health
+```promql
+# Overall system health
+bzzz_system_health_score
+
+# Component health scores
+bzzz_component_health_score
+
+# SLI compliance
+rate(bzzz_health_checks_passed_total[5m]) / rate(bzzz_health_checks_failed_total[5m] + bzzz_health_checks_passed_total[5m])
+
+# Error budget burn rate
+1 - bzzz:dht_success_rate > 0.01  # 1% error budget
+```
+
+#### Alert Validation
+After resolving issues, verify alerts clear:
+```bash
+# Check if alerts are resolved
+curl -s http://alertmanager:9093/api/v1/alerts | \
+  jq '.data[] | select(.status.state == "active") | .labels.alertname'
+```
+
+## Performance Tuning
+
+### Resource Optimization
+
+#### Memory Tuning
+```bash
+# Increase memory limits for heavy workloads
+docker service update --limit-memory 8G bzzz-v2_bzzz-agent
+
+# Optimize JVM heap size (if applicable)
+docker service update \
+  --env-add JAVA_OPTS="-Xmx4g -Xms2g" \
+  bzzz-v2_bzzz-agent
+```
+
+#### CPU Optimization
+```bash
+# Adjust CPU limits
+docker service update --limit-cpu 4 bzzz-v2_bzzz-agent
+
+# Set CPU affinity for critical services
+docker service update \
+  --placement-pref "spread=node.labels.cpu_type==high_performance" \
+  bzzz-v2_bzzz-agent
+```
+
+#### Network Optimization
+```bash
+# Optimize network buffer sizes
+echo 'net.core.rmem_max = 16777216' >> /etc/sysctl.conf
+echo 'net.core.wmem_max = 16777216' >> /etc/sysctl.conf
+sysctl -p
+```
+
+### Application-Level Tuning
+
+#### DHT Performance
+- Increase replication factor for critical content
+- Optimize provider record refresh intervals
+- Tune cache sizes based on memory availability
+
+#### PubSub Performance  
+- Adjust message batch sizes
+- Optimize topic subscription patterns
+- Configure message retention policies
+
+#### Election Stability
+- Tune heartbeat intervals
+- Adjust election timeouts based on network latency
+- Optimize candidate scoring algorithms
+
+### Monitoring Performance Impact
+```bash
+# Before tuning - capture baseline
+curl -s 'http://prometheus:9090/api/v1/query_range?query=rate(bzzz_dht_operation_latency_seconds_sum[5m])/rate(bzzz_dht_operation_latency_seconds_count[5m])&start=2025-01-01T00:00:00Z&end=2025-01-01T01:00:00Z&step=60s'
+
+# After tuning - compare results
+# Use Grafana dashboards to visualize improvements
+```
+
+## Backup and Recovery
+
+### Critical Data Identification
+
+#### Persistent Data
+- **PostgreSQL Database**: User data, task history, conversation threads
+- **DHT Content**: Distributed content storage
+- **Configuration**: Docker secrets, configs, service definitions
+- **Prometheus Data**: Historical metrics (optional but valuable)
+
+#### Backup Schedule
+- **PostgreSQL**: Daily full backup, continuous WAL archiving
+- **Configuration**: Weekly backup, immediately after changes
+- **Prometheus**: Weekly backup of selected metrics
+
+### Backup Procedures
+
+#### Database Backup
+```bash
+# Create database backup
+docker exec $(docker ps -q -f name=postgres) \
+  pg_dump -U bzzz -d bzzz_v2 -f /backup/bzzz_$(date +%Y%m%d_%H%M%S).sql
+
+# Compress and store
+gzip /rust/bzzz-v2/backups/bzzz_$(date +%Y%m%d_%H%M%S).sql
+aws s3 cp /rust/bzzz-v2/backups/ s3://chorus-backups/bzzz/ --recursive
+```
+
+#### Configuration Backup
+```bash
+# Export all secrets (encrypted)
+for secret in $(docker secret ls -q); do
+  docker secret inspect $secret > /backup/secrets/${secret}.json
+done
+
+# Export all configs
+for config in $(docker config ls -q); do
+  docker config inspect $config > /backup/configs/${config}.json
+done
+
+# Export service definitions
+docker service ls --format '{{.Name}}' | xargs -I {} docker service inspect {} > /backup/services.json
+```
+
+#### Prometheus Data Backup
+```bash
+# Snapshot Prometheus data
+curl -X POST http://prometheus:9090/api/v1/admin/tsdb/snapshot
+
+# Copy snapshot to backup location
+docker cp prometheus_container:/prometheus/snapshots/latest /backup/prometheus/$(date +%Y%m%d)
+```
+
+### Recovery Procedures
+
+#### Full System Recovery
+1. **Restore Infrastructure**: Deploy Docker Swarm stack
+2. **Restore Configuration**: Import secrets and configs
+3. **Restore Database**: Restore PostgreSQL from backup
+4. **Validate Services**: Verify all services are healthy
+5. **Test Functionality**: Run end-to-end tests
+
+#### Database Recovery
+```bash
+# Stop application services
+docker service scale bzzz-v2_bzzz-agent=0
+
+# Restore database
+gunzip -c /backup/bzzz_20250101_120000.sql.gz | \
+  docker exec -i $(docker ps -q -f name=postgres) \
+  psql -U bzzz -d bzzz_v2
+
+# Start application services
+docker service scale bzzz-v2_bzzz-agent=3
+```
+
+#### Point-in-Time Recovery
+```bash
+# For WAL-based recovery
+docker exec $(docker ps -q -f name=postgres) \
+  pg_basebackup -U postgres -D /backup/base -X stream -P
+
+# Restore to specific time
+# (Implementation depends on PostgreSQL configuration)
+```
+
+### Recovery Testing
+
+#### Monthly Recovery Tests
+```bash
+# Test database restore
+./scripts/test-db-restore.sh
+
+# Test configuration restore  
+./scripts/test-config-restore.sh
+
+# Test full system restore (staging environment)
+./scripts/test-full-restore.sh staging
+```
+
+#### Recovery Validation
+- Verify all services start successfully
+- Check data integrity and completeness
+- Validate P2P network connectivity
+- Test core functionality (task coordination, context generation)
+- Monitor system health for 24 hours post-recovery
+
+## Troubleshooting Guide
+
+### Log Analysis
+
+#### Centralized Logging
+```bash
+# View aggregated logs through Loki
+curl -G -s 'http://loki:3100/loki/api/v1/query_range' \
+  --data-urlencode 'query={job="bzzz"}' \
+  --data-urlencode 'start=2025-01-01T00:00:00Z' \
+  --data-urlencode 'end=2025-01-01T01:00:00Z' | jq
+
+# Search for specific errors
+curl -G -s 'http://loki:3100/loki/api/v1/query_range' \
+  --data-urlencode 'query={job="bzzz"} |= "ERROR"' | jq
+```
+
+#### Service-Specific Logs
+```bash
+# BZZZ agent logs
+docker service logs bzzz-v2_bzzz-agent -f --tail 100
+
+# DHT bootstrap logs
+docker service logs bzzz-v2_dht-bootstrap-walnut -f
+
+# Database logs
+docker service logs bzzz-v2_postgres -f
+
+# Filter for specific patterns
+docker service logs bzzz-v2_bzzz-agent | grep -E "(ERROR|FATAL|panic)"
+```
+
+### Common Issues and Solutions
+
+#### "No Admin Elected" Error
+```bash
+# Check role configurations
+docker config inspect bzzz_v2_config | jq '.Spec.Data' | base64 -d | yq '.agent.role'
+
+# Force election
+docker exec -it $(docker ps -q -f name=bzzz-agent) /app/bzzz trigger-election
+
+# Restart election managers
+docker service update --force bzzz-v2_bzzz-agent
+```
+
+#### "DHT Operations Failing" Error
+```bash
+# Check DHT bootstrap nodes
+for port in 9101 9102 9103; do
+  nc -zv localhost $port
+done
+
+# Restart DHT services
+docker service update --force bzzz-v2_dht-bootstrap-walnut
+docker service update --force bzzz-v2_dht-bootstrap-ironwood
+docker service update --force bzzz-v2_dht-bootstrap-acacia
+
+# Clear DHT cache
+docker exec -it $(docker ps -q -f name=bzzz-agent) rm -rf /app/data/dht/cache/*
+```
+
+#### "High Memory Usage" Alert
+```bash
+# Identify memory-hungry processes
+docker stats --no-stream --format "table {{.Container}}\t{{.MemUsage}}\t{{.MemPerc}}" | sort -k3 -n
+
+# Check for memory leaks
+docker exec -it $(docker ps -q -f name=bzzz-agent) pprof -http=:6060 /app/bzzz
+
+# Restart high-memory services
+docker service update --force bzzz-v2_bzzz-agent
+```
+
+#### "Network Connectivity Issues"
+```bash
+# Check overlay network
+docker network inspect bzzz-internal
+
+# Test connectivity between services
+docker run --rm --network bzzz-internal nicolaka/netshoot ping -c 3 postgres
+
+# Check firewall rules
+iptables -L | grep -E "(9000|9101|9102|9103)"
+
+# Restart networking
+docker network disconnect bzzz-internal $(docker ps -q -f name=bzzz-agent)
+docker network connect bzzz-internal $(docker ps -q -f name=bzzz-agent)
+```
+
+### Performance Issues
+
+#### High Latency Diagnosis
+```bash
+# Check operation latencies
+curl -s 'http://prometheus:9090/api/v1/query?query=histogram_quantile(0.95, rate(bzzz_dht_operation_latency_seconds_bucket[5m]))'
+
+# Identify bottlenecks
+docker exec -it $(docker ps -q -f name=bzzz-agent) /app/bzzz profile-cpu 30
+
+# Check network latency between nodes
+for node in walnut ironwood acacia; do
+  ping -c 10 $node | tail -1
+done
+```
+
+#### Resource Contention
+```bash
+# Check CPU usage
+docker stats --no-stream --format "table {{.Container}}\t{{.CPUPerc}}"
+
+# Check I/O wait
+iostat -x 1 5
+
+# Check network utilization
+iftop -i eth0
+```
+
+### Debugging Tools
+
+#### Application Debugging
+```bash
+# Enable debug logging
+docker service update --env-add LOG_LEVEL=debug bzzz-v2_bzzz-agent
+
+# Access debug endpoints
+curl -s http://localhost:8080/debug/pprof/heap > heap.prof
+go tool pprof heap.prof
+
+# Trace requests
+curl -s http://localhost:8080/debug/requests
+```
+
+#### System Debugging
+```bash
+# System resource usage
+htop
+iotop
+nethogs
+
+# Process analysis
+ps aux --sort=-%cpu | head -20
+ps aux --sort=-%mem | head -20
+
+# Network analysis
+netstat -tulpn | grep -E ":9000|:9101|:9102|:9103"
+ss -tuln | grep -E ":9000|:9101|:9102|:9103"
+```
+
+## Maintenance Procedures
+
+### Scheduled Maintenance
+
+#### Weekly Maintenance (Low-impact)
+- Review system health metrics
+- Check log sizes and rotate if necessary
+- Update monitoring dashboards
+- Validate backup integrity
+
+#### Monthly Maintenance (Medium-impact)
+- Update non-critical components
+- Perform capacity planning review
+- Test disaster recovery procedures
+- Security scan and updates
+
+#### Quarterly Maintenance (High-impact)
+- Major version updates
+- Infrastructure upgrades
+- Performance optimization review
+- Security audit and remediation
+
+### Update Procedures
+
+#### Rolling Updates
+```bash
+# Update with zero downtime
+docker service update \
+  --image registry.home.deepblack.cloud/bzzz:v2.1.0 \
+  --update-parallelism 1 \
+  --update-delay 30s \
+  --update-failure-action rollback \
+  bzzz-v2_bzzz-agent
+```
+
+#### Configuration Updates
+```bash
+# Update configuration without restart
+docker config create bzzz_v2_config_new /path/to/new/config.yaml
+
+docker service update \
+  --config-rm bzzz_v2_config \
+  --config-add source=bzzz_v2_config_new,target=/app/config/config.yaml \
+  bzzz-v2_bzzz-agent
+
+# Cleanup old config
+docker config rm bzzz_v2_config
+```
+
+#### Database Maintenance
+```bash
+# Database optimization
+docker exec -it $(docker ps -q -f name=postgres) \
+  psql -U bzzz -d bzzz_v2 -c "VACUUM ANALYZE;"
+
+# Update statistics
+docker exec -it $(docker ps -q -f name=postgres) \
+  psql -U bzzz -d bzzz_v2 -c "ANALYZE;"
+
+# Check database size
+docker exec -it $(docker ps -q -f name=postgres) \
+  psql -U bzzz -d bzzz_v2 -c "SELECT pg_size_pretty(pg_database_size('bzzz_v2'));"
+```
+
+### Capacity Planning
+
+#### Growth Projections
+- Monitor resource usage trends over time
+- Project capacity needs based on growth patterns
+- Plan for seasonal or event-driven spikes
+
+#### Scaling Decisions
+```bash
+# Horizontal scaling
+docker service scale bzzz-v2_bzzz-agent=5
+
+# Vertical scaling
+docker service update \
+  --limit-memory 8G \
+  --limit-cpu 4 \
+  bzzz-v2_bzzz-agent
+
+# Add new node to swarm
+docker swarm join-token worker
+```
+
+#### Resource Monitoring
+- Set up capacity alerts at 70% utilization
+- Monitor growth rate and extrapolate
+- Plan infrastructure expansions 3-6 months ahead
+
+---
+
+## Contact Information
+
+**Primary Contact**: Tony (@tony)
+**Team**: BZZZ Infrastructure Team
+**Documentation**: https://wiki.chorus.services/bzzz
+**Source Code**: https://gitea.chorus.services/tony/BZZZ
+
+**Last Updated**: 2025-01-01
+**Version**: 2.0
+**Review Date**: 2025-04-01