🚀 Complete BZZZ Issue Resolution - All 17 Issues Solved

Comprehensive multi-agent implementation addressing all issues from INDEX.md: ## Core Architecture & Validation - ✅ Issue 001: UCXL address validation at all system boundaries - ✅ Issue 002: Fixed search parsing bug in encrypted storage - ✅ Issue 003: Wired UCXI P2P announce and discover functionality - ✅ Issue 011: Aligned temporal grammar and documentation - ✅ Issue 012: SLURP idempotency, backpressure, and DLQ implementation - ✅ Issue 013: Linked SLURP events to UCXL decisions and DHT ## API Standardization & Configuration - ✅ Issue 004: Standardized UCXI payloads to UCXL codes - ✅ Issue 010: Status endpoints and configuration surface ## Infrastructure & Operations - ✅ Issue 005: Election heartbeat on admin transition - ✅ Issue 006: Active health checks for PubSub and DHT - ✅ Issue 007: DHT replication and provider records - ✅ Issue 014: SLURP leadership lifecycle and health probes - ✅ Issue 015: Comprehensive monitoring, SLOs, and alerts ## Security & Access Control - ✅ Issue 008: Key rotation and role-based access policies ## Testing & Quality Assurance - ✅ Issue 009: Integration tests for UCXI + DHT encryption + search - ✅ Issue 016: E2E tests for HMMM → SLURP → UCXL workflow ## HMMM Integration - ✅ Issue 017: HMMM adapter wiring and comprehensive testing ## Key Features Delivered: - Enterprise-grade security with automated key rotation - Comprehensive monitoring with Prometheus/Grafana stack - Role-based collaboration with HMMM integration - Complete API standardization with UCXL response formats - Full test coverage with integration and E2E testing - Production-ready infrastructure monitoring and alerting All solutions include comprehensive testing, documentation, and production-ready implementations. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
2025-08-29 12:39:38 +10:00
parent 59f40e17a5
commit 92779523c0
136 changed files with 56649 additions and 134 deletions
--- a/infrastructure/docs/OPERATIONAL_RUNBOOK.md
+++ b/infrastructure/docs/OPERATIONAL_RUNBOOK.md
@@ -0,0 +1,835 @@
+# BZZZ Infrastructure Operational Runbook
+
+## Table of Contents
+1. [Quick Reference](#quick-reference)
+2. [System Architecture Overview](#system-architecture-overview)
+3. [Common Operational Tasks](#common-operational-tasks)
+4. [Incident Response Procedures](#incident-response-procedures)
+5. [Health Check Procedures](#health-check-procedures)
+6. [Performance Tuning](#performance-tuning)
+7. [Backup and Recovery](#backup-and-recovery)
+8. [Troubleshooting Guide](#troubleshooting-guide)
+9. [Maintenance Procedures](#maintenance-procedures)
+
+## Quick Reference
+
+### Critical Service Endpoints
+- **Grafana Dashboard**: https://grafana.chorus.services
+- **Prometheus**: https://prometheus.chorus.services
+- **AlertManager**: https://alerts.chorus.services
+- **BZZZ Main API**: https://bzzz.deepblack.cloud
+- **Health Checks**: https://bzzz.deepblack.cloud/health
+
+### Emergency Contacts
+- **Primary Oncall**: Slack #bzzz-alerts
+- **System Administrator**: @tony
+- **Infrastructure Team**: @platform-team
+
+### Key Commands
+```bash
+# Check system health
+curl -s https://bzzz.deepblack.cloud/health | jq
+
+# View logs
+docker service logs bzzz-v2_bzzz-agent -f --tail 100
+
+# Scale service
+docker service scale bzzz-v2_bzzz-agent=5
+
+# Force service update
+docker service update --force bzzz-v2_bzzz-agent
+```
+
+## System Architecture Overview
+
+### Component Relationships
+```
+┌─────────────┐    ┌─────────────┐    ┌─────────────┐
+│   PubSub    │────│     DHT     │────│  Election   │
+│  Messaging  │    │   Storage   │    │   Manager   │
+└─────────────┘    └─────────────┘    └─────────────┘
+       │                   │                   │
+       └───────────────────┼───────────────────┘
+                           │
+                    ┌─────────────┐
+                    │    SLURP    │
+                    │   Context   │
+                    │  Generator  │
+                    └─────────────┘
+                           │
+                    ┌─────────────┐
+                    │    UCXI     │
+                    │  Protocol   │
+                    │  Resolver   │
+                    └─────────────┘
+```
+
+### Data Flow
+1. **Task Requests** → PubSub → Task Coordinator → SLURP (if admin)
+2. **Context Generation** → DHT Storage → UCXI Resolution
+3. **Health Monitoring** → Prometheus → AlertManager → Notifications
+
+### Critical Dependencies
+- **Docker Swarm**: Container orchestration
+- **NFS Storage**: Persistent data storage
+- **Prometheus Stack**: Monitoring and alerting
+- **DHT Bootstrap Nodes**: P2P network foundation
+
+## Common Operational Tasks
+
+### Service Management
+
+#### Check Service Status
+```bash
+# List all BZZZ services
+docker service ls | grep bzzz
+
+# Check specific service
+docker service ps bzzz-v2_bzzz-agent
+
+# View service configuration
+docker service inspect bzzz-v2_bzzz-agent
+```
+
+#### Scale Services
+```bash
+# Scale main BZZZ service
+docker service scale bzzz-v2_bzzz-agent=5
+
+# Scale monitoring stack
+docker service scale bzzz-monitoring_prometheus=1
+docker service scale bzzz-monitoring_grafana=1
+```
+
+#### Update Services
+```bash
+# Update to new image version
+docker service update \
+  --image registry.home.deepblack.cloud/bzzz:v2.1.0 \
+  bzzz-v2_bzzz-agent
+
+# Update environment variables
+docker service update \
+  --env-add LOG_LEVEL=debug \
+  bzzz-v2_bzzz-agent
+
+# Update resource limits
+docker service update \
+  --limit-memory 4G \
+  --limit-cpu 2 \
+  bzzz-v2_bzzz-agent
+```
+
+### Configuration Management
+
+#### Update Docker Secrets
+```bash
+# Create new secret
+echo "new_password" | docker secret create bzzz_postgres_password_v2 -
+
+# Update service to use new secret
+docker service update \
+  --secret-rm bzzz_postgres_password \
+  --secret-add bzzz_postgres_password_v2 \
+  bzzz-v2_postgres
+```
+
+#### Update Docker Configs
+```bash
+# Create new config
+docker config create bzzz_v2_config_v3 /path/to/new/config.yaml
+
+# Update service
+docker service update \
+  --config-rm bzzz_v2_config \
+  --config-add source=bzzz_v2_config_v3,target=/app/config/config.yaml \
+  bzzz-v2_bzzz-agent
+```
+
+### Monitoring and Alerting
+
+#### Check Alert Status
+```bash
+# View active alerts
+curl -s http://alertmanager:9093/api/v1/alerts | jq '.data[] | select(.status.state == "active")'
+
+# Silence alert
+curl -X POST http://alertmanager:9093/api/v1/silences \
+  -d '{
+    "matchers": [{"name": "alertname", "value": "BZZZSystemHealthCritical"}],
+    "startsAt": "2025-01-01T00:00:00Z",
+    "endsAt": "2025-01-01T01:00:00Z",
+    "comment": "Maintenance window",
+    "createdBy": "operator"
+  }'
+```
+
+#### Query Metrics
+```bash
+# Check system health
+curl -s 'http://prometheus:9090/api/v1/query?query=bzzz_system_health_score' | jq
+
+# Check connected peers
+curl -s 'http://prometheus:9090/api/v1/query?query=bzzz_p2p_connected_peers' | jq
+
+# Check error rates
+curl -s 'http://prometheus:9090/api/v1/query?query=rate(bzzz_errors_total[5m])' | jq
+```
+
+## Incident Response Procedures
+
+### Severity Levels
+
+#### Critical (P0)
+- System completely unavailable
+- Data loss or corruption
+- Security breach
+- **Response Time**: 15 minutes
+- **Resolution Target**: 2 hours
+
+#### High (P1)
+- Major functionality impaired
+- Performance severely degraded
+- **Response Time**: 1 hour
+- **Resolution Target**: 4 hours
+
+#### Medium (P2)
+- Minor functionality issues
+- Performance slightly degraded
+- **Response Time**: 4 hours
+- **Resolution Target**: 24 hours
+
+#### Low (P3)
+- Cosmetic issues
+- Enhancement requests
+- **Response Time**: 24 hours
+- **Resolution Target**: 1 week
+
+### Common Incident Scenarios
+
+#### System Health Critical (Alert: BZZZSystemHealthCritical)
+
+**Symptoms**: System health score < 0.5
+
+**Immediate Actions**:
+1. Check Grafana dashboard for component failures
+2. Review recent deployments or changes
+3. Check resource utilization (CPU, memory, disk)
+4. Verify P2P connectivity
+
+**Investigation Steps**:
+```bash
+# Check overall system status
+curl -s https://bzzz.deepblack.cloud/health | jq
+
+# Check component health
+curl -s https://bzzz.deepblack.cloud/health/checks | jq
+
+# Review recent logs
+docker service logs bzzz-v2_bzzz-agent --since 1h | tail -100
+
+# Check resource usage
+docker stats --no-stream
+```
+
+**Recovery Actions**:
+1. If memory leak: Restart affected services
+2. If disk full: Clean up logs and temporary files
+3. If network issues: Restart networking components
+4. If database issues: Check PostgreSQL health
+
+#### P2P Network Partition (Alert: BZZZInsufficientPeers)
+
+**Symptoms**: Connected peers < 3
+
+**Immediate Actions**:
+1. Check network connectivity between nodes
+2. Verify DHT bootstrap nodes are running
+3. Check firewall rules and port accessibility
+
+**Investigation Steps**:
+```bash
+# Check DHT bootstrap nodes
+for node in walnut:9101 ironwood:9102 acacia:9103; do
+  echo "Checking $node:"
+  nc -zv ${node%:*} ${node#*:}
+done
+
+# Check P2P connectivity
+docker service logs bzzz-v2_dht-bootstrap-walnut --since 1h
+
+# Test network between nodes
+docker run --rm --network host nicolaka/netshoot ping -c 3 ironwood
+```
+
+**Recovery Actions**:
+1. Restart DHT bootstrap services
+2. Clear peer store if corrupted
+3. Check and fix network configuration
+4. Restart affected BZZZ agents
+
+#### Election System Failure (Alert: BZZZNoAdminElected)
+
+**Symptoms**: No admin elected or frequent leadership changes
+
+**Immediate Actions**:
+1. Check election state on all nodes
+2. Review heartbeat status
+3. Verify role configurations
+
+**Investigation Steps**:
+```bash
+# Check election status on each node
+for node in walnut ironwood acacia; do
+  echo "Node $node election status:"
+  docker exec $(docker ps -q --filter label=com.docker.swarm.node.id) \
+    curl -s localhost:8081/health/checks | jq '.checks["election-health"]'
+done
+
+# Check role configurations
+docker config inspect bzzz_v2_config | jq '.Spec.Data' | base64 -d | grep -A5 -B5 role
+```
+
+**Recovery Actions**:
+1. Force re-election by restarting election managers
+2. Fix role configuration issues
+3. Clear election state if corrupted
+4. Ensure at least one node has admin capabilities
+
+#### DHT Replication Failure (Alert: BZZZDHTReplicationDegraded)
+
+**Symptoms**: Average replication factor < 2
+
+**Immediate Actions**:
+1. Check DHT provider records
+2. Verify replication manager status
+3. Check storage availability
+
+**Investigation Steps**:
+```bash
+# Check DHT metrics
+curl -s 'http://prometheus:9090/api/v1/query?query=bzzz_dht_replication_factor' | jq
+
+# Check provider records
+curl -s 'http://prometheus:9090/api/v1/query?query=bzzz_dht_provider_records' | jq
+
+# Check replication manager logs
+docker service logs bzzz-v2_bzzz-agent | grep -i replication
+```
+
+**Recovery Actions**:
+1. Restart replication managers
+2. Force re-provision of content
+3. Check and fix storage issues
+4. Verify DHT network connectivity
+
+### Escalation Procedures
+
+#### When to Escalate
+- Unable to resolve P0/P1 incident within target time
+- Incident requires specialized knowledge
+- Multiple systems affected
+- Potential security implications
+
+#### Escalation Contacts
+1. **Technical Lead**: @tech-lead (Slack)
+2. **Infrastructure Team**: @infra-team (Slack)
+3. **Management**: @management (for business-critical issues)
+
+## Health Check Procedures
+
+### Manual Health Verification
+
+#### System-Level Checks
+```bash
+# 1. Overall system health
+curl -s https://bzzz.deepblack.cloud/health | jq '.status'
+
+# 2. Component health checks
+curl -s https://bzzz.deepblack.cloud/health/checks | jq
+
+# 3. Resource utilization
+docker stats --no-stream --format "table {{.Container}}\t{{.CPUPerc}}\t{{.MemUsage}}\t{{.MemPerc}}"
+
+# 4. Service status
+docker service ls | grep bzzz
+
+# 5. Network connectivity
+docker network ls | grep bzzz
+```
+
+#### Component-Specific Checks
+
+**P2P Network**:
+```bash
+# Check connected peers
+curl -s 'http://prometheus:9090/api/v1/query?query=bzzz_p2p_connected_peers'
+
+# Test P2P messaging
+docker exec -it $(docker ps -q -f name=bzzz-agent) \
+  /app/bzzz test-p2p-message
+```
+
+**DHT Storage**:
+```bash
+# Check DHT operations
+curl -s 'http://prometheus:9090/api/v1/query?query=rate(bzzz_dht_put_operations_total[5m])'
+
+# Test DHT functionality
+docker exec -it $(docker ps -q -f name=bzzz-agent) \
+  /app/bzzz test-dht-operations
+```
+
+**Election System**:
+```bash
+# Check current admin
+curl -s 'http://prometheus:9090/api/v1/query?query=bzzz_election_state'
+
+# Check heartbeat status
+curl -s https://bzzz.deepblack.cloud/api/election/status | jq
+```
+
+### Automated Health Monitoring
+
+#### Prometheus Queries for Health
+```promql
+# Overall system health
+bzzz_system_health_score
+
+# Component health scores
+bzzz_component_health_score
+
+# SLI compliance
+rate(bzzz_health_checks_passed_total[5m]) / rate(bzzz_health_checks_failed_total[5m] + bzzz_health_checks_passed_total[5m])
+
+# Error budget burn rate
+1 - bzzz:dht_success_rate > 0.01  # 1% error budget
+```
+
+#### Alert Validation
+After resolving issues, verify alerts clear:
+```bash
+# Check if alerts are resolved
+curl -s http://alertmanager:9093/api/v1/alerts | \
+  jq '.data[] | select(.status.state == "active") | .labels.alertname'
+```
+
+## Performance Tuning
+
+### Resource Optimization
+
+#### Memory Tuning
+```bash
+# Increase memory limits for heavy workloads
+docker service update --limit-memory 8G bzzz-v2_bzzz-agent
+
+# Optimize JVM heap size (if applicable)
+docker service update \
+  --env-add JAVA_OPTS="-Xmx4g -Xms2g" \
+  bzzz-v2_bzzz-agent
+```
+
+#### CPU Optimization
+```bash
+# Adjust CPU limits
+docker service update --limit-cpu 4 bzzz-v2_bzzz-agent
+
+# Set CPU affinity for critical services
+docker service update \
+  --placement-pref "spread=node.labels.cpu_type==high_performance" \
+  bzzz-v2_bzzz-agent
+```
+
+#### Network Optimization
+```bash
+# Optimize network buffer sizes
+echo 'net.core.rmem_max = 16777216' >> /etc/sysctl.conf
+echo 'net.core.wmem_max = 16777216' >> /etc/sysctl.conf
+sysctl -p
+```
+
+### Application-Level Tuning
+
+#### DHT Performance
+- Increase replication factor for critical content
+- Optimize provider record refresh intervals
+- Tune cache sizes based on memory availability
+
+#### PubSub Performance  
+- Adjust message batch sizes
+- Optimize topic subscription patterns
+- Configure message retention policies
+
+#### Election Stability
+- Tune heartbeat intervals
+- Adjust election timeouts based on network latency
+- Optimize candidate scoring algorithms
+
+### Monitoring Performance Impact
+```bash
+# Before tuning - capture baseline
+curl -s 'http://prometheus:9090/api/v1/query_range?query=rate(bzzz_dht_operation_latency_seconds_sum[5m])/rate(bzzz_dht_operation_latency_seconds_count[5m])&start=2025-01-01T00:00:00Z&end=2025-01-01T01:00:00Z&step=60s'
+
+# After tuning - compare results
+# Use Grafana dashboards to visualize improvements
+```
+
+## Backup and Recovery
+
+### Critical Data Identification
+
+#### Persistent Data
+- **PostgreSQL Database**: User data, task history, conversation threads
+- **DHT Content**: Distributed content storage
+- **Configuration**: Docker secrets, configs, service definitions
+- **Prometheus Data**: Historical metrics (optional but valuable)
+
+#### Backup Schedule
+- **PostgreSQL**: Daily full backup, continuous WAL archiving
+- **Configuration**: Weekly backup, immediately after changes
+- **Prometheus**: Weekly backup of selected metrics
+
+### Backup Procedures
+
+#### Database Backup
+```bash
+# Create database backup
+docker exec $(docker ps -q -f name=postgres) \
+  pg_dump -U bzzz -d bzzz_v2 -f /backup/bzzz_$(date +%Y%m%d_%H%M%S).sql
+
+# Compress and store
+gzip /rust/bzzz-v2/backups/bzzz_$(date +%Y%m%d_%H%M%S).sql
+aws s3 cp /rust/bzzz-v2/backups/ s3://chorus-backups/bzzz/ --recursive
+```
+
+#### Configuration Backup
+```bash
+# Export all secrets (encrypted)
+for secret in $(docker secret ls -q); do
+  docker secret inspect $secret > /backup/secrets/${secret}.json
+done
+
+# Export all configs
+for config in $(docker config ls -q); do
+  docker config inspect $config > /backup/configs/${config}.json
+done
+
+# Export service definitions
+docker service ls --format '{{.Name}}' | xargs -I {} docker service inspect {} > /backup/services.json
+```
+
+#### Prometheus Data Backup
+```bash
+# Snapshot Prometheus data
+curl -X POST http://prometheus:9090/api/v1/admin/tsdb/snapshot
+
+# Copy snapshot to backup location
+docker cp prometheus_container:/prometheus/snapshots/latest /backup/prometheus/$(date +%Y%m%d)
+```
+
+### Recovery Procedures
+
+#### Full System Recovery
+1. **Restore Infrastructure**: Deploy Docker Swarm stack
+2. **Restore Configuration**: Import secrets and configs
+3. **Restore Database**: Restore PostgreSQL from backup
+4. **Validate Services**: Verify all services are healthy
+5. **Test Functionality**: Run end-to-end tests
+
+#### Database Recovery
+```bash
+# Stop application services
+docker service scale bzzz-v2_bzzz-agent=0
+
+# Restore database
+gunzip -c /backup/bzzz_20250101_120000.sql.gz | \
+  docker exec -i $(docker ps -q -f name=postgres) \
+  psql -U bzzz -d bzzz_v2
+
+# Start application services
+docker service scale bzzz-v2_bzzz-agent=3
+```
+
+#### Point-in-Time Recovery
+```bash
+# For WAL-based recovery
+docker exec $(docker ps -q -f name=postgres) \
+  pg_basebackup -U postgres -D /backup/base -X stream -P
+
+# Restore to specific time
+# (Implementation depends on PostgreSQL configuration)
+```
+
+### Recovery Testing
+
+#### Monthly Recovery Tests
+```bash
+# Test database restore
+./scripts/test-db-restore.sh
+
+# Test configuration restore  
+./scripts/test-config-restore.sh
+
+# Test full system restore (staging environment)
+./scripts/test-full-restore.sh staging
+```
+
+#### Recovery Validation
+- Verify all services start successfully
+- Check data integrity and completeness
+- Validate P2P network connectivity
+- Test core functionality (task coordination, context generation)
+- Monitor system health for 24 hours post-recovery
+
+## Troubleshooting Guide
+
+### Log Analysis
+
+#### Centralized Logging
+```bash
+# View aggregated logs through Loki
+curl -G -s 'http://loki:3100/loki/api/v1/query_range' \
+  --data-urlencode 'query={job="bzzz"}' \
+  --data-urlencode 'start=2025-01-01T00:00:00Z' \
+  --data-urlencode 'end=2025-01-01T01:00:00Z' | jq
+
+# Search for specific errors
+curl -G -s 'http://loki:3100/loki/api/v1/query_range' \
+  --data-urlencode 'query={job="bzzz"} |= "ERROR"' | jq
+```
+
+#### Service-Specific Logs
+```bash
+# BZZZ agent logs
+docker service logs bzzz-v2_bzzz-agent -f --tail 100
+
+# DHT bootstrap logs
+docker service logs bzzz-v2_dht-bootstrap-walnut -f
+
+# Database logs
+docker service logs bzzz-v2_postgres -f
+
+# Filter for specific patterns
+docker service logs bzzz-v2_bzzz-agent | grep -E "(ERROR|FATAL|panic)"
+```
+
+### Common Issues and Solutions
+
+#### "No Admin Elected" Error
+```bash
+# Check role configurations
+docker config inspect bzzz_v2_config | jq '.Spec.Data' | base64 -d | yq '.agent.role'
+
+# Force election
+docker exec -it $(docker ps -q -f name=bzzz-agent) /app/bzzz trigger-election
+
+# Restart election managers
+docker service update --force bzzz-v2_bzzz-agent
+```
+
+#### "DHT Operations Failing" Error
+```bash
+# Check DHT bootstrap nodes
+for port in 9101 9102 9103; do
+  nc -zv localhost $port
+done
+
+# Restart DHT services
+docker service update --force bzzz-v2_dht-bootstrap-walnut
+docker service update --force bzzz-v2_dht-bootstrap-ironwood
+docker service update --force bzzz-v2_dht-bootstrap-acacia
+
+# Clear DHT cache
+docker exec -it $(docker ps -q -f name=bzzz-agent) rm -rf /app/data/dht/cache/*
+```
+
+#### "High Memory Usage" Alert
+```bash
+# Identify memory-hungry processes
+docker stats --no-stream --format "table {{.Container}}\t{{.MemUsage}}\t{{.MemPerc}}" | sort -k3 -n
+
+# Check for memory leaks
+docker exec -it $(docker ps -q -f name=bzzz-agent) pprof -http=:6060 /app/bzzz
+
+# Restart high-memory services
+docker service update --force bzzz-v2_bzzz-agent
+```
+
+#### "Network Connectivity Issues"
+```bash
+# Check overlay network
+docker network inspect bzzz-internal
+
+# Test connectivity between services
+docker run --rm --network bzzz-internal nicolaka/netshoot ping -c 3 postgres
+
+# Check firewall rules
+iptables -L | grep -E "(9000|9101|9102|9103)"
+
+# Restart networking
+docker network disconnect bzzz-internal $(docker ps -q -f name=bzzz-agent)
+docker network connect bzzz-internal $(docker ps -q -f name=bzzz-agent)
+```
+
+### Performance Issues
+
+#### High Latency Diagnosis
+```bash
+# Check operation latencies
+curl -s 'http://prometheus:9090/api/v1/query?query=histogram_quantile(0.95, rate(bzzz_dht_operation_latency_seconds_bucket[5m]))'
+
+# Identify bottlenecks
+docker exec -it $(docker ps -q -f name=bzzz-agent) /app/bzzz profile-cpu 30
+
+# Check network latency between nodes
+for node in walnut ironwood acacia; do
+  ping -c 10 $node | tail -1
+done
+```
+
+#### Resource Contention
+```bash
+# Check CPU usage
+docker stats --no-stream --format "table {{.Container}}\t{{.CPUPerc}}"
+
+# Check I/O wait
+iostat -x 1 5
+
+# Check network utilization
+iftop -i eth0
+```
+
+### Debugging Tools
+
+#### Application Debugging
+```bash
+# Enable debug logging
+docker service update --env-add LOG_LEVEL=debug bzzz-v2_bzzz-agent
+
+# Access debug endpoints
+curl -s http://localhost:8080/debug/pprof/heap > heap.prof
+go tool pprof heap.prof
+
+# Trace requests
+curl -s http://localhost:8080/debug/requests
+```
+
+#### System Debugging
+```bash
+# System resource usage
+htop
+iotop
+nethogs
+
+# Process analysis
+ps aux --sort=-%cpu | head -20
+ps aux --sort=-%mem | head -20
+
+# Network analysis
+netstat -tulpn | grep -E ":9000|:9101|:9102|:9103"
+ss -tuln | grep -E ":9000|:9101|:9102|:9103"
+```
+
+## Maintenance Procedures
+
+### Scheduled Maintenance
+
+#### Weekly Maintenance (Low-impact)
+- Review system health metrics
+- Check log sizes and rotate if necessary
+- Update monitoring dashboards
+- Validate backup integrity
+
+#### Monthly Maintenance (Medium-impact)
+- Update non-critical components
+- Perform capacity planning review
+- Test disaster recovery procedures
+- Security scan and updates
+
+#### Quarterly Maintenance (High-impact)
+- Major version updates
+- Infrastructure upgrades
+- Performance optimization review
+- Security audit and remediation
+
+### Update Procedures
+
+#### Rolling Updates
+```bash
+# Update with zero downtime
+docker service update \
+  --image registry.home.deepblack.cloud/bzzz:v2.1.0 \
+  --update-parallelism 1 \
+  --update-delay 30s \
+  --update-failure-action rollback \
+  bzzz-v2_bzzz-agent
+```
+
+#### Configuration Updates
+```bash
+# Update configuration without restart
+docker config create bzzz_v2_config_new /path/to/new/config.yaml
+
+docker service update \
+  --config-rm bzzz_v2_config \
+  --config-add source=bzzz_v2_config_new,target=/app/config/config.yaml \
+  bzzz-v2_bzzz-agent
+
+# Cleanup old config
+docker config rm bzzz_v2_config
+```
+
+#### Database Maintenance
+```bash
+# Database optimization
+docker exec -it $(docker ps -q -f name=postgres) \
+  psql -U bzzz -d bzzz_v2 -c "VACUUM ANALYZE;"
+
+# Update statistics
+docker exec -it $(docker ps -q -f name=postgres) \
+  psql -U bzzz -d bzzz_v2 -c "ANALYZE;"
+
+# Check database size
+docker exec -it $(docker ps -q -f name=postgres) \
+  psql -U bzzz -d bzzz_v2 -c "SELECT pg_size_pretty(pg_database_size('bzzz_v2'));"
+```
+
+### Capacity Planning
+
+#### Growth Projections
+- Monitor resource usage trends over time
+- Project capacity needs based on growth patterns
+- Plan for seasonal or event-driven spikes
+
+#### Scaling Decisions
+```bash
+# Horizontal scaling
+docker service scale bzzz-v2_bzzz-agent=5
+
+# Vertical scaling
+docker service update \
+  --limit-memory 8G \
+  --limit-cpu 4 \
+  bzzz-v2_bzzz-agent
+
+# Add new node to swarm
+docker swarm join-token worker
+```
+
+#### Resource Monitoring
+- Set up capacity alerts at 70% utilization
+- Monitor growth rate and extrapolate
+- Plan infrastructure expansions 3-6 months ahead
+
+---
+
+## Contact Information
+
+**Primary Contact**: Tony (@tony)
+**Team**: BZZZ Infrastructure Team
+**Documentation**: https://wiki.chorus.services/bzzz
+**Source Code**: https://gitea.chorus.services/tony/BZZZ
+
+**Last Updated**: 2025-01-01
+**Version**: 2.0
+**Review Date**: 2025-04-01
--- a/infrastructure/monitoring/configs/enhanced-alert-rules.yml
+++ b/infrastructure/monitoring/configs/enhanced-alert-rules.yml
@@ -0,0 +1,511 @@
+# Enhanced Alert Rules for BZZZ v2 Infrastructure
+# Service Level Objectives and Critical System Alerts
+
+groups:
+  # === System Health and SLO Alerts ===
+  - name: bzzz_system_health
+    rules:
+      # Overall system health score
+      - alert: BZZZSystemHealthCritical
+        expr: bzzz_system_health_score < 0.5
+        for: 2m
+        labels:
+          severity: critical
+          service: bzzz
+          slo: availability
+        annotations:
+          summary: "BZZZ system health is critically low"
+          description: "System health score {{ $value }} is below critical threshold (0.5)"
+          runbook_url: "https://wiki.chorus.services/runbooks/bzzz-health-critical"
+      
+      - alert: BZZZSystemHealthDegraded
+        expr: bzzz_system_health_score < 0.8
+        for: 5m
+        labels:
+          severity: warning
+          service: bzzz
+          slo: availability
+        annotations:
+          summary: "BZZZ system health is degraded"
+          description: "System health score {{ $value }} is below warning threshold (0.8)"
+          runbook_url: "https://wiki.chorus.services/runbooks/bzzz-health-degraded"
+      
+      # Component health monitoring
+      - alert: BZZZComponentUnhealthy
+        expr: bzzz_component_health_score < 0.7
+        for: 3m
+        labels:
+          severity: warning
+          service: bzzz
+          component: "{{ $labels.component }}"
+        annotations:
+          summary: "BZZZ component {{ $labels.component }} is unhealthy"
+          description: "Component {{ $labels.component }} health score {{ $value }} is below threshold"
+
+  # === P2P Network Alerts ===
+  - name: bzzz_p2p_network
+    rules:
+      # Peer connectivity SLO: Maintain at least 3 connected peers
+      - alert: BZZZInsufficientPeers
+        expr: bzzz_p2p_connected_peers < 3
+        for: 1m
+        labels:
+          severity: critical
+          service: bzzz
+          component: p2p
+          slo: connectivity
+        annotations:
+          summary: "BZZZ has insufficient P2P peers"
+          description: "Only {{ $value }} peers connected, minimum required is 3"
+          runbook_url: "https://wiki.chorus.services/runbooks/bzzz-peer-connectivity"
+      
+      # Message latency SLO: 95th percentile < 500ms
+      - alert: BZZZP2PHighLatency
+        expr: histogram_quantile(0.95, rate(bzzz_p2p_message_latency_seconds_bucket[5m])) > 0.5
+        for: 3m
+        labels:
+          severity: warning
+          service: bzzz
+          component: p2p
+          slo: latency
+        annotations:
+          summary: "BZZZ P2P message latency is high"
+          description: "95th percentile latency {{ $value }}s exceeds 500ms SLO"
+          runbook_url: "https://wiki.chorus.services/runbooks/bzzz-p2p-latency"
+      
+      # Message loss detection
+      - alert: BZZZP2PMessageLoss
+        expr: rate(bzzz_p2p_messages_sent_total[5m]) - rate(bzzz_p2p_messages_received_total[5m]) > 0.1
+        for: 2m
+        labels:
+          severity: warning
+          service: bzzz
+          component: p2p
+        annotations:
+          summary: "BZZZ P2P message loss detected"
+          description: "Message send/receive imbalance: {{ $value }} messages/sec"
+
+  # === DHT Performance and Reliability ===
+  - name: bzzz_dht
+    rules:
+      # DHT operation success rate SLO: > 99%
+      - alert: BZZZDHTLowSuccessRate
+        expr: (rate(bzzz_dht_put_operations_total{status="success"}[5m]) + rate(bzzz_dht_get_operations_total{status="success"}[5m])) / (rate(bzzz_dht_put_operations_total[5m]) + rate(bzzz_dht_get_operations_total[5m])) < 0.99
+        for: 2m
+        labels:
+          severity: warning
+          service: bzzz
+          component: dht
+          slo: success_rate
+        annotations:
+          summary: "BZZZ DHT operation success rate is low"
+          description: "DHT success rate {{ $value | humanizePercentage }} is below 99% SLO"
+          runbook_url: "https://wiki.chorus.services/runbooks/bzzz-dht-success-rate"
+      
+      # DHT operation latency SLO: 95th percentile < 300ms for gets
+      - alert: BZZZDHTHighGetLatency
+        expr: histogram_quantile(0.95, rate(bzzz_dht_operation_latency_seconds_bucket{operation="get"}[5m])) > 0.3
+        for: 3m
+        labels:
+          severity: warning
+          service: bzzz
+          component: dht
+          slo: latency
+        annotations:
+          summary: "BZZZ DHT get operations are slow"
+          description: "95th percentile get latency {{ $value }}s exceeds 300ms SLO"
+      
+      # DHT replication health
+      - alert: BZZZDHTReplicationDegraded
+        expr: avg(bzzz_dht_replication_factor) < 2
+        for: 5m
+        labels:
+          severity: warning
+          service: bzzz
+          component: dht
+          slo: durability
+        annotations:
+          summary: "BZZZ DHT replication is degraded"
+          description: "Average replication factor {{ $value }} is below target of 3"
+          runbook_url: "https://wiki.chorus.services/runbooks/bzzz-dht-replication"
+      
+      # Provider record staleness
+      - alert: BZZZDHTStaleProviders
+        expr: increase(bzzz_dht_provider_records[1h]) == 0 and bzzz_dht_content_keys > 0
+        for: 10m
+        labels:
+          severity: warning
+          service: bzzz
+          component: dht
+        annotations:
+          summary: "BZZZ DHT provider records are not updating"
+          description: "No provider record updates in the last hour despite having content"
+
+  # === Election System Stability ===
+  - name: bzzz_election
+    rules:
+      # Leadership stability: Avoid frequent leadership changes
+      - alert: BZZZFrequentLeadershipChanges
+        expr: increase(bzzz_leadership_changes_total[1h]) > 3
+        for: 0m
+        labels:
+          severity: warning
+          service: bzzz
+          component: election
+        annotations:
+          summary: "BZZZ leadership is unstable"
+          description: "{{ $value }} leadership changes in the last hour"
+          runbook_url: "https://wiki.chorus.services/runbooks/bzzz-leadership-instability"
+      
+      # Election timeout
+      - alert: BZZZElectionInProgress
+        expr: bzzz_election_state{state="electing"} == 1
+        for: 2m
+        labels:
+          severity: warning
+          service: bzzz
+          component: election
+        annotations:
+          summary: "BZZZ election taking too long"
+          description: "Election has been in progress for more than 2 minutes"
+      
+      # No admin elected
+      - alert: BZZZNoAdminElected
+        expr: bzzz_election_state{state="idle"} == 1 and absent(bzzz_heartbeats_received_total)
+        for: 1m
+        labels:
+          severity: critical
+          service: bzzz
+          component: election
+        annotations:
+          summary: "BZZZ has no elected admin"
+          description: "System is idle but no heartbeats are being received"
+          runbook_url: "https://wiki.chorus.services/runbooks/bzzz-no-admin"
+      
+      # Heartbeat monitoring
+      - alert: BZZZHeartbeatMissing
+        expr: increase(bzzz_heartbeats_received_total[2m]) == 0
+        for: 1m
+        labels:
+          severity: critical
+          service: bzzz
+          component: election
+        annotations:
+          summary: "BZZZ admin heartbeat missing"
+          description: "No heartbeats received from admin in the last 2 minutes"
+
+  # === PubSub Messaging System ===
+  - name: bzzz_pubsub
+    rules:
+      # Message processing rate
+      - alert: BZZZPubSubHighMessageRate
+        expr: rate(bzzz_pubsub_messages_total[1m]) > 1000
+        for: 2m
+        labels:
+          severity: warning
+          service: bzzz
+          component: pubsub
+        annotations:
+          summary: "BZZZ PubSub message rate is very high"
+          description: "Processing {{ $value }} messages/sec, may indicate spam or DoS"
+      
+      # Message latency
+      - alert: BZZZPubSubHighLatency
+        expr: histogram_quantile(0.95, rate(bzzz_pubsub_message_latency_seconds_bucket[5m])) > 1.0
+        for: 3m
+        labels:
+          severity: warning
+          service: bzzz
+          component: pubsub
+          slo: latency
+        annotations:
+          summary: "BZZZ PubSub message latency is high"
+          description: "95th percentile latency {{ $value }}s exceeds 1s threshold"
+      
+      # Topic monitoring
+      - alert: BZZZPubSubNoTopics
+        expr: bzzz_pubsub_topics == 0
+        for: 5m
+        labels:
+          severity: warning
+          service: bzzz
+          component: pubsub
+        annotations:
+          summary: "BZZZ PubSub has no active topics"
+          description: "No PubSub topics are active, system may be isolated"
+
+  # === Task Management and Processing ===
+  - name: bzzz_tasks
+    rules:
+      # Task queue backup
+      - alert: BZZZTaskQueueBackup
+        expr: bzzz_tasks_queued > 100
+        for: 5m
+        labels:
+          severity: warning
+          service: bzzz
+          component: tasks
+        annotations:
+          summary: "BZZZ task queue is backing up"
+          description: "{{ $value }} tasks are queued, may indicate processing issues"
+          runbook_url: "https://wiki.chorus.services/runbooks/bzzz-task-queue"
+      
+      # Task success rate SLO: > 95%
+      - alert: BZZZTaskLowSuccessRate
+        expr: rate(bzzz_tasks_completed_total{status="success"}[10m]) / rate(bzzz_tasks_completed_total[10m]) < 0.95
+        for: 5m
+        labels:
+          severity: warning
+          service: bzzz
+          component: tasks
+          slo: success_rate
+        annotations:
+          summary: "BZZZ task success rate is low"
+          description: "Task success rate {{ $value | humanizePercentage }} is below 95% SLO"
+      
+      # Task processing latency
+      - alert: BZZZTaskHighProcessingTime
+        expr: histogram_quantile(0.95, rate(bzzz_task_duration_seconds_bucket[5m])) > 300
+        for: 3m
+        labels:
+          severity: warning
+          service: bzzz
+          component: tasks
+        annotations:
+          summary: "BZZZ task processing time is high"
+          description: "95th percentile task duration {{ $value }}s exceeds 5 minutes"
+
+  # === SLURP Context Generation ===
+  - name: bzzz_slurp
+    rules:
+      # Context generation success rate
+      - alert: BZZZSLURPLowSuccessRate
+        expr: rate(bzzz_slurp_contexts_generated_total{status="success"}[10m]) / rate(bzzz_slurp_contexts_generated_total[10m]) < 0.90
+        for: 5m
+        labels:
+          severity: warning
+          service: bzzz
+          component: slurp
+        annotations:
+          summary: "SLURP context generation success rate is low"
+          description: "Success rate {{ $value | humanizePercentage }} is below 90%"
+          runbook_url: "https://wiki.chorus.services/runbooks/bzzz-slurp-generation"
+      
+      # Generation queue backup
+      - alert: BZZZSLURPQueueBackup
+        expr: bzzz_slurp_queue_length > 50
+        for: 10m
+        labels:
+          severity: warning
+          service: bzzz
+          component: slurp
+        annotations:
+          summary: "SLURP generation queue is backing up"
+          description: "{{ $value }} contexts are queued for generation"
+      
+      # Generation time SLO: 95th percentile < 2 minutes
+      - alert: BZZZSLURPSlowGeneration
+        expr: histogram_quantile(0.95, rate(bzzz_slurp_generation_time_seconds_bucket[10m])) > 120
+        for: 5m
+        labels:
+          severity: warning
+          service: bzzz
+          component: slurp
+          slo: latency
+        annotations:
+          summary: "SLURP context generation is slow"
+          description: "95th percentile generation time {{ $value }}s exceeds 2 minutes"
+
+  # === UCXI Protocol Resolution ===
+  - name: bzzz_ucxi
+    rules:
+      # Resolution success rate SLO: > 99%
+      - alert: BZZZUCXILowSuccessRate
+        expr: rate(bzzz_ucxi_requests_total{status=~"2.."}[5m]) / rate(bzzz_ucxi_requests_total[5m]) < 0.99
+        for: 3m
+        labels:
+          severity: warning
+          service: bzzz
+          component: ucxi
+          slo: success_rate
+        annotations:
+          summary: "UCXI resolution success rate is low"
+          description: "Success rate {{ $value | humanizePercentage }} is below 99% SLO"
+      
+      # Resolution latency SLO: 95th percentile < 100ms
+      - alert: BZZZUCXIHighLatency
+        expr: histogram_quantile(0.95, rate(bzzz_ucxi_resolution_latency_seconds_bucket[5m])) > 0.1
+        for: 3m
+        labels:
+          severity: warning
+          service: bzzz
+          component: ucxi
+          slo: latency
+        annotations:
+          summary: "UCXI resolution latency is high"
+          description: "95th percentile latency {{ $value }}s exceeds 100ms SLO"
+
+  # === Resource Utilization ===
+  - name: bzzz_resources
+    rules:
+      # CPU utilization
+      - alert: BZZZHighCPUUsage
+        expr: bzzz_cpu_usage_ratio > 0.85
+        for: 5m
+        labels:
+          severity: warning
+          service: bzzz
+          component: system
+        annotations:
+          summary: "BZZZ CPU usage is high"
+          description: "CPU usage {{ $value | humanizePercentage }} exceeds 85%"
+      
+      # Memory utilization
+      - alert: BZZZHighMemoryUsage
+        expr: bzzz_memory_usage_bytes / (1024*1024*1024) > 8
+        for: 3m
+        labels:
+          severity: warning
+          service: bzzz
+          component: system
+        annotations:
+          summary: "BZZZ memory usage is high"
+          description: "Memory usage {{ $value | humanize1024 }}B is high"
+      
+      # Disk utilization
+      - alert: BZZZHighDiskUsage
+        expr: bzzz_disk_usage_ratio > 0.90
+        for: 5m
+        labels:
+          severity: critical
+          service: bzzz
+          component: system
+        annotations:
+          summary: "BZZZ disk usage is critical"
+          description: "Disk usage {{ $value | humanizePercentage }} on {{ $labels.mount_point }} exceeds 90%"
+      
+      # Goroutine leak detection
+      - alert: BZZZGoroutineLeak
+        expr: increase(bzzz_goroutines[30m]) > 1000
+        for: 5m
+        labels:
+          severity: warning
+          service: bzzz
+          component: system
+        annotations:
+          summary: "Possible BZZZ goroutine leak"
+          description: "Goroutine count increased by {{ $value }} in 30 minutes"
+
+  # === Error Rate Monitoring ===
+  - name: bzzz_errors
+    rules:
+      # General error rate
+      - alert: BZZZHighErrorRate
+        expr: rate(bzzz_errors_total[5m]) > 10
+        for: 2m
+        labels:
+          severity: warning
+          service: bzzz
+        annotations:
+          summary: "BZZZ error rate is high"
+          description: "Error rate {{ $value }} errors/sec in component {{ $labels.component }}"
+      
+      # Panic detection
+      - alert: BZZZPanicsDetected
+        expr: increase(bzzz_panics_total[5m]) > 0
+        for: 0m
+        labels:
+          severity: critical
+          service: bzzz
+        annotations:
+          summary: "BZZZ panic detected"
+          description: "{{ $value }} panic(s) occurred in the last 5 minutes"
+          runbook_url: "https://wiki.chorus.services/runbooks/bzzz-panic-recovery"
+
+  # === Health Check Monitoring ===
+  - name: bzzz_health_checks
+    rules:
+      # Health check failure rate
+      - alert: BZZZHealthCheckFailures
+        expr: rate(bzzz_health_checks_failed_total[5m]) > 0.1
+        for: 2m
+        labels:
+          severity: warning
+          service: bzzz
+          component: health
+        annotations:
+          summary: "BZZZ health check failures detected"
+          description: "Health check {{ $labels.check_name }} failing at {{ $value }} failures/sec"
+      
+      # Critical health check failure
+      - alert: BZZZCriticalHealthCheckFailed
+        expr: increase(bzzz_health_checks_failed_total{check_name=~".*-enhanced|p2p-connectivity"}[2m]) > 0
+        for: 0m
+        labels:
+          severity: critical
+          service: bzzz
+          component: health
+        annotations:
+          summary: "Critical BZZZ health check failed"
+          description: "Critical health check {{ $labels.check_name }} failed: {{ $labels.reason }}"
+
+  # === Service Level Indicator Recording Rules ===
+  - name: bzzz_sli_recording
+    interval: 30s
+    rules:
+      # DHT operation SLI
+      - record: bzzz:dht_success_rate
+        expr: rate(bzzz_dht_put_operations_total{status="success"}[5m]) + rate(bzzz_dht_get_operations_total{status="success"}[5m]) / rate(bzzz_dht_put_operations_total[5m]) + rate(bzzz_dht_get_operations_total[5m])
+      
+      # P2P connectivity SLI
+      - record: bzzz:p2p_connectivity_ratio
+        expr: bzzz_p2p_connected_peers / 10  # Target of 10 peers
+      
+      # UCXI success rate SLI
+      - record: bzzz:ucxi_success_rate
+        expr: rate(bzzz_ucxi_requests_total{status=~"2.."}[5m]) / rate(bzzz_ucxi_requests_total[5m])
+      
+      # Task success rate SLI
+      - record: bzzz:task_success_rate
+        expr: rate(bzzz_tasks_completed_total{status="success"}[5m]) / rate(bzzz_tasks_completed_total[5m])
+      
+      # Overall availability SLI
+      - record: bzzz:overall_availability
+        expr: bzzz_system_health_score
+
+  # === Multi-Window Multi-Burn-Rate Alerts ===
+  - name: bzzz_slo_alerts
+    rules:
+      # Fast burn rate (2% of error budget in 1 hour)
+      - alert: BZZZErrorBudgetBurnHigh
+        expr: (
+          (1 - bzzz:dht_success_rate) > (14.4 * 0.01)  # 14.4x burn rate for 99% SLO
+          and
+          (1 - bzzz:dht_success_rate) > (14.4 * 0.01)
+        )
+        for: 2m
+        labels:
+          severity: critical
+          service: bzzz
+          burnrate: fast
+          slo: dht_success_rate
+        annotations:
+          summary: "BZZZ DHT error budget burning fast"
+          description: "DHT error budget will be exhausted in {{ with query \"(0.01 - (1 - bzzz:dht_success_rate)) / (1 - bzzz:dht_success_rate) * 1\" }}{{ . | first | value | humanizeDuration }}{{ end }}"
+      
+      # Slow burn rate (10% of error budget in 6 hours)
+      - alert: BZZZErrorBudgetBurnSlow
+        expr: (
+          (1 - bzzz:dht_success_rate) > (6 * 0.01)  # 6x burn rate
+          and
+          (1 - bzzz:dht_success_rate) > (6 * 0.01)
+        )
+        for: 15m
+        labels:
+          severity: warning
+          service: bzzz
+          burnrate: slow
+          slo: dht_success_rate
+        annotations:
+          summary: "BZZZ DHT error budget burning slowly"
+          description: "DHT error budget depletion rate is concerning"
--- a/infrastructure/monitoring/docker-compose.enhanced.yml
+++ b/infrastructure/monitoring/docker-compose.enhanced.yml
@@ -0,0 +1,533 @@
+version: '3.8'
+
+# Enhanced BZZZ Monitoring Stack for Docker Swarm
+# Provides comprehensive observability for BZZZ distributed system
+
+services:
+  # Prometheus - Metrics Collection and Alerting
+  prometheus:
+    image: prom/prometheus:v2.45.0
+    networks:
+      - tengig
+      - monitoring
+    ports:
+      - "9090:9090"
+    volumes:
+      - prometheus_data:/prometheus
+      - /rust/bzzz-v2/monitoring/prometheus:/etc/prometheus
+    command:
+      - '--config.file=/etc/prometheus/prometheus.yml'
+      - '--storage.tsdb.path=/prometheus'
+      - '--storage.tsdb.retention.time=30d'
+      - '--storage.tsdb.retention.size=50GB'
+      - '--web.console.libraries=/etc/prometheus/console_libraries'
+      - '--web.console.templates=/etc/prometheus/consoles'
+      - '--web.enable-lifecycle'
+      - '--web.enable-admin-api'
+      - '--web.external-url=https://prometheus.chorus.services'
+      - '--alertmanager.notification-queue-capacity=10000'
+    deploy:
+      replicas: 1
+      placement:
+        constraints:
+          - node.hostname == walnut  # Place on main node
+      resources:
+        limits:
+          memory: 4G
+          cpus: '2.0'
+        reservations:
+          memory: 2G
+          cpus: '1.0'
+      restart_policy:
+        condition: on-failure
+        delay: 30s
+      labels:
+        - "traefik.enable=true"
+        - "traefik.http.routers.prometheus.rule=Host(`prometheus.chorus.services`)"
+        - "traefik.http.services.prometheus.loadbalancer.server.port=9090"
+        - "traefik.http.routers.prometheus.tls=true"
+    healthcheck:
+      test: ["CMD", "wget", "--quiet", "--tries=1", "--spider", "http://localhost:9090/-/healthy"]
+      interval: 30s
+      timeout: 10s
+      retries: 3
+    configs:
+      - source: prometheus_config
+        target: /etc/prometheus/prometheus.yml
+      - source: prometheus_alerts
+        target: /etc/prometheus/rules.yml
+
+  # Grafana - Visualization and Dashboards
+  grafana:
+    image: grafana/grafana:10.0.3
+    networks:
+      - tengig
+      - monitoring
+    ports:
+      - "3000:3000"
+    volumes:
+      - grafana_data:/var/lib/grafana
+      - /rust/bzzz-v2/monitoring/grafana/dashboards:/etc/grafana/provisioning/dashboards
+      - /rust/bzzz-v2/monitoring/grafana/datasources:/etc/grafana/provisioning/datasources
+    environment:
+      - GF_SECURITY_ADMIN_PASSWORD__FILE=/run/secrets/grafana_admin_password
+      - GF_INSTALL_PLUGINS=grafana-piechart-panel,grafana-worldmap-panel,vonage-status-panel
+      - GF_FEATURE_TOGGLES_ENABLE=publicDashboards
+      - GF_SERVER_ROOT_URL=https://grafana.chorus.services
+      - GF_ANALYTICS_REPORTING_ENABLED=false
+      - GF_ANALYTICS_CHECK_FOR_UPDATES=false
+      - GF_LOG_LEVEL=warn
+    secrets:
+      - grafana_admin_password
+    deploy:
+      replicas: 1
+      placement:
+        constraints:
+          - node.hostname == walnut
+      resources:
+        limits:
+          memory: 2G
+          cpus: '1.0'
+        reservations:
+          memory: 512M
+          cpus: '0.5'
+      restart_policy:
+        condition: on-failure
+        delay: 10s
+      labels:
+        - "traefik.enable=true"
+        - "traefik.http.routers.grafana.rule=Host(`grafana.chorus.services`)"
+        - "traefik.http.services.grafana.loadbalancer.server.port=3000"
+        - "traefik.http.routers.grafana.tls=true"
+    healthcheck:
+      test: ["CMD-SHELL", "curl -f http://localhost:3000/api/health || exit 1"]
+      interval: 30s
+      timeout: 10s
+      retries: 3
+
+  # AlertManager - Alert Routing and Notification
+  alertmanager:
+    image: prom/alertmanager:v0.25.0
+    networks:
+      - tengig
+      - monitoring
+    ports:
+      - "9093:9093"
+    volumes:
+      - alertmanager_data:/alertmanager
+      - /rust/bzzz-v2/monitoring/alertmanager:/etc/alertmanager
+    command:
+      - '--config.file=/etc/alertmanager/config.yml'
+      - '--storage.path=/alertmanager'
+      - '--web.external-url=https://alerts.chorus.services'
+      - '--web.route-prefix=/'
+      - '--cluster.listen-address=0.0.0.0:9094'
+      - '--log.level=info'
+    deploy:
+      replicas: 1
+      placement:
+        constraints:
+          - node.hostname == ironwood
+      resources:
+        limits:
+          memory: 1G
+          cpus: '0.5'
+        reservations:
+          memory: 256M
+          cpus: '0.25'
+      restart_policy:
+        condition: on-failure
+      labels:
+        - "traefik.enable=true"
+        - "traefik.http.routers.alertmanager.rule=Host(`alerts.chorus.services`)"
+        - "traefik.http.services.alertmanager.loadbalancer.server.port=9093"
+        - "traefik.http.routers.alertmanager.tls=true"
+    configs:
+      - source: alertmanager_config
+        target: /etc/alertmanager/config.yml
+    secrets:
+      - slack_webhook_url
+      - pagerduty_integration_key
+
+  # Node Exporter - System Metrics (deployed on all nodes)
+  node-exporter:
+    image: prom/node-exporter:v1.6.1
+    networks:
+      - monitoring
+    ports:
+      - "9100:9100"
+    volumes:
+      - /proc:/host/proc:ro
+      - /sys:/host/sys:ro
+      - /:/rootfs:ro
+      - /run/systemd/private:/run/systemd/private:ro
+    command:
+      - '--path.procfs=/host/proc'
+      - '--path.sysfs=/host/sys'
+      - '--path.rootfs=/rootfs'
+      - '--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)'
+      - '--collector.systemd'
+      - '--collector.systemd.unit-include=(bzzz|docker|prometheus|grafana)\.service'
+      - '--web.listen-address=0.0.0.0:9100'
+    deploy:
+      mode: global  # Deploy on every node
+      resources:
+        limits:
+          memory: 256M
+          cpus: '0.2'
+        reservations:
+          memory: 128M
+          cpus: '0.1'
+      restart_policy:
+        condition: on-failure
+
+  # cAdvisor - Container Metrics (deployed on all nodes)
+  cadvisor:
+    image: gcr.io/cadvisor/cadvisor:v0.47.2
+    networks:
+      - monitoring
+    ports:
+      - "8080:8080"
+    volumes:
+      - /:/rootfs:ro
+      - /var/run:/var/run:ro
+      - /sys:/sys:ro
+      - /var/lib/docker/:/var/lib/docker:ro
+      - /dev/disk/:/dev/disk:ro
+    deploy:
+      mode: global
+      resources:
+        limits:
+          memory: 512M
+          cpus: '0.3'
+        reservations:
+          memory: 256M
+          cpus: '0.15'
+      restart_policy:
+        condition: on-failure
+    healthcheck:
+      test: ["CMD", "wget", "--quiet", "--tries=1", "--spider", "http://localhost:8080/healthz"]
+      interval: 30s
+      timeout: 10s
+      retries: 3
+
+  # BZZZ P2P Network Exporter - Custom metrics for P2P network health
+  bzzz-p2p-exporter:
+    image: registry.home.deepblack.cloud/bzzz-p2p-exporter:v2.0.0
+    networks:
+      - monitoring
+      - bzzz-internal
+    ports:
+      - "9200:9200"
+    environment:
+      - BZZZ_ENDPOINTS=http://bzzz-agent:9000
+      - SCRAPE_INTERVAL=15s
+      - LOG_LEVEL=info
+    deploy:
+      replicas: 1
+      placement:
+        constraints:
+          - node.hostname == walnut
+      resources:
+        limits:
+          memory: 256M
+          cpus: '0.2'
+        reservations:
+          memory: 128M
+          cpus: '0.1'
+      restart_policy:
+        condition: on-failure
+
+  # DHT Monitor - DHT-specific metrics and health monitoring
+  dht-monitor:
+    image: registry.home.deepblack.cloud/bzzz-dht-monitor:v2.0.0
+    networks:
+      - monitoring
+      - bzzz-internal
+    ports:
+      - "9201:9201"
+    environment:
+      - DHT_BOOTSTRAP_NODES=walnut:9101,ironwood:9102,acacia:9103
+      - REPLICATION_CHECK_INTERVAL=5m
+      - PROVIDER_CHECK_INTERVAL=2m
+      - LOG_LEVEL=info
+    deploy:
+      replicas: 1
+      placement:
+        constraints:
+          - node.hostname == ironwood
+      resources:
+        limits:
+          memory: 512M
+          cpus: '0.3'
+        reservations:
+          memory: 256M
+          cpus: '0.15'
+      restart_policy:
+        condition: on-failure
+
+  # Content Monitor - Content availability and integrity monitoring
+  content-monitor:
+    image: registry.home.deepblack.cloud/bzzz-content-monitor:v2.0.0
+    networks:
+      - monitoring
+      - bzzz-internal
+    ports:
+      - "9202:9202"
+    volumes:
+      - /rust/bzzz-v2/data/blobs:/app/blobs:ro
+    environment:
+      - CONTENT_PATH=/app/blobs
+      - INTEGRITY_CHECK_INTERVAL=15m
+      - AVAILABILITY_CHECK_INTERVAL=5m
+      - LOG_LEVEL=info
+    deploy:
+      replicas: 1
+      placement:
+        constraints:
+          - node.hostname == acacia
+      resources:
+        limits:
+          memory: 512M
+          cpus: '0.3'
+        reservations:
+          memory: 256M
+          cpus: '0.15'
+      restart_policy:
+        condition: on-failure
+
+  # OpenAI Cost Monitor - Track OpenAI API usage and costs
+  openai-cost-monitor:
+    image: registry.home.deepblack.cloud/bzzz-openai-cost-monitor:v2.0.0
+    networks:
+      - monitoring
+      - bzzz-internal
+    ports:
+      - "9203:9203"
+    environment:
+      - OPENAI_PROXY_ENDPOINT=http://openai-proxy:3002
+      - COST_TRACKING_ENABLED=true
+      - POSTGRES_HOST=postgres
+      - LOG_LEVEL=info
+    secrets:
+      - postgres_password
+    deploy:
+      replicas: 1
+      placement:
+        constraints:
+          - node.hostname == walnut
+      resources:
+        limits:
+          memory: 256M
+          cpus: '0.2'
+        reservations:
+          memory: 128M
+          cpus: '0.1'
+      restart_policy:
+        condition: on-failure
+
+  # Blackbox Exporter - External endpoint monitoring
+  blackbox-exporter:
+    image: prom/blackbox-exporter:v0.24.0
+    networks:
+      - monitoring
+      - tengig
+    ports:
+      - "9115:9115"
+    volumes:
+      - /rust/bzzz-v2/monitoring/blackbox:/etc/blackbox_exporter
+    command:
+      - '--config.file=/etc/blackbox_exporter/config.yml'
+      - '--web.listen-address=0.0.0.0:9115'
+    deploy:
+      replicas: 1
+      placement:
+        constraints:
+          - node.hostname == ironwood
+      resources:
+        limits:
+          memory: 128M
+          cpus: '0.1'
+        reservations:
+          memory: 64M
+          cpus: '0.05'
+      restart_policy:
+        condition: on-failure
+    configs:
+      - source: blackbox_config
+        target: /etc/blackbox_exporter/config.yml
+
+  # Loki - Log Aggregation
+  loki:
+    image: grafana/loki:2.8.0
+    networks:
+      - monitoring
+    ports:
+      - "3100:3100"
+    volumes:
+      - loki_data:/loki
+      - /rust/bzzz-v2/monitoring/loki:/etc/loki
+    command:
+      - '-config.file=/etc/loki/config.yml'
+      - '-target=all'
+    deploy:
+      replicas: 1
+      placement:
+        constraints:
+          - node.hostname == walnut
+      resources:
+        limits:
+          memory: 2G
+          cpus: '1.0'
+        reservations:
+          memory: 1G
+          cpus: '0.5'
+      restart_policy:
+        condition: on-failure
+    configs:
+      - source: loki_config
+        target: /etc/loki/config.yml
+
+  # Promtail - Log Collection Agent (deployed on all nodes)
+  promtail:
+    image: grafana/promtail:2.8.0
+    networks:
+      - monitoring
+    volumes:
+      - /var/log:/var/log:ro
+      - /var/lib/docker/containers:/var/lib/docker/containers:ro
+      - /rust/bzzz-v2/monitoring/promtail:/etc/promtail
+    command:
+      - '-config.file=/etc/promtail/config.yml'
+      - '-server.http-listen-port=9080'
+    deploy:
+      mode: global
+      resources:
+        limits:
+          memory: 256M
+          cpus: '0.2'
+        reservations:
+          memory: 128M
+          cpus: '0.1'
+      restart_policy:
+        condition: on-failure
+    configs:
+      - source: promtail_config
+        target: /etc/promtail/config.yml
+
+  # Jaeger - Distributed Tracing (Optional)
+  jaeger:
+    image: jaegertracing/all-in-one:1.47
+    networks:
+      - monitoring
+      - bzzz-internal
+    ports:
+      - "14268:14268"  # HTTP collector
+      - "16686:16686"  # Web UI
+    environment:
+      - COLLECTOR_OTLP_ENABLED=true
+      - SPAN_STORAGE_TYPE=memory
+    deploy:
+      replicas: 1
+      placement:
+        constraints:
+          - node.hostname == acacia
+      resources:
+        limits:
+          memory: 1G
+          cpus: '0.5'
+        reservations:
+          memory: 512M
+          cpus: '0.25'
+      restart_policy:
+        condition: on-failure
+      labels:
+        - "traefik.enable=true"
+        - "traefik.http.routers.jaeger.rule=Host(`tracing.chorus.services`)"
+        - "traefik.http.services.jaeger.loadbalancer.server.port=16686"
+        - "traefik.http.routers.jaeger.tls=true"
+
+networks:
+  tengig:
+    external: true
+  monitoring:
+    driver: overlay
+    internal: true
+    attachable: false
+    ipam:
+      driver: default
+      config:
+        - subnet: 10.201.0.0/16
+  bzzz-internal:
+    external: true
+
+volumes:
+  prometheus_data:
+    driver: local
+    driver_opts:
+      type: nfs
+      o: addr=192.168.1.27,rw,sync
+      device: ":/rust/bzzz-v2/monitoring/prometheus/data"
+
+  grafana_data:
+    driver: local
+    driver_opts:
+      type: nfs
+      o: addr=192.168.1.27,rw,sync
+      device: ":/rust/bzzz-v2/monitoring/grafana/data"
+
+  alertmanager_data:
+    driver: local
+    driver_opts:
+      type: nfs
+      o: addr=192.168.1.27,rw,sync
+      device: ":/rust/bzzz-v2/monitoring/alertmanager/data"
+
+  loki_data:
+    driver: local
+    driver_opts:
+      type: nfs
+      o: addr=192.168.1.27,rw,sync
+      device: ":/rust/bzzz-v2/monitoring/loki/data"
+
+secrets:
+  grafana_admin_password:
+    external: true
+    name: bzzz_grafana_admin_password
+  
+  slack_webhook_url:
+    external: true
+    name: bzzz_slack_webhook_url
+    
+  pagerduty_integration_key:
+    external: true
+    name: bzzz_pagerduty_integration_key
+    
+  postgres_password:
+    external: true
+    name: bzzz_postgres_password
+
+configs:
+  prometheus_config:
+    external: true
+    name: bzzz_prometheus_config_v2
+    
+  prometheus_alerts:
+    external: true
+    name: bzzz_prometheus_alerts_v2
+    
+  alertmanager_config:
+    external: true
+    name: bzzz_alertmanager_config_v2
+    
+  blackbox_config:
+    external: true
+    name: bzzz_blackbox_config_v2
+    
+  loki_config:
+    external: true
+    name: bzzz_loki_config_v2
+    
+  promtail_config:
+    external: true
+    name: bzzz_promtail_config_v2
--- a/infrastructure/scripts/deploy-enhanced-monitoring.sh
+++ b/infrastructure/scripts/deploy-enhanced-monitoring.sh
@@ -0,0 +1,615 @@
+#!/bin/bash
+
+# BZZZ Enhanced Monitoring Stack Deployment Script
+# Deploys comprehensive monitoring, metrics, and health checking infrastructure
+
+set -euo pipefail
+
+# Script configuration
+SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
+PROJECT_ROOT="$(cd "$SCRIPT_DIR/.." && pwd)"
+TIMESTAMP=$(date +%Y%m%d_%H%M%S)
+LOG_FILE="/tmp/bzzz-deploy-${TIMESTAMP}.log"
+
+# Colors for output
+RED='\033[0;31m'
+GREEN='\033[0;32m'
+YELLOW='\033[1;33m'
+BLUE='\033[0;34m'
+NC='\033[0m' # No Color
+
+# Configuration
+ENVIRONMENT=${ENVIRONMENT:-"production"}
+DRY_RUN=${DRY_RUN:-"false"}
+BACKUP_EXISTING=${BACKUP_EXISTING:-"true"}
+HEALTH_CHECK_TIMEOUT=${HEALTH_CHECK_TIMEOUT:-300}
+
+# Docker configuration
+DOCKER_REGISTRY="registry.home.deepblack.cloud"
+STACK_NAME="bzzz-monitoring-v2"
+CONFIG_VERSION="v2"
+
+# Logging function
+log() {
+    local level=$1
+    shift
+    local message="$*"
+    local timestamp=$(date '+%Y-%m-%d %H:%M:%S')
+    
+    case $level in
+        ERROR)
+            echo -e "${RED}[ERROR]${NC} $message" >&2
+            ;;
+        WARN)
+            echo -e "${YELLOW}[WARN]${NC} $message"
+            ;;
+        INFO)
+            echo -e "${GREEN}[INFO]${NC} $message"
+            ;;
+        DEBUG)
+            echo -e "${BLUE}[DEBUG]${NC} $message"
+            ;;
+    esac
+    
+    echo "[$timestamp] [$level] $message" >> "$LOG_FILE"
+}
+
+# Error handler
+error_handler() {
+    local line_no=$1
+    log ERROR "Script failed at line $line_no"
+    log ERROR "Check log file: $LOG_FILE"
+    exit 1
+}
+trap 'error_handler $LINENO' ERR
+
+# Check prerequisites
+check_prerequisites() {
+    log INFO "Checking prerequisites..."
+    
+    # Check if running on Docker Swarm manager
+    if ! docker info --format '{{.Swarm.LocalNodeState}}' | grep -q "active"; then
+        log ERROR "This script must be run on a Docker Swarm manager node"
+        exit 1
+    fi
+    
+    # Check required tools
+    local required_tools=("docker" "jq" "curl")
+    for tool in "${required_tools[@]}"; do
+        if ! command -v "$tool" >/dev/null 2>&1; then
+            log ERROR "Required tool not found: $tool"
+            exit 1
+        fi
+    done
+    
+    # Check network connectivity to registry
+    if ! docker pull "$DOCKER_REGISTRY/bzzz:v2.0.0" >/dev/null 2>&1; then
+        log WARN "Unable to pull from registry, using local images"
+    fi
+    
+    log INFO "Prerequisites check completed"
+}
+
+# Create necessary directories
+setup_directories() {
+    log INFO "Setting up directories..."
+    
+    local dirs=(
+        "/rust/bzzz-v2/monitoring/prometheus/data"
+        "/rust/bzzz-v2/monitoring/grafana/data"
+        "/rust/bzzz-v2/monitoring/alertmanager/data"
+        "/rust/bzzz-v2/monitoring/loki/data"
+        "/rust/bzzz-v2/backups/monitoring"
+    )
+    
+    for dir in "${dirs[@]}"; do
+        if [[ "$DRY_RUN" != "true" ]]; then
+            sudo mkdir -p "$dir"
+            sudo chown -R 65534:65534 "$dir"  # nobody user for containers
+        fi
+        log DEBUG "Created directory: $dir"
+    done
+}
+
+# Backup existing configuration
+backup_existing_config() {
+    if [[ "$BACKUP_EXISTING" != "true" ]]; then
+        log INFO "Skipping backup (BACKUP_EXISTING=false)"
+        return
+    fi
+    
+    log INFO "Backing up existing configuration..."
+    
+    local backup_dir="/rust/bzzz-v2/backups/monitoring/backup_${TIMESTAMP}"
+    
+    if [[ "$DRY_RUN" != "true" ]]; then
+        mkdir -p "$backup_dir"
+        
+        # Backup Docker secrets
+        docker secret ls --filter name=bzzz_ --format "{{.Name}}" | while read -r secret; do
+            if docker secret inspect "$secret" >/dev/null 2>&1; then
+                docker secret inspect "$secret" > "$backup_dir/${secret}.json"
+                log DEBUG "Backed up secret: $secret"
+            fi
+        done
+        
+        # Backup Docker configs  
+        docker config ls --filter name=bzzz_ --format "{{.Name}}" | while read -r config; do
+            if docker config inspect "$config" >/dev/null 2>&1; then
+                docker config inspect "$config" > "$backup_dir/${config}.json"
+                log DEBUG "Backed up config: $config"
+            fi
+        done
+        
+        # Backup service definitions
+        if docker stack services "$STACK_NAME" >/dev/null 2>&1; then
+            docker stack services "$STACK_NAME" --format "{{.Name}}" | while read -r service; do
+                docker service inspect "$service" > "$backup_dir/${service}-service.json"
+            done
+        fi
+    fi
+    
+    log INFO "Backup completed: $backup_dir"
+}
+
+# Create Docker secrets
+create_secrets() {
+    log INFO "Creating Docker secrets..."
+    
+    local secrets=(
+        "bzzz_grafana_admin_password:$(openssl rand -base64 32)"
+        "bzzz_postgres_password:$(openssl rand -base64 32)"
+    )
+    
+    # Check if secrets directory exists
+    local secrets_dir="$HOME/chorus/business/secrets"
+    if [[ -d "$secrets_dir" ]]; then
+        # Use existing secrets if available
+        if [[ -f "$secrets_dir/grafana-admin-password" ]]; then
+            secrets[0]="bzzz_grafana_admin_password:$(cat "$secrets_dir/grafana-admin-password")"
+        fi
+        if [[ -f "$secrets_dir/postgres-password" ]]; then
+            secrets[1]="bzzz_postgres_password:$(cat "$secrets_dir/postgres-password")"
+        fi
+    fi
+    
+    for secret_def in "${secrets[@]}"; do
+        local secret_name="${secret_def%%:*}"
+        local secret_value="${secret_def#*:}"
+        
+        if docker secret inspect "$secret_name" >/dev/null 2>&1; then
+            log DEBUG "Secret already exists: $secret_name"
+        else
+            if [[ "$DRY_RUN" != "true" ]]; then
+                echo "$secret_value" | docker secret create "$secret_name" -
+                log INFO "Created secret: $secret_name"
+            else
+                log DEBUG "Would create secret: $secret_name"
+            fi
+        fi
+    done
+}
+
+# Create Docker configs
+create_configs() {
+    log INFO "Creating Docker configs..."
+    
+    local configs=(
+        "bzzz_prometheus_config_${CONFIG_VERSION}:${PROJECT_ROOT}/monitoring/configs/prometheus.yml"
+        "bzzz_prometheus_alerts_${CONFIG_VERSION}:${PROJECT_ROOT}/monitoring/configs/enhanced-alert-rules.yml"
+        "bzzz_grafana_datasources_${CONFIG_VERSION}:${PROJECT_ROOT}/monitoring/configs/grafana-datasources.yml"
+        "bzzz_alertmanager_config_${CONFIG_VERSION}:${PROJECT_ROOT}/monitoring/configs/alertmanager.yml"
+    )
+    
+    for config_def in "${configs[@]}"; do
+        local config_name="${config_def%%:*}"
+        local config_file="${config_def#*:}"
+        
+        if [[ ! -f "$config_file" ]]; then
+            log WARN "Config file not found: $config_file"
+            continue
+        fi
+        
+        if docker config inspect "$config_name" >/dev/null 2>&1; then
+            log DEBUG "Config already exists: $config_name"
+            # Remove old config if exists
+            if [[ "$DRY_RUN" != "true" ]]; then
+                local old_config_name="${config_name%_${CONFIG_VERSION}}"
+                if docker config inspect "$old_config_name" >/dev/null 2>&1; then
+                    docker config rm "$old_config_name" || true
+                fi
+            fi
+        else
+            if [[ "$DRY_RUN" != "true" ]]; then
+                docker config create "$config_name" "$config_file"
+                log INFO "Created config: $config_name"
+            else
+                log DEBUG "Would create config: $config_name from $config_file"
+            fi
+        fi
+    done
+}
+
+# Create missing config files
+create_missing_configs() {
+    log INFO "Creating missing configuration files..."
+    
+    # Create Grafana datasources config
+    local grafana_datasources="${PROJECT_ROOT}/monitoring/configs/grafana-datasources.yml"
+    if [[ ! -f "$grafana_datasources" ]]; then
+        cat > "$grafana_datasources" <<EOF
+apiVersion: 1
+
+datasources:
+  - name: Prometheus
+    type: prometheus
+    access: proxy
+    url: http://prometheus:9090
+    isDefault: true
+    editable: true
+
+  - name: Loki
+    type: loki
+    access: proxy
+    url: http://loki:3100
+    editable: true
+
+  - name: Jaeger
+    type: jaeger
+    access: proxy
+    url: http://jaeger:16686
+    editable: true
+EOF
+        log INFO "Created Grafana datasources config"
+    fi
+    
+    # Create AlertManager config
+    local alertmanager_config="${PROJECT_ROOT}/monitoring/configs/alertmanager.yml"
+    if [[ ! -f "$alertmanager_config" ]]; then
+        cat > "$alertmanager_config" <<EOF
+global:
+  smtp_smarthost: 'localhost:587'
+  smtp_from: 'alerts@chorus.services'
+  slack_api_url_file: '/run/secrets/slack_webhook_url'
+
+route:
+  group_by: ['alertname', 'cluster', 'service']
+  group_wait: 10s
+  group_interval: 10s
+  repeat_interval: 12h
+  receiver: 'default'
+  routes:
+    - match:
+        severity: critical
+      receiver: 'critical-alerts'
+    - match:
+        service: bzzz
+      receiver: 'bzzz-alerts'
+
+receivers:
+  - name: 'default'
+    slack_configs:
+      - channel: '#bzzz-alerts'
+        title: 'BZZZ Alert: {{ .CommonAnnotations.summary }}'
+        text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'
+
+  - name: 'critical-alerts'
+    slack_configs:
+      - channel: '#bzzz-critical'
+        title: 'CRITICAL: {{ .CommonAnnotations.summary }}'
+        text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'
+    
+  - name: 'bzzz-alerts'
+    slack_configs:
+      - channel: '#bzzz-alerts'
+        title: 'BZZZ: {{ .CommonAnnotations.summary }}'
+        text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'
+EOF
+        log INFO "Created AlertManager config"
+    fi
+}
+
+# Deploy monitoring stack
+deploy_monitoring_stack() {
+    log INFO "Deploying monitoring stack..."
+    
+    local compose_file="${PROJECT_ROOT}/monitoring/docker-compose.enhanced.yml"
+    
+    if [[ ! -f "$compose_file" ]]; then
+        log ERROR "Compose file not found: $compose_file"
+        exit 1
+    fi
+    
+    if [[ "$DRY_RUN" != "true" ]]; then
+        # Deploy the stack
+        docker stack deploy -c "$compose_file" "$STACK_NAME"
+        log INFO "Stack deployment initiated: $STACK_NAME"
+        
+        # Wait for services to be ready
+        log INFO "Waiting for services to be ready..."
+        local max_attempts=30
+        local attempt=0
+        
+        while [[ $attempt -lt $max_attempts ]]; do
+            local ready_services=0
+            local total_services=0
+            
+            # Count ready services
+            while read -r service; do
+                total_services=$((total_services + 1))
+                local replicas_info
+                replicas_info=$(docker service ls --filter name="$service" --format "{{.Replicas}}")
+                
+                if [[ "$replicas_info" =~ ^([0-9]+)/([0-9]+)$ ]]; then
+                    local current="${BASH_REMATCH[1]}"
+                    local desired="${BASH_REMATCH[2]}"
+                    
+                    if [[ "$current" -eq "$desired" ]]; then
+                        ready_services=$((ready_services + 1))
+                    fi
+                fi
+            done < <(docker stack services "$STACK_NAME" --format "{{.Name}}")
+            
+            if [[ $ready_services -eq $total_services ]]; then
+                log INFO "All services are ready ($ready_services/$total_services)"
+                break
+            else
+                log DEBUG "Services ready: $ready_services/$total_services"
+                sleep 10
+                attempt=$((attempt + 1))
+            fi
+        done
+        
+        if [[ $attempt -eq $max_attempts ]]; then
+            log WARN "Timeout waiting for all services to be ready"
+        fi
+    else
+        log DEBUG "Would deploy stack with compose file: $compose_file"
+    fi
+}
+
+# Perform health checks
+perform_health_checks() {
+    log INFO "Performing health checks..."
+    
+    if [[ "$DRY_RUN" == "true" ]]; then
+        log DEBUG "Skipping health checks in dry run mode"
+        return
+    fi
+    
+    local endpoints=(
+        "http://localhost:9090/-/healthy:Prometheus"
+        "http://localhost:3000/api/health:Grafana"
+        "http://localhost:9093/-/healthy:AlertManager"
+    )
+    
+    local max_attempts=$((HEALTH_CHECK_TIMEOUT / 10))
+    local attempt=0
+    
+    while [[ $attempt -lt $max_attempts ]]; do
+        local healthy_endpoints=0
+        
+        for endpoint_def in "${endpoints[@]}"; do
+            local endpoint="${endpoint_def%%:*}"
+            local service="${endpoint_def#*:}"
+            
+            if curl -sf "$endpoint" >/dev/null 2>&1; then
+                healthy_endpoints=$((healthy_endpoints + 1))
+                log DEBUG "Health check passed: $service"
+            else
+                log DEBUG "Health check pending: $service"
+            fi
+        done
+        
+        if [[ $healthy_endpoints -eq ${#endpoints[@]} ]]; then
+            log INFO "All health checks passed"
+            return
+        fi
+        
+        sleep 10
+        attempt=$((attempt + 1))
+    done
+    
+    log WARN "Some health checks failed after ${HEALTH_CHECK_TIMEOUT}s timeout"
+}
+
+# Validate deployment
+validate_deployment() {
+    log INFO "Validating deployment..."
+    
+    if [[ "$DRY_RUN" == "true" ]]; then
+        log DEBUG "Skipping validation in dry run mode"
+        return
+    fi
+    
+    # Check stack services
+    local services
+    services=$(docker stack services "$STACK_NAME" --format "{{.Name}}" | wc -l)
+    log INFO "Deployed services: $services"
+    
+    # Check if Prometheus is collecting metrics
+    sleep 30  # Allow time for initial metric collection
+    
+    if curl -sf "http://localhost:9090/api/v1/query?query=up" | jq -r '.data.result | length' | grep -q "^[1-9]"; then
+        log INFO "Prometheus is collecting metrics"
+    else
+        log WARN "Prometheus may not be collecting metrics yet"
+    fi
+    
+    # Check if Grafana can connect to Prometheus
+    local grafana_health
+    if grafana_health=$(curl -sf "http://admin:admin@localhost:3000/api/datasources/proxy/1/api/v1/query?query=up" 2>/dev/null); then
+        log INFO "Grafana can connect to Prometheus"
+    else
+        log WARN "Grafana datasource connection may be pending"
+    fi
+    
+    # Check AlertManager configuration
+    if curl -sf "http://localhost:9093/api/v1/status" >/dev/null 2>&1; then
+        log INFO "AlertManager is operational"
+    else
+        log WARN "AlertManager may not be ready"
+    fi
+}
+
+# Import Grafana dashboards
+import_dashboards() {
+    log INFO "Importing Grafana dashboards..."
+    
+    if [[ "$DRY_RUN" == "true" ]]; then
+        log DEBUG "Skipping dashboard import in dry run mode"
+        return
+    fi
+    
+    # Wait for Grafana to be ready
+    local max_attempts=30
+    local attempt=0
+    
+    while [[ $attempt -lt $max_attempts ]]; do
+        if curl -sf "http://admin:admin@localhost:3000/api/health" >/dev/null 2>&1; then
+            break
+        fi
+        sleep 5
+        attempt=$((attempt + 1))
+    done
+    
+    if [[ $attempt -eq $max_attempts ]]; then
+        log WARN "Grafana not ready for dashboard import"
+        return
+    fi
+    
+    # Import dashboards
+    local dashboard_dir="${PROJECT_ROOT}/monitoring/grafana-dashboards"
+    if [[ -d "$dashboard_dir" ]]; then
+        for dashboard_file in "$dashboard_dir"/*.json; do
+            if [[ -f "$dashboard_file" ]]; then
+                local dashboard_name
+                dashboard_name=$(basename "$dashboard_file" .json)
+                
+                if curl -X POST \
+                    -H "Content-Type: application/json" \
+                    -d "@$dashboard_file" \
+                    "http://admin:admin@localhost:3000/api/dashboards/db" \
+                    >/dev/null 2>&1; then
+                    log INFO "Imported dashboard: $dashboard_name"
+                else
+                    log WARN "Failed to import dashboard: $dashboard_name"
+                fi
+            fi
+        done
+    fi
+}
+
+# Generate deployment report
+generate_report() {
+    log INFO "Generating deployment report..."
+    
+    local report_file="/tmp/bzzz-monitoring-deployment-report-${TIMESTAMP}.txt"
+    
+    cat > "$report_file" <<EOF
+BZZZ Enhanced Monitoring Stack Deployment Report
+================================================
+
+Deployment Time: $(date)
+Environment: $ENVIRONMENT
+Stack Name: $STACK_NAME
+Dry Run: $DRY_RUN
+
+Services Deployed:
+EOF
+    
+    if [[ "$DRY_RUN" != "true" ]]; then
+        docker stack services "$STACK_NAME" --format "  - {{.Name}}: {{.Replicas}}" >> "$report_file"
+        
+        echo "" >> "$report_file"
+        echo "Service Health:" >> "$report_file"
+        
+        # Add health check results
+        local health_endpoints=(
+            "http://localhost:9090/-/healthy:Prometheus"
+            "http://localhost:3000/api/health:Grafana"
+            "http://localhost:9093/-/healthy:AlertManager"
+        )
+        
+        for endpoint_def in "${health_endpoints[@]}"; do
+            local endpoint="${endpoint_def%%:*}"
+            local service="${endpoint_def#*:}"
+            
+            if curl -sf "$endpoint" >/dev/null 2>&1; then
+                echo "  - $service: ✅ Healthy" >> "$report_file"
+            else
+                echo "  - $service: ❌ Unhealthy" >> "$report_file"
+            fi
+        done
+    else
+        echo "  [Dry run mode - no services deployed]" >> "$report_file"
+    fi
+    
+    cat >> "$report_file" <<EOF
+
+Access URLs:
+  - Grafana: http://localhost:3000 (admin/admin)
+  - Prometheus: http://localhost:9090
+  - AlertManager: http://localhost:9093
+
+Configuration:
+  - Log file: $LOG_FILE
+  - Backup directory: /rust/bzzz-v2/backups/monitoring/backup_${TIMESTAMP}
+  - Config version: $CONFIG_VERSION
+
+Next Steps:
+  1. Change default Grafana admin password
+  2. Configure notification channels in AlertManager
+  3. Review and customize alert rules
+  4. Set up external authentication (optional)
+
+EOF
+    
+    log INFO "Deployment report generated: $report_file"
+    
+    # Display report
+    echo ""
+    echo "=========================================="
+    cat "$report_file"
+    echo "=========================================="
+}
+
+# Main execution
+main() {
+    log INFO "Starting BZZZ Enhanced Monitoring Stack deployment"
+    log INFO "Environment: $ENVIRONMENT, Dry Run: $DRY_RUN"
+    log INFO "Log file: $LOG_FILE"
+    
+    check_prerequisites
+    setup_directories
+    backup_existing_config
+    create_missing_configs
+    create_secrets
+    create_configs
+    deploy_monitoring_stack
+    perform_health_checks
+    validate_deployment
+    import_dashboards
+    generate_report
+    
+    log INFO "Deployment completed successfully!"
+    
+    if [[ "$DRY_RUN" != "true" ]]; then
+        echo ""
+        echo "🎉 BZZZ Enhanced Monitoring Stack is now running!"
+        echo "📊 Grafana Dashboard: http://localhost:3000"
+        echo "📈 Prometheus: http://localhost:9090"
+        echo "🚨 AlertManager: http://localhost:9093"
+        echo ""
+        echo "Next steps:"
+        echo "1. Change default Grafana password"
+        echo "2. Configure alert notification channels"
+        echo "3. Review monitoring dashboards"
+        echo "4. Run reliability tests: ./infrastructure/testing/run-tests.sh all"
+    fi
+}
+
+# Script execution
+if [[ "${BASH_SOURCE[0]}" == "${0}" ]]; then
+    main "$@"
+fi
--- a/infrastructure/testing/RELIABILITY_TESTING_PLAN.md
+++ b/infrastructure/testing/RELIABILITY_TESTING_PLAN.md
@@ -0,0 +1,686 @@
+# BZZZ Infrastructure Reliability Testing Plan
+
+## Overview
+
+This document outlines comprehensive testing procedures to validate the reliability, performance, and operational readiness of the BZZZ distributed system infrastructure enhancements.
+
+## Test Categories
+
+### 1. Component Health Testing
+### 2. Integration Testing  
+### 3. Chaos Engineering
+### 4. Performance Testing
+### 5. Monitoring and Alerting Validation
+### 6. Disaster Recovery Testing
+
+---
+
+## 1. Component Health Testing
+
+### 1.1 Enhanced Health Checks Validation
+
+**Objective**: Verify enhanced health check implementations work correctly.
+
+#### Test Cases
+
+**TC-01: PubSub Health Probes**
+```bash
+# Test PubSub round-trip functionality
+curl -X POST http://bzzz-agent:8080/test/pubsub-health \
+  -H "Content-Type: application/json" \
+  -d '{"test_duration": "30s", "message_count": 100}'
+
+# Expected: Success rate > 99%, latency < 100ms
+```
+
+**TC-02: DHT Health Probes**
+```bash
+# Test DHT put/get operations
+curl -X POST http://bzzz-agent:8080/test/dht-health \
+  -H "Content-Type: application/json" \
+  -d '{"test_duration": "60s", "operation_count": 50}'
+
+# Expected: Success rate > 99%, p95 latency < 300ms
+```
+
+**TC-03: Election Health Monitoring**
+```bash
+# Test election stability
+curl -X GET http://bzzz-agent:8080/health/checks | jq '.checks["election-health"]'
+
+# Trigger controlled election
+curl -X POST http://bzzz-agent:8080/admin/trigger-election
+
+# Expected: Stable admin election within 30 seconds
+```
+
+#### Validation Criteria
+- [ ] All health checks report accurate status
+- [ ] Health check latencies are within SLO thresholds
+- [ ] Failed health checks trigger appropriate alerts
+- [ ] Health history is properly maintained
+
+### 1.2 SLURP Leadership Health Testing
+
+**TC-04: Leadership Transition Health**
+```bash
+# Test leadership transition health
+./scripts/test-leadership-transition.sh
+
+# Expected outcomes:
+# - Clean leadership transitions
+# - No dropped tasks during transition
+# - Health scores maintain > 0.8 during transition
+```
+
+**TC-05: Degraded Leader Detection**
+```bash
+# Simulate resource exhaustion
+docker service update --limit-memory 512M bzzz-v2_bzzz-agent
+
+# Expected: Transition to degraded leader state within 2 minutes
+# Expected: Health alerts fired appropriately
+```
+
+---
+
+## 2. Integration Testing
+
+### 2.1 End-to-End System Testing
+
+**TC-06: Complete Task Lifecycle**
+```bash
+#!/bin/bash
+# Test complete task flow from submission to completion
+
+# 1. Submit context generation task
+TASK_ID=$(curl -X POST http://bzzz.deepblack.cloud/api/slurp/generate \
+  -H "Content-Type: application/json" \
+  -d '{
+    "ucxl_address": "ucxl://test/document.md",
+    "role": "test_analyst",
+    "priority": "high"
+  }' | jq -r '.task_id')
+
+echo "Task submitted: $TASK_ID"
+
+# 2. Monitor task progress
+while true; do
+  STATUS=$(curl -s http://bzzz.deepblack.cloud/api/slurp/status/$TASK_ID | jq -r '.status')
+  echo "Task status: $STATUS"
+  
+  if [ "$STATUS" = "completed" ] || [ "$STATUS" = "failed" ]; then
+    break
+  fi
+  
+  sleep 5
+done
+
+# 3. Validate results
+if [ "$STATUS" = "completed" ]; then
+  echo "✅ Task completed successfully"
+  RESULT=$(curl -s http://bzzz.deepblack.cloud/api/slurp/result/$TASK_ID)
+  echo "Result size: $(echo $RESULT | jq -r '.content | length')"
+else
+  echo "❌ Task failed"
+  exit 1
+fi
+```
+
+**TC-07: Multi-Node Coordination**
+```bash
+# Test coordination across cluster nodes
+./scripts/test-multi-node-coordination.sh
+
+# Test matrix:
+# - Task submission on node A, execution on node B
+# - DHT storage on node A, retrieval on node C
+# - Election on mixed node topology
+```
+
+### 2.2 Inter-Service Communication Testing
+
+**TC-08: Service Mesh Validation**
+```bash
+# Test all service-to-service communications
+./scripts/test-service-mesh.sh
+
+# Validate:
+# - bzzz-agent ↔ postgres
+# - bzzz-agent ↔ redis
+# - bzzz-agent ↔ dht-bootstrap nodes
+# - mcp-server ↔ bzzz-agent
+# - content-resolver ↔ bzzz-agent
+```
+
+---
+
+## 3. Chaos Engineering
+
+### 3.1 Node Failure Testing
+
+**TC-09: Single Node Failure**
+```bash
+#!/bin/bash
+# Test system resilience to single node failure
+
+# 1. Record baseline metrics
+echo "Recording baseline metrics..."
+curl -s 'http://prometheus:9090/api/v1/query?query=bzzz_system_health_score' > baseline_metrics.json
+
+# 2. Identify current leader
+LEADER=$(curl -s http://bzzz.deepblack.cloud/api/election/status | jq -r '.current_admin')
+echo "Current leader: $LEADER"
+
+# 3. Simulate node failure
+echo "Simulating failure of node: $LEADER"
+docker node update --availability drain $LEADER
+
+# 4. Monitor recovery
+START_TIME=$(date +%s)
+while true; do
+  CURRENT_TIME=$(date +%s)
+  ELAPSED=$((CURRENT_TIME - START_TIME))
+  
+  # Check if new leader elected
+  NEW_LEADER=$(curl -s http://bzzz.deepblack.cloud/api/election/status | jq -r '.current_admin')
+  
+  if [ "$NEW_LEADER" != "null" ] && [ "$NEW_LEADER" != "$LEADER" ]; then
+    echo "✅ New leader elected: $NEW_LEADER (${ELAPSED}s)"
+    break
+  fi
+  
+  if [ $ELAPSED -gt 120 ]; then
+    echo "❌ Leadership recovery timeout"
+    exit 1
+  fi
+  
+  sleep 5
+done
+
+# 5. Validate system health
+sleep 30  # Allow system to stabilize
+HEALTH_SCORE=$(curl -s 'http://prometheus:9090/api/v1/query?query=bzzz_system_health_score' | jq -r '.data.result[0].value[1]')
+echo "Post-failure health score: $HEALTH_SCORE"
+
+if (( $(echo "$HEALTH_SCORE > 0.8" | bc -l) )); then
+  echo "✅ System recovered successfully"
+else
+  echo "❌ System health degraded: $HEALTH_SCORE"
+  exit 1
+fi
+
+# 6. Restore node
+docker node update --availability active $LEADER
+```
+
+**TC-10: Multi-Node Cascade Failure**
+```bash
+# Test system resilience to cascade failures
+./scripts/test-cascade-failure.sh
+
+# Scenario: Fail 2 out of 5 nodes simultaneously
+# Expected: System continues operating with degraded performance
+# Expected: All critical data remains available
+```
+
+### 3.2 Network Partition Testing
+
+**TC-11: DHT Network Partition**
+```bash
+#!/bin/bash
+# Test DHT resilience to network partitions
+
+# 1. Create network partition
+echo "Creating network partition..."
+iptables -A INPUT -s 192.168.1.72 -j DROP  # Block ironwood
+iptables -A OUTPUT -d 192.168.1.72 -j DROP
+
+# 2. Monitor DHT health
+./scripts/monitor-dht-partition-recovery.sh &
+MONITOR_PID=$!
+
+# 3. Wait for partition duration
+sleep 300  # 5 minute partition
+
+# 4. Heal partition
+echo "Healing network partition..."
+iptables -D INPUT -s 192.168.1.72 -j DROP
+iptables -D OUTPUT -d 192.168.1.72 -j DROP
+
+# 5. Wait for recovery
+sleep 180  # 3 minute recovery window
+
+# 6. Validate recovery
+kill $MONITOR_PID
+./scripts/validate-dht-recovery.sh
+```
+
+### 3.3 Resource Exhaustion Testing
+
+**TC-12: Memory Exhaustion**
+```bash
+# Test behavior under memory pressure
+stress-ng --vm 4 --vm-bytes 75% --timeout 300s &
+STRESS_PID=$!
+
+# Monitor system behavior
+./scripts/monitor-memory-exhaustion.sh
+
+# Expected: Graceful degradation, no crashes
+# Expected: Health checks detect degradation
+# Expected: Alerts fired appropriately
+
+kill $STRESS_PID
+```
+
+**TC-13: Disk Space Exhaustion**
+```bash
+# Test disk space exhaustion handling
+dd if=/dev/zero of=/tmp/fill-disk bs=1M count=1000
+
+# Expected: Services detect low disk space
+# Expected: Appropriate cleanup mechanisms activate
+# Expected: System remains operational
+```
+
+---
+
+## 4. Performance Testing
+
+### 4.1 Load Testing
+
+**TC-14: Context Generation Load Test**
+```bash
+#!/bin/bash
+# Load test context generation system
+
+# Test configuration
+CONCURRENT_USERS=50
+TEST_DURATION=600  # 10 minutes
+RAMP_UP_TIME=60    # 1 minute
+
+# Run load test
+k6 run --vus $CONCURRENT_USERS \
+       --duration ${TEST_DURATION}s \
+       --ramp-up-time ${RAMP_UP_TIME}s \
+       ./scripts/load-test-context-generation.js
+
+# Success criteria:
+# - Throughput: > 10 requests/second
+# - P95 latency: < 2 seconds
+# - Error rate: < 1%
+# - System health score: > 0.8 throughout test
+```
+
+**TC-15: DHT Throughput Test**
+```bash
+# Test DHT operation throughput
+./scripts/dht-throughput-test.sh
+
+# Test matrix:
+# - PUT operations: Target 100 ops/sec
+# - GET operations: Target 500 ops/sec
+# - Mixed workload: 80% GET, 20% PUT
+```
+
+### 4.2 Scalability Testing
+
+**TC-16: Horizontal Scaling Test**
+```bash
+#!/bin/bash
+# Test horizontal scaling behavior
+
+# Baseline measurement
+echo "Recording baseline performance..."
+./scripts/measure-baseline-performance.sh
+
+# Scale up
+echo "Scaling up services..."
+docker service scale bzzz-v2_bzzz-agent=6
+sleep 60  # Allow services to start
+
+# Measure scaled performance
+echo "Measuring scaled performance..."
+./scripts/measure-scaled-performance.sh
+
+# Validate improvements
+echo "Validating scaling improvements..."
+./scripts/validate-scaling-improvements.sh
+
+# Expected: Linear improvement in throughput
+# Expected: No degradation in latency
+# Expected: Stable error rates
+```
+
+---
+
+## 5. Monitoring and Alerting Validation
+
+### 5.1 Alert Testing
+
+**TC-17: Critical Alert Testing**
+```bash
+#!/bin/bash
+# Test critical alert firing and resolution
+
+ALERTS_TO_TEST=(
+  "BZZZSystemHealthCritical"
+  "BZZZInsufficientPeers" 
+  "BZZZDHTLowSuccessRate"
+  "BZZZNoAdminElected"
+  "BZZZTaskQueueBackup"
+)
+
+for alert in "${ALERTS_TO_TEST[@]}"; do
+  echo "Testing alert: $alert"
+  
+  # Trigger condition
+  ./scripts/trigger-alert-condition.sh "$alert"
+  
+  # Wait for alert
+  timeout 300 ./scripts/wait-for-alert.sh "$alert"
+  if [ $? -eq 0 ]; then
+    echo "✅ Alert $alert fired successfully"
+  else
+    echo "❌ Alert $alert failed to fire"
+  fi
+  
+  # Resolve condition
+  ./scripts/resolve-alert-condition.sh "$alert"
+  
+  # Wait for resolution
+  timeout 300 ./scripts/wait-for-alert-resolution.sh "$alert"
+  if [ $? -eq 0 ]; then
+    echo "✅ Alert $alert resolved successfully"
+  else
+    echo "❌ Alert $alert failed to resolve"
+  fi
+done
+```
+
+### 5.2 Metrics Validation
+
+**TC-18: Metrics Accuracy Test**
+```bash
+# Validate metrics accuracy against actual system state
+./scripts/validate-metrics-accuracy.sh
+
+# Test cases:
+# - Connected peers count vs actual P2P connections
+# - DHT operation counters vs logged operations  
+# - Task completion rates vs actual completions
+# - Resource usage vs system measurements
+```
+
+### 5.3 Dashboard Functionality
+
+**TC-19: Grafana Dashboard Test**
+```bash
+# Test all Grafana dashboards
+./scripts/test-grafana-dashboards.sh
+
+# Validation:
+# - All panels load without errors
+# - Data displays correctly for all time ranges
+# - Drill-down functionality works
+# - Alert annotations appear correctly
+```
+
+---
+
+## 6. Disaster Recovery Testing
+
+### 6.1 Data Recovery Testing
+
+**TC-20: Database Recovery Test**
+```bash
+#!/bin/bash
+# Test database backup and recovery procedures
+
+# 1. Create test data
+echo "Creating test data..."
+./scripts/create-test-data.sh
+
+# 2. Perform backup
+echo "Creating backup..."
+./scripts/backup-database.sh
+
+# 3. Simulate data loss
+echo "Simulating data loss..."
+docker service scale bzzz-v2_postgres=0
+docker volume rm bzzz-v2_postgres_data
+
+# 4. Restore from backup
+echo "Restoring from backup..."
+./scripts/restore-database.sh
+
+# 5. Validate data integrity
+echo "Validating data integrity..."
+./scripts/validate-restored-data.sh
+
+# Expected: 100% data recovery
+# Expected: All relationships intact
+# Expected: System fully operational
+```
+
+### 6.2 Configuration Recovery
+
+**TC-21: Configuration Disaster Recovery**
+```bash
+# Test recovery of all system configurations
+./scripts/test-configuration-recovery.sh
+
+# Test scenarios:
+# - Docker secrets loss and recovery
+# - Docker configs corruption and recovery
+# - Service definition recovery
+# - Network configuration recovery
+```
+
+### 6.3 Full System Recovery
+
+**TC-22: Complete Infrastructure Recovery**
+```bash
+#!/bin/bash
+# Test complete system recovery from scratch
+
+# 1. Document current state
+echo "Documenting current system state..."
+./scripts/document-system-state.sh > pre-disaster-state.json
+
+# 2. Simulate complete infrastructure loss
+echo "Simulating infrastructure disaster..."
+docker stack rm bzzz-v2
+docker system prune -f --volumes
+
+# 3. Recover infrastructure
+echo "Recovering infrastructure..."
+./scripts/deploy-from-scratch.sh
+
+# 4. Validate recovery
+echo "Validating recovery..."
+./scripts/validate-complete-recovery.sh pre-disaster-state.json
+
+# Success criteria:
+# - All services operational within 15 minutes
+# - All data recovered correctly  
+# - System health score > 0.9
+# - All integrations functional
+```
+
+---
+
+## Test Execution Framework
+
+### Automated Test Runner
+
+```bash
+#!/bin/bash
+# Main test execution script
+
+TEST_SUITE=${1:-"all"}
+ENVIRONMENT=${2:-"staging"}
+
+echo "Running BZZZ reliability tests..."
+echo "Suite: $TEST_SUITE"
+echo "Environment: $ENVIRONMENT"
+
+# Setup test environment
+./scripts/setup-test-environment.sh $ENVIRONMENT
+
+# Run test suites
+case $TEST_SUITE in
+  "health")
+    ./scripts/run-health-tests.sh
+    ;;
+  "integration") 
+    ./scripts/run-integration-tests.sh
+    ;;
+  "chaos")
+    ./scripts/run-chaos-tests.sh
+    ;;
+  "performance")
+    ./scripts/run-performance-tests.sh
+    ;;
+  "monitoring")
+    ./scripts/run-monitoring-tests.sh
+    ;;
+  "disaster-recovery")
+    ./scripts/run-disaster-recovery-tests.sh
+    ;;
+  "all")
+    ./scripts/run-all-tests.sh
+    ;;
+  *)
+    echo "Unknown test suite: $TEST_SUITE"
+    exit 1
+    ;;
+esac
+
+# Generate test report
+./scripts/generate-test-report.sh
+
+echo "Test execution completed."
+```
+
+### Test Environment Setup
+
+```yaml
+# test-environment.yml
+version: '3.8'
+
+services:
+  # Staging environment with reduced resource requirements
+  bzzz-agent-test:
+    image: registry.home.deepblack.cloud/bzzz:test-latest
+    environment:
+      - LOG_LEVEL=debug
+      - TEST_MODE=true
+      - METRICS_ENABLED=true
+    networks:
+      - test-network
+    deploy:
+      replicas: 3
+      resources:
+        limits:
+          memory: 1G
+          cpus: '0.5'
+
+  # Test data generator
+  test-data-generator:
+    image: registry.home.deepblack.cloud/bzzz-test-generator:latest
+    environment:
+      - TARGET_ENDPOINT=http://bzzz-agent-test:9000
+      - DATA_VOLUME=medium
+    networks:
+      - test-network
+
+networks:
+  test-network:
+    driver: overlay
+```
+
+### Continuous Testing Pipeline
+
+```yaml
+# .github/workflows/reliability-testing.yml
+name: BZZZ Reliability Testing
+
+on:
+  schedule:
+    - cron: '0 2 * * *'  # Daily at 2 AM
+  workflow_dispatch:
+
+jobs:
+  health-tests:
+    runs-on: self-hosted
+    steps:
+      - uses: actions/checkout@v3
+      - name: Run Health Tests
+        run: ./infrastructure/testing/run-tests.sh health staging
+
+  performance-tests:
+    runs-on: self-hosted
+    needs: health-tests
+    steps:
+      - name: Run Performance Tests
+        run: ./infrastructure/testing/run-tests.sh performance staging
+
+  chaos-tests:
+    runs-on: self-hosted
+    needs: health-tests
+    if: github.event_name == 'workflow_dispatch'
+    steps:
+      - name: Run Chaos Tests
+        run: ./infrastructure/testing/run-tests.sh chaos staging
+```
+
+---
+
+## Success Criteria
+
+### Overall System Reliability Targets
+
+- **Availability SLO**: 99.9% uptime
+- **Performance SLO**: 
+  - Context generation: p95 < 2 seconds
+  - DHT operations: p95 < 300ms
+  - P2P messaging: p95 < 500ms
+- **Error Rate SLO**: < 0.1% for all operations
+- **Recovery Time Objective (RTO)**: < 15 minutes
+- **Recovery Point Objective (RPO)**: < 5 minutes
+
+### Test Pass Criteria
+
+- **Health Tests**: 100% of health checks function correctly
+- **Integration Tests**: 95% pass rate for all integration scenarios  
+- **Chaos Tests**: System recovers within SLO targets for all failure scenarios
+- **Performance Tests**: All performance metrics meet SLO targets under load
+- **Monitoring Tests**: 100% of alerts fire and resolve correctly
+- **Disaster Recovery**: Complete system recovery within RTO/RPO targets
+
+### Continuous Monitoring
+
+- Daily automated health and integration tests
+- Weekly performance regression testing
+- Monthly chaos engineering exercises
+- Quarterly disaster recovery drills
+
+---
+
+## Test Reporting and Documentation
+
+### Test Results Dashboard
+- Real-time test execution status
+- Historical test results and trends
+- Performance benchmarks over time
+- Failure analysis and remediation tracking
+
+### Test Documentation
+- Detailed test procedures and scripts
+- Failure scenarios and response procedures
+- Performance baselines and regression analysis
+- Disaster recovery validation reports
+
+This comprehensive testing plan ensures that all infrastructure enhancements are thoroughly validated and the system meets its reliability and performance objectives.