# BZZZ Infrastructure Reliability Testing Plan ## Overview This document outlines comprehensive testing procedures to validate the reliability, performance, and operational readiness of the BZZZ distributed system infrastructure enhancements. ## Test Categories ### 1. Component Health Testing ### 2. Integration Testing ### 3. Chaos Engineering ### 4. Performance Testing ### 5. Monitoring and Alerting Validation ### 6. Disaster Recovery Testing --- ## 1. Component Health Testing ### 1.1 Enhanced Health Checks Validation **Objective**: Verify enhanced health check implementations work correctly. #### Test Cases **TC-01: PubSub Health Probes** ```bash # Test PubSub round-trip functionality curl -X POST http://bzzz-agent:8080/test/pubsub-health \ -H "Content-Type: application/json" \ -d '{"test_duration": "30s", "message_count": 100}' # Expected: Success rate > 99%, latency < 100ms ``` **TC-02: DHT Health Probes** ```bash # Test DHT put/get operations curl -X POST http://bzzz-agent:8080/test/dht-health \ -H "Content-Type: application/json" \ -d '{"test_duration": "60s", "operation_count": 50}' # Expected: Success rate > 99%, p95 latency < 300ms ``` **TC-03: Election Health Monitoring** ```bash # Test election stability curl -X GET http://bzzz-agent:8080/health/checks | jq '.checks["election-health"]' # Trigger controlled election curl -X POST http://bzzz-agent:8080/admin/trigger-election # Expected: Stable admin election within 30 seconds ``` #### Validation Criteria - [ ] All health checks report accurate status - [ ] Health check latencies are within SLO thresholds - [ ] Failed health checks trigger appropriate alerts - [ ] Health history is properly maintained ### 1.2 SLURP Leadership Health Testing **TC-04: Leadership Transition Health** ```bash # Test leadership transition health ./scripts/test-leadership-transition.sh # Expected outcomes: # - Clean leadership transitions # - No dropped tasks during transition # - Health scores maintain > 0.8 during transition ``` **TC-05: Degraded Leader Detection** ```bash # Simulate resource exhaustion docker service update --limit-memory 512M bzzz-v2_bzzz-agent # Expected: Transition to degraded leader state within 2 minutes # Expected: Health alerts fired appropriately ``` --- ## 2. Integration Testing ### 2.1 End-to-End System Testing **TC-06: Complete Task Lifecycle** ```bash #!/bin/bash # Test complete task flow from submission to completion # 1. Submit context generation task TASK_ID=$(curl -X POST http://bzzz.deepblack.cloud/api/slurp/generate \ -H "Content-Type: application/json" \ -d '{ "ucxl_address": "ucxl://test/document.md", "role": "test_analyst", "priority": "high" }' | jq -r '.task_id') echo "Task submitted: $TASK_ID" # 2. Monitor task progress while true; do STATUS=$(curl -s http://bzzz.deepblack.cloud/api/slurp/status/$TASK_ID | jq -r '.status') echo "Task status: $STATUS" if [ "$STATUS" = "completed" ] || [ "$STATUS" = "failed" ]; then break fi sleep 5 done # 3. Validate results if [ "$STATUS" = "completed" ]; then echo "✅ Task completed successfully" RESULT=$(curl -s http://bzzz.deepblack.cloud/api/slurp/result/$TASK_ID) echo "Result size: $(echo $RESULT | jq -r '.content | length')" else echo "❌ Task failed" exit 1 fi ``` **TC-07: Multi-Node Coordination** ```bash # Test coordination across cluster nodes ./scripts/test-multi-node-coordination.sh # Test matrix: # - Task submission on node A, execution on node B # - DHT storage on node A, retrieval on node C # - Election on mixed node topology ``` ### 2.2 Inter-Service Communication Testing **TC-08: Service Mesh Validation** ```bash # Test all service-to-service communications ./scripts/test-service-mesh.sh # Validate: # - bzzz-agent ↔ postgres # - bzzz-agent ↔ redis # - bzzz-agent ↔ dht-bootstrap nodes # - mcp-server ↔ bzzz-agent # - content-resolver ↔ bzzz-agent ``` --- ## 3. Chaos Engineering ### 3.1 Node Failure Testing **TC-09: Single Node Failure** ```bash #!/bin/bash # Test system resilience to single node failure # 1. Record baseline metrics echo "Recording baseline metrics..." curl -s 'http://prometheus:9090/api/v1/query?query=bzzz_system_health_score' > baseline_metrics.json # 2. Identify current leader LEADER=$(curl -s http://bzzz.deepblack.cloud/api/election/status | jq -r '.current_admin') echo "Current leader: $LEADER" # 3. Simulate node failure echo "Simulating failure of node: $LEADER" docker node update --availability drain $LEADER # 4. Monitor recovery START_TIME=$(date +%s) while true; do CURRENT_TIME=$(date +%s) ELAPSED=$((CURRENT_TIME - START_TIME)) # Check if new leader elected NEW_LEADER=$(curl -s http://bzzz.deepblack.cloud/api/election/status | jq -r '.current_admin') if [ "$NEW_LEADER" != "null" ] && [ "$NEW_LEADER" != "$LEADER" ]; then echo "✅ New leader elected: $NEW_LEADER (${ELAPSED}s)" break fi if [ $ELAPSED -gt 120 ]; then echo "❌ Leadership recovery timeout" exit 1 fi sleep 5 done # 5. Validate system health sleep 30 # Allow system to stabilize HEALTH_SCORE=$(curl -s 'http://prometheus:9090/api/v1/query?query=bzzz_system_health_score' | jq -r '.data.result[0].value[1]') echo "Post-failure health score: $HEALTH_SCORE" if (( $(echo "$HEALTH_SCORE > 0.8" | bc -l) )); then echo "✅ System recovered successfully" else echo "❌ System health degraded: $HEALTH_SCORE" exit 1 fi # 6. Restore node docker node update --availability active $LEADER ``` **TC-10: Multi-Node Cascade Failure** ```bash # Test system resilience to cascade failures ./scripts/test-cascade-failure.sh # Scenario: Fail 2 out of 5 nodes simultaneously # Expected: System continues operating with degraded performance # Expected: All critical data remains available ``` ### 3.2 Network Partition Testing **TC-11: DHT Network Partition** ```bash #!/bin/bash # Test DHT resilience to network partitions # 1. Create network partition echo "Creating network partition..." iptables -A INPUT -s 192.168.1.72 -j DROP # Block ironwood iptables -A OUTPUT -d 192.168.1.72 -j DROP # 2. Monitor DHT health ./scripts/monitor-dht-partition-recovery.sh & MONITOR_PID=$! # 3. Wait for partition duration sleep 300 # 5 minute partition # 4. Heal partition echo "Healing network partition..." iptables -D INPUT -s 192.168.1.72 -j DROP iptables -D OUTPUT -d 192.168.1.72 -j DROP # 5. Wait for recovery sleep 180 # 3 minute recovery window # 6. Validate recovery kill $MONITOR_PID ./scripts/validate-dht-recovery.sh ``` ### 3.3 Resource Exhaustion Testing **TC-12: Memory Exhaustion** ```bash # Test behavior under memory pressure stress-ng --vm 4 --vm-bytes 75% --timeout 300s & STRESS_PID=$! # Monitor system behavior ./scripts/monitor-memory-exhaustion.sh # Expected: Graceful degradation, no crashes # Expected: Health checks detect degradation # Expected: Alerts fired appropriately kill $STRESS_PID ``` **TC-13: Disk Space Exhaustion** ```bash # Test disk space exhaustion handling dd if=/dev/zero of=/tmp/fill-disk bs=1M count=1000 # Expected: Services detect low disk space # Expected: Appropriate cleanup mechanisms activate # Expected: System remains operational ``` --- ## 4. Performance Testing ### 4.1 Load Testing **TC-14: Context Generation Load Test** ```bash #!/bin/bash # Load test context generation system # Test configuration CONCURRENT_USERS=50 TEST_DURATION=600 # 10 minutes RAMP_UP_TIME=60 # 1 minute # Run load test k6 run --vus $CONCURRENT_USERS \ --duration ${TEST_DURATION}s \ --ramp-up-time ${RAMP_UP_TIME}s \ ./scripts/load-test-context-generation.js # Success criteria: # - Throughput: > 10 requests/second # - P95 latency: < 2 seconds # - Error rate: < 1% # - System health score: > 0.8 throughout test ``` **TC-15: DHT Throughput Test** ```bash # Test DHT operation throughput ./scripts/dht-throughput-test.sh # Test matrix: # - PUT operations: Target 100 ops/sec # - GET operations: Target 500 ops/sec # - Mixed workload: 80% GET, 20% PUT ``` ### 4.2 Scalability Testing **TC-16: Horizontal Scaling Test** ```bash #!/bin/bash # Test horizontal scaling behavior # Baseline measurement echo "Recording baseline performance..." ./scripts/measure-baseline-performance.sh # Scale up echo "Scaling up services..." docker service scale bzzz-v2_bzzz-agent=6 sleep 60 # Allow services to start # Measure scaled performance echo "Measuring scaled performance..." ./scripts/measure-scaled-performance.sh # Validate improvements echo "Validating scaling improvements..." ./scripts/validate-scaling-improvements.sh # Expected: Linear improvement in throughput # Expected: No degradation in latency # Expected: Stable error rates ``` --- ## 5. Monitoring and Alerting Validation ### 5.1 Alert Testing **TC-17: Critical Alert Testing** ```bash #!/bin/bash # Test critical alert firing and resolution ALERTS_TO_TEST=( "BZZZSystemHealthCritical" "BZZZInsufficientPeers" "BZZZDHTLowSuccessRate" "BZZZNoAdminElected" "BZZZTaskQueueBackup" ) for alert in "${ALERTS_TO_TEST[@]}"; do echo "Testing alert: $alert" # Trigger condition ./scripts/trigger-alert-condition.sh "$alert" # Wait for alert timeout 300 ./scripts/wait-for-alert.sh "$alert" if [ $? -eq 0 ]; then echo "✅ Alert $alert fired successfully" else echo "❌ Alert $alert failed to fire" fi # Resolve condition ./scripts/resolve-alert-condition.sh "$alert" # Wait for resolution timeout 300 ./scripts/wait-for-alert-resolution.sh "$alert" if [ $? -eq 0 ]; then echo "✅ Alert $alert resolved successfully" else echo "❌ Alert $alert failed to resolve" fi done ``` ### 5.2 Metrics Validation **TC-18: Metrics Accuracy Test** ```bash # Validate metrics accuracy against actual system state ./scripts/validate-metrics-accuracy.sh # Test cases: # - Connected peers count vs actual P2P connections # - DHT operation counters vs logged operations # - Task completion rates vs actual completions # - Resource usage vs system measurements ``` ### 5.3 Dashboard Functionality **TC-19: Grafana Dashboard Test** ```bash # Test all Grafana dashboards ./scripts/test-grafana-dashboards.sh # Validation: # - All panels load without errors # - Data displays correctly for all time ranges # - Drill-down functionality works # - Alert annotations appear correctly ``` --- ## 6. Disaster Recovery Testing ### 6.1 Data Recovery Testing **TC-20: Database Recovery Test** ```bash #!/bin/bash # Test database backup and recovery procedures # 1. Create test data echo "Creating test data..." ./scripts/create-test-data.sh # 2. Perform backup echo "Creating backup..." ./scripts/backup-database.sh # 3. Simulate data loss echo "Simulating data loss..." docker service scale bzzz-v2_postgres=0 docker volume rm bzzz-v2_postgres_data # 4. Restore from backup echo "Restoring from backup..." ./scripts/restore-database.sh # 5. Validate data integrity echo "Validating data integrity..." ./scripts/validate-restored-data.sh # Expected: 100% data recovery # Expected: All relationships intact # Expected: System fully operational ``` ### 6.2 Configuration Recovery **TC-21: Configuration Disaster Recovery** ```bash # Test recovery of all system configurations ./scripts/test-configuration-recovery.sh # Test scenarios: # - Docker secrets loss and recovery # - Docker configs corruption and recovery # - Service definition recovery # - Network configuration recovery ``` ### 6.3 Full System Recovery **TC-22: Complete Infrastructure Recovery** ```bash #!/bin/bash # Test complete system recovery from scratch # 1. Document current state echo "Documenting current system state..." ./scripts/document-system-state.sh > pre-disaster-state.json # 2. Simulate complete infrastructure loss echo "Simulating infrastructure disaster..." docker stack rm bzzz-v2 docker system prune -f --volumes # 3. Recover infrastructure echo "Recovering infrastructure..." ./scripts/deploy-from-scratch.sh # 4. Validate recovery echo "Validating recovery..." ./scripts/validate-complete-recovery.sh pre-disaster-state.json # Success criteria: # - All services operational within 15 minutes # - All data recovered correctly # - System health score > 0.9 # - All integrations functional ``` --- ## Test Execution Framework ### Automated Test Runner ```bash #!/bin/bash # Main test execution script TEST_SUITE=${1:-"all"} ENVIRONMENT=${2:-"staging"} echo "Running BZZZ reliability tests..." echo "Suite: $TEST_SUITE" echo "Environment: $ENVIRONMENT" # Setup test environment ./scripts/setup-test-environment.sh $ENVIRONMENT # Run test suites case $TEST_SUITE in "health") ./scripts/run-health-tests.sh ;; "integration") ./scripts/run-integration-tests.sh ;; "chaos") ./scripts/run-chaos-tests.sh ;; "performance") ./scripts/run-performance-tests.sh ;; "monitoring") ./scripts/run-monitoring-tests.sh ;; "disaster-recovery") ./scripts/run-disaster-recovery-tests.sh ;; "all") ./scripts/run-all-tests.sh ;; *) echo "Unknown test suite: $TEST_SUITE" exit 1 ;; esac # Generate test report ./scripts/generate-test-report.sh echo "Test execution completed." ``` ### Test Environment Setup ```yaml # test-environment.yml version: '3.8' services: # Staging environment with reduced resource requirements bzzz-agent-test: image: registry.home.deepblack.cloud/bzzz:test-latest environment: - LOG_LEVEL=debug - TEST_MODE=true - METRICS_ENABLED=true networks: - test-network deploy: replicas: 3 resources: limits: memory: 1G cpus: '0.5' # Test data generator test-data-generator: image: registry.home.deepblack.cloud/bzzz-test-generator:latest environment: - TARGET_ENDPOINT=http://bzzz-agent-test:9000 - DATA_VOLUME=medium networks: - test-network networks: test-network: driver: overlay ``` ### Continuous Testing Pipeline ```yaml # .github/workflows/reliability-testing.yml name: BZZZ Reliability Testing on: schedule: - cron: '0 2 * * *' # Daily at 2 AM workflow_dispatch: jobs: health-tests: runs-on: self-hosted steps: - uses: actions/checkout@v3 - name: Run Health Tests run: ./infrastructure/testing/run-tests.sh health staging performance-tests: runs-on: self-hosted needs: health-tests steps: - name: Run Performance Tests run: ./infrastructure/testing/run-tests.sh performance staging chaos-tests: runs-on: self-hosted needs: health-tests if: github.event_name == 'workflow_dispatch' steps: - name: Run Chaos Tests run: ./infrastructure/testing/run-tests.sh chaos staging ``` --- ## Success Criteria ### Overall System Reliability Targets - **Availability SLO**: 99.9% uptime - **Performance SLO**: - Context generation: p95 < 2 seconds - DHT operations: p95 < 300ms - P2P messaging: p95 < 500ms - **Error Rate SLO**: < 0.1% for all operations - **Recovery Time Objective (RTO)**: < 15 minutes - **Recovery Point Objective (RPO)**: < 5 minutes ### Test Pass Criteria - **Health Tests**: 100% of health checks function correctly - **Integration Tests**: 95% pass rate for all integration scenarios - **Chaos Tests**: System recovers within SLO targets for all failure scenarios - **Performance Tests**: All performance metrics meet SLO targets under load - **Monitoring Tests**: 100% of alerts fire and resolve correctly - **Disaster Recovery**: Complete system recovery within RTO/RPO targets ### Continuous Monitoring - Daily automated health and integration tests - Weekly performance regression testing - Monthly chaos engineering exercises - Quarterly disaster recovery drills --- ## Test Reporting and Documentation ### Test Results Dashboard - Real-time test execution status - Historical test results and trends - Performance benchmarks over time - Failure analysis and remediation tracking ### Test Documentation - Detailed test procedures and scripts - Failure scenarios and response procedures - Performance baselines and regression analysis - Disaster recovery validation reports This comprehensive testing plan ensures that all infrastructure enhancements are thoroughly validated and the system meets its reliability and performance objectives.