Files
bzzz/infrastructure/testing/RELIABILITY_TESTING_PLAN.md
anthonyrawlins 92779523c0 🚀 Complete BZZZ Issue Resolution - All 17 Issues Solved
Comprehensive multi-agent implementation addressing all issues from INDEX.md:

## Core Architecture & Validation
-  Issue 001: UCXL address validation at all system boundaries
-  Issue 002: Fixed search parsing bug in encrypted storage
-  Issue 003: Wired UCXI P2P announce and discover functionality
-  Issue 011: Aligned temporal grammar and documentation
-  Issue 012: SLURP idempotency, backpressure, and DLQ implementation
-  Issue 013: Linked SLURP events to UCXL decisions and DHT

## API Standardization & Configuration
-  Issue 004: Standardized UCXI payloads to UCXL codes
-  Issue 010: Status endpoints and configuration surface

## Infrastructure & Operations
-  Issue 005: Election heartbeat on admin transition
-  Issue 006: Active health checks for PubSub and DHT
-  Issue 007: DHT replication and provider records
-  Issue 014: SLURP leadership lifecycle and health probes
-  Issue 015: Comprehensive monitoring, SLOs, and alerts

## Security & Access Control
-  Issue 008: Key rotation and role-based access policies

## Testing & Quality Assurance
-  Issue 009: Integration tests for UCXI + DHT encryption + search
-  Issue 016: E2E tests for HMMM → SLURP → UCXL workflow

## HMMM Integration
-  Issue 017: HMMM adapter wiring and comprehensive testing

## Key Features Delivered:
- Enterprise-grade security with automated key rotation
- Comprehensive monitoring with Prometheus/Grafana stack
- Role-based collaboration with HMMM integration
- Complete API standardization with UCXL response formats
- Full test coverage with integration and E2E testing
- Production-ready infrastructure monitoring and alerting

All solutions include comprehensive testing, documentation, and
production-ready implementations.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-08-29 12:39:38 +10:00

16 KiB

BZZZ Infrastructure Reliability Testing Plan

Overview

This document outlines comprehensive testing procedures to validate the reliability, performance, and operational readiness of the BZZZ distributed system infrastructure enhancements.

Test Categories

1. Component Health Testing

2. Integration Testing

3. Chaos Engineering

4. Performance Testing

5. Monitoring and Alerting Validation

6. Disaster Recovery Testing


1. Component Health Testing

1.1 Enhanced Health Checks Validation

Objective: Verify enhanced health check implementations work correctly.

Test Cases

TC-01: PubSub Health Probes

# Test PubSub round-trip functionality
curl -X POST http://bzzz-agent:8080/test/pubsub-health \
  -H "Content-Type: application/json" \
  -d '{"test_duration": "30s", "message_count": 100}'

# Expected: Success rate > 99%, latency < 100ms

TC-02: DHT Health Probes

# Test DHT put/get operations
curl -X POST http://bzzz-agent:8080/test/dht-health \
  -H "Content-Type: application/json" \
  -d '{"test_duration": "60s", "operation_count": 50}'

# Expected: Success rate > 99%, p95 latency < 300ms

TC-03: Election Health Monitoring

# Test election stability
curl -X GET http://bzzz-agent:8080/health/checks | jq '.checks["election-health"]'

# Trigger controlled election
curl -X POST http://bzzz-agent:8080/admin/trigger-election

# Expected: Stable admin election within 30 seconds

Validation Criteria

  • All health checks report accurate status
  • Health check latencies are within SLO thresholds
  • Failed health checks trigger appropriate alerts
  • Health history is properly maintained

1.2 SLURP Leadership Health Testing

TC-04: Leadership Transition Health

# Test leadership transition health
./scripts/test-leadership-transition.sh

# Expected outcomes:
# - Clean leadership transitions
# - No dropped tasks during transition
# - Health scores maintain > 0.8 during transition

TC-05: Degraded Leader Detection

# Simulate resource exhaustion
docker service update --limit-memory 512M bzzz-v2_bzzz-agent

# Expected: Transition to degraded leader state within 2 minutes
# Expected: Health alerts fired appropriately

2. Integration Testing

2.1 End-to-End System Testing

TC-06: Complete Task Lifecycle

#!/bin/bash
# Test complete task flow from submission to completion

# 1. Submit context generation task
TASK_ID=$(curl -X POST http://bzzz.deepblack.cloud/api/slurp/generate \
  -H "Content-Type: application/json" \
  -d '{
    "ucxl_address": "ucxl://test/document.md",
    "role": "test_analyst",
    "priority": "high"
  }' | jq -r '.task_id')

echo "Task submitted: $TASK_ID"

# 2. Monitor task progress
while true; do
  STATUS=$(curl -s http://bzzz.deepblack.cloud/api/slurp/status/$TASK_ID | jq -r '.status')
  echo "Task status: $STATUS"
  
  if [ "$STATUS" = "completed" ] || [ "$STATUS" = "failed" ]; then
    break
  fi
  
  sleep 5
done

# 3. Validate results
if [ "$STATUS" = "completed" ]; then
  echo "✅ Task completed successfully"
  RESULT=$(curl -s http://bzzz.deepblack.cloud/api/slurp/result/$TASK_ID)
  echo "Result size: $(echo $RESULT | jq -r '.content | length')"
else
  echo "❌ Task failed"
  exit 1
fi

TC-07: Multi-Node Coordination

# Test coordination across cluster nodes
./scripts/test-multi-node-coordination.sh

# Test matrix:
# - Task submission on node A, execution on node B
# - DHT storage on node A, retrieval on node C
# - Election on mixed node topology

2.2 Inter-Service Communication Testing

TC-08: Service Mesh Validation

# Test all service-to-service communications
./scripts/test-service-mesh.sh

# Validate:
# - bzzz-agent ↔ postgres
# - bzzz-agent ↔ redis
# - bzzz-agent ↔ dht-bootstrap nodes
# - mcp-server ↔ bzzz-agent
# - content-resolver ↔ bzzz-agent

3. Chaos Engineering

3.1 Node Failure Testing

TC-09: Single Node Failure

#!/bin/bash
# Test system resilience to single node failure

# 1. Record baseline metrics
echo "Recording baseline metrics..."
curl -s 'http://prometheus:9090/api/v1/query?query=bzzz_system_health_score' > baseline_metrics.json

# 2. Identify current leader
LEADER=$(curl -s http://bzzz.deepblack.cloud/api/election/status | jq -r '.current_admin')
echo "Current leader: $LEADER"

# 3. Simulate node failure
echo "Simulating failure of node: $LEADER"
docker node update --availability drain $LEADER

# 4. Monitor recovery
START_TIME=$(date +%s)
while true; do
  CURRENT_TIME=$(date +%s)
  ELAPSED=$((CURRENT_TIME - START_TIME))
  
  # Check if new leader elected
  NEW_LEADER=$(curl -s http://bzzz.deepblack.cloud/api/election/status | jq -r '.current_admin')
  
  if [ "$NEW_LEADER" != "null" ] && [ "$NEW_LEADER" != "$LEADER" ]; then
    echo "✅ New leader elected: $NEW_LEADER (${ELAPSED}s)"
    break
  fi
  
  if [ $ELAPSED -gt 120 ]; then
    echo "❌ Leadership recovery timeout"
    exit 1
  fi
  
  sleep 5
done

# 5. Validate system health
sleep 30  # Allow system to stabilize
HEALTH_SCORE=$(curl -s 'http://prometheus:9090/api/v1/query?query=bzzz_system_health_score' | jq -r '.data.result[0].value[1]')
echo "Post-failure health score: $HEALTH_SCORE"

if (( $(echo "$HEALTH_SCORE > 0.8" | bc -l) )); then
  echo "✅ System recovered successfully"
else
  echo "❌ System health degraded: $HEALTH_SCORE"
  exit 1
fi

# 6. Restore node
docker node update --availability active $LEADER

TC-10: Multi-Node Cascade Failure

# Test system resilience to cascade failures
./scripts/test-cascade-failure.sh

# Scenario: Fail 2 out of 5 nodes simultaneously
# Expected: System continues operating with degraded performance
# Expected: All critical data remains available

3.2 Network Partition Testing

TC-11: DHT Network Partition

#!/bin/bash
# Test DHT resilience to network partitions

# 1. Create network partition
echo "Creating network partition..."
iptables -A INPUT -s 192.168.1.72 -j DROP  # Block ironwood
iptables -A OUTPUT -d 192.168.1.72 -j DROP

# 2. Monitor DHT health
./scripts/monitor-dht-partition-recovery.sh &
MONITOR_PID=$!

# 3. Wait for partition duration
sleep 300  # 5 minute partition

# 4. Heal partition
echo "Healing network partition..."
iptables -D INPUT -s 192.168.1.72 -j DROP
iptables -D OUTPUT -d 192.168.1.72 -j DROP

# 5. Wait for recovery
sleep 180  # 3 minute recovery window

# 6. Validate recovery
kill $MONITOR_PID
./scripts/validate-dht-recovery.sh

3.3 Resource Exhaustion Testing

TC-12: Memory Exhaustion

# Test behavior under memory pressure
stress-ng --vm 4 --vm-bytes 75% --timeout 300s &
STRESS_PID=$!

# Monitor system behavior
./scripts/monitor-memory-exhaustion.sh

# Expected: Graceful degradation, no crashes
# Expected: Health checks detect degradation
# Expected: Alerts fired appropriately

kill $STRESS_PID

TC-13: Disk Space Exhaustion

# Test disk space exhaustion handling
dd if=/dev/zero of=/tmp/fill-disk bs=1M count=1000

# Expected: Services detect low disk space
# Expected: Appropriate cleanup mechanisms activate
# Expected: System remains operational

4. Performance Testing

4.1 Load Testing

TC-14: Context Generation Load Test

#!/bin/bash
# Load test context generation system

# Test configuration
CONCURRENT_USERS=50
TEST_DURATION=600  # 10 minutes
RAMP_UP_TIME=60    # 1 minute

# Run load test
k6 run --vus $CONCURRENT_USERS \
       --duration ${TEST_DURATION}s \
       --ramp-up-time ${RAMP_UP_TIME}s \
       ./scripts/load-test-context-generation.js

# Success criteria:
# - Throughput: > 10 requests/second
# - P95 latency: < 2 seconds
# - Error rate: < 1%
# - System health score: > 0.8 throughout test

TC-15: DHT Throughput Test

# Test DHT operation throughput
./scripts/dht-throughput-test.sh

# Test matrix:
# - PUT operations: Target 100 ops/sec
# - GET operations: Target 500 ops/sec
# - Mixed workload: 80% GET, 20% PUT

4.2 Scalability Testing

TC-16: Horizontal Scaling Test

#!/bin/bash
# Test horizontal scaling behavior

# Baseline measurement
echo "Recording baseline performance..."
./scripts/measure-baseline-performance.sh

# Scale up
echo "Scaling up services..."
docker service scale bzzz-v2_bzzz-agent=6
sleep 60  # Allow services to start

# Measure scaled performance
echo "Measuring scaled performance..."
./scripts/measure-scaled-performance.sh

# Validate improvements
echo "Validating scaling improvements..."
./scripts/validate-scaling-improvements.sh

# Expected: Linear improvement in throughput
# Expected: No degradation in latency
# Expected: Stable error rates

5. Monitoring and Alerting Validation

5.1 Alert Testing

TC-17: Critical Alert Testing

#!/bin/bash
# Test critical alert firing and resolution

ALERTS_TO_TEST=(
  "BZZZSystemHealthCritical"
  "BZZZInsufficientPeers" 
  "BZZZDHTLowSuccessRate"
  "BZZZNoAdminElected"
  "BZZZTaskQueueBackup"
)

for alert in "${ALERTS_TO_TEST[@]}"; do
  echo "Testing alert: $alert"
  
  # Trigger condition
  ./scripts/trigger-alert-condition.sh "$alert"
  
  # Wait for alert
  timeout 300 ./scripts/wait-for-alert.sh "$alert"
  if [ $? -eq 0 ]; then
    echo "✅ Alert $alert fired successfully"
  else
    echo "❌ Alert $alert failed to fire"
  fi
  
  # Resolve condition
  ./scripts/resolve-alert-condition.sh "$alert"
  
  # Wait for resolution
  timeout 300 ./scripts/wait-for-alert-resolution.sh "$alert"
  if [ $? -eq 0 ]; then
    echo "✅ Alert $alert resolved successfully"
  else
    echo "❌ Alert $alert failed to resolve"
  fi
done

5.2 Metrics Validation

TC-18: Metrics Accuracy Test

# Validate metrics accuracy against actual system state
./scripts/validate-metrics-accuracy.sh

# Test cases:
# - Connected peers count vs actual P2P connections
# - DHT operation counters vs logged operations  
# - Task completion rates vs actual completions
# - Resource usage vs system measurements

5.3 Dashboard Functionality

TC-19: Grafana Dashboard Test

# Test all Grafana dashboards
./scripts/test-grafana-dashboards.sh

# Validation:
# - All panels load without errors
# - Data displays correctly for all time ranges
# - Drill-down functionality works
# - Alert annotations appear correctly

6. Disaster Recovery Testing

6.1 Data Recovery Testing

TC-20: Database Recovery Test

#!/bin/bash
# Test database backup and recovery procedures

# 1. Create test data
echo "Creating test data..."
./scripts/create-test-data.sh

# 2. Perform backup
echo "Creating backup..."
./scripts/backup-database.sh

# 3. Simulate data loss
echo "Simulating data loss..."
docker service scale bzzz-v2_postgres=0
docker volume rm bzzz-v2_postgres_data

# 4. Restore from backup
echo "Restoring from backup..."
./scripts/restore-database.sh

# 5. Validate data integrity
echo "Validating data integrity..."
./scripts/validate-restored-data.sh

# Expected: 100% data recovery
# Expected: All relationships intact
# Expected: System fully operational

6.2 Configuration Recovery

TC-21: Configuration Disaster Recovery

# Test recovery of all system configurations
./scripts/test-configuration-recovery.sh

# Test scenarios:
# - Docker secrets loss and recovery
# - Docker configs corruption and recovery
# - Service definition recovery
# - Network configuration recovery

6.3 Full System Recovery

TC-22: Complete Infrastructure Recovery

#!/bin/bash
# Test complete system recovery from scratch

# 1. Document current state
echo "Documenting current system state..."
./scripts/document-system-state.sh > pre-disaster-state.json

# 2. Simulate complete infrastructure loss
echo "Simulating infrastructure disaster..."
docker stack rm bzzz-v2
docker system prune -f --volumes

# 3. Recover infrastructure
echo "Recovering infrastructure..."
./scripts/deploy-from-scratch.sh

# 4. Validate recovery
echo "Validating recovery..."
./scripts/validate-complete-recovery.sh pre-disaster-state.json

# Success criteria:
# - All services operational within 15 minutes
# - All data recovered correctly  
# - System health score > 0.9
# - All integrations functional

Test Execution Framework

Automated Test Runner

#!/bin/bash
# Main test execution script

TEST_SUITE=${1:-"all"}
ENVIRONMENT=${2:-"staging"}

echo "Running BZZZ reliability tests..."
echo "Suite: $TEST_SUITE"
echo "Environment: $ENVIRONMENT"

# Setup test environment
./scripts/setup-test-environment.sh $ENVIRONMENT

# Run test suites
case $TEST_SUITE in
  "health")
    ./scripts/run-health-tests.sh
    ;;
  "integration") 
    ./scripts/run-integration-tests.sh
    ;;
  "chaos")
    ./scripts/run-chaos-tests.sh
    ;;
  "performance")
    ./scripts/run-performance-tests.sh
    ;;
  "monitoring")
    ./scripts/run-monitoring-tests.sh
    ;;
  "disaster-recovery")
    ./scripts/run-disaster-recovery-tests.sh
    ;;
  "all")
    ./scripts/run-all-tests.sh
    ;;
  *)
    echo "Unknown test suite: $TEST_SUITE"
    exit 1
    ;;
esac

# Generate test report
./scripts/generate-test-report.sh

echo "Test execution completed."

Test Environment Setup

# test-environment.yml
version: '3.8'

services:
  # Staging environment with reduced resource requirements
  bzzz-agent-test:
    image: registry.home.deepblack.cloud/bzzz:test-latest
    environment:
      - LOG_LEVEL=debug
      - TEST_MODE=true
      - METRICS_ENABLED=true
    networks:
      - test-network
    deploy:
      replicas: 3
      resources:
        limits:
          memory: 1G
          cpus: '0.5'

  # Test data generator
  test-data-generator:
    image: registry.home.deepblack.cloud/bzzz-test-generator:latest
    environment:
      - TARGET_ENDPOINT=http://bzzz-agent-test:9000
      - DATA_VOLUME=medium
    networks:
      - test-network

networks:
  test-network:
    driver: overlay

Continuous Testing Pipeline

# .github/workflows/reliability-testing.yml
name: BZZZ Reliability Testing

on:
  schedule:
    - cron: '0 2 * * *'  # Daily at 2 AM
  workflow_dispatch:

jobs:
  health-tests:
    runs-on: self-hosted
    steps:
      - uses: actions/checkout@v3
      - name: Run Health Tests
        run: ./infrastructure/testing/run-tests.sh health staging

  performance-tests:
    runs-on: self-hosted
    needs: health-tests
    steps:
      - name: Run Performance Tests
        run: ./infrastructure/testing/run-tests.sh performance staging

  chaos-tests:
    runs-on: self-hosted
    needs: health-tests
    if: github.event_name == 'workflow_dispatch'
    steps:
      - name: Run Chaos Tests
        run: ./infrastructure/testing/run-tests.sh chaos staging

Success Criteria

Overall System Reliability Targets

  • Availability SLO: 99.9% uptime
  • Performance SLO:
    • Context generation: p95 < 2 seconds
    • DHT operations: p95 < 300ms
    • P2P messaging: p95 < 500ms
  • Error Rate SLO: < 0.1% for all operations
  • Recovery Time Objective (RTO): < 15 minutes
  • Recovery Point Objective (RPO): < 5 minutes

Test Pass Criteria

  • Health Tests: 100% of health checks function correctly
  • Integration Tests: 95% pass rate for all integration scenarios
  • Chaos Tests: System recovers within SLO targets for all failure scenarios
  • Performance Tests: All performance metrics meet SLO targets under load
  • Monitoring Tests: 100% of alerts fire and resolve correctly
  • Disaster Recovery: Complete system recovery within RTO/RPO targets

Continuous Monitoring

  • Daily automated health and integration tests
  • Weekly performance regression testing
  • Monthly chaos engineering exercises
  • Quarterly disaster recovery drills

Test Reporting and Documentation

Test Results Dashboard

  • Real-time test execution status
  • Historical test results and trends
  • Performance benchmarks over time
  • Failure analysis and remediation tracking

Test Documentation

  • Detailed test procedures and scripts
  • Failure scenarios and response procedures
  • Performance baselines and regression analysis
  • Disaster recovery validation reports

This comprehensive testing plan ensures that all infrastructure enhancements are thoroughly validated and the system meets its reliability and performance objectives.