bzzz/infrastructure/testing/RELIABILITY_TESTING_PLAN.md

# BZZZ Infrastructure Reliability Testing Plan

## Overview

This document outlines comprehensive testing procedures to validate the reliability, performance, and operational readiness of the BZZZ distributed system infrastructure enhancements.

## Test Categories

### 1. Component Health Testing
### 2. Integration Testing
### 3. Chaos Engineering
### 4. Performance Testing
### 5. Monitoring and Alerting Validation
### 6. Disaster Recovery Testing

---

## 1. Component Health Testing

### 1.1 Enhanced Health Checks Validation

**Objective**: Verify enhanced health check implementations work correctly.

#### Test Cases

**TC-01: PubSub Health Probes**
```bash
# Test PubSub round-trip functionality
curl -X POST http://bzzz-agent:8080/test/pubsub-health \
  -H "Content-Type: application/json" \
  -d '{"test_duration": "30s", "message_count": 100}'

# Expected: Success rate > 99%, latency < 100ms
```

**TC-02: DHT Health Probes**
```bash
# Test DHT put/get operations
curl -X POST http://bzzz-agent:8080/test/dht-health \
  -H "Content-Type: application/json" \
  -d '{"test_duration": "60s", "operation_count": 50}'

# Expected: Success rate > 99%, p95 latency < 300ms
```

**TC-03: Election Health Monitoring**
```bash
# Test election stability
curl -X GET http://bzzz-agent:8080/health/checks | jq '.checks["election-health"]'

# Trigger controlled election
curl -X POST http://bzzz-agent:8080/admin/trigger-election

# Expected: Stable admin election within 30 seconds
```

#### Validation Criteria
- [ ] All health checks report accurate status
- [ ] Health check latencies are within SLO thresholds
- [ ] Failed health checks trigger appropriate alerts
- [ ] Health history is properly maintained

### 1.2 SLURP Leadership Health Testing

**TC-04: Leadership Transition Health**
```bash
# Test leadership transition health
./scripts/test-leadership-transition.sh

# Expected outcomes:
# - Clean leadership transitions
# - No dropped tasks during transition
# - Health scores maintain > 0.8 during transition
```

**TC-05: Degraded Leader Detection**
```bash
# Simulate resource exhaustion
docker service update --limit-memory 512M bzzz-v2_bzzz-agent

# Expected: Transition to degraded leader state within 2 minutes
# Expected: Health alerts fired appropriately
```

---

## 2. Integration Testing

### 2.1 End-to-End System Testing

**TC-06: Complete Task Lifecycle**
```bash
#!/bin/bash
# Test complete task flow from submission to completion

# 1. Submit context generation task
TASK_ID=$(curl -X POST http://bzzz.deepblack.cloud/api/slurp/generate \
  -H "Content-Type: application/json" \
  -d '{
    "ucxl_address": "ucxl://test/document.md",
    "role": "test_analyst",
    "priority": "high"
  }' | jq -r '.task_id')

echo "Task submitted: $TASK_ID"

# 2. Monitor task progress
while true; do
  STATUS=$(curl -s http://bzzz.deepblack.cloud/api/slurp/status/$TASK_ID | jq -r '.status')
  echo "Task status: $STATUS"

  if [ "$STATUS" = "completed" ] || [ "$STATUS" = "failed" ]; then
    break
  fi

  sleep 5
done

# 3. Validate results
if [ "$STATUS" = "completed" ]; then
  echo "✅ Task completed successfully"
  RESULT=$(curl -s http://bzzz.deepblack.cloud/api/slurp/result/$TASK_ID)
  echo "Result size: $(echo $RESULT | jq -r '.content | length')"
else
  echo "❌ Task failed"
  exit 1
fi
```

**TC-07: Multi-Node Coordination**
```bash
# Test coordination across cluster nodes
./scripts/test-multi-node-coordination.sh

# Test matrix:
# - Task submission on node A, execution on node B
# - DHT storage on node A, retrieval on node C
# - Election on mixed node topology
```

### 2.2 Inter-Service Communication Testing

**TC-08: Service Mesh Validation**
```bash
# Test all service-to-service communications
./scripts/test-service-mesh.sh

# Validate:
# - bzzz-agent ↔ postgres
# - bzzz-agent ↔ redis
# - bzzz-agent ↔ dht-bootstrap nodes
# - mcp-server ↔ bzzz-agent
# - content-resolver ↔ bzzz-agent
```

---

## 3. Chaos Engineering

### 3.1 Node Failure Testing

**TC-09: Single Node Failure**
```bash
#!/bin/bash
# Test system resilience to single node failure

# 1. Record baseline metrics
echo "Recording baseline metrics..."
curl -s 'http://prometheus:9090/api/v1/query?query=bzzz_system_health_score' > baseline_metrics.json

# 2. Identify current leader
LEADER=$(curl -s http://bzzz.deepblack.cloud/api/election/status | jq -r '.current_admin')
echo "Current leader: $LEADER"

# 3. Simulate node failure
echo "Simulating failure of node: $LEADER"
docker node update --availability drain $LEADER

# 4. Monitor recovery
START_TIME=$(date +%s)
while true; do
  CURRENT_TIME=$(date +%s)
  ELAPSED=$((CURRENT_TIME - START_TIME))

  # Check if new leader elected
  NEW_LEADER=$(curl -s http://bzzz.deepblack.cloud/api/election/status | jq -r '.current_admin')

  if [ "$NEW_LEADER" != "null" ] && [ "$NEW_LEADER" != "$LEADER" ]; then
    echo "✅ New leader elected: $NEW_LEADER (${ELAPSED}s)"
    break
  fi

  if [ $ELAPSED -gt 120 ]; then
    echo "❌ Leadership recovery timeout"
    exit 1
  fi

  sleep 5
done

# 5. Validate system health
sleep 30  # Allow system to stabilize
HEALTH_SCORE=$(curl -s 'http://prometheus:9090/api/v1/query?query=bzzz_system_health_score' | jq -r '.data.result[0].value[1]')
echo "Post-failure health score: $HEALTH_SCORE"

if (( $(echo "$HEALTH_SCORE > 0.8" | bc -l) )); then
  echo "✅ System recovered successfully"
else
  echo "❌ System health degraded: $HEALTH_SCORE"
  exit 1
fi

# 6. Restore node
docker node update --availability active $LEADER
```

**TC-10: Multi-Node Cascade Failure**
```bash
# Test system resilience to cascade failures
./scripts/test-cascade-failure.sh

# Scenario: Fail 2 out of 5 nodes simultaneously
# Expected: System continues operating with degraded performance
# Expected: All critical data remains available
```

### 3.2 Network Partition Testing

**TC-11: DHT Network Partition**
```bash
#!/bin/bash
# Test DHT resilience to network partitions

# 1. Create network partition
echo "Creating network partition..."
iptables -A INPUT -s 192.168.1.72 -j DROP  # Block ironwood
iptables -A OUTPUT -d 192.168.1.72 -j DROP

# 2. Monitor DHT health
./scripts/monitor-dht-partition-recovery.sh &
MONITOR_PID=$!

# 3. Wait for partition duration
sleep 300  # 5 minute partition

# 4. Heal partition
echo "Healing network partition..."
iptables -D INPUT -s 192.168.1.72 -j DROP
iptables -D OUTPUT -d 192.168.1.72 -j DROP

# 5. Wait for recovery
sleep 180  # 3 minute recovery window

# 6. Validate recovery
kill $MONITOR_PID
./scripts/validate-dht-recovery.sh
```

### 3.3 Resource Exhaustion Testing

**TC-12: Memory Exhaustion**
```bash
# Test behavior under memory pressure
stress-ng --vm 4 --vm-bytes 75% --timeout 300s &
STRESS_PID=$!

# Monitor system behavior
./scripts/monitor-memory-exhaustion.sh

# Expected: Graceful degradation, no crashes
# Expected: Health checks detect degradation
# Expected: Alerts fired appropriately

kill $STRESS_PID
```

**TC-13: Disk Space Exhaustion**
```bash
# Test disk space exhaustion handling
dd if=/dev/zero of=/tmp/fill-disk bs=1M count=1000

# Expected: Services detect low disk space
# Expected: Appropriate cleanup mechanisms activate
# Expected: System remains operational
```

---

## 4. Performance Testing

### 4.1 Load Testing

**TC-14: Context Generation Load Test**
```bash
#!/bin/bash
# Load test context generation system

# Test configuration
CONCURRENT_USERS=50
TEST_DURATION=600  # 10 minutes
RAMP_UP_TIME=60    # 1 minute

# Run load test
k6 run --vus $CONCURRENT_USERS \
       --duration ${TEST_DURATION}s \
       --ramp-up-time ${RAMP_UP_TIME}s \
       ./scripts/load-test-context-generation.js

# Success criteria:
# - Throughput: > 10 requests/second
# - P95 latency: < 2 seconds
# - Error rate: < 1%
# - System health score: > 0.8 throughout test
```

**TC-15: DHT Throughput Test**
```bash
# Test DHT operation throughput
./scripts/dht-throughput-test.sh

# Test matrix:
# - PUT operations: Target 100 ops/sec
# - GET operations: Target 500 ops/sec
# - Mixed workload: 80% GET, 20% PUT
```

### 4.2 Scalability Testing

**TC-16: Horizontal Scaling Test**
```bash
#!/bin/bash
# Test horizontal scaling behavior

# Baseline measurement
echo "Recording baseline performance..."
./scripts/measure-baseline-performance.sh

# Scale up
echo "Scaling up services..."
docker service scale bzzz-v2_bzzz-agent=6
sleep 60  # Allow services to start

# Measure scaled performance
echo "Measuring scaled performance..."
./scripts/measure-scaled-performance.sh

# Validate improvements
echo "Validating scaling improvements..."
./scripts/validate-scaling-improvements.sh

# Expected: Linear improvement in throughput
# Expected: No degradation in latency
# Expected: Stable error rates
```

---

## 5. Monitoring and Alerting Validation

### 5.1 Alert Testing

**TC-17: Critical Alert Testing**
```bash
#!/bin/bash
# Test critical alert firing and resolution

ALERTS_TO_TEST=(
  "BZZZSystemHealthCritical"
  "BZZZInsufficientPeers"
  "BZZZDHTLowSuccessRate"
  "BZZZNoAdminElected"
  "BZZZTaskQueueBackup"
)

for alert in "${ALERTS_TO_TEST[@]}"; do
  echo "Testing alert: $alert"

  # Trigger condition
  ./scripts/trigger-alert-condition.sh "$alert"

  # Wait for alert
  timeout 300 ./scripts/wait-for-alert.sh "$alert"
  if [ $? -eq 0 ]; then
    echo "✅ Alert $alert fired successfully"
  else
    echo "❌ Alert $alert failed to fire"
  fi

  # Resolve condition
  ./scripts/resolve-alert-condition.sh "$alert"

  # Wait for resolution
  timeout 300 ./scripts/wait-for-alert-resolution.sh "$alert"
  if [ $? -eq 0 ]; then
    echo "✅ Alert $alert resolved successfully"
  else
    echo "❌ Alert $alert failed to resolve"
  fi
done
```

### 5.2 Metrics Validation

**TC-18: Metrics Accuracy Test**
```bash
# Validate metrics accuracy against actual system state
./scripts/validate-metrics-accuracy.sh

# Test cases:
# - Connected peers count vs actual P2P connections
# - DHT operation counters vs logged operations
# - Task completion rates vs actual completions
# - Resource usage vs system measurements
```

### 5.3 Dashboard Functionality

**TC-19: Grafana Dashboard Test**
```bash
# Test all Grafana dashboards
./scripts/test-grafana-dashboards.sh

# Validation:
# - All panels load without errors
# - Data displays correctly for all time ranges
# - Drill-down functionality works
# - Alert annotations appear correctly
```

---

## 6. Disaster Recovery Testing

### 6.1 Data Recovery Testing

**TC-20: Database Recovery Test**
```bash
#!/bin/bash
# Test database backup and recovery procedures

# 1. Create test data
echo "Creating test data..."
./scripts/create-test-data.sh

# 2. Perform backup
echo "Creating backup..."
./scripts/backup-database.sh

# 3. Simulate data loss
echo "Simulating data loss..."
docker service scale bzzz-v2_postgres=0
docker volume rm bzzz-v2_postgres_data

# 4. Restore from backup
echo "Restoring from backup..."
./scripts/restore-database.sh

# 5. Validate data integrity
echo "Validating data integrity..."
./scripts/validate-restored-data.sh

# Expected: 100% data recovery
# Expected: All relationships intact
# Expected: System fully operational
```

### 6.2 Configuration Recovery

**TC-21: Configuration Disaster Recovery**
```bash
# Test recovery of all system configurations
./scripts/test-configuration-recovery.sh

# Test scenarios:
# - Docker secrets loss and recovery
# - Docker configs corruption and recovery
# - Service definition recovery
# - Network configuration recovery
```

### 6.3 Full System Recovery

**TC-22: Complete Infrastructure Recovery**
```bash
#!/bin/bash
# Test complete system recovery from scratch

# 1. Document current state
echo "Documenting current system state..."
./scripts/document-system-state.sh > pre-disaster-state.json

# 2. Simulate complete infrastructure loss
echo "Simulating infrastructure disaster..."
docker stack rm bzzz-v2
docker system prune -f --volumes

# 3. Recover infrastructure
echo "Recovering infrastructure..."
./scripts/deploy-from-scratch.sh

# 4. Validate recovery
echo "Validating recovery..."
./scripts/validate-complete-recovery.sh pre-disaster-state.json

# Success criteria:
# - All services operational within 15 minutes
# - All data recovered correctly
# - System health score > 0.9
# - All integrations functional
```

---

## Test Execution Framework

### Automated Test Runner

```bash
#!/bin/bash
# Main test execution script

TEST_SUITE=${1:-"all"}
ENVIRONMENT=${2:-"staging"}

echo "Running BZZZ reliability tests..."
echo "Suite: $TEST_SUITE"
echo "Environment: $ENVIRONMENT"

# Setup test environment
./scripts/setup-test-environment.sh $ENVIRONMENT

# Run test suites
case $TEST_SUITE in
  "health")
    ./scripts/run-health-tests.sh
    ;;
  "integration")
    ./scripts/run-integration-tests.sh
    ;;
  "chaos")
    ./scripts/run-chaos-tests.sh
    ;;
  "performance")
    ./scripts/run-performance-tests.sh
    ;;
  "monitoring")
    ./scripts/run-monitoring-tests.sh
    ;;
  "disaster-recovery")
    ./scripts/run-disaster-recovery-tests.sh
    ;;
  "all")
    ./scripts/run-all-tests.sh
    ;;
  *)
    echo "Unknown test suite: $TEST_SUITE"
    exit 1
    ;;
esac

# Generate test report
./scripts/generate-test-report.sh

echo "Test execution completed."
```

### Test Environment Setup

```yaml
# test-environment.yml
version: '3.8'

services:
  # Staging environment with reduced resource requirements
  bzzz-agent-test:
    image: registry.home.deepblack.cloud/bzzz:test-latest
    environment:
      - LOG_LEVEL=debug
      - TEST_MODE=true
      - METRICS_ENABLED=true
    networks:
      - test-network
    deploy:
      replicas: 3
      resources:
        limits:
          memory: 1G
          cpus: '0.5'

  # Test data generator
  test-data-generator:
    image: registry.home.deepblack.cloud/bzzz-test-generator:latest
    environment:
      - TARGET_ENDPOINT=http://bzzz-agent-test:9000
      - DATA_VOLUME=medium
    networks:
      - test-network

networks:
  test-network:
    driver: overlay
```

### Continuous Testing Pipeline

```yaml
# .github/workflows/reliability-testing.yml
name: BZZZ Reliability Testing

on:
  schedule:
    - cron: '0 2 * * *'  # Daily at 2 AM
  workflow_dispatch:

jobs:
  health-tests:
    runs-on: self-hosted
    steps:
      - uses: actions/checkout@v3
      - name: Run Health Tests
        run: ./infrastructure/testing/run-tests.sh health staging

  performance-tests:
    runs-on: self-hosted
    needs: health-tests
    steps:
      - name: Run Performance Tests
        run: ./infrastructure/testing/run-tests.sh performance staging

  chaos-tests:
    runs-on: self-hosted
    needs: health-tests
    if: github.event_name == 'workflow_dispatch'
    steps:
      - name: Run Chaos Tests
        run: ./infrastructure/testing/run-tests.sh chaos staging
```

---

## Success Criteria

### Overall System Reliability Targets

- **Availability SLO**: 99.9% uptime
- **Performance SLO**:
  - Context generation: p95 < 2 seconds
  - DHT operations: p95 < 300ms
  - P2P messaging: p95 < 500ms
- **Error Rate SLO**: < 0.1% for all operations
- **Recovery Time Objective (RTO)**: < 15 minutes
- **Recovery Point Objective (RPO)**: < 5 minutes

### Test Pass Criteria

- **Health Tests**: 100% of health checks function correctly
- **Integration Tests**: 95% pass rate for all integration scenarios
- **Chaos Tests**: System recovers within SLO targets for all failure scenarios
- **Performance Tests**: All performance metrics meet SLO targets under load
- **Monitoring Tests**: 100% of alerts fire and resolve correctly
- **Disaster Recovery**: Complete system recovery within RTO/RPO targets

### Continuous Monitoring

- Daily automated health and integration tests
- Weekly performance regression testing
- Monthly chaos engineering exercises
- Quarterly disaster recovery drills

---

## Test Reporting and Documentation

### Test Results Dashboard
- Real-time test execution status
- Historical test results and trends
- Performance benchmarks over time
- Failure analysis and remediation tracking

### Test Documentation
- Detailed test procedures and scripts
- Failure scenarios and response procedures
- Performance baselines and regression analysis
- Disaster recovery validation reports

This comprehensive testing plan ensures that all infrastructure enhancements are thoroughly validated and the system meets its reliability and performance objectives.