Files
bzzz/infrastructure/testing/RELIABILITY_TESTING_PLAN.md
anthonyrawlins 92779523c0 🚀 Complete BZZZ Issue Resolution - All 17 Issues Solved
Comprehensive multi-agent implementation addressing all issues from INDEX.md:

## Core Architecture & Validation
-  Issue 001: UCXL address validation at all system boundaries
-  Issue 002: Fixed search parsing bug in encrypted storage
-  Issue 003: Wired UCXI P2P announce and discover functionality
-  Issue 011: Aligned temporal grammar and documentation
-  Issue 012: SLURP idempotency, backpressure, and DLQ implementation
-  Issue 013: Linked SLURP events to UCXL decisions and DHT

## API Standardization & Configuration
-  Issue 004: Standardized UCXI payloads to UCXL codes
-  Issue 010: Status endpoints and configuration surface

## Infrastructure & Operations
-  Issue 005: Election heartbeat on admin transition
-  Issue 006: Active health checks for PubSub and DHT
-  Issue 007: DHT replication and provider records
-  Issue 014: SLURP leadership lifecycle and health probes
-  Issue 015: Comprehensive monitoring, SLOs, and alerts

## Security & Access Control
-  Issue 008: Key rotation and role-based access policies

## Testing & Quality Assurance
-  Issue 009: Integration tests for UCXI + DHT encryption + search
-  Issue 016: E2E tests for HMMM → SLURP → UCXL workflow

## HMMM Integration
-  Issue 017: HMMM adapter wiring and comprehensive testing

## Key Features Delivered:
- Enterprise-grade security with automated key rotation
- Comprehensive monitoring with Prometheus/Grafana stack
- Role-based collaboration with HMMM integration
- Complete API standardization with UCXL response formats
- Full test coverage with integration and E2E testing
- Production-ready infrastructure monitoring and alerting

All solutions include comprehensive testing, documentation, and
production-ready implementations.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-08-29 12:39:38 +10:00

686 lines
16 KiB
Markdown

# BZZZ Infrastructure Reliability Testing Plan
## Overview
This document outlines comprehensive testing procedures to validate the reliability, performance, and operational readiness of the BZZZ distributed system infrastructure enhancements.
## Test Categories
### 1. Component Health Testing
### 2. Integration Testing
### 3. Chaos Engineering
### 4. Performance Testing
### 5. Monitoring and Alerting Validation
### 6. Disaster Recovery Testing
---
## 1. Component Health Testing
### 1.1 Enhanced Health Checks Validation
**Objective**: Verify enhanced health check implementations work correctly.
#### Test Cases
**TC-01: PubSub Health Probes**
```bash
# Test PubSub round-trip functionality
curl -X POST http://bzzz-agent:8080/test/pubsub-health \
-H "Content-Type: application/json" \
-d '{"test_duration": "30s", "message_count": 100}'
# Expected: Success rate > 99%, latency < 100ms
```
**TC-02: DHT Health Probes**
```bash
# Test DHT put/get operations
curl -X POST http://bzzz-agent:8080/test/dht-health \
-H "Content-Type: application/json" \
-d '{"test_duration": "60s", "operation_count": 50}'
# Expected: Success rate > 99%, p95 latency < 300ms
```
**TC-03: Election Health Monitoring**
```bash
# Test election stability
curl -X GET http://bzzz-agent:8080/health/checks | jq '.checks["election-health"]'
# Trigger controlled election
curl -X POST http://bzzz-agent:8080/admin/trigger-election
# Expected: Stable admin election within 30 seconds
```
#### Validation Criteria
- [ ] All health checks report accurate status
- [ ] Health check latencies are within SLO thresholds
- [ ] Failed health checks trigger appropriate alerts
- [ ] Health history is properly maintained
### 1.2 SLURP Leadership Health Testing
**TC-04: Leadership Transition Health**
```bash
# Test leadership transition health
./scripts/test-leadership-transition.sh
# Expected outcomes:
# - Clean leadership transitions
# - No dropped tasks during transition
# - Health scores maintain > 0.8 during transition
```
**TC-05: Degraded Leader Detection**
```bash
# Simulate resource exhaustion
docker service update --limit-memory 512M bzzz-v2_bzzz-agent
# Expected: Transition to degraded leader state within 2 minutes
# Expected: Health alerts fired appropriately
```
---
## 2. Integration Testing
### 2.1 End-to-End System Testing
**TC-06: Complete Task Lifecycle**
```bash
#!/bin/bash
# Test complete task flow from submission to completion
# 1. Submit context generation task
TASK_ID=$(curl -X POST http://bzzz.deepblack.cloud/api/slurp/generate \
-H "Content-Type: application/json" \
-d '{
"ucxl_address": "ucxl://test/document.md",
"role": "test_analyst",
"priority": "high"
}' | jq -r '.task_id')
echo "Task submitted: $TASK_ID"
# 2. Monitor task progress
while true; do
STATUS=$(curl -s http://bzzz.deepblack.cloud/api/slurp/status/$TASK_ID | jq -r '.status')
echo "Task status: $STATUS"
if [ "$STATUS" = "completed" ] || [ "$STATUS" = "failed" ]; then
break
fi
sleep 5
done
# 3. Validate results
if [ "$STATUS" = "completed" ]; then
echo "✅ Task completed successfully"
RESULT=$(curl -s http://bzzz.deepblack.cloud/api/slurp/result/$TASK_ID)
echo "Result size: $(echo $RESULT | jq -r '.content | length')"
else
echo "❌ Task failed"
exit 1
fi
```
**TC-07: Multi-Node Coordination**
```bash
# Test coordination across cluster nodes
./scripts/test-multi-node-coordination.sh
# Test matrix:
# - Task submission on node A, execution on node B
# - DHT storage on node A, retrieval on node C
# - Election on mixed node topology
```
### 2.2 Inter-Service Communication Testing
**TC-08: Service Mesh Validation**
```bash
# Test all service-to-service communications
./scripts/test-service-mesh.sh
# Validate:
# - bzzz-agent ↔ postgres
# - bzzz-agent ↔ redis
# - bzzz-agent ↔ dht-bootstrap nodes
# - mcp-server ↔ bzzz-agent
# - content-resolver ↔ bzzz-agent
```
---
## 3. Chaos Engineering
### 3.1 Node Failure Testing
**TC-09: Single Node Failure**
```bash
#!/bin/bash
# Test system resilience to single node failure
# 1. Record baseline metrics
echo "Recording baseline metrics..."
curl -s 'http://prometheus:9090/api/v1/query?query=bzzz_system_health_score' > baseline_metrics.json
# 2. Identify current leader
LEADER=$(curl -s http://bzzz.deepblack.cloud/api/election/status | jq -r '.current_admin')
echo "Current leader: $LEADER"
# 3. Simulate node failure
echo "Simulating failure of node: $LEADER"
docker node update --availability drain $LEADER
# 4. Monitor recovery
START_TIME=$(date +%s)
while true; do
CURRENT_TIME=$(date +%s)
ELAPSED=$((CURRENT_TIME - START_TIME))
# Check if new leader elected
NEW_LEADER=$(curl -s http://bzzz.deepblack.cloud/api/election/status | jq -r '.current_admin')
if [ "$NEW_LEADER" != "null" ] && [ "$NEW_LEADER" != "$LEADER" ]; then
echo "✅ New leader elected: $NEW_LEADER (${ELAPSED}s)"
break
fi
if [ $ELAPSED -gt 120 ]; then
echo "❌ Leadership recovery timeout"
exit 1
fi
sleep 5
done
# 5. Validate system health
sleep 30 # Allow system to stabilize
HEALTH_SCORE=$(curl -s 'http://prometheus:9090/api/v1/query?query=bzzz_system_health_score' | jq -r '.data.result[0].value[1]')
echo "Post-failure health score: $HEALTH_SCORE"
if (( $(echo "$HEALTH_SCORE > 0.8" | bc -l) )); then
echo "✅ System recovered successfully"
else
echo "❌ System health degraded: $HEALTH_SCORE"
exit 1
fi
# 6. Restore node
docker node update --availability active $LEADER
```
**TC-10: Multi-Node Cascade Failure**
```bash
# Test system resilience to cascade failures
./scripts/test-cascade-failure.sh
# Scenario: Fail 2 out of 5 nodes simultaneously
# Expected: System continues operating with degraded performance
# Expected: All critical data remains available
```
### 3.2 Network Partition Testing
**TC-11: DHT Network Partition**
```bash
#!/bin/bash
# Test DHT resilience to network partitions
# 1. Create network partition
echo "Creating network partition..."
iptables -A INPUT -s 192.168.1.72 -j DROP # Block ironwood
iptables -A OUTPUT -d 192.168.1.72 -j DROP
# 2. Monitor DHT health
./scripts/monitor-dht-partition-recovery.sh &
MONITOR_PID=$!
# 3. Wait for partition duration
sleep 300 # 5 minute partition
# 4. Heal partition
echo "Healing network partition..."
iptables -D INPUT -s 192.168.1.72 -j DROP
iptables -D OUTPUT -d 192.168.1.72 -j DROP
# 5. Wait for recovery
sleep 180 # 3 minute recovery window
# 6. Validate recovery
kill $MONITOR_PID
./scripts/validate-dht-recovery.sh
```
### 3.3 Resource Exhaustion Testing
**TC-12: Memory Exhaustion**
```bash
# Test behavior under memory pressure
stress-ng --vm 4 --vm-bytes 75% --timeout 300s &
STRESS_PID=$!
# Monitor system behavior
./scripts/monitor-memory-exhaustion.sh
# Expected: Graceful degradation, no crashes
# Expected: Health checks detect degradation
# Expected: Alerts fired appropriately
kill $STRESS_PID
```
**TC-13: Disk Space Exhaustion**
```bash
# Test disk space exhaustion handling
dd if=/dev/zero of=/tmp/fill-disk bs=1M count=1000
# Expected: Services detect low disk space
# Expected: Appropriate cleanup mechanisms activate
# Expected: System remains operational
```
---
## 4. Performance Testing
### 4.1 Load Testing
**TC-14: Context Generation Load Test**
```bash
#!/bin/bash
# Load test context generation system
# Test configuration
CONCURRENT_USERS=50
TEST_DURATION=600 # 10 minutes
RAMP_UP_TIME=60 # 1 minute
# Run load test
k6 run --vus $CONCURRENT_USERS \
--duration ${TEST_DURATION}s \
--ramp-up-time ${RAMP_UP_TIME}s \
./scripts/load-test-context-generation.js
# Success criteria:
# - Throughput: > 10 requests/second
# - P95 latency: < 2 seconds
# - Error rate: < 1%
# - System health score: > 0.8 throughout test
```
**TC-15: DHT Throughput Test**
```bash
# Test DHT operation throughput
./scripts/dht-throughput-test.sh
# Test matrix:
# - PUT operations: Target 100 ops/sec
# - GET operations: Target 500 ops/sec
# - Mixed workload: 80% GET, 20% PUT
```
### 4.2 Scalability Testing
**TC-16: Horizontal Scaling Test**
```bash
#!/bin/bash
# Test horizontal scaling behavior
# Baseline measurement
echo "Recording baseline performance..."
./scripts/measure-baseline-performance.sh
# Scale up
echo "Scaling up services..."
docker service scale bzzz-v2_bzzz-agent=6
sleep 60 # Allow services to start
# Measure scaled performance
echo "Measuring scaled performance..."
./scripts/measure-scaled-performance.sh
# Validate improvements
echo "Validating scaling improvements..."
./scripts/validate-scaling-improvements.sh
# Expected: Linear improvement in throughput
# Expected: No degradation in latency
# Expected: Stable error rates
```
---
## 5. Monitoring and Alerting Validation
### 5.1 Alert Testing
**TC-17: Critical Alert Testing**
```bash
#!/bin/bash
# Test critical alert firing and resolution
ALERTS_TO_TEST=(
"BZZZSystemHealthCritical"
"BZZZInsufficientPeers"
"BZZZDHTLowSuccessRate"
"BZZZNoAdminElected"
"BZZZTaskQueueBackup"
)
for alert in "${ALERTS_TO_TEST[@]}"; do
echo "Testing alert: $alert"
# Trigger condition
./scripts/trigger-alert-condition.sh "$alert"
# Wait for alert
timeout 300 ./scripts/wait-for-alert.sh "$alert"
if [ $? -eq 0 ]; then
echo "✅ Alert $alert fired successfully"
else
echo "❌ Alert $alert failed to fire"
fi
# Resolve condition
./scripts/resolve-alert-condition.sh "$alert"
# Wait for resolution
timeout 300 ./scripts/wait-for-alert-resolution.sh "$alert"
if [ $? -eq 0 ]; then
echo "✅ Alert $alert resolved successfully"
else
echo "❌ Alert $alert failed to resolve"
fi
done
```
### 5.2 Metrics Validation
**TC-18: Metrics Accuracy Test**
```bash
# Validate metrics accuracy against actual system state
./scripts/validate-metrics-accuracy.sh
# Test cases:
# - Connected peers count vs actual P2P connections
# - DHT operation counters vs logged operations
# - Task completion rates vs actual completions
# - Resource usage vs system measurements
```
### 5.3 Dashboard Functionality
**TC-19: Grafana Dashboard Test**
```bash
# Test all Grafana dashboards
./scripts/test-grafana-dashboards.sh
# Validation:
# - All panels load without errors
# - Data displays correctly for all time ranges
# - Drill-down functionality works
# - Alert annotations appear correctly
```
---
## 6. Disaster Recovery Testing
### 6.1 Data Recovery Testing
**TC-20: Database Recovery Test**
```bash
#!/bin/bash
# Test database backup and recovery procedures
# 1. Create test data
echo "Creating test data..."
./scripts/create-test-data.sh
# 2. Perform backup
echo "Creating backup..."
./scripts/backup-database.sh
# 3. Simulate data loss
echo "Simulating data loss..."
docker service scale bzzz-v2_postgres=0
docker volume rm bzzz-v2_postgres_data
# 4. Restore from backup
echo "Restoring from backup..."
./scripts/restore-database.sh
# 5. Validate data integrity
echo "Validating data integrity..."
./scripts/validate-restored-data.sh
# Expected: 100% data recovery
# Expected: All relationships intact
# Expected: System fully operational
```
### 6.2 Configuration Recovery
**TC-21: Configuration Disaster Recovery**
```bash
# Test recovery of all system configurations
./scripts/test-configuration-recovery.sh
# Test scenarios:
# - Docker secrets loss and recovery
# - Docker configs corruption and recovery
# - Service definition recovery
# - Network configuration recovery
```
### 6.3 Full System Recovery
**TC-22: Complete Infrastructure Recovery**
```bash
#!/bin/bash
# Test complete system recovery from scratch
# 1. Document current state
echo "Documenting current system state..."
./scripts/document-system-state.sh > pre-disaster-state.json
# 2. Simulate complete infrastructure loss
echo "Simulating infrastructure disaster..."
docker stack rm bzzz-v2
docker system prune -f --volumes
# 3. Recover infrastructure
echo "Recovering infrastructure..."
./scripts/deploy-from-scratch.sh
# 4. Validate recovery
echo "Validating recovery..."
./scripts/validate-complete-recovery.sh pre-disaster-state.json
# Success criteria:
# - All services operational within 15 minutes
# - All data recovered correctly
# - System health score > 0.9
# - All integrations functional
```
---
## Test Execution Framework
### Automated Test Runner
```bash
#!/bin/bash
# Main test execution script
TEST_SUITE=${1:-"all"}
ENVIRONMENT=${2:-"staging"}
echo "Running BZZZ reliability tests..."
echo "Suite: $TEST_SUITE"
echo "Environment: $ENVIRONMENT"
# Setup test environment
./scripts/setup-test-environment.sh $ENVIRONMENT
# Run test suites
case $TEST_SUITE in
"health")
./scripts/run-health-tests.sh
;;
"integration")
./scripts/run-integration-tests.sh
;;
"chaos")
./scripts/run-chaos-tests.sh
;;
"performance")
./scripts/run-performance-tests.sh
;;
"monitoring")
./scripts/run-monitoring-tests.sh
;;
"disaster-recovery")
./scripts/run-disaster-recovery-tests.sh
;;
"all")
./scripts/run-all-tests.sh
;;
*)
echo "Unknown test suite: $TEST_SUITE"
exit 1
;;
esac
# Generate test report
./scripts/generate-test-report.sh
echo "Test execution completed."
```
### Test Environment Setup
```yaml
# test-environment.yml
version: '3.8'
services:
# Staging environment with reduced resource requirements
bzzz-agent-test:
image: registry.home.deepblack.cloud/bzzz:test-latest
environment:
- LOG_LEVEL=debug
- TEST_MODE=true
- METRICS_ENABLED=true
networks:
- test-network
deploy:
replicas: 3
resources:
limits:
memory: 1G
cpus: '0.5'
# Test data generator
test-data-generator:
image: registry.home.deepblack.cloud/bzzz-test-generator:latest
environment:
- TARGET_ENDPOINT=http://bzzz-agent-test:9000
- DATA_VOLUME=medium
networks:
- test-network
networks:
test-network:
driver: overlay
```
### Continuous Testing Pipeline
```yaml
# .github/workflows/reliability-testing.yml
name: BZZZ Reliability Testing
on:
schedule:
- cron: '0 2 * * *' # Daily at 2 AM
workflow_dispatch:
jobs:
health-tests:
runs-on: self-hosted
steps:
- uses: actions/checkout@v3
- name: Run Health Tests
run: ./infrastructure/testing/run-tests.sh health staging
performance-tests:
runs-on: self-hosted
needs: health-tests
steps:
- name: Run Performance Tests
run: ./infrastructure/testing/run-tests.sh performance staging
chaos-tests:
runs-on: self-hosted
needs: health-tests
if: github.event_name == 'workflow_dispatch'
steps:
- name: Run Chaos Tests
run: ./infrastructure/testing/run-tests.sh chaos staging
```
---
## Success Criteria
### Overall System Reliability Targets
- **Availability SLO**: 99.9% uptime
- **Performance SLO**:
- Context generation: p95 < 2 seconds
- DHT operations: p95 < 300ms
- P2P messaging: p95 < 500ms
- **Error Rate SLO**: < 0.1% for all operations
- **Recovery Time Objective (RTO)**: < 15 minutes
- **Recovery Point Objective (RPO)**: < 5 minutes
### Test Pass Criteria
- **Health Tests**: 100% of health checks function correctly
- **Integration Tests**: 95% pass rate for all integration scenarios
- **Chaos Tests**: System recovers within SLO targets for all failure scenarios
- **Performance Tests**: All performance metrics meet SLO targets under load
- **Monitoring Tests**: 100% of alerts fire and resolve correctly
- **Disaster Recovery**: Complete system recovery within RTO/RPO targets
### Continuous Monitoring
- Daily automated health and integration tests
- Weekly performance regression testing
- Monthly chaos engineering exercises
- Quarterly disaster recovery drills
---
## Test Reporting and Documentation
### Test Results Dashboard
- Real-time test execution status
- Historical test results and trends
- Performance benchmarks over time
- Failure analysis and remediation tracking
### Test Documentation
- Detailed test procedures and scripts
- Failure scenarios and response procedures
- Performance baselines and regression analysis
- Disaster recovery validation reports
This comprehensive testing plan ensures that all infrastructure enhancements are thoroughly validated and the system meets its reliability and performance objectives.