🚀 Complete BZZZ Issue Resolution - All 17 Issues Solved

Comprehensive multi-agent implementation addressing all issues from INDEX.md:

## Core Architecture & Validation
-  Issue 001: UCXL address validation at all system boundaries
-  Issue 002: Fixed search parsing bug in encrypted storage
-  Issue 003: Wired UCXI P2P announce and discover functionality
-  Issue 011: Aligned temporal grammar and documentation
-  Issue 012: SLURP idempotency, backpressure, and DLQ implementation
-  Issue 013: Linked SLURP events to UCXL decisions and DHT

## API Standardization & Configuration
-  Issue 004: Standardized UCXI payloads to UCXL codes
-  Issue 010: Status endpoints and configuration surface

## Infrastructure & Operations
-  Issue 005: Election heartbeat on admin transition
-  Issue 006: Active health checks for PubSub and DHT
-  Issue 007: DHT replication and provider records
-  Issue 014: SLURP leadership lifecycle and health probes
-  Issue 015: Comprehensive monitoring, SLOs, and alerts

## Security & Access Control
-  Issue 008: Key rotation and role-based access policies

## Testing & Quality Assurance
-  Issue 009: Integration tests for UCXI + DHT encryption + search
-  Issue 016: E2E tests for HMMM → SLURP → UCXL workflow

## HMMM Integration
-  Issue 017: HMMM adapter wiring and comprehensive testing

## Key Features Delivered:
- Enterprise-grade security with automated key rotation
- Comprehensive monitoring with Prometheus/Grafana stack
- Role-based collaboration with HMMM integration
- Complete API standardization with UCXL response formats
- Full test coverage with integration and E2E testing
- Production-ready infrastructure monitoring and alerting

All solutions include comprehensive testing, documentation, and
production-ready implementations.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
This commit is contained in:
anthonyrawlins
2025-08-29 12:39:38 +10:00
parent 59f40e17a5
commit 92779523c0
136 changed files with 56649 additions and 134 deletions

View File

@@ -0,0 +1,835 @@
# BZZZ Infrastructure Operational Runbook
## Table of Contents
1. [Quick Reference](#quick-reference)
2. [System Architecture Overview](#system-architecture-overview)
3. [Common Operational Tasks](#common-operational-tasks)
4. [Incident Response Procedures](#incident-response-procedures)
5. [Health Check Procedures](#health-check-procedures)
6. [Performance Tuning](#performance-tuning)
7. [Backup and Recovery](#backup-and-recovery)
8. [Troubleshooting Guide](#troubleshooting-guide)
9. [Maintenance Procedures](#maintenance-procedures)
## Quick Reference
### Critical Service Endpoints
- **Grafana Dashboard**: https://grafana.chorus.services
- **Prometheus**: https://prometheus.chorus.services
- **AlertManager**: https://alerts.chorus.services
- **BZZZ Main API**: https://bzzz.deepblack.cloud
- **Health Checks**: https://bzzz.deepblack.cloud/health
### Emergency Contacts
- **Primary Oncall**: Slack #bzzz-alerts
- **System Administrator**: @tony
- **Infrastructure Team**: @platform-team
### Key Commands
```bash
# Check system health
curl -s https://bzzz.deepblack.cloud/health | jq
# View logs
docker service logs bzzz-v2_bzzz-agent -f --tail 100
# Scale service
docker service scale bzzz-v2_bzzz-agent=5
# Force service update
docker service update --force bzzz-v2_bzzz-agent
```
## System Architecture Overview
### Component Relationships
```
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ PubSub │────│ DHT │────│ Election │
│ Messaging │ │ Storage │ │ Manager │
└─────────────┘ └─────────────┘ └─────────────┘
│ │ │
└───────────────────┼───────────────────┘
┌─────────────┐
│ SLURP │
│ Context │
│ Generator │
└─────────────┘
┌─────────────┐
│ UCXI │
│ Protocol │
│ Resolver │
└─────────────┘
```
### Data Flow
1. **Task Requests** → PubSub → Task Coordinator → SLURP (if admin)
2. **Context Generation** → DHT Storage → UCXI Resolution
3. **Health Monitoring** → Prometheus → AlertManager → Notifications
### Critical Dependencies
- **Docker Swarm**: Container orchestration
- **NFS Storage**: Persistent data storage
- **Prometheus Stack**: Monitoring and alerting
- **DHT Bootstrap Nodes**: P2P network foundation
## Common Operational Tasks
### Service Management
#### Check Service Status
```bash
# List all BZZZ services
docker service ls | grep bzzz
# Check specific service
docker service ps bzzz-v2_bzzz-agent
# View service configuration
docker service inspect bzzz-v2_bzzz-agent
```
#### Scale Services
```bash
# Scale main BZZZ service
docker service scale bzzz-v2_bzzz-agent=5
# Scale monitoring stack
docker service scale bzzz-monitoring_prometheus=1
docker service scale bzzz-monitoring_grafana=1
```
#### Update Services
```bash
# Update to new image version
docker service update \
--image registry.home.deepblack.cloud/bzzz:v2.1.0 \
bzzz-v2_bzzz-agent
# Update environment variables
docker service update \
--env-add LOG_LEVEL=debug \
bzzz-v2_bzzz-agent
# Update resource limits
docker service update \
--limit-memory 4G \
--limit-cpu 2 \
bzzz-v2_bzzz-agent
```
### Configuration Management
#### Update Docker Secrets
```bash
# Create new secret
echo "new_password" | docker secret create bzzz_postgres_password_v2 -
# Update service to use new secret
docker service update \
--secret-rm bzzz_postgres_password \
--secret-add bzzz_postgres_password_v2 \
bzzz-v2_postgres
```
#### Update Docker Configs
```bash
# Create new config
docker config create bzzz_v2_config_v3 /path/to/new/config.yaml
# Update service
docker service update \
--config-rm bzzz_v2_config \
--config-add source=bzzz_v2_config_v3,target=/app/config/config.yaml \
bzzz-v2_bzzz-agent
```
### Monitoring and Alerting
#### Check Alert Status
```bash
# View active alerts
curl -s http://alertmanager:9093/api/v1/alerts | jq '.data[] | select(.status.state == "active")'
# Silence alert
curl -X POST http://alertmanager:9093/api/v1/silences \
-d '{
"matchers": [{"name": "alertname", "value": "BZZZSystemHealthCritical"}],
"startsAt": "2025-01-01T00:00:00Z",
"endsAt": "2025-01-01T01:00:00Z",
"comment": "Maintenance window",
"createdBy": "operator"
}'
```
#### Query Metrics
```bash
# Check system health
curl -s 'http://prometheus:9090/api/v1/query?query=bzzz_system_health_score' | jq
# Check connected peers
curl -s 'http://prometheus:9090/api/v1/query?query=bzzz_p2p_connected_peers' | jq
# Check error rates
curl -s 'http://prometheus:9090/api/v1/query?query=rate(bzzz_errors_total[5m])' | jq
```
## Incident Response Procedures
### Severity Levels
#### Critical (P0)
- System completely unavailable
- Data loss or corruption
- Security breach
- **Response Time**: 15 minutes
- **Resolution Target**: 2 hours
#### High (P1)
- Major functionality impaired
- Performance severely degraded
- **Response Time**: 1 hour
- **Resolution Target**: 4 hours
#### Medium (P2)
- Minor functionality issues
- Performance slightly degraded
- **Response Time**: 4 hours
- **Resolution Target**: 24 hours
#### Low (P3)
- Cosmetic issues
- Enhancement requests
- **Response Time**: 24 hours
- **Resolution Target**: 1 week
### Common Incident Scenarios
#### System Health Critical (Alert: BZZZSystemHealthCritical)
**Symptoms**: System health score < 0.5
**Immediate Actions**:
1. Check Grafana dashboard for component failures
2. Review recent deployments or changes
3. Check resource utilization (CPU, memory, disk)
4. Verify P2P connectivity
**Investigation Steps**:
```bash
# Check overall system status
curl -s https://bzzz.deepblack.cloud/health | jq
# Check component health
curl -s https://bzzz.deepblack.cloud/health/checks | jq
# Review recent logs
docker service logs bzzz-v2_bzzz-agent --since 1h | tail -100
# Check resource usage
docker stats --no-stream
```
**Recovery Actions**:
1. If memory leak: Restart affected services
2. If disk full: Clean up logs and temporary files
3. If network issues: Restart networking components
4. If database issues: Check PostgreSQL health
#### P2P Network Partition (Alert: BZZZInsufficientPeers)
**Symptoms**: Connected peers < 3
**Immediate Actions**:
1. Check network connectivity between nodes
2. Verify DHT bootstrap nodes are running
3. Check firewall rules and port accessibility
**Investigation Steps**:
```bash
# Check DHT bootstrap nodes
for node in walnut:9101 ironwood:9102 acacia:9103; do
echo "Checking $node:"
nc -zv ${node%:*} ${node#*:}
done
# Check P2P connectivity
docker service logs bzzz-v2_dht-bootstrap-walnut --since 1h
# Test network between nodes
docker run --rm --network host nicolaka/netshoot ping -c 3 ironwood
```
**Recovery Actions**:
1. Restart DHT bootstrap services
2. Clear peer store if corrupted
3. Check and fix network configuration
4. Restart affected BZZZ agents
#### Election System Failure (Alert: BZZZNoAdminElected)
**Symptoms**: No admin elected or frequent leadership changes
**Immediate Actions**:
1. Check election state on all nodes
2. Review heartbeat status
3. Verify role configurations
**Investigation Steps**:
```bash
# Check election status on each node
for node in walnut ironwood acacia; do
echo "Node $node election status:"
docker exec $(docker ps -q --filter label=com.docker.swarm.node.id) \
curl -s localhost:8081/health/checks | jq '.checks["election-health"]'
done
# Check role configurations
docker config inspect bzzz_v2_config | jq '.Spec.Data' | base64 -d | grep -A5 -B5 role
```
**Recovery Actions**:
1. Force re-election by restarting election managers
2. Fix role configuration issues
3. Clear election state if corrupted
4. Ensure at least one node has admin capabilities
#### DHT Replication Failure (Alert: BZZZDHTReplicationDegraded)
**Symptoms**: Average replication factor < 2
**Immediate Actions**:
1. Check DHT provider records
2. Verify replication manager status
3. Check storage availability
**Investigation Steps**:
```bash
# Check DHT metrics
curl -s 'http://prometheus:9090/api/v1/query?query=bzzz_dht_replication_factor' | jq
# Check provider records
curl -s 'http://prometheus:9090/api/v1/query?query=bzzz_dht_provider_records' | jq
# Check replication manager logs
docker service logs bzzz-v2_bzzz-agent | grep -i replication
```
**Recovery Actions**:
1. Restart replication managers
2. Force re-provision of content
3. Check and fix storage issues
4. Verify DHT network connectivity
### Escalation Procedures
#### When to Escalate
- Unable to resolve P0/P1 incident within target time
- Incident requires specialized knowledge
- Multiple systems affected
- Potential security implications
#### Escalation Contacts
1. **Technical Lead**: @tech-lead (Slack)
2. **Infrastructure Team**: @infra-team (Slack)
3. **Management**: @management (for business-critical issues)
## Health Check Procedures
### Manual Health Verification
#### System-Level Checks
```bash
# 1. Overall system health
curl -s https://bzzz.deepblack.cloud/health | jq '.status'
# 2. Component health checks
curl -s https://bzzz.deepblack.cloud/health/checks | jq
# 3. Resource utilization
docker stats --no-stream --format "table {{.Container}}\t{{.CPUPerc}}\t{{.MemUsage}}\t{{.MemPerc}}"
# 4. Service status
docker service ls | grep bzzz
# 5. Network connectivity
docker network ls | grep bzzz
```
#### Component-Specific Checks
**P2P Network**:
```bash
# Check connected peers
curl -s 'http://prometheus:9090/api/v1/query?query=bzzz_p2p_connected_peers'
# Test P2P messaging
docker exec -it $(docker ps -q -f name=bzzz-agent) \
/app/bzzz test-p2p-message
```
**DHT Storage**:
```bash
# Check DHT operations
curl -s 'http://prometheus:9090/api/v1/query?query=rate(bzzz_dht_put_operations_total[5m])'
# Test DHT functionality
docker exec -it $(docker ps -q -f name=bzzz-agent) \
/app/bzzz test-dht-operations
```
**Election System**:
```bash
# Check current admin
curl -s 'http://prometheus:9090/api/v1/query?query=bzzz_election_state'
# Check heartbeat status
curl -s https://bzzz.deepblack.cloud/api/election/status | jq
```
### Automated Health Monitoring
#### Prometheus Queries for Health
```promql
# Overall system health
bzzz_system_health_score
# Component health scores
bzzz_component_health_score
# SLI compliance
rate(bzzz_health_checks_passed_total[5m]) / rate(bzzz_health_checks_failed_total[5m] + bzzz_health_checks_passed_total[5m])
# Error budget burn rate
1 - bzzz:dht_success_rate > 0.01 # 1% error budget
```
#### Alert Validation
After resolving issues, verify alerts clear:
```bash
# Check if alerts are resolved
curl -s http://alertmanager:9093/api/v1/alerts | \
jq '.data[] | select(.status.state == "active") | .labels.alertname'
```
## Performance Tuning
### Resource Optimization
#### Memory Tuning
```bash
# Increase memory limits for heavy workloads
docker service update --limit-memory 8G bzzz-v2_bzzz-agent
# Optimize JVM heap size (if applicable)
docker service update \
--env-add JAVA_OPTS="-Xmx4g -Xms2g" \
bzzz-v2_bzzz-agent
```
#### CPU Optimization
```bash
# Adjust CPU limits
docker service update --limit-cpu 4 bzzz-v2_bzzz-agent
# Set CPU affinity for critical services
docker service update \
--placement-pref "spread=node.labels.cpu_type==high_performance" \
bzzz-v2_bzzz-agent
```
#### Network Optimization
```bash
# Optimize network buffer sizes
echo 'net.core.rmem_max = 16777216' >> /etc/sysctl.conf
echo 'net.core.wmem_max = 16777216' >> /etc/sysctl.conf
sysctl -p
```
### Application-Level Tuning
#### DHT Performance
- Increase replication factor for critical content
- Optimize provider record refresh intervals
- Tune cache sizes based on memory availability
#### PubSub Performance
- Adjust message batch sizes
- Optimize topic subscription patterns
- Configure message retention policies
#### Election Stability
- Tune heartbeat intervals
- Adjust election timeouts based on network latency
- Optimize candidate scoring algorithms
### Monitoring Performance Impact
```bash
# Before tuning - capture baseline
curl -s 'http://prometheus:9090/api/v1/query_range?query=rate(bzzz_dht_operation_latency_seconds_sum[5m])/rate(bzzz_dht_operation_latency_seconds_count[5m])&start=2025-01-01T00:00:00Z&end=2025-01-01T01:00:00Z&step=60s'
# After tuning - compare results
# Use Grafana dashboards to visualize improvements
```
## Backup and Recovery
### Critical Data Identification
#### Persistent Data
- **PostgreSQL Database**: User data, task history, conversation threads
- **DHT Content**: Distributed content storage
- **Configuration**: Docker secrets, configs, service definitions
- **Prometheus Data**: Historical metrics (optional but valuable)
#### Backup Schedule
- **PostgreSQL**: Daily full backup, continuous WAL archiving
- **Configuration**: Weekly backup, immediately after changes
- **Prometheus**: Weekly backup of selected metrics
### Backup Procedures
#### Database Backup
```bash
# Create database backup
docker exec $(docker ps -q -f name=postgres) \
pg_dump -U bzzz -d bzzz_v2 -f /backup/bzzz_$(date +%Y%m%d_%H%M%S).sql
# Compress and store
gzip /rust/bzzz-v2/backups/bzzz_$(date +%Y%m%d_%H%M%S).sql
aws s3 cp /rust/bzzz-v2/backups/ s3://chorus-backups/bzzz/ --recursive
```
#### Configuration Backup
```bash
# Export all secrets (encrypted)
for secret in $(docker secret ls -q); do
docker secret inspect $secret > /backup/secrets/${secret}.json
done
# Export all configs
for config in $(docker config ls -q); do
docker config inspect $config > /backup/configs/${config}.json
done
# Export service definitions
docker service ls --format '{{.Name}}' | xargs -I {} docker service inspect {} > /backup/services.json
```
#### Prometheus Data Backup
```bash
# Snapshot Prometheus data
curl -X POST http://prometheus:9090/api/v1/admin/tsdb/snapshot
# Copy snapshot to backup location
docker cp prometheus_container:/prometheus/snapshots/latest /backup/prometheus/$(date +%Y%m%d)
```
### Recovery Procedures
#### Full System Recovery
1. **Restore Infrastructure**: Deploy Docker Swarm stack
2. **Restore Configuration**: Import secrets and configs
3. **Restore Database**: Restore PostgreSQL from backup
4. **Validate Services**: Verify all services are healthy
5. **Test Functionality**: Run end-to-end tests
#### Database Recovery
```bash
# Stop application services
docker service scale bzzz-v2_bzzz-agent=0
# Restore database
gunzip -c /backup/bzzz_20250101_120000.sql.gz | \
docker exec -i $(docker ps -q -f name=postgres) \
psql -U bzzz -d bzzz_v2
# Start application services
docker service scale bzzz-v2_bzzz-agent=3
```
#### Point-in-Time Recovery
```bash
# For WAL-based recovery
docker exec $(docker ps -q -f name=postgres) \
pg_basebackup -U postgres -D /backup/base -X stream -P
# Restore to specific time
# (Implementation depends on PostgreSQL configuration)
```
### Recovery Testing
#### Monthly Recovery Tests
```bash
# Test database restore
./scripts/test-db-restore.sh
# Test configuration restore
./scripts/test-config-restore.sh
# Test full system restore (staging environment)
./scripts/test-full-restore.sh staging
```
#### Recovery Validation
- Verify all services start successfully
- Check data integrity and completeness
- Validate P2P network connectivity
- Test core functionality (task coordination, context generation)
- Monitor system health for 24 hours post-recovery
## Troubleshooting Guide
### Log Analysis
#### Centralized Logging
```bash
# View aggregated logs through Loki
curl -G -s 'http://loki:3100/loki/api/v1/query_range' \
--data-urlencode 'query={job="bzzz"}' \
--data-urlencode 'start=2025-01-01T00:00:00Z' \
--data-urlencode 'end=2025-01-01T01:00:00Z' | jq
# Search for specific errors
curl -G -s 'http://loki:3100/loki/api/v1/query_range' \
--data-urlencode 'query={job="bzzz"} |= "ERROR"' | jq
```
#### Service-Specific Logs
```bash
# BZZZ agent logs
docker service logs bzzz-v2_bzzz-agent -f --tail 100
# DHT bootstrap logs
docker service logs bzzz-v2_dht-bootstrap-walnut -f
# Database logs
docker service logs bzzz-v2_postgres -f
# Filter for specific patterns
docker service logs bzzz-v2_bzzz-agent | grep -E "(ERROR|FATAL|panic)"
```
### Common Issues and Solutions
#### "No Admin Elected" Error
```bash
# Check role configurations
docker config inspect bzzz_v2_config | jq '.Spec.Data' | base64 -d | yq '.agent.role'
# Force election
docker exec -it $(docker ps -q -f name=bzzz-agent) /app/bzzz trigger-election
# Restart election managers
docker service update --force bzzz-v2_bzzz-agent
```
#### "DHT Operations Failing" Error
```bash
# Check DHT bootstrap nodes
for port in 9101 9102 9103; do
nc -zv localhost $port
done
# Restart DHT services
docker service update --force bzzz-v2_dht-bootstrap-walnut
docker service update --force bzzz-v2_dht-bootstrap-ironwood
docker service update --force bzzz-v2_dht-bootstrap-acacia
# Clear DHT cache
docker exec -it $(docker ps -q -f name=bzzz-agent) rm -rf /app/data/dht/cache/*
```
#### "High Memory Usage" Alert
```bash
# Identify memory-hungry processes
docker stats --no-stream --format "table {{.Container}}\t{{.MemUsage}}\t{{.MemPerc}}" | sort -k3 -n
# Check for memory leaks
docker exec -it $(docker ps -q -f name=bzzz-agent) pprof -http=:6060 /app/bzzz
# Restart high-memory services
docker service update --force bzzz-v2_bzzz-agent
```
#### "Network Connectivity Issues"
```bash
# Check overlay network
docker network inspect bzzz-internal
# Test connectivity between services
docker run --rm --network bzzz-internal nicolaka/netshoot ping -c 3 postgres
# Check firewall rules
iptables -L | grep -E "(9000|9101|9102|9103)"
# Restart networking
docker network disconnect bzzz-internal $(docker ps -q -f name=bzzz-agent)
docker network connect bzzz-internal $(docker ps -q -f name=bzzz-agent)
```
### Performance Issues
#### High Latency Diagnosis
```bash
# Check operation latencies
curl -s 'http://prometheus:9090/api/v1/query?query=histogram_quantile(0.95, rate(bzzz_dht_operation_latency_seconds_bucket[5m]))'
# Identify bottlenecks
docker exec -it $(docker ps -q -f name=bzzz-agent) /app/bzzz profile-cpu 30
# Check network latency between nodes
for node in walnut ironwood acacia; do
ping -c 10 $node | tail -1
done
```
#### Resource Contention
```bash
# Check CPU usage
docker stats --no-stream --format "table {{.Container}}\t{{.CPUPerc}}"
# Check I/O wait
iostat -x 1 5
# Check network utilization
iftop -i eth0
```
### Debugging Tools
#### Application Debugging
```bash
# Enable debug logging
docker service update --env-add LOG_LEVEL=debug bzzz-v2_bzzz-agent
# Access debug endpoints
curl -s http://localhost:8080/debug/pprof/heap > heap.prof
go tool pprof heap.prof
# Trace requests
curl -s http://localhost:8080/debug/requests
```
#### System Debugging
```bash
# System resource usage
htop
iotop
nethogs
# Process analysis
ps aux --sort=-%cpu | head -20
ps aux --sort=-%mem | head -20
# Network analysis
netstat -tulpn | grep -E ":9000|:9101|:9102|:9103"
ss -tuln | grep -E ":9000|:9101|:9102|:9103"
```
## Maintenance Procedures
### Scheduled Maintenance
#### Weekly Maintenance (Low-impact)
- Review system health metrics
- Check log sizes and rotate if necessary
- Update monitoring dashboards
- Validate backup integrity
#### Monthly Maintenance (Medium-impact)
- Update non-critical components
- Perform capacity planning review
- Test disaster recovery procedures
- Security scan and updates
#### Quarterly Maintenance (High-impact)
- Major version updates
- Infrastructure upgrades
- Performance optimization review
- Security audit and remediation
### Update Procedures
#### Rolling Updates
```bash
# Update with zero downtime
docker service update \
--image registry.home.deepblack.cloud/bzzz:v2.1.0 \
--update-parallelism 1 \
--update-delay 30s \
--update-failure-action rollback \
bzzz-v2_bzzz-agent
```
#### Configuration Updates
```bash
# Update configuration without restart
docker config create bzzz_v2_config_new /path/to/new/config.yaml
docker service update \
--config-rm bzzz_v2_config \
--config-add source=bzzz_v2_config_new,target=/app/config/config.yaml \
bzzz-v2_bzzz-agent
# Cleanup old config
docker config rm bzzz_v2_config
```
#### Database Maintenance
```bash
# Database optimization
docker exec -it $(docker ps -q -f name=postgres) \
psql -U bzzz -d bzzz_v2 -c "VACUUM ANALYZE;"
# Update statistics
docker exec -it $(docker ps -q -f name=postgres) \
psql -U bzzz -d bzzz_v2 -c "ANALYZE;"
# Check database size
docker exec -it $(docker ps -q -f name=postgres) \
psql -U bzzz -d bzzz_v2 -c "SELECT pg_size_pretty(pg_database_size('bzzz_v2'));"
```
### Capacity Planning
#### Growth Projections
- Monitor resource usage trends over time
- Project capacity needs based on growth patterns
- Plan for seasonal or event-driven spikes
#### Scaling Decisions
```bash
# Horizontal scaling
docker service scale bzzz-v2_bzzz-agent=5
# Vertical scaling
docker service update \
--limit-memory 8G \
--limit-cpu 4 \
bzzz-v2_bzzz-agent
# Add new node to swarm
docker swarm join-token worker
```
#### Resource Monitoring
- Set up capacity alerts at 70% utilization
- Monitor growth rate and extrapolate
- Plan infrastructure expansions 3-6 months ahead
---
## Contact Information
**Primary Contact**: Tony (@tony)
**Team**: BZZZ Infrastructure Team
**Documentation**: https://wiki.chorus.services/bzzz
**Source Code**: https://gitea.chorus.services/tony/BZZZ
**Last Updated**: 2025-01-01
**Version**: 2.0
**Review Date**: 2025-04-01

View File

@@ -0,0 +1,511 @@
# Enhanced Alert Rules for BZZZ v2 Infrastructure
# Service Level Objectives and Critical System Alerts
groups:
# === System Health and SLO Alerts ===
- name: bzzz_system_health
rules:
# Overall system health score
- alert: BZZZSystemHealthCritical
expr: bzzz_system_health_score < 0.5
for: 2m
labels:
severity: critical
service: bzzz
slo: availability
annotations:
summary: "BZZZ system health is critically low"
description: "System health score {{ $value }} is below critical threshold (0.5)"
runbook_url: "https://wiki.chorus.services/runbooks/bzzz-health-critical"
- alert: BZZZSystemHealthDegraded
expr: bzzz_system_health_score < 0.8
for: 5m
labels:
severity: warning
service: bzzz
slo: availability
annotations:
summary: "BZZZ system health is degraded"
description: "System health score {{ $value }} is below warning threshold (0.8)"
runbook_url: "https://wiki.chorus.services/runbooks/bzzz-health-degraded"
# Component health monitoring
- alert: BZZZComponentUnhealthy
expr: bzzz_component_health_score < 0.7
for: 3m
labels:
severity: warning
service: bzzz
component: "{{ $labels.component }}"
annotations:
summary: "BZZZ component {{ $labels.component }} is unhealthy"
description: "Component {{ $labels.component }} health score {{ $value }} is below threshold"
# === P2P Network Alerts ===
- name: bzzz_p2p_network
rules:
# Peer connectivity SLO: Maintain at least 3 connected peers
- alert: BZZZInsufficientPeers
expr: bzzz_p2p_connected_peers < 3
for: 1m
labels:
severity: critical
service: bzzz
component: p2p
slo: connectivity
annotations:
summary: "BZZZ has insufficient P2P peers"
description: "Only {{ $value }} peers connected, minimum required is 3"
runbook_url: "https://wiki.chorus.services/runbooks/bzzz-peer-connectivity"
# Message latency SLO: 95th percentile < 500ms
- alert: BZZZP2PHighLatency
expr: histogram_quantile(0.95, rate(bzzz_p2p_message_latency_seconds_bucket[5m])) > 0.5
for: 3m
labels:
severity: warning
service: bzzz
component: p2p
slo: latency
annotations:
summary: "BZZZ P2P message latency is high"
description: "95th percentile latency {{ $value }}s exceeds 500ms SLO"
runbook_url: "https://wiki.chorus.services/runbooks/bzzz-p2p-latency"
# Message loss detection
- alert: BZZZP2PMessageLoss
expr: rate(bzzz_p2p_messages_sent_total[5m]) - rate(bzzz_p2p_messages_received_total[5m]) > 0.1
for: 2m
labels:
severity: warning
service: bzzz
component: p2p
annotations:
summary: "BZZZ P2P message loss detected"
description: "Message send/receive imbalance: {{ $value }} messages/sec"
# === DHT Performance and Reliability ===
- name: bzzz_dht
rules:
# DHT operation success rate SLO: > 99%
- alert: BZZZDHTLowSuccessRate
expr: (rate(bzzz_dht_put_operations_total{status="success"}[5m]) + rate(bzzz_dht_get_operations_total{status="success"}[5m])) / (rate(bzzz_dht_put_operations_total[5m]) + rate(bzzz_dht_get_operations_total[5m])) < 0.99
for: 2m
labels:
severity: warning
service: bzzz
component: dht
slo: success_rate
annotations:
summary: "BZZZ DHT operation success rate is low"
description: "DHT success rate {{ $value | humanizePercentage }} is below 99% SLO"
runbook_url: "https://wiki.chorus.services/runbooks/bzzz-dht-success-rate"
# DHT operation latency SLO: 95th percentile < 300ms for gets
- alert: BZZZDHTHighGetLatency
expr: histogram_quantile(0.95, rate(bzzz_dht_operation_latency_seconds_bucket{operation="get"}[5m])) > 0.3
for: 3m
labels:
severity: warning
service: bzzz
component: dht
slo: latency
annotations:
summary: "BZZZ DHT get operations are slow"
description: "95th percentile get latency {{ $value }}s exceeds 300ms SLO"
# DHT replication health
- alert: BZZZDHTReplicationDegraded
expr: avg(bzzz_dht_replication_factor) < 2
for: 5m
labels:
severity: warning
service: bzzz
component: dht
slo: durability
annotations:
summary: "BZZZ DHT replication is degraded"
description: "Average replication factor {{ $value }} is below target of 3"
runbook_url: "https://wiki.chorus.services/runbooks/bzzz-dht-replication"
# Provider record staleness
- alert: BZZZDHTStaleProviders
expr: increase(bzzz_dht_provider_records[1h]) == 0 and bzzz_dht_content_keys > 0
for: 10m
labels:
severity: warning
service: bzzz
component: dht
annotations:
summary: "BZZZ DHT provider records are not updating"
description: "No provider record updates in the last hour despite having content"
# === Election System Stability ===
- name: bzzz_election
rules:
# Leadership stability: Avoid frequent leadership changes
- alert: BZZZFrequentLeadershipChanges
expr: increase(bzzz_leadership_changes_total[1h]) > 3
for: 0m
labels:
severity: warning
service: bzzz
component: election
annotations:
summary: "BZZZ leadership is unstable"
description: "{{ $value }} leadership changes in the last hour"
runbook_url: "https://wiki.chorus.services/runbooks/bzzz-leadership-instability"
# Election timeout
- alert: BZZZElectionInProgress
expr: bzzz_election_state{state="electing"} == 1
for: 2m
labels:
severity: warning
service: bzzz
component: election
annotations:
summary: "BZZZ election taking too long"
description: "Election has been in progress for more than 2 minutes"
# No admin elected
- alert: BZZZNoAdminElected
expr: bzzz_election_state{state="idle"} == 1 and absent(bzzz_heartbeats_received_total)
for: 1m
labels:
severity: critical
service: bzzz
component: election
annotations:
summary: "BZZZ has no elected admin"
description: "System is idle but no heartbeats are being received"
runbook_url: "https://wiki.chorus.services/runbooks/bzzz-no-admin"
# Heartbeat monitoring
- alert: BZZZHeartbeatMissing
expr: increase(bzzz_heartbeats_received_total[2m]) == 0
for: 1m
labels:
severity: critical
service: bzzz
component: election
annotations:
summary: "BZZZ admin heartbeat missing"
description: "No heartbeats received from admin in the last 2 minutes"
# === PubSub Messaging System ===
- name: bzzz_pubsub
rules:
# Message processing rate
- alert: BZZZPubSubHighMessageRate
expr: rate(bzzz_pubsub_messages_total[1m]) > 1000
for: 2m
labels:
severity: warning
service: bzzz
component: pubsub
annotations:
summary: "BZZZ PubSub message rate is very high"
description: "Processing {{ $value }} messages/sec, may indicate spam or DoS"
# Message latency
- alert: BZZZPubSubHighLatency
expr: histogram_quantile(0.95, rate(bzzz_pubsub_message_latency_seconds_bucket[5m])) > 1.0
for: 3m
labels:
severity: warning
service: bzzz
component: pubsub
slo: latency
annotations:
summary: "BZZZ PubSub message latency is high"
description: "95th percentile latency {{ $value }}s exceeds 1s threshold"
# Topic monitoring
- alert: BZZZPubSubNoTopics
expr: bzzz_pubsub_topics == 0
for: 5m
labels:
severity: warning
service: bzzz
component: pubsub
annotations:
summary: "BZZZ PubSub has no active topics"
description: "No PubSub topics are active, system may be isolated"
# === Task Management and Processing ===
- name: bzzz_tasks
rules:
# Task queue backup
- alert: BZZZTaskQueueBackup
expr: bzzz_tasks_queued > 100
for: 5m
labels:
severity: warning
service: bzzz
component: tasks
annotations:
summary: "BZZZ task queue is backing up"
description: "{{ $value }} tasks are queued, may indicate processing issues"
runbook_url: "https://wiki.chorus.services/runbooks/bzzz-task-queue"
# Task success rate SLO: > 95%
- alert: BZZZTaskLowSuccessRate
expr: rate(bzzz_tasks_completed_total{status="success"}[10m]) / rate(bzzz_tasks_completed_total[10m]) < 0.95
for: 5m
labels:
severity: warning
service: bzzz
component: tasks
slo: success_rate
annotations:
summary: "BZZZ task success rate is low"
description: "Task success rate {{ $value | humanizePercentage }} is below 95% SLO"
# Task processing latency
- alert: BZZZTaskHighProcessingTime
expr: histogram_quantile(0.95, rate(bzzz_task_duration_seconds_bucket[5m])) > 300
for: 3m
labels:
severity: warning
service: bzzz
component: tasks
annotations:
summary: "BZZZ task processing time is high"
description: "95th percentile task duration {{ $value }}s exceeds 5 minutes"
# === SLURP Context Generation ===
- name: bzzz_slurp
rules:
# Context generation success rate
- alert: BZZZSLURPLowSuccessRate
expr: rate(bzzz_slurp_contexts_generated_total{status="success"}[10m]) / rate(bzzz_slurp_contexts_generated_total[10m]) < 0.90
for: 5m
labels:
severity: warning
service: bzzz
component: slurp
annotations:
summary: "SLURP context generation success rate is low"
description: "Success rate {{ $value | humanizePercentage }} is below 90%"
runbook_url: "https://wiki.chorus.services/runbooks/bzzz-slurp-generation"
# Generation queue backup
- alert: BZZZSLURPQueueBackup
expr: bzzz_slurp_queue_length > 50
for: 10m
labels:
severity: warning
service: bzzz
component: slurp
annotations:
summary: "SLURP generation queue is backing up"
description: "{{ $value }} contexts are queued for generation"
# Generation time SLO: 95th percentile < 2 minutes
- alert: BZZZSLURPSlowGeneration
expr: histogram_quantile(0.95, rate(bzzz_slurp_generation_time_seconds_bucket[10m])) > 120
for: 5m
labels:
severity: warning
service: bzzz
component: slurp
slo: latency
annotations:
summary: "SLURP context generation is slow"
description: "95th percentile generation time {{ $value }}s exceeds 2 minutes"
# === UCXI Protocol Resolution ===
- name: bzzz_ucxi
rules:
# Resolution success rate SLO: > 99%
- alert: BZZZUCXILowSuccessRate
expr: rate(bzzz_ucxi_requests_total{status=~"2.."}[5m]) / rate(bzzz_ucxi_requests_total[5m]) < 0.99
for: 3m
labels:
severity: warning
service: bzzz
component: ucxi
slo: success_rate
annotations:
summary: "UCXI resolution success rate is low"
description: "Success rate {{ $value | humanizePercentage }} is below 99% SLO"
# Resolution latency SLO: 95th percentile < 100ms
- alert: BZZZUCXIHighLatency
expr: histogram_quantile(0.95, rate(bzzz_ucxi_resolution_latency_seconds_bucket[5m])) > 0.1
for: 3m
labels:
severity: warning
service: bzzz
component: ucxi
slo: latency
annotations:
summary: "UCXI resolution latency is high"
description: "95th percentile latency {{ $value }}s exceeds 100ms SLO"
# === Resource Utilization ===
- name: bzzz_resources
rules:
# CPU utilization
- alert: BZZZHighCPUUsage
expr: bzzz_cpu_usage_ratio > 0.85
for: 5m
labels:
severity: warning
service: bzzz
component: system
annotations:
summary: "BZZZ CPU usage is high"
description: "CPU usage {{ $value | humanizePercentage }} exceeds 85%"
# Memory utilization
- alert: BZZZHighMemoryUsage
expr: bzzz_memory_usage_bytes / (1024*1024*1024) > 8
for: 3m
labels:
severity: warning
service: bzzz
component: system
annotations:
summary: "BZZZ memory usage is high"
description: "Memory usage {{ $value | humanize1024 }}B is high"
# Disk utilization
- alert: BZZZHighDiskUsage
expr: bzzz_disk_usage_ratio > 0.90
for: 5m
labels:
severity: critical
service: bzzz
component: system
annotations:
summary: "BZZZ disk usage is critical"
description: "Disk usage {{ $value | humanizePercentage }} on {{ $labels.mount_point }} exceeds 90%"
# Goroutine leak detection
- alert: BZZZGoroutineLeak
expr: increase(bzzz_goroutines[30m]) > 1000
for: 5m
labels:
severity: warning
service: bzzz
component: system
annotations:
summary: "Possible BZZZ goroutine leak"
description: "Goroutine count increased by {{ $value }} in 30 minutes"
# === Error Rate Monitoring ===
- name: bzzz_errors
rules:
# General error rate
- alert: BZZZHighErrorRate
expr: rate(bzzz_errors_total[5m]) > 10
for: 2m
labels:
severity: warning
service: bzzz
annotations:
summary: "BZZZ error rate is high"
description: "Error rate {{ $value }} errors/sec in component {{ $labels.component }}"
# Panic detection
- alert: BZZZPanicsDetected
expr: increase(bzzz_panics_total[5m]) > 0
for: 0m
labels:
severity: critical
service: bzzz
annotations:
summary: "BZZZ panic detected"
description: "{{ $value }} panic(s) occurred in the last 5 minutes"
runbook_url: "https://wiki.chorus.services/runbooks/bzzz-panic-recovery"
# === Health Check Monitoring ===
- name: bzzz_health_checks
rules:
# Health check failure rate
- alert: BZZZHealthCheckFailures
expr: rate(bzzz_health_checks_failed_total[5m]) > 0.1
for: 2m
labels:
severity: warning
service: bzzz
component: health
annotations:
summary: "BZZZ health check failures detected"
description: "Health check {{ $labels.check_name }} failing at {{ $value }} failures/sec"
# Critical health check failure
- alert: BZZZCriticalHealthCheckFailed
expr: increase(bzzz_health_checks_failed_total{check_name=~".*-enhanced|p2p-connectivity"}[2m]) > 0
for: 0m
labels:
severity: critical
service: bzzz
component: health
annotations:
summary: "Critical BZZZ health check failed"
description: "Critical health check {{ $labels.check_name }} failed: {{ $labels.reason }}"
# === Service Level Indicator Recording Rules ===
- name: bzzz_sli_recording
interval: 30s
rules:
# DHT operation SLI
- record: bzzz:dht_success_rate
expr: rate(bzzz_dht_put_operations_total{status="success"}[5m]) + rate(bzzz_dht_get_operations_total{status="success"}[5m]) / rate(bzzz_dht_put_operations_total[5m]) + rate(bzzz_dht_get_operations_total[5m])
# P2P connectivity SLI
- record: bzzz:p2p_connectivity_ratio
expr: bzzz_p2p_connected_peers / 10 # Target of 10 peers
# UCXI success rate SLI
- record: bzzz:ucxi_success_rate
expr: rate(bzzz_ucxi_requests_total{status=~"2.."}[5m]) / rate(bzzz_ucxi_requests_total[5m])
# Task success rate SLI
- record: bzzz:task_success_rate
expr: rate(bzzz_tasks_completed_total{status="success"}[5m]) / rate(bzzz_tasks_completed_total[5m])
# Overall availability SLI
- record: bzzz:overall_availability
expr: bzzz_system_health_score
# === Multi-Window Multi-Burn-Rate Alerts ===
- name: bzzz_slo_alerts
rules:
# Fast burn rate (2% of error budget in 1 hour)
- alert: BZZZErrorBudgetBurnHigh
expr: (
(1 - bzzz:dht_success_rate) > (14.4 * 0.01) # 14.4x burn rate for 99% SLO
and
(1 - bzzz:dht_success_rate) > (14.4 * 0.01)
)
for: 2m
labels:
severity: critical
service: bzzz
burnrate: fast
slo: dht_success_rate
annotations:
summary: "BZZZ DHT error budget burning fast"
description: "DHT error budget will be exhausted in {{ with query \"(0.01 - (1 - bzzz:dht_success_rate)) / (1 - bzzz:dht_success_rate) * 1\" }}{{ . | first | value | humanizeDuration }}{{ end }}"
# Slow burn rate (10% of error budget in 6 hours)
- alert: BZZZErrorBudgetBurnSlow
expr: (
(1 - bzzz:dht_success_rate) > (6 * 0.01) # 6x burn rate
and
(1 - bzzz:dht_success_rate) > (6 * 0.01)
)
for: 15m
labels:
severity: warning
service: bzzz
burnrate: slow
slo: dht_success_rate
annotations:
summary: "BZZZ DHT error budget burning slowly"
description: "DHT error budget depletion rate is concerning"

View File

@@ -0,0 +1,533 @@
version: '3.8'
# Enhanced BZZZ Monitoring Stack for Docker Swarm
# Provides comprehensive observability for BZZZ distributed system
services:
# Prometheus - Metrics Collection and Alerting
prometheus:
image: prom/prometheus:v2.45.0
networks:
- tengig
- monitoring
ports:
- "9090:9090"
volumes:
- prometheus_data:/prometheus
- /rust/bzzz-v2/monitoring/prometheus:/etc/prometheus
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.path=/prometheus'
- '--storage.tsdb.retention.time=30d'
- '--storage.tsdb.retention.size=50GB'
- '--web.console.libraries=/etc/prometheus/console_libraries'
- '--web.console.templates=/etc/prometheus/consoles'
- '--web.enable-lifecycle'
- '--web.enable-admin-api'
- '--web.external-url=https://prometheus.chorus.services'
- '--alertmanager.notification-queue-capacity=10000'
deploy:
replicas: 1
placement:
constraints:
- node.hostname == walnut # Place on main node
resources:
limits:
memory: 4G
cpus: '2.0'
reservations:
memory: 2G
cpus: '1.0'
restart_policy:
condition: on-failure
delay: 30s
labels:
- "traefik.enable=true"
- "traefik.http.routers.prometheus.rule=Host(`prometheus.chorus.services`)"
- "traefik.http.services.prometheus.loadbalancer.server.port=9090"
- "traefik.http.routers.prometheus.tls=true"
healthcheck:
test: ["CMD", "wget", "--quiet", "--tries=1", "--spider", "http://localhost:9090/-/healthy"]
interval: 30s
timeout: 10s
retries: 3
configs:
- source: prometheus_config
target: /etc/prometheus/prometheus.yml
- source: prometheus_alerts
target: /etc/prometheus/rules.yml
# Grafana - Visualization and Dashboards
grafana:
image: grafana/grafana:10.0.3
networks:
- tengig
- monitoring
ports:
- "3000:3000"
volumes:
- grafana_data:/var/lib/grafana
- /rust/bzzz-v2/monitoring/grafana/dashboards:/etc/grafana/provisioning/dashboards
- /rust/bzzz-v2/monitoring/grafana/datasources:/etc/grafana/provisioning/datasources
environment:
- GF_SECURITY_ADMIN_PASSWORD__FILE=/run/secrets/grafana_admin_password
- GF_INSTALL_PLUGINS=grafana-piechart-panel,grafana-worldmap-panel,vonage-status-panel
- GF_FEATURE_TOGGLES_ENABLE=publicDashboards
- GF_SERVER_ROOT_URL=https://grafana.chorus.services
- GF_ANALYTICS_REPORTING_ENABLED=false
- GF_ANALYTICS_CHECK_FOR_UPDATES=false
- GF_LOG_LEVEL=warn
secrets:
- grafana_admin_password
deploy:
replicas: 1
placement:
constraints:
- node.hostname == walnut
resources:
limits:
memory: 2G
cpus: '1.0'
reservations:
memory: 512M
cpus: '0.5'
restart_policy:
condition: on-failure
delay: 10s
labels:
- "traefik.enable=true"
- "traefik.http.routers.grafana.rule=Host(`grafana.chorus.services`)"
- "traefik.http.services.grafana.loadbalancer.server.port=3000"
- "traefik.http.routers.grafana.tls=true"
healthcheck:
test: ["CMD-SHELL", "curl -f http://localhost:3000/api/health || exit 1"]
interval: 30s
timeout: 10s
retries: 3
# AlertManager - Alert Routing and Notification
alertmanager:
image: prom/alertmanager:v0.25.0
networks:
- tengig
- monitoring
ports:
- "9093:9093"
volumes:
- alertmanager_data:/alertmanager
- /rust/bzzz-v2/monitoring/alertmanager:/etc/alertmanager
command:
- '--config.file=/etc/alertmanager/config.yml'
- '--storage.path=/alertmanager'
- '--web.external-url=https://alerts.chorus.services'
- '--web.route-prefix=/'
- '--cluster.listen-address=0.0.0.0:9094'
- '--log.level=info'
deploy:
replicas: 1
placement:
constraints:
- node.hostname == ironwood
resources:
limits:
memory: 1G
cpus: '0.5'
reservations:
memory: 256M
cpus: '0.25'
restart_policy:
condition: on-failure
labels:
- "traefik.enable=true"
- "traefik.http.routers.alertmanager.rule=Host(`alerts.chorus.services`)"
- "traefik.http.services.alertmanager.loadbalancer.server.port=9093"
- "traefik.http.routers.alertmanager.tls=true"
configs:
- source: alertmanager_config
target: /etc/alertmanager/config.yml
secrets:
- slack_webhook_url
- pagerduty_integration_key
# Node Exporter - System Metrics (deployed on all nodes)
node-exporter:
image: prom/node-exporter:v1.6.1
networks:
- monitoring
ports:
- "9100:9100"
volumes:
- /proc:/host/proc:ro
- /sys:/host/sys:ro
- /:/rootfs:ro
- /run/systemd/private:/run/systemd/private:ro
command:
- '--path.procfs=/host/proc'
- '--path.sysfs=/host/sys'
- '--path.rootfs=/rootfs'
- '--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)'
- '--collector.systemd'
- '--collector.systemd.unit-include=(bzzz|docker|prometheus|grafana)\.service'
- '--web.listen-address=0.0.0.0:9100'
deploy:
mode: global # Deploy on every node
resources:
limits:
memory: 256M
cpus: '0.2'
reservations:
memory: 128M
cpus: '0.1'
restart_policy:
condition: on-failure
# cAdvisor - Container Metrics (deployed on all nodes)
cadvisor:
image: gcr.io/cadvisor/cadvisor:v0.47.2
networks:
- monitoring
ports:
- "8080:8080"
volumes:
- /:/rootfs:ro
- /var/run:/var/run:ro
- /sys:/sys:ro
- /var/lib/docker/:/var/lib/docker:ro
- /dev/disk/:/dev/disk:ro
deploy:
mode: global
resources:
limits:
memory: 512M
cpus: '0.3'
reservations:
memory: 256M
cpus: '0.15'
restart_policy:
condition: on-failure
healthcheck:
test: ["CMD", "wget", "--quiet", "--tries=1", "--spider", "http://localhost:8080/healthz"]
interval: 30s
timeout: 10s
retries: 3
# BZZZ P2P Network Exporter - Custom metrics for P2P network health
bzzz-p2p-exporter:
image: registry.home.deepblack.cloud/bzzz-p2p-exporter:v2.0.0
networks:
- monitoring
- bzzz-internal
ports:
- "9200:9200"
environment:
- BZZZ_ENDPOINTS=http://bzzz-agent:9000
- SCRAPE_INTERVAL=15s
- LOG_LEVEL=info
deploy:
replicas: 1
placement:
constraints:
- node.hostname == walnut
resources:
limits:
memory: 256M
cpus: '0.2'
reservations:
memory: 128M
cpus: '0.1'
restart_policy:
condition: on-failure
# DHT Monitor - DHT-specific metrics and health monitoring
dht-monitor:
image: registry.home.deepblack.cloud/bzzz-dht-monitor:v2.0.0
networks:
- monitoring
- bzzz-internal
ports:
- "9201:9201"
environment:
- DHT_BOOTSTRAP_NODES=walnut:9101,ironwood:9102,acacia:9103
- REPLICATION_CHECK_INTERVAL=5m
- PROVIDER_CHECK_INTERVAL=2m
- LOG_LEVEL=info
deploy:
replicas: 1
placement:
constraints:
- node.hostname == ironwood
resources:
limits:
memory: 512M
cpus: '0.3'
reservations:
memory: 256M
cpus: '0.15'
restart_policy:
condition: on-failure
# Content Monitor - Content availability and integrity monitoring
content-monitor:
image: registry.home.deepblack.cloud/bzzz-content-monitor:v2.0.0
networks:
- monitoring
- bzzz-internal
ports:
- "9202:9202"
volumes:
- /rust/bzzz-v2/data/blobs:/app/blobs:ro
environment:
- CONTENT_PATH=/app/blobs
- INTEGRITY_CHECK_INTERVAL=15m
- AVAILABILITY_CHECK_INTERVAL=5m
- LOG_LEVEL=info
deploy:
replicas: 1
placement:
constraints:
- node.hostname == acacia
resources:
limits:
memory: 512M
cpus: '0.3'
reservations:
memory: 256M
cpus: '0.15'
restart_policy:
condition: on-failure
# OpenAI Cost Monitor - Track OpenAI API usage and costs
openai-cost-monitor:
image: registry.home.deepblack.cloud/bzzz-openai-cost-monitor:v2.0.0
networks:
- monitoring
- bzzz-internal
ports:
- "9203:9203"
environment:
- OPENAI_PROXY_ENDPOINT=http://openai-proxy:3002
- COST_TRACKING_ENABLED=true
- POSTGRES_HOST=postgres
- LOG_LEVEL=info
secrets:
- postgres_password
deploy:
replicas: 1
placement:
constraints:
- node.hostname == walnut
resources:
limits:
memory: 256M
cpus: '0.2'
reservations:
memory: 128M
cpus: '0.1'
restart_policy:
condition: on-failure
# Blackbox Exporter - External endpoint monitoring
blackbox-exporter:
image: prom/blackbox-exporter:v0.24.0
networks:
- monitoring
- tengig
ports:
- "9115:9115"
volumes:
- /rust/bzzz-v2/monitoring/blackbox:/etc/blackbox_exporter
command:
- '--config.file=/etc/blackbox_exporter/config.yml'
- '--web.listen-address=0.0.0.0:9115'
deploy:
replicas: 1
placement:
constraints:
- node.hostname == ironwood
resources:
limits:
memory: 128M
cpus: '0.1'
reservations:
memory: 64M
cpus: '0.05'
restart_policy:
condition: on-failure
configs:
- source: blackbox_config
target: /etc/blackbox_exporter/config.yml
# Loki - Log Aggregation
loki:
image: grafana/loki:2.8.0
networks:
- monitoring
ports:
- "3100:3100"
volumes:
- loki_data:/loki
- /rust/bzzz-v2/monitoring/loki:/etc/loki
command:
- '-config.file=/etc/loki/config.yml'
- '-target=all'
deploy:
replicas: 1
placement:
constraints:
- node.hostname == walnut
resources:
limits:
memory: 2G
cpus: '1.0'
reservations:
memory: 1G
cpus: '0.5'
restart_policy:
condition: on-failure
configs:
- source: loki_config
target: /etc/loki/config.yml
# Promtail - Log Collection Agent (deployed on all nodes)
promtail:
image: grafana/promtail:2.8.0
networks:
- monitoring
volumes:
- /var/log:/var/log:ro
- /var/lib/docker/containers:/var/lib/docker/containers:ro
- /rust/bzzz-v2/monitoring/promtail:/etc/promtail
command:
- '-config.file=/etc/promtail/config.yml'
- '-server.http-listen-port=9080'
deploy:
mode: global
resources:
limits:
memory: 256M
cpus: '0.2'
reservations:
memory: 128M
cpus: '0.1'
restart_policy:
condition: on-failure
configs:
- source: promtail_config
target: /etc/promtail/config.yml
# Jaeger - Distributed Tracing (Optional)
jaeger:
image: jaegertracing/all-in-one:1.47
networks:
- monitoring
- bzzz-internal
ports:
- "14268:14268" # HTTP collector
- "16686:16686" # Web UI
environment:
- COLLECTOR_OTLP_ENABLED=true
- SPAN_STORAGE_TYPE=memory
deploy:
replicas: 1
placement:
constraints:
- node.hostname == acacia
resources:
limits:
memory: 1G
cpus: '0.5'
reservations:
memory: 512M
cpus: '0.25'
restart_policy:
condition: on-failure
labels:
- "traefik.enable=true"
- "traefik.http.routers.jaeger.rule=Host(`tracing.chorus.services`)"
- "traefik.http.services.jaeger.loadbalancer.server.port=16686"
- "traefik.http.routers.jaeger.tls=true"
networks:
tengig:
external: true
monitoring:
driver: overlay
internal: true
attachable: false
ipam:
driver: default
config:
- subnet: 10.201.0.0/16
bzzz-internal:
external: true
volumes:
prometheus_data:
driver: local
driver_opts:
type: nfs
o: addr=192.168.1.27,rw,sync
device: ":/rust/bzzz-v2/monitoring/prometheus/data"
grafana_data:
driver: local
driver_opts:
type: nfs
o: addr=192.168.1.27,rw,sync
device: ":/rust/bzzz-v2/monitoring/grafana/data"
alertmanager_data:
driver: local
driver_opts:
type: nfs
o: addr=192.168.1.27,rw,sync
device: ":/rust/bzzz-v2/monitoring/alertmanager/data"
loki_data:
driver: local
driver_opts:
type: nfs
o: addr=192.168.1.27,rw,sync
device: ":/rust/bzzz-v2/monitoring/loki/data"
secrets:
grafana_admin_password:
external: true
name: bzzz_grafana_admin_password
slack_webhook_url:
external: true
name: bzzz_slack_webhook_url
pagerduty_integration_key:
external: true
name: bzzz_pagerduty_integration_key
postgres_password:
external: true
name: bzzz_postgres_password
configs:
prometheus_config:
external: true
name: bzzz_prometheus_config_v2
prometheus_alerts:
external: true
name: bzzz_prometheus_alerts_v2
alertmanager_config:
external: true
name: bzzz_alertmanager_config_v2
blackbox_config:
external: true
name: bzzz_blackbox_config_v2
loki_config:
external: true
name: bzzz_loki_config_v2
promtail_config:
external: true
name: bzzz_promtail_config_v2

View File

@@ -0,0 +1,615 @@
#!/bin/bash
# BZZZ Enhanced Monitoring Stack Deployment Script
# Deploys comprehensive monitoring, metrics, and health checking infrastructure
set -euo pipefail
# Script configuration
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
PROJECT_ROOT="$(cd "$SCRIPT_DIR/.." && pwd)"
TIMESTAMP=$(date +%Y%m%d_%H%M%S)
LOG_FILE="/tmp/bzzz-deploy-${TIMESTAMP}.log"
# Colors for output
RED='\033[0;31m'
GREEN='\033[0;32m'
YELLOW='\033[1;33m'
BLUE='\033[0;34m'
NC='\033[0m' # No Color
# Configuration
ENVIRONMENT=${ENVIRONMENT:-"production"}
DRY_RUN=${DRY_RUN:-"false"}
BACKUP_EXISTING=${BACKUP_EXISTING:-"true"}
HEALTH_CHECK_TIMEOUT=${HEALTH_CHECK_TIMEOUT:-300}
# Docker configuration
DOCKER_REGISTRY="registry.home.deepblack.cloud"
STACK_NAME="bzzz-monitoring-v2"
CONFIG_VERSION="v2"
# Logging function
log() {
local level=$1
shift
local message="$*"
local timestamp=$(date '+%Y-%m-%d %H:%M:%S')
case $level in
ERROR)
echo -e "${RED}[ERROR]${NC} $message" >&2
;;
WARN)
echo -e "${YELLOW}[WARN]${NC} $message"
;;
INFO)
echo -e "${GREEN}[INFO]${NC} $message"
;;
DEBUG)
echo -e "${BLUE}[DEBUG]${NC} $message"
;;
esac
echo "[$timestamp] [$level] $message" >> "$LOG_FILE"
}
# Error handler
error_handler() {
local line_no=$1
log ERROR "Script failed at line $line_no"
log ERROR "Check log file: $LOG_FILE"
exit 1
}
trap 'error_handler $LINENO' ERR
# Check prerequisites
check_prerequisites() {
log INFO "Checking prerequisites..."
# Check if running on Docker Swarm manager
if ! docker info --format '{{.Swarm.LocalNodeState}}' | grep -q "active"; then
log ERROR "This script must be run on a Docker Swarm manager node"
exit 1
fi
# Check required tools
local required_tools=("docker" "jq" "curl")
for tool in "${required_tools[@]}"; do
if ! command -v "$tool" >/dev/null 2>&1; then
log ERROR "Required tool not found: $tool"
exit 1
fi
done
# Check network connectivity to registry
if ! docker pull "$DOCKER_REGISTRY/bzzz:v2.0.0" >/dev/null 2>&1; then
log WARN "Unable to pull from registry, using local images"
fi
log INFO "Prerequisites check completed"
}
# Create necessary directories
setup_directories() {
log INFO "Setting up directories..."
local dirs=(
"/rust/bzzz-v2/monitoring/prometheus/data"
"/rust/bzzz-v2/monitoring/grafana/data"
"/rust/bzzz-v2/monitoring/alertmanager/data"
"/rust/bzzz-v2/monitoring/loki/data"
"/rust/bzzz-v2/backups/monitoring"
)
for dir in "${dirs[@]}"; do
if [[ "$DRY_RUN" != "true" ]]; then
sudo mkdir -p "$dir"
sudo chown -R 65534:65534 "$dir" # nobody user for containers
fi
log DEBUG "Created directory: $dir"
done
}
# Backup existing configuration
backup_existing_config() {
if [[ "$BACKUP_EXISTING" != "true" ]]; then
log INFO "Skipping backup (BACKUP_EXISTING=false)"
return
fi
log INFO "Backing up existing configuration..."
local backup_dir="/rust/bzzz-v2/backups/monitoring/backup_${TIMESTAMP}"
if [[ "$DRY_RUN" != "true" ]]; then
mkdir -p "$backup_dir"
# Backup Docker secrets
docker secret ls --filter name=bzzz_ --format "{{.Name}}" | while read -r secret; do
if docker secret inspect "$secret" >/dev/null 2>&1; then
docker secret inspect "$secret" > "$backup_dir/${secret}.json"
log DEBUG "Backed up secret: $secret"
fi
done
# Backup Docker configs
docker config ls --filter name=bzzz_ --format "{{.Name}}" | while read -r config; do
if docker config inspect "$config" >/dev/null 2>&1; then
docker config inspect "$config" > "$backup_dir/${config}.json"
log DEBUG "Backed up config: $config"
fi
done
# Backup service definitions
if docker stack services "$STACK_NAME" >/dev/null 2>&1; then
docker stack services "$STACK_NAME" --format "{{.Name}}" | while read -r service; do
docker service inspect "$service" > "$backup_dir/${service}-service.json"
done
fi
fi
log INFO "Backup completed: $backup_dir"
}
# Create Docker secrets
create_secrets() {
log INFO "Creating Docker secrets..."
local secrets=(
"bzzz_grafana_admin_password:$(openssl rand -base64 32)"
"bzzz_postgres_password:$(openssl rand -base64 32)"
)
# Check if secrets directory exists
local secrets_dir="$HOME/chorus/business/secrets"
if [[ -d "$secrets_dir" ]]; then
# Use existing secrets if available
if [[ -f "$secrets_dir/grafana-admin-password" ]]; then
secrets[0]="bzzz_grafana_admin_password:$(cat "$secrets_dir/grafana-admin-password")"
fi
if [[ -f "$secrets_dir/postgres-password" ]]; then
secrets[1]="bzzz_postgres_password:$(cat "$secrets_dir/postgres-password")"
fi
fi
for secret_def in "${secrets[@]}"; do
local secret_name="${secret_def%%:*}"
local secret_value="${secret_def#*:}"
if docker secret inspect "$secret_name" >/dev/null 2>&1; then
log DEBUG "Secret already exists: $secret_name"
else
if [[ "$DRY_RUN" != "true" ]]; then
echo "$secret_value" | docker secret create "$secret_name" -
log INFO "Created secret: $secret_name"
else
log DEBUG "Would create secret: $secret_name"
fi
fi
done
}
# Create Docker configs
create_configs() {
log INFO "Creating Docker configs..."
local configs=(
"bzzz_prometheus_config_${CONFIG_VERSION}:${PROJECT_ROOT}/monitoring/configs/prometheus.yml"
"bzzz_prometheus_alerts_${CONFIG_VERSION}:${PROJECT_ROOT}/monitoring/configs/enhanced-alert-rules.yml"
"bzzz_grafana_datasources_${CONFIG_VERSION}:${PROJECT_ROOT}/monitoring/configs/grafana-datasources.yml"
"bzzz_alertmanager_config_${CONFIG_VERSION}:${PROJECT_ROOT}/monitoring/configs/alertmanager.yml"
)
for config_def in "${configs[@]}"; do
local config_name="${config_def%%:*}"
local config_file="${config_def#*:}"
if [[ ! -f "$config_file" ]]; then
log WARN "Config file not found: $config_file"
continue
fi
if docker config inspect "$config_name" >/dev/null 2>&1; then
log DEBUG "Config already exists: $config_name"
# Remove old config if exists
if [[ "$DRY_RUN" != "true" ]]; then
local old_config_name="${config_name%_${CONFIG_VERSION}}"
if docker config inspect "$old_config_name" >/dev/null 2>&1; then
docker config rm "$old_config_name" || true
fi
fi
else
if [[ "$DRY_RUN" != "true" ]]; then
docker config create "$config_name" "$config_file"
log INFO "Created config: $config_name"
else
log DEBUG "Would create config: $config_name from $config_file"
fi
fi
done
}
# Create missing config files
create_missing_configs() {
log INFO "Creating missing configuration files..."
# Create Grafana datasources config
local grafana_datasources="${PROJECT_ROOT}/monitoring/configs/grafana-datasources.yml"
if [[ ! -f "$grafana_datasources" ]]; then
cat > "$grafana_datasources" <<EOF
apiVersion: 1
datasources:
- name: Prometheus
type: prometheus
access: proxy
url: http://prometheus:9090
isDefault: true
editable: true
- name: Loki
type: loki
access: proxy
url: http://loki:3100
editable: true
- name: Jaeger
type: jaeger
access: proxy
url: http://jaeger:16686
editable: true
EOF
log INFO "Created Grafana datasources config"
fi
# Create AlertManager config
local alertmanager_config="${PROJECT_ROOT}/monitoring/configs/alertmanager.yml"
if [[ ! -f "$alertmanager_config" ]]; then
cat > "$alertmanager_config" <<EOF
global:
smtp_smarthost: 'localhost:587'
smtp_from: 'alerts@chorus.services'
slack_api_url_file: '/run/secrets/slack_webhook_url'
route:
group_by: ['alertname', 'cluster', 'service']
group_wait: 10s
group_interval: 10s
repeat_interval: 12h
receiver: 'default'
routes:
- match:
severity: critical
receiver: 'critical-alerts'
- match:
service: bzzz
receiver: 'bzzz-alerts'
receivers:
- name: 'default'
slack_configs:
- channel: '#bzzz-alerts'
title: 'BZZZ Alert: {{ .CommonAnnotations.summary }}'
text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'
- name: 'critical-alerts'
slack_configs:
- channel: '#bzzz-critical'
title: 'CRITICAL: {{ .CommonAnnotations.summary }}'
text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'
- name: 'bzzz-alerts'
slack_configs:
- channel: '#bzzz-alerts'
title: 'BZZZ: {{ .CommonAnnotations.summary }}'
text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'
EOF
log INFO "Created AlertManager config"
fi
}
# Deploy monitoring stack
deploy_monitoring_stack() {
log INFO "Deploying monitoring stack..."
local compose_file="${PROJECT_ROOT}/monitoring/docker-compose.enhanced.yml"
if [[ ! -f "$compose_file" ]]; then
log ERROR "Compose file not found: $compose_file"
exit 1
fi
if [[ "$DRY_RUN" != "true" ]]; then
# Deploy the stack
docker stack deploy -c "$compose_file" "$STACK_NAME"
log INFO "Stack deployment initiated: $STACK_NAME"
# Wait for services to be ready
log INFO "Waiting for services to be ready..."
local max_attempts=30
local attempt=0
while [[ $attempt -lt $max_attempts ]]; do
local ready_services=0
local total_services=0
# Count ready services
while read -r service; do
total_services=$((total_services + 1))
local replicas_info
replicas_info=$(docker service ls --filter name="$service" --format "{{.Replicas}}")
if [[ "$replicas_info" =~ ^([0-9]+)/([0-9]+)$ ]]; then
local current="${BASH_REMATCH[1]}"
local desired="${BASH_REMATCH[2]}"
if [[ "$current" -eq "$desired" ]]; then
ready_services=$((ready_services + 1))
fi
fi
done < <(docker stack services "$STACK_NAME" --format "{{.Name}}")
if [[ $ready_services -eq $total_services ]]; then
log INFO "All services are ready ($ready_services/$total_services)"
break
else
log DEBUG "Services ready: $ready_services/$total_services"
sleep 10
attempt=$((attempt + 1))
fi
done
if [[ $attempt -eq $max_attempts ]]; then
log WARN "Timeout waiting for all services to be ready"
fi
else
log DEBUG "Would deploy stack with compose file: $compose_file"
fi
}
# Perform health checks
perform_health_checks() {
log INFO "Performing health checks..."
if [[ "$DRY_RUN" == "true" ]]; then
log DEBUG "Skipping health checks in dry run mode"
return
fi
local endpoints=(
"http://localhost:9090/-/healthy:Prometheus"
"http://localhost:3000/api/health:Grafana"
"http://localhost:9093/-/healthy:AlertManager"
)
local max_attempts=$((HEALTH_CHECK_TIMEOUT / 10))
local attempt=0
while [[ $attempt -lt $max_attempts ]]; do
local healthy_endpoints=0
for endpoint_def in "${endpoints[@]}"; do
local endpoint="${endpoint_def%%:*}"
local service="${endpoint_def#*:}"
if curl -sf "$endpoint" >/dev/null 2>&1; then
healthy_endpoints=$((healthy_endpoints + 1))
log DEBUG "Health check passed: $service"
else
log DEBUG "Health check pending: $service"
fi
done
if [[ $healthy_endpoints -eq ${#endpoints[@]} ]]; then
log INFO "All health checks passed"
return
fi
sleep 10
attempt=$((attempt + 1))
done
log WARN "Some health checks failed after ${HEALTH_CHECK_TIMEOUT}s timeout"
}
# Validate deployment
validate_deployment() {
log INFO "Validating deployment..."
if [[ "$DRY_RUN" == "true" ]]; then
log DEBUG "Skipping validation in dry run mode"
return
fi
# Check stack services
local services
services=$(docker stack services "$STACK_NAME" --format "{{.Name}}" | wc -l)
log INFO "Deployed services: $services"
# Check if Prometheus is collecting metrics
sleep 30 # Allow time for initial metric collection
if curl -sf "http://localhost:9090/api/v1/query?query=up" | jq -r '.data.result | length' | grep -q "^[1-9]"; then
log INFO "Prometheus is collecting metrics"
else
log WARN "Prometheus may not be collecting metrics yet"
fi
# Check if Grafana can connect to Prometheus
local grafana_health
if grafana_health=$(curl -sf "http://admin:admin@localhost:3000/api/datasources/proxy/1/api/v1/query?query=up" 2>/dev/null); then
log INFO "Grafana can connect to Prometheus"
else
log WARN "Grafana datasource connection may be pending"
fi
# Check AlertManager configuration
if curl -sf "http://localhost:9093/api/v1/status" >/dev/null 2>&1; then
log INFO "AlertManager is operational"
else
log WARN "AlertManager may not be ready"
fi
}
# Import Grafana dashboards
import_dashboards() {
log INFO "Importing Grafana dashboards..."
if [[ "$DRY_RUN" == "true" ]]; then
log DEBUG "Skipping dashboard import in dry run mode"
return
fi
# Wait for Grafana to be ready
local max_attempts=30
local attempt=0
while [[ $attempt -lt $max_attempts ]]; do
if curl -sf "http://admin:admin@localhost:3000/api/health" >/dev/null 2>&1; then
break
fi
sleep 5
attempt=$((attempt + 1))
done
if [[ $attempt -eq $max_attempts ]]; then
log WARN "Grafana not ready for dashboard import"
return
fi
# Import dashboards
local dashboard_dir="${PROJECT_ROOT}/monitoring/grafana-dashboards"
if [[ -d "$dashboard_dir" ]]; then
for dashboard_file in "$dashboard_dir"/*.json; do
if [[ -f "$dashboard_file" ]]; then
local dashboard_name
dashboard_name=$(basename "$dashboard_file" .json)
if curl -X POST \
-H "Content-Type: application/json" \
-d "@$dashboard_file" \
"http://admin:admin@localhost:3000/api/dashboards/db" \
>/dev/null 2>&1; then
log INFO "Imported dashboard: $dashboard_name"
else
log WARN "Failed to import dashboard: $dashboard_name"
fi
fi
done
fi
}
# Generate deployment report
generate_report() {
log INFO "Generating deployment report..."
local report_file="/tmp/bzzz-monitoring-deployment-report-${TIMESTAMP}.txt"
cat > "$report_file" <<EOF
BZZZ Enhanced Monitoring Stack Deployment Report
================================================
Deployment Time: $(date)
Environment: $ENVIRONMENT
Stack Name: $STACK_NAME
Dry Run: $DRY_RUN
Services Deployed:
EOF
if [[ "$DRY_RUN" != "true" ]]; then
docker stack services "$STACK_NAME" --format " - {{.Name}}: {{.Replicas}}" >> "$report_file"
echo "" >> "$report_file"
echo "Service Health:" >> "$report_file"
# Add health check results
local health_endpoints=(
"http://localhost:9090/-/healthy:Prometheus"
"http://localhost:3000/api/health:Grafana"
"http://localhost:9093/-/healthy:AlertManager"
)
for endpoint_def in "${health_endpoints[@]}"; do
local endpoint="${endpoint_def%%:*}"
local service="${endpoint_def#*:}"
if curl -sf "$endpoint" >/dev/null 2>&1; then
echo " - $service: ✅ Healthy" >> "$report_file"
else
echo " - $service: ❌ Unhealthy" >> "$report_file"
fi
done
else
echo " [Dry run mode - no services deployed]" >> "$report_file"
fi
cat >> "$report_file" <<EOF
Access URLs:
- Grafana: http://localhost:3000 (admin/admin)
- Prometheus: http://localhost:9090
- AlertManager: http://localhost:9093
Configuration:
- Log file: $LOG_FILE
- Backup directory: /rust/bzzz-v2/backups/monitoring/backup_${TIMESTAMP}
- Config version: $CONFIG_VERSION
Next Steps:
1. Change default Grafana admin password
2. Configure notification channels in AlertManager
3. Review and customize alert rules
4. Set up external authentication (optional)
EOF
log INFO "Deployment report generated: $report_file"
# Display report
echo ""
echo "=========================================="
cat "$report_file"
echo "=========================================="
}
# Main execution
main() {
log INFO "Starting BZZZ Enhanced Monitoring Stack deployment"
log INFO "Environment: $ENVIRONMENT, Dry Run: $DRY_RUN"
log INFO "Log file: $LOG_FILE"
check_prerequisites
setup_directories
backup_existing_config
create_missing_configs
create_secrets
create_configs
deploy_monitoring_stack
perform_health_checks
validate_deployment
import_dashboards
generate_report
log INFO "Deployment completed successfully!"
if [[ "$DRY_RUN" != "true" ]]; then
echo ""
echo "🎉 BZZZ Enhanced Monitoring Stack is now running!"
echo "📊 Grafana Dashboard: http://localhost:3000"
echo "📈 Prometheus: http://localhost:9090"
echo "🚨 AlertManager: http://localhost:9093"
echo ""
echo "Next steps:"
echo "1. Change default Grafana password"
echo "2. Configure alert notification channels"
echo "3. Review monitoring dashboards"
echo "4. Run reliability tests: ./infrastructure/testing/run-tests.sh all"
fi
}
# Script execution
if [[ "${BASH_SOURCE[0]}" == "${0}" ]]; then
main "$@"
fi

View File

@@ -0,0 +1,686 @@
# BZZZ Infrastructure Reliability Testing Plan
## Overview
This document outlines comprehensive testing procedures to validate the reliability, performance, and operational readiness of the BZZZ distributed system infrastructure enhancements.
## Test Categories
### 1. Component Health Testing
### 2. Integration Testing
### 3. Chaos Engineering
### 4. Performance Testing
### 5. Monitoring and Alerting Validation
### 6. Disaster Recovery Testing
---
## 1. Component Health Testing
### 1.1 Enhanced Health Checks Validation
**Objective**: Verify enhanced health check implementations work correctly.
#### Test Cases
**TC-01: PubSub Health Probes**
```bash
# Test PubSub round-trip functionality
curl -X POST http://bzzz-agent:8080/test/pubsub-health \
-H "Content-Type: application/json" \
-d '{"test_duration": "30s", "message_count": 100}'
# Expected: Success rate > 99%, latency < 100ms
```
**TC-02: DHT Health Probes**
```bash
# Test DHT put/get operations
curl -X POST http://bzzz-agent:8080/test/dht-health \
-H "Content-Type: application/json" \
-d '{"test_duration": "60s", "operation_count": 50}'
# Expected: Success rate > 99%, p95 latency < 300ms
```
**TC-03: Election Health Monitoring**
```bash
# Test election stability
curl -X GET http://bzzz-agent:8080/health/checks | jq '.checks["election-health"]'
# Trigger controlled election
curl -X POST http://bzzz-agent:8080/admin/trigger-election
# Expected: Stable admin election within 30 seconds
```
#### Validation Criteria
- [ ] All health checks report accurate status
- [ ] Health check latencies are within SLO thresholds
- [ ] Failed health checks trigger appropriate alerts
- [ ] Health history is properly maintained
### 1.2 SLURP Leadership Health Testing
**TC-04: Leadership Transition Health**
```bash
# Test leadership transition health
./scripts/test-leadership-transition.sh
# Expected outcomes:
# - Clean leadership transitions
# - No dropped tasks during transition
# - Health scores maintain > 0.8 during transition
```
**TC-05: Degraded Leader Detection**
```bash
# Simulate resource exhaustion
docker service update --limit-memory 512M bzzz-v2_bzzz-agent
# Expected: Transition to degraded leader state within 2 minutes
# Expected: Health alerts fired appropriately
```
---
## 2. Integration Testing
### 2.1 End-to-End System Testing
**TC-06: Complete Task Lifecycle**
```bash
#!/bin/bash
# Test complete task flow from submission to completion
# 1. Submit context generation task
TASK_ID=$(curl -X POST http://bzzz.deepblack.cloud/api/slurp/generate \
-H "Content-Type: application/json" \
-d '{
"ucxl_address": "ucxl://test/document.md",
"role": "test_analyst",
"priority": "high"
}' | jq -r '.task_id')
echo "Task submitted: $TASK_ID"
# 2. Monitor task progress
while true; do
STATUS=$(curl -s http://bzzz.deepblack.cloud/api/slurp/status/$TASK_ID | jq -r '.status')
echo "Task status: $STATUS"
if [ "$STATUS" = "completed" ] || [ "$STATUS" = "failed" ]; then
break
fi
sleep 5
done
# 3. Validate results
if [ "$STATUS" = "completed" ]; then
echo "✅ Task completed successfully"
RESULT=$(curl -s http://bzzz.deepblack.cloud/api/slurp/result/$TASK_ID)
echo "Result size: $(echo $RESULT | jq -r '.content | length')"
else
echo "❌ Task failed"
exit 1
fi
```
**TC-07: Multi-Node Coordination**
```bash
# Test coordination across cluster nodes
./scripts/test-multi-node-coordination.sh
# Test matrix:
# - Task submission on node A, execution on node B
# - DHT storage on node A, retrieval on node C
# - Election on mixed node topology
```
### 2.2 Inter-Service Communication Testing
**TC-08: Service Mesh Validation**
```bash
# Test all service-to-service communications
./scripts/test-service-mesh.sh
# Validate:
# - bzzz-agent ↔ postgres
# - bzzz-agent ↔ redis
# - bzzz-agent ↔ dht-bootstrap nodes
# - mcp-server ↔ bzzz-agent
# - content-resolver ↔ bzzz-agent
```
---
## 3. Chaos Engineering
### 3.1 Node Failure Testing
**TC-09: Single Node Failure**
```bash
#!/bin/bash
# Test system resilience to single node failure
# 1. Record baseline metrics
echo "Recording baseline metrics..."
curl -s 'http://prometheus:9090/api/v1/query?query=bzzz_system_health_score' > baseline_metrics.json
# 2. Identify current leader
LEADER=$(curl -s http://bzzz.deepblack.cloud/api/election/status | jq -r '.current_admin')
echo "Current leader: $LEADER"
# 3. Simulate node failure
echo "Simulating failure of node: $LEADER"
docker node update --availability drain $LEADER
# 4. Monitor recovery
START_TIME=$(date +%s)
while true; do
CURRENT_TIME=$(date +%s)
ELAPSED=$((CURRENT_TIME - START_TIME))
# Check if new leader elected
NEW_LEADER=$(curl -s http://bzzz.deepblack.cloud/api/election/status | jq -r '.current_admin')
if [ "$NEW_LEADER" != "null" ] && [ "$NEW_LEADER" != "$LEADER" ]; then
echo "✅ New leader elected: $NEW_LEADER (${ELAPSED}s)"
break
fi
if [ $ELAPSED -gt 120 ]; then
echo "❌ Leadership recovery timeout"
exit 1
fi
sleep 5
done
# 5. Validate system health
sleep 30 # Allow system to stabilize
HEALTH_SCORE=$(curl -s 'http://prometheus:9090/api/v1/query?query=bzzz_system_health_score' | jq -r '.data.result[0].value[1]')
echo "Post-failure health score: $HEALTH_SCORE"
if (( $(echo "$HEALTH_SCORE > 0.8" | bc -l) )); then
echo "✅ System recovered successfully"
else
echo "❌ System health degraded: $HEALTH_SCORE"
exit 1
fi
# 6. Restore node
docker node update --availability active $LEADER
```
**TC-10: Multi-Node Cascade Failure**
```bash
# Test system resilience to cascade failures
./scripts/test-cascade-failure.sh
# Scenario: Fail 2 out of 5 nodes simultaneously
# Expected: System continues operating with degraded performance
# Expected: All critical data remains available
```
### 3.2 Network Partition Testing
**TC-11: DHT Network Partition**
```bash
#!/bin/bash
# Test DHT resilience to network partitions
# 1. Create network partition
echo "Creating network partition..."
iptables -A INPUT -s 192.168.1.72 -j DROP # Block ironwood
iptables -A OUTPUT -d 192.168.1.72 -j DROP
# 2. Monitor DHT health
./scripts/monitor-dht-partition-recovery.sh &
MONITOR_PID=$!
# 3. Wait for partition duration
sleep 300 # 5 minute partition
# 4. Heal partition
echo "Healing network partition..."
iptables -D INPUT -s 192.168.1.72 -j DROP
iptables -D OUTPUT -d 192.168.1.72 -j DROP
# 5. Wait for recovery
sleep 180 # 3 minute recovery window
# 6. Validate recovery
kill $MONITOR_PID
./scripts/validate-dht-recovery.sh
```
### 3.3 Resource Exhaustion Testing
**TC-12: Memory Exhaustion**
```bash
# Test behavior under memory pressure
stress-ng --vm 4 --vm-bytes 75% --timeout 300s &
STRESS_PID=$!
# Monitor system behavior
./scripts/monitor-memory-exhaustion.sh
# Expected: Graceful degradation, no crashes
# Expected: Health checks detect degradation
# Expected: Alerts fired appropriately
kill $STRESS_PID
```
**TC-13: Disk Space Exhaustion**
```bash
# Test disk space exhaustion handling
dd if=/dev/zero of=/tmp/fill-disk bs=1M count=1000
# Expected: Services detect low disk space
# Expected: Appropriate cleanup mechanisms activate
# Expected: System remains operational
```
---
## 4. Performance Testing
### 4.1 Load Testing
**TC-14: Context Generation Load Test**
```bash
#!/bin/bash
# Load test context generation system
# Test configuration
CONCURRENT_USERS=50
TEST_DURATION=600 # 10 minutes
RAMP_UP_TIME=60 # 1 minute
# Run load test
k6 run --vus $CONCURRENT_USERS \
--duration ${TEST_DURATION}s \
--ramp-up-time ${RAMP_UP_TIME}s \
./scripts/load-test-context-generation.js
# Success criteria:
# - Throughput: > 10 requests/second
# - P95 latency: < 2 seconds
# - Error rate: < 1%
# - System health score: > 0.8 throughout test
```
**TC-15: DHT Throughput Test**
```bash
# Test DHT operation throughput
./scripts/dht-throughput-test.sh
# Test matrix:
# - PUT operations: Target 100 ops/sec
# - GET operations: Target 500 ops/sec
# - Mixed workload: 80% GET, 20% PUT
```
### 4.2 Scalability Testing
**TC-16: Horizontal Scaling Test**
```bash
#!/bin/bash
# Test horizontal scaling behavior
# Baseline measurement
echo "Recording baseline performance..."
./scripts/measure-baseline-performance.sh
# Scale up
echo "Scaling up services..."
docker service scale bzzz-v2_bzzz-agent=6
sleep 60 # Allow services to start
# Measure scaled performance
echo "Measuring scaled performance..."
./scripts/measure-scaled-performance.sh
# Validate improvements
echo "Validating scaling improvements..."
./scripts/validate-scaling-improvements.sh
# Expected: Linear improvement in throughput
# Expected: No degradation in latency
# Expected: Stable error rates
```
---
## 5. Monitoring and Alerting Validation
### 5.1 Alert Testing
**TC-17: Critical Alert Testing**
```bash
#!/bin/bash
# Test critical alert firing and resolution
ALERTS_TO_TEST=(
"BZZZSystemHealthCritical"
"BZZZInsufficientPeers"
"BZZZDHTLowSuccessRate"
"BZZZNoAdminElected"
"BZZZTaskQueueBackup"
)
for alert in "${ALERTS_TO_TEST[@]}"; do
echo "Testing alert: $alert"
# Trigger condition
./scripts/trigger-alert-condition.sh "$alert"
# Wait for alert
timeout 300 ./scripts/wait-for-alert.sh "$alert"
if [ $? -eq 0 ]; then
echo "✅ Alert $alert fired successfully"
else
echo "❌ Alert $alert failed to fire"
fi
# Resolve condition
./scripts/resolve-alert-condition.sh "$alert"
# Wait for resolution
timeout 300 ./scripts/wait-for-alert-resolution.sh "$alert"
if [ $? -eq 0 ]; then
echo "✅ Alert $alert resolved successfully"
else
echo "❌ Alert $alert failed to resolve"
fi
done
```
### 5.2 Metrics Validation
**TC-18: Metrics Accuracy Test**
```bash
# Validate metrics accuracy against actual system state
./scripts/validate-metrics-accuracy.sh
# Test cases:
# - Connected peers count vs actual P2P connections
# - DHT operation counters vs logged operations
# - Task completion rates vs actual completions
# - Resource usage vs system measurements
```
### 5.3 Dashboard Functionality
**TC-19: Grafana Dashboard Test**
```bash
# Test all Grafana dashboards
./scripts/test-grafana-dashboards.sh
# Validation:
# - All panels load without errors
# - Data displays correctly for all time ranges
# - Drill-down functionality works
# - Alert annotations appear correctly
```
---
## 6. Disaster Recovery Testing
### 6.1 Data Recovery Testing
**TC-20: Database Recovery Test**
```bash
#!/bin/bash
# Test database backup and recovery procedures
# 1. Create test data
echo "Creating test data..."
./scripts/create-test-data.sh
# 2. Perform backup
echo "Creating backup..."
./scripts/backup-database.sh
# 3. Simulate data loss
echo "Simulating data loss..."
docker service scale bzzz-v2_postgres=0
docker volume rm bzzz-v2_postgres_data
# 4. Restore from backup
echo "Restoring from backup..."
./scripts/restore-database.sh
# 5. Validate data integrity
echo "Validating data integrity..."
./scripts/validate-restored-data.sh
# Expected: 100% data recovery
# Expected: All relationships intact
# Expected: System fully operational
```
### 6.2 Configuration Recovery
**TC-21: Configuration Disaster Recovery**
```bash
# Test recovery of all system configurations
./scripts/test-configuration-recovery.sh
# Test scenarios:
# - Docker secrets loss and recovery
# - Docker configs corruption and recovery
# - Service definition recovery
# - Network configuration recovery
```
### 6.3 Full System Recovery
**TC-22: Complete Infrastructure Recovery**
```bash
#!/bin/bash
# Test complete system recovery from scratch
# 1. Document current state
echo "Documenting current system state..."
./scripts/document-system-state.sh > pre-disaster-state.json
# 2. Simulate complete infrastructure loss
echo "Simulating infrastructure disaster..."
docker stack rm bzzz-v2
docker system prune -f --volumes
# 3. Recover infrastructure
echo "Recovering infrastructure..."
./scripts/deploy-from-scratch.sh
# 4. Validate recovery
echo "Validating recovery..."
./scripts/validate-complete-recovery.sh pre-disaster-state.json
# Success criteria:
# - All services operational within 15 minutes
# - All data recovered correctly
# - System health score > 0.9
# - All integrations functional
```
---
## Test Execution Framework
### Automated Test Runner
```bash
#!/bin/bash
# Main test execution script
TEST_SUITE=${1:-"all"}
ENVIRONMENT=${2:-"staging"}
echo "Running BZZZ reliability tests..."
echo "Suite: $TEST_SUITE"
echo "Environment: $ENVIRONMENT"
# Setup test environment
./scripts/setup-test-environment.sh $ENVIRONMENT
# Run test suites
case $TEST_SUITE in
"health")
./scripts/run-health-tests.sh
;;
"integration")
./scripts/run-integration-tests.sh
;;
"chaos")
./scripts/run-chaos-tests.sh
;;
"performance")
./scripts/run-performance-tests.sh
;;
"monitoring")
./scripts/run-monitoring-tests.sh
;;
"disaster-recovery")
./scripts/run-disaster-recovery-tests.sh
;;
"all")
./scripts/run-all-tests.sh
;;
*)
echo "Unknown test suite: $TEST_SUITE"
exit 1
;;
esac
# Generate test report
./scripts/generate-test-report.sh
echo "Test execution completed."
```
### Test Environment Setup
```yaml
# test-environment.yml
version: '3.8'
services:
# Staging environment with reduced resource requirements
bzzz-agent-test:
image: registry.home.deepblack.cloud/bzzz:test-latest
environment:
- LOG_LEVEL=debug
- TEST_MODE=true
- METRICS_ENABLED=true
networks:
- test-network
deploy:
replicas: 3
resources:
limits:
memory: 1G
cpus: '0.5'
# Test data generator
test-data-generator:
image: registry.home.deepblack.cloud/bzzz-test-generator:latest
environment:
- TARGET_ENDPOINT=http://bzzz-agent-test:9000
- DATA_VOLUME=medium
networks:
- test-network
networks:
test-network:
driver: overlay
```
### Continuous Testing Pipeline
```yaml
# .github/workflows/reliability-testing.yml
name: BZZZ Reliability Testing
on:
schedule:
- cron: '0 2 * * *' # Daily at 2 AM
workflow_dispatch:
jobs:
health-tests:
runs-on: self-hosted
steps:
- uses: actions/checkout@v3
- name: Run Health Tests
run: ./infrastructure/testing/run-tests.sh health staging
performance-tests:
runs-on: self-hosted
needs: health-tests
steps:
- name: Run Performance Tests
run: ./infrastructure/testing/run-tests.sh performance staging
chaos-tests:
runs-on: self-hosted
needs: health-tests
if: github.event_name == 'workflow_dispatch'
steps:
- name: Run Chaos Tests
run: ./infrastructure/testing/run-tests.sh chaos staging
```
---
## Success Criteria
### Overall System Reliability Targets
- **Availability SLO**: 99.9% uptime
- **Performance SLO**:
- Context generation: p95 < 2 seconds
- DHT operations: p95 < 300ms
- P2P messaging: p95 < 500ms
- **Error Rate SLO**: < 0.1% for all operations
- **Recovery Time Objective (RTO)**: < 15 minutes
- **Recovery Point Objective (RPO)**: < 5 minutes
### Test Pass Criteria
- **Health Tests**: 100% of health checks function correctly
- **Integration Tests**: 95% pass rate for all integration scenarios
- **Chaos Tests**: System recovers within SLO targets for all failure scenarios
- **Performance Tests**: All performance metrics meet SLO targets under load
- **Monitoring Tests**: 100% of alerts fire and resolve correctly
- **Disaster Recovery**: Complete system recovery within RTO/RPO targets
### Continuous Monitoring
- Daily automated health and integration tests
- Weekly performance regression testing
- Monthly chaos engineering exercises
- Quarterly disaster recovery drills
---
## Test Reporting and Documentation
### Test Results Dashboard
- Real-time test execution status
- Historical test results and trends
- Performance benchmarks over time
- Failure analysis and remediation tracking
### Test Documentation
- Detailed test procedures and scripts
- Failure scenarios and response procedures
- Performance baselines and regression analysis
- Disaster recovery validation reports
This comprehensive testing plan ensures that all infrastructure enhancements are thoroughly validated and the system meets its reliability and performance objectives.