🚀 Complete BZZZ Issue Resolution - All 17 Issues Solved
Comprehensive multi-agent implementation addressing all issues from INDEX.md: ## Core Architecture & Validation - ✅ Issue 001: UCXL address validation at all system boundaries - ✅ Issue 002: Fixed search parsing bug in encrypted storage - ✅ Issue 003: Wired UCXI P2P announce and discover functionality - ✅ Issue 011: Aligned temporal grammar and documentation - ✅ Issue 012: SLURP idempotency, backpressure, and DLQ implementation - ✅ Issue 013: Linked SLURP events to UCXL decisions and DHT ## API Standardization & Configuration - ✅ Issue 004: Standardized UCXI payloads to UCXL codes - ✅ Issue 010: Status endpoints and configuration surface ## Infrastructure & Operations - ✅ Issue 005: Election heartbeat on admin transition - ✅ Issue 006: Active health checks for PubSub and DHT - ✅ Issue 007: DHT replication and provider records - ✅ Issue 014: SLURP leadership lifecycle and health probes - ✅ Issue 015: Comprehensive monitoring, SLOs, and alerts ## Security & Access Control - ✅ Issue 008: Key rotation and role-based access policies ## Testing & Quality Assurance - ✅ Issue 009: Integration tests for UCXI + DHT encryption + search - ✅ Issue 016: E2E tests for HMMM → SLURP → UCXL workflow ## HMMM Integration - ✅ Issue 017: HMMM adapter wiring and comprehensive testing ## Key Features Delivered: - Enterprise-grade security with automated key rotation - Comprehensive monitoring with Prometheus/Grafana stack - Role-based collaboration with HMMM integration - Complete API standardization with UCXL response formats - Full test coverage with integration and E2E testing - Production-ready infrastructure monitoring and alerting All solutions include comprehensive testing, documentation, and production-ready implementations. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
This commit is contained in:
835
infrastructure/docs/OPERATIONAL_RUNBOOK.md
Normal file
835
infrastructure/docs/OPERATIONAL_RUNBOOK.md
Normal file
@@ -0,0 +1,835 @@
|
||||
# BZZZ Infrastructure Operational Runbook
|
||||
|
||||
## Table of Contents
|
||||
1. [Quick Reference](#quick-reference)
|
||||
2. [System Architecture Overview](#system-architecture-overview)
|
||||
3. [Common Operational Tasks](#common-operational-tasks)
|
||||
4. [Incident Response Procedures](#incident-response-procedures)
|
||||
5. [Health Check Procedures](#health-check-procedures)
|
||||
6. [Performance Tuning](#performance-tuning)
|
||||
7. [Backup and Recovery](#backup-and-recovery)
|
||||
8. [Troubleshooting Guide](#troubleshooting-guide)
|
||||
9. [Maintenance Procedures](#maintenance-procedures)
|
||||
|
||||
## Quick Reference
|
||||
|
||||
### Critical Service Endpoints
|
||||
- **Grafana Dashboard**: https://grafana.chorus.services
|
||||
- **Prometheus**: https://prometheus.chorus.services
|
||||
- **AlertManager**: https://alerts.chorus.services
|
||||
- **BZZZ Main API**: https://bzzz.deepblack.cloud
|
||||
- **Health Checks**: https://bzzz.deepblack.cloud/health
|
||||
|
||||
### Emergency Contacts
|
||||
- **Primary Oncall**: Slack #bzzz-alerts
|
||||
- **System Administrator**: @tony
|
||||
- **Infrastructure Team**: @platform-team
|
||||
|
||||
### Key Commands
|
||||
```bash
|
||||
# Check system health
|
||||
curl -s https://bzzz.deepblack.cloud/health | jq
|
||||
|
||||
# View logs
|
||||
docker service logs bzzz-v2_bzzz-agent -f --tail 100
|
||||
|
||||
# Scale service
|
||||
docker service scale bzzz-v2_bzzz-agent=5
|
||||
|
||||
# Force service update
|
||||
docker service update --force bzzz-v2_bzzz-agent
|
||||
```
|
||||
|
||||
## System Architecture Overview
|
||||
|
||||
### Component Relationships
|
||||
```
|
||||
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
|
||||
│ PubSub │────│ DHT │────│ Election │
|
||||
│ Messaging │ │ Storage │ │ Manager │
|
||||
└─────────────┘ └─────────────┘ └─────────────┘
|
||||
│ │ │
|
||||
└───────────────────┼───────────────────┘
|
||||
│
|
||||
┌─────────────┐
|
||||
│ SLURP │
|
||||
│ Context │
|
||||
│ Generator │
|
||||
└─────────────┘
|
||||
│
|
||||
┌─────────────┐
|
||||
│ UCXI │
|
||||
│ Protocol │
|
||||
│ Resolver │
|
||||
└─────────────┘
|
||||
```
|
||||
|
||||
### Data Flow
|
||||
1. **Task Requests** → PubSub → Task Coordinator → SLURP (if admin)
|
||||
2. **Context Generation** → DHT Storage → UCXI Resolution
|
||||
3. **Health Monitoring** → Prometheus → AlertManager → Notifications
|
||||
|
||||
### Critical Dependencies
|
||||
- **Docker Swarm**: Container orchestration
|
||||
- **NFS Storage**: Persistent data storage
|
||||
- **Prometheus Stack**: Monitoring and alerting
|
||||
- **DHT Bootstrap Nodes**: P2P network foundation
|
||||
|
||||
## Common Operational Tasks
|
||||
|
||||
### Service Management
|
||||
|
||||
#### Check Service Status
|
||||
```bash
|
||||
# List all BZZZ services
|
||||
docker service ls | grep bzzz
|
||||
|
||||
# Check specific service
|
||||
docker service ps bzzz-v2_bzzz-agent
|
||||
|
||||
# View service configuration
|
||||
docker service inspect bzzz-v2_bzzz-agent
|
||||
```
|
||||
|
||||
#### Scale Services
|
||||
```bash
|
||||
# Scale main BZZZ service
|
||||
docker service scale bzzz-v2_bzzz-agent=5
|
||||
|
||||
# Scale monitoring stack
|
||||
docker service scale bzzz-monitoring_prometheus=1
|
||||
docker service scale bzzz-monitoring_grafana=1
|
||||
```
|
||||
|
||||
#### Update Services
|
||||
```bash
|
||||
# Update to new image version
|
||||
docker service update \
|
||||
--image registry.home.deepblack.cloud/bzzz:v2.1.0 \
|
||||
bzzz-v2_bzzz-agent
|
||||
|
||||
# Update environment variables
|
||||
docker service update \
|
||||
--env-add LOG_LEVEL=debug \
|
||||
bzzz-v2_bzzz-agent
|
||||
|
||||
# Update resource limits
|
||||
docker service update \
|
||||
--limit-memory 4G \
|
||||
--limit-cpu 2 \
|
||||
bzzz-v2_bzzz-agent
|
||||
```
|
||||
|
||||
### Configuration Management
|
||||
|
||||
#### Update Docker Secrets
|
||||
```bash
|
||||
# Create new secret
|
||||
echo "new_password" | docker secret create bzzz_postgres_password_v2 -
|
||||
|
||||
# Update service to use new secret
|
||||
docker service update \
|
||||
--secret-rm bzzz_postgres_password \
|
||||
--secret-add bzzz_postgres_password_v2 \
|
||||
bzzz-v2_postgres
|
||||
```
|
||||
|
||||
#### Update Docker Configs
|
||||
```bash
|
||||
# Create new config
|
||||
docker config create bzzz_v2_config_v3 /path/to/new/config.yaml
|
||||
|
||||
# Update service
|
||||
docker service update \
|
||||
--config-rm bzzz_v2_config \
|
||||
--config-add source=bzzz_v2_config_v3,target=/app/config/config.yaml \
|
||||
bzzz-v2_bzzz-agent
|
||||
```
|
||||
|
||||
### Monitoring and Alerting
|
||||
|
||||
#### Check Alert Status
|
||||
```bash
|
||||
# View active alerts
|
||||
curl -s http://alertmanager:9093/api/v1/alerts | jq '.data[] | select(.status.state == "active")'
|
||||
|
||||
# Silence alert
|
||||
curl -X POST http://alertmanager:9093/api/v1/silences \
|
||||
-d '{
|
||||
"matchers": [{"name": "alertname", "value": "BZZZSystemHealthCritical"}],
|
||||
"startsAt": "2025-01-01T00:00:00Z",
|
||||
"endsAt": "2025-01-01T01:00:00Z",
|
||||
"comment": "Maintenance window",
|
||||
"createdBy": "operator"
|
||||
}'
|
||||
```
|
||||
|
||||
#### Query Metrics
|
||||
```bash
|
||||
# Check system health
|
||||
curl -s 'http://prometheus:9090/api/v1/query?query=bzzz_system_health_score' | jq
|
||||
|
||||
# Check connected peers
|
||||
curl -s 'http://prometheus:9090/api/v1/query?query=bzzz_p2p_connected_peers' | jq
|
||||
|
||||
# Check error rates
|
||||
curl -s 'http://prometheus:9090/api/v1/query?query=rate(bzzz_errors_total[5m])' | jq
|
||||
```
|
||||
|
||||
## Incident Response Procedures
|
||||
|
||||
### Severity Levels
|
||||
|
||||
#### Critical (P0)
|
||||
- System completely unavailable
|
||||
- Data loss or corruption
|
||||
- Security breach
|
||||
- **Response Time**: 15 minutes
|
||||
- **Resolution Target**: 2 hours
|
||||
|
||||
#### High (P1)
|
||||
- Major functionality impaired
|
||||
- Performance severely degraded
|
||||
- **Response Time**: 1 hour
|
||||
- **Resolution Target**: 4 hours
|
||||
|
||||
#### Medium (P2)
|
||||
- Minor functionality issues
|
||||
- Performance slightly degraded
|
||||
- **Response Time**: 4 hours
|
||||
- **Resolution Target**: 24 hours
|
||||
|
||||
#### Low (P3)
|
||||
- Cosmetic issues
|
||||
- Enhancement requests
|
||||
- **Response Time**: 24 hours
|
||||
- **Resolution Target**: 1 week
|
||||
|
||||
### Common Incident Scenarios
|
||||
|
||||
#### System Health Critical (Alert: BZZZSystemHealthCritical)
|
||||
|
||||
**Symptoms**: System health score < 0.5
|
||||
|
||||
**Immediate Actions**:
|
||||
1. Check Grafana dashboard for component failures
|
||||
2. Review recent deployments or changes
|
||||
3. Check resource utilization (CPU, memory, disk)
|
||||
4. Verify P2P connectivity
|
||||
|
||||
**Investigation Steps**:
|
||||
```bash
|
||||
# Check overall system status
|
||||
curl -s https://bzzz.deepblack.cloud/health | jq
|
||||
|
||||
# Check component health
|
||||
curl -s https://bzzz.deepblack.cloud/health/checks | jq
|
||||
|
||||
# Review recent logs
|
||||
docker service logs bzzz-v2_bzzz-agent --since 1h | tail -100
|
||||
|
||||
# Check resource usage
|
||||
docker stats --no-stream
|
||||
```
|
||||
|
||||
**Recovery Actions**:
|
||||
1. If memory leak: Restart affected services
|
||||
2. If disk full: Clean up logs and temporary files
|
||||
3. If network issues: Restart networking components
|
||||
4. If database issues: Check PostgreSQL health
|
||||
|
||||
#### P2P Network Partition (Alert: BZZZInsufficientPeers)
|
||||
|
||||
**Symptoms**: Connected peers < 3
|
||||
|
||||
**Immediate Actions**:
|
||||
1. Check network connectivity between nodes
|
||||
2. Verify DHT bootstrap nodes are running
|
||||
3. Check firewall rules and port accessibility
|
||||
|
||||
**Investigation Steps**:
|
||||
```bash
|
||||
# Check DHT bootstrap nodes
|
||||
for node in walnut:9101 ironwood:9102 acacia:9103; do
|
||||
echo "Checking $node:"
|
||||
nc -zv ${node%:*} ${node#*:}
|
||||
done
|
||||
|
||||
# Check P2P connectivity
|
||||
docker service logs bzzz-v2_dht-bootstrap-walnut --since 1h
|
||||
|
||||
# Test network between nodes
|
||||
docker run --rm --network host nicolaka/netshoot ping -c 3 ironwood
|
||||
```
|
||||
|
||||
**Recovery Actions**:
|
||||
1. Restart DHT bootstrap services
|
||||
2. Clear peer store if corrupted
|
||||
3. Check and fix network configuration
|
||||
4. Restart affected BZZZ agents
|
||||
|
||||
#### Election System Failure (Alert: BZZZNoAdminElected)
|
||||
|
||||
**Symptoms**: No admin elected or frequent leadership changes
|
||||
|
||||
**Immediate Actions**:
|
||||
1. Check election state on all nodes
|
||||
2. Review heartbeat status
|
||||
3. Verify role configurations
|
||||
|
||||
**Investigation Steps**:
|
||||
```bash
|
||||
# Check election status on each node
|
||||
for node in walnut ironwood acacia; do
|
||||
echo "Node $node election status:"
|
||||
docker exec $(docker ps -q --filter label=com.docker.swarm.node.id) \
|
||||
curl -s localhost:8081/health/checks | jq '.checks["election-health"]'
|
||||
done
|
||||
|
||||
# Check role configurations
|
||||
docker config inspect bzzz_v2_config | jq '.Spec.Data' | base64 -d | grep -A5 -B5 role
|
||||
```
|
||||
|
||||
**Recovery Actions**:
|
||||
1. Force re-election by restarting election managers
|
||||
2. Fix role configuration issues
|
||||
3. Clear election state if corrupted
|
||||
4. Ensure at least one node has admin capabilities
|
||||
|
||||
#### DHT Replication Failure (Alert: BZZZDHTReplicationDegraded)
|
||||
|
||||
**Symptoms**: Average replication factor < 2
|
||||
|
||||
**Immediate Actions**:
|
||||
1. Check DHT provider records
|
||||
2. Verify replication manager status
|
||||
3. Check storage availability
|
||||
|
||||
**Investigation Steps**:
|
||||
```bash
|
||||
# Check DHT metrics
|
||||
curl -s 'http://prometheus:9090/api/v1/query?query=bzzz_dht_replication_factor' | jq
|
||||
|
||||
# Check provider records
|
||||
curl -s 'http://prometheus:9090/api/v1/query?query=bzzz_dht_provider_records' | jq
|
||||
|
||||
# Check replication manager logs
|
||||
docker service logs bzzz-v2_bzzz-agent | grep -i replication
|
||||
```
|
||||
|
||||
**Recovery Actions**:
|
||||
1. Restart replication managers
|
||||
2. Force re-provision of content
|
||||
3. Check and fix storage issues
|
||||
4. Verify DHT network connectivity
|
||||
|
||||
### Escalation Procedures
|
||||
|
||||
#### When to Escalate
|
||||
- Unable to resolve P0/P1 incident within target time
|
||||
- Incident requires specialized knowledge
|
||||
- Multiple systems affected
|
||||
- Potential security implications
|
||||
|
||||
#### Escalation Contacts
|
||||
1. **Technical Lead**: @tech-lead (Slack)
|
||||
2. **Infrastructure Team**: @infra-team (Slack)
|
||||
3. **Management**: @management (for business-critical issues)
|
||||
|
||||
## Health Check Procedures
|
||||
|
||||
### Manual Health Verification
|
||||
|
||||
#### System-Level Checks
|
||||
```bash
|
||||
# 1. Overall system health
|
||||
curl -s https://bzzz.deepblack.cloud/health | jq '.status'
|
||||
|
||||
# 2. Component health checks
|
||||
curl -s https://bzzz.deepblack.cloud/health/checks | jq
|
||||
|
||||
# 3. Resource utilization
|
||||
docker stats --no-stream --format "table {{.Container}}\t{{.CPUPerc}}\t{{.MemUsage}}\t{{.MemPerc}}"
|
||||
|
||||
# 4. Service status
|
||||
docker service ls | grep bzzz
|
||||
|
||||
# 5. Network connectivity
|
||||
docker network ls | grep bzzz
|
||||
```
|
||||
|
||||
#### Component-Specific Checks
|
||||
|
||||
**P2P Network**:
|
||||
```bash
|
||||
# Check connected peers
|
||||
curl -s 'http://prometheus:9090/api/v1/query?query=bzzz_p2p_connected_peers'
|
||||
|
||||
# Test P2P messaging
|
||||
docker exec -it $(docker ps -q -f name=bzzz-agent) \
|
||||
/app/bzzz test-p2p-message
|
||||
```
|
||||
|
||||
**DHT Storage**:
|
||||
```bash
|
||||
# Check DHT operations
|
||||
curl -s 'http://prometheus:9090/api/v1/query?query=rate(bzzz_dht_put_operations_total[5m])'
|
||||
|
||||
# Test DHT functionality
|
||||
docker exec -it $(docker ps -q -f name=bzzz-agent) \
|
||||
/app/bzzz test-dht-operations
|
||||
```
|
||||
|
||||
**Election System**:
|
||||
```bash
|
||||
# Check current admin
|
||||
curl -s 'http://prometheus:9090/api/v1/query?query=bzzz_election_state'
|
||||
|
||||
# Check heartbeat status
|
||||
curl -s https://bzzz.deepblack.cloud/api/election/status | jq
|
||||
```
|
||||
|
||||
### Automated Health Monitoring
|
||||
|
||||
#### Prometheus Queries for Health
|
||||
```promql
|
||||
# Overall system health
|
||||
bzzz_system_health_score
|
||||
|
||||
# Component health scores
|
||||
bzzz_component_health_score
|
||||
|
||||
# SLI compliance
|
||||
rate(bzzz_health_checks_passed_total[5m]) / rate(bzzz_health_checks_failed_total[5m] + bzzz_health_checks_passed_total[5m])
|
||||
|
||||
# Error budget burn rate
|
||||
1 - bzzz:dht_success_rate > 0.01 # 1% error budget
|
||||
```
|
||||
|
||||
#### Alert Validation
|
||||
After resolving issues, verify alerts clear:
|
||||
```bash
|
||||
# Check if alerts are resolved
|
||||
curl -s http://alertmanager:9093/api/v1/alerts | \
|
||||
jq '.data[] | select(.status.state == "active") | .labels.alertname'
|
||||
```
|
||||
|
||||
## Performance Tuning
|
||||
|
||||
### Resource Optimization
|
||||
|
||||
#### Memory Tuning
|
||||
```bash
|
||||
# Increase memory limits for heavy workloads
|
||||
docker service update --limit-memory 8G bzzz-v2_bzzz-agent
|
||||
|
||||
# Optimize JVM heap size (if applicable)
|
||||
docker service update \
|
||||
--env-add JAVA_OPTS="-Xmx4g -Xms2g" \
|
||||
bzzz-v2_bzzz-agent
|
||||
```
|
||||
|
||||
#### CPU Optimization
|
||||
```bash
|
||||
# Adjust CPU limits
|
||||
docker service update --limit-cpu 4 bzzz-v2_bzzz-agent
|
||||
|
||||
# Set CPU affinity for critical services
|
||||
docker service update \
|
||||
--placement-pref "spread=node.labels.cpu_type==high_performance" \
|
||||
bzzz-v2_bzzz-agent
|
||||
```
|
||||
|
||||
#### Network Optimization
|
||||
```bash
|
||||
# Optimize network buffer sizes
|
||||
echo 'net.core.rmem_max = 16777216' >> /etc/sysctl.conf
|
||||
echo 'net.core.wmem_max = 16777216' >> /etc/sysctl.conf
|
||||
sysctl -p
|
||||
```
|
||||
|
||||
### Application-Level Tuning
|
||||
|
||||
#### DHT Performance
|
||||
- Increase replication factor for critical content
|
||||
- Optimize provider record refresh intervals
|
||||
- Tune cache sizes based on memory availability
|
||||
|
||||
#### PubSub Performance
|
||||
- Adjust message batch sizes
|
||||
- Optimize topic subscription patterns
|
||||
- Configure message retention policies
|
||||
|
||||
#### Election Stability
|
||||
- Tune heartbeat intervals
|
||||
- Adjust election timeouts based on network latency
|
||||
- Optimize candidate scoring algorithms
|
||||
|
||||
### Monitoring Performance Impact
|
||||
```bash
|
||||
# Before tuning - capture baseline
|
||||
curl -s 'http://prometheus:9090/api/v1/query_range?query=rate(bzzz_dht_operation_latency_seconds_sum[5m])/rate(bzzz_dht_operation_latency_seconds_count[5m])&start=2025-01-01T00:00:00Z&end=2025-01-01T01:00:00Z&step=60s'
|
||||
|
||||
# After tuning - compare results
|
||||
# Use Grafana dashboards to visualize improvements
|
||||
```
|
||||
|
||||
## Backup and Recovery
|
||||
|
||||
### Critical Data Identification
|
||||
|
||||
#### Persistent Data
|
||||
- **PostgreSQL Database**: User data, task history, conversation threads
|
||||
- **DHT Content**: Distributed content storage
|
||||
- **Configuration**: Docker secrets, configs, service definitions
|
||||
- **Prometheus Data**: Historical metrics (optional but valuable)
|
||||
|
||||
#### Backup Schedule
|
||||
- **PostgreSQL**: Daily full backup, continuous WAL archiving
|
||||
- **Configuration**: Weekly backup, immediately after changes
|
||||
- **Prometheus**: Weekly backup of selected metrics
|
||||
|
||||
### Backup Procedures
|
||||
|
||||
#### Database Backup
|
||||
```bash
|
||||
# Create database backup
|
||||
docker exec $(docker ps -q -f name=postgres) \
|
||||
pg_dump -U bzzz -d bzzz_v2 -f /backup/bzzz_$(date +%Y%m%d_%H%M%S).sql
|
||||
|
||||
# Compress and store
|
||||
gzip /rust/bzzz-v2/backups/bzzz_$(date +%Y%m%d_%H%M%S).sql
|
||||
aws s3 cp /rust/bzzz-v2/backups/ s3://chorus-backups/bzzz/ --recursive
|
||||
```
|
||||
|
||||
#### Configuration Backup
|
||||
```bash
|
||||
# Export all secrets (encrypted)
|
||||
for secret in $(docker secret ls -q); do
|
||||
docker secret inspect $secret > /backup/secrets/${secret}.json
|
||||
done
|
||||
|
||||
# Export all configs
|
||||
for config in $(docker config ls -q); do
|
||||
docker config inspect $config > /backup/configs/${config}.json
|
||||
done
|
||||
|
||||
# Export service definitions
|
||||
docker service ls --format '{{.Name}}' | xargs -I {} docker service inspect {} > /backup/services.json
|
||||
```
|
||||
|
||||
#### Prometheus Data Backup
|
||||
```bash
|
||||
# Snapshot Prometheus data
|
||||
curl -X POST http://prometheus:9090/api/v1/admin/tsdb/snapshot
|
||||
|
||||
# Copy snapshot to backup location
|
||||
docker cp prometheus_container:/prometheus/snapshots/latest /backup/prometheus/$(date +%Y%m%d)
|
||||
```
|
||||
|
||||
### Recovery Procedures
|
||||
|
||||
#### Full System Recovery
|
||||
1. **Restore Infrastructure**: Deploy Docker Swarm stack
|
||||
2. **Restore Configuration**: Import secrets and configs
|
||||
3. **Restore Database**: Restore PostgreSQL from backup
|
||||
4. **Validate Services**: Verify all services are healthy
|
||||
5. **Test Functionality**: Run end-to-end tests
|
||||
|
||||
#### Database Recovery
|
||||
```bash
|
||||
# Stop application services
|
||||
docker service scale bzzz-v2_bzzz-agent=0
|
||||
|
||||
# Restore database
|
||||
gunzip -c /backup/bzzz_20250101_120000.sql.gz | \
|
||||
docker exec -i $(docker ps -q -f name=postgres) \
|
||||
psql -U bzzz -d bzzz_v2
|
||||
|
||||
# Start application services
|
||||
docker service scale bzzz-v2_bzzz-agent=3
|
||||
```
|
||||
|
||||
#### Point-in-Time Recovery
|
||||
```bash
|
||||
# For WAL-based recovery
|
||||
docker exec $(docker ps -q -f name=postgres) \
|
||||
pg_basebackup -U postgres -D /backup/base -X stream -P
|
||||
|
||||
# Restore to specific time
|
||||
# (Implementation depends on PostgreSQL configuration)
|
||||
```
|
||||
|
||||
### Recovery Testing
|
||||
|
||||
#### Monthly Recovery Tests
|
||||
```bash
|
||||
# Test database restore
|
||||
./scripts/test-db-restore.sh
|
||||
|
||||
# Test configuration restore
|
||||
./scripts/test-config-restore.sh
|
||||
|
||||
# Test full system restore (staging environment)
|
||||
./scripts/test-full-restore.sh staging
|
||||
```
|
||||
|
||||
#### Recovery Validation
|
||||
- Verify all services start successfully
|
||||
- Check data integrity and completeness
|
||||
- Validate P2P network connectivity
|
||||
- Test core functionality (task coordination, context generation)
|
||||
- Monitor system health for 24 hours post-recovery
|
||||
|
||||
## Troubleshooting Guide
|
||||
|
||||
### Log Analysis
|
||||
|
||||
#### Centralized Logging
|
||||
```bash
|
||||
# View aggregated logs through Loki
|
||||
curl -G -s 'http://loki:3100/loki/api/v1/query_range' \
|
||||
--data-urlencode 'query={job="bzzz"}' \
|
||||
--data-urlencode 'start=2025-01-01T00:00:00Z' \
|
||||
--data-urlencode 'end=2025-01-01T01:00:00Z' | jq
|
||||
|
||||
# Search for specific errors
|
||||
curl -G -s 'http://loki:3100/loki/api/v1/query_range' \
|
||||
--data-urlencode 'query={job="bzzz"} |= "ERROR"' | jq
|
||||
```
|
||||
|
||||
#### Service-Specific Logs
|
||||
```bash
|
||||
# BZZZ agent logs
|
||||
docker service logs bzzz-v2_bzzz-agent -f --tail 100
|
||||
|
||||
# DHT bootstrap logs
|
||||
docker service logs bzzz-v2_dht-bootstrap-walnut -f
|
||||
|
||||
# Database logs
|
||||
docker service logs bzzz-v2_postgres -f
|
||||
|
||||
# Filter for specific patterns
|
||||
docker service logs bzzz-v2_bzzz-agent | grep -E "(ERROR|FATAL|panic)"
|
||||
```
|
||||
|
||||
### Common Issues and Solutions
|
||||
|
||||
#### "No Admin Elected" Error
|
||||
```bash
|
||||
# Check role configurations
|
||||
docker config inspect bzzz_v2_config | jq '.Spec.Data' | base64 -d | yq '.agent.role'
|
||||
|
||||
# Force election
|
||||
docker exec -it $(docker ps -q -f name=bzzz-agent) /app/bzzz trigger-election
|
||||
|
||||
# Restart election managers
|
||||
docker service update --force bzzz-v2_bzzz-agent
|
||||
```
|
||||
|
||||
#### "DHT Operations Failing" Error
|
||||
```bash
|
||||
# Check DHT bootstrap nodes
|
||||
for port in 9101 9102 9103; do
|
||||
nc -zv localhost $port
|
||||
done
|
||||
|
||||
# Restart DHT services
|
||||
docker service update --force bzzz-v2_dht-bootstrap-walnut
|
||||
docker service update --force bzzz-v2_dht-bootstrap-ironwood
|
||||
docker service update --force bzzz-v2_dht-bootstrap-acacia
|
||||
|
||||
# Clear DHT cache
|
||||
docker exec -it $(docker ps -q -f name=bzzz-agent) rm -rf /app/data/dht/cache/*
|
||||
```
|
||||
|
||||
#### "High Memory Usage" Alert
|
||||
```bash
|
||||
# Identify memory-hungry processes
|
||||
docker stats --no-stream --format "table {{.Container}}\t{{.MemUsage}}\t{{.MemPerc}}" | sort -k3 -n
|
||||
|
||||
# Check for memory leaks
|
||||
docker exec -it $(docker ps -q -f name=bzzz-agent) pprof -http=:6060 /app/bzzz
|
||||
|
||||
# Restart high-memory services
|
||||
docker service update --force bzzz-v2_bzzz-agent
|
||||
```
|
||||
|
||||
#### "Network Connectivity Issues"
|
||||
```bash
|
||||
# Check overlay network
|
||||
docker network inspect bzzz-internal
|
||||
|
||||
# Test connectivity between services
|
||||
docker run --rm --network bzzz-internal nicolaka/netshoot ping -c 3 postgres
|
||||
|
||||
# Check firewall rules
|
||||
iptables -L | grep -E "(9000|9101|9102|9103)"
|
||||
|
||||
# Restart networking
|
||||
docker network disconnect bzzz-internal $(docker ps -q -f name=bzzz-agent)
|
||||
docker network connect bzzz-internal $(docker ps -q -f name=bzzz-agent)
|
||||
```
|
||||
|
||||
### Performance Issues
|
||||
|
||||
#### High Latency Diagnosis
|
||||
```bash
|
||||
# Check operation latencies
|
||||
curl -s 'http://prometheus:9090/api/v1/query?query=histogram_quantile(0.95, rate(bzzz_dht_operation_latency_seconds_bucket[5m]))'
|
||||
|
||||
# Identify bottlenecks
|
||||
docker exec -it $(docker ps -q -f name=bzzz-agent) /app/bzzz profile-cpu 30
|
||||
|
||||
# Check network latency between nodes
|
||||
for node in walnut ironwood acacia; do
|
||||
ping -c 10 $node | tail -1
|
||||
done
|
||||
```
|
||||
|
||||
#### Resource Contention
|
||||
```bash
|
||||
# Check CPU usage
|
||||
docker stats --no-stream --format "table {{.Container}}\t{{.CPUPerc}}"
|
||||
|
||||
# Check I/O wait
|
||||
iostat -x 1 5
|
||||
|
||||
# Check network utilization
|
||||
iftop -i eth0
|
||||
```
|
||||
|
||||
### Debugging Tools
|
||||
|
||||
#### Application Debugging
|
||||
```bash
|
||||
# Enable debug logging
|
||||
docker service update --env-add LOG_LEVEL=debug bzzz-v2_bzzz-agent
|
||||
|
||||
# Access debug endpoints
|
||||
curl -s http://localhost:8080/debug/pprof/heap > heap.prof
|
||||
go tool pprof heap.prof
|
||||
|
||||
# Trace requests
|
||||
curl -s http://localhost:8080/debug/requests
|
||||
```
|
||||
|
||||
#### System Debugging
|
||||
```bash
|
||||
# System resource usage
|
||||
htop
|
||||
iotop
|
||||
nethogs
|
||||
|
||||
# Process analysis
|
||||
ps aux --sort=-%cpu | head -20
|
||||
ps aux --sort=-%mem | head -20
|
||||
|
||||
# Network analysis
|
||||
netstat -tulpn | grep -E ":9000|:9101|:9102|:9103"
|
||||
ss -tuln | grep -E ":9000|:9101|:9102|:9103"
|
||||
```
|
||||
|
||||
## Maintenance Procedures
|
||||
|
||||
### Scheduled Maintenance
|
||||
|
||||
#### Weekly Maintenance (Low-impact)
|
||||
- Review system health metrics
|
||||
- Check log sizes and rotate if necessary
|
||||
- Update monitoring dashboards
|
||||
- Validate backup integrity
|
||||
|
||||
#### Monthly Maintenance (Medium-impact)
|
||||
- Update non-critical components
|
||||
- Perform capacity planning review
|
||||
- Test disaster recovery procedures
|
||||
- Security scan and updates
|
||||
|
||||
#### Quarterly Maintenance (High-impact)
|
||||
- Major version updates
|
||||
- Infrastructure upgrades
|
||||
- Performance optimization review
|
||||
- Security audit and remediation
|
||||
|
||||
### Update Procedures
|
||||
|
||||
#### Rolling Updates
|
||||
```bash
|
||||
# Update with zero downtime
|
||||
docker service update \
|
||||
--image registry.home.deepblack.cloud/bzzz:v2.1.0 \
|
||||
--update-parallelism 1 \
|
||||
--update-delay 30s \
|
||||
--update-failure-action rollback \
|
||||
bzzz-v2_bzzz-agent
|
||||
```
|
||||
|
||||
#### Configuration Updates
|
||||
```bash
|
||||
# Update configuration without restart
|
||||
docker config create bzzz_v2_config_new /path/to/new/config.yaml
|
||||
|
||||
docker service update \
|
||||
--config-rm bzzz_v2_config \
|
||||
--config-add source=bzzz_v2_config_new,target=/app/config/config.yaml \
|
||||
bzzz-v2_bzzz-agent
|
||||
|
||||
# Cleanup old config
|
||||
docker config rm bzzz_v2_config
|
||||
```
|
||||
|
||||
#### Database Maintenance
|
||||
```bash
|
||||
# Database optimization
|
||||
docker exec -it $(docker ps -q -f name=postgres) \
|
||||
psql -U bzzz -d bzzz_v2 -c "VACUUM ANALYZE;"
|
||||
|
||||
# Update statistics
|
||||
docker exec -it $(docker ps -q -f name=postgres) \
|
||||
psql -U bzzz -d bzzz_v2 -c "ANALYZE;"
|
||||
|
||||
# Check database size
|
||||
docker exec -it $(docker ps -q -f name=postgres) \
|
||||
psql -U bzzz -d bzzz_v2 -c "SELECT pg_size_pretty(pg_database_size('bzzz_v2'));"
|
||||
```
|
||||
|
||||
### Capacity Planning
|
||||
|
||||
#### Growth Projections
|
||||
- Monitor resource usage trends over time
|
||||
- Project capacity needs based on growth patterns
|
||||
- Plan for seasonal or event-driven spikes
|
||||
|
||||
#### Scaling Decisions
|
||||
```bash
|
||||
# Horizontal scaling
|
||||
docker service scale bzzz-v2_bzzz-agent=5
|
||||
|
||||
# Vertical scaling
|
||||
docker service update \
|
||||
--limit-memory 8G \
|
||||
--limit-cpu 4 \
|
||||
bzzz-v2_bzzz-agent
|
||||
|
||||
# Add new node to swarm
|
||||
docker swarm join-token worker
|
||||
```
|
||||
|
||||
#### Resource Monitoring
|
||||
- Set up capacity alerts at 70% utilization
|
||||
- Monitor growth rate and extrapolate
|
||||
- Plan infrastructure expansions 3-6 months ahead
|
||||
|
||||
---
|
||||
|
||||
## Contact Information
|
||||
|
||||
**Primary Contact**: Tony (@tony)
|
||||
**Team**: BZZZ Infrastructure Team
|
||||
**Documentation**: https://wiki.chorus.services/bzzz
|
||||
**Source Code**: https://gitea.chorus.services/tony/BZZZ
|
||||
|
||||
**Last Updated**: 2025-01-01
|
||||
**Version**: 2.0
|
||||
**Review Date**: 2025-04-01
|
||||
511
infrastructure/monitoring/configs/enhanced-alert-rules.yml
Normal file
511
infrastructure/monitoring/configs/enhanced-alert-rules.yml
Normal file
@@ -0,0 +1,511 @@
|
||||
# Enhanced Alert Rules for BZZZ v2 Infrastructure
|
||||
# Service Level Objectives and Critical System Alerts
|
||||
|
||||
groups:
|
||||
# === System Health and SLO Alerts ===
|
||||
- name: bzzz_system_health
|
||||
rules:
|
||||
# Overall system health score
|
||||
- alert: BZZZSystemHealthCritical
|
||||
expr: bzzz_system_health_score < 0.5
|
||||
for: 2m
|
||||
labels:
|
||||
severity: critical
|
||||
service: bzzz
|
||||
slo: availability
|
||||
annotations:
|
||||
summary: "BZZZ system health is critically low"
|
||||
description: "System health score {{ $value }} is below critical threshold (0.5)"
|
||||
runbook_url: "https://wiki.chorus.services/runbooks/bzzz-health-critical"
|
||||
|
||||
- alert: BZZZSystemHealthDegraded
|
||||
expr: bzzz_system_health_score < 0.8
|
||||
for: 5m
|
||||
labels:
|
||||
severity: warning
|
||||
service: bzzz
|
||||
slo: availability
|
||||
annotations:
|
||||
summary: "BZZZ system health is degraded"
|
||||
description: "System health score {{ $value }} is below warning threshold (0.8)"
|
||||
runbook_url: "https://wiki.chorus.services/runbooks/bzzz-health-degraded"
|
||||
|
||||
# Component health monitoring
|
||||
- alert: BZZZComponentUnhealthy
|
||||
expr: bzzz_component_health_score < 0.7
|
||||
for: 3m
|
||||
labels:
|
||||
severity: warning
|
||||
service: bzzz
|
||||
component: "{{ $labels.component }}"
|
||||
annotations:
|
||||
summary: "BZZZ component {{ $labels.component }} is unhealthy"
|
||||
description: "Component {{ $labels.component }} health score {{ $value }} is below threshold"
|
||||
|
||||
# === P2P Network Alerts ===
|
||||
- name: bzzz_p2p_network
|
||||
rules:
|
||||
# Peer connectivity SLO: Maintain at least 3 connected peers
|
||||
- alert: BZZZInsufficientPeers
|
||||
expr: bzzz_p2p_connected_peers < 3
|
||||
for: 1m
|
||||
labels:
|
||||
severity: critical
|
||||
service: bzzz
|
||||
component: p2p
|
||||
slo: connectivity
|
||||
annotations:
|
||||
summary: "BZZZ has insufficient P2P peers"
|
||||
description: "Only {{ $value }} peers connected, minimum required is 3"
|
||||
runbook_url: "https://wiki.chorus.services/runbooks/bzzz-peer-connectivity"
|
||||
|
||||
# Message latency SLO: 95th percentile < 500ms
|
||||
- alert: BZZZP2PHighLatency
|
||||
expr: histogram_quantile(0.95, rate(bzzz_p2p_message_latency_seconds_bucket[5m])) > 0.5
|
||||
for: 3m
|
||||
labels:
|
||||
severity: warning
|
||||
service: bzzz
|
||||
component: p2p
|
||||
slo: latency
|
||||
annotations:
|
||||
summary: "BZZZ P2P message latency is high"
|
||||
description: "95th percentile latency {{ $value }}s exceeds 500ms SLO"
|
||||
runbook_url: "https://wiki.chorus.services/runbooks/bzzz-p2p-latency"
|
||||
|
||||
# Message loss detection
|
||||
- alert: BZZZP2PMessageLoss
|
||||
expr: rate(bzzz_p2p_messages_sent_total[5m]) - rate(bzzz_p2p_messages_received_total[5m]) > 0.1
|
||||
for: 2m
|
||||
labels:
|
||||
severity: warning
|
||||
service: bzzz
|
||||
component: p2p
|
||||
annotations:
|
||||
summary: "BZZZ P2P message loss detected"
|
||||
description: "Message send/receive imbalance: {{ $value }} messages/sec"
|
||||
|
||||
# === DHT Performance and Reliability ===
|
||||
- name: bzzz_dht
|
||||
rules:
|
||||
# DHT operation success rate SLO: > 99%
|
||||
- alert: BZZZDHTLowSuccessRate
|
||||
expr: (rate(bzzz_dht_put_operations_total{status="success"}[5m]) + rate(bzzz_dht_get_operations_total{status="success"}[5m])) / (rate(bzzz_dht_put_operations_total[5m]) + rate(bzzz_dht_get_operations_total[5m])) < 0.99
|
||||
for: 2m
|
||||
labels:
|
||||
severity: warning
|
||||
service: bzzz
|
||||
component: dht
|
||||
slo: success_rate
|
||||
annotations:
|
||||
summary: "BZZZ DHT operation success rate is low"
|
||||
description: "DHT success rate {{ $value | humanizePercentage }} is below 99% SLO"
|
||||
runbook_url: "https://wiki.chorus.services/runbooks/bzzz-dht-success-rate"
|
||||
|
||||
# DHT operation latency SLO: 95th percentile < 300ms for gets
|
||||
- alert: BZZZDHTHighGetLatency
|
||||
expr: histogram_quantile(0.95, rate(bzzz_dht_operation_latency_seconds_bucket{operation="get"}[5m])) > 0.3
|
||||
for: 3m
|
||||
labels:
|
||||
severity: warning
|
||||
service: bzzz
|
||||
component: dht
|
||||
slo: latency
|
||||
annotations:
|
||||
summary: "BZZZ DHT get operations are slow"
|
||||
description: "95th percentile get latency {{ $value }}s exceeds 300ms SLO"
|
||||
|
||||
# DHT replication health
|
||||
- alert: BZZZDHTReplicationDegraded
|
||||
expr: avg(bzzz_dht_replication_factor) < 2
|
||||
for: 5m
|
||||
labels:
|
||||
severity: warning
|
||||
service: bzzz
|
||||
component: dht
|
||||
slo: durability
|
||||
annotations:
|
||||
summary: "BZZZ DHT replication is degraded"
|
||||
description: "Average replication factor {{ $value }} is below target of 3"
|
||||
runbook_url: "https://wiki.chorus.services/runbooks/bzzz-dht-replication"
|
||||
|
||||
# Provider record staleness
|
||||
- alert: BZZZDHTStaleProviders
|
||||
expr: increase(bzzz_dht_provider_records[1h]) == 0 and bzzz_dht_content_keys > 0
|
||||
for: 10m
|
||||
labels:
|
||||
severity: warning
|
||||
service: bzzz
|
||||
component: dht
|
||||
annotations:
|
||||
summary: "BZZZ DHT provider records are not updating"
|
||||
description: "No provider record updates in the last hour despite having content"
|
||||
|
||||
# === Election System Stability ===
|
||||
- name: bzzz_election
|
||||
rules:
|
||||
# Leadership stability: Avoid frequent leadership changes
|
||||
- alert: BZZZFrequentLeadershipChanges
|
||||
expr: increase(bzzz_leadership_changes_total[1h]) > 3
|
||||
for: 0m
|
||||
labels:
|
||||
severity: warning
|
||||
service: bzzz
|
||||
component: election
|
||||
annotations:
|
||||
summary: "BZZZ leadership is unstable"
|
||||
description: "{{ $value }} leadership changes in the last hour"
|
||||
runbook_url: "https://wiki.chorus.services/runbooks/bzzz-leadership-instability"
|
||||
|
||||
# Election timeout
|
||||
- alert: BZZZElectionInProgress
|
||||
expr: bzzz_election_state{state="electing"} == 1
|
||||
for: 2m
|
||||
labels:
|
||||
severity: warning
|
||||
service: bzzz
|
||||
component: election
|
||||
annotations:
|
||||
summary: "BZZZ election taking too long"
|
||||
description: "Election has been in progress for more than 2 minutes"
|
||||
|
||||
# No admin elected
|
||||
- alert: BZZZNoAdminElected
|
||||
expr: bzzz_election_state{state="idle"} == 1 and absent(bzzz_heartbeats_received_total)
|
||||
for: 1m
|
||||
labels:
|
||||
severity: critical
|
||||
service: bzzz
|
||||
component: election
|
||||
annotations:
|
||||
summary: "BZZZ has no elected admin"
|
||||
description: "System is idle but no heartbeats are being received"
|
||||
runbook_url: "https://wiki.chorus.services/runbooks/bzzz-no-admin"
|
||||
|
||||
# Heartbeat monitoring
|
||||
- alert: BZZZHeartbeatMissing
|
||||
expr: increase(bzzz_heartbeats_received_total[2m]) == 0
|
||||
for: 1m
|
||||
labels:
|
||||
severity: critical
|
||||
service: bzzz
|
||||
component: election
|
||||
annotations:
|
||||
summary: "BZZZ admin heartbeat missing"
|
||||
description: "No heartbeats received from admin in the last 2 minutes"
|
||||
|
||||
# === PubSub Messaging System ===
|
||||
- name: bzzz_pubsub
|
||||
rules:
|
||||
# Message processing rate
|
||||
- alert: BZZZPubSubHighMessageRate
|
||||
expr: rate(bzzz_pubsub_messages_total[1m]) > 1000
|
||||
for: 2m
|
||||
labels:
|
||||
severity: warning
|
||||
service: bzzz
|
||||
component: pubsub
|
||||
annotations:
|
||||
summary: "BZZZ PubSub message rate is very high"
|
||||
description: "Processing {{ $value }} messages/sec, may indicate spam or DoS"
|
||||
|
||||
# Message latency
|
||||
- alert: BZZZPubSubHighLatency
|
||||
expr: histogram_quantile(0.95, rate(bzzz_pubsub_message_latency_seconds_bucket[5m])) > 1.0
|
||||
for: 3m
|
||||
labels:
|
||||
severity: warning
|
||||
service: bzzz
|
||||
component: pubsub
|
||||
slo: latency
|
||||
annotations:
|
||||
summary: "BZZZ PubSub message latency is high"
|
||||
description: "95th percentile latency {{ $value }}s exceeds 1s threshold"
|
||||
|
||||
# Topic monitoring
|
||||
- alert: BZZZPubSubNoTopics
|
||||
expr: bzzz_pubsub_topics == 0
|
||||
for: 5m
|
||||
labels:
|
||||
severity: warning
|
||||
service: bzzz
|
||||
component: pubsub
|
||||
annotations:
|
||||
summary: "BZZZ PubSub has no active topics"
|
||||
description: "No PubSub topics are active, system may be isolated"
|
||||
|
||||
# === Task Management and Processing ===
|
||||
- name: bzzz_tasks
|
||||
rules:
|
||||
# Task queue backup
|
||||
- alert: BZZZTaskQueueBackup
|
||||
expr: bzzz_tasks_queued > 100
|
||||
for: 5m
|
||||
labels:
|
||||
severity: warning
|
||||
service: bzzz
|
||||
component: tasks
|
||||
annotations:
|
||||
summary: "BZZZ task queue is backing up"
|
||||
description: "{{ $value }} tasks are queued, may indicate processing issues"
|
||||
runbook_url: "https://wiki.chorus.services/runbooks/bzzz-task-queue"
|
||||
|
||||
# Task success rate SLO: > 95%
|
||||
- alert: BZZZTaskLowSuccessRate
|
||||
expr: rate(bzzz_tasks_completed_total{status="success"}[10m]) / rate(bzzz_tasks_completed_total[10m]) < 0.95
|
||||
for: 5m
|
||||
labels:
|
||||
severity: warning
|
||||
service: bzzz
|
||||
component: tasks
|
||||
slo: success_rate
|
||||
annotations:
|
||||
summary: "BZZZ task success rate is low"
|
||||
description: "Task success rate {{ $value | humanizePercentage }} is below 95% SLO"
|
||||
|
||||
# Task processing latency
|
||||
- alert: BZZZTaskHighProcessingTime
|
||||
expr: histogram_quantile(0.95, rate(bzzz_task_duration_seconds_bucket[5m])) > 300
|
||||
for: 3m
|
||||
labels:
|
||||
severity: warning
|
||||
service: bzzz
|
||||
component: tasks
|
||||
annotations:
|
||||
summary: "BZZZ task processing time is high"
|
||||
description: "95th percentile task duration {{ $value }}s exceeds 5 minutes"
|
||||
|
||||
# === SLURP Context Generation ===
|
||||
- name: bzzz_slurp
|
||||
rules:
|
||||
# Context generation success rate
|
||||
- alert: BZZZSLURPLowSuccessRate
|
||||
expr: rate(bzzz_slurp_contexts_generated_total{status="success"}[10m]) / rate(bzzz_slurp_contexts_generated_total[10m]) < 0.90
|
||||
for: 5m
|
||||
labels:
|
||||
severity: warning
|
||||
service: bzzz
|
||||
component: slurp
|
||||
annotations:
|
||||
summary: "SLURP context generation success rate is low"
|
||||
description: "Success rate {{ $value | humanizePercentage }} is below 90%"
|
||||
runbook_url: "https://wiki.chorus.services/runbooks/bzzz-slurp-generation"
|
||||
|
||||
# Generation queue backup
|
||||
- alert: BZZZSLURPQueueBackup
|
||||
expr: bzzz_slurp_queue_length > 50
|
||||
for: 10m
|
||||
labels:
|
||||
severity: warning
|
||||
service: bzzz
|
||||
component: slurp
|
||||
annotations:
|
||||
summary: "SLURP generation queue is backing up"
|
||||
description: "{{ $value }} contexts are queued for generation"
|
||||
|
||||
# Generation time SLO: 95th percentile < 2 minutes
|
||||
- alert: BZZZSLURPSlowGeneration
|
||||
expr: histogram_quantile(0.95, rate(bzzz_slurp_generation_time_seconds_bucket[10m])) > 120
|
||||
for: 5m
|
||||
labels:
|
||||
severity: warning
|
||||
service: bzzz
|
||||
component: slurp
|
||||
slo: latency
|
||||
annotations:
|
||||
summary: "SLURP context generation is slow"
|
||||
description: "95th percentile generation time {{ $value }}s exceeds 2 minutes"
|
||||
|
||||
# === UCXI Protocol Resolution ===
|
||||
- name: bzzz_ucxi
|
||||
rules:
|
||||
# Resolution success rate SLO: > 99%
|
||||
- alert: BZZZUCXILowSuccessRate
|
||||
expr: rate(bzzz_ucxi_requests_total{status=~"2.."}[5m]) / rate(bzzz_ucxi_requests_total[5m]) < 0.99
|
||||
for: 3m
|
||||
labels:
|
||||
severity: warning
|
||||
service: bzzz
|
||||
component: ucxi
|
||||
slo: success_rate
|
||||
annotations:
|
||||
summary: "UCXI resolution success rate is low"
|
||||
description: "Success rate {{ $value | humanizePercentage }} is below 99% SLO"
|
||||
|
||||
# Resolution latency SLO: 95th percentile < 100ms
|
||||
- alert: BZZZUCXIHighLatency
|
||||
expr: histogram_quantile(0.95, rate(bzzz_ucxi_resolution_latency_seconds_bucket[5m])) > 0.1
|
||||
for: 3m
|
||||
labels:
|
||||
severity: warning
|
||||
service: bzzz
|
||||
component: ucxi
|
||||
slo: latency
|
||||
annotations:
|
||||
summary: "UCXI resolution latency is high"
|
||||
description: "95th percentile latency {{ $value }}s exceeds 100ms SLO"
|
||||
|
||||
# === Resource Utilization ===
|
||||
- name: bzzz_resources
|
||||
rules:
|
||||
# CPU utilization
|
||||
- alert: BZZZHighCPUUsage
|
||||
expr: bzzz_cpu_usage_ratio > 0.85
|
||||
for: 5m
|
||||
labels:
|
||||
severity: warning
|
||||
service: bzzz
|
||||
component: system
|
||||
annotations:
|
||||
summary: "BZZZ CPU usage is high"
|
||||
description: "CPU usage {{ $value | humanizePercentage }} exceeds 85%"
|
||||
|
||||
# Memory utilization
|
||||
- alert: BZZZHighMemoryUsage
|
||||
expr: bzzz_memory_usage_bytes / (1024*1024*1024) > 8
|
||||
for: 3m
|
||||
labels:
|
||||
severity: warning
|
||||
service: bzzz
|
||||
component: system
|
||||
annotations:
|
||||
summary: "BZZZ memory usage is high"
|
||||
description: "Memory usage {{ $value | humanize1024 }}B is high"
|
||||
|
||||
# Disk utilization
|
||||
- alert: BZZZHighDiskUsage
|
||||
expr: bzzz_disk_usage_ratio > 0.90
|
||||
for: 5m
|
||||
labels:
|
||||
severity: critical
|
||||
service: bzzz
|
||||
component: system
|
||||
annotations:
|
||||
summary: "BZZZ disk usage is critical"
|
||||
description: "Disk usage {{ $value | humanizePercentage }} on {{ $labels.mount_point }} exceeds 90%"
|
||||
|
||||
# Goroutine leak detection
|
||||
- alert: BZZZGoroutineLeak
|
||||
expr: increase(bzzz_goroutines[30m]) > 1000
|
||||
for: 5m
|
||||
labels:
|
||||
severity: warning
|
||||
service: bzzz
|
||||
component: system
|
||||
annotations:
|
||||
summary: "Possible BZZZ goroutine leak"
|
||||
description: "Goroutine count increased by {{ $value }} in 30 minutes"
|
||||
|
||||
# === Error Rate Monitoring ===
|
||||
- name: bzzz_errors
|
||||
rules:
|
||||
# General error rate
|
||||
- alert: BZZZHighErrorRate
|
||||
expr: rate(bzzz_errors_total[5m]) > 10
|
||||
for: 2m
|
||||
labels:
|
||||
severity: warning
|
||||
service: bzzz
|
||||
annotations:
|
||||
summary: "BZZZ error rate is high"
|
||||
description: "Error rate {{ $value }} errors/sec in component {{ $labels.component }}"
|
||||
|
||||
# Panic detection
|
||||
- alert: BZZZPanicsDetected
|
||||
expr: increase(bzzz_panics_total[5m]) > 0
|
||||
for: 0m
|
||||
labels:
|
||||
severity: critical
|
||||
service: bzzz
|
||||
annotations:
|
||||
summary: "BZZZ panic detected"
|
||||
description: "{{ $value }} panic(s) occurred in the last 5 minutes"
|
||||
runbook_url: "https://wiki.chorus.services/runbooks/bzzz-panic-recovery"
|
||||
|
||||
# === Health Check Monitoring ===
|
||||
- name: bzzz_health_checks
|
||||
rules:
|
||||
# Health check failure rate
|
||||
- alert: BZZZHealthCheckFailures
|
||||
expr: rate(bzzz_health_checks_failed_total[5m]) > 0.1
|
||||
for: 2m
|
||||
labels:
|
||||
severity: warning
|
||||
service: bzzz
|
||||
component: health
|
||||
annotations:
|
||||
summary: "BZZZ health check failures detected"
|
||||
description: "Health check {{ $labels.check_name }} failing at {{ $value }} failures/sec"
|
||||
|
||||
# Critical health check failure
|
||||
- alert: BZZZCriticalHealthCheckFailed
|
||||
expr: increase(bzzz_health_checks_failed_total{check_name=~".*-enhanced|p2p-connectivity"}[2m]) > 0
|
||||
for: 0m
|
||||
labels:
|
||||
severity: critical
|
||||
service: bzzz
|
||||
component: health
|
||||
annotations:
|
||||
summary: "Critical BZZZ health check failed"
|
||||
description: "Critical health check {{ $labels.check_name }} failed: {{ $labels.reason }}"
|
||||
|
||||
# === Service Level Indicator Recording Rules ===
|
||||
- name: bzzz_sli_recording
|
||||
interval: 30s
|
||||
rules:
|
||||
# DHT operation SLI
|
||||
- record: bzzz:dht_success_rate
|
||||
expr: rate(bzzz_dht_put_operations_total{status="success"}[5m]) + rate(bzzz_dht_get_operations_total{status="success"}[5m]) / rate(bzzz_dht_put_operations_total[5m]) + rate(bzzz_dht_get_operations_total[5m])
|
||||
|
||||
# P2P connectivity SLI
|
||||
- record: bzzz:p2p_connectivity_ratio
|
||||
expr: bzzz_p2p_connected_peers / 10 # Target of 10 peers
|
||||
|
||||
# UCXI success rate SLI
|
||||
- record: bzzz:ucxi_success_rate
|
||||
expr: rate(bzzz_ucxi_requests_total{status=~"2.."}[5m]) / rate(bzzz_ucxi_requests_total[5m])
|
||||
|
||||
# Task success rate SLI
|
||||
- record: bzzz:task_success_rate
|
||||
expr: rate(bzzz_tasks_completed_total{status="success"}[5m]) / rate(bzzz_tasks_completed_total[5m])
|
||||
|
||||
# Overall availability SLI
|
||||
- record: bzzz:overall_availability
|
||||
expr: bzzz_system_health_score
|
||||
|
||||
# === Multi-Window Multi-Burn-Rate Alerts ===
|
||||
- name: bzzz_slo_alerts
|
||||
rules:
|
||||
# Fast burn rate (2% of error budget in 1 hour)
|
||||
- alert: BZZZErrorBudgetBurnHigh
|
||||
expr: (
|
||||
(1 - bzzz:dht_success_rate) > (14.4 * 0.01) # 14.4x burn rate for 99% SLO
|
||||
and
|
||||
(1 - bzzz:dht_success_rate) > (14.4 * 0.01)
|
||||
)
|
||||
for: 2m
|
||||
labels:
|
||||
severity: critical
|
||||
service: bzzz
|
||||
burnrate: fast
|
||||
slo: dht_success_rate
|
||||
annotations:
|
||||
summary: "BZZZ DHT error budget burning fast"
|
||||
description: "DHT error budget will be exhausted in {{ with query \"(0.01 - (1 - bzzz:dht_success_rate)) / (1 - bzzz:dht_success_rate) * 1\" }}{{ . | first | value | humanizeDuration }}{{ end }}"
|
||||
|
||||
# Slow burn rate (10% of error budget in 6 hours)
|
||||
- alert: BZZZErrorBudgetBurnSlow
|
||||
expr: (
|
||||
(1 - bzzz:dht_success_rate) > (6 * 0.01) # 6x burn rate
|
||||
and
|
||||
(1 - bzzz:dht_success_rate) > (6 * 0.01)
|
||||
)
|
||||
for: 15m
|
||||
labels:
|
||||
severity: warning
|
||||
service: bzzz
|
||||
burnrate: slow
|
||||
slo: dht_success_rate
|
||||
annotations:
|
||||
summary: "BZZZ DHT error budget burning slowly"
|
||||
description: "DHT error budget depletion rate is concerning"
|
||||
533
infrastructure/monitoring/docker-compose.enhanced.yml
Normal file
533
infrastructure/monitoring/docker-compose.enhanced.yml
Normal file
@@ -0,0 +1,533 @@
|
||||
version: '3.8'
|
||||
|
||||
# Enhanced BZZZ Monitoring Stack for Docker Swarm
|
||||
# Provides comprehensive observability for BZZZ distributed system
|
||||
|
||||
services:
|
||||
# Prometheus - Metrics Collection and Alerting
|
||||
prometheus:
|
||||
image: prom/prometheus:v2.45.0
|
||||
networks:
|
||||
- tengig
|
||||
- monitoring
|
||||
ports:
|
||||
- "9090:9090"
|
||||
volumes:
|
||||
- prometheus_data:/prometheus
|
||||
- /rust/bzzz-v2/monitoring/prometheus:/etc/prometheus
|
||||
command:
|
||||
- '--config.file=/etc/prometheus/prometheus.yml'
|
||||
- '--storage.tsdb.path=/prometheus'
|
||||
- '--storage.tsdb.retention.time=30d'
|
||||
- '--storage.tsdb.retention.size=50GB'
|
||||
- '--web.console.libraries=/etc/prometheus/console_libraries'
|
||||
- '--web.console.templates=/etc/prometheus/consoles'
|
||||
- '--web.enable-lifecycle'
|
||||
- '--web.enable-admin-api'
|
||||
- '--web.external-url=https://prometheus.chorus.services'
|
||||
- '--alertmanager.notification-queue-capacity=10000'
|
||||
deploy:
|
||||
replicas: 1
|
||||
placement:
|
||||
constraints:
|
||||
- node.hostname == walnut # Place on main node
|
||||
resources:
|
||||
limits:
|
||||
memory: 4G
|
||||
cpus: '2.0'
|
||||
reservations:
|
||||
memory: 2G
|
||||
cpus: '1.0'
|
||||
restart_policy:
|
||||
condition: on-failure
|
||||
delay: 30s
|
||||
labels:
|
||||
- "traefik.enable=true"
|
||||
- "traefik.http.routers.prometheus.rule=Host(`prometheus.chorus.services`)"
|
||||
- "traefik.http.services.prometheus.loadbalancer.server.port=9090"
|
||||
- "traefik.http.routers.prometheus.tls=true"
|
||||
healthcheck:
|
||||
test: ["CMD", "wget", "--quiet", "--tries=1", "--spider", "http://localhost:9090/-/healthy"]
|
||||
interval: 30s
|
||||
timeout: 10s
|
||||
retries: 3
|
||||
configs:
|
||||
- source: prometheus_config
|
||||
target: /etc/prometheus/prometheus.yml
|
||||
- source: prometheus_alerts
|
||||
target: /etc/prometheus/rules.yml
|
||||
|
||||
# Grafana - Visualization and Dashboards
|
||||
grafana:
|
||||
image: grafana/grafana:10.0.3
|
||||
networks:
|
||||
- tengig
|
||||
- monitoring
|
||||
ports:
|
||||
- "3000:3000"
|
||||
volumes:
|
||||
- grafana_data:/var/lib/grafana
|
||||
- /rust/bzzz-v2/monitoring/grafana/dashboards:/etc/grafana/provisioning/dashboards
|
||||
- /rust/bzzz-v2/monitoring/grafana/datasources:/etc/grafana/provisioning/datasources
|
||||
environment:
|
||||
- GF_SECURITY_ADMIN_PASSWORD__FILE=/run/secrets/grafana_admin_password
|
||||
- GF_INSTALL_PLUGINS=grafana-piechart-panel,grafana-worldmap-panel,vonage-status-panel
|
||||
- GF_FEATURE_TOGGLES_ENABLE=publicDashboards
|
||||
- GF_SERVER_ROOT_URL=https://grafana.chorus.services
|
||||
- GF_ANALYTICS_REPORTING_ENABLED=false
|
||||
- GF_ANALYTICS_CHECK_FOR_UPDATES=false
|
||||
- GF_LOG_LEVEL=warn
|
||||
secrets:
|
||||
- grafana_admin_password
|
||||
deploy:
|
||||
replicas: 1
|
||||
placement:
|
||||
constraints:
|
||||
- node.hostname == walnut
|
||||
resources:
|
||||
limits:
|
||||
memory: 2G
|
||||
cpus: '1.0'
|
||||
reservations:
|
||||
memory: 512M
|
||||
cpus: '0.5'
|
||||
restart_policy:
|
||||
condition: on-failure
|
||||
delay: 10s
|
||||
labels:
|
||||
- "traefik.enable=true"
|
||||
- "traefik.http.routers.grafana.rule=Host(`grafana.chorus.services`)"
|
||||
- "traefik.http.services.grafana.loadbalancer.server.port=3000"
|
||||
- "traefik.http.routers.grafana.tls=true"
|
||||
healthcheck:
|
||||
test: ["CMD-SHELL", "curl -f http://localhost:3000/api/health || exit 1"]
|
||||
interval: 30s
|
||||
timeout: 10s
|
||||
retries: 3
|
||||
|
||||
# AlertManager - Alert Routing and Notification
|
||||
alertmanager:
|
||||
image: prom/alertmanager:v0.25.0
|
||||
networks:
|
||||
- tengig
|
||||
- monitoring
|
||||
ports:
|
||||
- "9093:9093"
|
||||
volumes:
|
||||
- alertmanager_data:/alertmanager
|
||||
- /rust/bzzz-v2/monitoring/alertmanager:/etc/alertmanager
|
||||
command:
|
||||
- '--config.file=/etc/alertmanager/config.yml'
|
||||
- '--storage.path=/alertmanager'
|
||||
- '--web.external-url=https://alerts.chorus.services'
|
||||
- '--web.route-prefix=/'
|
||||
- '--cluster.listen-address=0.0.0.0:9094'
|
||||
- '--log.level=info'
|
||||
deploy:
|
||||
replicas: 1
|
||||
placement:
|
||||
constraints:
|
||||
- node.hostname == ironwood
|
||||
resources:
|
||||
limits:
|
||||
memory: 1G
|
||||
cpus: '0.5'
|
||||
reservations:
|
||||
memory: 256M
|
||||
cpus: '0.25'
|
||||
restart_policy:
|
||||
condition: on-failure
|
||||
labels:
|
||||
- "traefik.enable=true"
|
||||
- "traefik.http.routers.alertmanager.rule=Host(`alerts.chorus.services`)"
|
||||
- "traefik.http.services.alertmanager.loadbalancer.server.port=9093"
|
||||
- "traefik.http.routers.alertmanager.tls=true"
|
||||
configs:
|
||||
- source: alertmanager_config
|
||||
target: /etc/alertmanager/config.yml
|
||||
secrets:
|
||||
- slack_webhook_url
|
||||
- pagerduty_integration_key
|
||||
|
||||
# Node Exporter - System Metrics (deployed on all nodes)
|
||||
node-exporter:
|
||||
image: prom/node-exporter:v1.6.1
|
||||
networks:
|
||||
- monitoring
|
||||
ports:
|
||||
- "9100:9100"
|
||||
volumes:
|
||||
- /proc:/host/proc:ro
|
||||
- /sys:/host/sys:ro
|
||||
- /:/rootfs:ro
|
||||
- /run/systemd/private:/run/systemd/private:ro
|
||||
command:
|
||||
- '--path.procfs=/host/proc'
|
||||
- '--path.sysfs=/host/sys'
|
||||
- '--path.rootfs=/rootfs'
|
||||
- '--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)'
|
||||
- '--collector.systemd'
|
||||
- '--collector.systemd.unit-include=(bzzz|docker|prometheus|grafana)\.service'
|
||||
- '--web.listen-address=0.0.0.0:9100'
|
||||
deploy:
|
||||
mode: global # Deploy on every node
|
||||
resources:
|
||||
limits:
|
||||
memory: 256M
|
||||
cpus: '0.2'
|
||||
reservations:
|
||||
memory: 128M
|
||||
cpus: '0.1'
|
||||
restart_policy:
|
||||
condition: on-failure
|
||||
|
||||
# cAdvisor - Container Metrics (deployed on all nodes)
|
||||
cadvisor:
|
||||
image: gcr.io/cadvisor/cadvisor:v0.47.2
|
||||
networks:
|
||||
- monitoring
|
||||
ports:
|
||||
- "8080:8080"
|
||||
volumes:
|
||||
- /:/rootfs:ro
|
||||
- /var/run:/var/run:ro
|
||||
- /sys:/sys:ro
|
||||
- /var/lib/docker/:/var/lib/docker:ro
|
||||
- /dev/disk/:/dev/disk:ro
|
||||
deploy:
|
||||
mode: global
|
||||
resources:
|
||||
limits:
|
||||
memory: 512M
|
||||
cpus: '0.3'
|
||||
reservations:
|
||||
memory: 256M
|
||||
cpus: '0.15'
|
||||
restart_policy:
|
||||
condition: on-failure
|
||||
healthcheck:
|
||||
test: ["CMD", "wget", "--quiet", "--tries=1", "--spider", "http://localhost:8080/healthz"]
|
||||
interval: 30s
|
||||
timeout: 10s
|
||||
retries: 3
|
||||
|
||||
# BZZZ P2P Network Exporter - Custom metrics for P2P network health
|
||||
bzzz-p2p-exporter:
|
||||
image: registry.home.deepblack.cloud/bzzz-p2p-exporter:v2.0.0
|
||||
networks:
|
||||
- monitoring
|
||||
- bzzz-internal
|
||||
ports:
|
||||
- "9200:9200"
|
||||
environment:
|
||||
- BZZZ_ENDPOINTS=http://bzzz-agent:9000
|
||||
- SCRAPE_INTERVAL=15s
|
||||
- LOG_LEVEL=info
|
||||
deploy:
|
||||
replicas: 1
|
||||
placement:
|
||||
constraints:
|
||||
- node.hostname == walnut
|
||||
resources:
|
||||
limits:
|
||||
memory: 256M
|
||||
cpus: '0.2'
|
||||
reservations:
|
||||
memory: 128M
|
||||
cpus: '0.1'
|
||||
restart_policy:
|
||||
condition: on-failure
|
||||
|
||||
# DHT Monitor - DHT-specific metrics and health monitoring
|
||||
dht-monitor:
|
||||
image: registry.home.deepblack.cloud/bzzz-dht-monitor:v2.0.0
|
||||
networks:
|
||||
- monitoring
|
||||
- bzzz-internal
|
||||
ports:
|
||||
- "9201:9201"
|
||||
environment:
|
||||
- DHT_BOOTSTRAP_NODES=walnut:9101,ironwood:9102,acacia:9103
|
||||
- REPLICATION_CHECK_INTERVAL=5m
|
||||
- PROVIDER_CHECK_INTERVAL=2m
|
||||
- LOG_LEVEL=info
|
||||
deploy:
|
||||
replicas: 1
|
||||
placement:
|
||||
constraints:
|
||||
- node.hostname == ironwood
|
||||
resources:
|
||||
limits:
|
||||
memory: 512M
|
||||
cpus: '0.3'
|
||||
reservations:
|
||||
memory: 256M
|
||||
cpus: '0.15'
|
||||
restart_policy:
|
||||
condition: on-failure
|
||||
|
||||
# Content Monitor - Content availability and integrity monitoring
|
||||
content-monitor:
|
||||
image: registry.home.deepblack.cloud/bzzz-content-monitor:v2.0.0
|
||||
networks:
|
||||
- monitoring
|
||||
- bzzz-internal
|
||||
ports:
|
||||
- "9202:9202"
|
||||
volumes:
|
||||
- /rust/bzzz-v2/data/blobs:/app/blobs:ro
|
||||
environment:
|
||||
- CONTENT_PATH=/app/blobs
|
||||
- INTEGRITY_CHECK_INTERVAL=15m
|
||||
- AVAILABILITY_CHECK_INTERVAL=5m
|
||||
- LOG_LEVEL=info
|
||||
deploy:
|
||||
replicas: 1
|
||||
placement:
|
||||
constraints:
|
||||
- node.hostname == acacia
|
||||
resources:
|
||||
limits:
|
||||
memory: 512M
|
||||
cpus: '0.3'
|
||||
reservations:
|
||||
memory: 256M
|
||||
cpus: '0.15'
|
||||
restart_policy:
|
||||
condition: on-failure
|
||||
|
||||
# OpenAI Cost Monitor - Track OpenAI API usage and costs
|
||||
openai-cost-monitor:
|
||||
image: registry.home.deepblack.cloud/bzzz-openai-cost-monitor:v2.0.0
|
||||
networks:
|
||||
- monitoring
|
||||
- bzzz-internal
|
||||
ports:
|
||||
- "9203:9203"
|
||||
environment:
|
||||
- OPENAI_PROXY_ENDPOINT=http://openai-proxy:3002
|
||||
- COST_TRACKING_ENABLED=true
|
||||
- POSTGRES_HOST=postgres
|
||||
- LOG_LEVEL=info
|
||||
secrets:
|
||||
- postgres_password
|
||||
deploy:
|
||||
replicas: 1
|
||||
placement:
|
||||
constraints:
|
||||
- node.hostname == walnut
|
||||
resources:
|
||||
limits:
|
||||
memory: 256M
|
||||
cpus: '0.2'
|
||||
reservations:
|
||||
memory: 128M
|
||||
cpus: '0.1'
|
||||
restart_policy:
|
||||
condition: on-failure
|
||||
|
||||
# Blackbox Exporter - External endpoint monitoring
|
||||
blackbox-exporter:
|
||||
image: prom/blackbox-exporter:v0.24.0
|
||||
networks:
|
||||
- monitoring
|
||||
- tengig
|
||||
ports:
|
||||
- "9115:9115"
|
||||
volumes:
|
||||
- /rust/bzzz-v2/monitoring/blackbox:/etc/blackbox_exporter
|
||||
command:
|
||||
- '--config.file=/etc/blackbox_exporter/config.yml'
|
||||
- '--web.listen-address=0.0.0.0:9115'
|
||||
deploy:
|
||||
replicas: 1
|
||||
placement:
|
||||
constraints:
|
||||
- node.hostname == ironwood
|
||||
resources:
|
||||
limits:
|
||||
memory: 128M
|
||||
cpus: '0.1'
|
||||
reservations:
|
||||
memory: 64M
|
||||
cpus: '0.05'
|
||||
restart_policy:
|
||||
condition: on-failure
|
||||
configs:
|
||||
- source: blackbox_config
|
||||
target: /etc/blackbox_exporter/config.yml
|
||||
|
||||
# Loki - Log Aggregation
|
||||
loki:
|
||||
image: grafana/loki:2.8.0
|
||||
networks:
|
||||
- monitoring
|
||||
ports:
|
||||
- "3100:3100"
|
||||
volumes:
|
||||
- loki_data:/loki
|
||||
- /rust/bzzz-v2/monitoring/loki:/etc/loki
|
||||
command:
|
||||
- '-config.file=/etc/loki/config.yml'
|
||||
- '-target=all'
|
||||
deploy:
|
||||
replicas: 1
|
||||
placement:
|
||||
constraints:
|
||||
- node.hostname == walnut
|
||||
resources:
|
||||
limits:
|
||||
memory: 2G
|
||||
cpus: '1.0'
|
||||
reservations:
|
||||
memory: 1G
|
||||
cpus: '0.5'
|
||||
restart_policy:
|
||||
condition: on-failure
|
||||
configs:
|
||||
- source: loki_config
|
||||
target: /etc/loki/config.yml
|
||||
|
||||
# Promtail - Log Collection Agent (deployed on all nodes)
|
||||
promtail:
|
||||
image: grafana/promtail:2.8.0
|
||||
networks:
|
||||
- monitoring
|
||||
volumes:
|
||||
- /var/log:/var/log:ro
|
||||
- /var/lib/docker/containers:/var/lib/docker/containers:ro
|
||||
- /rust/bzzz-v2/monitoring/promtail:/etc/promtail
|
||||
command:
|
||||
- '-config.file=/etc/promtail/config.yml'
|
||||
- '-server.http-listen-port=9080'
|
||||
deploy:
|
||||
mode: global
|
||||
resources:
|
||||
limits:
|
||||
memory: 256M
|
||||
cpus: '0.2'
|
||||
reservations:
|
||||
memory: 128M
|
||||
cpus: '0.1'
|
||||
restart_policy:
|
||||
condition: on-failure
|
||||
configs:
|
||||
- source: promtail_config
|
||||
target: /etc/promtail/config.yml
|
||||
|
||||
# Jaeger - Distributed Tracing (Optional)
|
||||
jaeger:
|
||||
image: jaegertracing/all-in-one:1.47
|
||||
networks:
|
||||
- monitoring
|
||||
- bzzz-internal
|
||||
ports:
|
||||
- "14268:14268" # HTTP collector
|
||||
- "16686:16686" # Web UI
|
||||
environment:
|
||||
- COLLECTOR_OTLP_ENABLED=true
|
||||
- SPAN_STORAGE_TYPE=memory
|
||||
deploy:
|
||||
replicas: 1
|
||||
placement:
|
||||
constraints:
|
||||
- node.hostname == acacia
|
||||
resources:
|
||||
limits:
|
||||
memory: 1G
|
||||
cpus: '0.5'
|
||||
reservations:
|
||||
memory: 512M
|
||||
cpus: '0.25'
|
||||
restart_policy:
|
||||
condition: on-failure
|
||||
labels:
|
||||
- "traefik.enable=true"
|
||||
- "traefik.http.routers.jaeger.rule=Host(`tracing.chorus.services`)"
|
||||
- "traefik.http.services.jaeger.loadbalancer.server.port=16686"
|
||||
- "traefik.http.routers.jaeger.tls=true"
|
||||
|
||||
networks:
|
||||
tengig:
|
||||
external: true
|
||||
monitoring:
|
||||
driver: overlay
|
||||
internal: true
|
||||
attachable: false
|
||||
ipam:
|
||||
driver: default
|
||||
config:
|
||||
- subnet: 10.201.0.0/16
|
||||
bzzz-internal:
|
||||
external: true
|
||||
|
||||
volumes:
|
||||
prometheus_data:
|
||||
driver: local
|
||||
driver_opts:
|
||||
type: nfs
|
||||
o: addr=192.168.1.27,rw,sync
|
||||
device: ":/rust/bzzz-v2/monitoring/prometheus/data"
|
||||
|
||||
grafana_data:
|
||||
driver: local
|
||||
driver_opts:
|
||||
type: nfs
|
||||
o: addr=192.168.1.27,rw,sync
|
||||
device: ":/rust/bzzz-v2/monitoring/grafana/data"
|
||||
|
||||
alertmanager_data:
|
||||
driver: local
|
||||
driver_opts:
|
||||
type: nfs
|
||||
o: addr=192.168.1.27,rw,sync
|
||||
device: ":/rust/bzzz-v2/monitoring/alertmanager/data"
|
||||
|
||||
loki_data:
|
||||
driver: local
|
||||
driver_opts:
|
||||
type: nfs
|
||||
o: addr=192.168.1.27,rw,sync
|
||||
device: ":/rust/bzzz-v2/monitoring/loki/data"
|
||||
|
||||
secrets:
|
||||
grafana_admin_password:
|
||||
external: true
|
||||
name: bzzz_grafana_admin_password
|
||||
|
||||
slack_webhook_url:
|
||||
external: true
|
||||
name: bzzz_slack_webhook_url
|
||||
|
||||
pagerduty_integration_key:
|
||||
external: true
|
||||
name: bzzz_pagerduty_integration_key
|
||||
|
||||
postgres_password:
|
||||
external: true
|
||||
name: bzzz_postgres_password
|
||||
|
||||
configs:
|
||||
prometheus_config:
|
||||
external: true
|
||||
name: bzzz_prometheus_config_v2
|
||||
|
||||
prometheus_alerts:
|
||||
external: true
|
||||
name: bzzz_prometheus_alerts_v2
|
||||
|
||||
alertmanager_config:
|
||||
external: true
|
||||
name: bzzz_alertmanager_config_v2
|
||||
|
||||
blackbox_config:
|
||||
external: true
|
||||
name: bzzz_blackbox_config_v2
|
||||
|
||||
loki_config:
|
||||
external: true
|
||||
name: bzzz_loki_config_v2
|
||||
|
||||
promtail_config:
|
||||
external: true
|
||||
name: bzzz_promtail_config_v2
|
||||
615
infrastructure/scripts/deploy-enhanced-monitoring.sh
Executable file
615
infrastructure/scripts/deploy-enhanced-monitoring.sh
Executable file
@@ -0,0 +1,615 @@
|
||||
#!/bin/bash
|
||||
|
||||
# BZZZ Enhanced Monitoring Stack Deployment Script
|
||||
# Deploys comprehensive monitoring, metrics, and health checking infrastructure
|
||||
|
||||
set -euo pipefail
|
||||
|
||||
# Script configuration
|
||||
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
|
||||
PROJECT_ROOT="$(cd "$SCRIPT_DIR/.." && pwd)"
|
||||
TIMESTAMP=$(date +%Y%m%d_%H%M%S)
|
||||
LOG_FILE="/tmp/bzzz-deploy-${TIMESTAMP}.log"
|
||||
|
||||
# Colors for output
|
||||
RED='\033[0;31m'
|
||||
GREEN='\033[0;32m'
|
||||
YELLOW='\033[1;33m'
|
||||
BLUE='\033[0;34m'
|
||||
NC='\033[0m' # No Color
|
||||
|
||||
# Configuration
|
||||
ENVIRONMENT=${ENVIRONMENT:-"production"}
|
||||
DRY_RUN=${DRY_RUN:-"false"}
|
||||
BACKUP_EXISTING=${BACKUP_EXISTING:-"true"}
|
||||
HEALTH_CHECK_TIMEOUT=${HEALTH_CHECK_TIMEOUT:-300}
|
||||
|
||||
# Docker configuration
|
||||
DOCKER_REGISTRY="registry.home.deepblack.cloud"
|
||||
STACK_NAME="bzzz-monitoring-v2"
|
||||
CONFIG_VERSION="v2"
|
||||
|
||||
# Logging function
|
||||
log() {
|
||||
local level=$1
|
||||
shift
|
||||
local message="$*"
|
||||
local timestamp=$(date '+%Y-%m-%d %H:%M:%S')
|
||||
|
||||
case $level in
|
||||
ERROR)
|
||||
echo -e "${RED}[ERROR]${NC} $message" >&2
|
||||
;;
|
||||
WARN)
|
||||
echo -e "${YELLOW}[WARN]${NC} $message"
|
||||
;;
|
||||
INFO)
|
||||
echo -e "${GREEN}[INFO]${NC} $message"
|
||||
;;
|
||||
DEBUG)
|
||||
echo -e "${BLUE}[DEBUG]${NC} $message"
|
||||
;;
|
||||
esac
|
||||
|
||||
echo "[$timestamp] [$level] $message" >> "$LOG_FILE"
|
||||
}
|
||||
|
||||
# Error handler
|
||||
error_handler() {
|
||||
local line_no=$1
|
||||
log ERROR "Script failed at line $line_no"
|
||||
log ERROR "Check log file: $LOG_FILE"
|
||||
exit 1
|
||||
}
|
||||
trap 'error_handler $LINENO' ERR
|
||||
|
||||
# Check prerequisites
|
||||
check_prerequisites() {
|
||||
log INFO "Checking prerequisites..."
|
||||
|
||||
# Check if running on Docker Swarm manager
|
||||
if ! docker info --format '{{.Swarm.LocalNodeState}}' | grep -q "active"; then
|
||||
log ERROR "This script must be run on a Docker Swarm manager node"
|
||||
exit 1
|
||||
fi
|
||||
|
||||
# Check required tools
|
||||
local required_tools=("docker" "jq" "curl")
|
||||
for tool in "${required_tools[@]}"; do
|
||||
if ! command -v "$tool" >/dev/null 2>&1; then
|
||||
log ERROR "Required tool not found: $tool"
|
||||
exit 1
|
||||
fi
|
||||
done
|
||||
|
||||
# Check network connectivity to registry
|
||||
if ! docker pull "$DOCKER_REGISTRY/bzzz:v2.0.0" >/dev/null 2>&1; then
|
||||
log WARN "Unable to pull from registry, using local images"
|
||||
fi
|
||||
|
||||
log INFO "Prerequisites check completed"
|
||||
}
|
||||
|
||||
# Create necessary directories
|
||||
setup_directories() {
|
||||
log INFO "Setting up directories..."
|
||||
|
||||
local dirs=(
|
||||
"/rust/bzzz-v2/monitoring/prometheus/data"
|
||||
"/rust/bzzz-v2/monitoring/grafana/data"
|
||||
"/rust/bzzz-v2/monitoring/alertmanager/data"
|
||||
"/rust/bzzz-v2/monitoring/loki/data"
|
||||
"/rust/bzzz-v2/backups/monitoring"
|
||||
)
|
||||
|
||||
for dir in "${dirs[@]}"; do
|
||||
if [[ "$DRY_RUN" != "true" ]]; then
|
||||
sudo mkdir -p "$dir"
|
||||
sudo chown -R 65534:65534 "$dir" # nobody user for containers
|
||||
fi
|
||||
log DEBUG "Created directory: $dir"
|
||||
done
|
||||
}
|
||||
|
||||
# Backup existing configuration
|
||||
backup_existing_config() {
|
||||
if [[ "$BACKUP_EXISTING" != "true" ]]; then
|
||||
log INFO "Skipping backup (BACKUP_EXISTING=false)"
|
||||
return
|
||||
fi
|
||||
|
||||
log INFO "Backing up existing configuration..."
|
||||
|
||||
local backup_dir="/rust/bzzz-v2/backups/monitoring/backup_${TIMESTAMP}"
|
||||
|
||||
if [[ "$DRY_RUN" != "true" ]]; then
|
||||
mkdir -p "$backup_dir"
|
||||
|
||||
# Backup Docker secrets
|
||||
docker secret ls --filter name=bzzz_ --format "{{.Name}}" | while read -r secret; do
|
||||
if docker secret inspect "$secret" >/dev/null 2>&1; then
|
||||
docker secret inspect "$secret" > "$backup_dir/${secret}.json"
|
||||
log DEBUG "Backed up secret: $secret"
|
||||
fi
|
||||
done
|
||||
|
||||
# Backup Docker configs
|
||||
docker config ls --filter name=bzzz_ --format "{{.Name}}" | while read -r config; do
|
||||
if docker config inspect "$config" >/dev/null 2>&1; then
|
||||
docker config inspect "$config" > "$backup_dir/${config}.json"
|
||||
log DEBUG "Backed up config: $config"
|
||||
fi
|
||||
done
|
||||
|
||||
# Backup service definitions
|
||||
if docker stack services "$STACK_NAME" >/dev/null 2>&1; then
|
||||
docker stack services "$STACK_NAME" --format "{{.Name}}" | while read -r service; do
|
||||
docker service inspect "$service" > "$backup_dir/${service}-service.json"
|
||||
done
|
||||
fi
|
||||
fi
|
||||
|
||||
log INFO "Backup completed: $backup_dir"
|
||||
}
|
||||
|
||||
# Create Docker secrets
|
||||
create_secrets() {
|
||||
log INFO "Creating Docker secrets..."
|
||||
|
||||
local secrets=(
|
||||
"bzzz_grafana_admin_password:$(openssl rand -base64 32)"
|
||||
"bzzz_postgres_password:$(openssl rand -base64 32)"
|
||||
)
|
||||
|
||||
# Check if secrets directory exists
|
||||
local secrets_dir="$HOME/chorus/business/secrets"
|
||||
if [[ -d "$secrets_dir" ]]; then
|
||||
# Use existing secrets if available
|
||||
if [[ -f "$secrets_dir/grafana-admin-password" ]]; then
|
||||
secrets[0]="bzzz_grafana_admin_password:$(cat "$secrets_dir/grafana-admin-password")"
|
||||
fi
|
||||
if [[ -f "$secrets_dir/postgres-password" ]]; then
|
||||
secrets[1]="bzzz_postgres_password:$(cat "$secrets_dir/postgres-password")"
|
||||
fi
|
||||
fi
|
||||
|
||||
for secret_def in "${secrets[@]}"; do
|
||||
local secret_name="${secret_def%%:*}"
|
||||
local secret_value="${secret_def#*:}"
|
||||
|
||||
if docker secret inspect "$secret_name" >/dev/null 2>&1; then
|
||||
log DEBUG "Secret already exists: $secret_name"
|
||||
else
|
||||
if [[ "$DRY_RUN" != "true" ]]; then
|
||||
echo "$secret_value" | docker secret create "$secret_name" -
|
||||
log INFO "Created secret: $secret_name"
|
||||
else
|
||||
log DEBUG "Would create secret: $secret_name"
|
||||
fi
|
||||
fi
|
||||
done
|
||||
}
|
||||
|
||||
# Create Docker configs
|
||||
create_configs() {
|
||||
log INFO "Creating Docker configs..."
|
||||
|
||||
local configs=(
|
||||
"bzzz_prometheus_config_${CONFIG_VERSION}:${PROJECT_ROOT}/monitoring/configs/prometheus.yml"
|
||||
"bzzz_prometheus_alerts_${CONFIG_VERSION}:${PROJECT_ROOT}/monitoring/configs/enhanced-alert-rules.yml"
|
||||
"bzzz_grafana_datasources_${CONFIG_VERSION}:${PROJECT_ROOT}/monitoring/configs/grafana-datasources.yml"
|
||||
"bzzz_alertmanager_config_${CONFIG_VERSION}:${PROJECT_ROOT}/monitoring/configs/alertmanager.yml"
|
||||
)
|
||||
|
||||
for config_def in "${configs[@]}"; do
|
||||
local config_name="${config_def%%:*}"
|
||||
local config_file="${config_def#*:}"
|
||||
|
||||
if [[ ! -f "$config_file" ]]; then
|
||||
log WARN "Config file not found: $config_file"
|
||||
continue
|
||||
fi
|
||||
|
||||
if docker config inspect "$config_name" >/dev/null 2>&1; then
|
||||
log DEBUG "Config already exists: $config_name"
|
||||
# Remove old config if exists
|
||||
if [[ "$DRY_RUN" != "true" ]]; then
|
||||
local old_config_name="${config_name%_${CONFIG_VERSION}}"
|
||||
if docker config inspect "$old_config_name" >/dev/null 2>&1; then
|
||||
docker config rm "$old_config_name" || true
|
||||
fi
|
||||
fi
|
||||
else
|
||||
if [[ "$DRY_RUN" != "true" ]]; then
|
||||
docker config create "$config_name" "$config_file"
|
||||
log INFO "Created config: $config_name"
|
||||
else
|
||||
log DEBUG "Would create config: $config_name from $config_file"
|
||||
fi
|
||||
fi
|
||||
done
|
||||
}
|
||||
|
||||
# Create missing config files
|
||||
create_missing_configs() {
|
||||
log INFO "Creating missing configuration files..."
|
||||
|
||||
# Create Grafana datasources config
|
||||
local grafana_datasources="${PROJECT_ROOT}/monitoring/configs/grafana-datasources.yml"
|
||||
if [[ ! -f "$grafana_datasources" ]]; then
|
||||
cat > "$grafana_datasources" <<EOF
|
||||
apiVersion: 1
|
||||
|
||||
datasources:
|
||||
- name: Prometheus
|
||||
type: prometheus
|
||||
access: proxy
|
||||
url: http://prometheus:9090
|
||||
isDefault: true
|
||||
editable: true
|
||||
|
||||
- name: Loki
|
||||
type: loki
|
||||
access: proxy
|
||||
url: http://loki:3100
|
||||
editable: true
|
||||
|
||||
- name: Jaeger
|
||||
type: jaeger
|
||||
access: proxy
|
||||
url: http://jaeger:16686
|
||||
editable: true
|
||||
EOF
|
||||
log INFO "Created Grafana datasources config"
|
||||
fi
|
||||
|
||||
# Create AlertManager config
|
||||
local alertmanager_config="${PROJECT_ROOT}/monitoring/configs/alertmanager.yml"
|
||||
if [[ ! -f "$alertmanager_config" ]]; then
|
||||
cat > "$alertmanager_config" <<EOF
|
||||
global:
|
||||
smtp_smarthost: 'localhost:587'
|
||||
smtp_from: 'alerts@chorus.services'
|
||||
slack_api_url_file: '/run/secrets/slack_webhook_url'
|
||||
|
||||
route:
|
||||
group_by: ['alertname', 'cluster', 'service']
|
||||
group_wait: 10s
|
||||
group_interval: 10s
|
||||
repeat_interval: 12h
|
||||
receiver: 'default'
|
||||
routes:
|
||||
- match:
|
||||
severity: critical
|
||||
receiver: 'critical-alerts'
|
||||
- match:
|
||||
service: bzzz
|
||||
receiver: 'bzzz-alerts'
|
||||
|
||||
receivers:
|
||||
- name: 'default'
|
||||
slack_configs:
|
||||
- channel: '#bzzz-alerts'
|
||||
title: 'BZZZ Alert: {{ .CommonAnnotations.summary }}'
|
||||
text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'
|
||||
|
||||
- name: 'critical-alerts'
|
||||
slack_configs:
|
||||
- channel: '#bzzz-critical'
|
||||
title: 'CRITICAL: {{ .CommonAnnotations.summary }}'
|
||||
text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'
|
||||
|
||||
- name: 'bzzz-alerts'
|
||||
slack_configs:
|
||||
- channel: '#bzzz-alerts'
|
||||
title: 'BZZZ: {{ .CommonAnnotations.summary }}'
|
||||
text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'
|
||||
EOF
|
||||
log INFO "Created AlertManager config"
|
||||
fi
|
||||
}
|
||||
|
||||
# Deploy monitoring stack
|
||||
deploy_monitoring_stack() {
|
||||
log INFO "Deploying monitoring stack..."
|
||||
|
||||
local compose_file="${PROJECT_ROOT}/monitoring/docker-compose.enhanced.yml"
|
||||
|
||||
if [[ ! -f "$compose_file" ]]; then
|
||||
log ERROR "Compose file not found: $compose_file"
|
||||
exit 1
|
||||
fi
|
||||
|
||||
if [[ "$DRY_RUN" != "true" ]]; then
|
||||
# Deploy the stack
|
||||
docker stack deploy -c "$compose_file" "$STACK_NAME"
|
||||
log INFO "Stack deployment initiated: $STACK_NAME"
|
||||
|
||||
# Wait for services to be ready
|
||||
log INFO "Waiting for services to be ready..."
|
||||
local max_attempts=30
|
||||
local attempt=0
|
||||
|
||||
while [[ $attempt -lt $max_attempts ]]; do
|
||||
local ready_services=0
|
||||
local total_services=0
|
||||
|
||||
# Count ready services
|
||||
while read -r service; do
|
||||
total_services=$((total_services + 1))
|
||||
local replicas_info
|
||||
replicas_info=$(docker service ls --filter name="$service" --format "{{.Replicas}}")
|
||||
|
||||
if [[ "$replicas_info" =~ ^([0-9]+)/([0-9]+)$ ]]; then
|
||||
local current="${BASH_REMATCH[1]}"
|
||||
local desired="${BASH_REMATCH[2]}"
|
||||
|
||||
if [[ "$current" -eq "$desired" ]]; then
|
||||
ready_services=$((ready_services + 1))
|
||||
fi
|
||||
fi
|
||||
done < <(docker stack services "$STACK_NAME" --format "{{.Name}}")
|
||||
|
||||
if [[ $ready_services -eq $total_services ]]; then
|
||||
log INFO "All services are ready ($ready_services/$total_services)"
|
||||
break
|
||||
else
|
||||
log DEBUG "Services ready: $ready_services/$total_services"
|
||||
sleep 10
|
||||
attempt=$((attempt + 1))
|
||||
fi
|
||||
done
|
||||
|
||||
if [[ $attempt -eq $max_attempts ]]; then
|
||||
log WARN "Timeout waiting for all services to be ready"
|
||||
fi
|
||||
else
|
||||
log DEBUG "Would deploy stack with compose file: $compose_file"
|
||||
fi
|
||||
}
|
||||
|
||||
# Perform health checks
|
||||
perform_health_checks() {
|
||||
log INFO "Performing health checks..."
|
||||
|
||||
if [[ "$DRY_RUN" == "true" ]]; then
|
||||
log DEBUG "Skipping health checks in dry run mode"
|
||||
return
|
||||
fi
|
||||
|
||||
local endpoints=(
|
||||
"http://localhost:9090/-/healthy:Prometheus"
|
||||
"http://localhost:3000/api/health:Grafana"
|
||||
"http://localhost:9093/-/healthy:AlertManager"
|
||||
)
|
||||
|
||||
local max_attempts=$((HEALTH_CHECK_TIMEOUT / 10))
|
||||
local attempt=0
|
||||
|
||||
while [[ $attempt -lt $max_attempts ]]; do
|
||||
local healthy_endpoints=0
|
||||
|
||||
for endpoint_def in "${endpoints[@]}"; do
|
||||
local endpoint="${endpoint_def%%:*}"
|
||||
local service="${endpoint_def#*:}"
|
||||
|
||||
if curl -sf "$endpoint" >/dev/null 2>&1; then
|
||||
healthy_endpoints=$((healthy_endpoints + 1))
|
||||
log DEBUG "Health check passed: $service"
|
||||
else
|
||||
log DEBUG "Health check pending: $service"
|
||||
fi
|
||||
done
|
||||
|
||||
if [[ $healthy_endpoints -eq ${#endpoints[@]} ]]; then
|
||||
log INFO "All health checks passed"
|
||||
return
|
||||
fi
|
||||
|
||||
sleep 10
|
||||
attempt=$((attempt + 1))
|
||||
done
|
||||
|
||||
log WARN "Some health checks failed after ${HEALTH_CHECK_TIMEOUT}s timeout"
|
||||
}
|
||||
|
||||
# Validate deployment
|
||||
validate_deployment() {
|
||||
log INFO "Validating deployment..."
|
||||
|
||||
if [[ "$DRY_RUN" == "true" ]]; then
|
||||
log DEBUG "Skipping validation in dry run mode"
|
||||
return
|
||||
fi
|
||||
|
||||
# Check stack services
|
||||
local services
|
||||
services=$(docker stack services "$STACK_NAME" --format "{{.Name}}" | wc -l)
|
||||
log INFO "Deployed services: $services"
|
||||
|
||||
# Check if Prometheus is collecting metrics
|
||||
sleep 30 # Allow time for initial metric collection
|
||||
|
||||
if curl -sf "http://localhost:9090/api/v1/query?query=up" | jq -r '.data.result | length' | grep -q "^[1-9]"; then
|
||||
log INFO "Prometheus is collecting metrics"
|
||||
else
|
||||
log WARN "Prometheus may not be collecting metrics yet"
|
||||
fi
|
||||
|
||||
# Check if Grafana can connect to Prometheus
|
||||
local grafana_health
|
||||
if grafana_health=$(curl -sf "http://admin:admin@localhost:3000/api/datasources/proxy/1/api/v1/query?query=up" 2>/dev/null); then
|
||||
log INFO "Grafana can connect to Prometheus"
|
||||
else
|
||||
log WARN "Grafana datasource connection may be pending"
|
||||
fi
|
||||
|
||||
# Check AlertManager configuration
|
||||
if curl -sf "http://localhost:9093/api/v1/status" >/dev/null 2>&1; then
|
||||
log INFO "AlertManager is operational"
|
||||
else
|
||||
log WARN "AlertManager may not be ready"
|
||||
fi
|
||||
}
|
||||
|
||||
# Import Grafana dashboards
|
||||
import_dashboards() {
|
||||
log INFO "Importing Grafana dashboards..."
|
||||
|
||||
if [[ "$DRY_RUN" == "true" ]]; then
|
||||
log DEBUG "Skipping dashboard import in dry run mode"
|
||||
return
|
||||
fi
|
||||
|
||||
# Wait for Grafana to be ready
|
||||
local max_attempts=30
|
||||
local attempt=0
|
||||
|
||||
while [[ $attempt -lt $max_attempts ]]; do
|
||||
if curl -sf "http://admin:admin@localhost:3000/api/health" >/dev/null 2>&1; then
|
||||
break
|
||||
fi
|
||||
sleep 5
|
||||
attempt=$((attempt + 1))
|
||||
done
|
||||
|
||||
if [[ $attempt -eq $max_attempts ]]; then
|
||||
log WARN "Grafana not ready for dashboard import"
|
||||
return
|
||||
fi
|
||||
|
||||
# Import dashboards
|
||||
local dashboard_dir="${PROJECT_ROOT}/monitoring/grafana-dashboards"
|
||||
if [[ -d "$dashboard_dir" ]]; then
|
||||
for dashboard_file in "$dashboard_dir"/*.json; do
|
||||
if [[ -f "$dashboard_file" ]]; then
|
||||
local dashboard_name
|
||||
dashboard_name=$(basename "$dashboard_file" .json)
|
||||
|
||||
if curl -X POST \
|
||||
-H "Content-Type: application/json" \
|
||||
-d "@$dashboard_file" \
|
||||
"http://admin:admin@localhost:3000/api/dashboards/db" \
|
||||
>/dev/null 2>&1; then
|
||||
log INFO "Imported dashboard: $dashboard_name"
|
||||
else
|
||||
log WARN "Failed to import dashboard: $dashboard_name"
|
||||
fi
|
||||
fi
|
||||
done
|
||||
fi
|
||||
}
|
||||
|
||||
# Generate deployment report
|
||||
generate_report() {
|
||||
log INFO "Generating deployment report..."
|
||||
|
||||
local report_file="/tmp/bzzz-monitoring-deployment-report-${TIMESTAMP}.txt"
|
||||
|
||||
cat > "$report_file" <<EOF
|
||||
BZZZ Enhanced Monitoring Stack Deployment Report
|
||||
================================================
|
||||
|
||||
Deployment Time: $(date)
|
||||
Environment: $ENVIRONMENT
|
||||
Stack Name: $STACK_NAME
|
||||
Dry Run: $DRY_RUN
|
||||
|
||||
Services Deployed:
|
||||
EOF
|
||||
|
||||
if [[ "$DRY_RUN" != "true" ]]; then
|
||||
docker stack services "$STACK_NAME" --format " - {{.Name}}: {{.Replicas}}" >> "$report_file"
|
||||
|
||||
echo "" >> "$report_file"
|
||||
echo "Service Health:" >> "$report_file"
|
||||
|
||||
# Add health check results
|
||||
local health_endpoints=(
|
||||
"http://localhost:9090/-/healthy:Prometheus"
|
||||
"http://localhost:3000/api/health:Grafana"
|
||||
"http://localhost:9093/-/healthy:AlertManager"
|
||||
)
|
||||
|
||||
for endpoint_def in "${health_endpoints[@]}"; do
|
||||
local endpoint="${endpoint_def%%:*}"
|
||||
local service="${endpoint_def#*:}"
|
||||
|
||||
if curl -sf "$endpoint" >/dev/null 2>&1; then
|
||||
echo " - $service: ✅ Healthy" >> "$report_file"
|
||||
else
|
||||
echo " - $service: ❌ Unhealthy" >> "$report_file"
|
||||
fi
|
||||
done
|
||||
else
|
||||
echo " [Dry run mode - no services deployed]" >> "$report_file"
|
||||
fi
|
||||
|
||||
cat >> "$report_file" <<EOF
|
||||
|
||||
Access URLs:
|
||||
- Grafana: http://localhost:3000 (admin/admin)
|
||||
- Prometheus: http://localhost:9090
|
||||
- AlertManager: http://localhost:9093
|
||||
|
||||
Configuration:
|
||||
- Log file: $LOG_FILE
|
||||
- Backup directory: /rust/bzzz-v2/backups/monitoring/backup_${TIMESTAMP}
|
||||
- Config version: $CONFIG_VERSION
|
||||
|
||||
Next Steps:
|
||||
1. Change default Grafana admin password
|
||||
2. Configure notification channels in AlertManager
|
||||
3. Review and customize alert rules
|
||||
4. Set up external authentication (optional)
|
||||
|
||||
EOF
|
||||
|
||||
log INFO "Deployment report generated: $report_file"
|
||||
|
||||
# Display report
|
||||
echo ""
|
||||
echo "=========================================="
|
||||
cat "$report_file"
|
||||
echo "=========================================="
|
||||
}
|
||||
|
||||
# Main execution
|
||||
main() {
|
||||
log INFO "Starting BZZZ Enhanced Monitoring Stack deployment"
|
||||
log INFO "Environment: $ENVIRONMENT, Dry Run: $DRY_RUN"
|
||||
log INFO "Log file: $LOG_FILE"
|
||||
|
||||
check_prerequisites
|
||||
setup_directories
|
||||
backup_existing_config
|
||||
create_missing_configs
|
||||
create_secrets
|
||||
create_configs
|
||||
deploy_monitoring_stack
|
||||
perform_health_checks
|
||||
validate_deployment
|
||||
import_dashboards
|
||||
generate_report
|
||||
|
||||
log INFO "Deployment completed successfully!"
|
||||
|
||||
if [[ "$DRY_RUN" != "true" ]]; then
|
||||
echo ""
|
||||
echo "🎉 BZZZ Enhanced Monitoring Stack is now running!"
|
||||
echo "📊 Grafana Dashboard: http://localhost:3000"
|
||||
echo "📈 Prometheus: http://localhost:9090"
|
||||
echo "🚨 AlertManager: http://localhost:9093"
|
||||
echo ""
|
||||
echo "Next steps:"
|
||||
echo "1. Change default Grafana password"
|
||||
echo "2. Configure alert notification channels"
|
||||
echo "3. Review monitoring dashboards"
|
||||
echo "4. Run reliability tests: ./infrastructure/testing/run-tests.sh all"
|
||||
fi
|
||||
}
|
||||
|
||||
# Script execution
|
||||
if [[ "${BASH_SOURCE[0]}" == "${0}" ]]; then
|
||||
main "$@"
|
||||
fi
|
||||
686
infrastructure/testing/RELIABILITY_TESTING_PLAN.md
Normal file
686
infrastructure/testing/RELIABILITY_TESTING_PLAN.md
Normal file
@@ -0,0 +1,686 @@
|
||||
# BZZZ Infrastructure Reliability Testing Plan
|
||||
|
||||
## Overview
|
||||
|
||||
This document outlines comprehensive testing procedures to validate the reliability, performance, and operational readiness of the BZZZ distributed system infrastructure enhancements.
|
||||
|
||||
## Test Categories
|
||||
|
||||
### 1. Component Health Testing
|
||||
### 2. Integration Testing
|
||||
### 3. Chaos Engineering
|
||||
### 4. Performance Testing
|
||||
### 5. Monitoring and Alerting Validation
|
||||
### 6. Disaster Recovery Testing
|
||||
|
||||
---
|
||||
|
||||
## 1. Component Health Testing
|
||||
|
||||
### 1.1 Enhanced Health Checks Validation
|
||||
|
||||
**Objective**: Verify enhanced health check implementations work correctly.
|
||||
|
||||
#### Test Cases
|
||||
|
||||
**TC-01: PubSub Health Probes**
|
||||
```bash
|
||||
# Test PubSub round-trip functionality
|
||||
curl -X POST http://bzzz-agent:8080/test/pubsub-health \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{"test_duration": "30s", "message_count": 100}'
|
||||
|
||||
# Expected: Success rate > 99%, latency < 100ms
|
||||
```
|
||||
|
||||
**TC-02: DHT Health Probes**
|
||||
```bash
|
||||
# Test DHT put/get operations
|
||||
curl -X POST http://bzzz-agent:8080/test/dht-health \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{"test_duration": "60s", "operation_count": 50}'
|
||||
|
||||
# Expected: Success rate > 99%, p95 latency < 300ms
|
||||
```
|
||||
|
||||
**TC-03: Election Health Monitoring**
|
||||
```bash
|
||||
# Test election stability
|
||||
curl -X GET http://bzzz-agent:8080/health/checks | jq '.checks["election-health"]'
|
||||
|
||||
# Trigger controlled election
|
||||
curl -X POST http://bzzz-agent:8080/admin/trigger-election
|
||||
|
||||
# Expected: Stable admin election within 30 seconds
|
||||
```
|
||||
|
||||
#### Validation Criteria
|
||||
- [ ] All health checks report accurate status
|
||||
- [ ] Health check latencies are within SLO thresholds
|
||||
- [ ] Failed health checks trigger appropriate alerts
|
||||
- [ ] Health history is properly maintained
|
||||
|
||||
### 1.2 SLURP Leadership Health Testing
|
||||
|
||||
**TC-04: Leadership Transition Health**
|
||||
```bash
|
||||
# Test leadership transition health
|
||||
./scripts/test-leadership-transition.sh
|
||||
|
||||
# Expected outcomes:
|
||||
# - Clean leadership transitions
|
||||
# - No dropped tasks during transition
|
||||
# - Health scores maintain > 0.8 during transition
|
||||
```
|
||||
|
||||
**TC-05: Degraded Leader Detection**
|
||||
```bash
|
||||
# Simulate resource exhaustion
|
||||
docker service update --limit-memory 512M bzzz-v2_bzzz-agent
|
||||
|
||||
# Expected: Transition to degraded leader state within 2 minutes
|
||||
# Expected: Health alerts fired appropriately
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 2. Integration Testing
|
||||
|
||||
### 2.1 End-to-End System Testing
|
||||
|
||||
**TC-06: Complete Task Lifecycle**
|
||||
```bash
|
||||
#!/bin/bash
|
||||
# Test complete task flow from submission to completion
|
||||
|
||||
# 1. Submit context generation task
|
||||
TASK_ID=$(curl -X POST http://bzzz.deepblack.cloud/api/slurp/generate \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"ucxl_address": "ucxl://test/document.md",
|
||||
"role": "test_analyst",
|
||||
"priority": "high"
|
||||
}' | jq -r '.task_id')
|
||||
|
||||
echo "Task submitted: $TASK_ID"
|
||||
|
||||
# 2. Monitor task progress
|
||||
while true; do
|
||||
STATUS=$(curl -s http://bzzz.deepblack.cloud/api/slurp/status/$TASK_ID | jq -r '.status')
|
||||
echo "Task status: $STATUS"
|
||||
|
||||
if [ "$STATUS" = "completed" ] || [ "$STATUS" = "failed" ]; then
|
||||
break
|
||||
fi
|
||||
|
||||
sleep 5
|
||||
done
|
||||
|
||||
# 3. Validate results
|
||||
if [ "$STATUS" = "completed" ]; then
|
||||
echo "✅ Task completed successfully"
|
||||
RESULT=$(curl -s http://bzzz.deepblack.cloud/api/slurp/result/$TASK_ID)
|
||||
echo "Result size: $(echo $RESULT | jq -r '.content | length')"
|
||||
else
|
||||
echo "❌ Task failed"
|
||||
exit 1
|
||||
fi
|
||||
```
|
||||
|
||||
**TC-07: Multi-Node Coordination**
|
||||
```bash
|
||||
# Test coordination across cluster nodes
|
||||
./scripts/test-multi-node-coordination.sh
|
||||
|
||||
# Test matrix:
|
||||
# - Task submission on node A, execution on node B
|
||||
# - DHT storage on node A, retrieval on node C
|
||||
# - Election on mixed node topology
|
||||
```
|
||||
|
||||
### 2.2 Inter-Service Communication Testing
|
||||
|
||||
**TC-08: Service Mesh Validation**
|
||||
```bash
|
||||
# Test all service-to-service communications
|
||||
./scripts/test-service-mesh.sh
|
||||
|
||||
# Validate:
|
||||
# - bzzz-agent ↔ postgres
|
||||
# - bzzz-agent ↔ redis
|
||||
# - bzzz-agent ↔ dht-bootstrap nodes
|
||||
# - mcp-server ↔ bzzz-agent
|
||||
# - content-resolver ↔ bzzz-agent
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 3. Chaos Engineering
|
||||
|
||||
### 3.1 Node Failure Testing
|
||||
|
||||
**TC-09: Single Node Failure**
|
||||
```bash
|
||||
#!/bin/bash
|
||||
# Test system resilience to single node failure
|
||||
|
||||
# 1. Record baseline metrics
|
||||
echo "Recording baseline metrics..."
|
||||
curl -s 'http://prometheus:9090/api/v1/query?query=bzzz_system_health_score' > baseline_metrics.json
|
||||
|
||||
# 2. Identify current leader
|
||||
LEADER=$(curl -s http://bzzz.deepblack.cloud/api/election/status | jq -r '.current_admin')
|
||||
echo "Current leader: $LEADER"
|
||||
|
||||
# 3. Simulate node failure
|
||||
echo "Simulating failure of node: $LEADER"
|
||||
docker node update --availability drain $LEADER
|
||||
|
||||
# 4. Monitor recovery
|
||||
START_TIME=$(date +%s)
|
||||
while true; do
|
||||
CURRENT_TIME=$(date +%s)
|
||||
ELAPSED=$((CURRENT_TIME - START_TIME))
|
||||
|
||||
# Check if new leader elected
|
||||
NEW_LEADER=$(curl -s http://bzzz.deepblack.cloud/api/election/status | jq -r '.current_admin')
|
||||
|
||||
if [ "$NEW_LEADER" != "null" ] && [ "$NEW_LEADER" != "$LEADER" ]; then
|
||||
echo "✅ New leader elected: $NEW_LEADER (${ELAPSED}s)"
|
||||
break
|
||||
fi
|
||||
|
||||
if [ $ELAPSED -gt 120 ]; then
|
||||
echo "❌ Leadership recovery timeout"
|
||||
exit 1
|
||||
fi
|
||||
|
||||
sleep 5
|
||||
done
|
||||
|
||||
# 5. Validate system health
|
||||
sleep 30 # Allow system to stabilize
|
||||
HEALTH_SCORE=$(curl -s 'http://prometheus:9090/api/v1/query?query=bzzz_system_health_score' | jq -r '.data.result[0].value[1]')
|
||||
echo "Post-failure health score: $HEALTH_SCORE"
|
||||
|
||||
if (( $(echo "$HEALTH_SCORE > 0.8" | bc -l) )); then
|
||||
echo "✅ System recovered successfully"
|
||||
else
|
||||
echo "❌ System health degraded: $HEALTH_SCORE"
|
||||
exit 1
|
||||
fi
|
||||
|
||||
# 6. Restore node
|
||||
docker node update --availability active $LEADER
|
||||
```
|
||||
|
||||
**TC-10: Multi-Node Cascade Failure**
|
||||
```bash
|
||||
# Test system resilience to cascade failures
|
||||
./scripts/test-cascade-failure.sh
|
||||
|
||||
# Scenario: Fail 2 out of 5 nodes simultaneously
|
||||
# Expected: System continues operating with degraded performance
|
||||
# Expected: All critical data remains available
|
||||
```
|
||||
|
||||
### 3.2 Network Partition Testing
|
||||
|
||||
**TC-11: DHT Network Partition**
|
||||
```bash
|
||||
#!/bin/bash
|
||||
# Test DHT resilience to network partitions
|
||||
|
||||
# 1. Create network partition
|
||||
echo "Creating network partition..."
|
||||
iptables -A INPUT -s 192.168.1.72 -j DROP # Block ironwood
|
||||
iptables -A OUTPUT -d 192.168.1.72 -j DROP
|
||||
|
||||
# 2. Monitor DHT health
|
||||
./scripts/monitor-dht-partition-recovery.sh &
|
||||
MONITOR_PID=$!
|
||||
|
||||
# 3. Wait for partition duration
|
||||
sleep 300 # 5 minute partition
|
||||
|
||||
# 4. Heal partition
|
||||
echo "Healing network partition..."
|
||||
iptables -D INPUT -s 192.168.1.72 -j DROP
|
||||
iptables -D OUTPUT -d 192.168.1.72 -j DROP
|
||||
|
||||
# 5. Wait for recovery
|
||||
sleep 180 # 3 minute recovery window
|
||||
|
||||
# 6. Validate recovery
|
||||
kill $MONITOR_PID
|
||||
./scripts/validate-dht-recovery.sh
|
||||
```
|
||||
|
||||
### 3.3 Resource Exhaustion Testing
|
||||
|
||||
**TC-12: Memory Exhaustion**
|
||||
```bash
|
||||
# Test behavior under memory pressure
|
||||
stress-ng --vm 4 --vm-bytes 75% --timeout 300s &
|
||||
STRESS_PID=$!
|
||||
|
||||
# Monitor system behavior
|
||||
./scripts/monitor-memory-exhaustion.sh
|
||||
|
||||
# Expected: Graceful degradation, no crashes
|
||||
# Expected: Health checks detect degradation
|
||||
# Expected: Alerts fired appropriately
|
||||
|
||||
kill $STRESS_PID
|
||||
```
|
||||
|
||||
**TC-13: Disk Space Exhaustion**
|
||||
```bash
|
||||
# Test disk space exhaustion handling
|
||||
dd if=/dev/zero of=/tmp/fill-disk bs=1M count=1000
|
||||
|
||||
# Expected: Services detect low disk space
|
||||
# Expected: Appropriate cleanup mechanisms activate
|
||||
# Expected: System remains operational
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 4. Performance Testing
|
||||
|
||||
### 4.1 Load Testing
|
||||
|
||||
**TC-14: Context Generation Load Test**
|
||||
```bash
|
||||
#!/bin/bash
|
||||
# Load test context generation system
|
||||
|
||||
# Test configuration
|
||||
CONCURRENT_USERS=50
|
||||
TEST_DURATION=600 # 10 minutes
|
||||
RAMP_UP_TIME=60 # 1 minute
|
||||
|
||||
# Run load test
|
||||
k6 run --vus $CONCURRENT_USERS \
|
||||
--duration ${TEST_DURATION}s \
|
||||
--ramp-up-time ${RAMP_UP_TIME}s \
|
||||
./scripts/load-test-context-generation.js
|
||||
|
||||
# Success criteria:
|
||||
# - Throughput: > 10 requests/second
|
||||
# - P95 latency: < 2 seconds
|
||||
# - Error rate: < 1%
|
||||
# - System health score: > 0.8 throughout test
|
||||
```
|
||||
|
||||
**TC-15: DHT Throughput Test**
|
||||
```bash
|
||||
# Test DHT operation throughput
|
||||
./scripts/dht-throughput-test.sh
|
||||
|
||||
# Test matrix:
|
||||
# - PUT operations: Target 100 ops/sec
|
||||
# - GET operations: Target 500 ops/sec
|
||||
# - Mixed workload: 80% GET, 20% PUT
|
||||
```
|
||||
|
||||
### 4.2 Scalability Testing
|
||||
|
||||
**TC-16: Horizontal Scaling Test**
|
||||
```bash
|
||||
#!/bin/bash
|
||||
# Test horizontal scaling behavior
|
||||
|
||||
# Baseline measurement
|
||||
echo "Recording baseline performance..."
|
||||
./scripts/measure-baseline-performance.sh
|
||||
|
||||
# Scale up
|
||||
echo "Scaling up services..."
|
||||
docker service scale bzzz-v2_bzzz-agent=6
|
||||
sleep 60 # Allow services to start
|
||||
|
||||
# Measure scaled performance
|
||||
echo "Measuring scaled performance..."
|
||||
./scripts/measure-scaled-performance.sh
|
||||
|
||||
# Validate improvements
|
||||
echo "Validating scaling improvements..."
|
||||
./scripts/validate-scaling-improvements.sh
|
||||
|
||||
# Expected: Linear improvement in throughput
|
||||
# Expected: No degradation in latency
|
||||
# Expected: Stable error rates
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 5. Monitoring and Alerting Validation
|
||||
|
||||
### 5.1 Alert Testing
|
||||
|
||||
**TC-17: Critical Alert Testing**
|
||||
```bash
|
||||
#!/bin/bash
|
||||
# Test critical alert firing and resolution
|
||||
|
||||
ALERTS_TO_TEST=(
|
||||
"BZZZSystemHealthCritical"
|
||||
"BZZZInsufficientPeers"
|
||||
"BZZZDHTLowSuccessRate"
|
||||
"BZZZNoAdminElected"
|
||||
"BZZZTaskQueueBackup"
|
||||
)
|
||||
|
||||
for alert in "${ALERTS_TO_TEST[@]}"; do
|
||||
echo "Testing alert: $alert"
|
||||
|
||||
# Trigger condition
|
||||
./scripts/trigger-alert-condition.sh "$alert"
|
||||
|
||||
# Wait for alert
|
||||
timeout 300 ./scripts/wait-for-alert.sh "$alert"
|
||||
if [ $? -eq 0 ]; then
|
||||
echo "✅ Alert $alert fired successfully"
|
||||
else
|
||||
echo "❌ Alert $alert failed to fire"
|
||||
fi
|
||||
|
||||
# Resolve condition
|
||||
./scripts/resolve-alert-condition.sh "$alert"
|
||||
|
||||
# Wait for resolution
|
||||
timeout 300 ./scripts/wait-for-alert-resolution.sh "$alert"
|
||||
if [ $? -eq 0 ]; then
|
||||
echo "✅ Alert $alert resolved successfully"
|
||||
else
|
||||
echo "❌ Alert $alert failed to resolve"
|
||||
fi
|
||||
done
|
||||
```
|
||||
|
||||
### 5.2 Metrics Validation
|
||||
|
||||
**TC-18: Metrics Accuracy Test**
|
||||
```bash
|
||||
# Validate metrics accuracy against actual system state
|
||||
./scripts/validate-metrics-accuracy.sh
|
||||
|
||||
# Test cases:
|
||||
# - Connected peers count vs actual P2P connections
|
||||
# - DHT operation counters vs logged operations
|
||||
# - Task completion rates vs actual completions
|
||||
# - Resource usage vs system measurements
|
||||
```
|
||||
|
||||
### 5.3 Dashboard Functionality
|
||||
|
||||
**TC-19: Grafana Dashboard Test**
|
||||
```bash
|
||||
# Test all Grafana dashboards
|
||||
./scripts/test-grafana-dashboards.sh
|
||||
|
||||
# Validation:
|
||||
# - All panels load without errors
|
||||
# - Data displays correctly for all time ranges
|
||||
# - Drill-down functionality works
|
||||
# - Alert annotations appear correctly
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 6. Disaster Recovery Testing
|
||||
|
||||
### 6.1 Data Recovery Testing
|
||||
|
||||
**TC-20: Database Recovery Test**
|
||||
```bash
|
||||
#!/bin/bash
|
||||
# Test database backup and recovery procedures
|
||||
|
||||
# 1. Create test data
|
||||
echo "Creating test data..."
|
||||
./scripts/create-test-data.sh
|
||||
|
||||
# 2. Perform backup
|
||||
echo "Creating backup..."
|
||||
./scripts/backup-database.sh
|
||||
|
||||
# 3. Simulate data loss
|
||||
echo "Simulating data loss..."
|
||||
docker service scale bzzz-v2_postgres=0
|
||||
docker volume rm bzzz-v2_postgres_data
|
||||
|
||||
# 4. Restore from backup
|
||||
echo "Restoring from backup..."
|
||||
./scripts/restore-database.sh
|
||||
|
||||
# 5. Validate data integrity
|
||||
echo "Validating data integrity..."
|
||||
./scripts/validate-restored-data.sh
|
||||
|
||||
# Expected: 100% data recovery
|
||||
# Expected: All relationships intact
|
||||
# Expected: System fully operational
|
||||
```
|
||||
|
||||
### 6.2 Configuration Recovery
|
||||
|
||||
**TC-21: Configuration Disaster Recovery**
|
||||
```bash
|
||||
# Test recovery of all system configurations
|
||||
./scripts/test-configuration-recovery.sh
|
||||
|
||||
# Test scenarios:
|
||||
# - Docker secrets loss and recovery
|
||||
# - Docker configs corruption and recovery
|
||||
# - Service definition recovery
|
||||
# - Network configuration recovery
|
||||
```
|
||||
|
||||
### 6.3 Full System Recovery
|
||||
|
||||
**TC-22: Complete Infrastructure Recovery**
|
||||
```bash
|
||||
#!/bin/bash
|
||||
# Test complete system recovery from scratch
|
||||
|
||||
# 1. Document current state
|
||||
echo "Documenting current system state..."
|
||||
./scripts/document-system-state.sh > pre-disaster-state.json
|
||||
|
||||
# 2. Simulate complete infrastructure loss
|
||||
echo "Simulating infrastructure disaster..."
|
||||
docker stack rm bzzz-v2
|
||||
docker system prune -f --volumes
|
||||
|
||||
# 3. Recover infrastructure
|
||||
echo "Recovering infrastructure..."
|
||||
./scripts/deploy-from-scratch.sh
|
||||
|
||||
# 4. Validate recovery
|
||||
echo "Validating recovery..."
|
||||
./scripts/validate-complete-recovery.sh pre-disaster-state.json
|
||||
|
||||
# Success criteria:
|
||||
# - All services operational within 15 minutes
|
||||
# - All data recovered correctly
|
||||
# - System health score > 0.9
|
||||
# - All integrations functional
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Test Execution Framework
|
||||
|
||||
### Automated Test Runner
|
||||
|
||||
```bash
|
||||
#!/bin/bash
|
||||
# Main test execution script
|
||||
|
||||
TEST_SUITE=${1:-"all"}
|
||||
ENVIRONMENT=${2:-"staging"}
|
||||
|
||||
echo "Running BZZZ reliability tests..."
|
||||
echo "Suite: $TEST_SUITE"
|
||||
echo "Environment: $ENVIRONMENT"
|
||||
|
||||
# Setup test environment
|
||||
./scripts/setup-test-environment.sh $ENVIRONMENT
|
||||
|
||||
# Run test suites
|
||||
case $TEST_SUITE in
|
||||
"health")
|
||||
./scripts/run-health-tests.sh
|
||||
;;
|
||||
"integration")
|
||||
./scripts/run-integration-tests.sh
|
||||
;;
|
||||
"chaos")
|
||||
./scripts/run-chaos-tests.sh
|
||||
;;
|
||||
"performance")
|
||||
./scripts/run-performance-tests.sh
|
||||
;;
|
||||
"monitoring")
|
||||
./scripts/run-monitoring-tests.sh
|
||||
;;
|
||||
"disaster-recovery")
|
||||
./scripts/run-disaster-recovery-tests.sh
|
||||
;;
|
||||
"all")
|
||||
./scripts/run-all-tests.sh
|
||||
;;
|
||||
*)
|
||||
echo "Unknown test suite: $TEST_SUITE"
|
||||
exit 1
|
||||
;;
|
||||
esac
|
||||
|
||||
# Generate test report
|
||||
./scripts/generate-test-report.sh
|
||||
|
||||
echo "Test execution completed."
|
||||
```
|
||||
|
||||
### Test Environment Setup
|
||||
|
||||
```yaml
|
||||
# test-environment.yml
|
||||
version: '3.8'
|
||||
|
||||
services:
|
||||
# Staging environment with reduced resource requirements
|
||||
bzzz-agent-test:
|
||||
image: registry.home.deepblack.cloud/bzzz:test-latest
|
||||
environment:
|
||||
- LOG_LEVEL=debug
|
||||
- TEST_MODE=true
|
||||
- METRICS_ENABLED=true
|
||||
networks:
|
||||
- test-network
|
||||
deploy:
|
||||
replicas: 3
|
||||
resources:
|
||||
limits:
|
||||
memory: 1G
|
||||
cpus: '0.5'
|
||||
|
||||
# Test data generator
|
||||
test-data-generator:
|
||||
image: registry.home.deepblack.cloud/bzzz-test-generator:latest
|
||||
environment:
|
||||
- TARGET_ENDPOINT=http://bzzz-agent-test:9000
|
||||
- DATA_VOLUME=medium
|
||||
networks:
|
||||
- test-network
|
||||
|
||||
networks:
|
||||
test-network:
|
||||
driver: overlay
|
||||
```
|
||||
|
||||
### Continuous Testing Pipeline
|
||||
|
||||
```yaml
|
||||
# .github/workflows/reliability-testing.yml
|
||||
name: BZZZ Reliability Testing
|
||||
|
||||
on:
|
||||
schedule:
|
||||
- cron: '0 2 * * *' # Daily at 2 AM
|
||||
workflow_dispatch:
|
||||
|
||||
jobs:
|
||||
health-tests:
|
||||
runs-on: self-hosted
|
||||
steps:
|
||||
- uses: actions/checkout@v3
|
||||
- name: Run Health Tests
|
||||
run: ./infrastructure/testing/run-tests.sh health staging
|
||||
|
||||
performance-tests:
|
||||
runs-on: self-hosted
|
||||
needs: health-tests
|
||||
steps:
|
||||
- name: Run Performance Tests
|
||||
run: ./infrastructure/testing/run-tests.sh performance staging
|
||||
|
||||
chaos-tests:
|
||||
runs-on: self-hosted
|
||||
needs: health-tests
|
||||
if: github.event_name == 'workflow_dispatch'
|
||||
steps:
|
||||
- name: Run Chaos Tests
|
||||
run: ./infrastructure/testing/run-tests.sh chaos staging
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Success Criteria
|
||||
|
||||
### Overall System Reliability Targets
|
||||
|
||||
- **Availability SLO**: 99.9% uptime
|
||||
- **Performance SLO**:
|
||||
- Context generation: p95 < 2 seconds
|
||||
- DHT operations: p95 < 300ms
|
||||
- P2P messaging: p95 < 500ms
|
||||
- **Error Rate SLO**: < 0.1% for all operations
|
||||
- **Recovery Time Objective (RTO)**: < 15 minutes
|
||||
- **Recovery Point Objective (RPO)**: < 5 minutes
|
||||
|
||||
### Test Pass Criteria
|
||||
|
||||
- **Health Tests**: 100% of health checks function correctly
|
||||
- **Integration Tests**: 95% pass rate for all integration scenarios
|
||||
- **Chaos Tests**: System recovers within SLO targets for all failure scenarios
|
||||
- **Performance Tests**: All performance metrics meet SLO targets under load
|
||||
- **Monitoring Tests**: 100% of alerts fire and resolve correctly
|
||||
- **Disaster Recovery**: Complete system recovery within RTO/RPO targets
|
||||
|
||||
### Continuous Monitoring
|
||||
|
||||
- Daily automated health and integration tests
|
||||
- Weekly performance regression testing
|
||||
- Monthly chaos engineering exercises
|
||||
- Quarterly disaster recovery drills
|
||||
|
||||
---
|
||||
|
||||
## Test Reporting and Documentation
|
||||
|
||||
### Test Results Dashboard
|
||||
- Real-time test execution status
|
||||
- Historical test results and trends
|
||||
- Performance benchmarks over time
|
||||
- Failure analysis and remediation tracking
|
||||
|
||||
### Test Documentation
|
||||
- Detailed test procedures and scripts
|
||||
- Failure scenarios and response procedures
|
||||
- Performance baselines and regression analysis
|
||||
- Disaster recovery validation reports
|
||||
|
||||
This comprehensive testing plan ensures that all infrastructure enhancements are thoroughly validated and the system meets its reliability and performance objectives.
|
||||
Reference in New Issue
Block a user