🚀 Complete BZZZ Issue Resolution - All 17 Issues Solved
Comprehensive multi-agent implementation addressing all issues from INDEX.md: ## Core Architecture & Validation - ✅ Issue 001: UCXL address validation at all system boundaries - ✅ Issue 002: Fixed search parsing bug in encrypted storage - ✅ Issue 003: Wired UCXI P2P announce and discover functionality - ✅ Issue 011: Aligned temporal grammar and documentation - ✅ Issue 012: SLURP idempotency, backpressure, and DLQ implementation - ✅ Issue 013: Linked SLURP events to UCXL decisions and DHT ## API Standardization & Configuration - ✅ Issue 004: Standardized UCXI payloads to UCXL codes - ✅ Issue 010: Status endpoints and configuration surface ## Infrastructure & Operations - ✅ Issue 005: Election heartbeat on admin transition - ✅ Issue 006: Active health checks for PubSub and DHT - ✅ Issue 007: DHT replication and provider records - ✅ Issue 014: SLURP leadership lifecycle and health probes - ✅ Issue 015: Comprehensive monitoring, SLOs, and alerts ## Security & Access Control - ✅ Issue 008: Key rotation and role-based access policies ## Testing & Quality Assurance - ✅ Issue 009: Integration tests for UCXI + DHT encryption + search - ✅ Issue 016: E2E tests for HMMM → SLURP → UCXL workflow ## HMMM Integration - ✅ Issue 017: HMMM adapter wiring and comprehensive testing ## Key Features Delivered: - Enterprise-grade security with automated key rotation - Comprehensive monitoring with Prometheus/Grafana stack - Role-based collaboration with HMMM integration - Complete API standardization with UCXL response formats - Full test coverage with integration and E2E testing - Production-ready infrastructure monitoring and alerting All solutions include comprehensive testing, documentation, and production-ready implementations. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
This commit is contained in:
835
infrastructure/docs/OPERATIONAL_RUNBOOK.md
Normal file
835
infrastructure/docs/OPERATIONAL_RUNBOOK.md
Normal file
@@ -0,0 +1,835 @@
|
||||
# BZZZ Infrastructure Operational Runbook
|
||||
|
||||
## Table of Contents
|
||||
1. [Quick Reference](#quick-reference)
|
||||
2. [System Architecture Overview](#system-architecture-overview)
|
||||
3. [Common Operational Tasks](#common-operational-tasks)
|
||||
4. [Incident Response Procedures](#incident-response-procedures)
|
||||
5. [Health Check Procedures](#health-check-procedures)
|
||||
6. [Performance Tuning](#performance-tuning)
|
||||
7. [Backup and Recovery](#backup-and-recovery)
|
||||
8. [Troubleshooting Guide](#troubleshooting-guide)
|
||||
9. [Maintenance Procedures](#maintenance-procedures)
|
||||
|
||||
## Quick Reference
|
||||
|
||||
### Critical Service Endpoints
|
||||
- **Grafana Dashboard**: https://grafana.chorus.services
|
||||
- **Prometheus**: https://prometheus.chorus.services
|
||||
- **AlertManager**: https://alerts.chorus.services
|
||||
- **BZZZ Main API**: https://bzzz.deepblack.cloud
|
||||
- **Health Checks**: https://bzzz.deepblack.cloud/health
|
||||
|
||||
### Emergency Contacts
|
||||
- **Primary Oncall**: Slack #bzzz-alerts
|
||||
- **System Administrator**: @tony
|
||||
- **Infrastructure Team**: @platform-team
|
||||
|
||||
### Key Commands
|
||||
```bash
|
||||
# Check system health
|
||||
curl -s https://bzzz.deepblack.cloud/health | jq
|
||||
|
||||
# View logs
|
||||
docker service logs bzzz-v2_bzzz-agent -f --tail 100
|
||||
|
||||
# Scale service
|
||||
docker service scale bzzz-v2_bzzz-agent=5
|
||||
|
||||
# Force service update
|
||||
docker service update --force bzzz-v2_bzzz-agent
|
||||
```
|
||||
|
||||
## System Architecture Overview
|
||||
|
||||
### Component Relationships
|
||||
```
|
||||
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
|
||||
│ PubSub │────│ DHT │────│ Election │
|
||||
│ Messaging │ │ Storage │ │ Manager │
|
||||
└─────────────┘ └─────────────┘ └─────────────┘
|
||||
│ │ │
|
||||
└───────────────────┼───────────────────┘
|
||||
│
|
||||
┌─────────────┐
|
||||
│ SLURP │
|
||||
│ Context │
|
||||
│ Generator │
|
||||
└─────────────┘
|
||||
│
|
||||
┌─────────────┐
|
||||
│ UCXI │
|
||||
│ Protocol │
|
||||
│ Resolver │
|
||||
└─────────────┘
|
||||
```
|
||||
|
||||
### Data Flow
|
||||
1. **Task Requests** → PubSub → Task Coordinator → SLURP (if admin)
|
||||
2. **Context Generation** → DHT Storage → UCXI Resolution
|
||||
3. **Health Monitoring** → Prometheus → AlertManager → Notifications
|
||||
|
||||
### Critical Dependencies
|
||||
- **Docker Swarm**: Container orchestration
|
||||
- **NFS Storage**: Persistent data storage
|
||||
- **Prometheus Stack**: Monitoring and alerting
|
||||
- **DHT Bootstrap Nodes**: P2P network foundation
|
||||
|
||||
## Common Operational Tasks
|
||||
|
||||
### Service Management
|
||||
|
||||
#### Check Service Status
|
||||
```bash
|
||||
# List all BZZZ services
|
||||
docker service ls | grep bzzz
|
||||
|
||||
# Check specific service
|
||||
docker service ps bzzz-v2_bzzz-agent
|
||||
|
||||
# View service configuration
|
||||
docker service inspect bzzz-v2_bzzz-agent
|
||||
```
|
||||
|
||||
#### Scale Services
|
||||
```bash
|
||||
# Scale main BZZZ service
|
||||
docker service scale bzzz-v2_bzzz-agent=5
|
||||
|
||||
# Scale monitoring stack
|
||||
docker service scale bzzz-monitoring_prometheus=1
|
||||
docker service scale bzzz-monitoring_grafana=1
|
||||
```
|
||||
|
||||
#### Update Services
|
||||
```bash
|
||||
# Update to new image version
|
||||
docker service update \
|
||||
--image registry.home.deepblack.cloud/bzzz:v2.1.0 \
|
||||
bzzz-v2_bzzz-agent
|
||||
|
||||
# Update environment variables
|
||||
docker service update \
|
||||
--env-add LOG_LEVEL=debug \
|
||||
bzzz-v2_bzzz-agent
|
||||
|
||||
# Update resource limits
|
||||
docker service update \
|
||||
--limit-memory 4G \
|
||||
--limit-cpu 2 \
|
||||
bzzz-v2_bzzz-agent
|
||||
```
|
||||
|
||||
### Configuration Management
|
||||
|
||||
#### Update Docker Secrets
|
||||
```bash
|
||||
# Create new secret
|
||||
echo "new_password" | docker secret create bzzz_postgres_password_v2 -
|
||||
|
||||
# Update service to use new secret
|
||||
docker service update \
|
||||
--secret-rm bzzz_postgres_password \
|
||||
--secret-add bzzz_postgres_password_v2 \
|
||||
bzzz-v2_postgres
|
||||
```
|
||||
|
||||
#### Update Docker Configs
|
||||
```bash
|
||||
# Create new config
|
||||
docker config create bzzz_v2_config_v3 /path/to/new/config.yaml
|
||||
|
||||
# Update service
|
||||
docker service update \
|
||||
--config-rm bzzz_v2_config \
|
||||
--config-add source=bzzz_v2_config_v3,target=/app/config/config.yaml \
|
||||
bzzz-v2_bzzz-agent
|
||||
```
|
||||
|
||||
### Monitoring and Alerting
|
||||
|
||||
#### Check Alert Status
|
||||
```bash
|
||||
# View active alerts
|
||||
curl -s http://alertmanager:9093/api/v1/alerts | jq '.data[] | select(.status.state == "active")'
|
||||
|
||||
# Silence alert
|
||||
curl -X POST http://alertmanager:9093/api/v1/silences \
|
||||
-d '{
|
||||
"matchers": [{"name": "alertname", "value": "BZZZSystemHealthCritical"}],
|
||||
"startsAt": "2025-01-01T00:00:00Z",
|
||||
"endsAt": "2025-01-01T01:00:00Z",
|
||||
"comment": "Maintenance window",
|
||||
"createdBy": "operator"
|
||||
}'
|
||||
```
|
||||
|
||||
#### Query Metrics
|
||||
```bash
|
||||
# Check system health
|
||||
curl -s 'http://prometheus:9090/api/v1/query?query=bzzz_system_health_score' | jq
|
||||
|
||||
# Check connected peers
|
||||
curl -s 'http://prometheus:9090/api/v1/query?query=bzzz_p2p_connected_peers' | jq
|
||||
|
||||
# Check error rates
|
||||
curl -s 'http://prometheus:9090/api/v1/query?query=rate(bzzz_errors_total[5m])' | jq
|
||||
```
|
||||
|
||||
## Incident Response Procedures
|
||||
|
||||
### Severity Levels
|
||||
|
||||
#### Critical (P0)
|
||||
- System completely unavailable
|
||||
- Data loss or corruption
|
||||
- Security breach
|
||||
- **Response Time**: 15 minutes
|
||||
- **Resolution Target**: 2 hours
|
||||
|
||||
#### High (P1)
|
||||
- Major functionality impaired
|
||||
- Performance severely degraded
|
||||
- **Response Time**: 1 hour
|
||||
- **Resolution Target**: 4 hours
|
||||
|
||||
#### Medium (P2)
|
||||
- Minor functionality issues
|
||||
- Performance slightly degraded
|
||||
- **Response Time**: 4 hours
|
||||
- **Resolution Target**: 24 hours
|
||||
|
||||
#### Low (P3)
|
||||
- Cosmetic issues
|
||||
- Enhancement requests
|
||||
- **Response Time**: 24 hours
|
||||
- **Resolution Target**: 1 week
|
||||
|
||||
### Common Incident Scenarios
|
||||
|
||||
#### System Health Critical (Alert: BZZZSystemHealthCritical)
|
||||
|
||||
**Symptoms**: System health score < 0.5
|
||||
|
||||
**Immediate Actions**:
|
||||
1. Check Grafana dashboard for component failures
|
||||
2. Review recent deployments or changes
|
||||
3. Check resource utilization (CPU, memory, disk)
|
||||
4. Verify P2P connectivity
|
||||
|
||||
**Investigation Steps**:
|
||||
```bash
|
||||
# Check overall system status
|
||||
curl -s https://bzzz.deepblack.cloud/health | jq
|
||||
|
||||
# Check component health
|
||||
curl -s https://bzzz.deepblack.cloud/health/checks | jq
|
||||
|
||||
# Review recent logs
|
||||
docker service logs bzzz-v2_bzzz-agent --since 1h | tail -100
|
||||
|
||||
# Check resource usage
|
||||
docker stats --no-stream
|
||||
```
|
||||
|
||||
**Recovery Actions**:
|
||||
1. If memory leak: Restart affected services
|
||||
2. If disk full: Clean up logs and temporary files
|
||||
3. If network issues: Restart networking components
|
||||
4. If database issues: Check PostgreSQL health
|
||||
|
||||
#### P2P Network Partition (Alert: BZZZInsufficientPeers)
|
||||
|
||||
**Symptoms**: Connected peers < 3
|
||||
|
||||
**Immediate Actions**:
|
||||
1. Check network connectivity between nodes
|
||||
2. Verify DHT bootstrap nodes are running
|
||||
3. Check firewall rules and port accessibility
|
||||
|
||||
**Investigation Steps**:
|
||||
```bash
|
||||
# Check DHT bootstrap nodes
|
||||
for node in walnut:9101 ironwood:9102 acacia:9103; do
|
||||
echo "Checking $node:"
|
||||
nc -zv ${node%:*} ${node#*:}
|
||||
done
|
||||
|
||||
# Check P2P connectivity
|
||||
docker service logs bzzz-v2_dht-bootstrap-walnut --since 1h
|
||||
|
||||
# Test network between nodes
|
||||
docker run --rm --network host nicolaka/netshoot ping -c 3 ironwood
|
||||
```
|
||||
|
||||
**Recovery Actions**:
|
||||
1. Restart DHT bootstrap services
|
||||
2. Clear peer store if corrupted
|
||||
3. Check and fix network configuration
|
||||
4. Restart affected BZZZ agents
|
||||
|
||||
#### Election System Failure (Alert: BZZZNoAdminElected)
|
||||
|
||||
**Symptoms**: No admin elected or frequent leadership changes
|
||||
|
||||
**Immediate Actions**:
|
||||
1. Check election state on all nodes
|
||||
2. Review heartbeat status
|
||||
3. Verify role configurations
|
||||
|
||||
**Investigation Steps**:
|
||||
```bash
|
||||
# Check election status on each node
|
||||
for node in walnut ironwood acacia; do
|
||||
echo "Node $node election status:"
|
||||
docker exec $(docker ps -q --filter label=com.docker.swarm.node.id) \
|
||||
curl -s localhost:8081/health/checks | jq '.checks["election-health"]'
|
||||
done
|
||||
|
||||
# Check role configurations
|
||||
docker config inspect bzzz_v2_config | jq '.Spec.Data' | base64 -d | grep -A5 -B5 role
|
||||
```
|
||||
|
||||
**Recovery Actions**:
|
||||
1. Force re-election by restarting election managers
|
||||
2. Fix role configuration issues
|
||||
3. Clear election state if corrupted
|
||||
4. Ensure at least one node has admin capabilities
|
||||
|
||||
#### DHT Replication Failure (Alert: BZZZDHTReplicationDegraded)
|
||||
|
||||
**Symptoms**: Average replication factor < 2
|
||||
|
||||
**Immediate Actions**:
|
||||
1. Check DHT provider records
|
||||
2. Verify replication manager status
|
||||
3. Check storage availability
|
||||
|
||||
**Investigation Steps**:
|
||||
```bash
|
||||
# Check DHT metrics
|
||||
curl -s 'http://prometheus:9090/api/v1/query?query=bzzz_dht_replication_factor' | jq
|
||||
|
||||
# Check provider records
|
||||
curl -s 'http://prometheus:9090/api/v1/query?query=bzzz_dht_provider_records' | jq
|
||||
|
||||
# Check replication manager logs
|
||||
docker service logs bzzz-v2_bzzz-agent | grep -i replication
|
||||
```
|
||||
|
||||
**Recovery Actions**:
|
||||
1. Restart replication managers
|
||||
2. Force re-provision of content
|
||||
3. Check and fix storage issues
|
||||
4. Verify DHT network connectivity
|
||||
|
||||
### Escalation Procedures
|
||||
|
||||
#### When to Escalate
|
||||
- Unable to resolve P0/P1 incident within target time
|
||||
- Incident requires specialized knowledge
|
||||
- Multiple systems affected
|
||||
- Potential security implications
|
||||
|
||||
#### Escalation Contacts
|
||||
1. **Technical Lead**: @tech-lead (Slack)
|
||||
2. **Infrastructure Team**: @infra-team (Slack)
|
||||
3. **Management**: @management (for business-critical issues)
|
||||
|
||||
## Health Check Procedures
|
||||
|
||||
### Manual Health Verification
|
||||
|
||||
#### System-Level Checks
|
||||
```bash
|
||||
# 1. Overall system health
|
||||
curl -s https://bzzz.deepblack.cloud/health | jq '.status'
|
||||
|
||||
# 2. Component health checks
|
||||
curl -s https://bzzz.deepblack.cloud/health/checks | jq
|
||||
|
||||
# 3. Resource utilization
|
||||
docker stats --no-stream --format "table {{.Container}}\t{{.CPUPerc}}\t{{.MemUsage}}\t{{.MemPerc}}"
|
||||
|
||||
# 4. Service status
|
||||
docker service ls | grep bzzz
|
||||
|
||||
# 5. Network connectivity
|
||||
docker network ls | grep bzzz
|
||||
```
|
||||
|
||||
#### Component-Specific Checks
|
||||
|
||||
**P2P Network**:
|
||||
```bash
|
||||
# Check connected peers
|
||||
curl -s 'http://prometheus:9090/api/v1/query?query=bzzz_p2p_connected_peers'
|
||||
|
||||
# Test P2P messaging
|
||||
docker exec -it $(docker ps -q -f name=bzzz-agent) \
|
||||
/app/bzzz test-p2p-message
|
||||
```
|
||||
|
||||
**DHT Storage**:
|
||||
```bash
|
||||
# Check DHT operations
|
||||
curl -s 'http://prometheus:9090/api/v1/query?query=rate(bzzz_dht_put_operations_total[5m])'
|
||||
|
||||
# Test DHT functionality
|
||||
docker exec -it $(docker ps -q -f name=bzzz-agent) \
|
||||
/app/bzzz test-dht-operations
|
||||
```
|
||||
|
||||
**Election System**:
|
||||
```bash
|
||||
# Check current admin
|
||||
curl -s 'http://prometheus:9090/api/v1/query?query=bzzz_election_state'
|
||||
|
||||
# Check heartbeat status
|
||||
curl -s https://bzzz.deepblack.cloud/api/election/status | jq
|
||||
```
|
||||
|
||||
### Automated Health Monitoring
|
||||
|
||||
#### Prometheus Queries for Health
|
||||
```promql
|
||||
# Overall system health
|
||||
bzzz_system_health_score
|
||||
|
||||
# Component health scores
|
||||
bzzz_component_health_score
|
||||
|
||||
# SLI compliance
|
||||
rate(bzzz_health_checks_passed_total[5m]) / rate(bzzz_health_checks_failed_total[5m] + bzzz_health_checks_passed_total[5m])
|
||||
|
||||
# Error budget burn rate
|
||||
1 - bzzz:dht_success_rate > 0.01 # 1% error budget
|
||||
```
|
||||
|
||||
#### Alert Validation
|
||||
After resolving issues, verify alerts clear:
|
||||
```bash
|
||||
# Check if alerts are resolved
|
||||
curl -s http://alertmanager:9093/api/v1/alerts | \
|
||||
jq '.data[] | select(.status.state == "active") | .labels.alertname'
|
||||
```
|
||||
|
||||
## Performance Tuning
|
||||
|
||||
### Resource Optimization
|
||||
|
||||
#### Memory Tuning
|
||||
```bash
|
||||
# Increase memory limits for heavy workloads
|
||||
docker service update --limit-memory 8G bzzz-v2_bzzz-agent
|
||||
|
||||
# Optimize JVM heap size (if applicable)
|
||||
docker service update \
|
||||
--env-add JAVA_OPTS="-Xmx4g -Xms2g" \
|
||||
bzzz-v2_bzzz-agent
|
||||
```
|
||||
|
||||
#### CPU Optimization
|
||||
```bash
|
||||
# Adjust CPU limits
|
||||
docker service update --limit-cpu 4 bzzz-v2_bzzz-agent
|
||||
|
||||
# Set CPU affinity for critical services
|
||||
docker service update \
|
||||
--placement-pref "spread=node.labels.cpu_type==high_performance" \
|
||||
bzzz-v2_bzzz-agent
|
||||
```
|
||||
|
||||
#### Network Optimization
|
||||
```bash
|
||||
# Optimize network buffer sizes
|
||||
echo 'net.core.rmem_max = 16777216' >> /etc/sysctl.conf
|
||||
echo 'net.core.wmem_max = 16777216' >> /etc/sysctl.conf
|
||||
sysctl -p
|
||||
```
|
||||
|
||||
### Application-Level Tuning
|
||||
|
||||
#### DHT Performance
|
||||
- Increase replication factor for critical content
|
||||
- Optimize provider record refresh intervals
|
||||
- Tune cache sizes based on memory availability
|
||||
|
||||
#### PubSub Performance
|
||||
- Adjust message batch sizes
|
||||
- Optimize topic subscription patterns
|
||||
- Configure message retention policies
|
||||
|
||||
#### Election Stability
|
||||
- Tune heartbeat intervals
|
||||
- Adjust election timeouts based on network latency
|
||||
- Optimize candidate scoring algorithms
|
||||
|
||||
### Monitoring Performance Impact
|
||||
```bash
|
||||
# Before tuning - capture baseline
|
||||
curl -s 'http://prometheus:9090/api/v1/query_range?query=rate(bzzz_dht_operation_latency_seconds_sum[5m])/rate(bzzz_dht_operation_latency_seconds_count[5m])&start=2025-01-01T00:00:00Z&end=2025-01-01T01:00:00Z&step=60s'
|
||||
|
||||
# After tuning - compare results
|
||||
# Use Grafana dashboards to visualize improvements
|
||||
```
|
||||
|
||||
## Backup and Recovery
|
||||
|
||||
### Critical Data Identification
|
||||
|
||||
#### Persistent Data
|
||||
- **PostgreSQL Database**: User data, task history, conversation threads
|
||||
- **DHT Content**: Distributed content storage
|
||||
- **Configuration**: Docker secrets, configs, service definitions
|
||||
- **Prometheus Data**: Historical metrics (optional but valuable)
|
||||
|
||||
#### Backup Schedule
|
||||
- **PostgreSQL**: Daily full backup, continuous WAL archiving
|
||||
- **Configuration**: Weekly backup, immediately after changes
|
||||
- **Prometheus**: Weekly backup of selected metrics
|
||||
|
||||
### Backup Procedures
|
||||
|
||||
#### Database Backup
|
||||
```bash
|
||||
# Create database backup
|
||||
docker exec $(docker ps -q -f name=postgres) \
|
||||
pg_dump -U bzzz -d bzzz_v2 -f /backup/bzzz_$(date +%Y%m%d_%H%M%S).sql
|
||||
|
||||
# Compress and store
|
||||
gzip /rust/bzzz-v2/backups/bzzz_$(date +%Y%m%d_%H%M%S).sql
|
||||
aws s3 cp /rust/bzzz-v2/backups/ s3://chorus-backups/bzzz/ --recursive
|
||||
```
|
||||
|
||||
#### Configuration Backup
|
||||
```bash
|
||||
# Export all secrets (encrypted)
|
||||
for secret in $(docker secret ls -q); do
|
||||
docker secret inspect $secret > /backup/secrets/${secret}.json
|
||||
done
|
||||
|
||||
# Export all configs
|
||||
for config in $(docker config ls -q); do
|
||||
docker config inspect $config > /backup/configs/${config}.json
|
||||
done
|
||||
|
||||
# Export service definitions
|
||||
docker service ls --format '{{.Name}}' | xargs -I {} docker service inspect {} > /backup/services.json
|
||||
```
|
||||
|
||||
#### Prometheus Data Backup
|
||||
```bash
|
||||
# Snapshot Prometheus data
|
||||
curl -X POST http://prometheus:9090/api/v1/admin/tsdb/snapshot
|
||||
|
||||
# Copy snapshot to backup location
|
||||
docker cp prometheus_container:/prometheus/snapshots/latest /backup/prometheus/$(date +%Y%m%d)
|
||||
```
|
||||
|
||||
### Recovery Procedures
|
||||
|
||||
#### Full System Recovery
|
||||
1. **Restore Infrastructure**: Deploy Docker Swarm stack
|
||||
2. **Restore Configuration**: Import secrets and configs
|
||||
3. **Restore Database**: Restore PostgreSQL from backup
|
||||
4. **Validate Services**: Verify all services are healthy
|
||||
5. **Test Functionality**: Run end-to-end tests
|
||||
|
||||
#### Database Recovery
|
||||
```bash
|
||||
# Stop application services
|
||||
docker service scale bzzz-v2_bzzz-agent=0
|
||||
|
||||
# Restore database
|
||||
gunzip -c /backup/bzzz_20250101_120000.sql.gz | \
|
||||
docker exec -i $(docker ps -q -f name=postgres) \
|
||||
psql -U bzzz -d bzzz_v2
|
||||
|
||||
# Start application services
|
||||
docker service scale bzzz-v2_bzzz-agent=3
|
||||
```
|
||||
|
||||
#### Point-in-Time Recovery
|
||||
```bash
|
||||
# For WAL-based recovery
|
||||
docker exec $(docker ps -q -f name=postgres) \
|
||||
pg_basebackup -U postgres -D /backup/base -X stream -P
|
||||
|
||||
# Restore to specific time
|
||||
# (Implementation depends on PostgreSQL configuration)
|
||||
```
|
||||
|
||||
### Recovery Testing
|
||||
|
||||
#### Monthly Recovery Tests
|
||||
```bash
|
||||
# Test database restore
|
||||
./scripts/test-db-restore.sh
|
||||
|
||||
# Test configuration restore
|
||||
./scripts/test-config-restore.sh
|
||||
|
||||
# Test full system restore (staging environment)
|
||||
./scripts/test-full-restore.sh staging
|
||||
```
|
||||
|
||||
#### Recovery Validation
|
||||
- Verify all services start successfully
|
||||
- Check data integrity and completeness
|
||||
- Validate P2P network connectivity
|
||||
- Test core functionality (task coordination, context generation)
|
||||
- Monitor system health for 24 hours post-recovery
|
||||
|
||||
## Troubleshooting Guide
|
||||
|
||||
### Log Analysis
|
||||
|
||||
#### Centralized Logging
|
||||
```bash
|
||||
# View aggregated logs through Loki
|
||||
curl -G -s 'http://loki:3100/loki/api/v1/query_range' \
|
||||
--data-urlencode 'query={job="bzzz"}' \
|
||||
--data-urlencode 'start=2025-01-01T00:00:00Z' \
|
||||
--data-urlencode 'end=2025-01-01T01:00:00Z' | jq
|
||||
|
||||
# Search for specific errors
|
||||
curl -G -s 'http://loki:3100/loki/api/v1/query_range' \
|
||||
--data-urlencode 'query={job="bzzz"} |= "ERROR"' | jq
|
||||
```
|
||||
|
||||
#### Service-Specific Logs
|
||||
```bash
|
||||
# BZZZ agent logs
|
||||
docker service logs bzzz-v2_bzzz-agent -f --tail 100
|
||||
|
||||
# DHT bootstrap logs
|
||||
docker service logs bzzz-v2_dht-bootstrap-walnut -f
|
||||
|
||||
# Database logs
|
||||
docker service logs bzzz-v2_postgres -f
|
||||
|
||||
# Filter for specific patterns
|
||||
docker service logs bzzz-v2_bzzz-agent | grep -E "(ERROR|FATAL|panic)"
|
||||
```
|
||||
|
||||
### Common Issues and Solutions
|
||||
|
||||
#### "No Admin Elected" Error
|
||||
```bash
|
||||
# Check role configurations
|
||||
docker config inspect bzzz_v2_config | jq '.Spec.Data' | base64 -d | yq '.agent.role'
|
||||
|
||||
# Force election
|
||||
docker exec -it $(docker ps -q -f name=bzzz-agent) /app/bzzz trigger-election
|
||||
|
||||
# Restart election managers
|
||||
docker service update --force bzzz-v2_bzzz-agent
|
||||
```
|
||||
|
||||
#### "DHT Operations Failing" Error
|
||||
```bash
|
||||
# Check DHT bootstrap nodes
|
||||
for port in 9101 9102 9103; do
|
||||
nc -zv localhost $port
|
||||
done
|
||||
|
||||
# Restart DHT services
|
||||
docker service update --force bzzz-v2_dht-bootstrap-walnut
|
||||
docker service update --force bzzz-v2_dht-bootstrap-ironwood
|
||||
docker service update --force bzzz-v2_dht-bootstrap-acacia
|
||||
|
||||
# Clear DHT cache
|
||||
docker exec -it $(docker ps -q -f name=bzzz-agent) rm -rf /app/data/dht/cache/*
|
||||
```
|
||||
|
||||
#### "High Memory Usage" Alert
|
||||
```bash
|
||||
# Identify memory-hungry processes
|
||||
docker stats --no-stream --format "table {{.Container}}\t{{.MemUsage}}\t{{.MemPerc}}" | sort -k3 -n
|
||||
|
||||
# Check for memory leaks
|
||||
docker exec -it $(docker ps -q -f name=bzzz-agent) pprof -http=:6060 /app/bzzz
|
||||
|
||||
# Restart high-memory services
|
||||
docker service update --force bzzz-v2_bzzz-agent
|
||||
```
|
||||
|
||||
#### "Network Connectivity Issues"
|
||||
```bash
|
||||
# Check overlay network
|
||||
docker network inspect bzzz-internal
|
||||
|
||||
# Test connectivity between services
|
||||
docker run --rm --network bzzz-internal nicolaka/netshoot ping -c 3 postgres
|
||||
|
||||
# Check firewall rules
|
||||
iptables -L | grep -E "(9000|9101|9102|9103)"
|
||||
|
||||
# Restart networking
|
||||
docker network disconnect bzzz-internal $(docker ps -q -f name=bzzz-agent)
|
||||
docker network connect bzzz-internal $(docker ps -q -f name=bzzz-agent)
|
||||
```
|
||||
|
||||
### Performance Issues
|
||||
|
||||
#### High Latency Diagnosis
|
||||
```bash
|
||||
# Check operation latencies
|
||||
curl -s 'http://prometheus:9090/api/v1/query?query=histogram_quantile(0.95, rate(bzzz_dht_operation_latency_seconds_bucket[5m]))'
|
||||
|
||||
# Identify bottlenecks
|
||||
docker exec -it $(docker ps -q -f name=bzzz-agent) /app/bzzz profile-cpu 30
|
||||
|
||||
# Check network latency between nodes
|
||||
for node in walnut ironwood acacia; do
|
||||
ping -c 10 $node | tail -1
|
||||
done
|
||||
```
|
||||
|
||||
#### Resource Contention
|
||||
```bash
|
||||
# Check CPU usage
|
||||
docker stats --no-stream --format "table {{.Container}}\t{{.CPUPerc}}"
|
||||
|
||||
# Check I/O wait
|
||||
iostat -x 1 5
|
||||
|
||||
# Check network utilization
|
||||
iftop -i eth0
|
||||
```
|
||||
|
||||
### Debugging Tools
|
||||
|
||||
#### Application Debugging
|
||||
```bash
|
||||
# Enable debug logging
|
||||
docker service update --env-add LOG_LEVEL=debug bzzz-v2_bzzz-agent
|
||||
|
||||
# Access debug endpoints
|
||||
curl -s http://localhost:8080/debug/pprof/heap > heap.prof
|
||||
go tool pprof heap.prof
|
||||
|
||||
# Trace requests
|
||||
curl -s http://localhost:8080/debug/requests
|
||||
```
|
||||
|
||||
#### System Debugging
|
||||
```bash
|
||||
# System resource usage
|
||||
htop
|
||||
iotop
|
||||
nethogs
|
||||
|
||||
# Process analysis
|
||||
ps aux --sort=-%cpu | head -20
|
||||
ps aux --sort=-%mem | head -20
|
||||
|
||||
# Network analysis
|
||||
netstat -tulpn | grep -E ":9000|:9101|:9102|:9103"
|
||||
ss -tuln | grep -E ":9000|:9101|:9102|:9103"
|
||||
```
|
||||
|
||||
## Maintenance Procedures
|
||||
|
||||
### Scheduled Maintenance
|
||||
|
||||
#### Weekly Maintenance (Low-impact)
|
||||
- Review system health metrics
|
||||
- Check log sizes and rotate if necessary
|
||||
- Update monitoring dashboards
|
||||
- Validate backup integrity
|
||||
|
||||
#### Monthly Maintenance (Medium-impact)
|
||||
- Update non-critical components
|
||||
- Perform capacity planning review
|
||||
- Test disaster recovery procedures
|
||||
- Security scan and updates
|
||||
|
||||
#### Quarterly Maintenance (High-impact)
|
||||
- Major version updates
|
||||
- Infrastructure upgrades
|
||||
- Performance optimization review
|
||||
- Security audit and remediation
|
||||
|
||||
### Update Procedures
|
||||
|
||||
#### Rolling Updates
|
||||
```bash
|
||||
# Update with zero downtime
|
||||
docker service update \
|
||||
--image registry.home.deepblack.cloud/bzzz:v2.1.0 \
|
||||
--update-parallelism 1 \
|
||||
--update-delay 30s \
|
||||
--update-failure-action rollback \
|
||||
bzzz-v2_bzzz-agent
|
||||
```
|
||||
|
||||
#### Configuration Updates
|
||||
```bash
|
||||
# Update configuration without restart
|
||||
docker config create bzzz_v2_config_new /path/to/new/config.yaml
|
||||
|
||||
docker service update \
|
||||
--config-rm bzzz_v2_config \
|
||||
--config-add source=bzzz_v2_config_new,target=/app/config/config.yaml \
|
||||
bzzz-v2_bzzz-agent
|
||||
|
||||
# Cleanup old config
|
||||
docker config rm bzzz_v2_config
|
||||
```
|
||||
|
||||
#### Database Maintenance
|
||||
```bash
|
||||
# Database optimization
|
||||
docker exec -it $(docker ps -q -f name=postgres) \
|
||||
psql -U bzzz -d bzzz_v2 -c "VACUUM ANALYZE;"
|
||||
|
||||
# Update statistics
|
||||
docker exec -it $(docker ps -q -f name=postgres) \
|
||||
psql -U bzzz -d bzzz_v2 -c "ANALYZE;"
|
||||
|
||||
# Check database size
|
||||
docker exec -it $(docker ps -q -f name=postgres) \
|
||||
psql -U bzzz -d bzzz_v2 -c "SELECT pg_size_pretty(pg_database_size('bzzz_v2'));"
|
||||
```
|
||||
|
||||
### Capacity Planning
|
||||
|
||||
#### Growth Projections
|
||||
- Monitor resource usage trends over time
|
||||
- Project capacity needs based on growth patterns
|
||||
- Plan for seasonal or event-driven spikes
|
||||
|
||||
#### Scaling Decisions
|
||||
```bash
|
||||
# Horizontal scaling
|
||||
docker service scale bzzz-v2_bzzz-agent=5
|
||||
|
||||
# Vertical scaling
|
||||
docker service update \
|
||||
--limit-memory 8G \
|
||||
--limit-cpu 4 \
|
||||
bzzz-v2_bzzz-agent
|
||||
|
||||
# Add new node to swarm
|
||||
docker swarm join-token worker
|
||||
```
|
||||
|
||||
#### Resource Monitoring
|
||||
- Set up capacity alerts at 70% utilization
|
||||
- Monitor growth rate and extrapolate
|
||||
- Plan infrastructure expansions 3-6 months ahead
|
||||
|
||||
---
|
||||
|
||||
## Contact Information
|
||||
|
||||
**Primary Contact**: Tony (@tony)
|
||||
**Team**: BZZZ Infrastructure Team
|
||||
**Documentation**: https://wiki.chorus.services/bzzz
|
||||
**Source Code**: https://gitea.chorus.services/tony/BZZZ
|
||||
|
||||
**Last Updated**: 2025-01-01
|
||||
**Version**: 2.0
|
||||
**Review Date**: 2025-04-01
|
||||
Reference in New Issue
Block a user