tony/bzzz

Files

anthonyrawlins 92779523c0 🚀 Complete BZZZ Issue Resolution - All 17 Issues Solved

Comprehensive multi-agent implementation addressing all issues from INDEX.md:

## Core Architecture & Validation
- ✅ Issue 001: UCXL address validation at all system boundaries
- ✅ Issue 002: Fixed search parsing bug in encrypted storage
- ✅ Issue 003: Wired UCXI P2P announce and discover functionality
- ✅ Issue 011: Aligned temporal grammar and documentation
- ✅ Issue 012: SLURP idempotency, backpressure, and DLQ implementation
- ✅ Issue 013: Linked SLURP events to UCXL decisions and DHT

## API Standardization & Configuration
- ✅ Issue 004: Standardized UCXI payloads to UCXL codes
- ✅ Issue 010: Status endpoints and configuration surface

## Infrastructure & Operations
- ✅ Issue 005: Election heartbeat on admin transition
- ✅ Issue 006: Active health checks for PubSub and DHT
- ✅ Issue 007: DHT replication and provider records
- ✅ Issue 014: SLURP leadership lifecycle and health probes
- ✅ Issue 015: Comprehensive monitoring, SLOs, and alerts

## Security & Access Control
- ✅ Issue 008: Key rotation and role-based access policies

## Testing & Quality Assurance
- ✅ Issue 009: Integration tests for UCXI + DHT encryption + search
- ✅ Issue 016: E2E tests for HMMM → SLURP → UCXL workflow

## HMMM Integration
- ✅ Issue 017: HMMM adapter wiring and comprehensive testing

## Key Features Delivered:
- Enterprise-grade security with automated key rotation
- Comprehensive monitoring with Prometheus/Grafana stack
- Role-based collaboration with HMMM integration
- Complete API standardization with UCXL response formats
- Full test coverage with integration and E2E testing
- Production-ready infrastructure monitoring and alerting

All solutions include comprehensive testing, documentation, and
production-ready implementations.

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>

2025-08-29 12:39:38 +10:00

22 KiB

Raw Blame History

BZZZ Infrastructure Operational Runbook

Quick Reference
System Architecture Overview
Common Operational Tasks
Incident Response Procedures
Health Check Procedures
Performance Tuning
Backup and Recovery
Troubleshooting Guide
Maintenance Procedures

Quick Reference

Critical Service Endpoints

Grafana Dashboard: https://grafana.chorus.services
Prometheus: https://prometheus.chorus.services
AlertManager: https://alerts.chorus.services
BZZZ Main API: https://bzzz.deepblack.cloud
Health Checks: https://bzzz.deepblack.cloud/health

Emergency Contacts

Primary Oncall: Slack #bzzz-alerts
System Administrator: @tony
Infrastructure Team: @platform-team

Key Commands

# Check system health
curl -s https://bzzz.deepblack.cloud/health | jq

# View logs
docker service logs bzzz-v2_bzzz-agent -f --tail 100

# Scale service
docker service scale bzzz-v2_bzzz-agent=5

# Force service update
docker service update --force bzzz-v2_bzzz-agent

System Architecture Overview

Component Relationships

┌─────────────┐    ┌─────────────┐    ┌─────────────┐
│   PubSub    │────│     DHT     │────│  Election   │
│  Messaging  │    │   Storage   │    │   Manager   │
└─────────────┘    └─────────────┘    └─────────────┘
       │                   │                   │
       └───────────────────┼───────────────────┘
                           │
                    ┌─────────────┐
                    │    SLURP    │
                    │   Context   │
                    │  Generator  │
                    └─────────────┘
                           │
                    ┌─────────────┐
                    │    UCXI     │
                    │  Protocol   │
                    │  Resolver   │
                    └─────────────┘

Data Flow

Task Requests → PubSub → Task Coordinator → SLURP (if admin)
Context Generation → DHT Storage → UCXI Resolution
Health Monitoring → Prometheus → AlertManager → Notifications

Critical Dependencies

Docker Swarm: Container orchestration
NFS Storage: Persistent data storage
Prometheus Stack: Monitoring and alerting
DHT Bootstrap Nodes: P2P network foundation

Common Operational Tasks

Service Management

Check Service Status

# List all BZZZ services
docker service ls | grep bzzz

# Check specific service
docker service ps bzzz-v2_bzzz-agent

# View service configuration
docker service inspect bzzz-v2_bzzz-agent

Scale Services

# Scale main BZZZ service
docker service scale bzzz-v2_bzzz-agent=5

# Scale monitoring stack
docker service scale bzzz-monitoring_prometheus=1
docker service scale bzzz-monitoring_grafana=1

Update Services

# Update to new image version
docker service update \
  --image registry.home.deepblack.cloud/bzzz:v2.1.0 \
  bzzz-v2_bzzz-agent

# Update environment variables
docker service update \
  --env-add LOG_LEVEL=debug \
  bzzz-v2_bzzz-agent

# Update resource limits
docker service update \
  --limit-memory 4G \
  --limit-cpu 2 \
  bzzz-v2_bzzz-agent

Configuration Management

Update Docker Secrets

# Create new secret
echo "new_password" | docker secret create bzzz_postgres_password_v2 -

# Update service to use new secret
docker service update \
  --secret-rm bzzz_postgres_password \
  --secret-add bzzz_postgres_password_v2 \
  bzzz-v2_postgres

Update Docker Configs

# Create new config
docker config create bzzz_v2_config_v3 /path/to/new/config.yaml

# Update service
docker service update \
  --config-rm bzzz_v2_config \
  --config-add source=bzzz_v2_config_v3,target=/app/config/config.yaml \
  bzzz-v2_bzzz-agent

Monitoring and Alerting

Check Alert Status

# View active alerts
curl -s http://alertmanager:9093/api/v1/alerts | jq '.data[] | select(.status.state == "active")'

# Silence alert
curl -X POST http://alertmanager:9093/api/v1/silences \
  -d '{
    "matchers": [{"name": "alertname", "value": "BZZZSystemHealthCritical"}],
    "startsAt": "2025-01-01T00:00:00Z",
    "endsAt": "2025-01-01T01:00:00Z",
    "comment": "Maintenance window",
    "createdBy": "operator"
  }'

Query Metrics

# Check system health
curl -s 'http://prometheus:9090/api/v1/query?query=bzzz_system_health_score' | jq

# Check connected peers
curl -s 'http://prometheus:9090/api/v1/query?query=bzzz_p2p_connected_peers' | jq

# Check error rates
curl -s 'http://prometheus:9090/api/v1/query?query=rate(bzzz_errors_total[5m])' | jq

Incident Response Procedures

Severity Levels

Critical (P0)

System completely unavailable
Data loss or corruption
Security breach
Response Time: 15 minutes
Resolution Target: 2 hours

High (P1)

Major functionality impaired
Performance severely degraded
Response Time: 1 hour
Resolution Target: 4 hours

Medium (P2)

Minor functionality issues
Performance slightly degraded
Response Time: 4 hours
Resolution Target: 24 hours

Low (P3)

Cosmetic issues
Enhancement requests
Response Time: 24 hours
Resolution Target: 1 week

Common Incident Scenarios

System Health Critical (Alert: BZZZSystemHealthCritical)

Symptoms: System health score < 0.5

Immediate Actions:

Check Grafana dashboard for component failures
Review recent deployments or changes
Check resource utilization (CPU, memory, disk)
Verify P2P connectivity

Investigation Steps:

# Check overall system status
curl -s https://bzzz.deepblack.cloud/health | jq

# Check component health
curl -s https://bzzz.deepblack.cloud/health/checks | jq

# Review recent logs
docker service logs bzzz-v2_bzzz-agent --since 1h | tail -100

# Check resource usage
docker stats --no-stream

Recovery Actions:

If memory leak: Restart affected services
If disk full: Clean up logs and temporary files
If network issues: Restart networking components
If database issues: Check PostgreSQL health

P2P Network Partition (Alert: BZZZInsufficientPeers)

Symptoms: Connected peers < 3

Immediate Actions:

Check network connectivity between nodes
Verify DHT bootstrap nodes are running
Check firewall rules and port accessibility

Investigation Steps:

# Check DHT bootstrap nodes
for node in walnut:9101 ironwood:9102 acacia:9103; do
  echo "Checking $node:"
  nc -zv ${node%:*} ${node#*:}
done

# Check P2P connectivity
docker service logs bzzz-v2_dht-bootstrap-walnut --since 1h

# Test network between nodes
docker run --rm --network host nicolaka/netshoot ping -c 3 ironwood

Recovery Actions:

Restart DHT bootstrap services
Clear peer store if corrupted
Check and fix network configuration
Restart affected BZZZ agents

Election System Failure (Alert: BZZZNoAdminElected)

Symptoms: No admin elected or frequent leadership changes

Immediate Actions:

Check election state on all nodes
Review heartbeat status
Verify role configurations

Investigation Steps:

# Check election status on each node
for node in walnut ironwood acacia; do
  echo "Node $node election status:"
  docker exec $(docker ps -q --filter label=com.docker.swarm.node.id) \
    curl -s localhost:8081/health/checks | jq '.checks["election-health"]'
done

# Check role configurations
docker config inspect bzzz_v2_config | jq '.Spec.Data' | base64 -d | grep -A5 -B5 role

Recovery Actions:

Force re-election by restarting election managers
Fix role configuration issues
Clear election state if corrupted
Ensure at least one node has admin capabilities

DHT Replication Failure (Alert: BZZZDHTReplicationDegraded)

Symptoms: Average replication factor < 2

Immediate Actions:

Check DHT provider records
Verify replication manager status
Check storage availability

Investigation Steps:

# Check DHT metrics
curl -s 'http://prometheus:9090/api/v1/query?query=bzzz_dht_replication_factor' | jq

# Check provider records
curl -s 'http://prometheus:9090/api/v1/query?query=bzzz_dht_provider_records' | jq

# Check replication manager logs
docker service logs bzzz-v2_bzzz-agent | grep -i replication

Recovery Actions:

Restart replication managers
Force re-provision of content
Check and fix storage issues
Verify DHT network connectivity

Escalation Procedures

When to Escalate

Unable to resolve P0/P1 incident within target time
Incident requires specialized knowledge
Multiple systems affected
Potential security implications

Escalation Contacts

Technical Lead: @tech-lead (Slack)
Infrastructure Team: @infra-team (Slack)
Management: @management (for business-critical issues)

Health Check Procedures

Manual Health Verification

System-Level Checks

# 1. Overall system health
curl -s https://bzzz.deepblack.cloud/health | jq '.status'

# 2. Component health checks
curl -s https://bzzz.deepblack.cloud/health/checks | jq

# 3. Resource utilization
docker stats --no-stream --format "table {{.Container}}\t{{.CPUPerc}}\t{{.MemUsage}}\t{{.MemPerc}}"

# 4. Service status
docker service ls | grep bzzz

# 5. Network connectivity
docker network ls | grep bzzz

Component-Specific Checks

P2P Network:

# Check connected peers
curl -s 'http://prometheus:9090/api/v1/query?query=bzzz_p2p_connected_peers'

# Test P2P messaging
docker exec -it $(docker ps -q -f name=bzzz-agent) \
  /app/bzzz test-p2p-message

DHT Storage:

# Check DHT operations
curl -s 'http://prometheus:9090/api/v1/query?query=rate(bzzz_dht_put_operations_total[5m])'

# Test DHT functionality
docker exec -it $(docker ps -q -f name=bzzz-agent) \
  /app/bzzz test-dht-operations

Election System:

# Check current admin
curl -s 'http://prometheus:9090/api/v1/query?query=bzzz_election_state'

# Check heartbeat status
curl -s https://bzzz.deepblack.cloud/api/election/status | jq

Automated Health Monitoring

Prometheus Queries for Health

# Overall system health
bzzz_system_health_score

# Component health scores
bzzz_component_health_score

# SLI compliance
rate(bzzz_health_checks_passed_total[5m]) / rate(bzzz_health_checks_failed_total[5m] + bzzz_health_checks_passed_total[5m])

# Error budget burn rate
1 - bzzz:dht_success_rate > 0.01  # 1% error budget

Alert Validation

After resolving issues, verify alerts clear:

# Check if alerts are resolved
curl -s http://alertmanager:9093/api/v1/alerts | \
  jq '.data[] | select(.status.state == "active") | .labels.alertname'

Performance Tuning

Resource Optimization

Memory Tuning

# Increase memory limits for heavy workloads
docker service update --limit-memory 8G bzzz-v2_bzzz-agent

# Optimize JVM heap size (if applicable)
docker service update \
  --env-add JAVA_OPTS="-Xmx4g -Xms2g" \
  bzzz-v2_bzzz-agent

CPU Optimization

# Adjust CPU limits
docker service update --limit-cpu 4 bzzz-v2_bzzz-agent

# Set CPU affinity for critical services
docker service update \
  --placement-pref "spread=node.labels.cpu_type==high_performance" \
  bzzz-v2_bzzz-agent

Network Optimization

# Optimize network buffer sizes
echo 'net.core.rmem_max = 16777216' >> /etc/sysctl.conf
echo 'net.core.wmem_max = 16777216' >> /etc/sysctl.conf
sysctl -p

Application-Level Tuning

DHT Performance

Increase replication factor for critical content
Optimize provider record refresh intervals
Tune cache sizes based on memory availability

PubSub Performance

Adjust message batch sizes
Optimize topic subscription patterns
Configure message retention policies

Election Stability

Tune heartbeat intervals
Adjust election timeouts based on network latency
Optimize candidate scoring algorithms

Monitoring Performance Impact

# Before tuning - capture baseline
curl -s 'http://prometheus:9090/api/v1/query_range?query=rate(bzzz_dht_operation_latency_seconds_sum[5m])/rate(bzzz_dht_operation_latency_seconds_count[5m])&start=2025-01-01T00:00:00Z&end=2025-01-01T01:00:00Z&step=60s'

# After tuning - compare results
# Use Grafana dashboards to visualize improvements

Backup and Recovery

Critical Data Identification

Persistent Data

PostgreSQL Database: User data, task history, conversation threads
DHT Content: Distributed content storage
Configuration: Docker secrets, configs, service definitions
Prometheus Data: Historical metrics (optional but valuable)

Backup Schedule

PostgreSQL: Daily full backup, continuous WAL archiving
Configuration: Weekly backup, immediately after changes
Prometheus: Weekly backup of selected metrics

Backup Procedures

Database Backup

# Create database backup
docker exec $(docker ps -q -f name=postgres) \
  pg_dump -U bzzz -d bzzz_v2 -f /backup/bzzz_$(date +%Y%m%d_%H%M%S).sql

# Compress and store
gzip /rust/bzzz-v2/backups/bzzz_$(date +%Y%m%d_%H%M%S).sql
aws s3 cp /rust/bzzz-v2/backups/ s3://chorus-backups/bzzz/ --recursive

Configuration Backup

# Export all secrets (encrypted)
for secret in $(docker secret ls -q); do
  docker secret inspect $secret > /backup/secrets/${secret}.json
done

# Export all configs
for config in $(docker config ls -q); do
  docker config inspect $config > /backup/configs/${config}.json
done

# Export service definitions
docker service ls --format '{{.Name}}' | xargs -I {} docker service inspect {} > /backup/services.json

Prometheus Data Backup

# Snapshot Prometheus data
curl -X POST http://prometheus:9090/api/v1/admin/tsdb/snapshot

# Copy snapshot to backup location
docker cp prometheus_container:/prometheus/snapshots/latest /backup/prometheus/$(date +%Y%m%d)

Recovery Procedures

Full System Recovery

Restore Infrastructure: Deploy Docker Swarm stack
Restore Configuration: Import secrets and configs
Restore Database: Restore PostgreSQL from backup
Validate Services: Verify all services are healthy
Test Functionality: Run end-to-end tests

Database Recovery

# Stop application services
docker service scale bzzz-v2_bzzz-agent=0

# Restore database
gunzip -c /backup/bzzz_20250101_120000.sql.gz | \
  docker exec -i $(docker ps -q -f name=postgres) \
  psql -U bzzz -d bzzz_v2

# Start application services
docker service scale bzzz-v2_bzzz-agent=3

Point-in-Time Recovery

# For WAL-based recovery
docker exec $(docker ps -q -f name=postgres) \
  pg_basebackup -U postgres -D /backup/base -X stream -P

# Restore to specific time
# (Implementation depends on PostgreSQL configuration)

Recovery Testing

Monthly Recovery Tests

# Test database restore
./scripts/test-db-restore.sh

# Test configuration restore  
./scripts/test-config-restore.sh

# Test full system restore (staging environment)
./scripts/test-full-restore.sh staging

Recovery Validation

Verify all services start successfully
Check data integrity and completeness
Validate P2P network connectivity
Test core functionality (task coordination, context generation)
Monitor system health for 24 hours post-recovery

Troubleshooting Guide

Log Analysis

Centralized Logging

# View aggregated logs through Loki
curl -G -s 'http://loki:3100/loki/api/v1/query_range' \
  --data-urlencode 'query={job="bzzz"}' \
  --data-urlencode 'start=2025-01-01T00:00:00Z' \
  --data-urlencode 'end=2025-01-01T01:00:00Z' | jq

# Search for specific errors
curl -G -s 'http://loki:3100/loki/api/v1/query_range' \
  --data-urlencode 'query={job="bzzz"} |= "ERROR"' | jq

Service-Specific Logs

# BZZZ agent logs
docker service logs bzzz-v2_bzzz-agent -f --tail 100

# DHT bootstrap logs
docker service logs bzzz-v2_dht-bootstrap-walnut -f

# Database logs
docker service logs bzzz-v2_postgres -f

# Filter for specific patterns
docker service logs bzzz-v2_bzzz-agent | grep -E "(ERROR|FATAL|panic)"

Common Issues and Solutions

"No Admin Elected" Error

# Check role configurations
docker config inspect bzzz_v2_config | jq '.Spec.Data' | base64 -d | yq '.agent.role'

# Force election
docker exec -it $(docker ps -q -f name=bzzz-agent) /app/bzzz trigger-election

# Restart election managers
docker service update --force bzzz-v2_bzzz-agent

"DHT Operations Failing" Error

# Check DHT bootstrap nodes
for port in 9101 9102 9103; do
  nc -zv localhost $port
done

# Restart DHT services
docker service update --force bzzz-v2_dht-bootstrap-walnut
docker service update --force bzzz-v2_dht-bootstrap-ironwood
docker service update --force bzzz-v2_dht-bootstrap-acacia

# Clear DHT cache
docker exec -it $(docker ps -q -f name=bzzz-agent) rm -rf /app/data/dht/cache/*

"High Memory Usage" Alert

# Identify memory-hungry processes
docker stats --no-stream --format "table {{.Container}}\t{{.MemUsage}}\t{{.MemPerc}}" | sort -k3 -n

# Check for memory leaks
docker exec -it $(docker ps -q -f name=bzzz-agent) pprof -http=:6060 /app/bzzz

# Restart high-memory services
docker service update --force bzzz-v2_bzzz-agent

"Network Connectivity Issues"

# Check overlay network
docker network inspect bzzz-internal

# Test connectivity between services
docker run --rm --network bzzz-internal nicolaka/netshoot ping -c 3 postgres

# Check firewall rules
iptables -L | grep -E "(9000|9101|9102|9103)"

# Restart networking
docker network disconnect bzzz-internal $(docker ps -q -f name=bzzz-agent)
docker network connect bzzz-internal $(docker ps -q -f name=bzzz-agent)

Performance Issues

High Latency Diagnosis

# Check operation latencies
curl -s 'http://prometheus:9090/api/v1/query?query=histogram_quantile(0.95, rate(bzzz_dht_operation_latency_seconds_bucket[5m]))'

# Identify bottlenecks
docker exec -it $(docker ps -q -f name=bzzz-agent) /app/bzzz profile-cpu 30

# Check network latency between nodes
for node in walnut ironwood acacia; do
  ping -c 10 $node | tail -1
done

Resource Contention

# Check CPU usage
docker stats --no-stream --format "table {{.Container}}\t{{.CPUPerc}}"

# Check I/O wait
iostat -x 1 5

# Check network utilization
iftop -i eth0

Debugging Tools

Application Debugging

# Enable debug logging
docker service update --env-add LOG_LEVEL=debug bzzz-v2_bzzz-agent

# Access debug endpoints
curl -s http://localhost:8080/debug/pprof/heap > heap.prof
go tool pprof heap.prof

# Trace requests
curl -s http://localhost:8080/debug/requests

System Debugging

# System resource usage
htop
iotop
nethogs

# Process analysis
ps aux --sort=-%cpu | head -20
ps aux --sort=-%mem | head -20

# Network analysis
netstat -tulpn | grep -E ":9000|:9101|:9102|:9103"
ss -tuln | grep -E ":9000|:9101|:9102|:9103"

Maintenance Procedures

Scheduled Maintenance

Weekly Maintenance (Low-impact)

Review system health metrics
Check log sizes and rotate if necessary
Update monitoring dashboards
Validate backup integrity

Monthly Maintenance (Medium-impact)

Update non-critical components
Perform capacity planning review
Test disaster recovery procedures
Security scan and updates

Quarterly Maintenance (High-impact)

Major version updates
Infrastructure upgrades
Performance optimization review
Security audit and remediation

Update Procedures

Rolling Updates

# Update with zero downtime
docker service update \
  --image registry.home.deepblack.cloud/bzzz:v2.1.0 \
  --update-parallelism 1 \
  --update-delay 30s \
  --update-failure-action rollback \
  bzzz-v2_bzzz-agent

Configuration Updates

# Update configuration without restart
docker config create bzzz_v2_config_new /path/to/new/config.yaml

docker service update \
  --config-rm bzzz_v2_config \
  --config-add source=bzzz_v2_config_new,target=/app/config/config.yaml \
  bzzz-v2_bzzz-agent

# Cleanup old config
docker config rm bzzz_v2_config

Database Maintenance

# Database optimization
docker exec -it $(docker ps -q -f name=postgres) \
  psql -U bzzz -d bzzz_v2 -c "VACUUM ANALYZE;"

# Update statistics
docker exec -it $(docker ps -q -f name=postgres) \
  psql -U bzzz -d bzzz_v2 -c "ANALYZE;"

# Check database size
docker exec -it $(docker ps -q -f name=postgres) \
  psql -U bzzz -d bzzz_v2 -c "SELECT pg_size_pretty(pg_database_size('bzzz_v2'));"

Capacity Planning

Growth Projections

Monitor resource usage trends over time
Project capacity needs based on growth patterns
Plan for seasonal or event-driven spikes

Scaling Decisions

# Horizontal scaling
docker service scale bzzz-v2_bzzz-agent=5

# Vertical scaling
docker service update \
  --limit-memory 8G \
  --limit-cpu 4 \
  bzzz-v2_bzzz-agent

# Add new node to swarm
docker swarm join-token worker

Resource Monitoring

Set up capacity alerts at 70% utilization
Monitor growth rate and extrapolate
Plan infrastructure expansions 3-6 months ahead

Contact Information

Primary Contact: Tony (@tony) Team: BZZZ Infrastructure Team Documentation: https://wiki.chorus.services/bzzz Source Code: https://gitea.chorus.services/tony/BZZZ

Last Updated: 2025-01-01 Version: 2.0 Review Date: 2025-04-01

22 KiB Raw Blame History

BZZZ Infrastructure Operational Runbook

Table of Contents

Quick Reference

Critical Service Endpoints

Emergency Contacts

Key Commands

System Architecture Overview

Component Relationships

Data Flow

Critical Dependencies

Common Operational Tasks

Service Management

Check Service Status

Scale Services

Update Services

Configuration Management

Update Docker Secrets

Update Docker Configs

Monitoring and Alerting

Check Alert Status

Query Metrics

Incident Response Procedures

Severity Levels

Critical (P0)

High (P1)

Medium (P2)

Low (P3)

Common Incident Scenarios

System Health Critical (Alert: BZZZSystemHealthCritical)

P2P Network Partition (Alert: BZZZInsufficientPeers)

Election System Failure (Alert: BZZZNoAdminElected)

DHT Replication Failure (Alert: BZZZDHTReplicationDegraded)

Escalation Procedures

When to Escalate

Escalation Contacts

Health Check Procedures

Manual Health Verification

System-Level Checks

Component-Specific Checks

Automated Health Monitoring

Prometheus Queries for Health

Alert Validation

Performance Tuning

Resource Optimization

Memory Tuning

CPU Optimization

Network Optimization

Application-Level Tuning

DHT Performance

PubSub Performance

Election Stability

Monitoring Performance Impact

Backup and Recovery

Critical Data Identification

Persistent Data

Backup Schedule

Backup Procedures

Database Backup

Configuration Backup

Prometheus Data Backup

Recovery Procedures

Full System Recovery

Database Recovery

Point-in-Time Recovery

Recovery Testing

Monthly Recovery Tests

Recovery Validation

Troubleshooting Guide

Log Analysis

Centralized Logging

Service-Specific Logs

Common Issues and Solutions

"No Admin Elected" Error

"DHT Operations Failing" Error

"High Memory Usage" Alert

"Network Connectivity Issues"

Performance Issues

High Latency Diagnosis

Resource Contention

22 KiB

Raw Blame History