Files
bzzz/deployments/bare-metal/INSTALLATION-DEPLOYMENT-PLAN.md
anthonyrawlins f5f96ba505 Major updates and improvements to BZZZ system
- Updated configuration and deployment files
- Improved system architecture and components
- Enhanced documentation and testing
- Fixed various issues and added new features

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-09-17 18:06:57 +10:00

16 KiB

BZZZ Installation & Deployment Plan

Architecture Overview

BZZZ employs a sophisticated distributed installation strategy that progresses through distinct phases: initial setup, SSH-based cluster deployment, P2P network formation, leader election, and finally DHT-based business configuration storage.

Key Principles

  1. Security-First Design: Multi-layered key management with Shamir's Secret Sharing
  2. Distributed Authority: Clear separation between Admin (human oversight) and Leader (network operations)
  3. P2P Model Distribution: Bandwidth-efficient model replication across cluster
  4. DHT Business Storage: Configuration data stored in distributed hash table post-bootstrap
  5. Capability-Based Discovery: Nodes announce capabilities and auto-organize

Phase 1: Initial Node Setup & Key Generation

1.1 Bootstrap Machine Installation

curl -fsSL https://chorus.services/install.sh | sh

Actions Performed:

  • System detection and validation
  • BZZZ binary installation
  • Docker and dependency setup
  • Launch configuration web UI at http://[node-ip]:8080/setup

1.2 Master Key Generation & Display

Key Generation Process:

  1. Master Key Pair Generation

    • Generate RSA 4096-bit master key pair
    • CRITICAL: Display private key ONCE in read-only format
    • User must securely store master private key (not stored on system)
    • Master public key stored locally for validation
  2. Admin Role Key Generation

    • Generate admin role RSA 4096-bit key pair
    • Admin public key stored locally
    • Admin private key split using Shamir's Secret Sharing
  3. Shamir's Secret Sharing Implementation

    • Split admin private key into N shares (where N = cluster size)
    • Require K shares for reconstruction (K = ceiling(N/2) + 1)
    • Distribute shares to BZZZ peers once network is established
    • Ensures no single node failure compromises admin access

1.3 Web UI Security Display

┌─────────────────────────────────────────────────────────────────┐
│ 🔐 CRITICAL: Master Private Key - DISPLAY ONCE ONLY              │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│ -----BEGIN RSA PRIVATE KEY-----                                 │
│ [MASTER_PRIVATE_KEY_CONTENT]                                    │
│ -----END RSA PRIVATE KEY-----                                   │
│                                                                 │
│ ⚠️  SECURITY NOTICE:                                            │
│ • This key will NEVER be displayed again                       │
│ • Store in secure password manager immediately                 │
│ • Required for emergency cluster recovery                      │
│ • Loss of this key may require complete reinstallation        │
│                                                                 │
│ [ ] I have securely stored the master private key              │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

Phase 2: Cluster Node Discovery & SSH Deployment

2.1 Manual IP Entry Interface

Web UI Node Discovery:

┌─────────────────────────────────────────────────────────────────┐
│ 🌐 Cluster Node Discovery                                       │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│ Enter IP addresses for cluster nodes (one per line):           │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ 192.168.1.101                                               │ │
│ │ 192.168.1.102                                               │ │
│ │ 192.168.1.103                                               │ │
│ │ 192.168.1.104                                               │ │
│ └─────────────────────────────────────────────────────────────┘ │
│                                                                 │
│ SSH Configuration:                                              │
│ Username: [admin_user    ] Port: [22  ]                       │
│ Password: [••••••••••••••] or Key: [Browse...]                │
│                                                                 │
│ [ ] Test SSH Connectivity    [Deploy to Cluster]               │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

2.2 SSH-Based Remote Installation

For Each Target Node:

  1. SSH Connectivity Validation

    • Test SSH access with provided credentials
    • Validate sudo privileges
    • Check system compatibility
  2. Remote BZZZ Installation

    # Executed via SSH on each target node
    ssh admin_user@192.168.1.101 "curl -fsSL https://chorus.services/install.sh | BZZZ_ROLE=worker sh"
    
  3. Configuration Transfer

    • Copy master public key to node
    • Install BZZZ binaries and dependencies
    • Configure systemd services
    • Set initial network parameters (bootstrap node address)
  4. Service Initialization

    • Start BZZZ service in cluster-join mode
    • Configure P2P network parameters
    • Set announce channel subscription

Phase 3: P2P Network Formation & Capability Discovery

3.1 P2P Network Bootstrap

Network Formation Process:

  1. Bootstrap Node Configuration

    • First installed node becomes bootstrap node
    • Listens for P2P connections on configured port
    • Maintains peer discovery registry
  2. Peer Discovery via Announce Channel

    announce_message:
      node_id: "node-192168001101-20250810"
      capabilities:
        - gpu_count: 4
        - gpu_type: "nvidia"
        - gpu_memory: [24576, 24576, 24576, 24576] # MB per GPU
        - cpu_cores: 32
        - memory_gb: 128
        - storage_gb: 2048
        - ollama_type: "parallama"
      network_info:
        ip_address: "192.168.1.101"
        p2p_port: 8081
        services:
          - bzzz_go: 8080
          - mcp_server: 3000
      joined_at: "2025-08-10T16:22:20Z"
    
  3. Capability-Based Network Organization

    • Nodes self-organize based on announced capabilities
    • GPU-enabled nodes form AI processing pools
    • Storage nodes identified for DHT participation
    • Network topology dynamically optimized

3.2 Shamir Share Distribution

Once P2P Network Established:

  1. Generate N shares of admin private key (N = peer count)
  2. Distribute one share to each peer via encrypted P2P channel
  3. Each peer stores share encrypted with their node-specific key
  4. Verify share distribution and reconstruction capability

Phase 4: Leader Election & SLURP Responsibilities

4.1 Leader Election Algorithm

Election Criteria (Weighted Scoring):

  • Network Stability: Uptime and connection quality (30%)
  • Hardware Resources: CPU, Memory, Storage capacity (25%)
  • Network Position: Connectivity to other peers (20%)
  • Geographic Distribution: Network latency optimization (15%)
  • Load Capacity: Current resource utilization (10%)

Election Process:

  1. Each node calculates its fitness score
  2. Nodes broadcast their scores and capabilities
  3. Consensus algorithm determines leader (highest score + network agreement)
  4. Leader election occurs every 24 hours or on leader failure
  5. Leader ≠ Admin: Leader handles operations, Admin handles oversight

4.2 SLURP Responsibilities (Leader Node)

SLURP = Service Layer Unified Resource Protocol

Leader Responsibilities:

  • Resource Orchestration: Task distribution across cluster
  • Model Distribution: Coordinate ollama model replication
  • Load Balancing: Distribute AI workloads optimally
  • Network Health: Monitor peer connectivity and performance
  • DHT Coordination: Manage distributed storage operations

Leader Election Display:

🏆 Network Leader Election Results
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Current Leader: node-192168001103-20250810
├─ Hardware Score: 95/100 (4x RTX 4090, 128GB RAM)
├─ Network Score: 89/100 (Central position, low latency)
├─ Stability Score: 96/100 (99.8% uptime)
└─ Overall Score: 93.2/100

Network Topology:
├─ Total Nodes: 5
├─ GPU Nodes: 4 (Parallama enabled)
├─ Storage Nodes: 5 (DHT participants)
├─ Available VRAM: 384GB total
└─ Network Latency: avg 2.3ms

Next Election: 2025-08-11 16:22:20 UTC

Phase 5: Business Configuration & DHT Storage

5.1 DHT Bootstrap & Business Data Storage

Only After Leader Election:

  • DHT network becomes available for business data storage
  • Configuration data migrated from local storage to DHT
  • Business decisions stored using UCXL addresses

UCXL Address Format:

ucxl://bzzz.cluster.config/network_topology
ucxl://bzzz.cluster.config/resource_allocation  
ucxl://bzzz.cluster.config/ai_models
ucxl://bzzz.cluster.config/user_projects

5.2 Business Configuration Categories

Stored in DHT (Post-Bootstrap):

  • Network topology and node roles
  • Resource allocation policies
  • AI model distribution strategies
  • User project configurations
  • Cost management settings
  • Monitoring and alerting rules

Kept Locally (Security/Bootstrap):

  • Admin user's public key
  • Master public key for validation
  • Initial IP candidate list
  • Domain/DNS configuration
  • Bootstrap node addresses

Phase 6: Model Distribution & Synchronization

6.1 P2P Model Distribution Strategy

Model Distribution Logic:

def distribute_model(model_info):
    model_size = model_info.size_gb
    model_vram_req = model_info.vram_requirement_gb
    
    # Find eligible nodes
    eligible_nodes = []
    for node in cluster_nodes:
        if node.available_vram_gb >= model_vram_req:
            eligible_nodes.append(node)
    
    # Distribute to all eligible nodes
    for node in eligible_nodes:
        if not node.has_model(model_info.id):
            leader.schedule_model_transfer(
                source=primary_model_node,
                target=node,
                model=model_info
            )

Distribution Priorities:

  1. GPU Memory Threshold: Model must fit in available VRAM
  2. Redundancy: Minimum 3 copies across different nodes
  3. Geographic Distribution: Spread across network topology
  4. Load Balancing: Distribute based on current node utilization

6.2 Model Version Synchronization (TODO)

Current Status: Implementation pending Requirements:

  • Track model versions across all nodes
  • Coordinate updates when new model versions released
  • Handle rollback scenarios for failed updates
  • Maintain consistency during network partitions

TODO Items to Address:

  • Design version tracking mechanism
  • Implement distributed consensus for updates
  • Create rollback/recovery procedures
  • Handle split-brain scenarios during updates

Phase 7: Role-Based Key Generation

7.1 Dynamic Role Key Creation

Using Admin Private Key (Post-Bootstrap):

  1. User Defines Custom Roles via web UI:

    roles:
      - name: "data_scientist"
        permissions: ["model_access", "job_submit", "resource_view"]
      - name: "ml_engineer" 
        permissions: ["model_deploy", "cluster_config", "monitoring"]
      - name: "project_manager"
        permissions: ["user_management", "cost_monitoring", "reporting"]
    
  2. Admin Key Reconstruction:

    • Collect K shares from network peers
    • Reconstruct admin private key temporarily in memory
    • Generate role-specific key pairs
    • Sign role public keys with admin private key
    • Clear admin private key from memory
  3. Role Key Distribution:

    • Store role key pairs in DHT with UCXL addresses
    • Distribute to authorized users via secure channels
    • Revocation handled through DHT updates

Installation Flow Summary

Phase 1: Bootstrap Setup
├─ curl install.sh → Web UI → Master Key Display (ONCE)
├─ Generate admin keys → Shamir split preparation
└─ Manual IP entry for cluster nodes

Phase 2: SSH Cluster Deployment  
├─ SSH connectivity validation
├─ Remote BZZZ installation on all nodes
└─ Service startup with P2P parameters

Phase 3: P2P Network Formation
├─ Capability announcement via announce channel
├─ Peer discovery and network topology
└─ Shamir share distribution

Phase 4: Leader Election
├─ Fitness score calculation and consensus
├─ Leader takes SLURP responsibilities
└─ Network operational status achieved

Phase 5: DHT & Business Storage
├─ DHT network becomes available
├─ Business configuration migrated to UCXL addresses
└─ Local storage limited to security essentials

Phase 6: Model Distribution
├─ P2P model replication based on VRAM capacity
├─ Version synchronization (TODO)
└─ Load balancing and redundancy

Phase 7: Role Management
├─ Dynamic role definition via web UI
├─ Admin key reconstruction for signing
└─ Role-based access control deployment

Security Considerations

Data Storage Security

  • Sensitive Data: Never stored in DHT (keys, passwords)
  • Business Data: Encrypted before DHT storage
  • Network Communication: All P2P traffic encrypted
  • Key Recovery: Master key required for emergency access

Network Security

  • mTLS: All inter-node communication secured
  • Certificate Rotation: Automated cert renewal
  • Access Control: Role-based permissions enforced
  • Audit Logging: All privileged operations logged

Monitoring & Observability

Network Health Metrics

  • P2P connection quality and latency
  • DHT data consistency and replication
  • Model distribution status and synchronization
  • Leader election frequency and stability

Business Metrics

  • Resource utilization across cluster
  • Cost tracking and budget adherence
  • AI workload distribution and performance
  • User activity and access patterns

Failure Recovery Procedures

Leader Failure

  1. Automatic re-election triggered
  2. New leader assumes SLURP responsibilities
  3. DHT operations continue uninterrupted
  4. Model distribution resumes under new leader

Network Partition

  1. Majority partition continues operations
  2. Minority partitions enter read-only mode
  3. Automatic healing when connectivity restored
  4. Conflict resolution via timestamp ordering

Admin Key Recovery

  1. Master private key required for recovery
  2. Generate new admin key pair if needed
  3. Re-split and redistribute Shamir shares
  4. Update role signatures with new admin key

This plan provides a comprehensive, security-focused approach to BZZZ cluster deployment with clear separation of concerns and robust failure recovery mechanisms.