# BZZZ Installation & Deployment Plan ## Architecture Overview BZZZ employs a sophisticated distributed installation strategy that progresses through distinct phases: initial setup, SSH-based cluster deployment, P2P network formation, leader election, and finally DHT-based business configuration storage. ## Key Principles 1. **Security-First Design**: Multi-layered key management with Shamir's Secret Sharing 2. **Distributed Authority**: Clear separation between Admin (human oversight) and Leader (network operations) 3. **P2P Model Distribution**: Bandwidth-efficient model replication across cluster 4. **DHT Business Storage**: Configuration data stored in distributed hash table post-bootstrap 5. **Capability-Based Discovery**: Nodes announce capabilities and auto-organize ## Phase 1: Initial Node Setup & Key Generation ### 1.1 Bootstrap Machine Installation ```bash curl -fsSL https://chorus.services/install.sh | sh ``` **Actions Performed:** - System detection and validation - BZZZ binary installation - Docker and dependency setup - Launch configuration web UI at `http://[node-ip]:8080/setup` ### 1.2 Master Key Generation & Display **Key Generation Process:** 1. **Master Key Pair Generation** - Generate RSA 4096-bit master key pair - **CRITICAL**: Display private key ONCE in read-only format - User must securely store master private key (not stored on system) - Master public key stored locally for validation 2. **Admin Role Key Generation** - Generate admin role RSA 4096-bit key pair - Admin public key stored locally - **Admin private key split using Shamir's Secret Sharing** 3. **Shamir's Secret Sharing Implementation** - Split admin private key into N shares (where N = cluster size) - Require K shares for reconstruction (K = ceiling(N/2) + 1) - Distribute shares to BZZZ peers once network is established - Ensures no single node failure compromises admin access ### 1.3 Web UI Security Display ``` ┌─────────────────────────────────────────────────────────────────┐ │ 🔐 CRITICAL: Master Private Key - DISPLAY ONCE ONLY │ ├─────────────────────────────────────────────────────────────────┤ │ │ │ -----BEGIN RSA PRIVATE KEY----- │ │ [MASTER_PRIVATE_KEY_CONTENT] │ │ -----END RSA PRIVATE KEY----- │ │ │ │ ⚠️ SECURITY NOTICE: │ │ • This key will NEVER be displayed again │ │ • Store in secure password manager immediately │ │ • Required for emergency cluster recovery │ │ • Loss of this key may require complete reinstallation │ │ │ │ [ ] I have securely stored the master private key │ │ │ └─────────────────────────────────────────────────────────────────┘ ``` ## Phase 2: Cluster Node Discovery & SSH Deployment ### 2.1 Manual IP Entry Interface **Web UI Node Discovery:** ``` ┌─────────────────────────────────────────────────────────────────┐ │ 🌐 Cluster Node Discovery │ ├─────────────────────────────────────────────────────────────────┤ │ │ │ Enter IP addresses for cluster nodes (one per line): │ │ ┌─────────────────────────────────────────────────────────────┐ │ │ │ 192.168.1.101 │ │ │ │ 192.168.1.102 │ │ │ │ 192.168.1.103 │ │ │ │ 192.168.1.104 │ │ │ └─────────────────────────────────────────────────────────────┘ │ │ │ │ SSH Configuration: │ │ Username: [admin_user ] Port: [22 ] │ │ Password: [••••••••••••••] or Key: [Browse...] │ │ │ │ [ ] Test SSH Connectivity [Deploy to Cluster] │ │ │ └─────────────────────────────────────────────────────────────────┘ ``` ### 2.2 SSH-Based Remote Installation **For Each Target Node:** 1. **SSH Connectivity Validation** - Test SSH access with provided credentials - Validate sudo privileges - Check system compatibility 2. **Remote BZZZ Installation** ```bash # Executed via SSH on each target node ssh admin_user@192.168.1.101 "curl -fsSL https://chorus.services/install.sh | BZZZ_ROLE=worker sh" ``` 3. **Configuration Transfer** - Copy master public key to node - Install BZZZ binaries and dependencies - Configure systemd services - Set initial network parameters (bootstrap node address) 4. **Service Initialization** - Start BZZZ service in cluster-join mode - Configure P2P network parameters - Set announce channel subscription ## Phase 3: P2P Network Formation & Capability Discovery ### 3.1 P2P Network Bootstrap **Network Formation Process:** 1. **Bootstrap Node Configuration** - First installed node becomes bootstrap node - Listens for P2P connections on configured port - Maintains peer discovery registry 2. **Peer Discovery via Announce Channel** ```yaml announce_message: node_id: "node-192168001101-20250810" capabilities: - gpu_count: 4 - gpu_type: "nvidia" - gpu_memory: [24576, 24576, 24576, 24576] # MB per GPU - cpu_cores: 32 - memory_gb: 128 - storage_gb: 2048 - ollama_type: "parallama" network_info: ip_address: "192.168.1.101" p2p_port: 8081 services: - bzzz_go: 8080 - mcp_server: 3000 joined_at: "2025-08-10T16:22:20Z" ``` 3. **Capability-Based Network Organization** - Nodes self-organize based on announced capabilities - GPU-enabled nodes form AI processing pools - Storage nodes identified for DHT participation - Network topology dynamically optimized ### 3.2 Shamir Share Distribution **Once P2P Network Established:** 1. Generate N shares of admin private key (N = peer count) 2. Distribute one share to each peer via encrypted P2P channel 3. Each peer stores share encrypted with their node-specific key 4. Verify share distribution and reconstruction capability ## Phase 4: Leader Election & SLURP Responsibilities ### 4.1 Leader Election Algorithm **Election Criteria (Weighted Scoring):** - **Network Stability**: Uptime and connection quality (30%) - **Hardware Resources**: CPU, Memory, Storage capacity (25%) - **Network Position**: Connectivity to other peers (20%) - **Geographic Distribution**: Network latency optimization (15%) - **Load Capacity**: Current resource utilization (10%) **Election Process:** 1. Each node calculates its fitness score 2. Nodes broadcast their scores and capabilities 3. Consensus algorithm determines leader (highest score + network agreement) 4. Leader election occurs every 24 hours or on leader failure 5. **Leader ≠ Admin**: Leader handles operations, Admin handles oversight ### 4.2 SLURP Responsibilities (Leader Node) **SLURP = Service Layer Unified Resource Protocol** **Leader Responsibilities:** - **Resource Orchestration**: Task distribution across cluster - **Model Distribution**: Coordinate ollama model replication - **Load Balancing**: Distribute AI workloads optimally - **Network Health**: Monitor peer connectivity and performance - **DHT Coordination**: Manage distributed storage operations **Leader Election Display:** ``` 🏆 Network Leader Election Results ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ Current Leader: node-192168001103-20250810 ├─ Hardware Score: 95/100 (4x RTX 4090, 128GB RAM) ├─ Network Score: 89/100 (Central position, low latency) ├─ Stability Score: 96/100 (99.8% uptime) └─ Overall Score: 93.2/100 Network Topology: ├─ Total Nodes: 5 ├─ GPU Nodes: 4 (Parallama enabled) ├─ Storage Nodes: 5 (DHT participants) ├─ Available VRAM: 384GB total └─ Network Latency: avg 2.3ms Next Election: 2025-08-11 16:22:20 UTC ``` ## Phase 5: Business Configuration & DHT Storage ### 5.1 DHT Bootstrap & Business Data Storage **Only After Leader Election:** - DHT network becomes available for business data storage - Configuration data migrated from local storage to DHT - Business decisions stored using UCXL addresses **UCXL Address Format:** ``` ucxl://bzzz.cluster.config/network_topology ucxl://bzzz.cluster.config/resource_allocation ucxl://bzzz.cluster.config/ai_models ucxl://bzzz.cluster.config/user_projects ``` ### 5.2 Business Configuration Categories **Stored in DHT (Post-Bootstrap):** - Network topology and node roles - Resource allocation policies - AI model distribution strategies - User project configurations - Cost management settings - Monitoring and alerting rules **Kept Locally (Security/Bootstrap):** - Admin user's public key - Master public key for validation - Initial IP candidate list - Domain/DNS configuration - Bootstrap node addresses ## Phase 6: Model Distribution & Synchronization ### 6.1 P2P Model Distribution Strategy **Model Distribution Logic:** ```python def distribute_model(model_info): model_size = model_info.size_gb model_vram_req = model_info.vram_requirement_gb # Find eligible nodes eligible_nodes = [] for node in cluster_nodes: if node.available_vram_gb >= model_vram_req: eligible_nodes.append(node) # Distribute to all eligible nodes for node in eligible_nodes: if not node.has_model(model_info.id): leader.schedule_model_transfer( source=primary_model_node, target=node, model=model_info ) ``` **Distribution Priorities:** 1. **GPU Memory Threshold**: Model must fit in available VRAM 2. **Redundancy**: Minimum 3 copies across different nodes 3. **Geographic Distribution**: Spread across network topology 4. **Load Balancing**: Distribute based on current node utilization ### 6.2 Model Version Synchronization (TODO) **Current Status**: Implementation pending **Requirements:** - Track model versions across all nodes - Coordinate updates when new model versions released - Handle rollback scenarios for failed updates - Maintain consistency during network partitions **TODO Items to Address:** - [ ] Design version tracking mechanism - [ ] Implement distributed consensus for updates - [ ] Create rollback/recovery procedures - [ ] Handle split-brain scenarios during updates ## Phase 7: Role-Based Key Generation ### 7.1 Dynamic Role Key Creation **Using Admin Private Key (Post-Bootstrap):** 1. **User Defines Custom Roles** via web UI: ```yaml roles: - name: "data_scientist" permissions: ["model_access", "job_submit", "resource_view"] - name: "ml_engineer" permissions: ["model_deploy", "cluster_config", "monitoring"] - name: "project_manager" permissions: ["user_management", "cost_monitoring", "reporting"] ``` 2. **Admin Key Reconstruction**: - Collect K shares from network peers - Reconstruct admin private key temporarily in memory - Generate role-specific key pairs - Sign role public keys with admin private key - Clear admin private key from memory 3. **Role Key Distribution**: - Store role key pairs in DHT with UCXL addresses - Distribute to authorized users via secure channels - Revocation handled through DHT updates ## Installation Flow Summary ``` Phase 1: Bootstrap Setup ├─ curl install.sh → Web UI → Master Key Display (ONCE) ├─ Generate admin keys → Shamir split preparation └─ Manual IP entry for cluster nodes Phase 2: SSH Cluster Deployment ├─ SSH connectivity validation ├─ Remote BZZZ installation on all nodes └─ Service startup with P2P parameters Phase 3: P2P Network Formation ├─ Capability announcement via announce channel ├─ Peer discovery and network topology └─ Shamir share distribution Phase 4: Leader Election ├─ Fitness score calculation and consensus ├─ Leader takes SLURP responsibilities └─ Network operational status achieved Phase 5: DHT & Business Storage ├─ DHT network becomes available ├─ Business configuration migrated to UCXL addresses └─ Local storage limited to security essentials Phase 6: Model Distribution ├─ P2P model replication based on VRAM capacity ├─ Version synchronization (TODO) └─ Load balancing and redundancy Phase 7: Role Management ├─ Dynamic role definition via web UI ├─ Admin key reconstruction for signing └─ Role-based access control deployment ``` ## Security Considerations ### Data Storage Security - **Sensitive Data**: Never stored in DHT (keys, passwords) - **Business Data**: Encrypted before DHT storage - **Network Communication**: All P2P traffic encrypted - **Key Recovery**: Master key required for emergency access ### Network Security - **mTLS**: All inter-node communication secured - **Certificate Rotation**: Automated cert renewal - **Access Control**: Role-based permissions enforced - **Audit Logging**: All privileged operations logged ## Monitoring & Observability ### Network Health Metrics - P2P connection quality and latency - DHT data consistency and replication - Model distribution status and synchronization - Leader election frequency and stability ### Business Metrics - Resource utilization across cluster - Cost tracking and budget adherence - AI workload distribution and performance - User activity and access patterns ## Failure Recovery Procedures ### Leader Failure 1. Automatic re-election triggered 2. New leader assumes SLURP responsibilities 3. DHT operations continue uninterrupted 4. Model distribution resumes under new leader ### Network Partition 1. Majority partition continues operations 2. Minority partitions enter read-only mode 3. Automatic healing when connectivity restored 4. Conflict resolution via timestamp ordering ### Admin Key Recovery 1. Master private key required for recovery 2. Generate new admin key pair if needed 3. Re-split and redistribute Shamir shares 4. Update role signatures with new admin key This plan provides a comprehensive, security-focused approach to BZZZ cluster deployment with clear separation of concerns and robust failure recovery mechanisms.