Major updates and improvements to BZZZ system

- Updated configuration and deployment files - Improved system architecture and components - Enhanced documentation and testing - Fixed various issues and added new features 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
2025-09-17 18:06:57 +10:00
parent 4e6140de03
commit f5f96ba505
71 changed files with 664 additions and 3823 deletions
--- a/deployments/bare-metal/INSTALLATION-DEPLOYMENT-PLAN.md
+++ b/deployments/bare-metal/INSTALLATION-DEPLOYMENT-PLAN.md
@@ -0,0 +1,412 @@
+# BZZZ Installation & Deployment Plan
+
+## Architecture Overview
+
+BZZZ employs a sophisticated distributed installation strategy that progresses through distinct phases: initial setup, SSH-based cluster deployment, P2P network formation, leader election, and finally DHT-based business configuration storage.
+
+## Key Principles
+
+1. **Security-First Design**: Multi-layered key management with Shamir's Secret Sharing
+2. **Distributed Authority**: Clear separation between Admin (human oversight) and Leader (network operations)
+3. **P2P Model Distribution**: Bandwidth-efficient model replication across cluster
+4. **DHT Business Storage**: Configuration data stored in distributed hash table post-bootstrap
+5. **Capability-Based Discovery**: Nodes announce capabilities and auto-organize
+
+## Phase 1: Initial Node Setup & Key Generation
+
+### 1.1 Bootstrap Machine Installation
+```bash
+curl -fsSL https://chorus.services/install.sh | sh
+```
+
+**Actions Performed:**
+- System detection and validation
+- BZZZ binary installation
+- Docker and dependency setup
+- Launch configuration web UI at `http://[node-ip]:8080/setup`
+
+### 1.2 Master Key Generation & Display
+
+**Key Generation Process:**
+1. **Master Key Pair Generation**
+   - Generate RSA 4096-bit master key pair
+   - **CRITICAL**: Display private key ONCE in read-only format
+   - User must securely store master private key (not stored on system)
+   - Master public key stored locally for validation
+
+2. **Admin Role Key Generation**
+   - Generate admin role RSA 4096-bit key pair
+   - Admin public key stored locally
+   - **Admin private key split using Shamir's Secret Sharing**
+
+3. **Shamir's Secret Sharing Implementation**
+   - Split admin private key into N shares (where N = cluster size)
+   - Require K shares for reconstruction (K = ceiling(N/2) + 1)
+   - Distribute shares to BZZZ peers once network is established
+   - Ensures no single node failure compromises admin access
+
+### 1.3 Web UI Security Display
+```
+┌─────────────────────────────────────────────────────────────────┐
+│ 🔐 CRITICAL: Master Private Key - DISPLAY ONCE ONLY              │
+├─────────────────────────────────────────────────────────────────┤
+│                                                                 │
+│ -----BEGIN RSA PRIVATE KEY-----                                 │
+│ [MASTER_PRIVATE_KEY_CONTENT]                                    │
+│ -----END RSA PRIVATE KEY-----                                   │
+│                                                                 │
+│ ⚠️  SECURITY NOTICE:                                            │
+│ • This key will NEVER be displayed again                       │
+│ • Store in secure password manager immediately                 │
+│ • Required for emergency cluster recovery                      │
+│ • Loss of this key may require complete reinstallation        │
+│                                                                 │
+│ [ ] I have securely stored the master private key              │
+│                                                                 │
+└─────────────────────────────────────────────────────────────────┘
+```
+
+## Phase 2: Cluster Node Discovery & SSH Deployment
+
+### 2.1 Manual IP Entry Interface
+
+**Web UI Node Discovery:**
+```
+┌─────────────────────────────────────────────────────────────────┐
+│ 🌐 Cluster Node Discovery                                       │
+├─────────────────────────────────────────────────────────────────┤
+│                                                                 │
+│ Enter IP addresses for cluster nodes (one per line):           │
+│ ┌─────────────────────────────────────────────────────────────┐ │
+│ │ 192.168.1.101                                               │ │
+│ │ 192.168.1.102                                               │ │
+│ │ 192.168.1.103                                               │ │
+│ │ 192.168.1.104                                               │ │
+│ └─────────────────────────────────────────────────────────────┘ │
+│                                                                 │
+│ SSH Configuration:                                              │
+│ Username: [admin_user    ] Port: [22  ]                       │
+│ Password: [••••••••••••••] or Key: [Browse...]                │
+│                                                                 │
+│ [ ] Test SSH Connectivity    [Deploy to Cluster]               │
+│                                                                 │
+└─────────────────────────────────────────────────────────────────┘
+```
+
+### 2.2 SSH-Based Remote Installation
+
+**For Each Target Node:**
+1. **SSH Connectivity Validation**
+   - Test SSH access with provided credentials
+   - Validate sudo privileges
+   - Check system compatibility
+
+2. **Remote BZZZ Installation**
+   ```bash
+   # Executed via SSH on each target node
+   ssh admin_user@192.168.1.101 "curl -fsSL https://chorus.services/install.sh | BZZZ_ROLE=worker sh"
+   ```
+
+3. **Configuration Transfer**
+   - Copy master public key to node
+   - Install BZZZ binaries and dependencies
+   - Configure systemd services
+   - Set initial network parameters (bootstrap node address)
+
+4. **Service Initialization**
+   - Start BZZZ service in cluster-join mode
+   - Configure P2P network parameters
+   - Set announce channel subscription
+
+## Phase 3: P2P Network Formation & Capability Discovery
+
+### 3.1 P2P Network Bootstrap
+
+**Network Formation Process:**
+1. **Bootstrap Node Configuration**
+   - First installed node becomes bootstrap node
+   - Listens for P2P connections on configured port
+   - Maintains peer discovery registry
+
+2. **Peer Discovery via Announce Channel**
+   ```yaml
+   announce_message:
+     node_id: "node-192168001101-20250810"
+     capabilities:
+       - gpu_count: 4
+       - gpu_type: "nvidia"
+       - gpu_memory: [24576, 24576, 24576, 24576] # MB per GPU
+       - cpu_cores: 32
+       - memory_gb: 128
+       - storage_gb: 2048
+       - ollama_type: "parallama"
+     network_info:
+       ip_address: "192.168.1.101"
+       p2p_port: 8081
+       services:
+         - bzzz_go: 8080
+         - mcp_server: 3000
+     joined_at: "2025-08-10T16:22:20Z"
+   ```
+
+3. **Capability-Based Network Organization**
+   - Nodes self-organize based on announced capabilities
+   - GPU-enabled nodes form AI processing pools
+   - Storage nodes identified for DHT participation
+   - Network topology dynamically optimized
+
+### 3.2 Shamir Share Distribution
+
+**Once P2P Network Established:**
+1. Generate N shares of admin private key (N = peer count)
+2. Distribute one share to each peer via encrypted P2P channel
+3. Each peer stores share encrypted with their node-specific key
+4. Verify share distribution and reconstruction capability
+
+## Phase 4: Leader Election & SLURP Responsibilities
+
+### 4.1 Leader Election Algorithm
+
+**Election Criteria (Weighted Scoring):**
+- **Network Stability**: Uptime and connection quality (30%)
+- **Hardware Resources**: CPU, Memory, Storage capacity (25%)  
+- **Network Position**: Connectivity to other peers (20%)
+- **Geographic Distribution**: Network latency optimization (15%)
+- **Load Capacity**: Current resource utilization (10%)
+
+**Election Process:**
+1. Each node calculates its fitness score
+2. Nodes broadcast their scores and capabilities
+3. Consensus algorithm determines leader (highest score + network agreement)
+4. Leader election occurs every 24 hours or on leader failure
+5. **Leader ≠ Admin**: Leader handles operations, Admin handles oversight
+
+### 4.2 SLURP Responsibilities (Leader Node)
+
+**SLURP = Service Layer Unified Resource Protocol**
+
+**Leader Responsibilities:**
+- **Resource Orchestration**: Task distribution across cluster
+- **Model Distribution**: Coordinate ollama model replication
+- **Load Balancing**: Distribute AI workloads optimally
+- **Network Health**: Monitor peer connectivity and performance
+- **DHT Coordination**: Manage distributed storage operations
+
+**Leader Election Display:**
+```
+🏆 Network Leader Election Results
+━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
+
+Current Leader: node-192168001103-20250810
+├─ Hardware Score: 95/100 (4x RTX 4090, 128GB RAM)
+├─ Network Score: 89/100 (Central position, low latency)
+├─ Stability Score: 96/100 (99.8% uptime)
+└─ Overall Score: 93.2/100
+
+Network Topology:
+├─ Total Nodes: 5
+├─ GPU Nodes: 4 (Parallama enabled)
+├─ Storage Nodes: 5 (DHT participants)
+├─ Available VRAM: 384GB total
+└─ Network Latency: avg 2.3ms
+
+Next Election: 2025-08-11 16:22:20 UTC
+```
+
+## Phase 5: Business Configuration & DHT Storage
+
+### 5.1 DHT Bootstrap & Business Data Storage
+
+**Only After Leader Election:**
+- DHT network becomes available for business data storage
+- Configuration data migrated from local storage to DHT
+- Business decisions stored using UCXL addresses
+
+**UCXL Address Format:**
+```
+ucxl://bzzz.cluster.config/network_topology
+ucxl://bzzz.cluster.config/resource_allocation  
+ucxl://bzzz.cluster.config/ai_models
+ucxl://bzzz.cluster.config/user_projects
+```
+
+### 5.2 Business Configuration Categories
+
+**Stored in DHT (Post-Bootstrap):**
+- Network topology and node roles
+- Resource allocation policies
+- AI model distribution strategies  
+- User project configurations
+- Cost management settings
+- Monitoring and alerting rules
+
+**Kept Locally (Security/Bootstrap):**
+- Admin user's public key
+- Master public key for validation
+- Initial IP candidate list
+- Domain/DNS configuration
+- Bootstrap node addresses
+
+## Phase 6: Model Distribution & Synchronization
+
+### 6.1 P2P Model Distribution Strategy
+
+**Model Distribution Logic:**
+```python
+def distribute_model(model_info):
+    model_size = model_info.size_gb
+    model_vram_req = model_info.vram_requirement_gb
+    
+    # Find eligible nodes
+    eligible_nodes = []
+    for node in cluster_nodes:
+        if node.available_vram_gb >= model_vram_req:
+            eligible_nodes.append(node)
+    
+    # Distribute to all eligible nodes
+    for node in eligible_nodes:
+        if not node.has_model(model_info.id):
+            leader.schedule_model_transfer(
+                source=primary_model_node,
+                target=node,
+                model=model_info
+            )
+```
+
+**Distribution Priorities:**
+1. **GPU Memory Threshold**: Model must fit in available VRAM
+2. **Redundancy**: Minimum 3 copies across different nodes
+3. **Geographic Distribution**: Spread across network topology
+4. **Load Balancing**: Distribute based on current node utilization
+
+### 6.2 Model Version Synchronization (TODO)
+
+**Current Status**: Implementation pending
+**Requirements:**
+- Track model versions across all nodes
+- Coordinate updates when new model versions released
+- Handle rollback scenarios for failed updates
+- Maintain consistency during network partitions
+
+**TODO Items to Address:**
+- [ ] Design version tracking mechanism
+- [ ] Implement distributed consensus for updates
+- [ ] Create rollback/recovery procedures
+- [ ] Handle split-brain scenarios during updates
+
+## Phase 7: Role-Based Key Generation
+
+### 7.1 Dynamic Role Key Creation
+
+**Using Admin Private Key (Post-Bootstrap):**
+1. **User Defines Custom Roles** via web UI:
+   ```yaml
+   roles:
+     - name: "data_scientist"
+       permissions: ["model_access", "job_submit", "resource_view"]
+     - name: "ml_engineer" 
+       permissions: ["model_deploy", "cluster_config", "monitoring"]
+     - name: "project_manager"
+       permissions: ["user_management", "cost_monitoring", "reporting"]
+   ```
+
+2. **Admin Key Reconstruction**:
+   - Collect K shares from network peers
+   - Reconstruct admin private key temporarily in memory
+   - Generate role-specific key pairs
+   - Sign role public keys with admin private key
+   - Clear admin private key from memory
+
+3. **Role Key Distribution**:
+   - Store role key pairs in DHT with UCXL addresses
+   - Distribute to authorized users via secure channels
+   - Revocation handled through DHT updates
+
+## Installation Flow Summary
+
+```
+Phase 1: Bootstrap Setup
+├─ curl install.sh → Web UI → Master Key Display (ONCE)
+├─ Generate admin keys → Shamir split preparation
+└─ Manual IP entry for cluster nodes
+
+Phase 2: SSH Cluster Deployment  
+├─ SSH connectivity validation
+├─ Remote BZZZ installation on all nodes
+└─ Service startup with P2P parameters
+
+Phase 3: P2P Network Formation
+├─ Capability announcement via announce channel
+├─ Peer discovery and network topology
+└─ Shamir share distribution
+
+Phase 4: Leader Election
+├─ Fitness score calculation and consensus
+├─ Leader takes SLURP responsibilities
+└─ Network operational status achieved
+
+Phase 5: DHT & Business Storage
+├─ DHT network becomes available
+├─ Business configuration migrated to UCXL addresses
+└─ Local storage limited to security essentials
+
+Phase 6: Model Distribution
+├─ P2P model replication based on VRAM capacity
+├─ Version synchronization (TODO)
+└─ Load balancing and redundancy
+
+Phase 7: Role Management
+├─ Dynamic role definition via web UI
+├─ Admin key reconstruction for signing
+└─ Role-based access control deployment
+```
+
+## Security Considerations
+
+### Data Storage Security
+- **Sensitive Data**: Never stored in DHT (keys, passwords)
+- **Business Data**: Encrypted before DHT storage
+- **Network Communication**: All P2P traffic encrypted
+- **Key Recovery**: Master key required for emergency access
+
+### Network Security
+- **mTLS**: All inter-node communication secured
+- **Certificate Rotation**: Automated cert renewal
+- **Access Control**: Role-based permissions enforced
+- **Audit Logging**: All privileged operations logged
+
+## Monitoring & Observability
+
+### Network Health Metrics
+- P2P connection quality and latency
+- DHT data consistency and replication
+- Model distribution status and synchronization
+- Leader election frequency and stability
+
+### Business Metrics  
+- Resource utilization across cluster
+- Cost tracking and budget adherence
+- AI workload distribution and performance
+- User activity and access patterns
+
+## Failure Recovery Procedures
+
+### Leader Failure
+1. Automatic re-election triggered
+2. New leader assumes SLURP responsibilities
+3. DHT operations continue uninterrupted
+4. Model distribution resumes under new leader
+
+### Network Partition
+1. Majority partition continues operations
+2. Minority partitions enter read-only mode
+3. Automatic healing when connectivity restored
+4. Conflict resolution via timestamp ordering
+
+### Admin Key Recovery
+1. Master private key required for recovery
+2. Generate new admin key pair if needed
+3. Re-split and redistribute Shamir shares
+4. Update role signatures with new admin key
+
+This plan provides a comprehensive, security-focused approach to BZZZ cluster deployment with clear separation of concerns and robust failure recovery mechanisms.