bzzz/deployments/bare-metal/INSTALLATION-DEPLOYMENT-PLAN.md

# BZZZ Installation & Deployment Plan

## Architecture Overview

BZZZ employs a sophisticated distributed installation strategy that progresses through distinct phases: initial setup, SSH-based cluster deployment, P2P network formation, leader election, and finally DHT-based business configuration storage.

## Key Principles

1. **Security-First Design**: Multi-layered key management with Shamir's Secret Sharing
2. **Distributed Authority**: Clear separation between Admin (human oversight) and Leader (network operations)
3. **P2P Model Distribution**: Bandwidth-efficient model replication across cluster
4. **DHT Business Storage**: Configuration data stored in distributed hash table post-bootstrap
5. **Capability-Based Discovery**: Nodes announce capabilities and auto-organize

## Phase 1: Initial Node Setup & Key Generation

### 1.1 Bootstrap Machine Installation
```bash
curl -fsSL https://chorus.services/install.sh | sh
```

**Actions Performed:**
- System detection and validation
- BZZZ binary installation
- Docker and dependency setup
- Launch configuration web UI at `http://[node-ip]:8080/setup`

### 1.2 Master Key Generation & Display

**Key Generation Process:**
1. **Master Key Pair Generation**
   - Generate RSA 4096-bit master key pair
   - **CRITICAL**: Display private key ONCE in read-only format
   - User must securely store master private key (not stored on system)
   - Master public key stored locally for validation

2. **Admin Role Key Generation**
   - Generate admin role RSA 4096-bit key pair
   - Admin public key stored locally
   - **Admin private key split using Shamir's Secret Sharing**

3. **Shamir's Secret Sharing Implementation**
   - Split admin private key into N shares (where N = cluster size)
   - Require K shares for reconstruction (K = ceiling(N/2) + 1)
   - Distribute shares to BZZZ peers once network is established
   - Ensures no single node failure compromises admin access

### 1.3 Web UI Security Display
```
┌─────────────────────────────────────────────────────────────────┐
│ 🔐 CRITICAL: Master Private Key - DISPLAY ONCE ONLY              │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│ -----BEGIN RSA PRIVATE KEY-----                                 │
│ [MASTER_PRIVATE_KEY_CONTENT]                                    │
│ -----END RSA PRIVATE KEY-----                                   │
│                                                                 │
│ ⚠️  SECURITY NOTICE:                                            │
│ • This key will NEVER be displayed again                       │
│ • Store in secure password manager immediately                 │
│ • Required for emergency cluster recovery                      │
│ • Loss of this key may require complete reinstallation        │
│                                                                 │
│ [ ] I have securely stored the master private key              │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘
```

## Phase 2: Cluster Node Discovery & SSH Deployment

### 2.1 Manual IP Entry Interface

**Web UI Node Discovery:**
```
┌─────────────────────────────────────────────────────────────────┐
│ 🌐 Cluster Node Discovery                                       │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│ Enter IP addresses for cluster nodes (one per line):           │
│ ┌─────────────────────────────────────────────────────────────┐ │
│ │ 192.168.1.101                                               │ │
│ │ 192.168.1.102                                               │ │
│ │ 192.168.1.103                                               │ │
│ │ 192.168.1.104                                               │ │
│ └─────────────────────────────────────────────────────────────┘ │
│                                                                 │
│ SSH Configuration:                                              │
│ Username: [admin_user    ] Port: [22  ]                       │
│ Password: [••••••••••••••] or Key: [Browse...]                │
│                                                                 │
│ [ ] Test SSH Connectivity    [Deploy to Cluster]               │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘
```

### 2.2 SSH-Based Remote Installation

**For Each Target Node:**
1. **SSH Connectivity Validation**
   - Test SSH access with provided credentials
   - Validate sudo privileges
   - Check system compatibility

2. **Remote BZZZ Installation**
   ```bash
   # Executed via SSH on each target node
   ssh admin_user@192.168.1.101 "curl -fsSL https://chorus.services/install.sh | BZZZ_ROLE=worker sh"
   ```

3. **Configuration Transfer**
   - Copy master public key to node
   - Install BZZZ binaries and dependencies
   - Configure systemd services
   - Set initial network parameters (bootstrap node address)

4. **Service Initialization**
   - Start BZZZ service in cluster-join mode
   - Configure P2P network parameters
   - Set announce channel subscription

## Phase 3: P2P Network Formation & Capability Discovery

### 3.1 P2P Network Bootstrap

**Network Formation Process:**
1. **Bootstrap Node Configuration**
   - First installed node becomes bootstrap node
   - Listens for P2P connections on configured port
   - Maintains peer discovery registry

2. **Peer Discovery via Announce Channel**
   ```yaml
   announce_message:
     node_id: "node-192168001101-20250810"
     capabilities:
       - gpu_count: 4
       - gpu_type: "nvidia"
       - gpu_memory: [24576, 24576, 24576, 24576] # MB per GPU
       - cpu_cores: 32
       - memory_gb: 128
       - storage_gb: 2048
       - ollama_type: "parallama"
     network_info:
       ip_address: "192.168.1.101"
       p2p_port: 8081
       services:
         - bzzz_go: 8080
         - mcp_server: 3000
     joined_at: "2025-08-10T16:22:20Z"
   ```

3. **Capability-Based Network Organization**
   - Nodes self-organize based on announced capabilities
   - GPU-enabled nodes form AI processing pools
   - Storage nodes identified for DHT participation
   - Network topology dynamically optimized

### 3.2 Shamir Share Distribution

**Once P2P Network Established:**
1. Generate N shares of admin private key (N = peer count)
2. Distribute one share to each peer via encrypted P2P channel
3. Each peer stores share encrypted with their node-specific key
4. Verify share distribution and reconstruction capability

## Phase 4: Leader Election & SLURP Responsibilities

### 4.1 Leader Election Algorithm

**Election Criteria (Weighted Scoring):**
- **Network Stability**: Uptime and connection quality (30%)
- **Hardware Resources**: CPU, Memory, Storage capacity (25%)
- **Network Position**: Connectivity to other peers (20%)
- **Geographic Distribution**: Network latency optimization (15%)
- **Load Capacity**: Current resource utilization (10%)

**Election Process:**
1. Each node calculates its fitness score
2. Nodes broadcast their scores and capabilities
3. Consensus algorithm determines leader (highest score + network agreement)
4. Leader election occurs every 24 hours or on leader failure
5. **Leader ≠ Admin**: Leader handles operations, Admin handles oversight

### 4.2 SLURP Responsibilities (Leader Node)

**SLURP = Service Layer Unified Resource Protocol**

**Leader Responsibilities:**
- **Resource Orchestration**: Task distribution across cluster
- **Model Distribution**: Coordinate ollama model replication
- **Load Balancing**: Distribute AI workloads optimally
- **Network Health**: Monitor peer connectivity and performance
- **DHT Coordination**: Manage distributed storage operations

**Leader Election Display:**
```
🏆 Network Leader Election Results
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Current Leader: node-192168001103-20250810
├─ Hardware Score: 95/100 (4x RTX 4090, 128GB RAM)
├─ Network Score: 89/100 (Central position, low latency)
├─ Stability Score: 96/100 (99.8% uptime)
└─ Overall Score: 93.2/100

Network Topology:
├─ Total Nodes: 5
├─ GPU Nodes: 4 (Parallama enabled)
├─ Storage Nodes: 5 (DHT participants)
├─ Available VRAM: 384GB total
└─ Network Latency: avg 2.3ms

Next Election: 2025-08-11 16:22:20 UTC
```

## Phase 5: Business Configuration & DHT Storage

### 5.1 DHT Bootstrap & Business Data Storage

**Only After Leader Election:**
- DHT network becomes available for business data storage
- Configuration data migrated from local storage to DHT
- Business decisions stored using UCXL addresses

**UCXL Address Format:**
```
ucxl://bzzz.cluster.config/network_topology
ucxl://bzzz.cluster.config/resource_allocation
ucxl://bzzz.cluster.config/ai_models
ucxl://bzzz.cluster.config/user_projects
```

### 5.2 Business Configuration Categories

**Stored in DHT (Post-Bootstrap):**
- Network topology and node roles
- Resource allocation policies
- AI model distribution strategies
- User project configurations
- Cost management settings
- Monitoring and alerting rules

**Kept Locally (Security/Bootstrap):**
- Admin user's public key
- Master public key for validation
- Initial IP candidate list
- Domain/DNS configuration
- Bootstrap node addresses

## Phase 6: Model Distribution & Synchronization

### 6.1 P2P Model Distribution Strategy

**Model Distribution Logic:**
```python
def distribute_model(model_info):
    model_size = model_info.size_gb
    model_vram_req = model_info.vram_requirement_gb

    # Find eligible nodes
    eligible_nodes = []
    for node in cluster_nodes:
        if node.available_vram_gb >= model_vram_req:
            eligible_nodes.append(node)

    # Distribute to all eligible nodes
    for node in eligible_nodes:
        if not node.has_model(model_info.id):
            leader.schedule_model_transfer(
                source=primary_model_node,
                target=node,
                model=model_info
            )
```

**Distribution Priorities:**
1. **GPU Memory Threshold**: Model must fit in available VRAM
2. **Redundancy**: Minimum 3 copies across different nodes
3. **Geographic Distribution**: Spread across network topology
4. **Load Balancing**: Distribute based on current node utilization

### 6.2 Model Version Synchronization (TODO)

**Current Status**: Implementation pending
**Requirements:**
- Track model versions across all nodes
- Coordinate updates when new model versions released
- Handle rollback scenarios for failed updates
- Maintain consistency during network partitions

**TODO Items to Address:**
- [ ] Design version tracking mechanism
- [ ] Implement distributed consensus for updates
- [ ] Create rollback/recovery procedures
- [ ] Handle split-brain scenarios during updates

## Phase 7: Role-Based Key Generation

### 7.1 Dynamic Role Key Creation

**Using Admin Private Key (Post-Bootstrap):**
1. **User Defines Custom Roles** via web UI:
   ```yaml
   roles:
     - name: "data_scientist"
       permissions: ["model_access", "job_submit", "resource_view"]
     - name: "ml_engineer"
       permissions: ["model_deploy", "cluster_config", "monitoring"]
     - name: "project_manager"
       permissions: ["user_management", "cost_monitoring", "reporting"]
   ```

2. **Admin Key Reconstruction**:
   - Collect K shares from network peers
   - Reconstruct admin private key temporarily in memory
   - Generate role-specific key pairs
   - Sign role public keys with admin private key
   - Clear admin private key from memory

3. **Role Key Distribution**:
   - Store role key pairs in DHT with UCXL addresses
   - Distribute to authorized users via secure channels
   - Revocation handled through DHT updates

## Installation Flow Summary

```
Phase 1: Bootstrap Setup
├─ curl install.sh → Web UI → Master Key Display (ONCE)
├─ Generate admin keys → Shamir split preparation
└─ Manual IP entry for cluster nodes

Phase 2: SSH Cluster Deployment
├─ SSH connectivity validation
├─ Remote BZZZ installation on all nodes
└─ Service startup with P2P parameters

Phase 3: P2P Network Formation
├─ Capability announcement via announce channel
├─ Peer discovery and network topology
└─ Shamir share distribution

Phase 4: Leader Election
├─ Fitness score calculation and consensus
├─ Leader takes SLURP responsibilities
└─ Network operational status achieved

Phase 5: DHT & Business Storage
├─ DHT network becomes available
├─ Business configuration migrated to UCXL addresses
└─ Local storage limited to security essentials

Phase 6: Model Distribution
├─ P2P model replication based on VRAM capacity
├─ Version synchronization (TODO)
└─ Load balancing and redundancy

Phase 7: Role Management
├─ Dynamic role definition via web UI
├─ Admin key reconstruction for signing
└─ Role-based access control deployment
```

## Security Considerations

### Data Storage Security
- **Sensitive Data**: Never stored in DHT (keys, passwords)
- **Business Data**: Encrypted before DHT storage
- **Network Communication**: All P2P traffic encrypted
- **Key Recovery**: Master key required for emergency access

### Network Security
- **mTLS**: All inter-node communication secured
- **Certificate Rotation**: Automated cert renewal
- **Access Control**: Role-based permissions enforced
- **Audit Logging**: All privileged operations logged

## Monitoring & Observability

### Network Health Metrics
- P2P connection quality and latency
- DHT data consistency and replication
- Model distribution status and synchronization
- Leader election frequency and stability

### Business Metrics
- Resource utilization across cluster
- Cost tracking and budget adherence
- AI workload distribution and performance
- User activity and access patterns

## Failure Recovery Procedures

### Leader Failure
1. Automatic re-election triggered
2. New leader assumes SLURP responsibilities
3. DHT operations continue uninterrupted
4. Model distribution resumes under new leader

### Network Partition
1. Majority partition continues operations
2. Minority partitions enter read-only mode
3. Automatic healing when connectivity restored
4. Conflict resolution via timestamp ordering

### Admin Key Recovery
1. Master private key required for recovery
2. Generate new admin key pair if needed
3. Re-split and redistribute Shamir shares
4. Update role signatures with new admin key

This plan provides a comprehensive, security-focused approach to BZZZ cluster deployment with clear separation of concerns and robust failure recovery mechanisms.