- Updated configuration and deployment files - Improved system architecture and components - Enhanced documentation and testing - Fixed various issues and added new features 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
412 lines
16 KiB
Markdown
412 lines
16 KiB
Markdown
# BZZZ Installation & Deployment Plan
|
|
|
|
## Architecture Overview
|
|
|
|
BZZZ employs a sophisticated distributed installation strategy that progresses through distinct phases: initial setup, SSH-based cluster deployment, P2P network formation, leader election, and finally DHT-based business configuration storage.
|
|
|
|
## Key Principles
|
|
|
|
1. **Security-First Design**: Multi-layered key management with Shamir's Secret Sharing
|
|
2. **Distributed Authority**: Clear separation between Admin (human oversight) and Leader (network operations)
|
|
3. **P2P Model Distribution**: Bandwidth-efficient model replication across cluster
|
|
4. **DHT Business Storage**: Configuration data stored in distributed hash table post-bootstrap
|
|
5. **Capability-Based Discovery**: Nodes announce capabilities and auto-organize
|
|
|
|
## Phase 1: Initial Node Setup & Key Generation
|
|
|
|
### 1.1 Bootstrap Machine Installation
|
|
```bash
|
|
curl -fsSL https://chorus.services/install.sh | sh
|
|
```
|
|
|
|
**Actions Performed:**
|
|
- System detection and validation
|
|
- BZZZ binary installation
|
|
- Docker and dependency setup
|
|
- Launch configuration web UI at `http://[node-ip]:8080/setup`
|
|
|
|
### 1.2 Master Key Generation & Display
|
|
|
|
**Key Generation Process:**
|
|
1. **Master Key Pair Generation**
|
|
- Generate RSA 4096-bit master key pair
|
|
- **CRITICAL**: Display private key ONCE in read-only format
|
|
- User must securely store master private key (not stored on system)
|
|
- Master public key stored locally for validation
|
|
|
|
2. **Admin Role Key Generation**
|
|
- Generate admin role RSA 4096-bit key pair
|
|
- Admin public key stored locally
|
|
- **Admin private key split using Shamir's Secret Sharing**
|
|
|
|
3. **Shamir's Secret Sharing Implementation**
|
|
- Split admin private key into N shares (where N = cluster size)
|
|
- Require K shares for reconstruction (K = ceiling(N/2) + 1)
|
|
- Distribute shares to BZZZ peers once network is established
|
|
- Ensures no single node failure compromises admin access
|
|
|
|
### 1.3 Web UI Security Display
|
|
```
|
|
┌─────────────────────────────────────────────────────────────────┐
|
|
│ 🔐 CRITICAL: Master Private Key - DISPLAY ONCE ONLY │
|
|
├─────────────────────────────────────────────────────────────────┤
|
|
│ │
|
|
│ -----BEGIN RSA PRIVATE KEY----- │
|
|
│ [MASTER_PRIVATE_KEY_CONTENT] │
|
|
│ -----END RSA PRIVATE KEY----- │
|
|
│ │
|
|
│ ⚠️ SECURITY NOTICE: │
|
|
│ • This key will NEVER be displayed again │
|
|
│ • Store in secure password manager immediately │
|
|
│ • Required for emergency cluster recovery │
|
|
│ • Loss of this key may require complete reinstallation │
|
|
│ │
|
|
│ [ ] I have securely stored the master private key │
|
|
│ │
|
|
└─────────────────────────────────────────────────────────────────┘
|
|
```
|
|
|
|
## Phase 2: Cluster Node Discovery & SSH Deployment
|
|
|
|
### 2.1 Manual IP Entry Interface
|
|
|
|
**Web UI Node Discovery:**
|
|
```
|
|
┌─────────────────────────────────────────────────────────────────┐
|
|
│ 🌐 Cluster Node Discovery │
|
|
├─────────────────────────────────────────────────────────────────┤
|
|
│ │
|
|
│ Enter IP addresses for cluster nodes (one per line): │
|
|
│ ┌─────────────────────────────────────────────────────────────┐ │
|
|
│ │ 192.168.1.101 │ │
|
|
│ │ 192.168.1.102 │ │
|
|
│ │ 192.168.1.103 │ │
|
|
│ │ 192.168.1.104 │ │
|
|
│ └─────────────────────────────────────────────────────────────┘ │
|
|
│ │
|
|
│ SSH Configuration: │
|
|
│ Username: [admin_user ] Port: [22 ] │
|
|
│ Password: [••••••••••••••] or Key: [Browse...] │
|
|
│ │
|
|
│ [ ] Test SSH Connectivity [Deploy to Cluster] │
|
|
│ │
|
|
└─────────────────────────────────────────────────────────────────┘
|
|
```
|
|
|
|
### 2.2 SSH-Based Remote Installation
|
|
|
|
**For Each Target Node:**
|
|
1. **SSH Connectivity Validation**
|
|
- Test SSH access with provided credentials
|
|
- Validate sudo privileges
|
|
- Check system compatibility
|
|
|
|
2. **Remote BZZZ Installation**
|
|
```bash
|
|
# Executed via SSH on each target node
|
|
ssh admin_user@192.168.1.101 "curl -fsSL https://chorus.services/install.sh | BZZZ_ROLE=worker sh"
|
|
```
|
|
|
|
3. **Configuration Transfer**
|
|
- Copy master public key to node
|
|
- Install BZZZ binaries and dependencies
|
|
- Configure systemd services
|
|
- Set initial network parameters (bootstrap node address)
|
|
|
|
4. **Service Initialization**
|
|
- Start BZZZ service in cluster-join mode
|
|
- Configure P2P network parameters
|
|
- Set announce channel subscription
|
|
|
|
## Phase 3: P2P Network Formation & Capability Discovery
|
|
|
|
### 3.1 P2P Network Bootstrap
|
|
|
|
**Network Formation Process:**
|
|
1. **Bootstrap Node Configuration**
|
|
- First installed node becomes bootstrap node
|
|
- Listens for P2P connections on configured port
|
|
- Maintains peer discovery registry
|
|
|
|
2. **Peer Discovery via Announce Channel**
|
|
```yaml
|
|
announce_message:
|
|
node_id: "node-192168001101-20250810"
|
|
capabilities:
|
|
- gpu_count: 4
|
|
- gpu_type: "nvidia"
|
|
- gpu_memory: [24576, 24576, 24576, 24576] # MB per GPU
|
|
- cpu_cores: 32
|
|
- memory_gb: 128
|
|
- storage_gb: 2048
|
|
- ollama_type: "parallama"
|
|
network_info:
|
|
ip_address: "192.168.1.101"
|
|
p2p_port: 8081
|
|
services:
|
|
- bzzz_go: 8080
|
|
- mcp_server: 3000
|
|
joined_at: "2025-08-10T16:22:20Z"
|
|
```
|
|
|
|
3. **Capability-Based Network Organization**
|
|
- Nodes self-organize based on announced capabilities
|
|
- GPU-enabled nodes form AI processing pools
|
|
- Storage nodes identified for DHT participation
|
|
- Network topology dynamically optimized
|
|
|
|
### 3.2 Shamir Share Distribution
|
|
|
|
**Once P2P Network Established:**
|
|
1. Generate N shares of admin private key (N = peer count)
|
|
2. Distribute one share to each peer via encrypted P2P channel
|
|
3. Each peer stores share encrypted with their node-specific key
|
|
4. Verify share distribution and reconstruction capability
|
|
|
|
## Phase 4: Leader Election & SLURP Responsibilities
|
|
|
|
### 4.1 Leader Election Algorithm
|
|
|
|
**Election Criteria (Weighted Scoring):**
|
|
- **Network Stability**: Uptime and connection quality (30%)
|
|
- **Hardware Resources**: CPU, Memory, Storage capacity (25%)
|
|
- **Network Position**: Connectivity to other peers (20%)
|
|
- **Geographic Distribution**: Network latency optimization (15%)
|
|
- **Load Capacity**: Current resource utilization (10%)
|
|
|
|
**Election Process:**
|
|
1. Each node calculates its fitness score
|
|
2. Nodes broadcast their scores and capabilities
|
|
3. Consensus algorithm determines leader (highest score + network agreement)
|
|
4. Leader election occurs every 24 hours or on leader failure
|
|
5. **Leader ≠ Admin**: Leader handles operations, Admin handles oversight
|
|
|
|
### 4.2 SLURP Responsibilities (Leader Node)
|
|
|
|
**SLURP = Service Layer Unified Resource Protocol**
|
|
|
|
**Leader Responsibilities:**
|
|
- **Resource Orchestration**: Task distribution across cluster
|
|
- **Model Distribution**: Coordinate ollama model replication
|
|
- **Load Balancing**: Distribute AI workloads optimally
|
|
- **Network Health**: Monitor peer connectivity and performance
|
|
- **DHT Coordination**: Manage distributed storage operations
|
|
|
|
**Leader Election Display:**
|
|
```
|
|
🏆 Network Leader Election Results
|
|
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
|
|
|
|
Current Leader: node-192168001103-20250810
|
|
├─ Hardware Score: 95/100 (4x RTX 4090, 128GB RAM)
|
|
├─ Network Score: 89/100 (Central position, low latency)
|
|
├─ Stability Score: 96/100 (99.8% uptime)
|
|
└─ Overall Score: 93.2/100
|
|
|
|
Network Topology:
|
|
├─ Total Nodes: 5
|
|
├─ GPU Nodes: 4 (Parallama enabled)
|
|
├─ Storage Nodes: 5 (DHT participants)
|
|
├─ Available VRAM: 384GB total
|
|
└─ Network Latency: avg 2.3ms
|
|
|
|
Next Election: 2025-08-11 16:22:20 UTC
|
|
```
|
|
|
|
## Phase 5: Business Configuration & DHT Storage
|
|
|
|
### 5.1 DHT Bootstrap & Business Data Storage
|
|
|
|
**Only After Leader Election:**
|
|
- DHT network becomes available for business data storage
|
|
- Configuration data migrated from local storage to DHT
|
|
- Business decisions stored using UCXL addresses
|
|
|
|
**UCXL Address Format:**
|
|
```
|
|
ucxl://bzzz.cluster.config/network_topology
|
|
ucxl://bzzz.cluster.config/resource_allocation
|
|
ucxl://bzzz.cluster.config/ai_models
|
|
ucxl://bzzz.cluster.config/user_projects
|
|
```
|
|
|
|
### 5.2 Business Configuration Categories
|
|
|
|
**Stored in DHT (Post-Bootstrap):**
|
|
- Network topology and node roles
|
|
- Resource allocation policies
|
|
- AI model distribution strategies
|
|
- User project configurations
|
|
- Cost management settings
|
|
- Monitoring and alerting rules
|
|
|
|
**Kept Locally (Security/Bootstrap):**
|
|
- Admin user's public key
|
|
- Master public key for validation
|
|
- Initial IP candidate list
|
|
- Domain/DNS configuration
|
|
- Bootstrap node addresses
|
|
|
|
## Phase 6: Model Distribution & Synchronization
|
|
|
|
### 6.1 P2P Model Distribution Strategy
|
|
|
|
**Model Distribution Logic:**
|
|
```python
|
|
def distribute_model(model_info):
|
|
model_size = model_info.size_gb
|
|
model_vram_req = model_info.vram_requirement_gb
|
|
|
|
# Find eligible nodes
|
|
eligible_nodes = []
|
|
for node in cluster_nodes:
|
|
if node.available_vram_gb >= model_vram_req:
|
|
eligible_nodes.append(node)
|
|
|
|
# Distribute to all eligible nodes
|
|
for node in eligible_nodes:
|
|
if not node.has_model(model_info.id):
|
|
leader.schedule_model_transfer(
|
|
source=primary_model_node,
|
|
target=node,
|
|
model=model_info
|
|
)
|
|
```
|
|
|
|
**Distribution Priorities:**
|
|
1. **GPU Memory Threshold**: Model must fit in available VRAM
|
|
2. **Redundancy**: Minimum 3 copies across different nodes
|
|
3. **Geographic Distribution**: Spread across network topology
|
|
4. **Load Balancing**: Distribute based on current node utilization
|
|
|
|
### 6.2 Model Version Synchronization (TODO)
|
|
|
|
**Current Status**: Implementation pending
|
|
**Requirements:**
|
|
- Track model versions across all nodes
|
|
- Coordinate updates when new model versions released
|
|
- Handle rollback scenarios for failed updates
|
|
- Maintain consistency during network partitions
|
|
|
|
**TODO Items to Address:**
|
|
- [ ] Design version tracking mechanism
|
|
- [ ] Implement distributed consensus for updates
|
|
- [ ] Create rollback/recovery procedures
|
|
- [ ] Handle split-brain scenarios during updates
|
|
|
|
## Phase 7: Role-Based Key Generation
|
|
|
|
### 7.1 Dynamic Role Key Creation
|
|
|
|
**Using Admin Private Key (Post-Bootstrap):**
|
|
1. **User Defines Custom Roles** via web UI:
|
|
```yaml
|
|
roles:
|
|
- name: "data_scientist"
|
|
permissions: ["model_access", "job_submit", "resource_view"]
|
|
- name: "ml_engineer"
|
|
permissions: ["model_deploy", "cluster_config", "monitoring"]
|
|
- name: "project_manager"
|
|
permissions: ["user_management", "cost_monitoring", "reporting"]
|
|
```
|
|
|
|
2. **Admin Key Reconstruction**:
|
|
- Collect K shares from network peers
|
|
- Reconstruct admin private key temporarily in memory
|
|
- Generate role-specific key pairs
|
|
- Sign role public keys with admin private key
|
|
- Clear admin private key from memory
|
|
|
|
3. **Role Key Distribution**:
|
|
- Store role key pairs in DHT with UCXL addresses
|
|
- Distribute to authorized users via secure channels
|
|
- Revocation handled through DHT updates
|
|
|
|
## Installation Flow Summary
|
|
|
|
```
|
|
Phase 1: Bootstrap Setup
|
|
├─ curl install.sh → Web UI → Master Key Display (ONCE)
|
|
├─ Generate admin keys → Shamir split preparation
|
|
└─ Manual IP entry for cluster nodes
|
|
|
|
Phase 2: SSH Cluster Deployment
|
|
├─ SSH connectivity validation
|
|
├─ Remote BZZZ installation on all nodes
|
|
└─ Service startup with P2P parameters
|
|
|
|
Phase 3: P2P Network Formation
|
|
├─ Capability announcement via announce channel
|
|
├─ Peer discovery and network topology
|
|
└─ Shamir share distribution
|
|
|
|
Phase 4: Leader Election
|
|
├─ Fitness score calculation and consensus
|
|
├─ Leader takes SLURP responsibilities
|
|
└─ Network operational status achieved
|
|
|
|
Phase 5: DHT & Business Storage
|
|
├─ DHT network becomes available
|
|
├─ Business configuration migrated to UCXL addresses
|
|
└─ Local storage limited to security essentials
|
|
|
|
Phase 6: Model Distribution
|
|
├─ P2P model replication based on VRAM capacity
|
|
├─ Version synchronization (TODO)
|
|
└─ Load balancing and redundancy
|
|
|
|
Phase 7: Role Management
|
|
├─ Dynamic role definition via web UI
|
|
├─ Admin key reconstruction for signing
|
|
└─ Role-based access control deployment
|
|
```
|
|
|
|
## Security Considerations
|
|
|
|
### Data Storage Security
|
|
- **Sensitive Data**: Never stored in DHT (keys, passwords)
|
|
- **Business Data**: Encrypted before DHT storage
|
|
- **Network Communication**: All P2P traffic encrypted
|
|
- **Key Recovery**: Master key required for emergency access
|
|
|
|
### Network Security
|
|
- **mTLS**: All inter-node communication secured
|
|
- **Certificate Rotation**: Automated cert renewal
|
|
- **Access Control**: Role-based permissions enforced
|
|
- **Audit Logging**: All privileged operations logged
|
|
|
|
## Monitoring & Observability
|
|
|
|
### Network Health Metrics
|
|
- P2P connection quality and latency
|
|
- DHT data consistency and replication
|
|
- Model distribution status and synchronization
|
|
- Leader election frequency and stability
|
|
|
|
### Business Metrics
|
|
- Resource utilization across cluster
|
|
- Cost tracking and budget adherence
|
|
- AI workload distribution and performance
|
|
- User activity and access patterns
|
|
|
|
## Failure Recovery Procedures
|
|
|
|
### Leader Failure
|
|
1. Automatic re-election triggered
|
|
2. New leader assumes SLURP responsibilities
|
|
3. DHT operations continue uninterrupted
|
|
4. Model distribution resumes under new leader
|
|
|
|
### Network Partition
|
|
1. Majority partition continues operations
|
|
2. Minority partitions enter read-only mode
|
|
3. Automatic healing when connectivity restored
|
|
4. Conflict resolution via timestamp ordering
|
|
|
|
### Admin Key Recovery
|
|
1. Master private key required for recovery
|
|
2. Generate new admin key pair if needed
|
|
3. Re-split and redistribute Shamir shares
|
|
4. Update role signatures with new admin key
|
|
|
|
This plan provides a comprehensive, security-focused approach to BZZZ cluster deployment with clear separation of concerns and robust failure recovery mechanisms. |