Major updates and improvements to BZZZ system
- Updated configuration and deployment files - Improved system architecture and components - Enhanced documentation and testing - Fixed various issues and added new features 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
This commit is contained in:
412
deployments/bare-metal/INSTALLATION-DEPLOYMENT-PLAN.md
Normal file
412
deployments/bare-metal/INSTALLATION-DEPLOYMENT-PLAN.md
Normal file
@@ -0,0 +1,412 @@
|
||||
# BZZZ Installation & Deployment Plan
|
||||
|
||||
## Architecture Overview
|
||||
|
||||
BZZZ employs a sophisticated distributed installation strategy that progresses through distinct phases: initial setup, SSH-based cluster deployment, P2P network formation, leader election, and finally DHT-based business configuration storage.
|
||||
|
||||
## Key Principles
|
||||
|
||||
1. **Security-First Design**: Multi-layered key management with Shamir's Secret Sharing
|
||||
2. **Distributed Authority**: Clear separation between Admin (human oversight) and Leader (network operations)
|
||||
3. **P2P Model Distribution**: Bandwidth-efficient model replication across cluster
|
||||
4. **DHT Business Storage**: Configuration data stored in distributed hash table post-bootstrap
|
||||
5. **Capability-Based Discovery**: Nodes announce capabilities and auto-organize
|
||||
|
||||
## Phase 1: Initial Node Setup & Key Generation
|
||||
|
||||
### 1.1 Bootstrap Machine Installation
|
||||
```bash
|
||||
curl -fsSL https://chorus.services/install.sh | sh
|
||||
```
|
||||
|
||||
**Actions Performed:**
|
||||
- System detection and validation
|
||||
- BZZZ binary installation
|
||||
- Docker and dependency setup
|
||||
- Launch configuration web UI at `http://[node-ip]:8080/setup`
|
||||
|
||||
### 1.2 Master Key Generation & Display
|
||||
|
||||
**Key Generation Process:**
|
||||
1. **Master Key Pair Generation**
|
||||
- Generate RSA 4096-bit master key pair
|
||||
- **CRITICAL**: Display private key ONCE in read-only format
|
||||
- User must securely store master private key (not stored on system)
|
||||
- Master public key stored locally for validation
|
||||
|
||||
2. **Admin Role Key Generation**
|
||||
- Generate admin role RSA 4096-bit key pair
|
||||
- Admin public key stored locally
|
||||
- **Admin private key split using Shamir's Secret Sharing**
|
||||
|
||||
3. **Shamir's Secret Sharing Implementation**
|
||||
- Split admin private key into N shares (where N = cluster size)
|
||||
- Require K shares for reconstruction (K = ceiling(N/2) + 1)
|
||||
- Distribute shares to BZZZ peers once network is established
|
||||
- Ensures no single node failure compromises admin access
|
||||
|
||||
### 1.3 Web UI Security Display
|
||||
```
|
||||
┌─────────────────────────────────────────────────────────────────┐
|
||||
│ 🔐 CRITICAL: Master Private Key - DISPLAY ONCE ONLY │
|
||||
├─────────────────────────────────────────────────────────────────┤
|
||||
│ │
|
||||
│ -----BEGIN RSA PRIVATE KEY----- │
|
||||
│ [MASTER_PRIVATE_KEY_CONTENT] │
|
||||
│ -----END RSA PRIVATE KEY----- │
|
||||
│ │
|
||||
│ ⚠️ SECURITY NOTICE: │
|
||||
│ • This key will NEVER be displayed again │
|
||||
│ • Store in secure password manager immediately │
|
||||
│ • Required for emergency cluster recovery │
|
||||
│ • Loss of this key may require complete reinstallation │
|
||||
│ │
|
||||
│ [ ] I have securely stored the master private key │
|
||||
│ │
|
||||
└─────────────────────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
## Phase 2: Cluster Node Discovery & SSH Deployment
|
||||
|
||||
### 2.1 Manual IP Entry Interface
|
||||
|
||||
**Web UI Node Discovery:**
|
||||
```
|
||||
┌─────────────────────────────────────────────────────────────────┐
|
||||
│ 🌐 Cluster Node Discovery │
|
||||
├─────────────────────────────────────────────────────────────────┤
|
||||
│ │
|
||||
│ Enter IP addresses for cluster nodes (one per line): │
|
||||
│ ┌─────────────────────────────────────────────────────────────┐ │
|
||||
│ │ 192.168.1.101 │ │
|
||||
│ │ 192.168.1.102 │ │
|
||||
│ │ 192.168.1.103 │ │
|
||||
│ │ 192.168.1.104 │ │
|
||||
│ └─────────────────────────────────────────────────────────────┘ │
|
||||
│ │
|
||||
│ SSH Configuration: │
|
||||
│ Username: [admin_user ] Port: [22 ] │
|
||||
│ Password: [••••••••••••••] or Key: [Browse...] │
|
||||
│ │
|
||||
│ [ ] Test SSH Connectivity [Deploy to Cluster] │
|
||||
│ │
|
||||
└─────────────────────────────────────────────────────────────────┘
|
||||
```
|
||||
|
||||
### 2.2 SSH-Based Remote Installation
|
||||
|
||||
**For Each Target Node:**
|
||||
1. **SSH Connectivity Validation**
|
||||
- Test SSH access with provided credentials
|
||||
- Validate sudo privileges
|
||||
- Check system compatibility
|
||||
|
||||
2. **Remote BZZZ Installation**
|
||||
```bash
|
||||
# Executed via SSH on each target node
|
||||
ssh admin_user@192.168.1.101 "curl -fsSL https://chorus.services/install.sh | BZZZ_ROLE=worker sh"
|
||||
```
|
||||
|
||||
3. **Configuration Transfer**
|
||||
- Copy master public key to node
|
||||
- Install BZZZ binaries and dependencies
|
||||
- Configure systemd services
|
||||
- Set initial network parameters (bootstrap node address)
|
||||
|
||||
4. **Service Initialization**
|
||||
- Start BZZZ service in cluster-join mode
|
||||
- Configure P2P network parameters
|
||||
- Set announce channel subscription
|
||||
|
||||
## Phase 3: P2P Network Formation & Capability Discovery
|
||||
|
||||
### 3.1 P2P Network Bootstrap
|
||||
|
||||
**Network Formation Process:**
|
||||
1. **Bootstrap Node Configuration**
|
||||
- First installed node becomes bootstrap node
|
||||
- Listens for P2P connections on configured port
|
||||
- Maintains peer discovery registry
|
||||
|
||||
2. **Peer Discovery via Announce Channel**
|
||||
```yaml
|
||||
announce_message:
|
||||
node_id: "node-192168001101-20250810"
|
||||
capabilities:
|
||||
- gpu_count: 4
|
||||
- gpu_type: "nvidia"
|
||||
- gpu_memory: [24576, 24576, 24576, 24576] # MB per GPU
|
||||
- cpu_cores: 32
|
||||
- memory_gb: 128
|
||||
- storage_gb: 2048
|
||||
- ollama_type: "parallama"
|
||||
network_info:
|
||||
ip_address: "192.168.1.101"
|
||||
p2p_port: 8081
|
||||
services:
|
||||
- bzzz_go: 8080
|
||||
- mcp_server: 3000
|
||||
joined_at: "2025-08-10T16:22:20Z"
|
||||
```
|
||||
|
||||
3. **Capability-Based Network Organization**
|
||||
- Nodes self-organize based on announced capabilities
|
||||
- GPU-enabled nodes form AI processing pools
|
||||
- Storage nodes identified for DHT participation
|
||||
- Network topology dynamically optimized
|
||||
|
||||
### 3.2 Shamir Share Distribution
|
||||
|
||||
**Once P2P Network Established:**
|
||||
1. Generate N shares of admin private key (N = peer count)
|
||||
2. Distribute one share to each peer via encrypted P2P channel
|
||||
3. Each peer stores share encrypted with their node-specific key
|
||||
4. Verify share distribution and reconstruction capability
|
||||
|
||||
## Phase 4: Leader Election & SLURP Responsibilities
|
||||
|
||||
### 4.1 Leader Election Algorithm
|
||||
|
||||
**Election Criteria (Weighted Scoring):**
|
||||
- **Network Stability**: Uptime and connection quality (30%)
|
||||
- **Hardware Resources**: CPU, Memory, Storage capacity (25%)
|
||||
- **Network Position**: Connectivity to other peers (20%)
|
||||
- **Geographic Distribution**: Network latency optimization (15%)
|
||||
- **Load Capacity**: Current resource utilization (10%)
|
||||
|
||||
**Election Process:**
|
||||
1. Each node calculates its fitness score
|
||||
2. Nodes broadcast their scores and capabilities
|
||||
3. Consensus algorithm determines leader (highest score + network agreement)
|
||||
4. Leader election occurs every 24 hours or on leader failure
|
||||
5. **Leader ≠ Admin**: Leader handles operations, Admin handles oversight
|
||||
|
||||
### 4.2 SLURP Responsibilities (Leader Node)
|
||||
|
||||
**SLURP = Service Layer Unified Resource Protocol**
|
||||
|
||||
**Leader Responsibilities:**
|
||||
- **Resource Orchestration**: Task distribution across cluster
|
||||
- **Model Distribution**: Coordinate ollama model replication
|
||||
- **Load Balancing**: Distribute AI workloads optimally
|
||||
- **Network Health**: Monitor peer connectivity and performance
|
||||
- **DHT Coordination**: Manage distributed storage operations
|
||||
|
||||
**Leader Election Display:**
|
||||
```
|
||||
🏆 Network Leader Election Results
|
||||
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
|
||||
|
||||
Current Leader: node-192168001103-20250810
|
||||
├─ Hardware Score: 95/100 (4x RTX 4090, 128GB RAM)
|
||||
├─ Network Score: 89/100 (Central position, low latency)
|
||||
├─ Stability Score: 96/100 (99.8% uptime)
|
||||
└─ Overall Score: 93.2/100
|
||||
|
||||
Network Topology:
|
||||
├─ Total Nodes: 5
|
||||
├─ GPU Nodes: 4 (Parallama enabled)
|
||||
├─ Storage Nodes: 5 (DHT participants)
|
||||
├─ Available VRAM: 384GB total
|
||||
└─ Network Latency: avg 2.3ms
|
||||
|
||||
Next Election: 2025-08-11 16:22:20 UTC
|
||||
```
|
||||
|
||||
## Phase 5: Business Configuration & DHT Storage
|
||||
|
||||
### 5.1 DHT Bootstrap & Business Data Storage
|
||||
|
||||
**Only After Leader Election:**
|
||||
- DHT network becomes available for business data storage
|
||||
- Configuration data migrated from local storage to DHT
|
||||
- Business decisions stored using UCXL addresses
|
||||
|
||||
**UCXL Address Format:**
|
||||
```
|
||||
ucxl://bzzz.cluster.config/network_topology
|
||||
ucxl://bzzz.cluster.config/resource_allocation
|
||||
ucxl://bzzz.cluster.config/ai_models
|
||||
ucxl://bzzz.cluster.config/user_projects
|
||||
```
|
||||
|
||||
### 5.2 Business Configuration Categories
|
||||
|
||||
**Stored in DHT (Post-Bootstrap):**
|
||||
- Network topology and node roles
|
||||
- Resource allocation policies
|
||||
- AI model distribution strategies
|
||||
- User project configurations
|
||||
- Cost management settings
|
||||
- Monitoring and alerting rules
|
||||
|
||||
**Kept Locally (Security/Bootstrap):**
|
||||
- Admin user's public key
|
||||
- Master public key for validation
|
||||
- Initial IP candidate list
|
||||
- Domain/DNS configuration
|
||||
- Bootstrap node addresses
|
||||
|
||||
## Phase 6: Model Distribution & Synchronization
|
||||
|
||||
### 6.1 P2P Model Distribution Strategy
|
||||
|
||||
**Model Distribution Logic:**
|
||||
```python
|
||||
def distribute_model(model_info):
|
||||
model_size = model_info.size_gb
|
||||
model_vram_req = model_info.vram_requirement_gb
|
||||
|
||||
# Find eligible nodes
|
||||
eligible_nodes = []
|
||||
for node in cluster_nodes:
|
||||
if node.available_vram_gb >= model_vram_req:
|
||||
eligible_nodes.append(node)
|
||||
|
||||
# Distribute to all eligible nodes
|
||||
for node in eligible_nodes:
|
||||
if not node.has_model(model_info.id):
|
||||
leader.schedule_model_transfer(
|
||||
source=primary_model_node,
|
||||
target=node,
|
||||
model=model_info
|
||||
)
|
||||
```
|
||||
|
||||
**Distribution Priorities:**
|
||||
1. **GPU Memory Threshold**: Model must fit in available VRAM
|
||||
2. **Redundancy**: Minimum 3 copies across different nodes
|
||||
3. **Geographic Distribution**: Spread across network topology
|
||||
4. **Load Balancing**: Distribute based on current node utilization
|
||||
|
||||
### 6.2 Model Version Synchronization (TODO)
|
||||
|
||||
**Current Status**: Implementation pending
|
||||
**Requirements:**
|
||||
- Track model versions across all nodes
|
||||
- Coordinate updates when new model versions released
|
||||
- Handle rollback scenarios for failed updates
|
||||
- Maintain consistency during network partitions
|
||||
|
||||
**TODO Items to Address:**
|
||||
- [ ] Design version tracking mechanism
|
||||
- [ ] Implement distributed consensus for updates
|
||||
- [ ] Create rollback/recovery procedures
|
||||
- [ ] Handle split-brain scenarios during updates
|
||||
|
||||
## Phase 7: Role-Based Key Generation
|
||||
|
||||
### 7.1 Dynamic Role Key Creation
|
||||
|
||||
**Using Admin Private Key (Post-Bootstrap):**
|
||||
1. **User Defines Custom Roles** via web UI:
|
||||
```yaml
|
||||
roles:
|
||||
- name: "data_scientist"
|
||||
permissions: ["model_access", "job_submit", "resource_view"]
|
||||
- name: "ml_engineer"
|
||||
permissions: ["model_deploy", "cluster_config", "monitoring"]
|
||||
- name: "project_manager"
|
||||
permissions: ["user_management", "cost_monitoring", "reporting"]
|
||||
```
|
||||
|
||||
2. **Admin Key Reconstruction**:
|
||||
- Collect K shares from network peers
|
||||
- Reconstruct admin private key temporarily in memory
|
||||
- Generate role-specific key pairs
|
||||
- Sign role public keys with admin private key
|
||||
- Clear admin private key from memory
|
||||
|
||||
3. **Role Key Distribution**:
|
||||
- Store role key pairs in DHT with UCXL addresses
|
||||
- Distribute to authorized users via secure channels
|
||||
- Revocation handled through DHT updates
|
||||
|
||||
## Installation Flow Summary
|
||||
|
||||
```
|
||||
Phase 1: Bootstrap Setup
|
||||
├─ curl install.sh → Web UI → Master Key Display (ONCE)
|
||||
├─ Generate admin keys → Shamir split preparation
|
||||
└─ Manual IP entry for cluster nodes
|
||||
|
||||
Phase 2: SSH Cluster Deployment
|
||||
├─ SSH connectivity validation
|
||||
├─ Remote BZZZ installation on all nodes
|
||||
└─ Service startup with P2P parameters
|
||||
|
||||
Phase 3: P2P Network Formation
|
||||
├─ Capability announcement via announce channel
|
||||
├─ Peer discovery and network topology
|
||||
└─ Shamir share distribution
|
||||
|
||||
Phase 4: Leader Election
|
||||
├─ Fitness score calculation and consensus
|
||||
├─ Leader takes SLURP responsibilities
|
||||
└─ Network operational status achieved
|
||||
|
||||
Phase 5: DHT & Business Storage
|
||||
├─ DHT network becomes available
|
||||
├─ Business configuration migrated to UCXL addresses
|
||||
└─ Local storage limited to security essentials
|
||||
|
||||
Phase 6: Model Distribution
|
||||
├─ P2P model replication based on VRAM capacity
|
||||
├─ Version synchronization (TODO)
|
||||
└─ Load balancing and redundancy
|
||||
|
||||
Phase 7: Role Management
|
||||
├─ Dynamic role definition via web UI
|
||||
├─ Admin key reconstruction for signing
|
||||
└─ Role-based access control deployment
|
||||
```
|
||||
|
||||
## Security Considerations
|
||||
|
||||
### Data Storage Security
|
||||
- **Sensitive Data**: Never stored in DHT (keys, passwords)
|
||||
- **Business Data**: Encrypted before DHT storage
|
||||
- **Network Communication**: All P2P traffic encrypted
|
||||
- **Key Recovery**: Master key required for emergency access
|
||||
|
||||
### Network Security
|
||||
- **mTLS**: All inter-node communication secured
|
||||
- **Certificate Rotation**: Automated cert renewal
|
||||
- **Access Control**: Role-based permissions enforced
|
||||
- **Audit Logging**: All privileged operations logged
|
||||
|
||||
## Monitoring & Observability
|
||||
|
||||
### Network Health Metrics
|
||||
- P2P connection quality and latency
|
||||
- DHT data consistency and replication
|
||||
- Model distribution status and synchronization
|
||||
- Leader election frequency and stability
|
||||
|
||||
### Business Metrics
|
||||
- Resource utilization across cluster
|
||||
- Cost tracking and budget adherence
|
||||
- AI workload distribution and performance
|
||||
- User activity and access patterns
|
||||
|
||||
## Failure Recovery Procedures
|
||||
|
||||
### Leader Failure
|
||||
1. Automatic re-election triggered
|
||||
2. New leader assumes SLURP responsibilities
|
||||
3. DHT operations continue uninterrupted
|
||||
4. Model distribution resumes under new leader
|
||||
|
||||
### Network Partition
|
||||
1. Majority partition continues operations
|
||||
2. Minority partitions enter read-only mode
|
||||
3. Automatic healing when connectivity restored
|
||||
4. Conflict resolution via timestamp ordering
|
||||
|
||||
### Admin Key Recovery
|
||||
1. Master private key required for recovery
|
||||
2. Generate new admin key pair if needed
|
||||
3. Re-split and redistribute Shamir shares
|
||||
4. Update role signatures with new admin key
|
||||
|
||||
This plan provides a comprehensive, security-focused approach to BZZZ cluster deployment with clear separation of concerns and robust failure recovery mechanisms.
|
||||
Reference in New Issue
Block a user