Files
hive/HIVE_BZZZ_REGISTRATION_ARCHITECTURE.md
anthonyrawlins b6bff318d9 WIP: Save current work before CHORUS rebrand
- Agent roles integration progress
- Various backend and frontend updates
- Storybook cache cleanup

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-08-01 02:20:56 +10:00

6.9 KiB

🏗️ Hive-Bzzz Registration Architecture Design Plan

🔍 Current Architecture Problems

  1. Static Configuration: Hardcoded node IPs in cluster_service.py
  2. SSH Dependencies: Requires SSH keys, network access, security risks
  3. Docker Isolation: Can't SSH from container to host network
  4. No Dynamic Discovery: Nodes can't join/leave dynamically
  5. Stale Data: No real-time hardware/status updates

🎯 Proposed Architecture: Registration-Based Cluster

Similar to Docker Swarm's docker swarm join with tokens:

# Bzzz clients register with Hive coordinator
HIVE_CLUSTER_TOKEN=abc123... HIVE_COORDINATOR_URL=https://hive.example.com bzzz-client

📋 Implementation Plan

Phase 1: Hive Coordinator Registration System

1.1 Database Schema Changes

-- Cluster registration tokens
CREATE TABLE cluster_tokens (
    id SERIAL PRIMARY KEY,
    token VARCHAR(64) UNIQUE NOT NULL,
    description TEXT,
    created_at TIMESTAMP DEFAULT NOW(),
    expires_at TIMESTAMP,
    is_active BOOLEAN DEFAULT true
);

-- Registered cluster nodes  
CREATE TABLE cluster_nodes (
    id SERIAL PRIMARY KEY,
    node_id VARCHAR(64) UNIQUE NOT NULL,
    hostname VARCHAR(255) NOT NULL,
    ip_address INET NOT NULL,
    registration_token VARCHAR(64) REFERENCES cluster_tokens(token),
    
    -- Hardware info (reported by client)
    cpu_info JSONB,
    memory_info JSONB, 
    gpu_info JSONB,
    disk_info JSONB,
    
    -- Status tracking
    status VARCHAR(20) DEFAULT 'online',
    last_heartbeat TIMESTAMP DEFAULT NOW(),
    first_registered TIMESTAMP DEFAULT NOW(),
    
    -- Capabilities
    services JSONB, -- ollama, docker, etc.
    capabilities JSONB -- models, tools, etc.
);

1.2 Registration API Endpoints

# /api/cluster/register (POST)
# - Validates token
# - Records node hardware info
# - Returns node_id and heartbeat interval

# /api/cluster/heartbeat (POST)  
# - Updates last_heartbeat
# - Updates current status/metrics
# - Returns cluster commands/tasks

# /api/cluster/tokens (GET/POST)
# - Generate/list/revoke cluster tokens
# - Admin endpoint for token management

Phase 2: Bzzz Client Registration Capability

2.1 Environment Variables

HIVE_CLUSTER_TOKEN=token_here       # Required for registration
HIVE_COORDINATOR_URL=https://hive.local:8000  # Hive API endpoint
HIVE_NODE_ID=walnut-$(hostname)     # Optional: custom node ID
HIVE_HEARTBEAT_INTERVAL=30          # Seconds between heartbeats

2.2 Hardware Detection Module

# bzzz/system_info.py
def get_system_info():
    return {
        "cpu": detect_cpu(),           # lscpu parsing
        "memory": detect_memory(),     # /proc/meminfo
        "gpu": detect_gpu(),           # nvidia-smi, lspci
        "disk": detect_storage(),      # df, lsblk
        "services": detect_services(), # docker, ollama, etc.
        "capabilities": detect_models() # available models
    }

2.3 Registration Logic

# bzzz/cluster_client.py
class HiveClusterClient:
    def __init__(self):
        self.token = os.getenv('HIVE_CLUSTER_TOKEN')
        self.coordinator_url = os.getenv('HIVE_COORDINATOR_URL')
        self.node_id = os.getenv('HIVE_NODE_ID', f"{socket.gethostname()}-{uuid4()}")
        
    async def register(self):
        """Register with Hive coordinator"""
        system_info = get_system_info()
        payload = {
            "token": self.token,
            "node_id": self.node_id, 
            "hostname": socket.gethostname(),
            "ip_address": get_local_ip(),
            "system_info": system_info
        }
        # POST to /api/cluster/register
        
    async def heartbeat_loop(self):
        """Send periodic heartbeats with current status"""
        while True:
            current_status = get_current_status()
            # POST to /api/cluster/heartbeat
            await asyncio.sleep(self.heartbeat_interval)

Phase 3: Integration & Migration

3.1 Remove Hardcoded Nodes

  • Delete static cluster_nodes dict from cluster_service.py
  • Replace with dynamic database queries
  • Update all cluster APIs to use registered nodes

3.2 Frontend Updates

  • Node Management UI: View/approve/remove registered nodes
  • Token Management: Generate/revoke cluster tokens
  • Real-time Status: Live hardware metrics from heartbeats
  • Registration Instructions: Show token and join commands

3.3 Bzzz Client Integration

  • Add cluster client to Bzzz startup sequence
  • Environment variable configuration
  • Graceful handling of registration failures

🔄 Registration Flow

sequenceDiagram
    participant B as Bzzz Client
    participant H as Hive Coordinator
    participant DB as Database
    
    Note over H: Admin generates token
    H->>DB: INSERT cluster_token
    
    Note over B: Start with env vars
    B->>B: Detect system info
    B->>H: POST /api/cluster/register
    H->>DB: Validate token
    H->>DB: INSERT cluster_node
    H->>B: Return node_id, heartbeat_interval
    
    loop Every 30 seconds
        B->>B: Get current status
        B->>H: POST /api/cluster/heartbeat  
        H->>DB: UPDATE last_heartbeat
    end

🔐 Security Considerations

  1. Token-based Auth: No SSH keys or passwords needed
  2. Token Expiration: Configurable token lifetimes
  3. IP Validation: Optional IP whitelist for token usage
  4. TLS Required: All communication over HTTPS
  5. Token Rotation: Ability to revoke/regenerate tokens

Benefits of New Architecture

  1. Dynamic Discovery: Nodes self-register, no pre-configuration
  2. Real-time Data: Live hardware metrics from heartbeats
  3. Security: No SSH, credential management, or open ports
  4. Scalability: Works with any number of nodes
  5. Fault Tolerance: Nodes can rejoin after network issues
  6. Docker Friendly: No host network access required
  7. Cloud Ready: Works across NAT, VPCs, different networks

🚀 Implementation Priority

  1. High Priority: Database schema, registration endpoints, basic heartbeat
  2. Medium Priority: Bzzz client integration, hardware detection
  3. Low Priority: Advanced UI features, token management UI

📝 Implementation Status

  • Phase 1.1: Database schema migration
  • Phase 1.2: Registration API endpoints
  • Phase 2.1: Bzzz environment variable support
  • Phase 2.2: System hardware detection module
  • Phase 2.3: Registration client logic
  • Phase 3.1: Remove hardcoded cluster nodes
  • Phase 3.2: Frontend cluster management UI
  • Phase 3.3: Full Bzzz integration
  • /backend/app/services/cluster_service.py - Current hardcoded implementation
  • /backend/app/api/cluster.py - Cluster API endpoints
  • /backend/migrations/ - Database schema changes
  • /frontend/src/components/cluster/ - Cluster UI components

Created: 2025-07-31
Status: Planning Phase
Priority: High
Impact: Solves fundamental hardware detection and cluster management issues