Major updates and improvements to BZZZ system

- Updated configuration and deployment files - Improved system architecture and components - Enhanced documentation and testing - Fixed various issues and added new features 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
2025-09-17 18:06:57 +10:00
parent 4e6140de03
commit f5f96ba505
71 changed files with 664 additions and 3823 deletions
--- a/deployments/bare-metal/config-ui/requirements.md
+++ b/deployments/bare-metal/config-ui/requirements.md
@@ -0,0 +1,337 @@
+# BZZZ Configuration Web Interface Requirements
+
+## Overview
+A comprehensive web-based configuration interface that guides users through setting up their BZZZ cluster after the initial installation.
+
+## User Information Requirements
+
+### 1. Cluster Infrastructure Configuration
+
+#### Network Settings
+- **Subnet IP Range** (CIDR notation)
+  - Auto-detected from system
+  - User can override (e.g., `192.168.1.0/24`)
+  - Validation for valid CIDR format
+  - Conflict detection with existing networks
+
+- **Node Discovery Method**
+  - Option 1: Automatic discovery via broadcast
+  - Option 2: Manual IP address list
+  - Option 3: DNS-based discovery
+  - Integration with existing network infrastructure
+
+- **Network Interface Selection**
+  - Dropdown of available interfaces
+  - Auto-select primary interface
+  - Show interface details (IP, status, speed)
+  - Validation for interface accessibility
+
+- **Port Configuration**
+  - BZZZ Go Service Port (default: 8080)
+  - MCP Server Port (default: 3000)
+  - Web UI Port (default: 8080)
+  - WebSocket Port (default: 8081)
+  - Reserved port range exclusions
+  - Port conflict detection
+
+#### Firewall & Security
+- **Firewall Configuration**
+  - Auto-configure firewall rules (ufw/iptables)
+  - Manual firewall setup instructions
+  - Port testing and validation
+  - Network connectivity verification
+
+### 2. Authentication & Security Setup
+
+#### SSH Key Management
+- **SSH Key Options**
+  - Generate new SSH key pair
+  - Upload existing public key
+  - Use existing system SSH keys
+  - Key distribution to cluster nodes
+
+- **SSH Access Configuration**
+  - SSH username for cluster access
+  - Sudo privileges configuration
+  - SSH port (default: 22)
+  - Key-based vs password authentication
+
+#### Security Settings
+- **TLS/SSL Configuration**
+  - Generate self-signed certificates
+  - Upload existing certificates
+  - Let's Encrypt integration
+  - Certificate distribution
+
+- **Authentication Methods**
+  - Token-based authentication
+  - OAuth2 integration
+  - LDAP/Active Directory
+  - Local user management
+
+### 3. AI Model Configuration
+
+#### OpenAI Integration
+- **API Key Management**
+  - Secure API key input
+  - Key validation and testing
+  - Organization and project settings
+  - Usage monitoring setup
+
+- **Model Preferences**
+  - Default model selection (GPT-5)
+  - Model-to-task mapping
+  - Custom model parameters
+  - Fallback model configuration
+
+#### Local AI Models (Ollama/Parallama)
+- **Ollama/Parallama Installation**
+  - Option to install standard Ollama
+  - Option to install Parallama (multi-GPU fork)
+  - Auto-detect existing Ollama installations
+  - Upgrade/migrate from Ollama to Parallama
+
+- **Node Discovery & Configuration**
+  - Auto-discover Ollama/Parallama instances
+  - Manual endpoint configuration
+  - Model availability checking
+  - Load balancing preferences
+  - GPU assignment for Parallama
+
+- **Multi-GPU Configuration (Parallama)**
+  - GPU topology detection
+  - Model sharding across GPUs
+  - Memory allocation per GPU
+  - Performance optimization settings
+  - GPU failure handling
+
+- **Model Distribution Strategy**
+  - Which models on which nodes
+  - GPU-specific model placement
+  - Automatic model pulling
+  - Storage requirements
+  - Model update policies
+
+### 4. Cost Management
+
+#### Spending Limits
+- **Daily Limits** (USD)
+  - Per-user limits
+  - Per-project limits
+  - Global daily limit
+  - Warning thresholds
+
+- **Monthly Limits** (USD)
+  - Budget allocation
+  - Automatic budget reset
+  - Cost tracking granularity
+  - Billing integration
+
+#### Cost Optimization
+- **Usage Monitoring**
+  - Real-time cost tracking
+  - Historical usage reports
+  - Cost per model/task type
+  - Optimization recommendations
+
+### 5. Hardware & Resource Detection
+
+#### System Resources
+- **CPU Configuration**
+  - Core count and allocation
+  - CPU affinity settings
+  - Performance optimization
+  - Load balancing
+
+- **Memory Management**
+  - Available RAM detection
+  - Memory allocation per service
+  - Swap configuration
+  - Memory monitoring
+
+- **Storage Configuration**
+  - Available disk space
+  - Storage paths for data/logs
+  - Backup storage locations
+  - Storage monitoring
+
+#### GPU Resources
+- **GPU Detection**
+  - NVIDIA CUDA support
+  - AMD ROCm support
+  - GPU memory allocation
+  - Multi-GPU configuration
+
+- **AI Workload Optimization**
+  - GPU scheduling
+  - Model-to-GPU assignment
+  - Power management
+  - Temperature monitoring
+
+### 6. Service Configuration
+
+#### Container Management
+- **Docker Configuration**
+  - Container registry selection
+  - Image pull policies
+  - Resource limits per container
+  - Container orchestration (Docker Swarm/K8s)
+
+- **Registry Settings**
+  - Public registry (Docker Hub)
+  - Private registry setup
+  - Authentication for registries
+  - Image versioning strategy
+
+#### Update Management
+- **Release Channels**
+  - Stable releases
+  - Beta releases
+  - Development builds
+  - Custom release sources
+
+- **Auto-Update Settings**
+  - Automatic updates enabled/disabled
+  - Update scheduling
+  - Rollback capabilities
+  - Update notifications
+
+### 7. Monitoring & Observability
+
+#### Logging Configuration
+- **Log Levels**
+  - Debug, Info, Warn, Error
+  - Per-component log levels
+  - Log rotation settings
+  - Centralized logging
+
+- **Log Destinations**
+  - Local file logging
+  - Syslog integration
+  - External log collectors
+  - Log retention policies
+
+#### Metrics & Monitoring
+- **Metrics Collection**
+  - Prometheus integration
+  - Custom metrics
+  - Performance monitoring
+  - Health checks
+
+- **Alerting**
+  - Alert rules configuration
+  - Notification channels
+  - Escalation policies
+  - Alert suppression
+
+### 8. Cluster Topology
+
+#### Node Roles
+- **Coordinator Nodes**
+  - Primary coordinator selection
+  - Coordinator failover
+  - Load balancing
+  - State synchronization
+
+- **Worker Nodes**
+  - Worker node capabilities
+  - Task scheduling preferences
+  - Resource allocation
+  - Worker health monitoring
+
+- **Storage Nodes**
+  - Distributed storage setup
+  - Replication factors
+  - Data consistency
+  - Backup strategies
+
+#### High Availability
+- **Failover Configuration**
+  - Automatic failover
+  - Manual failover procedures
+  - Split-brain prevention
+  - Recovery strategies
+
+- **Load Balancing**
+  - Load balancing algorithms
+  - Health check configuration
+  - Traffic distribution
+  - Performance optimization
+
+## Configuration Flow
+
+### Step 1: System Detection
+- Detect hardware resources
+- Identify network interfaces
+- Check system dependencies
+- Validate installation
+
+### Step 2: Network Configuration
+- Configure network settings
+- Set up firewall rules
+- Test connectivity
+- Validate port accessibility
+
+### Step 3: Security Setup
+- Configure authentication
+- Set up SSH access
+- Generate/install certificates
+- Test security settings
+
+### Step 4: AI Integration
+- Configure OpenAI API
+- Set up Ollama endpoints
+- Configure model preferences
+- Test AI connectivity
+
+### Step 5: Resource Allocation
+- Allocate CPU/memory
+- Configure storage paths
+- Set up GPU resources
+- Configure monitoring
+
+### Step 6: Service Deployment
+- Deploy BZZZ services
+- Configure service parameters
+- Start services
+- Validate service health
+
+### Step 7: Cluster Formation
+- Discover other nodes
+- Join/create cluster
+- Configure replication
+- Test cluster connectivity
+
+### Step 8: Testing & Validation
+- Run connectivity tests
+- Test AI model access
+- Validate security settings
+- Performance benchmarking
+
+## Technical Implementation
+
+### Frontend Framework
+- **React/Next.js** for modern UI
+- **Material-UI** or **Tailwind CSS** for components
+- **Real-time updates** via WebSocket
+- **Progressive Web App** capabilities
+
+### Backend API
+- **Go REST API** integrated with BZZZ service
+- **Configuration validation** and testing
+- **Real-time status updates**
+- **Secure configuration storage**
+
+### Configuration Persistence
+- **YAML configuration files**
+- **Environment variable generation**
+- **Docker Compose generation**
+- **Systemd service configuration**
+
+### Validation & Testing
+- **Network connectivity testing**
+- **Service health validation**
+- **Configuration syntax checking**
+- **Resource availability verification**
+
+This comprehensive configuration system ensures users can easily set up and manage their BZZZ clusters regardless of their technical expertise level.