This comprehensive implementation includes: - FastAPI backend with MCP server integration - React/TypeScript frontend with Vite - PostgreSQL database with Redis caching - Grafana/Prometheus monitoring stack - Docker Compose orchestration - Full MCP protocol support for Claude Code integration Features: - Agent discovery and management across network - Visual workflow editor and execution engine - Real-time task coordination and monitoring - Multi-model support with specialized agents - Distributed development task allocation 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
19 KiB
🏗️ Hive Architecture Documentation
System Overview
Hive is designed as a microservices architecture with clear separation of concerns, real-time communication, and scalable agent management.
Core Services Architecture
graph TB
subgraph "Frontend Layer"
UI[React Dashboard]
WS_CLIENT[WebSocket Client]
API_CLIENT[API Client]
end
subgraph "API Gateway"
NGINX[Nginx/Traefik]
AUTH[Authentication Middleware]
RATE_LIMIT[Rate Limiting]
end
subgraph "Backend Services"
COORDINATOR[Hive Coordinator]
WORKFLOW_ENGINE[Workflow Engine]
AGENT_MANAGER[Agent Manager]
PERF_MONITOR[Performance Monitor]
MCP_BRIDGE[MCP Bridge]
end
subgraph "Data Layer"
POSTGRES[(PostgreSQL)]
REDIS[(Redis Cache)]
INFLUX[(InfluxDB Metrics)]
end
subgraph "Agent Network"
ACACIA[ACACIA Agent]
WALNUT[WALNUT Agent]
IRONWOOD[IRONWOOD Agent]
AGENTS[... Additional Agents]
end
UI --> NGINX
WS_CLIENT --> NGINX
API_CLIENT --> NGINX
NGINX --> AUTH
AUTH --> COORDINATOR
AUTH --> WORKFLOW_ENGINE
AUTH --> AGENT_MANAGER
COORDINATOR --> POSTGRES
COORDINATOR --> REDIS
COORDINATOR --> PERF_MONITOR
WORKFLOW_ENGINE --> MCP_BRIDGE
AGENT_MANAGER --> ACACIA
AGENT_MANAGER --> WALNUT
AGENT_MANAGER --> IRONWOOD
PERF_MONITOR --> INFLUX
Component Specifications
🧠 Hive Coordinator
Purpose: Central orchestration service that manages task distribution, workflow execution, and system coordination.
Key Responsibilities:
- Task queue management with priority scheduling
- Agent assignment based on capabilities and availability
- Workflow lifecycle management
- Real-time status coordination
- Performance metrics aggregation
API Endpoints:
POST /api/tasks # Create new task
GET /api/tasks/{id} # Get task status
PUT /api/tasks/{id}/assign # Assign task to agent
DELETE /api/tasks/{id} # Cancel task
GET /api/status/cluster # Overall cluster status
GET /api/status/agents # All agent statuses
GET /api/metrics/performance # Performance metrics
Database Schema:
tasks (
id UUID PRIMARY KEY,
title VARCHAR(255),
description TEXT,
priority INTEGER,
status task_status_enum,
assigned_agent_id UUID,
created_at TIMESTAMP,
started_at TIMESTAMP,
completed_at TIMESTAMP,
metadata JSONB
);
task_dependencies (
task_id UUID REFERENCES tasks(id),
depends_on_task_id UUID REFERENCES tasks(id),
PRIMARY KEY (task_id, depends_on_task_id)
);
🤖 Agent Manager
Purpose: Manages the lifecycle, health, and capabilities of all AI agents in the network.
Key Responsibilities:
- Agent registration and discovery
- Health monitoring and heartbeat tracking
- Capability assessment and scoring
- Load balancing and routing decisions
- Performance benchmarking
Agent Registration Protocol:
{
"agent_id": "acacia",
"name": "ACACIA Infrastructure Specialist",
"endpoint": "http://192.168.1.72:11434",
"model": "deepseek-r1:7b",
"capabilities": [
{"name": "devops", "proficiency": 0.95},
{"name": "architecture", "proficiency": 0.90},
{"name": "deployment", "proficiency": 0.88}
],
"hardware": {
"gpu_type": "AMD Radeon RX 7900 XTX",
"vram_gb": 24,
"cpu_cores": 16,
"ram_gb": 64
},
"performance_targets": {
"min_tps": 15,
"max_response_time": 30
}
}
Health Check System:
@dataclass
class AgentHealthCheck:
agent_id: str
timestamp: datetime
response_time: float
tokens_per_second: float
cpu_usage: float
memory_usage: float
gpu_usage: float
available: bool
error_message: Optional[str] = None
🔄 Workflow Engine
Purpose: Executes n8n-compatible workflows with real-time monitoring and MCP integration.
Core Components:
- N8n Parser: Converts n8n JSON to executable workflow graph
- Execution Engine: Manages workflow execution with dependency resolution
- MCP Bridge: Translates workflow nodes to MCP tool calls
- Progress Tracker: Real-time execution status and metrics
Workflow Execution Flow:
class WorkflowExecution:
async def execute(self, workflow: Workflow, input_data: Dict) -> ExecutionResult:
# Parse workflow into execution graph
graph = self.parser.parse_n8n_workflow(workflow.n8n_data)
# Validate dependencies and create execution plan
execution_plan = self.planner.create_execution_plan(graph)
# Execute nodes in dependency order
for step in execution_plan:
node_result = await self.execute_node(step, input_data)
await self.emit_progress_update(step, node_result)
return ExecutionResult(status="completed", output=final_output)
WebSocket Events:
interface WorkflowEvent {
type: 'execution_started' | 'node_completed' | 'execution_completed' | 'error';
execution_id: string;
workflow_id: string;
timestamp: string;
data: {
node_id?: string;
progress?: number;
result?: any;
error?: string;
};
}
📊 Performance Monitor
Purpose: Collects, analyzes, and visualizes system and agent performance metrics.
Metrics Collection:
@dataclass
class PerformanceMetrics:
# System Metrics
cpu_usage: float
memory_usage: float
disk_usage: float
network_io: Dict[str, float]
# AI-Specific Metrics
tokens_per_second: float
response_time: float
queue_length: int
active_tasks: int
# GPU Metrics (if available)
gpu_usage: float
gpu_memory: float
gpu_temperature: float
# Quality Metrics
success_rate: float
error_rate: float
retry_count: int
Alert System:
alerts:
high_cpu:
condition: "cpu_usage > 85"
severity: "warning"
cooldown: 300 # 5 minutes
agent_down:
condition: "agent_available == false"
severity: "critical"
cooldown: 60 # 1 minute
slow_response:
condition: "avg_response_time > 60"
severity: "warning"
cooldown: 180 # 3 minutes
🌉 MCP Bridge
Purpose: Provides standardized integration between n8n workflows and MCP (Model Context Protocol) servers.
Protocol Translation:
class MCPBridge:
async def translate_n8n_node(self, node: N8nNode) -> MCPTool:
"""Convert n8n node to MCP tool specification"""
match node.type:
case "n8n-nodes-base.httpRequest":
return MCPTool(
name="http_request",
description=node.parameters.get("description", ""),
input_schema=self.extract_input_schema(node),
function=self.create_http_handler(node.parameters)
)
case "n8n-nodes-base.code":
return MCPTool(
name="code_execution",
description="Execute custom code",
input_schema={"code": "string", "language": "string"},
function=self.create_code_handler(node.parameters)
)
MCP Server Registry:
{
"servers": {
"comfyui": {
"endpoint": "ws://localhost:8188/api/mcp",
"capabilities": ["image_generation", "image_processing"],
"version": "1.0.0",
"status": "active"
},
"code_review": {
"endpoint": "http://localhost:8000/mcp",
"capabilities": ["code_analysis", "security_scan"],
"version": "1.2.0",
"status": "active"
}
}
}
Data Layer Design
🗄️ Database Schema
Core Tables:
-- Agent Management
CREATE TABLE agents (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
name VARCHAR(255) NOT NULL,
endpoint VARCHAR(512) NOT NULL,
model VARCHAR(255),
specialization VARCHAR(100),
hardware_config JSONB,
capabilities JSONB,
status agent_status DEFAULT 'offline',
created_at TIMESTAMP DEFAULT NOW(),
last_seen TIMESTAMP
);
-- Workflow Management
CREATE TABLE workflows (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
name VARCHAR(255) NOT NULL,
description TEXT,
n8n_data JSONB NOT NULL,
mcp_tools JSONB,
created_by UUID REFERENCES users(id),
version INTEGER DEFAULT 1,
active BOOLEAN DEFAULT true,
created_at TIMESTAMP DEFAULT NOW()
);
-- Execution Tracking
CREATE TABLE executions (
id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
workflow_id UUID REFERENCES workflows(id),
status execution_status DEFAULT 'pending',
input_data JSONB,
output_data JSONB,
error_message TEXT,
started_at TIMESTAMP,
completed_at TIMESTAMP,
created_at TIMESTAMP DEFAULT NOW()
);
-- Performance Metrics (Time Series)
CREATE TABLE agent_metrics (
agent_id UUID REFERENCES agents(id),
timestamp TIMESTAMP NOT NULL,
metrics JSONB NOT NULL,
PRIMARY KEY (agent_id, timestamp)
);
CREATE INDEX idx_agent_metrics_timestamp ON agent_metrics(timestamp);
CREATE INDEX idx_agent_metrics_agent_timestamp ON agent_metrics(agent_id, timestamp);
Indexing Strategy:
-- Performance optimization indexes
CREATE INDEX idx_tasks_status ON tasks(status) WHERE status IN ('pending', 'running');
CREATE INDEX idx_tasks_priority ON tasks(priority DESC, created_at ASC);
CREATE INDEX idx_executions_workflow_status ON executions(workflow_id, status);
CREATE INDEX idx_agent_metrics_recent ON agent_metrics(timestamp) WHERE timestamp > NOW() - INTERVAL '24 hours';
🔄 Caching Strategy
Redis Cache Layout:
# Agent Status Cache (TTL: 30 seconds)
agent:status:{agent_id} -> {status, last_seen, performance}
# Task Queue Cache
task:queue:high -> [task_id_1, task_id_2, ...]
task:queue:medium -> [task_id_3, task_id_4, ...]
task:queue:low -> [task_id_5, task_id_6, ...]
# Workflow Cache (TTL: 5 minutes)
workflow:{workflow_id} -> {serialized_workflow_data}
# Performance Metrics Cache (TTL: 1 minute)
metrics:cluster -> {aggregated_cluster_metrics}
metrics:agent:{agent_id} -> {recent_agent_metrics}
Real-time Communication
🔌 WebSocket Architecture
Connection Management:
interface WebSocketConnection {
id: string;
userId: string;
subscriptions: Set<string>; // Topic subscriptions
lastPing: Date;
authenticated: boolean;
}
// Subscription Topics
type SubscriptionTopic =
| `agent.${string}` // Specific agent updates
| `execution.${string}` // Specific execution updates
| `cluster.status` // Overall cluster status
| `alerts.${severity}` // Alerts by severity
| `user.${string}`; // User-specific notifications
Message Protocol:
interface WebSocketMessage {
id: string;
type: 'subscribe' | 'unsubscribe' | 'data' | 'error' | 'ping' | 'pong';
topic?: string;
data?: any;
timestamp: string;
}
// Example messages
{
"id": "msg_123",
"type": "data",
"topic": "agent.acacia",
"data": {
"status": "busy",
"current_task": "task_456",
"performance": {
"tps": 18.5,
"cpu_usage": 67.2
}
},
"timestamp": "2025-07-06T12:00:00Z"
}
📡 Event Streaming
Event Bus Architecture:
@dataclass
class HiveEvent:
id: str
type: str
source: str
timestamp: datetime
data: Dict[str, Any]
correlation_id: Optional[str] = None
class EventBus:
async def publish(self, event: HiveEvent) -> None:
"""Publish event to all subscribers"""
async def subscribe(self, event_type: str, handler: Callable) -> str:
"""Subscribe to specific event types"""
async def unsubscribe(self, subscription_id: str) -> None:
"""Remove subscription"""
Event Types:
# Agent Events
AGENT_REGISTERED = "agent.registered"
AGENT_STATUS_CHANGED = "agent.status_changed"
AGENT_PERFORMANCE_UPDATE = "agent.performance_update"
# Task Events
TASK_CREATED = "task.created"
TASK_ASSIGNED = "task.assigned"
TASK_STARTED = "task.started"
TASK_COMPLETED = "task.completed"
TASK_FAILED = "task.failed"
# Workflow Events
WORKFLOW_EXECUTION_STARTED = "workflow.execution_started"
WORKFLOW_NODE_COMPLETED = "workflow.node_completed"
WORKFLOW_EXECUTION_COMPLETED = "workflow.execution_completed"
# System Events
SYSTEM_ALERT = "system.alert"
SYSTEM_MAINTENANCE = "system.maintenance"
Security Architecture
🔒 Authentication & Authorization
JWT Token Structure:
{
"sub": "user_id",
"iat": 1625097600,
"exp": 1625184000,
"roles": ["admin", "developer"],
"permissions": [
"workflows.create",
"agents.manage",
"executions.view"
],
"tenant": "organization_id"
}
Permission Matrix:
roles:
admin:
permissions: ["*"]
description: "Full system access"
developer:
permissions:
- "workflows.*"
- "executions.*"
- "agents.view"
- "tasks.create"
description: "Development and execution access"
viewer:
permissions:
- "workflows.view"
- "executions.view"
- "agents.view"
description: "Read-only access"
🛡️ API Security
Rate Limiting:
# Rate limits by endpoint and user role
RATE_LIMITS = {
"api.workflows.create": {"admin": 100, "developer": 50, "viewer": 0},
"api.executions.start": {"admin": 200, "developer": 100, "viewer": 0},
"api.agents.register": {"admin": 10, "developer": 0, "viewer": 0},
}
Input Validation:
from pydantic import BaseModel, validator
class WorkflowCreateRequest(BaseModel):
name: str
description: Optional[str]
n8n_data: Dict[str, Any]
@validator('name')
def validate_name(cls, v):
if len(v) < 3 or len(v) > 255:
raise ValueError('Name must be 3-255 characters')
return v
@validator('n8n_data')
def validate_n8n_data(cls, v):
required_fields = ['nodes', 'connections']
if not all(field in v for field in required_fields):
raise ValueError('Invalid n8n workflow format')
return v
Deployment Architecture
🐳 Container Strategy
Docker Compose Structure:
version: '3.8'
services:
hive-coordinator:
image: hive/coordinator:latest
environment:
- DATABASE_URL=postgresql://user:pass@postgres:5432/hive
- REDIS_URL=redis://redis:6379
depends_on: [postgres, redis]
hive-frontend:
image: hive/frontend:latest
environment:
- API_URL=http://hive-coordinator:8000
depends_on: [hive-coordinator]
postgres:
image: postgres:15
environment:
- POSTGRES_DB=hive
- POSTGRES_USER=hive
- POSTGRES_PASSWORD=${DB_PASSWORD}
volumes:
- postgres_data:/var/lib/postgresql/data
redis:
image: redis:7-alpine
volumes:
- redis_data:/data
prometheus:
image: prom/prometheus:latest
volumes:
- ./monitoring/prometheus.yml:/etc/prometheus/prometheus.yml
grafana:
image: grafana/grafana:latest
environment:
- GF_SECURITY_ADMIN_PASSWORD=${GRAFANA_PASSWORD}
volumes:
- grafana_data:/var/lib/grafana
🌐 Network Architecture
Production Network Topology:
Internet
↓
[Traefik Load Balancer] (SSL Termination)
↓
[tengig Overlay Network]
↓
┌─────────────────────────────────────┐
│ Hive Application Services │
│ ├── Frontend (React) │
│ ├── Backend API (FastAPI) │
│ ├── WebSocket Gateway │
│ └── Task Queue Workers │
└─────────────────────────────────────┘
↓
┌─────────────────────────────────────┐
│ Data Services │
│ ├── PostgreSQL (Primary DB) │
│ ├── Redis (Cache + Sessions) │
│ ├── InfluxDB (Metrics) │
│ └── Prometheus (Monitoring) │
└─────────────────────────────────────┘
↓
┌─────────────────────────────────────┐
│ AI Agent Network │
│ ├── ACACIA (192.168.1.72:11434) │
│ ├── WALNUT (192.168.1.27:11434) │
│ ├── IRONWOOD (192.168.1.113:11434)│
│ └── [Additional Agents...] │
└─────────────────────────────────────┘
Performance Considerations
🚀 Optimization Strategies
Database Optimization:
- Connection pooling with asyncpg
- Query optimization with proper indexing
- Time-series data partitioning for metrics
- Read replicas for analytics queries
Caching Strategy:
- Redis for session and temporary data
- Application-level caching for expensive computations
- CDN for static assets
- Database query result caching
Concurrency Management:
- AsyncIO for I/O-bound operations
- Connection pools for database and HTTP clients
- Semaphores for limiting concurrent agent requests
- Queue-based task processing
📊 Monitoring & Observability
Key Metrics:
# Application Metrics
- hive_active_agents_total
- hive_task_queue_length
- hive_workflow_executions_total
- hive_api_request_duration_seconds
- hive_websocket_connections_active
# Infrastructure Metrics
- hive_database_connections_active
- hive_redis_memory_usage_bytes
- hive_container_cpu_usage_percent
- hive_container_memory_usage_bytes
# Business Metrics
- hive_workflows_created_daily
- hive_execution_success_rate
- hive_agent_utilization_percent
- hive_average_task_completion_time
Alerting Rules:
groups:
- name: hive.rules
rules:
- alert: HighErrorRate
expr: rate(hive_api_errors_total[5m]) > 0.1
for: 2m
labels:
severity: warning
annotations:
summary: "High error rate detected"
- alert: AgentDown
expr: hive_agent_health_status == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Agent {{ $labels.agent_id }} is down"
This architecture provides a solid foundation for the unified Hive platform, combining the best practices from our existing distributed AI projects while ensuring scalability, maintainability, and observability.