Add environment configuration and local development documentation

- Parameterize CORS_ORIGINS in docker-compose.swarm.yml - Add .env.example with configuration options - Create comprehensive LOCAL_DEVELOPMENT.md guide - Update README.md with environment variable documentation - Provide alternatives for local development without production domain 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
2025-07-10 18:20:52 +10:00
parent daf0766e29
commit f3cbb5c6f7
50 changed files with 6339 additions and 528 deletions
--- a/backend/DEPLOYMENT_FIXES.md
+++ b/backend/DEPLOYMENT_FIXES.md
@@ -0,0 +1,219 @@
+# Hive Backend Deployment Fixes
+
+## Critical Issues Identified and Fixed
+
+### 1. Database Connection Issues ✅ FIXED
+
+**Problem:**
+- Simple DATABASE_URL fallback to SQLite in production
+- No connection pooling
+- No retry logic for database connections
+- Missing connection validation
+
+**Solution:**
+- Added PostgreSQL connection pooling with proper configuration
+- Implemented database connection retry logic
+- Added connection validation and health checks
+- Enhanced error handling for database operations
+
+**Files Modified:**
+- `/home/tony/AI/projects/hive/backend/app/core/database.py`
+
+### 2. FastAPI Lifecycle Management ✅ FIXED
+
+**Problem:**
+- Synchronous database table creation in async context
+- No error handling in startup/shutdown
+- No graceful handling of initialization failures
+
+**Solution:**
+- Added retry logic for database initialization
+- Enhanced error handling in lifespan manager
+- Proper cleanup on startup failures
+- Graceful shutdown handling
+
+**Files Modified:**
+- `/home/tony/AI/projects/hive/backend/app/main.py`
+
+### 3. Health Check Robustness ✅ FIXED
+
+**Problem:**
+- Health check could fail if coordinator was unhealthy
+- No database connection testing
+- Insufficient error handling
+
+**Solution:**
+- Enhanced health check with comprehensive component testing
+- Added database connection validation
+- Proper error reporting with appropriate HTTP status codes
+- Component-wise health status reporting
+
+**Files Modified:**
+- `/home/tony/AI/projects/hive/backend/app/main.py`
+
+### 4. Coordinator Initialization ✅ FIXED
+
+**Problem:**
+- No proper error handling during initialization
+- Agent HTTP requests lacked timeout configuration
+- No graceful shutdown for running tasks
+- Memory leaks possible with task storage
+
+**Solution:**
+- Added HTTP client session with proper timeout configuration
+- Enhanced error handling during initialization
+- Proper task cancellation during shutdown
+- Resource cleanup on errors
+
+**Files Modified:**
+- `/home/tony/AI/projects/hive/backend/app/core/hive_coordinator.py`
+
+### 5. Docker Production Readiness ✅ FIXED
+
+**Problem:**
+- Missing environment variable defaults
+- No database migration handling
+- Health check reliability issues
+- No proper signal handling
+
+**Solution:**
+- Added environment variable defaults
+- Enhanced health check with longer startup period
+- Added dumb-init for proper signal handling
+- Production-ready configuration
+
+**Files Modified:**
+- `/home/tony/AI/projects/hive/backend/Dockerfile`
+- `/home/tony/AI/projects/hive/backend/.env.production`
+
+## Root Cause Analysis
+
+### Primary Issues:
+1. **Database Connection Failures**: Lack of retry logic and connection pooling
+2. **Race Conditions**: Poor initialization order and error handling
+3. **Resource Management**: No proper cleanup of HTTP sessions and tasks
+4. **Production Configuration**: Missing environment variables and timeouts
+
+### Secondary Issues:
+1. **CORS Configuration**: Limited to localhost only
+2. **Error Handling**: Insufficient error context and logging
+3. **Health Checks**: Not comprehensive enough for production
+4. **Signal Handling**: No graceful shutdown support
+
+## Deployment Instructions
+
+### 1. Environment Setup
+```bash
+# Copy production environment file
+cp .env.production .env
+
+# Update secret key and other sensitive values
+nano .env
+```
+
+### 2. Database Migration
+```bash
+# Create migration if needed
+alembic revision --autogenerate -m "Initial migration"
+
+# Apply migrations
+alembic upgrade head
+```
+
+### 3. Docker Build
+```bash
+# Build with production configuration
+docker build -t hive-backend:latest .
+
+# Test locally
+docker run -p 8000:8000 --env-file .env hive-backend:latest
+```
+
+### 4. Health Check Verification
+```bash
+# Test health endpoint
+curl -f http://localhost:8000/health
+
+# Expected response should include all components as "operational"
+```
+
+## Service Scaling Recommendations
+
+### 1. Database Configuration
+- **Connection Pool**: 10 connections with 20 max overflow
+- **Connection Recycling**: 3600 seconds (1 hour)
+- **Pre-ping**: Enabled for connection validation
+
+### 2. Application Scaling
+- **Replicas**: Start with 2 replicas for HA
+- **Workers**: 1 worker per container (better isolation)
+- **Resources**: 512MB memory, 0.5 CPU per replica
+
+### 3. Load Balancing
+- **Health Check**: `/health` endpoint with 30s interval
+- **Startup Grace**: 60 seconds for initialization
+- **Timeout**: 10 seconds for health checks
+
+### 4. Monitoring
+- **Prometheus**: Metrics available at `/api/metrics`
+- **Logging**: Structured JSON logs for aggregation
+- **Alerts**: Set up for failed health checks
+
+## Troubleshooting Guide
+
+### Backend Not Starting
+1. Check database connectivity
+2. Verify environment variables
+3. Check coordinator initialization logs
+4. Validate HTTP client connectivity
+
+### Service Scaling Issues
+1. Monitor memory usage (coordinator stores tasks)
+2. Check database connection pool exhaustion
+3. Verify HTTP session limits
+4. Review task execution timeouts
+
+### Health Check Failures
+1. Database connection issues
+2. Coordinator initialization failures
+3. HTTP client timeout problems
+4. Resource exhaustion
+
+## Production Monitoring
+
+### Key Metrics to Watch:
+- Database connection pool usage
+- Task execution success rate
+- HTTP client connection errors
+- Memory usage trends
+- Response times for health checks
+
+### Log Analysis:
+- Search for "initialization failed" patterns
+- Monitor database connection errors
+- Track coordinator shutdown messages
+- Watch for HTTP timeout errors
+
+## Security Considerations
+
+### Environment Variables:
+- Never commit `.env` files to version control
+- Use secrets management for sensitive values
+- Rotate database credentials regularly
+- Implement proper RBAC for API access
+
+### Network Security:
+- Use HTTPS in production
+- Implement rate limiting
+- Configure proper CORS origins
+- Use network policies for pod-to-pod communication
+
+## Next Steps
+
+1. **Deploy Updated Images**: Build and deploy with fixes
+2. **Monitor Metrics**: Set up monitoring and alerting
+3. **Load Testing**: Verify scaling behavior under load
+4. **Security Audit**: Review security configurations
+5. **Documentation**: Update operational runbooks
+
+The fixes implemented address the root causes of the 1/2 replica scaling issue and should result in stable 2/2 replica deployment.