Pre-cleanup snapshot - all current files

🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
2025-08-05 02:32:45 +10:00
parent 26079aa8da
commit 4511f4c801
32 changed files with 5072 additions and 0 deletions
--- a/modules/shhh/PROJECT_PLAN.md
+++ b/modules/shhh/PROJECT_PLAN.md
@@ -0,0 +1,319 @@
+## Plan: Hybrid Secret Detection with Sanitized Log Replication
+
+### 1. Objective
+
+To implement a robust, two-stage secret detection pipeline that:
+1.  Reads from a primary hypercore log in real-time.
+2.  Uses a fast, regex-based scanner for initial detection.
+3.  Leverages a local LLM (via Ollama) for deeper, context-aware analysis of potential secrets to reduce false positives.
+4.  Writes a fully sanitized version of the log to a new, parallel "sister" hypercore stream.
+5.  Quarantines and alerts on confirmed high-severity secrets, ensuring the original log remains untouched for audit purposes while the sanitized log is safe for wider consumption.
+
+### 2. High-Level Architecture & Data Flow
+
+The process will follow this data flow:
+
+```
+                                 ┌──────────────────────────┐
+[Primary Hypercore Log] ─────► │   HypercoreReader        │
+                                 └────────────┬─────────────┘
+                                              │ (Raw Log Entry)
+                                              ▼
+                                 ┌──────────<E29480><E29480>───────────────┐
+                                 │   MessageProcessor       │
+                                 │ (Orchestrator)           │
+                                 └────────────┬─────────────┘
+                                              │
+                      ┌───────────────────────▼───────────────────────┐
+                      │         Stage 1: Fast Regex Scan              │
+                      │         (SecretDetector)                      │
+                      └───────────────────────┬───────────────────────┘
+                                              │
+                  ┌───────────────────────────┼───────────────────────────┐
+                  │ (No Match)                │ (Potential Match)         │ (High-Confidence Match)
+                  ▼                           ▼                           ▼
+┌──────────────────────────┐   ┌─<E2948C><E29480>────────────────────────┐   ┌──────────────────────────┐
+│ SanitizedWriter          │   │ Stage 2: LLM Analysis    │   │ (Skip LLM)               │
+│ (Writes original entry)  │   │ (LLMAnalyzer)            │   │ Quarantine Immediately   │
+└──────────────────────────┘   └────────────┬─────────────┘   └────────────┬─────────────┘
+                  ▲                           │ (LLM Confirms)            │
+                  │                           ▼                           ▼
+                  │              ┌──────────────────────────┐   ┌──────────────────────────┐
+                  │              │ QuarantineManager        │   │ Alerting System          │
+                  │              │ (DB Storage, Alerts)     │   │ (Webhooks)               │
+                  │              └──────────────────────────┘   └────────────<E29480><E29480>─────────────┘
+                  │                           │
+                  │                           ▼
+                  │              ┌──────────────────────────┐
+                  └──────────────┤ SanitizedWriter          │
+                                 │ (Writes REDACTED entry)  │
+                                 └──────────────────────────┘
+                                              │
+                                              ▼
+                                 [Sanitized Hypercore Log]
+```
+
+### 3. Component Implementation Plan
+
+This plan modifies existing components and adds new ones.
+
+#### 3.1. New Component: `core/llm_analyzer.py`
+
+This new file will contain all logic for interacting with the Ollama instance. This isolates the dependency and makes it easy to test or swap out the LLM backend.
+
+```python
+# core/llm_analyzer.py
+import requests
+import json
+
+class LLMAnalyzer:
+    """Analyzes text for secrets using a local LLM via Ollama."""
+
+    def __init__(self, endpoint: str, model: str, system_prompt: str):
+        self.endpoint = endpoint
+        self.model = model
+        self.system_prompt = system_prompt
+
+    def analyze(self, text: str) -> dict:
+        """
+        Sends text to the Ollama API for analysis and returns a structured JSON response.
+
+        Returns:
+            A dictionary like:
+            {
+                "secret_found": bool,
+                "secret_type": str,
+                "confidence_score": float,
+                "severity": str
+            }
+            Returns a default "not found" response on error.
+        """
+        prompt = f"Log entry: \"{text}\"\n\nAnalyze this for secrets and respond with only the required JSON."
+        payload = {
+            "model": self.model,
+            "system": self.system_prompt,
+            "prompt": prompt,
+            "format": "json",
+            "stream": False
+        }
+        try:
+            response = requests.post(self.endpoint, json=payload, timeout=15)
+            response.raise_for_status()
+            # The response from Ollama is a JSON string, which needs to be parsed.
+            analysis = json.loads(response.json().get("response", "{}"))
+            return analysis
+        except (requests.exceptions.RequestException, json.JSONDecodeError) as e:
+            print(f"[ERROR] LLMAnalyzer failed: {e}")
+            # Fallback: If LLM fails, assume no secret was found to avoid blocking the pipeline.
+            return {"secret_found": False}
+```
+
+#### 3.2. New Component: `core/sanitized_writer.py`
+
+This component is responsible for writing to the new, sanitized hypercore log. This abstraction allows us to easily change the output destination in the future.
+
+```python
+# core/sanitized_writer.py
+class SanitizedWriter:
+    """Writes log entries to the sanitized sister hypercore log."""
+
+    def __init__(self, sanitized_log_path: str):
+        self.log_path = sanitized_log_path
+        # Placeholder for hypercore writing logic. For now, we'll append to a file.
+        self.log_file = open(self.log_path, "a")
+
+    def write(self, log_entry: str):
+        """Writes a single log entry to the sanitized stream."""
+        self.log_file.write(log_entry + "\n")
+        self.log_file.flush()
+
+    def close(self):
+        self.log_file.close()
+```
+
+#### 3.3. Modify: `core/detector.py`
+
+We will enhance the `SecretDetector` to not only find matches but also redact them.
+
+```python
+# core/detector.py
+import re
+
+class SecretDetector:
+    def __init__(self, patterns_file: str = "patterns.yaml"):
+        # ... (load_patterns remains the same) ...
+
+    def scan(self, text: str) -> list[dict]:
+        """Scans text and returns a list of found secrets with metadata."""
+        matches = []
+        for pattern_name, pattern in self.patterns.items():
+            if pattern.get("active", True):
+                regex_match = re.search(pattern["regex"], text)
+                if regex_match:
+                    matches.append({
+                        "secret_type": pattern_name,
+                        "value": regex_match.group(0),
+                        "confidence": pattern.get("confidence", 0.8), # Default confidence
+                        "severity": pattern.get("severity", "MEDIUM")
+                    })
+        return matches
+
+    def redact(self, text: str, secret_value: str) -> str:
+        """Redacts a specific secret value within a string."""
+        redacted_str = secret_value[:4] + "****" + secret_value[-4:]
+        return text.replace(secret_value, f"[REDACTED:{redacted_str}]")
+```
+
+#### 3.4. Modify: `pipeline/processor.py`
+
+This is the orchestrator and will see the most significant changes to implement the hybrid logic.
+
+```python
+# pipeline/processor.py
+from core.hypercore_reader import HypercoreReader
+from core.detector import SecretDetector
+from core.llm_analyzer import LLMAnalyzer
+from core.quarantine import QuarantineManager
+from core.sanitized_writer import SanitizedWriter
+
+class MessageProcessor:
+    def __init__(self, reader: HypercoreReader, detector: SecretDetector, llm_analyzer: LLMAnalyzer, quarantine: QuarantineManager, writer: SanitizedWriter, llm_threshold: float):
+        self.reader = reader
+        self.detector = detector
+        self.llm_analyzer = llm_analyzer
+        self.quarantine = quarantine
+        self.writer = writer
+        self.llm_threshold = llm_threshold # e.g., 0.90
+
+    async def process_stream(self):
+        """Main processing loop for the hybrid detection model."""
+        async for entry in self.reader.stream_entries():
+            # Stage 1: Fast Regex Scan
+            regex_matches = self.detector.scan(entry.content)
+
+            if not regex_matches:
+                # No secrets found, write original entry to sanitized log
+                self.writer.write(entry.content)
+                continue
+
+            # A potential secret was found. Default to sanitized, but may be quarantined.
+            sanitized_content = entry.content
+            should_quarantine = False
+            confirmed_secret = None
+
+            for match in regex_matches:
+                # High-confidence regex matches trigger immediate quarantine, skipping LLM.
+                if match['confidence'] >= self.llm_threshold:
+                    should_quarantine = True
+                    confirmed_secret = match
+                    break # One high-confidence match is enough
+
+                # Stage 2: Low-confidence matches go to LLM for verification.
+                llm_result = self.llm_analyzer.analyze(entry.content)
+                if llm_result.get("secret_found"):
+                    should_quarantine = True
+                    # Prefer LLM's classification but use regex value for redaction
+                    confirmed_secret = {
+                        "secret_type": llm_result.get("secret_type", match['secret_type']),
+                        "value": match['value'],
+                        "severity": llm_result.get("severity", match['severity'])
+                    }
+                    break
+
+            if should_quarantine and confirmed_secret:
+                # A secret is confirmed. Redact, quarantine, and alert.
+                sanitized_content = self.detector.redact(entry.content, confirmed_secret['value'])
+                self.quarantine.quarantine_message(
+                    message=entry,
+                    secret_type=confirmed_secret['secret_type'],
+                    severity=confirmed_secret['severity'],
+                    redacted_content=sanitized_content
+                )
+                # Potentially trigger alerts here as well
+                print(f"[ALERT] Confirmed secret {confirmed_secret['secret_type']} found and quarantined.")
+
+            # Write the (potentially redacted) content to the sanitized log
+            self.writer.write(sanitized_content)
+```
+
+#### 3.5. Modify: `main.py`
+
+The main entry point will be updated to instantiate and wire together the new and modified components.
+
+```python
+# main.py
+# ... imports ...
+import asyncio
+from core.hypercore_reader import HypercoreReader
+from core.detector import SecretDetector
+from core.llm_analyzer import LLMAnalyzer
+from core.quarantine import QuarantineManager
+from core.sanitized_writer import SanitizedWriter
+# ... other imports
+
+def main():
+    # 1. Configuration
+    # Load from a new config.yaml or environment variables
+    PRIMARY_LOG_PATH = "/path/to/primary/hypercore.log"
+    SANITIZED_LOG_PATH = "/path/to/sanitized/hypercore.log"
+    PATTERNS_PATH = "patterns.yaml"
+    DB_CONNECTION = "..."
+    OLLAMA_ENDPOINT = "http://localhost:11434/api/generate"
+    OLLAMA_MODEL = "llama3"
+    LLM_CONFIDENCE_THRESHOLD = 0.90 # Regex confidence >= this skips LLM
+
+    with open("SHHH_SECRETS_SENTINEL_AGENT_PROMPT.md", "r") as f:
+        OLLAMA_SYSTEM_PROMPT = f.read()
+
+    # 2. Instantiation
+    reader = HypercoreReader(PRIMARY_LOG_PATH)
+    detector = SecretDetector(PATTERNS_PATH)
+    llm_analyzer = LLMAnalyzer(OLLAMA_ENDPOINT, OLLAMA_MODEL, OLLAMA_SYSTEM_PROMPT)
+    quarantine = QuarantineManager(DB_CONNECTION)
+    writer = SanitizedWriter(SANITIZED_LOG_PATH)
+
+    processor = MessageProcessor(
+        reader=reader,
+        detector=detector,
+        llm_analyzer=llm_analyzer,
+        quarantine=quarantine,
+        writer=writer,
+        llm_threshold=LLM_CONFIDENCE_THRESHOLD
+    )
+
+    # 3. Execution
+    print("Starting SHHH Hypercore Monitor...")
+    try:
+        asyncio.run(processor.process_stream())
+    except KeyboardInterrupt:
+        print("Shutting down...")
+    finally:
+        writer.close()
+
+if __name__ == "__main__":
+    main()
+```
+
+### 4. Phased Rollout
+
+1.  **Phase 1: Component Implementation (1-2 days)**
+    *   Create `core/llm_analyzer.py` and `core/sanitized_writer.py`.
+    *   Write unit tests for both new components. Mock the `requests` calls for the analyzer.
+    *   Update `core/detector.py` with the `redact` method and update its unit tests.
+
+2.  **Phase 2: Orchestration Logic (2-3 days)**
+    *   Implement the new logic in `pipeline/processor.py`.
+    *   Write integration tests for the processor that simulate the full flow: no match, low-confidence match (with mocked LLM response), and high-confidence match.
+    *   Update `main.py` to wire everything together.
+
+3.  **Phase 3: Configuration & Testing (1 day)**
+    *   Add a `config.yaml` to manage all paths, thresholds, and endpoints.
+    *   Perform an end-to-end test run with a sample log file and a running Ollama instance.
+    *   Verify that the primary log is untouched, the sanitized log is created correctly (with and without redactions), and the quarantine database is populated as expected.
+
+### 5. Success Criteria
+
+*   **Zero Leaks:** The sanitized log stream contains no secrets.
+*   **High Accuracy:** False positive rate is demonstrably lower than a regex-only solution, verified during testing.
+*   **Performance:** The pipeline maintains acceptable latency (<200ms per log entry on average, accounting for occasional LLM analysis).
+*   **Auditability:** The primary log remains a perfect, unaltered source of truth. All detection and quarantine events are logged in the PostgreSQL database.