Pre-cleanup snapshot - all current files
🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>
This commit is contained in:
319
modules/shhh/PROJECT_PLAN.md
Normal file
319
modules/shhh/PROJECT_PLAN.md
Normal file
@@ -0,0 +1,319 @@
|
||||
## Plan: Hybrid Secret Detection with Sanitized Log Replication
|
||||
|
||||
### 1. Objective
|
||||
|
||||
To implement a robust, two-stage secret detection pipeline that:
|
||||
1. Reads from a primary hypercore log in real-time.
|
||||
2. Uses a fast, regex-based scanner for initial detection.
|
||||
3. Leverages a local LLM (via Ollama) for deeper, context-aware analysis of potential secrets to reduce false positives.
|
||||
4. Writes a fully sanitized version of the log to a new, parallel "sister" hypercore stream.
|
||||
5. Quarantines and alerts on confirmed high-severity secrets, ensuring the original log remains untouched for audit purposes while the sanitized log is safe for wider consumption.
|
||||
|
||||
### 2. High-Level Architecture & Data Flow
|
||||
|
||||
The process will follow this data flow:
|
||||
|
||||
```
|
||||
┌──────────────────────────┐
|
||||
[Primary Hypercore Log] ─────► │ HypercoreReader │
|
||||
└────────────┬─────────────┘
|
||||
│ (Raw Log Entry)
|
||||
▼
|
||||
┌──────────<E29480><E29480>───────────────┐
|
||||
│ MessageProcessor │
|
||||
│ (Orchestrator) │
|
||||
└────────────┬─────────────┘
|
||||
│
|
||||
┌───────────────────────▼───────────────────────┐
|
||||
│ Stage 1: Fast Regex Scan │
|
||||
│ (SecretDetector) │
|
||||
└───────────────────────┬───────────────────────┘
|
||||
│
|
||||
┌───────────────────────────┼───────────────────────────┐
|
||||
│ (No Match) │ (Potential Match) │ (High-Confidence Match)
|
||||
▼ ▼ ▼
|
||||
┌──────────────────────────┐ ┌─<E2948C><E29480>────────────────────────┐ ┌──────────────────────────┐
|
||||
│ SanitizedWriter │ │ Stage 2: LLM Analysis │ │ (Skip LLM) │
|
||||
│ (Writes original entry) │ │ (LLMAnalyzer) │ │ Quarantine Immediately │
|
||||
└──────────────────────────┘ └────────────┬─────────────┘ └────────────┬─────────────┘
|
||||
▲ │ (LLM Confirms) │
|
||||
│ ▼ ▼
|
||||
│ ┌──────────────────────────┐ ┌──────────────────────────┐
|
||||
│ │ QuarantineManager │ │ Alerting System │
|
||||
│ │ (DB Storage, Alerts) │ │ (Webhooks) │
|
||||
│ └──────────────────────────┘ └────────────<E29480><E29480>─────────────┘
|
||||
│ │
|
||||
│ ▼
|
||||
│ ┌──────────────────────────┐
|
||||
└──────────────┤ SanitizedWriter │
|
||||
│ (Writes REDACTED entry) │
|
||||
└──────────────────────────┘
|
||||
│
|
||||
▼
|
||||
[Sanitized Hypercore Log]
|
||||
```
|
||||
|
||||
### 3. Component Implementation Plan
|
||||
|
||||
This plan modifies existing components and adds new ones.
|
||||
|
||||
#### 3.1. New Component: `core/llm_analyzer.py`
|
||||
|
||||
This new file will contain all logic for interacting with the Ollama instance. This isolates the dependency and makes it easy to test or swap out the LLM backend.
|
||||
|
||||
```python
|
||||
# core/llm_analyzer.py
|
||||
import requests
|
||||
import json
|
||||
|
||||
class LLMAnalyzer:
|
||||
"""Analyzes text for secrets using a local LLM via Ollama."""
|
||||
|
||||
def __init__(self, endpoint: str, model: str, system_prompt: str):
|
||||
self.endpoint = endpoint
|
||||
self.model = model
|
||||
self.system_prompt = system_prompt
|
||||
|
||||
def analyze(self, text: str) -> dict:
|
||||
"""
|
||||
Sends text to the Ollama API for analysis and returns a structured JSON response.
|
||||
|
||||
Returns:
|
||||
A dictionary like:
|
||||
{
|
||||
"secret_found": bool,
|
||||
"secret_type": str,
|
||||
"confidence_score": float,
|
||||
"severity": str
|
||||
}
|
||||
Returns a default "not found" response on error.
|
||||
"""
|
||||
prompt = f"Log entry: \"{text}\"\n\nAnalyze this for secrets and respond with only the required JSON."
|
||||
payload = {
|
||||
"model": self.model,
|
||||
"system": self.system_prompt,
|
||||
"prompt": prompt,
|
||||
"format": "json",
|
||||
"stream": False
|
||||
}
|
||||
try:
|
||||
response = requests.post(self.endpoint, json=payload, timeout=15)
|
||||
response.raise_for_status()
|
||||
# The response from Ollama is a JSON string, which needs to be parsed.
|
||||
analysis = json.loads(response.json().get("response", "{}"))
|
||||
return analysis
|
||||
except (requests.exceptions.RequestException, json.JSONDecodeError) as e:
|
||||
print(f"[ERROR] LLMAnalyzer failed: {e}")
|
||||
# Fallback: If LLM fails, assume no secret was found to avoid blocking the pipeline.
|
||||
return {"secret_found": False}
|
||||
```
|
||||
|
||||
#### 3.2. New Component: `core/sanitized_writer.py`
|
||||
|
||||
This component is responsible for writing to the new, sanitized hypercore log. This abstraction allows us to easily change the output destination in the future.
|
||||
|
||||
```python
|
||||
# core/sanitized_writer.py
|
||||
class SanitizedWriter:
|
||||
"""Writes log entries to the sanitized sister hypercore log."""
|
||||
|
||||
def __init__(self, sanitized_log_path: str):
|
||||
self.log_path = sanitized_log_path
|
||||
# Placeholder for hypercore writing logic. For now, we'll append to a file.
|
||||
self.log_file = open(self.log_path, "a")
|
||||
|
||||
def write(self, log_entry: str):
|
||||
"""Writes a single log entry to the sanitized stream."""
|
||||
self.log_file.write(log_entry + "\n")
|
||||
self.log_file.flush()
|
||||
|
||||
def close(self):
|
||||
self.log_file.close()
|
||||
```
|
||||
|
||||
#### 3.3. Modify: `core/detector.py`
|
||||
|
||||
We will enhance the `SecretDetector` to not only find matches but also redact them.
|
||||
|
||||
```python
|
||||
# core/detector.py
|
||||
import re
|
||||
|
||||
class SecretDetector:
|
||||
def __init__(self, patterns_file: str = "patterns.yaml"):
|
||||
# ... (load_patterns remains the same) ...
|
||||
|
||||
def scan(self, text: str) -> list[dict]:
|
||||
"""Scans text and returns a list of found secrets with metadata."""
|
||||
matches = []
|
||||
for pattern_name, pattern in self.patterns.items():
|
||||
if pattern.get("active", True):
|
||||
regex_match = re.search(pattern["regex"], text)
|
||||
if regex_match:
|
||||
matches.append({
|
||||
"secret_type": pattern_name,
|
||||
"value": regex_match.group(0),
|
||||
"confidence": pattern.get("confidence", 0.8), # Default confidence
|
||||
"severity": pattern.get("severity", "MEDIUM")
|
||||
})
|
||||
return matches
|
||||
|
||||
def redact(self, text: str, secret_value: str) -> str:
|
||||
"""Redacts a specific secret value within a string."""
|
||||
redacted_str = secret_value[:4] + "****" + secret_value[-4:]
|
||||
return text.replace(secret_value, f"[REDACTED:{redacted_str}]")
|
||||
```
|
||||
|
||||
#### 3.4. Modify: `pipeline/processor.py`
|
||||
|
||||
This is the orchestrator and will see the most significant changes to implement the hybrid logic.
|
||||
|
||||
```python
|
||||
# pipeline/processor.py
|
||||
from core.hypercore_reader import HypercoreReader
|
||||
from core.detector import SecretDetector
|
||||
from core.llm_analyzer import LLMAnalyzer
|
||||
from core.quarantine import QuarantineManager
|
||||
from core.sanitized_writer import SanitizedWriter
|
||||
|
||||
class MessageProcessor:
|
||||
def __init__(self, reader: HypercoreReader, detector: SecretDetector, llm_analyzer: LLMAnalyzer, quarantine: QuarantineManager, writer: SanitizedWriter, llm_threshold: float):
|
||||
self.reader = reader
|
||||
self.detector = detector
|
||||
self.llm_analyzer = llm_analyzer
|
||||
self.quarantine = quarantine
|
||||
self.writer = writer
|
||||
self.llm_threshold = llm_threshold # e.g., 0.90
|
||||
|
||||
async def process_stream(self):
|
||||
"""Main processing loop for the hybrid detection model."""
|
||||
async for entry in self.reader.stream_entries():
|
||||
# Stage 1: Fast Regex Scan
|
||||
regex_matches = self.detector.scan(entry.content)
|
||||
|
||||
if not regex_matches:
|
||||
# No secrets found, write original entry to sanitized log
|
||||
self.writer.write(entry.content)
|
||||
continue
|
||||
|
||||
# A potential secret was found. Default to sanitized, but may be quarantined.
|
||||
sanitized_content = entry.content
|
||||
should_quarantine = False
|
||||
confirmed_secret = None
|
||||
|
||||
for match in regex_matches:
|
||||
# High-confidence regex matches trigger immediate quarantine, skipping LLM.
|
||||
if match['confidence'] >= self.llm_threshold:
|
||||
should_quarantine = True
|
||||
confirmed_secret = match
|
||||
break # One high-confidence match is enough
|
||||
|
||||
# Stage 2: Low-confidence matches go to LLM for verification.
|
||||
llm_result = self.llm_analyzer.analyze(entry.content)
|
||||
if llm_result.get("secret_found"):
|
||||
should_quarantine = True
|
||||
# Prefer LLM's classification but use regex value for redaction
|
||||
confirmed_secret = {
|
||||
"secret_type": llm_result.get("secret_type", match['secret_type']),
|
||||
"value": match['value'],
|
||||
"severity": llm_result.get("severity", match['severity'])
|
||||
}
|
||||
break
|
||||
|
||||
if should_quarantine and confirmed_secret:
|
||||
# A secret is confirmed. Redact, quarantine, and alert.
|
||||
sanitized_content = self.detector.redact(entry.content, confirmed_secret['value'])
|
||||
self.quarantine.quarantine_message(
|
||||
message=entry,
|
||||
secret_type=confirmed_secret['secret_type'],
|
||||
severity=confirmed_secret['severity'],
|
||||
redacted_content=sanitized_content
|
||||
)
|
||||
# Potentially trigger alerts here as well
|
||||
print(f"[ALERT] Confirmed secret {confirmed_secret['secret_type']} found and quarantined.")
|
||||
|
||||
# Write the (potentially redacted) content to the sanitized log
|
||||
self.writer.write(sanitized_content)
|
||||
```
|
||||
|
||||
#### 3.5. Modify: `main.py`
|
||||
|
||||
The main entry point will be updated to instantiate and wire together the new and modified components.
|
||||
|
||||
```python
|
||||
# main.py
|
||||
# ... imports ...
|
||||
import asyncio
|
||||
from core.hypercore_reader import HypercoreReader
|
||||
from core.detector import SecretDetector
|
||||
from core.llm_analyzer import LLMAnalyzer
|
||||
from core.quarantine import QuarantineManager
|
||||
from core.sanitized_writer import SanitizedWriter
|
||||
# ... other imports
|
||||
|
||||
def main():
|
||||
# 1. Configuration
|
||||
# Load from a new config.yaml or environment variables
|
||||
PRIMARY_LOG_PATH = "/path/to/primary/hypercore.log"
|
||||
SANITIZED_LOG_PATH = "/path/to/sanitized/hypercore.log"
|
||||
PATTERNS_PATH = "patterns.yaml"
|
||||
DB_CONNECTION = "..."
|
||||
OLLAMA_ENDPOINT = "http://localhost:11434/api/generate"
|
||||
OLLAMA_MODEL = "llama3"
|
||||
LLM_CONFIDENCE_THRESHOLD = 0.90 # Regex confidence >= this skips LLM
|
||||
|
||||
with open("SHHH_SECRETS_SENTINEL_AGENT_PROMPT.md", "r") as f:
|
||||
OLLAMA_SYSTEM_PROMPT = f.read()
|
||||
|
||||
# 2. Instantiation
|
||||
reader = HypercoreReader(PRIMARY_LOG_PATH)
|
||||
detector = SecretDetector(PATTERNS_PATH)
|
||||
llm_analyzer = LLMAnalyzer(OLLAMA_ENDPOINT, OLLAMA_MODEL, OLLAMA_SYSTEM_PROMPT)
|
||||
quarantine = QuarantineManager(DB_CONNECTION)
|
||||
writer = SanitizedWriter(SANITIZED_LOG_PATH)
|
||||
|
||||
processor = MessageProcessor(
|
||||
reader=reader,
|
||||
detector=detector,
|
||||
llm_analyzer=llm_analyzer,
|
||||
quarantine=quarantine,
|
||||
writer=writer,
|
||||
llm_threshold=LLM_CONFIDENCE_THRESHOLD
|
||||
)
|
||||
|
||||
# 3. Execution
|
||||
print("Starting SHHH Hypercore Monitor...")
|
||||
try:
|
||||
asyncio.run(processor.process_stream())
|
||||
except KeyboardInterrupt:
|
||||
print("Shutting down...")
|
||||
finally:
|
||||
writer.close()
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
```
|
||||
|
||||
### 4. Phased Rollout
|
||||
|
||||
1. **Phase 1: Component Implementation (1-2 days)**
|
||||
* Create `core/llm_analyzer.py` and `core/sanitized_writer.py`.
|
||||
* Write unit tests for both new components. Mock the `requests` calls for the analyzer.
|
||||
* Update `core/detector.py` with the `redact` method and update its unit tests.
|
||||
|
||||
2. **Phase 2: Orchestration Logic (2-3 days)**
|
||||
* Implement the new logic in `pipeline/processor.py`.
|
||||
* Write integration tests for the processor that simulate the full flow: no match, low-confidence match (with mocked LLM response), and high-confidence match.
|
||||
* Update `main.py` to wire everything together.
|
||||
|
||||
3. **Phase 3: Configuration & Testing (1 day)**
|
||||
* Add a `config.yaml` to manage all paths, thresholds, and endpoints.
|
||||
* Perform an end-to-end test run with a sample log file and a running Ollama instance.
|
||||
* Verify that the primary log is untouched, the sanitized log is created correctly (with and without redactions), and the quarantine database is populated as expected.
|
||||
|
||||
### 5. Success Criteria
|
||||
|
||||
* **Zero Leaks:** The sanitized log stream contains no secrets.
|
||||
* **High Accuracy:** False positive rate is demonstrably lower than a regex-only solution, verified during testing.
|
||||
* **Performance:** The pipeline maintains acceptable latency (<200ms per log entry on average, accounting for occasional LLM analysis).
|
||||
* **Auditability:** The primary log remains a perfect, unaltered source of truth. All detection and quarantine events are logged in the PostgreSQL database.
|
||||
Reference in New Issue
Block a user