tony/HCFS

Files

Claude Code a6ee31f237 Phase 2 build initial

2025-07-30 09:34:16 +10:00

14 KiB

Raw Blame History

# PROJECT_PLAN.md ## 📘 Title Context‑Aware Hierarchical Context File System (HCFS): Unifying file system paths with context blobs for agentic AI cognition

1. Research Motivation & Literature Review 🧠

Semantic and context‑aware file systems: Gifford et al. (1991) proposed early semantic file systems using directory paths as semantic queries (Wikipedia). Later work explored tag‑based and ontology‑based systems for richer metadata and context-aware retrieval (Wikipedia).
LLM‑driven semantic FS (LSFS): The recent ICLR 2025 LSFS proposes integrating vector DBs and semantic indexing into a filesystem that supports prompt-driven file operations and semantic rollback (OpenReview).
Path-structure embeddings: Recent Transformer-based work shows file paths can be modeled as sequences for semantic anomaly detection—capturing hierarchy and semantics in embeddings (MDPI).
Context modeling frameworks: Ontology-driven context models (e.g. OWL/SOCAM) support representing, reasoning about, and sharing context hierarchically (arXiv).

Your HCFS merges these prior insights into a hybrid: directory navigation = query scope, backed by semantic context blobs in a DB, enabling agentic systems to zoom in/out contextually.

2. Objectives & Scope

Design a virtual filesystem layer that maps hierarchical paths to context blobs.
Build a context storage system (DB) to hold context units, versioned and indexed.
Define APIs and syscalls for agents to:
- navigate context scope (cd‑style),
- request context retrieval,
- push new context,
- merge or inherit context across levels.
Enable decentralized context sharing: agents can publish updates at path-nodes; peer agents subscribe by tree‑paths.
Prototype on a controlled dataset / toy project tree to validate:
- latency,
- correct retrieval,
- hierarchical inheritance semantics.

3. System Architecture Overview

3.1 Virtual Filesystem Layer (e.g. FUSE or AIOS integration)

Presents standard POSIX (or AIOS‑style) tree structure.
Each directory or file node has metadata pointers into context‑blob IDs.
Traversal (e.g., ls, cd) triggers context lookup for that path.

3.2 Context Database Backend

Two possible designs:
- Relational/SQLite + versioned tables: simple, transactional, supports hierarchical inheritance via path parent pointers.
- Graph DB (e.g., Neo4j): ideal for multi-parent contexts, symlink-like context inheritance.
Context blobs include:
- blob ID,
- path(s) bound,
- timestamp/version, author/agent,
- embedding or semantic tags,
- content or summary.

3.3 Indexing & Embeddings

Generate embeddings of context blobs for semantic similarity retrieval (e.g. for context folding) (OpenReview, OpenReview, MDPI).
Use combination of BM25 + embedding ranking (contextual retrieval) for accurate scope-based retrieval (TECHCOMMUNITY.MICROSOFT.COM).

3.4 API & Syscalls

context_cd(path): sets current context pointer.
context_get(depth=N): retrieves cumulative context from current node up N levels.
context_push(path, blob): insert new context tied to a path.
context_list(path): lists available context blobs at that path.
context_subscribe(path): agent registers to receive updates at a path.

4. Project Timeline & Milestones

Phase	Duration	Deliverables
Phase 0: Research & Design	2 weeks	Literature review doc, architecture draft
Phase 1: Prototype FS layer	4 weeks	Minimal FUSE‑based path→context mapping, CLI demo
Phase 2: Backend DB & storage	4 weeks	Context blob storage, path linkage, versioning
Phase 3: Embedding & retrieval integration	3 weeks	Embeddings + BM25 hybrid ranking for context relevance
Phase 4: API/Syscall layer scripting	3 weeks	Python (or AIOS) service exposing navigation + push APIs
Phase 5: Agent integration & simulation	3 weeks	Dummy AI agents navigating, querying, publishing context
Phase 6: Evaluation & refinement	2 weeks	Usability, latency, retrieval relevance metrics
Phase 7: Write-up & publication	2 weeks	Report, possible poster/paper submission

5. Risks & Alternatives

Semantic vs hierarchical mismatch: Flat tag systems (e.g. Tagsistant) offer semantic tagging but lack path-based inheritance (research.ijcaonline.org, OpenReview, Wikipedia, arXiv, Anthropic, OpenReview, Wikipedia).
Context explosion: many small blobs flooding the DB—mitigate via summarization/folding.
Performance trade‑offs: FS lookups must stay acceptable; versioned graph storage might slow down. Consider caching snapshots at each node.

6. Peer‑Reviewed References

David Gifford et al., Semantic file systems, ACM Operating Systems Review (1991) (Wikipedia)
ICLR 2025: From Commands to Prompts: LLM-based Semantic File System for AIOS (LSFS) (OpenReview)
Xiaoyu et al., Transformer-based path sequence modeling for file‑path anomaly detection (MDPI)
Tao Gu et al., Ontology‑based Context Model in Intelligent Environments (SOCAM) (arXiv)

7. Next Steps

Review cited literature, build an annotated bibliography.
Choose backend stack (SQLite vs graph DB) and test embedding pipeline.
Begin Phase 1: implementing minimal context‑aware FS mock.

Core Architecture Considerations

FS design on top of FUSE and DB schema selection with versioning.

🖥️ 1. FS Architecture: FUSE Layer & Path‑Context Mapping

Why FUSE makes sense

FUSE (Filesystem in Userspace) provides a widely used, flexible interface for prototyping new FS models without kernel hacking, enabling rapid development of virtual filesystems that you can mount and interact with via standard POSIX tools (IBM Research, Wikipedia).
Performance varies—but optimized designs or alternatives like RFUSE help improve kernel‑userspace communication latency and throughput, making user‑space FS viable even in demanding use cases (USENIX).

Path‑to‑Context Mapping Schema

You’d implement a mapping where each path (directory or file) is bound to zero or more context blob IDs. Concepts:

Directory traversal (cd, ls) triggers path-based context lookups in your backend.
File reads (e.g. readfile(context)) return the merged or inherited context blob(s) for that node.
Inheritance: a context layer at /a/b/c implicitly inherits from /a/b, /a, and / context as-needed.

Caching & Merge Layers

Cache context snapshots at each directory layer to reduce repeated database hits.
Provide configurable merge strategies (union, override, summarization) to maintain efficient context retrieval.

📦 2. DB Design: Relational vs Graph & Context Versioning

Relational (e.g. SQLite/PostgreSQL)

Strong transactional guarantees and simple schema: tables like Blobs(blob_id, path, version, content, timestamp) plus a PathHierarchy(path, parent_path) table for inheritance.
Good for simple single-parent hierarchies and transactional versioning (with version numbers or history tables).
But joins across deep path hierarchies can get costly; semantic relationships or multi-parent inheritance are more cumbersome.

Graph Database (e.g. Neo4j)

Nodes represent paths and context blobs; edges represent parent-child, semantic relations, and "derived" links.
Ideal for multi-parent or symlink-like context inheritance, semantic network traversal, or hierarchy restructuring (Wikipedia).
Enables queries like: “find all context blobs reachable within N hops from path X,” or “retrieve peers with similar context semantics.”

Hybrid Approaches

A relational backend augmented with semantic tables or converted into a graph as needed for richer queries (memgraph.com, link.springer.com).
Example: relational for version history and base structure, graph/cloud-based embeddings for semantic relationships.

Context Versioning

Must support hierarchical version control: each blob should have metadata like blob_id, version_id, parent_version, agent_id, timestamp.
You can implement simple version chains in relational DB or LTS support (e.g. graph edges representing “version-of” relationships).
Track changes with immutable blob storage; allow rollbacks or context diffs.

🔍 Comparison Table

Feature	Relational DB	Graph DB
Hierarchy resilience	Works well for strict tree; joins required for multi-parent	Native multi-parent and traversals
Performance	Fast for simple lookups; may slow with joins	O(1) traversal for connected queries
Versioning	Straightforward with version tables; chronology easy	Version graph edges, easier branching/merging
Semantic links	Requires additional tables or indexes	First-class properties/relationships
Cost & tooling	SQLite heavy-light, well-known	Requires graph engine (Neo4j, etc.)

🧠 Integration Architecture

FS Layer

Run FUSE-based FS presenting standard directories/files.
On lookup, FS resolves the path and queries DB for context blobs.
On read, FS returns merged context string; write or push maps to context_push(path, blob_content) exposing MCP endpoints.

Backend DB Schema Sketch

Relational (SQL)

CREATE TABLE path (
  path TEXT PRIMARY KEY,
  parent TEXT REFERENCES path(path)
);
CREATE TABLE context_blob (
  blob_id SERIAL PRIMARY KEY,
  path TEXT REFERENCES path(path),
  version INT,
  parent_blob INT REFERENCES context_blob(blob_id),
  agent TEXT,
  timestamp TIMESTAMP,
  content TEXT
);

Graph (Property Graph)

Node labels:
- (:Path {path: "...", last_blob: id})
- (:Blob {blob_id, version, agent, timestamp, content})
Edges:
- (Path)-[:HAS_BLOB]->(Blob)
- (Blob)-[:PARENT_VERSION]->(Blob)
- (Path)-[:PARENT_PATH]->(Path)

🧩 Summary & Recommendation

A FUSE-based FS layer is well-suited for interface compatibility and rapid prototyping; RFUSE-style frameworks may help with performance if you scale.
For backend, if you expect strict single-parent hierarchical contexts, relational DB is safe and simple.
If you want multi-parent inheritance, semantic linking, branching, merging, graph DB offers greater flexibility.
Versioning is supported in both: relational via version chains and history tables; graph via version edges.
Hybrid: use PostgreSQL with graph extensions or embed a graph layer atop SQL for embeddings and semantic dive queries (academia.edu, sciencedirect.com, milvus.io, filesystems.org).

14 KiB Raw Blame History Unescape Escape