HCFS/PROJECT_PLAN.md

---

# PROJECT\_PLAN.md

## 📘 Title

**Context‑Aware Hierarchical Context File System (HCFS)**: Unifying file system paths with context blobs for agentic AI cognition

---

## 1. Research Motivation & Literature Review 🧠

* **Semantic and context‑aware file systems**: Gifford et al. (1991) proposed early semantic file systems using directory paths as semantic queries ([Wikipedia][1]). Later work explored tag‑based and ontology‑based systems for richer metadata and context-aware retrieval ([Wikipedia][1]).
* **LLM‑driven semantic FS (LSFS)**: The recent ICLR 2025 LSFS proposes integrating vector DBs and semantic indexing into a filesystem that supports prompt-driven file operations and semantic rollback ([OpenReview][2]).
* **Path-structure embeddings**: Recent Transformer-based work shows file paths can be modeled as sequences for semantic anomaly detection—capturing hierarchy and semantics in embeddings ([MDPI][3]).
* **Context modeling frameworks**: Ontology-driven context models (e.g. OWL/SOCAM) support representing, reasoning about, and sharing context hierarchically ([arXiv][4]).

Your HCFS merges these prior insights into a hybrid: directory navigation = query scope, backed by semantic context blobs in a DB, enabling agentic systems to zoom in/out contextually.

---

## 2. Objectives & Scope

1. Design a **virtual filesystem layer** that maps hierarchical paths to context blobs.
2. Build a **context storage system** (DB) to hold context units, versioned and indexed.
3. Define **APIs and syscalls** for agents to:

   * navigate context scope (`cd`‑style),
   * request context retrieval,
   * push new context,
   * merge or inherit context across levels.
4. Enable **decentralized context sharing**: agents can publish updates at path-nodes; peer agents subscribe by tree‑paths.
5. Prototype on a controlled dataset / toy project tree to validate:

   * latency,
   * correct retrieval,
   * hierarchical inheritance semantics.

---

## 3. System Architecture Overview

### 3.1 Virtual Filesystem Layer (e.g. FUSE or AIOS integration)

* Presents standard POSIX (or AIOS‑style) tree structure.
* Each directory or file node has metadata pointers into context‑blob IDs.
* Traversal (e.g., `ls`, `cd`) triggers context lookup for that path.

### 3.2 Context Database Backend

* Two possible designs:

  * **Relational/SQLite + versioned tables**: simple, transactional, supports hierarchical inheritance via path parent pointers.
  * **Graph DB (e.g., Neo4j)**: ideal for multi-parent contexts, symlink-like context inheritance.
* Context blobs include:

  * blob ID,
  * path(s) bound,
  * timestamp/version, author/agent,
  * embedding or semantic tags,
  * content or summary.

### 3.3 Indexing & Embeddings

* Generate embeddings of context blobs for semantic similarity retrieval (e.g. for context folding) ([OpenReview][5], [OpenReview][2], [MDPI][3]).
* Use combination of BM25 + embedding ranking (contextual retrieval) for accurate scope-based retrieval ([TECHCOMMUNITY.MICROSOFT.COM][6]).

### 3.4 API & Syscalls

* `context_cd(path)`: sets current context pointer.
* `context_get(depth=N)`: retrieves cumulative context from current node up N levels.
* `context_push(path, blob)`: insert new context tied to a path.
* `context_list(path)`: lists available context blobs at that path.
* `context_subscribe(path)`: agent registers to receive updates at a path.

---

## 4. Project Timeline & Milestones

| Phase                                          | Duration | Deliverables                                             |
| ---------------------------------------------- | -------- | -------------------------------------------------------- |
| **Phase 0: Research & Design**                 | 2 weeks  | Literature review doc, architecture draft                |
| **Phase 1: Prototype FS layer**                | 4 weeks  | Minimal FUSE‑based path→context mapping, CLI demo        |
| **Phase 2: Backend DB & storage**              | 4 weeks  | Context blob storage, path linkage, versioning           |
| **Phase 3: Embedding & retrieval integration** | 3 weeks  | Embeddings + BM25 hybrid ranking for context relevance   |
| **Phase 4: API/Syscall layer scripting**       | 3 weeks  | Python (or AIOS) service exposing navigation + push APIs |
| **Phase 5: Agent integration & simulation**    | 3 weeks  | Dummy AI agents navigating, querying, publishing context |
| **Phase 6: Evaluation & refinement**           | 2 weeks  | Usability, latency, retrieval relevance metrics          |
| **Phase 7: Write-up & publication**            | 2 weeks  | Report, possible poster/paper submission                 |

---

## 5. Risks & Alternatives

* **Semantic vs hierarchical mismatch**: Flat tag systems (e.g. Tagsistant) offer semantic tagging but lack path-based inheritance ([research.ijcaonline.org][7], [OpenReview][2], [Wikipedia][1], [arXiv][8], [Anthropic][9], [OpenReview][5], [Wikipedia][10]).
* **Context explosion**: many small blobs flooding the DB—mitigate via summarization/folding.
* **Performance trade‑offs**: FS lookups must stay acceptable; versioned graph storage might slow down. Consider caching snapshots at each node.

---

## 6. Peer‑Reviewed References

* David Gifford et al., *Semantic file systems*, ACM Operating Systems Review (1991) ([Wikipedia][1])
* ICLR 2025: *From Commands to Prompts: LLM-based Semantic File System for AIOS* (LSFS) ([OpenReview][2])
* Xiaoyu et al., *Transformer-based path sequence modeling for file‑path anomaly detection* ([MDPI][3])
* Tao Gu et al., *Ontology‑based Context Model in Intelligent Environments* (SOCAM) ([arXiv][4])

---

## 7. Next Steps

* Review cited literature, build an annotated bibliography.
* Choose backend stack (SQLite vs graph DB) and test embedding pipeline.
* Begin Phase 1: implementing minimal context‑aware FS mock.

---

[1]: https://en.wikipedia.org/wiki/Semantic_file_system?utm_source=chatgpt.com "Semantic file system"
[2]: https://openreview.net/forum?id=2G021ZqUEZ&utm_source=chatgpt.com "From Commands to Prompts: LLM-based Semantic File System for AIOS"
[3]: https://www.mdpi.com/2079-8954/13/6/403?utm_source=chatgpt.com "Effective Context-Aware File Path Embeddings for Anomaly Detection - MDPI"
[4]: https://arxiv.org/abs/2003.05055?utm_source=chatgpt.com "An Ontology-based Context Model in Intelligent Environments"
[5]: https://openreview.net/pdf?id=2G021ZqUEZ&utm_source=chatgpt.com "F COMMANDS TO PROMPTS LLM- S FILE SYSTEM FOR AIOS - OpenReview"
[6]: https://techcommunity.microsoft.com/blog/azure-ai-services-blog/building-a-contextual-retrieval-system-for-improving-rag-accuracy/4271924?utm_source=chatgpt.com "Building a Contextual Retrieval System for Improving RAG Accuracy"
[7]: https://research.ijcaonline.org/volume121/number1/pxc3904433.pdf?utm_source=chatgpt.com "A Survey on Different File System Approach - research.ijcaonline.org"
[8]: https://arxiv.org/abs/1909.10123?utm_source=chatgpt.com "SplitFS: Reducing Software Overhead in File Systems for Persistent Memory"
[9]: https://www.anthropic.com/news/contextual-retrieval?utm_source=chatgpt.com "Introducing Contextual Retrieval \ Anthropic"
[10]: https://en.wikipedia.org/wiki/Tagsistant?utm_source=chatgpt.com "Tagsistant"


---

# Core Architecture Considerations
**FS design on top of FUSE** and **DB schema selection with versioning**.
---

## 🖥️ 1. FS Architecture: FUSE Layer & Path‑Context Mapping

### Why FUSE makes sense

* FUSE (Filesystem in Userspace) provides a widely used, flexible interface for prototyping new FS models without kernel hacking, enabling rapid development of virtual filesystems that you can mount and interact with via standard POSIX tools ([IBM Research][1], [Wikipedia][2]).
* Performance varies—but optimized designs or alternatives like RFUSE help improve kernel‑userspace communication latency and throughput, making user‑space FS viable even in demanding use cases ([USENIX][3]).

### Path‑to‑Context Mapping Schema

You’d implement a mapping where each path (directory or file) is bound to zero or more **context blob IDs**. Concepts:

* Directory traversal (`cd`, `ls`) triggers path-based context lookups in your backend.
* File reads (e.g. `readfile(context)`) return the merged or inherited context blob(s) for that node.
* Inheritance: a context layer at `/a/b/c` implicitly inherits from `/a/b`, `/a`, and `/` context as-needed.

### Caching & Merge Layers

* Cache context snapshots at each directory layer to reduce repeated database hits.
* Provide configurable merge strategies (union, override, summarization) to maintain efficient context retrieval.

---

## 📦 2. DB Design: Relational vs Graph & Context Versioning

### Relational (e.g. SQLite/PostgreSQL)

* Strong transactional guarantees and simple schema: tables like `Blobs(blob_id, path, version, content, timestamp)` plus a `PathHierarchy(path, parent_path)` table for inheritance.
* Good for simple single-parent hierarchies and transactional versioning (with version numbers or history tables).
* But joins across deep path hierarchies can get costly; semantic relationships or multi-parent inheritance are more cumbersome.

### Graph Database (e.g. Neo4j)

* Nodes represent paths and context blobs; edges represent parent-child, semantic relations, and "derived" links.
* Ideal for multi-parent or symlink-like context inheritance, semantic network traversal, or hierarchy restructuring ([Wikipedia][4]).
* Enables queries like: “find all context blobs reachable within N hops from path X,” or “retrieve peers with similar context semantics.”

### Hybrid Approaches

* A relational backend augmented with semantic tables or converted into a graph as needed for richer queries ([memgraph.com][5], [link.springer.com][6]).
* Example: relational for version history and base structure, graph/cloud-based embeddings for semantic relationships.

### Context Versioning

* Must support **hierarchical version control**: each blob should have metadata like `blob_id`, `version_id`, `parent_version`, `agent_id`, `timestamp`.
* You can implement simple version chains in relational DB or LTS support (e.g. graph edges representing “version-of” relationships).
* Track changes with immutable blob storage; allow rollbacks or context diffs.

---

## 🔍 Comparison Table

| Feature                  | Relational DB                                               | Graph DB                                      |
| ------------------------ | ----------------------------------------------------------- | --------------------------------------------- |
| **Hierarchy resilience** | Works well for strict tree; joins required for multi-parent | Native multi-parent and traversals            |
| **Performance**          | Fast for simple lookups; may slow with joins                | O(1) traversal for connected queries          |
| **Versioning**           | Straightforward with version tables; chronology easy        | Version graph edges, easier branching/merging |
| **Semantic links**       | Requires additional tables or indexes                       | First-class properties/relationships          |
| **Cost & tooling**       | SQLite heavy-light, well-known                              | Requires graph engine (Neo4j, etc.)           |

---

## 🧠 Integration Architecture

### FS Layer

* Run FUSE-based FS presenting standard directories/files.
* On `lookup`, FS resolves the path and queries DB for context blobs.
* On `read`, FS returns merged context string; `write` or `push` maps to `context_push(path, blob_content)` exposing MCP endpoints.

### Backend DB Schema Sketch

**Relational (SQL)**

```sql
CREATE TABLE path (
  path TEXT PRIMARY KEY,
  parent TEXT REFERENCES path(path)
);
CREATE TABLE context_blob (
  blob_id SERIAL PRIMARY KEY,
  path TEXT REFERENCES path(path),
  version INT,
  parent_blob INT REFERENCES context_blob(blob_id),
  agent TEXT,
  timestamp TIMESTAMP,
  content TEXT
);
```

**Graph (Property Graph)**

* Node labels:

  * `(:Path {path: "...", last_blob: id})`
  * `(:Blob {blob_id, version, agent, timestamp, content})`
* Edges:

  * `(Path)-[:HAS_BLOB]->(Blob)`
  * `(Blob)-[:PARENT_VERSION]->(Blob)`
  * `(Path)-[:PARENT_PATH]->(Path)`

---

## 🧩 Summary & Recommendation

* A **FUSE-based FS layer** is well-suited for interface compatibility and rapid prototyping; RFUSE-style frameworks may help with performance if you scale.
* For backend, if you expect **strict single-parent hierarchical contexts**, relational DB is safe and simple.
* If you want **multi-parent inheritance, semantic linking, branching, merging**, graph DB offers greater flexibility.
* Versioning is supported in both: relational via version chains and history tables; graph via version edges.
* Hybrid: use PostgreSQL with graph extensions or embed a graph layer atop SQL for embeddings and semantic dive queries ([academia.edu][7], [sciencedirect.com][8], [milvus.io][9], [filesystems.org][10]).

---

[1]: https://research.ibm.com/publications/to-fuse-or-not-to-fuse-performance-of-user-space-file-systems?utm_source=chatgpt.com "To fuse or not to fuse: Performance of user-space file systems"
[2]: https://en.wikipedia.org/wiki/Filesystem_in_Userspace?utm_source=chatgpt.com "Filesystem in Userspace"
[3]: https://www.usenix.org/system/files/fast24-cho.pdf?utm_source=chatgpt.com "RFUSE: Modernizing Userspace Filesystem Framework through Scalable ..."
[4]: https://en.wikipedia.org/wiki/Graph_database?utm_source=chatgpt.com "Graph database"
[5]: https://memgraph.com/docs/ai-ecosystem/graph-rag?utm_source=chatgpt.com "GraphRAG with Memgraph"
[6]: https://link.springer.com/chapter/10.1007/978-3-031-74701-4_13?utm_source=chatgpt.com "Exploring the Hybrid Approach: Integrating Relational and Graph ..."
[7]: https://www.academia.edu/13092788/SEMANTIC_BASED_DATA_STORAGE_WITH_NEXT_GENERATION_CATEGORIZER?utm_source=chatgpt.com "SEMANTIC BASED DATA STORAGE WITH NEXT GENERATION CATEGORIZER"
[8]: https://www.sciencedirect.com/science/article/pii/S1319157822002920?utm_source=chatgpt.com "FUSE based file system for efficient storage and retrieval of ..."
[9]: https://milvus.io/ai-quick-reference/what-strategies-exist-for-longterm-memory-in-model-context-protocol-mcp?utm_source=chatgpt.com "What strategies exist for long-term memory in Model Context Protocol (MCP)?"
[10]: https://www.filesystems.org/docs/fuse/bharath-msthesis.pdf?utm_source=chatgpt.com "To FUSE or not to FUSE? Analysis and Performance ... - File System"