Files
HCFS/PROJECT_PLAN.md
2025-07-30 09:34:16 +10:00

259 lines
14 KiB
Markdown
Raw Blame History

This file contains invisible Unicode characters

This file contains invisible Unicode characters that are indistinguishable to humans but may be processed differently by a computer. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

---
# PROJECT\_PLAN.md
## 📘 Title
**ContextAware Hierarchical Context File System (HCFS)**: Unifying file system paths with context blobs for agentic AI cognition
---
## 1. Research Motivation & Literature Review 🧠
* **Semantic and contextaware file systems**: Gifford etal. (1991) proposed early semantic file systems using directory paths as semantic queries ([Wikipedia][1]). Later work explored tagbased and ontologybased systems for richer metadata and context-aware retrieval ([Wikipedia][1]).
* **LLMdriven semantic FS (LSFS)**: The recent ICLR 2025 LSFS proposes integrating vector DBs and semantic indexing into a filesystem that supports prompt-driven file operations and semantic rollback ([OpenReview][2]).
* **Path-structure embeddings**: Recent Transformer-based work shows file paths can be modeled as sequences for semantic anomaly detection—capturing hierarchy and semantics in embeddings ([MDPI][3]).
* **Context modeling frameworks**: Ontology-driven context models (e.g. OWL/SOCAM) support representing, reasoning about, and sharing context hierarchically ([arXiv][4]).
Your HCFS merges these prior insights into a hybrid: directory navigation = query scope, backed by semantic context blobs in a DB, enabling agentic systems to zoom in/out contextually.
---
## 2. Objectives & Scope
1. Design a **virtual filesystem layer** that maps hierarchical paths to context blobs.
2. Build a **context storage system** (DB) to hold context units, versioned and indexed.
3. Define **APIs and syscalls** for agents to:
* navigate context scope (`cd`style),
* request context retrieval,
* push new context,
* merge or inherit context across levels.
4. Enable **decentralized context sharing**: agents can publish updates at path-nodes; peer agents subscribe by treepaths.
5. Prototype on a controlled dataset / toy project tree to validate:
* latency,
* correct retrieval,
* hierarchical inheritance semantics.
---
## 3. System Architecture Overview
### 3.1 Virtual Filesystem Layer (e.g. FUSE or AIOS integration)
* Presents standard POSIX (or AIOSstyle) tree structure.
* Each directory or file node has metadata pointers into contextblob IDs.
* Traversal (e.g., `ls`, `cd`) triggers context lookup for that path.
### 3.2 Context Database Backend
* Two possible designs:
* **Relational/SQLite + versioned tables**: simple, transactional, supports hierarchical inheritance via path parent pointers.
* **Graph DB (e.g., Neo4j)**: ideal for multi-parent contexts, symlink-like context inheritance.
* Context blobs include:
* blob ID,
* path(s) bound,
* timestamp/version, author/agent,
* embedding or semantic tags,
* content or summary.
### 3.3 Indexing & Embeddings
* Generate embeddings of context blobs for semantic similarity retrieval (e.g. for context folding) ([OpenReview][5], [OpenReview][2], [MDPI][3]).
* Use combination of BM25 + embedding ranking (contextual retrieval) for accurate scope-based retrieval ([TECHCOMMUNITY.MICROSOFT.COM][6]).
### 3.4 API & Syscalls
* `context_cd(path)`: sets current context pointer.
* `context_get(depth=N)`: retrieves cumulative context from current node up N levels.
* `context_push(path, blob)`: insert new context tied to a path.
* `context_list(path)`: lists available context blobs at that path.
* `context_subscribe(path)`: agent registers to receive updates at a path.
---
## 4. Project Timeline & Milestones
| Phase | Duration | Deliverables |
| ---------------------------------------------- | -------- | -------------------------------------------------------- |
| **Phase 0: Research & Design** | 2weeks | Literature review doc, architecture draft |
| **Phase 1: Prototype FS layer** | 4weeks | Minimal FUSEbased path→context mapping, CLI demo |
| **Phase 2: Backend DB & storage** | 4weeks | Context blob storage, path linkage, versioning |
| **Phase 3: Embedding & retrieval integration** | 3weeks | Embeddings + BM25 hybrid ranking for context relevance |
| **Phase 4: API/Syscall layer scripting** | 3weeks | Python (or AIOS) service exposing navigation + push APIs |
| **Phase 5: Agent integration & simulation** | 3weeks | Dummy AI agents navigating, querying, publishing context |
| **Phase 6: Evaluation & refinement** | 2weeks | Usability, latency, retrieval relevance metrics |
| **Phase 7: Write-up & publication** | 2weeks | Report, possible poster/paper submission |
---
## 5. Risks & Alternatives
* **Semantic vs hierarchical mismatch**: Flat tag systems (e.g. Tagsistant) offer semantic tagging but lack path-based inheritance ([research.ijcaonline.org][7], [OpenReview][2], [Wikipedia][1], [arXiv][8], [Anthropic][9], [OpenReview][5], [Wikipedia][10]).
* **Context explosion**: many small blobs flooding the DB—mitigate via summarization/folding.
* **Performance tradeoffs**: FS lookups must stay acceptable; versioned graph storage might slow down. Consider caching snapshots at each node.
---
## 6. PeerReviewed References
* David Gifford etal., *Semantic file systems*, ACM Operating Systems Review (1991) ([Wikipedia][1])
* ICLR 2025: *From Commands to Prompts: LLM-based Semantic File System for AIOS* (LSFS) ([OpenReview][2])
* Xiaoyu etal., *Transformer-based path sequence modeling for filepath anomaly detection* ([MDPI][3])
* Tao Gu etal., *Ontologybased Context Model in Intelligent Environments* (SOCAM) ([arXiv][4])
---
## 7. Next Steps
* Review cited literature, build an annotated bibliography.
* Choose backend stack (SQLite vs graph DB) and test embedding pipeline.
* Begin Phase1: implementing minimal contextaware FS mock.
---
[1]: https://en.wikipedia.org/wiki/Semantic_file_system?utm_source=chatgpt.com "Semantic file system"
[2]: https://openreview.net/forum?id=2G021ZqUEZ&utm_source=chatgpt.com "From Commands to Prompts: LLM-based Semantic File System for AIOS"
[3]: https://www.mdpi.com/2079-8954/13/6/403?utm_source=chatgpt.com "Effective Context-Aware File Path Embeddings for Anomaly Detection - MDPI"
[4]: https://arxiv.org/abs/2003.05055?utm_source=chatgpt.com "An Ontology-based Context Model in Intelligent Environments"
[5]: https://openreview.net/pdf?id=2G021ZqUEZ&utm_source=chatgpt.com "F COMMANDS TO PROMPTS LLM- S FILE SYSTEM FOR AIOS - OpenReview"
[6]: https://techcommunity.microsoft.com/blog/azure-ai-services-blog/building-a-contextual-retrieval-system-for-improving-rag-accuracy/4271924?utm_source=chatgpt.com "Building a Contextual Retrieval System for Improving RAG Accuracy"
[7]: https://research.ijcaonline.org/volume121/number1/pxc3904433.pdf?utm_source=chatgpt.com "A Survey on Different File System Approach - research.ijcaonline.org"
[8]: https://arxiv.org/abs/1909.10123?utm_source=chatgpt.com "SplitFS: Reducing Software Overhead in File Systems for Persistent Memory"
[9]: https://www.anthropic.com/news/contextual-retrieval?utm_source=chatgpt.com "Introducing Contextual Retrieval \ Anthropic"
[10]: https://en.wikipedia.org/wiki/Tagsistant?utm_source=chatgpt.com "Tagsistant"
---
# Core Architecture Considerations
**FS design on top of FUSE** and **DB schema selection with versioning**.
---
## 🖥️ 1. FS Architecture: FUSE Layer & PathContext Mapping
### Why FUSE makes sense
* FUSE (Filesystem in Userspace) provides a widely used, flexible interface for prototyping new FS models without kernel hacking, enabling rapid development of virtual filesystems that you can mount and interact with via standard POSIX tools ([IBM Research][1], [Wikipedia][2]).
* Performance varies—but optimized designs or alternatives like RFUSE help improve kerneluserspace communication latency and throughput, making userspace FS viable even in demanding use cases ([USENIX][3]).
### PathtoContext Mapping Schema
Youd implement a mapping where each path (directory or file) is bound to zero or more **context blob IDs**. Concepts:
* Directory traversal (`cd`, `ls`) triggers path-based context lookups in your backend.
* File reads (e.g. `readfile(context)`) return the merged or inherited context blob(s) for that node.
* Inheritance: a context layer at `/a/b/c` implicitly inherits from `/a/b`, `/a`, and `/` context as-needed.
### Caching & Merge Layers
* Cache context snapshots at each directory layer to reduce repeated database hits.
* Provide configurable merge strategies (union, override, summarization) to maintain efficient context retrieval.
---
## 📦 2. DB Design: Relational vs Graph & Context Versioning
### Relational (e.g. SQLite/PostgreSQL)
* Strong transactional guarantees and simple schema: tables like `Blobs(blob_id, path, version, content, timestamp)` plus a `PathHierarchy(path, parent_path)` table for inheritance.
* Good for simple single-parent hierarchies and transactional versioning (with version numbers or history tables).
* But joins across deep path hierarchies can get costly; semantic relationships or multi-parent inheritance are more cumbersome.
### Graph Database (e.g. Neo4j)
* Nodes represent paths and context blobs; edges represent parent-child, semantic relations, and "derived" links.
* Ideal for multi-parent or symlink-like context inheritance, semantic network traversal, or hierarchy restructuring ([Wikipedia][4]).
* Enables queries like: “find all context blobs reachable within N hops from path X,” or “retrieve peers with similar context semantics.”
### Hybrid Approaches
* A relational backend augmented with semantic tables or converted into a graph as needed for richer queries ([memgraph.com][5], [link.springer.com][6]).
* Example: relational for version history and base structure, graph/cloud-based embeddings for semantic relationships.
### Context Versioning
* Must support **hierarchical version control**: each blob should have metadata like `blob_id`, `version_id`, `parent_version`, `agent_id`, `timestamp`.
* You can implement simple version chains in relational DB or LTS support (e.g. graph edges representing “version-of” relationships).
* Track changes with immutable blob storage; allow rollbacks or context diffs.
---
## 🔍 Comparison Table
| Feature | Relational DB | Graph DB |
| ------------------------ | ----------------------------------------------------------- | --------------------------------------------- |
| **Hierarchy resilience** | Works well for strict tree; joins required for multi-parent | Native multi-parent and traversals |
| **Performance** | Fast for simple lookups; may slow with joins | O(1) traversal for connected queries |
| **Versioning** | Straightforward with version tables; chronology easy | Version graph edges, easier branching/merging |
| **Semantic links** | Requires additional tables or indexes | First-class properties/relationships |
| **Cost & tooling** | SQLite heavy-light, well-known | Requires graph engine (Neo4j, etc.) |
---
## 🧠 Integration Architecture
### FS Layer
* Run FUSE-based FS presenting standard directories/files.
* On `lookup`, FS resolves the path and queries DB for context blobs.
* On `read`, FS returns merged context string; `write` or `push` maps to `context_push(path, blob_content)` exposing MCP endpoints.
### Backend DB Schema Sketch
**Relational (SQL)**
```sql
CREATE TABLE path (
path TEXT PRIMARY KEY,
parent TEXT REFERENCES path(path)
);
CREATE TABLE context_blob (
blob_id SERIAL PRIMARY KEY,
path TEXT REFERENCES path(path),
version INT,
parent_blob INT REFERENCES context_blob(blob_id),
agent TEXT,
timestamp TIMESTAMP,
content TEXT
);
```
**Graph (Property Graph)**
* Node labels:
* `(:Path {path: "...", last_blob: id})`
* `(:Blob {blob_id, version, agent, timestamp, content})`
* Edges:
* `(Path)-[:HAS_BLOB]->(Blob)`
* `(Blob)-[:PARENT_VERSION]->(Blob)`
* `(Path)-[:PARENT_PATH]->(Path)`
---
## 🧩 Summary & Recommendation
* A **FUSE-based FS layer** is well-suited for interface compatibility and rapid prototyping; RFUSE-style frameworks may help with performance if you scale.
* For backend, if you expect **strict single-parent hierarchical contexts**, relational DB is safe and simple.
* If you want **multi-parent inheritance, semantic linking, branching, merging**, graph DB offers greater flexibility.
* Versioning is supported in both: relational via version chains and history tables; graph via version edges.
* Hybrid: use PostgreSQL with graph extensions or embed a graph layer atop SQL for embeddings and semantic dive queries ([academia.edu][7], [sciencedirect.com][8], [milvus.io][9], [filesystems.org][10]).
---
[1]: https://research.ibm.com/publications/to-fuse-or-not-to-fuse-performance-of-user-space-file-systems?utm_source=chatgpt.com "To fuse or not to fuse: Performance of user-space file systems"
[2]: https://en.wikipedia.org/wiki/Filesystem_in_Userspace?utm_source=chatgpt.com "Filesystem in Userspace"
[3]: https://www.usenix.org/system/files/fast24-cho.pdf?utm_source=chatgpt.com "RFUSE: Modernizing Userspace Filesystem Framework through Scalable ..."
[4]: https://en.wikipedia.org/wiki/Graph_database?utm_source=chatgpt.com "Graph database"
[5]: https://memgraph.com/docs/ai-ecosystem/graph-rag?utm_source=chatgpt.com "GraphRAG with Memgraph"
[6]: https://link.springer.com/chapter/10.1007/978-3-031-74701-4_13?utm_source=chatgpt.com "Exploring the Hybrid Approach: Integrating Relational and Graph ..."
[7]: https://www.academia.edu/13092788/SEMANTIC_BASED_DATA_STORAGE_WITH_NEXT_GENERATION_CATEGORIZER?utm_source=chatgpt.com "SEMANTIC BASED DATA STORAGE WITH NEXT GENERATION CATEGORIZER"
[8]: https://www.sciencedirect.com/science/article/pii/S1319157822002920?utm_source=chatgpt.com "FUSE based file system for efficient storage and retrieval of ..."
[9]: https://milvus.io/ai-quick-reference/what-strategies-exist-for-longterm-memory-in-model-context-protocol-mcp?utm_source=chatgpt.com "What strategies exist for long-term memory in Model Context Protocol (MCP)?"
[10]: https://www.filesystems.org/docs/fuse/bharath-msthesis.pdf?utm_source=chatgpt.com "To FUSE or not to FUSE? Analysis and Performance ... - File System"