14 KiB
1. Research Motivation & Literature Review 🧠
- Semantic and context‑aware file systems: Gifford et al. (1991) proposed early semantic file systems using directory paths as semantic queries (Wikipedia). Later work explored tag‑based and ontology‑based systems for richer metadata and context-aware retrieval (Wikipedia).
- LLM‑driven semantic FS (LSFS): The recent ICLR 2025 LSFS proposes integrating vector DBs and semantic indexing into a filesystem that supports prompt-driven file operations and semantic rollback (OpenReview).
- Path-structure embeddings: Recent Transformer-based work shows file paths can be modeled as sequences for semantic anomaly detection—capturing hierarchy and semantics in embeddings (MDPI).
- Context modeling frameworks: Ontology-driven context models (e.g. OWL/SOCAM) support representing, reasoning about, and sharing context hierarchically (arXiv).
Your HCFS merges these prior insights into a hybrid: directory navigation = query scope, backed by semantic context blobs in a DB, enabling agentic systems to zoom in/out contextually.
2. Objectives & Scope
-
Design a virtual filesystem layer that maps hierarchical paths to context blobs.
-
Build a context storage system (DB) to hold context units, versioned and indexed.
-
Define APIs and syscalls for agents to:
- navigate context scope (
cd‑style), - request context retrieval,
- push new context,
- merge or inherit context across levels.
- navigate context scope (
-
Enable decentralized context sharing: agents can publish updates at path-nodes; peer agents subscribe by tree‑paths.
-
Prototype on a controlled dataset / toy project tree to validate:
- latency,
- correct retrieval,
- hierarchical inheritance semantics.
3. System Architecture Overview
3.1 Virtual Filesystem Layer (e.g. FUSE or AIOS integration)
- Presents standard POSIX (or AIOS‑style) tree structure.
- Each directory or file node has metadata pointers into context‑blob IDs.
- Traversal (e.g.,
ls,cd) triggers context lookup for that path.
3.2 Context Database Backend
-
Two possible designs:
- Relational/SQLite + versioned tables: simple, transactional, supports hierarchical inheritance via path parent pointers.
- Graph DB (e.g., Neo4j): ideal for multi-parent contexts, symlink-like context inheritance.
-
Context blobs include:
- blob ID,
- path(s) bound,
- timestamp/version, author/agent,
- embedding or semantic tags,
- content or summary.
3.3 Indexing & Embeddings
- Generate embeddings of context blobs for semantic similarity retrieval (e.g. for context folding) (OpenReview, OpenReview, MDPI).
- Use combination of BM25 + embedding ranking (contextual retrieval) for accurate scope-based retrieval (TECHCOMMUNITY.MICROSOFT.COM).
3.4 API & Syscalls
context_cd(path): sets current context pointer.context_get(depth=N): retrieves cumulative context from current node up N levels.context_push(path, blob): insert new context tied to a path.context_list(path): lists available context blobs at that path.context_subscribe(path): agent registers to receive updates at a path.
4. Project Timeline & Milestones
| Phase | Duration | Deliverables |
|---|---|---|
| Phase 0: Research & Design | 2 weeks | Literature review doc, architecture draft |
| Phase 1: Prototype FS layer | 4 weeks | Minimal FUSE‑based path→context mapping, CLI demo |
| Phase 2: Backend DB & storage | 4 weeks | Context blob storage, path linkage, versioning |
| Phase 3: Embedding & retrieval integration | 3 weeks | Embeddings + BM25 hybrid ranking for context relevance |
| Phase 4: API/Syscall layer scripting | 3 weeks | Python (or AIOS) service exposing navigation + push APIs |
| Phase 5: Agent integration & simulation | 3 weeks | Dummy AI agents navigating, querying, publishing context |
| Phase 6: Evaluation & refinement | 2 weeks | Usability, latency, retrieval relevance metrics |
| Phase 7: Write-up & publication | 2 weeks | Report, possible poster/paper submission |
5. Risks & Alternatives
- Semantic vs hierarchical mismatch: Flat tag systems (e.g. Tagsistant) offer semantic tagging but lack path-based inheritance (research.ijcaonline.org, OpenReview, Wikipedia, arXiv, Anthropic, OpenReview, Wikipedia).
- Context explosion: many small blobs flooding the DB—mitigate via summarization/folding.
- Performance trade‑offs: FS lookups must stay acceptable; versioned graph storage might slow down. Consider caching snapshots at each node.
6. Peer‑Reviewed References
- David Gifford et al., Semantic file systems, ACM Operating Systems Review (1991) (Wikipedia)
- ICLR 2025: From Commands to Prompts: LLM-based Semantic File System for AIOS (LSFS) (OpenReview)
- Xiaoyu et al., Transformer-based path sequence modeling for file‑path anomaly detection (MDPI)
- Tao Gu et al., Ontology‑based Context Model in Intelligent Environments (SOCAM) (arXiv)
7. Next Steps
- Review cited literature, build an annotated bibliography.
- Choose backend stack (SQLite vs graph DB) and test embedding pipeline.
- Begin Phase 1: implementing minimal context‑aware FS mock.
Core Architecture Considerations
FS design on top of FUSE and DB schema selection with versioning.
🖥️ 1. FS Architecture: FUSE Layer & Path‑Context Mapping
Why FUSE makes sense
- FUSE (Filesystem in Userspace) provides a widely used, flexible interface for prototyping new FS models without kernel hacking, enabling rapid development of virtual filesystems that you can mount and interact with via standard POSIX tools (IBM Research, Wikipedia).
- Performance varies—but optimized designs or alternatives like RFUSE help improve kernel‑userspace communication latency and throughput, making user‑space FS viable even in demanding use cases (USENIX).
Path‑to‑Context Mapping Schema
You’d implement a mapping where each path (directory or file) is bound to zero or more context blob IDs. Concepts:
- Directory traversal (
cd,ls) triggers path-based context lookups in your backend. - File reads (e.g.
readfile(context)) return the merged or inherited context blob(s) for that node. - Inheritance: a context layer at
/a/b/cimplicitly inherits from/a/b,/a, and/context as-needed.
Caching & Merge Layers
- Cache context snapshots at each directory layer to reduce repeated database hits.
- Provide configurable merge strategies (union, override, summarization) to maintain efficient context retrieval.
📦 2. DB Design: Relational vs Graph & Context Versioning
Relational (e.g. SQLite/PostgreSQL)
- Strong transactional guarantees and simple schema: tables like
Blobs(blob_id, path, version, content, timestamp)plus aPathHierarchy(path, parent_path)table for inheritance. - Good for simple single-parent hierarchies and transactional versioning (with version numbers or history tables).
- But joins across deep path hierarchies can get costly; semantic relationships or multi-parent inheritance are more cumbersome.
Graph Database (e.g. Neo4j)
- Nodes represent paths and context blobs; edges represent parent-child, semantic relations, and "derived" links.
- Ideal for multi-parent or symlink-like context inheritance, semantic network traversal, or hierarchy restructuring (Wikipedia).
- Enables queries like: “find all context blobs reachable within N hops from path X,” or “retrieve peers with similar context semantics.”
Hybrid Approaches
- A relational backend augmented with semantic tables or converted into a graph as needed for richer queries (memgraph.com, link.springer.com).
- Example: relational for version history and base structure, graph/cloud-based embeddings for semantic relationships.
Context Versioning
- Must support hierarchical version control: each blob should have metadata like
blob_id,version_id,parent_version,agent_id,timestamp. - You can implement simple version chains in relational DB or LTS support (e.g. graph edges representing “version-of” relationships).
- Track changes with immutable blob storage; allow rollbacks or context diffs.
🔍 Comparison Table
| Feature | Relational DB | Graph DB |
|---|---|---|
| Hierarchy resilience | Works well for strict tree; joins required for multi-parent | Native multi-parent and traversals |
| Performance | Fast for simple lookups; may slow with joins | O(1) traversal for connected queries |
| Versioning | Straightforward with version tables; chronology easy | Version graph edges, easier branching/merging |
| Semantic links | Requires additional tables or indexes | First-class properties/relationships |
| Cost & tooling | SQLite heavy-light, well-known | Requires graph engine (Neo4j, etc.) |
🧠 Integration Architecture
FS Layer
- Run FUSE-based FS presenting standard directories/files.
- On
lookup, FS resolves the path and queries DB for context blobs. - On
read, FS returns merged context string;writeorpushmaps tocontext_push(path, blob_content)exposing MCP endpoints.
Backend DB Schema Sketch
Relational (SQL)
CREATE TABLE path (
path TEXT PRIMARY KEY,
parent TEXT REFERENCES path(path)
);
CREATE TABLE context_blob (
blob_id SERIAL PRIMARY KEY,
path TEXT REFERENCES path(path),
version INT,
parent_blob INT REFERENCES context_blob(blob_id),
agent TEXT,
timestamp TIMESTAMP,
content TEXT
);
Graph (Property Graph)
-
Node labels:
(:Path {path: "...", last_blob: id})(:Blob {blob_id, version, agent, timestamp, content})
-
Edges:
(Path)-[:HAS_BLOB]->(Blob)(Blob)-[:PARENT_VERSION]->(Blob)(Path)-[:PARENT_PATH]->(Path)
🧩 Summary & Recommendation
- A FUSE-based FS layer is well-suited for interface compatibility and rapid prototyping; RFUSE-style frameworks may help with performance if you scale.
- For backend, if you expect strict single-parent hierarchical contexts, relational DB is safe and simple.
- If you want multi-parent inheritance, semantic linking, branching, merging, graph DB offers greater flexibility.
- Versioning is supported in both: relational via version chains and history tables; graph via version edges.
- Hybrid: use PostgreSQL with graph extensions or embed a graph layer atop SQL for embeddings and semantic dive queries (academia.edu, sciencedirect.com, milvus.io, filesystems.org).