Files
HCFS/PROJECT_PLAN.md
2025-07-30 09:34:16 +10:00

14 KiB
Raw Permalink Blame History

# PROJECT_PLAN.md ## 📘 Title ContextAware Hierarchical Context File System (HCFS): Unifying file system paths with context blobs for agentic AI cognition

1. Research Motivation & Literature Review 🧠

  • Semantic and contextaware file systems: Gifford etal. (1991) proposed early semantic file systems using directory paths as semantic queries (Wikipedia). Later work explored tagbased and ontologybased systems for richer metadata and context-aware retrieval (Wikipedia).
  • LLMdriven semantic FS (LSFS): The recent ICLR 2025 LSFS proposes integrating vector DBs and semantic indexing into a filesystem that supports prompt-driven file operations and semantic rollback (OpenReview).
  • Path-structure embeddings: Recent Transformer-based work shows file paths can be modeled as sequences for semantic anomaly detection—capturing hierarchy and semantics in embeddings (MDPI).
  • Context modeling frameworks: Ontology-driven context models (e.g. OWL/SOCAM) support representing, reasoning about, and sharing context hierarchically (arXiv).

Your HCFS merges these prior insights into a hybrid: directory navigation = query scope, backed by semantic context blobs in a DB, enabling agentic systems to zoom in/out contextually.


2. Objectives & Scope

  1. Design a virtual filesystem layer that maps hierarchical paths to context blobs.

  2. Build a context storage system (DB) to hold context units, versioned and indexed.

  3. Define APIs and syscalls for agents to:

    • navigate context scope (cdstyle),
    • request context retrieval,
    • push new context,
    • merge or inherit context across levels.
  4. Enable decentralized context sharing: agents can publish updates at path-nodes; peer agents subscribe by treepaths.

  5. Prototype on a controlled dataset / toy project tree to validate:

    • latency,
    • correct retrieval,
    • hierarchical inheritance semantics.

3. System Architecture Overview

3.1 Virtual Filesystem Layer (e.g. FUSE or AIOS integration)

  • Presents standard POSIX (or AIOSstyle) tree structure.
  • Each directory or file node has metadata pointers into contextblob IDs.
  • Traversal (e.g., ls, cd) triggers context lookup for that path.

3.2 Context Database Backend

  • Two possible designs:

    • Relational/SQLite + versioned tables: simple, transactional, supports hierarchical inheritance via path parent pointers.
    • Graph DB (e.g., Neo4j): ideal for multi-parent contexts, symlink-like context inheritance.
  • Context blobs include:

    • blob ID,
    • path(s) bound,
    • timestamp/version, author/agent,
    • embedding or semantic tags,
    • content or summary.

3.3 Indexing & Embeddings

  • Generate embeddings of context blobs for semantic similarity retrieval (e.g. for context folding) (OpenReview, OpenReview, MDPI).
  • Use combination of BM25 + embedding ranking (contextual retrieval) for accurate scope-based retrieval (TECHCOMMUNITY.MICROSOFT.COM).

3.4 API & Syscalls

  • context_cd(path): sets current context pointer.
  • context_get(depth=N): retrieves cumulative context from current node up N levels.
  • context_push(path, blob): insert new context tied to a path.
  • context_list(path): lists available context blobs at that path.
  • context_subscribe(path): agent registers to receive updates at a path.

4. Project Timeline & Milestones

Phase Duration Deliverables
Phase 0: Research & Design 2weeks Literature review doc, architecture draft
Phase 1: Prototype FS layer 4weeks Minimal FUSEbased path→context mapping, CLI demo
Phase 2: Backend DB & storage 4weeks Context blob storage, path linkage, versioning
Phase 3: Embedding & retrieval integration 3weeks Embeddings + BM25 hybrid ranking for context relevance
Phase 4: API/Syscall layer scripting 3weeks Python (or AIOS) service exposing navigation + push APIs
Phase 5: Agent integration & simulation 3weeks Dummy AI agents navigating, querying, publishing context
Phase 6: Evaluation & refinement 2weeks Usability, latency, retrieval relevance metrics
Phase 7: Write-up & publication 2weeks Report, possible poster/paper submission

5. Risks & Alternatives

  • Semantic vs hierarchical mismatch: Flat tag systems (e.g. Tagsistant) offer semantic tagging but lack path-based inheritance (research.ijcaonline.org, OpenReview, Wikipedia, arXiv, Anthropic, OpenReview, Wikipedia).
  • Context explosion: many small blobs flooding the DB—mitigate via summarization/folding.
  • Performance tradeoffs: FS lookups must stay acceptable; versioned graph storage might slow down. Consider caching snapshots at each node.

6. PeerReviewed References

  • David Gifford etal., Semantic file systems, ACM Operating Systems Review (1991) (Wikipedia)
  • ICLR 2025: From Commands to Prompts: LLM-based Semantic File System for AIOS (LSFS) (OpenReview)
  • Xiaoyu etal., Transformer-based path sequence modeling for filepath anomaly detection (MDPI)
  • Tao Gu etal., Ontologybased Context Model in Intelligent Environments (SOCAM) (arXiv)

7. Next Steps

  • Review cited literature, build an annotated bibliography.
  • Choose backend stack (SQLite vs graph DB) and test embedding pipeline.
  • Begin Phase1: implementing minimal contextaware FS mock.


Core Architecture Considerations

FS design on top of FUSE and DB schema selection with versioning.

🖥️ 1. FS Architecture: FUSE Layer & PathContext Mapping

Why FUSE makes sense

  • FUSE (Filesystem in Userspace) provides a widely used, flexible interface for prototyping new FS models without kernel hacking, enabling rapid development of virtual filesystems that you can mount and interact with via standard POSIX tools (IBM Research, Wikipedia).
  • Performance varies—but optimized designs or alternatives like RFUSE help improve kerneluserspace communication latency and throughput, making userspace FS viable even in demanding use cases (USENIX).

PathtoContext Mapping Schema

Youd implement a mapping where each path (directory or file) is bound to zero or more context blob IDs. Concepts:

  • Directory traversal (cd, ls) triggers path-based context lookups in your backend.
  • File reads (e.g. readfile(context)) return the merged or inherited context blob(s) for that node.
  • Inheritance: a context layer at /a/b/c implicitly inherits from /a/b, /a, and / context as-needed.

Caching & Merge Layers

  • Cache context snapshots at each directory layer to reduce repeated database hits.
  • Provide configurable merge strategies (union, override, summarization) to maintain efficient context retrieval.

📦 2. DB Design: Relational vs Graph & Context Versioning

Relational (e.g. SQLite/PostgreSQL)

  • Strong transactional guarantees and simple schema: tables like Blobs(blob_id, path, version, content, timestamp) plus a PathHierarchy(path, parent_path) table for inheritance.
  • Good for simple single-parent hierarchies and transactional versioning (with version numbers or history tables).
  • But joins across deep path hierarchies can get costly; semantic relationships or multi-parent inheritance are more cumbersome.

Graph Database (e.g. Neo4j)

  • Nodes represent paths and context blobs; edges represent parent-child, semantic relations, and "derived" links.
  • Ideal for multi-parent or symlink-like context inheritance, semantic network traversal, or hierarchy restructuring (Wikipedia).
  • Enables queries like: “find all context blobs reachable within N hops from path X,” or “retrieve peers with similar context semantics.”

Hybrid Approaches

  • A relational backend augmented with semantic tables or converted into a graph as needed for richer queries (memgraph.com, link.springer.com).
  • Example: relational for version history and base structure, graph/cloud-based embeddings for semantic relationships.

Context Versioning

  • Must support hierarchical version control: each blob should have metadata like blob_id, version_id, parent_version, agent_id, timestamp.
  • You can implement simple version chains in relational DB or LTS support (e.g. graph edges representing “version-of” relationships).
  • Track changes with immutable blob storage; allow rollbacks or context diffs.

🔍 Comparison Table

Feature Relational DB Graph DB
Hierarchy resilience Works well for strict tree; joins required for multi-parent Native multi-parent and traversals
Performance Fast for simple lookups; may slow with joins O(1) traversal for connected queries
Versioning Straightforward with version tables; chronology easy Version graph edges, easier branching/merging
Semantic links Requires additional tables or indexes First-class properties/relationships
Cost & tooling SQLite heavy-light, well-known Requires graph engine (Neo4j, etc.)

🧠 Integration Architecture

FS Layer

  • Run FUSE-based FS presenting standard directories/files.
  • On lookup, FS resolves the path and queries DB for context blobs.
  • On read, FS returns merged context string; write or push maps to context_push(path, blob_content) exposing MCP endpoints.

Backend DB Schema Sketch

Relational (SQL)

CREATE TABLE path (
  path TEXT PRIMARY KEY,
  parent TEXT REFERENCES path(path)
);
CREATE TABLE context_blob (
  blob_id SERIAL PRIMARY KEY,
  path TEXT REFERENCES path(path),
  version INT,
  parent_blob INT REFERENCES context_blob(blob_id),
  agent TEXT,
  timestamp TIMESTAMP,
  content TEXT
);

Graph (Property Graph)

  • Node labels:

    • (:Path {path: "...", last_blob: id})
    • (:Blob {blob_id, version, agent, timestamp, content})
  • Edges:

    • (Path)-[:HAS_BLOB]->(Blob)
    • (Blob)-[:PARENT_VERSION]->(Blob)
    • (Path)-[:PARENT_PATH]->(Path)

🧩 Summary & Recommendation

  • A FUSE-based FS layer is well-suited for interface compatibility and rapid prototyping; RFUSE-style frameworks may help with performance if you scale.
  • For backend, if you expect strict single-parent hierarchical contexts, relational DB is safe and simple.
  • If you want multi-parent inheritance, semantic linking, branching, merging, graph DB offers greater flexibility.
  • Versioning is supported in both: relational via version chains and history tables; graph via version edges.
  • Hybrid: use PostgreSQL with graph extensions or embed a graph layer atop SQL for embeddings and semantic dive queries (academia.edu, sciencedirect.com, milvus.io, filesystems.org).