Skip to content

Feature: CodeGraph-like repository knowledge graph (ast-grep extractor + graph embeddings) #490

@AlexMikhalev

Description

@AlexMikhalev

Problem

Claude/agent sessions repeatedly pay an “exploration tax” (many file reads/greps) to rebuild a mental model of a repo. CodeGraph addresses this by building a local, queryable code graph once and exposing agent-friendly queries (context building, callers/callees, impact).

Terraphim is well-positioned to provide a similar UX with additional advantages (thesaurus/rolegraph + graph embeddings), but we need a first-class per-repository code structure KG and agent-facing tools.

Proposal

Implement a CodeGraph-like repo index inside Terraphim:

1) Build a per-repo code KG

  • Extract symbols + relations using ast-grep (initial focus: defs + imports; later calls).
  • Normalize to a stable schema:
    • Nodes: CodeSymbol (function/class/method/interface/type), File, Module
    • Edges: Imports, Defines, Extends, Implements, References, Calls (phase 2)
    • Track file, span, signature, optional docstring/snippet, confidence for edges
  • Store alongside text for lexical search (BM25/FTS) and embeddings for semantic retrieval.

2) Add agent-facing query tools (MCP)

Provide structured endpoints that avoid file-by-file exploration:

  • repo_context(task, entrypoints?, depth?) → top files/symbols + snippets + why
  • repo_search(query, mode=name|semantic|hybrid) → ranked symbols/files
  • repo_callers(symbol_id) / repo_callees(symbol_id) (phase 2 when call graph is good)
  • repo_impact(symbol_id|file, radius) → blast radius + risk score
  • (optional) repo_path(from_symbol,to_symbol) → explanation chain

3) Leverage Terraphim graph embeddings (differentiator)

  • Combine text embeddings with graph embeddings (connectivity-aware) for ranking.
  • Use rolegraph/thesaurus to bias retrieval toward domain concepts (auth/payment/etc).

4) Incremental updates

  • init: index repo at HEAD
  • update: re-index changed files (git diff / mtimes)
  • Optional git hook integration

Phased delivery

Phase 1 (MVP)

  • ast-grep extraction for defs + imports
  • store nodes/edges + snippets
  • MCP tools: repo_search, repo_context, repo_impact (import graph)

Phase 2 (Parity)

  • call graph extraction + improved resolution (start with TS/JS or Rust)
  • enable repo_callers/repo_callees with acceptable precision

Phase 3 (Enhancement)

  • graph embeddings integrated into retrieval/ranking

Acceptance criteria

  • On a non-trivial repo (e.g., terraphim-ai), agent can answer “where is X handled” and “what breaks if I change Y” with significantly fewer file reads/greps compared to baseline.
  • Index is local, reproducible, and updates incrementally.
  • Tool outputs are structured (IDs + locations) and suitable for automated context building.

Notes

  • CodeGraph reference (conceptual): structured code graph + MCP tools to avoid repeated exploration.
  • Implement storage Terraphim-native; keep UX/tooling CodeGraph-like.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions