Skip to content

feat(symgraph): improve ingestion pipeline for large repositories.#1

Merged
grahambrooks merged 3 commits into
grahambrooks:mainfrom
Zenithar:zenithar/symgraph/recursion_fix
May 30, 2026
Merged

feat(symgraph): improve ingestion pipeline for large repositories.#1
grahambrooks merged 3 commits into
grahambrooks:mainfrom
Zenithar:zenithar/symgraph/recursion_fix

Conversation

@Zenithar

Copy link
Copy Markdown
Contributor

Summary

This changes full reindexing from an in-place incremental update into a clean rebuild of a shadow database followed by an atomic swap.

It also hardens indexing and extraction in a few places:

  • keeps full rebuilds from leaving stale symbols behind
  • keeps the live DB usable if a rebuild fails
  • improves bulk indexing behavior for FTS/WAL handling
  • avoids stack overflows on deeply nested ASTs
  • makes CLI indexing show progress

Problem

Before this change, the system treated “reindex everything” too much like “incrementally update the existing DB.”

That caused a few concrete issues:

  1. Full reindexes could leave stale rows behind after file deletes or symbol renames.
  2. A failed rebuild risked interfering with the live on-disk database state.
  3. Bulk indexing did unnecessary per-file/per-row maintenance work.
  4. Deeply nested syntax trees could overflow the extractor’s recursive traversal.
  5. The CLI and MCP surface described reindexing in ways that no longer matched the behaviour we actually want.

What changed

1. Full rebuilds now use a shadow database and atomic swap

  • Added build_full_index for building a complete index into an empty target DB.
  • Added shadow DB helpers:
    • open_shadow_database
    • rebuild_project_database
    • swap_shadow_database
  • Added DB swap support:
    • prepare_for_swap
    • replace_with_shadow
    • sidecar cleanup / reopen helpers

Why:

  • Solves stale data after deletes/renames by rebuilding from scratch.
  • Gives us a clean cutover instead of mutating the live DB in place.

2. CLI index now performs a real rebuild

  • index_command now calls rebuild_project_database(...) instead of incremental indexing.
  • Added show_progress to IndexConfig.
  • Added indicatif and wired progress bars for scanning/parsing/storing/resolving.

Why:

  • Solves the mismatch where the symgraph index sounded like a fresh rebuild but behaved incrementally.
  • Makes large indexing runs easier to understand from the CLI.

3. MCP reindex semantics are now explicit

  • files: Some([...]) still performs targeted in-place reindexing.
  • Omitting files now performs a full shadow rebuild.
  • Updated MCP descriptions and output text to reflect that.

Why:

  • Solves API ambiguity.
  • Preserves the cheap targeted path for specific files while making full rebuilds correct.

4. Bulk indexing path is split by mode

  • Introduced IndexMode::{Incremental, FullBuild}.
  • Incremental mode:
    • disables FTS automerge during bulk inserts
    • checkpoints WAL every CHECKPOINT_INTERVAL files
    • optimises FTS after inserts
  • Full-build mode:
    • inserts canonical rows first
    • skips per-insert FTS maintenance
    • rebuilds FTS once at the end

Why:

  • Solves unnecessary bulk-write overhead.
  • Keeps WAL size bounded during large incremental updates.
  • Makes full rebuilds behave like full rebuilds rather than repeated incremental updates.

5. Extraction traversal is now iterative

  • Replaced recursive symbol traversal with an explicit work stack.
  • Kept contains-edge / nested symbol behaviour intact through explicit push/pop handling.

Why:

  • Solves stack overflow risk on deeply nested ASTs.

6. Signature truncation is UTF-8 safe

  • Long signatures are now truncated at a valid character boundary.

Why:

  • Solves potential invalid slicing on non-ASCII signatures.

Tests added / updated

Added regression coverage for:

  • full build populating an empty DB
  • full rebuild replacing stale rows
  • keeping the live DB intact on rebuild failure
  • MCP full rebuild, reopening the live handle correctly
  • file-scoped reindex staying in place
  • periodic checkpoint behaviour in incremental indexing
  • deeply nested extraction without stack overflow

Verification

Ran targeted tests:

  • cargo test test_build_full_index_populates_empty_target_db
  • cargo test test_rebuild_project_database
  • cargo test test_incremental_index_codebase_periodic_checkpoint
  • cargo test test_handle_reindex_
  • cargo test test_deeply_nested_does_not_overflow

All passed.

Notes for reviewers

The main design decision here is a clean cutover:

  • full rebuilds are now correctness-first via shadow DB swap
  • targeted file reindex remains incremental and scoped

@Zenithar

Copy link
Copy Markdown
Contributor Author

@grahambrooks, is there any chance of getting a review of this PR?

…licts

Brings the coupling-analysis features, new extraction edges, and configurable
index storage from main into Zenithar's shadow-DB rebuild PR. Conflicts were in
extraction and the CLI DB layer where both sides changed the same code:

- src/extraction/mod.rs: main added an edge-extraction walker (field
  accesses/mutates, imports, enum dispatch, &mut params) on the old recursive
  traversal; this PR rewrote traversal + call-finding into an iterative
  work-stack to avoid stack overflows. Resolved by porting the new edge logic
  onto the iterative structure — a single iterative `find_references` now emits
  Calls/Accesses/Mutates/References, plus the import + &mut-param hooks — so the
  recursion-safety fix and the coupling edges both hold.

- src/cli/db_utils.rs: this PR's shadow database hardcoded `.symgraph/`. main
  made the live DB location configurable (git-dir / cache / .symgraph). Resolved
  by co-locating the shadow with the *resolved* live DB so the atomic
  rename in `replace_with_shadow` stays on one filesystem regardless of storage
  strategy (the old `ensure_database_directory` is replaced by inline creation).

- src/cli/commands.rs: union of imports; index uses `rebuild_project_database`
  while status/search/context use the new `resolve_db` (the removed
  `database_path` is dropped).

- src/lib.rs: `build_full_index` now rejects a non-directory root, so a failed
  rebuild reliably preserves the live DB regardless of where the shadow lives
  (restores the intent of test_rebuild_project_database_keeps_live_db_on_failure
  after storage resolution changed its incidental failure path).

All 181 tests pass; clippy -D warnings and rustfmt clean. Dogfooded: a full
rebuild via the shadow swap produces the new accesses/mutates/imports/dispatch
edges and field/enum_member nodes.
@grahambrooks grahambrooks merged commit 5493b1b into grahambrooks:main May 30, 2026
11 checks passed
@grahambrooks

Copy link
Copy Markdown
Owner

@Zenithar

Sorry about the delay in review and merge. I'll push a new release tomorrow.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants