Skip to content

Persistent SQLite DB index with cache-aware search#47

Open
tony wants to merge 25 commits into
masterfrom
streamline-04
Open

Persistent SQLite DB index with cache-aware search#47
tony wants to merge 25 commits into
masterfrom
streamline-04

Conversation

@tony
Copy link
Copy Markdown
Owner

@tony tony commented Jun 6, 2026

Summary

  • Add a persistent SQLite DB index (ADR 0005): agentgrep db sync|status|explain materializes discovered agent history into a local WAL-mode database with an FTS5 text index, a source ledger with freshness state, and stable source/record ids. The DB is derived state — Codex, Claude, Cursor, Gemini, Grok, Pi, and OpenCode stores remain the source of truth.
  • Add cache-aware search execution: grep and search gain --cache auto|require|off (and --no-cache). The default auto serves a query from the DB index only when it can answer it and falls back to the live scanner otherwise; require insists on the DB path, and unsupported cache-required queries fail as clean CLI errors instead of tracebacks.
  • Add live stderr sync progress with cooperative early exit — Enter stops at the next source boundary and keeps partial results; structured stdout stays machine-readable in JSON and NDJSON modes.
  • Add cache-fast sync defaults: source_state fingerprint checks skip unchanged sources unless --force is passed, and a source-id index keeps per-source replacement fast.
  • Add human-readable semantic summaries as the default text output for db payloads, with explicit --json/--ndjson machine modes.
  • Add the read-only db_status MCP tool.

Changes by area

DB index

src/agentgrep/db.py: DbStore (SQLite/WAL/FTS5 schema: source ledger plus source_state, normalized records), DbRuntime sync/status/search, sync progress protocol, freshness fingerprints, and a default path under the XDG cache directory overridable via AGENTGREP_DB or --db.

Search integration

src/agentgrep/__init__.py, src/agentgrep/_engine/: cache-mode plumbing through SearchRuntime so search-shaped commands consult the DB index under --cache auto|require and preserve the live-scan path under --no-cache.

CLI

src/agentgrep/cli/parser.py, src/agentgrep/cli/render.py: the db command group with progress flags, semantic human summaries in text mode, JSON/NDJSON machine modes, and clean CLI errors for unsupported cached queries.

MCP

src/agentgrep/mcp/tools/db_tools.py, src/agentgrep/mcp/models.py: the read-only db_status tool with its pydantic response model and capability registration.

Docs

ADR 0005; per-command CLI pages for db, plus argparse-backed pages for the existing search and ui commands; a DB-index development page; DB cache controls in the configuration guide.

Design decisions

  • The DB is a materialized read model, never the source of truth: deleting the database loses cached records only; agent history stores stay untouched and agentgrep stays read-only over them.
  • --cache auto is conservative: the index only answers queries it can satisfy; anything else falls through to the live scanner, and --cache require turns unsupported queries into explicit CLI errors.
  • Sync skips by fingerprint, not by guess: source_state records the synced mtime and content fingerprint; --force is the explicit full-rebuild path.

Test plan

  • tests/test_db_index.pyDbStore schema and sync behavior: freshness skips, --force resync, FTS queries, deterministic ids, early exit
  • tests/test_cache_cli.py — CLI contracts for db and cache modes: JSON and NDJSON stdout, human summaries, progress output, error paths
  • tests/test_agentgrep_mcp.py — read-only db_status payloads and capability registration
  • Full gate per commit: ruff check / ruff format, ty check, uv run pytest (incl. doctests), just build-docs

Stacked PR

The insights/suggestions surface that previously shared this branch now lives in its own PR stacked on this one.

@tony tony force-pushed the streamline-04 branch from 22883f0 to 9f1177b Compare June 6, 2026 18:25
@tony tony force-pushed the streamline-04 branch from 9f1177b to 1042dae Compare June 6, 2026 19:27
@tony tony changed the title Persistent DB index with deterministic insights and suggestions Persistent SQLite DB index with cache-aware search Jun 6, 2026
@tony
Copy link
Copy Markdown
Owner Author

tony commented Jun 6, 2026

Code review

Found 3 issues:

  1. _source_fingerprint stats only the main source file, so WAL-mode SQLite stores (Cursor state.vscdb, Grok, OpenCode) whose commits land in the -wal sidecar look unchanged — source_is_current returns true and db sync skips them, serving stale records. This is the same failure mode trunk commit f41a2df fixed for the in-memory scan cache in _engine/scanning.py (_source_scan_cache_key includes the sidecar's size/mtime for source_kind == "sqlite"); the DB freshness path doesn't carry that fix forward.

def _source_fingerprint(source: agentgrep.SourceHandle) -> str:
"""Return a cheap source fingerprint for cache freshness checks."""
try:
stat = source.path.stat()
except OSError:
size = 0
mtime_ns = source.mtime_ns
else:
size = stat.st_size
mtime_ns = stat.st_mtime_ns
return text_hash(f"{source.path}\0{size}\0{mtime_ns}")

  1. _db_status_sync opens a DbRuntime (one sqlite3.connect per call) and never closes it. In the long-lived MCP server every db_status call leaks a connection/file descriptors (verified: 3 fds per call), and unclosed WAL-mode connections can suppress checkpointing. DbStore.close() exists but is unreachable here — DbRuntime has no close/context-manager.

)
status = DbRuntime.open(path).status()
return DbStatusModel(

  1. The docstring of test_collect_search_records_does_not_import_profiler_when_inactive was changed to "The non-profiled db path keeps the profiler module unloaded", but the test exercises collect_search_records (the record-collection path) and has nothing to do with the db command — the original "collection path" wording referred to record collection, not a command name, so the new text is factually wrong.

) -> None:
"""The non-profiled db path keeps the profiler module unloaded."""
source = agentgrep.SourceHandle(

🤖 Generated with Claude Code

- If this code review was useful, please react with 👍. Otherwise, react with 👎.

@tony
Copy link
Copy Markdown
Owner Author

tony commented Jun 6, 2026

Follow-up: all review findings addressed, including the ones below the posting threshold, one commit per issue.

Finding Commit
WAL sidecars invisible to sync freshness (posted, #1) agentgrep(fix[db]): Fingerprint SQLite WAL sidecars in sync freshness
db_status connection leak (posted, #2) agentgrep(fix[db]): Close per-call DB runtimes
Wrong profiling docstring (posted, #3) agentgrep(fix[test]): Describe the profiling guard test accurately
FTS5 external-content deletes with placeholder values agentgrep(fix[db]): Delete FTS rows with their stored values
Paste-cache cross-file adapters served stale syncs agentgrep(fix[db]): Exempt cross-file adapters from sync freshness skips
Whole-token FTS dropped substring matches under --cache require agentgrep(fix[db]): Match cached searches by exact substring semantics (trigram tokenizer + exact post-filter; rebuild policy in ADR 0008)
Cache path bypassed per-session dedup agentgrep(fix[db]): Apply per-session dedup to cached search results
Read-only db_status wrote schema metadata agentgrep(fix[db]): Open status surfaces read-only
db explain identical to db status agentgrep(feat[db]): Report cache diagnostics from db explain
Missing doctests on pure helpers agentgrep(test[db]): Add doctests to pure db and render helpers

Found along the way: the test suite could read the developer's real cache via default_db_path() — now isolated per-test (agentgrep(test[db]): Isolate the user cache from the test suite), and ADR 0008 records the cache schema rebuild policy. The stacked insights PR picks up the same lifecycle/read-only treatment for its tool surfaces.

@tony
Copy link
Copy Markdown
Owner Author

tony commented Jun 6, 2026

Round-2 follow-up: the second review pass (both PRs) surfaced 12 findings; none crossed the posting threshold, but the legitimate ones are fixed here, one commit per issue.

Finding Commit PR
Result cap applied before per-session dedup undercounted unique results fix[db]: Count result caps in unique cached records #47
Cache candidates covered title+text only (live matches model/role/path) and casefold-expanding text escaped trigram matching fix[db]: Index the casefolded live match surface — one design fix for both: the cache now indexes the same casefolded haystack the live matcher uses #47
Schema rebuild dropped only base tables, leaving stale artifact rows fix[db]: Drop artifact tables in schema rebuilds (+ sqlite_master drift test) #48
CLI read actions (insights list/explain, suggestions show/render/list) opened read-write fix[insights]: Open CLI read actions read-only #48
Suggestion listings unbounded vs the insights pattern feat[suggestions]: Bound persisted suggestion listings (--limit, totals, truncation, MCP fields) #48
Function-local cheap stdlib import fix[insights]: Import json at module level #48
Missing doctests on pure helpers test[db]: Add doctests to remaining pure helpers #48

Declined with reasons: SCHEMA_VERSION bump for the new tables (nothing has shipped; the schema is born final), the explain ok/error overlap (the offending row state has no writer), the _pytest alias (load-bearing under the TYPE_CHECKING-only import), Parameters docstring sections (repo convention uses short prose for internal helpers), and the private exemption-constant import (single source of truth, behavior-tested).

tony added 2 commits June 6, 2026 19:15
why: The DB cache work needs a reviewable architecture record before
the implementation lands. ADR 0005 captures the SQLite-first decision,
the derived-state contract, and the prior systems that motivate the
schema shape.

what:
- Add ADR 0005 for the persistent SQLite-first agentic DB index.
- Register the new ADR in the development architecture decision index.
- Align ADR 0004 terminology with the planner/aggregation vocabulary.
why: ADR 0005 calls for a SQLite-first local cache so broad queries
stop re-parsing every upstream store. A records/FTS index plus a
cache-aware execution path gives search-shaped commands a fast local
answer while keeping explicit cold live scans available.

what:
- Add SQLite/WAL DbStore with sources, source_state, records, and FTS5
  tables plus stable source/record ids and freshness fingerprints.
- Add DbRuntime sync/status/search and a cache-aware SearchRuntime with
  --cache auto|require|off and --no-cache on grep and search.
- Add agentgrep db sync|status|explain CLI commands with JSON and
  NDJSON output and clean CLI errors for unsupported cached queries.
- Add the read-only db_status MCP tool and document the cache controls.
- Cover sync, FTS search, cache modes, CLI contracts, and MCP payloads
  with regression tests.
tony added 9 commits June 6, 2026 20:37
why: The db command group should follow the existing CLI documentation
contract: one argparse-backed page per command path, with development
internals under /dev/ and the MCP tool surface documented alongside the
other read-only tools.

what:
- Add argparse-backed CLI pages for db sync|status|explain plus the
  search and ui commands, with grid cards and toctree entries.
- Add the DB index development page and register the db_status MCP tool
  across the MCP docs, reference pages, and docs-build shim.
- Keep command-group no-action invocations as help directories and add
  rendered-doc regression coverage for the db pages.
why: DB sync can take long enough that a silent command looks stalled,
especially when indexing large local histories. Giving it grep/search-style
stderr progress keeps structured stdout machine-readable while letting
interactive users stop before the next source and keep partial sync results.

what:
- Add db sync progress flags and typed color/progress argument plumbing.
- Add a headless DB sync progress protocol plus cooperative source-boundary
  early exit.
- Render DB sync progress on stderr with matching color semantics and
  Enter-to-exit behavior.
- Cover parser, runtime, JSON stdout, TTY hint, and docs behavior.
why: Full DB sync repeated source replacement work even when nothing
changed on disk. Consulting source_state freshness before opening
record iterators keeps the cache refresh path responsive on large
histories while preserving an explicit full-rebuild path.

what:
- Skip unchanged sources during sync using source_state fingerprint
  checks, and track skipped sources in sync results and progress.
- Add --force to resync unchanged sources for full rebuilds.
- Add a records source-id index so per-source replacement scales.
- Cover skip behavior, force refresh, CLI plumbing, and docs for the
  new sync semantics.
why: The default text output for db status and sync looked like
malformed JSON. Semantic terminal summaries keep the CLI readable for
humans while preserving explicit JSON and NDJSON for scripts.

what:
- Render DB status and sync structured payloads as semantic human
  summaries in text mode, with counts colored by meaning.
- Keep JSON and NDJSON machine modes explicit and unchanged.
- Cover the text contracts with case-driven regression tests.
why: External-content FTS5 delete commands must receive the originally
indexed column values. Passing placeholders left the old tokens mapped
to rowids SQLite later reuses, so the raw index kept answering for
text that no longer existed and accumulated stale entries on every
resync. The result-level post-filter masked it from search output but
not from the index itself.

what:
- Select title and text alongside rowid in _remove_source_records and
  pass them to the FTS5 'delete' command.
- Add a resync regression test asserting the raw FTS index drops old
  tokens and keeps exactly the fresh ones.
why: WAL-mode SQLite stores commit into the -wal sidecar while the main
database file's size and mtime stay unchanged until a checkpoint, so
the freshness check saw WAL-only writes as "current" and db sync
skipped re-reading them, serving stale records. The engine's source
scan cache already fingerprints the sidecar for the same reason.

what:
- Fold the -wal sidecar's size and mtime into _source_fingerprint for
  sqlite sources, treating a missing sidecar as a stable marker.
- Cover WAL-write resync, unchanged-skip, and non-sqlite indifference
  with named freshness cases.
why: The freshness fingerprint stats only the primary source file.
Claude history expands paste-cache sibling files into record text, so
a changed sibling left stale expansions in the index while the skip
optimization saw the source as current. The engine's source-scan cache
already exempts the same adapters for the same reason.

what:
- Treat scan-cache-exempt adapters as never current in
  DbStore.source_is_current, reusing the engine's exemption set as the
  single source of truth.
- Cover the always-resync behavior for claude.history_jsonl.v1.
why: Cache-aware search consults default_db_path() when no explicit db
path is given, so suite runs on a machine with a real agentgrep cache
were reading it during tests — and any future schema rebuild would
erase it. Tests must never depend on or mutate developer state.

what:
- Add an autouse fixture pointing AGENTGREP_DB at a per-test tmp path.
why: The DB index schema will keep evolving, and a derived cache should
not accumulate incremental migration code that preserves rows the next
sync regenerates anyway. Recording the rebuild-on-mismatch decision
keeps future schema changes one-line version bumps.

what:
- Add ADR 0008: a single meta schema_version key, with any mismatch
  dropping and recreating the schema for the next sync to repopulate.
- Register the ADR in the development architecture decision index.
tony added 14 commits June 6, 2026 20:37
why: The live engine matches case-insensitive substrings, but the FTS
index matched whole tokens only, so cached answers silently dropped
records like "scruffy" for the term "ruff" — and --cache require turned
that into final zero-result answers with no fallback. Trigram
tokenization makes FTS candidates a provable superset of substring
matches, and the existing live-predicate post-filter keeps precision.

what:
- Tokenize record_text_fts with FTS5 trigram so MATCH candidates cover
  every substring occurrence of three-plus-character ASCII terms.
- Route shorter or non-ASCII terms through an exact table scan with an
  ASCII substring prefilter, post-filtered by the live predicate.
- Rebuild the cache on meta schema-version mismatch per ADR 0008.
- Cover prefix, mid-token, mixed, case, short-term, non-ASCII, and
  zero-match parity plus the version-mismatch rebuild with named cases.
why: The search event stream guarantees emitted records passed the
per-session dedup decision, and the live path enforces that inside the
execution driver. Cached records bypassed the driver, so duplicates
persisted under different sources leaked through cache-served answers.

what:
- Dedup cached records by record_dedupe_key in _db_search_result, the
  single choke point for both eager and event-stream consumers, when
  the query requests dedup.
- Cover dedupe-on and dedupe-off cached behavior with named cases.
why: Every db_status MCP call and db CLI invocation opened a SQLite
connection it never closed. The long-lived MCP server accumulated file
descriptors per call, and unclosed WAL-mode connections can suppress
checkpointing.

what:
- Add DbRuntime.close and context-manager support on DbRuntime and
  DbStore.
- Close the per-call runtime in the db_status tool and run_db_command
  on every exit path; the SearchRuntime-owned db stays open by design.
- Assert both surfaces leave their connections closed.
why: The db_status MCP tool and db status CLI promised read-only
behavior but opened the cache through the migration path, which wrote
the schema-version row and WAL pragmas on every call. Status reads now
cannot write, work on read-only filesystems, and no longer create a
missing cache file as a side effect.

what:
- Add DbStore.open_readonly and DbRuntime.open_readonly using the
  read-only SQLite URI mode, skipping migration and pragmas.
- Route the db_status tool and the status/explain CLI branch through
  read-only opens, reporting zero counts for missing caches and clean
  errors for foreign files.
- Assert byte-for-byte write-free status reads, chmod 0444 operation,
  no-create on missing paths, and traceback-free foreign-file failures.
why: db explain returned the same payload as db status while its docs
promised planner diagnostics for debugging cache decisions. Explain
now answers the questions the page describes: how the last sync went
and which query forms the cache can serve.

what:
- Add a DbExplain payload with sync-state breakdown (ok and error
  counts, most recent sync time) and the answerable-query summary.
- Route db explain through the read-only open and render a distinct
  human summary alongside the JSON and NDJSON modes.
- Align the explain doc page prose and cover the JSON and text
  contracts.
why: The docstring claimed the test exercises the db path, but it
exercises collect_search_records — the record-collection path. The
"collection" in the original wording referred to record collection,
not a command name.

what:
- Name collect_search_records directly in the docstring so the wording
  cannot be confused with a command surface again.
why: The doctest guidance scopes runnable examples to pure helpers
with no external state. The db normalization, hashing, FTS quoting,
and count-formatting helpers qualify and had none.

what:
- Add Examples sections to normalize_record_text, token_set,
  text_hash, and _quote_fts_term.
- Add Examples sections to the four db count formatters.
why: The cache applied the result limit before per-session dedup, so
cross-source duplicates inside the first N rows returned fewer than N
unique records. The live driver dedups during collection, where a cap
always means N unique records.

what:
- Dedup inside DbStore.search_records after sorting and before the
  limit slice, owning the event-stream uniqueness invariant in one
  place.
- Drop the now-redundant dedup pass in _db_search_result.
- Cover limit-with-duplicates and no-dedup-with-limit with named cases,
  and pin the duplicate-storage test to an undeduped query.
why: The cache generated candidates from raw title and text while the
live engine matches a five-field haystack with Python case folding, so
records matching a term only through model, role, or path — or through
casefold expansions like Strasse for Straße — never reached the
post-filter, and --cache require silently under-returned.

what:
- Store a casefolded haystack column built from the same
  build_record_match_surface helper the live matcher uses, and point
  the trigram FTS index and the scan prefilter at it.
- Casefold query terms before MATCH so both sides fold in Python; the
  trigram index never re-folds, letting non-ASCII terms use FTS and
  removing the ASCII guard from _fts_indexable.
- Keep the live-predicate post-filter authoritative so text-surface
  queries narrow correctly from haystack candidates.
- Cover model, role, path, casefold-expansion, and text-surface parity
  with named cases. Pre-release dev caches need one resync.
why: Benchmark harnesses, CI jobs, and MCP server configuration blocks
cannot pass per-command flags, so cache behavior needs an environment
lever. A tri-state mirror of --cache covers both cold runs (off) and
honest warm runs (require, which fails loudly instead of silently
measuring live scans). A cross-project audit (issue #49) found no
ecosystem env-var convention to borrow, so the name follows the
existing AGENTGREP_* prefix.

what:
- Resolve cache mode as flag > AGENTGREP_CACHE > auto via pure,
  doctested helpers; invalid values fail at parse time with a clean
  message.
- Honor the variable in the MCP server runtime, attaching the cache
  read-only when it exists; the server never writes the cache file.
- Document the variable in the configuration guide and cover
  resolution precedence, the invalid-value error, and the server
  runtime with named cases.
why: Cache-served searches opened the DB through the migration path,
writing schema metadata on every cached query, and --cache require
with a missing cache silently created an empty database and returned
zero results. Reads must not mutate the cache, and require should fail
loudly so warm benchmark runs cannot measure an accidental live scan.

what:
- Open the CLI search runtime via DbRuntime.open_readonly.
- Error cleanly on --cache require when the cache is missing or not an
  agentgrep database, without creating files; auto keeps falling back
  to the live scanner.
- Assert byte-identical cache files across cached searches and the
  clean require/auto missing-cache behaviors.
why: Cache behavior needs transparency to be trustworthy: profilers
and benchmarks must be able to see whether a query was served from the
DB cache, fell back to the live scanner, and why. One aggregate sample
per consulted query keeps the telemetry n+1-safe — never a span per
record.

what:
- Emit a search.cache.decision sample from the cache choke point with
  the cache mode, handled flag, served record count, and fallback
  reason (no-db, unsupported, or empty) — privacy-safe scalars only.
- Document the span in the profiling vocabulary (AGENTS.md and the
  benchmark dev page).
- Cover served, empty-fallback, unsupported-fallback, and off paths
  with named cases.
why: Cache-aware search makes timings ambiguous unless each run states
its cache posture. Cold entries bypass the cache outright; warm entries
serve from a pre-synced bench-scoped cache under require mode, so a
cold cache fails the bench instead of silently timing live scans.

what:
- Add paired *-cache-cold-* and *-cache-warm-* bench entries; warm
  entries declare a setup_command that syncs the bench-scoped cache
  once before timing.
- Support the optional setup_command key in the harness, failing the
  row cleanly when setup exits non-zero.
- Record cache_mode on every measurement row (parsed from the env
  prefix, which stays visible in the sanitized command string) and in
  profiler artifacts.
- Document cold and warm invocations in the benchmark guide and cover
  extraction, pairing, and metadata with named cases.
why: Cache levers grew across the flag surface, the environment, and
the profiling spans; the architecture record should state the policy
they implement so future cache layers inherit it deliberately.

what:
- Add a "Cache control and transparency" section to ADR 0005: every
  cache layer must expose a bypass lever and a hit-or-miss signal,
  with the AGENTGREP_DB / --cache + AGENTGREP_CACHE / decision-span
  levers and their precedence.
@tony
Copy link
Copy Markdown
Owner Author

tony commented Jun 7, 2026

Code review

Found 1 issue:

  1. profile_engine.py records cache_mode in profiler artifacts but the profiled path never consults the cache. payload["cache_mode"] is read from AGENTGREP_CACHE, yet profile_search_query builds its query pipeline without a SearchRuntime/DbRuntime, so _db_search_result early-returns and every component measures a live scan. Running AGENTGREP_CACHE=require produces live-scan timings labeled cache_mode: "require" (and off changes nothing), which misleads benchmark analysis and conflicts with Add explicit cache bypass controls for profiling and benchmarks #49's acceptance criterion that AGENTGREP_CACHE "selects the same modes for CLI, MCP server, benchmark, and profiler subprocesses". Either honor the mode in the profiled path (build a DbRuntime when the env var asks for it) or record cache_mode: "off"/omit the field for engine profiles so the artifact reflects what was measured.

payload["artifact_kind"] = PROFILE_RUN_ARTIFACT_KIND
payload["cache_mode"] = os.environ.get("AGENTGREP_CACHE", "auto")
payload["agent_count"] = len(agents)

Profiled path that never reaches the cache:

def profile_search_query(
home: pathlib.Path,
query: agentgrep.SearchQuery,
*,
backends: agentgrep.BackendSelection | None = None,
control: agentgrep.SearchControl | None = None,
runtime: SearchRuntime | None = None,

🤖 Generated with Claude Code

- If this code review was useful, please react with 👍. Otherwise, react with 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant