Persistent SQLite DB index with cache-aware search#47
Conversation
Code reviewFound 3 issues:
Lines 231 to 243 in 1042dae
agentgrep/src/agentgrep/mcp/tools/db_tools.py Lines 38 to 40 in 1042dae
agentgrep/tests/test_engine_profiling.py Lines 693 to 695 in 1042dae 🤖 Generated with Claude Code - If this code review was useful, please react with 👍. Otherwise, react with 👎. |
|
Follow-up: all review findings addressed, including the ones below the posting threshold, one commit per issue.
Found along the way: the test suite could read the developer's real cache via |
|
Round-2 follow-up: the second review pass (both PRs) surfaced 12 findings; none crossed the posting threshold, but the legitimate ones are fixed here, one commit per issue.
Declined with reasons: SCHEMA_VERSION bump for the new tables (nothing has shipped; the schema is born final), the explain ok/error overlap (the offending row state has no writer), the |
why: The DB cache work needs a reviewable architecture record before the implementation lands. ADR 0005 captures the SQLite-first decision, the derived-state contract, and the prior systems that motivate the schema shape. what: - Add ADR 0005 for the persistent SQLite-first agentic DB index. - Register the new ADR in the development architecture decision index. - Align ADR 0004 terminology with the planner/aggregation vocabulary.
why: ADR 0005 calls for a SQLite-first local cache so broad queries stop re-parsing every upstream store. A records/FTS index plus a cache-aware execution path gives search-shaped commands a fast local answer while keeping explicit cold live scans available. what: - Add SQLite/WAL DbStore with sources, source_state, records, and FTS5 tables plus stable source/record ids and freshness fingerprints. - Add DbRuntime sync/status/search and a cache-aware SearchRuntime with --cache auto|require|off and --no-cache on grep and search. - Add agentgrep db sync|status|explain CLI commands with JSON and NDJSON output and clean CLI errors for unsupported cached queries. - Add the read-only db_status MCP tool and document the cache controls. - Cover sync, FTS search, cache modes, CLI contracts, and MCP payloads with regression tests.
why: The db command group should follow the existing CLI documentation contract: one argparse-backed page per command path, with development internals under /dev/ and the MCP tool surface documented alongside the other read-only tools. what: - Add argparse-backed CLI pages for db sync|status|explain plus the search and ui commands, with grid cards and toctree entries. - Add the DB index development page and register the db_status MCP tool across the MCP docs, reference pages, and docs-build shim. - Keep command-group no-action invocations as help directories and add rendered-doc regression coverage for the db pages.
why: DB sync can take long enough that a silent command looks stalled, especially when indexing large local histories. Giving it grep/search-style stderr progress keeps structured stdout machine-readable while letting interactive users stop before the next source and keep partial sync results. what: - Add db sync progress flags and typed color/progress argument plumbing. - Add a headless DB sync progress protocol plus cooperative source-boundary early exit. - Render DB sync progress on stderr with matching color semantics and Enter-to-exit behavior. - Cover parser, runtime, JSON stdout, TTY hint, and docs behavior.
why: Full DB sync repeated source replacement work even when nothing changed on disk. Consulting source_state freshness before opening record iterators keeps the cache refresh path responsive on large histories while preserving an explicit full-rebuild path. what: - Skip unchanged sources during sync using source_state fingerprint checks, and track skipped sources in sync results and progress. - Add --force to resync unchanged sources for full rebuilds. - Add a records source-id index so per-source replacement scales. - Cover skip behavior, force refresh, CLI plumbing, and docs for the new sync semantics.
why: The default text output for db status and sync looked like malformed JSON. Semantic terminal summaries keep the CLI readable for humans while preserving explicit JSON and NDJSON for scripts. what: - Render DB status and sync structured payloads as semantic human summaries in text mode, with counts colored by meaning. - Keep JSON and NDJSON machine modes explicit and unchanged. - Cover the text contracts with case-driven regression tests.
why: External-content FTS5 delete commands must receive the originally indexed column values. Passing placeholders left the old tokens mapped to rowids SQLite later reuses, so the raw index kept answering for text that no longer existed and accumulated stale entries on every resync. The result-level post-filter masked it from search output but not from the index itself. what: - Select title and text alongside rowid in _remove_source_records and pass them to the FTS5 'delete' command. - Add a resync regression test asserting the raw FTS index drops old tokens and keeps exactly the fresh ones.
why: WAL-mode SQLite stores commit into the -wal sidecar while the main database file's size and mtime stay unchanged until a checkpoint, so the freshness check saw WAL-only writes as "current" and db sync skipped re-reading them, serving stale records. The engine's source scan cache already fingerprints the sidecar for the same reason. what: - Fold the -wal sidecar's size and mtime into _source_fingerprint for sqlite sources, treating a missing sidecar as a stable marker. - Cover WAL-write resync, unchanged-skip, and non-sqlite indifference with named freshness cases.
why: The freshness fingerprint stats only the primary source file. Claude history expands paste-cache sibling files into record text, so a changed sibling left stale expansions in the index while the skip optimization saw the source as current. The engine's source-scan cache already exempts the same adapters for the same reason. what: - Treat scan-cache-exempt adapters as never current in DbStore.source_is_current, reusing the engine's exemption set as the single source of truth. - Cover the always-resync behavior for claude.history_jsonl.v1.
why: Cache-aware search consults default_db_path() when no explicit db path is given, so suite runs on a machine with a real agentgrep cache were reading it during tests — and any future schema rebuild would erase it. Tests must never depend on or mutate developer state. what: - Add an autouse fixture pointing AGENTGREP_DB at a per-test tmp path.
why: The DB index schema will keep evolving, and a derived cache should not accumulate incremental migration code that preserves rows the next sync regenerates anyway. Recording the rebuild-on-mismatch decision keeps future schema changes one-line version bumps. what: - Add ADR 0008: a single meta schema_version key, with any mismatch dropping and recreating the schema for the next sync to repopulate. - Register the ADR in the development architecture decision index.
why: The live engine matches case-insensitive substrings, but the FTS index matched whole tokens only, so cached answers silently dropped records like "scruffy" for the term "ruff" — and --cache require turned that into final zero-result answers with no fallback. Trigram tokenization makes FTS candidates a provable superset of substring matches, and the existing live-predicate post-filter keeps precision. what: - Tokenize record_text_fts with FTS5 trigram so MATCH candidates cover every substring occurrence of three-plus-character ASCII terms. - Route shorter or non-ASCII terms through an exact table scan with an ASCII substring prefilter, post-filtered by the live predicate. - Rebuild the cache on meta schema-version mismatch per ADR 0008. - Cover prefix, mid-token, mixed, case, short-term, non-ASCII, and zero-match parity plus the version-mismatch rebuild with named cases.
why: The search event stream guarantees emitted records passed the per-session dedup decision, and the live path enforces that inside the execution driver. Cached records bypassed the driver, so duplicates persisted under different sources leaked through cache-served answers. what: - Dedup cached records by record_dedupe_key in _db_search_result, the single choke point for both eager and event-stream consumers, when the query requests dedup. - Cover dedupe-on and dedupe-off cached behavior with named cases.
why: Every db_status MCP call and db CLI invocation opened a SQLite connection it never closed. The long-lived MCP server accumulated file descriptors per call, and unclosed WAL-mode connections can suppress checkpointing. what: - Add DbRuntime.close and context-manager support on DbRuntime and DbStore. - Close the per-call runtime in the db_status tool and run_db_command on every exit path; the SearchRuntime-owned db stays open by design. - Assert both surfaces leave their connections closed.
why: The db_status MCP tool and db status CLI promised read-only behavior but opened the cache through the migration path, which wrote the schema-version row and WAL pragmas on every call. Status reads now cannot write, work on read-only filesystems, and no longer create a missing cache file as a side effect. what: - Add DbStore.open_readonly and DbRuntime.open_readonly using the read-only SQLite URI mode, skipping migration and pragmas. - Route the db_status tool and the status/explain CLI branch through read-only opens, reporting zero counts for missing caches and clean errors for foreign files. - Assert byte-for-byte write-free status reads, chmod 0444 operation, no-create on missing paths, and traceback-free foreign-file failures.
why: db explain returned the same payload as db status while its docs promised planner diagnostics for debugging cache decisions. Explain now answers the questions the page describes: how the last sync went and which query forms the cache can serve. what: - Add a DbExplain payload with sync-state breakdown (ok and error counts, most recent sync time) and the answerable-query summary. - Route db explain through the read-only open and render a distinct human summary alongside the JSON and NDJSON modes. - Align the explain doc page prose and cover the JSON and text contracts.
why: The docstring claimed the test exercises the db path, but it exercises collect_search_records — the record-collection path. The "collection" in the original wording referred to record collection, not a command name. what: - Name collect_search_records directly in the docstring so the wording cannot be confused with a command surface again.
why: The doctest guidance scopes runnable examples to pure helpers with no external state. The db normalization, hashing, FTS quoting, and count-formatting helpers qualify and had none. what: - Add Examples sections to normalize_record_text, token_set, text_hash, and _quote_fts_term. - Add Examples sections to the four db count formatters.
why: The cache applied the result limit before per-session dedup, so cross-source duplicates inside the first N rows returned fewer than N unique records. The live driver dedups during collection, where a cap always means N unique records. what: - Dedup inside DbStore.search_records after sorting and before the limit slice, owning the event-stream uniqueness invariant in one place. - Drop the now-redundant dedup pass in _db_search_result. - Cover limit-with-duplicates and no-dedup-with-limit with named cases, and pin the duplicate-storage test to an undeduped query.
why: The cache generated candidates from raw title and text while the live engine matches a five-field haystack with Python case folding, so records matching a term only through model, role, or path — or through casefold expansions like Strasse for Straße — never reached the post-filter, and --cache require silently under-returned. what: - Store a casefolded haystack column built from the same build_record_match_surface helper the live matcher uses, and point the trigram FTS index and the scan prefilter at it. - Casefold query terms before MATCH so both sides fold in Python; the trigram index never re-folds, letting non-ASCII terms use FTS and removing the ASCII guard from _fts_indexable. - Keep the live-predicate post-filter authoritative so text-surface queries narrow correctly from haystack candidates. - Cover model, role, path, casefold-expansion, and text-surface parity with named cases. Pre-release dev caches need one resync.
why: Benchmark harnesses, CI jobs, and MCP server configuration blocks cannot pass per-command flags, so cache behavior needs an environment lever. A tri-state mirror of --cache covers both cold runs (off) and honest warm runs (require, which fails loudly instead of silently measuring live scans). A cross-project audit (issue #49) found no ecosystem env-var convention to borrow, so the name follows the existing AGENTGREP_* prefix. what: - Resolve cache mode as flag > AGENTGREP_CACHE > auto via pure, doctested helpers; invalid values fail at parse time with a clean message. - Honor the variable in the MCP server runtime, attaching the cache read-only when it exists; the server never writes the cache file. - Document the variable in the configuration guide and cover resolution precedence, the invalid-value error, and the server runtime with named cases.
why: Cache-served searches opened the DB through the migration path, writing schema metadata on every cached query, and --cache require with a missing cache silently created an empty database and returned zero results. Reads must not mutate the cache, and require should fail loudly so warm benchmark runs cannot measure an accidental live scan. what: - Open the CLI search runtime via DbRuntime.open_readonly. - Error cleanly on --cache require when the cache is missing or not an agentgrep database, without creating files; auto keeps falling back to the live scanner. - Assert byte-identical cache files across cached searches and the clean require/auto missing-cache behaviors.
why: Cache behavior needs transparency to be trustworthy: profilers and benchmarks must be able to see whether a query was served from the DB cache, fell back to the live scanner, and why. One aggregate sample per consulted query keeps the telemetry n+1-safe — never a span per record. what: - Emit a search.cache.decision sample from the cache choke point with the cache mode, handled flag, served record count, and fallback reason (no-db, unsupported, or empty) — privacy-safe scalars only. - Document the span in the profiling vocabulary (AGENTS.md and the benchmark dev page). - Cover served, empty-fallback, unsupported-fallback, and off paths with named cases.
why: Cache-aware search makes timings ambiguous unless each run states its cache posture. Cold entries bypass the cache outright; warm entries serve from a pre-synced bench-scoped cache under require mode, so a cold cache fails the bench instead of silently timing live scans. what: - Add paired *-cache-cold-* and *-cache-warm-* bench entries; warm entries declare a setup_command that syncs the bench-scoped cache once before timing. - Support the optional setup_command key in the harness, failing the row cleanly when setup exits non-zero. - Record cache_mode on every measurement row (parsed from the env prefix, which stays visible in the sanitized command string) and in profiler artifacts. - Document cold and warm invocations in the benchmark guide and cover extraction, pairing, and metadata with named cases.
why: Cache levers grew across the flag surface, the environment, and the profiling spans; the architecture record should state the policy they implement so future cache layers inherit it deliberately. what: - Add a "Cache control and transparency" section to ADR 0005: every cache layer must expose a bypass lever and a hit-or-miss signal, with the AGENTGREP_DB / --cache + AGENTGREP_CACHE / decision-span levers and their precedence.
Code reviewFound 1 issue:
agentgrep/scripts/profile_engine.py Lines 297 to 299 in c4b8fdc Profiled path that never reaches the cache: agentgrep/src/agentgrep/_engine/profiling.py Lines 203 to 210 in c4b8fdc 🤖 Generated with Claude Code - If this code review was useful, please react with 👍. Otherwise, react with 👎. |
Summary
agentgrep db sync|status|explainmaterializes discovered agent history into a local WAL-mode database with an FTS5 text index, a source ledger with freshness state, and stable source/record ids. The DB is derived state — Codex, Claude, Cursor, Gemini, Grok, Pi, and OpenCode stores remain the source of truth.grepandsearchgain--cache auto|require|off(and--no-cache). The defaultautoserves a query from the DB index only when it can answer it and falls back to the live scanner otherwise;requireinsists on the DB path, and unsupported cache-required queries fail as clean CLI errors instead of tracebacks.source_statefingerprint checks skip unchanged sources unless--forceis passed, and a source-id index keeps per-source replacement fast.--json/--ndjsonmachine modes.db_statusMCP tool.Changes by area
DB index
src/agentgrep/db.py:DbStore(SQLite/WAL/FTS5 schema: source ledger plussource_state, normalized records),DbRuntimesync/status/search, sync progress protocol, freshness fingerprints, and a default path under the XDG cache directory overridable viaAGENTGREP_DBor--db.Search integration
src/agentgrep/__init__.py,src/agentgrep/_engine/: cache-mode plumbing throughSearchRuntimeso search-shaped commands consult the DB index under--cache auto|requireand preserve the live-scan path under--no-cache.CLI
src/agentgrep/cli/parser.py,src/agentgrep/cli/render.py: thedbcommand group with progress flags, semantic human summaries in text mode, JSON/NDJSON machine modes, and clean CLI errors for unsupported cached queries.MCP
src/agentgrep/mcp/tools/db_tools.py,src/agentgrep/mcp/models.py: the read-onlydb_statustool with its pydantic response model and capability registration.Docs
ADR 0005; per-command CLI pages for
db, plus argparse-backed pages for the existingsearchanduicommands; a DB-index development page; DB cache controls in the configuration guide.Design decisions
--cache autois conservative: the index only answers queries it can satisfy; anything else falls through to the live scanner, and--cache requireturns unsupported queries into explicit CLI errors.source_staterecords the synced mtime and content fingerprint;--forceis the explicit full-rebuild path.Test plan
tests/test_db_index.py—DbStoreschema and sync behavior: freshness skips,--forceresync, FTS queries, deterministic ids, early exittests/test_cache_cli.py— CLI contracts fordband cache modes: JSON and NDJSON stdout, human summaries, progress output, error pathstests/test_agentgrep_mcp.py— read-onlydb_statuspayloads and capability registrationruff check/ruff format,ty check,uv run pytest(incl. doctests),just build-docsStacked PR
The insights/suggestions surface that previously shared this branch now lives in its own PR stacked on this one.