Summary
agentgrep prunes file-backed sources before parsing them with a ripgrep
prefilter (prefilter_sources_by_root): one rg/ag pass per
discovered search root keeps only the sources whose root matched the
query terms. Three opt-in modes defeat or strain that prefilter and fall
back to a (near) full-corpus parse that materialises every matching
record in memory (~1.3 GB RSS observed on a full-scan run):
- OR / any-term — unions the per-term match sets, so the surviving
source set balloons toward the whole corpus.
- regex terms — broad patterns match a large fraction of roots.
- files-without-match (
-L) — by definition must consider every
planned source, so nothing can be pruned.
Because these can't yet be served within bounded memory, their public
toggles are being removed: the CLI search --regex / --any flags, the
regex / any_term parameters on the MCP search and validate_query
tools, and grep -L. grep and find stay regex-by-default — that
path runs through ripgrep itself and is unaffected. This issue tracks
reintroducing the three modes without the memory cliff.
Where it lives (v0.1.0a8)
Proposed approach
- Regex: derive literal atoms from each pattern to seed the
prefilter; fall back to a bounded scan only when no atom can be
extracted.
- OR / any-term: prefilter per term and stream the union, capping
the number of sources parsed concurrently instead of materialising all
records at once.
-L (files-without-match): compute the no-match complement
against the enumerated / prefiltered source set rather than parsing
every source's contents.
- Memory: stream records through dedupe with a bounded working set
(and/or an explicit --max-sources guard) instead of building one big
in-memory dict.
Acceptance criteria
- The three modes return correct results with peak RSS bounded
regardless of corpus size.
- Re-expose the toggles (CLI + MCP) and restore the docs and the MCP
server-instruction lines only once the bounded-memory path lands.
- Regression coverage over a large synthetic corpus asserting a memory
ceiling.
Summary
agentgrep prunes file-backed sources before parsing them with a ripgrep
prefilter (
prefilter_sources_by_root): onerg/agpass perdiscovered search root keeps only the sources whose root matched the
query terms. Three opt-in modes defeat or strain that prefilter and fall
back to a (near) full-corpus parse that materialises every matching
record in memory (~1.3 GB RSS observed on a full-scan run):
source set balloons toward the whole corpus.
-L) — by definition must consider everyplanned source, so nothing can be pruned.
Because these can't yet be served within bounded memory, their public
toggles are being removed: the CLI
search --regex/--anyflags, theregex/any_termparameters on the MCPsearchandvalidate_querytools, and
grep -L.grepandfindstay regex-by-default — thatpath runs through ripgrep itself and is unaffected. This issue tracks
reintroducing the three modes without the memory cliff.
Where it lives (v0.1.0a8)
Proposed approach
prefilter; fall back to a bounded scan only when no atom can be
extracted.
the number of sources parsed concurrently instead of materialising all
records at once.
-L(files-without-match): compute the no-match complementagainst the enumerated / prefiltered source set rather than parsing
every source's contents.
(and/or an explicit
--max-sourcesguard) instead of building one bigin-memory dict.
Acceptance criteria
regardless of corpus size.
server-instruction lines only once the bounded-memory path lands.
ceiling.