Skip to content

Prefilter-aware regex / OR / files-without-match search (bounded memory) #32

@tony

Description

@tony

Summary

agentgrep prunes file-backed sources before parsing them with a ripgrep
prefilter (prefilter_sources_by_root): one rg/ag pass per
discovered search root keeps only the sources whose root matched the
query terms. Three opt-in modes defeat or strain that prefilter and fall
back to a (near) full-corpus parse that materialises every matching
record in memory (~1.3 GB RSS observed on a full-scan run):

  • OR / any-term — unions the per-term match sets, so the surviving
    source set balloons toward the whole corpus.
  • regex terms — broad patterns match a large fraction of roots.
  • files-without-match (-L) — by definition must consider every
    planned source, so nothing can be pruned.

Because these can't yet be served within bounded memory, their public
toggles are being removed: the CLI search --regex / --any flags, the
regex / any_term parameters on the MCP search and validate_query
tools, and grep -L. grep and find stay regex-by-default — that
path runs through ripgrep itself and is unaffected. This issue tracks
reintroducing the three modes without the memory cliff.

Where it lives (v0.1.0a8)

Proposed approach

  • Regex: derive literal atoms from each pattern to seed the
    prefilter; fall back to a bounded scan only when no atom can be
    extracted.
  • OR / any-term: prefilter per term and stream the union, capping
    the number of sources parsed concurrently instead of materialising all
    records at once.
  • -L (files-without-match): compute the no-match complement
    against the enumerated / prefiltered source set rather than parsing
    every source's contents.
  • Memory: stream records through dedupe with a bounded working set
    (and/or an explicit --max-sources guard) instead of building one big
    in-memory dict.

Acceptance criteria

  • The three modes return correct results with peak RSS bounded
    regardless of corpus size.
  • Re-expose the toggles (CLI + MCP) and restore the docs and the MCP
    server-instruction lines only once the bounded-memory path lands.
  • Regression coverage over a large synthetic corpus asserting a memory
    ceiling.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions