perf: buffer-at-a-time search for literal patterns#16
Conversation
Merging this PR will improve performance by ×19
|
| Benchmark | BASE |
HEAD |
Efficiency | |
|---|---|---|---|---|
| ⚡ | literal_no_match |
27.6 ms | 1.3 ms | ×21 |
| ⚡ | search_pattern |
29.5 ms | 1.6 ms | ×18 |
| ⚡ | fixed_string |
29.5 ms | 1.6 ms | ×18 |
Tip
Curious why this is faster? Comment @codspeedbot explain why this is faster on this PR, or directly use the CodSpeed MCP with your agent.
Comparing literal-fast-path (56d774f) with main (f4798cb)
Footnotes
-
17 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports. ↩
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #16 +/- ##
==========================================
+ Coverage 95.28% 95.67% +0.38%
==========================================
Files 6 6
Lines 1422 1758 +336
Branches 140 188 +48
==========================================
+ Hits 1355 1682 +327
- Misses 66 75 +9
Partials 1 1
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
0c9bbbe to
0846294
Compare
c3840c2 to
b3d70c0
Compare
Literal searches were ~50-70x slower than GNU grep because every line paid per-line costs (terminator scan, NUL scan, dispatch) even when a buffer held no match. Add a buffer-at-a-time driver that scans whole chunks with a substring searcher and only locates line boundaries around the matches it finds; a chunk with no match costs a single vectorized sweep and no per-line work. The driver activates only for plain ASCII literal patterns (case sensitive, no metacharacters) in the simpler output modes: -c, -l, -L, -q, and plain line printing with -n/-b/filename/-m. Anything needing match positions, context, inversion, color, or special binary handling falls back to the unchanged line-at-a-time path. Output stays byte-identical to that path, including binary/invalid-UTF-8 behavior. - line_buffer: read_chunk() yields the largest span of complete lines. - matcher: expose per-pattern memmem searchers when every pattern is a plain literal (plain_literal()). - searcher: eligible_for_fast_path(), fast_locate(), fast_print(). All scanning rides on the memchr crate (SIMD memchr/memrchr/memmem). Unit tests for read_chunk and plain_literal; integration tests for prefixes, -m, and multi-chunk line-number correctness. Benchmarks (31 MB corpus) vs prior release: -F (no match): 232ms -> 15ms (15.9x; now faster than GNU) -c literal: 229ms -> 15ms (15.2x) plain print: 248ms -> 18ms (13.5x) Regex and -i paths are unchanged (still the line-at-a-time engine).
The buffer-at-a-time fast path now serves the literal patterns that the existing -l/-L/-q and binary tests used, leaving the line-at-a-time engine's equivalents uncovered. Add bracket-class (non-literal) tests for -l/-L/-q and binary handling (notice, -a text, without-match bail, and the finalize-time notice), plus a fast-path test for a NUL that is only discovered after a line was already printed. No dead code was found: the remaining uncovered lines are writer I/O error-propagation arms and pre-existing filesystem error handlers.
b3d70c0 to
56d774f
Compare
|
auto merged before they bit rot with all the contribs |
|
I apologize for not reviewing it earlier. I'm currently being swamped over in https://github.com/microsoft/coreutils. I have to extra-apologize for this PR, because I personally do not believe that this PR is going into an ideal direction. It introduces a distinct path for fixed patterns, but such invocations are not any more common than other invocations as far as I can tell. The added code path is rather large and so is not worth the cost in my opinion. Lastly, it uses the same The correct approach in my opinion is to always read in full chunks without An alternative approach in the meantime is to adopt more of ripgrep's approach and always use |
|
@lhecker no worries: revert here: #57
not sure i agree: see #16 (comment) :) |
Literal searches were ~50-70x slower than GNU grep because every line paid per-line costs (terminator scan, NUL scan, dispatch) even when a buffer held no match. Add a buffer-at-a-time driver that scans whole chunks with a substring searcher and only locates line boundaries around the matches it finds; a chunk with no match costs a single vectorized sweep and no per-line work.