Skip to content

grep: avoid multi-character case folds#54

Open
wondr-wclabs wants to merge 1 commit into
uutils:mainfrom
wondr-wclabs:codex/simple-case-folding
Open

grep: avoid multi-character case folds#54
wondr-wclabs wants to merge 1 commit into
uutils:mainfrom
wondr-wclabs:codex/simple-case-folding

Conversation

@wondr-wclabs
Copy link
Copy Markdown
Contributor

Fixes #32.

This changes the ignore-case compile path to use Oniguruma's per-regex onig_new_deluxe API with the default case-fold flags minus INTERNAL_ONIGENC_CASE_FOLD_MULTI_CHAR.

The important distinction is that this is not only an output formatting issue. With Oniguruma's default full-fold behavior, one regex atom can consume multiple input characters: [[:alpha:]] can match st as a single match, and ß can match SS. That changes -o output, match counts, and cursor advancement. GNU grep avoids those multi-character fold expansions for grep -i, so the matcher needs to stay on simple one-to-one folds even before broader locale support exists.

I avoided ONIG_OPTION_IGNORECASE_IS_ASCII because that would be a broader behavioral change: it would discard non-ASCII simple folds too, while the incompatibility here is specifically the multi-character fold expansion. I also avoided onig_set_default_case_fold_flag because it is process-global; using compile-time OnigCompileInfo keeps the change local to each compiled regex and avoids races between different pattern compilations.

The small raw wrapper mirrors the existing onig::Regex shape where relevant: it serializes regex construction with a mutex, keeps the same Send/Sync ownership model for immutable compiled regexes, reuses onig::Region's transparent raw representation, and formats errors through Oniguruma's own error-message API.

Validation:

  • cargo fmt --all -- --check
  • cargo test
  • cargo clippy --all-targets --workspace -puu_grep -- -D warnings
  • git diff --check
  • printf 'st\nss\nffi\n' | cargo run --quiet -- -o -i '[[:alpha:]]' now emits one character per line
  • printf 'SS\n' | cargo run --quiet -- -o -i 'ß' exits 1 with no output

@codspeed-hq
Copy link
Copy Markdown

codspeed-hq Bot commented Jun 5, 2026

Merging this PR will improve performance by 23.76%

⚡ 5 improved benchmarks
✅ 5 untouched benchmarks
⏩ 17 skipped benchmarks1

Performance Changes

Benchmark BASE HEAD Efficiency
recursive_no_binary 29.4 ms 22.7 ms +29.79%
regex_no_match 29.8 ms 23 ms +29.33%
context 32.2 ms 25.5 ms +26.51%
only_matching 55.1 ms 47.1 ms +16.95%
filename_lineno_color 55.2 ms 47.2 ms +16.91%

Tip

Curious why this is faster? Comment @codspeedbot explain why this is faster on this PR, or directly use the CodSpeed MCP with your agent.


Comparing wondr-wclabs:codex/simple-case-folding (fcc30a4) with main (d28bf76)

Open in CodSpeed

Footnotes

  1. 17 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

-i matches multi-character sequences via Unicode case folding where GNU matches one

1 participant