Skip to content

grep: support POSIX equivalence classes#39

Closed
wondr-wclabs wants to merge 1 commit into
uutils:mainfrom
wondr-wclabs:codex/posix-equivalence-classes
Closed

grep: support POSIX equivalence classes#39
wondr-wclabs wants to merge 1 commit into
uutils:mainfrom
wondr-wclabs:codex/posix-equivalence-classes

Conversation

@wondr-wclabs
Copy link
Copy Markdown
Contributor

Fixes #35.

This adds a small BRE/ERE pre-compile normalization step for POSIX equivalence-class entries inside bracket expressions. Because uu_grep does not currently implement locale collation, the implementation only normalizes single-character [=c=] entries to the equivalent literal character, which matches the C-locale behavior described in the issue. Multi-character collating elements are left unchanged rather than guessing at locale-specific behavior.

The parser handles POSIX character-class and collating-symbol tokens inside bracket expressions so patterns such as [[:alpha:]][[=1=]] keep their existing meaning, and fixed-string mode is intentionally left literal.

@codspeed-hq
Copy link
Copy Markdown

codspeed-hq Bot commented Jun 5, 2026

Merging this PR will not alter performance

✅ 10 untouched benchmarks
⏩ 17 skipped benchmarks1


Comparing wondr-wclabs:codex/posix-equivalence-classes (c1d4187) with main (d28bf76)

Open in CodSpeed

Footnotes

  1. 17 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports.

@wondr-wclabs wondr-wclabs force-pushed the codex/posix-equivalence-classes branch from 910bbd7 to c1d4187 Compare June 5, 2026 08:42
Copy link
Copy Markdown
Collaborator

@lhecker lhecker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm personally willing to accept this as a broken feature until we have an alternative to Oniguruma. The equivalence classes are somewhat niche. Not sure how others view it...

@wondr-wclabs
Copy link
Copy Markdown
Contributor Author

That is a fair concern. This patch is deliberately narrow: it only normalizes single-character [=c=] entries to the C-locale literal form, and it does not attempt locale collation or multi-character collating elements. That means it fixes the concrete C-locale behavior from #35, but it still leaves the larger POSIX equivalence-class semantics unimplemented.

Given the feedback on #55, I agree the main tradeoff is not whether this small parser can pass the local case, but whether grep should grow local pre-compile normalizers around Oniguruma at all. If the project direction is to avoid those until there is a broader regex-engine strategy, then this PR is probably not worth merging even though the current checks were green before main moved.

My view is: this is acceptable only if maintainers want the narrow C-locale compatibility improvement now. If the preferred direction is “leave equivalence classes broken until the engine story changes,” I can close this PR rather than rebase and keep adding review noise.

@lhecker
Copy link
Copy Markdown
Collaborator

lhecker commented Jun 5, 2026

the engine story

I'll close this PR for now then. If other maintainers consider this a priority to fix, let's reopen the PR!

@lhecker lhecker closed this Jun 5, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

POSIX equivalence classes [[=c=]] do not match like GNU

2 participants