Support POSIX bracket classes in regex char classes#1552
Conversation
There was a problem hiding this comment.
Code Review
This pull request implements support for POSIX bracket classes (e.g., [:alpha:], [:digit:]) within character classes in the regex engine, accompanied by new test cases. The review feedback identifies two key improvements: raising a compile error for invalid POSIX class names instead of falling back to literal parsing to avoid silent failures, and adding support for the [:ascii:] class to improve compatibility with Ruby's regex engine.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
Recognize [:name:] POSIX classes inside bracket expressions and expand
them to their ASCII ranges via the existing class-add helpers. Supports
alpha, digit, alnum, space, upper, lower, punct, blank, cntrl, xdigit,
print, graph, word, and ascii. The in-class negated form [:^name:] adds
the ASCII complement plus utf8_any, mirroring the \D/\W/\S shorthands.
Enclosing negation ([^[:alpha:]]) is applied by the existing RE_NCLASS
emit. An unrecognized POSIX class name raises a RegexpError ("invalid
POSIX bracket type"), matching CRuby, instead of silently matching
nothing. Previously [:alpha:] was parsed as the literal set {:,a,l,p,h,t},
so the class never matched alphabetic input.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
4c7468a to
e0af587
Compare
POSIX bracket classes inside a character class —
[[:alpha:]],[[:digit:]], and the rest of the family — previously matched nothing. The bracket reader had no recognition of the[:name:]syntax, so[:alpha:]was parsed as a char class containing the literal characters:,a,l,p,h,t, which never matches alphabetic input.The bracket-class reader in
lib/regexp/re_compile.cnow recognizes[:…:]and expands the named class to its ASCII ranges using the same range/bit helpers that back\dand\w(no parallel mechanism). Supported names:alpha, digit, alnum, space, upper, lower, punct, blank, cntrl, xdigit, print, graph, word. Enclosing negation such as[^[:alpha:]]is handled by the existingRE_NCLASSemit, identical to how the engine already negates classes. ASCII semantics, matching CRuby for ASCII input; all 13 names were verified byte-for-byte against ruby. Unicode\p{L}properties remain out of scope.A new
test/regex_posix_class.rb(.expectedregenerated from ruby) covers alpha/digit/alnum/space/upper/lower, a negated class, combined literal+POSIX brackets, bothscanand a plainmatch, and a non-constant subject routed through a method parameter so the runtime engine is exercised rather than constant-folded. The full suite is green under-Werror.Co-Authored-By: Claude Opus 4.8 (1M context) noreply@anthropic.com