Skip to content

Regex nfa dfa benchmarks#527

Merged
timbray merged 6 commits into
timbray:mainfrom
sayrer:regex-nfa-dfa-benchmarks
May 30, 2026
Merged

Regex nfa dfa benchmarks#527
timbray merged 6 commits into
timbray:mainfrom
sayrer:regex-nfa-dfa-benchmarks

Conversation

@sayrer
Copy link
Copy Markdown
Contributor

@sayrer sayrer commented May 18, 2026

These are tests that subsume #492.

sayrer and others added 5 commits May 18, 2026 11:11
Previously patterns and events drew emojis from the same pool but
independently, so events with random emoji pairs rarely matched any
of the random pattern pairs (especially at patterns=32/64), tripping
the per-iteration b.Fatalf assertion.

Track the (e1, e2) pairs used to build patterns; sample events from
that same set so every event matches at least one pattern. The
benchmark still measures NFA traversal cost on dense multi-byte
UTF-8 input — the only thing that changed is correctness of the
match-presence sanity check.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds stable baseline benchmarks for representative match-time workloads:

  - ExactString:              1 exact pattern
  - SingleShellstyle:         1 wildcard pattern
  - ManyOverlappingWildcards: N=8..128 overlapping wildcards
  - RegexAlternation:         20 regex patterns with alternation
  - LiteralInRegex:           literal substring inside regex
  - QuantifiedCharClass:      regex with {n,m} quantifier
  - ManyAnchoredRegex:        200 anchored regex patterns
  - DeepEpsilonNest:          regex with nested alternation/quantifiers
  - CacheThrashing:           adversarial input over wide state space
  - ParallelMatchers:         8..64 goroutines via Copy()

Each warms the matcher with ~100 iterations before resetting the timer
so first-call laziness does not pollute steady-state measurements.

These are intended as stable workload baselines: subsequent matcher
optimization work can be evaluated by re-running these benchmarks
unchanged and comparing.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
for _, sp := range simplePatterns {
b.Run(sp.name, func(b *testing.B) {
q, _ := New()
pattern := fmt.Sprintf(`{"val": [{"shellstyle": %q}]}`, sp.shellstyle)
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a blocker but let's use wildcard rather than shellstyle in the future.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry to be an epic nitpicker, but there is a problem here. "shellstyle" is the description of the overall pattern, while "wildcard" is the specific production. I don't care how this distinction is resolved, but it's there.

@timbray timbray merged commit 567d4d5 into timbray:main May 30, 2026
7 checks passed
@timbray
Copy link
Copy Markdown
Owner

timbray commented May 30, 2026 via email

@sayrer
Copy link
Copy Markdown
Contributor Author

sayrer commented May 30, 2026

I am surprised by your strong opinion here, although I do not disagree with it. I'll try to knock out something to cover this issue tomorrow morning. Happy to do it your way, it's just confusing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants