kaizen: nfa: simplify smallTable.step, eliminate per-traverse stepOut alloc by sayrer · Pull Request #519 · timbray/quamina

sayrer · 2026-04-16T23:26:25Z

step() previously returned via a *stepOut parameter whose epsilons field was written on every call but never read — traverseNFA iterates the precomputed epsilonClosure separately. Change step to return *faState directly, drop the stepOut struct, and drop the stack slot in traverseNFA (it was escaping to the heap as one alloc per match).

Also replace the forbiddenBytes map with a range check; the three ranges {0xC0, 0xC1, 0xF5-0xFF} are enough for a couple of compares and the map fetch was pure cost on the cold path.

Bench (Apple M1 Ultra, n=6):

ShellStyleBuildTime-20 56.10m -> 42.44m -24.36%
958 B -> 709 B -25.99%
1 allocs -> 0 allocs
geomean (10 benchmarks): -2.26% time

Other targeted benchmarks are within +/-2% (statistically significant but small); ShellstyleMultiMatch / TablePointerDedup / CityLots allocations unchanged at 0.

Tested #519 and #520 together under the race detector and it was clean, but it took 117 seconds.

step() previously returned via a *stepOut parameter whose `epsilons` field was written on every call but never read — traverseNFA iterates the precomputed epsilonClosure separately. Change step to return *faState directly, drop the stepOut struct, and drop the stack slot in traverseNFA (it was escaping to the heap as one alloc per match). Also replace the forbiddenBytes map with a range check; the three ranges {0xC0, 0xC1, 0xF5-0xFF} are enough for a couple of compares and the map fetch was pure cost on the cold path. Bench (Apple M1 Ultra, n=6): ShellStyleBuildTime-20 56.10m -> 42.44m -24.36% 958 B -> 709 B -25.99% 1 allocs -> 0 allocs geomean (10 benchmarks): -2.26% time Other targeted benchmarks are within +/-2% (statistically significant but small); ShellstyleMultiMatch / TablePointerDedup / CityLots allocations unchanged at 0. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

timbray · 2026-04-17T00:03:24Z

I like this one a lot, but I'm going to merge mine first unless someone turns up issues with it, because it's weighing on my mind.

sayrer · 2026-04-17T00:29:14Z

Yes, get yours in first. It's better at measuring what follows, and I'm only sending these for the game--how fast can this thing get? (I knew the bug in #516 existed, because I have standing instructions for Claude to tell me I am stupid, and not to mince words, but I do not send these to other people)

sayrer · 2026-04-17T07:19:55Z

It's because I say "Claude, roast this code", and it usually has some things to improve. But I do not send these to other people.

timbray · 2026-04-17T14:14:09Z

Which bug in #516 is that?

sayrer · 2026-04-17T15:54:03Z

Oops, I meant #518. I had it run that criticism on my lazyDFA branch, and it spotted those issues too. But I figured we need #516 first, and this has to be resilient to processes coming and going. So, before worrying about a fancy eviction or GC mechanism, just exit(0).

timbray

I'm approving but not merging this yet. Try doing a before/after on the new BenchmarkMemoryBoundaries benchmark. It introduces a new metric, how many 5-char wildcard patterns per 100M of memory. Not sure what you should be looking for, but I'll be interested to hear if you are surprised.

sayrer · 2026-04-17T17:35:51Z

  memory_cost_test.go:166 assumes the memory budget of current - 1 will be exceeded by the second add, but under Go 1.23 + -race (CI's 
  config) the allocation accounting differs just enough that it slips under:                                                           
                                                                                                                                       
  // memory_cost_test.go:157-167                                                                                                       
  _, current = cm.getMemoryBudget()                                                                                                    
  cm = newCoreMatcher()                                                                                                                
  _, _ = cm.setMemoryBudget(current - 1)  // 1-byte margin — too tight                                                                 
  err = cm.addPattern("x", `{"x": ["x"]}`)                                                                                             
  if err != nil {                                                                                                                      
      t.Error("x?")                                                                                                                    
  }               
  err = cm.addPattern("x", fmt.Sprintf(`{"x": ["%s"]}`, i100))                                                                         
  if err == nil {                                                                                                                      
      t.Error("should fail")   // line 166 — fires under 1.23 + -race
  }                                                                                                                                    
                  
  This test comes from main's memory-budget PR #516, not my change. Cover job (no -race) passes; Tests job (-race) fails because the   
  race detector shifts TotalAlloc by enough to invalidate the 1-byte margin.
                                                                                                                                       
  Reproduced locally on the same branch with GOTOOLCHAIN=go1.23.0 GOFLAGS='-v -race -count=1 -json' go test:                           
  --- FAIL: TestStringFA (0.00s)
      memory_cost_test.go:166: should fail                                                                                             
                                                                                                                                       
  The fix is upstream's to make (widen the margin, e.g. current/2 or current - i100Size*2). Since this isn't caused by my step()
  change, and it's on main as well, I'd recommend either:                                                                              
                  
  1. Flag it to Tim as a flaky test from #516 — minimal change: widen the margin.                                                      
  2. If you want this branch's PR to go green without waiting, cherry-pick a local fix just to demonstrate it.```

The final sub-case set the budget to current-1 and expected the 100-char string pattern add to fail. Under Go 1.23 with -race, the race detector's bookkeeping shifts TotalAlloc by more than 1 byte between the reference measurement and the replay on a fresh matcher, so addPattern stayed under budget and the test tripped its "should fail" assertion. (Go 1.24+ on the same hardware behaved differently, and the Cover job without -race passes.) Widen the margin to len(i100) — comfortably smaller than the i100 pattern's own FA cost (thousands of bytes) but large enough to absorb alloc variance. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

sayrer · 2026-04-17T17:54:57Z

So, it wants #521. I'll put this comparison in there.

                                                                                                                                       
  Identical on all memory axes. 5 samples each, benchstat:                                                                             
                                                                                                                                       
  ┌───────────┬───────────┬───────────────────┬─────────────┐                                                                          
  │  Metric   │   main    │ nfa-step-simplify │      Δ      │                                                                          
  ├───────────┼───────────┼───────────────────┼─────────────┤                                                                          
  │ sec/op    │ 19.17     │ 19.11             │ ~ (p=0.310) │
  ├───────────┼───────────┼───────────────────┼─────────────┤                                                                          
  │ B/op      │ 3.301 GiB │ 3.301 GiB         │ ~ (p=0.421) │                                                                          
  ├───────────┼───────────┼───────────────────┼─────────────┤                                                                          
  │ allocs/op │ 5.234 M   │ 5.234 M           │ ~ (p=0.056) │                                                                          
  └───────────┴───────────┴───────────────────┴─────────────┘                                                                          
                                                                                                                                       
  Per-budget patterns supported (from the bench's stdout) — identical run-to-run across branches:                                      
   
  ┌────────┬────────────────────┬───────────┐                                                                                          
  │ Budget │ Patterns supported │ B/pattern │
  ├────────┼────────────────────┼───────────┤                                                                                          
  │ 100 M  │ 1306–1308          │ ~74.7 K   │
  ├────────┼────────────────────┼───────────┤
  │ 200 M  │ 1978               │ 98.7 K    │
  ├────────┼────────────────────┼───────────┤                                                                                          
  │ 400 M  │ 2914               │ 134.1 K   │
  ├────────┼────────────────────┼───────────┤                                                                                          
  │ 800 M  │ 4294               │ 181.9 K   │
  ├────────┼────────────────────┼───────────┤
  │ 2000 M │ 7008               │ 278.7 K   │
  ├────────┼────────────────────┼───────────┤                                                                                          
  │ Total  │ ~17501             │           │
  └────────┴────────────────────┴───────────┘                                                                                          
                  
  Expected — this branch only touches the match-time step() signature and drops the dead stepOut.epsilons write. It doesn't change NFA 
  construction, smallTable/faState size, or allocation shape, which is what this benchmark measures.
                                                                                                                                       
  The build-context-extract branch (separate) would move this needle: it shrinks smallTable by 24 B and faState by 8 B permanently.    
  Want me to spin up that comparison against main next?```

sayrer mentioned this pull request Apr 16, 2026

kaizen: core_matcher: pool nfaBuffers and flattener in the test helper path #520

Merged

Merge branch 'main' into nfa-step-simplify

50cdd72

timbray reviewed Apr 17, 2026

View reviewed changes

timbray previously approved these changes Apr 17, 2026

View reviewed changes

Merge branch 'main' into nfa-step-simplify

1d0fdbd

sayrer dismissed timbray’s stale review via b96ef09 April 17, 2026 17:42

timbray mentioned this pull request Apr 17, 2026

kaizen: nfa: move build-time gen state off smallTable/faState to side tables #521

Merged

timbray merged commit 53515a0 into timbray:main Apr 17, 2026
7 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kaizen: nfa: simplify smallTable.step, eliminate per-traverse stepOut alloc#519

kaizen: nfa: simplify smallTable.step, eliminate per-traverse stepOut alloc#519
timbray merged 4 commits into
timbray:mainfrom
sayrer:nfa-step-simplify

sayrer commented Apr 16, 2026 •

edited

Loading

Uh oh!

timbray commented Apr 17, 2026

Uh oh!

sayrer commented Apr 17, 2026 •

edited

Loading

Uh oh!

sayrer commented Apr 17, 2026

Uh oh!

timbray commented Apr 17, 2026

Uh oh!

sayrer commented Apr 17, 2026

Uh oh!

timbray left a comment

Uh oh!

sayrer commented Apr 17, 2026

Uh oh!

sayrer commented Apr 17, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

sayrer commented Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

timbray commented Apr 17, 2026

Uh oh!

sayrer commented Apr 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sayrer commented Apr 17, 2026

Uh oh!

timbray commented Apr 17, 2026

Uh oh!

sayrer commented Apr 17, 2026

Uh oh!

timbray left a comment

Choose a reason for hiding this comment

Uh oh!

sayrer commented Apr 17, 2026

Uh oh!

sayrer commented Apr 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

sayrer commented Apr 16, 2026 •

edited

Loading

sayrer commented Apr 17, 2026 •

edited

Loading

sayrer commented Apr 17, 2026 •

edited

Loading