Skip to content

kaizen: nfa: simplify smallTable.step, eliminate per-traverse stepOut alloc#519

Merged
timbray merged 4 commits into
timbray:mainfrom
sayrer:nfa-step-simplify
Apr 17, 2026
Merged

kaizen: nfa: simplify smallTable.step, eliminate per-traverse stepOut alloc#519
timbray merged 4 commits into
timbray:mainfrom
sayrer:nfa-step-simplify

Conversation

@sayrer
Copy link
Copy Markdown
Contributor

@sayrer sayrer commented Apr 16, 2026

step() previously returned via a *stepOut parameter whose epsilons field was written on every call but never read — traverseNFA iterates the precomputed epsilonClosure separately. Change step to return *faState directly, drop the stepOut struct, and drop the stack slot in traverseNFA (it was escaping to the heap as one alloc per match).

Also replace the forbiddenBytes map with a range check; the three ranges {0xC0, 0xC1, 0xF5-0xFF} are enough for a couple of compares and the map fetch was pure cost on the cold path.

Bench (Apple M1 Ultra, n=6):

ShellStyleBuildTime-20 56.10m -> 42.44m -24.36%
958 B -> 709 B -25.99%
1 allocs -> 0 allocs
geomean (10 benchmarks): -2.26% time

Other targeted benchmarks are within +/-2% (statistically significant but small); ShellstyleMultiMatch / TablePointerDedup / CityLots allocations unchanged at 0.

Tested #519 and #520 together under the race detector and it was clean, but it took 117 seconds.

step() previously returned via a *stepOut parameter whose `epsilons`
field was written on every call but never read — traverseNFA iterates
the precomputed epsilonClosure separately. Change step to return
*faState directly, drop the stepOut struct, and drop the stack slot
in traverseNFA (it was escaping to the heap as one alloc per match).

Also replace the forbiddenBytes map with a range check; the three
ranges {0xC0, 0xC1, 0xF5-0xFF} are enough for a couple of compares
and the map fetch was pure cost on the cold path.

Bench (Apple M1 Ultra, n=6):

  ShellStyleBuildTime-20   56.10m -> 42.44m  -24.36%
                            958 B ->  709 B  -25.99%
                           1 allocs -> 0 allocs
  geomean (10 benchmarks):             -2.26% time

Other targeted benchmarks are within +/-2% (statistically significant
but small); ShellstyleMultiMatch / TablePointerDedup / CityLots
allocations unchanged at 0.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@timbray
Copy link
Copy Markdown
Owner

timbray commented Apr 17, 2026

I like this one a lot, but I'm going to merge mine first unless someone turns up issues with it, because it's weighing on my mind.

@sayrer
Copy link
Copy Markdown
Contributor Author

sayrer commented Apr 17, 2026

Yes, get yours in first. It's better at measuring what follows, and I'm only sending these for the game--how fast can this thing get? (I knew the bug in #516 existed, because I have standing instructions for Claude to tell me I am stupid, and not to mince words, but I do not send these to other people)

@sayrer
Copy link
Copy Markdown
Contributor Author

sayrer commented Apr 17, 2026

It's because I say "Claude, roast this code", and it usually has some things to improve. But I do not send these to other people.

@timbray
Copy link
Copy Markdown
Owner

timbray commented Apr 17, 2026

Which bug in #516 is that?

@sayrer
Copy link
Copy Markdown
Contributor Author

sayrer commented Apr 17, 2026

Oops, I meant #518. I had it run that criticism on my lazyDFA branch, and it spotted those issues too. But I figured we need #516 first, and this has to be resilient to processes coming and going. So, before worrying about a fancy eviction or GC mechanism, just exit(0).

Copy link
Copy Markdown
Owner

@timbray timbray left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm approving but not merging this yet. Try doing a before/after on the new BenchmarkMemoryBoundaries benchmark. It introduces a new metric, how many 5-char wildcard patterns per 100M of memory. Not sure what you should be looking for, but I'll be interested to hear if you are surprised.

timbray
timbray previously approved these changes Apr 17, 2026
@sayrer
Copy link
Copy Markdown
Contributor Author

sayrer commented Apr 17, 2026

  memory_cost_test.go:166 assumes the memory budget of current - 1 will be exceeded by the second add, but under Go 1.23 + -race (CI's 
  config) the allocation accounting differs just enough that it slips under:                                                           
                                                                                                                                       
  // memory_cost_test.go:157-167                                                                                                       
  _, current = cm.getMemoryBudget()                                                                                                    
  cm = newCoreMatcher()                                                                                                                
  _, _ = cm.setMemoryBudget(current - 1)  // 1-byte margin — too tight                                                                 
  err = cm.addPattern("x", `{"x": ["x"]}`)                                                                                             
  if err != nil {                                                                                                                      
      t.Error("x?")                                                                                                                    
  }               
  err = cm.addPattern("x", fmt.Sprintf(`{"x": ["%s"]}`, i100))                                                                         
  if err == nil {                                                                                                                      
      t.Error("should fail")   // line 166 — fires under 1.23 + -race
  }                                                                                                                                    
                  
  This test comes from main's memory-budget PR #516, not my change. Cover job (no -race) passes; Tests job (-race) fails because the   
  race detector shifts TotalAlloc by enough to invalidate the 1-byte margin.
                                                                                                                                       
  Reproduced locally on the same branch with GOTOOLCHAIN=go1.23.0 GOFLAGS='-v -race -count=1 -json' go test:                           
  --- FAIL: TestStringFA (0.00s)
      memory_cost_test.go:166: should fail                                                                                             
                                                                                                                                       
  The fix is upstream's to make (widen the margin, e.g. current/2 or current - i100Size*2). Since this isn't caused by my step()
  change, and it's on main as well, I'd recommend either:                                                                              
                  
  1. Flag it to Tim as a flaky test from #516 — minimal change: widen the margin.                                                      
  2. If you want this branch's PR to go green without waiting, cherry-pick a local fix just to demonstrate it.```

The final sub-case set the budget to current-1 and expected the
100-char string pattern add to fail. Under Go 1.23 with -race, the
race detector's bookkeeping shifts TotalAlloc by more than 1 byte
between the reference measurement and the replay on a fresh matcher,
so addPattern stayed under budget and the test tripped its
"should fail" assertion. (Go 1.24+ on the same hardware behaved
differently, and the Cover job without -race passes.)

Widen the margin to len(i100) — comfortably smaller than the
i100 pattern's own FA cost (thousands of bytes) but large enough to
absorb alloc variance.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@sayrer
Copy link
Copy Markdown
Contributor Author

sayrer commented Apr 17, 2026

So, it wants #521. I'll put this comparison in there.

                                                                                                                                       
  Identical on all memory axes. 5 samples each, benchstat:                                                                             
                                                                                                                                       
  ┌───────────┬───────────┬───────────────────┬─────────────┐                                                                          
  │  Metric   │   main    │ nfa-step-simplify │      Δ      │                                                                          
  ├───────────┼───────────┼───────────────────┼─────────────┤                                                                          
  │ sec/op    │ 19.17     │ 19.11             │ ~ (p=0.310) │
  ├───────────┼───────────┼───────────────────┼─────────────┤                                                                          
  │ B/op      │ 3.301 GiB │ 3.301 GiB         │ ~ (p=0.421) │                                                                          
  ├───────────┼───────────┼───────────────────┼─────────────┤                                                                          
  │ allocs/op │ 5.234 M   │ 5.234 M           │ ~ (p=0.056) │                                                                          
  └───────────┴───────────┴───────────────────┴─────────────┘                                                                          
                                                                                                                                       
  Per-budget patterns supported (from the bench's stdout) — identical run-to-run across branches:                                      
   
  ┌────────┬────────────────────┬───────────┐                                                                                          
  │ Budget │ Patterns supported │ B/pattern │
  ├────────┼────────────────────┼───────────┤                                                                                          
  │ 100 M  │ 1306–1308          │ ~74.7 K   │
  ├────────┼────────────────────┼───────────┤
  │ 200 M  │ 1978               │ 98.7 K    │
  ├────────┼────────────────────┼───────────┤                                                                                          
  │ 400 M  │ 2914               │ 134.1 K   │
  ├────────┼────────────────────┼───────────┤                                                                                          
  │ 800 M  │ 4294               │ 181.9 K   │
  ├────────┼────────────────────┼───────────┤
  │ 2000 M │ 7008               │ 278.7 K   │
  ├────────┼────────────────────┼───────────┤                                                                                          
  │ Total  │ ~17501             │           │
  └────────┴────────────────────┴───────────┘                                                                                          
                  
  Expected — this branch only touches the match-time step() signature and drops the dead stepOut.epsilons write. It doesn't change NFA 
  construction, smallTable/faState size, or allocation shape, which is what this benchmark measures.
                                                                                                                                       
  The build-context-extract branch (separate) would move this needle: it shrinks smallTable by 24 B and faState by 8 B permanently.    
  Want me to spin up that comparison against main next?```

@timbray timbray merged commit 53515a0 into timbray:main Apr 17, 2026
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants