kaizen: nfa: simplify smallTable.step, eliminate per-traverse stepOut alloc#519
Conversation
step() previously returned via a *stepOut parameter whose `epsilons`
field was written on every call but never read — traverseNFA iterates
the precomputed epsilonClosure separately. Change step to return
*faState directly, drop the stepOut struct, and drop the stack slot
in traverseNFA (it was escaping to the heap as one alloc per match).
Also replace the forbiddenBytes map with a range check; the three
ranges {0xC0, 0xC1, 0xF5-0xFF} are enough for a couple of compares
and the map fetch was pure cost on the cold path.
Bench (Apple M1 Ultra, n=6):
ShellStyleBuildTime-20 56.10m -> 42.44m -24.36%
958 B -> 709 B -25.99%
1 allocs -> 0 allocs
geomean (10 benchmarks): -2.26% time
Other targeted benchmarks are within +/-2% (statistically significant
but small); ShellstyleMultiMatch / TablePointerDedup / CityLots
allocations unchanged at 0.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
I like this one a lot, but I'm going to merge mine first unless someone turns up issues with it, because it's weighing on my mind. |
|
Yes, get yours in first. It's better at measuring what follows, and I'm only sending these for the game--how fast can this thing get? (I knew the bug in #516 existed, because I have standing instructions for Claude to tell me I am stupid, and not to mince words, but I do not send these to other people) |
|
It's because I say "Claude, roast this code", and it usually has some things to improve. But I do not send these to other people. |
|
Which bug in #516 is that? |
timbray
left a comment
There was a problem hiding this comment.
I'm approving but not merging this yet. Try doing a before/after on the new BenchmarkMemoryBoundaries benchmark. It introduces a new metric, how many 5-char wildcard patterns per 100M of memory. Not sure what you should be looking for, but I'll be interested to hear if you are surprised.
|
The final sub-case set the budget to current-1 and expected the 100-char string pattern add to fail. Under Go 1.23 with -race, the race detector's bookkeeping shifts TotalAlloc by more than 1 byte between the reference measurement and the replay on a fresh matcher, so addPattern stayed under budget and the test tripped its "should fail" assertion. (Go 1.24+ on the same hardware behaved differently, and the Cover job without -race passes.) Widen the margin to len(i100) — comfortably smaller than the i100 pattern's own FA cost (thousands of bytes) but large enough to absorb alloc variance. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
So, it wants #521. I'll put this comparison in there. |
step() previously returned via a *stepOut parameter whose
epsilonsfield was written on every call but never read — traverseNFA iterates the precomputed epsilonClosure separately. Change step to return *faState directly, drop the stepOut struct, and drop the stack slot in traverseNFA (it was escaping to the heap as one alloc per match).Also replace the forbiddenBytes map with a range check; the three ranges {0xC0, 0xC1, 0xF5-0xFF} are enough for a couple of compares and the map fetch was pure cost on the cold path.
Bench (Apple M1 Ultra, n=6):
ShellStyleBuildTime-20 56.10m -> 42.44m -24.36%
958 B -> 709 B -25.99%
1 allocs -> 0 allocs
geomean (10 benchmarks): -2.26% time
Other targeted benchmarks are within +/-2% (statistically significant but small); ShellstyleMultiMatch / TablePointerDedup / CityLots allocations unchanged at 0.
Tested #519 and #520 together under the race detector and it was clean, but it took 117 seconds.