kaizen: Embed smalltable#535
Conversation
Adds tableShareKey, a stable identifier for the share group of slice-headers inside a smallTable. After smallTable is embedded into faState by value, two states that hold copies of one source smallTable still share the underlying steps/ceilings/epsilons backing arrays — and a key built from SliceData(steps) + len(steps) captures that equivalence. This replaces *smallTable-pointer-identity as the dedup key in the next commit. No behavioral change yet; this commit only adds the helper and unit tests. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Switches closureBuffers.tables from map[*smallTable]*tableMark to map[tableShareKey]*tableMark. With *smallTable pointer identity still intact (pre-embed), the new key is equivalent: two states that share a *smallTable trivially share their steps backing. No behavioral change. Sets up the next commit, which embeds smallTable in faState by value — at which point pointer-identity goes away but slice-backing identity remains.
faState.table changes from *smallTable to smallTable (inline). This shrinks per-state memory: - faState 64B + smallTable 80B-class = 144B per state pair - embedded faState fits 128B size class = 128B per state - saves 16B/state on workloads without table-pointer sharing vmFields.startTable becomes startState *faState (the start state owns the start table inline). traverseNFA, epsilonClosure, and related helpers take *faState instead of *smallTable. Epsilon-closure dedup continues to work via tableShareKey (added in the previous two commits) — value-copies of one source smallTable still share their slice backings, and SliceData(steps) is the new identity. Size-assertion tests are not yet recalibrated; that follows in the next commit.
Per-state size shrunk (faState now 128B with inline smallTable). Update hand-calibrated constants in TestMcBasicSizes, TestQuaminaMemoryCost, TestMcNfaSizes. Update TestPP s/t numbers and restore trailing whitespace in wanted literal (producer still emits "[s/t X/Y] \n" with the trailing space before the newline). TestTablePointerDedup's tableSharing and totalEntries expectations also needed updating — the dedup metric now uses slice-backing identity (via tableShareKey from earlier commits) rather than *smallTable pointer identity, so value-copies of a source smallTable register as shared. Recalibrated to the new ground-truth values. Minor cleanups: stale "startTable" reference in TestQuaminaMemoryCostSingleton comment (renamed to startState); inline a dead local in copyShellNode. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Port of the build-context-extract fix (6947edf) to the embedded-smallTable design. embed-smalltable inherited the same two regressions from the shared 32dc2a9 ancestor; shellstyle build at 1000 words was ~7x slower than main. 1. closureForNfa's walk dedup was broken. The refactor collapsed two independent counters into a single bufs.gen that closureForState mutates, so the walk's visited check never matched after the first state and the heavily-shared shellstyle graph got re-traversed (O(V*E)). Restored via bufs.walkGen, a snapshot closureForState never touches. 2. Per-call allocation. epsilonClosure allocated fresh maps per call and tableMarkOf heap-allocated a *tableMark per share group. closureBuffers are now pooled (sync.Pool, GC-reclaimable so no steady-state cost), maps are reused via a monotonic generation (no clearing), and tableMark is stored by value. Unlike the *smallTable-keyed bce branch, the walk here dedups by *faState identity rather than tableShareKey. Share-key dedup is unsafe for the walk: distinct states can share a steps backing array yet have different epsilons, and the zero key collapses all no-byte tables. The faState pointer is the natural unique identity now that smallTable is embedded by value. The post-pass table-pointer dedup keeps tableShareKey (collision-safe there, as it re-checks sameFieldTransitions) and now skips the zero key explicitly. Shellstyle build at 1000 words: 1363ms -> 473ms. Full suite passes. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
nfa2Dfa/intern is slated to go live. Profiling BenchmarkNfa2Dfa showed it
~64% slower than main, entirely inside intern()'s dedup: clear(sl.seen) plus
a per-state map assign into a map[*faState]struct{} cost ~600ms where main's
faState.closureSetGen generation-counter compare cost ~50ms. That field was
removed to shrink steady-state memory, so the map was the fallback.
intern already sorts the state set by pointer to build a canonical key, so
duplicates are adjacent after the sort. Replacing the map-based dedup with
slices.Compact over the sorted buffer removes the map (and its clear()) with
no per-faState field and no extra sort — sorting was already happening.
Nfa2Dfa vs main (geomean, n=6): time +63.8% -> +2.65%, B/op -1.05% -> -1.65%,
allocs/op -7.8% -> -9.6%. Full suite passes.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
# Conflicts: # epsi_closure.go # memory_cost_test.go # state_lists.go
|
(afk for a while, will take a closer look) |
|
Now, granted, all these AddPattern calls are wildcards, and the epsilon merging gets kind of ugly, so probably not a representative workload. |
|
I'm just trying to figure out whether I care about this. Don't have an opinion yet. |
|
I just got that workload, seems like I can do some damage to it with some small changes. |
|
I gotta say, if those numbers are correct, the permanent memory saving is just not that dramatic and the AddPattern impact is pretty significant. So maybe the thing to do is back out the last PR and then cherrypick the good stuff from the faState/smallTable branch with the add-ons that made nfa2Dfa faster and so on? |
|
I think I can get it with no regression, just measuring. |
|
Even just the first step went well. The closure walk was improved, and had been going over the more space-efficient structures too much. That's here: |
|
So, this is already getting close on AddPattern, and the footprint improvements improve the Matcher. |
|
Is this PR up-to-date with your latest? |
|
No, this is on https://github.com/sayrer/quamina/tree/embed-smalltable-research, which has everything in this PR, your research PR, and further patches to see if the AddPattern cost is really necessary. I didn't want to pile them all in to one review, but I did want to see whether it's possible to avoid any unpleasant trade-offs, and that's looking promising. |
|
Oh, ok, I'll pull that and have a look around tomorrow sometime. |


This one carries #521 further, fixes some bugs that caused buildshellstyle quadratic behavior, and also speeds up Nfa2dfa.