kaizen: nfa: move build-time gen state off smallTable/faState to side tables#521
Conversation
bacd3e8 to
dec9a56
Compare
|
|
|
I think Claude is being lazy here, and C is the right choice. See https://thecodinggopher.substack.com/p/gos-memory-allocator |
|
Thanks for the work and analysis. Unfortunately I'm AFK for the next few hours. Before I go… I tried using HeapAlloc instead of TotalAlloc in memory_cost.go and that produced bad results but I don't remember the details - could maybe revisit that decision. Clearly, the idea of moving all that deduping machinery off into ephemeral AddPattern-time-only storage is attractive. Let me think about this while I'm offline. I wonder if there's a way to exclude it from the memory_cost bookkeeping. But anyhow, there's no reason not to merge #519, right? |
|
I think #519 is all good. As ever, I'm not in a rush, I just send these in a flurry when I have time to think about them. There's no real urgency. |
|
I had to look this up, so by no means an expert, but I think the suggestion of |
|
Why don't you go ahead and, as an experiment, update the Also, the process of finally getting the memory_cost stuff working taught me that a couple of my assumptions are wrong, and now I have in mind a couple of cheaper and ( think) more accurate approaches so maybe I'll have re-do it. |
|
|
So, we're working on the right things here, it just needs some tuning in the tests and the accountant code. |
|
|
OK, that's looking pretty good, but still need to go in and nitpick the benchmarks. For another day. |
|
|
I thought of this while walking to get lunch. The MemoryBoundaries benchmark has really high thresholds because it was using the monotonically increasing |
|
This is about what we want to see, but it will get tougher with Nfa2Dfa. |
timbray
left a comment
There was a problem hiding this comment.
If you can rebase this (mostly lose the budget stuff) we should merge it now.
|
I won't get to it until tomorrow morning, but no problem |
|
It dawns on me that another memory-saver (8 bytes/state) would be to have an faState contain a smallTable not a *smallTable. Just embed the sucker right in there. Probably get better memory locality and a perf bump too. BUT, faState is at the core of everything and it could turn into a nontrivial refactoring. |
smallTable.lastVisitedGen, smallTable.closureGen, smallTable.closureRep
and faState.closureSetGen were all only read/written during pattern-add
time (epsilon closure computation, merge simplification, and the
test-only nfa2Dfa intern path). They sat on every table and state
permanently, bloating steady-state memory by 24 B/table and 8 B/state.
Move that state to a per-call side table held on closureBuffers (maps
keyed by *smallTable / *faState). A fresh closureBuffers is created at
the top of each epsilonClosure call and dropped when the call returns.
simplifySplices and stateLists.intern each get their own local visited
set using the same pattern. The global closureGeneration counter is no
longer needed.
Struct sizes (measured via unsafe.Sizeof):
smallTable 96 B -> 72 B -24
faState 72 B -> 64 B -8
-------
per pair -32 B
Bench (Apple M1 Ultra, n=6):
ShellStyleBuildTime-20 56.10m -> 48.77m -13.07%
958 B -> 797 B -16.81%
1 allocs -> 0 allocs
Match-time benchmarks are neutral or marginally faster (CityLots,
NumberMatching, 8259Example each show ~-1% — likely the cache-footprint
win from the shrunken structs). Passes -race.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Moving build-time generation state off faState/smallTable (commit 192aaa0, cherry-picked from dec9a56) shrinks both structs. The hardcoded size constants in TestQuaminaMemoryCost, TestMcNfaSizes, and TestPP were calibrated against the pre-move sizes on main: faState/smallTable bytes per state: 1481 -> 1321 (-160 / -10.8%) prettyprinter s/t per state: 240/312 -> 216/280 (-24 / -32) TestMcBasicSizes computes its expectation from unsafe.Sizeof so it needs no update; only the hand-calibrated values do. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
898e4cd to
2326bc0
Compare
|
Still a little to check here: |
|
Hmm, having trouble parsing that table. I don't think any of those actually measure the size of the data structures after the AddPattern()s? The new quamina.GetMatcherStats() does measure that for you. That's where I was assuming the big win here would come from. |
|
I didn't try this yet. There's also a bug in |
…hers
When a valueMatcher uses the singleton-match optimization (singletonMatch
holds the raw bytes, startTable is nil — e.g. boolean-valued patterns
like `{"Animated": [false]}`), cmFieldMatcherStats was constructing
&faState{table: nil} and passing it to cmStateStats, which then crashed
on state.table.steps. The singleton byte count is already added a few
lines above; there is no NFA to walk, so just skip the nil-table case.
Add TestQuaminaMemoryCostSingleton as a regression test. Minimal repro
is q.AddPattern("p", `{"Animated": [false]}`) followed by
q.GetMatcherStats() — without this fix it panics.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Now we can run 8259. |
|
Let me know when you're happy with this PR's status and I'll have a closer look & probably merge. |
|
I'm going to do the faState thing based on this branch, but not in this branch. Then see how it looks. It's a mechanical change but I think the patch will be a monster. |
|
Also, I broke one of my own rules, always look at the red/green coverage on everything with <100%. Which would have revealed that the singleton case was never exercised, sigh. |
|
OK, take this one as ready. I did your faState idea and it worked, but it's better to get this one first. https://github.com/sayrer/quamina/tree/embed-smalltable |
timbray
left a comment
There was a problem hiding this comment.
Most of these are nits, one or two real (minor) issues.
Curious: Your table shows the memory reduction, which is good, did you observe any speedup?
Speedup for the regex-heavy ones #521 (comment), slight problems with the table-dedup ones (like a couple percent), but the follow-up faState one makes it all moot. I checked that everything was the same (made Claude go through and check the state space and the assembly), it was just a matter of whether the GC from build-time happened during match time. A benchmark artifact in my view. |
|
I think the best approach is to make things smaller so they start fitting within cache lines where they didn't before, then really tune those ~2% things. But there are no algorithmic issues that I can spot. I'll follow up on your other comments. |
Tim flagged that the singleton-match optimization isn't specific to booleans — it applies to any field with a single string or literal value, where the matcher can use bytes.Compare instead of building an FA. Reword the regression-test comment accordingly; keep the boolean pattern as the minimal repro rather than the defining case. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The side-tables refactor (32dc2a9) replaced field-level comments with a block comment about the struct's purpose. The block comment explains why two maps exist but stopped documenting what each field is for. Put the field-level docs back; matches the pre-refactor style for generation/closureSetGen/closureList and adds documentation for the two new map fields. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Tim flagged that `visited[s] = struct{}{}` isn't a familiar idiom.
Add a comment above the insert line explaining map[K]struct{} is
Go's keys-only set type (struct{}{} is zero bytes, so the value half
is free), with the insert and membership idioms.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Per Tim — the byte cost reported by GetMatcherStats should reflect the backing array footprint, matching how mcSmallTable / mcFaState already count their slice fields (cap, not len). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Per Tim — the rest of the codebase (match_set, stats, matcher,
memory_cost, anything_but) uses map[K]bool for sets. The two newer
sites that used map[K]struct{} (stateLists.seen and simplifyCollect's
visited) were stylistic outliers; switch to bool for consistency.
Also drops the comment I just added explaining the struct{}{} idiom —
no longer needed since the code is now plain map[K]bool with
`if m[k]` / `m[k] = true`.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The seen-field type swap (struct{} → bool) narrowed the field type,
which gofmt then re-aligned the adjacent comments around. Apply the
formatter.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
You done? |
I think so, just running the benchmarks. |
|
So, what this patch does is improve NFA performance. But, it is very bad for eager DFA performance (but not lazy DFA). So, it's all a bit mixed up. I'd argue to just make the NFA path as fast as possible and then go after DFA. Gory details: https://gist.github.com/sayrer/1bcd6364d08305c8c3136a7e2d65d09c |
I looked at your gist and no surprises there. But I don't get why this PR is "bad for eager DFA". |
|
oh, I actually read it wrong yesterday. Eager is faster but unbounded. I think it's ready then. |
See: #518. I have a patch for each of the points there, but it's difficult to measure the impact all at once.
smallTable.lastVisitedGen, smallTable.closureGen, smallTable.closureRep and faState.closureSetGen were all only read/written during pattern-add time (epsilon closure computation, merge simplification, and the test-only nfa2Dfa intern path). They sat on every table and state permanently, bloating steady-state memory by 24 B/table and 8 B/state.
Move that state to a per-call side table held on closureBuffers (maps keyed by *smallTable / *faState). A fresh closureBuffers is created at the top of each epsilonClosure call and dropped when the call returns. simplifySplices and stateLists.intern each get their own local visited set using the same pattern. The global closureGeneration counter is no longer needed.
Struct sizes (measured via unsafe.Sizeof):
smallTable 96 B -> 72 B -24
faState 72 B -> 64 B -8
-------
per pair -32 B
Bench (Apple M1 Ultra, n=6):
ShellStyleBuildTime-20 56.10m -> 48.77m -13.07%
958 B -> 797 B -16.81%
1 allocs -> 0 allocs
Match-time benchmarks are neutral or marginally faster (CityLots, NumberMatching, 8259Example each show ~-1% — likely the cache-footprint win from the shrunken structs). Passes -race.