Skip to content

kaizen: Embed smalltable#535

Open
sayrer wants to merge 7 commits into
timbray:mainfrom
sayrer:embed-smalltable
Open

kaizen: Embed smalltable#535
sayrer wants to merge 7 commits into
timbray:mainfrom
sayrer:embed-smalltable

Conversation

@sayrer
Copy link
Copy Markdown
Contributor

@sayrer sayrer commented May 31, 2026

This one carries #521 further, fixes some bugs that caused buildshellstyle quadratic behavior, and also speeds up Nfa2dfa.

sayrer and others added 7 commits May 30, 2026 14:33
Adds tableShareKey, a stable identifier for the share group of
slice-headers inside a smallTable. After smallTable is embedded
into faState by value, two states that hold copies of one source
smallTable still share the underlying steps/ceilings/epsilons
backing arrays — and a key built from SliceData(steps) + len(steps)
captures that equivalence. This replaces *smallTable-pointer-identity
as the dedup key in the next commit.

No behavioral change yet; this commit only adds the helper and unit
tests.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Switches closureBuffers.tables from map[*smallTable]*tableMark to
map[tableShareKey]*tableMark. With *smallTable pointer identity
still intact (pre-embed), the new key is equivalent: two states
that share a *smallTable trivially share their steps backing.

No behavioral change. Sets up the next commit, which embeds
smallTable in faState by value — at which point pointer-identity
goes away but slice-backing identity remains.
faState.table changes from *smallTable to smallTable (inline). This
shrinks per-state memory:
  - faState 64B + smallTable 80B-class = 144B per state pair
  - embedded faState fits 128B size class = 128B per state
  - saves 16B/state on workloads without table-pointer sharing

vmFields.startTable becomes startState *faState (the start state
owns the start table inline). traverseNFA, epsilonClosure, and
related helpers take *faState instead of *smallTable.

Epsilon-closure dedup continues to work via tableShareKey (added in
the previous two commits) — value-copies of one source smallTable
still share their slice backings, and SliceData(steps) is the new
identity.

Size-assertion tests are not yet recalibrated; that follows in the
next commit.
Per-state size shrunk (faState now 128B with inline smallTable). Update
hand-calibrated constants in TestMcBasicSizes, TestQuaminaMemoryCost,
TestMcNfaSizes. Update TestPP s/t numbers and restore trailing whitespace
in wanted literal (producer still emits "[s/t X/Y] \n" with the trailing
space before the newline).

TestTablePointerDedup's tableSharing and totalEntries expectations also
needed updating — the dedup metric now uses slice-backing identity
(via tableShareKey from earlier commits) rather than *smallTable pointer
identity, so value-copies of a source smallTable register as shared.
Recalibrated to the new ground-truth values.

Minor cleanups: stale "startTable" reference in TestQuaminaMemoryCostSingleton
comment (renamed to startState); inline a dead local in copyShellNode.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Port of the build-context-extract fix (6947edf) to the embedded-smallTable
design. embed-smalltable inherited the same two regressions from the shared
32dc2a9 ancestor; shellstyle build at 1000 words was ~7x slower than main.

1. closureForNfa's walk dedup was broken. The refactor collapsed two
   independent counters into a single bufs.gen that closureForState mutates,
   so the walk's visited check never matched after the first state and the
   heavily-shared shellstyle graph got re-traversed (O(V*E)). Restored via
   bufs.walkGen, a snapshot closureForState never touches.

2. Per-call allocation. epsilonClosure allocated fresh maps per call and
   tableMarkOf heap-allocated a *tableMark per share group. closureBuffers
   are now pooled (sync.Pool, GC-reclaimable so no steady-state cost), maps
   are reused via a monotonic generation (no clearing), and tableMark is
   stored by value.

Unlike the *smallTable-keyed bce branch, the walk here dedups by *faState
identity rather than tableShareKey. Share-key dedup is unsafe for the walk:
distinct states can share a steps backing array yet have different epsilons,
and the zero key collapses all no-byte tables. The faState pointer is the
natural unique identity now that smallTable is embedded by value. The
post-pass table-pointer dedup keeps tableShareKey (collision-safe there, as
it re-checks sameFieldTransitions) and now skips the zero key explicitly.

Shellstyle build at 1000 words: 1363ms -> 473ms. Full suite passes.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
nfa2Dfa/intern is slated to go live. Profiling BenchmarkNfa2Dfa showed it
~64% slower than main, entirely inside intern()'s dedup: clear(sl.seen) plus
a per-state map assign into a map[*faState]struct{} cost ~600ms where main's
faState.closureSetGen generation-counter compare cost ~50ms. That field was
removed to shrink steady-state memory, so the map was the fallback.

intern already sorts the state set by pointer to build a canonical key, so
duplicates are adjacent after the sort. Replacing the map-based dedup with
slices.Compact over the sorted buffer removes the map (and its clear()) with
no per-faState field and no extra sort — sorting was already happening.

Nfa2Dfa vs main (geomean, n=6): time +63.8% -> +2.65%, B/op -1.05% -> -1.65%,
allocs/op -7.8% -> -9.6%. Full suite passes.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
# Conflicts:
#	epsi_closure.go
#	memory_cost_test.go
#	state_lists.go
@timbray
Copy link
Copy Markdown
Owner

timbray commented May 31, 2026

(afk for a while, will take a closer look)

@sayrer sayrer changed the title Embed smalltable kaizen: Embed smalltable May 31, 2026
@timbray
Copy link
Copy Markdown
Owner

timbray commented May 31, 2026

OK, checked out 535 and ran the research-main CSV generator. I'll attach the xlsx files and a graph of sec/100-addP-calls

may30 pr535 [2026-05-31.xlsx](https://github.com/user-attachments/files/28443966/2026-05-31.xlsx) [2026-05-30.xlsx](https://github.com/user-attachments/files/28443967/2026-05-30.xlsx)

@timbray
Copy link
Copy Markdown
Owner

timbray commented May 31, 2026

Now, granted, all these AddPattern calls are wildcards, and the epsilon merging gets kind of ugly, so probably not a representative workload.

@timbray
Copy link
Copy Markdown
Owner

timbray commented May 31, 2026

I'm just trying to figure out whether I care about this. Don't have an opinion yet.

@sayrer
Copy link
Copy Markdown
Contributor Author

sayrer commented May 31, 2026

I just got that workload, seems like I can do some damage to it with some small changes.

@timbray
Copy link
Copy Markdown
Owner

timbray commented May 31, 2026

I gotta say, if those numbers are correct, the permanent memory saving is just not that dramatic and the AddPattern impact is pretty significant. So maybe the thing to do is back out the last PR and then cherrypick the good stuff from the faState/smallTable branch with the add-ons that made nfa2Dfa faster and so on?

@sayrer
Copy link
Copy Markdown
Contributor Author

sayrer commented May 31, 2026

I think I can get it with no regression, just measuring.

@sayrer
Copy link
Copy Markdown
Contributor Author

sayrer commented May 31, 2026

Even just the first step went well. The closure walk was improved, and had been going over the more space-efficient structures too much.

That's here:
5fe237b

  Incremental epsilon closure — results

  Plan executed (commits on embed-smalltable-research): spec 5fe237b → plan
  534327a → immutability gate 18547f6 → order-independence guard a6c44a9 →
  prune 5976a4f → review nits f96e39a (plus 20c980d research -cpuprofile
  flag).

  The win (10k-pattern research harness):

  ┌──────────────────────┬──────────────┬─────────────────────────────────┐
  │                      │ before prune │           after prune           │
  ├──────────────────────┼──────────────┼─────────────────────────────────┤
  │ patterns/sec         │ 133–140      │ 418 (3×)                        │
  ├──────────────────────┼──────────────┼─────────────────────────────────┤
  │ wall time            │ ~72 s        │ 24.6 s                          │
  ├──────────────────────┼──────────────┼─────────────────────────────────┤
  │ build-scale @ N=1000 │ 473 ms       │ 225 ms (~2.1×)                  │
  ├──────────────────────┼──────────────┼─────────────────────────────────┤
  │ closureForNfa flat   │ 8.46 s       │ 0.06 s (walk re-traversal gone) │
  └──────────────────────┴──────────────┴─────────────────────────────────┘

@sayrer
Copy link
Copy Markdown
Contributor Author

sayrer commented May 31, 2026

So, this is already getting close on AddPattern, and the footprint improvements improve the Matcher.

@timbray
Copy link
Copy Markdown
Owner

timbray commented May 31, 2026

Is this PR up-to-date with your latest?

@sayrer
Copy link
Copy Markdown
Contributor Author

sayrer commented Jun 1, 2026

No, this is on https://github.com/sayrer/quamina/tree/embed-smalltable-research, which has everything in this PR, your research PR, and further patches to see if the AddPattern cost is really necessary. I didn't want to pile them all in to one review, but I did want to see whether it's possible to avoid any unpleasant trade-offs, and that's looking promising.

@timbray
Copy link
Copy Markdown
Owner

timbray commented Jun 1, 2026

Oh, ok, I'll pull that and have a look around tomorrow sometime.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants