kaizen: nfa: move build-time gen state off smallTable/faState to side tables by sayrer · Pull Request #521 · timbray/quamina

sayrer · 2026-04-16T23:59:13Z

See: #518. I have a patch for each of the points there, but it's difficult to measure the impact all at once.

smallTable.lastVisitedGen, smallTable.closureGen, smallTable.closureRep and faState.closureSetGen were all only read/written during pattern-add time (epsilon closure computation, merge simplification, and the test-only nfa2Dfa intern path). They sat on every table and state permanently, bloating steady-state memory by 24 B/table and 8 B/state.

Move that state to a per-call side table held on closureBuffers (maps keyed by *smallTable / *faState). A fresh closureBuffers is created at the top of each epsilonClosure call and dropped when the call returns. simplifySplices and stateLists.intern each get their own local visited set using the same pattern. The global closureGeneration counter is no longer needed.

Struct sizes (measured via unsafe.Sizeof):

smallTable 96 B -> 72 B -24
faState 72 B -> 64 B -8
-------
per pair -32 B

Bench (Apple M1 Ultra, n=6):

ShellStyleBuildTime-20 56.10m -> 48.77m -13.07%
958 B -> 797 B -16.81%
1 allocs -> 0 allocs

Match-time benchmarks are neutral or marginally faster (CityLots, NumberMatching, 8259Example each show ~-1% — likely the cache-footprint win from the shrunken structs). Passes -race.

sayrer · 2026-04-17T18:05:00Z

Significant tradeoff found:

  ┌──────────────────┬───────────┬───────────────────────┬───────────┐
  │      Metric      │   main    │ build-context-extract │     Δ     │
  ├──────────────────┼───────────┼───────────────────────┼───────────┤
  │ sec/op           │ 19.17     │ 12.79                 │ –33.3% 🎉 │
  ├──────────────────┼───────────┼───────────────────────┼───────────┤
  │ B/op             │ 3.301 GiB │ 3.277 GiB             │ –0.7%     │
  ├──────────────────┼───────────┼───────────────────────┼───────────┤
  │ allocs/op        │ 5.234 M   │ 36.80 M               │ +603% 🚨  │
  ├──────────────────┼───────────┼───────────────────────┼───────────┤
  │ Patterns @100 M  │ 1307      │ 462                   │ –65% 🚨   │
  ├──────────────────┼───────────┼───────────────────────┼───────────┤
  │ Patterns @2000 M │ 7008      │ 2342                  │ –65%      │
  ├──────────────────┼───────────┼───────────────────────┼───────────┤
  │ Total patterns   │ 17501     │ 5902                  │ –66%      │
  └──────────────────┴───────────┴───────────────────────┴───────────┘


  The regression: SetMemoryBudget uses runtime.ReadMemStats.TotalAlloc to
  decide when to stop adding. Our per-call side-table maps generate lots of
  short-lived allocations during build (the +603% allocs/op). Those allocs
  get GC'd and never show up as steady-state memory (so B/op is fine), but
  they inflate TotalAlloc, which the budget accountant reads as "memory
  used." Result: the user-visible SetMemoryBudget is 3× less generous.

  Struct sizes still shrink as advertised — this is purely a TotalAlloc
  accounting interaction.```

sayrer · 2026-04-17T18:27:04Z

                                                                    
  With value-stored tableMark (no *tableMark allocation per entry): 
                                                                    
  ┌────────────────┬───────────┬───────────────┬───────────┐        
  │     Metric     │   main    │ bce value-sem │     Δ     │        
  ├────────────────┼───────────┼───────────────┼───────────┤        
  │ sec/op         │ 19.17     │ 12.29         │ –35.9%    │
  ├────────────────┼───────────┼───────────────┼───────────┤
  │ B/op           │ 3.301 GiB │ 3.275 GiB     │ –0.8%     │        
  ├────────────────┼───────────┼───────────────┼───────────┤        
  │ allocs/op      │ 5.234 M   │ 2.133 M       │ –59.3% 🎉 │        
  ├────────────────┼───────────┼───────────────┼───────────┤        
  │ Total patterns │ 17501     │ 5579          │ –68% 🚨   │
  └────────────────┴───────────┴───────────────┴───────────┘        
  

  The allocs went below main's — fewer maps / smaller per-entry. But
  patterns-per-budget still craters. That tells us the problem isn't alloc
  count; it's the bytes-per-pattern as measured by TotalAlloc:
                                                                              
  - main: ~197 KB/pattern (all permanent FA structs)  
  - bce: ~587 KB/pattern (FA structs + transient maps growing + being
    discarded)
                                                                              
  The TotalAlloc accountant (memory_cost.go:82) deliberately counts transient
  allocations too — per its comment, because during build  the old matcher is
  still live alongside the new one. That choice makes any side-table approach a
  loss for the budget feature,        regardless of pooling or value-semantics
  tricks.               
                  
  Paths forward

  A. Keep this branch as-is — ship the real wins (–36% time, –59% allocs,
  smaller structs in steady state). Users relying on           SetMemoryBudget
  see ~3× fewer patterns supported per budget byte. The budget is a heuristic
  anyway.
                                                                              
  B. Revert build-context-extract, switch to suggestion #3 (int32 counters +
  field reorder). Saves 16 B/pair permanently (half of the  32 B/pair the
  current branch would save at steady state), zero transient allocation, zero
  budget regression. Safest.
                                                                              
  C. Ship build-context-extract + fix memory_cost.go to use HeapInuse. Bigger
  scope, crosses into Tim's memory-budget work (#516).     
  
  My recommendation: B. The budget regression in A is user-visible and
  surprising; the patterns-per-budget number is exactly the thing Tim's #516
  was optimizing. C is better semantically but crosses PR boundaries. B gets
  half the steady-state win with none of the downside.

sayrer · 2026-04-17T18:28:45Z

I think Claude is being lazy here, and C is the right choice. See https://thecodinggopher.substack.com/p/gos-memory-allocator

timbray · 2026-04-17T18:37:33Z

Thanks for the work and analysis. Unfortunately I'm AFK for the next few hours. Before I go… I tried using HeapAlloc instead of TotalAlloc in memory_cost.go and that produced bad results but I don't remember the details - could maybe revisit that decision. Clearly, the idea of moving all that deduping machinery off into ephemeral AddPattern-time-only storage is attractive. Let me think about this while I'm offline. I wonder if there's a way to exclude it from the memory_cost bookkeeping.

But anyhow, there's no reason not to merge #519, right?

sayrer · 2026-04-17T18:39:49Z

I think #519 is all good. As ever, I'm not in a rush, I just send these in a flurry when I have time to think about them. There's no real urgency.

sayrer · 2026-04-17T19:21:38Z

I had to look this up, so by no means an expert, but I think the suggestion of HeapInuse not (HeapAlloc) could make sense. I think TotalAlloc only goes up. My general thought pattern is that the DFA stuff will be way faster, but stress this part of the program.

timbray · 2026-04-17T21:07:53Z

Why don't you go ahead and, as an experiment, update the bytesAllocated call in memory_cost.go to use HeapInUse and see what happens?

Also, the process of finally getting the memory_cost stuff working taught me that a couple of my assumptions are wrong, and now I have in mind a couple of cheaper and ( think) more accurate approaches so maybe I'll have re-do it.

sayrer · 2026-04-18T18:21:25Z

⏺ Comparison: main vs build-context-extract, BenchmarkMemoryBoundaries  
                                                                                
  Answering why the bench is slow is actually the story: under
  build-context-extract + HeapAlloc, the budget is so much more generous  that
  the bench's fixed pattern set can't saturate it.  
                                                                                
  Patterns-per-budget (same 12959 shellstyle patterns, same hardware)

                                                                                
  ┌────────────────────┬───────────────────┬──────────────────────────────────────────────────┐
  │       Budget       │ main (TotalAlloc) │        build-context-extract (HeapAlloc)         │
  ├────────────────────┼───────────────────┼──────────────────────────────────────────────────┤
  │ 100 M              │ 1306              │ 12959 (all patterns fit)                         │
  ├────────────────────┼───────────────────┼──────────────────────────────────────────────────┤
  │ 200 M              │ 1978              │ 12959 (all patterns fit)                         │
  ├────────────────────┼───────────────────┼──────────────────────────────────────────────────┤
  │ 400 M              │ 2914              │ (didn't measure — can't exceed 12959)            │
  ├────────────────────┼───────────────────┼──────────────────────────────────────────────────┤
  │ 800 M              │ 4294              │ —                                                │
  ├────────────────────┼───────────────────┼──────────────────────────────────────────────────┤
  │ 2000 M             │ 7008              │ —                                                │
  ├────────────────────┼───────────────────┼──────────────────────────────────────────────────┤
  │ Total across tiers │ 17501             │ ~5× the same 12959, capped by the pattern supply │
  └────────────────────┴───────────────────┴──────────────────────────────────────────────────┘
   

  Wall-clock per b.Loop() iteration  
                                                                                  
  - main: ~19 s (bench hits budget cap fast, abandons each tier)  
  - build-context-extract: ~2222 s (bench loads all 12959 patterns × 5 tiers ×
    ~34 ms/add = ~64K pattern-adds)                       
                                                                                  
  The 116× slowdown of the bench is misleading — it's not that BCE is slower per
  pattern-add, it's that the bench is completing the    work the budget now
  permits. Per-pattern cost is actually similar.  
                                                                                  
  What this means 

  - Under the old TotalAlloc accounting, the 100 M budget rejected patterns
    after ~10% of the set. That rejection was spurious — it was counting
    transient allocations that GC would reclaim.
  - Under HeapAlloc, the 100 M budget fits the entire test corpus. Users can
    load ~10× more patterns per budget byte, which is real    capacity they
    were paying for but never getting.  
  - Combined with the struct shrinkagefrom build-context-extract (–32 B/pair),
    the change directly addresses the "too much memory" motivation of
    issue #518.                                            
                  
 What still needs doing                                                       
 
  The bench itself should grow a larger pattern corpus (or an
  expanding-quantifier pattern generator) so it can still exert budget pressure
  under HeapAlloc — otherwise it'll just time out on big matchers and we can't
  distinguish "budget works" from "pattern supply exhausted." Worth raising
  when Tim does the memory_cost rework he mentioned.

sayrer · 2026-04-18T18:25:52Z

So, we're working on the right things here, it just needs some tuning in the tests and the accountant code. HeapAlloc ended up working better, because it's more granular, and Quamina makes a lot of small transient allocations (instantly and cheaply reclaimed often). By doing this, we're converging on what the Rust one does with an arena allocator.

sayrer · 2026-04-18T18:41:27Z

HeapAlloc vs HeapInuse                                                     
                                                                                
  Both come from runtime.MemStats and both describe currently-held heap memory
  (unlike TotalAlloc, which is cumulative). The difference is granularity.  
                                                                                
  HeapAlloc       

  - Bytes occupied by live-plus-unreclaimed heap objects.  
  - Grows on every allocation; shrinks the moment GC frees an object.
  - Byte-precise: allocate a 16-byte object, HeapAlloc goes up by 16.  
                                                                                
  HeapInuse                                                       
                                                                                
  - Bytes in heap spans that contain at least one object.  
  - Grows when a newspan is claimed; shrinks only when a span becomes fully
    empty.
  - Span-granular: allocate a 16-byte object, HeapInuse goes up by the full span
    size (typically 8192 bytes).                          
                                                                                
  Why the delta exists                                                       
                                                                                
  Go's allocator works in size-class spans — chunks of memory (often 8 KB)
  subdivided into fixed-size slots.  A span counts as "in use" as long as any
  one slot is occupied.
                                                                                
  Allocate one 16-byte object → lands in one slot of an 8192-byte span.
                                                                                
    HeapAlloc  = 16       (just the object) HeapInuse  = 8192     (the whole
    span)                       
                                                                                
  As more 16-byte objects fill the same span:  
  - HeapAlloc grows linearly (16, 32, 48, …).
  - HeapInuse stays flat at 8192 until the span fills, then jumps to 16384 when
    a second span is claimed.                              
                                                                                
  Free 511 of 512 slots:                                                       
 - HeapAlloc drops to 16.  
  - HeapInuse stays 8192 (one live object keeps thespan "in use").  
                                                                                
  So HeapInuse ≥ HeapAlloc always. The gap is fragmentation overhead — slots
  committed to size classes but currently empty.            
                                                                                
  Practical picks                                                       
                                                                                
  ┌───────────────────────────────────────────────────────┬───────────────────────────┐ 
  │                       Question                        │        Right field        │
  ├───────────────────────────────────────────────────────┼───────────────────────────┤ 
  │ "How much do my live objects occupy?"                 │ HeapAlloc                 │
  ├───────────────────────────────────────────────────────┼───────────────────────────┤
  │ "How much memory is the runtime holding from the OS?" │ HeapInuse (closer to RSS) │ 
  ├───────────────────────────────────────────────────────┼───────────────────────────┤ 
  │ "Will I OOM?"                                         │ HeapInuse or HeapSys      │ 
  ├───────────────────────────────────────────────────────┼───────────────────────────┤ 
  │ Byte-precise measurement for fine-grained budgets?    │ HeapAlloc                 │
  └───────────────────────────────────────────────────────┴───────────────────────────┘ 

  Why we picked HeapAlloc for Quamina's budget  
                  
  Quamina allocates many small objects (*faState, *smallTable, pointer slices).
  In a warm process those slip into slots of             already-claimed
  spans — HeapInuse stays flat even as the matcher grows, and the budget
  under-counts retention.
                                                                                        
  HeapAlloc tracks every byte, so the budget reflects actual per-object growth.
  Tradeoff: it omits span overhead ("fragmentation"), so the user's ceiling
  is "bytes my live objects  occupy" rather than "bytes my process holds in spans"
 — fine for Quamina's shape, less honest for a workload of few-but-large objects.

sayrer · 2026-04-18T19:08:23Z

OK, that's looking pretty good, but still need to go in and nitpick the benchmarks. For another day.

sayrer · 2026-04-18T19:21:16Z

  ┌────────────────────────────────────────────┬──────────┬─────────┬──────────┐                                                       
  │                 Benchmark                  │ sec/op Δ │ B/op Δ  │ allocs Δ │ 
  ├────────────────────────────────────────────┼──────────┼─────────┼──────────┤ 
  │ ShellStyleBuildTime                        │ −30.78%  │ −33.40% │ 1 → 0    │ 
  ├────────────────────────────────────────────┼──────────┼─────────┼──────────┤ 
  │ 8259Example                                │ −1.77%   │ —       │ —        │ 
  ├────────────────────────────────────────────┼──────────┼─────────┼──────────┤ 
  │ TablePointerDedup/deeply-nested            │ −0.90%   │ —       │ —        │ 
  ├────────────────────────────────────────────┼──────────┼─────────┼──────────┤ 
  │ TablePointerDedup/overlapping-char-classes │ −1.97%   │ —       │ —        │ 
  ├────────────────────────────────────────────┼──────────┼─────────┼──────────┤ 
  │ TablePointerDedup/shell+deep-overlap       │ −1.90%   │ —       │ —        │
  ├────────────────────────────────────────────┼──────────┼─────────┼──────────┤ 
  │ CityLots                                   │ ~        │ —       │ —        │ 
  ├────────────────────────────────────────────┼──────────┼─────────┼──────────┤ 
  │ NumberMatching                             │ ~        │ —       │ —        │
  ├────────────────────────────────────────────┼──────────┼─────────┼──────────┤ 
  │ ShellstyleMultiMatch                       │ ~        │ —       │ —        │ 
  ├────────────────────────────────────────────┼──────────┼─────────┼──────────┤
  │ TablePointerDedup/6-regexps-12-shell       │ ~        │ —       │ —        │
  ├────────────────────────────────────────────┼──────────┼─────────┼──────────┤ 
  │ TablePointerDedup/20-nested-regexps        │ ~        │ —       │ —        │
  ├────────────────────────────────────────────┼──────────┼─────────┼──────────┤ 
  │ geomean                                    │ −4.66%   │ −3.98%  │          │ 
  └────────────────────────────────────────────┴──────────┴─────────┴──────────┘ 

   
  Headline: ShellStyleBuildTime drops −31% time / −33% memory / 1 → 0 allocs —
  this bench exercises a lot of NFA construction where the struct-shrink(+ step
  () simplification already on main) compounds. Other match-path benchmarks are
  neutral or slightly better (≤2%), reflecting the cache-footprint win from the
  smaller faState/smallTable.           
                                                                                 
  No regressions in the ±5% band. Memory-boundaries (which blew up earlier)
  showed the real user-facing win: same 100 M budget now fits ~10× more
  patterns.

sayrer · 2026-04-18T21:09:42Z

I thought of this while walking to get lunch. The MemoryBoundaries benchmark has really high thresholds because it was using the monotonically increasing TotalAlloc. But it should be HeapAlloc at 2,4,8,16,32,64 MB (RE2 default is 8MB)

sayrer · 2026-04-19T19:42:27Z

This is about what we want to see, but it will get tougher with Nfa2Dfa.

HeapAlloc-scale:

  ┌─────────────────┬───────────────────────┬───────────┐
  │     Budget      │       Patterns        │ B/pattern │
  ├─────────────────┼───────────────────────┼───────────┤
  │ 2 M             │ 101                   │ 20.3 K    │
  ├─────────────────┼───────────────────────┼───────────┤
  │ 4 M             │ 482                   │ 8.5 K     │
  ├─────────────────┼───────────────────────┼───────────┤
  │ 8 M (RE2-equiv) │ 1398                  │ 5.9 K     │
  ├─────────────────┼───────────────────────┼───────────┤
  │ 16 M            │ 2890                  │ 5.7 K     │
  ├─────────────────┼───────────────────────┼───────────┤
  │ 32 M            │ 6335                  │ 5.2 K     │
  ├─────────────────┼───────────────────────┼───────────┤
  │ 64 M            │ all 12959 (untripped) │ —         │
  └─────────────────┴───────────────────────┴───────────┘


  Per-pattern cost drops with budget (amortization across shared NFA structure),
  as you'd expect from a well-behaved matcher.

timbray

If you can rebase this (mostly lose the budget stuff) we should merge it now.

sayrer · 2026-05-30T00:46:57Z

I won't get to it until tomorrow morning, but no problem

timbray · 2026-05-30T16:44:32Z

It dawns on me that another memory-saver (8 bytes/state) would be to have an faState contain a smallTable not a *smallTable. Just embed the sucker right in there. Probably get better memory locality and a perf bump too. BUT, faState is at the core of everything and it could turn into a nontrivial refactoring.

smallTable.lastVisitedGen, smallTable.closureGen, smallTable.closureRep and faState.closureSetGen were all only read/written during pattern-add time (epsilon closure computation, merge simplification, and the test-only nfa2Dfa intern path). They sat on every table and state permanently, bloating steady-state memory by 24 B/table and 8 B/state. Move that state to a per-call side table held on closureBuffers (maps keyed by *smallTable / *faState). A fresh closureBuffers is created at the top of each epsilonClosure call and dropped when the call returns. simplifySplices and stateLists.intern each get their own local visited set using the same pattern. The global closureGeneration counter is no longer needed. Struct sizes (measured via unsafe.Sizeof): smallTable 96 B -> 72 B -24 faState 72 B -> 64 B -8 ------- per pair -32 B Bench (Apple M1 Ultra, n=6): ShellStyleBuildTime-20 56.10m -> 48.77m -13.07% 958 B -> 797 B -16.81% 1 allocs -> 0 allocs Match-time benchmarks are neutral or marginally faster (CityLots, NumberMatching, 8259Example each show ~-1% — likely the cache-footprint win from the shrunken structs). Passes -race. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Moving build-time generation state off faState/smallTable (commit 192aaa0, cherry-picked from dec9a56) shrinks both structs. The hardcoded size constants in TestQuaminaMemoryCost, TestMcNfaSizes, and TestPP were calibrated against the pre-move sizes on main: faState/smallTable bytes per state: 1481 -> 1321 (-160 / -10.8%) prettyprinter s/t per state: 240/312 -> 216/280 (-24 / -32) TestMcBasicSizes computes its expectation from unsafe.Sizeof so it needs no update; only the hand-calibrated values do. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

sayrer · 2026-05-30T18:37:01Z

Still a little to check here:

  ┌────────────────────────────────────────────┬─────────────┬───────────────┬─────────┬───────────┬─────────────┬─────────┐
  │                 Benchmark                  │ main sec/op │ branch sec/op │    Δ    │ main B/op │ branch B/op │    Δ    │
  ├────────────────────────────────────────────┼─────────────┼───────────────┼─────────┼───────────┼─────────────┼─────────┤
  │ ShellStyleBuildTime                        │ 41.07m ± 1% │ 35.80m ± 5%   │ −12.83% │ 684 ± 4%  │ 598 ± 3%    │ −12.57% │
  ├────────────────────────────────────────────┼─────────────┼───────────────┼─────────┼───────────┼─────────────┼─────────┤
  │ CityLots                                   │ 5.287µ ± 1% │ 5.312µ ± 1%   │ ~       │ 14        │ 14          │ —       │
  ├────────────────────────────────────────────┼─────────────┼───────────────┼─────────┼───────────┼─────────────┼─────────┤
  │ 8259Example                                │ 5.392µ ± 3% │ 5.519µ ± 1%   │ ~       │ 0         │ 0           │ —       │
  ├────────────────────────────────────────────┼─────────────┼───────────────┼─────────┼───────────┼─────────────┼─────────┤
  │ NumberMatching                             │ 413.1n ± 3% │ 413.0n ± 2%   │ ~       │ 280       │ 280         │ —       │
  ├────────────────────────────────────────────┼─────────────┼───────────────┼─────────┼───────────┼─────────────┼─────────┤
  │ ShellstyleMultiMatch                       │ 27.59µ ± 4% │ 28.72µ ± 3%   │ +4.10%  │ 0         │ 0           │ —       │
  ├────────────────────────────────────────────┼─────────────┼───────────────┼─────────┼───────────┼─────────────┼─────────┤
  │ TablePointerDedup/6-regexps-12-shell       │ 6.733µ ± 1% │ 7.208µ ± 1%   │ +7.05%  │ 0         │ 0           │ —       │
  ├────────────────────────────────────────────┼─────────────┼───────────────┼─────────┼───────────┼─────────────┼─────────┤
  │ TablePointerDedup/20-nested-regexps        │ 759.2n ± 1% │ 766.2n ± 2%   │ +0.92%  │ 0         │ 0           │ —       │
  ├────────────────────────────────────────────┼─────────────┼───────────────┼─────────┼───────────┼─────────────┼─────────┤
  │ TablePointerDedup/deeply-nested            │ 793.4n ± 1% │ 803.3n ± 1%   │ +1.24%  │ 0         │ 0           │ —       │
  ├────────────────────────────────────────────┼─────────────┼───────────────┼─────────┼───────────┼─────────────┼─────────┤
  │ TablePointerDedup/overlapping-char-classes │ 596.5n ± 2% │ 601.0n ± 2%   │ ~       │ 0         │ 0           │ —       │
  ├────────────────────────────────────────────┼─────────────┼───────────────┼─────────┼───────────┼─────────────┼─────────┤
  │ TablePointerDedup/shell+deep-overlap       │ 8.954µ ± 2% │ 9.341µ ± 2%   │ +4.32%  │ 0         │ 0           │ —       │
  ├────────────────────────────────────────────┼─────────────┼───────────────┼─────────┼───────────┼─────────────┼─────────┤
  │ geomean                                    │ 7.016µ      │ 7.065µ        │ +0.70%  │ —         │ —           │ −1.33%  │

timbray · 2026-05-30T18:42:34Z

Hmm, having trouble parsing that table. I don't think any of those actually measure the size of the data structures after the AddPattern()s? The new quamina.GetMatcherStats() does measure that for you. That's where I was assuming the big win here would come from.

sayrer · 2026-05-30T18:47:32Z

  Main vs branch — steady-state matcher size

  ┌──────────────────────────────┬────────┬────────────┬──────────────┬──────────┬─────────┬──────────────┬────────────────┬───────────┐
  │           Workload           │ states │ main bytes │ branch bytes │ Δ bytes  │   Δ %   │ main B/state │ branch B/state │ Δ B/state │
  ├──────────────────────────────┼────────┼────────────┼──────────────┼──────────┼─────────┼──────────────┼────────────────┼───────────┤
  │ ShellStyleBuild-1000         │  7,409 │  3,157,290 │    2,920,202 │ −237,088 │  −7.51% │        426.1 │          394.1 │     −32.0 │
  ├──────────────────────────────┼────────┼────────────┼──────────────┼──────────┼─────────┼──────────────┼────────────────┼───────────┤
  │ TPD/6-regexps-12-shell       │  1,102 │    420,405 │      385,141 │  −35,264 │  −8.39% │        381.5 │          349.5 │     −32.0 │
  ├──────────────────────────────┼────────┼────────────┼──────────────┼──────────┼─────────┼──────────────┼────────────────┼───────────┤
  │ TPD/20-nested-regexps        │    150 │     41,483 │       36,683 │   −4,800 │ −11.57% │        276.6 │          244.6 │     −32.0 │
  ├──────────────────────────────┼────────┼────────────┼──────────────┼──────────┼─────────┼──────────────┼────────────────┼───────────┤
  │ TPD/deeply-nested            │     60 │     20,483 │       18,563 │   −1,920 │  −9.37% │        341.4 │          309.4 │     −32.0 │
  ├──────────────────────────────┼────────┼────────────┼──────────────┼──────────┼─────────┼──────────────┼────────────────┼───────────┤
  │ TPD/overlapping-char-classes │     86 │     23,561 │       20,809 │   −2,752 │ −11.68% │        274.0 │          242.0 │     −32.0 │
  ├──────────────────────────────┼────────┼────────────┼──────────────┼──────────┼─────────┼──────────────┼────────────────┼───────────┤
  │ TPD/shell+deep-overlap       │    838 │    312,541 │      285,725 │  −26,816 │  −8.58% │        373.0 │          341.0 │     −32.0 │
  └──────────────────────────────┴────────┴────────────┴──────────────┴──────────┴─────────┴──────────────┴────────────────┴───────────┘

sayrer · 2026-05-30T18:49:40Z

It dawns on me that another memory-saver (8 bytes/state) would be to have an faState contain a smallTable not a *smallTable. Just embed the sucker right in there. Probably get better memory locality and a perf bump too. BUT, faState is at the core of everything and it could turn into a nontrivial refactoring.

I didn't try this yet. There's also a bug in cmStateStats that I'll get first.

…hers When a valueMatcher uses the singleton-match optimization (singletonMatch holds the raw bytes, startTable is nil — e.g. boolean-valued patterns like `{"Animated": [false]}`), cmFieldMatcherStats was constructing &faState{table: nil} and passing it to cmStateStats, which then crashed on state.table.steps. The singleton byte count is already added a few lines above; there is no NFA to walk, so just skip the nil-table case. Add TestQuaminaMemoryCostSingleton as a regression test. Minimal repro is q.AddPattern("p", `{"Animated": [false]}`) followed by q.GetMatcherStats() — without this fix it panics. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

sayrer · 2026-05-30T19:06:01Z

Now we can run 8259.

  Main vs branch — steady-state matcher size (with fix)

  ┌──────────────────────────────┬────────┬────────────┬──────────────┬──────────┬─────────┬──────────────┬────────────────┬───────────┐
  │           Workload           │ states │ main bytes │ branch bytes │ Δ bytes  │   Δ %   │ main B/state │ branch B/state │ Δ B/state │
  ├──────────────────────────────┼────────┼────────────┼──────────────┼──────────┼─────────┼──────────────┼────────────────┼───────────┤
  │ 8259Example                  │ 20,360 │  6,854,729 │    6,203,209 │ −651,520 │  −9.50% │        336.7 │          304.7 │     −32.0 │
  ├──────────────────────────────┼────────┼────────────┼──────────────┼──────────┼─────────┼──────────────┼────────────────┼───────────┤
  │ ShellStyleBuild-1000         │  7,409 │  3,157,290 │    2,920,202 │ −237,088 │  −7.51% │        426.1 │          394.1 │     −32.0 │
  ├──────────────────────────────┼────────┼────────────┼──────────────┼──────────┼─────────┼──────────────┼────────────────┼───────────┤
  │ TPD/6-regexps-12-shell       │  1,102 │    420,405 │      385,141 │  −35,264 │  −8.39% │        381.5 │          349.5 │     −32.0 │
  ├──────────────────────────────┼────────┼────────────┼──────────────┼──────────┼─────────┼──────────────┼────────────────┼───────────┤
  │ TPD/20-nested-regexps        │    150 │     41,483 │       36,683 │   −4,800 │ −11.57% │        276.6 │          244.6 │     −32.0 │
  ├──────────────────────────────┼────────┼────────────┼──────────────┼──────────┼─────────┼──────────────┼────────────────┼───────────┤
  │ TPD/deeply-nested            │     60 │     20,483 │       18,563 │   −1,920 │  −9.37% │        341.4 │          309.4 │     −32.0 │
  ├──────────────────────────────┼────────┼────────────┼──────────────┼──────────┼─────────┼──────────────┼────────────────┼───────────┤
  │ TPD/overlapping-char-classes │     86 │     23,561 │       20,809 │   −2,752 │ −11.68% │        274.0 │          242.0 │     −32.0 │
  ├──────────────────────────────┼────────┼────────────┼──────────────┼──────────┼─────────┼──────────────┼────────────────┼───────────┤
  │ TPD/shell+deep-overlap       │    838 │    312,541 │      285,725 │  −26,816 │  −8.58% │        373.0 │          341.0 │     −32.0 │
  └──────────────────────────────┴────────┴────────────┴──────────────┴──────────┴─────────┴──────────────┴────────────────┴───────────┘

timbray · 2026-05-30T19:12:04Z

Let me know when you're happy with this PR's status and I'll have a closer look & probably merge.

sayrer · 2026-05-30T19:14:45Z

I'm going to do the faState thing based on this branch, but not in this branch. Then see how it looks. It's a mechanical change but I think the patch will be a monster.

timbray · 2026-05-30T19:15:17Z

Also, I broke one of my own rules, always look at the red/green coverage on everything with <100%. Which would have revealed that the singleton case was never exercised, sigh.

sayrer · 2026-05-30T21:37:20Z

OK, take this one as ready. I did your faState idea and it worked, but it's better to get this one first.

https://github.com/sayrer/quamina/tree/embed-smalltable

  - Build-time: ShellStyleBuildTime −17.33% sec/op, −18.23% B/op vs build-context-extract tip.
  - Match-time: every benchmark either improved (3.5–8.6%) or was statistically flat. Geomean −4.77%.
  - Steady-state matcher size: ~8 bytes/state lighter across all workloads (the embed eliminates the *smallTable pointer
  field + the separate 80B-class smallTable allocation; per-state allocation count halved).
  - Dedup behavior preserved: switched from *smallTable pointer identity to slice-backing identity (tableShareKey keyed by
  unsafe.SliceData(steps) + len), validated via the existing TablePointerDedup benchmarks (5 of 5 within ±1% or faster) and
  the mutation audit (all 21 state.table.X = … sites in production code are safe under the construction-time-only
  invariant).
  
  The branch stacks on build-context-extract (which has its own open PR). When/if build-context-extract lands, this branch
  rebases cleanly onto main.

timbray

Most of these are nits, one or two real (minor) issues.

Curious: Your table shows the memory reduction, which is good, did you observe any speedup?

sayrer · 2026-05-30T22:23:05Z

Most of these are nits, one or two real (minor) issues.

Curious: Your table shows the memory reduction, which is good, did you observe any speedup?

Speedup for the regex-heavy ones #521 (comment), slight problems with the table-dedup ones (like a couple percent), but the follow-up faState one makes it all moot. I checked that everything was the same (made Claude go through and check the state space and the assembly), it was just a matter of whether the GC from build-time happened during match time. A benchmark artifact in my view.

sayrer · 2026-05-30T22:47:12Z

I think the best approach is to make things smaller so they start fitting within cache lines where they didn't before, then really tune those ~2% things. But there are no algorithmic issues that I can spot. I'll follow up on your other comments.

Tim flagged that the singleton-match optimization isn't specific to booleans — it applies to any field with a single string or literal value, where the matcher can use bytes.Compare instead of building an FA. Reword the regression-test comment accordingly; keep the boolean pattern as the minimal repro rather than the defining case. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The side-tables refactor (32dc2a9) replaced field-level comments with a block comment about the struct's purpose. The block comment explains why two maps exist but stopped documenting what each field is for. Put the field-level docs back; matches the pre-refactor style for generation/closureSetGen/closureList and adds documentation for the two new map fields. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Tim flagged that `visited[s] = struct{}{}` isn't a familiar idiom. Add a comment above the insert line explaining map[K]struct{} is Go's keys-only set type (struct{}{} is zero bytes, so the value half is free), with the insert and membership idioms. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Per Tim — the byte cost reported by GetMatcherStats should reflect the backing array footprint, matching how mcSmallTable / mcFaState already count their slice fields (cap, not len). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Per Tim — the rest of the codebase (match_set, stats, matcher, memory_cost, anything_but) uses map[K]bool for sets. The two newer sites that used map[K]struct{} (stateLists.seen and simplifyCollect's visited) were stylistic outliers; switch to bool for consistency. Also drops the comment I just added explaining the struct{}{} idiom — no longer needed since the code is now plain map[K]bool with `if m[k]` / `m[k] = true`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The seen-field type swap (struct{} → bool) narrowed the field type, which gofmt then re-aligned the adjacent comments around. Apply the formatter. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

timbray · 2026-05-31T00:14:32Z

You done?

sayrer · 2026-05-31T00:15:59Z

You done?

I think so, just running the benchmarks.

sayrer · 2026-05-31T01:20:23Z

So, what this patch does is improve NFA performance. But, it is very bad for eager DFA performance (but not lazy DFA). So, it's all a bit mixed up. I'd argue to just make the NFA path as fast as possible and then go after DFA. Gory details:

https://gist.github.com/sayrer/1bcd6364d08305c8c3136a7e2d65d09c

timbray · 2026-05-31T04:24:57Z

So, what this patch does is improve NFA performance. But, it is very bad for eager DFA performance (but not lazy DFA). So, it's all a bit mixed up.

I looked at your gist and no surprises there. But I don't get why this PR is "bad for eager DFA".

sayrer · 2026-05-31T15:22:14Z

oh, I actually read it wrong yesterday. Eager is faster but unbounded. I think it's ready then.

sayrer force-pushed the build-context-extract branch from bacd3e8 to dec9a56 Compare April 17, 2026 00:01

sayrer mentioned this pull request Apr 17, 2026

kaizen: nfa: simplify smallTable.step, eliminate per-traverse stepOut alloc #519

Merged

sayrer mentioned this pull request Apr 30, 2026

NFA => DFA optimization #481

Open

timbray reviewed May 29, 2026

View reviewed changes

sayrer and others added 2 commits May 30, 2026 11:30

sayrer force-pushed the build-context-extract branch from 898e4cd to 2326bc0 Compare May 30, 2026 18:35

timbray reviewed May 30, 2026

View reviewed changes

Comment thread epsi_closure.go Outdated

Comment thread epsi_closure.go

Comment thread memory_cost.go Outdated

Comment thread nfa.go

Comment thread nfa.go Outdated

Comment thread small_table.go

Comment thread state_lists.go Outdated

Comment thread memory_cost_test.go Outdated

sayrer and others added 6 commits May 30, 2026 16:04

state_lists: gofmt comment alignment

408dc57

The seen-field type swap (struct{} → bool) narrowed the field type, which gofmt then re-aligned the adjacent comments around. Apply the formatter. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

timbray merged commit 1ca29de into timbray:main May 31, 2026
7 checks passed

sayrer mentioned this pull request May 31, 2026

kaizen: Embed smalltable #535

Open

Conversation

sayrer commented Apr 16, 2026

Uh oh!

sayrer commented Apr 17, 2026

Uh oh!

sayrer commented Apr 17, 2026

Uh oh!

sayrer commented Apr 17, 2026

Uh oh!

timbray commented Apr 17, 2026

Uh oh!

sayrer commented Apr 17, 2026

Uh oh!

sayrer commented Apr 17, 2026

Uh oh!

timbray commented Apr 17, 2026

Uh oh!

sayrer commented Apr 18, 2026

Uh oh!

sayrer commented Apr 18, 2026

Uh oh!

sayrer commented Apr 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sayrer commented Apr 18, 2026

Uh oh!

sayrer commented Apr 18, 2026

Uh oh!

sayrer commented Apr 18, 2026

Uh oh!

sayrer commented Apr 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

timbray left a comment

Choose a reason for hiding this comment

Uh oh!

sayrer commented May 30, 2026

Uh oh!

timbray commented May 30, 2026

Uh oh!

sayrer commented May 30, 2026

Uh oh!

timbray commented May 30, 2026

Uh oh!

sayrer commented May 30, 2026

Uh oh!

sayrer commented May 30, 2026

Uh oh!

sayrer commented May 30, 2026

Uh oh!

timbray commented May 30, 2026

Uh oh!

sayrer commented May 30, 2026

Uh oh!

timbray commented May 30, 2026

Uh oh!

sayrer commented May 30, 2026

Uh oh!

timbray left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sayrer commented May 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sayrer commented May 30, 2026

Uh oh!

timbray commented May 31, 2026

Uh oh!

sayrer commented May 31, 2026

Uh oh!

sayrer commented May 31, 2026

Uh oh!

sayrer commented Apr 18, 2026 •

edited

Loading

sayrer commented Apr 19, 2026 •

edited

Loading

sayrer commented May 30, 2026 •

edited

Loading