Skip to content

kaizen: nfa: move build-time gen state off smallTable/faState to side tables#521

Merged
timbray merged 9 commits into
timbray:mainfrom
sayrer:build-context-extract
May 31, 2026
Merged

kaizen: nfa: move build-time gen state off smallTable/faState to side tables#521
timbray merged 9 commits into
timbray:mainfrom
sayrer:build-context-extract

Conversation

@sayrer
Copy link
Copy Markdown
Contributor

@sayrer sayrer commented Apr 16, 2026

See: #518. I have a patch for each of the points there, but it's difficult to measure the impact all at once.

smallTable.lastVisitedGen, smallTable.closureGen, smallTable.closureRep and faState.closureSetGen were all only read/written during pattern-add time (epsilon closure computation, merge simplification, and the test-only nfa2Dfa intern path). They sat on every table and state permanently, bloating steady-state memory by 24 B/table and 8 B/state.

Move that state to a per-call side table held on closureBuffers (maps keyed by *smallTable / *faState). A fresh closureBuffers is created at the top of each epsilonClosure call and dropped when the call returns. simplifySplices and stateLists.intern each get their own local visited set using the same pattern. The global closureGeneration counter is no longer needed.

Struct sizes (measured via unsafe.Sizeof):

smallTable 96 B -> 72 B -24
faState 72 B -> 64 B -8
-------
per pair -32 B

Bench (Apple M1 Ultra, n=6):

ShellStyleBuildTime-20 56.10m -> 48.77m -13.07%
958 B -> 797 B -16.81%
1 allocs -> 0 allocs

Match-time benchmarks are neutral or marginally faster (CityLots, NumberMatching, 8259Example each show ~-1% — likely the cache-footprint win from the shrunken structs). Passes -race.

@sayrer
Copy link
Copy Markdown
Contributor Author

sayrer commented Apr 17, 2026

Significant tradeoff found:

  ┌──────────────────┬───────────┬───────────────────────┬───────────┐
  │      Metric      │   main    │ build-context-extract │     Δ     │
  ├──────────────────┼───────────┼───────────────────────┼───────────┤
  │ sec/op           │ 19.17     │ 12.79                 │ –33.3% 🎉 │
  ├──────────────────┼───────────┼───────────────────────┼───────────┤
  │ B/op             │ 3.301 GiB │ 3.277 GiB             │ –0.7%     │
  ├──────────────────┼───────────┼───────────────────────┼───────────┤
  │ allocs/op        │ 5.234 M   │ 36.80 M               │ +603% 🚨  │
  ├──────────────────┼───────────┼───────────────────────┼───────────┤
  │ Patterns @100 M  │ 1307      │ 462                   │ –65% 🚨   │
  ├──────────────────┼───────────┼───────────────────────┼───────────┤
  │ Patterns @2000 M │ 7008      │ 2342                  │ –65%      │
  ├──────────────────┼───────────┼───────────────────────┼───────────┤
  │ Total patterns   │ 17501     │ 5902                  │ –66%      │
  └──────────────────┴───────────┴───────────────────────┴───────────┘


  The regression: SetMemoryBudget uses runtime.ReadMemStats.TotalAlloc to
  decide when to stop adding. Our per-call side-table maps generate lots of
  short-lived allocations during build (the +603% allocs/op). Those allocs
  get GC'd and never show up as steady-state memory (so B/op is fine), but
  they inflate TotalAlloc, which the budget accountant reads as "memory
  used." Result: the user-visible SetMemoryBudget is 3× less generous.

  Struct sizes still shrink as advertised — this is purely a TotalAlloc
  accounting interaction.```

@sayrer
Copy link
Copy Markdown
Contributor Author

sayrer commented Apr 17, 2026

                                                                    
  With value-stored tableMark (no *tableMark allocation per entry): 
                                                                    
  ┌────────────────┬───────────┬───────────────┬───────────┐        
  │     Metric     │   main    │ bce value-sem │     Δ     │        
  ├────────────────┼───────────┼───────────────┼───────────┤        
  │ sec/op         │ 19.17     │ 12.29         │ –35.9%    │
  ├────────────────┼───────────┼───────────────┼───────────┤
  │ B/op           │ 3.301 GiB │ 3.275 GiB     │ –0.8%     │        
  ├────────────────┼───────────┼───────────────┼───────────┤        
  │ allocs/op      │ 5.234 M   │ 2.133 M       │ –59.3% 🎉 │        
  ├────────────────┼───────────┼───────────────┼───────────┤        
  │ Total patterns │ 17501     │ 5579          │ –68% 🚨   │
  └────────────────┴───────────┴───────────────┴───────────┘        
  

  The allocs went below main's — fewer maps / smaller per-entry. But
  patterns-per-budget still craters. That tells us the problem isn't alloc
  count; it's the bytes-per-pattern as measured by TotalAlloc:
                                                                              
  - main: ~197 KB/pattern (all permanent FA structs)  
  - bce: ~587 KB/pattern (FA structs + transient maps growing + being
    discarded)
                                                                              
  The TotalAlloc accountant (memory_cost.go:82) deliberately counts transient
  allocations too — per its comment, because during build  the old matcher is
  still live alongside the new one. That choice makes any side-table approach a
  loss for the budget feature,        regardless of pooling or value-semantics
  tricks.               
                  
  Paths forward

  A. Keep this branch as-is — ship the real wins (–36% time, –59% allocs,
  smaller structs in steady state). Users relying on           SetMemoryBudget
  see ~3× fewer patterns supported per budget byte. The budget is a heuristic
  anyway.
                                                                              
  B. Revert build-context-extract, switch to suggestion #3 (int32 counters +
  field reorder). Saves 16 B/pair permanently (half of the  32 B/pair the
  current branch would save at steady state), zero transient allocation, zero
  budget regression. Safest.
                                                                              
  C. Ship build-context-extract + fix memory_cost.go to use HeapInuse. Bigger
  scope, crosses into Tim's memory-budget work (#516).     
  
  My recommendation: B. The budget regression in A is user-visible and
  surprising; the patterns-per-budget number is exactly the thing Tim's #516
  was optimizing. C is better semantically but crosses PR boundaries. B gets
  half the steady-state win with none of the downside.

@sayrer
Copy link
Copy Markdown
Contributor Author

sayrer commented Apr 17, 2026

I think Claude is being lazy here, and C is the right choice. See https://thecodinggopher.substack.com/p/gos-memory-allocator

@timbray
Copy link
Copy Markdown
Owner

timbray commented Apr 17, 2026

Thanks for the work and analysis. Unfortunately I'm AFK for the next few hours. Before I go… I tried using HeapAlloc instead of TotalAlloc in memory_cost.go and that produced bad results but I don't remember the details - could maybe revisit that decision. Clearly, the idea of moving all that deduping machinery off into ephemeral AddPattern-time-only storage is attractive. Let me think about this while I'm offline. I wonder if there's a way to exclude it from the memory_cost bookkeeping.

But anyhow, there's no reason not to merge #519, right?

@sayrer
Copy link
Copy Markdown
Contributor Author

sayrer commented Apr 17, 2026

I think #519 is all good. As ever, I'm not in a rush, I just send these in a flurry when I have time to think about them. There's no real urgency.

@sayrer
Copy link
Copy Markdown
Contributor Author

sayrer commented Apr 17, 2026

I had to look this up, so by no means an expert, but I think the suggestion of HeapInuse not (HeapAlloc) could make sense. I think TotalAlloc only goes up. My general thought pattern is that the DFA stuff will be way faster, but stress this part of the program.

@timbray
Copy link
Copy Markdown
Owner

timbray commented Apr 17, 2026

Why don't you go ahead and, as an experiment, update the bytesAllocated call in memory_cost.go to use HeapInUse and see what happens?

Also, the process of finally getting the memory_cost stuff working taught me that a couple of my assumptions are wrong, and now I have in mind a couple of cheaper and ( think) more accurate approaches so maybe I'll have re-do it.

@sayrer
Copy link
Copy Markdown
Contributor Author

sayrer commented Apr 18, 2026

⏺ Comparison: main vs build-context-extract, BenchmarkMemoryBoundaries  
                                                                                
  Answering why the bench is slow is actually the story: under
  build-context-extract + HeapAlloc, the budget is so much more generous  that
  the bench's fixed pattern set can't saturate it.  
                                                                                
  Patterns-per-budget (same 12959 shellstyle patterns, same hardware)

                                                                                
  ┌────────────────────┬───────────────────┬──────────────────────────────────────────────────┐
  │       Budget       │ main (TotalAlloc) │        build-context-extract (HeapAlloc)         │
  ├────────────────────┼───────────────────┼──────────────────────────────────────────────────┤
  │ 100 M              │ 1306              │ 12959 (all patterns fit)                         │
  ├────────────────────┼───────────────────┼──────────────────────────────────────────────────┤
  │ 200 M              │ 1978              │ 12959 (all patterns fit)                         │
  ├────────────────────┼───────────────────┼──────────────────────────────────────────────────┤
  │ 400 M              │ 2914              │ (didn't measure — can't exceed 12959)            │
  ├────────────────────┼───────────────────┼──────────────────────────────────────────────────┤
  │ 800 M              │ 4294              │ —                                                │
  ├────────────────────┼───────────────────┼──────────────────────────────────────────────────┤
  │ 2000 M             │ 7008              │ —                                                │
  ├────────────────────┼───────────────────┼──────────────────────────────────────────────────┤
  │ Total across tiers │ 17501             │ ~5× the same 12959, capped by the pattern supply │
  └────────────────────┴───────────────────┴──────────────────────────────────────────────────┘
   

  Wall-clock per b.Loop() iteration  
                                                                                  
  - main: ~19 s (bench hits budget cap fast, abandons each tier)  
  - build-context-extract: ~2222 s (bench loads all 12959 patterns × 5 tiers ×
    ~34 ms/add = ~64K pattern-adds)                       
                                                                                  
  The 116× slowdown of the bench is misleading — it's not that BCE is slower per
  pattern-add, it's that the bench is completing the    work the budget now
  permits. Per-pattern cost is actually similar.  
                                                                                  
  What this means 

  - Under the old TotalAlloc accounting, the 100 M budget rejected patterns
    after ~10% of the set. That rejection was spurious — it was counting
    transient allocations that GC would reclaim.
  - Under HeapAlloc, the 100 M budget fits the entire test corpus. Users can
    load ~10× more patterns per budget byte, which is real    capacity they
    were paying for but never getting.  
  - Combined with the struct shrinkagefrom build-context-extract (–32 B/pair),
    the change directly addresses the "too much memory" motivation of
    issue #518.                                            
                  
 What still needs doing                                                       
 
  The bench itself should grow a larger pattern corpus (or an
  expanding-quantifier pattern generator) so it can still exert budget pressure
  under HeapAlloc — otherwise it'll just time out on big matchers and we can't
  distinguish "budget works" from "pattern supply exhausted." Worth raising
  when Tim does the memory_cost rework he mentioned.    

@sayrer
Copy link
Copy Markdown
Contributor Author

sayrer commented Apr 18, 2026

So, we're working on the right things here, it just needs some tuning in the tests and the accountant code. HeapAlloc ended up working better, because it's more granular, and Quamina makes a lot of small transient allocations (instantly and cheaply reclaimed often). By doing this, we're converging on what the Rust one does with an arena allocator.

@sayrer
Copy link
Copy Markdown
Contributor Author

sayrer commented Apr 18, 2026

HeapAlloc vs HeapInuse                                                     
                                                                                
  Both come from runtime.MemStats and both describe currently-held heap memory
  (unlike TotalAlloc, which is cumulative). The difference is granularity.  
                                                                                
  HeapAlloc       

  - Bytes occupied by live-plus-unreclaimed heap objects.  
  - Grows on every allocation; shrinks the moment GC frees an object.
  - Byte-precise: allocate a 16-byte object, HeapAlloc goes up by 16.  
                                                                                
  HeapInuse                                                       
                                                                                
  - Bytes in heap spans that contain at least one object.  
  - Grows when a newspan is claimed; shrinks only when a span becomes fully
    empty.
  - Span-granular: allocate a 16-byte object, HeapInuse goes up by the full span
    size (typically 8192 bytes).                          
                                                                                
  Why the delta exists                                                       
                                                                                
  Go's allocator works in size-class spans — chunks of memory (often 8 KB)
  subdivided into fixed-size slots.  A span counts as "in use" as long as any
  one slot is occupied.
                                                                                
  Allocate one 16-byte object → lands in one slot of an 8192-byte span.
                                                                                
    HeapAlloc  = 16       (just the object) HeapInuse  = 8192     (the whole
    span)                       
                                                                                
  As more 16-byte objects fill the same span:  
  - HeapAlloc grows linearly (16, 32, 48, …).
  - HeapInuse stays flat at 8192 until the span fills, then jumps to 16384 when
    a second span is claimed.                              
                                                                                
  Free 511 of 512 slots:                                                       
 - HeapAlloc drops to 16.  
  - HeapInuse stays 8192 (one live object keeps thespan "in use").  
                                                                                
  So HeapInuse ≥ HeapAlloc always. The gap is fragmentation overhead — slots
  committed to size classes but currently empty.            
                                                                                
  Practical picks                                                       
                                                                                
  ┌───────────────────────────────────────────────────────┬───────────────────────────┐ 
  │                       Question                        │        Right field        │
  ├───────────────────────────────────────────────────────┼───────────────────────────┤ 
  │ "How much do my live objects occupy?"                 │ HeapAlloc                 │
  ├───────────────────────────────────────────────────────┼───────────────────────────┤
  │ "How much memory is the runtime holding from the OS?" │ HeapInuse (closer to RSS) │ 
  ├───────────────────────────────────────────────────────┼───────────────────────────┤ 
  │ "Will I OOM?"                                         │ HeapInuse or HeapSys      │ 
  ├───────────────────────────────────────────────────────┼───────────────────────────┤ 
  │ Byte-precise measurement for fine-grained budgets?    │ HeapAlloc                 │
  └───────────────────────────────────────────────────────┴───────────────────────────┘ 

  Why we picked HeapAlloc for Quamina's budget  
                  
  Quamina allocates many small objects (*faState, *smallTable, pointer slices).
  In a warm process those slip into slots of             already-claimed
  spans — HeapInuse stays flat even as the matcher grows, and the budget
  under-counts retention.
                                                                                        
  HeapAlloc tracks every byte, so the budget reflects actual per-object growth.
  Tradeoff: it omits span overhead ("fragmentation"), so the user's ceiling
  is "bytes my live objects  occupy" rather than "bytes my process holds in spans"
 — fine for Quamina's shape, less honest for a workload of few-but-large objects.  

@sayrer
Copy link
Copy Markdown
Contributor Author

sayrer commented Apr 18, 2026

OK, that's looking pretty good, but still need to go in and nitpick the benchmarks. For another day.

@sayrer
Copy link
Copy Markdown
Contributor Author

sayrer commented Apr 18, 2026

  ┌────────────────────────────────────────────┬──────────┬─────────┬──────────┐                                                       
  │                 Benchmark                  │ sec/op Δ │ B/op Δ  │ allocs Δ │ 
  ├────────────────────────────────────────────┼──────────┼─────────┼──────────┤ 
  │ ShellStyleBuildTime                        │ −30.78%  │ −33.40% │ 1 → 0    │ 
  ├────────────────────────────────────────────┼──────────┼─────────┼──────────┤ 
  │ 8259Example                                │ −1.77%   │ —       │ —        │ 
  ├────────────────────────────────────────────┼──────────┼─────────┼──────────┤ 
  │ TablePointerDedup/deeply-nested            │ −0.90%   │ —       │ —        │ 
  ├────────────────────────────────────────────┼──────────┼─────────┼──────────┤ 
  │ TablePointerDedup/overlapping-char-classes │ −1.97%   │ —       │ —        │ 
  ├────────────────────────────────────────────┼──────────┼─────────┼──────────┤ 
  │ TablePointerDedup/shell+deep-overlap       │ −1.90%   │ —       │ —        │
  ├────────────────────────────────────────────┼──────────┼─────────┼──────────┤ 
  │ CityLots                                   │ ~        │ —       │ —        │ 
  ├────────────────────────────────────────────┼──────────┼─────────┼──────────┤ 
  │ NumberMatching                             │ ~        │ —       │ —        │
  ├────────────────────────────────────────────┼──────────┼─────────┼──────────┤ 
  │ ShellstyleMultiMatch                       │ ~        │ —       │ —        │ 
  ├────────────────────────────────────────────┼──────────┼─────────┼──────────┤
  │ TablePointerDedup/6-regexps-12-shell       │ ~        │ —       │ —        │
  ├────────────────────────────────────────────┼──────────┼─────────┼──────────┤ 
  │ TablePointerDedup/20-nested-regexps        │ ~        │ —       │ —        │
  ├────────────────────────────────────────────┼──────────┼─────────┼──────────┤ 
  │ geomean                                    │ −4.66%   │ −3.98%  │          │ 
  └────────────────────────────────────────────┴──────────┴─────────┴──────────┘ 

   
  Headline: ShellStyleBuildTime drops −31% time / −33% memory / 1 → 0 allocs —
  this bench exercises a lot of NFA construction where the struct-shrink(+ step
  () simplification already on main) compounds. Other match-path benchmarks are
  neutral or slightly better (≤2%), reflecting the cache-footprint win from the
  smaller faState/smallTable.           
                                                                                 
  No regressions in the ±5% band. Memory-boundaries (which blew up earlier)
  showed the real user-facing win: same 100 M budget now fits ~10× more
  patterns.

@sayrer
Copy link
Copy Markdown
Contributor Author

sayrer commented Apr 18, 2026

I thought of this while walking to get lunch. The MemoryBoundaries benchmark has really high thresholds because it was using the monotonically increasing TotalAlloc. But it should be HeapAlloc at 2,4,8,16,32,64 MB (RE2 default is 8MB)

@sayrer
Copy link
Copy Markdown
Contributor Author

sayrer commented Apr 19, 2026

This is about what we want to see, but it will get tougher with Nfa2Dfa.

HeapAlloc-scale:

  ┌─────────────────┬───────────────────────┬───────────┐
  │     Budget      │       Patterns        │ B/pattern │
  ├─────────────────┼───────────────────────┼───────────┤
  │ 2 M             │ 101                   │ 20.3 K    │
  ├─────────────────┼───────────────────────┼───────────┤
  │ 4 M             │ 482                   │ 8.5 K     │
  ├─────────────────┼───────────────────────┼───────────┤
  │ 8 M (RE2-equiv) │ 1398                  │ 5.9 K     │
  ├─────────────────┼───────────────────────┼───────────┤
  │ 16 M            │ 2890                  │ 5.7 K     │
  ├─────────────────┼───────────────────────┼───────────┤
  │ 32 M            │ 6335                  │ 5.2 K     │
  ├─────────────────┼───────────────────────┼───────────┤
  │ 64 M            │ all 12959 (untripped) │ —         │
  └─────────────────┴───────────────────────┴───────────┘


  Per-pattern cost drops with budget (amortization across shared NFA structure),
  as you'd expect from a well-behaved matcher. 

Copy link
Copy Markdown
Owner

@timbray timbray left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you can rebase this (mostly lose the budget stuff) we should merge it now.

@sayrer
Copy link
Copy Markdown
Contributor Author

sayrer commented May 30, 2026

I won't get to it until tomorrow morning, but no problem

@timbray
Copy link
Copy Markdown
Owner

timbray commented May 30, 2026

It dawns on me that another memory-saver (8 bytes/state) would be to have an faState contain a smallTable not a *smallTable. Just embed the sucker right in there. Probably get better memory locality and a perf bump too. BUT, faState is at the core of everything and it could turn into a nontrivial refactoring.

sayrer and others added 2 commits May 30, 2026 11:30
smallTable.lastVisitedGen, smallTable.closureGen, smallTable.closureRep
and faState.closureSetGen were all only read/written during pattern-add
time (epsilon closure computation, merge simplification, and the
test-only nfa2Dfa intern path). They sat on every table and state
permanently, bloating steady-state memory by 24 B/table and 8 B/state.

Move that state to a per-call side table held on closureBuffers (maps
keyed by *smallTable / *faState). A fresh closureBuffers is created at
the top of each epsilonClosure call and dropped when the call returns.
simplifySplices and stateLists.intern each get their own local visited
set using the same pattern. The global closureGeneration counter is no
longer needed.

Struct sizes (measured via unsafe.Sizeof):

  smallTable   96 B -> 72 B   -24
  faState      72 B -> 64 B    -8
                             -------
  per pair                    -32 B

Bench (Apple M1 Ultra, n=6):

  ShellStyleBuildTime-20   56.10m -> 48.77m   -13.07%
                            958 B ->  797 B   -16.81%
                           1 allocs -> 0 allocs

Match-time benchmarks are neutral or marginally faster (CityLots,
NumberMatching, 8259Example each show ~-1% — likely the cache-footprint
win from the shrunken structs). Passes -race.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Moving build-time generation state off faState/smallTable (commit
192aaa0, cherry-picked from dec9a56) shrinks both structs. The
hardcoded size constants in TestQuaminaMemoryCost, TestMcNfaSizes,
and TestPP were calibrated against the pre-move sizes on main:

  faState/smallTable bytes per state: 1481 -> 1321 (-160 / -10.8%)
  prettyprinter s/t per state:        240/312 -> 216/280 (-24 / -32)

TestMcBasicSizes computes its expectation from unsafe.Sizeof so it
needs no update; only the hand-calibrated values do.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@sayrer sayrer force-pushed the build-context-extract branch from 898e4cd to 2326bc0 Compare May 30, 2026 18:35
@sayrer
Copy link
Copy Markdown
Contributor Author

sayrer commented May 30, 2026

Still a little to check here:

  ┌────────────────────────────────────────────┬─────────────┬───────────────┬─────────┬───────────┬─────────────┬─────────┐
  │                 Benchmark                  │ main sec/op │ branch sec/op │    Δ    │ main B/op │ branch B/op │    Δ    │
  ├────────────────────────────────────────────┼─────────────┼───────────────┼─────────┼───────────┼─────────────┼─────────┤
  │ ShellStyleBuildTime                        │ 41.07m ± 1% │ 35.80m ± 5%   │ −12.83% │ 684 ± 4%  │ 598 ± 3%    │ −12.57% │
  ├────────────────────────────────────────────┼─────────────┼───────────────┼─────────┼───────────┼─────────────┼─────────┤
  │ CityLots                                   │ 5.287µ ± 1% │ 5.312µ ± 1%   │ ~       │ 14        │ 14          │ —       │
  ├────────────────────────────────────────────┼─────────────┼───────────────┼─────────┼───────────┼─────────────┼─────────┤
  │ 8259Example                                │ 5.392µ ± 3% │ 5.519µ ± 1%   │ ~       │ 0         │ 0           │ —       │
  ├────────────────────────────────────────────┼─────────────┼───────────────┼─────────┼───────────┼─────────────┼─────────┤
  │ NumberMatching                             │ 413.1n ± 3% │ 413.0n ± 2%   │ ~       │ 280       │ 280         │ —       │
  ├────────────────────────────────────────────┼─────────────┼───────────────┼─────────┼───────────┼─────────────┼─────────┤
  │ ShellstyleMultiMatch                       │ 27.59µ ± 4% │ 28.72µ ± 3%   │ +4.10%  │ 0         │ 0           │ —       │
  ├────────────────────────────────────────────┼─────────────┼───────────────┼─────────┼───────────┼─────────────┼─────────┤
  │ TablePointerDedup/6-regexps-12-shell       │ 6.733µ ± 1% │ 7.208µ ± 1%   │ +7.05%  │ 0         │ 0           │ —       │
  ├────────────────────────────────────────────┼─────────────┼───────────────┼─────────┼───────────┼─────────────┼─────────┤
  │ TablePointerDedup/20-nested-regexps        │ 759.2n ± 1% │ 766.2n ± 2%   │ +0.92%  │ 0         │ 0           │ —       │
  ├────────────────────────────────────────────┼─────────────┼───────────────┼─────────┼───────────┼─────────────┼─────────┤
  │ TablePointerDedup/deeply-nested            │ 793.4n ± 1% │ 803.3n ± 1%   │ +1.24%  │ 0         │ 0           │ —       │
  ├────────────────────────────────────────────┼─────────────┼───────────────┼─────────┼───────────┼─────────────┼─────────┤
  │ TablePointerDedup/overlapping-char-classes │ 596.5n ± 2% │ 601.0n ± 2%   │ ~       │ 0         │ 0           │ —       │
  ├────────────────────────────────────────────┼─────────────┼───────────────┼─────────┼───────────┼─────────────┼─────────┤
  │ TablePointerDedup/shell+deep-overlap       │ 8.954µ ± 2% │ 9.341µ ± 2%   │ +4.32%  │ 0         │ 0           │ —       │
  ├────────────────────────────────────────────┼─────────────┼───────────────┼─────────┼───────────┼─────────────┼─────────┤
  │ geomean                                    │ 7.016µ      │ 7.065µ        │ +0.70%  │ —         │ —           │ −1.33%  │

@timbray
Copy link
Copy Markdown
Owner

timbray commented May 30, 2026

Hmm, having trouble parsing that table. I don't think any of those actually measure the size of the data structures after the AddPattern()s? The new quamina.GetMatcherStats() does measure that for you. That's where I was assuming the big win here would come from.

@sayrer
Copy link
Copy Markdown
Contributor Author

sayrer commented May 30, 2026

  Main vs branch — steady-state matcher size

  ┌──────────────────────────────┬────────┬────────────┬──────────────┬──────────┬─────────┬──────────────┬────────────────┬───────────┐
  │           Workload           │ states │ main bytes │ branch bytes │ Δ bytes  │   Δ %   │ main B/state │ branch B/state │ Δ B/state │
  ├──────────────────────────────┼────────┼────────────┼──────────────┼──────────┼─────────┼──────────────┼────────────────┼───────────┤
  │ ShellStyleBuild-1000         │  7,409 │  3,157,290 │    2,920,202 │ −237,088 │  −7.51% │        426.1 │          394.1 │     −32.0 │
  ├──────────────────────────────┼────────┼────────────┼──────────────┼──────────┼─────────┼──────────────┼────────────────┼───────────┤
  │ TPD/6-regexps-12-shell       │  1,102 │    420,405 │      385,141 │  −35,264 │  −8.39% │        381.5 │          349.5 │     −32.0 │
  ├──────────────────────────────┼────────┼────────────┼──────────────┼──────────┼─────────┼──────────────┼────────────────┼───────────┤
  │ TPD/20-nested-regexps        │    150 │     41,483 │       36,683 │   −4,800 │ −11.57% │        276.6 │          244.6 │     −32.0 │
  ├──────────────────────────────┼────────┼────────────┼──────────────┼──────────┼─────────┼──────────────┼────────────────┼───────────┤
  │ TPD/deeply-nested            │     60 │     20,483 │       18,563 │   −1,920 │  −9.37% │        341.4 │          309.4 │     −32.0 │
  ├──────────────────────────────┼────────┼────────────┼──────────────┼──────────┼─────────┼──────────────┼────────────────┼───────────┤
  │ TPD/overlapping-char-classes │     86 │     23,561 │       20,809 │   −2,752 │ −11.68% │        274.0 │          242.0 │     −32.0 │
  ├──────────────────────────────┼────────┼────────────┼──────────────┼──────────┼─────────┼──────────────┼────────────────┼───────────┤
  │ TPD/shell+deep-overlap       │    838 │    312,541 │      285,725 │  −26,816 │  −8.58% │        373.0 │          341.0 │     −32.0 │
  └──────────────────────────────┴────────┴────────────┴──────────────┴──────────┴─────────┴──────────────┴────────────────┴───────────┘

@sayrer
Copy link
Copy Markdown
Contributor Author

sayrer commented May 30, 2026

It dawns on me that another memory-saver (8 bytes/state) would be to have an faState contain a smallTable not a *smallTable. Just embed the sucker right in there. Probably get better memory locality and a perf bump too. BUT, faState is at the core of everything and it could turn into a nontrivial refactoring.

I didn't try this yet. There's also a bug in cmStateStats that I'll get first.

…hers

When a valueMatcher uses the singleton-match optimization (singletonMatch
holds the raw bytes, startTable is nil — e.g. boolean-valued patterns
like `{"Animated": [false]}`), cmFieldMatcherStats was constructing
&faState{table: nil} and passing it to cmStateStats, which then crashed
on state.table.steps. The singleton byte count is already added a few
lines above; there is no NFA to walk, so just skip the nil-table case.

Add TestQuaminaMemoryCostSingleton as a regression test. Minimal repro
is q.AddPattern("p", `{"Animated": [false]}`) followed by
q.GetMatcherStats() — without this fix it panics.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@sayrer
Copy link
Copy Markdown
Contributor Author

sayrer commented May 30, 2026

Now we can run 8259.

  Main vs branch — steady-state matcher size (with fix)

  ┌──────────────────────────────┬────────┬────────────┬──────────────┬──────────┬─────────┬──────────────┬────────────────┬───────────┐
  │           Workload           │ states │ main bytes │ branch bytes │ Δ bytes  │   Δ %   │ main B/state │ branch B/state │ Δ B/state │
  ├──────────────────────────────┼────────┼────────────┼──────────────┼──────────┼─────────┼──────────────┼────────────────┼───────────┤
  │ 8259Example                  │ 20,360 │  6,854,729 │    6,203,209 │ −651,520 │  −9.50% │        336.7 │          304.7 │     −32.0 │
  ├──────────────────────────────┼────────┼────────────┼──────────────┼──────────┼─────────┼──────────────┼────────────────┼───────────┤
  │ ShellStyleBuild-1000         │  7,409 │  3,157,290 │    2,920,202 │ −237,088 │  −7.51% │        426.1 │          394.1 │     −32.0 │
  ├──────────────────────────────┼────────┼────────────┼──────────────┼──────────┼─────────┼──────────────┼────────────────┼───────────┤
  │ TPD/6-regexps-12-shell       │  1,102 │    420,405 │      385,141 │  −35,264 │  −8.39% │        381.5 │          349.5 │     −32.0 │
  ├──────────────────────────────┼────────┼────────────┼──────────────┼──────────┼─────────┼──────────────┼────────────────┼───────────┤
  │ TPD/20-nested-regexps        │    150 │     41,483 │       36,683 │   −4,800 │ −11.57% │        276.6 │          244.6 │     −32.0 │
  ├──────────────────────────────┼────────┼────────────┼──────────────┼──────────┼─────────┼──────────────┼────────────────┼───────────┤
  │ TPD/deeply-nested            │     60 │     20,483 │       18,563 │   −1,920 │  −9.37% │        341.4 │          309.4 │     −32.0 │
  ├──────────────────────────────┼────────┼────────────┼──────────────┼──────────┼─────────┼──────────────┼────────────────┼───────────┤
  │ TPD/overlapping-char-classes │     86 │     23,561 │       20,809 │   −2,752 │ −11.68% │        274.0 │          242.0 │     −32.0 │
  ├──────────────────────────────┼────────┼────────────┼──────────────┼──────────┼─────────┼──────────────┼────────────────┼───────────┤
  │ TPD/shell+deep-overlap       │    838 │    312,541 │      285,725 │  −26,816 │  −8.58% │        373.0 │          341.0 │     −32.0 │
  └──────────────────────────────┴────────┴────────────┴──────────────┴──────────┴─────────┴──────────────┴────────────────┴───────────┘

@timbray
Copy link
Copy Markdown
Owner

timbray commented May 30, 2026

Let me know when you're happy with this PR's status and I'll have a closer look & probably merge.

@sayrer
Copy link
Copy Markdown
Contributor Author

sayrer commented May 30, 2026

I'm going to do the faState thing based on this branch, but not in this branch. Then see how it looks. It's a mechanical change but I think the patch will be a monster.

@timbray
Copy link
Copy Markdown
Owner

timbray commented May 30, 2026

Also, I broke one of my own rules, always look at the red/green coverage on everything with <100%. Which would have revealed that the singleton case was never exercised, sigh.

@sayrer
Copy link
Copy Markdown
Contributor Author

sayrer commented May 30, 2026

OK, take this one as ready. I did your faState idea and it worked, but it's better to get this one first.

https://github.com/sayrer/quamina/tree/embed-smalltable

  - Build-time: ShellStyleBuildTime −17.33% sec/op, −18.23% B/op vs build-context-extract tip.
  - Match-time: every benchmark either improved (3.5–8.6%) or was statistically flat. Geomean −4.77%.
  - Steady-state matcher size: ~8 bytes/state lighter across all workloads (the embed eliminates the *smallTable pointer
  field + the separate 80B-class smallTable allocation; per-state allocation count halved).
  - Dedup behavior preserved: switched from *smallTable pointer identity to slice-backing identity (tableShareKey keyed by
  unsafe.SliceData(steps) + len), validated via the existing TablePointerDedup benchmarks (5 of 5 within ±1% or faster) and
  the mutation audit (all 21 state.table.X = … sites in production code are safe under the construction-time-only
  invariant).
  
  The branch stacks on build-context-extract (which has its own open PR). When/if build-context-extract lands, this branch
  rebases cleanly onto main.

Copy link
Copy Markdown
Owner

@timbray timbray left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most of these are nits, one or two real (minor) issues.

Curious: Your table shows the memory reduction, which is good, did you observe any speedup?

Comment thread epsi_closure.go Outdated
Comment thread epsi_closure.go
Comment thread memory_cost.go Outdated
Comment thread nfa.go
Comment thread nfa.go Outdated
Comment thread small_table.go
Comment thread state_lists.go Outdated
Comment thread memory_cost_test.go Outdated
@sayrer
Copy link
Copy Markdown
Contributor Author

sayrer commented May 30, 2026

Most of these are nits, one or two real (minor) issues.

Curious: Your table shows the memory reduction, which is good, did you observe any speedup?

Speedup for the regex-heavy ones #521 (comment), slight problems with the table-dedup ones (like a couple percent), but the follow-up faState one makes it all moot. I checked that everything was the same (made Claude go through and check the state space and the assembly), it was just a matter of whether the GC from build-time happened during match time. A benchmark artifact in my view.

@sayrer
Copy link
Copy Markdown
Contributor Author

sayrer commented May 30, 2026

I think the best approach is to make things smaller so they start fitting within cache lines where they didn't before, then really tune those ~2% things. But there are no algorithmic issues that I can spot. I'll follow up on your other comments.

sayrer and others added 6 commits May 30, 2026 16:04
Tim flagged that the singleton-match optimization isn't specific to
booleans — it applies to any field with a single string or literal
value, where the matcher can use bytes.Compare instead of building
an FA. Reword the regression-test comment accordingly; keep the
boolean pattern as the minimal repro rather than the defining case.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The side-tables refactor (32dc2a9) replaced field-level comments with a
block comment about the struct's purpose. The block comment explains
why two maps exist but stopped documenting what each field is for. Put
the field-level docs back; matches the pre-refactor style for
generation/closureSetGen/closureList and adds documentation for the
two new map fields.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Tim flagged that `visited[s] = struct{}{}` isn't a familiar idiom.
Add a comment above the insert line explaining map[K]struct{} is
Go's keys-only set type (struct{}{} is zero bytes, so the value half
is free), with the insert and membership idioms.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Per Tim — the byte cost reported by GetMatcherStats should reflect
the backing array footprint, matching how mcSmallTable / mcFaState
already count their slice fields (cap, not len).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Per Tim — the rest of the codebase (match_set, stats, matcher,
memory_cost, anything_but) uses map[K]bool for sets. The two newer
sites that used map[K]struct{} (stateLists.seen and simplifyCollect's
visited) were stylistic outliers; switch to bool for consistency.

Also drops the comment I just added explaining the struct{}{} idiom —
no longer needed since the code is now plain map[K]bool with
`if m[k]` / `m[k] = true`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The seen-field type swap (struct{} → bool) narrowed the field type,
which gofmt then re-aligned the adjacent comments around. Apply the
formatter.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@timbray
Copy link
Copy Markdown
Owner

timbray commented May 31, 2026

You done?

@sayrer
Copy link
Copy Markdown
Contributor Author

sayrer commented May 31, 2026

You done?

I think so, just running the benchmarks.

@sayrer
Copy link
Copy Markdown
Contributor Author

sayrer commented May 31, 2026

So, what this patch does is improve NFA performance. But, it is very bad for eager DFA performance (but not lazy DFA). So, it's all a bit mixed up. I'd argue to just make the NFA path as fast as possible and then go after DFA. Gory details:

https://gist.github.com/sayrer/1bcd6364d08305c8c3136a7e2d65d09c

@timbray
Copy link
Copy Markdown
Owner

timbray commented May 31, 2026

So, what this patch does is improve NFA performance. But, it is very bad for eager DFA performance (but not lazy DFA). So, it's all a bit mixed up.

I looked at your gist and no surprises there. But I don't get why this PR is "bad for eager DFA".

@sayrer
Copy link
Copy Markdown
Contributor Author

sayrer commented May 31, 2026

oh, I actually read it wrong yesterday. Eager is faster but unbounded. I think it's ready then.

@timbray timbray merged commit 1ca29de into timbray:main May 31, 2026
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants