[NetKAT] Store decision nodes by field: per-field unique tables + level-packed handles by smolkaj · Pull Request #102 · google/netkat

smolkaj · 2026-06-10T10:42:37Z

What

Reorganizes both managers' decision-node storage by packet field (level) — the standard node organization in BDD packages — in two commits that build on each other:

Commit 1: Per-field unique tables keyed by node index.
The unique tables (packet_by_node_, transformer_by_node_) previously stored a full copy of every DecisionNode as the map key, duplicating all node storage (especially costly for transformer nodes, which own heap-allocated btree maps). They are now per-field flat_hash_set<uint32_t>s that hash/compare entries by dereferencing into the node storage — each node is stored exactly once, roughly halving node-related memory. Per-field tables also keep probe sequences short and drop the field from hashing (it's implied by the table).

Commit 2: Pack the field into handles, store nodes in per-field arenas.
PacketSetHandle and PacketTransformerHandle now encode a (field, slot) pair in their 32 bits (10/22 split), and nodes live in one arena per field. This buys:

Field reads without node loads. Every recursive step consults fields to maintain the field-ordering invariant; And() now canonicalizes argument order on handle bits alone and never loads the higher-field operand's node in its field-mismatch case.
Level-clustered memory layout. Operations sweep the node graph level by level; same-field nodes now sit together.
Cheap pages. Arena pages shrink 64 MiB → 16 KiB, so short-lived managers recycle pages through malloc freelists instead of paying an mmap/munmap syscall pair per compilation (measured via strace -c; sizes above ~128 KiB hit glibc's mmap/trim thresholds).

Benchmarks

Micro-benchmarks (medians of 5, -c opt, vs main):

Benchmark	main	this PR
BM_FirstTimeCompileNonOverlappingPredicate	45.0µs	38.2µs
BM_ReCompileNonOverlappingPredicate	641ns	682ns
BM_FirstTimeCompileOverlappingPredicate	11.1µs	3.7µs (3×)
BM_ReCompileOverlappingPredicate	649ns	668ns
BM_FirstTimeCompileNonOverlappingPolicy	310.5µs	268.8µs
BM_ReCompileNonOverlappingPolicy	4.23µs	4.27µs
BM_FirstTimeCompileOverlappingPolicy	46.5µs	31.1µs
BM_ReCompileOverlappingPolicy	4.12µs	4.08µs

On the scale benchmark (#103, realistically large inputs): consistent −10–42% at small/medium sizes (e.g. CompileForwardingTable/512: 34.4ms → 27.7ms), compressing to −1–9% at the largest sizes where runtime is dominated by the known algorithmic issues (no operation memoization, b/382379263) that this PR deliberately does not address — it is the memory/locality/constant-factor layer underneath that future work.

Design notes

Bit split: 10/22 supports 1023 fields and ~4M nodes per field, enforced by CHECKs at node creation (the reserved top field index encodes the existing sentinels unchanged). The old flat encoding allowed ~4B nodes total; per-field stats from real models should inform revisiting the split. An 8/24 split is a one-line change if preferred.
Interning probes before appending (heterogeneous lookup), keeping the common already-interned case copy- and allocation-free.
Manager moves: the table functors reference node storage through a heap-allocated location slot; the now-manual move operations repoint it, keeping moves O(1).
Handles print as field:slot (e.g. PacketSetHandle<1:7>), making debug output and the regenerated goldens more informative.
Nodes keep their (now redundant) field member: removing it saves nothing due to padding; CheckInternalInvariants verifies consistency.
Reordering-readiness: handles encode the field id, which today coincides with the level; dynamic reordering, if ever wanted, adds a permutation table without changing the encoding.

Testing

All 17 bazel test targets pass at each commit, including the fuzz tests comparing against reference implementations. Golden .expected files are regenerated for the new handle format.

🤖 Generated with Claude Code

…ode index. Previously, the unique tables (packet_by_node_, transformer_by_node_) stored a full copy of every decision node as the map key, duplicating all node storage. This was especially costly for PacketTransformerManager, whose nodes own heap-allocated btree maps. Now the tables store 4-byte node indices and hash/compare by dereferencing into nodes_, so each node is stored exactly once. Splitting the tables by packet field keeps each table - and thus its probe sequences - small, and lays the groundwork for storing the nodes themselves by field/level (standard practice in BDD packages, where it improves locality and enables variable reordering). A follow-up change builds on this. Implementation notes: * Interning probes the table with the candidate node via heterogeneous lookup (no node is added to nodes_ unless it is new), keeping the common already-interned case copy- and allocation-free. * The table functors reference nodes_ through a heap-allocated location slot so that they survive manager moves; move operations repoint the slot. Benchmarks are flat overall: first-time compilation is unchanged, recompile microbenchmarks regress by ~8% (~50ns) due to the indirection in the unique tables, in exchange for halving node-related memory. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

google-cla · 2026-06-10T10:42:55Z

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

smolkaj · 2026-06-10T11:05:02Z

Measured this PR stack on the new scale benchmark from #103 (realistically large inputs; medians of 3, -c opt, same machine, back-to-back):

Benchmark	main	#100	#102 (stack)
CompileForwardingTable/64	513µs	427µs	408µs (−20%)
CompileForwardingTable/512	34.4ms	27.9ms	27.7ms (−19%)
CompileForwardingTable/4096	2.98s	2.76s	2.72s (−9%)
CompileAclAsPacketSet/16	45.3µs	37.4µs	36.9µs (−19%)
CompileAclAsPacketSet/64	427µs	383µs	381µs (−11%)
CompileAclAsPacketSet/256	5.60ms	5.42ms	5.53ms (−1%)
CompileAclDifference/16	89.4µs	70.1µs	70.2µs (−21%)
CompileAclDifference/256	11.0ms	11.1ms	11.3ms (+2%)
CompileRingReachability/2	39.9µs	36.6µs	23.1µs (−42%)
CompileRingReachability/8	2.12ms	2.07ms	2.04ms (−4%)
CompileRingReachability/32	414ms	411ms	409ms (−1%)

Two takeaways:

The stack helps consistently at small/medium scale (−10–42%), and [NetKAT] Intern decision nodes via per-field unique tables keyed by node index #100's node-copy elimination shows up much more here than in the micro-benchmarks — interning wide forwarding-table nodes previously copied a heap-backed branch array per duplicate probe.
At the largest sizes the gains wash out (−1–9%): runtime is dominated by the known algorithmic issues (no operation memoization, b/382379263; map copies in the transformer combinators). That's the expected division of labor — this stack is the memory/locality/constant-factor layer; the asymptotic wins need the memoization work, which [NetKAT] Add scale benchmark with realistically large inputs #103 now makes measurable.

Builds on the per-field unique tables: node storage itself is now organized by field/level, the standard organization in BDD packages. PacketSetHandle and PacketTransformerHandle now encode a (field, slot) pair (10/22 bit split) instead of a flat index, and the managers store nodes in one arena per field. This yields two performance properties: * The field of a node is available directly from the handle, without loading the node from memory. Operations consult the field on every recursive step to maintain the field-ordering invariant; e.g. And() now decides its field-mismatch case and canonicalizes argument order purely on handle bits, and never loads the higher-field operand's node in the mismatch case. * Nodes of the same field are clustered in memory. Operations tend to sweep the graph level by level, improving locality on large models. Benchmarks are flat to substantially better; most notably, first-time compilation of a small predicate improves 3x (10.8us -> 3.7us). This is because arena pages shrink from 64 MiB to 16 KiB: short-lived managers previously paid an mmap/munmap pair per compilation, while small pages are recycled through the allocator's freelists (this also motivates 16 KiB over larger sizes, which exceed typical malloc mmap/trim thresholds). Tradeoffs and choices: * The 10/22 split supports 1023 fields and ~4M nodes per field, enforced by CHECKs at node creation. The previous flat encoding supported ~4B nodes total; per-field stats from real models should inform revisiting the split. * Handles now print as field:slot (e.g. PacketSetHandle<1:7>), making debug output more informative; the golden files are regenerated accordingly. * Nodes keep their (now redundant) field member: removing it saves no memory due to padding, and keeping it avoids churning every node-construction site. CheckInternalInvariants verifies it stays consistent with the arena. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

smolkaj · 2026-06-10T16:00:25Z

Closing after measuring this stack on the scale benchmark (#103) and attributing the wins to its components:

Component	Measured benefit	Now lives in
Per-field unique tables keyed by node index	~½ node memory; −7–20% on table-heavy compiles	#100 (reopened)
16 KiB arena pages	the 3× first-time-compile win (kills per-manager mmap/munmap)	#105 (one-line change on main)
Packed (field, slot) handles + per-field arenas	+0–4% beyond the above at scale	deferred (this branch, preserved)

The packed-handle layer adds the most complexity (bit-split encoding, parallel per-field vectors, new hard caps of 1023 fields / 4M nodes-per-field vs ~4B total on main) while its real payoffs — level-clustered cache locality and field-reads-from-handle-bits in hot apply loops — can only materialize once operation memoization (b/382379263) makes the backend cache-probe-dominated rather than blowup-dominated. Landing a hard capacity cap on guessed bit budgets, ahead of the benefit, is the wrong order.

Plan of record: land #105 and #100, then memoization with #103 as the yardstick, then revisit this branch with per-field node statistics from real models to choose the field/slot split on evidence. The branch (level-packed-handles) stays on the fork with all benchmarks and design notes in this PR's description.

This was referenced Jun 10, 2026

[NetKAT] Add scale benchmark with realistically large inputs #103

Draft

[NetKAT] Intern decision nodes via per-field unique tables keyed by node index #100

Draft

smolkaj changed the title ~~[NetKAT] Pack the field into handles and store decision nodes by field~~ [NetKAT] Store decision nodes by field: per-field unique tables + level-packed handles Jun 10, 2026

smolkaj force-pushed the level-packed-handles branch from 525eb66 to eb28b80 Compare June 10, 2026 11:15

smolkaj mentioned this pull request Jun 10, 2026

[NetKAT] Shrink node storage pages from 64 MiB to 16 KiB #105

Closed

smolkaj closed this Jun 10, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[NetKAT] Store decision nodes by field: per-field unique tables + level-packed handles#102

[NetKAT] Store decision nodes by field: per-field unique tables + level-packed handles#102
smolkaj wants to merge 2 commits into
google:mainfrom
smolkaj:level-packed-handles

smolkaj commented Jun 10, 2026 •

edited

Loading

Uh oh!

google-cla Bot commented Jun 10, 2026

Uh oh!

smolkaj commented Jun 10, 2026

Uh oh!

smolkaj commented Jun 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

smolkaj commented Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What

Benchmarks

Design notes

Testing

Uh oh!

google-cla Bot commented Jun 10, 2026

Uh oh!

smolkaj commented Jun 10, 2026

Uh oh!

smolkaj commented Jun 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

smolkaj commented Jun 10, 2026 •

edited

Loading