Skip to content

[NetKAT] Store decision nodes by field: per-field unique tables + level-packed handles#102

Closed
smolkaj wants to merge 2 commits into
google:mainfrom
smolkaj:level-packed-handles
Closed

[NetKAT] Store decision nodes by field: per-field unique tables + level-packed handles#102
smolkaj wants to merge 2 commits into
google:mainfrom
smolkaj:level-packed-handles

Conversation

@smolkaj

@smolkaj smolkaj commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

What

Reorganizes both managers' decision-node storage by packet field (level) — the standard node organization in BDD packages — in two commits that build on each other:

Commit 1: Per-field unique tables keyed by node index.
The unique tables (packet_by_node_, transformer_by_node_) previously stored a full copy of every DecisionNode as the map key, duplicating all node storage (especially costly for transformer nodes, which own heap-allocated btree maps). They are now per-field flat_hash_set<uint32_t>s that hash/compare entries by dereferencing into the node storage — each node is stored exactly once, roughly halving node-related memory. Per-field tables also keep probe sequences short and drop the field from hashing (it's implied by the table).

Commit 2: Pack the field into handles, store nodes in per-field arenas.
PacketSetHandle and PacketTransformerHandle now encode a (field, slot) pair in their 32 bits (10/22 split), and nodes live in one arena per field. This buys:

  • Field reads without node loads. Every recursive step consults fields to maintain the field-ordering invariant; And() now canonicalizes argument order on handle bits alone and never loads the higher-field operand's node in its field-mismatch case.
  • Level-clustered memory layout. Operations sweep the node graph level by level; same-field nodes now sit together.
  • Cheap pages. Arena pages shrink 64 MiB → 16 KiB, so short-lived managers recycle pages through malloc freelists instead of paying an mmap/munmap syscall pair per compilation (measured via strace -c; sizes above ~128 KiB hit glibc's mmap/trim thresholds).

Benchmarks

Micro-benchmarks (medians of 5, -c opt, vs main):

Benchmark main this PR
BM_FirstTimeCompileNonOverlappingPredicate 45.0µs 38.2µs
BM_ReCompileNonOverlappingPredicate 641ns 682ns
BM_FirstTimeCompileOverlappingPredicate 11.1µs 3.7µs (3×)
BM_ReCompileOverlappingPredicate 649ns 668ns
BM_FirstTimeCompileNonOverlappingPolicy 310.5µs 268.8µs
BM_ReCompileNonOverlappingPolicy 4.23µs 4.27µs
BM_FirstTimeCompileOverlappingPolicy 46.5µs 31.1µs
BM_ReCompileOverlappingPolicy 4.12µs 4.08µs

On the scale benchmark (#103, realistically large inputs): consistent −10–42% at small/medium sizes (e.g. CompileForwardingTable/512: 34.4ms → 27.7ms), compressing to −1–9% at the largest sizes where runtime is dominated by the known algorithmic issues (no operation memoization, b/382379263) that this PR deliberately does not address — it is the memory/locality/constant-factor layer underneath that future work.

Design notes

  • Bit split: 10/22 supports 1023 fields and ~4M nodes per field, enforced by CHECKs at node creation (the reserved top field index encodes the existing sentinels unchanged). The old flat encoding allowed ~4B nodes total; per-field stats from real models should inform revisiting the split. An 8/24 split is a one-line change if preferred.
  • Interning probes before appending (heterogeneous lookup), keeping the common already-interned case copy- and allocation-free.
  • Manager moves: the table functors reference node storage through a heap-allocated location slot; the now-manual move operations repoint it, keeping moves O(1).
  • Handles print as field:slot (e.g. PacketSetHandle<1:7>), making debug output and the regenerated goldens more informative.
  • Nodes keep their (now redundant) field member: removing it saves nothing due to padding; CheckInternalInvariants verifies consistency.
  • Reordering-readiness: handles encode the field id, which today coincides with the level; dynamic reordering, if ever wanted, adds a permutation table without changing the encoding.

Testing

All 17 bazel test targets pass at each commit, including the fuzz tests comparing against reference implementations. Golden .expected files are regenerated for the new handle format.

🤖 Generated with Claude Code

…ode index.

Previously, the unique tables (packet_by_node_, transformer_by_node_) stored a
full copy of every decision node as the map key, duplicating all node storage.
This was especially costly for PacketTransformerManager, whose nodes own
heap-allocated btree maps. Now the tables store 4-byte node indices and
hash/compare by dereferencing into nodes_, so each node is stored exactly once.

Splitting the tables by packet field keeps each table - and thus its probe
sequences - small, and lays the groundwork for storing the nodes themselves by
field/level (standard practice in BDD packages, where it improves locality and
enables variable reordering). A follow-up change builds on this.

Implementation notes:
* Interning probes the table with the candidate node via heterogeneous lookup
  (no node is added to nodes_ unless it is new), keeping the common
  already-interned case copy- and allocation-free.
* The table functors reference nodes_ through a heap-allocated location slot
  so that they survive manager moves; move operations repoint the slot.

Benchmarks are flat overall: first-time compilation is unchanged, recompile
microbenchmarks regress by ~8% (~50ns) due to the indirection in the unique
tables, in exchange for halving node-related memory.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@google-cla

google-cla Bot commented Jun 10, 2026

Copy link
Copy Markdown

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

@smolkaj

smolkaj commented Jun 10, 2026

Copy link
Copy Markdown
Contributor Author

Measured this PR stack on the new scale benchmark from #103 (realistically large inputs; medians of 3, -c opt, same machine, back-to-back):

Benchmark main #100 #102 (stack)
CompileForwardingTable/64 513µs 427µs 408µs (−20%)
CompileForwardingTable/512 34.4ms 27.9ms 27.7ms (−19%)
CompileForwardingTable/4096 2.98s 2.76s 2.72s (−9%)
CompileAclAsPacketSet/16 45.3µs 37.4µs 36.9µs (−19%)
CompileAclAsPacketSet/64 427µs 383µs 381µs (−11%)
CompileAclAsPacketSet/256 5.60ms 5.42ms 5.53ms (−1%)
CompileAclDifference/16 89.4µs 70.1µs 70.2µs (−21%)
CompileAclDifference/256 11.0ms 11.1ms 11.3ms (+2%)
CompileRingReachability/2 39.9µs 36.6µs 23.1µs (−42%)
CompileRingReachability/8 2.12ms 2.07ms 2.04ms (−4%)
CompileRingReachability/32 414ms 411ms 409ms (−1%)

Two takeaways:

  1. The stack helps consistently at small/medium scale (−10–42%), and [NetKAT] Intern decision nodes via per-field unique tables keyed by node index #100's node-copy elimination shows up much more here than in the micro-benchmarks — interning wide forwarding-table nodes previously copied a heap-backed branch array per duplicate probe.
  2. At the largest sizes the gains wash out (−1–9%): runtime is dominated by the known algorithmic issues (no operation memoization, b/382379263; map copies in the transformer combinators). That's the expected division of labor — this stack is the memory/locality/constant-factor layer; the asymptotic wins need the memoization work, which [NetKAT] Add scale benchmark with realistically large inputs #103 now makes measurable.

@smolkaj smolkaj changed the title [NetKAT] Pack the field into handles and store decision nodes by field [NetKAT] Store decision nodes by field: per-field unique tables + level-packed handles Jun 10, 2026
Builds on the per-field unique tables: node storage itself is now organized
by field/level, the standard organization in BDD packages.

PacketSetHandle and PacketTransformerHandle now encode a (field, slot) pair
(10/22 bit split) instead of a flat index, and the managers store nodes in
one arena per field. This yields two performance properties:

* The field of a node is available directly from the handle, without loading
  the node from memory. Operations consult the field on every recursive step
  to maintain the field-ordering invariant; e.g. And() now decides its
  field-mismatch case and canonicalizes argument order purely on handle bits,
  and never loads the higher-field operand's node in the mismatch case.
* Nodes of the same field are clustered in memory. Operations tend to sweep
  the graph level by level, improving locality on large models.

Benchmarks are flat to substantially better; most notably, first-time
compilation of a small predicate improves 3x (10.8us -> 3.7us). This is
because arena pages shrink from 64 MiB to 16 KiB: short-lived managers
previously paid an mmap/munmap pair per compilation, while small pages are
recycled through the allocator's freelists (this also motivates 16 KiB over
larger sizes, which exceed typical malloc mmap/trim thresholds).

Tradeoffs and choices:
* The 10/22 split supports 1023 fields and ~4M nodes per field, enforced by
  CHECKs at node creation. The previous flat encoding supported ~4B nodes
  total; per-field stats from real models should inform revisiting the split.
* Handles now print as field:slot (e.g. PacketSetHandle<1:7>), making debug
  output more informative; the golden files are regenerated accordingly.
* Nodes keep their (now redundant) field member: removing it saves no memory
  due to padding, and keeping it avoids churning every node-construction
  site. CheckInternalInvariants verifies it stays consistent with the arena.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@smolkaj

smolkaj commented Jun 10, 2026

Copy link
Copy Markdown
Contributor Author

Closing after measuring this stack on the scale benchmark (#103) and attributing the wins to its components:

Component Measured benefit Now lives in
Per-field unique tables keyed by node index ~½ node memory; −7–20% on table-heavy compiles #100 (reopened)
16 KiB arena pages the 3× first-time-compile win (kills per-manager mmap/munmap) #105 (one-line change on main)
Packed (field, slot) handles + per-field arenas +0–4% beyond the above at scale deferred (this branch, preserved)

The packed-handle layer adds the most complexity (bit-split encoding, parallel per-field vectors, new hard caps of 1023 fields / 4M nodes-per-field vs ~4B total on main) while its real payoffs — level-clustered cache locality and field-reads-from-handle-bits in hot apply loops — can only materialize once operation memoization (b/382379263) makes the backend cache-probe-dominated rather than blowup-dominated. Landing a hard capacity cap on guessed bit budgets, ahead of the benefit, is the wrong order.

Plan of record: land #105 and #100, then memoization with #103 as the yardstick, then revisit this branch with per-field node statistics from real models to choose the field/slot split on evidence. The branch (level-packed-handles) stays on the fork with all benchmarks and design notes in this PR's description.

@smolkaj smolkaj closed this Jun 10, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant