[NetKAT] Store decision nodes by field: per-field unique tables + level-packed handles#102
[NetKAT] Store decision nodes by field: per-field unique tables + level-packed handles#102smolkaj wants to merge 2 commits into
Conversation
…ode index. Previously, the unique tables (packet_by_node_, transformer_by_node_) stored a full copy of every decision node as the map key, duplicating all node storage. This was especially costly for PacketTransformerManager, whose nodes own heap-allocated btree maps. Now the tables store 4-byte node indices and hash/compare by dereferencing into nodes_, so each node is stored exactly once. Splitting the tables by packet field keeps each table - and thus its probe sequences - small, and lays the groundwork for storing the nodes themselves by field/level (standard practice in BDD packages, where it improves locality and enables variable reordering). A follow-up change builds on this. Implementation notes: * Interning probes the table with the candidate node via heterogeneous lookup (no node is added to nodes_ unless it is new), keeping the common already-interned case copy- and allocation-free. * The table functors reference nodes_ through a heap-allocated location slot so that they survive manager moves; move operations repoint the slot. Benchmarks are flat overall: first-time compilation is unchanged, recompile microbenchmarks regress by ~8% (~50ns) due to the indirection in the unique tables, in exchange for halving node-related memory. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
|
Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA). View this failed invocation of the CLA check for more information. For the most up to date status, view the checks section at the bottom of the pull request. |
|
Measured this PR stack on the new scale benchmark from #103 (realistically large inputs; medians of 3,
Two takeaways:
|
Builds on the per-field unique tables: node storage itself is now organized by field/level, the standard organization in BDD packages. PacketSetHandle and PacketTransformerHandle now encode a (field, slot) pair (10/22 bit split) instead of a flat index, and the managers store nodes in one arena per field. This yields two performance properties: * The field of a node is available directly from the handle, without loading the node from memory. Operations consult the field on every recursive step to maintain the field-ordering invariant; e.g. And() now decides its field-mismatch case and canonicalizes argument order purely on handle bits, and never loads the higher-field operand's node in the mismatch case. * Nodes of the same field are clustered in memory. Operations tend to sweep the graph level by level, improving locality on large models. Benchmarks are flat to substantially better; most notably, first-time compilation of a small predicate improves 3x (10.8us -> 3.7us). This is because arena pages shrink from 64 MiB to 16 KiB: short-lived managers previously paid an mmap/munmap pair per compilation, while small pages are recycled through the allocator's freelists (this also motivates 16 KiB over larger sizes, which exceed typical malloc mmap/trim thresholds). Tradeoffs and choices: * The 10/22 split supports 1023 fields and ~4M nodes per field, enforced by CHECKs at node creation. The previous flat encoding supported ~4B nodes total; per-field stats from real models should inform revisiting the split. * Handles now print as field:slot (e.g. PacketSetHandle<1:7>), making debug output more informative; the golden files are regenerated accordingly. * Nodes keep their (now redundant) field member: removing it saves no memory due to padding, and keeping it avoids churning every node-construction site. CheckInternalInvariants verifies it stays consistent with the arena. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
525eb66 to
eb28b80
Compare
|
Closing after measuring this stack on the scale benchmark (#103) and attributing the wins to its components:
The packed-handle layer adds the most complexity (bit-split encoding, parallel per-field vectors, new hard caps of 1023 fields / 4M nodes-per-field vs ~4B total on main) while its real payoffs — level-clustered cache locality and field-reads-from-handle-bits in hot apply loops — can only materialize once operation memoization (b/382379263) makes the backend cache-probe-dominated rather than blowup-dominated. Landing a hard capacity cap on guessed bit budgets, ahead of the benefit, is the wrong order. Plan of record: land #105 and #100, then memoization with #103 as the yardstick, then revisit this branch with per-field node statistics from real models to choose the field/slot split on evidence. The branch ( |
What
Reorganizes both managers' decision-node storage by packet field (level) — the standard node organization in BDD packages — in two commits that build on each other:
Commit 1: Per-field unique tables keyed by node index.
The unique tables (
packet_by_node_,transformer_by_node_) previously stored a full copy of everyDecisionNodeas the map key, duplicating all node storage (especially costly for transformer nodes, which own heap-allocated btree maps). They are now per-fieldflat_hash_set<uint32_t>s that hash/compare entries by dereferencing into the node storage — each node is stored exactly once, roughly halving node-related memory. Per-field tables also keep probe sequences short and drop the field from hashing (it's implied by the table).Commit 2: Pack the field into handles, store nodes in per-field arenas.
PacketSetHandleandPacketTransformerHandlenow encode a (field, slot) pair in their 32 bits (10/22 split), and nodes live in one arena per field. This buys:And()now canonicalizes argument order on handle bits alone and never loads the higher-field operand's node in its field-mismatch case.strace -c; sizes above ~128 KiB hit glibc's mmap/trim thresholds).Benchmarks
Micro-benchmarks (medians of 5,
-c opt, vs main):On the scale benchmark (#103, realistically large inputs): consistent −10–42% at small/medium sizes (e.g. CompileForwardingTable/512: 34.4ms → 27.7ms), compressing to −1–9% at the largest sizes where runtime is dominated by the known algorithmic issues (no operation memoization, b/382379263) that this PR deliberately does not address — it is the memory/locality/constant-factor layer underneath that future work.
Design notes
PacketSetHandle<1:7>), making debug output and the regenerated goldens more informative.CheckInternalInvariantsverifies consistency.Testing
All 17 bazel test targets pass at each commit, including the fuzz tests comparing against reference implementations. Golden
.expectedfiles are regenerated for the new handle format.🤖 Generated with Claude Code