[NetKAT] Speed up PagedStableVector indexing with power-of-two page sizes by smolkaj · Pull Request #101 · google/netkat

smolkaj · 2026-06-10T10:32:08Z

Note

Stacked on #105 — the first two commits shown here belong to that PR. Review only the four commits starting at "Simplify PagedStableVector bookkeeping"; this PR will be rebased once #105 merges.

The win

This PR hardens and instruments the node arena after #105 right-sizes its pages:

Compile-time enforcement that PagedStableVector page sizes are powers of two — the property that makes its operator[] (the hot path of nearly every BDD operation, via GetNodeOrDie) compile to shifts/masks and run 1.1–2.6× faster than with a non-power-of-two page size, matching a flat std::vector on sequential reads. [NetKAT] Shrink node storage pages from 64 MiB to 16 KiB #105 chooses such page sizes; this PR makes choosing anything else a compile error rather than a silent regression.
A latent exception-safety bug fixed in push_back/emplace_back.
Benchmark infrastructure the repo was missing: a PagedStableVector microbenchmark and large-scale (~10^5–10^6 node) packet-set benchmarks — the latter already used to validate [NetKAT] Shrink node storage pages from 64 MiB to 16 KiB #105's claims.

End-to-end, the arena-level changes are worth ~1–3% (see "How much this matters" below); the big wins here are robustness and measurement.

The story

PagedStableVector::operator[] computes index / PageSize and index % PageSize on every access. When the page size is derived from a byte budget — as it was on head, and as #105 initially did — the element count is generally not a power of two, so the compiler emits multiply-based division sequences instead of single shift/mask instructions. The transformer manager only ever escaped this by luck (64-byte nodes happen to divide byte budgets evenly).

With #105 defining both page sizes as power-of-two node counts, this PR adds static_assert(std::has_single_bit(PageSize)) to PagedStableVector itself, turning the convention into a guarantee.

The push_back/emplace_back bookkeeping is also simplified: page boundaries are detected via the last page's size rather than recomputing size() % PageSize per insertion. Besides being cheaper, this fixes a latent invariant bug: if an insertion threw after allocating a fresh page, the next insertion would allocate a second empty page, leaving a hole mid-vector that silently corrupts the index arithmetic for all subsequent elements. (Unreachable today since DecisionNode insertions don't throw, but a footgun for future element types.)

Microbenchmark proof

The power-of-two impact was measured with a temporary variant of the new paged_stable_vector_benchmark instantiating both page-size flavors in the same binary (perfectly load-matched), at L3-resident (256k elements) and DRAM-resident (4M elements) working-set sizes. CPU-time medians, 5 repetitions:

Benchmark	non-pow2 pages	pow2 pages	speedup
Sequential read, 256k elems (L3)	222.6 µs	84.7 µs	2.6×
Sequential read, 4M elems (DRAM)	3.72 ms	2.15 ms	1.7×
Random read, 256k elems (L3)	385 µs	292 µs	1.3×
Random read, 4M elems (DRAM)	37.0 ms	33.1 ms	1.1×

The committed benchmark uses the production page size (512 nodes per #105) and keeps the long-term job of guarding paged-vs-flat performance: with the small pages, paged sequential reads still match flat std::vector (2.081 ms vs 2.070 ms at 4M elements — the 8k-entry page table costs nothing measurable), and paged appends still beat flat by ~1.4× thanks to the absence of relocation copies.

How much this matters end to end

The pre-existing benchmarks compile predicates of only tens of BDD nodes — far too small to exercise the arena. This PR adds large-scale benchmarks to packet_set_benchmark: pseudo-random sets drawn from a ~1M element space over 5 hex-digit fields, whose incompressible BDDs reach ~10^5–10^6 nodes like real-world models. The arena-level changes (power-of-two indexing + cheaper appends) are worth ~1–3% end to end on these workloads — node creation is dominated by unique-table hashing (each DecisionNode hash includes its heap-allocated branch array) and flat_hash_map probing, which dwarf the few instructions saved per arena access. The big end-to-end levers are algorithmic and already tracked: operation memoization (b/382379263) and complement edges (b/382380335) — today Or is Not(And(Not(l), Not(r))) and Not copies its entire operand. These benchmarks double as the yardstick for that future work, and were already used to verify #105's "large workloads unaffected" claim (they improved ~8%).

All 17 bazel test targets pass in -c opt. The pointer-stability test is also strengthened: it now holds a reference to every element across several pages (the boundary-adjacent positions are the ones most at risk during page allocation), rather than just the first two elements of the first page.

🤖 Generated with Claude Code

google-cla · 2026-06-10T10:32:24Z

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

The managers' node vectors allocate memory in pages. At 64 MiB, every page allocation exceeds malloc's mmap threshold (typically 128 KiB), so each manager pays an mmap/munmap syscall pair - significant for short-lived managers, which compile a policy and are discarded. At 16 KiB, pages are recycled through the allocator's freelists, while still amortizing allocation over hundreds of nodes. In benchmarks, this speeds up first-time compilation of small policies by up to 3x (e.g. BM_FirstTimeCompileOverlappingPredicate: 10.8us -> 3.7us); the syscall cost was diagnosed with strace -c. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Deriving the page size from a byte budget yields a non-power-of-two node count for packet sets (16 KiB / 24 B = 682), which forces the index arithmetic in PagedStableVector::operator[] -- on the hot path of nearly every operation -- to compile to multiply sequences instead of single shift/mask instructions. Round to 512 nodes (12 KiB) instead; transformer pages become an explicit 256 nodes (16 KiB), numerically unchanged. Both stay far below the malloc mmap/trim thresholds, which is what this PR is about. This also unblocks stacking google#101, which enforces power-of-two page sizes at compile time. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Detect page boundaries via the last page's size instead of recomputing size() % PageSize on every insertion. Besides being cheaper, this fixes a latent invariant bug: if an insertion threw after allocating a fresh page, the next insertion would allocate a second empty page, leaving a hole mid-vector that silently corrupts the index arithmetic for all subsequent elements. (Unreachable today since DecisionNode insertions do not throw, but a footgun for future element types.) The new microbenchmark quantifies the data structure's design choices: power-of-two vs non-power-of-two page sizes for operator[] indexing (up to 2.6x), and paged vs flat appends (~1.4x in favor of paged, no relocation copies), with a flat std::vector as the reference. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

Turn the prose recommendation into a static_assert: fast index arithmetic is the reason this class can match a flat std::vector on reads, so a non-power-of-two PageSize should be a compile error rather than a silent 2x regression. With the property enforced, the benchmark no longer needs to re-litigate the page-size choice, and is slimmed down to its long-term job of guarding paged-vs-flat performance. Also drop the cached size_ member in favor of deriving the size from the pages again: it measured neutral, and the static_assert now guarantees the derived computation compiles to a shift, so the extra invariant bought nothing. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

The existing benchmarks compile predicates of only tens of BDD nodes, so they cannot detect effects that only manifest at scale: node arena performance, unique-table pressure, or the algorithmic complexity of the set operations. This makes them blind both to regressions at realistic model sizes and to the improvements we most care about landing next, such as operation memoization (b/382379263) and complement edges (b/382380335). Add benchmarks over pseudo-random sets drawn from a ~1M element space encoded across 5 hex-digit fields. Random sets have incompressible BDDs, so node counts scale with set size (~10^5-10^6 nodes), mimicking large real-world NetKAT models and providing a yardstick for future algorithmic work. These benchmarks put the page-size change of this PR in perspective: it yields ~1-3% end to end, since unique-table hashing dominates node creation. The 1.1-2.6x microbenchmark win applies to the arena in isolation only. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

The test only held references to the first two elements, both in the first page. The positions most at risk of invalidation are elements adjacent to page boundaries, at the moments a new page or a larger page table gets allocated. Hold a reference to every element across several pages instead, and verify contents in addition to addresses. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

The managers' pages shrank from 2^21 to 2^9 nodes (see google#105); keep the microbenchmark representative of what production uses. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

smolkaj and others added 2 commits June 10, 2026 08:59

smolkaj mentioned this pull request Jun 10, 2026

[NetKAT] Shrink node storage pages from 64 MiB to 16 KiB #105

Closed

smolkaj and others added 5 commits June 10, 2026 09:09

[NetKAT] Align microbenchmark page size with the production page size.

1801421

The managers' pages shrank from 2^21 to 2^9 nodes (see google#105); keep the microbenchmark representative of what production uses. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

smolkaj force-pushed the paged-stable-vector-perf branch from 16fb260 to 1801421 Compare June 10, 2026 16:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[NetKAT] Speed up PagedStableVector indexing with power-of-two page sizes#101

[NetKAT] Speed up PagedStableVector indexing with power-of-two page sizes#101
smolkaj wants to merge 7 commits into
google:mainfrom
smolkaj:paged-stable-vector-perf

smolkaj commented Jun 10, 2026 •

edited

Loading

Uh oh!

google-cla Bot commented Jun 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

smolkaj commented Jun 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

The win

The story

Microbenchmark proof

How much this matters end to end

Uh oh!

google-cla Bot commented Jun 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

smolkaj commented Jun 10, 2026 •

edited

Loading