Skip to content

[NetKAT] Speed up PagedStableVector indexing with power-of-two page sizes#101

Draft
smolkaj wants to merge 7 commits into
google:mainfrom
smolkaj:paged-stable-vector-perf
Draft

[NetKAT] Speed up PagedStableVector indexing with power-of-two page sizes#101
smolkaj wants to merge 7 commits into
google:mainfrom
smolkaj:paged-stable-vector-perf

Conversation

@smolkaj

@smolkaj smolkaj commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

Note

Stacked on #105 — the first two commits shown here belong to that PR. Review only the four commits starting at "Simplify PagedStableVector bookkeeping"; this PR will be rebased once #105 merges.

The win

This PR hardens and instruments the node arena after #105 right-sizes its pages:

  • Compile-time enforcement that PagedStableVector page sizes are powers of two — the property that makes its operator[] (the hot path of nearly every BDD operation, via GetNodeOrDie) compile to shifts/masks and run 1.1–2.6× faster than with a non-power-of-two page size, matching a flat std::vector on sequential reads. [NetKAT] Shrink node storage pages from 64 MiB to 16 KiB #105 chooses such page sizes; this PR makes choosing anything else a compile error rather than a silent regression.
  • A latent exception-safety bug fixed in push_back/emplace_back.
  • Benchmark infrastructure the repo was missing: a PagedStableVector microbenchmark and large-scale (~10^5–10^6 node) packet-set benchmarks — the latter already used to validate [NetKAT] Shrink node storage pages from 64 MiB to 16 KiB #105's claims.

End-to-end, the arena-level changes are worth ~1–3% (see "How much this matters" below); the big wins here are robustness and measurement.

The story

PagedStableVector::operator[] computes index / PageSize and index % PageSize on every access. When the page size is derived from a byte budget — as it was on head, and as #105 initially did — the element count is generally not a power of two, so the compiler emits multiply-based division sequences instead of single shift/mask instructions. The transformer manager only ever escaped this by luck (64-byte nodes happen to divide byte budgets evenly).

With #105 defining both page sizes as power-of-two node counts, this PR adds static_assert(std::has_single_bit(PageSize)) to PagedStableVector itself, turning the convention into a guarantee.

The push_back/emplace_back bookkeeping is also simplified: page boundaries are detected via the last page's size rather than recomputing size() % PageSize per insertion. Besides being cheaper, this fixes a latent invariant bug: if an insertion threw after allocating a fresh page, the next insertion would allocate a second empty page, leaving a hole mid-vector that silently corrupts the index arithmetic for all subsequent elements. (Unreachable today since DecisionNode insertions don't throw, but a footgun for future element types.)

Microbenchmark proof

The power-of-two impact was measured with a temporary variant of the new paged_stable_vector_benchmark instantiating both page-size flavors in the same binary (perfectly load-matched), at L3-resident (256k elements) and DRAM-resident (4M elements) working-set sizes. CPU-time medians, 5 repetitions:

Benchmark non-pow2 pages pow2 pages speedup
Sequential read, 256k elems (L3) 222.6 µs 84.7 µs 2.6×
Sequential read, 4M elems (DRAM) 3.72 ms 2.15 ms 1.7×
Random read, 256k elems (L3) 385 µs 292 µs 1.3×
Random read, 4M elems (DRAM) 37.0 ms 33.1 ms 1.1×

The committed benchmark uses the production page size (512 nodes per #105) and keeps the long-term job of guarding paged-vs-flat performance: with the small pages, paged sequential reads still match flat std::vector (2.081 ms vs 2.070 ms at 4M elements — the 8k-entry page table costs nothing measurable), and paged appends still beat flat by ~1.4× thanks to the absence of relocation copies.

How much this matters end to end

The pre-existing benchmarks compile predicates of only tens of BDD nodes — far too small to exercise the arena. This PR adds large-scale benchmarks to packet_set_benchmark: pseudo-random sets drawn from a ~1M element space over 5 hex-digit fields, whose incompressible BDDs reach ~10^5–10^6 nodes like real-world models. The arena-level changes (power-of-two indexing + cheaper appends) are worth ~1–3% end to end on these workloads — node creation is dominated by unique-table hashing (each DecisionNode hash includes its heap-allocated branch array) and flat_hash_map probing, which dwarf the few instructions saved per arena access. The big end-to-end levers are algorithmic and already tracked: operation memoization (b/382379263) and complement edges (b/382380335) — today Or is Not(And(Not(l), Not(r))) and Not copies its entire operand. These benchmarks double as the yardstick for that future work, and were already used to verify #105's "large workloads unaffected" claim (they improved ~8%).

All 17 bazel test targets pass in -c opt. The pointer-stability test is also strengthened: it now holds a reference to every element across several pages (the boundary-adjacent positions are the ones most at risk during page allocation), rather than just the first two elements of the first page.

🤖 Generated with Claude Code

@google-cla

google-cla Bot commented Jun 10, 2026

Copy link
Copy Markdown

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

smolkaj and others added 2 commits June 10, 2026 08:59
The managers' node vectors allocate memory in pages. At 64 MiB, every page
allocation exceeds malloc's mmap threshold (typically 128 KiB), so each
manager pays an mmap/munmap syscall pair - significant for short-lived
managers, which compile a policy and are discarded. At 16 KiB, pages are
recycled through the allocator's freelists, while still amortizing
allocation over hundreds of nodes.

In benchmarks, this speeds up first-time compilation of small policies by
up to 3x (e.g. BM_FirstTimeCompileOverlappingPredicate: 10.8us -> 3.7us);
the syscall cost was diagnosed with strace -c.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Deriving the page size from a byte budget yields a non-power-of-two
node count for packet sets (16 KiB / 24 B = 682), which forces the
index arithmetic in PagedStableVector::operator[] -- on the hot path of
nearly every operation -- to compile to multiply sequences instead of
single shift/mask instructions. Round to 512 nodes (12 KiB) instead;
transformer pages become an explicit 256 nodes (16 KiB), numerically
unchanged. Both stay far below the malloc mmap/trim thresholds, which
is what this PR is about.

This also unblocks stacking google#101, which enforces power-of-two page
sizes at compile time.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
smolkaj and others added 5 commits June 10, 2026 09:09
Detect page boundaries via the last page's size instead of recomputing
size() % PageSize on every insertion. Besides being cheaper, this fixes
a latent invariant bug: if an insertion threw after allocating a fresh
page, the next insertion would allocate a second empty page, leaving a
hole mid-vector that silently corrupts the index arithmetic for all
subsequent elements. (Unreachable today since DecisionNode insertions
do not throw, but a footgun for future element types.)

The new microbenchmark quantifies the data structure's design choices:
power-of-two vs non-power-of-two page sizes for operator[] indexing
(up to 2.6x), and paged vs flat appends (~1.4x in favor of paged, no
relocation copies), with a flat std::vector as the reference.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Turn the prose recommendation into a static_assert: fast index
arithmetic is the reason this class can match a flat std::vector on
reads, so a non-power-of-two PageSize should be a compile error rather
than a silent 2x regression. With the property enforced, the benchmark
no longer needs to re-litigate the page-size choice, and is slimmed
down to its long-term job of guarding paged-vs-flat performance.

Also drop the cached size_ member in favor of deriving the size from
the pages again: it measured neutral, and the static_assert now
guarantees the derived computation compiles to a shift, so the extra
invariant bought nothing.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The existing benchmarks compile predicates of only tens of BDD nodes,
so they cannot detect effects that only manifest at scale: node arena
performance, unique-table pressure, or the algorithmic complexity of
the set operations. This makes them blind both to regressions at
realistic model sizes and to the improvements we most care about
landing next, such as operation memoization (b/382379263) and
complement edges (b/382380335).

Add benchmarks over pseudo-random sets drawn from a ~1M element space
encoded across 5 hex-digit fields. Random sets have incompressible
BDDs, so node counts scale with set size (~10^5-10^6 nodes), mimicking
large real-world NetKAT models and providing a yardstick for future
algorithmic work.

These benchmarks put the page-size change of this PR in perspective:
it yields ~1-3% end to end, since unique-table hashing dominates node
creation. The 1.1-2.6x microbenchmark win applies to the arena in
isolation only.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The test only held references to the first two elements, both in the
first page. The positions most at risk of invalidation are elements
adjacent to page boundaries, at the moments a new page or a larger page
table gets allocated. Hold a reference to every element across several
pages instead, and verify contents in addition to addresses.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The managers' pages shrank from 2^21 to 2^9 nodes (see google#105); keep the
microbenchmark representative of what production uses.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@smolkaj smolkaj force-pushed the paged-stable-vector-perf branch from 16fb260 to 1801421 Compare June 10, 2026 16:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant