[NetKAT] Speed up PagedStableVector indexing with power-of-two page sizes#101
Draft
smolkaj wants to merge 7 commits into
Draft
[NetKAT] Speed up PagedStableVector indexing with power-of-two page sizes#101smolkaj wants to merge 7 commits into
smolkaj wants to merge 7 commits into
Conversation
|
Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA). View this failed invocation of the CLA check for more information. For the most up to date status, view the checks section at the bottom of the pull request. |
The managers' node vectors allocate memory in pages. At 64 MiB, every page allocation exceeds malloc's mmap threshold (typically 128 KiB), so each manager pays an mmap/munmap syscall pair - significant for short-lived managers, which compile a policy and are discarded. At 16 KiB, pages are recycled through the allocator's freelists, while still amortizing allocation over hundreds of nodes. In benchmarks, this speeds up first-time compilation of small policies by up to 3x (e.g. BM_FirstTimeCompileOverlappingPredicate: 10.8us -> 3.7us); the syscall cost was diagnosed with strace -c. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Deriving the page size from a byte budget yields a non-power-of-two node count for packet sets (16 KiB / 24 B = 682), which forces the index arithmetic in PagedStableVector::operator[] -- on the hot path of nearly every operation -- to compile to multiply sequences instead of single shift/mask instructions. Round to 512 nodes (12 KiB) instead; transformer pages become an explicit 256 nodes (16 KiB), numerically unchanged. Both stay far below the malloc mmap/trim thresholds, which is what this PR is about. This also unblocks stacking google#101, which enforces power-of-two page sizes at compile time. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Detect page boundaries via the last page's size instead of recomputing size() % PageSize on every insertion. Besides being cheaper, this fixes a latent invariant bug: if an insertion threw after allocating a fresh page, the next insertion would allocate a second empty page, leaving a hole mid-vector that silently corrupts the index arithmetic for all subsequent elements. (Unreachable today since DecisionNode insertions do not throw, but a footgun for future element types.) The new microbenchmark quantifies the data structure's design choices: power-of-two vs non-power-of-two page sizes for operator[] indexing (up to 2.6x), and paged vs flat appends (~1.4x in favor of paged, no relocation copies), with a flat std::vector as the reference. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Turn the prose recommendation into a static_assert: fast index arithmetic is the reason this class can match a flat std::vector on reads, so a non-power-of-two PageSize should be a compile error rather than a silent 2x regression. With the property enforced, the benchmark no longer needs to re-litigate the page-size choice, and is slimmed down to its long-term job of guarding paged-vs-flat performance. Also drop the cached size_ member in favor of deriving the size from the pages again: it measured neutral, and the static_assert now guarantees the derived computation compiles to a shift, so the extra invariant bought nothing. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The existing benchmarks compile predicates of only tens of BDD nodes, so they cannot detect effects that only manifest at scale: node arena performance, unique-table pressure, or the algorithmic complexity of the set operations. This makes them blind both to regressions at realistic model sizes and to the improvements we most care about landing next, such as operation memoization (b/382379263) and complement edges (b/382380335). Add benchmarks over pseudo-random sets drawn from a ~1M element space encoded across 5 hex-digit fields. Random sets have incompressible BDDs, so node counts scale with set size (~10^5-10^6 nodes), mimicking large real-world NetKAT models and providing a yardstick for future algorithmic work. These benchmarks put the page-size change of this PR in perspective: it yields ~1-3% end to end, since unique-table hashing dominates node creation. The 1.1-2.6x microbenchmark win applies to the arena in isolation only. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The test only held references to the first two elements, both in the first page. The positions most at risk of invalidation are elements adjacent to page boundaries, at the moments a new page or a larger page table gets allocated. Hold a reference to every element across several pages instead, and verify contents in addition to addresses. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The managers' pages shrank from 2^21 to 2^9 nodes (see google#105); keep the microbenchmark representative of what production uses. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
16fb260 to
1801421
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Note
Stacked on #105 — the first two commits shown here belong to that PR. Review only the four commits starting at "Simplify PagedStableVector bookkeeping"; this PR will be rebased once #105 merges.
The win
This PR hardens and instruments the node arena after #105 right-sizes its pages:
PagedStableVectorpage sizes are powers of two — the property that makes itsoperator[](the hot path of nearly every BDD operation, viaGetNodeOrDie) compile to shifts/masks and run 1.1–2.6× faster than with a non-power-of-two page size, matching a flatstd::vectoron sequential reads. [NetKAT] Shrink node storage pages from 64 MiB to 16 KiB #105 chooses such page sizes; this PR makes choosing anything else a compile error rather than a silent regression.push_back/emplace_back.PagedStableVectormicrobenchmark and large-scale (~10^5–10^6 node) packet-set benchmarks — the latter already used to validate [NetKAT] Shrink node storage pages from 64 MiB to 16 KiB #105's claims.End-to-end, the arena-level changes are worth ~1–3% (see "How much this matters" below); the big wins here are robustness and measurement.
The story
PagedStableVector::operator[]computesindex / PageSizeandindex % PageSizeon every access. When the page size is derived from a byte budget — as it was on head, and as #105 initially did — the element count is generally not a power of two, so the compiler emits multiply-based division sequences instead of single shift/mask instructions. The transformer manager only ever escaped this by luck (64-byte nodes happen to divide byte budgets evenly).With #105 defining both page sizes as power-of-two node counts, this PR adds
static_assert(std::has_single_bit(PageSize))toPagedStableVectoritself, turning the convention into a guarantee.The
push_back/emplace_backbookkeeping is also simplified: page boundaries are detected via the last page's size rather than recomputingsize() % PageSizeper insertion. Besides being cheaper, this fixes a latent invariant bug: if an insertion threw after allocating a fresh page, the next insertion would allocate a second empty page, leaving a hole mid-vector that silently corrupts the index arithmetic for all subsequent elements. (Unreachable today sinceDecisionNodeinsertions don't throw, but a footgun for future element types.)Microbenchmark proof
The power-of-two impact was measured with a temporary variant of the new
paged_stable_vector_benchmarkinstantiating both page-size flavors in the same binary (perfectly load-matched), at L3-resident (256k elements) and DRAM-resident (4M elements) working-set sizes. CPU-time medians, 5 repetitions:The committed benchmark uses the production page size (512 nodes per #105) and keeps the long-term job of guarding paged-vs-flat performance: with the small pages, paged sequential reads still match flat
std::vector(2.081 ms vs 2.070 ms at 4M elements — the 8k-entry page table costs nothing measurable), and paged appends still beat flat by ~1.4× thanks to the absence of relocation copies.How much this matters end to end
The pre-existing benchmarks compile predicates of only tens of BDD nodes — far too small to exercise the arena. This PR adds large-scale benchmarks to
packet_set_benchmark: pseudo-random sets drawn from a ~1M element space over 5 hex-digit fields, whose incompressible BDDs reach ~10^5–10^6 nodes like real-world models. The arena-level changes (power-of-two indexing + cheaper appends) are worth ~1–3% end to end on these workloads — node creation is dominated by unique-table hashing (eachDecisionNodehash includes its heap-allocated branch array) andflat_hash_mapprobing, which dwarf the few instructions saved per arena access. The big end-to-end levers are algorithmic and already tracked: operation memoization (b/382379263) and complement edges (b/382380335) — todayOrisNot(And(Not(l), Not(r)))andNotcopies its entire operand. These benchmarks double as the yardstick for that future work, and were already used to verify #105's "large workloads unaffected" claim (they improved ~8%).All 17 bazel test targets pass in
-c opt. The pointer-stability test is also strengthened: it now holds a reference to every element across several pages (the boundary-adjacent positions are the ones most at risk during page allocation), rather than just the first two elements of the first page.🤖 Generated with Claude Code