[NetKAT] Shrink node storage pages from 64 MiB to 16 KiB#105
Conversation
The managers' node vectors allocate memory in pages. At 64 MiB, every page allocation exceeds malloc's mmap threshold (typically 128 KiB), so each manager pays an mmap/munmap syscall pair - significant for short-lived managers, which compile a policy and are discarded. At 16 KiB, pages are recycled through the allocator's freelists, while still amortizing allocation over hundreds of nodes. In benchmarks, this speeds up first-time compilation of small policies by up to 3x (e.g. BM_FirstTimeCompileOverlappingPredicate: 10.8us -> 3.7us); the syscall cost was diagnosed with strace -c. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
|
Standalone benchmark run confirmed (medians of 5, -c opt, vs main measured back-to-back on the same machine):
Notably, these match the numbers previously measured for the full #102 stack almost exactly — this two-line change accounts for essentially all of the first-time-compilation speedup observed there. Recompilation is unchanged, as expected (no allocation on that path). |
|
Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA). View this failed invocation of the CLA check for more information. For the most up to date status, view the checks section at the bottom of the pull request. |
Deriving the page size from a byte budget yields a non-power-of-two node count for packet sets (16 KiB / 24 B = 682), which forces the index arithmetic in PagedStableVector::operator[] -- on the hot path of nearly every operation -- to compile to multiply sequences instead of single shift/mask instructions. Round to 512 nodes (12 KiB) instead; transformer pages become an explicit 256 nodes (16 KiB), numerically unchanged. Both stay far below the malloc mmap/trim thresholds, which is what this PR is about. This also unblocks stacking google#101, which enforces power-of-two page sizes at compile time. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The managers' pages shrank from 2^21 to 2^9 nodes (see google#105); keep the microbenchmark representative of what production uses. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
|
This seems to overfitting to small workflows, but will make large workflows slower. Reference: https://claude.ai/share/6c75784f-9b5a-4124-ad54-14811c88ed94 Closing. |
What
The page size of the managers' node vectors goes from 64 MiB to 12 KiB (packet sets, 512 nodes) / 16 KiB (transformers, 256 nodes), defined directly as power-of-two node counts.
Why
At 64 MiB, every page allocation exceeds malloc's mmap threshold (typically 128 KiB), so every manager pays an mmap/munmap syscall pair — diagnosed with
strace -c, which shows one mmap+munmap per benchmark iteration on head. This is significant for short-lived managers (compile a policy, answer a query, discard), the dominant pattern in tests and the analysis engine today. At 12–16 KiB, pages stay below the mmap/trim thresholds and are recycled through the allocator's freelists, while still amortizing allocation over hundreds of nodes.The page sizes are expressed as power-of-two node counts rather than derived from a byte budget: a byte-budget division yields a non-power-of-two count for packet sets (16 KiB / 24 B = 682), which would force the index arithmetic in
PagedStableVector::operator[]— the hot path of nearly every operation — to compile to multiply sequences instead of single shift/mask instructions. (#101, designed to stack on this PR, enforces the power-of-two property at compile time and adds benchmark infrastructure.)Measured results (CPU-time medians, 5 reps)
Small-policy compilation, the workload this PR targets, vs head:
BM_FirstTimeCompileOverlappingPredicateBM_FirstTimeCompileNonOverlappingPredicateBM_FirstTimeCompileNonOverlappingPolicy(transformer)Large workloads are not just unaffected but slightly improved, verified with the large-scale benchmarks from #101 (random packet sets of ~10^5–10^6 BDD nodes): compiling a 32k-member set goes from 264.6 ms to 243.1 ms (−8%),
Xorof two 32k-member sets from 3.19 ms to 3.13 ms,Notunchanged.Testing
All 17 bazel test targets pass.
🤖 Generated with Claude Code