Skip to content

[NetKAT] Shrink node storage pages from 64 MiB to 16 KiB#105

Closed
smolkaj wants to merge 2 commits into
google:mainfrom
smolkaj:small-arena-pages
Closed

[NetKAT] Shrink node storage pages from 64 MiB to 16 KiB#105
smolkaj wants to merge 2 commits into
google:mainfrom
smolkaj:small-arena-pages

Conversation

@smolkaj

@smolkaj smolkaj commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

What

The page size of the managers' node vectors goes from 64 MiB to 12 KiB (packet sets, 512 nodes) / 16 KiB (transformers, 256 nodes), defined directly as power-of-two node counts.

Why

At 64 MiB, every page allocation exceeds malloc's mmap threshold (typically 128 KiB), so every manager pays an mmap/munmap syscall pair — diagnosed with strace -c, which shows one mmap+munmap per benchmark iteration on head. This is significant for short-lived managers (compile a policy, answer a query, discard), the dominant pattern in tests and the analysis engine today. At 12–16 KiB, pages stay below the mmap/trim thresholds and are recycled through the allocator's freelists, while still amortizing allocation over hundreds of nodes.

The page sizes are expressed as power-of-two node counts rather than derived from a byte budget: a byte-budget division yields a non-power-of-two count for packet sets (16 KiB / 24 B = 682), which would force the index arithmetic in PagedStableVector::operator[] — the hot path of nearly every operation — to compile to multiply sequences instead of single shift/mask instructions. (#101, designed to stack on this PR, enforces the power-of-two property at compile time and adds benchmark infrastructure.)

Measured results (CPU-time medians, 5 reps)

Small-policy compilation, the workload this PR targets, vs head:

Benchmark before after speedup
BM_FirstTimeCompileOverlappingPredicate 11.2 µs 3.8 µs 3.0×
BM_FirstTimeCompileNonOverlappingPredicate 45.0 µs 36.1 µs 1.25×
BM_FirstTimeCompileNonOverlappingPolicy (transformer) 585 µs 266 µs 2.2×
Recompile benchmarks (no allocation) ~640 ns ~635 ns unchanged

Large workloads are not just unaffected but slightly improved, verified with the large-scale benchmarks from #101 (random packet sets of ~10^5–10^6 BDD nodes): compiling a 32k-member set goes from 264.6 ms to 243.1 ms (−8%), Xor of two 32k-member sets from 3.19 ms to 3.13 ms, Not unchanged.

Testing

All 17 bazel test targets pass.

🤖 Generated with Claude Code

The managers' node vectors allocate memory in pages. At 64 MiB, every page
allocation exceeds malloc's mmap threshold (typically 128 KiB), so each
manager pays an mmap/munmap syscall pair - significant for short-lived
managers, which compile a policy and are discarded. At 16 KiB, pages are
recycled through the allocator's freelists, while still amortizing
allocation over hundreds of nodes.

In benchmarks, this speeds up first-time compilation of small policies by
up to 3x (e.g. BM_FirstTimeCompileOverlappingPredicate: 10.8us -> 3.7us);
the syscall cost was diagnosed with strace -c.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@smolkaj

smolkaj commented Jun 10, 2026

Copy link
Copy Markdown
Contributor Author

Standalone benchmark run confirmed (medians of 5, -c opt, vs main measured back-to-back on the same machine):

Benchmark main this PR
BM_FirstTimeCompileNonOverlappingPredicate 45.0µs 37.1µs (−18%)
BM_ReCompileNonOverlappingPredicate 641ns 641ns
BM_FirstTimeCompileOverlappingPredicate 11.1µs 3.8µs (2.9×)
BM_ReCompileOverlappingPredicate 649ns 636ns
BM_FirstTimeCompileNonOverlappingPolicy 310.5µs 268.1µs (−14%)
BM_ReCompileNonOverlappingPolicy 4.23µs 4.10µs
BM_FirstTimeCompileOverlappingPolicy 46.5µs 31.9µs (−31%)
BM_ReCompileOverlappingPolicy 4.12µs 4.05µs

Notably, these match the numbers previously measured for the full #102 stack almost exactly — this two-line change accounts for essentially all of the first-time-compilation speedup observed there. Recompilation is unchanged, as expected (no allocation on that path).

@google-cla

google-cla Bot commented Jun 10, 2026

Copy link
Copy Markdown

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

Deriving the page size from a byte budget yields a non-power-of-two
node count for packet sets (16 KiB / 24 B = 682), which forces the
index arithmetic in PagedStableVector::operator[] -- on the hot path of
nearly every operation -- to compile to multiply sequences instead of
single shift/mask instructions. Round to 512 nodes (12 KiB) instead;
transformer pages become an explicit 256 nodes (16 KiB), numerically
unchanged. Both stay far below the malloc mmap/trim thresholds, which
is what this PR is about.

This also unblocks stacking google#101, which enforces power-of-two page
sizes at compile time.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
smolkaj added a commit to smolkaj/netkat that referenced this pull request Jun 10, 2026
The managers' pages shrank from 2^21 to 2^9 nodes (see google#105); keep the
microbenchmark representative of what production uses.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@smolkaj

smolkaj commented Jun 11, 2026

Copy link
Copy Markdown
Contributor Author

This seems to overfitting to small workflows, but will make large workflows slower. Reference: https://claude.ai/share/6c75784f-9b5a-4124-ad54-14811c88ed94

Closing.

@smolkaj smolkaj closed this Jun 11, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant