Skip to content

feat(live-fork, memfd): Back Mem Snapshot with Hugepages#230

Open
theflashwin wants to merge 1 commit into
deeplethe:mainfrom
theflashwin:hugepage-backed-mem-snapshot
Open

feat(live-fork, memfd): Back Mem Snapshot with Hugepages#230
theflashwin wants to merge 1 commit into
deeplethe:mainfrom
theflashwin:hugepage-backed-mem-snapshot

Conversation

@theflashwin
Copy link
Copy Markdown

@theflashwin theflashwin commented Jun 5, 2026

PR for #6.

Summary

During the branch command, we want to minimize the amount of time the parent VM is paused for us to copy over the memory to a new memory snapshot. The parent VM is paused in two places:

  • To copy over the VM state (CPU registers, etc..)
  • copying over RAM to the new memory snapshot we branch off of

Copying over the VM state is infungible but takes a very small amount of time (<10 ms), while coping over the RAM is an intensive process. A big contributor to this delay is that there is high TLB pressure because we have to walk the entire VM's memory. To mitigate this, we back this copying process with huge pages.

Changes:

  • Added a use_hugepages boolean flag that enables mem_fd syscall to be called libc::MFD_HUGETLB
  • Added a copy_via_mmap function because hugepages cannot be written to using the write() syscall, so we created this function to workaround this fact.

Testing

5 new tests added to the existing memfd::tests module:

  • copy_via_mmap_size_guard_rejects_oversized_request - verifies that passing size_bytes >
    alloc_size returns an InvalidInput error immediately, without touching any mmap.
  • copy_via_mmap_content_matches - creates a source file with a known byte pattern, calls
    copy_via_mmap directly into a plain (non-hugetlb) memfd, reads back through the fd and asserts
    byte-for-byte equality.
  • hugepages_metadata_correct - calls create_and_populate with use_hugepages=true, asserts
    size_bytes() returns the original file size (not the hugepage-aligned alloc_size), and asserts
    backend_path() has the correct /proc//fd/ format. Skips gracefully if HugePages_Free=0.
  • hugepages_content_matches_source - same as the existing populated_memfd_content_matches_source
    but with use_hugepages=true. Verifies the copy_via_mmap path (used for hugetlb memfds) produces
    identical bytes to the source
  • hugepages_size_bytes_is_source_size_not_aligned - source is 4096 bytes (well below 2 MiB).
    Asserts region.size_bytes() returns 4096, not the 2 MiB-aligned alloc_size.

The three hugepages tests check HugePages_Free at runtime and eturn early with an eprintln! hint if the pool isn't available.

Also added a forkd doctor check to verify hugepage allocation.

Benchmarking

  • DigitalOcean 8 GiB droplet (1 vCPU, Ubuntu 24.04, Linux 6.8)
  • Snapshot: 512 MiB source (py-bench, Python 3.12 + numpy, built locally)
  • 1024 hugepages reserved (2 GiB)
  • 20 iterations per configuration, interleaved
  • Branch mode: diff (live mode requires the vendored FC; used diff as a reproducible proxy for
    pause_ms)
  • N tested: 1, 3, 5 (N=100 not feasible on this host — would need ~25 GiB hugepage pool for a 512
    MiB source)

Results:

n baseline p50 hugepages p50 speedup p50 baseline p99 hugepages p99 speedup p99
1 481 ms 410 ms 1.17× 517 ms 524 ms ~1×
3 1326 ms 1108 ms 1.20× 1530 ms 1198 ms 1.28×
5 2055 ms 1698 ms 1.21× 2137 ms 1874 ms 1.14×

@WaylandYang
Copy link
Copy Markdown
Contributor

Hey @theflashwin — read through the full diff. This is solid work, and unusually complete for a first PR: you didn't just ship the memfd flag, you plumbed it end-to-end through the REST API, CLI, Python SDK, TS SDK, doctor check, and a 430-line bench harness. That's senior-level scope.

A few things I particularly liked:

  1. MemoryBackend::MemfdSharedMemfdShared { use_hugepages } as a struct variant is the right Rust extension here — every matches! and match arm is updated correctly, and the API stays exhaustive. Future flags can slot in without breaking call sites.

  2. copy_via_mmap is the non-trivial bit and you got it right: hugetlb memfds can't be written via write(2), so MAP_SHARED dst + MAP_PRIVATE src + copy_nonoverlapping is exactly the dance the kernel wants. Each unsafe block has a focused SAFETY comment, and the error path correctly munmaps dst_ptr before returning. Nice attention to that.

  3. ENOMEM fallback to 4 KiB pages with a tracing::warn! rather than a hard fail. The right default — users get hugepages where the pool allows, never a stuck daemon when it doesn't.

  4. doctor check reads /proc/meminfo correctly and surfaces three states (no pool / pool exhausted / pass with free/total count) with actionable echo 512 | sudo tee /proc/sys/vm/nr_hugepages hints. Exactly the shape we want.

  5. #[arg(long, requires = "live_fork")] on the CLI — clap-level enforcement so --hugepages without --live-fork is rejected at parse time, not at daemon time. Nice catch.

  6. bench-hugepages.py interleaves baseline and hugepages iterations so thermal/cache effects wash out symmetrically. The p99 + max reporting alongside p50 is the right shape for a memory benchmark (tail behavior matters as much as median). The CSV output makes it easy to compare across runs.

A handful of small things — all fine to address in this PR or as follow-ups, none blocking:

  1. Three typos in the create_and_populate doc comment: boolenboolean, usallyusually, exhuastedexhausted.

  2. Could replace the magic 21 << 26 with 21 << libc::MFD_HUGE_SHIFT — libc does expose MFD_HUGE_SHIFT (verified). Makes the intent self-documenting without the explanatory comment.

  3. In copy_via_mmap, worth a one-line comment noting that the dst memfd is sized to alloc_size (the hugepage-aligned size) while only size_bytes worth of data is copied — the tail alloc_size - size_bytes is the post-ftruncate zero-fill, which FC never reads since the VMM API call uses size_bytes. Future readers will want to know why padding is safe.

  4. Did you actually run the bench against a host with hugepages reserved? If yes — what were the numbers? Especially curious about p99 spawn at N=100, and BRANCH pause_ms for the bulk-copy pass. If you want a place to drop them as a follow-up commit, bench/live-fork-pause-window/RESULTS-hugepages.md alongside RESULTS-v0.4.md would be the natural home.

  5. Bench --branch-mode defaults to diff — was that intentional? For a hugepages-vs-baseline test on live_fork=true sandboxes I'd have expected live to be the primary measurement (the bulk-copy is the part hugepages should help most). Curious whether you found diff more reproducible.

  6. PR description still says "Do not review yet please!" — when you're ready, flip out of draft so CI runs (cargo fmt --all -- --check, cargo clippy --all-targets -- -D warnings, cargo test). If anything trips, I'm happy to push fmt/clippy fixes to your fork — say the word.

Other than items 1–3 (5 minutes of polish), this is mergeable as-is. The architecture is sound, the SAFETY discipline is real, and the bench tells the story. You'll be credited in the v0.5.2 release notes (which this'll likely cut, since it's the first material feature post-v0.5.1).

Welcome aboard. 🚀

@theflashwin theflashwin force-pushed the hugepage-backed-mem-snapshot branch from d89316d to 065c355 Compare June 7, 2026 15:51
@theflashwin theflashwin marked this pull request as ready for review June 7, 2026 15:51
@theflashwin
Copy link
Copy Markdown
Author

Hi @WaylandYang ! Thanks for the input, let me know if there's anything else to change!

Also, note on N=100 benchmarking, I don't have enough compute to test this out, but am very curious myself.

@WaylandYang
Copy link
Copy Markdown
Contributor

@theflashwin nice — thanks for the quick reply! Two things on my side:

  1. CI was stuck waiting for first-time-contributor approval (that's why your pushes showed as action_required). I just approved the queue, runs are kicking off now — sorry that wasn't obvious from your end.

  2. N=100 bench — I'll run it on my dev box (4×8 dual-socket Xeon, 512 GiB RAM, 2 MiB hugepages staged on a tmpfs). Will paste numbers under your bench README within an hour or two. Don't worry about reproducing — that's exactly the kind of "I have the iron, send the patch" split I was hoping for.

Once CI lands and the N=100 numbers are in, I think we can merge as-is (the typo / MFD_HUGE_SHIFT cleanups can ride along in a tiny commit on top, no need to re-roll). Tagging for v0.5.2 release notes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants