feat(live-fork, memfd): Back Mem Snapshot with Hugepages by theflashwin · Pull Request #230 · deeplethe/forkd

theflashwin · 2026-06-05T13:36:50Z

PR for #6.

Summary

During the branch command, we want to minimize the amount of time the parent VM is paused for us to copy over the memory to a new memory snapshot. The parent VM is paused in two places:

To copy over the VM state (CPU registers, etc..)
copying over RAM to the new memory snapshot we branch off of

Copying over the VM state is infungible but takes a very small amount of time (<10 ms), while coping over the RAM is an intensive process. A big contributor to this delay is that there is high TLB pressure because we have to walk the entire VM's memory. To mitigate this, we back this copying process with huge pages.

Changes:

Added a use_hugepages boolean flag that enables mem_fd syscall to be called libc::MFD_HUGETLB
Added a copy_via_mmap function because hugepages cannot be written to using the write() syscall, so we created this function to workaround this fact.

Testing

5 new tests added to the existing memfd::tests module:

copy_via_mmap_size_guard_rejects_oversized_request - verifies that passing size_bytes >
alloc_size returns an InvalidInput error immediately, without touching any mmap.
copy_via_mmap_content_matches - creates a source file with a known byte pattern, calls
copy_via_mmap directly into a plain (non-hugetlb) memfd, reads back through the fd and asserts
byte-for-byte equality.
hugepages_metadata_correct - calls create_and_populate with use_hugepages=true, asserts
size_bytes() returns the original file size (not the hugepage-aligned alloc_size), and asserts
backend_path() has the correct /proc//fd/ format. Skips gracefully if HugePages_Free=0.
hugepages_content_matches_source - same as the existing populated_memfd_content_matches_source
but with use_hugepages=true. Verifies the copy_via_mmap path (used for hugetlb memfds) produces
identical bytes to the source
hugepages_size_bytes_is_source_size_not_aligned - source is 4096 bytes (well below 2 MiB).
Asserts region.size_bytes() returns 4096, not the 2 MiB-aligned alloc_size.

The three hugepages tests check HugePages_Free at runtime and eturn early with an eprintln! hint if the pool isn't available.

Also added a forkd doctor check to verify hugepage allocation.

Benchmarking

DigitalOcean 8 GiB droplet (1 vCPU, Ubuntu 24.04, Linux 6.8)
Snapshot: 512 MiB source (py-bench, Python 3.12 + numpy, built locally)
1024 hugepages reserved (2 GiB)
20 iterations per configuration, interleaved
Branch mode: diff (live mode requires the vendored FC; used diff as a reproducible proxy for
pause_ms)
N tested: 1, 3, 5 (N=100 not feasible on this host — would need ~25 GiB hugepage pool for a 512
MiB source)

Results:

n	baseline p50	hugepages p50	speedup p50	baseline p99	hugepages p99	speedup p99
1	481 ms	410 ms	1.17×	517 ms	524 ms	~1×
3	1326 ms	1108 ms	1.20×	1530 ms	1198 ms	1.28×
5	2055 ms	1698 ms	1.21×	2137 ms	1874 ms	1.14×

WaylandYang · 2026-06-07T02:39:28Z

Hey @theflashwin — read through the full diff. This is solid work, and unusually complete for a first PR: you didn't just ship the memfd flag, you plumbed it end-to-end through the REST API, CLI, Python SDK, TS SDK, doctor check, and a 430-line bench harness. That's senior-level scope.

A few things I particularly liked:

MemoryBackend::MemfdShared → MemfdShared { use_hugepages } as a struct variant is the right Rust extension here — every matches! and match arm is updated correctly, and the API stays exhaustive. Future flags can slot in without breaking call sites.
copy_via_mmap is the non-trivial bit and you got it right: hugetlb memfds can't be written via write(2), so MAP_SHARED dst + MAP_PRIVATE src + copy_nonoverlapping is exactly the dance the kernel wants. Each unsafe block has a focused SAFETY comment, and the error path correctly munmaps dst_ptr before returning. Nice attention to that.
ENOMEM fallback to 4 KiB pages with a tracing::warn! rather than a hard fail. The right default — users get hugepages where the pool allows, never a stuck daemon when it doesn't.
doctor check reads /proc/meminfo correctly and surfaces three states (no pool / pool exhausted / pass with free/total count) with actionable echo 512 | sudo tee /proc/sys/vm/nr_hugepages hints. Exactly the shape we want.
#[arg(long, requires = "live_fork")] on the CLI — clap-level enforcement so --hugepages without --live-fork is rejected at parse time, not at daemon time. Nice catch.
bench-hugepages.py interleaves baseline and hugepages iterations so thermal/cache effects wash out symmetrically. The p99 + max reporting alongside p50 is the right shape for a memory benchmark (tail behavior matters as much as median). The CSV output makes it easy to compare across runs.

A handful of small things — all fine to address in this PR or as follow-ups, none blocking:

Three typos in the create_and_populate doc comment: boolen → boolean, usally → usually, exhuasted → exhausted.
Could replace the magic 21 << 26 with 21 << libc::MFD_HUGE_SHIFT — libc does expose MFD_HUGE_SHIFT (verified). Makes the intent self-documenting without the explanatory comment.
In copy_via_mmap, worth a one-line comment noting that the dst memfd is sized to alloc_size (the hugepage-aligned size) while only size_bytes worth of data is copied — the tail alloc_size - size_bytes is the post-ftruncate zero-fill, which FC never reads since the VMM API call uses size_bytes. Future readers will want to know why padding is safe.
Did you actually run the bench against a host with hugepages reserved? If yes — what were the numbers? Especially curious about p99 spawn at N=100, and BRANCH pause_ms for the bulk-copy pass. If you want a place to drop them as a follow-up commit, bench/live-fork-pause-window/RESULTS-hugepages.md alongside RESULTS-v0.4.md would be the natural home.
Bench --branch-mode defaults to diff — was that intentional? For a hugepages-vs-baseline test on live_fork=true sandboxes I'd have expected live to be the primary measurement (the bulk-copy is the part hugepages should help most). Curious whether you found diff more reproducible.
PR description still says "Do not review yet please!" — when you're ready, flip out of draft so CI runs (cargo fmt --all -- --check, cargo clippy --all-targets -- -D warnings, cargo test). If anything trips, I'm happy to push fmt/clippy fixes to your fork — say the word.

Other than items 1–3 (5 minutes of polish), this is mergeable as-is. The architecture is sound, the SAFETY discipline is real, and the bench tells the story. You'll be credited in the v0.5.2 release notes (which this'll likely cut, since it's the first material feature post-v0.5.1).

Welcome aboard. 🚀

theflashwin · 2026-06-07T15:54:34Z

Hi @WaylandYang ! Thanks for the input, let me know if there's anything else to change!

Also, note on N=100 benchmarking, I don't have enough compute to test this out, but am very curious myself.

WaylandYang · 2026-06-07T16:04:16Z

@theflashwin nice — thanks for the quick reply! Two things on my side:

CI was stuck waiting for first-time-contributor approval (that's why your pushes showed as action_required). I just approved the queue, runs are kicking off now — sorry that wasn't obvious from your end.
N=100 bench — I'll run it on my dev box (4×8 dual-socket Xeon, 512 GiB RAM, 2 MiB hugepages staged on a tmpfs). Will paste numbers under your bench README within an hour or two. Don't worry about reproducing — that's exactly the kind of "I have the iron, send the patch" split I was hoping for.

Once CI lands and the N=100 numbers are in, I think we can merge as-is (the typo / MFD_HUGE_SHIFT cleanups can ride along in a tiny commit on top, no need to re-roll). Tagging for v0.5.2 release notes.

theflashwin mentioned this pull request Jun 6, 2026

Hugepage-backed snapshot memory file #6

Open

2 tasks

feat(live-fork, memfd): Back Mem Snapshot with Hugepages

065c355

theflashwin force-pushed the hugepage-backed-mem-snapshot branch from d89316d to 065c355 Compare June 7, 2026 15:51

theflashwin marked this pull request as ready for review June 7, 2026 15:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(live-fork, memfd): Back Mem Snapshot with Hugepages#230

feat(live-fork, memfd): Back Mem Snapshot with Hugepages#230
theflashwin wants to merge 1 commit into
deeplethe:mainfrom
theflashwin:hugepage-backed-mem-snapshot

theflashwin commented Jun 5, 2026 •

edited

Loading

Uh oh!

WaylandYang commented Jun 7, 2026

Uh oh!

theflashwin commented Jun 7, 2026

Uh oh!

WaylandYang commented Jun 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

theflashwin commented Jun 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Testing

Benchmarking

Uh oh!

WaylandYang commented Jun 7, 2026

Uh oh!

theflashwin commented Jun 7, 2026

Uh oh!

WaylandYang commented Jun 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

theflashwin commented Jun 5, 2026 •

edited

Loading