uffd: record warmup faults and prefetch them on later forks#218
Draft
sjmiller609 wants to merge 9 commits into
Draft
uffd: record warmup faults and prefetch them on later forks#218sjmiller609 wants to merge 9 commits into
sjmiller609 wants to merge 9 commits into
Conversation
5 tasks
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
When a Firecracker fork descends from a Template source, skip copying the snapshot mem-file and hardlink it to the source's instead. Firecracker mmaps the mem-file MAP_PRIVATE on restore, so all forks COW from the same backing inode — no per-fork copy required. Hardlink rather than symlink: firecracker's restore path temporarily aliases the source data dir to the fork data dir while loading the snapshot (withSnapshotSourceDirAlias). A symlink whose target traverses the source dir would resolve back into the fork dir during that window and trip ELOOP; a hardlink resolves by inode so the alias has no effect on it. Hardlinks require both paths on the same filesystem, which holds for our standard data-dir layout. Gated to Firecracker only because other hypervisors (cloud-hypervisor, qemu, vz) don't share MAP_PRIVATE semantics on their snapshot layouts. Restricted to Template sources because they are explicitly promoted as fork-only and can never be restored — sharing the mem-file with a non-Template source would let a later RestoreInstance mutate the file out from under live forks. Stacked on hypeship/template-as-state so the Template state both gates "this snapshot is safe to fan out from" and lets fork counts be derived at read time.
Adds lib/uffd, a userfaultfd page server that backs many concurrent fan-out forks against a single read-only template mem-file instead of letting each fork mmap it privately. Firecracker connects to a per-fork UDS, hands us its userfaultfd via SCM_RIGHTS along with a JSON mappings handshake, and the server then services UFFD_EVENT_PAGEFAULT events with UFFDIO_COPY reads from the template. The Linux hot path lives behind a build tag; non-Linux builds return ErrUnsupported so callers can fall back to MAP_PRIVATE. Cross-platform tests cover the handshake parser and the server lifecycle. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Adds a hot-page recorder + prefetch primitive on top of the userfaultfd page server. During a template's first warmup fork the server can record every served page (Config.RecordHotPages); the resulting HotPageList is stable-sorted, deduplicated, and saved to disk in a small binary format alongside the template. Later forks call Server.Prefetch(forkID, list) to issue UFFDIO_COPY for every recorded page against their userfaultfd before the guest unpauses, eliminating the fault round-trips on those addresses. The prefetcher is installed by the platform-specific listener once the fork's uffd has been received and registered, so callers can race Prefetch and the fault loop safely. EEXIST/EAGAIN are tolerated the same way the fault handler does to absorb first-touch races with vCPUs. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Adds StateTemplate to the instance state machine. A Standby instance is auto-promoted to Template the first time it's forked from a snapshot, and ForkCount is bumped on each subsequent fork. Templates can't wake while ForkCount > 0; un-promote (Template -> Standby) and delete (Template -> Stopped) are both refused until forks drain. Fork bookkeeping lives on StoredMetadata (IsTemplate, ForkCount, ForkOfTemplate, plus a reserved HotPagesPath for the prefetch path). Deleting a fork decrements the parent template's ForkCount under the parent's lock; deletion of the fork's own data has already happened, so worst case is refcount drift that a future reconciliation pass fixes. The running-fork flow keeps skipping promotion: it restores the source back to Running afterward, and a template can't wake.
Drops the persisted ForkCount field from StoredMetadata and the decrement bookkeeping in DeleteInstance. Live forks of a template are now counted by scanning metadata for ForkOfTemplate matches via a new countTemplateForks helper. The fork-of-template field itself remains the single source of truth, so there's no drift to reconcile. Template promotion on fork only flips IsTemplate when not already set; deletion of a template still refuses when forks exist, but the count is computed from disk rather than read from a denormalized field.
Previously ForkInstance auto-promoted a Standby source to Template the
first time it was forked from a snapshot, and RestoreInstance auto-demoted
a Template before waking it. That implicit lifecycle blurred the rules: a
Standby and a "Standby that has been forked once" behaved differently,
and callers had to know that restoring a Template was a two-step
operation under the hood.
Replace it with explicit PromoteToTemplate / DemoteTemplate manager
methods (and matching POST /instances/{id}/promote-template and
/demote-template endpoints). Promotion is now Standby -> Template only;
demotion is Template -> Standby only and refuses while live forks
reference the template. ForkInstance only records the parent linkage if
the source is already a Template, and RestoreInstance no longer
auto-demotes — callers must demote first.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
27082d3 to
005508c
Compare
21bdfb5 to
579a72e
Compare
Silently continuing past an unreadable metadata file could undercount forks of a template, allowing DemoteTemplate or DeleteInstance to free a template whose pages are still mapped by a live fork. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
005508c to
e5594f9
Compare
579a72e to
df782fa
Compare
e5594f9 to
17831da
Compare
df782fa to
f141f3c
Compare
17831da to
12fcda0
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Stacked on: #216 (uffd page server) — review #213 → #214 → #216 first.
Summary
HotPage/HotPageListtypes with sort+dedup snapshot, atomicSave, andLoadHotPageList(binary varint format with aHPL1magic).Config.RecordHotPagesflag turns on per-fault recording in the page-fault loop.Server.Prefetch(forkID, list)issuesUFFDIO_COPYfor every entry in a hot-page list against the fork's userfaultfd before the guest unpauses.EEXIST/EAGAINare tolerated to absorb first-touch races with vCPUs.Why
Even with the shared mem-file + UFFD page server, a fresh fork still pays a fault round-trip on every page the guest needs to boot — that's tens of thousands of page-fault round-trips on the critical path. Recording the hot set during a template's first warmup fork and prefetching it on every later fork eliminates those round-trips entirely.
Template.HotPagesPath(reserved in PR 2) finally has a producer/consumer.Test plan
go test ./lib/uffd/...(covers HotPageList sort/dedup/save/load + bad-magic + truncation)RecordHotPages: true, save the list, fork without prefetch and time boot; fork with prefetch and time boot; confirm the second is faster🤖 Generated with Claude Code