Skip to content

[v1.14] Expose memory mapping & dirty pages; Make memfile dump optional#8

Draft
bchalios wants to merge 53 commits intofirecracker-v1.14from
firecracker-v1.14-direct-mem
Draft

[v1.14] Expose memory mapping & dirty pages; Make memfile dump optional#8
bchalios wants to merge 53 commits intofirecracker-v1.14from
firecracker-v1.14-direct-mem

Conversation

@bchalios
Copy link
Copy Markdown

@bchalios bchalios commented Feb 5, 2026

WIP

License Acceptance

By submitting this pull request, I confirm that my contribution is made under
the terms of the Apache 2.0 license. For more information on following Developer
Certificate of Origin and signing off your commits, please check
CONTRIBUTING.md.

PR Checklist

  • I have read and understand CONTRIBUTING.md.
  • I have run tools/devtool checkbuild --all to verify that the PR passes
    build checks on all supported architectures.
  • I have run tools/devtool checkstyle to verify that the PR passes the
    automated style checks.
  • I have described what is done in these changes, why they are needed, and
    how they are solving the problem in a clear and encompassing way.
  • I have updated any relevant documentation (both in code and in the docs)
    in the PR.
  • I have mentioned all user-facing changes in CHANGELOG.md.
  • If a specific issue led to this PR, this PR closes the issue.
  • When making API changes, I have followed the
    Runbook for Firecracker API changes.
  • I have tested all new and changed functionalities in unit tests and/or
    integration tests.
  • I have linked an issue to every new TODO.

  • This functionality cannot be added in rust-vmm.

@bchalios bchalios marked this pull request as draft February 5, 2026 19:16
@ValentaTomas ValentaTomas added the wontfix This will not be worked on label Feb 5, 2026
@bchalios bchalios force-pushed the firecracker-v1.14-direct-mem branch from 88151c3 to 89538bc Compare February 5, 2026 20:20
@bchalios bchalios force-pushed the firecracker-v1.14-direct-mem branch from a041675 to 61fdd9d Compare February 12, 2026 23:10
@bchalios bchalios force-pushed the firecracker-v1.14-direct-mem branch from d91bdb1 to 61fdd9d Compare February 13, 2026 23:58
ShadowCurse and others added 17 commits February 24, 2026 12:31
The commit
7ee43cc
("fix: Support creating file with open_file_nonblock")
did modify the file opening utility function by adding
`open` option, but it also removed the `read` option from it.
This causes an error during metrics and logs file initialization code if
the the file is a FIFO and there are no readers already reading from it.
This is because `open` returns `ENXIO` when opening a FIFO write-only as
described in the man page:
```
ENXIO  O_NONBLOCK | O_WRONLY is set, the named file is a FIFO,  and  no
             process has the FIFO open for reading.
```
Fix is just a partial revert of the part that changed the file opening
logic by re-introducing same `open_file_nonblock` as it was before but
with added `create` flag.

Signed-off-by: Egor Lazarchuk <yegorlz@amazon.co.uk>
No functional change. Just cleanup.

Signed-off-by: Egor Lazarchuk <yegorlz@amazon.co.uk>
Add note about fix in the firecracker-microvm#5698 PR.

Signed-off-by: Egor Lazarchuk <yegorlz@amazon.co.uk>
In GuestMemoryExtension::dump_dirty() and GuestMemorySlot::dump_dirty()
there are several unwraps which will cause Firecracker to panic. Instead
of panicking simply propagate errors up the call chain to mark the
snapshot operation as failed. To make error handling less verbose and
more descriptive make GuestMemorySlot::dump_dirty() return MemoryError
instead of GuestMemoryError and define new error types for MemoryError.

Signed-off-by: Ilias Stamatis <ilstam@amazon.com>
The comment in test_dump_dirty() before the last snapshot is taken says:
// First region pages: [dirty, clean]

However, the following code line updates the second page rather than the
first one, so in reality the state of the bitmap is [clean, dirty]. Fix
the comment.

Additionally, the "Dump only the dirty pages." comment further above is
misleading. Since we write to all pages of all regions, the Firecracker
bitmap (rather than the KVM bitmap) will be all 1s, therefore the first
snapshot taken by this function will dump all memory pages.

Rename 'dirty_bitmap' to 'kvm_dirty_bitmap' to make it clear which
bitmap it refers to and add comments about the state of the Firecracker
bitmap too. To make things more obvious for the reader re-configure the
state of the KVM dirty bitmap next to the relevant comment, just before
we take the second snapshot.

Since we are in the neighbourhood, rename 'expected_first_region' to
'expected_file_contents' since it clearly stores the contents of both
regions rather than the first one only.

No functional change intended.

Signed-off-by: Ilias Stamatis <ilstam@amazon.com>
It is possible that the dirty bitmap dump_dirty() receives is larger or
smaller than the slot size. Return an error in that case.

Additionally, the inner loop in that function always iterates over the
0..64 range. Typically the region size won't be a multiple of 64, so we
need to make sure that we break after we check the last bit that
corresponds to the last page of the region.

Extend the test_dump_dirty() test case to exercise the new code paths by
supplying wrongly sized  bitmaps to the function.

Signed-off-by: Ilias Stamatis <ilstam@amazon.com>
GuestMemorySlot::dump_dirty() does not advance the file cursor when the
pages at the end of the memory region are clean. This causes the next
slot to start writing at an incorrect offset and corrupting the contents
of the previous region. Fix this by always advancing the cursor at the
end of dump_dirty().

As per the original report from EJ Ciramella this only affects VMs with
more than one memory slots and it can cause Firecracker or the guest to
crash when loading a corrupted snapshot.

Fixes: 6c4c1bf ("feat(mem): introduce KVM slots per GuestMemoryRegion")
Reported-by: EJ Ciramella <ejc3@meta.com>
Suggested-by: EJ Ciramella <ejc3@meta.com>
Signed-off-by: Ilias Stamatis <ilstam@amazon.com>
The test_dump_dirty() test currently checks the file contents of a
differential snapshot where the last page of both memory regions is
dirty.

Extend this test to also test the case where the last page of both
regions is clean. Additionally, check that the logical size of the
resulting file is different than the physical size due to the holes
representing clean pages. Finally, make sure that if the KVM dirty
bitmap is larger than the region size, the extra bits are ignored.

Signed-off-by: Ilias Stamatis <ilstam@amazon.com>
The test_diff_snapshot_overlay() case tests differential snapshots on
VMs that have a single memory slot since basic_config() uses a 256MiB
memory size by default.

Parametrize the test so that it's repeated for both 256MiB and 4096MiB
sizes. On x86 this will create 2 memory slots and hence test a different
scenario.

Suggested-by: Riccardo Mancini <mancio@amazon.com>
Signed-off-by: Ilias Stamatis <ilstam@amazon.com>
Update v1.14.2 CHANGELOG mentioning the fix for
firecracker-microvm#5705

Signed-off-by: Ilias Stamatis <ilstam@amazon.com>
This one was only used in devtool, so
move it there.

No functional change.

Signed-off-by: Egor Lazarchuk <yegorlz@amazon.co.uk>
There are several issues with the current implementation which
result in a spurious failures/timeouts in a CI. Additionally,
the kvm update logic was duplicated in a couple of places.
To resolve these issues, refactor the code a bit to move
kvm loading/checking logic into one place and implement more
robust locking mechanism.

Signed-off-by: Egor Lazarchuk <yegorlz@amazon.co.uk>
Update version number / CHANGELOG / CREDITS

Signed-off-by: Riccardo Mancini <mancio@amazon.com>
Bump aws-lc-rs to 1.16.1 to take the newest aws-lc-sys 0.38.0 which is
required for security advisory.

(cherry picked from commit f056e65)
Signed-off-by: Jack Thomson <jackabt@amazon.com>
Previously the value of the opt len was not validated which contradicts
the spec.

Per RFC 9293 (MUST-7), opt len includes the kind and length bytes so the
minimum valid value length is 2.

Reported-by: Kai Mitsuzawa <kaizawap@amazon.co.jp> (@kaizawa97)
(cherry picked from commit fa3e4d7)
Signed-off-by: Jack Thomson <jackabt@amazon.com>
Ensure we reject invalid option lengths

(cherry picked from commit e0eb5e7)
Signed-off-by: Jack Thomson <jackabt@amazon.com>
Add new entry for the MMDS TCP option length fix

Signed-off-by: Jack Thomson <jackabt@amazon.com>
@ValentaTomas ValentaTomas requested review from ValentaTomas and removed request for ValentaTomas March 12, 2026 19:56
@ValentaTomas ValentaTomas self-assigned this Mar 12, 2026
@ValentaTomas ValentaTomas self-requested a review March 12, 2026 19:57
@ValentaTomas ValentaTomas removed their assignment Mar 12, 2026
zulinx86 and others added 15 commits April 7, 2026 10:41
The match on `!self.device_status & status` only checked which new bit
was being set, but did not verify that all previously set bits were
preserved in the written value.  This allowed transitions like writing
bare DRIVER_OK from FEATURES_OK state, which clears ACKNOWLEDGE, DRIVER,
and FEATURE_OK bits in violation of the virtio spec [1]: "The driver
MUST NOT clear a device status bit."

Add a guard ensuring the written value is exactly the current status
with one new bit added.

[1]: https://docs.oasis-open.org/virtio/virtio/v1.3/csd01/virtio-v1.3-csd01.html#x1-120001
Signed-off-by: Takahiro Itazuri <itazur@amazon.com>
We already have Rust unit tests for the immutability for virtio PCI
queue config fields (queue_size, queue_eanble, queue_desc, queue_avail,
queue_used):
- test_queue_size
- test_queue_enable
- test_queue_addresses

However, those queue config fields would be of the greatest interest
from the security perspective among all the fields.  To simulate a more
realistic scenario, add an integration test that verifies those fields
cannot be modified after boot.

Signed-off-by: Takahiro Itazuri <itazur@amazon.com>
Add a Rust unit test that verifies virtio MMIO queue configuration
fields (QueueNum, QueueReady, QueueDescLow/High, QueueAvailLow/High,
QueueUsedLog/High) cannot be after the device has been activated.
We do have this test especially for the queue config fields because they
would be of the greatest interest from the security perspective among
all the fields.  Ideally thgis would be an integraion test like PCI
counterpart, but MMIO queue config registers are write-only, making
readback verification from the guest impossible.

Signed-off-by: Takahiro Itazuri <itazur@amazon.com>
When the backend virtio device doesn't implement reset(), the PCI
transport sets device_status to INIT so that the Linux PCI driver's
reset poll terminates correctly.  However, the backend device is still
active.  Without any guard, the driver could do re-initialization
against a stil-live backend device.

Pass device_activated from VirtioPciDevice into set_device_status()
through the write() -> write_common_config_byte() chain.  When
device_activated is true and device_status is already INIT, reject all
status transitions.  This condition uniquely identifies a
post-failed-reset state and naturally survives snapshot restore since
both device_activated and device_status are persisted.

Signed-off-by: Takahiro Itazuri <itazur@amazon.com>
The MMIO transport sets DEVICE_NEEDS_RESET [1] when device activation
fails, but the PCI transport didn't.

[1]: https://docs.oasis-open.org/virtio/virtio/v1.3/csd01/virtio-v1.3-csd01.html#x1-1220001
Signed-off-by: Takahiro Itazuri <itazur@amazon.com>
The MMIO transport rejects writes to the driver feature register
outsidfe of the DRIVER state, but the PCI transport accepted them
unconditionally.  Add the same guard: only call ack_features_by_page()
when DRIVER Is set and FEATURES_OK, FAILED, and DEVICE_NEEDS_RESET are
all clear.

Signed-off-by: Takahiro Itazuri <itazur@amazon.com>
Add entries for the virtio transport fixes introduced in the preceding
commits.

Signed-off-by: Takahiro Itazuri <itazur@amazon.com>
Since host kernel 6.3 (commit 7af0c2534f4c), KVM fabricates CLIDR_EL1
instead of passing through the host's real value. On hosts with IDC=1
and DIC=0 (e.g. Neoverse V1), the fabricated CLIDR exposes only
L1=Unified when the host actually has separate L1d+L1i, L2, and L3.

Guest kernels >= 6.1.156 backported init_of_cache_level() which counts
cache leaves from the DT, while populate_cache_leaves() uses CLIDR_EL1.
When the DT (built from host sysfs) describes more cache entries than
CLIDR_EL1, the mismatch causes cache sysfs entries to not be created,
breaking /sys/devices/system/cpu/cpu*/cache/* in the guest.

Fix this by reading the current CLIDR_EL1 from vCPU 0, merging in the
ctype and LoC fields derived from the host's sysfs cache topology, and
writing the result back to each vCPU via KVM_SET_ONE_REG. Fields that
cannot be derived from sysfs (LoUU, LoUIS, ICB, Ttype) are preserved
from the original CLIDR_EL1. This makes CLIDR_EL1 consistent with the
FDT, which already describes the real host caches.

On pre-6.3 kernels, KVM passes through the real host CLIDR rather than
fabricating one. Since the sysfs cache topology already matches the real
CLIDR, the merge produces the same value, the write is skipped, and the
override is effectively a no-op.

This approach preserves the full host cache information for the guest
rather than stripping the FDT to match the fabricated CLIDR.

Signed-off-by: Nikita Kalyazin <kalyazin@amazon.com>
The previous implementation checked whether either slot endpoint fell
inside the requested range. This missed the containment case where a
slot fully contains the range (neither endpoint inside it), causing
update_kvm_slots to silently skip KVM slot registration/removal for any
block not aligned to a slot boundary.

Replace the two addr_in_range endpoint checks with a proper half-open
interval intersection test: slot_start < range_end && range_start <
slot_end.

Remove the now-unused addr_in_range helper and add a table-driven unit
test covering boundary, interior, cross-slot, full-region, outside, and
zero-length ranges.

Signed-off-by: Nikita Kalyazin <kalyazin@amazon.com>
process_stats_queue() used the guest-provided descriptor len field as
the loop bound without validation. A misbehaving guest could set this to
u32::MAX, causing excessive iterations that temporarily monopolise the
VMM event loop.

Add a MAX_STATS_DESC_LEN check before entering the loop. The limit uses
a generous upper bound (256 tags) rather than the current spec count, so
future kernel additions won't silently break stats collection. Oversized
descriptors are logged and held without updating stats, preserving the
stats request/response protocol.

Signed-off-by: Nikita Kalyazin <kalyazin@amazon.com>
When a non-compliant driver submits more than one stats buffer,
process_stats_queue returns the previous descriptor via add_used but
never calls advance_used_ring_idx or signal_used_queue. The write to the
used ring is therefore invisible to the guest, which can never reclaim
the buffer.

Add the missing advance_used_ring_idx and signal_used_queue calls so the
guest actually sees the returned descriptor.

Signed-off-by: Nikita Kalyazin <kalyazin@amazon.com>
Firecracker has never advanced the clock on restore for any of the
supported clocksources. Since Linux 5.16, the KVM_CLOCK_REALTIME has
been passed to kvm-clock, causing the monotonic time in the guest to
jump when using kvm-clock as clock source.

Despite being unexpected and not what Firecracker should do, we
recognize this may be a valid usecase so this patch adds a way to
configure it, keeping the default to the expected documented behaviour.

This patch adds a new API flag to LoadSnapshot, clock_realtime, that
advances the clock on restore when set (default is False). Rather than
the clock flags being decided at snapshot time, the restore path ignores
those flags and decides what to do depending on the clock_realtime flag.
This is because the other available flag (KVM_CLOCK_TSC_STABLE) cannot
even be passed to `set_clock`, meaning the only valid flag is
KVM_CLOCK_REALTIME.

The name of the flag was kept generic as we may add this behaviour for
the other clock sources in the future, if the need arises.

Signed-off-by: Riccardo Mancini <mancio@amazon.com>
We actually support all of them.

Signed-off-by: Riccardo Mancini <mancio@amazon.com>
Update version number / CHANGELOG / CREDITS

Signed-off-by: Riccardo Mancini <mancio@amazon.com>
The change was made privately, so the PR number wans't available.

Signed-off-by: Takahiro Itazuri <itazur@amazon.com>
@ValentaTomas ValentaTomas removed their assignment Apr 8, 2026
bchalios added 11 commits April 14, 2026 17:46
Add a few APIs to get information about guest memory:

* An endpoint for guest memory mappings (guest physical to host
  virtual).
* An endpoint for resident and empty pages.
* An endpoint for dirty pages.

Signed-off-by: Babis Chalios <babis.chalios@e2b.dev>
There are cases where a user might want to snapshot the memoyr of a VM
externally. In these cases, we can ask Firecracker to avoid serializing
the memory file to disk when we create a snapshot.

Signed-off-by: Babis Chalios <babis.chalios@e2b.dev>
Implement API /memory/mappings which returns the memory mappings of
guest physical to host virtual memory.

Signed-off-by: Babis Chalios <babis.chalios@e2b.dev>
Implement API /memory which returns two bitmaps: resident and empty.
`resident` tracks whether a guest page is in the resident set and `empty`
tracks whether it's actually all 0s.

Both bitmaps are structures as vectors of u64, so their length is:
total_number_of_pages.div_ceil(64).

Pages are ordered in the order of pages as reported by/memory/mappings.

Signed-off-by: Babis Chalios <babis.chalios@e2b.dev>
Implement API /memory/dirty which returns a bitmap tracking dirty guest
memory. The bitmap is structured as a vector of u64, so its length is:
total_number_of_pages.div_ceil(64).

Pages are ordered in the order of pages as reported by /memory/mappings.

Signed-off-by: Babis Chalios <babis.chalios@e2b.dev>
UFFD provides an API to enable write-protection for memory ranges
tracked by a userfault file descriptor. Detailed information can be
found here: https://docs.kernel.org/admin-guide/mm/userfaultfd.html.

To use the feature, users need to register the memory region with
UFFDIO_REGISTER_MODE_WP. Then, users need to enable explicitly
write-protection for sub-ranges of the registered region.

Writes in pages within write-protected memory ranges can be handled in
one of two ways. In synchronous mode, writes in a protected page will
cause kernel to send a write protection event over the userfaultfd.
In asynchronous mode, the kernel will automatically handle writes to
protected pages by clearing the write-protection bit. Userspace can
later observe the write protection bit by looking into the corresponding
entry of /proc/<pid>/pagemap.

This commit, uncoditionally, enables write protection for guest memory
using the asynchronous mode.

!NOTE!: asynchronous write protection requires (host) kernel version 6.7
or later).

Signed-off-by: Babis Chalios <babis.chalios@e2b.dev>
This is an optional test on the Firecracker side and most of the times
it's ignored (when valid dependency changes happen). Having this fail
blocks our fc-versions releases.

Signed-off-by: Babis Chalios <babis.chalios@e2b.dev>
TODO

Signed-off-by: Babis Chalios <babis.chalios@e2b.dev>
Add descriptions for MicovmState from previous Firecracker versions.
Moreover, add methods to translate a snapshot file from previous
versions in the current one.

Signed-off-by: Babis Chalios <babis.chalios@e2b.dev>
Now that we have logic for translating snapshot formats, we can allow
the /snapshot/load API to parse v1.10 and v1.12 snapshots. We change the
logic that parses the snapshot file to first read the version from the
file and then (if needed) translate it to the expected v1.14 version.

Currently older versions supported are v1.10 and v1.12.

Signed-off-by: Babis Chalios <babis.chalios@e2b.dev>
Changes we did for supporting older snapshot formats, did not really
compile on ARM systems. Fix the compilation issues. The issues were
mainly bad re-exports.

Signed-off-by: Babis Chalios <babis.chalios@e2b.dev>
@bchalios bchalios force-pushed the firecracker-v1.14-direct-mem branch from 76f16f0 to 458ca91 Compare April 14, 2026 15:51
@cursor
Copy link
Copy Markdown

cursor bot commented Apr 14, 2026

PR Summary

Medium Risk
Touches snapshot/restore behavior and virtio transport state machines (MMIO + PCI), which can impact guest device initialization and compatibility; also introduces a git-sourced userfaultfd dependency and KVM clock restore flag handling.

Overview
Adds new GET /memory, GET /memory/mappings, and GET /memory/dirty API endpoints (wired through VmmAction/VmmData) to expose guest memory mapping metadata plus resident/empty and dirty-page bitmaps.

Makes snapshot memfile dumping optional by changing SnapshotCreateParams.mem_file_path to optional (and updating parsing/tests/spec), and extends LoadSnapshot with an optional clock_realtime flag (x86_64-only) that restores kvm-clock with KVM_CLOCK_REALTIME when supported.

Hardens device and parsing logic: caps virtio-rng per-request allocations to 64KiB, bounds balloon stats descriptor processing, validates TCP option lengths in MMDS, and enforces virtio device status/queue configuration sequencing in both MMIO and PCI transports (including DEVICE_NEEDS_RESET on activation failure). Also improves aarch64 snapshot compatibility by making GIC/ITS state restore tolerant of missing older fields and overrides fabricated CLIDR cache topology to match sysfs.

Updates release metadata/docs (v1.14.4 changelog, clocksource docs, release policy), adjusts seccomp allowlist (pread64), relaxes deny.toml license list, removes the dependency-modification CI workflow, and updates dependencies (notably aws-lc-rs and userfaultfd switched to a git branch).

Reviewed by Cursor Bugbot for commit 458ca91. Bugbot is set up for automated code reviews on this repo. Configure here.

Copy link
Copy Markdown

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 458ca91. Configure here.

config_space: NetConfigSpaceState,
pub config_space: NetConfigSpaceState,
pub virtio_state: VirtioDeviceState,
pub rx_buffers_state: RxBufferState,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing serde default for new snapshot field breaks compatibility

High Severity

The new rx_buffers_state field in NetState lacks a #[serde(default)] attribute. Since NetState derives Deserialize, loading snapshots created by older Firecracker versions (which don't include this field) will fail with a deserialization error. RxBufferState already derives Default, so adding #[serde(default)] would allow backward-compatible deserialization while preserving the zero-valued default.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 458ca91. Configure here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

wontfix This will not be worked on

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants