[v1.14] Expose memory mapping & dirty pages; Make memfile dump optional#8
[v1.14] Expose memory mapping & dirty pages; Make memfile dump optional#8bchalios wants to merge 53 commits intofirecracker-v1.14from
Conversation
88151c3 to
89538bc
Compare
a041675 to
61fdd9d
Compare
d91bdb1 to
61fdd9d
Compare
The commit 7ee43cc ("fix: Support creating file with open_file_nonblock") did modify the file opening utility function by adding `open` option, but it also removed the `read` option from it. This causes an error during metrics and logs file initialization code if the the file is a FIFO and there are no readers already reading from it. This is because `open` returns `ENXIO` when opening a FIFO write-only as described in the man page: ``` ENXIO O_NONBLOCK | O_WRONLY is set, the named file is a FIFO, and no process has the FIFO open for reading. ``` Fix is just a partial revert of the part that changed the file opening logic by re-introducing same `open_file_nonblock` as it was before but with added `create` flag. Signed-off-by: Egor Lazarchuk <yegorlz@amazon.co.uk>
No functional change. Just cleanup. Signed-off-by: Egor Lazarchuk <yegorlz@amazon.co.uk>
Add note about fix in the firecracker-microvm#5698 PR. Signed-off-by: Egor Lazarchuk <yegorlz@amazon.co.uk>
In GuestMemoryExtension::dump_dirty() and GuestMemorySlot::dump_dirty() there are several unwraps which will cause Firecracker to panic. Instead of panicking simply propagate errors up the call chain to mark the snapshot operation as failed. To make error handling less verbose and more descriptive make GuestMemorySlot::dump_dirty() return MemoryError instead of GuestMemoryError and define new error types for MemoryError. Signed-off-by: Ilias Stamatis <ilstam@amazon.com>
The comment in test_dump_dirty() before the last snapshot is taken says: // First region pages: [dirty, clean] However, the following code line updates the second page rather than the first one, so in reality the state of the bitmap is [clean, dirty]. Fix the comment. Additionally, the "Dump only the dirty pages." comment further above is misleading. Since we write to all pages of all regions, the Firecracker bitmap (rather than the KVM bitmap) will be all 1s, therefore the first snapshot taken by this function will dump all memory pages. Rename 'dirty_bitmap' to 'kvm_dirty_bitmap' to make it clear which bitmap it refers to and add comments about the state of the Firecracker bitmap too. To make things more obvious for the reader re-configure the state of the KVM dirty bitmap next to the relevant comment, just before we take the second snapshot. Since we are in the neighbourhood, rename 'expected_first_region' to 'expected_file_contents' since it clearly stores the contents of both regions rather than the first one only. No functional change intended. Signed-off-by: Ilias Stamatis <ilstam@amazon.com>
It is possible that the dirty bitmap dump_dirty() receives is larger or smaller than the slot size. Return an error in that case. Additionally, the inner loop in that function always iterates over the 0..64 range. Typically the region size won't be a multiple of 64, so we need to make sure that we break after we check the last bit that corresponds to the last page of the region. Extend the test_dump_dirty() test case to exercise the new code paths by supplying wrongly sized bitmaps to the function. Signed-off-by: Ilias Stamatis <ilstam@amazon.com>
GuestMemorySlot::dump_dirty() does not advance the file cursor when the pages at the end of the memory region are clean. This causes the next slot to start writing at an incorrect offset and corrupting the contents of the previous region. Fix this by always advancing the cursor at the end of dump_dirty(). As per the original report from EJ Ciramella this only affects VMs with more than one memory slots and it can cause Firecracker or the guest to crash when loading a corrupted snapshot. Fixes: 6c4c1bf ("feat(mem): introduce KVM slots per GuestMemoryRegion") Reported-by: EJ Ciramella <ejc3@meta.com> Suggested-by: EJ Ciramella <ejc3@meta.com> Signed-off-by: Ilias Stamatis <ilstam@amazon.com>
The test_dump_dirty() test currently checks the file contents of a differential snapshot where the last page of both memory regions is dirty. Extend this test to also test the case where the last page of both regions is clean. Additionally, check that the logical size of the resulting file is different than the physical size due to the holes representing clean pages. Finally, make sure that if the KVM dirty bitmap is larger than the region size, the extra bits are ignored. Signed-off-by: Ilias Stamatis <ilstam@amazon.com>
The test_diff_snapshot_overlay() case tests differential snapshots on VMs that have a single memory slot since basic_config() uses a 256MiB memory size by default. Parametrize the test so that it's repeated for both 256MiB and 4096MiB sizes. On x86 this will create 2 memory slots and hence test a different scenario. Suggested-by: Riccardo Mancini <mancio@amazon.com> Signed-off-by: Ilias Stamatis <ilstam@amazon.com>
Update v1.14.2 CHANGELOG mentioning the fix for firecracker-microvm#5705 Signed-off-by: Ilias Stamatis <ilstam@amazon.com>
This one was only used in devtool, so move it there. No functional change. Signed-off-by: Egor Lazarchuk <yegorlz@amazon.co.uk>
There are several issues with the current implementation which result in a spurious failures/timeouts in a CI. Additionally, the kvm update logic was duplicated in a couple of places. To resolve these issues, refactor the code a bit to move kvm loading/checking logic into one place and implement more robust locking mechanism. Signed-off-by: Egor Lazarchuk <yegorlz@amazon.co.uk>
Update version number / CHANGELOG / CREDITS Signed-off-by: Riccardo Mancini <mancio@amazon.com>
Bump aws-lc-rs to 1.16.1 to take the newest aws-lc-sys 0.38.0 which is required for security advisory. (cherry picked from commit f056e65) Signed-off-by: Jack Thomson <jackabt@amazon.com>
Previously the value of the opt len was not validated which contradicts the spec. Per RFC 9293 (MUST-7), opt len includes the kind and length bytes so the minimum valid value length is 2. Reported-by: Kai Mitsuzawa <kaizawap@amazon.co.jp> (@kaizawa97) (cherry picked from commit fa3e4d7) Signed-off-by: Jack Thomson <jackabt@amazon.com>
Ensure we reject invalid option lengths (cherry picked from commit e0eb5e7) Signed-off-by: Jack Thomson <jackabt@amazon.com>
Add new entry for the MMDS TCP option length fix Signed-off-by: Jack Thomson <jackabt@amazon.com>
The match on `!self.device_status & status` only checked which new bit was being set, but did not verify that all previously set bits were preserved in the written value. This allowed transitions like writing bare DRIVER_OK from FEATURES_OK state, which clears ACKNOWLEDGE, DRIVER, and FEATURE_OK bits in violation of the virtio spec [1]: "The driver MUST NOT clear a device status bit." Add a guard ensuring the written value is exactly the current status with one new bit added. [1]: https://docs.oasis-open.org/virtio/virtio/v1.3/csd01/virtio-v1.3-csd01.html#x1-120001 Signed-off-by: Takahiro Itazuri <itazur@amazon.com>
We already have Rust unit tests for the immutability for virtio PCI queue config fields (queue_size, queue_eanble, queue_desc, queue_avail, queue_used): - test_queue_size - test_queue_enable - test_queue_addresses However, those queue config fields would be of the greatest interest from the security perspective among all the fields. To simulate a more realistic scenario, add an integration test that verifies those fields cannot be modified after boot. Signed-off-by: Takahiro Itazuri <itazur@amazon.com>
Add a Rust unit test that verifies virtio MMIO queue configuration fields (QueueNum, QueueReady, QueueDescLow/High, QueueAvailLow/High, QueueUsedLog/High) cannot be after the device has been activated. We do have this test especially for the queue config fields because they would be of the greatest interest from the security perspective among all the fields. Ideally thgis would be an integraion test like PCI counterpart, but MMIO queue config registers are write-only, making readback verification from the guest impossible. Signed-off-by: Takahiro Itazuri <itazur@amazon.com>
When the backend virtio device doesn't implement reset(), the PCI transport sets device_status to INIT so that the Linux PCI driver's reset poll terminates correctly. However, the backend device is still active. Without any guard, the driver could do re-initialization against a stil-live backend device. Pass device_activated from VirtioPciDevice into set_device_status() through the write() -> write_common_config_byte() chain. When device_activated is true and device_status is already INIT, reject all status transitions. This condition uniquely identifies a post-failed-reset state and naturally survives snapshot restore since both device_activated and device_status are persisted. Signed-off-by: Takahiro Itazuri <itazur@amazon.com>
The MMIO transport sets DEVICE_NEEDS_RESET [1] when device activation fails, but the PCI transport didn't. [1]: https://docs.oasis-open.org/virtio/virtio/v1.3/csd01/virtio-v1.3-csd01.html#x1-1220001 Signed-off-by: Takahiro Itazuri <itazur@amazon.com>
The MMIO transport rejects writes to the driver feature register outsidfe of the DRIVER state, but the PCI transport accepted them unconditionally. Add the same guard: only call ack_features_by_page() when DRIVER Is set and FEATURES_OK, FAILED, and DEVICE_NEEDS_RESET are all clear. Signed-off-by: Takahiro Itazuri <itazur@amazon.com>
Add entries for the virtio transport fixes introduced in the preceding commits. Signed-off-by: Takahiro Itazuri <itazur@amazon.com>
Since host kernel 6.3 (commit 7af0c2534f4c), KVM fabricates CLIDR_EL1 instead of passing through the host's real value. On hosts with IDC=1 and DIC=0 (e.g. Neoverse V1), the fabricated CLIDR exposes only L1=Unified when the host actually has separate L1d+L1i, L2, and L3. Guest kernels >= 6.1.156 backported init_of_cache_level() which counts cache leaves from the DT, while populate_cache_leaves() uses CLIDR_EL1. When the DT (built from host sysfs) describes more cache entries than CLIDR_EL1, the mismatch causes cache sysfs entries to not be created, breaking /sys/devices/system/cpu/cpu*/cache/* in the guest. Fix this by reading the current CLIDR_EL1 from vCPU 0, merging in the ctype and LoC fields derived from the host's sysfs cache topology, and writing the result back to each vCPU via KVM_SET_ONE_REG. Fields that cannot be derived from sysfs (LoUU, LoUIS, ICB, Ttype) are preserved from the original CLIDR_EL1. This makes CLIDR_EL1 consistent with the FDT, which already describes the real host caches. On pre-6.3 kernels, KVM passes through the real host CLIDR rather than fabricating one. Since the sysfs cache topology already matches the real CLIDR, the merge produces the same value, the write is skipped, and the override is effectively a no-op. This approach preserves the full host cache information for the guest rather than stripping the FDT to match the fabricated CLIDR. Signed-off-by: Nikita Kalyazin <kalyazin@amazon.com>
The previous implementation checked whether either slot endpoint fell inside the requested range. This missed the containment case where a slot fully contains the range (neither endpoint inside it), causing update_kvm_slots to silently skip KVM slot registration/removal for any block not aligned to a slot boundary. Replace the two addr_in_range endpoint checks with a proper half-open interval intersection test: slot_start < range_end && range_start < slot_end. Remove the now-unused addr_in_range helper and add a table-driven unit test covering boundary, interior, cross-slot, full-region, outside, and zero-length ranges. Signed-off-by: Nikita Kalyazin <kalyazin@amazon.com>
process_stats_queue() used the guest-provided descriptor len field as the loop bound without validation. A misbehaving guest could set this to u32::MAX, causing excessive iterations that temporarily monopolise the VMM event loop. Add a MAX_STATS_DESC_LEN check before entering the loop. The limit uses a generous upper bound (256 tags) rather than the current spec count, so future kernel additions won't silently break stats collection. Oversized descriptors are logged and held without updating stats, preserving the stats request/response protocol. Signed-off-by: Nikita Kalyazin <kalyazin@amazon.com>
When a non-compliant driver submits more than one stats buffer, process_stats_queue returns the previous descriptor via add_used but never calls advance_used_ring_idx or signal_used_queue. The write to the used ring is therefore invisible to the guest, which can never reclaim the buffer. Add the missing advance_used_ring_idx and signal_used_queue calls so the guest actually sees the returned descriptor. Signed-off-by: Nikita Kalyazin <kalyazin@amazon.com>
Firecracker has never advanced the clock on restore for any of the supported clocksources. Since Linux 5.16, the KVM_CLOCK_REALTIME has been passed to kvm-clock, causing the monotonic time in the guest to jump when using kvm-clock as clock source. Despite being unexpected and not what Firecracker should do, we recognize this may be a valid usecase so this patch adds a way to configure it, keeping the default to the expected documented behaviour. This patch adds a new API flag to LoadSnapshot, clock_realtime, that advances the clock on restore when set (default is False). Rather than the clock flags being decided at snapshot time, the restore path ignores those flags and decides what to do depending on the clock_realtime flag. This is because the other available flag (KVM_CLOCK_TSC_STABLE) cannot even be passed to `set_clock`, meaning the only valid flag is KVM_CLOCK_REALTIME. The name of the flag was kept generic as we may add this behaviour for the other clock sources in the future, if the need arises. Signed-off-by: Riccardo Mancini <mancio@amazon.com>
We actually support all of them. Signed-off-by: Riccardo Mancini <mancio@amazon.com>
Update version number / CHANGELOG / CREDITS Signed-off-by: Riccardo Mancini <mancio@amazon.com>
The change was made privately, so the PR number wans't available. Signed-off-by: Takahiro Itazuri <itazur@amazon.com>
Add a few APIs to get information about guest memory: * An endpoint for guest memory mappings (guest physical to host virtual). * An endpoint for resident and empty pages. * An endpoint for dirty pages. Signed-off-by: Babis Chalios <babis.chalios@e2b.dev>
There are cases where a user might want to snapshot the memoyr of a VM externally. In these cases, we can ask Firecracker to avoid serializing the memory file to disk when we create a snapshot. Signed-off-by: Babis Chalios <babis.chalios@e2b.dev>
Implement API /memory/mappings which returns the memory mappings of guest physical to host virtual memory. Signed-off-by: Babis Chalios <babis.chalios@e2b.dev>
Implement API /memory which returns two bitmaps: resident and empty. `resident` tracks whether a guest page is in the resident set and `empty` tracks whether it's actually all 0s. Both bitmaps are structures as vectors of u64, so their length is: total_number_of_pages.div_ceil(64). Pages are ordered in the order of pages as reported by/memory/mappings. Signed-off-by: Babis Chalios <babis.chalios@e2b.dev>
Implement API /memory/dirty which returns a bitmap tracking dirty guest memory. The bitmap is structured as a vector of u64, so its length is: total_number_of_pages.div_ceil(64). Pages are ordered in the order of pages as reported by /memory/mappings. Signed-off-by: Babis Chalios <babis.chalios@e2b.dev>
UFFD provides an API to enable write-protection for memory ranges tracked by a userfault file descriptor. Detailed information can be found here: https://docs.kernel.org/admin-guide/mm/userfaultfd.html. To use the feature, users need to register the memory region with UFFDIO_REGISTER_MODE_WP. Then, users need to enable explicitly write-protection for sub-ranges of the registered region. Writes in pages within write-protected memory ranges can be handled in one of two ways. In synchronous mode, writes in a protected page will cause kernel to send a write protection event over the userfaultfd. In asynchronous mode, the kernel will automatically handle writes to protected pages by clearing the write-protection bit. Userspace can later observe the write protection bit by looking into the corresponding entry of /proc/<pid>/pagemap. This commit, uncoditionally, enables write protection for guest memory using the asynchronous mode. !NOTE!: asynchronous write protection requires (host) kernel version 6.7 or later). Signed-off-by: Babis Chalios <babis.chalios@e2b.dev>
This is an optional test on the Firecracker side and most of the times it's ignored (when valid dependency changes happen). Having this fail blocks our fc-versions releases. Signed-off-by: Babis Chalios <babis.chalios@e2b.dev>
TODO Signed-off-by: Babis Chalios <babis.chalios@e2b.dev>
Add descriptions for MicovmState from previous Firecracker versions. Moreover, add methods to translate a snapshot file from previous versions in the current one. Signed-off-by: Babis Chalios <babis.chalios@e2b.dev>
Now that we have logic for translating snapshot formats, we can allow the /snapshot/load API to parse v1.10 and v1.12 snapshots. We change the logic that parses the snapshot file to first read the version from the file and then (if needed) translate it to the expected v1.14 version. Currently older versions supported are v1.10 and v1.12. Signed-off-by: Babis Chalios <babis.chalios@e2b.dev>
Changes we did for supporting older snapshot formats, did not really compile on ARM systems. Fix the compilation issues. The issues were mainly bad re-exports. Signed-off-by: Babis Chalios <babis.chalios@e2b.dev>
76f16f0 to
458ca91
Compare
PR SummaryMedium Risk Overview Makes snapshot memfile dumping optional by changing Hardens device and parsing logic: caps virtio-rng per-request allocations to 64KiB, bounds balloon stats descriptor processing, validates TCP option lengths in MMDS, and enforces virtio device status/queue configuration sequencing in both MMIO and PCI transports (including Updates release metadata/docs (v1.14.4 changelog, clocksource docs, release policy), adjusts seccomp allowlist ( Reviewed by Cursor Bugbot for commit 458ca91. Bugbot is set up for automated code reviews on this repo. Configure here. |
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 458ca91. Configure here.
| config_space: NetConfigSpaceState, | ||
| pub config_space: NetConfigSpaceState, | ||
| pub virtio_state: VirtioDeviceState, | ||
| pub rx_buffers_state: RxBufferState, |
There was a problem hiding this comment.
Missing serde default for new snapshot field breaks compatibility
High Severity
The new rx_buffers_state field in NetState lacks a #[serde(default)] attribute. Since NetState derives Deserialize, loading snapshots created by older Firecracker versions (which don't include this field) will fail with a deserialization error. RxBufferState already derives Default, so adding #[serde(default)] would allow backward-compatible deserialization while preserving the zero-valued default.
Reviewed by Cursor Bugbot for commit 458ca91. Configure here.


WIP
License Acceptance
By submitting this pull request, I confirm that my contribution is made under
the terms of the Apache 2.0 license. For more information on following Developer
Certificate of Origin and signing off your commits, please check
CONTRIBUTING.md.PR Checklist
tools/devtool checkbuild --allto verify that the PR passesbuild checks on all supported architectures.
tools/devtool checkstyleto verify that the PR passes theautomated style checks.
how they are solving the problem in a clear and encompassing way.
in the PR.
CHANGELOG.md.Runbook for Firecracker API changes.
integration tests.
TODO.rust-vmm.