nvme_driver: only support keepalive for a certain class of devices by gurasinghMS · Pull Request #3415 · microsoft/openvmm

gurasinghMS · 2026-05-01T21:48:02Z

No description provided.

Copilot

Pull request overview

This PR narrows NVMe “keepalive” (servicing without device reset) behavior in Underhill to only a specific class of devices, as a mitigation for known incompatibilities with certain devices

Changes:

Introduces a helper to decide whether a device is “keepalive-compatible” based on its identifier.
During servicing shutdown, only enables keepalive (do-not-reset / skip shutdown) for compatible devices; logs when keepalive is requested but disabled per-device.
During servicing save, skips persisting NVMe state for incompatible devices.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

File	Description
openhcl/underhill_core/src/nvme_manager/mod.rs	Adds a helper function to gate keepalive compatibility based on an identifier heuristic.
openhcl/underhill_core/src/nvme_manager/manager.rs	Applies the compatibility gate to shutdown options and to which devices participate in save/restore.

mattkur · 2026-05-01T22:39:20Z

+                // nvme_keepalive is received from host but it is only valid
+                // when memory pool allocator supports save/restore. Further,
+                // as a partial mitigation for known incompatibilities between
+                // keepalive and NVMe Direct v2 devices, we only honor
+                // keepalive for ASAP devices (identified by the VPCI
+                // instance GUID containing `c05b`).
+                let host_requested_keepalive =
+                    nvme_keepalive && self.context.save_restore_supported;
+                let device_keepalive =
+                    host_requested_keepalive && is_nvme_keepalive_compatible(&pci_id);
+                if host_requested_keepalive && !device_keepalive {
+                    tracing::info!(
+                        %pci_id,
+                        "disabling nvme keepalive for non-ASAP device; \
+                         falling back to reset-on-servicing"
+                    );


This is good for a proof of concept + evaluating this fix. As was discussed offline: you'll want to use device identity data to make this analysis more robust.

mattkur · 2026-05-01T22:40:53Z

Cool. Thanks for giving this a try. Please make sure to write a CI test that validates this works when there's two devices: device (a) that supports keepalive and device (b) that does not. Hopefully that will be rather quick, and will catch basic problems in this path.

Copilot

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 5 comments.

alandau · 2026-05-04T19:05:26Z

+                        do_not_reset: nvme_keepalive
+                            && self.context.save_restore_supported
+                            && is_nvme_keepalive_compatible(&pci_id),
+                        skip_device_shutdown: nvme_keepalive
+                            && self.context.save_restore_supported
+                            && is_nvme_keepalive_compatible(&pci_id),


I don't follow. I think the code's logic is correct.

This is interesting, makes me wanna think twice about it because when my local copilot first wrote this code it also decided to do !is_nvme_keepalive_compatible(&pci_id) here and I had corrected it.

alandau · 2026-05-04T19:10:28Z

+            .filter_map(|(pci_id, driver)| {
+                if is_nvme_keepalive_compatible(pci_id) {
+                    Some((pci_id.clone(), driver.client().clone()))
+                } else {
+                    tracing::info!(
+                        %pci_id,
+                        "skipping save of nvme device; \
+                         keepalive disabled for this device"
+                    );
+                    None
+                }
+            })


This makes sense. I don't know how long the logging prolongs the time inside the lock (which lock? I don't see locks in this function), so don't know if we should fix this.

I don't know what lock it is talking about either. But unless there are hundreds of devices, which there shouldn't be, I think this is fine. @alandau you ok moving forward with this?

Copilot

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.

…ition

alandau

Looks good, a few nits inline.
Do we want to allow keepalive only on c05b, or find out the pciid for the problematic hardware and disable keepalive only for it?

alandau · 2026-05-04T19:05:26Z

+                        do_not_reset: nvme_keepalive
+                            && self.context.save_restore_supported
+                            && is_nvme_keepalive_compatible(&pci_id),
+                        skip_device_shutdown: nvme_keepalive
+                            && self.context.save_restore_supported
+                            && is_nvme_keepalive_compatible(&pci_id),


I don't follow. I think the code's logic is correct.

alandau · 2026-05-04T19:10:28Z

+            .filter_map(|(pci_id, driver)| {
+                if is_nvme_keepalive_compatible(pci_id) {
+                    Some((pci_id.clone(), driver.client().clone()))
+                } else {
+                    tracing::info!(
+                        %pci_id,
+                        "skipping save of nvme device; \
+                         keepalive disabled for this device"
+                    );
+                    None
+                }
+            })


This makes sense. I don't know how long the logging prolongs the time inside the lock (which lock? I don't see locks in this function), so don't know if we should fix this.

Copilot

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.

…over, it will tear down that device state and start again ... at least it should

Copilot

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.

Comments suppressed due to low confidence (1)

openhcl/underhill_core/src/nvme_manager/manager.rs:1

The restore path still iterates every saved NVMe entry and always passes Some(&disk.driver_state), even when the PCI ID is not keepalive-compatible. That means a servicing upgrade from an older build that saved all NVMe controllers will still attempt to restore state for non-c05b devices instead of forcing them through the reset path. The save-side filter added in this PR only protects states produced by the new code, not states produced by the previous version.

// Copyright (c) Microsoft Corporation.

+    vm.restart_openhcl(igvm_file.clone(), flags).await?;
+
+    agent.ping().await?;
+
+    // The non-c05b device must issue CREATE_IO_COMPLETION_QUEUE after
+    // servicing because its keepalive was downgraded to reset.
+    CancelContext::new()
+        .with_timeout(Duration::from_secs(60))
+        .until_cancelled(no_keepalive_create_seen_recv)
+        .await


    source: NvmeSpawnerError,
 }

+/// Device prefix for nvme devices that are compatibble with keepalive.


+/// Returns whether the given PCI ID corresponds to an NVMe device that is
+/// compatible with the keepalive.
+/// DEV_NOTE: This is a weak heuristic based approach which is not ideal but is necessary
+pub(crate) fn is_nvme_keepalive_compatible(pci_id: &str) -> bool {


alandau · 2026-05-04T21:09:49Z

LGTM

… figure out keepalive info

gurasinghMS · 2026-05-05T00:00:31Z

+
+            // Read the PCI vendor/device IDs once and cache the resulting
+            // keepalive compatibility flag on the `NvmeDriverManager`.
+            let keepalive_compatible = is_nvme_keepalive_compatible(&pci_id);


@alandau do you know if the device would be resolved at this point? Like if the device has not arrived back and reading the config space leads to a stall, it could impact boot times (which is no better than not doing keepalive at all)

I don't. But I remember at some point we ask VfioDevice to restore and it has to go to /sys to do its magic. You can try seeing if it's doing it past this point or not.

Copilot

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 4 comments.

+                        let keepalive_compatible = is_nvme_keepalive_compatible(&pci_id);
+
                        let driver = NvmeDriverManager::new(
                            &context.driver_source,
                            &pci_id,
                            context.vp_count,
-                            context.save_restore_supported,
+                            context.save_restore_supported && keepalive_compatible,


+#[openvmm_test(openhcl_linux_direct_x64 [LATEST_LINUX_DIRECT_TEST_X64])]
+async fn servicing_keepalive_per_device_gate(
+    config: PetriVmBuilder<OpenVmmPetriBackend>,
+    (igvm_file,): (ResolvedArtifact<impl petri_artifacts_common::tags::IsOpenhclIgvm>,),
+) -> Result<(), anyhow::Error> {


alandau · 2026-05-05T00:14:46Z

+        .with_hardware_config_fault(
+            HardwareConfigFaultConfig::new()
+                .with_vendor_id(0x1414)
+                .with_device_id(0xb111),


I don't follow. Do you?

+        let hardware_config_fault = fault_configuration.hardware_config_fault.take();
+        let vendor_id = hardware_config_fault
+            .and_then(|f| f.vendor_id)
+            .unwrap_or(VENDOR_ID);
+        let device_id = hardware_config_fault
+            .and_then(|f| f.device_id)
+            .unwrap_or(0x00a9);


alandau · 2026-05-05T00:15:06Z

LGTM, a few small ones inline

alandau

Ohh, apparently commenting on the discussion tab doesn't submit pending line comments.

alandau · 2026-05-05T00:12:49Z

+
+            // Read the PCI vendor/device IDs once and cache the resulting
+            // keepalive compatibility flag on the `NvmeDriverManager`.
+            let keepalive_compatible = is_nvme_keepalive_compatible(&pci_id);


I don't. But I remember at some point we ask VfioDevice to restore and it has to go to /sys to do its magic. You can try seeing if it's doing it past this point or not.

alandau · 2026-05-05T00:14:46Z

+        .with_hardware_config_fault(
+            HardwareConfigFaultConfig::new()
+                .with_vendor_id(0x1414)
+                .with_device_id(0xb111),


I don't follow. Do you?

This reverts commit e84570e.

Copilot

Pull request overview

Copilot reviewed 8 out of 8 changed files in this pull request and generated 3 comments.

 use nvme_resources::fault::AdminQueueFaultBehavior;
 use nvme_resources::fault::AdminQueueFaultConfig;
 use nvme_resources::fault::FaultConfiguration;
+use nvme_resources::fault::HardwareConfigFaultConfig;


                match guard.entry(pci_id.clone()) {
                    hash_map::Entry::Occupied(_) => unreachable!(), // We checked above that this entry does not exist.
                    hash_map::Entry::Vacant(entry) => {
+                        let keepalive_compatible = is_nvme_keepalive_compatible(&pci_id);
+


        let mut devices_to_save: HashMap<String, NvmeDriverManagerClient> = self
            .context
            .devices
            .write()
            .iter()


This reverts commit 4efda83.

Should enable keepalive but only for ASAP devices

0a9f871

Copilot AI review requested due to automatic review settings May 1, 2026 21:48

Copilot started reviewing on behalf of gurasinghMS May 1, 2026 21:48 View session

Copilot AI reviewed May 1, 2026

View reviewed changes

Comment thread openhcl/underhill_core/src/nvme_manager/mod.rs Outdated

Comment thread openhcl/underhill_core/src/nvme_manager/mod.rs Outdated

Comment thread openhcl/underhill_core/src/nvme_manager/manager.rs Outdated

Comment thread openhcl/underhill_core/src/nvme_manager/manager.rs

mattkur reviewed May 1, 2026

View reviewed changes

gurasinghMS added 2 commits May 1, 2026 15:54

I think this is in better shape

aec5033

Don't start with keepalive either

1c8bc55

Copilot AI review requested due to automatic review settings May 1, 2026 23:05

Copilot started reviewing on behalf of gurasinghMS May 1, 2026 23:05 View session

Setting up the flags correctly

30b5bad

Copilot AI reviewed May 1, 2026

View reviewed changes

gurasinghMS added 2 commits May 4, 2026 11:54

Added a test to verify that only ASAP devices are being reset

6076ef6

Update comments

1e19679

Copilot AI review requested due to automatic review settings May 4, 2026 19:02

Copilot AI reviewed May 4, 2026

View reviewed changes

Comment thread openhcl/underhill_core/src/nvme_manager/mod.rs Outdated

Comment thread openhcl/underhill_core/src/nvme_manager/mod.rs Outdated

Comment thread openhcl/underhill_core/src/nvme_manager/mod.rs Outdated

Comment thread vmm_tests/vmm_tests/tests/tests/multiarch/openhcl_servicing.rs

Added comment fix and removed magic numbers out of the function defin…

fc20c57

…ition

alandau reviewed May 4, 2026

View reviewed changes

More variable name cleanup

4c15d5d

Copilot AI review requested due to automatic review settings May 4, 2026 19:32

Copilot started reviewing on behalf of gurasinghMS May 4, 2026 19:35 View session

Copilot started reviewing on behalf of gurasinghMS May 4, 2026 19:38 View session

Copilot AI reviewed May 4, 2026

View reviewed changes

Comment thread openhcl/underhill_core/src/nvme_manager/manager.rs

Comment thread openhcl/underhill_core/src/nvme_manager/mod.rs Outdated

Comment thread openhcl/underhill_core/src/nvme_manager/mod.rs Outdated

Comment thread vmm_tests/vmm_tests/tests/tests/multiarch/openhcl_servicing.rs

gurasinghMS added 2 commits May 4, 2026 13:22

Fixed comments

4c9456c

Will no longer restore devices that are not keepalive comptible. More…

dbab214

…over, it will tear down that device state and start again ... at least it should

Copilot AI review requested due to automatic review settings May 4, 2026 20:30

Copilot AI reviewed May 4, 2026

View reviewed changes

Copilot started reviewing on behalf of gurasinghMS May 4, 2026 20:42 View session

Now using vendor and device ID information instead of using pci ID to…

6412280

… figure out keepalive info

gurasinghMS commented May 5, 2026

View reviewed changes

Minor updates to comments and such

dea6e9e

Copilot AI review requested due to automatic review settings May 5, 2026 00:02

Copilot started reviewing on behalf of gurasinghMS May 5, 2026 00:03 View session

Copilot AI reviewed May 5, 2026

View reviewed changes

alandau reviewed May 5, 2026

View reviewed changes

gurasinghMS and others added 2 commits May 6, 2026 17:37

Remove per device gate test

7d4c3e2

Merge branch 'main' into no-keepalive-for-nd2-devices

b60a859

Copilot AI review requested due to automatic review settings May 7, 2026 00:38

Copilot started reviewing on behalf of gurasinghMS May 7, 2026 00:39 View session

gurasinghMS added 3 commits May 6, 2026 17:39

Added test gated to nvme direct 2 devices back in

0af4f8e

Revert "nvme_driver: disable nvme keepalive (microsoft#3267)"

0a96c9d

This reverts commit e84570e.

For a local build

4efda83

Copilot AI reviewed May 7, 2026

View reviewed changes

Revert "For a local build"

88aed69

This reverts commit 4efda83.

Conversation

gurasinghMS commented May 1, 2026

Uh oh!

Copilot AI left a comment • edited by gurasinghMS Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mattkur commented May 1, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

alandau left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

alandau commented May 4, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alandau commented May 5, 2026

Uh oh!

alandau left a comment

Copilot AI left a comment •

edited by gurasinghMS

Loading