nvme_driver: adding admin queue logging by gurasinghMS · Pull Request #3409 · microsoft/openvmm

gurasinghMS · 2026-04-30T19:15:49Z

Adds logging for commands being issued on the admin queue. Logs the CID, Opcode and such. The number of admin commands being processed is (should be) quite small so logging these details should not impact perf all that much. In a vmm_test run with 2 CPUs (so 2 IO Queues created) we observe 12 admin commands being issued.

Output looks like:

[24.176059] nvme_driver::queue_pair:  INFO  completed admin command cid=2001 opcode=5 opname=CREATE_IO_COMPLETION_QUEUE nsid=0 cdw10=0xff0003 cdw11=0x20003 elapsed_ms=10.0112721s
[24.181200] nvme_driver::queue_pair:  INFO  inserted admin command cid=2401 opcode=1 opname=CREATE_IO_SUBMISSION_QUEUE nsid=0 cdw10=0xff0003 cdw11=0x30001
[24.186052] nvme_driver::queue_pair:  INFO  completed admin command cid=2401 opcode=1 opname=CREATE_IO_SUBMISSION_QUEUE nsid=0 cdw10=0xff0003 cdw11=0x30001 elapsed_ms=141.9µs

NOTE: This PR also decreases the default save timeout duration to 59s. currently the host provides us a hint of 200s but then terminates vtl2 at 60s. 59s is an acceptable wait for save given that expectation for servicing duration is 1s. This will allow us to get crash dumps when servicing is hung.

Clone of #3246 with the commit history squashed

…le commit to fix build pipeline issues

Copilot

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Adds detailed admin-queue command logging in the NVMe driver, and adjusts servicing/save related timeouts and defaults to better align with host behavior (including enabling NVMe keepalive by default).

Changes:

Log admin-queue command submission/completion details (CID/opcode/nsid/cdw10/cdw11 + elapsed).
Cap/override host-provided servicing timeout hints to 59s and update related logging.
Change default NVMe keepalive behavior (enabled by default) and update VMM tests’ OpenHCL cmdlines accordingly.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 14 comments.

Show a summary per file

File	Description
vmm_tests/vmm_tests/tests/tests/x86_64/openhcl_uefi.rs	Removes explicit keepalive override from test cmdlines to match new defaults.
vmm_tests/vmm_tests/tests/tests/multiarch/openhcl_servicing.rs	Removes explicit keepalive override from servicing tests’ cmdlines.
vm/devices/storage/disk_nvme/nvme_driver/src/queue_pair.rs	Adds admin command submission/completion logging and threads device id into pending command tracking.
vm/devices/storage/disk_nvme/nvme_driver/src/driver.rs	Adds extra queue PFN logging; refactors worker loop; increases log verbosity in several paths.
openhcl/underhill_core/src/nvme_manager/device.rs	Adds shutdown-request logging in NVMe manager worker.
openhcl/underhill_core/src/dispatch/mod.rs	Changes timeout-hint handling to default/cap to 59s and updates log message.
openhcl/openhcl_boot/src/cmdline.rs	Enables NVMe keepalive by default and adjusts cmdline parsing semantics.

Copilot · 2026-04-30T19:24:11Z

            enable_vtl2_gpa_pool: Vtl2GpaPoolConfig::Heuristics(Vtl2GpaPoolLookupTable::Release), // use the release config by default
            sidecar: SidecarOptions::default(),
-            disable_nvme_keep_alive: true,
+            disable_nvme_keep_alive: false,


With the new default disable_nvme_keep_alive: false, the parser no longer disables keepalive when OPENHCL_DISABLE_NVME_KEEP_ALIVE is present without an explicit =... value (because arg becomes None and the branch is skipped). For backward-compatible boolean-flag behavior, treat a missing value as true (disable keepalive), and only treat an explicit =0 as false.

Copilot · 2026-04-30T19:24:12Z

+                if arg.is_some_and(|a| a != "0") {
+                    self.disable_nvme_keep_alive = true;
                }


With the new default disable_nvme_keep_alive: false, the parser no longer disables keepalive when OPENHCL_DISABLE_NVME_KEEP_ALIVE is present without an explicit =... value (because arg becomes None and the branch is skipped). For backward-compatible boolean-flag behavior, treat a missing value as true (disable keepalive), and only treat an explicit =0 as false.

Copilot · 2026-04-30T19:24:12Z

+                    // If the host provided timeout hint is >= 59 seconds,
+                    // we treat that as a signal from the host that no
                    // timeout duration was set. We instead limit servicing to
-                    // 200s in that case.
-                    let timeout_hint = if timeout_hint >= Duration::from_secs(u16::MAX as u64) {
+                    // 59s in that case.
+                    let timeout_hint = if timeout_hint >= Duration::from_secs(59) {
                        tracing::info!(
                            CVM_ALLOWED,
-                            "host provided UINT16_MAX timeout hint, defaulting to 200s"
+                            host_timeout_hint_ms = timeout_hint.as_millis() as u64,
+                            "host provided timeout hint > 59s, defaulting to 59s"
                        );
-                        Duration::from_secs(200)
+                        Duration::from_secs(59)


The comment describes a sentinel-value interpretation (“signal … no timeout duration was set”), but the code is implementing a hard cap for any hint >= 59s. Update the comment/log wording to reflect capping (or, if sentinel semantics are intended, keep the old sentinel check and apply a separate cap). Also, the log message says > 59s but the condition is >=—align the message/condition to avoid confusion.

Copilot · 2026-04-30T19:24:12Z

                        tracing::info!(
                            CVM_ALLOWED,
-                            "host provided UINT16_MAX timeout hint, defaulting to 200s"
+                            host_timeout_hint_ms = timeout_hint.as_millis() as u64,


Casting as_millis() (u128) to u64 can truncate very large durations and produces a potentially misleading telemetry field. Prefer logging the u128 directly (or log seconds as u64 via as_secs()), so the value is always accurate.

Copilot · 2026-04-30T19:24:13Z

+            submitted_at: (self.qid == 0).then(|| {
+                tracing::info!(
+                    pci_id = %self.device_id,
+                    ?cid,
+                    opcode = ?command.cdw0.opcode(),
+                    opname = ?spec::AdminOpcode(command.cdw0.opcode()),
+                    nsid = ?command.nsid,
+                    cdw10 = command.cdw10,
+                    cdw11 = command.cdw11,
+                    "inserted admin command",
+                );
+                Instant::now()
+            }),


The completion log records elapsed as a Duration debug value. The PR description/examples refer to elapsed_ms=..., which is both a different field name and implies a specific unit. Consider standardizing the field name/unit (e.g., log elapsed_ms as an integer/f64 milliseconds, or rename the example/docs to match elapsed as a structured Duration). This makes log consumption/queries more consistent.

Copilot · 2026-04-30T19:24:14Z

-                        None => break,
-                    }
+        loop {
+            let cmd = stop.until_stopped(self.recv.next()).await?;


Using await? on until_stopped(...) means the task returns early on stop/cancel without emitting an exit log (the "exiting cleanly" log only happens when the channel closes). If exit logging is important for operational debugging, handle the Cancelled path explicitly (log "exiting due to stop") before returning the cancellation error.

Suggested change

let cmd = stop.until_stopped(self.recv.next()).await?;

let cmd = match stop.until_stopped(self.recv.next()).await {

Ok(cmd) => cmd,

Err(err) => {

tracing::info!(

pci_id = %self.device.id(),

"nvme worker task exiting due to stop"

);

return Err(err);

}

};

Copilot · 2026-04-30T19:24:15Z

+        tracing::info!(pci_id = %self.device.id(), "nvme worker task exiting cleanly");
+        Ok(())


Using await? on until_stopped(...) means the task returns early on stop/cancel without emitting an exit log (the "exiting cleanly" log only happens when the channel closes). If exit logging is important for operational debugging, handle the Cancelled path explicitly (log "exiting due to stop") before returning the cancellation error.

Copilot · 2026-04-30T19:24:15Z

 impl<D: DeviceBacking> Drop for NvmeDriver<D> {
    fn drop(&mut self) {
-        tracing::trace!(pci_id = ?self.device_id, ka = self.nvme_keepalive, task = self.task.is_some(), "dropping nvme driver");
+        tracing::info!(pci_id = ?self.device_id, ka = self.nvme_keepalive, task = self.task.is_some(), "dropping nvme driver");


Several paths that can occur frequently (driver drop lifecycle across tests, IO issuer requests, IO queue creation) were bumped from trace/debug to info. This can materially increase log volume in normal runs and make the new admin-queue logs harder to spot. Consider keeping high-frequency lifecycle events at debug/trace and reserving info for user-actionable state transitions/errors.

Suggested change

tracing::info!(pci_id = ?self.device_id, ka = self.nvme_keepalive, task = self.task.is_some(), "dropping nvme driver");

tracing::debug!(pci_id = ?self.device_id, ka = self.nvme_keepalive, task = self.task.is_some(), "dropping nvme driver");

Copilot · 2026-04-30T19:24:15Z


    async fn create_io_issuer(&mut self, state: &mut WorkerState, cpu: u32) {
-        tracing::debug!(cpu, pci_id = ?self.device.id(), "issuer request");
+        tracing::info!(cpu, pci_id = ?self.device.id(), "create io issuer request");


Several paths that can occur frequently (driver drop lifecycle across tests, IO issuer requests, IO queue creation) were bumped from trace/debug to info. This can materially increase log volume in normal runs and make the new admin-queue logs harder to spot. Consider keeping high-frequency lifecycle events at debug/trace and reserving info for user-actionable state transitions/errors.

Suggested change

tracing::info!(cpu, pci_id = ?self.device.id(), "create io issuer request");

tracing::debug!(cpu, pci_id = ?self.device.id(), "create io issuer request");

Copilot · 2026-04-30T19:24:15Z

        self.next_ioq_id += 1;

-        tracing::debug!(cpu, qid, iv, pci_id = ?self.device.id(), "creating io queue");
+        tracing::info!(cpu, qid, iv, pci_id = ?self.device.id(), "create_nvme_io_queue_fn: creating io queue");


Several paths that can occur frequently (driver drop lifecycle across tests, IO issuer requests, IO queue creation) were bumped from trace/debug to info. This can materially increase log volume in normal runs and make the new admin-queue logs harder to spot. Consider keeping high-frequency lifecycle events at debug/trace and reserving info for user-actionable state transitions/errors.

Suggested change

tracing::info!(cpu, qid, iv, pci_id = ?self.device.id(), "create_nvme_io_queue_fn: creating io queue");

tracing::debug!(cpu, qid, iv, pci_id = ?self.device.id(), "create_nvme_io_queue_fn: creating io queue");

…ition

…over, it will tear down that device state and start again ... at least it should

… figure out keepalive info

…ogging-2

Squashes all changes from adding-admin-queue-logging all in to a sing…

6e91439

…le commit to fix build pipeline issues

Copilot AI review requested due to automatic review settings April 30, 2026 19:15

Copilot AI reviewed Apr 30, 2026

View reviewed changes

Copilot started reviewing on behalf of gurasinghMS April 30, 2026 19:44 View session

gurasinghMS added 13 commits May 1, 2026 14:46

Should enable keepalive but only for ASAP devices

0a9f871

I think this is in better shape

aec5033

Don't start with keepalive either

1c8bc55

Setting up the flags correctly

30b5bad

Added a test to verify that only ASAP devices are being reset

6076ef6

Update comments

1e19679

Added comment fix and removed magic numbers out of the function defin…

fc20c57

…ition

More variable name cleanup

4c15d5d

Fixed comments

4c9456c

Will no longer restore devices that are not keepalive comptible. More…

dbab214

…over, it will tear down that device state and start again ... at least it should

Now using vendor and device ID information instead of using pci ID to…

6412280

… figure out keepalive info

Minor updates to comments and such

dea6e9e

Merge branch 'no-keepalive-for-nd2-devices' into adding-admin-queue-l…

63b28f9

…ogging-2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

nvme_driver: adding admin queue logging#3409

nvme_driver: adding admin queue logging#3409
gurasinghMS wants to merge 14 commits intomicrosoft:mainfrom
gurasinghMS:adding-admin-queue-logging-2

gurasinghMS commented Apr 30, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 30, 2026

Uh oh!

Copilot AI Apr 30, 2026

Uh oh!

Copilot AI Apr 30, 2026

Uh oh!

Copilot AI Apr 30, 2026

Uh oh!

Copilot AI Apr 30, 2026

Uh oh!

Copilot AI Apr 30, 2026

Uh oh!

Copilot AI Apr 30, 2026

Uh oh!

Copilot AI Apr 30, 2026

Uh oh!

Copilot AI Apr 30, 2026

Uh oh!

Copilot AI Apr 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

-            let cmd = stop.until_stopped(self.recv.next()).await?;
+            let cmd = match stop.until_stopped(self.recv.next()).await {
+                Ok(cmd) => cmd,
+                Err(err) => {
+                    tracing::info!(
+                        pci_id = %self.device.id(),
+                        "nvme worker task exiting due to stop"
+                    );
+                    return Err(err);
+                }
+            };

		tracing::info!(pci_id = %self.device.id(), "nvme worker task exiting cleanly");
		Ok(())

	tracing::info!(pci_id = ?self.device_id, ka = self.nvme_keepalive, task = self.task.is_some(), "dropping nvme driver");
	tracing::debug!(pci_id = ?self.device_id, ka = self.nvme_keepalive, task = self.task.is_some(), "dropping nvme driver");

	tracing::info!(cpu, pci_id = ?self.device.id(), "create io issuer request");
	tracing::debug!(cpu, pci_id = ?self.device.id(), "create io issuer request");

	tracing::info!(cpu, qid, iv, pci_id = ?self.device.id(), "create_nvme_io_queue_fn: creating io queue");
	tracing::debug!(cpu, qid, iv, pci_id = ?self.device.id(), "create_nvme_io_queue_fn: creating io queue");

Conversation

gurasinghMS commented Apr 30, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants