Skip to content

nvme_driver: adding admin queue logging#3409

Draft
gurasinghMS wants to merge 14 commits intomicrosoft:mainfrom
gurasinghMS:adding-admin-queue-logging-2
Draft

nvme_driver: adding admin queue logging#3409
gurasinghMS wants to merge 14 commits intomicrosoft:mainfrom
gurasinghMS:adding-admin-queue-logging-2

Conversation

@gurasinghMS
Copy link
Copy Markdown
Contributor

Adds logging for commands being issued on the admin queue. Logs the CID, Opcode and such. The number of admin commands being processed is (should be) quite small so logging these details should not impact perf all that much. In a vmm_test run with 2 CPUs (so 2 IO Queues created) we observe 12 admin commands being issued.

Output looks like:

[24.176059] nvme_driver::queue_pair:  INFO  completed admin command cid=2001 opcode=5 opname=CREATE_IO_COMPLETION_QUEUE nsid=0 cdw10=0xff0003 cdw11=0x20003 elapsed_ms=10.0112721s
[24.181200] nvme_driver::queue_pair:  INFO  inserted admin command cid=2401 opcode=1 opname=CREATE_IO_SUBMISSION_QUEUE nsid=0 cdw10=0xff0003 cdw11=0x30001
[24.186052] nvme_driver::queue_pair:  INFO  completed admin command cid=2401 opcode=1 opname=CREATE_IO_SUBMISSION_QUEUE nsid=0 cdw10=0xff0003 cdw11=0x30001 elapsed_ms=141.9µs

NOTE: This PR also decreases the default save timeout duration to 59s. currently the host provides us a hint of 200s but then terminates vtl2 at 60s. 59s is an acceptable wait for save given that expectation for servicing duration is 1s. This will allow us to get crash dumps when servicing is hung.

Clone of #3246 with the commit history squashed

Copilot AI review requested due to automatic review settings April 30, 2026 19:15
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Adds detailed admin-queue command logging in the NVMe driver, and adjusts servicing/save related timeouts and defaults to better align with host behavior (including enabling NVMe keepalive by default).

Changes:

  • Log admin-queue command submission/completion details (CID/opcode/nsid/cdw10/cdw11 + elapsed).
  • Cap/override host-provided servicing timeout hints to 59s and update related logging.
  • Change default NVMe keepalive behavior (enabled by default) and update VMM tests’ OpenHCL cmdlines accordingly.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 14 comments.

Show a summary per file
File Description
vmm_tests/vmm_tests/tests/tests/x86_64/openhcl_uefi.rs Removes explicit keepalive override from test cmdlines to match new defaults.
vmm_tests/vmm_tests/tests/tests/multiarch/openhcl_servicing.rs Removes explicit keepalive override from servicing tests’ cmdlines.
vm/devices/storage/disk_nvme/nvme_driver/src/queue_pair.rs Adds admin command submission/completion logging and threads device id into pending command tracking.
vm/devices/storage/disk_nvme/nvme_driver/src/driver.rs Adds extra queue PFN logging; refactors worker loop; increases log verbosity in several paths.
openhcl/underhill_core/src/nvme_manager/device.rs Adds shutdown-request logging in NVMe manager worker.
openhcl/underhill_core/src/dispatch/mod.rs Changes timeout-hint handling to default/cap to 59s and updates log message.
openhcl/openhcl_boot/src/cmdline.rs Enables NVMe keepalive by default and adjusts cmdline parsing semantics.

enable_vtl2_gpa_pool: Vtl2GpaPoolConfig::Heuristics(Vtl2GpaPoolLookupTable::Release), // use the release config by default
sidecar: SidecarOptions::default(),
disable_nvme_keep_alive: true,
disable_nvme_keep_alive: false,
Copy link

Copilot AI Apr 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the new default disable_nvme_keep_alive: false, the parser no longer disables keepalive when OPENHCL_DISABLE_NVME_KEEP_ALIVE is present without an explicit =... value (because arg becomes None and the branch is skipped). For backward-compatible boolean-flag behavior, treat a missing value as true (disable keepalive), and only treat an explicit =0 as false.

Copilot uses AI. Check for mistakes.
Comment on lines +186 to 188
if arg.is_some_and(|a| a != "0") {
self.disable_nvme_keep_alive = true;
}
Copy link

Copilot AI Apr 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the new default disable_nvme_keep_alive: false, the parser no longer disables keepalive when OPENHCL_DISABLE_NVME_KEEP_ALIVE is present without an explicit =... value (because arg becomes None and the branch is skipped). For backward-compatible boolean-flag behavior, treat a missing value as true (disable keepalive), and only treat an explicit =0 as false.

Copilot uses AI. Check for mistakes.
Comment on lines +447 to +457
// If the host provided timeout hint is >= 59 seconds,
// we treat that as a signal from the host that no
// timeout duration was set. We instead limit servicing to
// 200s in that case.
let timeout_hint = if timeout_hint >= Duration::from_secs(u16::MAX as u64) {
// 59s in that case.
let timeout_hint = if timeout_hint >= Duration::from_secs(59) {
tracing::info!(
CVM_ALLOWED,
"host provided UINT16_MAX timeout hint, defaulting to 200s"
host_timeout_hint_ms = timeout_hint.as_millis() as u64,
"host provided timeout hint > 59s, defaulting to 59s"
);
Duration::from_secs(200)
Duration::from_secs(59)
Copy link

Copilot AI Apr 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment describes a sentinel-value interpretation (“signal … no timeout duration was set”), but the code is implementing a hard cap for any hint >= 59s. Update the comment/log wording to reflect capping (or, if sentinel semantics are intended, keep the old sentinel check and apply a separate cap). Also, the log message says > 59s but the condition is >=—align the message/condition to avoid confusion.

Copilot uses AI. Check for mistakes.
tracing::info!(
CVM_ALLOWED,
"host provided UINT16_MAX timeout hint, defaulting to 200s"
host_timeout_hint_ms = timeout_hint.as_millis() as u64,
Copy link

Copilot AI Apr 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Casting as_millis() (u128) to u64 can truncate very large durations and produces a potentially misleading telemetry field. Prefer logging the u128 directly (or log seconds as u64 via as_secs()), so the value is always accurate.

Copilot uses AI. Check for mistakes.
Comment on lines +128 to +140
submitted_at: (self.qid == 0).then(|| {
tracing::info!(
pci_id = %self.device_id,
?cid,
opcode = ?command.cdw0.opcode(),
opname = ?spec::AdminOpcode(command.cdw0.opcode()),
nsid = ?command.nsid,
cdw10 = command.cdw10,
cdw11 = command.cdw11,
"inserted admin command",
);
Instant::now()
}),
Copy link

Copilot AI Apr 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The completion log records elapsed as a Duration debug value. The PR description/examples refer to elapsed_ms=..., which is both a different field name and implies a specific unit. Consider standardizing the field name/unit (e.g., log elapsed_ms as an integer/f64 milliseconds, or rename the example/docs to match elapsed as a structured Duration). This makes log consumption/queries more consistent.

Copilot uses AI. Check for mistakes.
None => break,
}
loop {
let cmd = stop.until_stopped(self.recv.next()).await?;
Copy link

Copilot AI Apr 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using await? on until_stopped(...) means the task returns early on stop/cancel without emitting an exit log (the "exiting cleanly" log only happens when the channel closes). If exit logging is important for operational debugging, handle the Cancelled path explicitly (log "exiting due to stop") before returning the cancellation error.

Suggested change
let cmd = stop.until_stopped(self.recv.next()).await?;
let cmd = match stop.until_stopped(self.recv.next()).await {
Ok(cmd) => cmd,
Err(err) => {
tracing::info!(
pci_id = %self.device.id(),
"nvme worker task exiting due to stop"
);
return Err(err);
}
};

Copilot uses AI. Check for mistakes.
Comment on lines +1334 to +1335
tracing::info!(pci_id = %self.device.id(), "nvme worker task exiting cleanly");
Ok(())
Copy link

Copilot AI Apr 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using await? on until_stopped(...) means the task returns early on stop/cancel without emitting an exit log (the "exiting cleanly" log only happens when the channel closes). If exit logging is important for operational debugging, handle the Cancelled path explicitly (log "exiting due to stop") before returning the cancellation error.

Copilot uses AI. Check for mistakes.
impl<D: DeviceBacking> Drop for NvmeDriver<D> {
fn drop(&mut self) {
tracing::trace!(pci_id = ?self.device_id, ka = self.nvme_keepalive, task = self.task.is_some(), "dropping nvme driver");
tracing::info!(pci_id = ?self.device_id, ka = self.nvme_keepalive, task = self.task.is_some(), "dropping nvme driver");
Copy link

Copilot AI Apr 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Several paths that can occur frequently (driver drop lifecycle across tests, IO issuer requests, IO queue creation) were bumped from trace/debug to info. This can materially increase log volume in normal runs and make the new admin-queue logs harder to spot. Consider keeping high-frequency lifecycle events at debug/trace and reserving info for user-actionable state transitions/errors.

Suggested change
tracing::info!(pci_id = ?self.device_id, ka = self.nvme_keepalive, task = self.task.is_some(), "dropping nvme driver");
tracing::debug!(pci_id = ?self.device_id, ka = self.nvme_keepalive, task = self.task.is_some(), "dropping nvme driver");

Copilot uses AI. Check for mistakes.

async fn create_io_issuer(&mut self, state: &mut WorkerState, cpu: u32) {
tracing::debug!(cpu, pci_id = ?self.device.id(), "issuer request");
tracing::info!(cpu, pci_id = ?self.device.id(), "create io issuer request");
Copy link

Copilot AI Apr 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Several paths that can occur frequently (driver drop lifecycle across tests, IO issuer requests, IO queue creation) were bumped from trace/debug to info. This can materially increase log volume in normal runs and make the new admin-queue logs harder to spot. Consider keeping high-frequency lifecycle events at debug/trace and reserving info for user-actionable state transitions/errors.

Suggested change
tracing::info!(cpu, pci_id = ?self.device.id(), "create io issuer request");
tracing::debug!(cpu, pci_id = ?self.device.id(), "create io issuer request");

Copilot uses AI. Check for mistakes.
self.next_ioq_id += 1;

tracing::debug!(cpu, qid, iv, pci_id = ?self.device.id(), "creating io queue");
tracing::info!(cpu, qid, iv, pci_id = ?self.device.id(), "create_nvme_io_queue_fn: creating io queue");
Copy link

Copilot AI Apr 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Several paths that can occur frequently (driver drop lifecycle across tests, IO issuer requests, IO queue creation) were bumped from trace/debug to info. This can materially increase log volume in normal runs and make the new admin-queue logs harder to spot. Consider keeping high-frequency lifecycle events at debug/trace and reserving info for user-actionable state transitions/errors.

Suggested change
tracing::info!(cpu, qid, iv, pci_id = ?self.device.id(), "create_nvme_io_queue_fn: creating io queue");
tracing::debug!(cpu, qid, iv, pci_id = ?self.device.id(), "create_nvme_io_queue_fn: creating io queue");

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants