Skip to content

Fix NVMe hot-swap LED stuck in FAILURE under VMD controllers#275

Open
tasleson wants to merge 3 commits into
md-raid-utilities:mainfrom
tasleson:stuck_failure
Open

Fix NVMe hot-swap LED stuck in FAILURE under VMD controllers#275
tasleson wants to merge 3 commits into
md-raid-utilities:mainfrom
tasleson:stuck_failure

Conversation

@tasleson
Copy link
Copy Markdown
Collaborator

@tasleson tasleson commented Apr 9, 2026

Root cause

NVMe udev events arrive with a virtual nvme-subsystem sysfs path (e.g.
/sys/devices/virtual/nvme-subsystem/nvme-subsys1/nvme1n1), but ledmon stores block devices using their physical
PCI sysfs path. The _compare() function in udev event handling calls block_device_init() on the virtual path,
which fails because block_get_controller() cannot match a virtual path to any PCI controller. This silently drops
all add and remove udev events for NVMe devices.

Without udev event processing, ledmon detects removal only through a timestamp mismatch in _send_msg(), which
sets FAILED_DRIVE. On re-insertion the device reappears in the sysfs scan, but FAILED_DRIVE is intentionally
sticky in _add_block() (to protect RAID members), so the state is never cleared. The udev add event that would
normally break out of this via the ADDEDONESHOT_NORMALUNKNOWN state machine is never matched.

Changes

  1. Fix udev event matching — Add a devnode name fallback to _compare() so that virtual nvme-subsystem paths
    are matched to their corresponding block device.

  2. Allow non-RAID recovery from FAILED_DRIVE — When a non-RAID device in FAILED_DRIVE state reappears in the
    sysfs scan, transition it to ADDED so the state machine can drive it back to normal. RAID members remain sticky
    and require explicit intervention regardless of whether the removal was intentional or caused by
    hardware. Note: This change may need to be placed behind a configuration setting

  3. Validate device node in ledctl slot reporting — Add a stat() check on the device node before associating a
    block device with a PCI slot, so ledctl --list-slots does not report a device that is no longer present.

This needs careful review and ideally testing from user supplied issue.

Resolves: #274

tasleson added 3 commits April 9, 2026 14:09
NVMe udev events may arrive with a virtual nvme-subsystem sysfs path
(e.g. /sys/devices/virtual/nvme-subsystem/nvme-subsys1/nvme1n1) that
cannot be resolved to a PCI controller. This causes block_device_init()
to fail in _compare(), silently dropping add and remove events.

Add a devnode name fallback to _compare() so that virtual nvme-subsystem
paths are matched to their corresponding block device. This restores
udev event processing for NVMe hot-swap under VMD controllers.

Signed-off-by: Tony Asleson <tasleson@redhat.com>
Co-developed-by: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Claude Opus 4.6 <noreply@anthropic.com>
FAILED_DRIVE is intentionally sticky across sysfs scan cycles to
prevent RAID member fault LEDs from flickering. However, this also
prevents standalone NVMe drives from recovering after a hot-swap
cycle, leaving the failure LED on permanently.

When a non-RAID device in FAILED_DRIVE state reappears in the sysfs
scan, transition it to ADDED so the normal state machine can drive
it back to healthy operation. RAID members remain sticky and require
mdadm or other raid tool  intervention to clear the fault.

Signed-off-by: Tony Asleson <tasleson@redhat.com>
Co-developed-by: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Claude Opus 4.6 <noreply@anthropic.com>
When ledctl walks sysfs for --list-slots, VMD PCI slot entries may
persist after a drive is physically removed because the VMD controller
maintains the PCI topology. This causes ledctl to report a device in
the slot that is no longer present.

Add a stat() check on the device node to verify the block device
actually exists before associating it with the slot.

Signed-off-by: Tony Asleson <tasleson@redhat.com>
Co-developed-by: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Claude Opus 4.6 <noreply@anthropic.com>
Comment thread src/lib/pci_slot.c

result->bl_device = get_block_device_from_sysfs_path(pci_slot->ctx,
pci_slot->address, true);
if (result->bl_device) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should care. There is always a race window, We still can hit a moment between stat and snprintf but of course hit window is reduced.

I think that we should always trust the state we have saved by _sysfs_scan
It might not be uptodate but in worst case we will print the device that gone. Not a big deal.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There will be more place like that, so I would stick with no handling.

(Unless I missing something?)

Comment thread src/ledmon/ledmon.c
temp->ibpi = block->ibpi;
}
} else if (temp->ibpi == LED_IBPI_PATTERN_FAILED_DRIVE &&
!temp->raid_dev) {
Copy link
Copy Markdown
Member

@mtkaczyk mtkaczyk Apr 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It requires config option because we are changing legacy behavior. I would add something like:
"BLINK_PERSISTENT_FAIL_ON_READD = TRUE" - that sholud be a default.

We cannot change it as is because it is to big risk. I cannot predict how many deployments sticked to this behavior. It would be especially harmful for the users that sticked to failure as indication that disks behaves incorrectly.

@bkucman
Copy link
Copy Markdown
Collaborator

bkucman commented Apr 10, 2026

Hi @tasleson @mtkaczyk

@bkucman could you validate this on your side?

I’m starting vacation right now, I’ll be back on April 27, and when I come back I’ll analyze this change and try to validate it on our hardware to check whether it breaks any expected behavior.

@czsczsczs2
Copy link
Copy Markdown

Hi @tasleson @mtkaczyk

This works on my side.
For detailed logs, please refer to my reply in #274

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG]: NVMe LED state stuck in FAILURE during hot-plug due to Multipath/Virtual syspath mismatch

4 participants