GPU device plugin interval health check

Hey,

We use NVIDIA GPU Operator on OpenShift to expose Passthrough GPU with KubeVirt.


### Issue

We experienced an issue when one of the GPUs on the Node became unavailable, but the Node didn't change the reported GPU capacity or Alloctable resources. The GPU itself wasn't available and when I tried to create a new VM it reached `crashLoopBack` state until the GPU became available again.

Only after I restarted the pod `nvidia-sandbox-device-plugin-daemonset` on the specific Node the number of Alloctable and Capacity GPUs changed to the right number.

I checked the pods on this Nodes:
- `nvidia-sandbox-device-plugin`
- `nvidia-sandbox-validator`
- `nvidia-vfio-manager`
and there were no errors in their logs and I couldn't see any new logs from these pods. 

It looks like the pods run an initial healthCheck, and then don't run them again. Is there a way to make the Operator pods validate the health of the GPUs on an interval, so the resources available on the Node will be reflected correctly?

### How to reproduce

I reproduced the issue by logically removing one of the GPU PCI devices from the node using the command:

```bash
echo "1" > /sys/bus/pci/devices/<gpu_pci_id>/remove
```

and validated the GPU is no longer visible from the host using `lspci`.

Then, using `oc describe <node>` the number of GPUs exposed didn't change. After restarting the sandbox pod, the number of GPUs was updated to the right number.

To re-add the GPU you can run the command:
```bash
echo "1" > /sys/bus/pci/rescan
```
and restart the sandbox pod again



### Versions
- `NVIDIA Operator` - 23.6.0
- `NVIDIA KubeVirt GPU Device Plugin` - v1.2.2
- `OpenShift` - 4.12.35
- `nvidia sandbox device plugin` image - `nvcr.io/nvidia/kubevirt-gpu-device-plugin@sha256:9484110986c80ab83bc404066ca4b7be115124ec04ca16bce775403e92bfd890`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPU device plugin interval health check #97

Issue

How to reproduce

Versions

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

GPU device plugin interval health check #97

Description

Issue

How to reproduce

Versions

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions