Skip to content

Cannot run kubevirt virtual machine using nvidia GPU plugin #89

@keepthemomentum

Description

@keepthemomentum

Hello,

I have a kubernetes cluster running virtual machines using kubevirt.
A worker node in the cluster has GPU, i want to use the GPU for the VM in passthrough mode.
I enabled the feature gate, deployed the kubevirt-gpu-device plugin on the cluster.

lspci -nn | grep -i nvidia
04:00.0 3D controller [0302]: NVIDIA Corporation GK110BGL [Tesla K40m] [10de:1023] (rev a1)

lspci -nnk -d 10de:
04:00.0 3D controller [0302]: NVIDIA Corporation GK110BGL [Tesla K40m] [10de:1023] (rev a1)
Subsystem: NVIDIA Corporation 12GB Computational Accelerator [10de:097e]
Kernel driver in use: vfio-pci
Kernel modules: nvidiafb, nouveau, nvidia_drm, nvidia

This is how the logs from the gpu-device-plugin pod look like:

kubectl logs pod/nvidia-kubevirt-gpu-dp-daemonset-4xmm8 -n kube-system

2024/01/05 12:59:06 Not a device, continuing
2024/01/05 12:59:06 Nvidia device 0000:03:00.0
2024/01/05 12:59:06 Iommu Group 22
2024/01/05 12:59:06 Device Id 1023
2024/01/05 12:59:06 Error accessing file path "/sys/bus/mdev/devices": lstat /sys/bus/mdev/devices: no such file or directory
2024/01/05 12:59:06 Iommu Map map[22:[{0000:03:00.0}]]
2024/01/05 12:59:06 Device Map map[1023:[22]]
2024/01/05 12:59:06 vGPU Map map[]
2024/01/05 12:59:06 GPU vGPU Map map[]
2024/01/05 12:59:06 DP Name GK110BGL_TESLA_K40M
2024/01/05 12:59:06 Devicename GK110BGL_TESLA_K40M
2024/01/05 12:59:06 GK110BGL_TESLA_K40M Device plugin server ready
2024/01/05 12:59:06 healthCheck(GK110BGL_TESLA_K40M): invoked

ls -l /var/lib/kubelet/device-plugins/
total 40
-rw------- 1 root root 39215 Jan 5 13:59 kubelet_internal_checkpoint
srwxr-xr-x 1 root root 0 Jan 4 11:40 kubelet.sock
srwxr-xr-x 1 root root 0 Jan 5 13:59 kubevirt-GK110BGL_TESLA_K40M.sock

It still couldn't run the pod, it says 'no preemption victims found for incoming pod'
What am i missing? could someone help.

Events:
Type Reason Age From Message


Warning FailedScheduling 47m default-scheduler 0/5 nodes are available: 1 Insufficient nvidia.com/GK110BGL_Tesla_K40m, 4 node(s) didn't match Pod's node affinity/selector. preemption: 0/5 nodes are available: 1 No preemption victims found for incoming pod, 4 Preemption is not helpful for scheduling..
Warning FailedScheduling 16m (x6 over 41m) default-scheduler 0/5 nodes are available: 1 Insufficient nvidia.com/GK110BGL_Tesla_K40m, 4 node(s) didn't match Pod's node affinity/selector. preemption: 0/5 nodes are available: 1 No preemption victims found for incoming pod, 4 Preemption is not helpful for scheduling..

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions