Skip to content

Device plugin can't detect the vgpus #78

@esposem

Description

@esposem

I currently have Openshift 4.13 with the Openshift Virtualization (CNV) installed.
I installed the nvidia drivers through https://github.com/vladikr/ocp-nvidia-vgpu-installer, and they work as expected.

I gave to the HyperConverged yaml file the following:

spec:
  mediatedDevicesConfiguration:
    mediatedDevicesTypes: 
    - nvidia-258
  permittedHostDevices:
    mediatedDevices:
    - mdevNameSelector: "GRID RTX6000-3Q"
      resourceName: "nvidia.com/GRID_RTX6000-3Q"
      externalResourceProvider: true

obviously checking that nvidia-258 exists:

$ cd /sys/bus/pci/devices/0000:05:00.0/mdev_supported_types
$ cat nvidia-258/available_instances 
8

Then I created 2 mdev devices

$ UUID=$(uuidgen);
$ echo "${UUID}" > nvidia-258/create;
$ mdevctl define --auto --uuid $UUID;
$ mdevctl list

Then I installed the kubevirt-gpu-device-plugin, but when I inspect the nodes log I see

2023/08/29 09:36:53 Not a device, continuing
2023/08/29 09:36:53 Nvidia device 0000:05:00.0
2023/08/29 09:36:53 Not a device, continuing
2023/08/29 09:36:53 Gpu id is 0000:05:00.0
2023/08/29 09:36:53 Vgpu id is GRID_RTX6000-3Q
2023/08/29 09:36:53 Gpu id is 0000:05:00.0
2023/08/29 09:36:53 Vgpu id is GRID_RTX6000-3Q
2023/08/29 09:36:53 Iommu Map map[]
2023/08/29 09:36:53 Device Map map[]
2023/08/29 09:36:53 vGPU Map map[GRID_RTX6000-3Q:[{21ad712a-f454-498c-84d5-4116f3723c01} {43922f20-6573-4d6b-9223-a2ca02f83b29}]]
2023/08/29 09:36:53 GPU vGPU Map map[0000:05:00.0:[21ad712a-f454-498c-84d5-4116f3723c01 43922f20-6573-4d6b-9223-a2ca02f83b29]]
2023/08/29 09:36:53 Could not find NVIDIA device with id: GRID_RTX6000-3Q
2023/08/29 09:36:53 DP Name GRID_RTX6000-3Q
2023/08/29 09:36:53 Devicename GRID_RTX6000-3Q
2023/08/29 09:36:58 [GRID_RTX6000-3Q] Error registering with device plugin manager: context deadline exceeded
2023/08/29 09:36:58 Error starting GRID_RTX6000-3Q device plugin: context deadline exceeded

And I can't run any VMI/VM as once I schedule one, it is never scheduled as it doesn't find any vgpu available when I provide the following to the yaml file:

spec:
      gpus:
      - deviceName: nvidia.com/GRID_RTX6000-3Q
        name: vgpu1

What did I do wrong?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions