feat: Add Fabric Manager partition support for NVLink-enabled multi-GPU VMs by dnugmanov · Pull Request #158 · NVIDIA/kubevirt-gpu-device-plugin

dnugmanov · 2025-12-29T09:09:41Z

Related Issue

Closes #133 - How to adapt the Shared NVSwitch Virtualization Model of FM to activate nvlink in multi-gpu VMs

Description

This PR implements support for NVIDIA Fabric Manager (FM) partition-aware GPU allocation, enabling NVLink connectivity for multi-GPU VMs in KubeVirt on DGX/HGX H100 systems using the Shared NVSwitch Virtualization Model.

Background

In virtualized environments with DGX/HGX H100 systems, NVIDIA provides the Shared NVSwitch Virtualization Model to enable NVLink connections for multi-GPU VMs. This requires that GPUs assigned to a VM must belong to the same FM partition to establish NVLink fabric connectivity.

What This PR Does

FM SDK Integration: Adds CGO bindings to the Fabric Manager SDK (libnvfm) for partition discovery and activation
Partition-Aware Allocation: Implements GetPreferredAllocation to recommend GPUs from the same partition
Automatic Partition Activation: Activates FM partitions during Allocate() and deactivates on pod deletion

Environment Assumptions

Assumption	Details
FM Daemon Running	Fabric Manager daemon must be running in `shared_nvswitch` mode on the host
Driver Version	Tested with NVIDIA driver 580.x series
GPU Architecture	Tested on H100 SXM5 80GB (HGX H100 system)
VFIO Binding	GPUs must be bound to `vfio-pci` driver for passthrough
Host Network	Plugin requires `hostNetwork: true` to access FM on localhost

We would appreciate feedback from maintainers on this implementation approach and any suggestions for improvement.

copy-pr-bot · 2025-12-29T09:09:46Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

…PU VMs Implements support for NVIDIA Fabric Manager partition-aware GPU allocation, enabling NVLink connectivity for multi-GPU VMs in KubeVirt on DGX/HGX H100 systems using the Shared NVSwitch Virtualization Model. Closes NVIDIA#133 Changes: - Add pkg/fabric_manager/ with FM SDK CGO bindings - Implement GetPreferredAllocation for partition-aware allocation - Add automatic partition activation/deactivation in Allocate - Update Dockerfile with FM SDK installation - Add --fm-enabled and --fm-address CLI flags

rthallisey · 2026-01-30T17:12:34Z

        name: nvidia-kubevirt-gpu-dp-ds
    spec:
      priorityClassName: system-node-critical
+      hostNetwork: true # Required for FM API access on localhost:6666


There should be a way to do this without going over the Node's network. A shared socket would be better.

rthallisey · 2026-01-30T17:13:36Z

        - name: vfio
          hostPath:
            path: /dev/vfio
+        - name: sys


Can you explain what we need access to /sys for?

rthallisey · 2026-01-30T17:20:27Z

 	return dpi
 }

+// SetPartitionManager sets the Fabric Manager partition manager for NVLink support.


Why does the device plugin need to configure the Fabric Manager? Can't this happen another way, like through cloud-init or a side-car?

dnugmanov force-pushed the feat/fabric-manager-partition-support branch from 9557994 to 13496fa Compare December 29, 2025 09:50

dnugmanov force-pushed the feat/fabric-manager-partition-support branch from 13496fa to 620a537 Compare December 29, 2025 09:51

rthallisey reviewed Jan 30, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Add Fabric Manager partition support for NVLink-enabled multi-GPU VMs#158

feat: Add Fabric Manager partition support for NVLink-enabled multi-GPU VMs#158
dnugmanov wants to merge 1 commit intoNVIDIA:masterfrom
dnugmanov:feat/fabric-manager-partition-support

dnugmanov commented Dec 29, 2025

Uh oh!

copy-pr-bot Bot commented Dec 29, 2025

Uh oh!

rthallisey Jan 30, 2026

Uh oh!

rthallisey Jan 30, 2026

Uh oh!

rthallisey Jan 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

dnugmanov commented Dec 29, 2025

Related Issue

Description

Background

What This PR Does

Environment Assumptions

Uh oh!

copy-pr-bot Bot commented Dec 29, 2025

Uh oh!

rthallisey Jan 30, 2026

Choose a reason for hiding this comment

Uh oh!

rthallisey Jan 30, 2026

Choose a reason for hiding this comment

Uh oh!

rthallisey Jan 30, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants