Fix hami vGPU scheduling failure in large and medium-scale clusters by linuxfhy · Pull Request #5393 · volcano-sh/volcano

linuxfhy · 2026-06-06T03:29:37Z

What type of PR is this?

fix

What this PR does / why we need it:

Which issue(s) this PR fixes:

Fixes #5361

#5361
Project-HAMi/volcano-vgpu-device-plugin#132

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Copilot

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

This PR removes the node “handshake” annotation flow for vGPU discovery and instead gates vGPU device initialization on the presence of non-zero allocatable vGPU resources on the Node.

Changes:

Removed node annotation patch helper and handshake-based gating logic.
Added checks in NewGPUDevices to require non-zero allocatable vgpu-number, vgpu-cores, and vgpu-memory.
Removed the VolcanoVGPUHandshake constant from device config.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

File	Description
pkg/scheduler/api/devices/nvidia/vgpu/utils.go	Removes the node-annotation patch helper previously used by handshake logic.
pkg/scheduler/api/devices/nvidia/vgpu/device_info.go	Switches gating from handshake annotation to Node allocatable vGPU resources checks.
pkg/scheduler/api/devices/config/vgpu.go	Removes exported handshake annotation constant.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

 	VolcanoVGPUNumber = "volcano.sh/vgpu-number"
 	// VolcanoVGPURegister virtual gpu information registered from device-plugin to scheduler
 	VolcanoVGPURegister = "volcano.sh/node-vgpu-register"


+	// Check if allocatable resources in node status exist and are not zero
+	if node.Status.Allocatable != nil {
+		// Check if volcano.sh/gpu-number resource exists and is not zero
+		gpuNumberRes, gpuNumberExists := node.Status.Allocatable[v1.ResourceName(deviceconfig.VolcanoVGPUNumber)]
+		if !gpuNumberExists || gpuNumberRes.Value() == 0 {
+			klog.V(3).Infof("Node %s does not have allocatable %s resource or value is 0, returning nil", node.Name, deviceconfig.VolcanoVGPUNumber)
+			return nil
+		}
+
+		// Check if volcano.sh/vgpu-cores resource exists and is not zero
+		vgpuCoresRes, vgpuCoresExists := node.Status.Allocatable[v1.ResourceName(deviceconfig.VolcanoVGPUCores)]
+		if !vgpuCoresExists || vgpuCoresRes.Value() == 0 {
+			klog.V(3).Infof("Node %s does not have allocatable %s resource or value is 0, returning nil", node.Name, deviceconfig.VolcanoVGPUCores)
+			return nil
+		}
+
+		// Check if volcano.sh/vgpu-memory resource exists and is not zero
+		vgpuMemoryRes, vgpuMemoryExists := node.Status.Allocatable[v1.ResourceName(deviceconfig.VolcanoVGPUMemory)]
+		if !vgpuMemoryExists || vgpuMemoryRes.Value() == 0 {
+			klog.V(3).Infof("Node %s does not have allocatable %s resource or value is 0, returning nil", node.Name, deviceconfig.VolcanoVGPUMemory)
+			return nil
+		}


+		if !gpuNumberExists || gpuNumberRes.Value() == 0 {
+			klog.V(3).Infof("Node %s does not have allocatable %s resource or value is 0, returning nil", node.Name, deviceconfig.VolcanoVGPUNumber)
+			return nil
+		}


gemini-code-assist

Code Review

This pull request removes the vGPU handshake mechanism and its associated node annotation patching logic, including the unused patchNodeAnnotations helper function. Instead, it introduces validation checks to ensure that allocatable resources (VolcanoVGPUNumber, VolcanoVGPUCores, and VolcanoVGPUMemory) exist and have non-zero values in the node status. Feedback suggests improving robustness by checking if these resource values are <= 0 rather than strictly == 0, and correcting a minor typo in the code comments.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

gemini-code-assist · 2026-06-06T03:30:36Z

+	if node.Status.Allocatable != nil {
+		// Check if volcano.sh/gpu-number resource exists and is not zero
+		gpuNumberRes, gpuNumberExists := node.Status.Allocatable[v1.ResourceName(deviceconfig.VolcanoVGPUNumber)]
+		if !gpuNumberExists || gpuNumberRes.Value() == 0 {
+			klog.V(3).Infof("Node %s does not have allocatable %s resource or value is 0, returning nil", node.Name, deviceconfig.VolcanoVGPUNumber)
+			return nil
+		}
+
+		// Check if volcano.sh/vgpu-cores resource exists and is not zero
+		vgpuCoresRes, vgpuCoresExists := node.Status.Allocatable[v1.ResourceName(deviceconfig.VolcanoVGPUCores)]
+		if !vgpuCoresExists || vgpuCoresRes.Value() == 0 {
+			klog.V(3).Infof("Node %s does not have allocatable %s resource or value is 0, returning nil", node.Name, deviceconfig.VolcanoVGPUCores)
+			return nil
+		}
+
+		// Check if volcano.sh/vgpu-memory resource exists and is not zero
+		vgpuMemoryRes, vgpuMemoryExists := node.Status.Allocatable[v1.ResourceName(deviceconfig.VolcanoVGPUMemory)]
+		if !vgpuMemoryExists || vgpuMemoryRes.Value() == 0 {
+			klog.V(3).Infof("Node %s does not have allocatable %s resource or value is 0, returning nil", node.Name, deviceconfig.VolcanoVGPUMemory)
+			return nil
+		}


For better robustness and defensive programming, we should check if the allocatable resource values are <= 0 instead of exactly == 0. This ensures that any negative values (which could theoretically occur in corrupted states or mock/test environments) are also safely handled. Additionally, there is a minor typo in the comment on line 105 (volcano.sh/gpu-number instead of volcano.sh/vgpu-number).

if node.Status.Allocatable != nil { // Check if volcano.sh/vgpu-number resource exists and is positive gpuNumberRes, gpuNumberExists := node.Status.Allocatable[v1.ResourceName(deviceconfig.VolcanoVGPUNumber)] if !gpuNumberExists || gpuNumberRes.Value() <= 0 { klog.V(3).Infof("Node %s does not have allocatable %s resource or value is <= 0, returning nil", node.Name, deviceconfig.VolcanoVGPUNumber) return nil } // Check if volcano.sh/vgpu-cores resource exists and is positive vgpuCoresRes, vgpuCoresExists := node.Status.Allocatable[v1.ResourceName(deviceconfig.VolcanoVGPUCores)] if !vgpuCoresExists || vgpuCoresRes.Value() <= 0 { klog.V(3).Infof("Node %s does not have allocatable %s resource or value is <= 0, returning nil", node.Name, deviceconfig.VolcanoVGPUCores) return nil } // Check if volcano.sh/vgpu-memory resource exists and is positive vgpuMemoryRes, vgpuMemoryExists := node.Status.Allocatable[v1.ResourceName(deviceconfig.VolcanoVGPUMemory)] if !vgpuMemoryExists || vgpuMemoryRes.Value() <= 0 { klog.V(3).Infof("Node %s does not have allocatable %s resource or value is <= 0, returning nil", node.Name, deviceconfig.VolcanoVGPUMemory) return nil } }

JesseStutler · 2026-06-06T06:20:00Z

@linuxfhy Please improve the title and issue description, the context is too limited.

linuxfhy · 2026-06-07T16:15:20Z

@linuxfhy Please improve the title and issue description, the context is too limited.

The title was auto-generated when creating the PR. I've revised it now. Additionally, I've also updated the title of the PR submitted to the hami-device-plugin repository.

archlitchi · 2026-06-08T01:39:34Z

please resolve these AI comments

archlitchi · 2026-06-08T01:41:29Z

-	handshake, ok := node.Annotations[deviceconfig.VolcanoVGPUHandshake]
-	if !ok {
+
+	// Check if allocatable resources in node status exist and are not zero


please remove these ai-generated comments

archlitchi · 2026-06-08T01:41:53Z

have you tried on your local cluster?

Signed-off-by: fanhy36 <fanhy36@chinaunicom.cn>

linuxfhy · 2026-06-08T08:02:50Z

have you tried on your local cluster?

We have conducted tests on our small-scale validation cluster with modified volcano-scheduler and hami-dp. The verification results are as follows:
1.Scheduling functionality remains unaffected.
2.Frequent node patching operations have been eliminated.
3.The scheduler can detect crashes of hami-dp, and all Pods requesting Hami vGPU resources will stay in the Pending state.

The changes for large-scale clusters will be deployed within this week.

For detailed verification information, please refer to: https://mp.weixin.qq.com/s/z-72X1dVXAyzGc2N-Imk0Q

archlitchi

/lgtm

volcano-sh-bot · 2026-06-09T08:01:31Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: archlitchi

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~pkg/scheduler/OWNERS~~ [archlitchi]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

JesseStutler · 2026-06-09T08:15:34Z

+		vgpuMemoryRes, vgpuMemoryExists := node.Status.Allocatable[v1.ResourceName(deviceconfig.VolcanoVGPUMemory)]
+		if !vgpuMemoryExists || vgpuMemoryRes.Value() == 0 {
+			klog.V(3).Infof("Node %s does not have allocatable %s resource or value is 0, returning nil", node.Name, deviceconfig.VolcanoVGPUMemory)
+			return nil


Is the order of judgment for this resource fixed? Is it possible that there might be a case where there is no memory in the allocatable but there are core + number existing in the allocatable? @linuxfhy @archlitchi

JesseStutler · 2026-06-11T06:57:41Z

/cherrypick release-1.14
/cherrypick release-1.15
/cherrypick release-1.13

volcano-sh-bot · 2026-06-11T06:58:25Z

@JesseStutler: new pull request created: #5427

Details

In response to this:

/cherrypick release-1.14
/cherrypick release-1.15
/cherrypick release-1.13

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Copilot AI review requested due to automatic review settings June 6, 2026 03:29

volcano-sh-bot requested review from Thor-wl and alcorj-mizar June 6, 2026 03:29

volcano-sh-bot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Jun 6, 2026

Copilot AI reviewed Jun 6, 2026

View reviewed changes

gemini-code-assist Bot reviewed Jun 6, 2026

View reviewed changes

linuxfhy changed the title ~~Pr 20260606 handshake~~ Fix hami vGPU scheduling failure in large and medium-scale clusters Jun 7, 2026

archlitchi reviewed Jun 8, 2026

View reviewed changes

删除scheduler和hami-dp之间的handshake

aec04d7

Signed-off-by: fanhy36 <fanhy36@chinaunicom.cn>

linuxfhy force-pushed the pr_20260606_handshake branch from 3582eb9 to aec04d7 Compare June 8, 2026 07:52

archlitchi approved these changes Jun 9, 2026

View reviewed changes

volcano-sh-bot assigned archlitchi Jun 9, 2026

volcano-sh-bot added the lgtm Indicates that a PR is ready to be merged. label Jun 9, 2026

volcano-sh-bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 9, 2026

volcano-sh-bot merged commit ce598b7 into volcano-sh:master Jun 9, 2026
50 of 53 checks passed

JesseStutler reviewed Jun 9, 2026

View reviewed changes

DSFans2014 mentioned this pull request Jun 10, 2026

feat: check ascend vnpu health with Allocatable #5418

Merged

volcano-sh-bot mentioned this pull request Jun 11, 2026

[release-1.14] Fix hami vGPU scheduling failure in large and medium-scale clusters #5427

Closed

Conversation

linuxfhy commented Jun 6, 2026

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Jun 6, 2026

Choose a reason for hiding this comment

Uh oh!

JesseStutler commented Jun 6, 2026

Uh oh!

linuxfhy commented Jun 7, 2026

Uh oh!

archlitchi commented Jun 8, 2026

Uh oh!

archlitchi Jun 8, 2026

Choose a reason for hiding this comment

Uh oh!

linuxfhy Jun 8, 2026

Choose a reason for hiding this comment

Uh oh!

archlitchi commented Jun 8, 2026

Uh oh!

linuxfhy commented Jun 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

archlitchi left a comment

Choose a reason for hiding this comment

Uh oh!

volcano-sh-bot commented Jun 9, 2026

Uh oh!

Uh oh!

JesseStutler Jun 9, 2026

Choose a reason for hiding this comment

Uh oh!

JesseStutler commented Jun 11, 2026

Uh oh!

volcano-sh-bot commented Jun 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

linuxfhy commented Jun 8, 2026 •

edited

Loading