Skip to content

Fix hami vGPU scheduling failure in large and medium-scale clusters#5393

Merged
volcano-sh-bot merged 1 commit into
volcano-sh:masterfrom
linuxfhy:pr_20260606_handshake
Jun 9, 2026
Merged

Fix hami vGPU scheduling failure in large and medium-scale clusters#5393
volcano-sh-bot merged 1 commit into
volcano-sh:masterfrom
linuxfhy:pr_20260606_handshake

Conversation

@linuxfhy

@linuxfhy linuxfhy commented Jun 6, 2026

Copy link
Copy Markdown
Member

What type of PR is this?

fix

What this PR does / why we need it:

Which issue(s) this PR fixes:

Fixes #5361

#5361
Project-HAMi/volcano-vgpu-device-plugin#132

Special notes for your reviewer:

Does this PR introduce a user-facing change?


Copilot AI review requested due to automatic review settings June 6, 2026 03:29
@volcano-sh-bot volcano-sh-bot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Jun 6, 2026

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

This PR removes the node “handshake” annotation flow for vGPU discovery and instead gates vGPU device initialization on the presence of non-zero allocatable vGPU resources on the Node.

Changes:

  • Removed node annotation patch helper and handshake-based gating logic.
  • Added checks in NewGPUDevices to require non-zero allocatable vgpu-number, vgpu-cores, and vgpu-memory.
  • Removed the VolcanoVGPUHandshake constant from device config.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

File Description
pkg/scheduler/api/devices/nvidia/vgpu/utils.go Removes the node-annotation patch helper previously used by handshake logic.
pkg/scheduler/api/devices/nvidia/vgpu/device_info.go Switches gating from handshake annotation to Node allocatable vGPU resources checks.
pkg/scheduler/api/devices/config/vgpu.go Removes exported handshake annotation constant.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 27 to 29
VolcanoVGPUNumber = "volcano.sh/vgpu-number"
// VolcanoVGPURegister virtual gpu information registered from device-plugin to scheduler
VolcanoVGPURegister = "volcano.sh/node-vgpu-register"
Comment on lines +103 to +124
// Check if allocatable resources in node status exist and are not zero
if node.Status.Allocatable != nil {
// Check if volcano.sh/gpu-number resource exists and is not zero
gpuNumberRes, gpuNumberExists := node.Status.Allocatable[v1.ResourceName(deviceconfig.VolcanoVGPUNumber)]
if !gpuNumberExists || gpuNumberRes.Value() == 0 {
klog.V(3).Infof("Node %s does not have allocatable %s resource or value is 0, returning nil", node.Name, deviceconfig.VolcanoVGPUNumber)
return nil
}

// Check if volcano.sh/vgpu-cores resource exists and is not zero
vgpuCoresRes, vgpuCoresExists := node.Status.Allocatable[v1.ResourceName(deviceconfig.VolcanoVGPUCores)]
if !vgpuCoresExists || vgpuCoresRes.Value() == 0 {
klog.V(3).Infof("Node %s does not have allocatable %s resource or value is 0, returning nil", node.Name, deviceconfig.VolcanoVGPUCores)
return nil
}

// Check if volcano.sh/vgpu-memory resource exists and is not zero
vgpuMemoryRes, vgpuMemoryExists := node.Status.Allocatable[v1.ResourceName(deviceconfig.VolcanoVGPUMemory)]
if !vgpuMemoryExists || vgpuMemoryRes.Value() == 0 {
klog.V(3).Infof("Node %s does not have allocatable %s resource or value is 0, returning nil", node.Name, deviceconfig.VolcanoVGPUMemory)
return nil
}
Comment on lines +107 to +110
if !gpuNumberExists || gpuNumberRes.Value() == 0 {
klog.V(3).Infof("Node %s does not have allocatable %s resource or value is 0, returning nil", node.Name, deviceconfig.VolcanoVGPUNumber)
return nil
}

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request removes the vGPU handshake mechanism and its associated node annotation patching logic, including the unused patchNodeAnnotations helper function. Instead, it introduces validation checks to ensure that allocatable resources (VolcanoVGPUNumber, VolcanoVGPUCores, and VolcanoVGPUMemory) exist and have non-zero values in the node status. Feedback suggests improving robustness by checking if these resource values are <= 0 rather than strictly == 0, and correcting a minor typo in the code comments.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment on lines +104 to +124
if node.Status.Allocatable != nil {
// Check if volcano.sh/gpu-number resource exists and is not zero
gpuNumberRes, gpuNumberExists := node.Status.Allocatable[v1.ResourceName(deviceconfig.VolcanoVGPUNumber)]
if !gpuNumberExists || gpuNumberRes.Value() == 0 {
klog.V(3).Infof("Node %s does not have allocatable %s resource or value is 0, returning nil", node.Name, deviceconfig.VolcanoVGPUNumber)
return nil
}

// Check if volcano.sh/vgpu-cores resource exists and is not zero
vgpuCoresRes, vgpuCoresExists := node.Status.Allocatable[v1.ResourceName(deviceconfig.VolcanoVGPUCores)]
if !vgpuCoresExists || vgpuCoresRes.Value() == 0 {
klog.V(3).Infof("Node %s does not have allocatable %s resource or value is 0, returning nil", node.Name, deviceconfig.VolcanoVGPUCores)
return nil
}

// Check if volcano.sh/vgpu-memory resource exists and is not zero
vgpuMemoryRes, vgpuMemoryExists := node.Status.Allocatable[v1.ResourceName(deviceconfig.VolcanoVGPUMemory)]
if !vgpuMemoryExists || vgpuMemoryRes.Value() == 0 {
klog.V(3).Infof("Node %s does not have allocatable %s resource or value is 0, returning nil", node.Name, deviceconfig.VolcanoVGPUMemory)
return nil
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

For better robustness and defensive programming, we should check if the allocatable resource values are <= 0 instead of exactly == 0. This ensures that any negative values (which could theoretically occur in corrupted states or mock/test environments) are also safely handled. Additionally, there is a minor typo in the comment on line 105 (volcano.sh/gpu-number instead of volcano.sh/vgpu-number).

	if node.Status.Allocatable != nil {
		// Check if volcano.sh/vgpu-number resource exists and is positive
		gpuNumberRes, gpuNumberExists := node.Status.Allocatable[v1.ResourceName(deviceconfig.VolcanoVGPUNumber)]
		if !gpuNumberExists || gpuNumberRes.Value() <= 0 {
			klog.V(3).Infof("Node %s does not have allocatable %s resource or value is <= 0, returning nil", node.Name, deviceconfig.VolcanoVGPUNumber)
			return nil
		}

		// Check if volcano.sh/vgpu-cores resource exists and is positive
		vgpuCoresRes, vgpuCoresExists := node.Status.Allocatable[v1.ResourceName(deviceconfig.VolcanoVGPUCores)]
		if !vgpuCoresExists || vgpuCoresRes.Value() <= 0 {
			klog.V(3).Infof("Node %s does not have allocatable %s resource or value is <= 0, returning nil", node.Name, deviceconfig.VolcanoVGPUCores)
			return nil
		}

		// Check if volcano.sh/vgpu-memory resource exists and is positive
		vgpuMemoryRes, vgpuMemoryExists := node.Status.Allocatable[v1.ResourceName(deviceconfig.VolcanoVGPUMemory)]
		if !vgpuMemoryExists || vgpuMemoryRes.Value() <= 0 {
			klog.V(3).Infof("Node %s does not have allocatable %s resource or value is <= 0, returning nil", node.Name, deviceconfig.VolcanoVGPUMemory)
			return nil
		}
	}

@JesseStutler

Copy link
Copy Markdown
Member

@linuxfhy Please improve the title and issue description, the context is too limited.

@linuxfhy linuxfhy changed the title Pr 20260606 handshake Fix hami vGPU scheduling failure in large and medium-scale clusters Jun 7, 2026
@linuxfhy

linuxfhy commented Jun 7, 2026

Copy link
Copy Markdown
Member Author

@linuxfhy Please improve the title and issue description, the context is too limited.

The title was auto-generated when creating the PR. I've revised it now. Additionally, I've also updated the title of the PR submitted to the hami-device-plugin repository.

@archlitchi

Copy link
Copy Markdown
Contributor

please resolve these AI comments

handshake, ok := node.Annotations[deviceconfig.VolcanoVGPUHandshake]
if !ok {

// Check if allocatable resources in node status exist and are not zero

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please remove these ai-generated comments

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@archlitchi

Copy link
Copy Markdown
Contributor

have you tried on your local cluster?

Signed-off-by: fanhy36 <fanhy36@chinaunicom.cn>
@linuxfhy linuxfhy force-pushed the pr_20260606_handshake branch from 3582eb9 to aec04d7 Compare June 8, 2026 07:52
@linuxfhy

linuxfhy commented Jun 8, 2026

Copy link
Copy Markdown
Member Author

have you tried on your local cluster?

We have conducted tests on our small-scale validation cluster with modified volcano-scheduler and hami-dp. The verification results are as follows:
1.Scheduling functionality remains unaffected.
2.Frequent node patching operations have been eliminated.
3.The scheduler can detect crashes of hami-dp, and all Pods requesting Hami vGPU resources will stay in the Pending state.

The changes for large-scale clusters will be deployed within this week.

For detailed verification information, please refer to: https://mp.weixin.qq.com/s/z-72X1dVXAyzGc2N-Imk0Q

@archlitchi archlitchi left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@volcano-sh-bot volcano-sh-bot added the lgtm Indicates that a PR is ready to be merged. label Jun 9, 2026
@volcano-sh-bot

Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: archlitchi

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@volcano-sh-bot volcano-sh-bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 9, 2026
@volcano-sh-bot volcano-sh-bot merged commit ce598b7 into volcano-sh:master Jun 9, 2026
50 of 53 checks passed
vgpuMemoryRes, vgpuMemoryExists := node.Status.Allocatable[v1.ResourceName(deviceconfig.VolcanoVGPUMemory)]
if !vgpuMemoryExists || vgpuMemoryRes.Value() == 0 {
klog.V(3).Infof("Node %s does not have allocatable %s resource or value is 0, returning nil", node.Name, deviceconfig.VolcanoVGPUMemory)
return nil

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the order of judgment for this resource fixed? Is it possible that there might be a case where there is no memory in the allocatable but there are core + number existing in the allocatable? @linuxfhy @archlitchi

@JesseStutler

Copy link
Copy Markdown
Member

/cherrypick release-1.14
/cherrypick release-1.15
/cherrypick release-1.13

@volcano-sh-bot

Copy link
Copy Markdown
Contributor

@JesseStutler: new pull request created: #5427

Details

In response to this:

/cherrypick release-1.14
/cherrypick release-1.15
/cherrypick release-1.13

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Hami vGPU scheduling failure in large and medium-scale clusters (200 nodes × 8 GPUs)

5 participants