Fix hami vGPU scheduling failure in large and medium-scale clusters#5393
Conversation
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
This PR removes the node “handshake” annotation flow for vGPU discovery and instead gates vGPU device initialization on the presence of non-zero allocatable vGPU resources on the Node.
Changes:
- Removed node annotation patch helper and handshake-based gating logic.
- Added checks in
NewGPUDevicesto require non-zero allocatablevgpu-number,vgpu-cores, andvgpu-memory. - Removed the
VolcanoVGPUHandshakeconstant from device config.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| pkg/scheduler/api/devices/nvidia/vgpu/utils.go | Removes the node-annotation patch helper previously used by handshake logic. |
| pkg/scheduler/api/devices/nvidia/vgpu/device_info.go | Switches gating from handshake annotation to Node allocatable vGPU resources checks. |
| pkg/scheduler/api/devices/config/vgpu.go | Removes exported handshake annotation constant. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| VolcanoVGPUNumber = "volcano.sh/vgpu-number" | ||
| // VolcanoVGPURegister virtual gpu information registered from device-plugin to scheduler | ||
| VolcanoVGPURegister = "volcano.sh/node-vgpu-register" |
| // Check if allocatable resources in node status exist and are not zero | ||
| if node.Status.Allocatable != nil { | ||
| // Check if volcano.sh/gpu-number resource exists and is not zero | ||
| gpuNumberRes, gpuNumberExists := node.Status.Allocatable[v1.ResourceName(deviceconfig.VolcanoVGPUNumber)] | ||
| if !gpuNumberExists || gpuNumberRes.Value() == 0 { | ||
| klog.V(3).Infof("Node %s does not have allocatable %s resource or value is 0, returning nil", node.Name, deviceconfig.VolcanoVGPUNumber) | ||
| return nil | ||
| } | ||
|
|
||
| // Check if volcano.sh/vgpu-cores resource exists and is not zero | ||
| vgpuCoresRes, vgpuCoresExists := node.Status.Allocatable[v1.ResourceName(deviceconfig.VolcanoVGPUCores)] | ||
| if !vgpuCoresExists || vgpuCoresRes.Value() == 0 { | ||
| klog.V(3).Infof("Node %s does not have allocatable %s resource or value is 0, returning nil", node.Name, deviceconfig.VolcanoVGPUCores) | ||
| return nil | ||
| } | ||
|
|
||
| // Check if volcano.sh/vgpu-memory resource exists and is not zero | ||
| vgpuMemoryRes, vgpuMemoryExists := node.Status.Allocatable[v1.ResourceName(deviceconfig.VolcanoVGPUMemory)] | ||
| if !vgpuMemoryExists || vgpuMemoryRes.Value() == 0 { | ||
| klog.V(3).Infof("Node %s does not have allocatable %s resource or value is 0, returning nil", node.Name, deviceconfig.VolcanoVGPUMemory) | ||
| return nil | ||
| } |
| if !gpuNumberExists || gpuNumberRes.Value() == 0 { | ||
| klog.V(3).Infof("Node %s does not have allocatable %s resource or value is 0, returning nil", node.Name, deviceconfig.VolcanoVGPUNumber) | ||
| return nil | ||
| } |
There was a problem hiding this comment.
Code Review
This pull request removes the vGPU handshake mechanism and its associated node annotation patching logic, including the unused patchNodeAnnotations helper function. Instead, it introduces validation checks to ensure that allocatable resources (VolcanoVGPUNumber, VolcanoVGPUCores, and VolcanoVGPUMemory) exist and have non-zero values in the node status. Feedback suggests improving robustness by checking if these resource values are <= 0 rather than strictly == 0, and correcting a minor typo in the code comments.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| if node.Status.Allocatable != nil { | ||
| // Check if volcano.sh/gpu-number resource exists and is not zero | ||
| gpuNumberRes, gpuNumberExists := node.Status.Allocatable[v1.ResourceName(deviceconfig.VolcanoVGPUNumber)] | ||
| if !gpuNumberExists || gpuNumberRes.Value() == 0 { | ||
| klog.V(3).Infof("Node %s does not have allocatable %s resource or value is 0, returning nil", node.Name, deviceconfig.VolcanoVGPUNumber) | ||
| return nil | ||
| } | ||
|
|
||
| // Check if volcano.sh/vgpu-cores resource exists and is not zero | ||
| vgpuCoresRes, vgpuCoresExists := node.Status.Allocatable[v1.ResourceName(deviceconfig.VolcanoVGPUCores)] | ||
| if !vgpuCoresExists || vgpuCoresRes.Value() == 0 { | ||
| klog.V(3).Infof("Node %s does not have allocatable %s resource or value is 0, returning nil", node.Name, deviceconfig.VolcanoVGPUCores) | ||
| return nil | ||
| } | ||
|
|
||
| // Check if volcano.sh/vgpu-memory resource exists and is not zero | ||
| vgpuMemoryRes, vgpuMemoryExists := node.Status.Allocatable[v1.ResourceName(deviceconfig.VolcanoVGPUMemory)] | ||
| if !vgpuMemoryExists || vgpuMemoryRes.Value() == 0 { | ||
| klog.V(3).Infof("Node %s does not have allocatable %s resource or value is 0, returning nil", node.Name, deviceconfig.VolcanoVGPUMemory) | ||
| return nil | ||
| } |
There was a problem hiding this comment.
For better robustness and defensive programming, we should check if the allocatable resource values are <= 0 instead of exactly == 0. This ensures that any negative values (which could theoretically occur in corrupted states or mock/test environments) are also safely handled. Additionally, there is a minor typo in the comment on line 105 (volcano.sh/gpu-number instead of volcano.sh/vgpu-number).
if node.Status.Allocatable != nil {
// Check if volcano.sh/vgpu-number resource exists and is positive
gpuNumberRes, gpuNumberExists := node.Status.Allocatable[v1.ResourceName(deviceconfig.VolcanoVGPUNumber)]
if !gpuNumberExists || gpuNumberRes.Value() <= 0 {
klog.V(3).Infof("Node %s does not have allocatable %s resource or value is <= 0, returning nil", node.Name, deviceconfig.VolcanoVGPUNumber)
return nil
}
// Check if volcano.sh/vgpu-cores resource exists and is positive
vgpuCoresRes, vgpuCoresExists := node.Status.Allocatable[v1.ResourceName(deviceconfig.VolcanoVGPUCores)]
if !vgpuCoresExists || vgpuCoresRes.Value() <= 0 {
klog.V(3).Infof("Node %s does not have allocatable %s resource or value is <= 0, returning nil", node.Name, deviceconfig.VolcanoVGPUCores)
return nil
}
// Check if volcano.sh/vgpu-memory resource exists and is positive
vgpuMemoryRes, vgpuMemoryExists := node.Status.Allocatable[v1.ResourceName(deviceconfig.VolcanoVGPUMemory)]
if !vgpuMemoryExists || vgpuMemoryRes.Value() <= 0 {
klog.V(3).Infof("Node %s does not have allocatable %s resource or value is <= 0, returning nil", node.Name, deviceconfig.VolcanoVGPUMemory)
return nil
}
}|
@linuxfhy Please improve the title and issue description, the context is too limited. |
The title was auto-generated when creating the PR. I've revised it now. Additionally, I've also updated the title of the PR submitted to the hami-device-plugin repository. |
|
please resolve these AI comments |
| handshake, ok := node.Annotations[deviceconfig.VolcanoVGPUHandshake] | ||
| if !ok { | ||
|
|
||
| // Check if allocatable resources in node status exist and are not zero |
There was a problem hiding this comment.
please remove these ai-generated comments
|
have you tried on your local cluster? |
Signed-off-by: fanhy36 <fanhy36@chinaunicom.cn>
3582eb9 to
aec04d7
Compare
We have conducted tests on our small-scale validation cluster with modified volcano-scheduler and hami-dp. The verification results are as follows: The changes for large-scale clusters will be deployed within this week. For detailed verification information, please refer to: https://mp.weixin.qq.com/s/z-72X1dVXAyzGc2N-Imk0Q |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: archlitchi The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
| vgpuMemoryRes, vgpuMemoryExists := node.Status.Allocatable[v1.ResourceName(deviceconfig.VolcanoVGPUMemory)] | ||
| if !vgpuMemoryExists || vgpuMemoryRes.Value() == 0 { | ||
| klog.V(3).Infof("Node %s does not have allocatable %s resource or value is 0, returning nil", node.Name, deviceconfig.VolcanoVGPUMemory) | ||
| return nil |
There was a problem hiding this comment.
Is the order of judgment for this resource fixed? Is it possible that there might be a case where there is no memory in the allocatable but there are core + number existing in the allocatable? @linuxfhy @archlitchi
|
/cherrypick release-1.14 |
|
@JesseStutler: new pull request created: #5427 DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
What type of PR is this?
fix
What this PR does / why we need it:
Which issue(s) this PR fixes:
Fixes #5361
#5361
Project-HAMi/volcano-vgpu-device-plugin#132
Special notes for your reviewer:
Does this PR introduce a user-facing change?