[release-1.14] Fix hami vGPU scheduling failure in large and medium-scale clusters#5427
Conversation
Signed-off-by: fanhy36 <fanhy36@chinaunicom.cn>
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Code Review
This pull request removes the vGPU handshake mechanism and its associated node annotation patching, while also removing the unused patchNodeAnnotations utility function. Instead, it introduces validation checks to ensure that nodes have allocatable vGPU resources (VolcanoVGPUNumber, VolcanoVGPUCores, and VolcanoVGPUMemory) with non-zero values. The feedback suggests refactoring these repetitive resource validation checks into a loop to improve code maintainability.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| gpuNumberRes, gpuNumberExists := node.Status.Allocatable[v1.ResourceName(deviceconfig.VolcanoVGPUNumber)] | ||
| if !gpuNumberExists || gpuNumberRes.Value() == 0 { | ||
| klog.V(3).Infof("Node %s does not have allocatable %s resource or value is 0, returning nil", node.Name, deviceconfig.VolcanoVGPUNumber) | ||
| return nil | ||
| } | ||
|
|
||
| vgpuCoresRes, vgpuCoresExists := node.Status.Allocatable[v1.ResourceName(deviceconfig.VolcanoVGPUCores)] | ||
| if !vgpuCoresExists || vgpuCoresRes.Value() == 0 { | ||
| klog.V(3).Infof("Node %s does not have allocatable %s resource or value is 0, returning nil", node.Name, deviceconfig.VolcanoVGPUCores) | ||
| return nil | ||
| } | ||
|
|
||
| vgpuMemoryRes, vgpuMemoryExists := node.Status.Allocatable[v1.ResourceName(deviceconfig.VolcanoVGPUMemory)] | ||
| if !vgpuMemoryExists || vgpuMemoryRes.Value() == 0 { | ||
| klog.V(3).Infof("Node %s does not have allocatable %s resource or value is 0, returning nil", node.Name, deviceconfig.VolcanoVGPUMemory) | ||
| return nil | ||
| } |
There was a problem hiding this comment.
The checks for the three allocatable vGPU resources (VolcanoVGPUNumber, VolcanoVGPUCores, and VolcanoVGPUMemory) are highly repetitive. We can simplify this code and improve maintainability by iterating over a slice of the required resource names in a loop.
| gpuNumberRes, gpuNumberExists := node.Status.Allocatable[v1.ResourceName(deviceconfig.VolcanoVGPUNumber)] | |
| if !gpuNumberExists || gpuNumberRes.Value() == 0 { | |
| klog.V(3).Infof("Node %s does not have allocatable %s resource or value is 0, returning nil", node.Name, deviceconfig.VolcanoVGPUNumber) | |
| return nil | |
| } | |
| vgpuCoresRes, vgpuCoresExists := node.Status.Allocatable[v1.ResourceName(deviceconfig.VolcanoVGPUCores)] | |
| if !vgpuCoresExists || vgpuCoresRes.Value() == 0 { | |
| klog.V(3).Infof("Node %s does not have allocatable %s resource or value is 0, returning nil", node.Name, deviceconfig.VolcanoVGPUCores) | |
| return nil | |
| } | |
| vgpuMemoryRes, vgpuMemoryExists := node.Status.Allocatable[v1.ResourceName(deviceconfig.VolcanoVGPUMemory)] | |
| if !vgpuMemoryExists || vgpuMemoryRes.Value() == 0 { | |
| klog.V(3).Infof("Node %s does not have allocatable %s resource or value is 0, returning nil", node.Name, deviceconfig.VolcanoVGPUMemory) | |
| return nil | |
| } | |
| requiredResources := []string{ | |
| deviceconfig.VolcanoVGPUNumber, | |
| deviceconfig.VolcanoVGPUCores, | |
| deviceconfig.VolcanoVGPUMemory, | |
| } | |
| for _, resName := range requiredResources { | |
| res, exists := node.Status.Allocatable[v1.ResourceName(resName)] | |
| if !exists || res.Value() == 0 { | |
| klog.V(3).Infof("Node %s does not have allocatable %s resource or value is 0, returning nil", node.Name, resName) | |
| return nil | |
| } | |
| } |
|
This PR wrongfully contains Chinese commit, which needs to refactor, I will allow Copilot to amend the commit and cherrypick |
This is an automated cherry-pick of #5393