Skip to content

[release-1.14] Fix hami vGPU scheduling failure in large and medium-scale clusters#5427

Closed
volcano-sh-bot wants to merge 1 commit into
volcano-sh:release-1.14from
volcano-sh-bot:cherry-pick-5393-to-release-1.14
Closed

[release-1.14] Fix hami vGPU scheduling failure in large and medium-scale clusters#5427
volcano-sh-bot wants to merge 1 commit into
volcano-sh:release-1.14from
volcano-sh-bot:cherry-pick-5393-to-release-1.14

Conversation

@volcano-sh-bot

Copy link
Copy Markdown
Contributor

This is an automated cherry-pick of #5393


Signed-off-by: fanhy36 <fanhy36@chinaunicom.cn>
@volcano-sh-bot

Copy link
Copy Markdown
Contributor Author

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign kingeasternsun for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@volcano-sh-bot volcano-sh-bot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Jun 11, 2026

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request removes the vGPU handshake mechanism and its associated node annotation patching, while also removing the unused patchNodeAnnotations utility function. Instead, it introduces validation checks to ensure that nodes have allocatable vGPU resources (VolcanoVGPUNumber, VolcanoVGPUCores, and VolcanoVGPUMemory) with non-zero values. The feedback suggests refactoring these repetitive resource validation checks into a loop to improve code maintainability.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment on lines +103 to +119
gpuNumberRes, gpuNumberExists := node.Status.Allocatable[v1.ResourceName(deviceconfig.VolcanoVGPUNumber)]
if !gpuNumberExists || gpuNumberRes.Value() == 0 {
klog.V(3).Infof("Node %s does not have allocatable %s resource or value is 0, returning nil", node.Name, deviceconfig.VolcanoVGPUNumber)
return nil
}

vgpuCoresRes, vgpuCoresExists := node.Status.Allocatable[v1.ResourceName(deviceconfig.VolcanoVGPUCores)]
if !vgpuCoresExists || vgpuCoresRes.Value() == 0 {
klog.V(3).Infof("Node %s does not have allocatable %s resource or value is 0, returning nil", node.Name, deviceconfig.VolcanoVGPUCores)
return nil
}

vgpuMemoryRes, vgpuMemoryExists := node.Status.Allocatable[v1.ResourceName(deviceconfig.VolcanoVGPUMemory)]
if !vgpuMemoryExists || vgpuMemoryRes.Value() == 0 {
klog.V(3).Infof("Node %s does not have allocatable %s resource or value is 0, returning nil", node.Name, deviceconfig.VolcanoVGPUMemory)
return nil
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The checks for the three allocatable vGPU resources (VolcanoVGPUNumber, VolcanoVGPUCores, and VolcanoVGPUMemory) are highly repetitive. We can simplify this code and improve maintainability by iterating over a slice of the required resource names in a loop.

Suggested change
gpuNumberRes, gpuNumberExists := node.Status.Allocatable[v1.ResourceName(deviceconfig.VolcanoVGPUNumber)]
if !gpuNumberExists || gpuNumberRes.Value() == 0 {
klog.V(3).Infof("Node %s does not have allocatable %s resource or value is 0, returning nil", node.Name, deviceconfig.VolcanoVGPUNumber)
return nil
}
vgpuCoresRes, vgpuCoresExists := node.Status.Allocatable[v1.ResourceName(deviceconfig.VolcanoVGPUCores)]
if !vgpuCoresExists || vgpuCoresRes.Value() == 0 {
klog.V(3).Infof("Node %s does not have allocatable %s resource or value is 0, returning nil", node.Name, deviceconfig.VolcanoVGPUCores)
return nil
}
vgpuMemoryRes, vgpuMemoryExists := node.Status.Allocatable[v1.ResourceName(deviceconfig.VolcanoVGPUMemory)]
if !vgpuMemoryExists || vgpuMemoryRes.Value() == 0 {
klog.V(3).Infof("Node %s does not have allocatable %s resource or value is 0, returning nil", node.Name, deviceconfig.VolcanoVGPUMemory)
return nil
}
requiredResources := []string{
deviceconfig.VolcanoVGPUNumber,
deviceconfig.VolcanoVGPUCores,
deviceconfig.VolcanoVGPUMemory,
}
for _, resName := range requiredResources {
res, exists := node.Status.Allocatable[v1.ResourceName(resName)]
if !exists || res.Value() == 0 {
klog.V(3).Infof("Node %s does not have allocatable %s resource or value is 0, returning nil", node.Name, resName)
return nil
}
}

@JesseStutler

Copy link
Copy Markdown
Member

This PR wrongfully contains Chinese commit, which needs to refactor, I will allow Copilot to amend the commit and cherrypick

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size/M Denotes a PR that changes 30-99 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants