[release-1.14] Fix hami vGPU scheduling failure in large and medium-scale clusters by Copilot · Pull Request #5434 · volcano-sh/volcano

Copilot · 2026-06-11T12:25:35Z

Cherry-pick of aec04d7 from master to release-1.14. The handshake annotation mechanism (volcano.sh/node-vgpu-handshake) caused vGPU scheduling failures in medium/large clusters due to timing inconsistencies between the scheduler and hami-dp.

Changes

pkg/scheduler/api/devices/config/vgpu.go: Remove VolcanoVGPUHandshake constant
pkg/scheduler/api/devices/nvidia/vgpu/device_info.go: Replace handshake annotation check in NewGPUDevices with direct node.Status.Allocatable checks for vgpu-number, vgpu-cores, and vgpu-memory
pkg/scheduler/api/devices/nvidia/vgpu/utils.go: Remove patchNodeAnnotations helper (no longer referenced)

Before → After

// Before: gated on a handshake annotation that could get stuck
handshake, ok := node.Annotations[deviceconfig.VolcanoVGPUHandshake]
if !ok {
    return nil
}
// ... timing-based state machine using "Requesting_" / "Deleted_" strings

// After: check allocatable resources directly
gpuNumberRes, gpuNumberExists := node.Status.Allocatable[v1.ResourceName(deviceconfig.VolcanoVGPUNumber)]
if !gpuNumberExists || gpuNumberRes.Value() == 0 {
    return nil
}
// ... similar checks for vgpu-cores and vgpu-memory

Signed-off-by: fanhy36 <fanhy36@chinaunicom.cn>

volcano-sh-bot · 2026-06-11T12:30:14Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from jessestutler. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

pkg/scheduler/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Initial plan

38dfe33

Copilot AI assigned Copilot and JesseStutler Jun 11, 2026

volcano-sh-bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 11, 2026

Copilot started work on behalf of JesseStutler June 11, 2026 12:25 View session

volcano-sh-bot added the size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. label Jun 11, 2026

Remove the handshake between scheduler and hami-dp

daaa3db

Signed-off-by: fanhy36 <fanhy36@chinaunicom.cn>

volcano-sh-bot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Jun 11, 2026

Copilot AI changed the title ~~[WIP] Remove handshake between scheduler and hami-dp~~ cherry-pick: remove handshake between scheduler and hami-dp (release-1.14) Jun 11, 2026

Copilot finished work on behalf of JesseStutler June 11, 2026 12:33

Copilot AI requested a review from JesseStutler June 11, 2026 12:33

JesseStutler changed the title ~~cherry-pick: remove handshake between scheduler and hami-dp (release-1.14)~~ [release-1.14] Fix hami vGPU scheduling failure in large and medium-scale clusters Jun 11, 2026

JesseStutler marked this pull request as ready for review June 11, 2026 12:40

volcano-sh-bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 11, 2026

volcano-sh-bot requested review from k82cn and wangyang0616 June 11, 2026 12:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[release-1.14] Fix hami vGPU scheduling failure in large and medium-scale clusters#5434

[release-1.14] Fix hami vGPU scheduling failure in large and medium-scale clusters#5434
Copilot wants to merge 2 commits into
release-1.14from
copilot/release-114

Copilot AI commented Jun 11, 2026 •

edited

Loading

Uh oh!

volcano-sh-bot commented Jun 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

Copilot AI commented Jun 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes

Before → After

Uh oh!

volcano-sh-bot commented Jun 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Copilot AI commented Jun 11, 2026 •

edited

Loading