Skip to content

[release-1.13] Fix hami vGPU scheduling failure in large and medium-scale clusters#5431

Open
Copilot wants to merge 2 commits into
release-1.13from
copilot/release-113
Open

[release-1.13] Fix hami vGPU scheduling failure in large and medium-scale clusters#5431
Copilot wants to merge 2 commits into
release-1.13from
copilot/release-113

Conversation

Copilot AI commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

Cherry-picks the vGPU scheduling fix from master into release-1.13 by removing the scheduler/device-plugin handshake path and relying on node allocatable resources instead. This aligns release behavior with the upstream fix for large/medium cluster vGPU scheduling failures.

  • Config cleanup

    • Removed obsolete VolcanoVGPUHandshake constant from pkg/scheduler/api/devices/config/vgpu.go.
  • Scheduler device discovery logic

    • Updated NewGPUDevices in pkg/scheduler/api/devices/nvidia/vgpu/device_info.go to:
      • stop reading/modifying volcano.sh/node-vgpu-handshake
      • require non-zero allocatable values for:
        • volcano.sh/vgpu-number
        • volcano.sh/vgpu-cores
        • volcano.sh/vgpu-memory
      • return early when allocatable data is missing or invalid.
  • Unit coverage updates

    • Added table-driven cases in pkg/scheduler/api/devices/nvidia/vgpu/device_info_test.go for allocatable gating behavior (missing/zero resources vs valid resource set).
gpuNumberRes, gpuNumberExists := node.Status.Allocatable[v1.ResourceName(deviceconfig.VolcanoVGPUNumber)]
if !gpuNumberExists || gpuNumberRes.Value() == 0 {
    return nil
}

@volcano-sh-bot volcano-sh-bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 11, 2026
@volcano-sh-bot volcano-sh-bot added the size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. label Jun 11, 2026
Signed-off-by: fanhy36 <fanhy36@chinaunicom.cn>
@volcano-sh-bot

Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign monokaix for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@volcano-sh-bot volcano-sh-bot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Jun 11, 2026
Copilot AI changed the title [WIP] Cherry-pick commit aec04d7 for vGPU scheduling fix release-1.13: remove scheduler↔hami-dp vGPU handshake and gate on allocatable resources Jun 11, 2026
Copilot AI requested a review from JesseStutler June 11, 2026 12:31
@JesseStutler JesseStutler changed the title release-1.13: remove scheduler↔hami-dp vGPU handshake and gate on allocatable resources [release-1.13] Fix hami vGPU scheduling failure in large and medium-scale clusters Jun 11, 2026
@JesseStutler JesseStutler marked this pull request as ready for review June 11, 2026 12:40
@volcano-sh-bot volcano-sh-bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 11, 2026
@volcano-sh-bot volcano-sh-bot requested a review from hwdef June 11, 2026 12:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants