Skip to content

Clear device-related annotations during pod release#5416

Merged
volcano-sh-bot merged 1 commit into
volcano-sh:masterfrom
archlitchi:master
Jun 11, 2026
Merged

Clear device-related annotations during pod release#5416
volcano-sh-bot merged 1 commit into
volcano-sh:masterfrom
archlitchi:master

Conversation

@archlitchi

Copy link
Copy Markdown
Contributor

/kind bug

fix #5335

This applies to all sharable devices(NV, Ascend, etc..), so we need a unified fix instead of fixing them one by one.

Copilot AI review requested due to automatic review settings June 9, 2026 10:12
@volcano-sh-bot volcano-sh-bot added the kind/bug Categorizes issue or PR as related to a bug. label Jun 9, 2026
@volcano-sh-bot volcano-sh-bot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Jun 9, 2026

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements the Release method for AscendDevices, NPUDevices, and GPUDevices to clean up speculative pod annotations during rollback or preemption, utilizing a new RemovePodAnnotations helper. Feedback focuses on preventing potential nil pointer dereferences by moving nil checks before calling SubResource in GPUDevices.Release, adding missing SubResource calls and receiver nil checks in AscendDevices.Release and NPUDevices.Release to avoid in-memory resource leaks, and adding defensive nil checks in RemovePodAnnotations.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread pkg/scheduler/api/devices/nvidia/vgpu/device_info.go
Comment thread pkg/scheduler/api/devices/ascend/hami/device_info.go
Comment thread pkg/scheduler/api/devices/util.go

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

This PR ensures device-related pod annotations are cleaned up during Release() so speculative allocations don’t “stick” and affect subsequent scheduling, while preserving annotations for committed/running pods.

Changes:

  • Added a shared helper to remove Pod annotations via API server patch.
  • Updated multiple device plugins’ Release() implementations to clear specific device annotations (with “success” phase preserved where applicable).
  • Added regression tests for vGPU and HAMi to verify cleanup behavior and preservation for committed pods.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
pkg/scheduler/api/devices/util.go Adds RemovePodAnnotations helper that patches the Pod to delete annotation keys.
pkg/scheduler/api/devices/nvidia/vgpu/device_info.go Cleans vGPU annotations in Release() for speculative allocations.
pkg/scheduler/api/devices/nvidia/vgpu/device_info_test.go Adds regression tests verifying speculative cleanup and committed preservation.
pkg/scheduler/api/devices/ascend/mindcluster/ascend310p/vnpu/device_info.go Cleans NPU-related annotations in Release().
pkg/scheduler/api/devices/ascend/hami/device_info.go Cleans HAMi/Ascend annotations in Release() (including dynamic device keys).
pkg/scheduler/api/devices/ascend/hami/device_info_test.go Adds tests for speculative cleanup and committed preservation; initializes device key maps for the tests.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread pkg/scheduler/api/devices/nvidia/vgpu/device_info.go Outdated
Comment thread pkg/scheduler/api/devices/ascend/hami/device_info.go
Comment thread pkg/scheduler/api/devices/ascend/hami/device_info_test.go Outdated
Comment thread pkg/scheduler/api/devices/util.go
Comment thread pkg/scheduler/api/devices/nvidia/vgpu/device_info.go
@volcano-sh-bot volcano-sh-bot added do-not-merge/invalid-commit-message Indicates that a PR should not merge because it has an invalid commit message. and removed do-not-merge/contains-merge-commits labels Jun 9, 2026
@volcano-sh-bot volcano-sh-bot removed the do-not-merge/invalid-commit-message Indicates that a PR should not merge because it has an invalid commit message. label Jun 9, 2026
Comment thread pkg/scheduler/api/devices/nvidia/vgpu/device_info.go
Signed-off-by: limengxuan <mengxuan.li@dynamia.ai>
@archlitchi

Copy link
Copy Markdown
Contributor Author

CC @JesseStutler

@JesseStutler

Copy link
Copy Markdown
Member

/approve
/lgtm
Do we need to backport? @archlitchi

@volcano-sh-bot volcano-sh-bot added the lgtm Indicates that a PR is ready to be merged. label Jun 11, 2026
@volcano-sh-bot

Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: JesseStutler, nishitnshah

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@volcano-sh-bot volcano-sh-bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 11, 2026
@volcano-sh-bot volcano-sh-bot merged commit ba800be into volcano-sh:master Jun 11, 2026
32 checks passed
@JesseStutler

Copy link
Copy Markdown
Member

/cherrypick release-1.15
/cherrypick release-1.14
/cherrypick release-1.13

@volcano-sh-bot

Copy link
Copy Markdown
Contributor

@JesseStutler: new pull request created: #5429

Details

In response to this:

/cherrypick release-1.15
/cherrypick release-1.14
/cherrypick release-1.13

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. kind/bug Categorizes issue or PR as related to a bug. lgtm Indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

vgpu: stale annotations from dry-run rollback cause GPU over-subscription in subGroup scheduling

5 participants