Clear device-related annotations during pod release#5416
Conversation
There was a problem hiding this comment.
Code Review
This pull request implements the Release method for AscendDevices, NPUDevices, and GPUDevices to clean up speculative pod annotations during rollback or preemption, utilizing a new RemovePodAnnotations helper. Feedback focuses on preventing potential nil pointer dereferences by moving nil checks before calling SubResource in GPUDevices.Release, adding missing SubResource calls and receiver nil checks in AscendDevices.Release and NPUDevices.Release to avoid in-memory resource leaks, and adding defensive nil checks in RemovePodAnnotations.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
This PR ensures device-related pod annotations are cleaned up during Release() so speculative allocations don’t “stick” and affect subsequent scheduling, while preserving annotations for committed/running pods.
Changes:
- Added a shared helper to remove Pod annotations via API server patch.
- Updated multiple device plugins’
Release()implementations to clear specific device annotations (with “success” phase preserved where applicable). - Added regression tests for vGPU and HAMi to verify cleanup behavior and preservation for committed pods.
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| pkg/scheduler/api/devices/util.go | Adds RemovePodAnnotations helper that patches the Pod to delete annotation keys. |
| pkg/scheduler/api/devices/nvidia/vgpu/device_info.go | Cleans vGPU annotations in Release() for speculative allocations. |
| pkg/scheduler/api/devices/nvidia/vgpu/device_info_test.go | Adds regression tests verifying speculative cleanup and committed preservation. |
| pkg/scheduler/api/devices/ascend/mindcluster/ascend310p/vnpu/device_info.go | Cleans NPU-related annotations in Release(). |
| pkg/scheduler/api/devices/ascend/hami/device_info.go | Cleans HAMi/Ascend annotations in Release() (including dynamic device keys). |
| pkg/scheduler/api/devices/ascend/hami/device_info_test.go | Adds tests for speculative cleanup and committed preservation; initializes device key maps for the tests. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Signed-off-by: limengxuan <mengxuan.li@dynamia.ai>
|
/approve |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: JesseStutler, nishitnshah The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/cherrypick release-1.15 |
|
@JesseStutler: new pull request created: #5429 DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/kind bug
fix #5335
This applies to all sharable devices(NV, Ascend, etc..), so we need a unified fix instead of fixing them one by one.