[release-1.15] Clear device-related annotations during pod release#5429
[release-1.15] Clear device-related annotations during pod release#5429volcano-sh-bot wants to merge 1 commit into
Conversation
Signed-off-by: limengxuan <mengxuan.li@dynamia.ai>
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Code Review
This pull request implements the Release method across several device plugins (Ascend HAMi, Ascend NPU, and NVIDIA vGPU) to clean up speculative or discarded device annotations from Pods, preventing them from persisting across node reassignments. It also introduces a helper function RemovePodAnnotations to delete these annotations via the API server, along with corresponding unit tests. However, the review highlights critical concurrency and safety issues: iterating over the global map devices.InRequestDevices in HAMi's Release method is prone to concurrent map read/write panics, and directly mutating pod.Annotations via delete in NVIDIA vGPU and Ascend NPU Release methods can cause concurrent map write panics or shared cache corruption if the pod object is a shared pointer from the informer cache.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| for commonWord := range devices.InRequestDevices { | ||
| keys = append(keys, | ||
| devices.InRequestDevices[commonWord], | ||
| devices.SupportDevices[commonWord], | ||
| fmt.Sprintf("huawei.com/%s", commonWord), | ||
| ) | ||
| } |
There was a problem hiding this comment.
Iterating over the global map devices.InRequestDevices here is prone to concurrent map read/write panics.\n\nThe global maps devices.InRequestDevices and devices.SupportDevices are mutated dynamically inside InitDevices (which is called by NewAscendDevices for every node/session) without any synchronization. Since Release can be called concurrently with scheduling sessions or node updates, this concurrent read and write will cause the scheduler process to crash with a fatal error: concurrent map read and map write panic.\n\nRecommendation:\n1. Protect devices.InRequestDevices and devices.SupportDevices with a sync.RWMutex or use a thread-safe map implementation (like sync.Map).\n2. Alternatively, ensure that these global maps are fully initialized once during startup/initialization rather than lazily and repeatedly inside NewAscendDevices.
| for _, k := range keys { | ||
| delete(pod.Annotations, k) | ||
| } |
There was a problem hiding this comment.
Mutating pod.Annotations directly via delete can lead to concurrent map write panics or shared cache corruption if the pod object passed to Release is a shared pointer from the informer cache.\n\nIn Kubernetes, objects retrieved from listers/informers are shared and read-only. Modifying them directly mutates the cache in-place, which is not thread-safe and can cause race conditions or crashes in other parts of the scheduler.\n\nRecommendation:\nEnsure that the caller always passes a deep-copied pod object, or perform a copy-on-write of the annotations map before mutating it.
| for _, k := range keys { | ||
| delete(pod.Annotations, k) | ||
| } |
There was a problem hiding this comment.
Mutating pod.Annotations directly via delete can lead to concurrent map write panics or shared cache corruption if the pod object passed to Release is a shared pointer from the informer cache.\n\nIn Kubernetes, objects retrieved from listers/informers are shared and read-only. Modifying them directly mutates the cache in-place, which is not thread-safe and can cause race conditions or crashes in other parts of the scheduler.\n\nRecommendation:\nEnsure that the caller always passes a deep-copied pod object, or perform a copy-on-write of the annotations map before mutating it.
This is an automated cherry-pick of #5416