Skip to content

[release-1.15] Clear device-related annotations during pod release#5429

Open
volcano-sh-bot wants to merge 1 commit into
volcano-sh:release-1.15from
volcano-sh-bot:cherry-pick-5416-to-release-1.15
Open

[release-1.15] Clear device-related annotations during pod release#5429
volcano-sh-bot wants to merge 1 commit into
volcano-sh:release-1.15from
volcano-sh-bot:cherry-pick-5416-to-release-1.15

Conversation

@volcano-sh-bot

Copy link
Copy Markdown
Contributor

This is an automated cherry-pick of #5416

Signed-off-by: limengxuan <mengxuan.li@dynamia.ai>
@volcano-sh-bot

Copy link
Copy Markdown
Contributor Author

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign hwdef for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@volcano-sh-bot volcano-sh-bot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Jun 11, 2026

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements the Release method across several device plugins (Ascend HAMi, Ascend NPU, and NVIDIA vGPU) to clean up speculative or discarded device annotations from Pods, preventing them from persisting across node reassignments. It also introduces a helper function RemovePodAnnotations to delete these annotations via the API server, along with corresponding unit tests. However, the review highlights critical concurrency and safety issues: iterating over the global map devices.InRequestDevices in HAMi's Release method is prone to concurrent map read/write panics, and directly mutating pod.Annotations via delete in NVIDIA vGPU and Ascend NPU Release methods can cause concurrent map write panics or shared cache corruption if the pod object is a shared pointer from the informer cache.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment on lines +352 to +358
for commonWord := range devices.InRequestDevices {
keys = append(keys,
devices.InRequestDevices[commonWord],
devices.SupportDevices[commonWord],
fmt.Sprintf("huawei.com/%s", commonWord),
)
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Iterating over the global map devices.InRequestDevices here is prone to concurrent map read/write panics.\n\nThe global maps devices.InRequestDevices and devices.SupportDevices are mutated dynamically inside InitDevices (which is called by NewAscendDevices for every node/session) without any synchronization. Since Release can be called concurrently with scheduling sessions or node updates, this concurrent read and write will cause the scheduler process to crash with a fatal error: concurrent map read and map write panic.\n\nRecommendation:\n1. Protect devices.InRequestDevices and devices.SupportDevices with a sync.RWMutex or use a thread-safe map implementation (like sync.Map).\n2. Alternatively, ensure that these global maps are fully initialized once during startup/initialization rather than lazily and repeatedly inside NewAscendDevices.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@archlitchi Could you take a look and verify it?

Comment on lines +300 to +302
for _, k := range keys {
delete(pod.Annotations, k)
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Mutating pod.Annotations directly via delete can lead to concurrent map write panics or shared cache corruption if the pod object passed to Release is a shared pointer from the informer cache.\n\nIn Kubernetes, objects retrieved from listers/informers are shared and read-only. Modifying them directly mutates the cache in-place, which is not thread-safe and can cause race conditions or crashes in other parts of the scheduler.\n\nRecommendation:\nEnsure that the caller always passes a deep-copied pod object, or perform a copy-on-write of the annotations map before mutating it.

Comment on lines +220 to +222
for _, k := range keys {
delete(pod.Annotations, k)
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Mutating pod.Annotations directly via delete can lead to concurrent map write panics or shared cache corruption if the pod object passed to Release is a shared pointer from the informer cache.\n\nIn Kubernetes, objects retrieved from listers/informers are shared and read-only. Modifying them directly mutates the cache in-place, which is not thread-safe and can cause race conditions or crashes in other parts of the scheduler.\n\nRecommendation:\nEnsure that the caller always passes a deep-copied pod object, or perform a copy-on-write of the annotations map before mutating it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants