[release-1.15] Clear device-related annotations during pod release by volcano-sh-bot · Pull Request #5429 · volcano-sh/volcano

volcano-sh-bot · 2026-06-11T07:01:23Z

This is an automated cherry-pick of #5416

Signed-off-by: limengxuan <mengxuan.li@dynamia.ai>

volcano-sh-bot · 2026-06-11T07:01:30Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign hwdef for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

pkg/scheduler/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

gemini-code-assist

Code Review

This pull request implements the Release method across several device plugins (Ascend HAMi, Ascend NPU, and NVIDIA vGPU) to clean up speculative or discarded device annotations from Pods, preventing them from persisting across node reassignments. It also introduces a helper function RemovePodAnnotations to delete these annotations via the API server, along with corresponding unit tests. However, the review highlights critical concurrency and safety issues: iterating over the global map devices.InRequestDevices in HAMi's Release method is prone to concurrent map read/write panics, and directly mutating pod.Annotations via delete in NVIDIA vGPU and Ascend NPU Release methods can cause concurrent map write panics or shared cache corruption if the pod object is a shared pointer from the informer cache.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

gemini-code-assist · 2026-06-11T07:03:25Z

+	for commonWord := range devices.InRequestDevices {
+		keys = append(keys,
+			devices.InRequestDevices[commonWord],
+			devices.SupportDevices[commonWord],
+			fmt.Sprintf("huawei.com/%s", commonWord),
+		)
+	}


Iterating over the global map devices.InRequestDevices here is prone to concurrent map read/write panics.\n\nThe global maps devices.InRequestDevices and devices.SupportDevices are mutated dynamically inside InitDevices (which is called by NewAscendDevices for every node/session) without any synchronization. Since Release can be called concurrently with scheduling sessions or node updates, this concurrent read and write will cause the scheduler process to crash with a fatal error: concurrent map read and map write panic.\n\nRecommendation:\n1. Protect devices.InRequestDevices and devices.SupportDevices with a sync.RWMutex or use a thread-safe map implementation (like sync.Map).\n2. Alternatively, ensure that these global maps are fully initialized once during startup/initialization rather than lazily and repeatedly inside NewAscendDevices.

@archlitchi Could you take a look and verify it?

gemini-code-assist · 2026-06-11T07:03:25Z

+	for _, k := range keys {
+		delete(pod.Annotations, k)
+	}


Mutating pod.Annotations directly via delete can lead to concurrent map write panics or shared cache corruption if the pod object passed to Release is a shared pointer from the informer cache.\n\nIn Kubernetes, objects retrieved from listers/informers are shared and read-only. Modifying them directly mutates the cache in-place, which is not thread-safe and can cause race conditions or crashes in other parts of the scheduler.\n\nRecommendation:\nEnsure that the caller always passes a deep-copied pod object, or perform a copy-on-write of the annotations map before mutating it.

gemini-code-assist · 2026-06-11T07:03:25Z

+	for _, k := range keys {
+		delete(pod.Annotations, k)
+	}


Mutating pod.Annotations directly via delete can lead to concurrent map write panics or shared cache corruption if the pod object passed to Release is a shared pointer from the informer cache.\n\nIn Kubernetes, objects retrieved from listers/informers are shared and read-only. Modifying them directly mutates the cache in-place, which is not thread-safe and can cause race conditions or crashes in other parts of the scheduler.\n\nRecommendation:\nEnsure that the caller always passes a deep-copied pod object, or perform a copy-on-write of the annotations map before mutating it.

Clean up annotations during release

a95f7ea

Signed-off-by: limengxuan <mengxuan.li@dynamia.ai>

volcano-sh-bot mentioned this pull request Jun 11, 2026

Clear device-related annotations during pod release #5416

Merged

volcano-sh-bot requested review from JesseStutler and alcorj-mizar June 11, 2026 07:01

volcano-sh-bot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Jun 11, 2026

gemini-code-assist Bot reviewed Jun 11, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[release-1.15] Clear device-related annotations during pod release#5429

[release-1.15] Clear device-related annotations during pod release#5429
volcano-sh-bot wants to merge 1 commit into
volcano-sh:release-1.15from
volcano-sh-bot:cherry-pick-5416-to-release-1.15

volcano-sh-bot commented Jun 11, 2026

Uh oh!

volcano-sh-bot commented Jun 11, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Jun 11, 2026

Uh oh!

JesseStutler Jun 11, 2026

Uh oh!

gemini-code-assist Bot Jun 11, 2026

Uh oh!

gemini-code-assist Bot Jun 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

volcano-sh-bot commented Jun 11, 2026

Uh oh!

volcano-sh-bot commented Jun 11, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Jun 11, 2026

Choose a reason for hiding this comment

Uh oh!

JesseStutler Jun 11, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 11, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 11, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants