Skip to content

fix ascend vnpu addResource#5363

Merged
volcano-sh-bot merged 2 commits into
volcano-sh:masterfrom
DSFans2014:fix/ascend-vnpu-allocate
Jun 4, 2026
Merged

fix ascend vnpu addResource#5363
volcano-sh-bot merged 2 commits into
volcano-sh:masterfrom
DSFans2014:fix/ascend-vnpu-allocate

Conversation

@DSFans2014

@DSFans2014 DSFans2014 commented Jun 2, 2026

Copy link
Copy Markdown
Contributor

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Fixes #5360

Special notes for your reviewer:

Does this PR introduce a user-facing change?


@volcano-sh-bot volcano-sh-bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 2, 2026
@volcano-sh-bot volcano-sh-bot requested review from hwdef and k82cn June 2, 2026 03:03
@volcano-sh-bot volcano-sh-bot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Jun 2, 2026

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a PodMap to AscendDevice to track device usage per pod, preventing duplicate resource additions or subtractions. It also refactors AddResourceUsage and SubResourceUsage to accept *AscendDevice directly and switches from InRequestDevices to SupportDevices. Feedback on the changes highlights an inefficiency in SubResource where a new DeviceUsage is redundantly allocated and assigned to PodMap immediately before the key is deleted.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread pkg/scheduler/api/devices/ascend/hami/device_info.go Outdated
Signed-off-by: james <open4pd@4paradigm.com>
@DSFans2014 DSFans2014 force-pushed the fix/ascend-vnpu-allocate branch from efa0705 to c05e901 Compare June 2, 2026 03:07
@volcano-sh-bot volcano-sh-bot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jun 2, 2026
@DSFans2014 DSFans2014 marked this pull request as ready for review June 3, 2026 02:25
Copilot AI review requested due to automatic review settings June 3, 2026 02:25
@volcano-sh-bot volcano-sh-bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 3, 2026

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Adds per-pod tracking for Ascend device allocations to make resource accounting idempotent and improves unit coverage around resource addition.

Changes:

  • Introduces PodMap on AscendDevice to track per-pod device usage and prevent double-counting.
  • Updates add/sub resource logic to use PodMap and to skip unknown devices safely.
  • Adds a table-driven test suite for AscendDevices.AddResource.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

File Description
pkg/scheduler/api/devices/ascend/hami/device_info.go Adds per-pod usage tracking (PodMap), updates add/sub accounting flow, and extends deepcopy/snapshot to include the new map.
pkg/scheduler/api/devices/ascend/hami/device_info_test.go Adds table-driven tests and helper constructors for AddResource behavior (single/multi/duplicate/invalid annotation, etc.).
Comments suppressed due to low confidence (2)

pkg/scheduler/api/devices/ascend/hami/device_info.go:154

  • This changes the public method signatures from (id string, ...) to (*AscendDevice, ...), which is a breaking API change for any external callers. Consider keeping the original id-based exported methods as wrappers (lookup + call an unexported helper), or make the pointer-taking variants unexported to avoid forcing downstream changes.
func (ads *AscendDevices) AddResourceUsage(dev *AscendDevice, cores int32, mem int32) error {
	dev.DeviceUsage.Used++
	dev.DeviceUsage.Usedcores += cores
	dev.DeviceUsage.Usedmem += mem
	return nil
}

func (ads *AscendDevices) SubResourceUsage(dev *AscendDevice, cores int32, mem int32) error {
	dev.DeviceUsage.Used--
	dev.DeviceUsage.Usedcores -= cores
	dev.DeviceUsage.Usedmem -= mem
	return nil
}

pkg/scheduler/api/devices/ascend/hami/device_info.go:171

  • The SubResource behavior changed materially (annotation key source + PodMap-gated subtraction). There are new tests for AddResource, but no corresponding unit tests asserting SubResource decrements exactly once and cleans up PodMap (including cases like missing device, repeated SubResource calls for same pod, and PodMap missing entry behavior). Adding targeted tests here would prevent regressions.
func (ads *AscendDevices) SubResource(pod *v1.Pod) {
	if ads == nil {
		return
	}
	ano_key := devices.SupportDevices[ads.Type]
	ano, ok := pod.Annotations[ano_key]
	if !ok {
		return
	}

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread pkg/scheduler/api/devices/ascend/hami/device_info.go
Comment thread pkg/scheduler/api/devices/ascend/hami/device_info.go
Comment thread pkg/scheduler/api/devices/ascend/hami/device_info_test.go Outdated
Comment thread pkg/scheduler/api/devices/ascend/hami/device_info_test.go Outdated
Signed-off-by: james <open4pd@4paradigm.com>
@DSFans2014 DSFans2014 force-pushed the fix/ascend-vnpu-allocate branch from ca2b977 to e6c8337 Compare June 3, 2026 02:48
@JesseStutler JesseStutler added the kind/bug Categorizes issue or PR as related to a bug. label Jun 3, 2026
@JesseStutler

Copy link
Copy Markdown
Member

Please help /cc @archlitchi

@archlitchi

Copy link
Copy Markdown
Contributor

@DSFans2014 have you tested on a NPU environment?

@DSFans2014

Copy link
Copy Markdown
Contributor Author

@DSFans2014 have you tested on a NPU environment?

Yes, I've already tested this fix locally.

@archlitchi archlitchi left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@volcano-sh-bot volcano-sh-bot added the lgtm Indicates that a PR is ready to be merged. label Jun 4, 2026
@volcano-sh-bot

Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: archlitchi

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@volcano-sh-bot volcano-sh-bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 4, 2026
@volcano-sh-bot volcano-sh-bot merged commit a66638c into volcano-sh:master Jun 4, 2026
33 checks passed
@DSFans2014 DSFans2014 deleted the fix/ascend-vnpu-allocate branch June 4, 2026 07:42
@JesseStutler

Copy link
Copy Markdown
Member

/cherrypick release-1.15

@volcano-sh-bot

Copy link
Copy Markdown
Contributor

@JesseStutler: new pull request created: #5377

Details

In response to this:

/cherrypick release-1.15

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. kind/bug Categorizes issue or PR as related to a bug. lgtm Indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

HAMi Ascend 集成场景下,Allocate 后清空 *-devices-to-allocate annotation 导致 Volcano 资源统计失真

5 participants