Skip to content

feat: check ascend vnpu health with Allocatable#5418

Merged
volcano-sh-bot merged 2 commits into
volcano-sh:masterfrom
DSFans2014:feat/check-vnpu-health
Jun 11, 2026
Merged

feat: check ascend vnpu health with Allocatable#5418
volcano-sh-bot merged 2 commits into
volcano-sh:masterfrom
DSFans2014:feat/check-vnpu-health

Conversation

@DSFans2014

Copy link
Copy Markdown
Contributor

What type of PR is this?

A new mechanism was introduced to check device health In #5393. This PR applies this mechanism to ascend vnpu.

What this PR does / why we need it:

Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer:

Does this PR introduce a user-facing change?


Signed-off-by: james <open4pd@4paradigm.com>
Copilot AI review requested due to automatic review settings June 10, 2026 03:28
@volcano-sh-bot volcano-sh-bot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Jun 10, 2026

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Removes the legacy “handshake” health-check plumbing and tightens Ascend device discovery by filtering out nodes that don’t advertise allocatable Ascend resources.

Changes:

  • Deleted the generic CheckHealth helper (and related imports) from the shared devices API.
  • Removed handshakeAnno from Ascend device structs/copies and stopped wiring handshake annotations in Ascend init.
  • Added node.Status.Allocatable presence + per-device allocatable resource checks in NewAscendDevices.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File Description
pkg/scheduler/api/devices/device_info.go Removes CheckHealth and now-unused imports tied to handshake-based health checking.
pkg/scheduler/api/devices/ascend/hami/device_info.go Drops handshake annotation fields and adds allocatable-resource gating for Ascend device discovery.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread pkg/scheduler/api/devices/ascend/hami/device_info.go
Comment thread pkg/scheduler/api/devices/ascend/hami/device_info.go Outdated
Comment thread pkg/scheduler/api/devices/ascend/hami/device_info.go

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request removes the handshake annotation and its associated health check logic from the device APIs. It also introduces checks in NewAscendDevices to verify that the node has allocatable resources before processing. Feedback suggests using the configured dev.config.ResourceName instead of hardcoding the resource name prefix, and defensively checking for resource values less than or equal to zero.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread pkg/scheduler/api/devices/ascend/hami/device_info.go Outdated
Signed-off-by: james <open4pd@4paradigm.com>
@archlitchi

Copy link
Copy Markdown
Contributor

Does hami/ascend-device-plugin needs adjustment?

@archlitchi

Copy link
Copy Markdown
Contributor

/assign

@DSFans2014

Copy link
Copy Markdown
Contributor Author

Does hami/ascend-device-plugin needs adjustment?

It can work well without requiring any modifications to ascend-device-plugin

@hajnalmt hajnalmt left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/area scheduling
/label tide/merge-method-squash
/lgtm

Thanks for the update, this looks good to me. Using the configured Allocatable resource name here matches the health-check direction from the earlier device work.

@archlitchi could you please also take a look and approve from the HAMI / Volcano vGPU side?

@volcano-sh-bot volcano-sh-bot added area/scheduling tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges. lgtm Indicates that a PR is ready to be merged. labels Jun 10, 2026
@JesseStutler

Copy link
Copy Markdown
Member

/cc @archlitchi Could you take a look and approve it if you agree with it?

@archlitchi archlitchi left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@volcano-sh-bot

Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: archlitchi, hajnalmt

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@volcano-sh-bot volcano-sh-bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 11, 2026
@volcano-sh-bot volcano-sh-bot merged commit 3ecdd4c into volcano-sh:master Jun 11, 2026
34 checks passed
@JesseStutler

Copy link
Copy Markdown
Member

/cherrypick release-1.15
/cherrypick release-1.14

@volcano-sh-bot

Copy link
Copy Markdown
Contributor

@JesseStutler: new pull request created: #5428

Details

In response to this:

/cherrypick release-1.15
/cherrypick release-1.14

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@JesseStutler

Copy link
Copy Markdown
Member

/cherrypick release-1.14

@volcano-sh-bot

Copy link
Copy Markdown
Contributor

@JesseStutler: #5418 failed to apply on top of branch "release-1.14":

Applying: feat: check vnpu health
Using index info to reconstruct a base tree...
M	pkg/scheduler/api/devices/ascend/hami/device_info.go
Falling back to patching base and 3-way merge...
Auto-merging pkg/scheduler/api/devices/ascend/hami/device_info.go
CONFLICT (content): Merge conflict in pkg/scheduler/api/devices/ascend/hami/device_info.go
error: Failed to merge in the changes.
hint: Use 'git am --show-current-patch=diff' to see the failed patch
Patch failed at 0001 feat: check vnpu health
When you have resolved this problem, run "git am --continue".
If you prefer to skip this patch, run "git am --skip" instead.
To restore the original branch and stop patching, run "git am --abort".

Details

In response to this:

/cherrypick release-1.14

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@volcano-sh-bot

Copy link
Copy Markdown
Contributor

@JesseStutler: new issue created for failed cherrypick: #5432

Details

In response to this:

/cherrypick release-1.14

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. area/scheduling lgtm Indicates that a PR is ready to be merged. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants