feat: check ascend vnpu health with Allocatable#5418
Conversation
Signed-off-by: james <open4pd@4paradigm.com>
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
Removes the legacy “handshake” health-check plumbing and tightens Ascend device discovery by filtering out nodes that don’t advertise allocatable Ascend resources.
Changes:
- Deleted the generic
CheckHealthhelper (and related imports) from the shared devices API. - Removed
handshakeAnnofrom Ascend device structs/copies and stopped wiring handshake annotations in Ascend init. - Added
node.Status.Allocatablepresence + per-device allocatable resource checks inNewAscendDevices.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| pkg/scheduler/api/devices/device_info.go | Removes CheckHealth and now-unused imports tied to handshake-based health checking. |
| pkg/scheduler/api/devices/ascend/hami/device_info.go | Drops handshake annotation fields and adds allocatable-resource gating for Ascend device discovery. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Code Review
This pull request removes the handshake annotation and its associated health check logic from the device APIs. It also introduces checks in NewAscendDevices to verify that the node has allocatable resources before processing. Feedback suggests using the configured dev.config.ResourceName instead of hardcoding the resource name prefix, and defensively checking for resource values less than or equal to zero.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
Signed-off-by: james <open4pd@4paradigm.com>
|
Does hami/ascend-device-plugin needs adjustment? |
|
/assign |
It can work well without requiring any modifications to ascend-device-plugin |
hajnalmt
left a comment
There was a problem hiding this comment.
/area scheduling
/label tide/merge-method-squash
/lgtm
Thanks for the update, this looks good to me. Using the configured Allocatable resource name here matches the health-check direction from the earlier device work.
@archlitchi could you please also take a look and approve from the HAMI / Volcano vGPU side?
|
/cc @archlitchi Could you take a look and approve it if you agree with it? |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: archlitchi, hajnalmt The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/cherrypick release-1.15 |
|
@JesseStutler: new pull request created: #5428 DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
/cherrypick release-1.14 |
|
@JesseStutler: #5418 failed to apply on top of branch "release-1.14": DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
@JesseStutler: new issue created for failed cherrypick: #5432 DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
What type of PR is this?
A new mechanism was introduced to check device health In #5393. This PR applies this mechanism to ascend vnpu.
What this PR does / why we need it:
Which issue(s) this PR fixes:
Fixes #
Special notes for your reviewer:
Does this PR introduce a user-facing change?