Hi maintainers,
Thank you for your work on ascend-device-plugin.
I would like to start a friendly discussion about the current vNPU cleanup behavior, especially the principle used to decide whether a vNPU is considered "idle" and can be destroyed.
Background
In the current implementation, the cleanup logic is roughly:
for _, vDev := range vDevInfos.VDevInfo {
if vDev.QueryInfo.IsContainerUsed == 0 {
err := am.mgr.DestroyVirtualDevice(logicID, uint32(vDev.VDevID))
...
}
}
From the Huawei documentation, the original description is:
is_container_used表示当前容器是否已经开始使用
Reference:
https://support.huawei.com/enterprise/zh/doc/EDOC1100568432/1717c5f?idPath=23710424|251366513|254884019|261408772|261457531#ZH-CN_TOPIC_0000002541395355
This seems to imply that IsContainerUsed == 0 does not necessarily mean that the vNPU is unused in the scheduling or allocation sense. It may also mean that the container has been created and the Pod is already Running, but the workload has not started accessing the NPU yet.
Observed behavior
In my testing, I observed behavior similar to the issue discussed here:
Project-HAMi/HAMi#1696
A typical sequence is:
- Create a Pod that uses vNPU.
- The Pod becomes
Running.
- On the node, the corresponding vNPU can be seen briefly.
- Very soon after that, the vNPU is destroyed by the cleanup logic.
- Then running
npu-smi info inside the container fails.
Interestingly, if I enter the container immediately after the Pod becomes Running and execute npu-smi info right away, the vNPU is not destroyed, and subsequent npu-smi info commands continue to work.
This makes the current behavior look like a timing window:
- the vNPU has already been allocated to the Pod,
- but the container has not yet "started using" it from the driver's perspective,
- so the cleanup logic may reclaim it too early.
Why this may be worth revisiting
The current cleanup principle is very strict and appears to rely only on a driver-level runtime signal.
However, for newly allocated vNPUs, there is a meaningful gap between:
- "the Pod has been allocated a vNPU and is already running", and
- "the container has started using the vNPU"
During this gap, reclaiming the vNPU may affect normal workloads that simply have not touched the device yet.
So the main question is:
Should vNPU cleanup be based only on IsContainerUsed, or should it also consider whether the vNPU is still allocated to an existing Pod?
Related implementation in mind-cluster
For comparison, mind-cluster appears to use a different principle in DestroyNotUsedVNPU:
- it collects Pod information from the Kubernetes side,
- it determines which virtual devices are still referenced by Pods,
- and it destroys only virtual devices that are no longer associated with any Pod allocation.
This looks more conservative, but perhaps safer for startup timing, because it does not treat "not yet started using" as equivalent to "unused and safe to destroy".
Suggestion for discussion
Maybe the cleanup policy could be reconsidered along one of these directions:
- keep the current mechanism but add a protection window for newly allocated vNPUs;
- combine
IsContainerUsed with Pod allocation state before deciding to destroy a vNPU;
- use a Pod/allocation-based rule similar to the one used in
mind-cluster;
- only reclaim vNPUs that have been unreferenced or unused for a longer period, instead of reclaiming immediately when
IsContainerUsed == 0.
I am not claiming that one specific design is definitely correct for all scenarios. I mainly hope to discuss whether the current cleanup principle is too aggressive for valid workloads that start using the NPU a little later.
Thanks again for maintaining this project.
Hi maintainers,
Thank you for your work on
ascend-device-plugin.I would like to start a friendly discussion about the current vNPU cleanup behavior, especially the principle used to decide whether a vNPU is considered "idle" and can be destroyed.
Background
In the current implementation, the cleanup logic is roughly:
From the Huawei documentation, the original description is:
Reference:
https://support.huawei.com/enterprise/zh/doc/EDOC1100568432/1717c5f?idPath=23710424|251366513|254884019|261408772|261457531#ZH-CN_TOPIC_0000002541395355
This seems to imply that
IsContainerUsed == 0does not necessarily mean that the vNPU is unused in the scheduling or allocation sense. It may also mean that the container has been created and the Pod is alreadyRunning, but the workload has not started accessing the NPU yet.Observed behavior
In my testing, I observed behavior similar to the issue discussed here:
Project-HAMi/HAMi#1696
A typical sequence is:
Running.npu-smi infoinside the container fails.Interestingly, if I enter the container immediately after the Pod becomes
Runningand executenpu-smi inforight away, the vNPU is not destroyed, and subsequentnpu-smi infocommands continue to work.This makes the current behavior look like a timing window:
Why this may be worth revisiting
The current cleanup principle is very strict and appears to rely only on a driver-level runtime signal.
However, for newly allocated vNPUs, there is a meaningful gap between:
During this gap, reclaiming the vNPU may affect normal workloads that simply have not touched the device yet.
So the main question is:
Should vNPU cleanup be based only on
IsContainerUsed, or should it also consider whether the vNPU is still allocated to an existing Pod?Related implementation in mind-cluster
For comparison,
mind-clusterappears to use a different principle inDestroyNotUsedVNPU:This looks more conservative, but perhaps safer for startup timing, because it does not treat "not yet started using" as equivalent to "unused and safe to destroy".
Suggestion for discussion
Maybe the cleanup policy could be reconsidered along one of these directions:
IsContainerUsedwith Pod allocation state before deciding to destroy a vNPU;mind-cluster;IsContainerUsed == 0.I am not claiming that one specific design is definitely correct for all scenarios. I mainly hope to discuss whether the current cleanup principle is too aggressive for valid workloads that start using the NPU a little later.
Thanks again for maintaining this project.