Discussion: vNPU cleanup policy

Hi maintainers,

Thank you for your work on `ascend-device-plugin`.

I would like to start a friendly discussion about the current vNPU cleanup behavior, especially the principle used to decide whether a vNPU is considered "idle" and can be destroyed.

## Background

In the current implementation, the cleanup logic is roughly:

```go
for _, vDev := range vDevInfos.VDevInfo {
    if vDev.QueryInfo.IsContainerUsed == 0 {
        err := am.mgr.DestroyVirtualDevice(logicID, uint32(vDev.VDevID))
        ...
    }
}
```

From the Huawei documentation, the original description is:

> `is_container_used表示当前容器是否已经开始使用`

Reference:
<https://support.huawei.com/enterprise/zh/doc/EDOC1100568432/1717c5f?idPath=23710424|251366513|254884019|261408772|261457531#ZH-CN_TOPIC_0000002541395355>

This seems to imply that `IsContainerUsed == 0` does **not necessarily** mean that the vNPU is unused in the scheduling or allocation sense. It may also mean that the container has been created and the Pod is already `Running`, but the workload has not started accessing the NPU yet.

## Observed behavior

In my testing, I observed behavior similar to the issue discussed here:
<https://github.com/Project-HAMi/HAMi/issues/1696>

A typical sequence is:

1. Create a Pod that uses vNPU.
2. The Pod becomes `Running`.
3. On the node, the corresponding vNPU can be seen briefly.
4. Very soon after that, the vNPU is destroyed by the cleanup logic.
5. Then running `npu-smi info` inside the container fails.

Interestingly, if I enter the container immediately after the Pod becomes `Running` and execute `npu-smi info` right away, the vNPU is not destroyed, and subsequent `npu-smi info` commands continue to work.

This makes the current behavior look like a timing window:

- the vNPU has already been allocated to the Pod,
- but the container has not yet "started using" it from the driver's perspective,
- so the cleanup logic may reclaim it too early.

## Why this may be worth revisiting

The current cleanup principle is very strict and appears to rely only on a driver-level runtime signal.

However, for newly allocated vNPUs, there is a meaningful gap between:

- "the Pod has been allocated a vNPU and is already running", and
- "the container has started using the vNPU"

During this gap, reclaiming the vNPU may affect normal workloads that simply have not touched the device yet.

So the main question is:

**Should vNPU cleanup be based only on `IsContainerUsed`, or should it also consider whether the vNPU is still allocated to an existing Pod?**

## Related implementation in mind-cluster

For comparison, [`mind-cluster`](https://gitcode.com/Ascend/mind-cluster) appears to use a different principle in [`DestroyNotUsedVNPU`](https://gitcode.com/Ascend/mind-cluster/blob/v26.0.0/component/ascend-device-plugin/pkg/server/plugin.go):

- it collects Pod information from the Kubernetes side,
- it determines which virtual devices are still referenced by Pods,
- and it destroys only virtual devices that are no longer associated with any Pod allocation.

This looks more conservative, but perhaps safer for startup timing, because it does not treat "not yet started using" as equivalent to "unused and safe to destroy".

## Suggestion for discussion

Maybe the cleanup policy could be reconsidered along one of these directions:

1. keep the current mechanism but add a protection window for newly allocated vNPUs;
2. combine `IsContainerUsed` with Pod allocation state before deciding to destroy a vNPU;
3. use a Pod/allocation-based rule similar to the one used in `mind-cluster`;
4. only reclaim vNPUs that have been unreferenced or unused for a longer period, instead of reclaiming immediately when `IsContainerUsed == 0`.

I am not claiming that one specific design is definitely correct for all scenarios. I mainly hope to discuss whether the current cleanup principle is too aggressive for valid workloads that start using the NPU a little later.

Thanks again for maintaining this project.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Discussion: vNPU cleanup policy #86

Background

Observed behavior

Why this may be worth revisiting

Related implementation in mind-cluster

Suggestion for discussion

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Discussion: vNPU cleanup policy #86

Description

Background

Observed behavior

Why this may be worth revisiting

Related implementation in mind-cluster

Suggestion for discussion

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions