Skip to content

Discussion: vNPU cleanup policy #86

@fuzhy

Description

@fuzhy

Hi maintainers,

Thank you for your work on ascend-device-plugin.

I would like to start a friendly discussion about the current vNPU cleanup behavior, especially the principle used to decide whether a vNPU is considered "idle" and can be destroyed.

Background

In the current implementation, the cleanup logic is roughly:

for _, vDev := range vDevInfos.VDevInfo {
    if vDev.QueryInfo.IsContainerUsed == 0 {
        err := am.mgr.DestroyVirtualDevice(logicID, uint32(vDev.VDevID))
        ...
    }
}

From the Huawei documentation, the original description is:

is_container_used表示当前容器是否已经开始使用

Reference:
https://support.huawei.com/enterprise/zh/doc/EDOC1100568432/1717c5f?idPath=23710424|251366513|254884019|261408772|261457531#ZH-CN_TOPIC_0000002541395355

This seems to imply that IsContainerUsed == 0 does not necessarily mean that the vNPU is unused in the scheduling or allocation sense. It may also mean that the container has been created and the Pod is already Running, but the workload has not started accessing the NPU yet.

Observed behavior

In my testing, I observed behavior similar to the issue discussed here:
Project-HAMi/HAMi#1696

A typical sequence is:

  1. Create a Pod that uses vNPU.
  2. The Pod becomes Running.
  3. On the node, the corresponding vNPU can be seen briefly.
  4. Very soon after that, the vNPU is destroyed by the cleanup logic.
  5. Then running npu-smi info inside the container fails.

Interestingly, if I enter the container immediately after the Pod becomes Running and execute npu-smi info right away, the vNPU is not destroyed, and subsequent npu-smi info commands continue to work.

This makes the current behavior look like a timing window:

  • the vNPU has already been allocated to the Pod,
  • but the container has not yet "started using" it from the driver's perspective,
  • so the cleanup logic may reclaim it too early.

Why this may be worth revisiting

The current cleanup principle is very strict and appears to rely only on a driver-level runtime signal.

However, for newly allocated vNPUs, there is a meaningful gap between:

  • "the Pod has been allocated a vNPU and is already running", and
  • "the container has started using the vNPU"

During this gap, reclaiming the vNPU may affect normal workloads that simply have not touched the device yet.

So the main question is:

Should vNPU cleanup be based only on IsContainerUsed, or should it also consider whether the vNPU is still allocated to an existing Pod?

Related implementation in mind-cluster

For comparison, mind-cluster appears to use a different principle in DestroyNotUsedVNPU:

  • it collects Pod information from the Kubernetes side,
  • it determines which virtual devices are still referenced by Pods,
  • and it destroys only virtual devices that are no longer associated with any Pod allocation.

This looks more conservative, but perhaps safer for startup timing, because it does not treat "not yet started using" as equivalent to "unused and safe to destroy".

Suggestion for discussion

Maybe the cleanup policy could be reconsidered along one of these directions:

  1. keep the current mechanism but add a protection window for newly allocated vNPUs;
  2. combine IsContainerUsed with Pod allocation state before deciding to destroy a vNPU;
  3. use a Pod/allocation-based rule similar to the one used in mind-cluster;
  4. only reclaim vNPUs that have been unreferenced or unused for a longer period, instead of reclaiming immediately when IsContainerUsed == 0.

I am not claiming that one specific design is definitely correct for all scenarios. I mainly hope to discuss whether the current cleanup principle is too aggressive for valid workloads that start using the NPU a little later.

Thanks again for maintaining this project.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions