Skip to content

同时创建多个pod时会导致分配到同一个NPU #52

@lomtom

Description

@lomtom

同一时间创建多个分配NPU时,会导致分配到同一个NPU,最终使用npu时报错:

DrvMngGetConsoleLogLevel failed. (ret=4)
dcmi model initialized failed, because the device is used. ret is -8020
  • schduler:volcano
  • device-plugin:latest
  • npu:910B(八张卡)
  1. 创建depoyment(2个副本)
apiVersion: apps/v1
kind: Deployment
metadata:
  name: npu
spec:
  replicas: 2
  selector:
    matchLabels:
      run: npu
  template:
    metadata:
      labels:
        run: npu
      name: npu
    spec:
      containers:
        - command:
            - sh
            - -c
          args:
            - sleep 1d
          image: swr.cn-south-1.myhuaweicloud.com/ascendhub/ascend-pytorch:24.0.RC1-A2-1.11.0-ubuntu20.04
          imagePullPolicy: IfNotPresent
          name: npu
          resources:
            limits:
              cpu: "8"
              huawei.com/Ascend910B4: "1"
              memory: 16Gi
            requests:
              cpu: "8"
              huawei.com/Ascend910B4: "1"
              memory: 16Gi
          securityContext:
            privileged: false
      schedulerName: volcano
  1. device-plugin 日志
I0130 07:56:32.030596       1 server.go:349] Allocate: &AllocateRequest{ContainerRequests:[]*ContainerAllocateRequest{&ContainerAllocateRequest{DevicesIDs:[C281A66C-8120A9DA-54B03472-B9D00485-104301E3-3],},},}
I0130 07:56:32.053587       1 server.go:387] allocate response: {map[ASCEND_VISIBLE_DEVICES:7] [] [] map[] [] {} 0}
I0130 07:56:35.077178       1 server.go:349] Allocate: &AllocateRequest{ContainerRequests:[]*ContainerAllocateRequest{&ContainerAllocateRequest{DevicesIDs:[C281A66C-81808FDA-09B87472-B9D00485-104301E3-0],},},}
I0130 07:56:35.099104       1 server.go:387] allocate response: {map[ASCEND_VISIBLE_DEVICES:7] [] [] map[] [] {} 0}
  1. 查看容器及分配的卡
# nerdctl ps |grep default/npu
67b466d3d67d    swr.cn-south-1.myhuaweicloud.com/ascendhub/ascend-pytorch:24.0.RC1-A2-1.11.0-ubuntu20.04                                                                                                                   "sh -c sleep 1d"          About a minute ago    Up                 k8s://default/npu-dcfc4bdc4-ckmt9/npu
7ad9f89dfb0f    sealos.hub:5000/pause:3.6                                                                                                                               "/pause"                  About a minute ago    Up                 k8s://default/npu-dcfc4bdc4-ckmt9
b75f0592887b    swr.cn-south-1.myhuaweicloud.com/ascendhub/ascend-pytorch:24.0.RC1-A2-1.11.0-ubuntu20.04                                                                                                                   "sh -c sleep 1d"          About a minute ago    Up                 k8s://default/npu-dcfc4bdc4-9cpcv/npu
99d73958ca35    sealos.hub:5000/pause:3.6                                                                                                                               "/pause"                  About a minute ago    Up                 k8s://default/npu-dcfc4bdc4-9cpcv

# nerdctl inspect 67b466d3d67d |grep VIS
                "ASCEND_VISIBLE_DEVICES=7",
# nerdctl inspect b75f0592887b |grep VIS
                "ASCEND_VISIBLE_DEVICES=7",
  1. pod内使用npu
# kubectl exec -it npu-dcfc4bdc4-9cpcv -- bash
root@npu-dcfc4bdc4-9cpcv:/# npu-smi info
+------------------------------------------------------------------------------------------------+
| npu-smi 25.3.rc1                 Version: 25.3.rc1                                             |
+---------------------------+---------------+----------------------------------------------------+
| NPU   Name                | Health        | Power(W)    Temp(C)           Hugepages-Usage(page)|
| Chip                      | Bus-Id        | AICore(%)   Memory-Usage(MB)  HBM-Usage(MB)        |
+===========================+===============+====================================================+
| 7     910B4               | Warning       | 88.4        40                0    / 0             |
| 0                         | 0000:42:00.0  | 0           0    / 0          2887 / 32768         |
+===========================+===============+====================================================+
+---------------------------+---------------+----------------------------------------------------+
| NPU     Chip              | Process id    | Process name             | Process memory(MB)      |
+===========================+===============+====================================================+
| No running processes found in NPU 7                                                            |
+===========================+===============+====================================================+


# kubectl exec -it npu-dcfc4bdc4-ckmt9 -- bash      
root@npu-dcfc4bdc4-ckmt9:/# npu-smi info
DrvMngGetConsoleLogLevel failed. (ret=4)
dcmi model initialized failed, because the device is used. ret is -8020

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions