Skip to content

Ascend Atlas 300I DOU设备 使用新版本的libvnpu进行切割,内存分配报错 #64

@zhangQiWorr

Description

@zhangQiWorr

设备:Atlas 300I DOU
使用libvnpu进行切割,驱动版本: Version: 25.5.1, cann版本:8.1.RC1

pod yaml文件:

kind: Deployment
metadata:
  name: ascend-soft-slice-pod
spec:
  replicas: 1  # 可根据需要调整副本数
  selector:
    matchLabels:
      app: ascend310p
  template:
    metadata:
      annotations:
        huawei.com/vnpu-mode: 'hami-core'
      labels:
        app: ascend310p  # 用于 Deployment 选择器匹配
    spec:
      runtimeClassName: ascend
      containers:
        - name: ubuntu-container
          image: dev.bingosoft.net/bingomatrix/my-mindie:1.0.0-300I-Duo
          imagePullPolicy: IfNotPresent
          command: ["bash", "-c", "sleep 86400"]
          resources:
            limits:
              huawei.com/Ascend310P: "1"          # 请求 1 块物理 NPU
              huawei.com/Ascend310P-memory: "10240" # 请求 10Gi 显存
              huawei.com/Ascend310P-core: "40"      # 请求 40% 的算力
          volumeMounts:
          - name: dshm
            mountPath: /dev/shm
          - name: ascend-toolkit
            mountPath: /usr/local/Ascend/ascend-toolkit
            readOnly: true  # 推荐只读,避免污染主机环境
      volumes:
          - name: ascend-toolkit
            hostPath:
              path: /usr/local/Ascend/ascend-toolkit
              type: Directory
          - name: dshm
            emptyDir:
              medium: Memory
              sizeLimit: 2Gi  # 根据需求调整,如 1Gi、2Gi

pod里面使用acl分配内存报错:
🔧 Initializing ACL...
📦 Allocating two 5GB device buffers (total ~10GB)...
[2026-04-24T07:38:58Z INFO limiter::worker] [Worker PID:549] Initialize SchedulerClient...

thread '' (549) panicked at crates/limiter/src/shmem.rs:33:25:
Worker failed to open NPU Manager shmem! Is the Daemon running?
note: run with RUST_BACKTRACE=1 environment variable to display a backtrace

thread '' (549) panicked at /rustc/59807616e1fa2540724bfbac14d7976d7e4a3860/library/core/src/panicking.rs:225:5:
panic in a function that cannot unwind
stack backtrace:
0: 0xffffa999cb30 - <<std[cc6062c208ed37d1]::sys::backtrace::BacktraceLock>::print::DisplayBacktrace as core[4e11ee24e72d71de]::fmt::Display>::fmt
1: 0xffffa99b1238 - core[4e11ee24e72d71de]::fmt::write
2: 0xffffa99a2ff0 - <std[cc6062c208ed37d1]::sys::stdio::unix::Stderr as std[cc6062c208ed37d1]::io::Write>::write_fmt
3: 0xffffa998d4cc - std[cc6062c208ed37d1]::panicking::default_hook::{closure#0}
4: 0xffffa9999f04 - std[cc6062c208ed37d1]::panicking::default_hook
5: 0xffffa999a0bc - std[cc6062c208ed37d1]::panicking::panic_with_hook
6: 0xffffa998d5a8 - std[cc6062c208ed37d1]::panicking::panic_handler::{closure#0}
7: 0xffffa9984d78 - std[cc6062c208ed37d1]::sys::backtrace::__rust_end_short_backtrace::<std[cc6062c208ed37d1]::panicking::panic_handler::{closure#0}, !>
8: 0xffffa998dd04 - __rustc[b7974e8690430dd9]::rust_begin_unwind
9: 0xffffa98d9fdc - core[4e11ee24e72d71de]::panicking::panic_nounwind_fmt
10: 0xffffa98d9f64 - core[4e11ee24e72d71de]::panicking::panic_nounwind
11: 0xffffa98da0bc - core[4e11ee24e72d71de]::panicking::panic_cannot_unwind
12: 0xffffa98db0cc - rtMalloc
13: 0xffff7369e7f0 - _ZN3acl17aclMallocMemInnerEPPvmb20aclrtMemMallocPolicyt
14: 0xffff7369fd3c - aclrtMalloc
15: 0xffff738e62e8 -
16: 0xffffa94db9e8 -
17: 0xffffa94db7a0 - _PyObject_MakeTpCall
18: 0xffffa942301c - _PyEval_EvalFrameDefault
19: 0xffffa95dc01c -
20: 0xffffa95dc0c0 - PyEval_EvalCode
21: 0xffffa95dca7c -
22: 0xffffa95dcb74 -
23: 0xffffa95dcc98 -
24: 0xffffa95e2cc8 - _PyRun_SimpleFileObject
25: 0xffffa95e3204 - _PyRun_AnyFileObject
26: 0xffffa95e413c - Py_RunMain
27: 0xffffa9619ed0 - Py_BytesMain
28: 0xffffa90f76c4 -
29: 0xffffa90f77a8 - __libc_start_main
30: 0xaaaab44b08b0 - _start
31: 0x0 -

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions