Skip to content

‘pynvml.NVMLError_NotSupported: Not Supported’ error on DGX Spark #180

Description

@LisaLiNVIDIA

We are building Cosmos3 arm container on DGX Spark,but we ran into the issue as below:

torchrun --nproc_per_node=1 -m cosmos3.scripts.inference --checkpoint-path Cosmos3-Nano -i /workspace/inputs/omni/action_forward_dynamics_camera.json -o outputs/omni --seed=0 --warmup 2 --num-iterations 3 --resolution 256 --no-guardrails --benchmark --use-torch-compile --no-use-cuda-graphs
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/workspace/cosmos3/scripts/inference.py", line 99, in <module>
    main()
    ~~~~^^
  File "/workspace/cosmos3/scripts/inference.py", line 95, in main
    inference(args)
    ~~~~~~~~~^^^^^^
  File "/workspace/cosmos3/scripts/inference.py", line 52, in inference
    setup_args = args.setup.build_setup()
  File "/workspace/cosmos3/args.py", line 1020, in build_setup
    self._build_parallelism(
    ~~~~~~~~~~~~~~~~~~~~~~~^
        world_size=world_size, local_world_size=local_world_size, device_memory_bytes=device_memory_bytes
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    )
    ^
  File "/workspace/cosmos3/args.py", line 1007, in _build_parallelism
    device_memory_bytes = _get_device_memory_bytes()
  File "/workspace/cosmos3/args.py", line 1036, in _get_device_memory_bytes
    info = pynvml.nvmlDeviceGetMemoryInfo(handle)
  File "/workspace/.venv/lib/python3.13/site-packages/pynvml.py", line 3762, in nvmlDeviceGetMemoryInfo
    _nvmlCheckReturn(ret)
    ~~~~~~~~~~~~~~~~^^^^^
  File "/workspace/.venv/lib/python3.13/site-packages/pynvml.py", line 1083, in _nvmlCheckReturn
    raise NVMLError(ret)
pynvml.NVMLError_NotSupported: Not Supported
E0602 11:14:24.331000 45 torch/distributed/elastic/multiprocessing/api.py:984] failed (exitcode: 1) local_rank: 0 (pid: 67) of binary: /workspace/.venv/bin/python
E0602 11:14:24.331000 45 torch/distributed/elastic/multiprocessing/errors/error_handler.py:145] no error file defined for parent, to copy child error file (/tmp/torchelastic_cqswkwbj/none_9sniorw6/attempt_0/0/error.json)
Traceback (most recent call last):
  File "/workspace/.venv/bin/torchrun", line 10, in <module>
    sys.exit(main())
             ~~~~^^
  File "/workspace/.venv/lib/python3.13/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 362, in wrapper
    return f(*args, **kwargs)
  File "/workspace/.venv/lib/python3.13/site-packages/torch/distributed/run.py", line 991, in main
    run(args)
    ~~~^^^^^^
  File "/workspace/.venv/lib/python3.13/site-packages/torch/distributed/run.py", line 982, in run
    elastic_launch(
    ~~~~~~~~~~~~~~~
        config=config,
        ~~~~~~~~~~~~~~
        entrypoint=cmd,
        ~~~~~~~~~~~~~~~
    )(*cmd_args)
    ~^^^^^^^^^^^
  File "/workspace/.venv/lib/python3.13/site-packages/torch/distributed/launcher/api.py", line 170, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/workspace/.venv/lib/python3.13/site-packages/torch/distributed/launcher/api.py", line 317, in launch_agent
    raise ChildFailedError(
    ...<2 lines>...
    )
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
cosmos3.scripts.inference FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2026-06-02_11:14:23
  host      : be677ad18848
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 67)
  error_file: /tmp/torchelastic_cqswkwbj/none_9sniorw6/attempt_0/0/error.json
  traceback : NoneType: None

DGX Spark:
DGX-SPARK FASTOS 1.135.29 ARM64

Driver:

root@be677ad18848:/workspace# nvidia-smi
Tue Jun  2 11:16:16 2026
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.142                Driver Version: 580.142        CUDA Version: 13.0     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GB10                    On  |   0000000F:01:00.0  On |                  N/A |
| N/A   42C    P0             10W /  N/A  | Not Supported          |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions