torchrun --nproc_per_node=1 -m cosmos3.scripts.inference --checkpoint-path Cosmos3-Nano -i /workspace/inputs/omni/action_forward_dynamics_camera.json -o outputs/omni --seed=0 --warmup 2 --num-iterations 3 --resolution 256 --no-guardrails --benchmark --use-torch-compile --no-use-cuda-graphs
Traceback (most recent call last):
File "<frozen runpy>", line 198, in _run_module_as_main
File "<frozen runpy>", line 88, in _run_code
File "/workspace/cosmos3/scripts/inference.py", line 99, in <module>
main()
~~~~^^
File "/workspace/cosmos3/scripts/inference.py", line 95, in main
inference(args)
~~~~~~~~~^^^^^^
File "/workspace/cosmos3/scripts/inference.py", line 52, in inference
setup_args = args.setup.build_setup()
File "/workspace/cosmos3/args.py", line 1020, in build_setup
self._build_parallelism(
~~~~~~~~~~~~~~~~~~~~~~~^
world_size=world_size, local_world_size=local_world_size, device_memory_bytes=device_memory_bytes
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
)
^
File "/workspace/cosmos3/args.py", line 1007, in _build_parallelism
device_memory_bytes = _get_device_memory_bytes()
File "/workspace/cosmos3/args.py", line 1036, in _get_device_memory_bytes
info = pynvml.nvmlDeviceGetMemoryInfo(handle)
File "/workspace/.venv/lib/python3.13/site-packages/pynvml.py", line 3762, in nvmlDeviceGetMemoryInfo
_nvmlCheckReturn(ret)
~~~~~~~~~~~~~~~~^^^^^
File "/workspace/.venv/lib/python3.13/site-packages/pynvml.py", line 1083, in _nvmlCheckReturn
raise NVMLError(ret)
pynvml.NVMLError_NotSupported: Not Supported
E0602 11:14:24.331000 45 torch/distributed/elastic/multiprocessing/api.py:984] failed (exitcode: 1) local_rank: 0 (pid: 67) of binary: /workspace/.venv/bin/python
E0602 11:14:24.331000 45 torch/distributed/elastic/multiprocessing/errors/error_handler.py:145] no error file defined for parent, to copy child error file (/tmp/torchelastic_cqswkwbj/none_9sniorw6/attempt_0/0/error.json)
Traceback (most recent call last):
File "/workspace/.venv/bin/torchrun", line 10, in <module>
sys.exit(main())
~~~~^^
File "/workspace/.venv/lib/python3.13/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 362, in wrapper
return f(*args, **kwargs)
File "/workspace/.venv/lib/python3.13/site-packages/torch/distributed/run.py", line 991, in main
run(args)
~~~^^^^^^
File "/workspace/.venv/lib/python3.13/site-packages/torch/distributed/run.py", line 982, in run
elastic_launch(
~~~~~~~~~~~~~~~
config=config,
~~~~~~~~~~~~~~
entrypoint=cmd,
~~~~~~~~~~~~~~~
)(*cmd_args)
~^^^^^^^^^^^
File "/workspace/.venv/lib/python3.13/site-packages/torch/distributed/launcher/api.py", line 170, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/workspace/.venv/lib/python3.13/site-packages/torch/distributed/launcher/api.py", line 317, in launch_agent
raise ChildFailedError(
...<2 lines>...
)
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
cosmos3.scripts.inference FAILED
------------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2026-06-02_11:14:23
host : be677ad18848
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 67)
error_file: /tmp/torchelastic_cqswkwbj/none_9sniorw6/attempt_0/0/error.json
traceback : NoneType: None
root@be677ad18848:/workspace# nvidia-smi
Tue Jun 2 11:16:16 2026
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.142 Driver Version: 580.142 CUDA Version: 13.0 |
+-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GB10 On | 0000000F:01:00.0 On | N/A |
| N/A 42C P0 10W / N/A | Not Supported | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found
We are building Cosmos3 arm container on DGX Spark,but we ran into the issue as below:
DGX Spark:
DGX-SPARK FASTOS 1.135.29 ARM64
Driver: