Version
2.1
Describe the bug.
I've been experiencing some crashes, with cuda errors, when feeding, more than once, data into a cpu-only pipeline:
pipe.feed_input("Reader", batch1)
results1 = pipe.run() # works
pipe.feed_input("Reader", batch2) # ← crash here
results2 = pipe.run()
I had minimax 2.7 trace this crash and it seems to be originated by a bug in AccessOrder::wait():
A device-ordered AccessOrder leaks into a CPU-only pipeline when a GPU conversion (.gpu()) is chained downstream. On the second feed_input, AccessOrder::wait() is called with the this side in host order and other in device order, which enters the host branch and unconditionally calls cudaGetDevice(). On systems with CUDA installed, cudaStreamSynchronize() is then called on a stale device stream handle. The result is non-deterministic: a RuntimeError, silent segfault, or hang.
Minimum reproducible example
Here's a repo which minimally reproduces the crash:
https://github.com/fversaci/feed-bug
Relevant log output
$ uv run main.py
Loaded plugin: /mnt/tdm-dic/users/cesco-safe/code/feed-bug/libfeedbug.so
First feed_input + run: OK
Traceback (most recent call last):
File "/mnt/tdm-dic/users/cesco-safe/code/feed-bug/main.py", line 85, in main
pipe.feed_input("Reader", batch2) # ← crash here
~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^
File "/mnt/tdm-dic/users/cesco-safe/code/feed-bug/.venv/lib/python3.13/site-packages/nvidia/dali/pipeline.py", line 1337, in feed_input
self._feed_input(name, data, layout, cuda_stream, use_copy_kernel)
~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/mnt/tdm-dic/users/cesco-safe/code/feed-bug/.venv/lib/python3.13/site-packages/nvidia/dali/pipeline.py", line 1248, in _feed_input
self._pipe.SetExternalTLInput(name, data, cuda_stream_ptr, use_copy_kernel)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: CUDA runtime API error cudaErrorInvalidResourceHandle (400):
invalid resource handle
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/mnt/tdm-dic/users/cesco-safe/code/feed-bug/main.py", line 99, in <module>
main()
~~~~^^
File "/mnt/tdm-dic/users/cesco-safe/code/feed-bug/main.py", line 90, in main
raise RuntimeError(
...<4 lines>...
) from e
RuntimeError: BUG REPRODUCED: Second feed_input crashed.
This error is caused by DALI's AccessOrder::wait() when cudaGetDevice() is called in the host-order branch.
Original error: CUDA runtime API error cudaErrorInvalidResourceHandle (400):
invalid resource handle
Other/Misc.
No response
Check for duplicates
Version
2.1
Describe the bug.
I've been experiencing some crashes, with cuda errors, when feeding, more than once, data into a cpu-only pipeline:
I had minimax 2.7 trace this crash and it seems to be originated by a bug in
AccessOrder::wait():Minimum reproducible example
Relevant log output
Other/Misc.
No response
Check for duplicates