Skip to content

CPU-only pipeline crashes on second feed_input due to cudaGetDevice() in AccessOrder::wait() host branch #6361

@fversaci

Description

@fversaci

Version

2.1

Describe the bug.

I've been experiencing some crashes, with cuda errors, when feeding, more than once, data into a cpu-only pipeline:

pipe.feed_input("Reader", batch1)
results1 = pipe.run()      # works
pipe.feed_input("Reader", batch2)  # ← crash here
results2 = pipe.run()

I had minimax 2.7 trace this crash and it seems to be originated by a bug in AccessOrder::wait():

A device-ordered AccessOrder leaks into a CPU-only pipeline when a GPU conversion (.gpu()) is chained downstream. On the second feed_input, AccessOrder::wait() is called with the this side in host order and other in device order, which enters the host branch and unconditionally calls cudaGetDevice(). On systems with CUDA installed, cudaStreamSynchronize() is then called on a stale device stream handle. The result is non-deterministic: a RuntimeError, silent segfault, or hang.

Minimum reproducible example

Here's a repo which minimally reproduces the crash:
https://github.com/fversaci/feed-bug

Relevant log output

$ uv run main.py
Loaded plugin: /mnt/tdm-dic/users/cesco-safe/code/feed-bug/libfeedbug.so
First feed_input + run: OK
Traceback (most recent call last):
  File "/mnt/tdm-dic/users/cesco-safe/code/feed-bug/main.py", line 85, in main
    pipe.feed_input("Reader", batch2)  # ← crash here
    ~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^
  File "/mnt/tdm-dic/users/cesco-safe/code/feed-bug/.venv/lib/python3.13/site-packages/nvidia/dali/pipeline.py", line 1337, in feed_input
    self._feed_input(name, data, layout, cuda_stream, use_copy_kernel)
    ~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/tdm-dic/users/cesco-safe/code/feed-bug/.venv/lib/python3.13/site-packages/nvidia/dali/pipeline.py", line 1248, in _feed_input
    self._pipe.SetExternalTLInput(name, data, cuda_stream_ptr, use_copy_kernel)
    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: CUDA runtime API error cudaErrorInvalidResourceHandle (400):
invalid resource handle

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/mnt/tdm-dic/users/cesco-safe/code/feed-bug/main.py", line 99, in <module>
    main()
    ~~~~^^
  File "/mnt/tdm-dic/users/cesco-safe/code/feed-bug/main.py", line 90, in main
    raise RuntimeError(
    ...<4 lines>...
    ) from e
RuntimeError: BUG REPRODUCED: Second feed_input crashed.
This error is caused by DALI's AccessOrder::wait() when cudaGetDevice() is called in the host-order branch.

Original error: CUDA runtime API error cudaErrorInvalidResourceHandle (400):
invalid resource handle

Other/Misc.

No response

Check for duplicates

  • I have searched the open bugs/issues and have found no duplicates for this bug report

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type
No fields configured for issues without a type.

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions