CPU-only pipeline crashes on second feed_input due to cudaGetDevice() in AccessOrder::wait() host branch

### Version

2.1

### Describe the bug.

I've been experiencing some crashes, with cuda errors, when feeding, more than once, data into a cpu-only pipeline:
```python
pipe.feed_input("Reader", batch1)
results1 = pipe.run()      # works
pipe.feed_input("Reader", batch2)  # ← crash here
results2 = pipe.run()
```
I had minimax 2.7 trace this crash and it seems to be originated by a bug in `AccessOrder::wait()`:

> A device-ordered `AccessOrder` leaks into a CPU-only pipeline when a GPU conversion (`.gpu()`) is chained downstream. On the second `feed_input`, `AccessOrder::wait()` is called with the `this` side in host order and `other` in device order, which enters the host branch and unconditionally calls `cudaGetDevice()`. On systems with CUDA installed, `cudaStreamSynchronize()` is then called on a stale device stream handle. The result is non-deterministic: a `RuntimeError`, silent segfault, or hang.


### Minimum reproducible example

```shell
Here's a repo which minimally reproduces the crash:
https://github.com/fversaci/feed-bug
```

### Relevant log output

```shell
$ uv run main.py
Loaded plugin: /mnt/tdm-dic/users/cesco-safe/code/feed-bug/libfeedbug.so
First feed_input + run: OK
Traceback (most recent call last):
  File "/mnt/tdm-dic/users/cesco-safe/code/feed-bug/main.py", line 85, in main
    pipe.feed_input("Reader", batch2)  # ← crash here
    ~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^
  File "/mnt/tdm-dic/users/cesco-safe/code/feed-bug/.venv/lib/python3.13/site-packages/nvidia/dali/pipeline.py", line 1337, in feed_input
    self._feed_input(name, data, layout, cuda_stream, use_copy_kernel)
    ~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/tdm-dic/users/cesco-safe/code/feed-bug/.venv/lib/python3.13/site-packages/nvidia/dali/pipeline.py", line 1248, in _feed_input
    self._pipe.SetExternalTLInput(name, data, cuda_stream_ptr, use_copy_kernel)
    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: CUDA runtime API error cudaErrorInvalidResourceHandle (400):
invalid resource handle

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/mnt/tdm-dic/users/cesco-safe/code/feed-bug/main.py", line 99, in <module>
    main()
    ~~~~^^
  File "/mnt/tdm-dic/users/cesco-safe/code/feed-bug/main.py", line 90, in main
    raise RuntimeError(
    ...<4 lines>...
    ) from e
RuntimeError: BUG REPRODUCED: Second feed_input crashed.
This error is caused by DALI's AccessOrder::wait() when cudaGetDevice() is called in the host-order branch.

Original error: CUDA runtime API error cudaErrorInvalidResourceHandle (400):
invalid resource handle
```

### Other/Misc.

_No response_

### Check for duplicates

- [x] I have searched the [open bugs/issues](https://github.com/NVIDIA/DALI/issues) and have found no duplicates for this bug report

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CPU-only pipeline crashes on second feed_input due to cudaGetDevice() in AccessOrder::wait() host branch #6361

Version

Describe the bug.

Minimum reproducible example

Relevant log output

Other/Misc.

Check for duplicates

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

CPU-only pipeline crashes on second feed_input due to cudaGetDevice() in AccessOrder::wait() host branch #6361

Description

Version

Describe the bug.

Minimum reproducible example

Relevant log output

Other/Misc.

Check for duplicates

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions