[Perf] Streams 1: Add CUDA stream and event API by hughperkins · Pull Request #407 · Genesis-Embodied-AI/quadrants

hughperkins · 2026-03-11T23:52:04Z

Introduces qd.create_stream() and qd.create_event() for launching kernels on separate CUDA streams with event-based synchronization. The qd_stream kwarg on kernel calls routes the launch to a specific stream. Non-CUDA backends return no-op handles (0). Routes kernel launcher memory ops through the active stream.

Lines of code added: +481 - 197 - 4 - 4 = +276

Issue: #

Brief Summary

copilot:summary

Walkthrough

copilot:walkthrough

Introduces qd.create_stream() and qd.create_event() for launching kernels on separate CUDA streams with event-based synchronization. The qd_stream kwarg on kernel calls routes the launch to a specific stream. Non-CUDA backends return no-op handles (0). Routes kernel launcher memory ops through the active stream.

- Make CUDAContext::stream_ thread_local for thread-safety - Convert sync memcpy_host_to_device to async on active_stream - Use weakref in Stream/Event __del__ to safely handle interpreter shutdown - Add __enter__/__exit__ context manager support for Stream and Event - Use consistent qd_stream parameter naming in Event.record and Event.wait - Add handle==0 guard to stream_synchronize

hughperkins · 2026-03-12T01:15:50Z

Review from Opus (written before the last commit above):

PR Review: Add CUDA Stream and Event API

Branch: hp/streams-quadrantsic-1-cuda-streams
Commit: ab15b1b82 â€” "Add CUDA stream and event API for concurrent kernel execution"
Files changed: 11 (+443, -16)

Summary

This PR introduces a CUDA stream and event API to enable concurrent kernel execution on separate GPU streams. It adds:

Python API (stream.py): Stream and Event wrapper classes with create_stream() / create_event() factory functions
C++ backend (program.cpp/h): 9 new Program methods wrapping CUDA driver calls for stream/event lifecycle
Kernel launch integration (kernel.py): A qd_stream= kwarg on kernel calls that sets the active CUDA stream around launch
Kernel launcher fix (kernel_launcher.cpp): Replaces hardcoded nullptr stream with CUDAContext::get_instance().get_stream() so that async memory ops respect the active stream
pybind11 exports and tests (197 lines of test coverage)

The design is clean and well-layered. On non-CUDA backends, everything degrades to no-ops (handle=0).

Issues and Concerns

1. Thread-safety of `CUDAContext::stream_` (High)

CUDAContext is a singleton. The set_stream / get_stream methods read/write a bare void *stream_ with no synchronization:

// quadrants/rhi/cuda/cuda_context.h:116-118
void set_stream(void *stream) {
    stream_ = stream;
}

The context already has a std::mutex lock_ (line 27), but it is not used here. The kernel launch path in kernel.py does set_current_cuda_stream â†’ launch_kernel â†’ set_current_cuda_stream(0), which is not atomic. If two Python threads launch kernels on different streams, they'll race on stream_, and a kernel could be launched on the wrong stream.

Suggestion: Either protect stream_ with the existing mutex, use thread_local storage for the active stream, or document that concurrent multi-threaded kernel launches with different streams are unsupported.

2. Synchronous `memcpy_host_to_device` not updated (Medium)

In kernel_launcher.cpp, the PR correctly updates all async operations to use active_stream, but lines 90â€“101 still use synchronous memcpy_host_to_device for external host array transfers:

// quadrants/runtime/cuda/kernel_launcher.cpp:90-91
CUDADriver::get_instance().memcpy_host_to_device(
    (void *)device_ptrs[data_ptr_idx], data_ptr, arr_sz);

Synchronous cuMemcpyHtoD implicitly synchronizes the default stream. When a user launches on a non-default stream with host-backed external arrays, this will introduce unintended synchronization with the default stream, potentially defeating the purpose of using separate streams.

Suggestion: Convert these to memcpy_host_to_device_async on active_stream, consistent with the rest of the changes.

3. `del` calling into runtime during interpreter shutdown (Medium)

Both Stream.__del__ and Event.__del__ call self.destroy(), which accesses impl.get_runtime().prog:

# python/quadrants/lang/stream.py:31-36
def __del__(self):
    if self._handle != 0:
        try:
            self.destroy()
        except Exception:
            pass

During Python interpreter shutdown, the runtime/program may already be finalized. The bare except Exception: pass mitigates crashes, but leaked CUDA resources are still possible. Additionally, __del__ timing is non-deterministic â€” the CUDA context itself could be destroyed before these finalizers run.

Suggestion: Consider registering streams/events with the runtime for batch cleanup at shutdown, or use weakref.ref to the program to detect whether cleanup is still possible. At minimum, add a note in the docstring encouraging explicit destroy() calls or context manager usage.

4. Inconsistent parameter naming in `Event` API (Low)

Event.record uses stream as its parameter name, while Event.wait uses qd_stream:

# python/quadrants/lang/stream.py:52-53
def record(self, stream: Stream | None = None):
    """Record this event on a stream. None means the default stream."""

# python/quadrants/lang/stream.py:58-59
def wait(self, qd_stream: Stream | None = None):
    """Make a stream wait for this event. None means the default stream."""

I understand qd_stream is used to avoid colliding with the kernel **kwargs namespace, but that concern doesn't apply to Event.wait â€” it's a standalone method, not a kernel call. These should be consistent. I'd suggest using stream in both places since the qd_stream prefix is an implementation detail of the kernel dispatch path.

5. `stream_synchronize` doesn't guard against handle=0 (Low)

stream_destroy and event_destroy guard against handle == 0, but stream_synchronize does not:

// quadrants/program/program.cpp:527-533
void Program::stream_synchronize(uint64 stream_handle) {
#ifdef QD_WITH_CUDA
  if (compile_config().arch == Arch::cuda) {
    CUDADriver::get_instance().stream_synchronize(
        reinterpret_cast<void *>(stream_handle));
  }
#endif

Calling cuStreamSynchronize(nullptr) synchronizes the default stream, which is technically valid but may be unintentional. The Python Stream class with handle == 0 (CPU backend) will reach this code path and unintentionally synchronize the CUDA default stream if the arch somehow doesn't match. The guard pattern should be consistent across all methods.

Minor Observations

Test coverage is solid. The test suite covers lifecycle, multi-stream, event sync, CPU no-op, and ndarray. Consider adding a test for using a destroyed stream (expect graceful no-op or error) and a test for multiple event records on the same event.
No context manager support. Stream would benefit from __enter__/__exit__ so users can write with qd.create_stream() as s: for automatic cleanup. This is a nice-to-have for a follow-up.
test_kernel_on_stream and similar tests use @test_utils.test() (all arches) which means they run on CPU where create_stream() returns handle=0 and the stream kwarg is a no-op. This tests the no-crash path, which is fine, but the concurrency behavior is only actually tested when arch=[qd.cuda]. Consider being explicit about which tests are CUDA-specific correctness tests vs. CPU-graceful-degradation tests.

Verdict

The PR is well-structured and the API is clean. The kernel launcher changes correctly propagating the active stream to all async memory operations are the most important part and look correct. I'd recommend addressing the thread-safety concern (#1) before merge â€” even if only with documentation â€” and fixing the synchronous memcpy (#2) to avoid subtle correctness issues when users mix host arrays with non-default streams. The naming inconsistency (#4) is a quick fix worth doing now to avoid API churn later.

duburcqa · 2026-03-15T10:12:22Z

python/quadrants/lang/stream.py

+    """Wraps a backend-specific GPU stream for concurrent kernel execution.
+
+    On backends without native streams (e.g. CPU), this is a no-op object.
+    Call destroy() explicitly or use as a context manager to ensure cleanup.


I would rather pretend it can only be used as context manager, aligning with the API for torch.profiler. Because managing streams manually without context sounds a bad practice and should be made easy.

duburcqa · 2026-03-15T10:12:53Z

python/quadrants/lang/stream.py

+
+
+class Event:
+    """Wraps a backend-specific GPU event for stream synchronization.


Could you clarify what is an "event" in the documentation? I have no idea what it is.

duburcqa · 2026-03-15T10:13:51Z

python/quadrants/lang/stream.py

+        if self._handle != 0:
+            prog = impl.get_runtime().prog
+            prog.event_destroy(self._handle)
+            self._handle = 0
+
+    def __del__(self):
+        if self._handle != 0 and self._prog_ref is not None:


Personally I prefer if self._handle:. It is more clear semantically. Whether it is an int or some more complex object does not matter much.

duburcqa · 2026-03-15T10:15:33Z

quadrants/python/export_lang.cpp

+      .def("stream_create", &Program::stream_create)
+      .def("stream_destroy", &Program::stream_destroy)
+      .def("stream_synchronize", &Program::stream_synchronize)
+      .def("set_current_cuda_stream", &Program::set_current_cuda_stream)
+      .def("event_create", &Program::event_create)
+      .def("event_destroy", &Program::event_destroy)
+      .def("event_record", &Program::event_record)
+      .def("event_synchronize", &Program::event_synchronize)
+      .def("stream_wait_event", &Program::stream_wait_event);


what is cuda-specific and what is not? Only 'set_current_cuda_stream' is cuda specific? if so, stream are still usable on other backend or this function is necessary to make it useful?

duburcqa · 2026-03-15T10:16:22Z

quadrants/rhi/cuda/cuda_driver_functions.inc.h


 // Stream management
 PER_CUDA_FUNCTION(stream_create, cuStreamCreate, void **, uint32);
+PER_CUDA_FUNCTION(stream_destroy, cuStreamDestroy_v2, void *);


What is 'cuStreamDestroy_v2' ? very weird name.

Why do we have functions with '_v2' suffix at multiple places?

duburcqa · 2026-03-15T10:19:30Z

tests/python/test_cache.py

@@ -242,11 +242,11 @@ def fun(value: qd.types.ndarray(), offset: qd.template()):
    qd_init_same_arch(offline_cache_file_path=str(tmp_path), offline_cache=True)
    is_valid = False

-    def launch_kernel(self, key, t_kernel, compiled_kernel_data, *args):
+    def launch_kernel(self, key, t_kernel, compiled_kernel_data, *args, qd_stream=None):
        nonlocal is_valid
        is_valid = True
        assert compiled_kernel_data is not None
-        return launch_kernel_orig(self, key, t_kernel, compiled_kernel_data, *args)
+        return launch_kernel_orig(self, key, t_kernel, compiled_kernel_data, *args, qd_stream=qd_stream)


I would rather follow the existing pattern and move 'qd_stream' before *args.

Moreover, I see no reason to prefix stream with qd. What does it mean? This is quadrants projects, so of course it is related to quadrants. It is just a gpu stream no? Just to clarify it is a gpu computation stream, not just some random stream? I don't think it is necessary, you are passing this to functions like 'launch_kernel', of course it is about launching kernels.

hughperkins marked this pull request as draft March 11, 2026 23:52

hughperkins marked this pull request as ready for review March 12, 2026 01:19

hughperkins changed the title ~~[Perf] CUDA Streams 1: Add CUDA stream and event API~~ [Perf] Streams 1: Add CUDA stream and event API Mar 12, 2026

hughperkins marked this pull request as draft March 12, 2026 04:59

duburcqa reviewed Mar 15, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Perf] Streams 1: Add CUDA stream and event API#407

[Perf] Streams 1: Add CUDA stream and event API#407
hughperkins wants to merge 2 commits intomainfrom
hp/streams-quadrantsic-1-cuda-streams

hughperkins commented Mar 11, 2026 •

edited

Loading

Uh oh!

hughperkins commented Mar 12, 2026

Uh oh!

duburcqa Mar 15, 2026

Uh oh!

duburcqa Mar 15, 2026

Uh oh!

duburcqa Mar 15, 2026 •

edited

Loading

Uh oh!

duburcqa Mar 15, 2026

Uh oh!

duburcqa Mar 15, 2026

Uh oh!

duburcqa Mar 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants



		class Event:
		"""Wraps a backend-specific GPU event for stream synchronization.

Conversation

hughperkins commented Mar 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Brief Summary

Walkthrough

Uh oh!

hughperkins commented Mar 12, 2026

PR Review: Add CUDA Stream and Event API

Summary

Issues and Concerns

1. Thread-safety of CUDAContext::stream_ (High)

2. Synchronous memcpy_host_to_device not updated (Medium)

3. __del__ calling into runtime during interpreter shutdown (Medium)

4. Inconsistent parameter naming in Event API (Low)

5. stream_synchronize doesn't guard against handle=0 (Low)

Minor Observations

Verdict

Uh oh!

duburcqa Mar 15, 2026

Choose a reason for hiding this comment

Uh oh!

duburcqa Mar 15, 2026

Choose a reason for hiding this comment

Uh oh!

duburcqa Mar 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

duburcqa Mar 15, 2026

Choose a reason for hiding this comment

Uh oh!

duburcqa Mar 15, 2026

Choose a reason for hiding this comment

Uh oh!

duburcqa Mar 15, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

hughperkins commented Mar 11, 2026 •

edited

Loading

1. Thread-safety of `CUDAContext::stream_` (High)

2. Synchronous `memcpy_host_to_device` not updated (Medium)

3. `del` calling into runtime during interpreter shutdown (Medium)

4. Inconsistent parameter naming in `Event` API (Low)

5. `stream_synchronize` doesn't guard against handle=0 (Low)

duburcqa Mar 15, 2026 •

edited

Loading