[Perf] Streams 1: Add CUDA stream and event API#407
[Perf] Streams 1: Add CUDA stream and event API#407hughperkins wants to merge 2 commits intomainfrom
Conversation
Introduces qd.create_stream() and qd.create_event() for launching kernels on separate CUDA streams with event-based synchronization. The qd_stream kwarg on kernel calls routes the launch to a specific stream. Non-CUDA backends return no-op handles (0). Routes kernel launcher memory ops through the active stream.
- Make CUDAContext::stream_ thread_local for thread-safety - Convert sync memcpy_host_to_device to async on active_stream - Use weakref in Stream/Event __del__ to safely handle interpreter shutdown - Add __enter__/__exit__ context manager support for Stream and Event - Use consistent qd_stream parameter naming in Event.record and Event.wait - Add handle==0 guard to stream_synchronize
|
Review from Opus (written before the last commit above): PR Review: Add CUDA Stream and Event APIBranch: SummaryThis PR introduces a CUDA stream and event API to enable concurrent kernel execution on separate GPU streams. It adds:
The design is clean and well-layered. On non-CUDA backends, everything degrades to no-ops (handle=0). Issues and Concerns1. Thread-safety of
|
| """Wraps a backend-specific GPU stream for concurrent kernel execution. | ||
|
|
||
| On backends without native streams (e.g. CPU), this is a no-op object. | ||
| Call destroy() explicitly or use as a context manager to ensure cleanup. |
There was a problem hiding this comment.
I would rather pretend it can only be used as context manager, aligning with the API for torch.profiler. Because managing streams manually without context sounds a bad practice and should be made easy.
|
|
||
|
|
||
| class Event: | ||
| """Wraps a backend-specific GPU event for stream synchronization. |
There was a problem hiding this comment.
Could you clarify what is an "event" in the documentation? I have no idea what it is.
| if self._handle != 0: | ||
| prog = impl.get_runtime().prog | ||
| prog.event_destroy(self._handle) | ||
| self._handle = 0 | ||
|
|
||
| def __del__(self): | ||
| if self._handle != 0 and self._prog_ref is not None: |
There was a problem hiding this comment.
Personally I prefer if self._handle:. It is more clear semantically. Whether it is an int or some more complex object does not matter much.
| .def("stream_create", &Program::stream_create) | ||
| .def("stream_destroy", &Program::stream_destroy) | ||
| .def("stream_synchronize", &Program::stream_synchronize) | ||
| .def("set_current_cuda_stream", &Program::set_current_cuda_stream) | ||
| .def("event_create", &Program::event_create) | ||
| .def("event_destroy", &Program::event_destroy) | ||
| .def("event_record", &Program::event_record) | ||
| .def("event_synchronize", &Program::event_synchronize) | ||
| .def("stream_wait_event", &Program::stream_wait_event); |
There was a problem hiding this comment.
what is cuda-specific and what is not? Only 'set_current_cuda_stream' is cuda specific? if so, stream are still usable on other backend or this function is necessary to make it useful?
|
|
||
| // Stream management | ||
| PER_CUDA_FUNCTION(stream_create, cuStreamCreate, void **, uint32); | ||
| PER_CUDA_FUNCTION(stream_destroy, cuStreamDestroy_v2, void *); |
There was a problem hiding this comment.
What is 'cuStreamDestroy_v2' ? very weird name.
Why do we have functions with '_v2' suffix at multiple places?
| @@ -242,11 +242,11 @@ def fun(value: qd.types.ndarray(), offset: qd.template()): | |||
| qd_init_same_arch(offline_cache_file_path=str(tmp_path), offline_cache=True) | |||
| is_valid = False | |||
|
|
|||
| def launch_kernel(self, key, t_kernel, compiled_kernel_data, *args): | |||
| def launch_kernel(self, key, t_kernel, compiled_kernel_data, *args, qd_stream=None): | |||
| nonlocal is_valid | |||
| is_valid = True | |||
| assert compiled_kernel_data is not None | |||
| return launch_kernel_orig(self, key, t_kernel, compiled_kernel_data, *args) | |||
| return launch_kernel_orig(self, key, t_kernel, compiled_kernel_data, *args, qd_stream=qd_stream) | |||
There was a problem hiding this comment.
I would rather follow the existing pattern and move 'qd_stream' before *args.
Moreover, I see no reason to prefix stream with qd. What does it mean? This is quadrants projects, so of course it is related to quadrants. It is just a gpu stream no? Just to clarify it is a gpu computation stream, not just some random stream? I don't think it is necessary, you are passing this to functions like 'launch_kernel', of course it is about launching kernels.
Introduces qd.create_stream() and qd.create_event() for launching kernels on separate CUDA streams with event-based synchronization. The qd_stream kwarg on kernel calls routes the launch to a specific stream. Non-CUDA backends return no-op handles (0). Routes kernel launcher memory ops through the active stream.
Lines of code added: +481 - 197 - 4 - 4 = +276
Issue: #
Brief Summary
copilot:summary
Walkthrough
copilot:walkthrough