Skip to content

[Perf] Streams 2: Add AMDGPU/HIP stream support#408

Draft
hughperkins wants to merge 8 commits intohp/streams-quadrantsic-1-cuda-streamsfrom
hp/streams-quadrantsic-2-amdgpu-cpu
Draft

[Perf] Streams 2: Add AMDGPU/HIP stream support#408
hughperkins wants to merge 8 commits intohp/streams-quadrantsic-1-cuda-streamsfrom
hp/streams-quadrantsic-2-amdgpu-cpu

Conversation

@hughperkins
Copy link
Collaborator

Mirrors the CUDA stream implementation for HIP: adds stream_ member to AMDGPUContext, stream_destroy/stream_wait_event/malloc_async/ mem_free_async to HIP driver functions, and AMDGPU branches in all Program stream/event methods. Converts AMDGPU kernel launcher to use async memory operations through the active stream. CPU backend returns 0 handles (no-op).

Issue: #

Brief Summary

copilot:summary

Walkthrough

copilot:walkthrough

Mirrors the CUDA stream implementation for HIP: adds stream_ member
to AMDGPUContext, stream_destroy/stream_wait_event/malloc_async/
mem_free_async to HIP driver functions, and AMDGPU branches in all
Program stream/event methods. Converts AMDGPU kernel launcher to use
async memory operations through the active stream. CPU backend
returns 0 handles (no-op).
@hughperkins hughperkins marked this pull request as draft March 11, 2026 23:54
Batch the device_result_buffer free into the stream pipeline before the
sync barrier, matching the CUDA kernel launcher's ordering for
consistency and marginal performance improvement.
Use memcpy_host_to_device_async for external array transfers so they
are properly ordered on the active stream, matching the CUDA launcher.
Lower GPU speedup threshold from 1.5x to 1.3x to reduce flakiness in
CI under contention, and print actual timings for diagnostics.
…ead_local

Mirror the CUDA fixes: guard stream_synchronize against handle==0 to
avoid unintentional default stream sync, and make AMDGPUContext::stream_
thread_local for thread-safety.
@hughperkins
Copy link
Collaborator Author

Opus review (written before last 6 commits above):

Code Review: hp/streams-quadrantsic-2-amdgpu-cpu

Summary

This PR adds AMDGPU (HIP) support for the stream and event APIs that were previously implemented only for CUDA. It touches three areas:

  1. program.cpp -- Mirrors every CUDA stream/event entry point with an AMDGPU equivalent under #ifdef QD_WITH_AMDGPU
  2. AMDGPU RHI layer (amdgpu_context.h/cpp, amdgpu_driver_functions.inc.h) -- Adds stream_ member, getter/setter, missing HIP driver function bindings, and wires the active stream through kernel launch
  3. AMDGPU kernel launcher -- Migrates from synchronous malloc/memcpy/mem_free to their async variants, stream-aware throughout
  4. Tests -- Extends test_create_and_destroy_stream and test_create_and_destroy_event to AMDGPU; adds a new concurrency timing test

The approach is sound: it closely mirrors the existing CUDA stream implementation, and the diff is consistent and well-scoped.


Architecture & Design

The pattern of duplicating each #ifdef QD_WITH_CUDA block with an #ifdef QD_WITH_AMDGPU block in program.cpp is consistent with the existing codebase style. However, this does produce a lot of nearly-identical code (9 blocks). This is a pre-existing design issue and not something to block the PR on, but it's worth noting that a future refactor toward a backend-agnostic GPUDriver interface could eliminate this duplication.

The AMDGPU kernel launcher now structurally matches the CUDA one closely, which is the right approach.


Issues

1. Operation ordering divergence from CUDA kernel launcher (minor)

In the CUDA kernel launcher, mem_free_async(device_result_buffer) is queued on the stream before the transfers synchronization:

// quadrants/runtime/cuda/kernel_launcher.cpp:144-198
// ... kernel launches ...
if (ctx.arg_buffer_size > 0) {
    CUDADriver::get_instance().mem_free_async(device_arg_buffer, active_stream);
}
if (ctx.result_buffer_size > 0) {
    CUDADriver::get_instance().memcpy_device_to_host_async(
        host_result_buffer, device_result_buffer, ctx.result_buffer_size,
        active_stream);
}
CUDADriver::get_instance().mem_free_async(device_result_buffer,
                                          active_stream);
// copy data back to host
if (transfers.size() > 0) {
    CUDADriver::get_instance().stream_synchronize(active_stream);

In the AMDGPU version, mem_free_async(device_result_buffer) comes after the transfers sync:

// quadrants/runtime/amdgpu/kernel_launcher.cpp:168-180
if (transfers.size()) {
    AMDGPUDriver::get_instance().stream_synchronize(active_stream);
    for (auto itr = transfers.begin(); itr != transfers.end(); itr++) {
        // ...
    }
}
AMDGPUDriver::get_instance().mem_free_async(device_result_buffer,
                                            active_stream);

This isn't a correctness bug, but it's a minor inefficiency: in the CUDA version, the free is batched into the stream pipeline before the sync barrier; in the AMDGPU version, the sync has already drained the stream, so mem_free_async runs on an idle stream. Consider matching the CUDA ordering for consistency and marginal performance.

2. memcpy_host_to_device for external arrays is still synchronous

// quadrants/runtime/amdgpu/kernel_launcher.cpp:71-72
AMDGPUDriver::get_instance().memcpy_host_to_device(
    (void *)device_ptrs[data_ptr_idx], data_ptr, arr_sz);

The external array transfer path still uses synchronous memcpy_host_to_device, which will operate on the default stream rather than active_stream. This is consistent with the CUDA launcher (which does the same), but worth noting: if a user sets a custom stream and passes host-backed external arrays, the transfer won't be ordered with respect to that stream. This is a pre-existing limitation in both backends.

3. free_kernel_arg_pointer uses synchronous mem_free on async-allocated memory

// quadrants/rhi/amdgpu/amdgpu_context.h:44-50
void free_kernel_arg_pointer() {
    for (auto &i : kernel_arg_pointer_) {
        AMDGPUDriver::get_instance().mem_free(i);
    }
    kernel_arg_pointer_.erase(kernel_arg_pointer_.begin(),
                              kernel_arg_pointer_.end());
}

context_pointer is allocated with malloc_async on active_stream but freed later via mem_free (synchronous). Mixing is allowed by HIP, and the synchronous free will implicitly wait, so this isn't a correctness issue. But if free_kernel_arg_pointer is ever called while the kernel is still in flight on a non-default stream, the synchronous free will stall. This is pre-existing and not introduced by this PR.

4. hipMallocAsync / hipFreeAsync minimum ROCm version requirement

The old code had a comment noting "Malloc_Async and Free_Async are available after ROCm 5.4". This comment was removed in the diff (which is fine since it was in the wrong place), but there's no build-system or runtime check ensuring the user's HIP runtime supports these functions. If someone runs this on ROCm < 5.4, the PER_AMDGPU_FUNCTION dynamic symbol lookup will fail at runtime. Consider documenting the minimum ROCm version requirement, or adding a graceful fallback.


Test Feedback

5. Timing-based concurrency test may be flaky in CI

The test_concurrent_streams_with_events test asserts speedup > 1.5 on GPU. Timing-based tests are inherently fragile, especially in shared CI environments or on GPUs under contention. The warmup and JIT compilation are handled well, but 5 million spin iterations may not produce enough wall-clock time for a stable ratio on very fast GPUs (where launch overhead dominates), or could be too slow on older GPUs.

Consider:

  • Logging the actual timings (serial/stream) for diagnostic purposes even on pass
  • Using a more relaxed threshold (e.g., > 1.3) to reduce flakiness
  • Adding a pytest.mark.slow or similar marker if this isn't intended to run in every CI pass

6. test_concurrent_streams_with_events uses @test_utils.test() (all archs) including CPU

The test correctly handles CPU with a relaxed speedup > 0.75 check. However, the CPU path still runs two heavy spin loops (5M iterations each) serially, which adds significant wall time to the CPU test suite. If this is a concern, you could restrict the test to GPU-only and keep the CPU path in a lighter test.


Nits

  • The removed comment block in kernel_launcher.cpp (lines 37-44 in the old code) was verbose but actually contained useful context about why device_result_buffer must always be allocated. The replacement comment is more concise but could lose the specific error message context. The new comment is adequate.

  • In program.cpp, the function set_current_cuda_stream is semantically misnamed for AMDGPU -- it sets the HIP stream, not a CUDA stream. This is pre-existing (the function name comes from the base branch), but might be worth a follow-up rename to something like set_current_gpu_stream.


Verdict

The PR is clean, well-structured, and correctly mirrors the CUDA stream implementation for AMDGPU. The core changes (async memory operations, stream plumbing through context and kernel launch, driver function bindings) are correct. The issues raised are mostly minor consistency/efficiency points and pre-existing design concerns. I'd recommend addressing items 1 (ordering) and 4 (ROCm version) before merging; the rest can be tracked as follow-ups.

@hughperkins
Copy link
Collaborator Author

I'm reverting the relax of timing threshold in the tests. I think 1.5x is alreayd pretty relaxed

@hughperkins hughperkins marked this pull request as ready for review March 12, 2026 01:24
@hughperkins hughperkins changed the title [Perf] Streams part 2: Add AMDGPU/HIP stream support [Perf] Streams 2: Add AMDGPU/HIP stream support Mar 12, 2026
@hughperkins hughperkins marked this pull request as draft March 12, 2026 04:59
Comment on lines +128 to +132
PER_AMDGPU_FUNCTION(stream_wait_event,
hipStreamWaitEvent,
void *,
void *,
uint32);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it is time to increase linewidth of C++ code. 80 chars is painful nowadays. I would recommend either 100 or 120, as a matter of taste.

Comment on lines -36 to -43
// Here we have to guarantee the result_result_buffer isn't nullptr
// It is interesting - The code following
// L60: DeviceAllocation devalloc =
// executor->allocate_memory_on_device( call another kernel and it will result
// in
// Memory access fault by GPU node-1 (Agent handle: 0xeda5ca0) on address
// (nil). Reason: Page not present or supervisor privilege.
// if you don't allocate it.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did you remove this comment? It is no longer applicable? It is irrelevant?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants