[Perf] Streams 2: Add AMDGPU/HIP stream support by hughperkins · Pull Request #408 · Genesis-Embodied-AI/quadrants

hughperkins · 2026-03-11T23:53:36Z

Mirrors the CUDA stream implementation for HIP: adds stream_ member to AMDGPUContext, stream_destroy/stream_wait_event/malloc_async/ mem_free_async to HIP driver functions, and AMDGPU branches in all Program stream/event methods. Converts AMDGPU kernel launcher to use async memory operations through the active stream. CPU backend returns 0 handles (no-op).

Issue: #

Brief Summary

copilot:summary

Walkthrough

copilot:walkthrough

Mirrors the CUDA stream implementation for HIP: adds stream_ member to AMDGPUContext, stream_destroy/stream_wait_event/malloc_async/ mem_free_async to HIP driver functions, and AMDGPU branches in all Program stream/event methods. Converts AMDGPU kernel launcher to use async memory operations through the active stream. CPU backend returns 0 handles (no-op).

…quadrantsic-2-amdgpu-cpu

Batch the device_result_buffer free into the stream pipeline before the sync barrier, matching the CUDA kernel launcher's ordering for consistency and marginal performance improvement.

Use memcpy_host_to_device_async for external array transfers so they are properly ordered on the active stream, matching the CUDA launcher.

Lower GPU speedup threshold from 1.5x to 1.3x to reduce flakiness in CI under contention, and print actual timings for diagnostics.

…ead_local Mirror the CUDA fixes: guard stream_synchronize against handle==0 to avoid unintentional default stream sync, and make AMDGPUContext::stream_ thread_local for thread-safety.

hughperkins · 2026-03-12T01:19:34Z

Opus review (written before last 6 commits above):

Code Review: `hp/streams-quadrantsic-2-amdgpu-cpu`

Summary

This PR adds AMDGPU (HIP) support for the stream and event APIs that were previously implemented only for CUDA. It touches three areas:

program.cpp -- Mirrors every CUDA stream/event entry point with an AMDGPU equivalent under #ifdef QD_WITH_AMDGPU
AMDGPU RHI layer (amdgpu_context.h/cpp, amdgpu_driver_functions.inc.h) -- Adds stream_ member, getter/setter, missing HIP driver function bindings, and wires the active stream through kernel launch
AMDGPU kernel launcher -- Migrates from synchronous malloc/memcpy/mem_free to their async variants, stream-aware throughout
Tests -- Extends test_create_and_destroy_stream and test_create_and_destroy_event to AMDGPU; adds a new concurrency timing test

The approach is sound: it closely mirrors the existing CUDA stream implementation, and the diff is consistent and well-scoped.

Architecture & Design

The pattern of duplicating each #ifdef QD_WITH_CUDA block with an #ifdef QD_WITH_AMDGPU block in program.cpp is consistent with the existing codebase style. However, this does produce a lot of nearly-identical code (9 blocks). This is a pre-existing design issue and not something to block the PR on, but it's worth noting that a future refactor toward a backend-agnostic GPUDriver interface could eliminate this duplication.

The AMDGPU kernel launcher now structurally matches the CUDA one closely, which is the right approach.

Issues

1. Operation ordering divergence from CUDA kernel launcher (minor)

In the CUDA kernel launcher, mem_free_async(device_result_buffer) is queued on the stream before the transfers synchronization:

// quadrants/runtime/cuda/kernel_launcher.cpp:144-198
// ... kernel launches ...
if (ctx.arg_buffer_size > 0) {
    CUDADriver::get_instance().mem_free_async(device_arg_buffer, active_stream);
}
if (ctx.result_buffer_size > 0) {
    CUDADriver::get_instance().memcpy_device_to_host_async(
        host_result_buffer, device_result_buffer, ctx.result_buffer_size,
        active_stream);
}
CUDADriver::get_instance().mem_free_async(device_result_buffer,
                                          active_stream);
// copy data back to host
if (transfers.size() > 0) {
    CUDADriver::get_instance().stream_synchronize(active_stream);

In the AMDGPU version, mem_free_async(device_result_buffer) comes after the transfers sync:

// quadrants/runtime/amdgpu/kernel_launcher.cpp:168-180
if (transfers.size()) {
    AMDGPUDriver::get_instance().stream_synchronize(active_stream);
    for (auto itr = transfers.begin(); itr != transfers.end(); itr++) {
        // ...
    }
}
AMDGPUDriver::get_instance().mem_free_async(device_result_buffer,
                                            active_stream);

This isn't a correctness bug, but it's a minor inefficiency: in the CUDA version, the free is batched into the stream pipeline before the sync barrier; in the AMDGPU version, the sync has already drained the stream, so mem_free_async runs on an idle stream. Consider matching the CUDA ordering for consistency and marginal performance.

2. memcpy_host_to_device for external arrays is still synchronous

// quadrants/runtime/amdgpu/kernel_launcher.cpp:71-72
AMDGPUDriver::get_instance().memcpy_host_to_device(
    (void *)device_ptrs[data_ptr_idx], data_ptr, arr_sz);

The external array transfer path still uses synchronous memcpy_host_to_device, which will operate on the default stream rather than active_stream. This is consistent with the CUDA launcher (which does the same), but worth noting: if a user sets a custom stream and passes host-backed external arrays, the transfer won't be ordered with respect to that stream. This is a pre-existing limitation in both backends.

3. free_kernel_arg_pointer uses synchronous mem_free on async-allocated memory

// quadrants/rhi/amdgpu/amdgpu_context.h:44-50
void free_kernel_arg_pointer() {
    for (auto &i : kernel_arg_pointer_) {
        AMDGPUDriver::get_instance().mem_free(i);
    }
    kernel_arg_pointer_.erase(kernel_arg_pointer_.begin(),
                              kernel_arg_pointer_.end());
}

context_pointer is allocated with malloc_async on active_stream but freed later via mem_free (synchronous). Mixing is allowed by HIP, and the synchronous free will implicitly wait, so this isn't a correctness issue. But if free_kernel_arg_pointer is ever called while the kernel is still in flight on a non-default stream, the synchronous free will stall. This is pre-existing and not introduced by this PR.

4. hipMallocAsync / hipFreeAsync minimum ROCm version requirement

The old code had a comment noting "Malloc_Async and Free_Async are available after ROCm 5.4". This comment was removed in the diff (which is fine since it was in the wrong place), but there's no build-system or runtime check ensuring the user's HIP runtime supports these functions. If someone runs this on ROCm < 5.4, the PER_AMDGPU_FUNCTION dynamic symbol lookup will fail at runtime. Consider documenting the minimum ROCm version requirement, or adding a graceful fallback.

Test Feedback

5. Timing-based concurrency test may be flaky in CI

The test_concurrent_streams_with_events test asserts speedup > 1.5 on GPU. Timing-based tests are inherently fragile, especially in shared CI environments or on GPUs under contention. The warmup and JIT compilation are handled well, but 5 million spin iterations may not produce enough wall-clock time for a stable ratio on very fast GPUs (where launch overhead dominates), or could be too slow on older GPUs.

Consider:

Logging the actual timings (serial/stream) for diagnostic purposes even on pass
Using a more relaxed threshold (e.g., > 1.3) to reduce flakiness
Adding a pytest.mark.slow or similar marker if this isn't intended to run in every CI pass

6. test_concurrent_streams_with_events uses @test_utils.test() (all archs) including CPU

The test correctly handles CPU with a relaxed speedup > 0.75 check. However, the CPU path still runs two heavy spin loops (5M iterations each) serially, which adds significant wall time to the CPU test suite. If this is a concern, you could restrict the test to GPU-only and keep the CPU path in a lighter test.

Nits

The removed comment block in kernel_launcher.cpp (lines 37-44 in the old code) was verbose but actually contained useful context about why device_result_buffer must always be allocated. The replacement comment is more concise but could lose the specific error message context. The new comment is adequate.
In program.cpp, the function set_current_cuda_stream is semantically misnamed for AMDGPU -- it sets the HIP stream, not a CUDA stream. This is pre-existing (the function name comes from the base branch), but might be worth a follow-up rename to something like set_current_gpu_stream.

Verdict

The PR is clean, well-structured, and correctly mirrors the CUDA stream implementation for AMDGPU. The core changes (async memory operations, stream plumbing through context and kernel launch, driver function bindings) are correct. The issues raised are mostly minor consistency/efficiency points and pre-existing design concerns. I'd recommend addressing items 1 (ordering) and 4 (ROCm version) before merging; the rest can be tracked as follow-ups.

This reverts commit 60d015b.

hughperkins · 2026-03-12T01:23:34Z

I'm reverting the relax of timing threshold in the tests. I think 1.5x is alreayd pretty relaxed

duburcqa · 2026-03-15T10:22:02Z

quadrants/rhi/amdgpu/amdgpu_driver_functions.inc.h

+PER_AMDGPU_FUNCTION(stream_wait_event,
+                    hipStreamWaitEvent,
+                    void *,
+                    void *,
+                    uint32);


Maybe it is time to increase linewidth of C++ code. 80 chars is painful nowadays. I would recommend either 100 or 120, as a matter of taste.

duburcqa · 2026-03-15T10:22:50Z

quadrants/runtime/amdgpu/kernel_launcher.cpp

-  // Here we have to guarantee the result_result_buffer isn't nullptr
-  // It is interesting - The code following
-  // L60:           DeviceAllocation devalloc =
-  // executor->allocate_memory_on_device( call another kernel and it will result
-  // in
-  //   Memory access fault by GPU node-1 (Agent handle: 0xeda5ca0) on address
-  //   (nil). Reason: Page not present or supervisor privilege.
-  // if you don't allocate it.


Why did you remove this comment? It is no longer applicable? It is irrelevant?

hughperkins marked this pull request as draft March 11, 2026 23:54

hughperkins added 6 commits March 11, 2026 17:26

Merge branch 'hp/streams-quadrantsic-1-cuda-streams' into hp/streams-…

b133bd7

…quadrantsic-2-amdgpu-cpu

Move AMDGPU mem_free_async before transfers sync to match CUDA ordering

7555ec5

Batch the device_result_buffer free into the stream pipeline before the sync barrier, matching the CUDA kernel launcher's ordering for consistency and marginal performance improvement.

Convert AMDGPU sync memcpy_host_to_device to async on active_stream

c12d23e

Use memcpy_host_to_device_async for external array transfers so they are properly ordered on the active stream, matching the CUDA launcher.

Document ROCm >= 5.4 requirement for hipMallocAsync/hipFreeAsync

1673a38

Relax concurrency test threshold and log timings

60d015b

Lower GPU speedup threshold from 1.5x to 1.3x to reduce flakiness in CI under contention, and print actual timings for diagnostics.

Add handle==0 guard to AMDGPU stream_synchronize and make stream_ thr…

c4be4ff

…ead_local Mirror the CUDA fixes: guard stream_synchronize against handle==0 to avoid unintentional default stream sync, and make AMDGPUContext::stream_ thread_local for thread-safety.

Revert "Relax concurrency test threshold and log timings"

b28e7c6

This reverts commit 60d015b.

hughperkins marked this pull request as ready for review March 12, 2026 01:24

hughperkins changed the title ~~[Perf] Streams part 2: Add AMDGPU/HIP stream support~~ [Perf] Streams 2: Add AMDGPU/HIP stream support Mar 12, 2026

hughperkins marked this pull request as draft March 12, 2026 04:59

duburcqa reviewed Mar 15, 2026

View reviewed changes

duburcqa approved these changes Mar 15, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Perf] Streams 2: Add AMDGPU/HIP stream support#408

[Perf] Streams 2: Add AMDGPU/HIP stream support#408
hughperkins wants to merge 8 commits intohp/streams-quadrantsic-1-cuda-streamsfrom
hp/streams-quadrantsic-2-amdgpu-cpu

hughperkins commented Mar 11, 2026

Uh oh!

hughperkins commented Mar 12, 2026

Uh oh!

hughperkins commented Mar 12, 2026

Uh oh!

duburcqa Mar 15, 2026

Uh oh!

duburcqa Mar 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

hughperkins commented Mar 11, 2026

Brief Summary

Walkthrough

Uh oh!

hughperkins commented Mar 12, 2026

Code Review: hp/streams-quadrantsic-2-amdgpu-cpu

Summary

Architecture & Design

Issues

Test Feedback

Nits

Verdict

Uh oh!

hughperkins commented Mar 12, 2026

Uh oh!

duburcqa Mar 15, 2026

Choose a reason for hiding this comment

Uh oh!

duburcqa Mar 15, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Code Review: `hp/streams-quadrantsic-2-amdgpu-cpu`