[Pef] CUDA graph 4: call from multiple locations by hughperkins · Pull Request #420 · Genesis-Embodied-AI/quadrants

hughperkins · 2026-03-16T16:55:51Z

Issue: #

Brief Summary

When multiple locations call the same graph, very likely the counter ndarray will be different physically between each calling site.

Prior to this PR, this causes an exception.

In this PR, we handle making it possible to pass in a different ndarray object, without either triggering a recompile, throwing an exception, or being incorrect (which were the three options before this fix).

copilot:summary

Walkthrough

copilot:walkthrough

When QD_CUDA_GRAPH=1, kernels with 2+ top-level for loops (offloaded tasks) are captured into a CUDA graph on first launch and replayed on subsequent launches, eliminating per-kernel launch overhead. Uses the explicit graph node API (cuGraphAddKernelNode) with persistent device arg/result buffers. Assumes stable ndarray device pointers. Made-with: Cursor

Replace the global QD_CUDA_GRAPH=1 env var with a per-kernel opt-in. The flag flows from the Python decorator through LaunchContextBuilder to the CUDA kernel launcher, avoiding interference with internal kernels like ndarray_to_ext_arr. Made-with: Cursor

Verify that cuda_graph=True is a harmless no-op on non-CUDA backends (tested on x64/CPU). Passes on both x64 and CUDA. Made-with: Cursor

On each graph replay, re-resolve ndarray device pointers and re-upload the arg buffer to the persistent device buffer. This ensures correct results when the kernel is called with different ndarrays after the graph was first captured. Refactored ndarray pointer resolution into resolve_ctx_ndarray_ptrs(). Made-with: Cursor

Apply lint formatting fixes (clang-format, ruff) and remove cuda_graph flag from autodiff adjoint kernel until the interaction with reverse-mode AD is validated.

Implements @qd.kernel(graph_while='flag_arg') which wraps the kernel offloaded tasks in a CUDA conditional while node (requires SM 9.0+). The named argument is a scalar i32 ndarray on device; the loop continues while its value is non-zero. Key implementation details: - Condition kernel compiled as PTX and JIT-linked with libcudadevrt.a at runtime to access cudaGraphSetConditional device function - CU_GRAPH_COND_ASSIGN_DEFAULT flag ensures handle is reset each launch - Works with both counter-based (decrement to 0) and boolean flag (set to 0 when done) patterns - graph_while implicitly enables cuda_graph=True Tests: counter, boolean done flag, multiple loops, graph replay.

…allback The graph_while_arg_id was computed using Python-level parameter indices, which is wrong when struct parameters are flattened into many C++ args (e.g. Genesis solver has 40 C++ params from 6 Python params). Now tracks the flattened C++ arg index during launch context setup and caches it. Also adds C++ do-while fallback loops for CPU, CUDA (non-graph path), and AMDGPU backends so graph_while works identically on all platforms.

Falls back to non-graph path with a warning on pre-Hopper GPUs, instead of failing with an unhelpful JIT link error.

Checks env-var-derived paths before the hardcoded fallbacks, so custom toolkit installs (e.g. conda, non-default prefix) are found.

Document cuda_graph=True and graph_while API in kernel() docstring, and add a user guide page covering usage patterns, cross-platform behavior, and the do-while semantics constraint.

…-2-graph-while

The graph path doesn't copy the result buffer back to the host, so struct returns would silently return stale data. Error early instead of producing wrong results.

Verifies that calling a cuda_graph=True kernel first with small arrays then with larger ones produces correct results for all elements — catches stale grid dims if the graph were incorrectly replayed from the first capture.

Re-add documentation comments for |transfers|, |device_ptrs|, zero-sized array handling, external array logic, and the host copy-back section in the non-graph launch path.

Verify that a cuda_graph=True kernel works correctly after a reset/reinit cycle — exercises the full teardown and rebuild of the KernelLauncher and its graph cache.

…-2-graph-while

The condition kernel's flag pointer was baked into the CUDA graph at creation time. Passing a different ndarray on replay would cause the condition kernel to read from a stale device address. Invalidate the cached graph when the flag pointer changes so it gets rebuilt.

Raise ValueError immediately if the graph_while name doesn't match any kernel parameter, instead of silently running the kernel once without looping. Also document the CUDA API version for CudaGraphNodeParams.

…raph-build

Made-with: Cursor

Add get_cuda_graph_cache_size() through the KernelLauncher -> Program -> pybind chain so tests can verify that graphs are actually being created (or not) rather than only checking output correctness. Made-with: Cursor

Tracks whether the CUDA graph cache was used on the most recent kernel launch, exposed through KernelLauncher -> Program -> pybind so tests can assert the graph path was (or was not) taken. Made-with: Cursor

Every test now verifies graph caching behavior, not just output correctness. Cross-platform test uses platform_supports_graph to make assertions conditional on the backend. Made-with: Cursor

Made-with: Cursor

…-3-add-fallback

hughperkins · 2026-03-16T17:07:46Z

quadrants/runtime/cuda/cuda_graph_manager.cpp

+    .reg .b64 %rd<5>;

-    // Load the two kernel parameters into registers:
-    //   %rd1 = conditional node handle


lets get these commetns back

hughperkins · 2026-03-16T17:09:26Z

quadrants/runtime/cuda/cuda_graph_manager.cpp

    selp.u32 %r2, 1, 0, %p1;

    // Tell the conditional while node whether to loop again or stop.
-    // cudaGraphSetConditional(handle, should_continue)


lets get this comment back

hughperkins · 2026-03-16T17:16:03Z

quadrants/runtime/cuda/cuda_graph_manager.cpp

-      "Reuse the same ndarray for the condition parameter across calls.");
+  if (use_graph_do_while && cached.counter_ptr_slot) {
+    void *flag_ptr = ctx.graph_do_while_flag_dev_ptr;
+    CUDADriver::get_instance().memcpy_host_to_device(cached.counter_ptr_slot,


I wonder if this could/should be async?

Use device-side pointer indirection so the condition kernel reads the counter address through a persistent slot. Updating the slot via memcpy before each launch lets different ndarrays be used without rebuilding the CUDA graph. Replaces the previous error ("condition ndarray changed between calls") with transparent support for swapping.

hughperkins · 2026-03-16T17:32:48Z

quadrants/runtime/cuda/cuda_graph_manager.cpp

+    // Allocate a persistent device-side pointer slot and write the initial
+    // counter address into it. The condition kernel reads through this slot,
+    // so swapping the counter ndarray later only requires updating the slot.
+    CUDADriver::get_instance().malloc(&cached.counter_ptr_slot, sizeof(void *));


hard to tell if this is ok. lets move to the constructor of cahcedcuagraph perahps?

Introduce CudaDeviceBuffer, an RAII wrapper around CUDADriver::malloc/mem_free, replacing raw void*/char* pointers for persistent_device_arg_buffer, persistent_device_result_buffer, and counter_ptr_slot. Add a parameterized CachedCudaGraph constructor that allocates all device buffers upfront, eliminating scattered malloc calls in try_launch.

This reverts commit 6418b6f.

Move device buffer allocation (arg, result, counter_ptr_slot) and RuntimeContext setup into a new constructor, removing scattered malloc calls from try_launch.

env.sh is generated by ./build.py and should not be tracked.

…-3-add-fallback

…p-4-handle-ndarray-change-2 # Conflicts: # env.sh

…raph-while # Conflicts: # .github/workflows/test_gpu.yml

…-3-add-fallback

…p-4-handle-ndarray-change-2

hughperkins · 2026-03-16T17:51:50Z

tests/python/test_cuda_graph_do_while.py

-    c2.from_numpy(np.array(1, dtype=np.int32))
-    with pytest.raises(RuntimeError, match="condition ndarray changed"):
+
+    for iteration in range(3):


I think we should check that we arent simply rebuilding the graph each call.

added _cuda_graph_total_builds assert

…dd-fallback # Conflicts: # docs/source/user_guide/cuda_graph.md # python/quadrants/lang/misc.py # quadrants/runtime/amdgpu/kernel_launcher.cpp # quadrants/runtime/cpu/kernel_launcher.cpp # quadrants/runtime/cuda/cuda_graph_manager.cpp # tests/python/test_cuda_graph_do_while.py

…p-4-handle-ndarray-change-2

…andle-ndarray-change-2 # Conflicts: # tests/python/test_cuda_graph_do_while.py

hughperkins added 30 commits March 11, 2026 10:25

bug fixes for cuda graph

49ce3c1

Add cross-platform test for cuda_graph=True annotation

cffb9ae

Verify that cuda_graph=True is a harmless no-op on non-CUDA backends (tested on x64/CPU). Passes on both x64 and CUDA. Made-with: Cursor

Fix formatting and disable cuda_graph on adjoint kernels

85dc8db

Apply lint formatting fixes (clang-format, ruff) and remove cuda_graph flag from autodiff adjoint kernel until the interaction with reverse-mode AD is validated.

Add static_assert on CudaGraphNodeParams size to catch ABI drift

0573c12

Add compute capability check for graph_while (requires SM 9.0+)

7fd81d3

Falls back to non-graph path with a warning on pre-Hopper GPUs, instead of failing with an unhelpful JIT link error.

Use CUDA_HOME/CUDA_PATH env vars to find libcudadevrt.a

9c75cee

Checks env-var-derived paths before the hardcoded fallbacks, so custom toolkit installs (e.g. conda, non-default prefix) are found.

Restore documentation comments removed during cuda-graph refactor

7f80b72

Add CUDA graph documentation and do-while semantics warning

7762fd9

Document cuda_graph=True and graph_while API in kernel() docstring, and add a user guide page covering usage patterns, cross-platform behavior, and the do-while semantics constraint.

Apply clang-format to kernel_launcher.h static_assert

47d59dc

Fix lint: formatting (black, clang-format, ruff)

ad4eab6

Fix clang-format whitespace in kernel_launcher.cpp

e00fc15

Merge branch 'hp/cuda-graph-mvp-1-graph-build' into hp/cuda-graph-mvp…

9bcc487

…-2-graph-while

Reject cuda_graph=True on kernels with struct return values

0031619

The graph path doesn't copy the result buffer back to the host, so struct returns would silently return stale data. Error early instead of producing wrong results.

Add test for cuda_graph with different-sized arrays

792ff34

Verifies that calling a cuda_graph=True kernel first with small arrays then with larger ones produces correct results for all elements — catches stale grid dims if the graph were incorrectly replayed from the first capture.

Restore comments removed during cuda graph refactor

334c2e8

Re-add documentation comments for |transfers|, |device_ptrs|, zero-sized array handling, external array logic, and the host copy-back section in the non-graph launch path.

Add test for cuda_graph after qd.reset()

8f56ffd

Verify that a cuda_graph=True kernel works correctly after a reset/reinit cycle — exercises the full teardown and rebuild of the KernelLauncher and its graph cache.

Merge branch 'hp/cuda-graph-mvp-1-graph-build' into hp/cuda-graph-mvp…

5dd2d66

…-2-graph-while

Validate graph_while parameter name at decoration time

96b43de

Raise ValueError immediately if the graph_while name doesn't match any kernel parameter, instead of silently running the kernel once without looping. Also document the CUDA API version for CudaGraphNodeParams.

Merge remote-tracking branch 'origin/main' into hp/cuda-graph-mvp-1-g…

8caa42c

…raph-build

Add CUDA graph documentation page

501362f

Made-with: Cursor

Expose CUDA graph cache size for test observability

517d3db

Add get_cuda_graph_cache_size() through the KernelLauncher -> Program -> pybind chain so tests can verify that graphs are actually being created (or not) rather than only checking output correctness. Made-with: Cursor

Add get_cuda_graph_cache_used_on_last_call() for test observability

da3ff27

Tracks whether the CUDA graph cache was used on the most recent kernel launch, exposed through KernelLauncher -> Program -> pybind so tests can assert the graph path was (or was not) taken. Made-with: Cursor

Add cache size and cache used assertions to all CUDA graph tests

a2abceb

Every test now verifies graph caching behavior, not just output correctness. Cross-platform test uses platform_supports_graph to make assertions conditional on the backend. Made-with: Cursor

Inline expected cache size in cross-platform test assertion

720f5d8

Made-with: Cursor

Merge branch 'hp/cuda-graph-mvp-2-graph-while' into hp/cuda-graph-mvp…

aa0cad8

…-3-add-fallback

hughperkins changed the base branch from main to hp/cuda-graph-mvp-3-add-fallback March 16, 2026 16:57

hughperkins commented Mar 16, 2026

View reviewed changes

hughperkins force-pushed the hp/cuda-graph-mvp-4-handle-ndarray-change-2 branch from 36d6d2d to fd78ff6 Compare March 16, 2026 17:08

hughperkins commented Mar 16, 2026

View reviewed changes

hughperkins force-pushed the hp/cuda-graph-mvp-4-handle-ndarray-change-2 branch from fd78ff6 to 7f03036 Compare March 16, 2026 17:10

hughperkins commented Mar 16, 2026

View reviewed changes

hughperkins force-pushed the hp/cuda-graph-mvp-4-handle-ndarray-change-2 branch from 7f03036 to 9de68d5 Compare March 16, 2026 17:16

hughperkins commented Mar 16, 2026

View reviewed changes

hughperkins added 10 commits March 16, 2026 10:36

Revert "Refactor CachedCudaGraph to use RAII for device memory"

f9fbf19

This reverts commit 6418b6f.

Add parameterized CachedCudaGraph constructor

7154ca2

Move device buffer allocation (arg, result, counter_ptr_slot) and RuntimeContext setup into a new constructor, removing scattered malloc calls from try_launch.

Remove env.sh from git and add to .gitignore

df0f753

env.sh is generated by ./build.py and should not be tracked.

Merge branch 'hp/cuda-graph-mvp-2-graph-while' into hp/cuda-graph-mvp…

5947e54

…-3-add-fallback

Merge branch 'hp/cuda-graph-mvp-3-add-fallback' into hp/cuda-graph-mv…

27355d7

…p-4-handle-ndarray-change-2 # Conflicts: # env.sh

Merge remote-tracking branch 'origin/main' into hp/cuda-graph-mvp-2-g…

8d10a35

…raph-while # Conflicts: # .github/workflows/test_gpu.yml

Merge branch 'hp/cuda-graph-mvp-2-graph-while' into hp/cuda-graph-mvp…

438499a

…-3-add-fallback

Merge branch 'hp/cuda-graph-mvp-3-add-fallback' into hp/cuda-graph-mv…

c3d9cb8

…p-4-handle-ndarray-change-2

Improve test docstrings for counter ndarray swap tests

2a0dc75

hughperkins commented Mar 16, 2026

View reviewed changes

hughperkins changed the title ~~[Pef] CUDA graph 4: handle multiple locations calling same graph~~ [Pef] CUDA graph 4: call from multiple locations Mar 16, 2026

hughperkins added 5 commits March 16, 2026 10:55

Add total_builds counter to verify graph reuse in tests

0f82db8

Merge branch 'hp/cuda-graph-mvp-3-add-fallback' into hp/cuda-graph-mv…

19099e6

…p-4-handle-ndarray-change-2

Fix stale conflict marker in cuda_graph_manager.cpp

5c6fe38

Merge branch 'hp/cuda-graph-mvp-3-add-fallback' into hp/cuda-graph-mv…

4dc369e

…p-4-handle-ndarray-change-2

Base automatically changed from hp/cuda-graph-mvp-3-add-fallback to main March 16, 2026 19:46

hughperkins added 2 commits March 16, 2026 13:06

Merge remote-tracking branch 'origin/main' into hp/cuda-graph-mvp-4-h…

8663cc6

…andle-ndarray-change-2 # Conflicts: # tests/python/test_cuda_graph_do_while.py

Fix black formatting: remove extra blank line

c8dce56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Pef] CUDA graph 4: call from multiple locations#420

[Pef] CUDA graph 4: call from multiple locations#420
hughperkins wants to merge 194 commits intomainfrom
hp/cuda-graph-mvp-4-handle-ndarray-change-2

hughperkins commented Mar 16, 2026

Uh oh!

hughperkins Mar 16, 2026

Uh oh!

hughperkins Mar 16, 2026

Uh oh!

hughperkins Mar 16, 2026

Uh oh!

hughperkins Mar 16, 2026

Uh oh!

hughperkins Mar 16, 2026

Uh oh!

hughperkins Mar 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

hughperkins commented Mar 16, 2026

Brief Summary

Walkthrough

Uh oh!

hughperkins Mar 16, 2026

Choose a reason for hiding this comment

Uh oh!

hughperkins Mar 16, 2026

Choose a reason for hiding this comment

Uh oh!

hughperkins Mar 16, 2026

Choose a reason for hiding this comment

Uh oh!

hughperkins Mar 16, 2026

Choose a reason for hiding this comment

Uh oh!

hughperkins Mar 16, 2026

Choose a reason for hiding this comment

Uh oh!

hughperkins Mar 16, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant