[Perf] CUDA graph 3: add fallbacks by hughperkins · Pull Request #416 · Genesis-Embodied-AI/quadrants

hughperkins · 2026-03-14T14:39:37Z

add fallbacks on non-cuda platforms

Made-with: Cursor

Issue: #

Brief Summary

copilot:summary

Walkthrough

copilot:walkthrough

When QD_CUDA_GRAPH=1, kernels with 2+ top-level for loops (offloaded tasks) are captured into a CUDA graph on first launch and replayed on subsequent launches, eliminating per-kernel launch overhead. Uses the explicit graph node API (cuGraphAddKernelNode) with persistent device arg/result buffers. Assumes stable ndarray device pointers. Made-with: Cursor

Replace the global QD_CUDA_GRAPH=1 env var with a per-kernel opt-in. The flag flows from the Python decorator through LaunchContextBuilder to the CUDA kernel launcher, avoiding interference with internal kernels like ndarray_to_ext_arr. Made-with: Cursor

Verify that cuda_graph=True is a harmless no-op on non-CUDA backends (tested on x64/CPU). Passes on both x64 and CUDA. Made-with: Cursor

On each graph replay, re-resolve ndarray device pointers and re-upload the arg buffer to the persistent device buffer. This ensures correct results when the kernel is called with different ndarrays after the graph was first captured. Refactored ndarray pointer resolution into resolve_ctx_ndarray_ptrs(). Made-with: Cursor

Apply lint formatting fixes (clang-format, ruff) and remove cuda_graph flag from autodiff adjoint kernel until the interaction with reverse-mode AD is validated.

Implements @qd.kernel(graph_while='flag_arg') which wraps the kernel offloaded tasks in a CUDA conditional while node (requires SM 9.0+). The named argument is a scalar i32 ndarray on device; the loop continues while its value is non-zero. Key implementation details: - Condition kernel compiled as PTX and JIT-linked with libcudadevrt.a at runtime to access cudaGraphSetConditional device function - CU_GRAPH_COND_ASSIGN_DEFAULT flag ensures handle is reset each launch - Works with both counter-based (decrement to 0) and boolean flag (set to 0 when done) patterns - graph_while implicitly enables cuda_graph=True Tests: counter, boolean done flag, multiple loops, graph replay.

…allback The graph_while_arg_id was computed using Python-level parameter indices, which is wrong when struct parameters are flattened into many C++ args (e.g. Genesis solver has 40 C++ params from 6 Python params). Now tracks the flattened C++ arg index during launch context setup and caches it. Also adds C++ do-while fallback loops for CPU, CUDA (non-graph path), and AMDGPU backends so graph_while works identically on all platforms.

Falls back to non-graph path with a warning on pre-Hopper GPUs, instead of failing with an unhelpful JIT link error.

Checks env-var-derived paths before the hardcoded fallbacks, so custom toolkit installs (e.g. conda, non-default prefix) are found.

Document cuda_graph=True and graph_while API in kernel() docstring, and add a user guide page covering usage patterns, cross-platform behavior, and the do-while semantics constraint.

…-2-graph-while

The graph path doesn't copy the result buffer back to the host, so struct returns would silently return stale data. Error early instead of producing wrong results.

Verifies that calling a cuda_graph=True kernel first with small arrays then with larger ones produces correct results for all elements — catches stale grid dims if the graph were incorrectly replayed from the first capture.

Re-add documentation comments for |transfers|, |device_ptrs|, zero-sized array handling, external array logic, and the host copy-back section in the non-graph launch path.

Verify that a cuda_graph=True kernel works correctly after a reset/reinit cycle — exercises the full teardown and rebuild of the KernelLauncher and its graph cache.

…-2-graph-while

The condition kernel's flag pointer was baked into the CUDA graph at creation time. Passing a different ndarray on replay would cause the condition kernel to read from a stale device address. Invalidate the cached graph when the flag pointer changes so it gets rebuilt.

Raise ValueError immediately if the graph_while name doesn't match any kernel parameter, instead of silently running the kernel once without looping. Also document the CUDA API version for CudaGraphNodeParams.

…raph-build

Made-with: Cursor

Add get_cuda_graph_cache_size() through the KernelLauncher -> Program -> pybind chain so tests can verify that graphs are actually being created (or not) rather than only checking output correctness. Made-with: Cursor

Tracks whether the CUDA graph cache was used on the most recent kernel launch, exposed through KernelLauncher -> Program -> pybind so tests can assert the graph path was (or was not) taken. Made-with: Cursor

Every test now verifies graph caching behavior, not just output correctness. Cross-platform test uses platform_supports_graph to make assertions conditional on the backend. Made-with: Cursor

Made-with: Cursor

The LLVM x64 backend generates extra tasks per ndarray argument for serialization/setup, so exact equality checks fail. Use >= instead. Made-with: Cursor

…-2-graph-while

…-3-add-fallback Made-with: Cursor # Conflicts: # docs/source/user_guide/cuda_graph.md

Made-with: Cursor

hughperkins · 2026-03-14T22:27:27Z

I have read every line added in this PR, and reviewed the lines. I take responsibilty for the lines added and removed in this PR, and won't blame any issues on Opus.

Ndarray kernels can produce additional serial tasks beyond the user-visible loops, so hardcoding expected node counts breaks. Use the actual num_offloaded_tasks instead.

…-2-graph-while

…-3-add-fallback

…-2-graph-while

…-3-add-fallback

Resolve conflicts from squash-merged MVP-1 PR (#405) vs branch's pre-existing MVP-1 merge commits. Keep all graph_do_while (MVP-2) additions. Incorporate grad_ptr local variable cleanup from main.

…-3-add-fallback

env.sh is generated by ./build.py and should not be tracked.

…-3-add-fallback

…raph-while # Conflicts: # .github/workflows/test_gpu.yml

…-3-add-fallback

…dd-fallback # Conflicts: # docs/source/user_guide/cuda_graph.md # python/quadrants/lang/misc.py # quadrants/runtime/amdgpu/kernel_launcher.cpp # quadrants/runtime/cpu/kernel_launcher.cpp # quadrants/runtime/cuda/cuda_graph_manager.cpp # tests/python/test_cuda_graph_do_while.py

hughperkins added 30 commits March 11, 2026 10:25

bug fixes for cuda graph

49ce3c1

Add cross-platform test for cuda_graph=True annotation

cffb9ae

Verify that cuda_graph=True is a harmless no-op on non-CUDA backends (tested on x64/CPU). Passes on both x64 and CUDA. Made-with: Cursor

Fix formatting and disable cuda_graph on adjoint kernels

85dc8db

Apply lint formatting fixes (clang-format, ruff) and remove cuda_graph flag from autodiff adjoint kernel until the interaction with reverse-mode AD is validated.

Add static_assert on CudaGraphNodeParams size to catch ABI drift

0573c12

Add compute capability check for graph_while (requires SM 9.0+)

7fd81d3

Falls back to non-graph path with a warning on pre-Hopper GPUs, instead of failing with an unhelpful JIT link error.

Use CUDA_HOME/CUDA_PATH env vars to find libcudadevrt.a

9c75cee

Checks env-var-derived paths before the hardcoded fallbacks, so custom toolkit installs (e.g. conda, non-default prefix) are found.

Restore documentation comments removed during cuda-graph refactor

7f80b72

Add CUDA graph documentation and do-while semantics warning

7762fd9

Document cuda_graph=True and graph_while API in kernel() docstring, and add a user guide page covering usage patterns, cross-platform behavior, and the do-while semantics constraint.

Apply clang-format to kernel_launcher.h static_assert

47d59dc

Fix lint: formatting (black, clang-format, ruff)

ad4eab6

Fix clang-format whitespace in kernel_launcher.cpp

e00fc15

Merge branch 'hp/cuda-graph-mvp-1-graph-build' into hp/cuda-graph-mvp…

9bcc487

…-2-graph-while

Reject cuda_graph=True on kernels with struct return values

0031619

The graph path doesn't copy the result buffer back to the host, so struct returns would silently return stale data. Error early instead of producing wrong results.

Add test for cuda_graph with different-sized arrays

792ff34

Verifies that calling a cuda_graph=True kernel first with small arrays then with larger ones produces correct results for all elements — catches stale grid dims if the graph were incorrectly replayed from the first capture.

Restore comments removed during cuda graph refactor

334c2e8

Re-add documentation comments for |transfers|, |device_ptrs|, zero-sized array handling, external array logic, and the host copy-back section in the non-graph launch path.

Add test for cuda_graph after qd.reset()

8f56ffd

Verify that a cuda_graph=True kernel works correctly after a reset/reinit cycle — exercises the full teardown and rebuild of the KernelLauncher and its graph cache.

Merge branch 'hp/cuda-graph-mvp-1-graph-build' into hp/cuda-graph-mvp…

5dd2d66

…-2-graph-while

Validate graph_while parameter name at decoration time

96b43de

Raise ValueError immediately if the graph_while name doesn't match any kernel parameter, instead of silently running the kernel once without looping. Also document the CUDA API version for CudaGraphNodeParams.

Merge remote-tracking branch 'origin/main' into hp/cuda-graph-mvp-1-g…

8caa42c

…raph-build

Add CUDA graph documentation page

501362f

Made-with: Cursor

Expose CUDA graph cache size for test observability

517d3db

Add get_cuda_graph_cache_size() through the KernelLauncher -> Program -> pybind chain so tests can verify that graphs are actually being created (or not) rather than only checking output correctness. Made-with: Cursor

Add get_cuda_graph_cache_used_on_last_call() for test observability

da3ff27

Tracks whether the CUDA graph cache was used on the most recent kernel launch, exposed through KernelLauncher -> Program -> pybind so tests can assert the graph path was (or was not) taken. Made-with: Cursor

Add cache size and cache used assertions to all CUDA graph tests

a2abceb

Every test now verifies graph caching behavior, not just output correctness. Cross-platform test uses platform_supports_graph to make assertions conditional on the backend. Made-with: Cursor

Inline expected cache size in cross-platform test assertion

720f5d8

Made-with: Cursor

hughperkins added 6 commits March 14, 2026 18:03

Apply black formatting to ast_transformer.py

6748e17

Made-with: Cursor

Fix offloaded tasks assertions to use >= for x64 ndarray compatibility

2996cb9

The LLVM x64 backend generates extra tasks per ndarray argument for serialization/setup, so exact equality checks fail. Use >= instead. Made-with: Cursor

Merge branch 'hp/cuda-graph-mvp-1-graph-build' into hp/cuda-graph-mvp…

e6d5adc

…-2-graph-while

Merge branch 'hp/cuda-graph-mvp-2-graph-while' into hp/cuda-graph-mvp…

e5f39e5

…-3-add-fallback Made-with: Cursor # Conflicts: # docs/source/user_guide/cuda_graph.md

Update graph_do_while docstring to reflect host-side fallback support

9932ca2

Made-with: Cursor

Error instead of warn when condition kernel is unavailable

05cb7ec

Made-with: Cursor

hughperkins added 8 commits March 14, 2026 16:01

Fix cuda graph tests: derive expected node count from offloaded tasks

b39c3a9

Ndarray kernels can produce additional serial tasks beyond the user-visible loops, so hardcoding expected node counts breaks. Use the actual num_offloaded_tasks instead.

Merge branch 'hp/cuda-graph-mvp-1-graph-build' into hp/cuda-graph-mvp…

d750b1d

…-2-graph-while

Merge branch 'hp/cuda-graph-mvp-2-graph-while' into hp/cuda-graph-mvp…

c278455

…-3-add-fallback

Add graph_do_while to public API test list

3497560

Merge branch 'hp/cuda-graph-mvp-2-graph-while' into hp/cuda-graph-mvp…

2cc4fd0

…-3-add-fallback

Merge branch 'main' into hp/cuda-graph-mvp-1-graph-build

31b73a6

Merge branch 'hp/cuda-graph-mvp-1-graph-build' into hp/cuda-graph-mvp…

34de2b4

…-2-graph-while

Merge branch 'hp/cuda-graph-mvp-2-graph-while' into hp/cuda-graph-mvp…

e9e5235

…-3-add-fallback

erizmr approved these changes Mar 16, 2026

View reviewed changes

hughperkins added 8 commits March 16, 2026 09:24

Merge origin/main into hp/cuda-graph-mvp-2-graph-while

a32efba

Resolve conflicts from squash-merged MVP-1 PR (#405) vs branch's pre-existing MVP-1 merge commits. Keep all graph_do_while (MVP-2) additions. Incorporate grad_ptr local variable cleanup from main.

Merge branch 'hp/cuda-graph-mvp-2-graph-while' into hp/cuda-graph-mvp…

000dda9

…-3-add-fallback

Fix end-of-file newline in env.sh

1bdd202

Merge branch 'hp/cuda-graph-mvp-2-graph-while' into hp/cuda-graph-mvp…

aa0cad8

…-3-add-fallback

Remove env.sh from git and add to .gitignore

df0f753

env.sh is generated by ./build.py and should not be tracked.

Merge branch 'hp/cuda-graph-mvp-2-graph-while' into hp/cuda-graph-mvp…

5947e54

…-3-add-fallback

Merge remote-tracking branch 'origin/main' into hp/cuda-graph-mvp-2-g…

8d10a35

…raph-while # Conflicts: # .github/workflows/test_gpu.yml

Merge branch 'hp/cuda-graph-mvp-2-graph-while' into hp/cuda-graph-mvp…

438499a

…-3-add-fallback

Base automatically changed from hp/cuda-graph-mvp-2-graph-while to main March 16, 2026 18:32

hughperkins added 2 commits March 16, 2026 11:36

Fix stale conflict marker in cuda_graph_manager.cpp

5c6fe38

hughperkins enabled auto-merge (squash) March 16, 2026 18:48

hughperkins merged commit a346d2d into main Mar 16, 2026
47 checks passed

hughperkins deleted the hp/cuda-graph-mvp-3-add-fallback branch March 16, 2026 19:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Perf] CUDA graph 3: add fallbacks#416

[Perf] CUDA graph 3: add fallbacks#416
hughperkins merged 182 commits intomainfrom
hp/cuda-graph-mvp-3-add-fallback

hughperkins commented Mar 14, 2026

Uh oh!

hughperkins commented Mar 14, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

hughperkins commented Mar 14, 2026

Brief Summary

Walkthrough

Uh oh!

hughperkins commented Mar 14, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants