[Perf] CUDA graph 2: graph_do_while#406
Conversation
When QD_CUDA_GRAPH=1, kernels with 2+ top-level for loops (offloaded tasks) are captured into a CUDA graph on first launch and replayed on subsequent launches, eliminating per-kernel launch overhead. Uses the explicit graph node API (cuGraphAddKernelNode) with persistent device arg/result buffers. Assumes stable ndarray device pointers. Made-with: Cursor
Replace the global QD_CUDA_GRAPH=1 env var with a per-kernel opt-in. The flag flows from the Python decorator through LaunchContextBuilder to the CUDA kernel launcher, avoiding interference with internal kernels like ndarray_to_ext_arr. Made-with: Cursor
Verify that cuda_graph=True is a harmless no-op on non-CUDA backends (tested on x64/CPU). Passes on both x64 and CUDA. Made-with: Cursor
On each graph replay, re-resolve ndarray device pointers and re-upload the arg buffer to the persistent device buffer. This ensures correct results when the kernel is called with different ndarrays after the graph was first captured. Refactored ndarray pointer resolution into resolve_ctx_ndarray_ptrs(). Made-with: Cursor
Apply lint formatting fixes (clang-format, ruff) and remove cuda_graph flag from autodiff adjoint kernel until the interaction with reverse-mode AD is validated.
Implements @qd.kernel(graph_while='flag_arg') which wraps the kernel offloaded tasks in a CUDA conditional while node (requires SM 9.0+). The named argument is a scalar i32 ndarray on device; the loop continues while its value is non-zero. Key implementation details: - Condition kernel compiled as PTX and JIT-linked with libcudadevrt.a at runtime to access cudaGraphSetConditional device function - CU_GRAPH_COND_ASSIGN_DEFAULT flag ensures handle is reset each launch - Works with both counter-based (decrement to 0) and boolean flag (set to 0 when done) patterns - graph_while implicitly enables cuda_graph=True Tests: counter, boolean done flag, multiple loops, graph replay.
…allback The graph_while_arg_id was computed using Python-level parameter indices, which is wrong when struct parameters are flattened into many C++ args (e.g. Genesis solver has 40 C++ params from 6 Python params). Now tracks the flattened C++ arg index during launch context setup and caches it. Also adds C++ do-while fallback loops for CPU, CUDA (non-graph path), and AMDGPU backends so graph_while works identically on all platforms.
Falls back to non-graph path with a warning on pre-Hopper GPUs, instead of failing with an unhelpful JIT link error.
Checks env-var-derived paths before the hardcoded fallbacks, so custom toolkit installs (e.g. conda, non-default prefix) are found.
Document cuda_graph=True and graph_while API in kernel() docstring, and add a user guide page covering usage patterns, cross-platform behavior, and the do-while semantics constraint.
The graph path doesn't copy the result buffer back to the host, so struct returns would silently return stale data. Error early instead of producing wrong results.
Verifies that calling a cuda_graph=True kernel first with small arrays then with larger ones produces correct results for all elements — catches stale grid dims if the graph were incorrectly replayed from the first capture.
Re-add documentation comments for |transfers|, |device_ptrs|, zero-sized array handling, external array logic, and the host copy-back section in the non-graph launch path.
Verify that a cuda_graph=True kernel works correctly after a reset/reinit cycle — exercises the full teardown and rebuild of the KernelLauncher and its graph cache.
|
Opus 4.6 review: What it doesAdds
The implementation threads through cleanly: Python decorator → What's good
Concerns1. Graph replay bug with changed ndarray pointers (correctness)When the CUDA graph is replayed (cache hit), the work kernels are fine because the arg buffer is re-uploaded. But the condition kernel's // kernel_launcher.cpp, inside launch_llvm_kernel_graph
void *flag_ptr = ctx.graph_while_flag_dev_ptr;
void *cond_args[2] = {&cond_handle, &flag_ptr};
// ... added as a kernel node — pointer captured permanentlyIf a user creates a new ndarray for the counter between calls, the condition kernel still reads from the old device address. The tests pass because they reuse the same ndarray object — but that's a latent bug. Fix: Store the condition kernel node in 2. Fragile ABI-coupled struct (maintainability)
// kernel_launcher.h
struct CudaGraphNodeParams {
unsigned int type;
int reserved0[3];
unsigned long long handle;
unsigned int condType;
unsigned int size;
void *phGraph_out;
void *ctx;
char _pad[232 - 8 - 4 - 4 - 8 - 8];
long long reserved2;
};The 3. No validation on the
|
The condition kernel's flag pointer was baked into the CUDA graph at creation time. Passing a different ndarray on replay would cause the condition kernel to read from a stale device address. Invalidate the cached graph when the flag pointer changes so it gets rebuilt.
Raise ValueError immediately if the graph_while name doesn't match any kernel parameter, instead of silently running the kernel once without looping. Also document the CUDA API version for CudaGraphNodeParams.
|
Update from Opus on concern 4: Thinking this through more carefully: counter=0 causes an infinite loop on all paths, not just the fallbacks. On the CUDA graph path, defaultLaunchValue=1 forces the body to |
|
Update from Opus on concern 5: Concern 5 (host-side fallback performance) is just a documentation suggestion — noting that the fallback is correct but not a performance win. Looking at it again, I think the docs/source/user_guide/user_guide/cuda_graphs.md lines 68-70
The fact that it's a "host-side do-while loop" already implies sync-per-iteration. Anyone who understands enough to care about this overhead will read "host-side" and |
|
update from Opus for concern 6: Looking at the code, the existing AMDGPU backend already uses stream_synchronize(nullptr) everywhere — line 95 has the same pattern, predating graph_while. The new code at |
|
Addressed all 6 of Opus's concerns |
…-2-graph-while Made-with: Cursor # Conflicts: # quadrants/runtime/cuda/cuda_graph_manager.cpp # quadrants/runtime/cuda/cuda_graph_manager.h
…loop Replace `@qd.kernel(graph_do_while="var")` with explicit `while qd.graph_do_while(var):` inside the kernel body. The AST transformer recognises the pattern and sets the condition arg without emitting a while-loop IR node. No C++ changes needed. Made-with: Cursor
docs/source/user_guide/cuda_graph.md
Outdated
| - On SM 9.0+ (Hopper), this uses CUDA conditional while nodes — the entire iteration runs on the GPU with no host involvement. | ||
| - Older CUDA GPUs, and non-CUDA backends not currently supported. | ||
| - `graph_do_while` implicitly enables `cuda_graph=True`. | ||
| - Using `qd.graph_do_while()` implicitly enables `cuda_graph=True` if not already set. |
There was a problem hiding this comment.
this should be removed
Made-with: Cursor
| f"Available parameters: {arg_names}" | ||
| ) | ||
| kernel.graph_do_while_arg = graph_do_while_arg | ||
| if not kernel.use_cuda_graph: |
There was a problem hiding this comment.
raise error if not in cuda graph arlreayd
…ling it Made-with: Cursor
Made-with: Cursor
Test that using qd.graph_do_while() without cuda_graph=True and with a non-existent parameter name both raise QuadrantsSyntaxError. Made-with: Cursor
Made-with: Cursor
|
I have read every line added in this PR, and reviewed the lines. I take responsibilty for the lines added and removed in this PR, and won't blame any issues on Opus. |
The LLVM x64 backend generates extra tasks per ndarray argument for serialization/setup, so exact equality checks fail. Use >= instead. Made-with: Cursor
Ndarray kernels can produce additional serial tasks beyond the user-visible loops, so hardcoding expected node counts breaks. Use the actual num_offloaded_tasks instead.
Resolve conflicts from squash-merged MVP-1 PR (#405) vs branch's pre-existing MVP-1 merge commits. Keep all graph_do_while (MVP-2) additions. Incorporate grad_ptr local variable cleanup from main.
|
env.sh shouldnt be here |
env.sh is generated by ./build.py and should not be tracked.
…raph-while # Conflicts: # .github/workflows/test_gpu.yml
Issue: #
Brief Summary
In this PR, we add graph_do_while for CUDA, and do NOT add fallbacks on other platforms. A later PR will add fallbacks on other platforms.
The do-while is implemented by usng a cuda graph conditional node
copilot:summary
Walkthrough
copilot:walkthrough