Fold repeated CUDA global load adjoints#3
Open
minansys wants to merge 6 commits into
Open
Conversation
e32f3a8 to
747e736
Compare
747e736 to
6cc6c0f
Compare
Root cause: CUDA device pointers often arrive as generic addrspace(0), generated derivative shadow arguments were not carrying the opt-in noalias attribute, and Enzyme-generated shadow atomics lacked metadata that later passes could use to identify and safely optimize them. In face-loop kernels this left repeated direct loads and duplicated branch-tail shadow atomic sites, making Enzyme much slower than the cached/manual adjoint shape. Changes: allow repeated-load forwarding for generic CUDA pointers using AA must-alias/clobber checks; propagate -enzyme-noalias to generated primal and shadow pointer arguments/calls; tag generated shadow atomic fadds with derivative alias metadata; coalesce tagged repeated CUDA shadow fadds; and add opt-in -enzyme-enable-cuda-atomic-tail-merge to merge identical Enzyme shadow atomic tails through PHIs. Added focused lit coverage plus CUDA Green-Gauss, momentum residual, 2D face-loop, and 3D face-loop benchmarks/repros. Risks: -enzyme-noalias remains an explicit user assertion and is unsafe if the user passes aliasing primal/shadow buffers. Atomic tail merge is opt-in and restricted to Enzyme-tagged monotonic unused shadow fadd atomics on NVPTX; it rejects PHI successors, aliasing clobbers, non-Enzyme atomics, and fences. Atomic fadd coalescing can change floating-point accumulation order within the usual CUDA atomic nondeterminism, so runtime checks use tolerances. Tests: cmake --build build/enzyme-llvmorg-19.1.7 --target LLVMEnzyme-19 ClangEnzyme-19 -- -j8; focused llvm-lit for noalias-shadow-args.ll, cuda-fold-atomic-add.ll, cuda_repeated_atomic_fadd.ll, cuda_repeated_generic_load.ll, cuda_shadow_atomic_tail_merge.ll; scripts/build_enzyme_cuda.sh --skip-checkout --skip-llvm-build --skip-enzyme-tests; CUDA runtime repros for Green-Gauss 512x512x20, momentum residual 512x512x20, face-loop 2048x2048x100, face-loop-3d 128x128x128x50. Static atomics with tail merge: 2D direct/cached/manual 8/8/8; 3D direct/cached/manual 10/10/10. Skipped: check-enzyme target is blocked by local CMake invoking /llvm-lit.
The CUDA shadow atomic optimization added Enzyme-specific metadata so later passes can identify generated shadow atomics. That metadata was being attached even when -enzyme-enable-cuda-repeated-loads was disabled, which meant the disabled/default path was not byte-for-byte the same IR shape as before the optimization. Gate both CUDA atomic folding and shadow-atomic metadata tagging on the same opt-in condition: -enzyme-enable-cuda-repeated-loads on NVPTX. With the flag off, generated CUDA adjoint atomics are emitted without the new !enzyme_shadow_atomic marker or the associated atomic alias/noalias metadata. The tail-merge flag remains independently opt-in and has no effect unless repeated-loads is enabled. Risk is low: this only removes metadata from the disabled path. The enabled path is unchanged except for the clearer shared gate name. Tests: rebuilt LLVMEnzyme-19 and ClangEnzyme-19; focused llvm-lit for cuda-fold-atomic-add.ll, noalias-shadow-args.ll, cuda_repeated_atomic_fadd.ll, cuda_repeated_generic_load.ll, cuda_shadow_atomic_tail_merge.ll; explicit default/off IR grep confirmed no !enzyme_shadow_atomic in differepeated; default/off CUDA face-loop 512x512x20 passed; scripts/build_enzyme_cuda.sh --skip-checkout --skip-llvm-build --skip-enzyme-tests passed.
CUDA face-loop CFD adjoints frequently load row pointers from float ** tables, perform atomic residual updates through the loaded rows, then reload the same table entries. The existing repeated-load forwarding conservatively invalidated those row/table loads across the atomics because AA cannot prove the row data is disjoint from the pointer table or readonly geometry arrays. Add an opt-in -enzyme-enable-cuda-pointer-table-loads flag that lets the NVPTX repeated-load pass keep readonly argument/global loads live across writes through rows loaded from readonly pointer tables. This preserves default behavior, requires -enzyme-enable-cuda-repeated-loads, and still invalidates on unknown writes and non-readonly tables. Also avoid emitting atomic floating-point shadow loads for NVPTX atomic fadd adjoints, since the NVPTX backend cannot lower load atomic float. Add focused SimpleGVN and reverse-mode regressions plus a 3D unstructured finite-volume CUDA benchmark with Green-Gauss gradients and owner/neighbor residual accumulation. Validation: rebuilt LLVMEnzyme-19 and ClangEnzyme-19; llvm-lit focused CUDA/SimpleGVN set passed; CUDA benchmark compiled to device IR and executable; runtime validation passed on 262144 cells / 1048576 faces / 20 reps. Direct FVM derivative static loads drop from 95 loads / 42 pointer loads to 85 / 36, matching cached; atomics remain 40. Risk: pointer-table flag is an explicit opt-in disjointness assumption for CFD-style row tables and should not be enabled if row data may alias readonly input/table storage.
Update the 3D unstructured FVM CUDA benchmark so it validates correctness before reporting timing. Add a direct-forward kernel and compare it against the cached-row forward residual, replace first-mismatch checks with aggregate max/RMS error summaries, and add a finite-difference directional derivative check for the manual adjoint. The finite-difference path perturbs all active momentum inputs (vel, pressure, viscosity, velocity gradients, pressure gradients), recomputes the seeded residual objective on GPU, and compares it with the manual adjoint dot perturbation. Enzyme direct/cached adjoints are still checked against the manual adjoint. Also report best and mean timing over multiple rounds for direct/cached forward, manual reverse, and Enzyme direct/cached reverse. This makes small codegen improvements from pointer-table load forwarding easier to validate under GPU timing noise. Validation: optimized and default CUDA executables rebuilt; static direct derivative loads are 95/42 pointer loads by default and 85/36 with pointer-table forwarding; runtime checks passed on 32768 cells / 131072 faces and 262144 cells / 1048576 faces.
Add a foldable atomic AD microcase to the CUDA FVM benchmark. The source performs two active loads from the same cell separated by an output store that may alias the input, so the unoptimized derivative emits two shadow atomic adds to the same input adjoint. With the CUDA repeated-load/atomic folding flags, the generated derivative folds those two atomics into one. The benchmark now validates this microcase at runtime against a manual gradient before running the realistic FVM checks. This keeps the real FVM case honest: FVM owner/neighbor residual assembly still has distinct scatter addresses and therefore does not reduce atomics, while the foldable case proves the atomic reduction path is exercised and correct. Also tighten cuda-fold-atomic-add.ll to check the exact atomic count: two atomics with the flag off and one atomic with the flag on. Validation: llvm-lit cuda-fold-atomic-add.ll and cuda_repeated_atomic_fadd.ll passed; CUDA benchmark default and optimized builds passed on 32768 cells / 131072 faces and 262144 cells / 1048576 faces. Static IR counts show atomic_fold goes from 2 atomics to 1, while FVM direct stays at 40 atomics and optimized FVM direct load count stays matched to cached.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Root cause
Repeated active CUDA global loads from the same global address can lead reverse mode to emit duplicate adjoint updates to the same shadow address. The first CI attempt also exposed two portability issues: LLVM 15 uses
llvm/ADT/Triple.h, and typed-pointer CI needs the new.lltests to pass-opaque-pointersonly on LLVM versions that support it.Solution
-enzyme-enable-cuda-repeated-loads.Validation
/home/minxu/code/enzyme/.tools/bin/clang-format-16 -i enzyme/Enzyme/DiffeGradientUtils.cpp enzyme/Enzyme/FunctionUtils.cpp enzyme/Enzyme/SimpleGVN.cpp enzyme/Enzyme/SimpleGVN.h enzyme/test/Integration/ReverseMode/Inputs/cuda_repeated_global_load_correctness.cucmake --build build/enzyme-llvmorg-19.1.7 --target LLVMEnzyme-19 --parallel 8cmake --build build/enzyme-llvmorg-19.1.7 --target ClangEnzyme-19 --parallel 8/home/minxu/code/enzyme/build/llvm-llvmorg-19.1.7/bin/llvm-lit -v build/enzyme-llvmorg-19.1.7/test/Enzyme/SimpleGVN/cuda_repeated_global_load.ll build/enzyme-llvmorg-19.1.7/test/Enzyme/ReverseMode/cuda-fold-atomic-add.llCUDA_BENCHMARK_N=65536 CUDA_BENCHMARK_REPS=2 CUDA_BENCHMARK_BINS=1024 scripts/build_enzyme_cuda.sh --skip-checkout --skip-llvm-build --skip-enzyme-build --skip-enzyme-testsatomicrmw faddoperations indiffe_Z13repeated_loadn=65536 bins=1024 reps=2 cuda repeated-load gradients verifiedNotes
/llvm-lit. Direct full-directory lit runs also hit existing LLVM 19 typed-pointer/opaque-pointer incompatibilities in unrelated tests, so the validation above stayed focused on this change.