Skip to content

Fold repeated CUDA global load adjoints#3

Open
minansys wants to merge 6 commits into
mainfrom
fix/cuda-repeated-global-loads
Open

Fold repeated CUDA global load adjoints#3
minansys wants to merge 6 commits into
mainfrom
fix/cuda-repeated-global-loads

Conversation

@minansys
Copy link
Copy Markdown
Owner

@minansys minansys commented Jun 2, 2026

Root cause

Repeated active CUDA global loads from the same global address can lead reverse mode to emit duplicate adjoint updates to the same shadow address. The first CI attempt also exposed two portability issues: LLVM 15 uses llvm/ADT/Triple.h, and typed-pointer CI needs the new .ll tests to pass -opaque-pointers only on LLVM versions that support it.

Solution

  • Add a hidden default-off option: -enzyme-enable-cuda-repeated-loads.
  • Gate NVPTX repeated global load forwarding in SimpleGVN behind that option.
  • Gate the reverse-mode adjacent atomic fadd fold behind the same option.
  • Keep forwarding conservative: NVPTX only, addrspace(1), same block, matching pointer/type, and invalidation on aliasing writes or side-effecting instructions.
  • Add lit coverage for both default-off fallback behavior and opt-in folded behavior.
  • Add a more complex CUDA correctness input with repeated reads, an intervening independent read, bias input, nonuniform output adjoints, output verification, and gradient verification.

Validation

  • /home/minxu/code/enzyme/.tools/bin/clang-format-16 -i enzyme/Enzyme/DiffeGradientUtils.cpp enzyme/Enzyme/FunctionUtils.cpp enzyme/Enzyme/SimpleGVN.cpp enzyme/Enzyme/SimpleGVN.h enzyme/test/Integration/ReverseMode/Inputs/cuda_repeated_global_load_correctness.cu
  • cmake --build build/enzyme-llvmorg-19.1.7 --target LLVMEnzyme-19 --parallel 8
  • cmake --build build/enzyme-llvmorg-19.1.7 --target ClangEnzyme-19 --parallel 8
  • /home/minxu/code/enzyme/build/llvm-llvmorg-19.1.7/bin/llvm-lit -v build/enzyme-llvmorg-19.1.7/test/Enzyme/SimpleGVN/cuda_repeated_global_load.ll build/enzyme-llvmorg-19.1.7/test/Enzyme/ReverseMode/cuda-fold-atomic-add.ll
    • 2 passed
  • CUDA_BENCHMARK_N=65536 CUDA_BENCHMARK_REPS=2 CUDA_BENCHMARK_BINS=1024 scripts/build_enzyme_cuda.sh --skip-checkout --skip-llvm-build --skip-enzyme-build --skip-enzyme-tests
    • IR validation: exactly two atomicrmw fadd operations in diffe_Z13repeated_load
    • Runtime: n=65536 bins=1024 reps=2 cuda repeated-load gradients verified

Notes

  • Broader local CMake lit targets are still blocked by this local build generating /llvm-lit. Direct full-directory lit runs also hit existing LLVM 19 typed-pointer/opaque-pointer incompatibilities in unrelated tests, so the validation above stayed focused on this change.

@minansys minansys force-pushed the fix/cuda-repeated-global-loads branch from 747e736 to 6cc6c0f Compare June 2, 2026 11:56
superustc added 5 commits June 3, 2026 09:15
Root cause: CUDA device pointers often arrive as generic addrspace(0), generated derivative shadow arguments were not carrying the opt-in noalias attribute, and Enzyme-generated shadow atomics lacked metadata that later passes could use to identify and safely optimize them. In face-loop kernels this left repeated direct loads and duplicated branch-tail shadow atomic sites, making Enzyme much slower than the cached/manual adjoint shape.

Changes: allow repeated-load forwarding for generic CUDA pointers using AA must-alias/clobber checks; propagate -enzyme-noalias to generated primal and shadow pointer arguments/calls; tag generated shadow atomic fadds with derivative alias metadata; coalesce tagged repeated CUDA shadow fadds; and add opt-in -enzyme-enable-cuda-atomic-tail-merge to merge identical Enzyme shadow atomic tails through PHIs. Added focused lit coverage plus CUDA Green-Gauss, momentum residual, 2D face-loop, and 3D face-loop benchmarks/repros.

Risks: -enzyme-noalias remains an explicit user assertion and is unsafe if the user passes aliasing primal/shadow buffers. Atomic tail merge is opt-in and restricted to Enzyme-tagged monotonic unused shadow fadd atomics on NVPTX; it rejects PHI successors, aliasing clobbers, non-Enzyme atomics, and fences. Atomic fadd coalescing can change floating-point accumulation order within the usual CUDA atomic nondeterminism, so runtime checks use tolerances.

Tests: cmake --build build/enzyme-llvmorg-19.1.7 --target LLVMEnzyme-19 ClangEnzyme-19 -- -j8; focused llvm-lit for noalias-shadow-args.ll, cuda-fold-atomic-add.ll, cuda_repeated_atomic_fadd.ll, cuda_repeated_generic_load.ll, cuda_shadow_atomic_tail_merge.ll; scripts/build_enzyme_cuda.sh --skip-checkout --skip-llvm-build --skip-enzyme-tests; CUDA runtime repros for Green-Gauss 512x512x20, momentum residual 512x512x20, face-loop 2048x2048x100, face-loop-3d 128x128x128x50. Static atomics with tail merge: 2D direct/cached/manual 8/8/8; 3D direct/cached/manual 10/10/10. Skipped: check-enzyme target is blocked by local CMake invoking /llvm-lit.
The CUDA shadow atomic optimization added Enzyme-specific metadata so later passes can identify generated shadow atomics. That metadata was being attached even when -enzyme-enable-cuda-repeated-loads was disabled, which meant the disabled/default path was not byte-for-byte the same IR shape as before the optimization.

Gate both CUDA atomic folding and shadow-atomic metadata tagging on the same opt-in condition: -enzyme-enable-cuda-repeated-loads on NVPTX. With the flag off, generated CUDA adjoint atomics are emitted without the new !enzyme_shadow_atomic marker or the associated atomic alias/noalias metadata. The tail-merge flag remains independently opt-in and has no effect unless repeated-loads is enabled.

Risk is low: this only removes metadata from the disabled path. The enabled path is unchanged except for the clearer shared gate name.

Tests: rebuilt LLVMEnzyme-19 and ClangEnzyme-19; focused llvm-lit for cuda-fold-atomic-add.ll, noalias-shadow-args.ll, cuda_repeated_atomic_fadd.ll, cuda_repeated_generic_load.ll, cuda_shadow_atomic_tail_merge.ll; explicit default/off IR grep confirmed no !enzyme_shadow_atomic in differepeated; default/off CUDA face-loop 512x512x20 passed; scripts/build_enzyme_cuda.sh --skip-checkout --skip-llvm-build --skip-enzyme-tests passed.
CUDA face-loop CFD adjoints frequently load row pointers from float ** tables, perform atomic residual updates through the loaded rows, then reload the same table entries. The existing repeated-load forwarding conservatively invalidated those row/table loads across the atomics because AA cannot prove the row data is disjoint from the pointer table or readonly geometry arrays.

Add an opt-in -enzyme-enable-cuda-pointer-table-loads flag that lets the NVPTX repeated-load pass keep readonly argument/global loads live across writes through rows loaded from readonly pointer tables. This preserves default behavior, requires -enzyme-enable-cuda-repeated-loads, and still invalidates on unknown writes and non-readonly tables.

Also avoid emitting atomic floating-point shadow loads for NVPTX atomic fadd adjoints, since the NVPTX backend cannot lower load atomic float. Add focused SimpleGVN and reverse-mode regressions plus a 3D unstructured finite-volume CUDA benchmark with Green-Gauss gradients and owner/neighbor residual accumulation.

Validation: rebuilt LLVMEnzyme-19 and ClangEnzyme-19; llvm-lit focused CUDA/SimpleGVN set passed; CUDA benchmark compiled to device IR and executable; runtime validation passed on 262144 cells / 1048576 faces / 20 reps. Direct FVM derivative static loads drop from 95 loads / 42 pointer loads to 85 / 36, matching cached; atomics remain 40. Risk: pointer-table flag is an explicit opt-in disjointness assumption for CFD-style row tables and should not be enabled if row data may alias readonly input/table storage.
Update the 3D unstructured FVM CUDA benchmark so it validates correctness before reporting timing. Add a direct-forward kernel and compare it against the cached-row forward residual, replace first-mismatch checks with aggregate max/RMS error summaries, and add a finite-difference directional derivative check for the manual adjoint.

The finite-difference path perturbs all active momentum inputs (vel, pressure, viscosity, velocity gradients, pressure gradients), recomputes the seeded residual objective on GPU, and compares it with the manual adjoint dot perturbation. Enzyme direct/cached adjoints are still checked against the manual adjoint.

Also report best and mean timing over multiple rounds for direct/cached forward, manual reverse, and Enzyme direct/cached reverse. This makes small codegen improvements from pointer-table load forwarding easier to validate under GPU timing noise.

Validation: optimized and default CUDA executables rebuilt; static direct derivative loads are 95/42 pointer loads by default and 85/36 with pointer-table forwarding; runtime checks passed on 32768 cells / 131072 faces and 262144 cells / 1048576 faces.
Add a foldable atomic AD microcase to the CUDA FVM benchmark. The source performs two active loads from the same cell separated by an output store that may alias the input, so the unoptimized derivative emits two shadow atomic adds to the same input adjoint. With the CUDA repeated-load/atomic folding flags, the generated derivative folds those two atomics into one.

The benchmark now validates this microcase at runtime against a manual gradient before running the realistic FVM checks. This keeps the real FVM case honest: FVM owner/neighbor residual assembly still has distinct scatter addresses and therefore does not reduce atomics, while the foldable case proves the atomic reduction path is exercised and correct.

Also tighten cuda-fold-atomic-add.ll to check the exact atomic count: two atomics with the flag off and one atomic with the flag on.

Validation: llvm-lit cuda-fold-atomic-add.ll and cuda_repeated_atomic_fadd.ll passed; CUDA benchmark default and optimized builds passed on 32768 cells / 131072 faces and 262144 cells / 1048576 faces. Static IR counts show atomic_fold goes from 2 atomics to 1, while FVM direct stays at 40 atomics and optimized FVM direct load count stays matched to cached.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants