Avoid atomics for CUDA elementwise shadow updates by minansys · Pull Request #2 · minansys/Enzyme

minansys · 2026-06-02T01:56:08Z

Summary

This centralizes Enzyme's reverse-mode shadow accumulation atomic decision in DiffeGradientUtils::shouldUseAtomicShadowUpdate and makes the CUDA elementwise-read optimization explicit.

For enzyme_elementwise_read functions, each CUDA work item is expected to accumulate into a distinct shadow location, so Enzyme can emit a normal load/add/store instead of a generated atomicrmw fadd. Conservative cases still keep atomics, including unannotated CUDA paths, shared-memory reductions, and unknown aliasing cases.

Changes

Add a named helper for deciding when generated shadow updates need atomics.
Add cuda-elementwise-atomic.ll to check both sides of the behavior:
- annotated CUDA elementwise read emits no atomicrmw
- unannotated CUDA read still emits atomicrmw fadd
Add a CUDA benchmark input that validates gradient correctness and compares unannotated atomic versus annotated elementwise kernels.

Validation

Formatted changed C++/CUDA files with /home/minxu/code/enzyme/.tools/bin/clang-format-16.
Rebuilt Enzyme:
- cmake --build build/enzyme-llvmorg-19.1.7 -- -j 8
New focused lit regression:
- python3 deps/llvm-project/llvm/utils/lit/lit.py -sv build/enzyme-llvmorg-19.1.7/test --filter 'Enzyme/ReverseMode/cuda-elementwise-atomic\\.ll$'
Direct generated-IR checks for existing behavior:
- elementwise-read.ll: no atomicrmw
- cuda.ll: atomicrmw fadd present
- sharedmem.ll: atomicrmw fadd present
- atomicfadd.ll: atomicrmw volatile fadd present
CUDA benchmark/smoke on NVIDIA GeForce RTX 3070 Laptop GPU, sm_86:
- scripts/build_enzyme_cuda.sh --skip-checkout --skip-build --skip-enzyme-tests
- Run 1: n=4194304 reps=50 atomic_ms=0.592671 elementwise_ms=0.572068 speedup=1.036x
- Run 2: n=4194304 reps=50 atomic_ms=0.590989 elementwise_ms=0.546058 speedup=1.082x

Notes

The local check-enzyme target is misconfigured in this workspace and invokes /llvm-lit, which does not exist.
After symlinking the built FileCheck into the LLVM install prefix, a broad .ll lit attempt with --max-failures=20 still failed on existing LLVM 19 opaque-pointer FileCheck spelling mismatches and a missing not helper in the local lit environment. The new regression passed independently.

minansys · 2026-06-02T02:28:13Z

Superseded by #3, which implements the repeated CUDA global-load root fix instead of the earlier elementwise atomic approach.

Avoid atomics for CUDA elementwise shadow updates

c3f4dd1

minansys closed this Jun 2, 2026

minansys deleted the fix/cuda-elementwise-no-atomic branch June 2, 2026 11:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Avoid atomics for CUDA elementwise shadow updates#2

Avoid atomics for CUDA elementwise shadow updates#2
minansys wants to merge 1 commit into
mainfrom
fix/cuda-elementwise-no-atomic

minansys commented Jun 2, 2026

Uh oh!

minansys commented Jun 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

minansys commented Jun 2, 2026

Summary

Changes

Validation

Notes

Uh oh!

minansys commented Jun 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants