Skip to content

Avoid atomics for CUDA elementwise shadow updates#2

Closed
minansys wants to merge 1 commit into
mainfrom
fix/cuda-elementwise-no-atomic
Closed

Avoid atomics for CUDA elementwise shadow updates#2
minansys wants to merge 1 commit into
mainfrom
fix/cuda-elementwise-no-atomic

Conversation

@minansys
Copy link
Copy Markdown
Owner

@minansys minansys commented Jun 2, 2026

Summary

This centralizes Enzyme's reverse-mode shadow accumulation atomic decision in DiffeGradientUtils::shouldUseAtomicShadowUpdate and makes the CUDA elementwise-read optimization explicit.

For enzyme_elementwise_read functions, each CUDA work item is expected to accumulate into a distinct shadow location, so Enzyme can emit a normal load/add/store instead of a generated atomicrmw fadd. Conservative cases still keep atomics, including unannotated CUDA paths, shared-memory reductions, and unknown aliasing cases.

Changes

  • Add a named helper for deciding when generated shadow updates need atomics.
  • Add cuda-elementwise-atomic.ll to check both sides of the behavior:
    • annotated CUDA elementwise read emits no atomicrmw
    • unannotated CUDA read still emits atomicrmw fadd
  • Add a CUDA benchmark input that validates gradient correctness and compares unannotated atomic versus annotated elementwise kernels.

Validation

  • Formatted changed C++/CUDA files with /home/minxu/code/enzyme/.tools/bin/clang-format-16.
  • Rebuilt Enzyme:
    • cmake --build build/enzyme-llvmorg-19.1.7 -- -j 8
  • New focused lit regression:
    • python3 deps/llvm-project/llvm/utils/lit/lit.py -sv build/enzyme-llvmorg-19.1.7/test --filter 'Enzyme/ReverseMode/cuda-elementwise-atomic\\.ll$'
  • Direct generated-IR checks for existing behavior:
    • elementwise-read.ll: no atomicrmw
    • cuda.ll: atomicrmw fadd present
    • sharedmem.ll: atomicrmw fadd present
    • atomicfadd.ll: atomicrmw volatile fadd present
  • CUDA benchmark/smoke on NVIDIA GeForce RTX 3070 Laptop GPU, sm_86:
    • scripts/build_enzyme_cuda.sh --skip-checkout --skip-build --skip-enzyme-tests
    • Run 1: n=4194304 reps=50 atomic_ms=0.592671 elementwise_ms=0.572068 speedup=1.036x
    • Run 2: n=4194304 reps=50 atomic_ms=0.590989 elementwise_ms=0.546058 speedup=1.082x

Notes

  • The local check-enzyme target is misconfigured in this workspace and invokes /llvm-lit, which does not exist.
  • After symlinking the built FileCheck into the LLVM install prefix, a broad .ll lit attempt with --max-failures=20 still failed on existing LLVM 19 opaque-pointer FileCheck spelling mismatches and a missing not helper in the local lit environment. The new regression passed independently.

@minansys
Copy link
Copy Markdown
Owner Author

minansys commented Jun 2, 2026

Superseded by #3, which implements the repeated CUDA global-load root fix instead of the earlier elementwise atomic approach.

@minansys minansys closed this Jun 2, 2026
@minansys minansys deleted the fix/cuda-elementwise-no-atomic branch June 2, 2026 11:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants