Enable AllGather Triton Backend by mfrancepillois · Pull Request #799 · ROCm/xla

mfrancepillois · 2026-04-13T11:33:25Z

This PR enables AllGather triton backend:

Enhances the common collective-emitter to handle the AllGather op (that returns a tuple)
Adds 2 tritons kernel implementations: one default implementation and one using swizzled for loading data
Adds e2e tests.
(This support needed the triton-xla atomics operations to be implemented. That's why it is based on top of the branch ci_maxime_allreduce_triton_rocm_elementwise_rocm)

xla/backends/gpu/codegen/triton/collective_emitter.cc

xla/backends/gpu/codegen/triton/support.cc

xla/backends/gpu/codegen/triton/collective_emitter.cc

xla/backends/gpu/codegen/triton/collective_emitter.h

xla/service/gpu/thunk_emitter.cc

xla/hlo/analysis/indexing_analysis.cc

xla/backends/gpu/codegen/triton/fusion.cc

claude · 2026-04-13T11:43:47Z

Review Summary

This PR extends the collective emitter infrastructure (originally built for AllReduce) to support AllGather via the Triton backend. It adds two kernel implementations (default and swizzled), tuple unpacking for AllGatherStart's (input, output) shape, a new KernelArguments::Create overload, and comprehensive e2e tests. The feature is gated behind xla_gpu_unsupported_use_all_gather_triton_backend (default: false).

Key issues found:

Bug — dtype check bypassed for AllGatherStart: all_gather.shape().element_type() returns TUPLE for AllGatherStart, so the F8/S4 unsupported-type guard never fires. Should use the operand's element type instead.
Bug — inconsistent kMaxBlocksPerGrid constants: Launch dimensions use 32, but signal buffer shape uses 24. Should be a single shared constant.
Correctness — GetTupleElement identity indexing map: The new GTE case in ComputeOutputToInputIndexing maps output shape to the tuple operand, which is semantically incorrect for general GTE operations. Needs a guard or scoping.
Dead code with latent bugs: EmitAllGatherSwizzled hardcodes gather dim=0 and has a potential division-by-zero. Both will break if the swizzled path is enabled.
Logging noise: ~24 new LOG(INFO) calls on common code paths (including non-collective Triton fusions). Should be VLOG(n).

Details in inline comments.

Automated review by Claude

i-chaochen · 2026-04-13T20:19:26Z

wondering is this branch is based on upstream or xla-0.9.1?

mfrancepillois · 2026-04-14T08:04:45Z

wondering is this branch is based on upstream or xla-0.9.1?

This branch is based on ci_maxime_allreduce_triton_rocm_elementwise_rocm because we need triton_xla atomic operations to be implemented. But ci_maxime_allreduce_triton_rocm_elementwise_rocm is based on rocm-jaxlib-v0.9.1.

mfrancepillois · 2026-04-14T08:22:49Z

xla/backends/gpu/codegen/triton/collective_emitter.cc

+  // group_size_m = min(num_pid_m - first_pid_m, GROUP_SIZE_M)
+  // pid_m = first_pid_m + ((tile_id % num_pid_in_group) % group_size_m)
+  // pid_n = (tile_id % num_pid_in_group) // group_size_m
+  mlir::LogicalResult EmitAllGatherSwizzled(int64_t group_size_m) {


Currently, the swizzled kernel is not called but I'm keeping it until the performance evaluation is complete.

mfrancepillois added 7 commits April 10, 2026 09:24

Initial commit

2c23001

Improve robustness + add tests

f175d36

Fix bug in offset calculation

2aaf795

Add swizzled AllGather kernel + add e2e tests on larger buffer

23618ab

Add tile size limitation for large inputs

3144708

Merge fixes

d42ce9a

fix launch dimension + rename function

eeaafd4

mfrancepillois added the claude-review Request a Claude AI code review for this PR label Apr 13, 2026

Clang format

3f72abf