SSA Phi Explosion: VGPR Waste and Spills When Large Accumulators Cross scf.IfOp Boundaries
TL;DR — When a large MFMA accumulator (32 x f32x4 = 128 VGPRs) is yielded through scf.IfOp phi from structurally different branches, LLVM's register allocator assigns disjoint VGPR ranges per branch. This caused 71 unique MFMA dst groups (ideal: 32) and 170 scratch spills in a production MLA kernel. The hand-written HK implementation of the same algorithm has 0 spills because it pins the accumulator to fixed VGPRs. Neither memref_alloca(Register) nor raising LLVM coalescer thresholds helps — the interference is fundamental. FlyDSL's memref_alloca(Register) already provides the right user-facing API, but the LLVM lowering must be changed so the alloca survives mem2reg and maps to a physical register without creating phi nodes.
1. Problem Observed in Production
The Kernel
kn_mla_fwd_decode_m16x8_fp8_fp8 — an MLA (Multi-head Latent Attention) forward-decode kernel for nhead=128, fp8 on gfx942 (MI300X), implemented in FlyDSL. The kernel uses 8 warps / 512 threads in a persistent-thread architecture with multi-tile KV processing via scf.ForOp.
What Happened
The kernel's inner loop contains an scf.IfOp for boundary checking (tiles near the end of a sequence require masked loads). Both branches call _process_tile(), which runs GEMM1 (24 MFMAs) and GEMM2 (32 MFMAs), producing a 128-VGPR output accumulator (oaccu: 32 x f32x4). The full oaccu is yielded through scf.yield from each branch, creating 32 phi nodes of type <4 x float> at the merge point.
Measured Impact
| Configuration |
Unique MFMA dst groups |
VGPR count |
Scratch spills |
| 4 branch instances (FlyDSL) |
49 (ideal: 32) |
256 |
0 |
| 6 branch instances (FlyDSL) |
71 (ideal: 32) |
256 |
170 |
| Hand-written asm (HK, 12 instances) |
34 |
~180 |
0 |
With 6 branch instances, LLVM produced 71 unique MFMA destination register groups — more than twice the ideal 32 — and 170 scratch spills (132 scratch_load + 65 scratch_store). Each scratch access is a global memory operation costing hundreds of cycles, versus zero for a register move. Inside a hot inner loop, this was catastrophic for performance.
By contrast, the hand-written HipKittens (HK) implementation of the same algorithm handles 12 branch instances with only 34 unique MFMA dst groups and zero spills. The programmer pins the accumulator to a fixed VGPR range (v[128:255]) across all branches — something that SSA IR cannot express.
Current Workaround
We restructured the kernel so that GEMM2 (the 128-VGPR accumulator producer) executes after the scf.IfOp merge, not inside the branches. Only 5 small values (7 VGPRs total) now cross the phi. This eliminated all spills.
But this is a workaround, not a fix. It required significant manual code restructuring, limits how the kernel can be written, and is fragile — LLVM's tail duplication pass may undo it by pulling post-merge code back into the branches. The fundamental issue remains.
2. Background: SSA and Phi Nodes
What Is SSA?
Static Single Assignment (SSA) form is a naming discipline used in virtually all modern compiler IRs — LLVM IR, MLIR, GCC GIMPLE, V8 Turbofan — where every variable is defined exactly once. This single-definition property makes compiler analyses dramatically simpler: use-def chains are trivial, dead code elimination is easy, and constant propagation is straightforward.
But when a variable needs different values depending on which branch was taken, SSA has a problem — two definitions reach the merge point. SSA resolves this with phi functions:
x1 = 1
if (cond):
x2 = 2 // definition 1
else:
x3 = 3 // definition 2
x4 = phi(x2, x3) // "select whichever branch we came from"
y = x4 + 1
The phi function x4 = phi(x2, x3) is not a real machine instruction — it is a notational device that the compiler must eventually eliminate by inserting register copies or by coalescing registers. The phi is the only way to merge values from different control flow paths in SSA. There is no "mutation" or "pass-by-reference" for SSA registers.
MLIR's scf.IfOp: Structured SSA with Regions
MLIR uses scf.IfOp with scf.yield to express if/else with value results. Values defined inside a region cannot escape that region — the only way to export a value is via scf.yield, and the yielded values become the results of the scf.if operation:
%result = scf.if %cond -> (vector<4xf32>) {
%val_a = ... compute in then-branch ...
scf.yield %val_a : vector<4xf32>
} else {
%val_b = ... compute in else-branch ...
scf.yield %val_b : vector<4xf32>
}
// %result = phi(%val_a, %val_b) — semantically a phi node
When lowered to LLVM IR, each scf.yield result becomes a phi instruction at the merge block. The scf.if results are phi nodes — there is no way to avoid them within SSA.
3. Root Cause: Why SSA Phi Causes the Explosion
The Mechanism (Step by Step)
Consider 32 x f32x4 (128 VGPRs) yielded through scf.IfOp phi from two structurally different branches:
Step 1 — PHI Elimination. LLVM's PHIElimination pass converts each phi to COPY instructions:
Before: After:
BB_then:
BB_then: %val_a_0 = MFMA(...)
%val_a_0 = MFMA(...) ...
... %val_a_31 = MFMA(...)
%val_a_31 = MFMA(...) COPY %inc_0, %val_a_0 // 32 copies
br BB_merge ...
COPY %inc_31, %val_a_31
BB_else: br BB_merge
%val_b_0 = MFMA(...)
... BB_else:
%val_b_31 = MFMA(...) %val_b_0 = MFMA(...)
br BB_merge ...
%val_b_31 = MFMA(...)
BB_merge: COPY %inc_0, %val_b_0
%r_0 = PHI(%val_a_0, %val_b_0) ...
... COPY %inc_31, %val_b_31
%r_31 = PHI(%val_a_31, %val_b_31) br BB_merge
BB_merge:
// use %inc_0 ... %inc_31
All 32 %inc_i VRegs are live simultaneously at the branch exit (all copies execute before the branch), and they remain live through the merge block.
Step 2 — Register Coalescing. The coalescer tries to merge the COPYs (e.g., assign %val_a_0 and %inc_0 to the same physical register so the COPY can be deleted). It may succeed for some copies, but interference prevents complete coalescing:
- All 32 MFMA outputs
%val_a_0..31 are live simultaneously within each branch (they must all survive to the copy point at the branch exit).
- All 32
%inc_0..31 are live simultaneously (they survive from the copy point to the merge block).
- On AMDGPU,
<4 x float> occupies a VGPR_128 tuple with stride-1 subreg indexing: v[0:3] and v[1:4] share register units v1, v2, v3. Two f32x4 VRegs that are live at the same time cannot occupy overlapping tuples, which means each f32x4 effectively blocks 4 consecutive VGPRs from any other simultaneous f32x4 allocation.
The coalescer processes copies one at a time. When it processes COPY %inc_0, %val_a_0, it tries to merge their live intervals. But %val_a_0's live range overlaps with %inc_1..31 (all live simultaneously), so %val_a_0 must be assigned to a physical register that doesn't conflict with any of the already-assigned %inc_j registers. This constraint gets progressively harder to satisfy as more copies are processed: the interference graph becomes denser, and eventually the coalescer fails for many of the 32 copies.
Step 3 — Greedy Allocation Divergence. The greedy register allocator processes VRegs by priority (longer intervals first). The 32 %inc_i intervals (spanning predecessor → merge) get assigned first, consuming v[0:3] through v[124:127]. Then, within each branch, the MFMA outputs try to coalesce into these same registers. But the two branches have different live range contexts (different computation structures create different interference patterns), so the allocator makes independent choices per branch. Branch A's MFMAs may coalesce to v[0:3]..v[124:127], but Branch B's MFMAs — facing different interference — get assigned to v[128:131]..v[252:255].
Result: Instead of the ideal 32 unique MFMA dst groups (both branches sharing the same VGPRs), we get 76 unique groups. Each branch uses largely disjoint register ranges, effectively doubling register consumption at the merge point.
Experimental Validation
We built three test kernels (committed on branch jruan/ssa_phi) to isolate the necessary conditions via 2x2 factorial analysis:
|
Identical branches |
Different branches |
| Small phi (1 x f32x4) |
0 phi, 3 dst groups, 10 VGPRs |
1 phi, 3 dst groups, 10 VGPRs |
| Large phi (32 x f32x4) |
0 phi (branches merged) |
32 phi, 76 dst groups, 134 VGPRs |
- Small + identical (
test_ssa_phi_old.py): LLVM merges identical branches via v_cndmask pointer selection. Phi eliminated entirely.
- Small + different (
test_ssa_phi_factors.py, Factor B): Phi survives but trivially coalesced. Only 4 VGPRs cross the merge — no pressure.
- Large + identical (
test_ssa_phi_factors.py, Factor A): LLVM again merges branches — identical structure lets it replace the branch with a pointer select. No phi.
- Large + different (
test_ssa_phi_divergence.py): Explosion. 32 phi <4 x float>, 76 unique MFMA dst groups, 134 VGPRs.
Conclusion: Both conditions are necessary — large phi width AND structurally different branches. Structural difference prevents LLVM from merging branches (which would eliminate phi entirely), and large phi width amplifies the register waste beyond the allocator's ability to coalesce.
Why LLVM Thresholds Cannot Fix This
We identified two LLVM throttle mechanisms that appeared relevant:
- Coalescer throttle (
RegisterCoalescer.cpp): isHighCostLiveInterval() refuses coalescing when a live interval has >= 100 value numbers AND has been visited >= 256 times.
- Eviction cutoff (
RegAllocEvictionAdvisor.cpp): EvictInterferenceCutoff = 10 refuses eviction when >= 10 interfering VRegs exist.
We raised all three thresholds at runtime via FlyDSL's llvm_options():
with llvm_options({
'large-interval-size-threshold': 512,
'large-interval-freq-threshold': 4096,
'regalloc-eviction-max-interference-cutoff': 128,
}):
launch_fn = build_ssa_phi_accu(N)
Result: identical output — 76 unique MFMA dst groups, 134 VGPRs. The throttles are not the bottleneck. The coalescer is genuinely trying to coalesce and failing because of true register interference from stride-1 tuple aliasing, not because it gave up early.
4. Why memref_alloca(Register) and Mutable Variables Don't Work
One natural idea is to sidestep phi entirely by using mutable storage — write to a "register variable" from each branch and read it after the merge:
# Attempt: use mutable register storage
buf = fx.memref_alloca(reg_ty, layout) # before IfOp
# Branch A:
fx.memref_store_vec(val_a, buf) # store in branch A
# Branch B:
fx.memref_store_vec(val_b, buf) # store in branch B
# After merge:
result = fx.memref_load_vec(buf) # load — no phi!
This avoids phi in the source IR, but does not solve the problem:
Case 1: mem2reg / SROA Succeeds (Most Common)
LLVM's mem2reg and SROA passes recognize the textbook pattern — one alloca, stores in predecessor blocks, load at the merge — and promote it back to SSA with phi:
; After mem2reg — the alloca is gone, phi is back:
merge:
%result = phi <4 x float> [%val_a, %then], [%val_b, %else]
The mutable storage was a source-level illusion that the compiler erased before register allocation. The exact same phi explosion occurs.
Case 2: mem2reg Fails
If promotion fails (e.g., the alloca escapes, uses volatile, or has mismatched types), the value stays in actual memory — scratch memory on AMDGPU. Every "read" is a global memory load (~hundreds of cycles), every "write" is a global memory store. This is worse than phi copies.
FlyDSL's memref_alloca(AddressSpace.Register)
FlyDSL provides memref_alloca(AddressSpace.Register), which lowers through fly.make_ptr(register) → llvm.alloca in address space 5 (AMDGPU private/scratch). We verified that mem2reg promotes this identically — the alloca, all stores, and the load disappear, and the same phi reappears.
Summary
| Approach |
What happens |
Cost |
| SSA phi (direct) |
Regalloc must coalesce or copy/spill |
Copy or spill |
| alloca + mem2reg succeeds |
Promoted back to SSA phi |
Same as above |
| alloca + mem2reg fails |
Value stays in scratch memory |
Hundreds of cycles per access (worse) |
memref_alloca(Register) |
mem2reg promotes to phi |
Same as direct SSA |
There is no escape from the phi problem within SSA. The issue is not syntactic (phi vs. alloca) but physical: at the hardware level, if two branches want to put different values in the same register, a resolution mechanism is needed at the merge point.
5. How PHI Elimination Works in LLVM and Why It Cannot Fix the Issue
The Pipeline
MLIR scf.IfOp + scf.yield
│ (convert-scf-to-cf)
▼
MLIR cf.cond_br + block arguments
│ (convert-cf-to-llvm → LLVM IR translation)
▼
LLVM IR: phi instructions at merge blocks
│ (PHIElimination pass)
▼
Machine IR: COPY instructions in predecessor blocks
│ (RegisterCoalescer)
▼
Machine IR: COPYs coalesced where possible
│ (Greedy Register Allocator)
▼
Machine IR: physical registers assigned, uncoalesced COPYs remain as v_mov
│ (VirtRegRewriter)
▼
Final ISA
PHI Elimination (PHIElimination.cpp: LowerPHINode)
For each phi node, the pass:
- Creates a new virtual register
%inc (same register class as the phi result).
- In each predecessor block: inserts
COPY %inc, %src_from_this_branch before the terminator.
- In the merge block: inserts
COPY %dest, %inc after all phis, replacing uses of the phi result.
For 32 phi nodes with 2 predecessors, this creates:
- 32 new
%inc VRegs
- 64 COPYs in predecessors (32 per branch)
- 32 COPYs in the merge block
- All 32
%inc VRegs are live simultaneously (they all survive from the predecessor to the merge block)
Register Coalescing: Why It Fails at Scale
The coalescer processes COPYs one at a time, trying to merge source and destination into the same physical register so the COPY can be deleted. For 1 phi <4 x float>, this works perfectly — there is only one %inc VReg, no interference, and all three values (%val_a, %inc, %val_b) get v[0:3]. Zero copies, 3 unique MFMA dst groups, 10 VGPRs.
For 32 phi nodes, the 32 %inc VRegs are all live simultaneously, occupying v[0:3] through v[124:127]. When the coalescer tries to merge %val_a_i into %inc_i's register, %val_a_i's live range (from MFMA def to the copy point) overlaps with other %inc_j values. As explained in Section 3, the dense interference from stride-1 VGPR_128 tuples makes many coalescing attempts fail. Uncoalesced MFMAs get fresh physical registers — producing 76 unique dst groups instead of 32, and 134 VGPRs instead of 10.
The 1-phi case is 3 dst groups / 10 VGPRs. The 32-phi case is 76 dst groups / 134 VGPRs. The scaling is catastrophic.
PHI elimination is doing its job correctly — it faithfully translates phi semantics into copies. The problem is that the resulting register allocation problem is fundamentally intractable when 32 wide values are simultaneously live across a merge point.
6. Proposal
6.1 Current Workaround: Restructure to Avoid Phi on Large Values
Today, the only reliable solution is to restructure the kernel code so that large register groups (MFMA accumulators, etc.) never cross scf.IfOp boundaries:
# BAD: 128 VGPRs through phi — causes explosion
result_types = [f32x4_type] * 32
if_op = scf.IfOp(cond, result_types, has_else=True)
with ir.InsertionPoint(if_op.then_block):
oaccu_a = gemm2(...) # 32 x f32x4
scf.yield(oaccu_a)
with ir.InsertionPoint(if_op.else_block):
oaccu_b = gemm2(...) # 32 x f32x4
scf.yield(oaccu_b)
# 32 phi nodes at merge — explosion
# GOOD: only small values through phi, large compute after merge
result_types = [f32, f32, i64, f32] # 4 small values
if_op = scf.IfOp(cond, result_types, has_else=True)
with ir.InsertionPoint(if_op.then_block):
p_pack_a = gemm1_and_pack(...) # consumes f32x4 within branch
scf.yield(rm_a, rse_a, p_pack_a, rs_a)
with ir.InsertionPoint(if_op.else_block):
p_pack_b = gemm1_and_pack(...)
scf.yield(rm_b, rse_b, p_pack_b, rs_b)
# 4 phi nodes — coalesced perfectly
oaccu = gemm2(if_op.results) # 128 VGPRs, never enters phi
This workaround is effective but has significant drawbacks:
- Constrains kernel architecture. The programmer must manually reason about which values can/cannot cross branch boundaries — a concern that should be the compiler's job.
- Not always possible. Some algorithms inherently need large accumulators to be conditionally computed (e.g., when the two branches perform fundamentally different matrix multiplications).
- Fragile. LLVM's tail duplication pass may pull the post-merge computation back into the branches, partially undoing the restructuring. We observed this in our MLA kernel — GEMM2 appeared duplicated in the ISA despite being placed after the merge in MLIR.
6.2 Proposed Fix: Mutable Register Semantics Through LLVM
The FlyDSL frontend side is straightforward — memref_alloca(AddressSpace.Register) already provides the right user-facing API (store/load on a register-address-space buffer). The problem is entirely in the LLVM lowering: today this lowers to an llvm.alloca in address space 5, which mem2reg promotes back to SSA with phi, recreating the exact problem we're trying to avoid. The real work is making LLVM honor the "mutable register" semantics — keeping the value in a physical register across branches without creating phi nodes.
What We Need From LLVM
In hand-written assembly (HipKittens), the programmer pins accumulators to fixed VGPR ranges:
; oaccu pinned to v[128:255] for the entire kernel lifetime
; Branch A:
v_mfma_f32_16x16x32_fp8_fp8 v[128:131], ..., v[128:131] ; write in-place
; Branch B:
v_mfma_f32_16x16x32_fp8_fp8 v[128:131], ..., v[128:131] ; same registers
; After merge: v[128:131] contains the result. No phi, no copies.
We need the same guarantee but without hardcoding the register number. The allocator is free to choose which register — it just cannot choose different registers per branch. The key constraint is: the same logical accumulator must map to the same physical register across all branches, and both branches can mutate it in place without creating new SSA definitions.
What This Looks Like at Each Level
FlyDSL user code (no API changes needed — reuse memref_alloca):
oaccu = fx.memref_alloca(reg_ty, layout) # already exists today
# Inside scf.IfOp — both branches mutate the SAME register
with then_branch:
fx.memref_store_vec(mfma(k_a, q, fx.memref_load_vec(oaccu)), oaccu)
with else_branch:
fx.memref_store_vec(mfma(k_b, q, fx.memref_load_vec(oaccu)), oaccu)
# After merge — no phi needed, oaccu is still valid
result = fx.memref_load_vec(oaccu)
Today's LLVM lowering (broken — mem2reg undoes it):
memref_alloca(Register)
→ fly.make_ptr(register)
→ llvm.alloca addrspace(5) // address space 5 = AMDGPU private
→ mem2reg promotes to SSA
→ phi nodes reappear // back to square one
What we need — the alloca must survive mem2reg and be lowered to a register after SSA optimizations but before register allocation:
memref_alloca(Register)
→ fly.make_ptr(register)
→ llvm.alloca addrspace(5) + "no-promote" attribute
→ mem2reg SKIPS this alloca // key change
→ [all SSA passes run normally — loads/stores remain]
→ late FlyDSL pass: replace alloca with a pinned VReg
→ register allocator: assigns one physical register for all defs
→ no phi, no copies, no spills
Possible LLVM Implementation Strategies
| Strategy |
Where |
Approach |
Pros |
Cons |
| Opaque alloca |
FlyDSL lowering |
Mark alloca with metadata/attribute that mem2reg skips; late pass lowers to register |
No upstream LLVM changes; works within existing infra |
Late lowering may miss optimization opportunities; risk of falling to scratch if the late pass fails |
| Tied MFMA intrinsic |
LLVM backend |
New intrinsic variant with explicit tied constraint between accumulator input/output |
Register allocator guarantees same register; fully visible to scheduler |
Requires upstream LLVM changes per MFMA variant; doesn't generalize beyond MFMA |
| Register-pinned VReg |
LLVM regalloc |
New machine-level pseudo: "this VReg must map to the same physical register at all def points" |
Directly expresses the constraint; generalizes to any wide value |
Requires new register allocator support; significant LLVM change |
| Predicated execution |
FlyDSL lowering |
Lower scf.IfOp to predicated instructions (both paths execute, results masked by exec) |
Eliminates branches entirely; no phi at all |
Wastes compute on the not-taken path; not always semantically valid (side effects) |
The opaque alloca approach is the most pragmatic starting point: it requires no upstream LLVM changes and can be implemented within FlyDSL's lowering pipeline. The key question is how to prevent mem2reg from promoting it. Options include:
- Custom metadata: attach
!fly.no_promote metadata to the alloca. Requires a small mem2reg patch to check for it.
- Volatile load/store: use
volatile to prevent promotion (LLVM already respects this). A late FlyDSL pass strips the volatile flag and replaces with register ops. Drawback: volatile also prevents instruction reordering during optimization.
- Address space trick: use a custom address space (not 5) that
mem2reg does not recognize as promotable, then lower to register in a late pass.
The tied MFMA intrinsic is the cleanest long-term solution. LLVM already has precedent: int_amdgcn_wmma_f16_16x16x16_f16_tied ties the accumulator input to output for WMMA instructions. Extending this to all MFMA variants would let the register allocator directly enforce same-register constraints without any alloca machinery.
Impact on Kernel Development
With mutable register semantics, the MLA kernel could be written naturally:
oaccu = fx.memref_alloca(reg_ty, layout) # existing API
oaccu.store(zeros)
for tile in kv_tiles:
if needs_boundary_check(tile):
process_tile_with_bounds(tile, oaccu) # mutates oaccu in place
else:
process_tile_no_bounds(tile, oaccu) # mutates oaccu in place
store_output(oaccu.load())
Both branches write to the same physical registers. No phi nodes, no register divergence, no spills. The compiler is free to schedule instructions and choose registers — it just cannot create separate register assignments per branch for the accumulator.
This matches the mental model that GPU kernel programmers already have: accumulators are long-lived, mutable state that persists across branches and loop iterations. SSA's single-definition rule is the wrong abstraction for GPU MFMA accumulators that span 128 VGPRs.
Reproduction
Test kernels are committed on branch jruan/ssa_phi of the FlyDSL repo:
# Run all three experiments
cd /jruan/ws/FlyDSL
git checkout jruan/ssa_phi
# 1. SSA phi explosion (32 x f32x4 through phi, different branches)
# Expected: 32 phi <4 x float>, 76 unique MFMA dst groups, 134 VGPRs
FLYDSL_DUMP_IR=1 python tests/kernels/test_ssa_phi_divergence.py
# 2. No explosion baseline (1 x i64 through phi, identical branches)
# Expected: 0 phi <4 x float>, 3 unique MFMA dst groups, 10 VGPRs
FLYDSL_DUMP_IR=1 python tests/kernels/test_ssa_phi_old.py
# 3. Factorial analysis (isolate which factor is necessary)
# Factor A (large phi + identical branches): LLVM merges branches, no explosion
# Factor B (small phi + different branches): phi survives but trivially coalesced
FLYDSL_DUMP_IR=1 python tests/kernels/test_ssa_phi_factors.py
# Inspect generated artifacts
cat ~/.flydsl/debug/ssa_phi_accu_kernel_0/16_llvm_ir.ll | grep "phi <4 x float>"
grep 'v_mfma' ~/.flydsl/debug/ssa_phi_accu_kernel_0/17_final_isa.s | \
awk '{for(i=1;i<=NF;i++) if($i~/^v\[/) print $i}' | sort -u | wc -l
grep 'vgpr_spill_count' ~/.flydsl/debug/ssa_phi_accu_kernel_0/17_final_isa.s
References
- LLVM
PHIElimination.cpp: LowerPHINode() — creates COPY instructions from phi nodes
- LLVM
RegisterCoalescer.cpp: joinVirtRegs() — attempts to merge phi copies; throttled by isHighCostLiveInterval() at 100 valnos / 256 visits (not the bottleneck)
- LLVM
RegAllocEvictionAdvisor.cpp: EvictInterferenceCutoff = 10 — limits eviction attempts (not the bottleneck)
- LLVM
RegAllocGreedy.cpp: greedy allocator processes longest intervals first, causing branch divergence
- LLVM Issue #131954: Large MFMA tiles + software pipelining spills (related AMDGPU register pressure issue)
- LLVM PR #145024: AMDGPU VGPR-to-AGPR MFMA rewrite pass
- FlyDSL report:
mlir_llvm_ssa_phi_restriction.md (detailed ISA analysis of the MLA kernel)
SSA Phi Explosion: VGPR Waste and Spills When Large Accumulators Cross
scf.IfOpBoundariesTL;DR — When a large MFMA accumulator (32 x f32x4 = 128 VGPRs) is yielded through
scf.IfOpphi from structurally different branches, LLVM's register allocator assigns disjoint VGPR ranges per branch. This caused 71 unique MFMA dst groups (ideal: 32) and 170 scratch spills in a production MLA kernel. The hand-written HK implementation of the same algorithm has 0 spills because it pins the accumulator to fixed VGPRs. Neithermemref_alloca(Register)nor raising LLVM coalescer thresholds helps — the interference is fundamental. FlyDSL'smemref_alloca(Register)already provides the right user-facing API, but the LLVM lowering must be changed so the alloca survivesmem2regand maps to a physical register without creating phi nodes.1. Problem Observed in Production
The Kernel
kn_mla_fwd_decode_m16x8_fp8_fp8— an MLA (Multi-head Latent Attention) forward-decode kernel fornhead=128, fp8on gfx942 (MI300X), implemented in FlyDSL. The kernel uses 8 warps / 512 threads in a persistent-thread architecture with multi-tile KV processing viascf.ForOp.What Happened
The kernel's inner loop contains an
scf.IfOpfor boundary checking (tiles near the end of a sequence require masked loads). Both branches call_process_tile(), which runs GEMM1 (24 MFMAs) and GEMM2 (32 MFMAs), producing a 128-VGPR output accumulator (oaccu: 32 x f32x4). The fulloaccuis yielded throughscf.yieldfrom each branch, creating 32 phi nodes of type<4 x float>at the merge point.Measured Impact
With 6 branch instances, LLVM produced 71 unique MFMA destination register groups — more than twice the ideal 32 — and 170 scratch spills (132
scratch_load+ 65scratch_store). Each scratch access is a global memory operation costing hundreds of cycles, versus zero for a register move. Inside a hot inner loop, this was catastrophic for performance.By contrast, the hand-written HipKittens (HK) implementation of the same algorithm handles 12 branch instances with only 34 unique MFMA dst groups and zero spills. The programmer pins the accumulator to a fixed VGPR range (
v[128:255]) across all branches — something that SSA IR cannot express.Current Workaround
We restructured the kernel so that GEMM2 (the 128-VGPR accumulator producer) executes after the
scf.IfOpmerge, not inside the branches. Only 5 small values (7 VGPRs total) now cross the phi. This eliminated all spills.But this is a workaround, not a fix. It required significant manual code restructuring, limits how the kernel can be written, and is fragile — LLVM's tail duplication pass may undo it by pulling post-merge code back into the branches. The fundamental issue remains.
2. Background: SSA and Phi Nodes
What Is SSA?
Static Single Assignment (SSA) form is a naming discipline used in virtually all modern compiler IRs — LLVM IR, MLIR, GCC GIMPLE, V8 Turbofan — where every variable is defined exactly once. This single-definition property makes compiler analyses dramatically simpler: use-def chains are trivial, dead code elimination is easy, and constant propagation is straightforward.
But when a variable needs different values depending on which branch was taken, SSA has a problem — two definitions reach the merge point. SSA resolves this with phi functions:
The phi function
x4 = phi(x2, x3)is not a real machine instruction — it is a notational device that the compiler must eventually eliminate by inserting register copies or by coalescing registers. The phi is the only way to merge values from different control flow paths in SSA. There is no "mutation" or "pass-by-reference" for SSA registers.MLIR's
scf.IfOp: Structured SSA with RegionsMLIR uses
scf.IfOpwithscf.yieldto express if/else with value results. Values defined inside a region cannot escape that region — the only way to export a value is viascf.yield, and the yielded values become the results of thescf.ifoperation:When lowered to LLVM IR, each
scf.yieldresult becomes a phi instruction at the merge block. Thescf.ifresults are phi nodes — there is no way to avoid them within SSA.3. Root Cause: Why SSA Phi Causes the Explosion
The Mechanism (Step by Step)
Consider 32 x f32x4 (128 VGPRs) yielded through
scf.IfOpphi from two structurally different branches:Step 1 — PHI Elimination. LLVM's
PHIEliminationpass converts each phi to COPY instructions:All 32
%inc_iVRegs are live simultaneously at the branch exit (all copies execute before the branch), and they remain live through the merge block.Step 2 — Register Coalescing. The coalescer tries to merge the COPYs (e.g., assign
%val_a_0and%inc_0to the same physical register so the COPY can be deleted). It may succeed for some copies, but interference prevents complete coalescing:%val_a_0..31are live simultaneously within each branch (they must all survive to the copy point at the branch exit).%inc_0..31are live simultaneously (they survive from the copy point to the merge block).<4 x float>occupies a VGPR_128 tuple with stride-1 subreg indexing:v[0:3]andv[1:4]share register units v1, v2, v3. Two f32x4 VRegs that are live at the same time cannot occupy overlapping tuples, which means each f32x4 effectively blocks 4 consecutive VGPRs from any other simultaneous f32x4 allocation.The coalescer processes copies one at a time. When it processes
COPY %inc_0, %val_a_0, it tries to merge their live intervals. But%val_a_0's live range overlaps with%inc_1..31(all live simultaneously), so%val_a_0must be assigned to a physical register that doesn't conflict with any of the already-assigned%inc_jregisters. This constraint gets progressively harder to satisfy as more copies are processed: the interference graph becomes denser, and eventually the coalescer fails for many of the 32 copies.Step 3 — Greedy Allocation Divergence. The greedy register allocator processes VRegs by priority (longer intervals first). The 32
%inc_iintervals (spanning predecessor → merge) get assigned first, consumingv[0:3]throughv[124:127]. Then, within each branch, the MFMA outputs try to coalesce into these same registers. But the two branches have different live range contexts (different computation structures create different interference patterns), so the allocator makes independent choices per branch. Branch A's MFMAs may coalesce tov[0:3]..v[124:127], but Branch B's MFMAs — facing different interference — get assigned tov[128:131]..v[252:255].Result: Instead of the ideal 32 unique MFMA dst groups (both branches sharing the same VGPRs), we get 76 unique groups. Each branch uses largely disjoint register ranges, effectively doubling register consumption at the merge point.
Experimental Validation
We built three test kernels (committed on branch
jruan/ssa_phi) to isolate the necessary conditions via 2x2 factorial analysis:test_ssa_phi_old.py): LLVM merges identical branches viav_cndmaskpointer selection. Phi eliminated entirely.test_ssa_phi_factors.py, Factor B): Phi survives but trivially coalesced. Only 4 VGPRs cross the merge — no pressure.test_ssa_phi_factors.py, Factor A): LLVM again merges branches — identical structure lets it replace the branch with a pointer select. No phi.test_ssa_phi_divergence.py): Explosion. 32 phi<4 x float>, 76 unique MFMA dst groups, 134 VGPRs.Conclusion: Both conditions are necessary — large phi width AND structurally different branches. Structural difference prevents LLVM from merging branches (which would eliminate phi entirely), and large phi width amplifies the register waste beyond the allocator's ability to coalesce.
Why LLVM Thresholds Cannot Fix This
We identified two LLVM throttle mechanisms that appeared relevant:
RegisterCoalescer.cpp):isHighCostLiveInterval()refuses coalescing when a live interval has >= 100 value numbers AND has been visited >= 256 times.RegAllocEvictionAdvisor.cpp):EvictInterferenceCutoff = 10refuses eviction when >= 10 interfering VRegs exist.We raised all three thresholds at runtime via FlyDSL's
llvm_options():Result: identical output — 76 unique MFMA dst groups, 134 VGPRs. The throttles are not the bottleneck. The coalescer is genuinely trying to coalesce and failing because of true register interference from stride-1 tuple aliasing, not because it gave up early.
4. Why
memref_alloca(Register)and Mutable Variables Don't WorkOne natural idea is to sidestep phi entirely by using mutable storage — write to a "register variable" from each branch and read it after the merge:
This avoids phi in the source IR, but does not solve the problem:
Case 1:
mem2reg/ SROA Succeeds (Most Common)LLVM's
mem2regandSROApasses recognize the textbook pattern — one alloca, stores in predecessor blocks, load at the merge — and promote it back to SSA with phi:The mutable storage was a source-level illusion that the compiler erased before register allocation. The exact same phi explosion occurs.
Case 2:
mem2regFailsIf promotion fails (e.g., the alloca escapes, uses volatile, or has mismatched types), the value stays in actual memory — scratch memory on AMDGPU. Every "read" is a global memory load (~hundreds of cycles), every "write" is a global memory store. This is worse than phi copies.
FlyDSL's
memref_alloca(AddressSpace.Register)FlyDSL provides
memref_alloca(AddressSpace.Register), which lowers throughfly.make_ptr(register)→llvm.allocain address space 5 (AMDGPU private/scratch). We verified thatmem2regpromotes this identically — the alloca, all stores, and the load disappear, and the same phi reappears.Summary
memref_alloca(Register)There is no escape from the phi problem within SSA. The issue is not syntactic (phi vs. alloca) but physical: at the hardware level, if two branches want to put different values in the same register, a resolution mechanism is needed at the merge point.
5. How PHI Elimination Works in LLVM and Why It Cannot Fix the Issue
The Pipeline
PHI Elimination (PHIElimination.cpp:
LowerPHINode)For each phi node, the pass:
%inc(same register class as the phi result).COPY %inc, %src_from_this_branchbefore the terminator.COPY %dest, %incafter all phis, replacing uses of the phi result.For 32 phi nodes with 2 predecessors, this creates:
%incVRegs%incVRegs are live simultaneously (they all survive from the predecessor to the merge block)Register Coalescing: Why It Fails at Scale
The coalescer processes COPYs one at a time, trying to merge source and destination into the same physical register so the COPY can be deleted. For 1 phi
<4 x float>, this works perfectly — there is only one%incVReg, no interference, and all three values (%val_a,%inc,%val_b) getv[0:3]. Zero copies, 3 unique MFMA dst groups, 10 VGPRs.For 32 phi nodes, the 32
%incVRegs are all live simultaneously, occupyingv[0:3]throughv[124:127]. When the coalescer tries to merge%val_a_iinto%inc_i's register,%val_a_i's live range (from MFMA def to the copy point) overlaps with other%inc_jvalues. As explained in Section 3, the dense interference from stride-1 VGPR_128 tuples makes many coalescing attempts fail. Uncoalesced MFMAs get fresh physical registers — producing 76 unique dst groups instead of 32, and 134 VGPRs instead of 10.The 1-phi case is 3 dst groups / 10 VGPRs. The 32-phi case is 76 dst groups / 134 VGPRs. The scaling is catastrophic.
PHI elimination is doing its job correctly — it faithfully translates phi semantics into copies. The problem is that the resulting register allocation problem is fundamentally intractable when 32 wide values are simultaneously live across a merge point.
6. Proposal
6.1 Current Workaround: Restructure to Avoid Phi on Large Values
Today, the only reliable solution is to restructure the kernel code so that large register groups (MFMA accumulators, etc.) never cross
scf.IfOpboundaries:This workaround is effective but has significant drawbacks:
6.2 Proposed Fix: Mutable Register Semantics Through LLVM
The FlyDSL frontend side is straightforward —
memref_alloca(AddressSpace.Register)already provides the right user-facing API (store/load on a register-address-space buffer). The problem is entirely in the LLVM lowering: today this lowers to anllvm.allocain address space 5, whichmem2regpromotes back to SSA with phi, recreating the exact problem we're trying to avoid. The real work is making LLVM honor the "mutable register" semantics — keeping the value in a physical register across branches without creating phi nodes.What We Need From LLVM
In hand-written assembly (HipKittens), the programmer pins accumulators to fixed VGPR ranges:
We need the same guarantee but without hardcoding the register number. The allocator is free to choose which register — it just cannot choose different registers per branch. The key constraint is: the same logical accumulator must map to the same physical register across all branches, and both branches can mutate it in place without creating new SSA definitions.
What This Looks Like at Each Level
FlyDSL user code (no API changes needed — reuse
memref_alloca):Today's LLVM lowering (broken —
mem2regundoes it):What we need — the alloca must survive
mem2regand be lowered to a register after SSA optimizations but before register allocation:Possible LLVM Implementation Strategies
allocawith metadata/attribute thatmem2regskips; late pass lowers to registerscf.IfOpto predicated instructions (both paths execute, results masked byexec)The opaque alloca approach is the most pragmatic starting point: it requires no upstream LLVM changes and can be implemented within FlyDSL's lowering pipeline. The key question is how to prevent
mem2regfrom promoting it. Options include:!fly.no_promotemetadata to the alloca. Requires a smallmem2regpatch to check for it.volatileto prevent promotion (LLVM already respects this). A late FlyDSL pass strips thevolatileflag and replaces with register ops. Drawback:volatilealso prevents instruction reordering during optimization.mem2regdoes not recognize as promotable, then lower to register in a late pass.The tied MFMA intrinsic is the cleanest long-term solution. LLVM already has precedent:
int_amdgcn_wmma_f16_16x16x16_f16_tiedties the accumulator input to output for WMMA instructions. Extending this to all MFMA variants would let the register allocator directly enforce same-register constraints without any alloca machinery.Impact on Kernel Development
With mutable register semantics, the MLA kernel could be written naturally:
Both branches write to the same physical registers. No phi nodes, no register divergence, no spills. The compiler is free to schedule instructions and choose registers — it just cannot create separate register assignments per branch for the accumulator.
This matches the mental model that GPU kernel programmers already have: accumulators are long-lived, mutable state that persists across branches and loop iterations. SSA's single-definition rule is the wrong abstraction for GPU MFMA accumulators that span 128 VGPRs.
Reproduction
Test kernels are committed on branch
jruan/ssa_phiof the FlyDSL repo:References
PHIElimination.cpp:LowerPHINode()— creates COPY instructions from phi nodesRegisterCoalescer.cpp:joinVirtRegs()— attempts to merge phi copies; throttled byisHighCostLiveInterval()at 100 valnos / 256 visits (not the bottleneck)RegAllocEvictionAdvisor.cpp:EvictInterferenceCutoff = 10— limits eviction attempts (not the bottleneck)RegAllocGreedy.cpp: greedy allocator processes longest intervals first, causing branch divergencemlir_llvm_ssa_phi_restriction.md(detailed ISA analysis of the MLA kernel)