[Feature]: SSA Phi Explosion: VGPR Waste and Spills When Large Accumulators Cross `scf.IfOp` Boundaries

# SSA Phi Explosion: VGPR Waste and Spills When Large Accumulators Cross `scf.IfOp` Boundaries

**TL;DR** — When a large MFMA accumulator (32 x f32x4 = 128 VGPRs) is yielded through `scf.IfOp` phi from structurally different branches, LLVM's register allocator assigns disjoint VGPR ranges per branch. This caused 71 unique MFMA dst groups (ideal: 32) and 170 scratch spills in a production MLA kernel. The hand-written HK implementation of the same algorithm has 0 spills because it pins the accumulator to fixed VGPRs. Neither `memref_alloca(Register)` nor raising LLVM coalescer thresholds helps — the interference is fundamental. FlyDSL's `memref_alloca(Register)` already provides the right user-facing API, but the LLVM lowering must be changed so the alloca survives `mem2reg` and maps to a physical register without creating phi nodes.

---

## 1. Problem Observed in Production

### The Kernel

[`kn_mla_fwd_decode_m16x8_fp8_fp8`](https://github.com/ROCm/aiter/blob/jruan/fdsl_mla_h128_fp8/aiter/ops/flydsl/kernels/mla_fwd_decode_m16x8_fp8_fp8.py) — an MLA (Multi-head Latent Attention) forward-decode kernel for `nhead=128, fp8` on gfx942 (MI300X), implemented in FlyDSL. The kernel uses 8 warps / 512 threads in a persistent-thread architecture with multi-tile KV processing via `scf.ForOp`.

### What Happened

The kernel's inner loop contains an `scf.IfOp` for boundary checking (tiles near the end of a sequence require masked loads). Both branches call `_process_tile()`, which runs GEMM1 (24 MFMAs) and GEMM2 (32 MFMAs), producing a 128-VGPR output accumulator (`oaccu`: 32 x f32x4). The full `oaccu` is yielded through `scf.yield` from each branch, creating **32 phi nodes** of type `<4 x float>` at the merge point.

### Measured Impact

| Configuration | Unique MFMA dst groups | VGPR count | Scratch spills |
|---|---|---|---|
| 4 branch instances (FlyDSL) | 49 (ideal: 32) | 256 | 0 |
| 6 branch instances (FlyDSL) | 71 (ideal: 32) | 256 | **170** |
| Hand-written asm (HK, 12 instances) | 34 | ~180 | 0 |

With 6 branch instances, LLVM produced **71** unique MFMA destination register groups — more than twice the ideal 32 — and **170 scratch spills** (132 `scratch_load` + 65 `scratch_store`). Each scratch access is a global memory operation costing hundreds of cycles, versus zero for a register move. Inside a hot inner loop, this was catastrophic for performance.

By contrast, the hand-written HipKittens (HK) implementation of the **same algorithm** handles 12 branch instances with only 34 unique MFMA dst groups and zero spills. The programmer pins the accumulator to a fixed VGPR range (`v[128:255]`) across all branches — something that SSA IR cannot express.

### Current Workaround

We restructured the kernel so that GEMM2 (the 128-VGPR accumulator producer) executes **after** the `scf.IfOp` merge, not inside the branches. Only 5 small values (7 VGPRs total) now cross the phi. This eliminated all spills.

**But this is a workaround, not a fix.** It required significant manual code restructuring, limits how the kernel can be written, and is fragile — LLVM's tail duplication pass may undo it by pulling post-merge code back into the branches. The fundamental issue remains.

---

## 2. Background: SSA and Phi Nodes

### What Is SSA?

Static Single Assignment (SSA) form is a naming discipline used in virtually all modern compiler IRs — LLVM IR, MLIR, GCC GIMPLE, V8 Turbofan — where **every variable is defined exactly once**. This single-definition property makes compiler analyses dramatically simpler: use-def chains are trivial, dead code elimination is easy, and constant propagation is straightforward.

But when a variable needs different values depending on which branch was taken, SSA has a problem — two definitions reach the merge point. SSA resolves this with **phi functions**:

```
x1 = 1
if (cond):
    x2 = 2           // definition 1
else:
    x3 = 3           // definition 2
x4 = phi(x2, x3)     // "select whichever branch we came from"
y = x4 + 1
```

The phi function `x4 = phi(x2, x3)` is **not a real machine instruction** — it is a notational device that the compiler must eventually eliminate by inserting register copies or by coalescing registers. The phi is the **only way** to merge values from different control flow paths in SSA. There is no "mutation" or "pass-by-reference" for SSA registers.

### MLIR's `scf.IfOp`: Structured SSA with Regions

MLIR uses `scf.IfOp` with `scf.yield` to express if/else with value results. Values defined inside a region **cannot escape that region** — the only way to export a value is via `scf.yield`, and the yielded values become the results of the `scf.if` operation:

```mlir
%result = scf.if %cond -> (vector<4xf32>) {
    %val_a = ... compute in then-branch ...
    scf.yield %val_a : vector<4xf32>
} else {
    %val_b = ... compute in else-branch ...
    scf.yield %val_b : vector<4xf32>
}
// %result = phi(%val_a, %val_b) — semantically a phi node
```

When lowered to LLVM IR, each `scf.yield` result becomes a phi instruction at the merge block. The `scf.if` results **are** phi nodes — there is no way to avoid them within SSA.

---

## 3. Root Cause: Why SSA Phi Causes the Explosion

### The Mechanism (Step by Step)

Consider 32 x f32x4 (128 VGPRs) yielded through `scf.IfOp` phi from two structurally different branches:

**Step 1 — PHI Elimination.** LLVM's `PHIElimination` pass converts each phi to COPY instructions:

```
Before:                              After:
                                     BB_then:
BB_then:                               %val_a_0 = MFMA(...)
  %val_a_0 = MFMA(...)                 ...
  ...                                  %val_a_31 = MFMA(...)
  %val_a_31 = MFMA(...)                COPY %inc_0, %val_a_0    // 32 copies
  br BB_merge                          ...
                                       COPY %inc_31, %val_a_31
BB_else:                               br BB_merge
  %val_b_0 = MFMA(...)
  ...                                BB_else:
  %val_b_31 = MFMA(...)                %val_b_0 = MFMA(...)
  br BB_merge                          ...
                                       %val_b_31 = MFMA(...)
BB_merge:                              COPY %inc_0, %val_b_0
  %r_0 = PHI(%val_a_0, %val_b_0)      ...
  ...                                  COPY %inc_31, %val_b_31
  %r_31 = PHI(%val_a_31, %val_b_31)   br BB_merge
                                     
                                     BB_merge:
                                       // use %inc_0 ... %inc_31
```

All 32 `%inc_i` VRegs are **live simultaneously** at the branch exit (all copies execute before the branch), and they remain live through the merge block.

**Step 2 — Register Coalescing.** The coalescer tries to merge the COPYs (e.g., assign `%val_a_0` and `%inc_0` to the same physical register so the COPY can be deleted). It may succeed for some copies, but interference prevents complete coalescing:

- All 32 MFMA outputs `%val_a_0..31` are live simultaneously within each branch (they must all survive to the copy point at the branch exit).
- All 32 `%inc_0..31` are live simultaneously (they survive from the copy point to the merge block).
- On AMDGPU, `<4 x float>` occupies a VGPR_128 tuple with **stride-1 subreg indexing**: `v[0:3]` and `v[1:4]` share register units v1, v2, v3. Two f32x4 VRegs that are live at the same time cannot occupy overlapping tuples, which means each f32x4 effectively blocks 4 consecutive VGPRs from any other simultaneous f32x4 allocation.

The coalescer processes copies one at a time. When it processes `COPY %inc_0, %val_a_0`, it tries to merge their live intervals. But `%val_a_0`'s live range overlaps with `%inc_1..31` (all live simultaneously), so `%val_a_0` must be assigned to a physical register that doesn't conflict with any of the already-assigned `%inc_j` registers. This constraint gets progressively harder to satisfy as more copies are processed: the interference graph becomes denser, and eventually the coalescer fails for many of the 32 copies.

**Step 3 — Greedy Allocation Divergence.** The greedy register allocator processes VRegs by priority (longer intervals first). The 32 `%inc_i` intervals (spanning predecessor → merge) get assigned first, consuming `v[0:3]` through `v[124:127]`. Then, within each branch, the MFMA outputs try to coalesce into these same registers. But the two branches have **different live range contexts** (different computation structures create different interference patterns), so the allocator makes **independent choices** per branch. Branch A's MFMAs may coalesce to `v[0:3]..v[124:127]`, but Branch B's MFMAs — facing different interference — get assigned to `v[128:131]..v[252:255]`.

**Result:** Instead of the ideal 32 unique MFMA dst groups (both branches sharing the same VGPRs), we get **76 unique groups**. Each branch uses largely disjoint register ranges, effectively doubling register consumption at the merge point.

### Experimental Validation

We built three test kernels (committed on branch `jruan/ssa_phi`) to isolate the necessary conditions via 2x2 factorial analysis:

|  | Identical branches | Different branches |
|---|---|---|
| **Small phi** (1 x f32x4) | 0 phi, 3 dst groups, 10 VGPRs | 1 phi, 3 dst groups, 10 VGPRs |
| **Large phi** (32 x f32x4) | 0 phi (branches merged) | **32 phi, 76 dst groups, 134 VGPRs** |

- **Small + identical** ([`test_ssa_phi_old.py`](https://github.com/ROCm/FlyDSL/blob/jruan/ssa_phi/tests/kernels/test_ssa_phi_old.py)): LLVM merges identical branches via `v_cndmask` pointer selection. Phi eliminated entirely.
- **Small + different** ([`test_ssa_phi_factors.py`](https://github.com/ROCm/FlyDSL/blob/jruan/ssa_phi/tests/kernels/test_ssa_phi_factors.py), Factor B): Phi survives but trivially coalesced. Only 4 VGPRs cross the merge — no pressure.
- **Large + identical** ([`test_ssa_phi_factors.py`](https://github.com/ROCm/FlyDSL/blob/jruan/ssa_phi/tests/kernels/test_ssa_phi_factors.py), Factor A): LLVM again merges branches — identical structure lets it replace the branch with a pointer select. No phi.
- **Large + different** ([`test_ssa_phi_divergence.py`](https://github.com/ROCm/FlyDSL/blob/jruan/ssa_phi/tests/kernels/test_ssa_phi_divergence.py)): **Explosion.** 32 phi `<4 x float>`, 76 unique MFMA dst groups, 134 VGPRs.

**Conclusion:** Both conditions are necessary — large phi width AND structurally different branches. Structural difference prevents LLVM from merging branches (which would eliminate phi entirely), and large phi width amplifies the register waste beyond the allocator's ability to coalesce.

### Why LLVM Thresholds Cannot Fix This

We identified two LLVM throttle mechanisms that appeared relevant:

1. **Coalescer throttle** (`RegisterCoalescer.cpp`): `isHighCostLiveInterval()` refuses coalescing when a live interval has >= 100 value numbers AND has been visited >= 256 times.
2. **Eviction cutoff** (`RegAllocEvictionAdvisor.cpp`): `EvictInterferenceCutoff = 10` refuses eviction when >= 10 interfering VRegs exist.

We raised all three thresholds at runtime via FlyDSL's `llvm_options()`:

```python
with llvm_options({
    'large-interval-size-threshold': 512,
    'large-interval-freq-threshold': 4096,
    'regalloc-eviction-max-interference-cutoff': 128,
}):
    launch_fn = build_ssa_phi_accu(N)
```

**Result: identical output** — 76 unique MFMA dst groups, 134 VGPRs. The throttles are not the bottleneck. The coalescer is genuinely *trying* to coalesce and *failing* because of true register interference from stride-1 tuple aliasing, not because it gave up early.

---

## 4. Why `memref_alloca(Register)` and Mutable Variables Don't Work

One natural idea is to sidestep phi entirely by using mutable storage — write to a "register variable" from each branch and read it after the merge:

```python
# Attempt: use mutable register storage
buf = fx.memref_alloca(reg_ty, layout)    # before IfOp
# Branch A:
fx.memref_store_vec(val_a, buf)           # store in branch A
# Branch B:
fx.memref_store_vec(val_b, buf)           # store in branch B
# After merge:
result = fx.memref_load_vec(buf)          # load — no phi!
```

This avoids phi in the source IR, but **does not solve the problem**:

### Case 1: `mem2reg` / SROA Succeeds (Most Common)

LLVM's `mem2reg` and `SROA` passes recognize the textbook pattern — one alloca, stores in predecessor blocks, load at the merge — and **promote it back to SSA with phi**:

```llvm
; After mem2reg — the alloca is gone, phi is back:
merge:
    %result = phi <4 x float> [%val_a, %then], [%val_b, %else]
```

The mutable storage was a source-level illusion that the compiler erased before register allocation. The exact same phi explosion occurs.

### Case 2: `mem2reg` Fails

If promotion fails (e.g., the alloca escapes, uses volatile, or has mismatched types), the value stays in actual memory — scratch memory on AMDGPU. Every "read" is a global memory load (~hundreds of cycles), every "write" is a global memory store. This is **worse** than phi copies.

### FlyDSL's `memref_alloca(AddressSpace.Register)`

FlyDSL provides `memref_alloca(AddressSpace.Register)`, which lowers through `fly.make_ptr(register)` → `llvm.alloca` in address space 5 (AMDGPU private/scratch). We verified that `mem2reg` promotes this identically — the alloca, all stores, and the load disappear, and the same phi reappears.

### Summary

| Approach | What happens | Cost |
|---|---|---|
| SSA phi (direct) | Regalloc must coalesce or copy/spill | Copy or spill |
| alloca + mem2reg succeeds | Promoted back to SSA phi | Same as above |
| alloca + mem2reg fails | Value stays in scratch memory | Hundreds of cycles per access (worse) |
| `memref_alloca(Register)` | mem2reg promotes to phi | Same as direct SSA |

**There is no escape from the phi problem within SSA.** The issue is not syntactic (phi vs. alloca) but physical: at the hardware level, if two branches want to put different values in the same register, a resolution mechanism is needed at the merge point.

---

## 5. How PHI Elimination Works in LLVM and Why It Cannot Fix the Issue

### The Pipeline

```
MLIR scf.IfOp + scf.yield
    │ (convert-scf-to-cf)
    ▼
MLIR cf.cond_br + block arguments
    │ (convert-cf-to-llvm → LLVM IR translation)
    ▼
LLVM IR: phi instructions at merge blocks
    │ (PHIElimination pass)
    ▼
Machine IR: COPY instructions in predecessor blocks
    │ (RegisterCoalescer)
    ▼
Machine IR: COPYs coalesced where possible
    │ (Greedy Register Allocator)
    ▼
Machine IR: physical registers assigned, uncoalesced COPYs remain as v_mov
    │ (VirtRegRewriter)
    ▼
Final ISA
```

### PHI Elimination (PHIElimination.cpp: `LowerPHINode`)

For each phi node, the pass:

1. Creates a **new virtual register** `%inc` (same register class as the phi result).
2. In **each predecessor block**: inserts `COPY %inc, %src_from_this_branch` before the terminator.
3. In the **merge block**: inserts `COPY %dest, %inc` after all phis, replacing uses of the phi result.

For 32 phi nodes with 2 predecessors, this creates:
- 32 new `%inc` VRegs
- 64 COPYs in predecessors (32 per branch)
- 32 COPYs in the merge block
- All 32 `%inc` VRegs are **live simultaneously** (they all survive from the predecessor to the merge block)

### Register Coalescing: Why It Fails at Scale

The coalescer processes COPYs one at a time, trying to merge source and destination into the same physical register so the COPY can be deleted. For **1 phi** `<4 x float>`, this works perfectly — there is only one `%inc` VReg, no interference, and all three values (`%val_a`, `%inc`, `%val_b`) get `v[0:3]`. Zero copies, 3 unique MFMA dst groups, 10 VGPRs.

For **32 phi** nodes, the 32 `%inc` VRegs are all live simultaneously, occupying `v[0:3]` through `v[124:127]`. When the coalescer tries to merge `%val_a_i` into `%inc_i`'s register, `%val_a_i`'s live range (from MFMA def to the copy point) overlaps with other `%inc_j` values. As explained in Section 3, the dense interference from stride-1 VGPR_128 tuples makes many coalescing attempts fail. Uncoalesced MFMAs get fresh physical registers — producing **76 unique dst groups** instead of 32, and **134 VGPRs** instead of 10.

The 1-phi case is 3 dst groups / 10 VGPRs. The 32-phi case is 76 dst groups / 134 VGPRs. The scaling is catastrophic.

**PHI elimination is doing its job correctly** — it faithfully translates phi semantics into copies. The problem is that the resulting register allocation problem is fundamentally intractable when 32 wide values are simultaneously live across a merge point.

---

## 6. Proposal

### 6.1 Current Workaround: Restructure to Avoid Phi on Large Values

Today, the only reliable solution is to **restructure the kernel code** so that large register groups (MFMA accumulators, etc.) never cross `scf.IfOp` boundaries:

```python
# BAD: 128 VGPRs through phi — causes explosion
result_types = [f32x4_type] * 32
if_op = scf.IfOp(cond, result_types, has_else=True)
with ir.InsertionPoint(if_op.then_block):
    oaccu_a = gemm2(...)      # 32 x f32x4
    scf.yield(oaccu_a)
with ir.InsertionPoint(if_op.else_block):
    oaccu_b = gemm2(...)      # 32 x f32x4
    scf.yield(oaccu_b)
# 32 phi nodes at merge — explosion

# GOOD: only small values through phi, large compute after merge
result_types = [f32, f32, i64, f32]   # 4 small values
if_op = scf.IfOp(cond, result_types, has_else=True)
with ir.InsertionPoint(if_op.then_block):
    p_pack_a = gemm1_and_pack(...)    # consumes f32x4 within branch
    scf.yield(rm_a, rse_a, p_pack_a, rs_a)
with ir.InsertionPoint(if_op.else_block):
    p_pack_b = gemm1_and_pack(...)
    scf.yield(rm_b, rse_b, p_pack_b, rs_b)
# 4 phi nodes — coalesced perfectly
oaccu = gemm2(if_op.results)          # 128 VGPRs, never enters phi
```

This workaround is effective but has significant drawbacks:

1. **Constrains kernel architecture.** The programmer must manually reason about which values can/cannot cross branch boundaries — a concern that should be the compiler's job.
2. **Not always possible.** Some algorithms inherently need large accumulators to be conditionally computed (e.g., when the two branches perform fundamentally different matrix multiplications).
3. **Fragile.** LLVM's tail duplication pass may pull the post-merge computation back into the branches, partially undoing the restructuring. We observed this in our MLA kernel — GEMM2 appeared duplicated in the ISA despite being placed after the merge in MLIR.

### 6.2 Proposed Fix: Mutable Register Semantics Through LLVM

The FlyDSL frontend side is straightforward — `memref_alloca(AddressSpace.Register)` already provides the right user-facing API (store/load on a register-address-space buffer). The problem is entirely in the **LLVM lowering**: today this lowers to an `llvm.alloca` in address space 5, which `mem2reg` promotes back to SSA with phi, recreating the exact problem we're trying to avoid. The real work is making LLVM honor the "mutable register" semantics — keeping the value in a physical register across branches without creating phi nodes.

#### What We Need From LLVM

In hand-written assembly (HipKittens), the programmer pins accumulators to fixed VGPR ranges:

```asm
; oaccu pinned to v[128:255] for the entire kernel lifetime
; Branch A:
v_mfma_f32_16x16x32_fp8_fp8 v[128:131], ..., v[128:131]   ; write in-place
; Branch B:
v_mfma_f32_16x16x32_fp8_fp8 v[128:131], ..., v[128:131]   ; same registers
; After merge: v[128:131] contains the result. No phi, no copies.
```

We need the same guarantee but without hardcoding the register number. The allocator is free to choose *which* register — it just cannot choose *different* registers per branch. The key constraint is: **the same logical accumulator must map to the same physical register across all branches**, and both branches can **mutate it in place** without creating new SSA definitions.

#### What This Looks Like at Each Level

**FlyDSL user code** (no API changes needed — reuse `memref_alloca`):

```python
oaccu = fx.memref_alloca(reg_ty, layout)   # already exists today

# Inside scf.IfOp — both branches mutate the SAME register
with then_branch:
    fx.memref_store_vec(mfma(k_a, q, fx.memref_load_vec(oaccu)), oaccu)
with else_branch:
    fx.memref_store_vec(mfma(k_b, q, fx.memref_load_vec(oaccu)), oaccu)

# After merge — no phi needed, oaccu is still valid
result = fx.memref_load_vec(oaccu)
```

**Today's LLVM lowering** (broken — `mem2reg` undoes it):

```
memref_alloca(Register)
    → fly.make_ptr(register)
    → llvm.alloca addrspace(5)        // address space 5 = AMDGPU private
    → mem2reg promotes to SSA
    → phi nodes reappear              // back to square one
```

**What we need** — the alloca must survive `mem2reg` and be lowered to a register **after** SSA optimizations but **before** register allocation:

```
memref_alloca(Register)
    → fly.make_ptr(register)
    → llvm.alloca addrspace(5) + "no-promote" attribute
    → mem2reg SKIPS this alloca       // key change
    → [all SSA passes run normally — loads/stores remain]
    → late FlyDSL pass: replace alloca with a pinned VReg
    → register allocator: assigns one physical register for all defs
    → no phi, no copies, no spills
```

#### Possible LLVM Implementation Strategies

| Strategy | Where | Approach | Pros | Cons |
|---|---|---|---|---|
| **Opaque alloca** | FlyDSL lowering | Mark `alloca` with metadata/attribute that `mem2reg` skips; late pass lowers to register | No upstream LLVM changes; works within existing infra | Late lowering may miss optimization opportunities; risk of falling to scratch if the late pass fails |
| **Tied MFMA intrinsic** | LLVM backend | New intrinsic variant with explicit tied constraint between accumulator input/output | Register allocator guarantees same register; fully visible to scheduler | Requires upstream LLVM changes per MFMA variant; doesn't generalize beyond MFMA |
| **Register-pinned VReg** | LLVM regalloc | New machine-level pseudo: "this VReg must map to the same physical register at all def points" | Directly expresses the constraint; generalizes to any wide value | Requires new register allocator support; significant LLVM change |
| **Predicated execution** | FlyDSL lowering | Lower `scf.IfOp` to predicated instructions (both paths execute, results masked by `exec`) | Eliminates branches entirely; no phi at all | Wastes compute on the not-taken path; not always semantically valid (side effects) |

The **opaque alloca** approach is the most pragmatic starting point: it requires no upstream LLVM changes and can be implemented within FlyDSL's lowering pipeline. The key question is how to prevent `mem2reg` from promoting it. Options include:

1. **Custom metadata**: attach `!fly.no_promote` metadata to the alloca. Requires a small `mem2reg` patch to check for it.
2. **Volatile load/store**: use `volatile` to prevent promotion (LLVM already respects this). A late FlyDSL pass strips the `volatile` flag and replaces with register ops. Drawback: `volatile` also prevents instruction reordering during optimization.
3. **Address space trick**: use a custom address space (not 5) that `mem2reg` does not recognize as promotable, then lower to register in a late pass.

The **tied MFMA intrinsic** is the cleanest long-term solution. LLVM already has precedent: `int_amdgcn_wmma_f16_16x16x16_f16_tied` ties the accumulator input to output for WMMA instructions. Extending this to all MFMA variants would let the register allocator directly enforce same-register constraints without any alloca machinery.

#### Impact on Kernel Development

With mutable register semantics, the MLA kernel could be written naturally:

```python
oaccu = fx.memref_alloca(reg_ty, layout)   # existing API
oaccu.store(zeros)

for tile in kv_tiles:
    if needs_boundary_check(tile):
        process_tile_with_bounds(tile, oaccu)   # mutates oaccu in place
    else:
        process_tile_no_bounds(tile, oaccu)      # mutates oaccu in place

store_output(oaccu.load())
```

Both branches write to the same physical registers. No phi nodes, no register divergence, no spills. The compiler is free to schedule instructions and choose registers — it just cannot create separate register assignments per branch for the accumulator.

This matches the mental model that GPU kernel programmers already have: accumulators are long-lived, mutable state that persists across branches and loop iterations. SSA's single-definition rule is the wrong abstraction for GPU MFMA accumulators that span 128 VGPRs.

---

## Reproduction

Test kernels are committed on branch `jruan/ssa_phi` of the FlyDSL repo:

```bash
# Run all three experiments
cd /jruan/ws/FlyDSL
git checkout jruan/ssa_phi

# 1. SSA phi explosion (32 x f32x4 through phi, different branches)
#    Expected: 32 phi <4 x float>, 76 unique MFMA dst groups, 134 VGPRs
FLYDSL_DUMP_IR=1 python tests/kernels/test_ssa_phi_divergence.py

# 2. No explosion baseline (1 x i64 through phi, identical branches)
#    Expected: 0 phi <4 x float>, 3 unique MFMA dst groups, 10 VGPRs
FLYDSL_DUMP_IR=1 python tests/kernels/test_ssa_phi_old.py

# 3. Factorial analysis (isolate which factor is necessary)
#    Factor A (large phi + identical branches): LLVM merges branches, no explosion
#    Factor B (small phi + different branches): phi survives but trivially coalesced
FLYDSL_DUMP_IR=1 python tests/kernels/test_ssa_phi_factors.py

# Inspect generated artifacts
cat ~/.flydsl/debug/ssa_phi_accu_kernel_0/16_llvm_ir.ll | grep "phi <4 x float>"
grep 'v_mfma' ~/.flydsl/debug/ssa_phi_accu_kernel_0/17_final_isa.s | \
    awk '{for(i=1;i<=NF;i++) if($i~/^v\[/) print $i}' | sort -u | wc -l
grep 'vgpr_spill_count' ~/.flydsl/debug/ssa_phi_accu_kernel_0/17_final_isa.s
```

---

## References

- LLVM `PHIElimination.cpp`: `LowerPHINode()` — creates COPY instructions from phi nodes
- LLVM `RegisterCoalescer.cpp`: `joinVirtRegs()` — attempts to merge phi copies; throttled by `isHighCostLiveInterval()` at 100 valnos / 256 visits (not the bottleneck)
- LLVM `RegAllocEvictionAdvisor.cpp`: `EvictInterferenceCutoff = 10` — limits eviction attempts (not the bottleneck)
- LLVM `RegAllocGreedy.cpp`: greedy allocator processes longest intervals first, causing branch divergence
- [LLVM Issue #131954](https://github.com/llvm/llvm-project/issues/131954): Large MFMA tiles + software pipelining spills (related AMDGPU register pressure issue)
- [LLVM PR #145024](https://github.com/llvm/llvm-project/pull/145024): AMDGPU VGPR-to-AGPR MFMA rewrite pass
- FlyDSL report: `mlir_llvm_ssa_phi_restriction.md` (detailed ISA analysis of the MLA kernel)

Configuration	Unique MFMA dst groups	VGPR count	Scratch spills
4 branch instances (FlyDSL)	49 (ideal: 32)	256	0
6 branch instances (FlyDSL)	71 (ideal: 32)	256	170
Hand-written asm (HK, 12 instances)	34	~180	0

	Identical branches	Different branches
Small phi (1 x f32x4)	0 phi, 3 dst groups, 10 VGPRs	1 phi, 3 dst groups, 10 VGPRs
Large phi (32 x f32x4)	0 phi (branches merged)	32 phi, 76 dst groups, 134 VGPRs

Approach	What happens	Cost
SSA phi (direct)	Regalloc must coalesce or copy/spill	Copy or spill
alloca + mem2reg succeeds	Promoted back to SSA phi	Same as above
alloca + mem2reg fails	Value stays in scratch memory	Hundreds of cycles per access (worse)
`memref_alloca(Register)`	mem2reg promotes to phi	Same as direct SSA

Strategy	Where	Approach	Pros	Cons
Opaque alloca	FlyDSL lowering	Mark `alloca` with metadata/attribute that `mem2reg` skips; late pass lowers to register	No upstream LLVM changes; works within existing infra	Late lowering may miss optimization opportunities; risk of falling to scratch if the late pass fails
Tied MFMA intrinsic	LLVM backend	New intrinsic variant with explicit tied constraint between accumulator input/output	Register allocator guarantees same register; fully visible to scheduler	Requires upstream LLVM changes per MFMA variant; doesn't generalize beyond MFMA
Register-pinned VReg	LLVM regalloc	New machine-level pseudo: "this VReg must map to the same physical register at all def points"	Directly expresses the constraint; generalizes to any wide value	Requires new register allocator support; significant LLVM change
Predicated execution	FlyDSL lowering	Lower `scf.IfOp` to predicated instructions (both paths execute, results masked by `exec`)	Eliminates branches entirely; no phi at all	Wastes compute on the not-taken path; not always semantically valid (side effects)

[Feature]: SSA Phi Explosion: VGPR Waste and Spills When Large Accumulators Cross scf.IfOp Boundaries #400

Description

SSA Phi Explosion: VGPR Waste and Spills When Large Accumulators Cross scf.IfOp Boundaries

1. Problem Observed in Production

The Kernel

What Happened

Measured Impact

Current Workaround

2. Background: SSA and Phi Nodes

What Is SSA?

MLIR's scf.IfOp: Structured SSA with Regions

3. Root Cause: Why SSA Phi Causes the Explosion

The Mechanism (Step by Step)

Experimental Validation

Why LLVM Thresholds Cannot Fix This

4. Why memref_alloca(Register) and Mutable Variables Don't Work

Case 1: mem2reg / SROA Succeeds (Most Common)

Case 2: mem2reg Fails

FlyDSL's memref_alloca(AddressSpace.Register)

Summary

5. How PHI Elimination Works in LLVM and Why It Cannot Fix the Issue

The Pipeline

PHI Elimination (PHIElimination.cpp: LowerPHINode)

Register Coalescing: Why It Fails at Scale

6. Proposal

6.1 Current Workaround: Restructure to Avoid Phi on Large Values

6.2 Proposed Fix: Mutable Register Semantics Through LLVM

What We Need From LLVM

What This Looks Like at Each Level

Possible LLVM Implementation Strategies

Impact on Kernel Development

Reproduction

References

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

[Feature]: SSA Phi Explosion: VGPR Waste and Spills When Large Accumulators Cross `scf.IfOp` Boundaries #400

SSA Phi Explosion: VGPR Waste and Spills When Large Accumulators Cross `scf.IfOp` Boundaries

MLIR's `scf.IfOp`: Structured SSA with Regions

4. Why `memref_alloca(Register)` and Mutable Variables Don't Work

Case 1: `mem2reg` / SROA Succeeds (Most Common)

Case 2: `mem2reg` Fails

FlyDSL's `memref_alloca(AddressSpace.Register)`

PHI Elimination (PHIElimination.cpp: `LowerPHINode`)