From c5a95e128f1c6dd50a3f1cf0840e2db5190663d1 Mon Sep 17 00:00:00 2001
From: Thomas Smith <thomsmit@google.com>
Date: Sat, 20 Jun 2026 21:27:36 -0400
Subject: [PATCH 1/3] nits

---
 docs/scan-and-reduce.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/scan-and-reduce.md b/docs/scan-and-reduce.md
index 1c28bd8..8a7c83d 100644
--- a/docs/scan-and-reduce.md
+++ b/docs/scan-and-reduce.md
@@ -49,7 +49,7 @@ The carry chain in a chained scan stores the inclusive scan of the reduction of
 
 ### Leveraging Subgroups
 
-WebGPU's optional [subgroups](https://www.w3.org/TR/webgpu/#subgroups) feature enables WebGPU programs to use SIMD instructions within a workgroup. These can deliver significant performance gains for several reasons: they leverage custom hardware for the computation itself; they do not have to route data through workgroup memory; they require one hardware instruction to do what would take many hardware instructions in emulation; and they require fewer barriers. Gridwise has some, but incompletely deployed, [support for emulating SIMD instructions](../subgroup-strategy/). Our initial performance testing indicated that scan was 2.5x slower using subgroup emulation vs. using subgroup hardware.
+WebGPU's optional [subgroups](https://www.w3.org/TR/webgpu/#subgroups) feature enables WebGPU programs to use SIMD instructions within a workgroup. These deliver significant performance gains primarily by utilizing register shuffling, bypassing the setup and latency of shared memory routing. Consequently, our initial performance testing indicated that scan operations are 2.5x slower using Gridwise's  [subgroup emulation](../subgroup-strategy/) compared to native hardware. While both emulated and hardware-backed approaches theoretically share the same minimal algorithmic depth, the depth incurred during a native subgroup scan utilizes register shuffling and requires no barriers, whereas the emulated loop incurs a barrier at each iteration. Therefore, while the theoretical depth remains constant, emulated scans—or native execution on smaller subgroup sizes—tend to perform worse in practice due to the increased relative cost of synchronization and the overhead of additional iterations in the main loop.
 
 It is possible that the fastest scan and reduce implementations in the absence of subgroup hardware are not chained ones. We have not investigated this at all. In general, in Gridwise, we expect that subgroup operations are available, and we use them to reduce across subgroups and workgroups, to broadcast information from one thread to others, and to compute local subgroup-sized scans as part of workgroup scans.
 

From 20b56bc6a194aa12ab242c005c0e52a7c8e560ec Mon Sep 17 00:00:00 2001
From: Thomas Smith <thomsmit@google.com>
Date: Sun, 21 Jun 2026 13:00:54 -0400
Subject: [PATCH 2/3] more nits

---
 docs/subgroup-strategy.md | 51 +--------------------------------------
 1 file changed, 1 insertion(+), 50 deletions(-)

diff --git a/docs/subgroup-strategy.md b/docs/subgroup-strategy.md
index 17599a1..21791c9 100644
--- a/docs/subgroup-strategy.md
+++ b/docs/subgroup-strategy.md
@@ -116,7 +116,7 @@ First, we know that using hw subgroup operations will deliver better performance
 
 Recall that WebGPU does not specify a subgroup size (in hw), although it does specify a minimum and maximum subgroup size. (In fact, some WebGPU-capable hardware may use different subgroup sizes across different kernels in the same application.) WebGPU developers must thus write their code assuming any subgroup size between the minimum and the maximum. Since our kernels already have to handle a range of subgroup sizes, we have some flexibility to choose a subgroup size in emu.
 
-We choose to emulate virtual subgroups of size **32** (partitioning the workgroup's flat array and tracking thread subgroup IDs via `lidx % 32u`). 
+We choose to emulate virtual subgroups of size **32** (partitioning the workgroup's flat array and tracking thread subgroup IDs via `lidx % 32u`).
 
 ### Why 32-thread virtual subgroups? (VS Workgroup-sized subgroups)
 
@@ -254,55 +254,6 @@ Instead of trying to emulate subgroup built-ins line-by-line in a shared kernel,
 * **Cons**:
   * Increases codebase maintenance overhead as developers must write, test, and maintain two versions of every primitive.
 
-## Case Study: Control Flow Divergence in subgroupAny (OneSweep Sort)
-
-In the implementation of the OneSweep Radix Sort lookback loop, a deadlock was discovered when running under software subgroup emulation. The original code compiled and ran correctly on native hardware subgroup platforms but hung indefinitely under emulation.
-
-### The Problem: Divergent Control Flow
-
-The lookback loop spins waiting for preceding tiles to publish their histograms. The original kernel structure executed `subgroupAny` inside a thread-divergent conditional block:
-
-```wgsl
-if (!sgLookbackComplete) {
-  if (!lookbackComplete) { // Thread-divergent branch (per-lane status)
-    while (spinCount < MAX_SPIN_COUNT) {
-      flagPayload = atomicLoad(&passHist[...]);
-      if ((flagPayload & FLAG_MASK) > FLAG_NOT_READY) { break; }
-      spinCount++;
-    }
-    // subgroupAny is called ONLY by threads that have NOT completed lookback
-    if (subgroupAny(spinCount == MAX_SPIN_COUNT) && (sgid == 0)) {
-      wg_incomplete = 1;
-    }
-  }
-}
-```
-
-* **On Hardware**: Native hardware subgroups use execution masks to dynamically disable inactive lanes. Threads that have already completed lookback (`lookbackComplete == true`) simply bypass the branch, and the hardware evaluates `subgroupAny` correctly using only the active participating lanes.
-* **On Emulation**: Software emulation simulates subgroup barriers using workgroup shared memory barriers and transaction counters. Every thread in the virtual subgroup must execute the helper function uniformly. Because lanes that completed early bypassed the `if (!lookbackComplete)` block, they never reached the barrier inside the emulated `subgroupAny`, causing the participating threads to deadlock waiting for them.
-
-### The Fix: Uniform Execution
-
-To make the kernel emulation-friendly, the divergent `subgroupAny` call was hoisted out of the thread-divergent block while keeping it within the subgroup-uniform block. A subgroup-uniform variable `didSpinTimeout` is initialized and updated inside the branch, then passed to `subgroupAny` uniformly:
-
-```wgsl
-if (!sgLookbackComplete) { // Subgroup-uniform branch
-  var didSpinTimeout = false;
-  
-  if (!lookbackComplete) { // Thread-divergent branch
-    while (spinCount < MAX_SPIN_COUNT) { ... }
-    didSpinTimeout = (spinCount == MAX_SPIN_COUNT);
-  }
-  
-  // Hoisted: Every thread in the subgroup now executes subgroupAny uniformly
-  if (subgroupAny(didSpinTimeout) && (sgid == 0)) {
-    wg_incomplete = 1;
-  }
-}
-```
-
-Now, all threads in the subgroup execute `subgroupAny` in lockstep. Completed threads participate by passing `didSpinTimeout = false`, and active threads pass their actual spin status. This prevents barrier misalignment and resolves the emulation deadlock.
-
 ### Principles for Writing Emulation-Friendly Subgroup Kernels
 
 If you are writing WebGPU kernels intended to be portable to devices without hardware subgroup support (running via a software emulation layer), follow these guidelines:

From f290cb8f2b459df16887bab6373f6fd26fe6c5c1 Mon Sep 17 00:00:00 2001
From: Thomas Smith <thomsmit@google.com>
Date: Sun, 21 Jun 2026 14:06:25 -0400
Subject: [PATCH 3/3] polish

---
 docs/scan-and-reduce.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/scan-and-reduce.md b/docs/scan-and-reduce.md
index 8a7c83d..90d38ba 100644
--- a/docs/scan-and-reduce.md
+++ b/docs/scan-and-reduce.md
@@ -49,7 +49,7 @@ The carry chain in a chained scan stores the inclusive scan of the reduction of
 
 ### Leveraging Subgroups
 
-WebGPU's optional [subgroups](https://www.w3.org/TR/webgpu/#subgroups) feature enables WebGPU programs to use SIMD instructions within a workgroup. These deliver significant performance gains primarily by utilizing register shuffling, bypassing the setup and latency of shared memory routing. Consequently, our initial performance testing indicated that scan operations are 2.5x slower using Gridwise's  [subgroup emulation](../subgroup-strategy/) compared to native hardware. While both emulated and hardware-backed approaches theoretically share the same minimal algorithmic depth, the depth incurred during a native subgroup scan utilizes register shuffling and requires no barriers, whereas the emulated loop incurs a barrier at each iteration. Therefore, while the theoretical depth remains constant, emulated scans—or native execution on smaller subgroup sizes—tend to perform worse in practice due to the increased relative cost of synchronization and the overhead of additional iterations in the main loop.
+WebGPU's optional [subgroups](https://www.w3.org/TR/webgpu/#subgroups) feature enables WebGPU programs to use SIMD instructions within a workgroup. These deliver significant performance gains primarily by utilizing register shuffling, minimizing the amount of work that must be routed through shared memory. Consequently, our initial performance testing indicated that scan operations are 2.5x slower using Gridwise's [subgroup emulation](../subgroup-strategy/) compared to native hardware. While both emulated and hardware-backed approaches theoretically share the same minimal algorithmic depth, the native hardware scan requires no barriers for the steps performed via shuffling, whereas the emulated loop must perform all work through shared memory, incurring a barrier at each iteration. Therefore, while the theoretical depth remains constant, emulated scans—or native execution on smaller subgroup sizes—tend to perform worse in practice due to the increased relative cost of synchronization required by shared memory.
 
 It is possible that the fastest scan and reduce implementations in the absence of subgroup hardware are not chained ones. We have not investigated this at all. In general, in Gridwise, we expect that subgroup operations are available, and we use them to reduce across subgroups and workgroups, to broadcast information from one thread to others, and to compute local subgroup-sized scans as part of workgroup scans.