meta-pytorch · Laurawly · May 4, 2026 · May 4, 2026
diff --git a/oink/README.md b/oink/README.md
@@ -24,7 +24,7 @@ Recommended env vars:
 
 ```bash
 export PYTORCH_ALLOC_CONF=expandable_segments:True
-export CUTE_DSL_ARCH=sm_103a   # GB300 / SM103
+export CUTE_DSL_ARCH=sm_103    # GB300 / SM103 on the current CuTeDSL host
 # export CUTE_DSL_ARCH=sm_100a # GB200/B200 / SM100
 ```
 
@@ -88,7 +88,7 @@ Reported numbers are correctness-gated against PyTorch references before timing.
 
 Current GB300 / SM103 setup:
 
-- NVIDIA GB300, capability `(10, 3)`, `CUTE_DSL_ARCH=sm_103a`
+- NVIDIA GB300, capability `(10, 3)`, `CUTE_DSL_ARCH=sm_103`
 - `torch==2.11.0+cu130`, CUDA `13.0`
 - `nvidia-cutlass-dsl==4.4.2`, `cuda-python==13.2.0`
 - measured BF16 STREAM-like roof: **7.140 TB/s**
@@ -113,6 +113,41 @@ Historical plots remain under `benchmarks/media/`:
 - `gb300_bf16_qk_norm_oink_vs_cutedsl_roofline.svg`: historical GB300 Q/K-norm
   harness, separate from the Quack-suite table above.
 
+### GB300 (SM103) LayerNorm backward results
+
+Oink's LayerNorm backward path is self-contained in this repo. The OSS
+benchmark reports Oink against ATen's native LayerNorm backward reference.
+
+Measured on **GB300 (SM103)** in the `cute` Conda environment with torch
+`2.11.0+cu130`, CUDA `13.0`, and `CUTE_DSL_ARCH=sm_103`, using CUDA graph warm
+replay (`--cuda-graph`), bf16 activations/gradients, same-dtype LayerNorm
+weights, and no bias. Correctness was checked before timing against a chunked
+fp32 PyTorch formula for `dx` / `dweight`; the timed `ref` column uses
+`torch.ops.aten.native_layer_norm_backward.default` with the same precomputed
+`mean` and `rstd`.
+
+The OSS Quack package installed in this environment exposes LayerNorm forward but
+not a `quack.rmsnorm.layernorm_bwd` API, so the benchmark reports Quack as
+unavailable and omits Quack timing columns. If a Quack build with
+`layernorm_bwd` is installed, the same command will add `quack_ms` and
+`Oink/Quack` columns.
+
+The throughput columns use a logical useful-IO model for no-bias LayerNorm
+backward: read `x`, read `dout`, write `dx`, read/write `weight`/`dweight`, and
+read fp32 `mean` + `rstd`. This excludes implementation-specific scratch traffic,
+so the values are a useful-bandwidth roofline view rather than physical HBM bytes.
+Full DSv3/DSv4 tables are in [`benchmarks/README.md`](benchmarks/README.md).
+
+
+Reproduce with:
+
+```bash
+env PYTHONNOUSERSITE=1 CUTE_DSL_ARCH=sm_103 PYTORCH_ALLOC_CONF=expandable_segments:True \
+  conda run -n cute python -u oink/benchmarks/benchmark/benchmark_layernorm_bwd_sm100.py \
+    --dtype bf16 --weight-dtype same --dsv4 --iters 80 --warmup-ms 10 --cuda-graph \
+    --json /tmp/oink_layernorm_bwd_sm103_dsv4_cuda_graph_seq.json
+```
+
 ## Links
 
 | What | Link |

diff --git a/oink/benchmarks/README.md b/oink/benchmarks/README.md
@@ -18,8 +18,8 @@ Recommended env vars:
 
 ```bash
 export PYTORCH_ALLOC_CONF=expandable_segments:True
-# GB300 / SM103:
-export CUTE_DSL_ARCH=sm_103a
+# GB300 / SM103 on the current CuTeDSL host:
+export CUTE_DSL_ARCH=sm_103
 # GB200/B200 / SM100 historical runs:
 # export CUTE_DSL_ARCH=sm_100a
 ```
@@ -42,6 +42,7 @@ conda run -n cute python -m pip install 'git+https://github.com/Dao-AILab/quack.
   with `hidden = 4096` so `M = batch * seq`, `N = 4096`.
 - **DeepSeek-V3-like (DSv3)**
   - RMSNorm / LayerNorm / Softmax: `M ∈ {4096, 16384, 65536}`, `N ∈ {6144, 7168, 8192}`
+  - LayerNorm backward's `--dsv3` suite uses `N ∈ {6144, 8192}`; use `--dsv4` for the `N = 7168` hidden-state sweep.
   - Cross-entropy: `M ∈ {4096, 16384, 65536}`, `N ∈ {3072, 6144, 8192, 12288}`
 - **DeepSeek-V4-Flash norm shapes (DSv4)** from `deepseek-ai/DeepSeek-V4-Flash/inference/model.py`
   - hidden-state RMSNorm / LayerNorm: `M ∈ {4096, 16384, 65536}`, `N = 7168`
@@ -73,7 +74,7 @@ Current measured GB300 BF16 STREAM-like roof used in the README:
 Regenerate on the current machine:
 
 ```bash
-conda run -n cute bash -lc 'PYTHONNOUSERSITE=1 CUTE_DSL_ARCH=sm_103a \
+conda run -n cute bash -lc 'PYTHONNOUSERSITE=1 CUTE_DSL_ARCH=sm_103 \
   python benchmarks/benchmark/benchmark_hbm_roofline_sm100.py --dtype bf16 --op both --gb 1 \
   --json /tmp/oink_sm103_hbm_roofline_bf16_current.json'
 ```
@@ -94,11 +95,11 @@ Run the full Quack-suite + DSv3 set (Oink vs Quack) and write all JSON artifacts
 to a timestamped directory:
 
 ```bash
-conda run -n cute bash -lc 'PYTHONNOUSERSITE=1 CUTE_DSL_ARCH=sm_103a \
+conda run -n cute bash -lc 'PYTHONNOUSERSITE=1 CUTE_DSL_ARCH=sm_103 \
   python benchmarks/readme/run_sm100_suite.py --dtype bf16'
 
 # Include DeepSeek-V4-Flash norm workloads:
-conda run -n cute bash -lc 'PYTHONNOUSERSITE=1 CUTE_DSL_ARCH=sm_103a \
+conda run -n cute bash -lc 'PYTHONNOUSERSITE=1 CUTE_DSL_ARCH=sm_103 \
   python benchmarks/readme/run_sm100_suite.py --dtype bf16 --include-dsv4 \
   --out-dir /tmp/oink_sm103_suite_bf16_current'
 ```
@@ -162,14 +163,14 @@ outputs.
 
 ```bash
 # DeepSeek-V3 hidden-size sweep
-PYTHONNOUSERSITE=1 CUTE_DSL_ARCH=sm_103a \
+PYTHONNOUSERSITE=1 CUTE_DSL_ARCH=sm_103 \
   python benchmarks/benchmark/benchmark_fused_add_rmsnorm_sm100.py \
     --dtype bf16 --dsv3 --iters 80 --warmup-ms 15 \
     --quack-baseline kernel_inplace \
     --json /tmp/oink_sm103_fused_add_rmsnorm_dsv3_bf16.json
 
 # DeepSeek-V4-Flash hidden-state sweep (N=7168)
-PYTHONNOUSERSITE=1 CUTE_DSL_ARCH=sm_103a \
+PYTHONNOUSERSITE=1 CUTE_DSL_ARCH=sm_103 \
   python benchmarks/benchmark/benchmark_fused_add_rmsnorm_sm100.py \
     --dtype bf16 --dsv4 --iters 80 --warmup-ms 15 \
     --quack-baseline kernel_inplace \
@@ -245,6 +246,48 @@ python benchmarks/benchmark/benchmark_layernorm_sm100.py --dtype bf16 --dsv3 --i
   --json /tmp/oink_layernorm_fwd_dsv3.json
 ```
 
+### LayerNorm backward
+
+This compares Oink against ATen's native LayerNorm backward reference and,
+when the installed OSS Quack package exposes `quack.rmsnorm.layernorm_bwd`, Quack
+LayerNorm backward. The benchmark validates each available backend against a
+chunked fp32 PyTorch formula before timing. Current table numbers use CUDA graph
+warm replay (`--cuda-graph`). The local Quack package used for these runs exposes
+LayerNorm forward but not `layernorm_bwd`, so Quack timing columns are omitted.
+
+DSv3 CUDA-graph replay results (`N ∈ {6144,8192}`):
+
+| M | N | Oink ms | Oink TB/s | ATen ref ms | Oink/ref |
+|---:|---:|---:|---:|---:|---:|
+| 4096 | 6144 | 0.0548 | 2.7574 | 0.0777 | 1.4190x |
+| 4096 | 8192 | 0.0611 | 3.2951 | 0.0970 | 1.5873x |
+| 16384 | 6144 | 0.1840 | 3.2833 | 0.2794 | 1.5183x |
+| 16384 | 8192 | 0.2093 | 3.8480 | 0.3387 | 1.6183x |
+| 65536 | 6144 | 0.6896 | 3.5043 | 1.0652 | 1.5447x |
+| 65536 | 8192 | 0.7372 | 4.3705 | 1.3138 | 1.7823x |
+
+DSv4 hidden LayerNorm CUDA-graph replay results (`N = 7168`):
+
+| M | N | Oink ms | Oink TB/s | ATen ref ms | Oink/ref |
+|---:|---:|---:|---:|---:|---:|
+| 4096 | 7168 | 0.0591 | 2.9800 | 0.0858 | 1.4503x |
+| 16384 | 7168 | 0.1990 | 3.5425 | 0.3077 | 1.5467x |
+| 65536 | 7168 | 0.7467 | 3.7753 | 1.1711 | 1.5684x |
+
+```bash
+# DeepSeek-V4-Flash hidden LayerNorm shape sweep (N=7168)
+env PYTHONNOUSERSITE=1 CUTE_DSL_ARCH=sm_103 PYTORCH_ALLOC_CONF=expandable_segments:True \
+  conda run -n cute python -u benchmarks/benchmark/benchmark_layernorm_bwd_sm100.py \
+    --dtype bf16 --weight-dtype same --dsv4 --iters 80 --warmup-ms 10 --cuda-graph \
+    --json /tmp/oink_layernorm_bwd_sm103_dsv4_cuda_graph_seq.json
+
+# DeepSeek-V3 shape sweep (N in {6144,8192})
+env PYTHONNOUSERSITE=1 CUTE_DSL_ARCH=sm_103 PYTORCH_ALLOC_CONF=expandable_segments:True \
+  conda run -n cute python -u benchmarks/benchmark/benchmark_layernorm_bwd_sm100.py \
+    --dtype bf16 --weight-dtype same --dsv3 --iters 80 --warmup-ms 10 --cuda-graph \
+    --json /tmp/oink_layernorm_bwd_sm103_dsv3_cuda_graph_seq.json
+```
+
 ## Notes
 
 - These scripts intentionally avoid importing any external Oink checkout so the

diff --git a/oink/benchmarks/benchmark/bench_utils.py b/oink/benchmarks/benchmark/bench_utils.py
@@ -75,7 +75,9 @@ def ensure_blackwell_arch_env(device: Optional[torch.device] = None) -> str:
 
     Benchmarks often run outside the Oink/vLLM plugin path, so they don't
     benefit from the plugin's device-capability-based `CUTE_DSL_ARCH` setup.
-    On GB300 we want `sm_103a` instead of the older hard-coded `sm_100a`.
+    On this GB300/CuTeDSL 4.4.2 host, LayerNorm backward compiles reliably
+    with `sm_103`; callers may still pin an `a` arch explicitly if their local
+    CuTeDSL build requires it.
     """
     pinned = os.environ.get("CUTE_DSL_ARCH")
     if pinned:
@@ -86,7 +88,9 @@ def ensure_blackwell_arch_env(device: Optional[torch.device] = None) -> str:
         if device is None:
             device = torch.device("cuda")
         major, minor = torch.cuda.get_device_capability(device)
-        if int(major) == 10:
+        if int(major) == 10 and int(minor) == 3:
+            arch = "sm_103"
+        elif int(major) == 10:
             arch = f"sm_{int(major)}{int(minor)}a"
     os.environ["CUTE_DSL_ARCH"] = arch
     return arch
@@ -115,6 +119,13 @@ def do_bench_triton(
     return float(triton_do_bench(fn, warmup=warmup_ms, rep=rep_ms, return_mode="mean"))
 
 
+def do_bench_cuda_graph(fn: Callable[[], Any], *, rep_ms: int = 100) -> float:
+    """CUDA-graph replay timing via Triton's cudagraph benchmark helper."""
+    from triton.testing import do_bench_cudagraph
+
+    return float(do_bench_cudagraph(fn, rep=rep_ms, return_mode="mean"))
+
+
 def parse_dtype(s: str) -> torch.dtype:
     s = s.lower()
     if s == "bf16":