Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
39 changes: 37 additions & 2 deletions oink/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@ Recommended env vars:

```bash
export PYTORCH_ALLOC_CONF=expandable_segments:True
export CUTE_DSL_ARCH=sm_103a # GB300 / SM103
export CUTE_DSL_ARCH=sm_103 # GB300 / SM103 on the current CuTeDSL host
# export CUTE_DSL_ARCH=sm_100a # GB200/B200 / SM100
```

Expand Down Expand Up @@ -88,7 +88,7 @@ Reported numbers are correctness-gated against PyTorch references before timing.

Current GB300 / SM103 setup:

- NVIDIA GB300, capability `(10, 3)`, `CUTE_DSL_ARCH=sm_103a`
- NVIDIA GB300, capability `(10, 3)`, `CUTE_DSL_ARCH=sm_103`
- `torch==2.11.0+cu130`, CUDA `13.0`
- `nvidia-cutlass-dsl==4.4.2`, `cuda-python==13.2.0`
- measured BF16 STREAM-like roof: **7.140 TB/s**
Expand All @@ -113,6 +113,41 @@ Historical plots remain under `benchmarks/media/`:
- `gb300_bf16_qk_norm_oink_vs_cutedsl_roofline.svg`: historical GB300 Q/K-norm
harness, separate from the Quack-suite table above.

### GB300 (SM103) LayerNorm backward results

Oink's LayerNorm backward path is self-contained in this repo. The OSS
benchmark reports Oink against ATen's native LayerNorm backward reference.

Measured on **GB300 (SM103)** in the `cute` Conda environment with torch
`2.11.0+cu130`, CUDA `13.0`, and `CUTE_DSL_ARCH=sm_103`, using CUDA graph warm
replay (`--cuda-graph`), bf16 activations/gradients, same-dtype LayerNorm
weights, and no bias. Correctness was checked before timing against a chunked
fp32 PyTorch formula for `dx` / `dweight`; the timed `ref` column uses
`torch.ops.aten.native_layer_norm_backward.default` with the same precomputed
`mean` and `rstd`.

The OSS Quack package installed in this environment exposes LayerNorm forward but
not a `quack.rmsnorm.layernorm_bwd` API, so the benchmark reports Quack as
unavailable and omits Quack timing columns. If a Quack build with
`layernorm_bwd` is installed, the same command will add `quack_ms` and
`Oink/Quack` columns.

The throughput columns use a logical useful-IO model for no-bias LayerNorm
backward: read `x`, read `dout`, write `dx`, read/write `weight`/`dweight`, and
read fp32 `mean` + `rstd`. This excludes implementation-specific scratch traffic,
so the values are a useful-bandwidth roofline view rather than physical HBM bytes.
Full DSv3/DSv4 tables are in [`benchmarks/README.md`](benchmarks/README.md).


Reproduce with:

```bash
env PYTHONNOUSERSITE=1 CUTE_DSL_ARCH=sm_103 PYTORCH_ALLOC_CONF=expandable_segments:True \
conda run -n cute python -u oink/benchmarks/benchmark/benchmark_layernorm_bwd_sm100.py \
--dtype bf16 --weight-dtype same --dsv4 --iters 80 --warmup-ms 10 --cuda-graph \
--json /tmp/oink_layernorm_bwd_sm103_dsv4_cuda_graph_seq.json
```

## Links

| What | Link |
Expand Down
57 changes: 50 additions & 7 deletions oink/benchmarks/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,8 +18,8 @@ Recommended env vars:

```bash
export PYTORCH_ALLOC_CONF=expandable_segments:True
# GB300 / SM103:
export CUTE_DSL_ARCH=sm_103a
# GB300 / SM103 on the current CuTeDSL host:
export CUTE_DSL_ARCH=sm_103
# GB200/B200 / SM100 historical runs:
# export CUTE_DSL_ARCH=sm_100a
```
Expand All @@ -42,6 +42,7 @@ conda run -n cute python -m pip install 'git+https://github.com/Dao-AILab/quack.
with `hidden = 4096` so `M = batch * seq`, `N = 4096`.
- **DeepSeek-V3-like (DSv3)**
- RMSNorm / LayerNorm / Softmax: `M ∈ {4096, 16384, 65536}`, `N ∈ {6144, 7168, 8192}`
- LayerNorm backward's `--dsv3` suite uses `N ∈ {6144, 8192}`; use `--dsv4` for the `N = 7168` hidden-state sweep.
- Cross-entropy: `M ∈ {4096, 16384, 65536}`, `N ∈ {3072, 6144, 8192, 12288}`
- **DeepSeek-V4-Flash norm shapes (DSv4)** from `deepseek-ai/DeepSeek-V4-Flash/inference/model.py`
- hidden-state RMSNorm / LayerNorm: `M ∈ {4096, 16384, 65536}`, `N = 7168`
Expand Down Expand Up @@ -73,7 +74,7 @@ Current measured GB300 BF16 STREAM-like roof used in the README:
Regenerate on the current machine:

```bash
conda run -n cute bash -lc 'PYTHONNOUSERSITE=1 CUTE_DSL_ARCH=sm_103a \
conda run -n cute bash -lc 'PYTHONNOUSERSITE=1 CUTE_DSL_ARCH=sm_103 \
python benchmarks/benchmark/benchmark_hbm_roofline_sm100.py --dtype bf16 --op both --gb 1 \
--json /tmp/oink_sm103_hbm_roofline_bf16_current.json'
```
Expand All @@ -94,11 +95,11 @@ Run the full Quack-suite + DSv3 set (Oink vs Quack) and write all JSON artifacts
to a timestamped directory:

```bash
conda run -n cute bash -lc 'PYTHONNOUSERSITE=1 CUTE_DSL_ARCH=sm_103a \
conda run -n cute bash -lc 'PYTHONNOUSERSITE=1 CUTE_DSL_ARCH=sm_103 \
python benchmarks/readme/run_sm100_suite.py --dtype bf16'

# Include DeepSeek-V4-Flash norm workloads:
conda run -n cute bash -lc 'PYTHONNOUSERSITE=1 CUTE_DSL_ARCH=sm_103a \
conda run -n cute bash -lc 'PYTHONNOUSERSITE=1 CUTE_DSL_ARCH=sm_103 \
python benchmarks/readme/run_sm100_suite.py --dtype bf16 --include-dsv4 \
--out-dir /tmp/oink_sm103_suite_bf16_current'
```
Expand Down Expand Up @@ -162,14 +163,14 @@ outputs.

```bash
# DeepSeek-V3 hidden-size sweep
PYTHONNOUSERSITE=1 CUTE_DSL_ARCH=sm_103a \
PYTHONNOUSERSITE=1 CUTE_DSL_ARCH=sm_103 \
python benchmarks/benchmark/benchmark_fused_add_rmsnorm_sm100.py \
--dtype bf16 --dsv3 --iters 80 --warmup-ms 15 \
--quack-baseline kernel_inplace \
--json /tmp/oink_sm103_fused_add_rmsnorm_dsv3_bf16.json

# DeepSeek-V4-Flash hidden-state sweep (N=7168)
PYTHONNOUSERSITE=1 CUTE_DSL_ARCH=sm_103a \
PYTHONNOUSERSITE=1 CUTE_DSL_ARCH=sm_103 \
python benchmarks/benchmark/benchmark_fused_add_rmsnorm_sm100.py \
--dtype bf16 --dsv4 --iters 80 --warmup-ms 15 \
--quack-baseline kernel_inplace \
Expand Down Expand Up @@ -245,6 +246,48 @@ python benchmarks/benchmark/benchmark_layernorm_sm100.py --dtype bf16 --dsv3 --i
--json /tmp/oink_layernorm_fwd_dsv3.json
```

### LayerNorm backward

This compares Oink against ATen's native LayerNorm backward reference and,
when the installed OSS Quack package exposes `quack.rmsnorm.layernorm_bwd`, Quack
LayerNorm backward. The benchmark validates each available backend against a
chunked fp32 PyTorch formula before timing. Current table numbers use CUDA graph
warm replay (`--cuda-graph`). The local Quack package used for these runs exposes
LayerNorm forward but not `layernorm_bwd`, so Quack timing columns are omitted.

DSv3 CUDA-graph replay results (`N ∈ {6144,8192}`):

| M | N | Oink ms | Oink TB/s | ATen ref ms | Oink/ref |
|---:|---:|---:|---:|---:|---:|
| 4096 | 6144 | 0.0548 | 2.7574 | 0.0777 | 1.4190x |
| 4096 | 8192 | 0.0611 | 3.2951 | 0.0970 | 1.5873x |
| 16384 | 6144 | 0.1840 | 3.2833 | 0.2794 | 1.5183x |
| 16384 | 8192 | 0.2093 | 3.8480 | 0.3387 | 1.6183x |
| 65536 | 6144 | 0.6896 | 3.5043 | 1.0652 | 1.5447x |
| 65536 | 8192 | 0.7372 | 4.3705 | 1.3138 | 1.7823x |

DSv4 hidden LayerNorm CUDA-graph replay results (`N = 7168`):

| M | N | Oink ms | Oink TB/s | ATen ref ms | Oink/ref |
|---:|---:|---:|---:|---:|---:|
| 4096 | 7168 | 0.0591 | 2.9800 | 0.0858 | 1.4503x |
| 16384 | 7168 | 0.1990 | 3.5425 | 0.3077 | 1.5467x |
| 65536 | 7168 | 0.7467 | 3.7753 | 1.1711 | 1.5684x |

```bash
# DeepSeek-V4-Flash hidden LayerNorm shape sweep (N=7168)
env PYTHONNOUSERSITE=1 CUTE_DSL_ARCH=sm_103 PYTORCH_ALLOC_CONF=expandable_segments:True \
conda run -n cute python -u benchmarks/benchmark/benchmark_layernorm_bwd_sm100.py \
--dtype bf16 --weight-dtype same --dsv4 --iters 80 --warmup-ms 10 --cuda-graph \
--json /tmp/oink_layernorm_bwd_sm103_dsv4_cuda_graph_seq.json

# DeepSeek-V3 shape sweep (N in {6144,8192})
env PYTHONNOUSERSITE=1 CUTE_DSL_ARCH=sm_103 PYTORCH_ALLOC_CONF=expandable_segments:True \
conda run -n cute python -u benchmarks/benchmark/benchmark_layernorm_bwd_sm100.py \
--dtype bf16 --weight-dtype same --dsv3 --iters 80 --warmup-ms 10 --cuda-graph \
--json /tmp/oink_layernorm_bwd_sm103_dsv3_cuda_graph_seq.json
```

## Notes

- These scripts intentionally avoid importing any external Oink checkout so the
Expand Down
15 changes: 13 additions & 2 deletions oink/benchmarks/benchmark/bench_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -75,7 +75,9 @@ def ensure_blackwell_arch_env(device: Optional[torch.device] = None) -> str:

Benchmarks often run outside the Oink/vLLM plugin path, so they don't
benefit from the plugin's device-capability-based `CUTE_DSL_ARCH` setup.
On GB300 we want `sm_103a` instead of the older hard-coded `sm_100a`.
On this GB300/CuTeDSL 4.4.2 host, LayerNorm backward compiles reliably
with `sm_103`; callers may still pin an `a` arch explicitly if their local
CuTeDSL build requires it.
"""
pinned = os.environ.get("CUTE_DSL_ARCH")
if pinned:
Expand All @@ -86,7 +88,9 @@ def ensure_blackwell_arch_env(device: Optional[torch.device] = None) -> str:
if device is None:
device = torch.device("cuda")
major, minor = torch.cuda.get_device_capability(device)
if int(major) == 10:
if int(major) == 10 and int(minor) == 3:
arch = "sm_103"
elif int(major) == 10:
arch = f"sm_{int(major)}{int(minor)}a"
os.environ["CUTE_DSL_ARCH"] = arch
return arch
Expand Down Expand Up @@ -115,6 +119,13 @@ def do_bench_triton(
return float(triton_do_bench(fn, warmup=warmup_ms, rep=rep_ms, return_mode="mean"))


def do_bench_cuda_graph(fn: Callable[[], Any], *, rep_ms: int = 100) -> float:
"""CUDA-graph replay timing via Triton's cudagraph benchmark helper."""
from triton.testing import do_bench_cudagraph

return float(do_bench_cudagraph(fn, rep=rep_ms, return_mode="mean"))


def parse_dtype(s: str) -> torch.dtype:
s = s.lower()
if s == "bf16":
Expand Down
Loading
Loading