Skip to content

Clean up and optimize self-contained LayerNorm backward for GB300 #133

Merged
Laurawly merged 1 commit intomainfrom
laura/layernorm
May 4, 2026
Merged

Clean up and optimize self-contained LayerNorm backward for GB300 #133
Laurawly merged 1 commit intomainfrom
laura/layernorm

Conversation

@Laurawly
Copy link
Copy Markdown
Contributor

@Laurawly Laurawly commented May 4, 2026

Primary public benchmark mode is now CUDA graph warm replay:

  env PYTHONNOUSERSITE=1 CUTE_DSL_ARCH=sm_103 PYTORCH_ALLOC_CONF=expandable_segments:True \
    conda run -n cute python -u oink/benchmarks/benchmark/benchmark_layernorm_bwd_sm100.py \
      --dtype bf16 --weight-dtype same --dsv3 --iters 80 --warmup-ms 10 --cuda-graph \
      --json /tmp/oink_layernorm_bwd_sm103_dsv3_cuda_graph_seq.json

  env PYTHONNOUSERSITE=1 CUTE_DSL_ARCH=sm_103 PYTORCH_ALLOC_CONF=expandable_segments:True \
    conda run -n cute python -u oink/benchmarks/benchmark/benchmark_layernorm_bwd_sm100.py \
      --dtype bf16 --weight-dtype same --dsv4 --iters 80 --warmup-ms 10 --cuda-graph \
      --json /tmp/oink_layernorm_bwd_sm103_dsv4_cuda_graph_seq.json

  DSv3 LayerNorm backward, bf16/same/no-bias, CUDA graph replay

       M      N   Oink ms   Oink TB/s   ATen ref ms   Oink/ref
  ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
    4096   6144    0.0548      2.7574        0.0777    1.4190x
    4096   8192    0.0611      3.2951        0.0970    1.5873x
   16384   6144    0.1840      3.2833        0.2794    1.5183x
   16384   8192    0.2093      3.8480        0.3387    1.6183x
   65536   6144    0.6896      3.5043        1.0652    1.5447x
   65536   8192    0.7372      4.3705        1.3138    1.7823x

  DSv4 hidden LayerNorm backward, bf16/same/no-bias, CUDA graph replay

       M      N   Oink ms   Oink TB/s   ATen ref ms   Oink/ref
  ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
    4096   7168    0.0591      2.9800        0.0858    1.4503x
   16384   7168    0.1990      3.5425        0.3077    1.5467x
   65536   7168    0.7467      3.7753        1.1711    1.5684x

@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Meta Open Source bot. label May 4, 2026
@Laurawly Laurawly force-pushed the laura/layernorm branch from f9853c9 to cf5cc14 Compare May 4, 2026 22:28
@Laurawly Laurawly force-pushed the laura/layernorm branch from cf5cc14 to af69e4d Compare May 4, 2026 23:04
@Laurawly Laurawly merged commit a3001c7 into main May 4, 2026
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants