fix misleading benchmarking for fp8 gemm by shunting314 · Pull Request #1980 · pytorch/helion

shunting314 · 2026-04-07T23:13:13Z

Stacked PRs:

fix misleading benchmarking for fp8 gemm

The fp8 gemm benchmarking is very misleading. Before the fix it shows helion get 5.23x speedup on h100 for 1024/1024/1024 shape. But that's too good to be true. The baseline latency is much slower than expected. It turns out that 2 things makes the baseline slow

we allocate the scale tensor inside the kernel. They should be passed in
we transpose matrix B in the kernel. It should be pre-transposed outside of the kernel.

The baseline latency now change from 0.0814 ms to 0.0097 ms (8.4x difference..)

stack-info: PR: #1980, branch: shunting314/stack/37

jansel · 2026-04-08T11:10:23Z

examples/fp8_gemm.py

    x_fp8 = x.to(torch.float8_e4m3fn)
-    y_fp8 = y.to(torch.float8_e4m3fn)
-    run_example(fp8_gemm, reference_fp8_gemm_pytorch, (x_fp8, y_fp8))
+    y_fp8 = y.to(torch.float8_e4m3fn).T.contiguous().T


What happens if you remove this line? This patch is changing the layout used by the non-reference versions as well, so it is not apples-to-apples with the prior version.

Removing this line will cause torch._scaled_mm fail. The kernel requires matrix B to be column major.

I think the transpose call should usually being fused with proceeding ops in practice? But I can check how it looks like in vllm

There are two different kernels:

fp8 gemm with both args contiguous

fp8 gemm with second arg transposed

The issue is eager mode has a kernel for 2, but no kernel for 1. If we are measuring 1, then you don't get to pre-compute anything -- it requires two kernels to do in eager and we should measure both.

stack-info: PR: #1980, branch: shunting314/stack/37

shunting314 added a commit that referenced this pull request Apr 7, 2026

fix misleading benchmarking for fp8 gemm

6be5e4d

stack-info: PR: #1980, branch: shunting314/stack/37

shunting314 force-pushed the shunting314/stack/37 branch from c038250 to 6be5e4d Compare April 7, 2026 23:13

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Apr 7, 2026

shunting314 requested review from jansel, oulgen and yf225 April 7, 2026 23:16

yf225 approved these changes Apr 7, 2026

View reviewed changes

jansel requested changes Apr 8, 2026

View reviewed changes

fix misleading benchmarking for fp8 gemm

1a8097e

stack-info: PR: #1980, branch: shunting314/stack/37

shunting314 marked this pull request as draft April 8, 2026 18:16

shunting314 changed the base branch from shunting314/stack/35 to main April 8, 2026 18:16

shunting314 force-pushed the shunting314/stack/37 branch from 6be5e4d to 1a8097e Compare April 8, 2026 18:16

shunting314 changed the base branch from main to shunting314/stack/35 April 8, 2026 18:16

shunting314 marked this pull request as ready for review April 8, 2026 18:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix misleading benchmarking for fp8 gemm#1980

fix misleading benchmarking for fp8 gemm#1980
shunting314 wants to merge 1 commit intoshunting314/stack/35from
shunting314/stack/37

shunting314 commented Apr 7, 2026 •

edited

Loading

Uh oh!

jansel Apr 8, 2026 •

edited

Loading

Uh oh!

shunting314 Apr 8, 2026

Uh oh!

jansel Apr 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

shunting314 commented Apr 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jansel Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

shunting314 Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

jansel Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

shunting314 commented Apr 7, 2026 •

edited

Loading

jansel Apr 8, 2026 •

edited

Loading