Skip to content

fix misleading benchmarking for fp8 gemm#1980

Open
shunting314 wants to merge 1 commit intoshunting314/stack/35from
shunting314/stack/37
Open

fix misleading benchmarking for fp8 gemm#1980
shunting314 wants to merge 1 commit intoshunting314/stack/35from
shunting314/stack/37

Conversation

@shunting314
Copy link
Copy Markdown
Contributor

@shunting314 shunting314 commented Apr 7, 2026

Stacked PRs:


fix misleading benchmarking for fp8 gemm

The fp8 gemm benchmarking is very misleading. Before the fix it shows helion get 5.23x speedup on h100 for 1024/1024/1024 shape. But that's too good to be true. The baseline latency is much slower than expected. It turns out that 2 things makes the baseline slow

  1. we allocate the scale tensor inside the kernel. They should be passed in
  2. we transpose matrix B in the kernel. It should be pre-transposed outside of the kernel.

The baseline latency now change from 0.0814 ms to 0.0097 ms (8.4x difference..)

x_fp8 = x.to(torch.float8_e4m3fn)
y_fp8 = y.to(torch.float8_e4m3fn)
run_example(fp8_gemm, reference_fp8_gemm_pytorch, (x_fp8, y_fp8))
y_fp8 = y.to(torch.float8_e4m3fn).T.contiguous().T
Copy link
Copy Markdown
Contributor

@jansel jansel Apr 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if you remove this line? This patch is changing the layout used by the non-reference versions as well, so it is not apples-to-apples with the prior version.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removing this line will cause torch._scaled_mm fail. The kernel requires matrix B to be column major.

I think the transpose call should usually being fused with proceeding ops in practice? But I can check how it looks like in vllm

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are two different kernels:

  1. fp8 gemm with both args contiguous
  2. fp8 gemm with second arg transposed

The issue is eager mode has a kernel for 2, but no kernel for 1. If we are measuring 1, then you don't get to pre-compute anything -- it requires two kernels to do in eager and we should measure both.

stack-info: PR: #1980, branch: shunting314/stack/37
@shunting314 shunting314 marked this pull request as draft April 8, 2026 18:16
@shunting314 shunting314 changed the base branch from shunting314/stack/35 to main April 8, 2026 18:16
@shunting314 shunting314 force-pushed the shunting314/stack/37 branch from 6be5e4d to 1a8097e Compare April 8, 2026 18:16
@shunting314 shunting314 changed the base branch from main to shunting314/stack/35 April 8, 2026 18:16
@shunting314 shunting314 marked this pull request as ready for review April 8, 2026 18:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants