fix misleading benchmarking for fp8 gemm#1980
fix misleading benchmarking for fp8 gemm#1980shunting314 wants to merge 1 commit intoshunting314/stack/35from
Conversation
stack-info: PR: #1980, branch: shunting314/stack/37
c038250 to
6be5e4d
Compare
| x_fp8 = x.to(torch.float8_e4m3fn) | ||
| y_fp8 = y.to(torch.float8_e4m3fn) | ||
| run_example(fp8_gemm, reference_fp8_gemm_pytorch, (x_fp8, y_fp8)) | ||
| y_fp8 = y.to(torch.float8_e4m3fn).T.contiguous().T |
There was a problem hiding this comment.
What happens if you remove this line? This patch is changing the layout used by the non-reference versions as well, so it is not apples-to-apples with the prior version.
There was a problem hiding this comment.
Removing this line will cause torch._scaled_mm fail. The kernel requires matrix B to be column major.
I think the transpose call should usually being fused with proceeding ops in practice? But I can check how it looks like in vllm
There was a problem hiding this comment.
There are two different kernels:
- fp8 gemm with both args contiguous
- fp8 gemm with second arg transposed
The issue is eager mode has a kernel for 2, but no kernel for 1. If we are measuring 1, then you don't get to pre-compute anything -- it requires two kernels to do in eager and we should measure both.
stack-info: PR: #1980, branch: shunting314/stack/37
6be5e4d to
1a8097e
Compare
Stacked PRs:
fix misleading benchmarking for fp8 gemm
The fp8 gemm benchmarking is very misleading. Before the fix it shows helion get 5.23x speedup on h100 for 1024/1024/1024 shape. But that's too good to be true. The baseline latency is much slower than expected. It turns out that 2 things makes the baseline slow
The baseline latency now change from 0.0814 ms to 0.0097 ms (8.4x difference..)