Skip to content

FlashMaskV3 Single-node Speed Optimization#119

Merged
GuoxiaWang merged 4 commits intoPaddlePaddle:mainfrom
Enigmatisms:new_optim
Apr 14, 2026
Merged

FlashMaskV3 Single-node Speed Optimization#119
GuoxiaWang merged 4 commits intoPaddlePaddle:mainfrom
Enigmatisms:new_optim

Conversation

@Enigmatisms
Copy link
Copy Markdown

本 PR 包括如下三个部分:

  • 对 global sliding window mask 优化 bwd multiple for loop & inline lambda 导致的寄存器溢出问题,使得大 tile size 可成功应用。解决 global sliding window 反向性能瓶颈问题。
  • 对 scheduler barrier 的使用调整,可优化 hdim128 前向的性能。
  • @xxyux 此前的 PR: Optimize fwd hdim64 #90。对 hdim64 进行的寄存器分配优化以及 tile size 调整。

} else {
if ((params.seqlen_q >= 1024 || params.seqlen_k >= 1024) && !(Has_lt_end && Has_ut_start)) {
if (params.seqlen_q >= 1024 || params.seqlen_k >= 1024) {
run_mha_bwd_dispatch<Arch, T, 64, 128, 128, Is_causal, Is_local, Has_softcap, Is_flashmask_, Has_lt_end, Has_ut_start, Deterministic, Is_blockmask_, 2, 2, true, false, true, 2, 1, 2, 1, false>(params, stream);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@umiswing
Copy link
Copy Markdown
Member

LGTM

@GuoxiaWang GuoxiaWang merged commit a44cf15 into PaddlePaddle:main Apr 14, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants