Skip to content

Add fused GatedDeltaNet decode Triton kernel#18865

Draft
Gasoonjia wants to merge 1 commit intocuda-graphfrom
fused-deltanet-decode
Draft

Add fused GatedDeltaNet decode Triton kernel#18865
Gasoonjia wants to merge 1 commit intocuda-graphfrom
fused-deltanet-decode

Conversation

@Gasoonjia
Copy link
Copy Markdown
Contributor

Fuse Q/K/V split, L2 normalization, head repeat, gating computation, and delta-rule recurrent state update into a single Triton kernel for decode (T=1). Replaces ~6 small AOTI-generated kernels with one, reducing GatedDeltaNet kernel time by ~62% and improving end-to-end decode throughput by ~2% (106 -> 108.5 tok/s on A100).

Fuse Q/K/V split, L2 normalization, head repeat, gating computation,
and delta-rule recurrent state update into a single Triton kernel for
decode (T=1). Replaces ~6 small AOTI-generated kernels with one,
reducing GatedDeltaNet kernel time by ~62% and improving end-to-end
decode throughput by ~2% (106 -> 108.5 tok/s on A100).
@pytorch-bot
Copy link
Copy Markdown

pytorch-bot bot commented Apr 14, 2026

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/18865

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

❌ 6 New Failures, 102 Cancelled Jobs, 51 Unrelated Failures

As of commit c19d43e with merge base 2eaa16c (image):

NEW FAILURES - The following jobs have failed:

CANCELLED JOBS - The following jobs were cancelled. Please retry:

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

BROKEN TRUNK - The following jobs failed but were present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Apr 14, 2026
@Gasoonjia Gasoonjia temporarily deployed to upload-benchmark-results April 14, 2026 07:39 — with GitHub Actions Inactive
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ciflow/cuda CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant