Commit 1c1e5ec

committed

[None][perf] triton paged attention: non-pow2 head_dim, decode speedup, logit cap, FP8 KV cache support

- HEAD_DIM_PADDED: pad non-power-of-2 head dims to next power of 2 for Triton efficiency - WRITE_DIRECT optimization: skip intermediate buffer for direct output writes - KV offset hoisting: precompute KV offsets outside inner loop - SW-aware splits: sliding window-aware chunking for context attention - Logit cap support: per-head logit softcapping (Gemma4 alt-attention) - Always-Triton threshold: route to Triton for all seq_lens above threshold - Two-chunk gather: efficient gather for non-pow2 head dims in stage2 - FP8 KV cache casting: cast K/V to query dtype at all load sites (decode SW path, context block_ptr loads, stage2, fallback SDPA path) Signed-off-by: Chenghao Zhang <211069071+nvchenghaoz@users.noreply.github.com>

1 parent 72dc7ec commit 1c1e5ecCopy full SHA for 1c1e5ec

2 files changed

tensorrt_llm/_torch/auto_deploy/custom_ops/attention
- triton_paged_attention.py
tests/unittest/auto_deploy/singlegpu/custom_ops/attention
- test_triton_paged_attention.py

Comments

(0)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Commit 1c1e5ec

File tree

0 commit comments