hexagon: flash attention rework (optimizations, accuracy improvements, etc) by max-krasnyansky · Pull Request #25085 · ggml-org/llama.cpp

max-krasnyansky · 2026-06-27T23:12:54Z

Overview

The main focus here is to update FLASH_ATTN_EXT in ggml-hexagon similar to what I did in #24954

Combining HMX and HVX into htp/flash-atten-ops.c and removing duplication
Updating op tracing instrumentation (added a few new events)
Moving inner kerners into htp/{hvx,hmx}-fa-kernels.h
Moving kernel-params computation into the host (cached as part of the graph caching)
Improving pipelining and most of the kernels
Updated softmax to use FP32 accumulators.
Similar to hexagon: store HMX flash-attention softmax accumulators in FP32 #24389

The PR also includes some MUL_MAT updates as a followup to #24954.
As I was looking at Flash Attention op-traces I noticed a few more things to improve in MUL_MAT.
Technically, I could split that up but I did extensive testing of the combined update.

Additional information

Tested on all supported devices.
Newer devices show higher gains but there are steady improvements across the board

Details

## gemma-4-E2B_q4_0

Ventuno-Q
   prompt eval time = 1466.41 ms / 786 tokens (  1.87 ms per token,  536.00 tokens per second) (vs 507.80 master)
          eval time = 4114.31 ms /  63 runs   ( 65.31 ms per token,   15.31 tokens per second) (vs  15.04 master)

S24U
   prompt eval time =  848.53 ms / 741 tokens (  1.15 ms per token,  873.28 tokens per second) (vs 815.58 master)
          eval time = 2816.80 ms /  63 runs   ( 44.71 ms per token,   22.37 tokens per second) (vs  19.29 master)

S25+
   prompt eval time =  497.93 ms / 741 tokens (  0.67 ms per token, 1488.15 tokens per second) (vs 1201.47 master)
          eval time = 2324.59 ms /  63 runs   ( 36.90 ms per token,   27.10 tokens per second) (vs   25.24 master)

S26+
   prompt eval time =  403.29 ms / 741 tokens (  0.54 ms per token, 1837.38 tokens per second) (vs 1646.45 master)
          eval time = 1983.48 ms /  63 runs   ( 31.48 ms per token,   31.76 tokens per second) (v    29.45 master)

X2-Elite
   prompt eval time =  376.89 ms / 741 tokens (  0.51 ms per token, 1966.10 tokens per second) (vs 1688.41 master)
          eval time = 1465.07 ms /  63 runs   ( 23.26 ms per token,   43.00 tokens per second) (vs   34.15 master)

## Qwen3.5-2B-Q4_0.gguf

S24U
   prompt eval time = 1249.01 ms / 742 tokens (  1.68 ms per token,  594.07 tokens per second) (vs 581.78 master)
          eval time = 2847.99 ms /  63 runs   ( 45.21 ms per token,   22.12 tokens per second) (vs  21.59 master)

S25+
   prompt eval time =  753.36 ms / 742 tokens (  1.02 ms per token,  984.92 tokens per second) (vs 969.03 master)
          eval time = 2232.04 ms /  63 runs   ( 35.43 ms per token,   28.23 tokens per second) (vs  27.29 master)

S26+
   prompt eval time =  570.42 ms / 742 tokens (  0.77 ms per token, 1300.80 tokens per second) (vs 1282.19 master)
          eval time = 2015.43 ms /  63 runs   ( 31.99 ms per token,   31.26 tokens per second) (vs   30.79 master)

X2-Elite
   prompt eval time =  568.55 ms / 742 tokens (  0.77 ms per token, 1305.07 tokens per second) (vs 1258.02 master)
          eval time = 1779.24 ms /  63 runs   ( 28.24 ms per token,   35.41 tokens per second) (vs   37.17 master)

## Llama-3.2-1B

S26+
   prompt eval time =  190.18 ms / 766 tokens (  0.25 ms per token, 4027.72 tokens per second) (vs 2855.28 master)
          eval time = 1161.91 ms /  63 runs   ( 18.44 ms per token,   54.22 tokens per second) (vs   42.41 master)

## Llama-3.2-3B

S26+
   prompt eval time =  421.11 ms / 766 tokens (  0.55 ms per token, 1819.01 tokens per second) (vs 1431.30 master)
          eval time = 2519.05 ms /  63 runs   ( 39.98 ms per token,   25.01 tokens per second) (vs   21.45 master)

Requirements

I have read and agree with the contributing guidelines
AI usage disclosure: Yes, lots of help from Antigravity to explore different ideas, and refactoring.

max-krasnyansky added 30 commits June 27, 2026 10:36

hex-mm: fold mm quant tasks into the main matmul threads

448f4e0

hex-mm: minor formatting fixes

1ace6e1

hex-mm: cleanup is_quant checks in dma dispatch

8c7dc2c

hex-mm: fix dst-spad alignment

15f6802

hex-mm: move fp kernels in the hvx-mm-kernels header

4e4abe9

hex-mm: fuse with ADD

158da44

hex-fa: factor out ukernels into separate headers and unify the rest

ca01fdf

hex-fa: move kernel-params compute into the host

21e26d1

hex-fa: refactor vtcm alloc for consistency

5a6f64f

hex-fa: add support for FA_SELECT

bedeabb

hex-fa: update tracing insrumentation to cover all functions

3a26a9a

hex-fa: update hvx fallback thresholds to recover t/g regressions

dab2414

hex-fa: update tracing instrumentation

abf8ae1

hex-fa: improved tracing with additional events

3d242aa

hex-fa: optimize mask processing (fastdiv, etc)

3dc8d7b

hex-fa: improve mask dma caching

0b262e6

hmx-fa: change loop order to maximize mask cache hits

537e208

hex-fa: remove over instrumentation

ac5a388

hex-fa: breakdown QKV prep trace events

1e086aa

hmx-fa: further mask proc optimizations

dd1809a

hex-fa: mask broadcast is the common case, optimize for that

bdbb71c

hex-fa: use aligned loads where possible

6c957b0

hex-fa: update loops to use uint32_t indices

73d9381

hmx-fa: fold vtcm init into q prep task

f1d803c

hex-fa: update rest of the hmx funcs to use uint32_t

4cb1dd2

hmx-fa: fold build_d into the main softmax loop

fac9a26

hmx-fa: start kv dmas earlier

c1dfc29

hmx-fa: start mask dma a bit earlier

5b0306c

hex-fa: precompute rows per task to avoid divs

f0a8030

hmx-fa: specialize fa_o_store for f16 and f32

dfc1a2c

max-krasnyansky added 13 commits June 27, 2026 10:39

hmx-fa: prelim support for Sinks

294e92d

hmx-fa: keep softmax accumulators in fp32

2c7469d

hex-fa: add tanh_f16 and exp2_f16 and use that in FA

09deb72

hex-fa: use fp16 math in the hvx kernel

e73eee3

hex-fa: avoid expensive float -> __fp16 cast for slopes and softcap

d4e5e60

hex-fa: replace most vec_exp_f32 with vec_exp2_f16

860c22c

hmx-fa: vectorize sinks update

3b20dee

hex-fa: minor formatting

42f6e12

hmx-fa: fold softcap loop into the tile load

29b49bb

hmx-fa: use vectoralias to populate sinks

dce26e3

hex-fa: remove redudant check

2d1f270

hex-fa: fix vtcm size compute to use fp32 for accumulators

d270599

hex-fa: make lto happy

6ff88f3

github-actions Bot added ggml changes relating to the ggml tensor library for machine learning Hexagon labels Jun 27, 2026

hex-mm: fix trailing spaces

10d8e6f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

hexagon: flash attention rework (optimizations, accuracy improvements, etc)#25085

hexagon: flash attention rework (optimizations, accuracy improvements, etc)#25085
max-krasnyansky wants to merge 44 commits into
ggml-org:masterfrom
qualcomm:hexagon-fa-rework

max-krasnyansky commented Jun 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

max-krasnyansky commented Jun 27, 2026

Overview

Additional information

Requirements

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant