Skip to content

hexagon: flash attention rework (optimizations, accuracy improvements, etc)#25085

Draft
max-krasnyansky wants to merge 44 commits into
ggml-org:masterfrom
qualcomm:hexagon-fa-rework
Draft

hexagon: flash attention rework (optimizations, accuracy improvements, etc)#25085
max-krasnyansky wants to merge 44 commits into
ggml-org:masterfrom
qualcomm:hexagon-fa-rework

Conversation

@max-krasnyansky

Copy link
Copy Markdown
Member

Overview

The main focus here is to update FLASH_ATTN_EXT in ggml-hexagon similar to what I did in #24954

  • Combining HMX and HVX into htp/flash-atten-ops.c and removing duplication
  • Updating op tracing instrumentation (added a few new events)
  • Moving inner kerners into htp/{hvx,hmx}-fa-kernels.h
  • Moving kernel-params computation into the host (cached as part of the graph caching)
  • Improving pipelining and most of the kernels
  • Updated softmax to use FP32 accumulators.
    Similar to hexagon: store HMX flash-attention softmax accumulators in FP32 #24389

The PR also includes some MUL_MAT updates as a followup to #24954.
As I was looking at Flash Attention op-traces I noticed a few more things to improve in MUL_MAT.
Technically, I could split that up but I did extensive testing of the combined update.

Additional information

Tested on all supported devices.
Newer devices show higher gains but there are steady improvements across the board

Details
## gemma-4-E2B_q4_0

Ventuno-Q
   prompt eval time = 1466.41 ms / 786 tokens (  1.87 ms per token,  536.00 tokens per second) (vs 507.80 master)
          eval time = 4114.31 ms /  63 runs   ( 65.31 ms per token,   15.31 tokens per second) (vs  15.04 master)

S24U
   prompt eval time =  848.53 ms / 741 tokens (  1.15 ms per token,  873.28 tokens per second) (vs 815.58 master)
          eval time = 2816.80 ms /  63 runs   ( 44.71 ms per token,   22.37 tokens per second) (vs  19.29 master)

S25+
   prompt eval time =  497.93 ms / 741 tokens (  0.67 ms per token, 1488.15 tokens per second) (vs 1201.47 master)
          eval time = 2324.59 ms /  63 runs   ( 36.90 ms per token,   27.10 tokens per second) (vs   25.24 master)

S26+
   prompt eval time =  403.29 ms / 741 tokens (  0.54 ms per token, 1837.38 tokens per second) (vs 1646.45 master)
          eval time = 1983.48 ms /  63 runs   ( 31.48 ms per token,   31.76 tokens per second) (v    29.45 master)

X2-Elite
   prompt eval time =  376.89 ms / 741 tokens (  0.51 ms per token, 1966.10 tokens per second) (vs 1688.41 master)
          eval time = 1465.07 ms /  63 runs   ( 23.26 ms per token,   43.00 tokens per second) (vs   34.15 master)

## Qwen3.5-2B-Q4_0.gguf

S24U
   prompt eval time = 1249.01 ms / 742 tokens (  1.68 ms per token,  594.07 tokens per second) (vs 581.78 master)
          eval time = 2847.99 ms /  63 runs   ( 45.21 ms per token,   22.12 tokens per second) (vs  21.59 master)

S25+
   prompt eval time =  753.36 ms / 742 tokens (  1.02 ms per token,  984.92 tokens per second) (vs 969.03 master)
          eval time = 2232.04 ms /  63 runs   ( 35.43 ms per token,   28.23 tokens per second) (vs  27.29 master)

S26+
   prompt eval time =  570.42 ms / 742 tokens (  0.77 ms per token, 1300.80 tokens per second) (vs 1282.19 master)
          eval time = 2015.43 ms /  63 runs   ( 31.99 ms per token,   31.26 tokens per second) (vs   30.79 master)

X2-Elite
   prompt eval time =  568.55 ms / 742 tokens (  0.77 ms per token, 1305.07 tokens per second) (vs 1258.02 master)
          eval time = 1779.24 ms /  63 runs   ( 28.24 ms per token,   35.41 tokens per second) (vs   37.17 master)

## Llama-3.2-1B

S26+
   prompt eval time =  190.18 ms / 766 tokens (  0.25 ms per token, 4027.72 tokens per second) (vs 2855.28 master)
          eval time = 1161.91 ms /  63 runs   ( 18.44 ms per token,   54.22 tokens per second) (vs   42.41 master)

## Llama-3.2-3B

S26+
   prompt eval time =  421.11 ms / 766 tokens (  0.55 ms per token, 1819.01 tokens per second) (vs 1431.30 master)
          eval time = 2519.05 ms /  63 runs   ( 39.98 ms per token,   25.01 tokens per second) (vs   21.45 master)

Requirements

  • I have read and agree with the contributing guidelines
  • AI usage disclosure: Yes, lots of help from Antigravity to explore different ideas, and refactoring.

@github-actions github-actions Bot added ggml changes relating to the ggml tensor library for machine learning Hexagon labels Jun 27, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning Hexagon

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant