ET backend generalization and performance uplift by marty1885 · Pull Request #8 · aifoundry-org/llama.cpp

marty1885 · 2026-04-15T14:32:26Z

This PR contains some general changes:

Sync with upstream llama.cpp

And ET backend specific changes

Optimization for MUL_MAT F32 @ Q8_0
Added support for WKV6, WKV7, IM2COL, GATED_DELTA_NET, SSM_CONV, SSM_SCAN, GROUP_NORM, RMS_NORM, NORM, SQR, SOLVE_TRI, FILL,RI, PAD operators
Flash Attention support for ET backend (baseline f32 and f16 via matrix engine)
Fused RMS_NORM_ADD operator
Cachline orientated parallelization instead of row orientated for el_map, get/set_rows, and set operator
barrier semantics support
Shire-wide SOFT_MAX parallelization across single row
Multi ET device support
Full GLU operator support

General GGML fix:

Enforce views lives on the same backend as backing tensor (fix to get TTS working on ET backend)

Practically:

LLaMA 3.2 1B performance uplift 14 -> 17.4 tok/s
New model family support (verified, smaller versions of) Qwen 3.5, Gemma 3/4, RWKVv7, Mamab 1
TTS mode supported

* hex-fa: add simple dma cache for Mask I noticed that we were refetch the mask rows over and over. This simple cache avoids that. * hex-dma: unset in-order desc bit which caused signficant perf regression We don't rely on true in order processing of the DMA descriptors anywhere. Turns out this mode caused significant regression of around 3-4 TPS during token gen. * hex-rope: update comment to clarify that we don't need in-order DMA completions

@am17an

* Optimize MOE GEMV kernel for BS > 1. The previous MOE kernel for BS > 1 had too many thread blocks (nrows_x, nchannels_dst, ncols_dst), with very little work per block. block of (32, 4) was doing inner dot product for a single row. New mul_mat_vec_q_moe kernel is dedicated for MoE multi-token kernel with grid (ceil(nrows_x/rpb), nchannels_dst), block (warp_size, ncols_dst). Each warp handles two rows independently with warp-level reduction only (no shared memory sync). This change doesn't increase any compilation time as a single template instance is needed per type. This also simplifies the original GEMV kernel and gets rid of `is_multi_token_id` specialization. * Remove em-dashes * Cherry-pick changes from @am17an PR ggml-org#20885 to enable small_k optimization only for cases where it benefits Increase max batch size for MMVQ kernels for MUL_MAT_ID to 8 * Make the max batch size for MOE GEMV kernel configurable based on GPU arch and datatype --------- Co-authored-by: Aman Gupta <amangupta052@gmail.com>

* server: wrap headers for mcp proxy * Update tools/server/server-cors-proxy.h Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * fix build * chore: update webui build output * chore: update webui build output --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>

* fix incorrect type ignore comments * bump ty to 0.0.26

…l-org#20978) * llama-model-loader: use pinned memory for tensor overrides * change to warning

* fix: Branching logic + small refactor * chore: update webui build output

When RPC is running with a remote backend which doesn't have init_tensor function (like CPU and Metal), the server log gets full with error messages saying that init_tensor is being called with null buffer which is incorrect. This patch fixes this.

…l-org#21181) * CUDA: Fix CUB's argsort when nrows % block_size == 0 CCCL < 3.1 We wrongly calculated offset_grid as `ceildiv(nrows, block_size)`, while it must be `ceildiv(nrows + 1, block_size)`. As a consequence, we had uninitialized values in `offset_iterator[nrows]` for the case when `nrows % block_size == 0`. Fixes ggml-org#21162 * Reduce nrows in test case to 256, don't need 768

…21807)

marty1885 and others added 30 commits March 29, 2026 13:38

perf: FlashAttention 2nd MM uses TensorFMA and optimizations

370c06d

cleanup: flashattention reorg

582db50

perf: optimizations and fixes

656f770

feat: L2SCP API and make FlashAttention support DV = 256 for gemma

69b2192

add missing ROPE_FACTORS_LONG/SHORT for MiniCPM (ggml-org#21150)

7c20367

perf: parallelize norms beyond single row

24670b8

feat: GATED_DELTA_NET support and relaxed L2_NORM requirment

4db780e

feat: loosen RMS_NORM, NORM, ROPE contingous req too

4ea3478

feat: repeat supports brocasting on dim 0 and loosen cont check

e2b8b12

feat: FILL and DIAG operator

243b7be

feat: loosen UNARY support chcek

23530ba

feat: TRI support

043d91a

feat: SOLVE_TRI support

22da6a1

feat: basic SET support

04d62da

feat: loosen CONT req

3fed43d

perf: fp16_to_fp32 use ASM

7524b04

ci : bump ty to 0.0.26 (ggml-org#21156)

e2eb39e

* fix incorrect type ignore comments * bump ty to 0.0.26

feat: IMROPE support

28cbb11

feat: PAD support

58f3e1e

llama-model-loader: print warning when using overrides with mmap (ggm…

278521c

…l-org#20978) * llama-model-loader: use pinned memory for tensor overrides * change to warning

feat: global barrier

6623725

webui: Fix branching logic on edit message (ggml-org#21175)

389c7d4

* fix: Branching logic + small refactor * chore: update webui build output

fix: view must live on the same backend as backing tensor

e378631

feat: relax CONCAT in ET backend

cc7ac95

feat: dead simple CUMSUM implementation

c522cd5

qnixsynapse and others added 5 commits April 13, 2026 09:44

sycl: disable Q1_0 in backend and cleanup unused variables (ggml-org#…

873c825

…21807)

Remove extra conditional check on debug mode. (ggml-org#21798)

bafae27

Merge remote-tracking branch 'upstream/master' into backend-dev-2

4effc2b

add back deleted files

83ee00b

fix: repair after merge

42c81a0

github-actions bot added documentation Improvements or additions to documentation Apple Metal SYCL Nvidia GPU Vulkan IBM zDNN AMD ZenDNN testing build examples devops python script android server/webui server ggml model nix jinja parser Ascend NPU OpenCL Hexagon WebGPU OpenVINO labels Apr 15, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ET backend generalization and performance uplift#8

ET backend generalization and performance uplift#8
marty1885 wants to merge 455 commits intoaifoundry-org:etfrom
marty1885:backend-dev-2

marty1885 commented Apr 15, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

Conversation

marty1885 commented Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

marty1885 commented Apr 15, 2026 •

edited

Loading