Skip to content

ET backend generalization and performance uplift#8

Open
marty1885 wants to merge 455 commits intoaifoundry-org:etfrom
marty1885:backend-dev-2
Open

ET backend generalization and performance uplift#8
marty1885 wants to merge 455 commits intoaifoundry-org:etfrom
marty1885:backend-dev-2

Conversation

@marty1885
Copy link
Copy Markdown

@marty1885 marty1885 commented Apr 15, 2026

This PR contains some general changes:

  • Sync with upstream llama.cpp

And ET backend specific changes

  • Optimization for MUL_MAT F32 @ Q8_0
  • Added support for WKV6, WKV7, IM2COL, GATED_DELTA_NET, SSM_CONV, SSM_SCAN, GROUP_NORM, RMS_NORM, NORM, SQR, SOLVE_TRI, FILL,RI, PAD operators
  • Flash Attention support for ET backend (baseline f32 and f16 via matrix engine)
  • Fused RMS_NORM_ADD operator
  • Cachline orientated parallelization instead of row orientated for el_map, get/set_rows, and set operator
  • barrier semantics support
  • Shire-wide SOFT_MAX parallelization across single row
  • Multi ET device support
  • Full GLU operator support

General GGML fix:

  • Enforce views lives on the same backend as backing tensor (fix to get TTS working on ET backend)

Practically:

  • LLaMA 3.2 1B performance uplift 14 -> 17.4 tok/s
  • New model family support (verified, smaller versions of) Qwen 3.5, Gemma 3/4, RWKVv7, Mamab 1
  • TTS mode supported

marty1885 and others added 30 commits March 29, 2026 13:38
* hex-fa: add simple dma cache for Mask

I noticed that we were refetch the mask rows over and over.
This simple cache avoids that.

* hex-dma: unset in-order desc bit which caused signficant perf regression

We don't rely on true in order processing of the DMA descriptors anywhere.
Turns out this mode caused significant regression of around 3-4 TPS during token gen.

* hex-rope: update comment to clarify that we don't need in-order DMA completions
* Optimize MOE GEMV kernel for BS > 1.

The previous MOE kernel for BS > 1 had too many thread blocks (nrows_x, nchannels_dst, ncols_dst), with very little work per block. block of (32, 4) was doing inner dot product for a single row.

New mul_mat_vec_q_moe kernel is dedicated for MoE multi-token kernel with grid (ceil(nrows_x/rpb), nchannels_dst), block (warp_size, ncols_dst). Each warp handles two rows independently with warp-level reduction only (no shared memory sync).

This change doesn't increase any compilation time as a single template instance is needed per type. This also simplifies the original GEMV kernel and gets rid of `is_multi_token_id` specialization.

* Remove em-dashes

* Cherry-pick changes from @am17an PR ggml-org#20885 to enable small_k optimization only for cases where it benefits

Increase max batch size for MMVQ kernels for MUL_MAT_ID to 8

* Make the max batch size for MOE GEMV kernel configurable based on GPU arch and datatype

---------

Co-authored-by: Aman Gupta <amangupta052@gmail.com>
* server: wrap headers for mcp proxy

* Update tools/server/server-cors-proxy.h

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* fix build

* chore: update webui build output

* chore: update webui build output

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>
* fix incorrect type ignore comments

* bump ty to 0.0.26
…l-org#20978)

* llama-model-loader: use pinned memory for tensor overrides

* change to warning
* fix: Branching logic + small refactor

* chore: update webui build output
When RPC is running with a remote backend which doesn't have init_tensor
function (like CPU and Metal), the server log gets full with error
messages saying that init_tensor is being called with null buffer which
is incorrect. This patch fixes this.
…l-org#21181)

* CUDA: Fix CUB's argsort when nrows % block_size == 0 CCCL < 3.1

We wrongly calculated offset_grid as `ceildiv(nrows, block_size)`,
while it must be `ceildiv(nrows + 1, block_size)`. As a consequence, we
had uninitialized values in `offset_iterator[nrows]` for the case when
`nrows % block_size == 0`.

Fixes ggml-org#21162

* Reduce nrows in test case to 256, don't need 768
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.