Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
193 commits
Select commit Hold shift + click to select a range
edb1a11
feat(paged): vLLM-parity KV block manager (Phase 0, CPU-first prototype)
mudler Jun 19, 2026
c6698dd
feat(paged): Phase 1 - ggml paged write/gather mechanism (CPU)
mudler Jun 19, 2026
5a5d3df
feat(paged): Phase 2 core - attention over paged KV matches reference
mudler Jun 19, 2026
ddace5f
feat(paged): paged-bench - measure capacity & prefix-sharing wins
mudler Jun 19, 2026
3ed3279
docs(paged): status + integration map for in-model Gate 0
mudler Jun 19, 2026
bbc84a9
feat(paged): Gate 0 in-model - token-identical generation with paged …
mudler Jun 19, 2026
7aa61d4
docs(paged): DGX Blackwell gap analysis + lever plan (living doc)
mudler Jun 19, 2026
aba0bfd
feat(backend): auto-default physical batch to 2048 on Blackwell GPUs
mudler Jun 19, 2026
9f16a90
docs(paged): Lever 3 profiled + Q4/MXFP4 findings, auto-ubatch shipped
mudler Jun 19, 2026
1449b80
docs(paged): Lever-3 + paged-attention implementation plans + upstrea…
mudler Jun 19, 2026
b142146
docs(paged): Lever-3 phase-1 nwarps tweak = dead end (constants coupled)
mudler Jun 19, 2026
62f0ae1
docs(paged): upstream survey - no FP4 MoE GEMM to patch in; phase 3 i…
mudler Jun 19, 2026
ba3fa5a
build(paged): stacking patch-series scaffolding for llama.cpp paged a…
mudler Jun 19, 2026
ce48cc0
patch(paged) 0001: vendor PagedKVManager into llama.cpp src
mudler Jun 19, 2026
61ff738
patch(paged) 0002: LLAMA_KV_PAGED block placement, Gate 0 token-ident…
mudler Jun 19, 2026
c4b4f3a
docs(paged): series status 0001/0002 done+verified; honest parity note
mudler Jun 19, 2026
145e45b
docs(paged): exact executable plan for 0003 gather-read
mudler Jun 19, 2026
48fbb93
docs(paged): refine 0003 plan - used-cell gather, per-ubatch rebuild,…
mudler Jun 19, 2026
2a500c3
bench(paged): fresh GB10 head-to-head vs vLLM - two distinct gaps
mudler Jun 19, 2026
cb28ded
bench(paged): decode profile overturns 'engine-addressable' - decode …
mudler Jun 19, 2026
b7b2e82
kernel(fp4-grouped-moe): scaffold the FP4 grouped-GEMM MoE dispatch (…
mudler Jun 19, 2026
37cbc08
bench(dense): Qwen3-32B dense parity - dense has the kernel gap too (…
mudler Jun 20, 2026
ce60737
kernel(doc): dense scope resolved - two FP4 kernels (dense first, the…
mudler Jun 20, 2026
19742ae
bench(dense): FORCE_CUBLAS no-op for dense too (720.8 vs 721.8) - eve…
mudler Jun 20, 2026
d2651c8
bench(dense): root-cause the W4A4 NVFP4 hang; W4A16 vs Q4 is the head…
mudler Jun 20, 2026
f5e9cae
kernel: reframed Blackwell kernel-gap map (research + profiles)
mudler Jun 20, 2026
14e3da2
kernel: dense MXFP4 test = free 1.44x (765->1153) but FP4-MMA untuned…
mudler Jun 20, 2026
122df1c
analysis: vLLM throughput gap decomposed - spec-dec is the per-user l…
mudler Jun 20, 2026
76cc0b6
docs(paged): phased plan to make llama.cpp a viable vLLM alternative
mudler Jun 20, 2026
13e6ee8
kernel: validate cuBLAS dead-end (sm_80 fallback) + W4A16 Marlin impl…
mudler Jun 20, 2026
dae2679
kernel(P0): parity harness established + baseline (test-backend-ops 1…
mudler Jun 20, 2026
d291e15
kernel(P0): record precise op-level baseline (q4_K n=512 = 47 TFLOPS,…
mudler Jun 20, 2026
718b31d
kernel(P1): W4A16 dispatch seam (gated, byte-identical fallback to MMQ)
mudler Jun 20, 2026
9a71e81
kernel: written subagent dispatch briefs for P3/P4/P5
mudler Jun 20, 2026
4de0c3b
feat(cuda): W4A16 P2 correctness-first BF16 GEMM kernel
mudler Jun 20, 2026
9973fa9
feat(w4a16): P3 step 1 - block-tiled multi-warp Marlin GEMM (GB10)
mudler Jun 20, 2026
2f648dc
feat(w4a16): conflict-free skew-pad ldmatrix + BM128/8w tile (q4_K +2…
mudler Jun 21, 2026
2b79083
feat(w4a16): grow tile to BN128/16w (q4_K +17%, pp512 148->178)
mudler Jun 21, 2026
fc589b3
analysis: vLLM GB10 advantage is the SCHEDULER, not the kernel (pivot)
mudler Jun 21, 2026
07985ba
analysis: measured llama.cpp aggregate vs vLLM - already ~75-80% at n…
mudler Jun 21, 2026
fdb7f56
docs(llama-cpp): scope chunked prefill + n_batch/n_ubatch decouple
mudler Jun 21, 2026
92e93df
analysis: paged KV gives ZERO benefit on GB10 (measured) - not the lever
mudler Jun 21, 2026
d6c91b7
analysis: finalize PR #22569 paged-KV eval (full detail + compute-bou…
mudler Jun 21, 2026
40ee9cd
docs(paged): evaluate llama.cpp PR #17004 (GPU/backend sampling) on GB10
mudler Jun 21, 2026
1887385
analysis: MXFP4-dense fails quality check (~27% worse PPL than Q4_K) …
mudler Jun 21, 2026
037ad82
docs(paged): MXFP4-dense vs Q4_K quality gate on GB10 (do not recommend)
mudler Jun 21, 2026
aaf7b41
test(llama-cpp): NVFP4-dense FP4 quality+speed eval on GB10
mudler Jun 21, 2026
6e0b910
analysis: decode gap is GPU/kernel-bound, NOT host overhead (corrects…
mudler Jun 21, 2026
faeb5b4
analysis: NVFP4 closes the decode gap too (547->619, ~93% of vLLM)
mudler Jun 21, 2026
0337505
docs(paged): measure paged KV at high concurrency (LLAMA_MAX_SEQ=2048…
mudler Jun 21, 2026
931793a
feat(paged): target-readiness for 2xH200 - correctness PASS, load-gen…
mudler Jun 21, 2026
84d59e6
docs(paged): additive "hook, don't edit" layout for the patch series
mudler Jun 22, 2026
d9d846e
feat(paged): patch 0003 gather-read - Gate 0 green, token-identical, …
mudler Jun 22, 2026
37e0e1e
paged-attn 0003: lift gather-read to multi-stream
mudler Jun 22, 2026
4968cd8
paged-attn 0004: on-demand KV block allocation
mudler Jun 22, 2026
04e3d04
build(llama-cpp): isolate paged patches in patches/paged/ behind LLAM…
mudler Jun 22, 2026
667a21c
feat(llama-cpp): expose paged KV cache as a per-server option (patch …
mudler Jun 22, 2026
67c6208
feat(llama-cpp/paged): cross-request prefix caching patch 0006
mudler Jun 22, 2026
ecffd4b
feat(llama-cpp/paged): engine-level prefix recompute-skip (patch 0007)
mudler Jun 22, 2026
d1ba327
docs(paged): record GPU correctness + CUDA backend-build verification
mudler Jun 22, 2026
9537726
fix(llama-cpp/paged): stop double-applying the paged patches in prepa…
mudler Jun 22, 2026
0dd45f0
docs(llama-cpp/paged): GPU 0007 re-run + shared-prefix benchmark results
mudler Jun 22, 2026
f347f7c
docs(paged): stock GPU batch-shape determinism + vLLM shared-prefix c…
mudler Jun 22, 2026
52f0f7b
docs(paged): apples-to-apples paged llama.cpp vs vLLM (batched+NVFP4+…
mudler Jun 22, 2026
80e0c1a
feat(paged): wire cross-request prefix share into llama-server (patch…
mudler Jun 22, 2026
4dcbcfc
docs(paged): decode-step gap study vs vLLM on GB10
mudler Jun 22, 2026
ee13a94
paged: in-kernel decode read patch 0009 (kill the gather regression)
mudler Jun 22, 2026
2c5adda
feat(paged): tile in-kernel decode read + dispatch guard (patch 0010)
mudler Jun 22, 2026
e983919
feat(paged): route GQA-grouped tile kernel by default for paged decod…
mudler Jun 22, 2026
ba6bd94
feat(paged): assert mask-pad invariant for the paged tile route (patc…
mudler Jun 23, 2026
4bc2b4a
feat(paged): add patch 0013 decoupled per-step prefill-token budget
mudler Jun 23, 2026
dd6a442
feat(llama-cpp): per-model max_prefill_tokens option (chunked-prefill…
mudler Jun 23, 2026
a3abd60
docs(paged): GB10 head-to-head server sweep (llama-server vs vLLM)
mudler Jun 23, 2026
8925c00
docs(paged): scope durable grouped FP4-MMA MoE GEMM port for GB10
mudler Jun 23, 2026
010067d
feat(paged): mirror patch 0014 - expert-aware MoE token-tile cap
mudler Jun 23, 2026
acb22a6
feat(paged): mirror MoE token-tile density-aware auto-select (patch 0…
mudler Jun 23, 2026
ee78ae4
docs(paged): Qwen3.6 NVFP4 h2h bench doc - MoE llama.cpp table
mudler Jun 23, 2026
2975a74
docs(paged): Qwen3.6 NVFP4 apples-to-apples scorecard (llama vs vLLM,…
mudler Jun 23, 2026
c8b1f16
docs(paged): dense NVFP4 fair re-run with max_prefill_tokens budget s…
mudler Jun 23, 2026
c7075fb
docs(paged): MoE 35B-A3B NVFP4 fair re-run with max_prefill_tokens bu…
mudler Jun 23, 2026
362eea9
docs(paged): fair re-run verdict - synthesize NVFP4 llama vs vLLM sco…
mudler Jun 23, 2026
ed17fc8
docs(paged): scope token-granular continuous-batch scheduler for llam…
mudler Jun 23, 2026
5a38dd3
docs(paged): adversarial review of the continuous-batch scheduler scope
mudler Jun 23, 2026
fccbb40
docs(paged): ground vLLM 0.23.0 eager-decode architecture vs llama.cpp
mudler Jun 24, 2026
24ce7d0
feat(llama-cpp/paged): dynamic decode-first prefill budget (patch 001…
mudler Jun 24, 2026
f7500df
docs(paged): staggered-arrival evaluation of patch 0016 dynamic budget
mudler Jun 24, 2026
e4c6317
docs(paged): verify llama.cpp GDN decode is O(1)-in-context, not a 2.…
mudler Jun 24, 2026
ea634ee
docs(paged): scope track B - FP4-MMA decode-GEMM roofline + parity go…
mudler Jun 24, 2026
c1d7f33
docs(paged): enrich track-B scope with code-level FP4-GEMM inefficien…
mudler Jun 24, 2026
7434d64
docs(paged): build-ready track-B FP4-GEMM scope - kernel decision + p…
mudler Jun 24, 2026
39e16cc
docs(paged): adversarial review of track-B FP4-GEMM parity go/no-go
mudler Jun 24, 2026
40f019e
docs(paged): mirror FP4 decode-GEMM track-B P0 gate + P1 kill-gate re…
mudler Jun 24, 2026
da67fd8
docs(paged): A.2 CUDA-graph decode lever measurement and gap diagnosis
mudler Jun 24, 2026
2dd5d68
docs(paged): A.2 Phase 2 - locate the real decode lever (gated-DeltaN…
mudler Jun 24, 2026
34cadb6
docs(paged): A.2 final synthesis - CUDA-graph decode verdict
mudler Jun 24, 2026
5ce2f1d
feat(paged): qwen35 gated-DeltaNet in-place SSM state write-back (pat…
mudler Jun 24, 2026
6f0792c
feat(paged): qwen35 SSM decode fused recurrent-state gather (patch 0019)
mudler Jun 24, 2026
ee13fd1
docs(paged): profile-both-engines post-SSM ground-truth decode decomp…
mudler Jun 25, 2026
c0e0ed3
docs(paged): synthesize decode-parity exploration - the o_proj MMVQ l…
mudler Jun 25, 2026
b895f4d
feat(paged): qwen35 gated-DeltaNet o_proj MMVQ->MMQ reshape (patch 0020)
mudler Jun 25, 2026
e597a8a
docs(paged): vLLM GDN decode = 2 fused kernels under CUDA graph vs ll…
mudler Jun 25, 2026
2b57997
docs(paged): cudagraph-coverage - GDN serial chain IS graph-covered a…
mudler Jun 25, 2026
a723852
docs(paged): decisive node-level decode timeline gap - bubbles refuted
mudler Jun 25, 2026
5825b07
docs(paged): SYNTHESIS - validated decode-parity picture, ranked plan…
mudler Jun 25, 2026
fd4332e
docs(paged): GDN recurrence byte-gate SETTLED - re-stream ~1.0x, buil…
mudler Jun 25, 2026
2a8103c
docs(paged): FINAL DECISION - NO-BUILD fused recurrence, BUILD conv f…
mudler Jun 25, 2026
1785573
docs(paged): bf16 SSM-state build plan (PART C synthesis: edits, KL g…
mudler Jun 25, 2026
5cec1a6
docs(paged): bitexact-vs-vLLM verdict + verified f32 GDN-state correc…
mudler Jun 25, 2026
8f8777e
feat(paged): qwen35 decode conv-state in-place fusion (patch 0021)
mudler Jun 25, 2026
3c1ed67
feat(paged): qwen35 gated-DeltaNet decode occupancy/coalescing retune…
mudler Jun 25, 2026
02cbae5
feat(paged): qwen35moe NVFP4 activation-quantize de-dup (patch 0023)
mudler Jun 25, 2026
64766ec
Merge branch 'master' into worktree-feat+paged-attention
mudler Jun 25, 2026
634c0e5
docs(paged): rms_norm->fp4 fold analysis - bit-exact decode ceiling a…
mudler Jun 25, 2026
24833f0
docs(paged): bf16 SSM-state NO-SHIP - fails f32 KL gate (= vLLM's own…
mudler Jun 26, 2026
7c45447
docs(paged): FUTURE_LEVERS - parked decode-parity exploration trail
mudler Jun 26, 2026
aaaa90a
bench(paged): final apples-to-apples NVFP4 decode benchmark (0023 vs …
mudler Jun 26, 2026
ae0042f
docs(paged): publish NVFP4 decode benchmark - plot-ready CSV + decode…
mudler Jun 26, 2026
7dd3431
docs(paged): promote TTFT/prefill + paged-pool burst-degradation bug …
mudler Jun 26, 2026
00f9265
docs(paged): correct vLLM recurrent-state precision (f32, not bf16)
mudler Jun 26, 2026
001d833
docs(paged): f16/bf16 glue probe - dense decode residual ceiling
mudler Jun 26, 2026
89e62fc
docs(paged): finalize f16 glue probe - cost analysis + build verdict
mudler Jun 26, 2026
b061e4a
docs(paged): OTHER_PATHS investigation - rank 4 post-0023 paths, pick…
mudler Jun 26, 2026
125d10a
feat(paged): paged-pool burst-reclaim (truncate + defrag + slot relea…
mudler Jun 26, 2026
167768c
feat(backend): llama-cpp-localai-paged variant + NVFP4 Qwen3.6 gallery
mudler Jun 26, 2026
30a2b59
Merge branch 'master' into worktree-feat+paged-attention (llama.cpp p…
mudler Jun 26, 2026
ec7c1b1
feat(paged): pin-sync patchset to llama.cpp 9d5d882d (re-export 4 pat…
mudler Jun 26, 2026
4d3fecd
docs(paged): MoE decode re-graph lever (patch 0025) + speedup-hunt B …
mudler Jun 26, 2026
6bfca14
docs(paged): speedup-hunt C section + final RANK + PLAN synthesis
mudler Jun 26, 2026
fe5bd3f
feat(paged): qwen35 hybrid per-head f32/bf16 SSM state (patch 0026)
mudler Jun 26, 2026
33dfe7f
feat(paged): qwen35 hybrid per-head f32/bf16 SSM state - carry fix + …
mudler Jun 26, 2026
1f857f1
docs(paged): B-2 down_proj act-quant retune RESULT - negative (no hea…
mudler Jun 26, 2026
9c1c2a6
docs(paged): B-3 mmq_y-down warp-remap NEGATIVE - bit-exact MoE ceili…
mudler Jun 26, 2026
b3d3323
feat(paged): wire ssm_bf16_tau model option for hybrid SSM-state fast…
mudler Jun 26, 2026
3b59571
docs(paged): both-engine MoE decode decomposition - the 15% is NOT th…
mudler Jun 26, 2026
6c6a925
docs(paged): MoE-vs-vLLM DECIDE synthesis - reject W4A16 Marlin, the …
mudler Jun 26, 2026
b1667b4
feat(paged): qwen35 recurrent-state gather fusion (patch 0028)
mudler Jun 26, 2026
bf9b4fa
feat(gallery): NVFP4-MTP Qwen3.6 entries for the LocalAI paged backend
mudler Jun 26, 2026
79edfd2
feat(gallery): -paged suffix rename + qwopus NVFP4-MTP paged variants
mudler Jun 26, 2026
6dd8a3d
docs(gallery): NVFP4 GGUFs published to mudler/ - update header note
mudler Jun 26, 2026
c1f1d1e
Merge remote-tracking branch 'origin/master' into worktree-feat+paged…
mudler Jun 26, 2026
62c407e
docs(paged): lever1 gather-fusion bench landed - checkpoint + attribu…
mudler Jun 26, 2026
9a1be79
docs(paged): lever-4 scope - NVFP4 the still-bf16 MoE GDN/attn projec…
mudler Jun 26, 2026
e3f8149
docs(paged): lever-4 KL-gate FAIL - NVFP4 MoE projections cost ~6% PP…
mudler Jun 26, 2026
9b0e4e5
docs(paged): residual-assess FINAL - MoE at bit-exact ceiling, hunt DONE
mudler Jun 27, 2026
db6ebc5
feat(paged): block-table within-step host cache (patch 0029)
mudler Jun 27, 2026
683e225
docs(paged): arch-generality audit - build-targeting (CUDA arch fan +…
mudler Jun 27, 2026
34abf39
docs(paged): ARCH audit - NVFP4 GGUF off-Blackwell portability + gall…
mudler Jun 27, 2026
5667dfe
docs(paged): arch-generality audit - optimization classification (001…
mudler Jun 27, 2026
2a2de1d
docs(paged): patch-arch-safety classification for patches 0018-0029
mudler Jun 27, 2026
87cfd1f
docs(paged): quant-generality audit - SSM/serving opts are quant-agno…
mudler Jun 27, 2026
af6e133
docs(paged): cross-arch synthesis - ship verdict + minimum non-Blackw…
mudler Jun 27, 2026
2332587
fix(gallery): scope NVFP4-paged entries to Blackwell + consistent tags
mudler Jun 27, 2026
621a20d
feat(paged): backend-gate fused GDN/discriminated SSM_CONV emission (…
mudler Jun 27, 2026
202a29f
feat(paged): Metal/darwin build availability for llama-cpp-localai-paged
mudler Jun 27, 2026
400930d
Merge remote-tracking branch 'origin/master' into worktree-feat+paged…
mudler Jun 27, 2026
e160041
chore(paged): decouple paged llama.cpp pin from the nightly auto-bumper
mudler Jun 27, 2026
2bee7a5
ci(paged): add early-warning canary for vendored llama.cpp paged patches
mudler Jun 27, 2026
7e1832b
fix(paged): strip stray dev-doc hunks so patch series applies on a cl…
mudler Jun 27, 2026
a5a5b2a
feat(paged): bump llama.cpp pin 9d5d882d -> c299a92c (bit-exact verif…
mudler Jun 27, 2026
fb2dc33
docs(paged): consolidate the dev-trail docs into one canonical README
mudler Jun 27, 2026
78fac9a
refactor(paged): stock llama-cpp is patch-free; paged backend owns it…
mudler Jun 27, 2026
4a9a1dd
docs(paged): Mac stock-vs-patched bench + Vulkan note + cross-backend…
mudler Jun 27, 2026
984c8fc
docs(paged): Layer-2 upstream scope for native fused-GDN kernels (Met…
mudler Jun 27, 2026
9115c2c
docs(paged): correct Vulkan/SYCL note (GDN op IS upstream) + CUDA-onl…
mudler Jun 27, 2026
a4e7309
feat(paged): restrict llama-cpp-localai-paged to CUDA-only build targets
mudler Jun 27, 2026
db14006
docs(agents): add paged-backend maintenance + vLLM-parity methodology…
mudler Jun 27, 2026
08b754f
chore(paged): keep patches/ patch-only; README to backend root, docs …
mudler Jun 27, 2026
53f66a6
fix(paged): revert pin to 9d5d882d (== stock); c299a92c broke grpc-se…
mudler Jun 27, 2026
ed5eb70
docs(paged): drop moot PIN_SYNC_c299a92c record, repoint to README sec 7
mudler Jun 27, 2026
3466094
docs(paged): re-measure DGX benchmarks on one harness (stock/patched/…
mudler Jun 27, 2026
266fcc7
docs(agents): fix A/B-bench gotcha - env-toggle != stock for compiled…
mudler Jun 27, 2026
1431f72
docs(paged): regenerate decode plots (3-way) from re-measured data + …
mudler Jun 27, 2026
0b84fda
docs(paged): add the bf16-tau opt-in line to the decode plots
mudler Jun 27, 2026
9bb8994
chore(paged): drop CUDA-12 variants of llama-cpp-localai-paged, keep …
mudler Jun 28, 2026
23b11a5
paged-kv-manager.h: add missing <cstddef> for size_t
mudler Jun 28, 2026
4da769c
paged headers: self-include <cstddef>/<cstdint> for size_t/uintN_t (f…
mudler Jun 28, 2026
1f3e5ba
fix(paged): serialize both SSM partitions in hybrid bf16-tau state sa…
mudler Jun 28, 2026
ea72a56
Merge origin/master + pin-sync paged backend to 0ed235ea
mudler Jun 28, 2026
c51ff4c
docs(paged): scope porting the portable benefits to Metal/SYCL/Vulkan…
mudler Jun 28, 2026
2c59805
fix(paged): rpc cmake target renamed rpc-server -> ggml-rpc-server at…
mudler Jun 28, 2026
4cd90bf
paged: drop bf16-tau (patch 0026), subsumed by decode fusions (tau=10…
mudler Jun 28, 2026
11128cb
docs(paged): scope the large-M NVFP4 prefill GEMM lever (design only)
mudler Jun 28, 2026
e610347
feat(paged): chunked parallel-scan GDN prefill kernel (patch 0031)
mudler Jun 28, 2026
9a28f23
docs(paged): scope the continuous-serving decode gap (host-bound, des…
mudler Jun 28, 2026
4bdd26a
docs(paged): scope tensor-core (mma) chunked GDN prefill kernel
mudler Jun 28, 2026
0007053
feat(paged): FP4 prefill large-M dequant->bf16 cuBLAS scaffold (patch…
mudler Jun 28, 2026
d706980
feat(paged): close the continuous-serving decode gap (S1+S3, patches …
mudler Jun 28, 2026
2fa8ef8
fix(paged): make patch 0031 apply on the 0001-0030 base; default S3 o…
mudler Jun 28, 2026
b028c81
docs(paged): record padded/fixed-slot decode shape as tested-and-reje…
mudler Jun 28, 2026
f1c98ff
fix(paged): revert S3 decode-stable scheduler to default-OFF (A/B reg…
mudler Jun 29, 2026
c4058eb
feat(paged): tail-fusion (0042) + full-step decode CUDA graph default…
mudler Jun 29, 2026
042deab
docs(paged): vLLM-parity lever map + tensor-core GDN build plan (both…
mudler Jun 29, 2026
7b38c6b
feat(paged): GDN M5 tensor-core chunked-scan prefill, default-on unde…
mudler Jun 29, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
109 changes: 109 additions & 0 deletions .agents/llama-cpp-localai-paged-backend.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,109 @@
# llama-cpp-localai-paged Backend (paged attention + Blackwell NVFP4 decode)

`llama-cpp-localai-paged` is LocalAI's **CUDA-only** paged-attention variant of the
llama.cpp backend. It targets high-concurrency decode for the Qwen3.6 hybrid
gated-DeltaNet (SSM) models on Blackwell (GB10 / DGX Spark). It reuses the stock
`llama-cpp` backend's sources and applies a vendored patch series on top at build
time. It is **not** a fork: a source-only `*.patch` stack plus one canonical doc.

**Canonical reference:** `backend/cpp/llama-cpp-localai-paged/README.md`
(architecture, the patch series 0001-0030, benchmarks, dev notes, generality,
pin/canary policy). Read it for any technical detail; this guide is the maintenance
how-to.

## Where things live

- `backend/cpp/llama-cpp-localai-paged/Makefile` - the thin wrapper. It copies the
stock `backend/cpp/llama-cpp/` build infra into a build dir, clones llama.cpp at
this backend's **own** pin (`LLAMA_VERSION`), applies the paged series via the
`apply-paged-patches` define (strict `git apply`), then builds `grpc-server`.
- `backend/cpp/llama-cpp-localai-paged/patches/paged/` - the source-only `.patch`
series (0001-0030), nothing else.
- `backend/cpp/llama-cpp-localai-paged/README.md` - the canonical doc. The
operational docs (`PAGED_BITEXACT_NOTE.md`, `UPSTREAM_LAYER2_SCOPE.md`) and
dev artifacts live in
`backend/cpp/llama-cpp-localai-paged/docs/`.
- `backend/Dockerfile.llama-cpp-localai-paged`, `.docker/llama-cpp-localai-paged-compile.sh`
- the CUDA build entry points.
- `backend/cpp/llama-cpp/` - the **stock** backend, pure upstream. It carries no
paged patches.

## Invariants (do not break these)

- **Stock stays pure.** The paged patches live ONLY in this backend. Never add a
`patches/paged/` dir or `LLAMA_PAGED` logic to `backend/cpp/llama-cpp/`.
- **CUDA-only.** Ship cublas/cuda targets only. Off-CUDA the fusions are gated off
(patch 0030) and NVFP4 falls back to dequant, so the backend is neutral-to-
slightly-negative there - non-CUDA users use the stock `llama-cpp`. Do not add
cpu/vulkan/sycl/metal rows for this backend in `.github/backend-matrix.yml`.
(Those builds also fail to link `grpc-server` on darwin/arm64 against upstream
`stream_*` server symbols - another reason it is CUDA-only.)
- **Source-only patches.** A `.patch` may touch only llama.cpp source - never a
dev doc or `*.md`. Strict `git apply` on a clean checkout must reach exit 0. (A
stray `SSM_DECODE_FIX_RESULTS.md` hunk in patch 0019 once broke the CI build.)
- **Bit-exact by default.** Every shipped patch is byte-identical to the f32
baseline. (The one opt-in precision trade, `ssm_bf16_tau` / patch 0026, was
DROPPED: it went flat once the decode fusions landed - forcing all gated-DeltaNet
heads to bf16 gave 780.6 vs 780.0 t/s, zero benefit - so the series is now
bit-exact end to end. Do not reintroduce a per-head SSM-precision lever; see the
rejected-levers note in the backend README section 5.)

## Maintaining the pin against new llama.cpp

The pin (`LLAMA_VERSION` in the wrapper Makefile) is advanced ONLY by the manual
pin-sync. It is deliberately **excluded from the nightly auto-bumper**
(`bump_deps.yaml`): a naive bump would shift the tree out from under the patches
and break `git apply` at build time.

1. **The canary tells you when to sync.** `.github/workflows/llama-cpp-paged-canary.yml`
runs weekly: it applies + builds the series against the latest upstream tip and
goes **red** when upstream drifts past the patches. Canary red -> run a pin-sync.
2. **The pin-sync** (recorded in the README section 7 and git history): rebase the series onto the new
tip (resolve conflicts; re-export **source-only** with a pathspec like
`-- src/ ggml/ common/ include/ tools/ tests/ cmake/`), rebuild on a CUDA box,
pass the bit-exact gate on **every** path + `test-backend-ops`, **and confirm
the full grpc-server build/link is green on CI**, then bump `LLAMA_VERSION`.

**Hard constraint: keep the pin == the stock `llama-cpp` pin.** `grpc-server.cpp`
is shared with the stock backend and tracks the stock pin. A paged pin that
diverges PAST an upstream server-API refactor breaks the grpc-server LINK even
when the patches are byte-for-byte bit-exact - the bit-exact gate alone does NOT
catch it. The `c299a92c` bump did exactly this (patches applied + greedy-md5
bit-exact, but `grpc-server.cpp` failed to link with undefined `stream_*` server
helpers the refactor pulled into its headers), so it was reverted to `9d5d882d`.
A pin bump is shippable only once the full CI grpc-server build is green, which in
practice means moving in lockstep with the stock pin (or vendoring a
pin-matched grpc-server.cpp, which we deliberately do not, to keep stock pure).

## The bit-exact gate (run for every change)

- greedy md5: `llama-completion -m MODEL -ngl 99 -fa on -p "The capital of France is" -n 48 --temp 0 --seed 1 </dev/null | md5sum`,
paged paths prefixed `LLAMA_KV_PAGED=1` (+ `LLAMA_MOE_FORCE_GRAPHS=1` for paged
MoE). Must match the recorded baseline. Redirect stdin from `/dev/null` or
`llama-completion` hangs in conversation mode.
- `test-backend-ops` (CUDA0 vs CPU oracle) for every touched op (`SSM_CONV*`,
`GATED_DELTA_NET`, `MUL_MAT`, `MUL_MAT_ID`).
- **The gate is per-path.** The paged-MoE md5 differs from the non-paged md5 - a
benign, KL-validated FP-accumulation-order difference (see `docs/PAGED_BITEXACT_NOTE.md`).
Compare a paged-MoE change to the **paged** reference, not the non-paged one.

## Encapsulating your work

- When you change a patch, regenerate the `.patch` (source-only) and keep the dev
tree and this worktree byte-identical. Commit both with sign-off.
- New optimization -> next patch number (gaps 0005/0027 are intentional). Update
the README's patch table and dev notes - keep the README the single doc; do not
scatter `*_RESULTS.md` files.
- Record rejected/flat levers in the README too (they stop the next person from
re-running dead ends).

## Follow-ups (Metal / SYCL / Vulkan)

The decode fusions are implemented for **CUDA + CPU only**. The base
gated-DeltaNet + SSM_CONV ops already exist upstream on Metal, SYCL, and Vulkan,
so the models **run** there via the non-fused path - what is missing is the
fusion speedup. Porting it (strictly mirroring the CUDA kernels, since we have no
Metal/SYCL/Vulkan hardware to test on here) is scoped in `docs/UPSTREAM_LAYER2_SCOPE.md`
(recommended order: Metal, then SYCL, then Vulkan; ops-first upstream PR, then one
PR per backend, each gated by `test-backend-ops` on the target hardware). The
methodology for that work is in [.agents/vllm-parity-methodology.md](vllm-parity-methodology.md).
101 changes: 101 additions & 0 deletions .agents/vllm-parity-methodology.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,101 @@
# Methodology: Closing the vLLM Decode-Throughput Gap in llama.cpp

This is the playbook that took the paged backend
([.agents/llama-cpp-localai-paged-backend.md](llama-cpp-localai-paged-backend.md))
from ~38% of vLLM decode to **parity-to-ahead on dense** (and a proven, honest
ceiling on MoE) on GB10. Use it for any "make llama.cpp match or beat engine X on
accelerator Y" effort. The *levers* are model- and hardware-specific; the
*discipline* below is not. The worked example, with all numbers, is the paged
backend README.

## The core loop

1. **Establish a bit-exact baseline and gate FIRST.** Record the greedy md5 (per
path) and an f32 reference. Every optimization must stay byte-identical to it -
or ship as an explicit, default-off precision opt-in. This is what lets you
optimize aggressively without silently regressing quality. Gate two ways:
greedy md5, and `test-backend-ops` against the CPU oracle.

2. **Profile - do not assume.** nsys the steady-state decode step, broken down per
*kernel* AND per *memcpy*. Find the dominant cost. "It's the GEMM" was wrong
here: on hybrid gated-DeltaNet models the bottleneck was the recurrent-state
**plumbing** (state memcpy + gathers, ~67% of the step), not the weight GEMM.
Also sanity-check GPU-busy %: an early "low utilization" reading was a profiling
window artifact (decode was 96-99% GPU-busy), not real idle.

3. **Ground-truth BOTH engines.** Decompose *your* decode step AND the
competitor's, side by side, per bucket, and compute the per-bucket delta. This
tells you WHERE the gap actually is - not where you would guess. It overturned
premises here: e.g. vLLM does NOT run the GDN/attn projections as NVFP4 (it
keeps them bf16, same as us); the MoE expert GEMM was a llama *win*, not the gap.

4. **Per-lever discipline.** For each candidate: implement -> bit-exact gate ->
same-harness A/B bench. Use a runtime env-toggle (flag off vs on) ONLY for
levers that are actually runtime-gated; a lever **compiled into** the binary
(e.g. the SSM decode fusions here) is NOT isolated by a runtime flag, so measure
it build-vs-build. The full-patchset "stock" baseline likewise needs a
**separately-built unpatched binary at the same pin** - toggling the runtime
flag on the patched binary does not reproduce stock (it measures only the gated
part; here that was ~neutral, which is exactly how this gotcha hides). Bank only
what lifts AND gates. **Record every rejected or flat lever with the reason** -
over time this is the most valuable part: it stops the next person re-running
dead ends.

5. **Name the structural floor.** Prove the bit-exact ceiling exhaustively (every
lever measured, not assumed). What remains is physical - the memory-bandwidth
floor, the irreducible serial-SSM host loop (sampling can't start until logits
land). Name it; do not claim more than you measured.

## Hard rules learned

- **Apples-to-apples, or label it.** Stock-vs-patched on the SAME harness
(`llama-batched-bench`) is exact - lead with it. But "stock" must be a
separately-built unpatched binary at the SAME pin, NOT the patched binary with
the runtime flag off (compiled-in wins survive the toggle). Cross-engine "% of vLLM"
(batched-bench vs vLLM server+client) is *indicative*; always caveat the harness
and config (context length alone shifted the MoE figure 76% <-> 86%).
- **Re-measure a "win" after later levers land - it may evaporate.** bf16 SSM
state (the `ssm_bf16_tau` lever) benched +12% early and failed the f32 KL gate
(vLLM keeps f32 too), so it was kept default-off opt-in. Once the decode fusions
(recurrent-state gather-fusion + block-table cache) landed, a clean re-measure
forcing ALL gated-DeltaNet heads to bf16 (`tau=100000`) went **flat** - 780.6 vs
780.0 t/s. The "+12%" was subsumed by the fusions: the lever bought nothing, so
it was **dropped** (precision trade + bug surface + extra CUDA template-instantiation
compile cost, zero benefit). A win measured before the rest of the series is not a
win after it.
- **Reject the obvious-but-wrong, with evidence.** A faster kernel that is off the
critical path benches FLAT (the freed time becomes idle). Quantizing the bf16
projections to NVFP4 cost ~6% PPL - and vLLM keeps them bf16 for the same reason.
Always measure before believing; a plausible mechanism is not a result.
- **The gate can be per-path.** Paged vs non-paged attention legitimately produces
different (equivalent) FP-reduction orders; validate the difference is benign
(KLD to f32) and then gate each path against its own reference.

## Orchestration (multi-agent)

- **One GPU profiler/bencher at a time** (the GPU-contention rule). Parallel
design/analysis/read agents are fine; concurrent GPU benches pollute each other's
numbers.
- **Adversarial verify.** Before banking a finding, spawn skeptics prompted to
*refute* it; majority-refute kills it. Prevents plausible-but-wrong results.
- **Anti-punt.** Use foreground, blocking ssh loops with short benches and a
progress-file checkpoint. Agents that background work and "wait for the monitor
event" stall - forbid that pattern.
- **GPU coexistence.** On a shared host, stop the user's deployments for a clean
benchmark window (with their OK) and ALWAYS restore them (wrap the bench so a
failure cannot strand them).

## What generalizes (and what doesn't)

The *speedups* may be hardware-specific (here: CUDA/Blackwell - the SSM fusions,
NVFP4 FP4-MMA, the occupancy tune), which is why other accelerators did not
benefit. But the *findings* often generalize and are worth upstreaming: the
"decode is plumbing-bound, not GEMM-bound" insight and the bit-exact, CPU-mirrored
fusion ops help any backend running these models. Separate "ship our tuned backend"
from "upstream the portable op" - they are different deliverables.

## The closing record

Write up the result HONESTLY: the shipped wins, the rejected levers (with reasons),
the structural ceiling, and the cross-backend / cross-quant generality. Negative
results are as valuable as wins. The paged backend README is the template.
39 changes: 39 additions & 0 deletions .docker/llama-cpp-localai-paged-compile.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
#!/usr/bin/env bash
# Shared compile logic for backend/Dockerfile.llama-cpp-localai-paged.
# Sourced (via bind mount) from both builder-fromsource and builder-prebuilt stages.

set -euxo pipefail

export CCACHE_DIR=/root/.ccache
ccache --max-size=5G || true
ccache -z || true

export CMAKE_ARGS="${CMAKE_ARGS:-} -DCMAKE_C_COMPILER_LAUNCHER=ccache -DCMAKE_CXX_COMPILER_LAUNCHER=ccache -DCMAKE_CUDA_COMPILER_LAUNCHER=ccache"

if [[ -n "${CUDA_DOCKER_ARCH:-}" ]]; then
CUDA_ARCH_ESC="${CUDA_DOCKER_ARCH//;/\\;}"
export CMAKE_ARGS="${CMAKE_ARGS} -DCMAKE_CUDA_ARCHITECTURES=${CUDA_ARCH_ESC}"
echo "CMAKE_ARGS(env) = ${CMAKE_ARGS}"
rm -rf /LocalAI/backend/cpp/llama-cpp-localai-paged-*-build
fi

cd /LocalAI/backend/cpp/llama-cpp-localai-paged

if [ -z "${BUILD_TYPE:-}" ]; then
# Pure CPU image: one ggml CPU_ALL_VARIANTS build replaces the per-microarch binaries.
# arm64: the armv9.2 SME variants need gcc-14 (gcc-13 rejects +sme).
if [ "${TARGETARCH}" = "arm64" ]; then
apt-get update -qq && apt-get install -y -qq gcc-14 g++-14
export CC=gcc-14 CXX=g++-14
fi
make llama-cpp-localai-paged-cpu-all
else
# GPU build (cublas/hipblas/sycl/vulkan/...): single fallback CPU build, the accelerator
# does the compute. Keeps the GPU compile from also building the CPU variant matrix and
# avoids the gcc-14 apt step on GPU base images such as nvidia l4t.
make llama-cpp-localai-paged-fallback
fi
make llama-cpp-localai-paged-grpc
make llama-cpp-localai-paged-rpc-server

ccache -s || true
33 changes: 33 additions & 0 deletions .github/backend-matrix.yml
Original file line number Diff line number Diff line change
Expand Up @@ -5177,6 +5177,39 @@ include:
dockerfile: "./backend/Dockerfile.golang"
context: "./"
ubuntu-version: '2404'
# llama-cpp-localai-paged: the LocalAI paged-attention llama.cpp variant. Each
# row mirrors the corresponding llama-cpp row with backend/dockerfile/tag-suffix
# swapped; builder-base-image is left UNCHANGED so these reuse the same
# base-grpc-* prebuilt bases (same gRPC + same toolchain), needing no new
# base-images.yml variant.
- build-type: 'cublas'
cuda-major-version: "13"
cuda-minor-version: "0"
platforms: 'linux/amd64'
tag-latest: 'auto'
tag-suffix: '-gpu-nvidia-cuda-13-llama-cpp-localai-paged'
builder-base-image: 'quay.io/go-skynet/ci-cache:base-grpc-cuda-13-amd64'
runs-on: 'bigger-runner'
base-image: "ubuntu:24.04"
skip-drivers: 'false'
backend: "llama-cpp-localai-paged"
dockerfile: "./backend/Dockerfile.llama-cpp-localai-paged"
context: "./"
ubuntu-version: '2404'
- build-type: 'cublas'
cuda-major-version: "13"
cuda-minor-version: "0"
platforms: 'linux/arm64'
skip-drivers: 'false'
tag-latest: 'auto'
tag-suffix: '-nvidia-l4t-cuda-13-arm64-llama-cpp-localai-paged'
builder-base-image: 'quay.io/go-skynet/ci-cache:base-grpc-cuda-13-arm64'
base-image: "ubuntu:24.04"
runs-on: 'ubuntu-24.04-arm'
ubuntu-version: '2404'
backend: "llama-cpp-localai-paged"
dockerfile: "./backend/Dockerfile.llama-cpp-localai-paged"
context: "./"

# Darwin matrix (consumed by backend-jobs-darwin).
includeDarwin:
Expand Down
77 changes: 77 additions & 0 deletions .github/scripts/paged-canary-apply.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,77 @@
#!/usr/bin/env bash
#
# paged-canary-apply.sh - apply the vendored paged-attention patch series
# (backend/cpp/llama-cpp-localai-paged/patches/paged/0001-0030) to a llama.cpp checkout, the
# same way the build does, but tolerating the ONE known-benign pre-existing
# quirk in the series. Used by the early-warning canary
# (.github/workflows/llama-cpp-paged-canary.yml) so it only goes red on a REAL
# upstream break, never on that quirk.
#
# Usage: paged-canary-apply.sh <llama.cpp-checkout-dir> <patches-dir>
# <patches-dir> is normally backend/cpp/llama-cpp-localai-paged/patches (it holds the
# top-level base series 0*.patch, currently empty, and the paged/ subseries).
#
# Exit 0 = the whole series applied -> patches still fit upstream.
# Exit !=0 = a patch failed to apply = the red signal: an upstream change moved
# the tree out from under the patches, so it is time to run a PIN_SYNC.
#
# Apply method MIRRORS backend/cpp/llama-cpp/Makefile's `llama.cpp` target:
# plain `git apply --verbose`, which natively tolerates @@ line-number offsets
# but NOT context-line changes. Matching the build's method is the point - the
# canary's apply result is exactly what the real build's apply would do.
#
# The ONLY tolerance, and it is path-scoped (not a blanket `|| true`): patch
# 0019 carries a stray *modify* hunk against the dev-only doc
# SSM_DECODE_FIX_RESULTS.md, a file that exists only on the DGX dev tree and is
# absent from any clean upstream checkout. `git apply` is atomic, so that single
# missing-file hunk rejects the whole patch - and because 0021/0022/0026/0028
# build on 0019's code, the rejection cascades to them too. This is a
# PRE-EXISTING shipped-series defect, present identically on every pin, NOT an
# upstream break (see backend/cpp/llama-cpp-localai-paged/README.md section 7,
# "Pin + maintenance policy"). We exclude ONLY that dev-doc path and still
# apply 0019's real code hunks atomically, so a genuine code-hunk break in 0019
# still fails the canary. prepare.sh tolerates the same hunk via
# `patch ... || true`; this mirrors that tolerance precisely.

set -euo pipefail

CHECKOUT="${1:?usage: paged-canary-apply.sh <llama.cpp-checkout> <patches-dir>}"
PATCHES="${2:?usage: paged-canary-apply.sh <llama.cpp-checkout> <patches-dir>}"

# The lone tolerated dev-doc, and the only patch allowed to carry it.
DEVDOC_GLOB='*SSM_DECODE_FIX_RESULTS.md'
DEVDOC_PATCH='0019-qwen35-ssm-decode-fused-gather.patch'

# Resolve to absolute paths so the apply works after we cd into the checkout.
PATCHES="$(cd "$PATCHES" && pwd)"
cd "$CHECKOUT"

shopt -s nullglob

apply_one() {
local p="$1"; shift
echo "paged-canary: applying $(basename "$p")"
if ! git apply --verbose "$@" "$p"; then
echo "::error::paged patch no longer applies to the upstream llama.cpp tip: $(basename "$p")"
echo "::error::upstream drifted past the vendored paged series - run a PIN_SYNC (see backend/cpp/llama-cpp-localai-paged/README.md section 7, Pin + maintenance policy), do NOT bump the pin blindly"
exit 1
fi
}

# Base series first (parity with the build: patches/0*.patch before
# patches/paged/0*.patch). Currently empty; nullglob makes this a no-op.
for p in "$PATCHES"/0*.patch; do
apply_one "$p"
done

# Paged series, in order.
for p in "$PATCHES"/paged/0*.patch; do
if [ "$(basename "$p")" = "$DEVDOC_PATCH" ]; then
# Apply 0019's real code hunks; exclude ONLY the benign dev-doc hunk.
apply_one "$p" --exclude="$DEVDOC_GLOB"
else
apply_one "$p"
fi
done

echo "paged-canary: the full paged patch series applied cleanly to the upstream tip"
Loading