mudler · localai-bot · Jun 19, 2026 · Jun 19, 2026 · Jun 19, 2026 · Jun 19, 2026
diff --git a/.agents/llama-cpp-localai-paged-backend.md b/.agents/llama-cpp-localai-paged-backend.md
@@ -0,0 +1,109 @@
+# llama-cpp-localai-paged Backend (paged attention + Blackwell NVFP4 decode)
+
+`llama-cpp-localai-paged` is LocalAI's **CUDA-only** paged-attention variant of the
+llama.cpp backend. It targets high-concurrency decode for the Qwen3.6 hybrid
+gated-DeltaNet (SSM) models on Blackwell (GB10 / DGX Spark). It reuses the stock
+`llama-cpp` backend's sources and applies a vendored patch series on top at build
+time. It is **not** a fork: a source-only `*.patch` stack plus one canonical doc.
+
+**Canonical reference:** `backend/cpp/llama-cpp-localai-paged/README.md`
+(architecture, the patch series 0001-0030, benchmarks, dev notes, generality,
+pin/canary policy). Read it for any technical detail; this guide is the maintenance
+how-to.
+
+## Where things live
+
+- `backend/cpp/llama-cpp-localai-paged/Makefile` - the thin wrapper. It copies the
+  stock `backend/cpp/llama-cpp/` build infra into a build dir, clones llama.cpp at
+  this backend's **own** pin (`LLAMA_VERSION`), applies the paged series via the
+  `apply-paged-patches` define (strict `git apply`), then builds `grpc-server`.
+- `backend/cpp/llama-cpp-localai-paged/patches/paged/` - the source-only `.patch`
+  series (0001-0030), nothing else.
+- `backend/cpp/llama-cpp-localai-paged/README.md` - the canonical doc. The
+  operational docs (`PAGED_BITEXACT_NOTE.md`, `UPSTREAM_LAYER2_SCOPE.md`) and
+  dev artifacts live in
+  `backend/cpp/llama-cpp-localai-paged/docs/`.
+- `backend/Dockerfile.llama-cpp-localai-paged`, `.docker/llama-cpp-localai-paged-compile.sh`
+  - the CUDA build entry points.
+- `backend/cpp/llama-cpp/` - the **stock** backend, pure upstream. It carries no
+  paged patches.
+
+## Invariants (do not break these)
+
+- **Stock stays pure.** The paged patches live ONLY in this backend. Never add a
+  `patches/paged/` dir or `LLAMA_PAGED` logic to `backend/cpp/llama-cpp/`.
+- **CUDA-only.** Ship cublas/cuda targets only. Off-CUDA the fusions are gated off
+  (patch 0030) and NVFP4 falls back to dequant, so the backend is neutral-to-
+  slightly-negative there - non-CUDA users use the stock `llama-cpp`. Do not add
+  cpu/vulkan/sycl/metal rows for this backend in `.github/backend-matrix.yml`.
+  (Those builds also fail to link `grpc-server` on darwin/arm64 against upstream
+  `stream_*` server symbols - another reason it is CUDA-only.)
+- **Source-only patches.** A `.patch` may touch only llama.cpp source - never a
+  dev doc or `*.md`. Strict `git apply` on a clean checkout must reach exit 0. (A
+  stray `SSM_DECODE_FIX_RESULTS.md` hunk in patch 0019 once broke the CI build.)
+- **Bit-exact by default.** Every shipped patch is byte-identical to the f32
+  baseline. (The one opt-in precision trade, `ssm_bf16_tau` / patch 0026, was
+  DROPPED: it went flat once the decode fusions landed - forcing all gated-DeltaNet
+  heads to bf16 gave 780.6 vs 780.0 t/s, zero benefit - so the series is now
+  bit-exact end to end. Do not reintroduce a per-head SSM-precision lever; see the
+  rejected-levers note in the backend README section 5.)
+
+## Maintaining the pin against new llama.cpp
+
+The pin (`LLAMA_VERSION` in the wrapper Makefile) is advanced ONLY by the manual
+pin-sync. It is deliberately **excluded from the nightly auto-bumper**
+(`bump_deps.yaml`): a naive bump would shift the tree out from under the patches
+and break `git apply` at build time.
+
+1. **The canary tells you when to sync.** `.github/workflows/llama-cpp-paged-canary.yml`
+   runs weekly: it applies + builds the series against the latest upstream tip and
+   goes **red** when upstream drifts past the patches. Canary red -> run a pin-sync.
+2. **The pin-sync** (recorded in the README section 7 and git history): rebase the series onto the new
+   tip (resolve conflicts; re-export **source-only** with a pathspec like
+   `-- src/ ggml/ common/ include/ tools/ tests/ cmake/`), rebuild on a CUDA box,
+   pass the bit-exact gate on **every** path + `test-backend-ops`, **and confirm
+   the full grpc-server build/link is green on CI**, then bump `LLAMA_VERSION`.
+
+**Hard constraint: keep the pin == the stock `llama-cpp` pin.** `grpc-server.cpp`
+is shared with the stock backend and tracks the stock pin. A paged pin that
+diverges PAST an upstream server-API refactor breaks the grpc-server LINK even
+when the patches are byte-for-byte bit-exact - the bit-exact gate alone does NOT
+catch it. The `c299a92c` bump did exactly this (patches applied + greedy-md5
+bit-exact, but `grpc-server.cpp` failed to link with undefined `stream_*` server
+helpers the refactor pulled into its headers), so it was reverted to `9d5d882d`.
+A pin bump is shippable only once the full CI grpc-server build is green, which in
+practice means moving in lockstep with the stock pin (or vendoring a
+pin-matched grpc-server.cpp, which we deliberately do not, to keep stock pure).
+
+## The bit-exact gate (run for every change)
+
+- greedy md5: `llama-completion -m MODEL -ngl 99 -fa on -p "The capital of France is" -n 48 --temp 0 --seed 1 </dev/null | md5sum`,
+  paged paths prefixed `LLAMA_KV_PAGED=1` (+ `LLAMA_MOE_FORCE_GRAPHS=1` for paged
+  MoE). Must match the recorded baseline. Redirect stdin from `/dev/null` or
+  `llama-completion` hangs in conversation mode.
+- `test-backend-ops` (CUDA0 vs CPU oracle) for every touched op (`SSM_CONV*`,
+  `GATED_DELTA_NET`, `MUL_MAT`, `MUL_MAT_ID`).
+- **The gate is per-path.** The paged-MoE md5 differs from the non-paged md5 - a
+  benign, KL-validated FP-accumulation-order difference (see `docs/PAGED_BITEXACT_NOTE.md`).
+  Compare a paged-MoE change to the **paged** reference, not the non-paged one.
+
+## Encapsulating your work
+
+- When you change a patch, regenerate the `.patch` (source-only) and keep the dev
+  tree and this worktree byte-identical. Commit both with sign-off.
+- New optimization -> next patch number (gaps 0005/0027 are intentional). Update
+  the README's patch table and dev notes - keep the README the single doc; do not
+  scatter `*_RESULTS.md` files.
+- Record rejected/flat levers in the README too (they stop the next person from
+  re-running dead ends).
+
+## Follow-ups (Metal / SYCL / Vulkan)
+
+The decode fusions are implemented for **CUDA + CPU only**. The base
+gated-DeltaNet + SSM_CONV ops already exist upstream on Metal, SYCL, and Vulkan,
+so the models **run** there via the non-fused path - what is missing is the
+fusion speedup. Porting it (strictly mirroring the CUDA kernels, since we have no
+Metal/SYCL/Vulkan hardware to test on here) is scoped in `docs/UPSTREAM_LAYER2_SCOPE.md`
+(recommended order: Metal, then SYCL, then Vulkan; ops-first upstream PR, then one
+PR per backend, each gated by `test-backend-ops` on the target hardware). The
+methodology for that work is in [.agents/vllm-parity-methodology.md](vllm-parity-methodology.md).
diff --git a/.agents/vllm-parity-methodology.md b/.agents/vllm-parity-methodology.md
@@ -0,0 +1,101 @@
+# Methodology: Closing the vLLM Decode-Throughput Gap in llama.cpp
+
+This is the playbook that took the paged backend
+([.agents/llama-cpp-localai-paged-backend.md](llama-cpp-localai-paged-backend.md))
+from ~38% of vLLM decode to **parity-to-ahead on dense** (and a proven, honest
+ceiling on MoE) on GB10. Use it for any "make llama.cpp match or beat engine X on
+accelerator Y" effort. The *levers* are model- and hardware-specific; the
+*discipline* below is not. The worked example, with all numbers, is the paged
+backend README.
+
+## The core loop
+
+1. **Establish a bit-exact baseline and gate FIRST.** Record the greedy md5 (per
+   path) and an f32 reference. Every optimization must stay byte-identical to it -
+   or ship as an explicit, default-off precision opt-in. This is what lets you
+   optimize aggressively without silently regressing quality. Gate two ways:
+   greedy md5, and `test-backend-ops` against the CPU oracle.
+
+2. **Profile - do not assume.** nsys the steady-state decode step, broken down per
+   *kernel* AND per *memcpy*. Find the dominant cost. "It's the GEMM" was wrong
+   here: on hybrid gated-DeltaNet models the bottleneck was the recurrent-state
+   **plumbing** (state memcpy + gathers, ~67% of the step), not the weight GEMM.
+   Also sanity-check GPU-busy %: an early "low utilization" reading was a profiling
+   window artifact (decode was 96-99% GPU-busy), not real idle.
+
+3. **Ground-truth BOTH engines.** Decompose *your* decode step AND the
+   competitor's, side by side, per bucket, and compute the per-bucket delta. This
+   tells you WHERE the gap actually is - not where you would guess. It overturned
+   premises here: e.g. vLLM does NOT run the GDN/attn projections as NVFP4 (it
+   keeps them bf16, same as us); the MoE expert GEMM was a llama *win*, not the gap.
+
+4. **Per-lever discipline.** For each candidate: implement -> bit-exact gate ->
+   same-harness A/B bench. Use a runtime env-toggle (flag off vs on) ONLY for
+   levers that are actually runtime-gated; a lever **compiled into** the binary
+   (e.g. the SSM decode fusions here) is NOT isolated by a runtime flag, so measure
+   it build-vs-build. The full-patchset "stock" baseline likewise needs a
+   **separately-built unpatched binary at the same pin** - toggling the runtime
+   flag on the patched binary does not reproduce stock (it measures only the gated
+   part; here that was ~neutral, which is exactly how this gotcha hides). Bank only
+   what lifts AND gates. **Record every rejected or flat lever with the reason** -
+   over time this is the most valuable part: it stops the next person re-running
+   dead ends.
+
+5. **Name the structural floor.** Prove the bit-exact ceiling exhaustively (every
+   lever measured, not assumed). What remains is physical - the memory-bandwidth
+   floor, the irreducible serial-SSM host loop (sampling can't start until logits
+   land). Name it; do not claim more than you measured.
+
+## Hard rules learned
+
+- **Apples-to-apples, or label it.** Stock-vs-patched on the SAME harness
+  (`llama-batched-bench`) is exact - lead with it. But "stock" must be a
+  separately-built unpatched binary at the SAME pin, NOT the patched binary with
+  the runtime flag off (compiled-in wins survive the toggle). Cross-engine "% of vLLM"
+  (batched-bench vs vLLM server+client) is *indicative*; always caveat the harness
+  and config (context length alone shifted the MoE figure 76% <-> 86%).
+- **Re-measure a "win" after later levers land - it may evaporate.** bf16 SSM
+  state (the `ssm_bf16_tau` lever) benched +12% early and failed the f32 KL gate
+  (vLLM keeps f32 too), so it was kept default-off opt-in. Once the decode fusions
+  (recurrent-state gather-fusion + block-table cache) landed, a clean re-measure
+  forcing ALL gated-DeltaNet heads to bf16 (`tau=100000`) went **flat** - 780.6 vs
+  780.0 t/s. The "+12%" was subsumed by the fusions: the lever bought nothing, so
+  it was **dropped** (precision trade + bug surface + extra CUDA template-instantiation
+  compile cost, zero benefit). A win measured before the rest of the series is not a
+  win after it.
+- **Reject the obvious-but-wrong, with evidence.** A faster kernel that is off the
+  critical path benches FLAT (the freed time becomes idle). Quantizing the bf16
+  projections to NVFP4 cost ~6% PPL - and vLLM keeps them bf16 for the same reason.
+  Always measure before believing; a plausible mechanism is not a result.
+- **The gate can be per-path.** Paged vs non-paged attention legitimately produces
+  different (equivalent) FP-reduction orders; validate the difference is benign
+  (KLD to f32) and then gate each path against its own reference.
+
+## Orchestration (multi-agent)
+
+- **One GPU profiler/bencher at a time** (the GPU-contention rule). Parallel
+  design/analysis/read agents are fine; concurrent GPU benches pollute each other's
+  numbers.
+- **Adversarial verify.** Before banking a finding, spawn skeptics prompted to
+  *refute* it; majority-refute kills it. Prevents plausible-but-wrong results.
+- **Anti-punt.** Use foreground, blocking ssh loops with short benches and a
+  progress-file checkpoint. Agents that background work and "wait for the monitor
+  event" stall - forbid that pattern.
+- **GPU coexistence.** On a shared host, stop the user's deployments for a clean
+  benchmark window (with their OK) and ALWAYS restore them (wrap the bench so a
+  failure cannot strand them).
+
+## What generalizes (and what doesn't)
+
+The *speedups* may be hardware-specific (here: CUDA/Blackwell - the SSM fusions,
+NVFP4 FP4-MMA, the occupancy tune), which is why other accelerators did not
+benefit. But the *findings* often generalize and are worth upstreaming: the
+"decode is plumbing-bound, not GEMM-bound" insight and the bit-exact, CPU-mirrored
+fusion ops help any backend running these models. Separate "ship our tuned backend"
+from "upstream the portable op" - they are different deliverables.
+
+## The closing record
+
+Write up the result HONESTLY: the shipped wins, the rejected levers (with reasons),
+the structural ceiling, and the cross-backend / cross-quant generality. Negative
+results are as valuable as wins. The paged backend README is the template.
diff --git a/.docker/llama-cpp-localai-paged-compile.sh b/.docker/llama-cpp-localai-paged-compile.sh
@@ -0,0 +1,39 @@
+#!/usr/bin/env bash
+# Shared compile logic for backend/Dockerfile.llama-cpp-localai-paged.
+# Sourced (via bind mount) from both builder-fromsource and builder-prebuilt stages.
+
+set -euxo pipefail
+
+export CCACHE_DIR=/root/.ccache
+ccache --max-size=5G || true
+ccache -z || true
+
+export CMAKE_ARGS="${CMAKE_ARGS:-} -DCMAKE_C_COMPILER_LAUNCHER=ccache -DCMAKE_CXX_COMPILER_LAUNCHER=ccache -DCMAKE_CUDA_COMPILER_LAUNCHER=ccache"
+
+if [[ -n "${CUDA_DOCKER_ARCH:-}" ]]; then
+  CUDA_ARCH_ESC="${CUDA_DOCKER_ARCH//;/\\;}"
+  export CMAKE_ARGS="${CMAKE_ARGS} -DCMAKE_CUDA_ARCHITECTURES=${CUDA_ARCH_ESC}"
+  echo "CMAKE_ARGS(env) = ${CMAKE_ARGS}"
+  rm -rf /LocalAI/backend/cpp/llama-cpp-localai-paged-*-build
+fi
+
+cd /LocalAI/backend/cpp/llama-cpp-localai-paged
+
+if [ -z "${BUILD_TYPE:-}" ]; then
+  # Pure CPU image: one ggml CPU_ALL_VARIANTS build replaces the per-microarch binaries.
+  # arm64: the armv9.2 SME variants need gcc-14 (gcc-13 rejects +sme).
+  if [ "${TARGETARCH}" = "arm64" ]; then
+    apt-get update -qq && apt-get install -y -qq gcc-14 g++-14
+    export CC=gcc-14 CXX=g++-14
+  fi
+  make llama-cpp-localai-paged-cpu-all
+else
+  # GPU build (cublas/hipblas/sycl/vulkan/...): single fallback CPU build, the accelerator
+  # does the compute. Keeps the GPU compile from also building the CPU variant matrix and
+  # avoids the gcc-14 apt step on GPU base images such as nvidia l4t.
+  make llama-cpp-localai-paged-fallback
+fi
+make llama-cpp-localai-paged-grpc
+make llama-cpp-localai-paged-rpc-server
+
+ccache -s || true
diff --git a/.github/backend-matrix.yml b/.github/backend-matrix.yml
@@ -5177,6 +5177,39 @@ include:
     dockerfile: "./backend/Dockerfile.golang"
     context: "./"
     ubuntu-version: '2404'
+  # llama-cpp-localai-paged: the LocalAI paged-attention llama.cpp variant. Each
+  # row mirrors the corresponding llama-cpp row with backend/dockerfile/tag-suffix
+  # swapped; builder-base-image is left UNCHANGED so these reuse the same
+  # base-grpc-* prebuilt bases (same gRPC + same toolchain), needing no new
+  # base-images.yml variant.
+  - build-type: 'cublas'
+    cuda-major-version: "13"
+    cuda-minor-version: "0"
+    platforms: 'linux/amd64'
+    tag-latest: 'auto'
+    tag-suffix: '-gpu-nvidia-cuda-13-llama-cpp-localai-paged'
+    builder-base-image: 'quay.io/go-skynet/ci-cache:base-grpc-cuda-13-amd64'
+    runs-on: 'bigger-runner'
+    base-image: "ubuntu:24.04"
+    skip-drivers: 'false'
+    backend: "llama-cpp-localai-paged"
+    dockerfile: "./backend/Dockerfile.llama-cpp-localai-paged"
+    context: "./"
+    ubuntu-version: '2404'
+  - build-type: 'cublas'
+    cuda-major-version: "13"
+    cuda-minor-version: "0"
+    platforms: 'linux/arm64'
+    skip-drivers: 'false'
+    tag-latest: 'auto'
+    tag-suffix: '-nvidia-l4t-cuda-13-arm64-llama-cpp-localai-paged'
+    builder-base-image: 'quay.io/go-skynet/ci-cache:base-grpc-cuda-13-arm64'
+    base-image: "ubuntu:24.04"
+    runs-on: 'ubuntu-24.04-arm'
+    ubuntu-version: '2404'
+    backend: "llama-cpp-localai-paged"
+    dockerfile: "./backend/Dockerfile.llama-cpp-localai-paged"
+    context: "./"
 
 # Darwin matrix (consumed by backend-jobs-darwin).
 includeDarwin:

diff --git a/.github/scripts/paged-canary-apply.sh b/.github/scripts/paged-canary-apply.sh
@@ -0,0 +1,77 @@
+#!/usr/bin/env bash
+#
+# paged-canary-apply.sh - apply the vendored paged-attention patch series
+# (backend/cpp/llama-cpp-localai-paged/patches/paged/0001-0030) to a llama.cpp checkout, the
+# same way the build does, but tolerating the ONE known-benign pre-existing
+# quirk in the series. Used by the early-warning canary
+# (.github/workflows/llama-cpp-paged-canary.yml) so it only goes red on a REAL
+# upstream break, never on that quirk.
+#
+# Usage: paged-canary-apply.sh <llama.cpp-checkout-dir> <patches-dir>
+#   <patches-dir> is normally backend/cpp/llama-cpp-localai-paged/patches (it holds the
+#   top-level base series 0*.patch, currently empty, and the paged/ subseries).
+#
+# Exit 0  = the whole series applied -> patches still fit upstream.
+# Exit !=0 = a patch failed to apply  = the red signal: an upstream change moved
+#            the tree out from under the patches, so it is time to run a PIN_SYNC.
+#
+# Apply method MIRRORS backend/cpp/llama-cpp/Makefile's `llama.cpp` target:
+# plain `git apply --verbose`, which natively tolerates @@ line-number offsets
+# but NOT context-line changes. Matching the build's method is the point - the
+# canary's apply result is exactly what the real build's apply would do.
+#
+# The ONLY tolerance, and it is path-scoped (not a blanket `|| true`): patch
+# 0019 carries a stray *modify* hunk against the dev-only doc
+# SSM_DECODE_FIX_RESULTS.md, a file that exists only on the DGX dev tree and is
+# absent from any clean upstream checkout. `git apply` is atomic, so that single
+# missing-file hunk rejects the whole patch - and because 0021/0022/0026/0028
+# build on 0019's code, the rejection cascades to them too. This is a
+# PRE-EXISTING shipped-series defect, present identically on every pin, NOT an
+# upstream break (see backend/cpp/llama-cpp-localai-paged/README.md section 7,
+# "Pin + maintenance policy"). We exclude ONLY that dev-doc path and still
+# apply 0019's real code hunks atomically, so a genuine code-hunk break in 0019
+# still fails the canary. prepare.sh tolerates the same hunk via
+# `patch ... || true`; this mirrors that tolerance precisely.
+
+set -euo pipefail
+
+CHECKOUT="${1:?usage: paged-canary-apply.sh <llama.cpp-checkout> <patches-dir>}"
+PATCHES="${2:?usage: paged-canary-apply.sh <llama.cpp-checkout> <patches-dir>}"
+
+# The lone tolerated dev-doc, and the only patch allowed to carry it.
+DEVDOC_GLOB='*SSM_DECODE_FIX_RESULTS.md'
+DEVDOC_PATCH='0019-qwen35-ssm-decode-fused-gather.patch'
+
+# Resolve to absolute paths so the apply works after we cd into the checkout.
+PATCHES="$(cd "$PATCHES" && pwd)"
+cd "$CHECKOUT"
+
+shopt -s nullglob
+
+apply_one() {
+  local p="$1"; shift
+  echo "paged-canary: applying $(basename "$p")"
+  if ! git apply --verbose "$@" "$p"; then
+    echo "::error::paged patch no longer applies to the upstream llama.cpp tip: $(basename "$p")"
+    echo "::error::upstream drifted past the vendored paged series - run a PIN_SYNC (see backend/cpp/llama-cpp-localai-paged/README.md section 7, Pin + maintenance policy), do NOT bump the pin blindly"
+    exit 1
+  fi
+}
+
+# Base series first (parity with the build: patches/0*.patch before
+# patches/paged/0*.patch). Currently empty; nullglob makes this a no-op.
+for p in "$PATCHES"/0*.patch; do
+  apply_one "$p"
+done
+
+# Paged series, in order.
+for p in "$PATCHES"/paged/0*.patch; do
+  if [ "$(basename "$p")" = "$DEVDOC_PATCH" ]; then
+    # Apply 0019's real code hunks; exclude ONLY the benign dev-doc hunk.
+    apply_one "$p" --exclude="$DEVDOC_GLOB"
+  else
+    apply_one "$p"
+  fi
+done
+
+echo "paged-canary: the full paged patch series applied cleanly to the upstream tip"