bench: prefill throughput script + 40× gap discovery

unamedkr · claude · unamedkr · commit 84ea97c7ef6c · 2026-04-15T15:34:31.000+09:00
Adds scripts/test_prefill.sh and updates the throughput report with the
single biggest gap to llama.cpp: prompt prefill.

Today's quant.cpp does prompt processing one token at a time through
the same single-token forward path as decode. llama.cpp uses batched
matrix-matrix matmul during prefill — 30-50× faster.

Concrete numbers (M1 Pro, 8 threads, ~450 prompt tokens):

  | Model            | quant.cpp | llama.cpp | Gap |
  | Llama-3.2-1B Q8  |  10 tok/s | 359 tok/s | 35× |
  | Llama-3.2-3B Q8  |   3 tok/s | 130 tok/s | 41× |
  | Phi-3.5 Q4_K_M   |   2 tok/s |  91 tok/s | 48× |
  | Qwen3.5-4B Q4_K  |   2 tok/s |  88 tok/s | 44× |

User-visible impact: a 1000-token prompt to Phi-3.5-mini takes ~10
minutes today. A batched-prefill path should make it under 15 seconds.

Marked as the next major engineering project for the engine.

Co-Authored-By: Claude Opus 4.6 (1M context) &lt;noreply@anthropic.com&gt;
diff --git a/bench/results/2026-04-15_throughput_vs_llamacpp.md b/bench/results/2026-04-15_throughput_vs_llamacpp.md
@@ -32,6 +32,29 @@ llama-bench -m <model> -p 0 -n 64 -t 8 -ngl 0
 - **vs CPU (apples-to-apples)**: we're at **23-71%** of llama.cpp's pure-CPU speed depending on model. Phi-3.5 Q8_0 at 71% is competitive.
 - **Smaller models close the gap**: 1B Q8 at 52% vs 3B/4B Q4_K_M at 23-25% suggests our Q4_K dispatch (raw GGUF path) is the largest remaining gap. The Q4-converted path (3B Llama, 1B Llama) is more competitive.
 
+## Prefill (prompt processing) — biggest remaining gap
+
+Generation speed is what gets benchmarked, but for any RAG/long-context
+workload the user actually waits on **prefill**: running the prompt
+through the model to populate the KV cache. quant.cpp currently calls
+the same single-token forward path for every prompt token, so prefill
+runs at roughly the same speed as decode. llama.cpp uses batched
+matrix-matrix matmul during prefill, which is 30-50× faster.
+
+Reproduce: `bash scripts/test_prefill.sh` and `llama-bench -m <model> -p 512 -n 0 -ngl 0`.
+
+| Model | quant.cpp pp~450 | llama.cpp pp512 | Ratio |
+|---|---:|---:|---:|
+| Llama-3.2-1B Q8_0   | 10.2 | 358.7 | **35× behind** |
+| Llama-3.2-3B Q8_0   | 3.2  | 130.1 | **41× behind** |
+| Phi-3.5 Q4_K_M      | 1.9  | 90.8  | **48× behind** |
+| Qwen3.5-4B Q4_K_M   | 2.0  | 88.1  | **44× behind** |
+
+User-visible impact on a 16GB Mac: feeding a 1000-token prompt to
+Phi-3.5-mini takes ~10 minutes today. With a batched-prefill path it
+should be under 15 seconds. **This is the single biggest user-facing
+gap** — and the next major engineering project for the engine.
+
 ## Session improvements (2026-04-15)
 
 Compared to the same hardware before this session:
diff --git a/scripts/test_prefill.sh b/scripts/test_prefill.sh
@@ -0,0 +1,74 @@
+#!/usr/bin/env bash
+# test_prefill.sh — measure prompt prefill throughput.
+#
+# Why: generation throughput (tok/s during decode) is what user-facing
+# benchmarks usually report, but for any RAG/long-context workload the
+# user waits on PREFILL — running the prompt through the model to build
+# the KV cache. quant.cpp currently does prefill one token at a time
+# through the same forward path as generation, so prefill ≈ gen rate
+# instead of the typical ~10-100× speedup that batched matmul gives.
+#
+# This script makes the gap measurable. Compare the printed pp_tps
+# against `llama-bench -p 512 -n 0` for the same model.
+#
+# Usage: bash scripts/test_prefill.sh [models_dir]
+
+set -u
+MODELS_DIR="${1:-models}"
+QUANT_BIN="${QUANT_BIN:-./build/quant}"
+
+if [[ ! -x "$QUANT_BIN" ]]; then
+    echo "ERROR: $QUANT_BIN not built." >&2
+    exit 1
+fi
+
+# Build a 200-token-ish prompt by repeating a known phrase.
+# Phi-3.5/Qwen tokenize this at ~1 token per word.
+make_prompt() {
+    local n_words=$1
+    local out=""
+    for ((i=0; i<n_words; i++)); do
+        out+="The quick brown fox jumps over the lazy dog. "
+    done
+    echo -n "$out"
+}
+
+bench_prefill() {
+    local model="$1"
+    local n_words="$2"
+    if [[ ! -f "$MODELS_DIR/$model" ]]; then
+        printf "  %-40s %4dw  [SKIP]\n" "$model" "$n_words"
+        return
+    fi
+    local prompt
+    prompt=$(make_prompt "$n_words")
+    local prompt_chars=${#prompt}
+
+    local t0 t1 elapsed
+    t0=$(date +%s.%N)
+    "$QUANT_BIN" "$MODELS_DIR/$model" -p "$prompt" -n 1 -T 0 > /dev/null 2>&1
+    t1=$(date +%s.%N)
+    elapsed=$(echo "$t1 - $t0" | bc -l)
+    # Approx token count: ~5 chars per token for English
+    local approx_toks=$(( prompt_chars / 5 ))
+    local rate=$(echo "scale=1; $approx_toks / $elapsed" | bc -l)
+    printf "  %-40s %4dw  %6.1fs  (~%d tok)  pp_tps≈%s\n" \
+        "$model" "$n_words" "$elapsed" "$approx_toks" "$rate"
+}
+
+echo "=== Prefill throughput (TQ_NO_METAL=1) ==="
+echo "Note: pp_tps is approximate (chars/5). Compare to llama-bench -p N -n 0."
+echo ""
+
+export TQ_NO_METAL=1
+
+# Two prompt sizes for each model: small (~50 tok) and medium (~250 tok).
+# The 1000+ token sweep takes 10+ minutes per model — uncomment to run.
+for model in \
+    Llama-3.2-1B-Instruct-Q8_0.gguf \
+    Llama-3.2-3B-Instruct-Q8_0.gguf \
+    Phi-3.5-mini-instruct-Q4_K_M.gguf \
+    Qwen3.5-4B-Q4_K_M.gguf; do
+    bench_prefill "$model" 10   # ~50 tokens
+    bench_prefill "$model" 50   # ~250 tokens
+done