Skip to content

Commit 84ea97c

Browse files
unamedkrclaude
andcommitted
bench: prefill throughput script + 40× gap discovery
Adds scripts/test_prefill.sh and updates the throughput report with the single biggest gap to llama.cpp: prompt prefill. Today's quant.cpp does prompt processing one token at a time through the same single-token forward path as decode. llama.cpp uses batched matrix-matrix matmul during prefill — 30-50× faster. Concrete numbers (M1 Pro, 8 threads, ~450 prompt tokens): | Model | quant.cpp | llama.cpp | Gap | | Llama-3.2-1B Q8 | 10 tok/s | 359 tok/s | 35× | | Llama-3.2-3B Q8 | 3 tok/s | 130 tok/s | 41× | | Phi-3.5 Q4_K_M | 2 tok/s | 91 tok/s | 48× | | Qwen3.5-4B Q4_K | 2 tok/s | 88 tok/s | 44× | User-visible impact: a 1000-token prompt to Phi-3.5-mini takes ~10 minutes today. A batched-prefill path should make it under 15 seconds. Marked as the next major engineering project for the engine. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent e82ddd7 commit 84ea97c

File tree

2 files changed

+97
-0
lines changed

2 files changed

+97
-0
lines changed

bench/results/2026-04-15_throughput_vs_llamacpp.md

Lines changed: 23 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -32,6 +32,29 @@ llama-bench -m <model> -p 0 -n 64 -t 8 -ngl 0
3232
- **vs CPU (apples-to-apples)**: we're at **23-71%** of llama.cpp's pure-CPU speed depending on model. Phi-3.5 Q8_0 at 71% is competitive.
3333
- **Smaller models close the gap**: 1B Q8 at 52% vs 3B/4B Q4_K_M at 23-25% suggests our Q4_K dispatch (raw GGUF path) is the largest remaining gap. The Q4-converted path (3B Llama, 1B Llama) is more competitive.
3434

35+
## Prefill (prompt processing) — biggest remaining gap
36+
37+
Generation speed is what gets benchmarked, but for any RAG/long-context
38+
workload the user actually waits on **prefill**: running the prompt
39+
through the model to populate the KV cache. quant.cpp currently calls
40+
the same single-token forward path for every prompt token, so prefill
41+
runs at roughly the same speed as decode. llama.cpp uses batched
42+
matrix-matrix matmul during prefill, which is 30-50× faster.
43+
44+
Reproduce: `bash scripts/test_prefill.sh` and `llama-bench -m <model> -p 512 -n 0 -ngl 0`.
45+
46+
| Model | quant.cpp pp~450 | llama.cpp pp512 | Ratio |
47+
|---|---:|---:|---:|
48+
| Llama-3.2-1B Q8_0 | 10.2 | 358.7 | **35× behind** |
49+
| Llama-3.2-3B Q8_0 | 3.2 | 130.1 | **41× behind** |
50+
| Phi-3.5 Q4_K_M | 1.9 | 90.8 | **48× behind** |
51+
| Qwen3.5-4B Q4_K_M | 2.0 | 88.1 | **44× behind** |
52+
53+
User-visible impact on a 16GB Mac: feeding a 1000-token prompt to
54+
Phi-3.5-mini takes ~10 minutes today. With a batched-prefill path it
55+
should be under 15 seconds. **This is the single biggest user-facing
56+
gap** — and the next major engineering project for the engine.
57+
3558
## Session improvements (2026-04-15)
3659

3760
Compared to the same hardware before this session:

scripts/test_prefill.sh

Lines changed: 74 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,74 @@
1+
#!/usr/bin/env bash
2+
# test_prefill.sh — measure prompt prefill throughput.
3+
#
4+
# Why: generation throughput (tok/s during decode) is what user-facing
5+
# benchmarks usually report, but for any RAG/long-context workload the
6+
# user waits on PREFILL — running the prompt through the model to build
7+
# the KV cache. quant.cpp currently does prefill one token at a time
8+
# through the same forward path as generation, so prefill ≈ gen rate
9+
# instead of the typical ~10-100× speedup that batched matmul gives.
10+
#
11+
# This script makes the gap measurable. Compare the printed pp_tps
12+
# against `llama-bench -p 512 -n 0` for the same model.
13+
#
14+
# Usage: bash scripts/test_prefill.sh [models_dir]
15+
16+
set -u
17+
MODELS_DIR="${1:-models}"
18+
QUANT_BIN="${QUANT_BIN:-./build/quant}"
19+
20+
if [[ ! -x "$QUANT_BIN" ]]; then
21+
echo "ERROR: $QUANT_BIN not built." >&2
22+
exit 1
23+
fi
24+
25+
# Build a 200-token-ish prompt by repeating a known phrase.
26+
# Phi-3.5/Qwen tokenize this at ~1 token per word.
27+
make_prompt() {
28+
local n_words=$1
29+
local out=""
30+
for ((i=0; i<n_words; i++)); do
31+
out+="The quick brown fox jumps over the lazy dog. "
32+
done
33+
echo -n "$out"
34+
}
35+
36+
bench_prefill() {
37+
local model="$1"
38+
local n_words="$2"
39+
if [[ ! -f "$MODELS_DIR/$model" ]]; then
40+
printf " %-40s %4dw [SKIP]\n" "$model" "$n_words"
41+
return
42+
fi
43+
local prompt
44+
prompt=$(make_prompt "$n_words")
45+
local prompt_chars=${#prompt}
46+
47+
local t0 t1 elapsed
48+
t0=$(date +%s.%N)
49+
"$QUANT_BIN" "$MODELS_DIR/$model" -p "$prompt" -n 1 -T 0 > /dev/null 2>&1
50+
t1=$(date +%s.%N)
51+
elapsed=$(echo "$t1 - $t0" | bc -l)
52+
# Approx token count: ~5 chars per token for English
53+
local approx_toks=$(( prompt_chars / 5 ))
54+
local rate=$(echo "scale=1; $approx_toks / $elapsed" | bc -l)
55+
printf " %-40s %4dw %6.1fs (~%d tok) pp_tps≈%s\n" \
56+
"$model" "$n_words" "$elapsed" "$approx_toks" "$rate"
57+
}
58+
59+
echo "=== Prefill throughput (TQ_NO_METAL=1) ==="
60+
echo "Note: pp_tps is approximate (chars/5). Compare to llama-bench -p N -n 0."
61+
echo ""
62+
63+
export TQ_NO_METAL=1
64+
65+
# Two prompt sizes for each model: small (~50 tok) and medium (~250 tok).
66+
# The 1000+ token sweep takes 10+ minutes per model — uncomment to run.
67+
for model in \
68+
Llama-3.2-1B-Instruct-Q8_0.gguf \
69+
Llama-3.2-3B-Instruct-Q8_0.gguf \
70+
Phi-3.5-mini-instruct-Q4_K_M.gguf \
71+
Qwen3.5-4B-Q4_K_M.gguf; do
72+
bench_prefill "$model" 10 # ~50 tokens
73+
bench_prefill "$model" 50 # ~250 tokens
74+
done

0 commit comments

Comments
 (0)