Skip to content

Commit 6ea3215

Browse files
unamedkrclaude
andcommitted
feat(default): switch to Qwen3-4B — 2.4x faster + best quality
Qwen3-4B Q4_K_M replaces Phi-3.5-mini as the recommended default. ## Why Measured on Apple M3 (CPU, kv_compress=0): | Model | tok/s | MMLU | File | |---|---:|---:|---:| | **Qwen3-4B Q4_K_M** | **4.5** | **73** | 2.5 GB | | Phi-3.5-mini Q8_0 | 1.9 | ~65 | 3.8 GB | Qwen3 is 2.4x faster AND has higher quality AND smaller download. The speed advantage: Qwen3's separate Q/K/V tensors enable load-time Q4 conversion → NEON Q4×Q8 fused dot (the fast path). Phi-3.5's fused QKV blocks this optimization and falls through to the slower GGUF on-the-fly dequant. ## Changes - `_MODEL_REGISTRY`: Qwen3-4B as first entry (default) - `MODEL_ALIASES`: `qwen3`, `qwen3:4b`, `qwen` → Qwen3-4B - `cmd_chat_default`: Qwen3-4B - Module docstring + `from_pretrained` example: Qwen3-4B - README.md + README.ko.md quickstart: `quantcpp run qwen3` - 35/35 unit tests pass Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent 91606ce commit 6ea3215

File tree

4 files changed

+48
-47
lines changed

4 files changed

+48
-47
lines changed

README.ko.md

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -28,26 +28,26 @@
2828
```bash
2929
pip install quantcpp
3030

31-
quantcpp pull phi-3.5-mini # HuggingFace에서 다운로드 (~2.4 GB)
32-
quantcpp run phi-3.5-mini # 대화형 채팅
33-
quantcpp serve phi-3.5-mini -p 8080 # OpenAI 호환 HTTP 서버 (SSE 스트리밍)
31+
quantcpp pull qwen3 # Qwen3-4B Q4_K_M 다운로드 (~2.5 GB)
32+
quantcpp run qwen3 # 대화형 채팅
33+
quantcpp serve qwen3 -p 8080 # OpenAI 호환 HTTP 서버 (SSE 스트리밍)
3434
quantcpp client "안녕" # 스트리밍 클라이언트 → :8080 서버
3535
quantcpp list # 캐시된 모델 목록
3636
```
3737

38-
추천 기본 모델: **Phi-3.5-mini** (3.8B params, vocab 32K). registry의 모든 모델 중 가장 작은 vocab(32K)이라 토큰당 `lm_head` matmul이 가장 빠릅니다 — 노트북에서 속도와 품질의 최적 조합입니다. 다른 별칭: `smollm2`, `smollm2:135m`, `llama3.2:1b`, `qwen3.5:0.8b`. `run`/`serve` 첫 실행 시 자동 다운로드.
38+
추천 기본 모델: **Qwen3-4B** (4B params, MMLU 73, M3에서 4.5 tok/s). 최고 품질 AND 최고 속도 — Q4 NEON fused dot 경로로 Phi-3.5-mini보다 2.4배 빠릅니다. 다른 별칭: `phi3.5`, `smollm2`, `llama3.2:1b`. `run`/`serve` 첫 실행 시 자동 다운로드.
3939

4040
`serve`는 OpenAI 호환 `POST /v1/chat/completions` 엔드포인트를 8080 포트에 제공합니다 — 클라이언트가 `"stream": true`를 보내면 SSE 토큰 단위 스트리밍, 생략하면 단일 JSON 응답. 내장 `quantcpp client`는 두 모드 모두 지원 (기본: 스트리밍, `--no-stream`: 단일 응답).
4141

4242
**한 줄 질문:**
4343
```bash
44-
quantcpp run phi-3.5-mini "중력이란 무엇인가요?"
44+
quantcpp run qwen3 "중력이란 무엇인가요?"
4545
```
4646

4747
**Python API (3줄):**
4848
```python
4949
from quantcpp import Model
50-
m = Model.from_pretrained("Phi-3.5-mini")
50+
m = Model.from_pretrained("Qwen3-4B")
5151
print(m.ask("중력이란 무엇인가요?"))
5252
```
5353

README.md

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -41,26 +41,26 @@
4141
```bash
4242
pip install quantcpp
4343

44-
quantcpp pull phi-3.5-mini # download from HuggingFace (~2.4 GB)
45-
quantcpp run phi-3.5-mini # interactive chat
46-
quantcpp serve phi-3.5-mini -p 8080 # OpenAI-compatible HTTP server (SSE streaming)
44+
quantcpp pull qwen3 # download Qwen3-4B Q4_K_M (~2.5 GB)
45+
quantcpp run qwen3 # interactive chat
46+
quantcpp serve qwen3 -p 8080 # OpenAI-compatible HTTP server (SSE streaming)
4747
quantcpp client "Hi" # streaming client → server on :8080
4848
quantcpp list # show cached models
4949
```
5050

51-
Recommended default: **Phi-3.5-mini** (3.8B params, vocab 32K). The 32K vocab is the smallest in the registry, which makes the per-token `lm_head` matmul the fastest of any model we ship — Phi-3.5-mini is the best speed/quality combo on a laptop. Other aliases: `smollm2`, `smollm2:135m`, `llama3.2:1b`, `qwen3.5:0.8b`. Auto-pulls on first `run` / `serve`.
51+
Recommended default: **Qwen3-4B** (4B params, MMLU 73, 4.5 tok/s on M3). Best speed AND quality — the Q4 NEON fused dot path makes it 2.4x faster than Phi-3.5-mini despite a larger vocab. Other aliases: `phi3.5`, `smollm2`, `llama3.2:1b`. Auto-pulls on first `run` / `serve`.
5252

5353
The `serve` subcommand exposes `POST /v1/chat/completions` (OpenAI-compatible) on port 8080 — clients pass `"stream": true` for SSE streaming, or omit it for a single JSON response. Built-in `quantcpp client` supports both modes (default: streaming, `--no-stream` for single response).
5454

5555
**One-shot question:**
5656
```bash
57-
quantcpp run phi-3.5-mini "What is gravity?"
57+
quantcpp run qwen3 "What is gravity?"
5858
```
5959

6060
**Python API (3 lines):**
6161
```python
6262
from quantcpp import Model
63-
m = Model.from_pretrained("Phi-3.5-mini")
63+
m = Model.from_pretrained("Qwen3-4B")
6464
print(m.ask("What is gravity?"))
6565
```
6666

bindings/python/quantcpp/__init__.py

Lines changed: 32 additions & 31 deletions
Original file line numberDiff line numberDiff line change
@@ -4,23 +4,17 @@
44
Quick start:
55
66
from quantcpp import Model
7-
m = Model.from_pretrained("Phi-3.5-mini")
7+
m = Model.from_pretrained("Qwen3-4B")
88
print(m.ask("What is gravity?"))
99
1010
Model selection guide:
11-
Phi-3.5-mini (3.8 GB, vocab 32K) — DEFAULT. 3.8B params, Q8_0.
12-
2x faster than Q4_K_M on NEON
13-
(3.0 vs 1.5 tok/s on M3).
14-
Best speed/quality combo.
15-
SmolLM2-1.7B (1.7 GB, vocab 49K) — lightweight all-rounder. ~12 tok/s
16-
on Apple M3, smaller download.
17-
Llama-3.2-1B (750 MB, vocab 128K) — smallest download but slower
18-
due to large vocab (~2 tok/s on M3).
19-
SmolLM2-135M (138 MB, vocab 49K) — demo only, low quality output.
20-
21-
Larger vocab = slower lm_head matmul → smaller params with smaller vocab
22-
often beats larger params with larger vocab. See docs/supported_models.md
23-
for the architecture support matrix.
11+
Qwen3-4B (2.5 GB, vocab 152K) — DEFAULT. 4.5 tok/s on M3.
12+
Best quality (MMLU 73) AND
13+
fastest (Q4 NEON fused dot).
14+
Phi-3.5-mini (3.8 GB, vocab 32K) — 1.9 tok/s. Good quality.
15+
SmolLM2-1.7B (1.7 GB, vocab 49K) — lightweight, ~12 tok/s.
16+
Llama-3.2-1B (750 MB, vocab 128K) — smallest download.
17+
SmolLM2-135M (138 MB, vocab 49K) — demo only.
2418
"""
2519

2620
try:
@@ -71,30 +65,39 @@ class ChatContextOverflow(RuntimeError):
7165
# adding new entries — there is no integrity check at runtime.
7266
_MODEL_REGISTRY = {
7367
# ── DEFAULT ──
74-
# Phi-3.5-mini-instruct Q8_0. Switched from Q4_K_M on 2026-04-12
75-
# after benchmarking: Q8_0 is 2x faster on Apple Silicon NEON
76-
# (3.0 vs 1.5 tok/s on M3). Q4_K_M's complex super-block dequant
77-
# dominates compute at batch-1; Q8_0's simple int8 dequant is
78-
# NEON-friendly. Both produce identical quality. The larger download
79-
# (3.8 GB vs 2.2 GB) is a one-time cost.
68+
# Qwen3-4B Q4_K_M: best speed + quality combo (2026-04-13).
69+
#
70+
# Measured on Apple M3 (CPU, kv_compress=0):
71+
# Qwen3-4B Q4_K_M: 4.5 tok/s (Q4 converted, NEON fused dot)
72+
# Phi-3.5-mini Q8_0: 1.9 tok/s (GGUF on-the-fly, no Q4 path)
73+
#
74+
# Qwen3 is 2.4x faster AND has higher benchmark scores (MMLU 73
75+
# vs ~65). The speed advantage comes from separate Q/K/V tensors
76+
# that enable load-time Q4 conversion → NEON Q4×Q8 fused dot.
77+
# Phi-3.5's fused QKV blocks this optimization.
78+
#
79+
# vocab 152K is larger than Phi-3.5's 32K, but the Q4 matmul
80+
# speed more than compensates for the bigger lm_head.
81+
"Qwen3-4B": (
82+
"bartowski/Qwen_Qwen3-4B-GGUF",
83+
"Qwen_Qwen3-4B-Q4_K_M.gguf",
84+
2500,
85+
),
86+
# Previous default. Still a good option for users who want the
87+
# smallest lm_head (vocab 32K). Slower than Qwen3 because the
88+
# fused QKV tensor blocks Q4 conversion.
8089
"Phi-3.5-mini": (
8190
"bartowski/Phi-3.5-mini-instruct-GGUF",
8291
"Phi-3.5-mini-instruct-Q8_0.gguf",
8392
3800,
8493
),
85-
# Lightweight all-rounder for users who want a smaller download
86-
# than Phi-3.5-mini. vocab 49K keeps the lm_head matmul small, so
87-
# on a mid-range M-series chip we measure ~12 tok/s — comfortable
88-
# for interactive chat. Same llama arch family as SmolLM2-135M.
94+
# Lightweight all-rounder. vocab 49K, ~12 tok/s on M3.
8995
"SmolLM2-1.7B": (
9096
"bartowski/SmolLM2-1.7B-Instruct-GGUF",
9197
"SmolLM2-1.7B-Instruct-Q8_0.gguf",
9298
1700,
9399
),
94-
# Smallest download in the "actually usable" tier. Slower at
95-
# inference time because of the 128K Llama-3 vocab (~5x slower
96-
# lm_head matmul on M3). Kept in the registry for users who
97-
# specifically want a Llama model.
100+
# Smallest usable download. 128K vocab → slower lm_head.
98101
"Llama-3.2-1B": (
99102
"hugging-quants/Llama-3.2-1B-Instruct-Q4_K_M-GGUF",
100103
"llama-3.2-1b-instruct-q4_k_m.gguf",
@@ -105,9 +108,7 @@ class ChatContextOverflow(RuntimeError):
105108
"Qwen3.5-0.8B-Q4_K_M.gguf",
106109
508,
107110
),
108-
# 138 MB demo model. Tokenizer + arch are llama-compatible but the
109-
# model is too small to produce coherent output for general chat.
110-
# Listed only so users can verify the install/load path quickly.
111+
# 138 MB demo model. Too small for real use.
111112
"SmolLM2-135M": (
112113
"Felladrin/gguf-Q8_0-SmolLM2-135M-Instruct",
113114
"smollm2-135m-instruct-q8_0.gguf",

bindings/python/quantcpp/cli.py

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -19,10 +19,10 @@
1919

2020

2121
# Ollama-style short aliases → canonical _MODEL_REGISTRY keys.
22-
# Plain "smollm2" without a size suffix points at the 1.7B model — that's
23-
# the recommended default. Users who explicitly want the 135M demo model
24-
# need to ask for it by full name.
2522
MODEL_ALIASES = {
23+
"qwen3": "Qwen3-4B",
24+
"qwen3:4b": "Qwen3-4B",
25+
"qwen": "Qwen3-4B",
2626
"smollm2": "SmolLM2-1.7B",
2727
"smollm2:1.7b": "SmolLM2-1.7B",
2828
"smollm2:135m": "SmolLM2-135M",
@@ -428,7 +428,7 @@ def cmd_chat_default(args):
428428
the registry (32K) AND 3.8B params, giving the
429429
best speed/quality combo we ship.
430430
"""
431-
args.model = args.model or "Phi-3.5-mini"
431+
args.model = args.model or "Qwen3-4B"
432432
args.threads = getattr(args, "threads", 4)
433433
args.max_tokens = getattr(args, "max_tokens", 256)
434434
args.temperature = getattr(args, "temperature", 0.7)

0 commit comments

Comments
 (0)