feat(default): switch to Qwen3-4B — 2.4x faster + best quality

unamedkr · claude · unamedkr · commit 6ea3215189b3 · 2026-04-13T13:44:23.000+09:00
Qwen3-4B Q4_K_M replaces Phi-3.5-mini as the recommended default.

## Why

Measured on Apple M3 (CPU, kv_compress=0):

| Model | tok/s | MMLU | File |
|---|---:|---:|---:|
| **Qwen3-4B Q4_K_M** | **4.5** | **73** | 2.5 GB |
| Phi-3.5-mini Q8_0 | 1.9 | ~65 | 3.8 GB |

Qwen3 is 2.4x faster AND has higher quality AND smaller download.

The speed advantage: Qwen3's separate Q/K/V tensors enable load-time
Q4 conversion → NEON Q4×Q8 fused dot (the fast path). Phi-3.5's fused
QKV blocks this optimization and falls through to the slower GGUF
on-the-fly dequant.

## Changes

- `_MODEL_REGISTRY`: Qwen3-4B as first entry (default)
- `MODEL_ALIASES`: `qwen3`, `qwen3:4b`, `qwen` → Qwen3-4B
- `cmd_chat_default`: Qwen3-4B
- Module docstring + `from_pretrained` example: Qwen3-4B
- README.md + README.ko.md quickstart: `quantcpp run qwen3`
- 35/35 unit tests pass

Co-Authored-By: Claude Opus 4.6 (1M context) &lt;noreply@anthropic.com&gt;
diff --git a/README.ko.md b/README.ko.md
@@ -28,26 +28,26 @@
 ```bash
 pip install quantcpp
 
-quantcpp pull phi-3.5-mini              # HuggingFace에서 다운로드 (~2.4 GB)
-quantcpp run phi-3.5-mini               # 대화형 채팅
-quantcpp serve phi-3.5-mini -p 8080     # OpenAI 호환 HTTP 서버 (SSE 스트리밍)
+quantcpp pull qwen3                     # Qwen3-4B Q4_K_M 다운로드 (~2.5 GB)
+quantcpp run qwen3                      # 대화형 채팅
+quantcpp serve qwen3 -p 8080            # OpenAI 호환 HTTP 서버 (SSE 스트리밍)
 quantcpp client "안녕"                   # 스트리밍 클라이언트 → :8080 서버
 quantcpp list                           # 캐시된 모델 목록
 ```
 
-추천 기본 모델: **Phi-3.5-mini** (3.8B params, vocab 32K). registry의 모든 모델 중 가장 작은 vocab(32K)이라 토큰당 `lm_head` matmul이 가장 빠릅니다 — 노트북에서 속도와 품질의 최적 조합입니다. 다른 별칭: `smollm2`, `smollm2:135m`, `llama3.2:1b`, `qwen3.5:0.8b`. `run`/`serve` 첫 실행 시 자동 다운로드.
+추천 기본 모델: **Qwen3-4B** (4B params, MMLU 73, M3에서 4.5 tok/s). 최고 품질 AND 최고 속도 — Q4 NEON fused dot 경로로 Phi-3.5-mini보다 2.4배 빠릅니다. 다른 별칭: `phi3.5`, `smollm2`, `llama3.2:1b`. `run`/`serve` 첫 실행 시 자동 다운로드.
 
 `serve`는 OpenAI 호환 `POST /v1/chat/completions` 엔드포인트를 8080 포트에 제공합니다 — 클라이언트가 `"stream": true`를 보내면 SSE 토큰 단위 스트리밍, 생략하면 단일 JSON 응답. 내장 `quantcpp client`는 두 모드 모두 지원 (기본: 스트리밍, `--no-stream`: 단일 응답).
 
 **한 줄 질문:**
 ```bash
-quantcpp run phi-3.5-mini "중력이란 무엇인가요?"
+quantcpp run qwen3 "중력이란 무엇인가요?"
 ```
 
 **Python API (3줄):**
 ```python
 from quantcpp import Model
-m = Model.from_pretrained("Phi-3.5-mini")
+m = Model.from_pretrained("Qwen3-4B")
 print(m.ask("중력이란 무엇인가요?"))
 ```
 
diff --git a/README.md b/README.md
@@ -41,26 +41,26 @@
 ```bash
 pip install quantcpp
 
-quantcpp pull phi-3.5-mini              # download from HuggingFace (~2.4 GB)
-quantcpp run phi-3.5-mini               # interactive chat
-quantcpp serve phi-3.5-mini -p 8080     # OpenAI-compatible HTTP server (SSE streaming)
+quantcpp pull qwen3                     # download Qwen3-4B Q4_K_M (~2.5 GB)
+quantcpp run qwen3                      # interactive chat
+quantcpp serve qwen3 -p 8080            # OpenAI-compatible HTTP server (SSE streaming)
 quantcpp client "Hi"                    # streaming client → server on :8080
 quantcpp list                           # show cached models
 ```
 
-Recommended default: **Phi-3.5-mini** (3.8B params, vocab 32K). The 32K vocab is the smallest in the registry, which makes the per-token `lm_head` matmul the fastest of any model we ship — Phi-3.5-mini is the best speed/quality combo on a laptop. Other aliases: `smollm2`, `smollm2:135m`, `llama3.2:1b`, `qwen3.5:0.8b`. Auto-pulls on first `run` / `serve`.
+Recommended default: **Qwen3-4B** (4B params, MMLU 73, 4.5 tok/s on M3). Best speed AND quality — the Q4 NEON fused dot path makes it 2.4x faster than Phi-3.5-mini despite a larger vocab. Other aliases: `phi3.5`, `smollm2`, `llama3.2:1b`. Auto-pulls on first `run` / `serve`.
 
 The `serve` subcommand exposes `POST /v1/chat/completions` (OpenAI-compatible) on port 8080 — clients pass `"stream": true` for SSE streaming, or omit it for a single JSON response. Built-in `quantcpp client` supports both modes (default: streaming, `--no-stream` for single response).
 
 **One-shot question:**
 ```bash
-quantcpp run phi-3.5-mini "What is gravity?"
+quantcpp run qwen3 "What is gravity?"
 ```
 
 **Python API (3 lines):**
 ```python
 from quantcpp import Model
-m = Model.from_pretrained("Phi-3.5-mini")
+m = Model.from_pretrained("Qwen3-4B")
 print(m.ask("What is gravity?"))
 ```
 
diff --git a/bindings/python/quantcpp/__init__.py b/bindings/python/quantcpp/__init__.py
@@ -4,23 +4,17 @@
 Quick start:
 
     from quantcpp import Model
-    m = Model.from_pretrained("Phi-3.5-mini")
+    m = Model.from_pretrained("Qwen3-4B")
     print(m.ask("What is gravity?"))
 
 Model selection guide:
-    Phi-3.5-mini   (3.8 GB, vocab 32K)  — DEFAULT. 3.8B params, Q8_0.
-                                          2x faster than Q4_K_M on NEON
-                                          (3.0 vs 1.5 tok/s on M3).
-                                          Best speed/quality combo.
-    SmolLM2-1.7B   (1.7 GB, vocab 49K)  — lightweight all-rounder. ~12 tok/s
-                                          on Apple M3, smaller download.
-    Llama-3.2-1B   (750 MB, vocab 128K) — smallest download but slower
-                                          due to large vocab (~2 tok/s on M3).
-    SmolLM2-135M   (138 MB, vocab 49K)  — demo only, low quality output.
-
-Larger vocab = slower lm_head matmul → smaller params with smaller vocab
-often beats larger params with larger vocab. See docs/supported_models.md
-for the architecture support matrix.
+    Qwen3-4B       (2.5 GB, vocab 152K) — DEFAULT. 4.5 tok/s on M3.
+                                          Best quality (MMLU 73) AND
+                                          fastest (Q4 NEON fused dot).
+    Phi-3.5-mini   (3.8 GB, vocab 32K)  — 1.9 tok/s. Good quality.
+    SmolLM2-1.7B   (1.7 GB, vocab 49K)  — lightweight, ~12 tok/s.
+    Llama-3.2-1B   (750 MB, vocab 128K) — smallest download.
+    SmolLM2-135M   (138 MB, vocab 49K)  — demo only.
 """
 
 try:
@@ -71,30 +65,39 @@ class ChatContextOverflow(RuntimeError):
 # adding new entries — there is no integrity check at runtime.
 _MODEL_REGISTRY = {
     # ── DEFAULT ──
-    # Phi-3.5-mini-instruct Q8_0. Switched from Q4_K_M on 2026-04-12
-    # after benchmarking: Q8_0 is 2x faster on Apple Silicon NEON
-    # (3.0 vs 1.5 tok/s on M3). Q4_K_M's complex super-block dequant
-    # dominates compute at batch-1; Q8_0's simple int8 dequant is
-    # NEON-friendly. Both produce identical quality. The larger download
-    # (3.8 GB vs 2.2 GB) is a one-time cost.
+    # Qwen3-4B Q4_K_M: best speed + quality combo (2026-04-13).
+    #
+    # Measured on Apple M3 (CPU, kv_compress=0):
+    #   Qwen3-4B Q4_K_M:    4.5 tok/s  (Q4 converted, NEON fused dot)
+    #   Phi-3.5-mini Q8_0:  1.9 tok/s  (GGUF on-the-fly, no Q4 path)
+    #
+    # Qwen3 is 2.4x faster AND has higher benchmark scores (MMLU 73
+    # vs ~65). The speed advantage comes from separate Q/K/V tensors
+    # that enable load-time Q4 conversion → NEON Q4×Q8 fused dot.
+    # Phi-3.5's fused QKV blocks this optimization.
+    #
+    # vocab 152K is larger than Phi-3.5's 32K, but the Q4 matmul
+    # speed more than compensates for the bigger lm_head.
+    "Qwen3-4B": (
+        "bartowski/Qwen_Qwen3-4B-GGUF",
+        "Qwen_Qwen3-4B-Q4_K_M.gguf",
+        2500,
+    ),
+    # Previous default. Still a good option for users who want the
+    # smallest lm_head (vocab 32K). Slower than Qwen3 because the
+    # fused QKV tensor blocks Q4 conversion.
     "Phi-3.5-mini": (
         "bartowski/Phi-3.5-mini-instruct-GGUF",
         "Phi-3.5-mini-instruct-Q8_0.gguf",
         3800,
     ),
-    # Lightweight all-rounder for users who want a smaller download
-    # than Phi-3.5-mini. vocab 49K keeps the lm_head matmul small, so
-    # on a mid-range M-series chip we measure ~12 tok/s — comfortable
-    # for interactive chat. Same llama arch family as SmolLM2-135M.
+    # Lightweight all-rounder. vocab 49K, ~12 tok/s on M3.
     "SmolLM2-1.7B": (
         "bartowski/SmolLM2-1.7B-Instruct-GGUF",
         "SmolLM2-1.7B-Instruct-Q8_0.gguf",
         1700,
     ),
-    # Smallest download in the "actually usable" tier. Slower at
-    # inference time because of the 128K Llama-3 vocab (~5x slower
-    # lm_head matmul on M3). Kept in the registry for users who
-    # specifically want a Llama model.
+    # Smallest usable download. 128K vocab → slower lm_head.
     "Llama-3.2-1B": (
         "hugging-quants/Llama-3.2-1B-Instruct-Q4_K_M-GGUF",
         "llama-3.2-1b-instruct-q4_k_m.gguf",
@@ -105,9 +108,7 @@ class ChatContextOverflow(RuntimeError):
         "Qwen3.5-0.8B-Q4_K_M.gguf",
         508,
     ),
-    # 138 MB demo model. Tokenizer + arch are llama-compatible but the
-    # model is too small to produce coherent output for general chat.
-    # Listed only so users can verify the install/load path quickly.
+    # 138 MB demo model. Too small for real use.
     "SmolLM2-135M": (
         "Felladrin/gguf-Q8_0-SmolLM2-135M-Instruct",
         "smollm2-135m-instruct-q8_0.gguf",
diff --git a/bindings/python/quantcpp/cli.py b/bindings/python/quantcpp/cli.py
@@ -19,10 +19,10 @@
 
 
 # Ollama-style short aliases → canonical _MODEL_REGISTRY keys.
-# Plain "smollm2" without a size suffix points at the 1.7B model — that's
-# the recommended default. Users who explicitly want the 135M demo model
-# need to ask for it by full name.
 MODEL_ALIASES = {
+    "qwen3":           "Qwen3-4B",
+    "qwen3:4b":        "Qwen3-4B",
+    "qwen":            "Qwen3-4B",
     "smollm2":         "SmolLM2-1.7B",
     "smollm2:1.7b":    "SmolLM2-1.7B",
     "smollm2:135m":    "SmolLM2-135M",
@@ -428,7 +428,7 @@ def cmd_chat_default(args):
                      the registry (32K) AND 3.8B params, giving the
                      best speed/quality combo we ship.
     """
-    args.model = args.model or "Phi-3.5-mini"
+    args.model = args.model or "Qwen3-4B"
     args.threads = getattr(args, "threads", 4)
     args.max_tokens = getattr(args, "max_tokens", 256)
     args.temperature = getattr(args, "temperature", 0.7)