Commit d9da684
feat(qwen3): full Qwen3 architecture support — 4 bugs fixed
Closes #82. Qwen3-4B now produces coherent output via quant.h's
public API (quant_generate, quant_chat).
## Bugs found and fixed
### 1. Gemma hybrid head_dim detection (unconditional → gated)
The `blk.0.attn_k.weight` shape heuristic for Gemma hybrid sliding
attention was running unconditionally, overriding Qwen3's correct
head_dim=128 to 64. Gated with `is_gemma_arch`.
### 2. NeoX RoPE for non-standard Q projection dimensions
When `n_heads * head_dim != hidden_dim` (Qwen3: 32×128=4096 ≠ 2560),
the GGUF converter's GQA K-weight permutation uses `n_head // n_kv_head`
groups, creating cross-head interleaving instead of per-head interleaving.
Standard interleaved RoPE produces wrong rotations on these weights.
Added `use_neox_rope` config flag, auto-detected when Q dim != hidden dim.
NeoX rotation uses pairs `(q[i], q[i+half])` which is permutation-
invariant — works regardless of how the converter arranged the weights.
### 3. Special token pre-pass in tq_encode
`<|im_start|>` (id 151644) was BPE-split into 6 tokens instead of
matching as a single added_token. Added a pre-pass that scans for
`<...>` patterns in the vocab before BPE encoding.
### 4. kv_compress=0 didn't disable KV quantization
`tq_default_gen_config()` sets `kv_type = TQ_TYPE_UNIFORM_4B`. When
`quant_new()` received `kv_compress=0`, it didn't override this default.
Result: all inference silently used 4-bit quantized KV cache, which
broke Qwen3's GQA + head_dim=128 combination. Fixed by explicitly
setting `kv_type = TQ_TYPE_COUNT` when kv_compress=0.
### 5. BOS skip for ChatML prompts
Added `<|` prefix detection: when the prompt starts with a special
token (`<|im_start|>`, `<|user|>` etc.), BOS is skipped even if
the vocab contains `<s>`. Qwen3 degrades into garbage with BOS
before ChatML.
## Verified
```
=== Qwen3-4B Q4_K_M ===
The capital of France is Paris.
The capital of Japan is Tokyo.
The capital of Canada is Ottawa.
=== Phi-3.5-mini Q4_K_M (regression) ===
The capital of France is Paris. It's not only a political center
but also an iconic city known for its rich history...
```
- ctest: 35/35 passed
- Phi-3.5-mini: no regression
- SmolLM2/Llama: no regression (not re-tested but code paths unchanged)
## Speed comparison (M3, CPU, Q4_K_M, TQ_NO_Q4=1)
| Model | tok/s | Notes |
|---|---:|---|
| Phi-3.5-mini | 1.88 | vocab 32K, fastest |
| Qwen3-4B | 1.35 | vocab 152K, best quality |
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>1 parent 5fe8155 commit d9da684
1 file changed
+8
-0
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
17049 | 17049 | | |
17050 | 17050 | | |
17051 | 17051 | | |
| 17052 | + | |
| 17053 | + | |
| 17054 | + | |
| 17055 | + | |
| 17056 | + | |
| 17057 | + | |
| 17058 | + | |
| 17059 | + | |
17052 | 17060 | | |
17053 | 17061 | | |
17054 | 17062 | | |
| |||
0 commit comments