Skip to content

Research: 32K-vocab English-optimized small model for quant.cpp #92

@unamedkr

Description

@unamedkr

Summary

Investigate creating or fine-tuning a small (1.7-3.8B) model with a 32K vocabulary optimized for English document QA on quant.cpp. This addresses the fundamental speed/quality tension discovered in our benchmarking.

The Vocab Size Dilemma

All 2025-2026 models have moved to large vocabularies for multilingual support:

Model Year Vocab Our tok/s (M3 Q8)
Phi-3.5-mini 2024 32K 6.5
SmolLM2-1.7B 2024 49K 23
Qwen3-4B 2025 152K ~2 (est)
Phi-4-mini 2025 200K ~1 (est)
Gemma-3-4B 2025 262K ~0.8 (est)

The industry trend (bigger vocab) is the opposite of what local CPU inference needs (smaller vocab).

Phi-3.5's 32K vocab is the last model with a small English-focused vocabulary. Its benchmarks are now outdated (2024).

Options

Option A: Vocabulary pruning

Take Qwen3-4B (best quality) and prune its 152K vocab to ~32K English-only tokens. Re-train the embedding/lm_head layers.

  • Pro: Best underlying model quality
  • Con: Requires GPU training, may degrade quality

Option B: Knowledge distillation

Distill Qwen3-4B's knowledge into a Phi-3.5-architecture student with 32K vocab.

  • Pro: Purpose-built architecture
  • Con: Significant training effort

Option C: Fine-tune Phi-3.5 on document QA

Keep Phi-3.5's 32K vocab but fine-tune on document QA tasks (SQuAD, NaturalQuestions, etc.).

  • Pro: No vocabulary changes, just quality improvement
  • Con: Limited by Phi-3.5's 2024-era pre-training

Option D: Community model search

Monitor HuggingFace for new models with small vocabularies. Some research groups may release English-focused models.

  • Pro: Zero effort
  • Con: May never appear (industry trend is opposite)

Why This Matters

The speed formula for local inference is approximately:

tok/s ∝ 1 / (vocab_size × params^0.5 × quant_overhead)

A 3.8B model with 32K vocab is 7.5x faster than the same model with 200K vocab. This is not an optimization — it's a fundamental architectural advantage for the English-only use case.

Priority: P3

Long-term research direction. Immediate impact comes from #83 (KV cache) and #84 (coherence API).

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions