Paged-attention inference engine: KV cache, continuous batching, preemption. Educational, not production.
pip install smol-vllmReal models (TinyLlama, Qwen2, etc.):
pip install smol-vllm[tinyllama-1.1b]
# or
pip install smol-vllm[qwen2-0.5b]FakeModel (no extras):
from smol_vllm import LLMEngine
engine = LLMEngine()
for token in engine.generate([1, 2, 3, 4, 5], max_tokens=20):
print(token, end=" ")CausalLM (needs [tinyllama-1.1b] or [qwen2-0.5b]):
engine = LLMEngine(use_real_model=True)
tokenizer = engine.model.tokenizer
tokens = tokenizer.encode("Hello!", add_special_tokens=False)
for token in engine.generate(tokens, max_tokens=20):
print(tokenizer.decode([token]), end="")| Model | model_name |
|---|---|
| TinyLlama 1.1B | TinyLlama/TinyLlama-1.1B-Chat-v1.0 (default) |
| Qwen2 0.5B | Qwen/Qwen2-0.5B-Instruct |
| Phi-2 | microsoft/phi-2 |
| Llama 3.2 | meta-llama/Llama-3.2-1B-Instruct |
| Gemma 2 | google/gemma-2-2b-it |
| Mistral | mistralai/Mistral-7B-Instruct-v0.3 |
Gated models (Llama, Gemma, etc.) need a HuggingFace token. Options:
1. Env var (recommended):
export HF_TOKEN=hf_xxxxxxxxxxxx2. In code:
LLMEngine(use_real_model=True, model_name="meta-llama/Llama-3.2-1B-Instruct", hf_token="hf_xxxx")Get a token: huggingface.co/settings/tokens. Accept the model's license on its HF page first.
smol-vllm-demo- PagedAttention — block-based KV cache, ref counting
- Continuous batching — short jobs fill slots immediately
- Preemption & swapping — when memory runs low
- Prefill vs decode — compute-bound → memory-bound
Workflow: run with FakeModel first (zero deps), then switch to CausalLM to compare.
Step-level: prefill/decode latency, tok/s, KV util. Summary and CSV logs in logs/.
MIT