smol-vllm

Paged-attention inference engine: KV cache, continuous batching, preemption. Educational, not production.

Install

pip install smol-vllm

Real models (TinyLlama, Qwen2, etc.):

pip install smol-vllm[tinyllama-1.1b]
# or
pip install smol-vllm[qwen2-0.5b]

Quick Start

FakeModel (no extras):

from smol_vllm import LLMEngine

engine = LLMEngine()
for token in engine.generate([1, 2, 3, 4, 5], max_tokens=20):
    print(token, end=" ")

CausalLM (needs [tinyllama-1.1b] or [qwen2-0.5b]):

engine = LLMEngine(use_real_model=True)
tokenizer = engine.model.tokenizer
tokens = tokenizer.encode("Hello!", add_special_tokens=False)
for token in engine.generate(tokens, max_tokens=20):
    print(tokenizer.decode([token]), end="")

Models

Model	`model_name`
TinyLlama 1.1B	`TinyLlama/TinyLlama-1.1B-Chat-v1.0` (default)
Qwen2 0.5B	`Qwen/Qwen2-0.5B-Instruct`
Phi-2	`microsoft/phi-2`
Llama 3.2	`meta-llama/Llama-3.2-1B-Instruct`
Gemma 2	`google/gemma-2-2b-it`
Mistral	`mistralai/Mistral-7B-Instruct-v0.3`

Gated models (Llama, Gemma, etc.) need a HuggingFace token. Options:

1. Env var (recommended):

export HF_TOKEN=hf_xxxxxxxxxxxx

2. In code:

LLMEngine(use_real_model=True, model_name="meta-llama/Llama-3.2-1B-Instruct", hf_token="hf_xxxx")

Get a token: huggingface.co/settings/tokens. Accept the model's license on its HF page first.

Demo

smol-vllm-demo

What It Teaches

PagedAttention — block-based KV cache, ref counting
Continuous batching — short jobs fill slots immediately
Preemption & swapping — when memory runs low
Prefill vs decode — compute-bound → memory-bound

Workflow: run with FakeModel first (zero deps), then switch to CausalLM to compare.

Metrics

Step-level: prefill/decode latency, tok/s, KV util. Summary and CSV logs in logs/.

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
__pycache__		__pycache__
smol_vllm.egg-info		smol_vllm.egg-info
smol_vllm		smol_vllm
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
demo.py		demo.py
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

smol-vllm

Install

Quick Start

Models

Demo

What It Teaches

Metrics

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

smol-vllm

Install

Quick Start

Models

Demo

What It Teaches

Metrics

License

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages