Skip to content

goabiaryan/smol_vllm

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

smol-vllm

PyPI version Python 3.10+ License: MIT GitHub

Paged-attention inference engine: KV cache, continuous batching, preemption. Educational, not production.

Install

pip install smol-vllm

Real models (TinyLlama, Qwen2, etc.):

pip install smol-vllm[tinyllama-1.1b]
# or
pip install smol-vllm[qwen2-0.5b]

Quick Start

FakeModel (no extras):

from smol_vllm import LLMEngine

engine = LLMEngine()
for token in engine.generate([1, 2, 3, 4, 5], max_tokens=20):
    print(token, end=" ")

CausalLM (needs [tinyllama-1.1b] or [qwen2-0.5b]):

engine = LLMEngine(use_real_model=True)
tokenizer = engine.model.tokenizer
tokens = tokenizer.encode("Hello!", add_special_tokens=False)
for token in engine.generate(tokens, max_tokens=20):
    print(tokenizer.decode([token]), end="")

Models

Model model_name
TinyLlama 1.1B TinyLlama/TinyLlama-1.1B-Chat-v1.0 (default)
Qwen2 0.5B Qwen/Qwen2-0.5B-Instruct
Phi-2 microsoft/phi-2
Llama 3.2 meta-llama/Llama-3.2-1B-Instruct
Gemma 2 google/gemma-2-2b-it
Mistral mistralai/Mistral-7B-Instruct-v0.3

Gated models (Llama, Gemma, etc.) need a HuggingFace token. Options:

1. Env var (recommended):

export HF_TOKEN=hf_xxxxxxxxxxxx

2. In code:

LLMEngine(use_real_model=True, model_name="meta-llama/Llama-3.2-1B-Instruct", hf_token="hf_xxxx")

Get a token: huggingface.co/settings/tokens. Accept the model's license on its HF page first.

Demo

smol-vllm-demo

What It Teaches

  • PagedAttention — block-based KV cache, ref counting
  • Continuous batching — short jobs fill slots immediately
  • Preemption & swapping — when memory runs low
  • Prefill vs decode — compute-bound → memory-bound

Workflow: run with FakeModel first (zero deps), then switch to CausalLM to compare.

Metrics

Step-level: prefill/decode latency, tok/s, KV util. Summary and CSV logs in logs/.

License

MIT

About

A minimal vllm to learn how inference engines work

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages