Skip to content

waybarrios/vllm-mlx

Repository files navigation

vllm-mlx

Read this in other languages: English · Español · Français · 中文

Continuous batching + OpenAI + Anthropic APIs in one server. Native Apple Silicon inference.

PyPI version PyPI Downloads Python 3.10+ License Apple Silicon GitHub stars


What is vllm-mlx?

A vLLM-style inference server for Apple Silicon Macs. Unlike Ollama or mlx-lm used directly, it ships continuous batching, paged KV cache, prefix caching, and SSD-tiered cache, and exposes both OpenAI /v1/* and Anthropic /v1/messages from a single process. Run LLMs, vision models, audio, and embeddings on Metal with unified memory, no conversion step.

Quick start (30 seconds)

pip install vllm-mlx
vllm-mlx serve mlx-community/Llama-3.2-3B-Instruct-4bit --port 8000 --continuous-batching

OpenAI SDK:

from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")
r = client.chat.completions.create(model="default", messages=[{"role": "user", "content": "Hi!"}])
print(r.choices[0].message.content)

Anthropic SDK / Claude Code:

export ANTHROPIC_BASE_URL=http://localhost:8000
export ANTHROPIC_API_KEY=not-needed
claude

Features

APIs

  • OpenAI-compatible: /v1/chat/completions, /v1/completions, /v1/embeddings, /v1/rerank, /v1/responses
  • Anthropic-compatible: /v1/messages (streaming, tool use, system prompts)
  • MCP Tool Calling: 12 parsers (OpenAI, Anthropic, Gemini, Qwen, DeepSeek, Gemma, and more)
  • Structured output: JSON Schema via response_format (lm-format-enforcer)

Throughput & memory

  • Continuous batching: high throughput for concurrent requests
  • Paged KV cache: memory-efficient with prefix sharing
  • SSD-tiered KV cache: spill prefix cache to disk for long-context agents (--ssd-cache-dir)
  • Warm prompts: preload popular prefixes at startup (--warm-prompts) for 1.3-2.25x TTFT
  • Prefix cache: trie-based, shared across requests

Multimodal

  • Text + image + video + audio from one server
  • Vision models: Gemma 3, Gemma 4, Qwen3-VL, Pixtral, Llama vision
  • Audio input in chat (audio_url content blocks)
  • Native TTS: 11 voices, 15+ languages (Kokoro, Chatterbox, VibeVoice, VoxCPM)
  • STT: Whisper family with RTF up to 197x on M4 Max

Reasoning & advanced

  • Reasoning extraction: Qwen3, DeepSeek-R1 (--reasoning-parser)
  • MoE expert reduction: --moe-top-k for +7-16% on Qwen3-30B-A3B
  • Speculative decoding: --mtp for Qwen3-Next
  • Sparse prefill: attention-based --spec-prefill for TTFT reduction

Observability

  • Prometheus metrics: /metrics endpoint with --metrics
  • Built-in benchmarker: vllm-mlx bench-serve for prompt sweeps with CSV/JSON output

Native GPU acceleration

  • Apple Silicon only (M1, M2, M3, M4) with Metal kernels via MLX
  • Unified memory, no model conversion

Performance

LLM decode (M4 Max, 128 GB, greedy, single stream):

Model Tok/s Memory
Qwen3-0.6B-8bit 417.9 0.7 GB
Llama-3.2-3B-Instruct-4bit 205.6 1.8 GB
Qwen3-30B-A3B-4bit 127.7 ~18 GB

Audio speech-to-text (M4 Max, RTF = real-time factor):

Model RTF Use case
whisper-tiny 197x Real-time / low latency
whisper-large-v3-turbo 55x Quality + speed
whisper-large-v3 24x Highest accuracy

See docs/benchmarks/ for continuous-batching results, KV-cache quantization (4-bit / 8-bit / fp16), and MoE top-k sweeps.

Examples

Anthropic API (Claude Code, OpenCode)

vllm-mlx serve mlx-community/Qwen3-8B-4bit --port 8000
export ANTHROPIC_BASE_URL=http://localhost:8000
export ANTHROPIC_API_KEY=not-needed
claude

Reasoning models (Qwen3, DeepSeek-R1)

vllm-mlx serve mlx-community/Qwen3-8B-4bit --reasoning-parser qwen3
r = client.chat.completions.create(
    model="default",
    messages=[{"role": "user", "content": "What is 17 * 23?"}],
)
print("Thinking:", r.choices[0].message.reasoning)
print("Answer:",   r.choices[0].message.content)

Multimodal (image + text)

vllm-mlx serve mlx-community/Qwen3-VL-4B-Instruct-3bit --port 8000
r = client.chat.completions.create(
    model="default",
    messages=[{"role": "user", "content": [
        {"type": "text", "text": "What is in this image?"},
        {"type": "image_url", "image_url": {"url": "https://example.com/cat.jpg"}},
    ]}],
)

Structured output (JSON Schema)

r = client.chat.completions.create(
    model="default",
    messages=[{"role": "user", "content": "List 3 colors."}],
    response_format={
        "type": "json_schema",
        "json_schema": {
            "schema": {"type": "object", "properties": {"colors": {"type": "array", "items": {"type": "string"}}}}
        },
    },
)

Reranking (/v1/rerank)

curl http://localhost:8000/v1/rerank -H 'Content-Type: application/json' -d '{
  "model": "default",
  "query": "apple silicon inference",
  "documents": ["MLX is Apples framework", "Metal kernels on M-series", "CUDA on NVIDIA"]
}'

Embeddings

vllm-mlx serve <llm-model> --embedding-model mlx-community/all-MiniLM-L6-v2-4bit
emb = client.embeddings.create(model="mlx-community/all-MiniLM-L6-v2-4bit", input=["Hello", "World"])

Audio (TTS / STT)

pip install vllm-mlx[audio]
brew install espeak-ng        # macOS, needed for non-English TTS

python examples/tts_example.py "Hello, how are you?" --play
python examples/tts_multilingual.py "Hola mundo" --lang es --play

Built-in benchmarking

vllm-mlx bench-serve --url http://localhost:8000 --concurrency 5 --prompts prompts.txt --output results.csv

# Product-style workload with quality checks and metrics deltas
vllm-mlx bench-serve --url http://localhost:8000 --workload workload.json --repetitions 5 --output results.json

# Append workload rows into SQLite for longitudinal comparisons
vllm-mlx bench-serve --url http://localhost:8000 --workload workload.json --repetitions 5 --format sqlite --output bench.db

Model acquisition and conversion

# Inspect repo metadata, file sizes, config, and rough fit before downloading weights
vllm-mlx model inspect mlx-community/Llama-3.2-3B-Instruct-4bit

# Acquire with resumable Hugging Face transfer and write a local artifact manifest
vllm-mlx model acquire mlx-community/Llama-3.2-3B-Instruct-4bit --target-dir ./models/llama-3b-4bit

# Wrap mlx-lm conversion and record the exact recipe in the converted artifact
vllm-mlx model convert meta-llama/Llama-3.2-3B-Instruct --output ./models/llama-3b-mlx-q4 --quantize --q-bits 4 --q-group-size 64 --q-mode affine

Prometheus metrics

vllm-mlx serve <model> --metrics
curl http://localhost:8000/metrics

Installation

Using uv (recommended):

uv tool install vllm-mlx                 # CLI, system-wide
# or in a project
uv pip install vllm-mlx

Using pip:

pip install vllm-mlx

# Audio extras
pip install vllm-mlx[audio]
brew install espeak-ng
python -m spacy download en_core_web_sm

From source:

git clone https://github.com/waybarrios/vllm-mlx.git
cd vllm-mlx
pip install -e .

See Installation Guide for full options.

Documentation

Architecture

┌─────────────────────────────────────────────────────────────────────────┐
│                           vllm-mlx Server                               │
│   OpenAI /v1/*  ·  Anthropic /v1/messages  ·  /v1/rerank  ·  /metrics   │
└─────────────────────────────────────────────────────────────────────────┘
                                   │
                                   ▼
┌─────────────────────────────────────────────────────────────────────────┐
│  Continuous batching · Paged KV cache · Prefix cache · SSD tiering      │
└─────────────────────────────────────────────────────────────────────────┘
                                   │
        ┌─────────────┬────────────┴────────────┬─────────────┐
        ▼             ▼                         ▼             ▼
┌───────────────┐ ┌───────────────┐ ┌───────────────┐ ┌───────────────┐
│    mlx-lm     │ │   mlx-vlm     │ │   mlx-audio   │ │mlx-embeddings │
│    (LLMs)     │ │  (Vision)     │ │  (TTS + STT)  │ │ (Embeddings)  │
└───────────────┘ └───────────────┘ └───────────────┘ └───────────────┘
                                   │
                                   ▼
┌─────────────────────────────────────────────────────────────────────────┐
│                   MLX · Metal kernels · Unified memory                  │
└─────────────────────────────────────────────────────────────────────────┘

Contributing

Bug fixes, perf work, docs, and benchmarks on different Apple Silicon chips all welcome. See the Contributing Guide.

License

Apache 2.0. See LICENSE.

Citation

@software{vllm_mlx2025,
  author = {Barrios, Wayner},
  title  = {vllm-mlx: Apple Silicon MLX Backend for vLLM},
  year   = {2025},
  url    = {https://github.com/waybarrios/vllm-mlx},
  note   = {Native GPU-accelerated LLM and vision-language model inference on Apple Silicon}
}

Acknowledgments

  • MLX. Apple's ML framework.
  • mlx-lm. LLM inference library.
  • mlx-vlm. Vision-language models.
  • mlx-audio. Text-to-Speech and Speech-to-Text.
  • mlx-embeddings. Text embeddings.
  • Rapid-MLX. Community fork of vllm-mlx.
  • vLLM. High-throughput LLM serving. vllm-mlx is inspired by vLLM and adopts its continuous-batching and paged KV-cache design for Apple Silicon via MLX.

Star history

Star History Chart


If vllm-mlx helped you, please star the repo. It helps more Apple Silicon devs find it.

About

OpenAI and Anthropic compatible server for Apple Silicon. Run LLMs and vision-language models (Llama, Qwen-VL, LLaVA) with continuous batching, MCP tool calling, and multimodal support. Native MLX backend, 400+ tok/s. Works with Claude Code.

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages