| Documentation | Blog | Paper | Twitter/X | User Forum | Developer Slack |
🔥 We have built a vllm website to help you get started with vllm. Please visit vllm.ai to learn more. For events, please visit vllm.ai/events to join us.
This fork is patched to run vLLM on NVIDIA Tesla P40 / Pascal (sm_61) and
has been validated with Qwen/Qwen3-ASR-1.7B,
mistralai/Voxtral-Mini-4B-Realtime-2602, and
opendatalab/MinerU2.5-2509-1.2B.
Tested environment:
- OS: Ubuntu 24.04 / Linux x86_64
- GPU: Tesla P40 (
sm_61) - Python:
3.12.3 - Virtual environment:
.venv-p40 - PyTorch:
2.5.1+cu121 - CUDA runtime from PyTorch wheel:
12.1 - CUDA toolkit used for local source build:
12.2(/usr/local/cuda-12.2) - Host compiler used for
nvcc:gcc-12/g++-12 - System default GCC on this machine:
13.3.0 - Recommended install mode: in a dedicated
venv, thenpip install -e . --no-build-isolation
Validated models on this fork:
Qwen/Qwen3-ASR-1.7Bmistralai/Voxtral-Mini-4B-Realtime-2602opendatalab/MinerU2.5-2509-1.2B
The commands below are the reproducible path used to build and run this fork on P40.
The current machine is Ubuntu 24.04 and nvcc 12.2 does not accept the
default gcc 13 toolchain. Install and use gcc-12 / g++-12.
sudo apt update
sudo apt install -y \
python3.12-venv \
gcc-12 \
g++-12 \
cmake \
ninja-buildgit clone https://github.com/uaysk/vllm-pascal.git
cd vllm-pascalpython3.12 -m venv .venv-p40
source .venv-p40/bin/activate
python -m pip install --upgrade pippython -m pip install \
"setuptools>=77,<81" \
wheel \
packaging \
cmake \
ninja \
jinja2 \
regex \
protobufThe fork has been tested with CUDA 12.1 wheels. This is separate from the local CUDA toolkit used to compile extensions.
python -m pip install \
--index-url https://download.pytorch.org/whl/cu121 \
torch==2.5.1 \
torchvision==0.20.1 \
torchaudio==2.5.1Verify that PyTorch sees the P40 correctly:
python - <<'PY'
import torch
print("torch:", torch.__version__)
print("cuda:", torch.version.cuda)
print("device:", torch.cuda.get_device_name(0))
print("capability:", torch.cuda.get_device_capability(0))
PYExpected capability on Tesla P40:
(6, 1)
export CUDA_HOME=/usr/local/cuda-12.2
export PATH="$CUDA_HOME/bin:$PATH"
export LD_LIBRARY_PATH="$CUDA_HOME/lib64:${LD_LIBRARY_PATH:-}"
export CC=/usr/bin/gcc-12
export CXX=/usr/bin/g++-12
export CMAKE_ARGS="-DCMAKE_CUDA_HOST_COMPILER=/usr/bin/g++-12"Confirm the compiler and toolkit before building:
nvcc --version
g++-12 --version | head -n 1export VLLM_TARGET_DEVICE=cuda
export TORCH_CUDA_ARCH_LIST="6.1"
python -m pip install -e . --no-build-isolationIf you want to reduce parallel build pressure on an older host:
export MAX_JOBS=1and then run the same install command again.
If you build without g++-12, nvcc 12.2 can fail early with an
unsupported GNU version error on Ubuntu 24.04 because the system default is
GCC 13.
python -m pip install librosa soundfile pillow "mineru-vl-utils[vllm]"For P40, the stable runtime path is eager execution with conservative scheduler settings:
export VLLM_WORKER_MULTIPROC_METHOD=spawn
vllm serve Qwen/Qwen3-ASR-1.7B \
--host 0.0.0.0 \
--port 8010 \
--served-model-name qwen3-asr-rt \
--hf-overrides '{"architectures":["Qwen3ASRRealtimeGeneration"]}' \
--max-model-len 4096 \
--max-num-seqs 1 \
--gpu-memory-utilization 0.85 \
--limit-mm-per-prompt '{"audio":1}' \
--enforce-eager \
--disable-log-stats \
--no-enable-prefix-caching \
--no-enable-chunked-prefill \
--no-async-schedulingIf you already have the model in a local Hugging Face snapshot directory, replace Qwen/Qwen3-ASR-1.7B with that local path.
opendatalab/MinerU2.5-2509-1.2B uses the existing Qwen2-VL model path in
vLLM. The official model card recommends a logits processor for
no_repeat_ngram_size; this fork includes that processor built in, so you do
not need an extra --logits-processors flag.
The model card is published with torch_dtype=bfloat16, but Tesla P40 does not
support BF16. Use --dtype half explicitly for Pascal:
export VLLM_WORKER_MULTIPROC_METHOD=spawn
vllm serve opendatalab/MinerU2.5-2509-1.2B \
--host 0.0.0.0 \
--port 8011 \
--served-model-name mineru25-p40 \
--dtype half \
--max-model-len 4096 \
--limit-mm-per-prompt '{"image":1}' \
--gpu-memory-utilization 0.85 \
--enforce-eager \
--max-num-seqs 1 \
--disable-log-stats \
--no-enable-prefix-caching \
--no-enable-chunked-prefill \
--no-async-schedulingIf you already downloaded the model to a local Hugging Face snapshot directory,
replace opendatalab/MinerU2.5-2509-1.2B with that local path.
Minimal extraction example with mineru-vl-utils:
python - <<'PY'
from PIL import Image
from mineru_vl_utils import MinerUClient
client = MinerUClient(
backend="http-client",
server_url="http://127.0.0.1:8011",
)
image = Image.open("/path/to/page.png")
blocks = client.two_step_extract(image)
print(blocks[:3])
PYValidated MinerU result on this P40 environment:
- Model path: local Hugging Face snapshot of
opendatalab/MinerU2.5-2509-1.2B - Test input: single document page image through
MinerUClient.two_step_extract - Observed cold load time: about
143s - Observed layout detect time: about
82s - Observed extract time: about
134son the first run - Output quality check: document header, journal line, figure caption, and formula text were extracted correctly on the validation sample
Two warnings can still appear during layout detection on the validation sample:
Warning: The output was truncated due to length limit.Warning: line does not match layout format: <|box_start|>844 119 958 124
In the validated P40 run these warnings did not prevent extraction; the model still returned 122 parsed blocks successfully.
Voxtral Mini 4B Realtime support is included in this fork for Pascal/P40.
Use the same conservative runtime settings as the validated path:
export VLLM_WORKER_MULTIPROC_METHOD=spawn
vllm serve mistralai/Voxtral-Mini-4B-Realtime-2602 \
--host 0.0.0.0 \
--port 8012 \
--served-model-name voxtral-mini-rt \
--enforce-eager \
--tokenizer-mode mistral \
--config-format mistral \
--load-format mistral \
--max-model-len 4096 \
--max-num-seqs 1 \
--gpu-memory-utilization 0.85 \
--disable-log-stats \
--no-enable-prefix-caching \
--no-enable-chunked-prefill \
--no-async-schedulingClient examples for the realtime WebSocket endpoint are available in:
examples/online_serving/openai_realtime_client.pyexamples/online_serving/openai_realtime_microphone_client.pyexamples/online_serving/openai_realtime_upload_web_demo.py
curl http://127.0.0.1:8010/health
curl http://127.0.0.1:8010/v1/models
curl http://127.0.0.1:8011/health
curl http://127.0.0.1:8011/v1/models
curl http://127.0.0.1:8012/health
curl http://127.0.0.1:8012/v1/models- This fork intentionally avoids FlashAttention/FlashInfer-style fast paths that are not usable on Pascal.
- For Qwen3-ASR on P40, use the serve flags shown above. Removing them can reintroduce crashes or unstable behavior.
- Voxtral realtime support is included for Pascal/P40 in this fork.
- On Pascal, Voxtral realtime uses safer attention/backend fallbacks instead of unstable Triton paths.
- For MinerU2.5 on P40, force
--dtype halfand keep--max-num-seqs 1unless you have verified a larger concurrency setting on your card. - MinerU2.5 uses the existing Qwen2-VL implementation in this fork; Pascal uses safe pre-Volta attention fallbacks for the vision path.
- This fork includes built-in handling for MinerU's
no_repeat_ngram_sizeextra arg, somineru-vl-utilsworks without adding a separate logits processor plugin. - The validated path is vLLM serving plus Qwen3-ASR transcription / realtime transcription on P40, Voxtral Mini 4B Realtime serving on P40, and MinerU2.5 document parsing on P40.
- Browser microphone capture still requires
localhostorHTTPS; that is a browser security rule, not a vLLM limitation.
vLLM is a fast and easy-to-use library for LLM inference and serving.
Originally developed in the Sky Computing Lab at UC Berkeley, vLLM has evolved into a community-driven project with contributions from both academia and industry.
vLLM is fast with:
- State-of-the-art serving throughput
- Efficient management of attention key and value memory with PagedAttention
- Continuous batching of incoming requests
- Fast model execution with CUDA/HIP graph
- Quantizations: GPTQ, AWQ, AutoRound, INT4, INT8, and FP8
- Optimized CUDA kernels, including integration with FlashAttention and FlashInfer
- Speculative decoding
- Chunked prefill
vLLM is flexible and easy to use with:
- Seamless integration with popular Hugging Face models
- High-throughput serving with various decoding algorithms, including parallel sampling, beam search, and more
- Tensor, pipeline, data and expert parallelism support for distributed inference
- Streaming outputs
- OpenAI-compatible API server
- Support for NVIDIA GPUs, AMD CPUs and GPUs, Intel CPUs and GPUs, PowerPC CPUs, Arm CPUs, and TPU. Additionally, support for diverse hardware plugins such as Intel Gaudi, IBM Spyre and Huawei Ascend.
- Prefix caching support
- Multi-LoRA support
vLLM seamlessly supports most popular open-source models on HuggingFace, including:
- Transformer-like LLMs (e.g., Llama)
- Mixture-of-Expert LLMs (e.g., Mixtral, Deepseek-V2 and V3)
- Embedding Models (e.g., E5-Mistral)
- Multi-modal LLMs (e.g., LLaVA)
Find the full list of supported models here.
Install vLLM with pip or from source:
pip install vllmVisit our documentation to learn more.
We welcome and value any contributions and collaborations. Please check out Contributing to vLLM for how to get involved.
If you use vLLM for your research, please cite our paper:
@inproceedings{kwon2023efficient,
title={Efficient Memory Management for Large Language Model Serving with PagedAttention},
author={Woosuk Kwon and Zhuohan Li and Siyuan Zhuang and Ying Sheng and Lianmin Zheng and Cody Hao Yu and Joseph E. Gonzalez and Hao Zhang and Ion Stoica},
booktitle={Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles},
year={2023}
}- For technical questions and feature requests, please use GitHub Issues
- For discussing with fellow users, please use the vLLM Forum
- For coordinating contributions and development, please use Slack
- For security disclosures, please use GitHub's Security Advisories feature
- For collaborations and partnerships, please contact us at collaboration@vllm.ai
- If you wish to use vLLM's logo, please refer to our media kit repo