GitHub - ba2512005/vllm-pascal: vLLM fork for pascal gpu

Easy, fast, and cheap LLM serving for everyone

🔥 We have built a vllm website to help you get started with vllm. Please visit vllm.ai to learn more. For events, please visit vllm.ai/events to join us.

Tesla P40 Build And Run

This fork is patched to run vLLM on NVIDIA Tesla P40 / Pascal (sm_61) and has been validated with Qwen/Qwen3-ASR-1.7B, mistralai/Voxtral-Mini-4B-Realtime-2602, and opendatalab/MinerU2.5-2509-1.2B.

Tested environment:

OS: Ubuntu 24.04 / Linux x86_64
GPU: Tesla P40 (sm_61)
Python: 3.12.3
Virtual environment: .venv-p40
PyTorch: 2.5.1+cu121
CUDA runtime from PyTorch wheel: 12.1
CUDA toolkit used for local source build: 12.2 (/usr/local/cuda-12.2)
Host compiler used for nvcc: gcc-12 / g++-12
System default GCC on this machine: 13.3.0
Recommended install mode: in a dedicated venv, then pip install -e . --no-build-isolation

Validated models on this fork:

Qwen/Qwen3-ASR-1.7B
mistralai/Voxtral-Mini-4B-Realtime-2602
opendatalab/MinerU2.5-2509-1.2B

The commands below are the reproducible path used to build and run this fork on P40.

1. Install system packages required by the current build environment

The current machine is Ubuntu 24.04 and nvcc 12.2 does not accept the default gcc 13 toolchain. Install and use gcc-12 / g++-12.

sudo apt update
sudo apt install -y \
  python3.12-venv \
  gcc-12 \
  g++-12 \
  cmake \
  ninja-build

2. Clone the fork

git clone https://github.com/uaysk/vllm-pascal.git
cd vllm-pascal

3. Create and activate a virtual environment

python3.12 -m venv .venv-p40
source .venv-p40/bin/activate
python -m pip install --upgrade pip

4. Install Python build tools

python -m pip install \
  "setuptools>=77,<81" \
  wheel \
  packaging \
  cmake \
  ninja \
  jinja2 \
  regex \
  protobuf

5. Install a P40-compatible PyTorch stack

The fork has been tested with CUDA 12.1 wheels. This is separate from the local CUDA toolkit used to compile extensions.

python -m pip install \
  --index-url https://download.pytorch.org/whl/cu121 \
  torch==2.5.1 \
  torchvision==0.20.1 \
  torchaudio==2.5.1

Verify that PyTorch sees the P40 correctly:

python - <<'PY'
import torch
print("torch:", torch.__version__)
print("cuda:", torch.version.cuda)
print("device:", torch.cuda.get_device_name(0))
print("capability:", torch.cuda.get_device_capability(0))
PY

Expected capability on Tesla P40:

(6, 1)

6. Configure the local CUDA 12.2 toolchain used for source builds

export CUDA_HOME=/usr/local/cuda-12.2
export PATH="$CUDA_HOME/bin:$PATH"
export LD_LIBRARY_PATH="$CUDA_HOME/lib64:${LD_LIBRARY_PATH:-}"
export CC=/usr/bin/gcc-12
export CXX=/usr/bin/g++-12
export CMAKE_ARGS="-DCMAKE_CUDA_HOST_COMPILER=/usr/bin/g++-12"

Confirm the compiler and toolkit before building:

nvcc --version
g++-12 --version | head -n 1

7. Build and install this fork for Pascal

export VLLM_TARGET_DEVICE=cuda
export TORCH_CUDA_ARCH_LIST="6.1"
python -m pip install -e . --no-build-isolation

If you want to reduce parallel build pressure on an older host:

export MAX_JOBS=1

and then run the same install command again.

If you build without g++-12, nvcc 12.2 can fail early with an unsupported GNU version error on Ubuntu 24.04 because the system default is GCC 13.

8. Install runtime packages used for Qwen ASR, Voxtral, and MinerU validation

python -m pip install librosa soundfile pillow "mineru-vl-utils[vllm]"

9. Start the OpenAI-compatible API server for Qwen3-ASR on P40

For P40, the stable runtime path is eager execution with conservative scheduler settings:

export VLLM_WORKER_MULTIPROC_METHOD=spawn
vllm serve Qwen/Qwen3-ASR-1.7B \
  --host 0.0.0.0 \
  --port 8010 \
  --served-model-name qwen3-asr-rt \
  --hf-overrides '{"architectures":["Qwen3ASRRealtimeGeneration"]}' \
  --max-model-len 4096 \
  --max-num-seqs 1 \
  --gpu-memory-utilization 0.85 \
  --limit-mm-per-prompt '{"audio":1}' \
  --enforce-eager \
  --disable-log-stats \
  --no-enable-prefix-caching \
  --no-enable-chunked-prefill \
  --no-async-scheduling

If you already have the model in a local Hugging Face snapshot directory, replace Qwen/Qwen3-ASR-1.7B with that local path.

10. Start the OpenAI-compatible API server for MinerU2.5 on P40

opendatalab/MinerU2.5-2509-1.2B uses the existing Qwen2-VL model path in vLLM. The official model card recommends a logits processor for no_repeat_ngram_size; this fork includes that processor built in, so you do not need an extra --logits-processors flag.

The model card is published with torch_dtype=bfloat16, but Tesla P40 does not support BF16. Use --dtype half explicitly for Pascal:

export VLLM_WORKER_MULTIPROC_METHOD=spawn
vllm serve opendatalab/MinerU2.5-2509-1.2B \
  --host 0.0.0.0 \
  --port 8011 \
  --served-model-name mineru25-p40 \
  --dtype half \
  --max-model-len 4096 \
  --limit-mm-per-prompt '{"image":1}' \
  --gpu-memory-utilization 0.85 \
  --enforce-eager \
  --max-num-seqs 1 \
  --disable-log-stats \
  --no-enable-prefix-caching \
  --no-enable-chunked-prefill \
  --no-async-scheduling

If you already downloaded the model to a local Hugging Face snapshot directory, replace opendatalab/MinerU2.5-2509-1.2B with that local path.

Minimal extraction example with mineru-vl-utils:

python - <<'PY'
from PIL import Image
from mineru_vl_utils import MinerUClient

client = MinerUClient(
    backend="http-client",
    server_url="http://127.0.0.1:8011",
)

image = Image.open("/path/to/page.png")
blocks = client.two_step_extract(image)
print(blocks[:3])
PY

Validated MinerU result on this P40 environment:

Model path: local Hugging Face snapshot of opendatalab/MinerU2.5-2509-1.2B
Test input: single document page image through MinerUClient.two_step_extract
Observed cold load time: about 143s
Observed layout detect time: about 82s
Observed extract time: about 134s on the first run
Output quality check: document header, journal line, figure caption, and formula text were extracted correctly on the validation sample

Two warnings can still appear during layout detection on the validation sample:

Warning: The output was truncated due to length limit.
Warning: line does not match layout format: <|box_start|>844 119 958 124

In the validated P40 run these warnings did not prevent extraction; the model still returned 122 parsed blocks successfully.

11. Start the OpenAI-compatible realtime API server for Voxtral Mini 4B on P40

Voxtral Mini 4B Realtime support is included in this fork for Pascal/P40. Use the same conservative runtime settings as the validated path:

export VLLM_WORKER_MULTIPROC_METHOD=spawn
vllm serve mistralai/Voxtral-Mini-4B-Realtime-2602 \
  --host 0.0.0.0 \
  --port 8012 \
  --served-model-name voxtral-mini-rt \
  --enforce-eager \
  --tokenizer-mode mistral \
  --config-format mistral \
  --load-format mistral \
  --max-model-len 4096 \
  --max-num-seqs 1 \
  --gpu-memory-utilization 0.85 \
  --disable-log-stats \
  --no-enable-prefix-caching \
  --no-enable-chunked-prefill \
  --no-async-scheduling

Client examples for the realtime WebSocket endpoint are available in:

examples/online_serving/openai_realtime_client.py
examples/online_serving/openai_realtime_microphone_client.py
examples/online_serving/openai_realtime_upload_web_demo.py

12. Verify basic health

curl http://127.0.0.1:8010/health
curl http://127.0.0.1:8010/v1/models
curl http://127.0.0.1:8011/health
curl http://127.0.0.1:8011/v1/models
curl http://127.0.0.1:8012/health
curl http://127.0.0.1:8012/v1/models

13. Known P40-specific runtime notes

This fork intentionally avoids FlashAttention/FlashInfer-style fast paths that are not usable on Pascal.
For Qwen3-ASR on P40, use the serve flags shown above. Removing them can reintroduce crashes or unstable behavior.
Voxtral realtime support is included for Pascal/P40 in this fork.
On Pascal, Voxtral realtime uses safer attention/backend fallbacks instead of unstable Triton paths.
For MinerU2.5 on P40, force --dtype half and keep --max-num-seqs 1 unless you have verified a larger concurrency setting on your card.
MinerU2.5 uses the existing Qwen2-VL implementation in this fork; Pascal uses safe pre-Volta attention fallbacks for the vision path.
This fork includes built-in handling for MinerU's no_repeat_ngram_size extra arg, so mineru-vl-utils works without adding a separate logits processor plugin.
The validated path is vLLM serving plus Qwen3-ASR transcription / realtime transcription on P40, Voxtral Mini 4B Realtime serving on P40, and MinerU2.5 document parsing on P40.
Browser microphone capture still requires localhost or HTTPS; that is a browser security rule, not a vLLM limitation.

About

vLLM is a fast and easy-to-use library for LLM inference and serving.

Originally developed in the Sky Computing Lab at UC Berkeley, vLLM has evolved into a community-driven project with contributions from both academia and industry.

vLLM is fast with:

State-of-the-art serving throughput
Efficient management of attention key and value memory with PagedAttention
Continuous batching of incoming requests
Fast model execution with CUDA/HIP graph
Quantizations: GPTQ, AWQ, AutoRound, INT4, INT8, and FP8
Optimized CUDA kernels, including integration with FlashAttention and FlashInfer
Speculative decoding
Chunked prefill

vLLM is flexible and easy to use with:

Seamless integration with popular Hugging Face models
High-throughput serving with various decoding algorithms, including parallel sampling, beam search, and more
Tensor, pipeline, data and expert parallelism support for distributed inference
Streaming outputs
OpenAI-compatible API server
Support for NVIDIA GPUs, AMD CPUs and GPUs, Intel CPUs and GPUs, PowerPC CPUs, Arm CPUs, and TPU. Additionally, support for diverse hardware plugins such as Intel Gaudi, IBM Spyre and Huawei Ascend.
Prefix caching support
Multi-LoRA support

vLLM seamlessly supports most popular open-source models on HuggingFace, including:

Transformer-like LLMs (e.g., Llama)
Mixture-of-Expert LLMs (e.g., Mixtral, Deepseek-V2 and V3)
Embedding Models (e.g., E5-Mistral)
Multi-modal LLMs (e.g., LLaVA)

Find the full list of supported models here.

Getting Started

Install vLLM with pip or from source:

pip install vllm

Visit our documentation to learn more.

Contributing

We welcome and value any contributions and collaborations. Please check out Contributing to vLLM for how to get involved.

Citation

If you use vLLM for your research, please cite our paper:

@inproceedings{kwon2023efficient,
  title={Efficient Memory Management for Large Language Model Serving with PagedAttention},
  author={Woosuk Kwon and Zhuohan Li and Siyuan Zhuang and Ying Sheng and Lianmin Zheng and Cody Hao Yu and Joseph E. Gonzalez and Hao Zhang and Ion Stoica},
  booktitle={Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles},
  year={2023}
}

Contact Us

For technical questions and feature requests, please use GitHub Issues
For discussing with fellow users, please use the vLLM Forum
For coordinating contributions and development, please use Slack
For security disclosures, please use GitHub's Security Advisories feature
For collaborations and partnerships, please contact us at collaboration@vllm.ai

Media Kit

If you wish to use vLLM's logo, please refer to our media kit repo

Name		Name	Last commit message	Last commit date
Latest commit History 14,595 Commits
.buildkite		.buildkite
.gemini		.gemini
.github		.github
benchmarks		benchmarks
cmake		cmake
csrc		csrc
docker		docker
docs		docs
examples		examples
requirements		requirements
scripts		scripts
tests		tests
tools		tools
vllm		vllm
.clang-format		.clang-format
.coveragerc		.coveragerc
.dockerignore		.dockerignore
.git-blame-ignore-revs		.git-blame-ignore-revs
.gitignore		.gitignore
.markdownlint.yaml		.markdownlint.yaml
.pre-commit-config.yaml		.pre-commit-config.yaml
.readthedocs.yaml		.readthedocs.yaml
.shellcheckrc		.shellcheckrc
.yapfignore		.yapfignore
CMakeLists.txt		CMakeLists.txt
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
DCO		DCO
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
RELEASE.md		RELEASE.md
SECURITY.md		SECURITY.md
codecov.yml		codecov.yml
mkdocs.yaml		mkdocs.yaml
pyproject.toml		pyproject.toml
setup.py		setup.py
use_existing_torch.py		use_existing_torch.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Easy, fast, and cheap LLM serving for everyone

Tesla P40 Build And Run

1. Install system packages required by the current build environment

2. Clone the fork

3. Create and activate a virtual environment

4. Install Python build tools

5. Install a P40-compatible PyTorch stack

6. Configure the local CUDA 12.2 toolchain used for source builds

7. Build and install this fork for Pascal

8. Install runtime packages used for Qwen ASR, Voxtral, and MinerU validation

9. Start the OpenAI-compatible API server for Qwen3-ASR on P40

10. Start the OpenAI-compatible API server for MinerU2.5 on P40

11. Start the OpenAI-compatible realtime API server for Voxtral Mini 4B on P40

12. Verify basic health

13. Known P40-specific runtime notes

About

Getting Started

Contributing

Citation

Contact Us

Media Kit

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Easy, fast, and cheap LLM serving for everyone

Tesla P40 Build And Run

1. Install system packages required by the current build environment

2. Clone the fork

3. Create and activate a virtual environment

4. Install Python build tools

5. Install a P40-compatible PyTorch stack

6. Configure the local CUDA 12.2 toolchain used for source builds

7. Build and install this fork for Pascal

8. Install runtime packages used for Qwen ASR, Voxtral, and MinerU validation

9. Start the OpenAI-compatible API server for Qwen3-ASR on P40

10. Start the OpenAI-compatible API server for MinerU2.5 on P40

11. Start the OpenAI-compatible realtime API server for Voxtral Mini 4B on P40

12. Verify basic health

13. Known P40-specific runtime notes

About

Getting Started

Contributing

Citation

Contact Us

Media Kit

About

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages