Skip to content

ba2512005/vllm-pascal

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14,595 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

vLLM

Easy, fast, and cheap LLM serving for everyone

| Documentation | Blog | Paper | Twitter/X | User Forum | Developer Slack |

🔥 We have built a vllm website to help you get started with vllm. Please visit vllm.ai to learn more. For events, please visit vllm.ai/events to join us.


Tesla P40 Build And Run

This fork is patched to run vLLM on NVIDIA Tesla P40 / Pascal (sm_61) and has been validated with Qwen/Qwen3-ASR-1.7B, mistralai/Voxtral-Mini-4B-Realtime-2602, and opendatalab/MinerU2.5-2509-1.2B.

Tested environment:

  • OS: Ubuntu 24.04 / Linux x86_64
  • GPU: Tesla P40 (sm_61)
  • Python: 3.12.3
  • Virtual environment: .venv-p40
  • PyTorch: 2.5.1+cu121
  • CUDA runtime from PyTorch wheel: 12.1
  • CUDA toolkit used for local source build: 12.2 (/usr/local/cuda-12.2)
  • Host compiler used for nvcc: gcc-12 / g++-12
  • System default GCC on this machine: 13.3.0
  • Recommended install mode: in a dedicated venv, then pip install -e . --no-build-isolation

Validated models on this fork:

  • Qwen/Qwen3-ASR-1.7B
  • mistralai/Voxtral-Mini-4B-Realtime-2602
  • opendatalab/MinerU2.5-2509-1.2B

The commands below are the reproducible path used to build and run this fork on P40.

1. Install system packages required by the current build environment

The current machine is Ubuntu 24.04 and nvcc 12.2 does not accept the default gcc 13 toolchain. Install and use gcc-12 / g++-12.

sudo apt update
sudo apt install -y \
  python3.12-venv \
  gcc-12 \
  g++-12 \
  cmake \
  ninja-build

2. Clone the fork

git clone https://github.com/uaysk/vllm-pascal.git
cd vllm-pascal

3. Create and activate a virtual environment

python3.12 -m venv .venv-p40
source .venv-p40/bin/activate
python -m pip install --upgrade pip

4. Install Python build tools

python -m pip install \
  "setuptools>=77,<81" \
  wheel \
  packaging \
  cmake \
  ninja \
  jinja2 \
  regex \
  protobuf

5. Install a P40-compatible PyTorch stack

The fork has been tested with CUDA 12.1 wheels. This is separate from the local CUDA toolkit used to compile extensions.

python -m pip install \
  --index-url https://download.pytorch.org/whl/cu121 \
  torch==2.5.1 \
  torchvision==0.20.1 \
  torchaudio==2.5.1

Verify that PyTorch sees the P40 correctly:

python - <<'PY'
import torch
print("torch:", torch.__version__)
print("cuda:", torch.version.cuda)
print("device:", torch.cuda.get_device_name(0))
print("capability:", torch.cuda.get_device_capability(0))
PY

Expected capability on Tesla P40:

(6, 1)

6. Configure the local CUDA 12.2 toolchain used for source builds

export CUDA_HOME=/usr/local/cuda-12.2
export PATH="$CUDA_HOME/bin:$PATH"
export LD_LIBRARY_PATH="$CUDA_HOME/lib64:${LD_LIBRARY_PATH:-}"
export CC=/usr/bin/gcc-12
export CXX=/usr/bin/g++-12
export CMAKE_ARGS="-DCMAKE_CUDA_HOST_COMPILER=/usr/bin/g++-12"

Confirm the compiler and toolkit before building:

nvcc --version
g++-12 --version | head -n 1

7. Build and install this fork for Pascal

export VLLM_TARGET_DEVICE=cuda
export TORCH_CUDA_ARCH_LIST="6.1"
python -m pip install -e . --no-build-isolation

If you want to reduce parallel build pressure on an older host:

export MAX_JOBS=1

and then run the same install command again.

If you build without g++-12, nvcc 12.2 can fail early with an unsupported GNU version error on Ubuntu 24.04 because the system default is GCC 13.

8. Install runtime packages used for Qwen ASR, Voxtral, and MinerU validation

python -m pip install librosa soundfile pillow "mineru-vl-utils[vllm]"

9. Start the OpenAI-compatible API server for Qwen3-ASR on P40

For P40, the stable runtime path is eager execution with conservative scheduler settings:

export VLLM_WORKER_MULTIPROC_METHOD=spawn
vllm serve Qwen/Qwen3-ASR-1.7B \
  --host 0.0.0.0 \
  --port 8010 \
  --served-model-name qwen3-asr-rt \
  --hf-overrides '{"architectures":["Qwen3ASRRealtimeGeneration"]}' \
  --max-model-len 4096 \
  --max-num-seqs 1 \
  --gpu-memory-utilization 0.85 \
  --limit-mm-per-prompt '{"audio":1}' \
  --enforce-eager \
  --disable-log-stats \
  --no-enable-prefix-caching \
  --no-enable-chunked-prefill \
  --no-async-scheduling

If you already have the model in a local Hugging Face snapshot directory, replace Qwen/Qwen3-ASR-1.7B with that local path.

10. Start the OpenAI-compatible API server for MinerU2.5 on P40

opendatalab/MinerU2.5-2509-1.2B uses the existing Qwen2-VL model path in vLLM. The official model card recommends a logits processor for no_repeat_ngram_size; this fork includes that processor built in, so you do not need an extra --logits-processors flag.

The model card is published with torch_dtype=bfloat16, but Tesla P40 does not support BF16. Use --dtype half explicitly for Pascal:

export VLLM_WORKER_MULTIPROC_METHOD=spawn
vllm serve opendatalab/MinerU2.5-2509-1.2B \
  --host 0.0.0.0 \
  --port 8011 \
  --served-model-name mineru25-p40 \
  --dtype half \
  --max-model-len 4096 \
  --limit-mm-per-prompt '{"image":1}' \
  --gpu-memory-utilization 0.85 \
  --enforce-eager \
  --max-num-seqs 1 \
  --disable-log-stats \
  --no-enable-prefix-caching \
  --no-enable-chunked-prefill \
  --no-async-scheduling

If you already downloaded the model to a local Hugging Face snapshot directory, replace opendatalab/MinerU2.5-2509-1.2B with that local path.

Minimal extraction example with mineru-vl-utils:

python - <<'PY'
from PIL import Image
from mineru_vl_utils import MinerUClient

client = MinerUClient(
    backend="http-client",
    server_url="http://127.0.0.1:8011",
)

image = Image.open("/path/to/page.png")
blocks = client.two_step_extract(image)
print(blocks[:3])
PY

Validated MinerU result on this P40 environment:

  • Model path: local Hugging Face snapshot of opendatalab/MinerU2.5-2509-1.2B
  • Test input: single document page image through MinerUClient.two_step_extract
  • Observed cold load time: about 143s
  • Observed layout detect time: about 82s
  • Observed extract time: about 134s on the first run
  • Output quality check: document header, journal line, figure caption, and formula text were extracted correctly on the validation sample

Two warnings can still appear during layout detection on the validation sample:

  • Warning: The output was truncated due to length limit.
  • Warning: line does not match layout format: <|box_start|>844 119 958 124

In the validated P40 run these warnings did not prevent extraction; the model still returned 122 parsed blocks successfully.

11. Start the OpenAI-compatible realtime API server for Voxtral Mini 4B on P40

Voxtral Mini 4B Realtime support is included in this fork for Pascal/P40. Use the same conservative runtime settings as the validated path:

export VLLM_WORKER_MULTIPROC_METHOD=spawn
vllm serve mistralai/Voxtral-Mini-4B-Realtime-2602 \
  --host 0.0.0.0 \
  --port 8012 \
  --served-model-name voxtral-mini-rt \
  --enforce-eager \
  --tokenizer-mode mistral \
  --config-format mistral \
  --load-format mistral \
  --max-model-len 4096 \
  --max-num-seqs 1 \
  --gpu-memory-utilization 0.85 \
  --disable-log-stats \
  --no-enable-prefix-caching \
  --no-enable-chunked-prefill \
  --no-async-scheduling

Client examples for the realtime WebSocket endpoint are available in:

  • examples/online_serving/openai_realtime_client.py
  • examples/online_serving/openai_realtime_microphone_client.py
  • examples/online_serving/openai_realtime_upload_web_demo.py

12. Verify basic health

curl http://127.0.0.1:8010/health
curl http://127.0.0.1:8010/v1/models
curl http://127.0.0.1:8011/health
curl http://127.0.0.1:8011/v1/models
curl http://127.0.0.1:8012/health
curl http://127.0.0.1:8012/v1/models

13. Known P40-specific runtime notes

  • This fork intentionally avoids FlashAttention/FlashInfer-style fast paths that are not usable on Pascal.
  • For Qwen3-ASR on P40, use the serve flags shown above. Removing them can reintroduce crashes or unstable behavior.
  • Voxtral realtime support is included for Pascal/P40 in this fork.
  • On Pascal, Voxtral realtime uses safer attention/backend fallbacks instead of unstable Triton paths.
  • For MinerU2.5 on P40, force --dtype half and keep --max-num-seqs 1 unless you have verified a larger concurrency setting on your card.
  • MinerU2.5 uses the existing Qwen2-VL implementation in this fork; Pascal uses safe pre-Volta attention fallbacks for the vision path.
  • This fork includes built-in handling for MinerU's no_repeat_ngram_size extra arg, so mineru-vl-utils works without adding a separate logits processor plugin.
  • The validated path is vLLM serving plus Qwen3-ASR transcription / realtime transcription on P40, Voxtral Mini 4B Realtime serving on P40, and MinerU2.5 document parsing on P40.
  • Browser microphone capture still requires localhost or HTTPS; that is a browser security rule, not a vLLM limitation.

About

vLLM is a fast and easy-to-use library for LLM inference and serving.

Originally developed in the Sky Computing Lab at UC Berkeley, vLLM has evolved into a community-driven project with contributions from both academia and industry.

vLLM is fast with:

  • State-of-the-art serving throughput
  • Efficient management of attention key and value memory with PagedAttention
  • Continuous batching of incoming requests
  • Fast model execution with CUDA/HIP graph
  • Quantizations: GPTQ, AWQ, AutoRound, INT4, INT8, and FP8
  • Optimized CUDA kernels, including integration with FlashAttention and FlashInfer
  • Speculative decoding
  • Chunked prefill

vLLM is flexible and easy to use with:

  • Seamless integration with popular Hugging Face models
  • High-throughput serving with various decoding algorithms, including parallel sampling, beam search, and more
  • Tensor, pipeline, data and expert parallelism support for distributed inference
  • Streaming outputs
  • OpenAI-compatible API server
  • Support for NVIDIA GPUs, AMD CPUs and GPUs, Intel CPUs and GPUs, PowerPC CPUs, Arm CPUs, and TPU. Additionally, support for diverse hardware plugins such as Intel Gaudi, IBM Spyre and Huawei Ascend.
  • Prefix caching support
  • Multi-LoRA support

vLLM seamlessly supports most popular open-source models on HuggingFace, including:

  • Transformer-like LLMs (e.g., Llama)
  • Mixture-of-Expert LLMs (e.g., Mixtral, Deepseek-V2 and V3)
  • Embedding Models (e.g., E5-Mistral)
  • Multi-modal LLMs (e.g., LLaVA)

Find the full list of supported models here.

Getting Started

Install vLLM with pip or from source:

pip install vllm

Visit our documentation to learn more.

Contributing

We welcome and value any contributions and collaborations. Please check out Contributing to vLLM for how to get involved.

Citation

If you use vLLM for your research, please cite our paper:

@inproceedings{kwon2023efficient,
  title={Efficient Memory Management for Large Language Model Serving with PagedAttention},
  author={Woosuk Kwon and Zhuohan Li and Siyuan Zhuang and Ying Sheng and Lianmin Zheng and Cody Hao Yu and Joseph E. Gonzalez and Hao Zhang and Ion Stoica},
  booktitle={Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles},
  year={2023}
}

Contact Us

  • For technical questions and feature requests, please use GitHub Issues
  • For discussing with fellow users, please use the vLLM Forum
  • For coordinating contributions and development, please use Slack
  • For security disclosures, please use GitHub's Security Advisories feature
  • For collaborations and partnerships, please contact us at collaboration@vllm.ai

Media Kit

About

vLLM fork for pascal gpu

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • Python 87.7%
  • Cuda 6.7%
  • C++ 4.0%
  • Shell 0.9%
  • CMake 0.3%
  • C 0.3%
  • Other 0.1%