Conversation
There was a problem hiding this comment.
Pull request overview
This PR adds vLLM integration to provide high-performance LLM inference with an OpenAI-compatible API. vLLM offers superior throughput compared to existing alternatives (24x faster than HuggingFace, 2-4x faster than Ollama) while maintaining API compatibility. The integration complements existing ollama/ (for simpler setups) and litellm/ (which can proxy to vLLM) offerings.
Key changes:
- New installation script with GPU detection, systemd service setup, and configuration management
- Comprehensive documentation covering model selection, API usage, multi-GPU configuration, and troubleshooting
- Example scripts for testing (Python with OpenAI SDK, streaming, and curl)
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| vllm/setup.sh | Installation script that detects GPU, creates Python venv, installs vLLM, configures systemd service, and generates example scripts |
| vllm/README.md | Comprehensive documentation with model selection guide, configuration examples, API usage patterns, and troubleshooting guidance |
Comments suppressed due to low confidence (1)
vllm/setup.sh:279
- [nitpick] The file ends with two blank lines. Repository convention appears to use a single blank line at EOF. Remove one of the trailing blank lines.
echo ""
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 2 out of 2 changed files in this pull request and generated 9 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| MAX_MODEL_LEN="4096" # Maximum sequence length | ||
| MAX_NUM_SEQS="256" # Maximum number of sequences | ||
|
|
||
| # Optional: Hugging Face token (needed for gated models like Llama) |
There was a problem hiding this comment.
Inconsistent spelling: "Hugging Face" here vs "HuggingFace" on line 117 and in the README (lines 65-66, 378, 380, 384). The official spelling is "Hugging Face" (two words). Consider using the official spelling consistently throughout the file.
| # Activate venv | ||
| source "$HOME/vllm-server/venv/bin/activate" | ||
|
|
||
| # Set HuggingFace cache |
There was a problem hiding this comment.
Inconsistent spelling: "HuggingFace" here vs "Hugging Face" on line 102. The official spelling is "Hugging Face" (two words). Consider using the official spelling consistently throughout the file.
| # Set HuggingFace cache | |
| # Set Hugging Face cache |
|
|
||
| ## Features | ||
|
|
||
| - **⚡ Fast** - 24x throughput vs HuggingFace, 2-4x faster than Ollama |
There was a problem hiding this comment.
Inconsistent spelling: "HuggingFace" is used throughout this file, but the official spelling is "Hugging Face" (two words). Consider using the official spelling consistently. Occurrences on lines 15, 65, 66, 378, 380, 384, 477, 482.
| chown -R "$USER:$USER" "$VLLM_DIR" | ||
| chown -R "$USER:$USER" "$HOME/vllm-examples" | ||
| chown -R "$USER:$USER" "$HOME/.cache/huggingface" 2>/dev/null || true |
There was a problem hiding this comment.
[nitpick] Inconsistent quoting pattern: This script uses "$USER:$USER" with quotes, while other setup scripts in the repository (ollama/setup.sh:99, litellm/setup.sh:123) use $USER:$USER without quotes. For consistency with the codebase conventions, consider removing the quotes around $USER:$USER.
| chown -R "$USER:$USER" "$VLLM_DIR" | |
| chown -R "$USER:$USER" "$HOME/vllm-examples" | |
| chown -R "$USER:$USER" "$HOME/.cache/huggingface" 2>/dev/null || true | |
| chown -R $USER:$USER "$VLLM_DIR" | |
| chown -R $USER:$USER "$HOME/vllm-examples" | |
| chown -R $USER:$USER "$HOME/.cache/huggingface" 2>/dev/null || true |
| MAX_NUM_SEQS="256" # Maximum number of sequences | ||
|
|
||
| # Optional: Hugging Face token (needed for gated models like Llama) | ||
| # HF_TOKEN="hf_..." |
There was a problem hiding this comment.
Potential issue with HF_TOKEN environment variable: The config.env file defines HF_TOKEN but doesn't export it. When the start.sh script sources the config, the variable is set but may not be passed to the Python process. Consider adding export HF_TOKEN after sourcing the config in start.sh (around line 112), or change the config.env template to use export HF_TOKEN="hf_..." format.
| # HF_TOKEN="hf_..." | |
| # export HF_TOKEN="hf_..." |
| Edit service file: | ||
|
|
||
| ```bash | ||
| sudo nano /etc/systemd/system/vllm.service | ||
| ``` | ||
|
|
||
| Add to ExecStart line: | ||
| ```bash | ||
| --api-key "your-secret-key" | ||
| ``` | ||
|
|
||
| Reload and restart: | ||
| ```bash | ||
| sudo systemctl daemon-reload |
There was a problem hiding this comment.
Misleading documentation: The instructions suggest adding --api-key to the ExecStart line in the systemd service file, but the service uses ExecStart=$VLLM_DIR/start.sh which calls a separate script. Users should be instructed to edit ~/vllm-server/start.sh and add the --api-key flag to the python command (around the existing flags on line 121-129 of setup.sh), not the service file itself.
| Edit service file: | |
| ```bash | |
| sudo nano /etc/systemd/system/vllm.service | |
| ``` | |
| Add to ExecStart line: | |
| ```bash | |
| --api-key "your-secret-key" | |
| ``` | |
| Reload and restart: | |
| ```bash | |
| sudo systemctl daemon-reload | |
| Edit the start script to add your API key: | |
| ```bash | |
| nano ~/vllm-server/start.sh |
Find the line that starts the Python server (around line 121-129), e.g.:
python3 -m vllm.entrypoints.openai.api_server \
--host 0.0.0.0 \
--port 8000 \
--model $MODEL \
# ... other flags ...Add your API key flag:
--api-key "your-secret-key" \Save and exit, then restart the service:
| echo "1. Edit the model in config:" | ||
| echo " nano $VLLM_DIR/config.env" | ||
| echo "" | ||
| echo "2. For gated models (Llama, etc), add HuggingFace token:" |
There was a problem hiding this comment.
Inconsistent spelling: "HuggingFace" here vs "Hugging Face" on line 254. The official spelling is "Hugging Face" (two words). Consider using the official spelling consistently throughout the file.
| echo "2. For gated models (Llama, etc), add HuggingFace token:" | |
| echo "2. For gated models (Llama, etc), add Hugging Face token:" |
| --gpu-memory-utilization "$GPU_MEMORY_UTILIZATION" \ | ||
| --max-model-len "$MAX_MODEL_LEN" \ | ||
| --max-num-seqs "$MAX_NUM_SEQS" \ | ||
| --trust-remote-code |
There was a problem hiding this comment.
Security consideration: The --trust-remote-code flag allows models to execute arbitrary code during loading. While this is necessary for some models, it poses a security risk if loading untrusted models. Consider adding a comment in the config.env file or README.md warning users about this flag and suggesting they only use it with trusted models, or make it configurable via the config.env file.
Add vLLM Integration
High-performance LLM inference engine with OpenAI-compatible API.
Features
What's Included
vllm/setup.sh(279 lines) - Installation script with GPU detection, systemd setup, and example scriptsvllm/README.md(643 lines) - Comprehensive docs with model selection guide, API examples, and troubleshootingInstallation
cd vllm && bash setup.sh # ~3-5 minutes### Requirements
Testing
Default Setup
Design
Complements existing
ollama/(higher throughput) andlitellm/(can proxy to vLLM) integrations.