Add vllm integration by HarshaVardhanMannem · Pull Request #2 · brevdev/setup-scripts

HarshaVardhanMannem · 2025-11-22T02:49:48Z

Add vLLM Integration

High-performance LLM inference engine with OpenAI-compatible API.

Features

24x faster than HuggingFace, 2-4x faster than Ollama
OpenAI-compatible API endpoint
Multi-GPU support with tensor parallelism
Production-ready systemd service
Memory efficient (PagedAttention algorithm)

What's Included

vllm/setup.sh (279 lines) - Installation script with GPU detection, systemd setup, and example scripts
vllm/README.md (643 lines) - Comprehensive docs with model selection guide, API examples, and troubleshooting

Installation

cd vllm && bash setup.sh # ~3-5 minutes### Requirements

NVIDIA GPU (6GB+ VRAM)
Ubuntu 20.04+ (tested on 24.04 LTS)
CUDA 11.0+

Testing

Ubuntu 24.04.2 LTS with Tesla T4 (15GB VRAM)
Installation completes successfully
Service starts and API responds
GPU utilization confirmed

Default Setup

Model: Mistral-7B-Instruct (no HuggingFace token required)
Port: 8000/tcp
OpenAI-compatible endpoint

Design

Idempotent and standalone
Battle-tested user detection
Follows repo conventions
Comprehensive documentation with examples

Complements existing ollama/ (higher throughput) and litellm/ (can proxy to vLLM) integrations.

Copilot

Pull request overview

This PR adds vLLM integration to provide high-performance LLM inference with an OpenAI-compatible API. vLLM offers superior throughput compared to existing alternatives (24x faster than HuggingFace, 2-4x faster than Ollama) while maintaining API compatibility. The integration complements existing ollama/ (for simpler setups) and litellm/ (which can proxy to vLLM) offerings.

Key changes:

New installation script with GPU detection, systemd service setup, and configuration management
Comprehensive documentation covering model selection, API usage, multi-GPU configuration, and troubleshooting
Example scripts for testing (Python with OpenAI SDK, streaming, and curl)

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File	Description
vllm/setup.sh	Installation script that detects GPU, creates Python venv, installs vLLM, configures systemd service, and generates example scripts
vllm/README.md	Comprehensive documentation with model selection guide, configuration examples, API usage patterns, and troubleshooting guidance

Comments suppressed due to low confidence (1)

vllm/setup.sh:279

[nitpick] The file ends with two blank lines. Repository convention appears to use a single blank line at EOF. Remove one of the trailing blank lines.

echo ""

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

vllm/setup.sh

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Copilot

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 9 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2025-11-25T03:33:52Z

vllm/setup.sh

+MAX_MODEL_LEN="4096"  # Maximum sequence length
+MAX_NUM_SEQS="256"    # Maximum number of sequences
+
+# Optional: Hugging Face token (needed for gated models like Llama)


Inconsistent spelling: "Hugging Face" here vs "HuggingFace" on line 117 and in the README (lines 65-66, 378, 380, 384). The official spelling is "Hugging Face" (two words). Consider using the official spelling consistently throughout the file.

Copilot · 2025-11-25T03:33:52Z

vllm/setup.sh

+# Activate venv
+source "$HOME/vllm-server/venv/bin/activate"
+
+# Set HuggingFace cache


Inconsistent spelling: "HuggingFace" here vs "Hugging Face" on line 102. The official spelling is "Hugging Face" (two words). Consider using the official spelling consistently throughout the file.

Suggested change

# Set HuggingFace cache

# Set Hugging Face cache

Copilot · 2025-11-25T03:33:52Z

vllm/README.md

+
+## Features
+
+- **⚡ Fast** - 24x throughput vs HuggingFace, 2-4x faster than Ollama


Inconsistent spelling: "HuggingFace" is used throughout this file, but the official spelling is "Hugging Face" (two words). Consider using the official spelling consistently. Occurrences on lines 15, 65, 66, 378, 380, 384, 477, 482.

Copilot · 2025-11-25T03:33:53Z

vllm/setup.sh

+    chown -R "$USER:$USER" "$VLLM_DIR"
+    chown -R "$USER:$USER" "$HOME/vllm-examples"
+    chown -R "$USER:$USER" "$HOME/.cache/huggingface" 2>/dev/null || true


[nitpick] Inconsistent quoting pattern: This script uses "$USER:$USER" with quotes, while other setup scripts in the repository (ollama/setup.sh:99, litellm/setup.sh:123) use $USER:$USER without quotes. For consistency with the codebase conventions, consider removing the quotes around $USER:$USER.

Suggested change

chown -R "$USER:$USER" "$VLLM_DIR"

chown -R "$USER:$USER" "$HOME/vllm-examples"

chown -R "$USER:$USER" "$HOME/.cache/huggingface" 2>/dev/null || true

chown -R $USER:$USER "$VLLM_DIR"

chown -R $USER:$USER "$HOME/vllm-examples"

chown -R $USER:$USER "$HOME/.cache/huggingface" 2>/dev/null || true

Copilot · 2025-11-25T03:33:53Z

vllm/setup.sh

+MAX_NUM_SEQS="256"    # Maximum number of sequences
+
+# Optional: Hugging Face token (needed for gated models like Llama)
+# HF_TOKEN="hf_..."


Potential issue with HF_TOKEN environment variable: The config.env file defines HF_TOKEN but doesn't export it. When the start.sh script sources the config, the variable is set but may not be passed to the Python process. Consider adding export HF_TOKEN after sourcing the config in start.sh (around line 112), or change the config.env template to use export HF_TOKEN="hf_..." format.

Suggested change

# HF_TOKEN="hf_..."

# export HF_TOKEN="hf_..."

Copilot · 2025-11-25T03:33:53Z

vllm/README.md

+Edit service file:
+
+```bash
+sudo nano /etc/systemd/system/vllm.service
+```
+
+Add to ExecStart line:
+```bash
+--api-key "your-secret-key"
+```
+
+Reload and restart:
+```bash
+sudo systemctl daemon-reload


Misleading documentation: The instructions suggest adding --api-key to the ExecStart line in the systemd service file, but the service uses ExecStart=$VLLM_DIR/start.sh which calls a separate script. Users should be instructed to edit ~/vllm-server/start.sh and add the --api-key flag to the python command (around the existing flags on line 121-129 of setup.sh), not the service file itself.

Suggested change

Edit service file:

```bash

sudo nano /etc/systemd/system/vllm.service

```

Add to ExecStart line:

```bash

--api-key "your-secret-key"

```

Reload and restart:

```bash

sudo systemctl daemon-reload

Edit the start script to add your API key:

```bash

nano ~/vllm-server/start.sh

Find the line that starts the Python server (around line 121-129), e.g.:

python3 -m vllm.entrypoints.openai.api_server \ --host 0.0.0.0 \ --port 8000 \ --model $MODEL \ # ... other flags ...

Add your API key flag:

--api-key "your-secret-key" \

Save and exit, then restart the service:

Copilot · 2025-11-25T03:33:53Z

vllm/setup.sh

+echo "1. Edit the model in config:"
+echo "   nano $VLLM_DIR/config.env"
+echo ""
+echo "2. For gated models (Llama, etc), add HuggingFace token:"


Inconsistent spelling: "HuggingFace" here vs "Hugging Face" on line 254. The official spelling is "Hugging Face" (two words). Consider using the official spelling consistently throughout the file.

Suggested change

echo "2. For gated models (Llama, etc), add HuggingFace token:"

echo "2. For gated models (Llama, etc), add Hugging Face token:"

Copilot · 2025-11-25T03:33:54Z

vllm/setup.sh

+    --gpu-memory-utilization "$GPU_MEMORY_UTILIZATION" \
+    --max-model-len "$MAX_MODEL_LEN" \
+    --max-num-seqs "$MAX_NUM_SEQS" \
+    --trust-remote-code


Security consideration: The --trust-remote-code flag allows models to execute arbitrary code during loading. While this is necessary for some models, it poses a security risk if loading untrusted models. Consider adding a comment in the config.env file or README.md warning users about this flag and suggesting they only use it with trusted models, or make it configurable via the config.env file.

vllm/README.md

HarshaVardhanMannem added 2 commits November 21, 2025 20:04

Add vLLM integration with setup script and comprehensive documentation

5a163dd

Add vLLM integration

2159419

Copilot AI review requested due to automatic review settings November 22, 2025 02:49

Copilot started reviewing on behalf of HarshaVardhanMannem November 22, 2025 02:50 View session

Copilot finished reviewing on behalf of HarshaVardhanMannem November 22, 2025 02:52

Copilot AI reviewed Nov 22, 2025

View reviewed changes

vllm/setup.sh Outdated Show resolved Hide resolved

vllm/setup.sh Outdated Show resolved Hide resolved

HarshaVardhanMannem and others added 2 commits November 21, 2025 21:42

Update vllm/setup.sh

cf9ab8e

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Update vllm/setup.sh

5718f9b

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

HarshaVardhanMannem requested a review from Copilot November 25, 2025 03:27

Copilot started reviewing on behalf of HarshaVardhanMannem November 25, 2025 03:28 View session

Copilot finished reviewing on behalf of HarshaVardhanMannem November 25, 2025 03:32

Copilot AI reviewed Nov 25, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

Add vllm integration#2

Add vllm integration#2
HarshaVardhanMannem wants to merge 4 commits intobrevdev:mainfrom
HarshaVardhanMannem:add-vllm-integration

HarshaVardhanMannem commented Nov 22, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Nov 25, 2025

Uh oh!

Copilot AI Nov 25, 2025

Uh oh!

Copilot AI Nov 25, 2025

Uh oh!

Copilot AI Nov 25, 2025

Uh oh!

Copilot AI Nov 25, 2025

Uh oh!

Copilot AI Nov 25, 2025

Uh oh!

Copilot AI Nov 25, 2025

Uh oh!

Copilot AI Nov 25, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant


		## Features

		- ⚡ Fast - 24x throughput vs HuggingFace, 2-4x faster than Ollama

	echo "2. For gated models (Llama, etc), add HuggingFace token:"
	echo "2. For gated models (Llama, etc), add Hugging Face token:"

Comments

Conversation

HarshaVardhanMannem commented Nov 22, 2025