This guide covers everything you need to know about developing plugins for vLLM, including best practices learned from the official documentation and community blog posts.
- Plugin System Overview
- Types of Plugins
- How vLLM Discovers Plugins
- Creating Your First Plugin
- Best Practices
- Advanced Topics
- Troubleshooting
vLLM's plugin system uses Python's standard entry_points mechanism to discover and load extensions. This enables:
- Clean modifications: Extend vLLM without forking the codebase
- Runtime activation: Plugins load automatically when vLLM starts
- Distributed compatibility: Plugins load in all processes (main, workers, etc.)
- Selective loading: Use environment variables to control which plugins activate
- Discovery: vLLM reads entry points from installed packages
- Loading:
load_general_plugins()is called before initialization - Registration: Plugin functions register models, patches, or other extensions
- Execution: Registered functionality is available throughout vLLM's runtime
Important: Plugin registration happens in every vLLM process including:
- Main process
- Worker processes
- GPU/CPU workers
- Auxiliary processes
This ensures consistent behavior across distributed deployments.
vLLM supports several plugin entry point groups:
| Entry Point Group | Purpose | Registration Target |
|---|---|---|
vllm.general_plugins |
General extensions, custom models, patches | Function that performs registration |
vllm.platform_plugins |
Hardware backend integrations | Function returning platform class if supported |
vllm.stat_logger_plugins |
Custom metrics/logging | Logger class (StatLoggerBase subclass) |
vllm.logits_processors |
Custom decoding strategies | LogitsProcessor subclass |
vllm.io_processor_plugins |
Input/output processing | IO processor implementation |
The most common plugin type. Use for:
- Registering custom model architectures
- Applying patches to vLLM classes
- Adding custom samplers or processors
[project.entry-points."vllm.general_plugins"]
my_plugin = "my_package.register:register"For hardware backend integrations (NPU, custom accelerators):
[project.entry-points."vllm.platform_plugins"]
my_platform = "my_package.platform:register"Requires implementing:
PlatformclassWorkerBaseModelRunnerBaseAttentionBackendCommunicatorBase
For custom metrics collection and export:
[project.entry-points."vllm.stat_logger_plugins"]
my_logger = "my_package.loggers:MyLoggerClass"Note: Entry point should reference the class directly, not a registration function.
vLLM v1 Only - For custom decoding strategies that modify logits before sampling.
[project.entry-points."vllm.logits_processors"]
my_decoder = "my_package.processor:MyLogitsProcessor"Important Characteristics:
- Global Application: Plugins apply to ALL requests when installed
- No Per-Request Selection: vLLM v1 does NOT support per-request logits processor selection via the OpenAI API
- One Plugin Per Deployment: Install only ONE decoding strategy plugin per vLLM deployment
- Must Inherit Base Class: Your processor MUST inherit from
LogitsProcessor
See Logits Processor Plugins (vLLM v1) for detailed implementation guide.
vLLM uses Python's importlib.metadata.entry_points() to discover plugins:
# Simplified discovery logic
from importlib.metadata import entry_points
eps = entry_points(group='vllm.general_plugins')
for ep in eps:
register_func = ep.load()
register_func()VLLM_PLUGINS: Comma-separated list of plugin names to load- If not set, all discovered plugins are loaded
- Use to selectively enable plugins:
VLLM_PLUGINS=my_plugin,other_plugin
my-vllm-plugin/
├── pyproject.toml
├── src/
│ └── my_vllm_plugin/
│ ├── __init__.py
│ └── register.py
└── tests/
└── test_plugin.py
# pyproject.toml
[project]
name = "my-vllm-plugin"
version = "0.1.0"
dependencies = ["vllm>=0.8.0"]
[project.entry-points."vllm.general_plugins"]
my_plugin = "my_vllm_plugin.register:register"
[build-system]
requires = ["setuptools>=61"]
build-backend = "setuptools.build_meta"# src/my_vllm_plugin/register.py
import logging
logger = logging.getLogger(__name__)
_registered = False
def register() -> None:
"""Register plugin with vLLM."""
global _registered
# Ensure re-entrancy
if _registered:
return
logger.info("Registering my vLLM plugin")
# Your registration logic here
# Example: Register a custom model
from vllm import ModelRegistry
if "MyModel" not in ModelRegistry.get_supported_archs():
ModelRegistry.register_model(
"MyModel",
"my_vllm_plugin.models:MyModelForCausalLM"
)
_registered = Truepip install -e .
python -c "import vllm; print('Plugin loaded!')"Your registration function must be safe to call multiple times:
_registered = False
def register():
global _registered
if _registered:
return # Already registered
# ... registration logic ...
_registered = TrueWhy? vLLM may call your function in multiple processes.
Always specify and check vLLM version requirements:
from packaging.version import Version
import vllm
def register():
current = Version(vllm.__version__)
required = Version("0.9.0")
if current < required:
logger.warning(f"Plugin requires vLLM >= 0.9.0, got {current}")
return
# ... registration logic ...Or use the decorator pattern:
@min_vllm_version("0.9.0")
class MyPatch(VLLMPatch[Scheduler]):
passWhen patching vLLM classes:
- Do: Add single methods, override specific behavior
- Don't: Duplicate entire classes, make sweeping changes
# Good: Minimal patch
class PriorityPatch(VLLMPatch[Scheduler]):
def get_priority(self, request):
return request.metadata.get("priority", 0)
# Bad: Reimplementing entire class
class MyScheduler(Scheduler):
# ... hundreds of lines ...Use environment variables for runtime configuration:
import os
def register():
enabled = os.environ.get("MY_PLUGIN_ENABLED", "true").lower() == "true"
if not enabled:
return
# ... registration logic ...Handle missing dependencies gracefully:
def register():
try:
from vllm import ModelRegistry
except ImportError:
logger.warning("vLLM not available, skipping registration")
return
# ... registration logic ...Use Python's logging module for visibility:
import logging
logger = logging.getLogger(__name__)
def register():
logger.info("Starting plugin registration")
# ...
logger.info("Plugin registered successfully")Always test:
- Re-entrancy (multiple calls)
- Without vLLM installed
- With different vLLM versions
def test_register_is_reentrant():
from my_plugin.register import register
register()
register() # Should not raise
def test_handles_missing_vllm(monkeypatch):
monkeypatch.setattr('builtins.__import__', mock_import_error)
# Should not raise, just log warningFor modifying existing vLLM classes without forking:
from vllm.core.scheduler import Scheduler
class PrioritySchedulerPatch(VLLMPatch[Scheduler]):
"""Add priority-based scheduling."""
def get_priority(self, request) -> int:
"""New method added to Scheduler."""
return request.metadata.get("priority", 0)
def schedule(self, waiting_queue):
"""Override existing method."""
sorted_queue = sorted(
waiting_queue,
key=lambda r: self.get_priority(r),
reverse=True
)
# Call original via _original_ prefix
return self._original_schedule(sorted_queue)
# Apply at registration time
PrioritySchedulerPatch.apply()Control patches via environment variables:
# Enable specific patches
VLLM_CUSTOM_PATCHES=PrioritySchedulerPatch,CustomSamplerPatch python app.py
# Enable all patches
VLLM_CUSTOM_PATCHES=* python app.pyDifferent models can enable different patches:
def register():
model_type = os.environ.get("MODEL_TYPE", "")
if model_type == "priority":
PrioritySchedulerPatch.apply()
elif model_type == "batch":
BatchOptimizationPatch.apply()FROM vllm/vllm-openai:latest
# Install plugins
COPY plugins/ /plugins/
RUN pip install /plugins/vllm-custom-models /plugins/vllm-patches
# Configure patches
ENV VLLM_CUSTOM_PATCHES=PrioritySchedulerPatchLogits processors modify the model's output logits before sampling. vLLM v1 has a specific interface that must be followed.
Your processor MUST:
- Inherit from
LogitsProcessor- Not just implement the methods - Use the exact constructor signature
- Implement all required methods
from typing import Optional
import torch
from vllm.v1.sample.logits_processor.interface import LogitsProcessor
# TYPE_CHECKING imports to avoid circular dependencies
from typing import TYPE_CHECKING
if TYPE_CHECKING:
from vllm.config import VllmConfig
from vllm.sampling_params import SamplingParams
from vllm.v1.sample.logits_processor.interface import BatchUpdate
class MyLogitsProcessor(LogitsProcessor):
"""Custom logits processor for vLLM v1."""
def __init__(
self,
vllm_config: "VllmConfig",
device: torch.device,
is_pin_memory: bool
):
"""Initialize the processor.
Args:
vllm_config: vLLM configuration object
device: Target device for tensors (cuda:0, cpu, etc.)
is_pin_memory: Whether to use pinned memory for CPU tensors
"""
self.device = device
self.is_pin_memory = is_pin_memory
self.batch_size = 0
# Load configuration from environment variables
import os
self.my_param = float(os.environ.get("MY_PROCESSOR_PARAM", "1.0"))
def is_argmax_invariant(self) -> bool:
"""Return whether this processor preserves the argmax.
Returns:
True: Processor never changes which token has highest logit
(can be skipped during greedy/beam search)
False: Processor may change the argmax
(must always be applied)
"""
return False # Most custom processors should return False
def update_state(self, batch_update: Optional["BatchUpdate"]) -> None:
"""Update internal state when batch composition changes.
Called at the start of each engine step BEFORE apply().
Args:
batch_update: Contains info about added, removed, moved requests.
None if no changes to the batch.
The BatchUpdate contains:
- batch_size: Current number of requests
- added: List of (index, SamplingParams, output_tok_ids, req_id)
- removed: List of removed request indices
- moved: List of (from_idx, to_idx, direction) for reordered requests
"""
if batch_update:
self.batch_size = batch_update.batch_size
def apply(self, logits: torch.Tensor) -> torch.Tensor:
"""Apply the logits processing.
Args:
logits: Tensor of shape (batch_size, vocab_size)
Returns:
Modified logits tensor with same shape
"""
if logits.size(0) == 0:
return logits
# Your processing logic here
modified_logits = logits / self.my_param
return modified_logits
@classmethod
def validate_params(cls, sampling_params: "SamplingParams") -> None:
"""Validate sampling parameters at request creation time.
Args:
sampling_params: The sampling parameters to validate
Raises:
ValueError: If parameters are invalid
"""
# Validate any custom parameters in sampling_params.extra_args
pass| Mistake | Error Message | Fix |
|---|---|---|
| Not inheriting from base class | must be a subclass of LogitsProcessor |
Add (LogitsProcessor) to class definition |
Missing is_argmax_invariant() |
has no attribute 'is_argmax_invariant' |
Add the method, return False |
Missing update_state() |
has no attribute 'update_state' |
Add the method, track batch_size |
| Wrong constructor signature | Various init errors | Use (vllm_config, device, is_pin_memory) |
Using __call__ instead of apply |
Processor not called | Rename to apply() |
Since processors are instantiated by vLLM (not by your code), you cannot pass custom constructor parameters. Use environment variables instead:
import os
class MyProcessor(LogitsProcessor):
def __init__(self, vllm_config, device, is_pin_memory):
# Configuration from environment
self.temperature = float(os.environ.get("MY_PROCESSOR_TEMP", "0.8"))
self.threshold = float(os.environ.get("MY_PROCESSOR_THRESHOLD", "0.5"))# Configure at runtime
MY_PROCESSOR_TEMP=0.7 MY_PROCESSOR_THRESHOLD=0.3 python -m vllm.entrypoints.openai.api_server ...If you need per-request configuration, use the BatchUpdate in update_state():
def update_state(self, batch_update: Optional["BatchUpdate"]) -> None:
if not batch_update:
return
# Track per-request state
for index, params, output_tokens, req_id in batch_update.added:
# params.extra_args contains custom per-request parameters
threshold = params.extra_args.get("my_threshold", 0.5) if params.extra_args else 0.5
self.req_state[index] = {"threshold": threshold}
for index in batch_update.removed:
self.req_state.pop(index, None)# pyproject.toml
[project.entry-points."vllm.logits_processors"]
my_decoder = "my_package.processor:MyLogitsProcessor"The entry point name (my_decoder) is used for identification but cannot be selected per-request in vLLM v1.
- Check installation:
pip list | grep your-plugin - Check entry points:
python -c "from importlib.metadata import entry_points; print(list(entry_points(group='vllm.general_plugins')))" - Check VLLM_PLUGINS env var: May be filtering your plugin
- Check logs: Look for registration messages
- Test import:
python -c "from your_plugin.register import register; register()" - Check vLLM version: Ensure compatibility
- Check VLLM_CUSTOM_PATCHES: Must include your patch name
- Check version decorator: May be blocking due to version mismatch
- Check PatchManager:
PatchManager.is_applied("YourPatch")
If plugin works locally but not in distributed mode:
- Ensure re-entrancy
- Check all workers have plugin installed
- Verify environment variables propagate to workers
Your class must inherit from the base class:
# Wrong
class MyProcessor:
...
# Correct
from vllm.v1.sample.logits_processor.interface import LogitsProcessor
class MyProcessor(LogitsProcessor):
...Add the required method:
def is_argmax_invariant(self) -> bool:
return FalseAdd the required method:
def update_state(self, batch_update):
if batch_update:
self.batch_size = batch_update.batch_sizeEnsure you're using apply() not __call__:
# Wrong
def __call__(self, logits):
...
# Correct
def apply(self, logits: torch.Tensor) -> torch.Tensor:
...vLLM v1 does NOT support per-request logits processor selection via the OpenAI API. Processors apply globally to all requests. To use different strategies:
- Deploy separate vLLM instances with different plugins
- Use different Docker images per strategy