feat: Add LoRA distribution across worker endpoints#2
Open
Conversation
LoRA adapters share tokenizer/config with their base model, but the frontend was trying to download configs from HuggingFace using the LoRA's display_name (e.g., "Qwen/Qwen2.5-1.5B-Instruct:guanaco") which is not a valid HuggingFace model ID, causing 401 errors. This fix: - Adds `base_model_name` field to ModelDeploymentCard - Propagates base model name through register_llm -> LocalModelBuilder -> MDC - Updates download_config() to use base_model_name for HF downloads When a LoRA is registered, the base model path is now stored in the MDC. The frontend uses this to download the correct tokenizer/config files. Fixes LoRA registration failures in distributed deployments where worker and frontend run on different nodes with separate filesystems.
Implement selective LoRA distribution to control which worker pods load
each LoRA adapter. This enables horizontal scaling of LoRAs across
multiple vLLM workers without exceeding per-worker memory/max-loras limits.
Changes:
- Add DistributionSpec to DynamoModel CRD with strategies: all, fixed, percentage
- Add TargetEndpoints/AvailableEndpoints to status for observability
- Implement deterministic endpoint selection using consistent hashing
- Fix endpoint deduplication bug in discovery (same pod appearing twice)
- Controller now loads LoRA only on selected target endpoints
- Cleanup (FinalizeResource) unloads only from target endpoints
The default strategy "all" preserves backward compatibility - existing
DynamoModels continue to load on all endpoints as before.
Example usage with fixed replicas:
```yaml
spec:
distribution:
strategy: fixed
replicas: 2
```
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Implements selective LoRA distribution to control which worker pods load each LoRA adapter. This enables horizontal scaling of LoRAs across multiple vLLM workers without exceeding per-worker memory/max-loras limits.
Problem
Current behavior loads ALL LoRAs on ALL vLLM workers. This doesn't scale when:
--max-lorascapacity per workerSolution
DistributionSpecto DynamoModel CRD with strategies:all,fixed,percentageChanges
DistributionSpec,TargetEndpoints,AvailableEndpoints, helper methodsSelectTargetEndpoints()for selective loadingUsage Example
Backward Compatibility
all- existing DynamoModels work unchangeddistributionspec is explicitly setTest Plan
E2E Testing (3 workers)
Verified