feat: Add LoRA distribution across worker endpoints by AmeenP · Pull Request #2 · AmeenP/dynamo

AmeenP · 2025-12-15T18:26:12Z

Summary

Implements selective LoRA distribution to control which worker pods load each LoRA adapter. This enables horizontal scaling of LoRAs across multiple vLLM workers without exceeding per-worker memory/max-loras limits.

Problem

Current behavior loads ALL LoRAs on ALL vLLM workers. This doesn't scale when:

Number of LoRAs exceeds --max-loras capacity per worker
Memory becomes constrained with many LoRAs loaded

Solution

Add DistributionSpec to DynamoModel CRD with strategies: all, fixed, percentage
Implement deterministic endpoint selection using rendezvous hashing (HRW)
Controller loads LoRA only on selected target endpoints
Cleanup unloads only from target endpoints

Changes

dynamo_model_types.go - Added DistributionSpec, TargetEndpoints, AvailableEndpoints, helper methods
dynamo_model_controller.go - Integrated SelectTargetEndpoints() for selective loading
discovery.go - Fixed endpoint deduplication bug (same pod appearing twice from multiple EndpointSlices)
selector.go (new) - Deterministic endpoint selection using FNV-1a consistent hashing
selector_test.go (new) - Comprehensive unit tests

Usage Example

apiVersion: nvidia.com/v1alpha1
kind: DynamoModel
metadata:
  name: my-lora
spec:
  modelName: Qwen/Qwen2.5-1.5B-Instruct:my-adapter
  baseModelName: Qwen/Qwen2.5-1.5B-Instruct
  modelType: lora
  source:
    uri: s3://bucket/path/to/lora
  distribution:
    strategy: fixed
    replicas: 2  # Load on exactly 2 workers

Backward Compatibility

Default strategy is all - existing DynamoModels work unchanged
Only changes behavior when distribution spec is explicitly set

Test Plan

E2E Testing (3 workers)

DynamoModel	Strategy	Config	Loaded On	Expected	Result
guanaco-lora	all	-	All 3 workers	All 3	✅
test-distribution-fixed	fixed	replicas=2	2 workers	2 of 3	✅
test-distribution-percentage	percentage	34%	1 worker	1 of 3	✅

Verified

Distribution correctly limits LoRA loading to N of M workers
Deterministic selection stable across reconciliations
Inference works on distributed LoRAs
Cleanup only unloads from target endpoints
Deduplication fix working (no duplicate endpoints)
Unit tests pass

LoRA adapters share tokenizer/config with their base model, but the frontend was trying to download configs from HuggingFace using the LoRA's display_name (e.g., "Qwen/Qwen2.5-1.5B-Instruct:guanaco") which is not a valid HuggingFace model ID, causing 401 errors. This fix: - Adds `base_model_name` field to ModelDeploymentCard - Propagates base model name through register_llm -> LocalModelBuilder -> MDC - Updates download_config() to use base_model_name for HF downloads When a LoRA is registered, the base model path is now stored in the MDC. The frontend uses this to download the correct tokenizer/config files. Fixes LoRA registration failures in distributed deployments where worker and frontend run on different nodes with separate filesystems.

Implement selective LoRA distribution to control which worker pods load each LoRA adapter. This enables horizontal scaling of LoRAs across multiple vLLM workers without exceeding per-worker memory/max-loras limits. Changes: - Add DistributionSpec to DynamoModel CRD with strategies: all, fixed, percentage - Add TargetEndpoints/AvailableEndpoints to status for observability - Implement deterministic endpoint selection using consistent hashing - Fix endpoint deduplication bug in discovery (same pod appearing twice) - Controller now loads LoRA only on selected target endpoints - Cleanup (FinalizeResource) unloads only from target endpoints The default strategy "all" preserves backward compatibility - existing DynamoModels continue to load on all endpoints as before. Example usage with fixed replicas: ```yaml spec: distribution: strategy: fixed replicas: 2 ```

AmeenP added 2 commits December 15, 2025 01:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Add LoRA distribution across worker endpoints#2

feat: Add LoRA distribution across worker endpoints#2
AmeenP wants to merge 2 commits intomainfrom
feat/lora-distribution-load-balancing

AmeenP commented Dec 15, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

AmeenP commented Dec 15, 2025

Summary

Problem

Solution

Changes

Usage Example

Backward Compatibility

Test Plan

E2E Testing (3 workers)

Verified

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant