Skip to content

feat: Add LoRA distribution across worker endpoints#2

Open
AmeenP wants to merge 2 commits intomainfrom
feat/lora-distribution-load-balancing
Open

feat: Add LoRA distribution across worker endpoints#2
AmeenP wants to merge 2 commits intomainfrom
feat/lora-distribution-load-balancing

Conversation

@AmeenP
Copy link
Owner

@AmeenP AmeenP commented Dec 15, 2025

Summary

Implements selective LoRA distribution to control which worker pods load each LoRA adapter. This enables horizontal scaling of LoRAs across multiple vLLM workers without exceeding per-worker memory/max-loras limits.

Problem

Current behavior loads ALL LoRAs on ALL vLLM workers. This doesn't scale when:

  1. Number of LoRAs exceeds --max-loras capacity per worker
  2. Memory becomes constrained with many LoRAs loaded

Solution

  • Add DistributionSpec to DynamoModel CRD with strategies: all, fixed, percentage
  • Implement deterministic endpoint selection using rendezvous hashing (HRW)
  • Controller loads LoRA only on selected target endpoints
  • Cleanup unloads only from target endpoints

Changes

  • dynamo_model_types.go - Added DistributionSpec, TargetEndpoints, AvailableEndpoints, helper methods
  • dynamo_model_controller.go - Integrated SelectTargetEndpoints() for selective loading
  • discovery.go - Fixed endpoint deduplication bug (same pod appearing twice from multiple EndpointSlices)
  • selector.go (new) - Deterministic endpoint selection using FNV-1a consistent hashing
  • selector_test.go (new) - Comprehensive unit tests

Usage Example

apiVersion: nvidia.com/v1alpha1
kind: DynamoModel
metadata:
  name: my-lora
spec:
  modelName: Qwen/Qwen2.5-1.5B-Instruct:my-adapter
  baseModelName: Qwen/Qwen2.5-1.5B-Instruct
  modelType: lora
  source:
    uri: s3://bucket/path/to/lora
  distribution:
    strategy: fixed
    replicas: 2  # Load on exactly 2 workers

Backward Compatibility

  • Default strategy is all - existing DynamoModels work unchanged
  • Only changes behavior when distribution spec is explicitly set

Test Plan

E2E Testing (3 workers)

DynamoModel Strategy Config Loaded On Expected Result
guanaco-lora all - All 3 workers All 3
test-distribution-fixed fixed replicas=2 2 workers 2 of 3
test-distribution-percentage percentage 34% 1 worker 1 of 3

Verified

  • Distribution correctly limits LoRA loading to N of M workers
  • Deterministic selection stable across reconciliations
  • Inference works on distributed LoRAs
  • Cleanup only unloads from target endpoints
  • Deduplication fix working (no duplicate endpoints)
  • Unit tests pass

LoRA adapters share tokenizer/config with their base model, but the
frontend was trying to download configs from HuggingFace using the
LoRA's display_name (e.g., "Qwen/Qwen2.5-1.5B-Instruct:guanaco") which
is not a valid HuggingFace model ID, causing 401 errors.

This fix:
- Adds `base_model_name` field to ModelDeploymentCard
- Propagates base model name through register_llm -> LocalModelBuilder -> MDC
- Updates download_config() to use base_model_name for HF downloads

When a LoRA is registered, the base model path is now stored in the MDC.
The frontend uses this to download the correct tokenizer/config files.

Fixes LoRA registration failures in distributed deployments where
worker and frontend run on different nodes with separate filesystems.
Implement selective LoRA distribution to control which worker pods load
each LoRA adapter. This enables horizontal scaling of LoRAs across
multiple vLLM workers without exceeding per-worker memory/max-loras limits.

Changes:
- Add DistributionSpec to DynamoModel CRD with strategies: all, fixed, percentage
- Add TargetEndpoints/AvailableEndpoints to status for observability
- Implement deterministic endpoint selection using consistent hashing
- Fix endpoint deduplication bug in discovery (same pod appearing twice)
- Controller now loads LoRA only on selected target endpoints
- Cleanup (FinalizeResource) unloads only from target endpoints

The default strategy "all" preserves backward compatibility - existing
DynamoModels continue to load on all endpoints as before.

Example usage with fixed replicas:
```yaml
spec:
  distribution:
    strategy: fixed
    replicas: 2
```
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant