docs: KEP-2839 Dynamic LLM Trainer Framework — ConfigTrainer for BaseTrainer hierarchy#3263
Open
NarayanaSabari wants to merge 9 commits intokubeflow:masterfrom
Open
docs: KEP-2839 Dynamic LLM Trainer Framework — ConfigTrainer for BaseTrainer hierarchy#3263NarayanaSabari wants to merge 9 commits intokubeflow:masterfrom
NarayanaSabari wants to merge 9 commits intokubeflow:masterfrom
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
🎉 Welcome to the Kubeflow Trainer! 🎉 Thanks for opening your first PR! We're happy to have you as part of our community 🚀 Here's what happens next:
Join the community:
Feel free to ask questions in the comments if you need any help or clarification! |
- Strip all Low-Level Design content (code interfaces, strategies, Dockerfile, runtime YAML, Helm chart details) - Fix 10 technical inaccuracies found during audit: - TRL CLI entry point (trl sft, not python -m trl) - Multi-node env vars (standard + PET variants) - Correct enforceTorchTunePolicy inline location - dependsOn YAML format, volume handling pattern - TRLTrainerType enum values (SFT/DPO/KTO/GRPO) - Container name 'node' not 'trainer' - PET env var naming conventions - KEP now covers: Summary, Goals, Non-Goals, Current State Analysis, High-Level Design, Test Plan, Risks, Phases
Signed-off-by: Sabari Narayana <sabarinarayanakg@proton.me>
…P-285 alignment - Remove @register_backend decorator and backend registry (YAGNI with 2 backends) - Change to_command() method to command: ClassVar[tuple[str, ...]] - Move num_nodes/resources_per_node to LLMBackend base class - Add Relationship to KEP-285 section for config-driven vs function-based trainers - Simplify KubernetesBackend integration (no hasattr checks) - Remove stale Phase 1/Phase 2 references from Risks table - Goals reduced from 7 to 5
1 task
Replace standalone LLMBackend ABC with ConfigTrainer that integrates into KEP-285's BaseTrainer type hierarchy, directly answering open questions from maintainers about how config-driven trainers fit alongside function-based trainers. Key changes: - LLMBackend → ConfigTrainer(BaseTrainer) for unified type hierarchy - LLMBackendStrategy → FrameworkStrategy (matches framework label convention) - TorchTuneConfig → TorchTuneTrainer with backward-compatible alias - TRLConfig → TRLTrainer with runtime auto-discovery support - Added detailed KEP-285 relationship section with maintainer references - Added implementation history and KEP.yaml-style metadata Tracking issue: kubeflow#2839
added 3 commits
March 31, 2026 16:05
Based on mentor feedback, ConfigTrainer is now a standalone ABC rather than a subclass of KEP-285's BaseTrainer. This avoids LSP violations (dead get_train_func() methods) and allows both hierarchies to evolve independently. Key architectural change: - ConfigTrainer and BaseTrainer are separate ABCs for separate patterns (config-driven vs function-based) - Both accepted through same TrainerClient.train(trainer=...) parameter for flat, unified user experience - No inheritance relationship — clean separation of concerns Also adds Alternatives Considered section documenting the unified hierarchy option and why it was rejected.
Add three diagrams: - SDK type hierarchy showing LLMTrainer and BaseTrainer as separate ABCs - End-to-end system architecture from Python SDK to Kubernetes pods - Go Torch plugin strategy dispatch flow
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Redesigned KEP-2839: Dynamic LLM Trainer Framework proposal.
This KEP introduces a pluggable config-driven trainer framework for LLM fine-tuning, aligned with KEP-285 (Specialized Trainer Abstractions).
What This KEP Proposes
SDK (Python)
ConfigTrainerbase class within KEP-285'sBaseTrainerhierarchy for config-driven trainersTorchTuneTrainer(ConfigTrainer)— refactored fromTorchTuneConfigwith backward-compatible aliasTRLTrainer(ConfigTrainer)— new backend with SFT/DPO/KTO/GRPO supporttrainer.kubeflow.org/frameworklabelBuiltinTrainer.configtype widened fromTorchTuneConfigtoConfigTrainerGo Control Plane
FrameworkStrategyinterface replacing hardcoded TorchTune command-sniffing in the Torch pluginTorchTuneStrategy(wraps existing logic unchanged)TRLStrategy(accelerate-compatible env var injection)trainer.kubeflow.org/frameworkInfrastructure
cmd/trainers/trl/)ClusterTrainingRuntimemanifestsRelationship to KEP-285
This KEP answers the open question from the KEP-285 review about how config-driven trainers fit into the
BaseTrainerhierarchy:Non-Goals
BuiltinTrainerorCustomTrainerTest Plan
Builds on KEP-2401 and the community consensus on "Plan 3" from #2752.
Tracking issue: #2839
/cc @Electronic-Waste @andreyvelich @tariq-hasan @szaher