docs: KEP-2839 Dynamic LLM Trainer Framework — ConfigTrainer for BaseTrainer hierarchy by NarayanaSabari · Pull Request #3263 · kubeflow/trainer

NarayanaSabari · 2026-02-27T10:01:50Z

Summary

Redesigned KEP-2839: Dynamic LLM Trainer Framework proposal.

This KEP introduces a pluggable config-driven trainer framework for LLM fine-tuning, aligned with KEP-285 (Specialized Trainer Abstractions).

What This KEP Proposes

SDK (Python)

ConfigTrainer base class within KEP-285's BaseTrainer hierarchy for config-driven trainers
TorchTuneTrainer(ConfigTrainer) — refactored from TorchTuneConfig with backward-compatible alias
TRLTrainer(ConfigTrainer) — new backend with SFT/DPO/KTO/GRPO support
Runtime auto-discovery via trainer.kubeflow.org/framework label
BuiltinTrainer.config type widened from TorchTuneConfig to ConfigTrainer

Go Control Plane

FrameworkStrategy interface replacing hardcoded TorchTune command-sniffing in the Torch plugin
TorchTuneStrategy (wraps existing logic unchanged)
TRLStrategy (accelerate-compatible env var injection)
Label-based dispatch via trainer.kubeflow.org/framework

Infrastructure

TRL container image (cmd/trainers/trl/)
TRL ClusterTrainingRuntime manifests
Helm chart additions for TRL runtimes

Relationship to KEP-285

This KEP answers the open question from the KEP-285 review about how config-driven trainers fit into the BaseTrainer hierarchy:

                    BaseTrainer (ABC)                ← KEP-285
                         │
         ┌───────────────┼───────────────┐
         │               │               │
   TorchTrainer     JAXTrainer     ConfigTrainer      ← This KEP
   (func-based)     (func-based)   (config-driven)
                                        │
                         ┌──────────────┼──────────────┐
                         │              │              │
                  TorchTuneTrainer  TRLTrainer    (future backends)

Non-Goals

Unsloth or LlamaFactory backends (future work)
CRD schema changes
Deprecating BuiltinTrainer or CustomTrainer

Test Plan

Unit tests for ConfigTrainer interface, backend registry, TRL arg generation, Go strategy dispatch
Integration tests for TRL TrainJob reconciliation
E2E tests for TRL SFT/DPO on GPU and TorchTune regression

Builds on KEP-2401 and the community consensus on "Plan 3" from #2752.

Tracking issue: #2839

/cc @Electronic-Waste @andreyvelich @tariq-hasan @szaher

google-oss-prow · 2026-02-27T10:01:58Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign terrytangyuan for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

github-actions · 2026-02-27T10:02:00Z

🎉 Welcome to the Kubeflow Trainer! 🎉

Thanks for opening your first PR! We're happy to have you as part of our community 🚀

Here's what happens next:

If you haven't already, please check out our Contributing Guide for repo-specific guidelines and the Kubeflow Contributor Guide for general community standards.
Our team will review your PR soon! cc @kubeflow/kubeflow-trainer-team

Join the community:

Slack: Join our #kubeflow-trainer Slack channel.
Meetings: Attend the Kubeflow AutoML and Training Working Group bi-weekly meetings.

Feel free to ask questions in the comments if you need any help or clarification!
Thanks again for contributing to Kubeflow! 🙏

- Strip all Low-Level Design content (code interfaces, strategies, Dockerfile, runtime YAML, Helm chart details) - Fix 10 technical inaccuracies found during audit: - TRL CLI entry point (trl sft, not python -m trl) - Multi-node env vars (standard + PET variants) - Correct enforceTorchTunePolicy inline location - dependsOn YAML format, volume handling pattern - TRLTrainerType enum values (SFT/DPO/KTO/GRPO) - Container name 'node' not 'trainer' - PET env var naming conventions - KEP now covers: Summary, Goals, Non-Goals, Current State Analysis, High-Level Design, Test Plan, Risks, Phases

Signed-off-by: Sabari Narayana <sabarinarayanakg@proton.me>

…P-285 alignment - Remove @register_backend decorator and backend registry (YAGNI with 2 backends) - Change to_command() method to command: ClassVar[tuple[str, ...]] - Move num_nodes/resources_per_node to LLMBackend base class - Add Relationship to KEP-285 section for config-driven vs function-based trainers - Simplify KubernetesBackend integration (no hasattr checks) - Remove stale Phase 1/Phase 2 references from Risks table - Goals reduced from 7 to 5

Replace standalone LLMBackend ABC with ConfigTrainer that integrates into KEP-285's BaseTrainer type hierarchy, directly answering open questions from maintainers about how config-driven trainers fit alongside function-based trainers. Key changes: - LLMBackend → ConfigTrainer(BaseTrainer) for unified type hierarchy - LLMBackendStrategy → FrameworkStrategy (matches framework label convention) - TorchTuneConfig → TorchTuneTrainer with backward-compatible alias - TRLConfig → TRLTrainer with runtime auto-discovery support - Added detailed KEP-285 relationship section with maintainer references - Added implementation history and KEP.yaml-style metadata Tracking issue: kubeflow#2839

Based on mentor feedback, ConfigTrainer is now a standalone ABC rather than a subclass of KEP-285's BaseTrainer. This avoids LSP violations (dead get_train_func() methods) and allows both hierarchies to evolve independently. Key architectural change: - ConfigTrainer and BaseTrainer are separate ABCs for separate patterns (config-driven vs function-based) - Both accepted through same TrainerClient.train(trainer=...) parameter for flat, unified user experience - No inheritance relationship — clean separation of concerns Also adds Alternatives Considered section documenting the unified hierarchy option and why it was rejected.

Add three diagrams: - SDK type hierarchy showing LLMTrainer and BaseTrainer as separate ABCs - End-to-end system architecture from Python SDK to Kubernetes pods - Go Torch plugin strategy dispatch flow

docs: add KEP-2839 Dynamic LLM Trainer Framework proposal

611adef

google-oss-prow bot added the do-not-merge/work-in-progress label Feb 27, 2026

google-oss-prow bot requested review from jinchihe and kuizhiqing February 27, 2026 10:01

google-oss-prow bot added the size/XL label Feb 27, 2026

NarayanaSabari added 4 commits March 2, 2026 13:08

updated KEP for TRL

37f64fe

Signed-off-by: Sabari Narayana <sabarinarayanakg@proton.me>

updated kep for dpo example

f755136

NarayanaSabari mentioned this pull request Mar 10, 2026

chore: Trainer: Specialized Trainers kubeflow/sdk#308

Open

1 task

google-oss-prow bot added size/XXL and removed size/XL labels Mar 28, 2026

NarayanaSabari changed the title ~~docs: add KEP-2839 Dynamic LLM Trainer Framework proposal~~ docs: KEP-2839 Dynamic LLM Trainer Framework — ConfigTrainer for BaseTrainer hierarchy Mar 28, 2026

NarayanaSabari marked this pull request as ready for review March 28, 2026 16:11

google-oss-prow bot removed the do-not-merge/work-in-progress label Mar 28, 2026

Sabari added 3 commits March 31, 2026 16:05

docs: rename ConfigTrainer to LLMTrainer in KEP-2839

9570cc6

docs: add architecture diagrams to KEP-2839

2f3e675

Add three diagrams: - SDK type hierarchy showing LLMTrainer and BaseTrainer as separate ABCs - End-to-end system architecture from Python SDK to Kubernetes pods - Go Torch plugin strategy dispatch flow

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs: KEP-2839 Dynamic LLM Trainer Framework — ConfigTrainer for BaseTrainer hierarchy#3263

docs: KEP-2839 Dynamic LLM Trainer Framework — ConfigTrainer for BaseTrainer hierarchy#3263
NarayanaSabari wants to merge 9 commits intokubeflow:masterfrom
NarayanaSabari:kep-2839-dynamic-llm-trainer

NarayanaSabari commented Feb 27, 2026 •

edited

Loading

Uh oh!

google-oss-prow bot commented Feb 27, 2026

Uh oh!

github-actions bot commented Feb 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

NarayanaSabari commented Feb 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What This KEP Proposes

Relationship to KEP-285

Non-Goals

Test Plan

Uh oh!

google-oss-prow bot commented Feb 27, 2026

Uh oh!

github-actions bot commented Feb 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

NarayanaSabari commented Feb 27, 2026 •

edited

Loading