Skip to content

docs: KEP-2839 Dynamic LLM Trainer Framework — ConfigTrainer for BaseTrainer hierarchy#3263

Open
NarayanaSabari wants to merge 9 commits intokubeflow:masterfrom
NarayanaSabari:kep-2839-dynamic-llm-trainer
Open

docs: KEP-2839 Dynamic LLM Trainer Framework — ConfigTrainer for BaseTrainer hierarchy#3263
NarayanaSabari wants to merge 9 commits intokubeflow:masterfrom
NarayanaSabari:kep-2839-dynamic-llm-trainer

Conversation

@NarayanaSabari
Copy link
Copy Markdown

@NarayanaSabari NarayanaSabari commented Feb 27, 2026

Summary

Redesigned KEP-2839: Dynamic LLM Trainer Framework proposal.

This KEP introduces a pluggable config-driven trainer framework for LLM fine-tuning, aligned with KEP-285 (Specialized Trainer Abstractions).

What This KEP Proposes

SDK (Python)

  • ConfigTrainer base class within KEP-285's BaseTrainer hierarchy for config-driven trainers
  • TorchTuneTrainer(ConfigTrainer) — refactored from TorchTuneConfig with backward-compatible alias
  • TRLTrainer(ConfigTrainer) — new backend with SFT/DPO/KTO/GRPO support
  • Runtime auto-discovery via trainer.kubeflow.org/framework label
  • BuiltinTrainer.config type widened from TorchTuneConfig to ConfigTrainer

Go Control Plane

  • FrameworkStrategy interface replacing hardcoded TorchTune command-sniffing in the Torch plugin
  • TorchTuneStrategy (wraps existing logic unchanged)
  • TRLStrategy (accelerate-compatible env var injection)
  • Label-based dispatch via trainer.kubeflow.org/framework

Infrastructure

  • TRL container image (cmd/trainers/trl/)
  • TRL ClusterTrainingRuntime manifests
  • Helm chart additions for TRL runtimes

Relationship to KEP-285

This KEP answers the open question from the KEP-285 review about how config-driven trainers fit into the BaseTrainer hierarchy:

                    BaseTrainer (ABC)                ← KEP-285
                         │
         ┌───────────────┼───────────────┐
         │               │               │
   TorchTrainer     JAXTrainer     ConfigTrainer      ← This KEP
   (func-based)     (func-based)   (config-driven)
                                        │
                         ┌──────────────┼──────────────┐
                         │              │              │
                  TorchTuneTrainer  TRLTrainer    (future backends)

Non-Goals

  • Unsloth or LlamaFactory backends (future work)
  • CRD schema changes
  • Deprecating BuiltinTrainer or CustomTrainer

Test Plan

  • Unit tests for ConfigTrainer interface, backend registry, TRL arg generation, Go strategy dispatch
  • Integration tests for TRL TrainJob reconciliation
  • E2E tests for TRL SFT/DPO on GPU and TorchTune regression

Builds on KEP-2401 and the community consensus on "Plan 3" from #2752.

Tracking issue: #2839

/cc @Electronic-Waste @andreyvelich @tariq-hasan @szaher

@google-oss-prow
Copy link
Copy Markdown

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign terrytangyuan for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@github-actions
Copy link
Copy Markdown

🎉 Welcome to the Kubeflow Trainer! 🎉

Thanks for opening your first PR! We're happy to have you as part of our community 🚀

Here's what happens next:

  • If you haven't already, please check out our Contributing Guide for repo-specific guidelines and the Kubeflow Contributor Guide for general community standards.
  • Our team will review your PR soon! cc @kubeflow/kubeflow-trainer-team

Join the community:

Feel free to ask questions in the comments if you need any help or clarification!
Thanks again for contributing to Kubeflow! 🙏

- Strip all Low-Level Design content (code interfaces, strategies,
  Dockerfile, runtime YAML, Helm chart details)
- Fix 10 technical inaccuracies found during audit:
  - TRL CLI entry point (trl sft, not python -m trl)
  - Multi-node env vars (standard + PET variants)
  - Correct enforceTorchTunePolicy inline location
  - dependsOn YAML format, volume handling pattern
  - TRLTrainerType enum values (SFT/DPO/KTO/GRPO)
  - Container name 'node' not 'trainer'
  - PET env var naming conventions
- KEP now covers: Summary, Goals, Non-Goals, Current State
  Analysis, High-Level Design, Test Plan, Risks, Phases
Signed-off-by: Sabari Narayana <sabarinarayanakg@proton.me>
…P-285 alignment

- Remove @register_backend decorator and backend registry (YAGNI with 2 backends)
- Change to_command() method to command: ClassVar[tuple[str, ...]]
- Move num_nodes/resources_per_node to LLMBackend base class
- Add Relationship to KEP-285 section for config-driven vs function-based trainers
- Simplify KubernetesBackend integration (no hasattr checks)
- Remove stale Phase 1/Phase 2 references from Risks table
- Goals reduced from 7 to 5
Replace standalone LLMBackend ABC with ConfigTrainer that integrates into
KEP-285's BaseTrainer type hierarchy, directly answering open questions from
maintainers about how config-driven trainers fit alongside function-based
trainers.

Key changes:
- LLMBackend → ConfigTrainer(BaseTrainer) for unified type hierarchy
- LLMBackendStrategy → FrameworkStrategy (matches framework label convention)
- TorchTuneConfig → TorchTuneTrainer with backward-compatible alias
- TRLConfig → TRLTrainer with runtime auto-discovery support
- Added detailed KEP-285 relationship section with maintainer references
- Added implementation history and KEP.yaml-style metadata

Tracking issue: kubeflow#2839
@NarayanaSabari NarayanaSabari changed the title docs: add KEP-2839 Dynamic LLM Trainer Framework proposal docs: KEP-2839 Dynamic LLM Trainer Framework — ConfigTrainer for BaseTrainer hierarchy Mar 28, 2026
@NarayanaSabari NarayanaSabari marked this pull request as ready for review March 28, 2026 16:11
Sabari added 3 commits March 31, 2026 16:05
Based on mentor feedback, ConfigTrainer is now a standalone ABC rather than
a subclass of KEP-285's BaseTrainer. This avoids LSP violations (dead
get_train_func() methods) and allows both hierarchies to evolve independently.

Key architectural change:
- ConfigTrainer and BaseTrainer are separate ABCs for separate patterns
  (config-driven vs function-based)
- Both accepted through same TrainerClient.train(trainer=...) parameter
  for flat, unified user experience
- No inheritance relationship — clean separation of concerns

Also adds Alternatives Considered section documenting the unified hierarchy
option and why it was rejected.
Add three diagrams:
- SDK type hierarchy showing LLMTrainer and BaseTrainer as separate ABCs
- End-to-end system architecture from Python SDK to Kubernetes pods
- Go Torch plugin strategy dispatch flow
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant