feat(docs): add dedicated documentation website by Sridhar1030 · Pull Request #3308 · kubeflow/trainer

Sridhar1030 · 2026-03-11T08:53:39Z

What this PR does / why we need it:

Adds a dedicated Sphinx documentation website for Kubeflow Trainer, as proposed in #3255.

Built with Sphinx (Furo theme) + MyST Markdown, designed for Read the Docs hosting.

Status: Work in Progress -- documentation content is still being completed.

What's done:

Sphinx project setup with Furo theme and MyST Markdown
ReadTheDocs integration (.readthedocs.yaml)
Site structure: Overview, Getting Started, User Guides, Operator Guides,
Contributor Guides, API Reference, Legacy v1
Makefile targets (make docs, make docs-serve, make docs-clean)
Completed the page contents

WebsiteRecording.mov

What's remaining:

review the page contents
API Reference pages (pending SDK integration)
Contributor guide content
Legacy v1 documentation content
Final review of all user/operator guides

How to test locally:

cd docs
pip install -r requirements.txt
make html
python3 -m http.server 8000 --directory _build/html

You can access the deployed docs at :
https://trainer-doc-website.readthedocs.io/en/latest/

… Kubeflow Trainer - Introduced .readthedocs.yaml for ReadTheDocs configuration. - Created Makefile for building and serving documentation. - Added Sphinx configuration in docs/conf.py. - Established index.rst as the main entry point for documentation. - Developed user guides for various frameworks including PyTorch, JAX, and DeepSpeed. - Implemented custom CSS for documentation styling. - Included a distributed data cache guide and a DeepSpeed integration guide. This commit sets up the foundational documentation for the Kubeflow Trainer, enhancing accessibility and usability for users.

…flow Trainer - Renamed and reorganized user guides to better serve different audiences: AI practitioners, cluster operators, and contributors. - Added new sections for documentation on the Kubeflow Training Operator v1 and legacy guides. - Enhanced descriptions to clarify the purpose and content of each guide, improving overall accessibility and usability.

…flow Trainer - Introduced API reference documentation for Python SDK and Kubernetes CRD types, including TrainJob, TrainingRuntime, and ClusterTrainingRuntime. - Created contributor guides covering architecture, community, and contributing processes. - Added legacy v1 documentation structure with sections for installation and user guides for various frameworks. - Enhanced local execution documentation with updated examples. This commit establishes a comprehensive documentation framework to support users and contributors of the Kubeflow Trainer.

- Revised user guides for distributed training with Kubeflow Trainer, including JAX, PyTorch, MLX, DeepSpeed, and the data cache feature. - Enhanced clarity and structure of documentation to improve user experience and accessibility. - Added detailed instructions for using TrainJob with various frameworks, emphasizing configuration-driven training and runtime packages. - Removed outdated content and streamlined sections for better readability. This commit enhances the documentation framework, making it easier for users to implement distributed training solutions.

…ation layout - Replaced the old container structure with HTML for doc cards in index.rst, enhancing the visual presentation of quick links. - Updated CSS styles for doc cards to improve interactivity and appearance, including hover effects and text decoration. - Added an API Reference section to the documentation layout for better accessibility to technical details. These changes aim to enhance the user experience and accessibility of the Kubeflow Trainer documentation.

…l link handling - Added JavaScript functionality for a collapsible sidebar to improve navigation on larger screens. - Implemented external link handling to open links in a new tab for better user experience. - Updated CSS to accommodate new sidebar features and reduce whitespace in the content area. - Included Sphinx documentation build artifacts in .gitignore for cleaner repository management. These changes aim to improve the usability and accessibility of the Kubeflow Trainer documentation.

google-oss-prow · 2026-03-11T08:53:47Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign gaocegege for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

github-actions · 2026-03-11T08:53:52Z

🎉 Welcome to the Kubeflow Trainer! 🎉

Thanks for opening your first PR! We're happy to have you as part of our community 🚀

Here's what happens next:

If you haven't already, please check out our Contributing Guide for repo-specific guidelines and the Kubeflow Contributor Guide for general community standards.
Our team will review your PR soon! cc @kubeflow/kubeflow-trainer-team

Join the community:

Slack: Join our #kubeflow-trainer Slack channel.
Meetings: Attend the Kubeflow AutoML and Training Working Group bi-weekly meetings.

Feel free to ask questions in the comments if you need any help or clarification!
Thanks again for contributing to Kubeflow! 🙏

… layout - Deleted the API reference section and related files to streamline the documentation. - Updated index.rst to remove references to the API documentation. - Enhanced CSS styles for doc cards to improve visual consistency and interactivity. These changes aim to simplify the documentation structure and improve user navigation within the Kubeflow Trainer documentation.

… configuration - Eliminated the CRD API reference generation from the Makefile and .readthedocs.yaml to simplify the build process. - Updated Sphinx configuration to disable fail_on_warning, allowing builds to proceed despite warnings. - Removed AutoAPI configuration from conf.py and related dependencies from requirements.txt to streamline documentation setup. These changes aim to enhance the documentation build process and reduce complexity in the project structure.

- Updated the JAX user guide to include comprehensive instructions for creating and monitoring JAX training jobs using the JAXJob custom resource. - Enhanced the MPI user guide with additional metrics and Docker image building instructions, improving clarity and usability. - Introduced a new section in the multi-cluster guide detailing the `MultiKueue` feature for efficient management of MPI jobs across clusters. These changes aim to provide users with clearer guidance and enhance the overall documentation for distributed training frameworks in Kubeflow.

…ents - Introduced a grid layout for the BuiltinTrainer and local execution user guides, improving visual organization and accessibility. - Added new sections for overview and backend options in the local execution guide, detailing how to run TrainJobs with different backends. - Updated CSS to make the sidebar sticky during scrolling, enhancing navigation on larger screens. These changes aim to improve the overall user experience and clarity of the documentation for Kubeflow Trainer.

Copilot

Pull request overview

This PR introduces a dedicated Sphinx-based documentation website for Kubeflow Trainer (Furo theme + MyST), configured for Read the Docs hosting, and adds an initial set of user/operator/contributor guides plus legacy v1 documentation.

Changes:

Add Sphinx documentation project under docs/ (theme, MyST config, static assets, build Makefile, Python requirements).
Add Read the Docs build configuration (.readthedocs.yaml) and repo Makefile targets for building/serving/link-checking docs.
Add initial documentation content for Overview/Getting Started/User Guides/Operator Guides/Contributor Guides and Legacy v1.

Reviewed changes

Copilot reviewed 61 out of 68 changed files in this pull request and generated 8 comments.

Show a summary per file

File	Description
Makefile	Adds top-level `docs*` targets to build/serve/clean/linkcheck documentation.
.readthedocs.yaml	Configures RTD to build Sphinx docs from `docs/conf.py` with Python deps.
.gitignore	Ignores Sphinx build artifacts and a few additional local/dev directories.
docs/Makefile	Adds docs-local build/linkcheck/serve/clean targets for Sphinx.
docs/requirements.txt	Pins Sphinx + theme/extensions required to build the docs site.
docs/conf.py	Sphinx config: theme, MyST, Mermaid, linkcheck settings, and static assets.
docs/index.rst	Root docs landing page + navigation structure.
docs/_static/css/custom.css	Kubeflow/Furo styling, card layout, and sidebar UX tweaks.
docs/_static/js/external-links.js	Forces external links to open in a new tab.
docs/_static/js/sidebar-toggle.js	Adds a desktop sidebar collapse toggle with persistence.
docs/overview/index.md	New overview page positioning Trainer and personas.
docs/getting-started/index.md	Getting started walkthrough + distributed PyTorch example.
docs/user-guides/index.md	User guide landing page with framework/local-dev navigation.
docs/user-guides/pytorch.md	PyTorch usage guide and examples.
docs/user-guides/jax.md	JAX usage guide and examples.
docs/user-guides/deepspeed.md	DeepSpeed usage guide and examples.
docs/user-guides/mlx.md	MLX usage guide and examples.
docs/user-guides/data-cache.md	Distributed data cache guide + install and initializer usage.
docs/user-guides/builtin-trainer/index.md	Builtin trainer landing page.
docs/user-guides/builtin-trainer/overview.md	Explains BuiltinTrainer vs CustomTrainer.
docs/user-guides/builtin-trainer/torchtune.md	TorchTune BuiltinTrainer guide and walkthrough.
docs/user-guides/local-execution/index.md	Local execution landing page (process/docker/podman).
docs/user-guides/local-execution/overview.md	Cross-backend local execution overview and common ops.
docs/user-guides/local-execution/docker.md	Docker backend guide for local container execution.
docs/user-guides/local-execution/podman.md	Podman backend guide for local container execution.
docs/operator-guides/index.md	Operator guide landing page.
docs/operator-guides/installation.md	Install guidance (kubectl + Helm).
docs/operator-guides/migration.md	Migration doc from v1 CRDs to v2 TrainJob.
docs/operator-guides/runtime.md	Runtime concepts and examples (TrainingRuntime/ClusterTrainingRuntime).
docs/operator-guides/ml-policy.md	MLPolicy overview and examples (PlainML/Torch/MPI).
docs/operator-guides/job-template.md	Job template concepts and ancestor label requirements.
docs/operator-guides/pod-template.md	PodTemplateOverrides guide and restrictions.
docs/operator-guides/extension-framework.md	Extension framework phases and extension points.
docs/operator-guides/job-scheduling/index.md	Job scheduling landing page + toctree.
docs/operator-guides/job-scheduling/coscheduling.md	Coscheduling plugin guidance.
docs/operator-guides/job-scheduling/kueue.md	Links out to Kueue TrainJob docs.
docs/operator-guides/job-scheduling/volcano.md	Volcano integration guidance and examples.
docs/contributor-guides/index.md	Contributor guides landing page.
docs/contributor-guides/contributing.md	Contributor workflow/testing guidance (sourced from CONTRIBUTING.md).
docs/contributor-guides/community.md	Community links and resources (sourced from README).
docs/legacy-v1/index.md	Legacy v1 doc section landing page.
docs/legacy-v1/overview.md	Legacy v1 overview (with v2 redirect pointers).
docs/legacy-v1/installation.md	Legacy v1 installation guide.
docs/legacy-v1/getting-started.md	Legacy v1 getting started.
docs/legacy-v1/user-guides/index.md	Legacy v1 user guide index/toctree.
docs/legacy-v1/user-guides/fine-tuning.md	Legacy v1 fine-tuning guide.
docs/legacy-v1/user-guides/multi-cluster.md	Legacy v1 multi-cluster guidance.
docs/legacy-v1/user-guides/pytorch.md	Legacy v1 PyTorchJob guide.
docs/legacy-v1/user-guides/tensorflow.md	Legacy v1 TFJob guide.
docs/legacy-v1/user-guides/paddlepaddle.md	Legacy v1 PaddleJob guide.
docs/legacy-v1/user-guides/xgboost.md	Legacy v1 XGBoostJob guide.
docs/legacy-v1/user-guides/jax.md	Legacy v1 JAXJob guide.
docs/legacy-v1/user-guides/job-scheduling.md	Legacy v1 gang scheduling guide.
docs/legacy-v1/user-guides/mpi.md	Legacy v1 MPIJob guide.
docs/legacy-v1/user-guides/monitoring.md	Legacy v1 Prometheus monitoring guide.
docs/legacy-v1/reference/index.md	Legacy v1 reference index/toctree.
docs/legacy-v1/reference/architecture.md	Legacy v1 architecture reference.
docs/legacy-v1/reference/distributed-training.md	Legacy v1 distributed training reference.
docs/legacy-v1/reference/fine-tuning.md	Legacy v1 fine-tuning architecture reference.
docs/legacy-v1/explanation/index.md	Legacy v1 explanation index/toctree.
docs/legacy-v1/explanation/fine-tuning.md	Legacy v1 fine-tuning rationale/explanation.

You can also share your feedback on Copilot code review. Take the survey.

docs/user-guides/local-execution/overview.md

docs/user-guides/data-cache.md

docs/user-guides/builtin-trainer/torchtune.md

docs/getting-started/index.md

docs/conf.py

docs/user-guides/jax.md

docs/user-guides/local-execution/docker.md

docs/user-guides/local-execution/podman.md

jeffspahr

This is the right overall direction for #3255: the docs are colocated, the section layout is sensible, and the Sphinx/Furo setup builds locally.

I don't think it meets the dedicated-site intent yet, though, because a few of the new docs still ship broken references or still depend on the old kubeflow.org site.

Blocking items from local verification:

python -m sphinx -n -W --keep-going -b html . _build/html reports unresolved internal refs in the new legacy docs
.readthedocs.yaml is configured with fail_on_warning: false, so those dead links can still be published
python -m sphinx -b linkcheck . _build/linkcheck also reports broken pkg.go.dev / GitHub blob anchor references in the new guides

.readthedocs.yaml

docs/conf.py

docs/operator-guides/pod-template.md

jeffspahr · 2026-03-22T20:25:35Z

docs/overview/index.md

+
+The platform features **distributed data caching** using Apache Arrow and Apache DataFusion for zero-copy tensor streaming directly to GPU nodes, maximizing training performance.
+
+![Kubeflow Trainer Tech Stack](https://www.kubeflow.org/docs/components/trainer/images/trainer-tech-stack.drawio.svg)


One of the explicit motivations in #3255 was to co-locate the docs with the source. This overview page still hotlinks the hero diagrams from kubeflow.org, so the new site depends on the old one for core content. I would vendor these diagrams into docs/images and reference them locally before merging the dedicated site.

Initially hotlinked these while scaffolding the content, but you're right, they should be vendored locally so the new site is fully self-contained. I'll download th images into docs/images/ .

docs/user-guides/builtin-trainer/torchtune.md

…links\ in\ user\ guides$'\n'$'\n'-\ Clarified\ that\ each\ rank\ in\ a\ multi-node\ TrainJob\ must\ download\ the\ dataset\ independently.$'\n'-\ Updated\ links\ in\ the\ local\ execution\ and\ Podman\ user\ guides\ to\ point\ directly\ to\ the\ relevant\ sections\ in\ the\ overview.$'\n'-\ Corrected\ the\ path\ for\ accessing\ the\ fine-tuned\ model\ in\ the\ BuiltinTrainer\ guide.$'\n'$'\n'These\ changes\ enhance\ the\ clarity\ and\ usability\ of\ the\ documentation\ for\ Kubeflow\ Trainer. (resolved the copliot reviews) Signed-off-by: Sridhar1030 <sridharpillai75@gmail.com>

Sridhar1030 · 2026-03-23T07:33:11Z

Thanks for the review @jeffspahr
I will address them asap

…ntation - Added architecture diagrams to the Kubeflow Trainer Extension Framework guide for better visualization. - Updated terminology in the index to reflect the transition from Pod Templates to Runtime Patches. - Removed the outdated PodTemplate documentation to streamline the guides and focus on current practices. These changes aim to enhance clarity and usability in the documentation for the Kubeflow Trainer. Signed-off-by: Sridhar1030 <sridharpillai75@gmail.com>

- Enabled fail_on_warning in Sphinx configuration to ensure builds fail on warnings, enhancing documentation quality. - Updated links in user guides to point to the correct paths, improving navigation and clarity for users. These changes aim to enhance the reliability and usability of the documentation for Kubeflow Trainer. Signed-off-by: Sridhar1030 <sridharpillai75@gmail.com>

Sridhar1030 · 2026-03-31T08:53:12Z

Hey @jeffspahr, I've resolved most of your review comments apart from the docs/images one

Here's what was addressed:

.readthedocs.yaml: Set fail_on_warning: true
docs/conf.py: Added myst_heading_anchors = 4 to fix all unresolved internal heading refs
docs/conf.py: Added medium.com to linkcheck_ignore (returns 403 to bots) and MLX anchor to linkcheck_anchors_ignore
docker.md / podman.md: Fixed cross-refs from overview# to overview.md# (4 occurrences)
torchtune.md: Removed #Lx-Ly GitHub line-range anchors from SDK links that fail linkcheck
jax.md: Fixed "Getting Started" link to point to the correct page

The strict Sphinx build (python -m sphinx -n -W --keep-going -b html . _build/html) now passes with zero warnings.

Could you review again and let me know if the PR requires any more changes?

Sridhar1030 added 9 commits March 8, 2026 14:27

feat(docs): add comprehensive operator guides for Kubeflow Trainer

d6ccb26

feat(docs): enhance operator guides for Kubeflow Trainer

7c2c686

fixed user-guides

ff7f734

google-oss-prow bot added the do-not-merge/work-in-progress label Mar 11, 2026

google-oss-prow bot requested review from akshaychitneni and kuizhiqing March 11, 2026 08:53

google-oss-prow bot added the size/XXL label Mar 11, 2026

Sridhar1030 mentioned this pull request Mar 11, 2026

Create a Dedicated Website for Kubeflow Trainer #3255

Open

Sridhar1030 added 5 commits March 16, 2026 12:44

added leagcy training docs

69189cf

Sridhar1030 mentioned this pull request Mar 19, 2026

KEP: Kubeflow MCP Server - AI-Powered Training Interface kubeflow/community#936

Open

Sridhar1030 marked this pull request as ready for review March 19, 2026 17:01

Copilot AI review requested due to automatic review settings March 19, 2026 17:01

google-oss-prow bot removed the do-not-merge/work-in-progress label Mar 19, 2026

Copilot started reviewing on behalf of Sridhar1030 March 19, 2026 17:02 View session

Copilot AI reviewed Mar 19, 2026

View reviewed changes

jeffspahr suggested changes Mar 22, 2026

View reviewed changes

google-oss-prow bot assigned jeffspahr Mar 22, 2026

Sridhar1030 added 3 commits March 23, 2026 14:43

docs: update overview and add diagrams

102c066


		The platform features distributed data caching using Apache Arrow and Apache DataFusion for zero-copy tensor streaming directly to GPU nodes, maximizing training performance.

		![Kubeflow Trainer Tech Stack](https://www.kubeflow.org/docs/components/trainer/images/trainer-tech-stack.drawio.svg)

Conversation

Sridhar1030 commented Mar 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

google-oss-prow bot commented Mar 11, 2026

Uh oh!

github-actions bot commented Mar 11, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jeffspahr left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jeffspahr Mar 22, 2026

Choose a reason for hiding this comment

Uh oh!

Sridhar1030 Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Sridhar1030 commented Mar 23, 2026

Uh oh!

Sridhar1030 commented Mar 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Sridhar1030 commented Mar 11, 2026 •

edited

Loading

Sridhar1030 commented Mar 31, 2026 •

edited

Loading