Skip to content

feat(docs): add dedicated documentation website#3308

Open
Sridhar1030 wants to merge 18 commits intokubeflow:masterfrom
Sridhar1030:doc-website
Open

feat(docs): add dedicated documentation website#3308
Sridhar1030 wants to merge 18 commits intokubeflow:masterfrom
Sridhar1030:doc-website

Conversation

@Sridhar1030
Copy link
Copy Markdown

@Sridhar1030 Sridhar1030 commented Mar 11, 2026

What this PR does / why we need it:

Adds a dedicated Sphinx documentation website for Kubeflow Trainer, as proposed in #3255.

Built with Sphinx (Furo theme) + MyST Markdown, designed for Read the Docs hosting.

Status: Work in Progress -- documentation content is still being completed.

What's done:

  • Sphinx project setup with Furo theme and MyST Markdown
  • ReadTheDocs integration (.readthedocs.yaml)
  • Site structure: Overview, Getting Started, User Guides, Operator Guides,
    Contributor Guides, API Reference, Legacy v1
  • Makefile targets (make docs, make docs-serve, make docs-clean)
  • Completed the page contents
WebsiteRecording.mov

What's remaining:

  • review the page contents
  • API Reference pages (pending SDK integration)
  • Contributor guide content
  • Legacy v1 documentation content
  • Final review of all user/operator guides

How to test locally:

cd docs
pip install -r requirements.txt
make html
python3 -m http.server 8000 --directory _build/html

You can access the deployed docs at :
https://trainer-doc-website.readthedocs.io/en/latest/

… Kubeflow Trainer

- Introduced .readthedocs.yaml for ReadTheDocs configuration.
- Created Makefile for building and serving documentation.
- Added Sphinx configuration in docs/conf.py.
- Established index.rst as the main entry point for documentation.
- Developed user guides for various frameworks including PyTorch, JAX, and DeepSpeed.
- Implemented custom CSS for documentation styling.
- Included a distributed data cache guide and a DeepSpeed integration guide.

This commit sets up the foundational documentation for the Kubeflow Trainer, enhancing accessibility and usability for users.
…flow Trainer

- Renamed and reorganized user guides to better serve different audiences: AI practitioners, cluster operators, and contributors.
- Added new sections for documentation on the Kubeflow Training Operator v1 and legacy guides.
- Enhanced descriptions to clarify the purpose and content of each guide, improving overall accessibility and usability.
…flow Trainer

- Introduced API reference documentation for Python SDK and Kubernetes CRD types, including TrainJob, TrainingRuntime, and ClusterTrainingRuntime.
- Created contributor guides covering architecture, community, and contributing processes.
- Added legacy v1 documentation structure with sections for installation and user guides for various frameworks.
- Enhanced local execution documentation with updated examples.

This commit establishes a comprehensive documentation framework to support users and contributors of the Kubeflow Trainer.
- Revised user guides for distributed training with Kubeflow Trainer, including JAX, PyTorch, MLX, DeepSpeed, and the data cache feature.
- Enhanced clarity and structure of documentation to improve user experience and accessibility.
- Added detailed instructions for using TrainJob with various frameworks, emphasizing configuration-driven training and runtime packages.
- Removed outdated content and streamlined sections for better readability.

This commit enhances the documentation framework, making it easier for users to implement distributed training solutions.
…ation layout

- Replaced the old container structure with HTML for doc cards in index.rst, enhancing the visual presentation of quick links.
- Updated CSS styles for doc cards to improve interactivity and appearance, including hover effects and text decoration.
- Added an API Reference section to the documentation layout for better accessibility to technical details.

These changes aim to enhance the user experience and accessibility of the Kubeflow Trainer documentation.
…l link handling

- Added JavaScript functionality for a collapsible sidebar to improve navigation on larger screens.
- Implemented external link handling to open links in a new tab for better user experience.
- Updated CSS to accommodate new sidebar features and reduce whitespace in the content area.
- Included Sphinx documentation build artifacts in .gitignore for cleaner repository management.

These changes aim to improve the usability and accessibility of the Kubeflow Trainer documentation.
@google-oss-prow
Copy link
Copy Markdown

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign gaocegege for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@github-actions
Copy link
Copy Markdown

🎉 Welcome to the Kubeflow Trainer! 🎉

Thanks for opening your first PR! We're happy to have you as part of our community 🚀

Here's what happens next:

  • If you haven't already, please check out our Contributing Guide for repo-specific guidelines and the Kubeflow Contributor Guide for general community standards.
  • Our team will review your PR soon! cc @kubeflow/kubeflow-trainer-team

Join the community:

Feel free to ask questions in the comments if you need any help or clarification!
Thanks again for contributing to Kubeflow! 🙏

… layout

- Deleted the API reference section and related files to streamline the documentation.
- Updated index.rst to remove references to the API documentation.
- Enhanced CSS styles for doc cards to improve visual consistency and interactivity.

These changes aim to simplify the documentation structure and improve user navigation within the Kubeflow Trainer documentation.
… configuration

- Eliminated the CRD API reference generation from the Makefile and .readthedocs.yaml to simplify the build process.
- Updated Sphinx configuration to disable fail_on_warning, allowing builds to proceed despite warnings.
- Removed AutoAPI configuration from conf.py and related dependencies from requirements.txt to streamline documentation setup.

These changes aim to enhance the documentation build process and reduce complexity in the project structure.
- Updated the JAX user guide to include comprehensive instructions for creating and monitoring JAX training jobs using the JAXJob custom resource.
- Enhanced the MPI user guide with additional metrics and Docker image building instructions, improving clarity and usability.
- Introduced a new section in the multi-cluster guide detailing the `MultiKueue` feature for efficient management of MPI jobs across clusters.

These changes aim to provide users with clearer guidance and enhance the overall documentation for distributed training frameworks in Kubeflow.
…ents

- Introduced a grid layout for the BuiltinTrainer and local execution user guides, improving visual organization and accessibility.
- Added new sections for overview and backend options in the local execution guide, detailing how to run TrainJobs with different backends.
- Updated CSS to make the sidebar sticky during scrolling, enhancing navigation on larger screens.

These changes aim to improve the overall user experience and clarity of the documentation for Kubeflow Trainer.
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces a dedicated Sphinx-based documentation website for Kubeflow Trainer (Furo theme + MyST), configured for Read the Docs hosting, and adds an initial set of user/operator/contributor guides plus legacy v1 documentation.

Changes:

  • Add Sphinx documentation project under docs/ (theme, MyST config, static assets, build Makefile, Python requirements).
  • Add Read the Docs build configuration (.readthedocs.yaml) and repo Makefile targets for building/serving/link-checking docs.
  • Add initial documentation content for Overview/Getting Started/User Guides/Operator Guides/Contributor Guides and Legacy v1.

Reviewed changes

Copilot reviewed 61 out of 68 changed files in this pull request and generated 8 comments.

Show a summary per file
File Description
Makefile Adds top-level docs* targets to build/serve/clean/linkcheck documentation.
.readthedocs.yaml Configures RTD to build Sphinx docs from docs/conf.py with Python deps.
.gitignore Ignores Sphinx build artifacts and a few additional local/dev directories.
docs/Makefile Adds docs-local build/linkcheck/serve/clean targets for Sphinx.
docs/requirements.txt Pins Sphinx + theme/extensions required to build the docs site.
docs/conf.py Sphinx config: theme, MyST, Mermaid, linkcheck settings, and static assets.
docs/index.rst Root docs landing page + navigation structure.
docs/_static/css/custom.css Kubeflow/Furo styling, card layout, and sidebar UX tweaks.
docs/_static/js/external-links.js Forces external links to open in a new tab.
docs/_static/js/sidebar-toggle.js Adds a desktop sidebar collapse toggle with persistence.
docs/overview/index.md New overview page positioning Trainer and personas.
docs/getting-started/index.md Getting started walkthrough + distributed PyTorch example.
docs/user-guides/index.md User guide landing page with framework/local-dev navigation.
docs/user-guides/pytorch.md PyTorch usage guide and examples.
docs/user-guides/jax.md JAX usage guide and examples.
docs/user-guides/deepspeed.md DeepSpeed usage guide and examples.
docs/user-guides/mlx.md MLX usage guide and examples.
docs/user-guides/data-cache.md Distributed data cache guide + install and initializer usage.
docs/user-guides/builtin-trainer/index.md Builtin trainer landing page.
docs/user-guides/builtin-trainer/overview.md Explains BuiltinTrainer vs CustomTrainer.
docs/user-guides/builtin-trainer/torchtune.md TorchTune BuiltinTrainer guide and walkthrough.
docs/user-guides/local-execution/index.md Local execution landing page (process/docker/podman).
docs/user-guides/local-execution/overview.md Cross-backend local execution overview and common ops.
docs/user-guides/local-execution/docker.md Docker backend guide for local container execution.
docs/user-guides/local-execution/podman.md Podman backend guide for local container execution.
docs/operator-guides/index.md Operator guide landing page.
docs/operator-guides/installation.md Install guidance (kubectl + Helm).
docs/operator-guides/migration.md Migration doc from v1 CRDs to v2 TrainJob.
docs/operator-guides/runtime.md Runtime concepts and examples (TrainingRuntime/ClusterTrainingRuntime).
docs/operator-guides/ml-policy.md MLPolicy overview and examples (PlainML/Torch/MPI).
docs/operator-guides/job-template.md Job template concepts and ancestor label requirements.
docs/operator-guides/pod-template.md PodTemplateOverrides guide and restrictions.
docs/operator-guides/extension-framework.md Extension framework phases and extension points.
docs/operator-guides/job-scheduling/index.md Job scheduling landing page + toctree.
docs/operator-guides/job-scheduling/coscheduling.md Coscheduling plugin guidance.
docs/operator-guides/job-scheduling/kueue.md Links out to Kueue TrainJob docs.
docs/operator-guides/job-scheduling/volcano.md Volcano integration guidance and examples.
docs/contributor-guides/index.md Contributor guides landing page.
docs/contributor-guides/contributing.md Contributor workflow/testing guidance (sourced from CONTRIBUTING.md).
docs/contributor-guides/community.md Community links and resources (sourced from README).
docs/legacy-v1/index.md Legacy v1 doc section landing page.
docs/legacy-v1/overview.md Legacy v1 overview (with v2 redirect pointers).
docs/legacy-v1/installation.md Legacy v1 installation guide.
docs/legacy-v1/getting-started.md Legacy v1 getting started.
docs/legacy-v1/user-guides/index.md Legacy v1 user guide index/toctree.
docs/legacy-v1/user-guides/fine-tuning.md Legacy v1 fine-tuning guide.
docs/legacy-v1/user-guides/multi-cluster.md Legacy v1 multi-cluster guidance.
docs/legacy-v1/user-guides/pytorch.md Legacy v1 PyTorchJob guide.
docs/legacy-v1/user-guides/tensorflow.md Legacy v1 TFJob guide.
docs/legacy-v1/user-guides/paddlepaddle.md Legacy v1 PaddleJob guide.
docs/legacy-v1/user-guides/xgboost.md Legacy v1 XGBoostJob guide.
docs/legacy-v1/user-guides/jax.md Legacy v1 JAXJob guide.
docs/legacy-v1/user-guides/job-scheduling.md Legacy v1 gang scheduling guide.
docs/legacy-v1/user-guides/mpi.md Legacy v1 MPIJob guide.
docs/legacy-v1/user-guides/monitoring.md Legacy v1 Prometheus monitoring guide.
docs/legacy-v1/reference/index.md Legacy v1 reference index/toctree.
docs/legacy-v1/reference/architecture.md Legacy v1 architecture reference.
docs/legacy-v1/reference/distributed-training.md Legacy v1 distributed training reference.
docs/legacy-v1/reference/fine-tuning.md Legacy v1 fine-tuning architecture reference.
docs/legacy-v1/explanation/index.md Legacy v1 explanation index/toctree.
docs/legacy-v1/explanation/fine-tuning.md Legacy v1 fine-tuning rationale/explanation.

You can also share your feedback on Copilot code review. Take the survey.

Copy link
Copy Markdown

@jeffspahr jeffspahr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the right overall direction for #3255: the docs are colocated, the section layout is sensible, and the Sphinx/Furo setup builds locally.

I don't think it meets the dedicated-site intent yet, though, because a few of the new docs still ship broken references or still depend on the old kubeflow.org site.

Blocking items from local verification:

  • python -m sphinx -n -W --keep-going -b html . _build/html reports unresolved internal refs in the new legacy docs
  • .readthedocs.yaml is configured with fail_on_warning: false, so those dead links can still be published
  • python -m sphinx -b linkcheck . _build/linkcheck also reports broken pkg.go.dev / GitHub blob anchor references in the new guides


The platform features **distributed data caching** using Apache Arrow and Apache DataFusion for zero-copy tensor streaming directly to GPU nodes, maximizing training performance.

![Kubeflow Trainer Tech Stack](https://www.kubeflow.org/docs/components/trainer/images/trainer-tech-stack.drawio.svg)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One of the explicit motivations in #3255 was to co-locate the docs with the source. This overview page still hotlinks the hero diagrams from kubeflow.org, so the new site depends on the old one for core content. I would vendor these diagrams into docs/images and reference them locally before merging the dedicated site.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Initially hotlinked these while scaffolding the content, but you're right, they should be vendored locally so the new site is fully self-contained. I'll download th images into docs/images/ .

…links\ in\ user\ guides$'\n'$'\n'-\ Clarified\ that\ each\ rank\ in\ a\ multi-node\ TrainJob\ must\ download\ the\ dataset\ independently.$'\n'-\ Updated\ links\ in\ the\ local\ execution\ and\ Podman\ user\ guides\ to\ point\ directly\ to\ the\ relevant\ sections\ in\ the\ overview.$'\n'-\ Corrected\ the\ path\ for\ accessing\ the\ fine-tuned\ model\ in\ the\ BuiltinTrainer\ guide.$'\n'$'\n'These\ changes\ enhance\ the\ clarity\ and\ usability\ of\ the\ documentation\ for\ Kubeflow\ Trainer. (resolved the copliot reviews)

Signed-off-by: Sridhar1030 <sridharpillai75@gmail.com>
@Sridhar1030
Copy link
Copy Markdown
Author

Thanks for the review @jeffspahr
I will address them asap

…ntation

- Added architecture diagrams to the Kubeflow Trainer Extension Framework guide for better visualization.
- Updated terminology in the index to reflect the transition from Pod Templates to Runtime Patches.
- Removed the outdated PodTemplate documentation to streamline the guides and focus on current practices.

These changes aim to enhance clarity and usability in the documentation for the Kubeflow Trainer.

Signed-off-by: Sridhar1030 <sridharpillai75@gmail.com>
- Enabled fail_on_warning in Sphinx configuration to ensure builds fail on warnings, enhancing documentation quality.
- Updated links in user guides to point to the correct paths, improving navigation and clarity for users.

These changes aim to enhance the reliability and usability of the documentation for Kubeflow Trainer.

Signed-off-by: Sridhar1030 <sridharpillai75@gmail.com>
@Sridhar1030
Copy link
Copy Markdown
Author

Sridhar1030 commented Mar 31, 2026

Hey @jeffspahr, I've resolved most of your review comments apart from the docs/images one

Here's what was addressed:

  • .readthedocs.yaml: Set fail_on_warning: true
  • docs/conf.py: Added myst_heading_anchors = 4 to fix all unresolved internal heading refs
  • docs/conf.py: Added medium.com to linkcheck_ignore (returns 403 to bots) and MLX anchor to linkcheck_anchors_ignore
  • docker.md / podman.md: Fixed cross-refs from overview# to overview.md# (4 occurrences)
  • torchtune.md: Removed #Lx-Ly GitHub line-range anchors from SDK links that fail linkcheck
  • jax.md: Fixed "Getting Started" link to point to the correct page

The strict Sphinx build (python -m sphinx -n -W --keep-going -b html . _build/html) now passes with zero warnings.

Could you review again and let me know if the PR requires any more changes?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants