feat(docs): add dedicated documentation website#3308
feat(docs): add dedicated documentation website#3308Sridhar1030 wants to merge 18 commits intokubeflow:masterfrom
Conversation
… Kubeflow Trainer - Introduced .readthedocs.yaml for ReadTheDocs configuration. - Created Makefile for building and serving documentation. - Added Sphinx configuration in docs/conf.py. - Established index.rst as the main entry point for documentation. - Developed user guides for various frameworks including PyTorch, JAX, and DeepSpeed. - Implemented custom CSS for documentation styling. - Included a distributed data cache guide and a DeepSpeed integration guide. This commit sets up the foundational documentation for the Kubeflow Trainer, enhancing accessibility and usability for users.
…flow Trainer - Renamed and reorganized user guides to better serve different audiences: AI practitioners, cluster operators, and contributors. - Added new sections for documentation on the Kubeflow Training Operator v1 and legacy guides. - Enhanced descriptions to clarify the purpose and content of each guide, improving overall accessibility and usability.
…flow Trainer - Introduced API reference documentation for Python SDK and Kubernetes CRD types, including TrainJob, TrainingRuntime, and ClusterTrainingRuntime. - Created contributor guides covering architecture, community, and contributing processes. - Added legacy v1 documentation structure with sections for installation and user guides for various frameworks. - Enhanced local execution documentation with updated examples. This commit establishes a comprehensive documentation framework to support users and contributors of the Kubeflow Trainer.
- Revised user guides for distributed training with Kubeflow Trainer, including JAX, PyTorch, MLX, DeepSpeed, and the data cache feature. - Enhanced clarity and structure of documentation to improve user experience and accessibility. - Added detailed instructions for using TrainJob with various frameworks, emphasizing configuration-driven training and runtime packages. - Removed outdated content and streamlined sections for better readability. This commit enhances the documentation framework, making it easier for users to implement distributed training solutions.
…ation layout - Replaced the old container structure with HTML for doc cards in index.rst, enhancing the visual presentation of quick links. - Updated CSS styles for doc cards to improve interactivity and appearance, including hover effects and text decoration. - Added an API Reference section to the documentation layout for better accessibility to technical details. These changes aim to enhance the user experience and accessibility of the Kubeflow Trainer documentation.
…l link handling - Added JavaScript functionality for a collapsible sidebar to improve navigation on larger screens. - Implemented external link handling to open links in a new tab for better user experience. - Updated CSS to accommodate new sidebar features and reduce whitespace in the content area. - Included Sphinx documentation build artifacts in .gitignore for cleaner repository management. These changes aim to improve the usability and accessibility of the Kubeflow Trainer documentation.
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
🎉 Welcome to the Kubeflow Trainer! 🎉 Thanks for opening your first PR! We're happy to have you as part of our community 🚀 Here's what happens next:
Join the community:
Feel free to ask questions in the comments if you need any help or clarification! |
… layout - Deleted the API reference section and related files to streamline the documentation. - Updated index.rst to remove references to the API documentation. - Enhanced CSS styles for doc cards to improve visual consistency and interactivity. These changes aim to simplify the documentation structure and improve user navigation within the Kubeflow Trainer documentation.
… configuration - Eliminated the CRD API reference generation from the Makefile and .readthedocs.yaml to simplify the build process. - Updated Sphinx configuration to disable fail_on_warning, allowing builds to proceed despite warnings. - Removed AutoAPI configuration from conf.py and related dependencies from requirements.txt to streamline documentation setup. These changes aim to enhance the documentation build process and reduce complexity in the project structure.
- Updated the JAX user guide to include comprehensive instructions for creating and monitoring JAX training jobs using the JAXJob custom resource. - Enhanced the MPI user guide with additional metrics and Docker image building instructions, improving clarity and usability. - Introduced a new section in the multi-cluster guide detailing the `MultiKueue` feature for efficient management of MPI jobs across clusters. These changes aim to provide users with clearer guidance and enhance the overall documentation for distributed training frameworks in Kubeflow.
…ents - Introduced a grid layout for the BuiltinTrainer and local execution user guides, improving visual organization and accessibility. - Added new sections for overview and backend options in the local execution guide, detailing how to run TrainJobs with different backends. - Updated CSS to make the sidebar sticky during scrolling, enhancing navigation on larger screens. These changes aim to improve the overall user experience and clarity of the documentation for Kubeflow Trainer.
There was a problem hiding this comment.
Pull request overview
This PR introduces a dedicated Sphinx-based documentation website for Kubeflow Trainer (Furo theme + MyST), configured for Read the Docs hosting, and adds an initial set of user/operator/contributor guides plus legacy v1 documentation.
Changes:
- Add Sphinx documentation project under
docs/(theme, MyST config, static assets, build Makefile, Python requirements). - Add Read the Docs build configuration (
.readthedocs.yaml) and repo Makefile targets for building/serving/link-checking docs. - Add initial documentation content for Overview/Getting Started/User Guides/Operator Guides/Contributor Guides and Legacy v1.
Reviewed changes
Copilot reviewed 61 out of 68 changed files in this pull request and generated 8 comments.
Show a summary per file
| File | Description |
|---|---|
| Makefile | Adds top-level docs* targets to build/serve/clean/linkcheck documentation. |
| .readthedocs.yaml | Configures RTD to build Sphinx docs from docs/conf.py with Python deps. |
| .gitignore | Ignores Sphinx build artifacts and a few additional local/dev directories. |
| docs/Makefile | Adds docs-local build/linkcheck/serve/clean targets for Sphinx. |
| docs/requirements.txt | Pins Sphinx + theme/extensions required to build the docs site. |
| docs/conf.py | Sphinx config: theme, MyST, Mermaid, linkcheck settings, and static assets. |
| docs/index.rst | Root docs landing page + navigation structure. |
| docs/_static/css/custom.css | Kubeflow/Furo styling, card layout, and sidebar UX tweaks. |
| docs/_static/js/external-links.js | Forces external links to open in a new tab. |
| docs/_static/js/sidebar-toggle.js | Adds a desktop sidebar collapse toggle with persistence. |
| docs/overview/index.md | New overview page positioning Trainer and personas. |
| docs/getting-started/index.md | Getting started walkthrough + distributed PyTorch example. |
| docs/user-guides/index.md | User guide landing page with framework/local-dev navigation. |
| docs/user-guides/pytorch.md | PyTorch usage guide and examples. |
| docs/user-guides/jax.md | JAX usage guide and examples. |
| docs/user-guides/deepspeed.md | DeepSpeed usage guide and examples. |
| docs/user-guides/mlx.md | MLX usage guide and examples. |
| docs/user-guides/data-cache.md | Distributed data cache guide + install and initializer usage. |
| docs/user-guides/builtin-trainer/index.md | Builtin trainer landing page. |
| docs/user-guides/builtin-trainer/overview.md | Explains BuiltinTrainer vs CustomTrainer. |
| docs/user-guides/builtin-trainer/torchtune.md | TorchTune BuiltinTrainer guide and walkthrough. |
| docs/user-guides/local-execution/index.md | Local execution landing page (process/docker/podman). |
| docs/user-guides/local-execution/overview.md | Cross-backend local execution overview and common ops. |
| docs/user-guides/local-execution/docker.md | Docker backend guide for local container execution. |
| docs/user-guides/local-execution/podman.md | Podman backend guide for local container execution. |
| docs/operator-guides/index.md | Operator guide landing page. |
| docs/operator-guides/installation.md | Install guidance (kubectl + Helm). |
| docs/operator-guides/migration.md | Migration doc from v1 CRDs to v2 TrainJob. |
| docs/operator-guides/runtime.md | Runtime concepts and examples (TrainingRuntime/ClusterTrainingRuntime). |
| docs/operator-guides/ml-policy.md | MLPolicy overview and examples (PlainML/Torch/MPI). |
| docs/operator-guides/job-template.md | Job template concepts and ancestor label requirements. |
| docs/operator-guides/pod-template.md | PodTemplateOverrides guide and restrictions. |
| docs/operator-guides/extension-framework.md | Extension framework phases and extension points. |
| docs/operator-guides/job-scheduling/index.md | Job scheduling landing page + toctree. |
| docs/operator-guides/job-scheduling/coscheduling.md | Coscheduling plugin guidance. |
| docs/operator-guides/job-scheduling/kueue.md | Links out to Kueue TrainJob docs. |
| docs/operator-guides/job-scheduling/volcano.md | Volcano integration guidance and examples. |
| docs/contributor-guides/index.md | Contributor guides landing page. |
| docs/contributor-guides/contributing.md | Contributor workflow/testing guidance (sourced from CONTRIBUTING.md). |
| docs/contributor-guides/community.md | Community links and resources (sourced from README). |
| docs/legacy-v1/index.md | Legacy v1 doc section landing page. |
| docs/legacy-v1/overview.md | Legacy v1 overview (with v2 redirect pointers). |
| docs/legacy-v1/installation.md | Legacy v1 installation guide. |
| docs/legacy-v1/getting-started.md | Legacy v1 getting started. |
| docs/legacy-v1/user-guides/index.md | Legacy v1 user guide index/toctree. |
| docs/legacy-v1/user-guides/fine-tuning.md | Legacy v1 fine-tuning guide. |
| docs/legacy-v1/user-guides/multi-cluster.md | Legacy v1 multi-cluster guidance. |
| docs/legacy-v1/user-guides/pytorch.md | Legacy v1 PyTorchJob guide. |
| docs/legacy-v1/user-guides/tensorflow.md | Legacy v1 TFJob guide. |
| docs/legacy-v1/user-guides/paddlepaddle.md | Legacy v1 PaddleJob guide. |
| docs/legacy-v1/user-guides/xgboost.md | Legacy v1 XGBoostJob guide. |
| docs/legacy-v1/user-guides/jax.md | Legacy v1 JAXJob guide. |
| docs/legacy-v1/user-guides/job-scheduling.md | Legacy v1 gang scheduling guide. |
| docs/legacy-v1/user-guides/mpi.md | Legacy v1 MPIJob guide. |
| docs/legacy-v1/user-guides/monitoring.md | Legacy v1 Prometheus monitoring guide. |
| docs/legacy-v1/reference/index.md | Legacy v1 reference index/toctree. |
| docs/legacy-v1/reference/architecture.md | Legacy v1 architecture reference. |
| docs/legacy-v1/reference/distributed-training.md | Legacy v1 distributed training reference. |
| docs/legacy-v1/reference/fine-tuning.md | Legacy v1 fine-tuning architecture reference. |
| docs/legacy-v1/explanation/index.md | Legacy v1 explanation index/toctree. |
| docs/legacy-v1/explanation/fine-tuning.md | Legacy v1 fine-tuning rationale/explanation. |
You can also share your feedback on Copilot code review. Take the survey.
jeffspahr
left a comment
There was a problem hiding this comment.
This is the right overall direction for #3255: the docs are colocated, the section layout is sensible, and the Sphinx/Furo setup builds locally.
I don't think it meets the dedicated-site intent yet, though, because a few of the new docs still ship broken references or still depend on the old kubeflow.org site.
Blocking items from local verification:
python -m sphinx -n -W --keep-going -b html . _build/htmlreports unresolved internal refs in the new legacy docs.readthedocs.yamlis configured withfail_on_warning: false, so those dead links can still be publishedpython -m sphinx -b linkcheck . _build/linkcheckalso reports brokenpkg.go.dev/ GitHub blob anchor references in the new guides
docs/overview/index.md
Outdated
|
|
||
| The platform features **distributed data caching** using Apache Arrow and Apache DataFusion for zero-copy tensor streaming directly to GPU nodes, maximizing training performance. | ||
|
|
||
|  |
There was a problem hiding this comment.
One of the explicit motivations in #3255 was to co-locate the docs with the source. This overview page still hotlinks the hero diagrams from kubeflow.org, so the new site depends on the old one for core content. I would vendor these diagrams into docs/images and reference them locally before merging the dedicated site.
There was a problem hiding this comment.
Initially hotlinked these while scaffolding the content, but you're right, they should be vendored locally so the new site is fully self-contained. I'll download th images into docs/images/ .
…links\ in\ user\ guides$'\n'$'\n'-\ Clarified\ that\ each\ rank\ in\ a\ multi-node\ TrainJob\ must\ download\ the\ dataset\ independently.$'\n'-\ Updated\ links\ in\ the\ local\ execution\ and\ Podman\ user\ guides\ to\ point\ directly\ to\ the\ relevant\ sections\ in\ the\ overview.$'\n'-\ Corrected\ the\ path\ for\ accessing\ the\ fine-tuned\ model\ in\ the\ BuiltinTrainer\ guide.$'\n'$'\n'These\ changes\ enhance\ the\ clarity\ and\ usability\ of\ the\ documentation\ for\ Kubeflow\ Trainer. (resolved the copliot reviews) Signed-off-by: Sridhar1030 <sridharpillai75@gmail.com>
|
Thanks for the review @jeffspahr |
…ntation - Added architecture diagrams to the Kubeflow Trainer Extension Framework guide for better visualization. - Updated terminology in the index to reflect the transition from Pod Templates to Runtime Patches. - Removed the outdated PodTemplate documentation to streamline the guides and focus on current practices. These changes aim to enhance clarity and usability in the documentation for the Kubeflow Trainer. Signed-off-by: Sridhar1030 <sridharpillai75@gmail.com>
- Enabled fail_on_warning in Sphinx configuration to ensure builds fail on warnings, enhancing documentation quality. - Updated links in user guides to point to the correct paths, improving navigation and clarity for users. These changes aim to enhance the reliability and usability of the documentation for Kubeflow Trainer. Signed-off-by: Sridhar1030 <sridharpillai75@gmail.com>
|
Hey @jeffspahr, I've resolved most of your review comments apart from the Here's what was addressed:
The strict Sphinx build ( Could you review again and let me know if the PR requires any more changes? |
What this PR does / why we need it:
Adds a dedicated Sphinx documentation website for Kubeflow Trainer, as proposed in #3255.
Built with Sphinx (Furo theme) + MyST Markdown, designed for Read the Docs hosting.
What's done:
.readthedocs.yaml)Contributor Guides, API Reference, Legacy v1
make docs,make docs-serve,make docs-clean)WebsiteRecording.mov
What's remaining:
How to test locally:
cd docs pip install -r requirements.txt make html python3 -m http.server 8000 --directory _build/htmlYou can access the deployed docs at :
https://trainer-doc-website.readthedocs.io/en/latest/