Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions docs/source/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -147,6 +147,7 @@ Getting Involved
train/custom-training
train/distributed
train/runtimes
train/initializers
train/options
train/api

Expand Down
27 changes: 27 additions & 0 deletions docs/source/train/api.rst
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,33 @@ Trainers
:members:
:show-inheritance:

Initializers
------------

.. autoclass:: kubeflow.trainer.Initializer
:members:
:show-inheritance:

.. autoclass:: kubeflow.trainer.HuggingFaceDatasetInitializer
:members:
:show-inheritance:

.. autoclass:: kubeflow.trainer.S3DatasetInitializer
:members:
:show-inheritance:

.. autoclass:: kubeflow.trainer.DataCacheInitializer
:members:
:show-inheritance:

.. autoclass:: kubeflow.trainer.HuggingFaceModelInitializer
:members:
:show-inheritance:

.. autoclass:: kubeflow.trainer.S3ModelInitializer
:members:
:show-inheritance:

Backend Configurations
----------------------

Expand Down
6 changes: 6 additions & 0 deletions docs/source/train/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -67,6 +67,12 @@ Guides

Understand pre-configured environments for PyTorch, TensorFlow, etc.

.. grid-item-card:: Data and Model Initializers
:link: initializers
:link-type: doc

Download datasets and pre-trained models before training starts.

Common Patterns
---------------

Expand Down
201 changes: 201 additions & 0 deletions docs/source/train/initializers.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,201 @@
Data and Model Initializers
===========================

Initializers are pre-training containers that download datasets and pre-trained
models before your training job starts. You declare *what* to fetch; the SDK
runs the download as a separate step and makes the data available to your
training container.

.. note::

Initializers are supported on the **Container backend** and the
Comment thread
1Ayush-Petwal marked this conversation as resolved.
**Kubernetes backend**. They have no effect on ``LocalProcessBackend``.
``DataCacheInitializer`` is only supported on the **Kubernetes backend**.

Available Initializers
----------------------

.. list-table::
:header-rows: 1
:widths: 20 20 60

* - Kind
- Source
- Class
* - Dataset
- HuggingFace Hub
- ``HuggingFaceDatasetInitializer``
* - Dataset
- S3-compatible
- ``S3DatasetInitializer``
* - Dataset
- Distributed cache
- ``DataCacheInitializer`` *(Kubernetes only)*
* - Model
- HuggingFace Hub
- ``HuggingFaceModelInitializer``
* - Model
- S3-compatible
- ``S3ModelInitializer``

Pass them via the ``Initializer`` wrapper to ``client.train()``. When both
``dataset`` and ``model`` are set they download **in parallel**, so total wait
time equals the longer of the two.

Dataset Initializers
--------------------

**HuggingFace Hub:**

.. code-block:: python

from kubeflow.trainer import TrainerClient, CustomTrainer
from kubeflow.trainer import Initializer, HuggingFaceDatasetInitializer
from kubeflow.trainer.backends.container.types import ContainerBackendConfig

client = TrainerClient(backend_config=ContainerBackendConfig())
client.train(
initializer=Initializer(
dataset=HuggingFaceDatasetInitializer(
storage_uri="hf://username/my-dataset",
access_token="hf_...", # required for private repos
)
),
trainer=CustomTrainer(func=train),
)

The dataset is available inside the training container at ``/workspace/dataset``.

**S3-compatible storage:**

.. code-block:: python

from kubeflow.trainer import Initializer, S3DatasetInitializer

client.train(
initializer=Initializer(
dataset=S3DatasetInitializer(
storage_uri="s3://my-bucket/datasets/my-dataset",
endpoint="https://minio.example.com", # omit for AWS S3
access_key_id="...",
secret_access_key="...",
region="us-east-1",
)
),
trainer=CustomTrainer(func=train),
)

**Distributed cache (Kubernetes only):**

.. code-block:: python

from kubeflow.trainer import Initializer, DataCacheInitializer

client.train(
initializer=Initializer(
dataset=DataCacheInitializer(
storage_uri="cache://my_schema/my_table",
metadata_loc="s3://my-bucket/iceberg/my_table/metadata/v1.metadata.json",
num_data_nodes=4, # must be > 1
iam_role="arn:aws:iam::123456789012:role/my-role", # optional
)
),
trainer=CustomTrainer(func=train),
)

.. note::

``DataCacheInitializer`` requires the **Kubernetes backend**. The
``storage_uri`` must follow the ``cache://<SCHEMA_NAME>/<TABLE_NAME>``
format and ``num_data_nodes`` must be greater than 1.

Model Initializers
------------------

**HuggingFace Hub:**

.. code-block:: python

from kubeflow.trainer import Initializer, HuggingFaceModelInitializer

client.train(
initializer=Initializer(
model=HuggingFaceModelInitializer(
storage_uri="hf://meta-llama/Llama-3.2-1B",
access_token="hf_...",
)
),
trainer=CustomTrainer(func=fine_tune),
)

Model weights are available at ``/workspace/model``. By default,
redundant formats (``*.msgpack``, ``*.h5``, ``*.bin``, ``*.pt``, ``*.pth``)
are skipped. Pass ``ignore_patterns=[]`` to download everything.

**S3-compatible storage:**

.. code-block:: python

from kubeflow.trainer import Initializer, S3ModelInitializer

client.train(
initializer=Initializer(
model=S3ModelInitializer(
storage_uri="s3://my-models/llama-3.2-1b",
access_key_id="...",
secret_access_key="...",
region="us-east-1",
)
),
trainer=CustomTrainer(func=fine_tune),
)

Using Both Together
-------------------

.. code-block:: python

from kubeflow.trainer import (
Initializer,
HuggingFaceDatasetInitializer,
HuggingFaceModelInitializer,
)

client.train(
initializer=Initializer(
dataset=HuggingFaceDatasetInitializer(storage_uri="hf://tatsu-lab/alpaca"),
model=HuggingFaceModelInitializer(
storage_uri="hf://meta-llama/Llama-3.2-1B",
access_token="hf_...",
),
),
trainer=CustomTrainer(func=fine_tune),
)

Container Backend Configuration
---------------------------------

Override default images or increase the timeout via ``ContainerBackendConfig``:

.. code-block:: python

from kubeflow.trainer.backends.container.types import ContainerBackendConfig

client = TrainerClient(backend_config=ContainerBackendConfig(
dataset_initializer_image="ghcr.io/kubeflow/trainer/dataset-initializer:latest",
model_initializer_image="ghcr.io/kubeflow/trainer/model-initializer:latest",
initializer_timeout=1800, # seconds, default 600
))

Debugging
---------

Fetch logs from a specific initializer step:

.. code-block:: python

for line in client.get_job_logs(job_name, step="dataset-initializer"):
print(line)

for line in client.get_job_logs(job_name, step="model-initializer"):
print(line)
Loading