Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
34 changes: 34 additions & 0 deletions docs/source/train/options.rst
Original file line number Diff line number Diff line change
Expand Up @@ -60,3 +60,37 @@ Options Reference
.. autoclass:: kubeflow.trainer.options.ContainerPatch
:members:
:show-inheritance:

Using options with TrainerClient
===============================

The ``options`` parameter in ``TrainerClient`` allows users to customize runtime behavior
and backend-specific configurations for training jobs.

It provides flexibility to control how training jobs are executed depending on the
selected backend (e.g., Kubernetes, local, container).

Example
-------

.. code-block:: python

from kubeflow.trainer import TrainerClient, CustomTrainer

def train_fn():
print("Training...")

client = TrainerClient()

job_id = client.train(
trainer=CustomTrainer(func=train_fn),
options={
"epochs": 10,
"batch_size": 32
}
)

client.wait_for_job_status(job_id)

The ``options`` dictionary can include different parameters depending on the backend
and runtime configuration.
30 changes: 30 additions & 0 deletions kubeflow/trainer/api/trainer_client.py
Original file line number Diff line number Diff line change
Expand Up @@ -265,6 +265,36 @@ def wait_for_job_status(
polling_interval=polling_interval,
callbacks=callbacks,
)
def get_job_progress(self, name: str) -> dict:
"""Get progress of a TrainJob.

Args:
name: Name of the TrainJob.

Returns:
Dictionary containing job status and progress.
"""

# Get job details
job = self.get_job(name=name)

status = job.status if hasattr(job, "status") else "Unknown"

if status == "Running":
progress = "In Progress"
elif status in ["Complete", "Succeeded"]:
Copy link

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line 285 checks for status "Succeeded", but the TrainJob status constants only define "Created", "Running", "Complete", and "Failed". This check will never match. Use "Complete" instead if checking for successful completion.

Suggested change
elif status in ["Complete", "Succeeded"]:
elif status == "Complete":

Copilot uses AI. Check for mistakes.
progress = "100%"
elif status == "Failed":
progress = "Error"
else:
progress = "Unknown"

return {
"job_id": name,
"status": status,
"progress": progress,
}
Comment on lines +268 to +296
Copy link

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR description states it fixes issue #367 by implementing the proposed update_runtime_status() API, but this PR adds get_job_progress() instead. The PR description does not mention this method. Clarify whether this is the intended implementation for issue #367 or if it is a separate enhancement.

Copilot uses AI. Check for mistakes.


Comment on lines +297 to 298
Copy link

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line 297 contains trailing whitespace. Remove to follow code style conventions.

Suggested change

Copilot uses AI. Check for mistakes.
def delete_job(self, name: str):
"""Delete the TrainJob.
Expand Down
23 changes: 14 additions & 9 deletions kubeflow/trainer/backends/localprocess/backend.py
Original file line number Diff line number Diff line change
Expand Up @@ -264,17 +264,22 @@ def __get_job_status(self, job: LocalBackendJobs) -> str:
if not job.steps:
return constants.TRAINJOB_CREATED
statuses = [_step.job.status for _step in job.steps]
# if status is running or failed will take precedence over completed

# Priority: Failed > Running > Created > Complete
if constants.TRAINJOB_FAILED in statuses:
status = constants.TRAINJOB_FAILED
elif constants.TRAINJOB_RUNNING in statuses:
status = constants.TRAINJOB_RUNNING
elif constants.TRAINJOB_CREATED in statuses:
status = constants.TRAINJOB_CREATED
else:
status = constants.TRAINJOB_COMPLETE
return constants.TRAINJOB_FAILED

if constants.TRAINJOB_RUNNING in statuses:
return constants.TRAINJOB_RUNNING
if constants.TRAINJOB_CREATED in statuses:
return constants.TRAINJOB_CREATED

# ✅ NEW FIX: Ensure all steps are actually complete
if all(status == constants.TRAINJOB_COMPLETE for status in statuses):
return constants.TRAINJOB_COMPLETE

return status
# fallback (safety)
return constants.TRAINJOB_RUNNING
Comment on lines +268 to +282
Copy link

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Incorrect indentation: comment lines 268, 277, and 281 are indented with 4 spaces, but should be indented with 8 spaces (aligned with the subsequent if statements). This will cause a Python IndentationError when the file is parsed.

Copilot uses AI. Check for mistakes.

def __register_job(
self,
Expand Down
Loading