Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
35 changes: 20 additions & 15 deletions kubeflow/trainer/api/trainer_client.py
Original file line number Diff line number Diff line change
Expand Up @@ -186,28 +186,33 @@ def get_job_logs(
step: str = constants.NODE + "-0",
follow: bool | None = False,
) -> Iterator[str]:
"""Get logs from a specific step of a TrainJob.
"""
Retrieve logs from a specific step of a TrainJob.
Comment on lines +189 to +190
Copy link

Copilot AI Mar 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docstring starts with a blank line (the opening triple-quote is on its own line), which introduces an empty first line in generated docs/help(); align with the rest of this file by putting the one-line summary immediately after the opening """.

Suggested change
"""
Retrieve logs from a specific step of a TrainJob.
"""Retrieve logs from a specific step of a TrainJob.

Copilot uses AI. Check for mistakes.

This method allows you to fetch logs either as a batch or stream them
in real-time using the `follow` parameter.
Comment on lines +190 to +193
Copy link

Copilot AI Mar 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docstring says logs are retrieved "from a specific step", but several backends treat the default step==node-0 as a sentinel to return logs from multiple steps/nodes (not just node-0); please clarify the default/step semantics so users know what output to expect.

Copilot uses AI. Check for mistakes.

You can watch for the logs in realtime as follows:
```python
from kubeflow.trainer import TrainerClient
Example:
from kubeflow.trainer import TrainerClient

for logline in TrainerClient().get_job_logs(name="s8d44aa4fb6d", follow=True):
print(logline)
```
client = TrainerClient()

# Stream logs in real-time
for line in client.get_job_logs(name="job-id", follow=True):
print(line)

Args:
name: Name of the TrainJob.
step: Step of the TrainJob to collect logs from, like dataset-initializer or node-0.
follow: Whether to stream logs in realtime as they are produced.
name (str): Name of the TrainJob.
step (str): Step of the TrainJob to collect logs from
(e.g., dataset-initializer or node-0).
follow (bool, optional): If True, streams logs in real-time.
Defaults to False.
Comment on lines +205 to +209
Copy link

Copilot AI Mar 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Args section switches to "name (str):" style, but other TrainerClient methods in this file consistently use Google-style "name:" without repeating types; please keep the Args formatting consistent within the module (either revert here or update the surrounding methods for the same style).

Suggested change
name (str): Name of the TrainJob.
step (str): Step of the TrainJob to collect logs from
(e.g., dataset-initializer or node-0).
follow (bool, optional): If True, streams logs in real-time.
Defaults to False.
name: Name of the TrainJob.
step: Step of the TrainJob to collect logs from
(e.g., dataset-initializer or node-0).
follow: If True, streams logs in real-time. Defaults to False.

Copilot uses AI. Check for mistakes.

Returns:
Iterator of log lines.
Iterator[str]: An iterator over log lines.

Copy link

Copilot AI Mar 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This docstring removed the Raises section, but get_job_logs can still raise (e.g., Kubernetes backend surfaces TimeoutError/RuntimeError via get_job, and LocalProcess backend raises ValueError when the job name is unknown); please document the relevant exceptions again so callers know what to handle.

Suggested change
Raises:
ValueError: The TrainJob with the given name does not exist (e.g., LocalProcess backend).
TimeoutError: Timeout while retrieving the TrainJob or its logs (e.g., Kubernetes backend).
RuntimeError: Failed to retrieve the TrainJob or its logs (e.g., Kubernetes backend).

Copilot uses AI. Check for mistakes.

Raises:
TimeoutError: Timeout to get a TrainJob.
RuntimeError: Failed to get a TrainJob.
Note:
If no logs are available, an empty iterator may be returned.
"""
return self.backend.get_job_logs(name=name, follow=follow, step=step)

Expand Down
Loading