Fix/335 clusterruntime retrieval by debojit11070 · Pull Request #423 · kubeflow/sdk

debojit11070 · 2026-03-24T15:15:59Z

feat(trainer): Add utility for TrainJob progress reporting (#367)

Description

This PR implements a user-friendly progress reporting utility for TrainJob monitoring, addressing issue #367. Users can now easily track training job progress with completion percentages, step details, and pod health status.

Approach: Hybrid implementation combining a structured dataclass (Approach C) for data representation with an instance method (Approach B) for convenient API access.

What's Changed

New Types (`kubeflow/trainer/types/types.py`)

Added JobProgress dataclass with the following metrics:
- job_name: The name of the TrainJob
- overall_status: Current job status (Running, Complete, Failed, etc.)
- total_steps / completed_steps: Step progress tracking
- running_steps / failed_steps: Detailed step breakdown
- healthy_pods / total_pods: Pod health status
- completion_percentage: Overall progress as percentage (0-100)
Added from_job() class method to extract progress from TrainJob objects
Added __str__() method for human-readable progress summaries

New API (`kubeflow/trainer/api/trainer_client.py`)

Added TrainerClient.get_job_progress(name: str) -> JobProgress method
Provides convenient access to progress information with full docstring and usage examples

Tests

types_test.py: 7 parametrized tests covering various job states
- All steps running
- Some steps completed, some running
- All steps completed
- Some steps failed
- Empty job (no steps)
- String representation test
trainer_client_test.py: 2 integration tests
- Progress object creation and field validation
- Human-readable output formatting

Usage Example

from kubeflow.trainer import TrainerClient

client = TrainerClient()

# Get current job progress
progress = client.get_job_progress("my-training-job")
print(progress)

# Output:
# Job: my-training-job
# Status: Running
# Progress: 50.0% (1/2 steps)
# Pods: 2/2 healthy
# Running steps: data-loading, training
# Failed steps: validation

…ubeflow#400) - Changed polling_interval > timeout to >= across all three backends - Kubernetes: Fixed validation to reject equal values - Container: Added completely missing polling_interval validation - LocalProcess: Fixed validation to reject equal values - Container tests: Added test case for polling_interval >= timeout scenario Fixes: kubeflow#400

…timeout - Validates polling_interval > 0 and timeout > 0 in LocalProcess backend - Validates polling_interval > 0 and timeout > 0 in Kubernetes backend - Validates polling_interval > 0 and timeout > 0 in Container backend - Uses pytest.raises() in container backend tests to properly assert exceptions This prevents ZeroDivisionError and tight loops when non-positive values are provided.

…kubeflow#335) - Enhanced mock for get_cluster_custom_object_response to properly handle 404 and 403 errors - Updated test case for non-existent runtimes to expect RuntimeError instead of success - Added test case for cluster-only runtime retrieval (exists in cluster, not in namespace) - Better simulation of real Kubernetes API behavior for cluster-scoped resources

…untimes (kubeflow#335) - Add RuntimeScope enum with NAMESPACE and CLUSTER values - Update Runtime dataclass to include scope field with default NAMESPACE value - Update KubernetesBackend to set scope based on resource type (TrainingRuntime vs ClusterTrainingRuntime) - Update test helpers and test cases to verify scope is correctly set - Resolves issue where users couldn't distinguish between TrainingRuntime and ClusterTrainingRuntime

) - Add JobProgress dataclass to types.py with from_job() factory method - Implement progress calculation: completion_percentage, running/failed steps, pod health - Add TrainerClient.get_job_progress(name) method for easy API access - Include __str__() method for human-readable progress summaries - Add comprehensive unit tests covering various job states - All tests pass, code passes ruff linting and formatting checks Example usage: client = TrainerClient() progress = client.get_job_progress('my-job') print(progress) # Output: # Job: my-job # Status: Running # Progress: 50.0% (1/2 steps) # Pods: 2/2 healthy # Running steps: training

google-oss-prow · 2026-03-24T15:16:06Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign andreyvelich for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

github-actions · 2026-03-24T15:16:12Z

🎉 Welcome to the Kubeflow SDK! 🎉

Thanks for opening your first PR! We're happy to have you as part of our community 🚀

Here's what happens next:

If you haven't already, please check out our Contributing Guide for repo-specific guidelines and the Kubeflow Contributor Guide for general community standards
Our team will review your PR soon! cc @kubeflow/kubeflow-sdk-team

Join the community:

Slack: Join our #kubeflow-ml-experience and #kubeflow-trainer Slack channels
Meetings: Attend the Kubeflow SDK and ML Experience bi-weekly meetings

Feel free to ask questions in the comments if you need any help or clarification!
Thanks again for contributing to Kubeflow! 🙏

Copilot

Pull request overview

This PR adds a progress-reporting data model/API for TrainJob monitoring and refines runtime retrieval behavior by distinguishing namespace-scoped vs cluster-scoped runtimes.

Changes:

Introduce JobProgress (computed from a TrainJob) and TrainerClient.get_job_progress() for user-facing progress summaries.
Add RuntimeScope and plumb Runtime.scope through Kubernetes runtime parsing/tests to reflect namespaced vs cluster runtimes.
Tighten wait_for_job_status() parameter validation (e.g., polling_interval >= timeout, positive values) across backends and adjust tests accordingly.

Reviewed changes

Copilot reviewed 9 out of 10 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
`kubeflow/trainer/types/types.py`	Adds `RuntimeScope`, extends `Runtime` with `scope`, and introduces `JobProgress`.
`kubeflow/trainer/types/types_test.py`	Adds unit tests for `JobProgress.from_job()` and string formatting.
`kubeflow/trainer/api/trainer_client.py`	Adds `TrainerClient.get_job_progress()` convenience API.
`kubeflow/trainer/api/trainer_client_test.py`	Adds tests validating `get_job_progress()` output/format.
`kubeflow/trainer/backends/kubernetes/backend.py`	Adds polling/timeout validation and sets `Runtime.scope` based on CR kind.
`kubeflow/trainer/backends/kubernetes/backend_test.py`	Expands runtime retrieval tests (404/403 cases; cluster-only runtime; scope assertions).
`kubeflow/trainer/backends/localprocess/backend.py`	Adds polling/timeout validation in `wait_for_job_status()`.
`kubeflow/trainer/backends/container/backend.py`	Adds polling/timeout validation in `wait_for_job_status()`.
`kubeflow/trainer/backends/container/backend_test.py`	Adds a new validation test case and refactors `wait_for_job_status` tests.
`.gitignore`	Ignores local development notes file `ISSUES_TO_SOLVE.md`.

Copilot · 2026-03-24T15:20:52Z

kubeflow/trainer/backends/container/backend_test.py

+        container_backend.wait_for_job_status(
+            job_name, status={test_case.config["wait_status"]}, timeout=5, polling_interval=1
+        )


In the "job fails" test case, wait_for_job_status() is invoked without asserting the expected RuntimeError, so the test will error out instead of validating the failure path; wrap this call in pytest.raises(test_case.expected_error) (and optionally assert the message).

Suggested change

container_backend.wait_for_job_status(

job_name, status={test_case.config["wait_status"]}, timeout=5, polling_interval=1

)

with pytest.raises(test_case.expected_error):

container_backend.wait_for_job_status(

job_name,

status={test_case.config["wait_status"]},

timeout=5,

polling_interval=1,

)

Copilot · 2026-03-24T15:20:53Z

kubeflow/trainer/backends/kubernetes/backend.py

+                f"must both be positive"
            )

        for _ in range(round(timeout / polling_interval)):


wait_for_job_status() uses range(round(timeout / polling_interval)), which can both undershoot (timeout=5,polling_interval=4 -> only 1 poll) and overshoot the requested timeout; prefer a time.monotonic()-based loop (like ContainerBackend) or compute iterations with math.ceil and track remaining time explicitly.

Copilot · 2026-03-24T15:20:54Z

kubeflow/trainer/backends/localprocess/backend.py

+                f"must both be positive"
            )

        for _ in range(round(timeout / polling_interval)):


wait_for_job_status() uses range(round(timeout / polling_interval)), which can both undershoot and overshoot the requested timeout (e.g., timeout=5,polling_interval=4 -> only 1 poll); consider using a time.monotonic()-based loop or math.ceil with explicit remaining-time checks.

Copilot · 2026-03-24T15:20:54Z

kubeflow/trainer/api/trainer_client.py

+    def get_job_progress(self, name: str) -> types.JobProgress:
+        """Get progress information for a TrainJob.
+
+        Provides a human-readable summary of the job's progress including
+        completion percentage, running steps, and pod health status.


The PR description/linked issue focuses on a runtime-side utility to report trainerStatus (e.g., update_runtime_status() POSTing updates), but this change adds a client-side retrieval API (get_job_progress) plus runtime scoping; either update the PR description to match the delivered functionality or add the reporting utility described in #367.

…SOLVE.md to repository - Remove Issue kubeflow#403 (active_deadline_seconds) from ISSUES_TO_SOLVE.md as it won't be worked on for now - Remove ISSUES_TO_SOLVE.md from .gitignore to make it a tracked file in the repository - Keep completed issues (kubeflow#400, kubeflow#335, kubeflow#367) documented for reference

debojit11070 added 5 commits March 24, 2026 00:08

Copilot AI review requested due to automatic review settings March 24, 2026 15:16

google-oss-prow bot requested review from astefanutti, kramaranya and szaher March 24, 2026 15:16

google-oss-prow bot added the size/XL label Mar 24, 2026

Copilot started reviewing on behalf of debojit11070 March 24, 2026 15:16 View session

Copilot AI reviewed Mar 24, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix/335 clusterruntime retrieval#423

Fix/335 clusterruntime retrieval#423
debojit11070 wants to merge 6 commits intokubeflow:mainfrom
debojit11070:fix/335-clusterruntime-retrieval

debojit11070 commented Mar 24, 2026

Uh oh!

google-oss-prow bot commented Mar 24, 2026

Uh oh!

github-actions bot commented Mar 24, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Mar 24, 2026

Uh oh!

Copilot AI Mar 24, 2026

Uh oh!

Copilot AI Mar 24, 2026

Uh oh!

Copilot AI Mar 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

debojit11070 commented Mar 24, 2026

feat(trainer): Add utility for TrainJob progress reporting (#367)

Description

What's Changed

New Types (kubeflow/trainer/types/types.py)

New API (kubeflow/trainer/api/trainer_client.py)

Tests

Usage Example

Uh oh!

google-oss-prow bot commented Mar 24, 2026

Uh oh!

github-actions bot commented Mar 24, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

New Types (`kubeflow/trainer/types/types.py`)

New API (`kubeflow/trainer/api/trainer_client.py`)