Conversation
…ubeflow#400) - Changed polling_interval > timeout to >= across all three backends - Kubernetes: Fixed validation to reject equal values - Container: Added completely missing polling_interval validation - LocalProcess: Fixed validation to reject equal values - Container tests: Added test case for polling_interval >= timeout scenario Fixes: kubeflow#400
…timeout - Validates polling_interval > 0 and timeout > 0 in LocalProcess backend - Validates polling_interval > 0 and timeout > 0 in Kubernetes backend - Validates polling_interval > 0 and timeout > 0 in Container backend - Uses pytest.raises() in container backend tests to properly assert exceptions This prevents ZeroDivisionError and tight loops when non-positive values are provided.
…kubeflow#335) - Enhanced mock for get_cluster_custom_object_response to properly handle 404 and 403 errors - Updated test case for non-existent runtimes to expect RuntimeError instead of success - Added test case for cluster-only runtime retrieval (exists in cluster, not in namespace) - Better simulation of real Kubernetes API behavior for cluster-scoped resources
…untimes (kubeflow#335) - Add RuntimeScope enum with NAMESPACE and CLUSTER values - Update Runtime dataclass to include scope field with default NAMESPACE value - Update KubernetesBackend to set scope based on resource type (TrainingRuntime vs ClusterTrainingRuntime) - Update test helpers and test cases to verify scope is correctly set - Resolves issue where users couldn't distinguish between TrainingRuntime and ClusterTrainingRuntime
) - Add JobProgress dataclass to types.py with from_job() factory method - Implement progress calculation: completion_percentage, running/failed steps, pod health - Add TrainerClient.get_job_progress(name) method for easy API access - Include __str__() method for human-readable progress summaries - Add comprehensive unit tests covering various job states - All tests pass, code passes ruff linting and formatting checks Example usage: client = TrainerClient() progress = client.get_job_progress('my-job') print(progress) # Output: # Job: my-job # Status: Running # Progress: 50.0% (1/2 steps) # Pods: 2/2 healthy # Running steps: training
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
🎉 Welcome to the Kubeflow SDK! 🎉 Thanks for opening your first PR! We're happy to have you as part of our community 🚀 Here's what happens next:
Join the community:
Feel free to ask questions in the comments if you need any help or clarification! |
There was a problem hiding this comment.
Pull request overview
This PR adds a progress-reporting data model/API for TrainJob monitoring and refines runtime retrieval behavior by distinguishing namespace-scoped vs cluster-scoped runtimes.
Changes:
- Introduce
JobProgress(computed from aTrainJob) andTrainerClient.get_job_progress()for user-facing progress summaries. - Add
RuntimeScopeand plumbRuntime.scopethrough Kubernetes runtime parsing/tests to reflect namespaced vs cluster runtimes. - Tighten
wait_for_job_status()parameter validation (e.g.,polling_interval >= timeout, positive values) across backends and adjust tests accordingly.
Reviewed changes
Copilot reviewed 9 out of 10 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
kubeflow/trainer/types/types.py |
Adds RuntimeScope, extends Runtime with scope, and introduces JobProgress. |
kubeflow/trainer/types/types_test.py |
Adds unit tests for JobProgress.from_job() and string formatting. |
kubeflow/trainer/api/trainer_client.py |
Adds TrainerClient.get_job_progress() convenience API. |
kubeflow/trainer/api/trainer_client_test.py |
Adds tests validating get_job_progress() output/format. |
kubeflow/trainer/backends/kubernetes/backend.py |
Adds polling/timeout validation and sets Runtime.scope based on CR kind. |
kubeflow/trainer/backends/kubernetes/backend_test.py |
Expands runtime retrieval tests (404/403 cases; cluster-only runtime; scope assertions). |
kubeflow/trainer/backends/localprocess/backend.py |
Adds polling/timeout validation in wait_for_job_status(). |
kubeflow/trainer/backends/container/backend.py |
Adds polling/timeout validation in wait_for_job_status(). |
kubeflow/trainer/backends/container/backend_test.py |
Adds a new validation test case and refactors wait_for_job_status tests. |
.gitignore |
Ignores local development notes file ISSUES_TO_SOLVE.md. |
| container_backend.wait_for_job_status( | ||
| job_name, status={test_case.config["wait_status"]}, timeout=5, polling_interval=1 | ||
| ) |
There was a problem hiding this comment.
In the "job fails" test case, wait_for_job_status() is invoked without asserting the expected RuntimeError, so the test will error out instead of validating the failure path; wrap this call in pytest.raises(test_case.expected_error) (and optionally assert the message).
| container_backend.wait_for_job_status( | |
| job_name, status={test_case.config["wait_status"]}, timeout=5, polling_interval=1 | |
| ) | |
| with pytest.raises(test_case.expected_error): | |
| container_backend.wait_for_job_status( | |
| job_name, | |
| status={test_case.config["wait_status"]}, | |
| timeout=5, | |
| polling_interval=1, | |
| ) |
| f"must both be positive" | ||
| ) | ||
|
|
||
| for _ in range(round(timeout / polling_interval)): |
There was a problem hiding this comment.
wait_for_job_status() uses range(round(timeout / polling_interval)), which can both undershoot (timeout=5,polling_interval=4 -> only 1 poll) and overshoot the requested timeout; prefer a time.monotonic()-based loop (like ContainerBackend) or compute iterations with math.ceil and track remaining time explicitly.
| f"must both be positive" | ||
| ) | ||
|
|
||
| for _ in range(round(timeout / polling_interval)): |
There was a problem hiding this comment.
wait_for_job_status() uses range(round(timeout / polling_interval)), which can both undershoot and overshoot the requested timeout (e.g., timeout=5,polling_interval=4 -> only 1 poll); consider using a time.monotonic()-based loop or math.ceil with explicit remaining-time checks.
| def get_job_progress(self, name: str) -> types.JobProgress: | ||
| """Get progress information for a TrainJob. | ||
|
|
||
| Provides a human-readable summary of the job's progress including | ||
| completion percentage, running steps, and pod health status. |
There was a problem hiding this comment.
The PR description/linked issue focuses on a runtime-side utility to report trainerStatus (e.g., update_runtime_status() POSTing updates), but this change adds a client-side retrieval API (get_job_progress) plus runtime scoping; either update the PR description to match the delivered functionality or add the reporting utility described in #367.
…SOLVE.md to repository - Remove Issue kubeflow#403 (active_deadline_seconds) from ISSUES_TO_SOLVE.md as it won't be worked on for now - Remove ISSUES_TO_SOLVE.md from .gitignore to make it a tracked file in the repository - Keep completed issues (kubeflow#400, kubeflow#335, kubeflow#367) documented for reference
feat(trainer): Add utility for TrainJob progress reporting (#367)
Description
This PR implements a user-friendly progress reporting utility for TrainJob monitoring, addressing issue #367. Users can now easily track training job progress with completion percentages, step details, and pod health status.
Approach: Hybrid implementation combining a structured dataclass (Approach C) for data representation with an instance method (Approach B) for convenient API access.
What's Changed
New Types (
kubeflow/trainer/types/types.py)Added
JobProgressdataclass with the following metrics:job_name: The name of the TrainJoboverall_status: Current job status (Running, Complete, Failed, etc.)total_steps/completed_steps: Step progress trackingrunning_steps/failed_steps: Detailed step breakdownhealthy_pods/total_pods: Pod health statuscompletion_percentage: Overall progress as percentage (0-100)Added
from_job()class method to extract progress from TrainJob objectsAdded
__str__()method for human-readable progress summariesNew API (
kubeflow/trainer/api/trainer_client.py)TrainerClient.get_job_progress(name: str) -> JobProgressmethodTests
types_test.py: 7 parametrized tests covering various job states
trainer_client_test.py: 2 integration tests
Usage Example