fix(localprocess): ensure __get_job_status returns Complete correctly#408
fix(localprocess): ensure __get_job_status returns Complete correctly#408Harshithavigna wants to merge 2 commits intokubeflow:mainfrom
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
🎉 Welcome to the Kubeflow SDK! 🎉 Thanks for opening your first PR! We're happy to have you as part of our community 🚀 Here's what happens next:
Join the community:
Feel free to ask questions in the comments if you need any help or clarification! |
There was a problem hiding this comment.
Pull request overview
This PR attempts to fix an issue where LocalProcessBackend.__get_job_status may not correctly return Complete when all steps have finished successfully. However, the PR has discrepancies between its description and implementation. The PR description claims to fix issue #367 which proposes adding an update_runtime_status() utility function for progress reporting, but the actual changes include a fix to __get_job_status, a new get_job_progress() method in the TrainerClient API, and documentation about options usage.
Changes:
- Refactors
__get_job_statuslogic to ensure proper status priority and correct handling of the Complete status - Adds
get_job_progress()method to TrainerClient API for retrieving job progress information - Adds documentation example showing how to use options with TrainerClient
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
| kubeflow/trainer/backends/localprocess/backend.py | Refactors job status logic to prioritize Failed > Running > Created > Complete, with explicit check that all steps are Complete before returning Complete status |
| kubeflow/trainer/api/trainer_client.py | Adds new get_job_progress() method to retrieve job status and progress information |
| docs/source/train/options.rst | Adds documentation section with example of using options parameter with TrainerClient |
| # Priority: Failed > Running > Created > Complete | ||
| if constants.TRAINJOB_FAILED in statuses: | ||
| status = constants.TRAINJOB_FAILED | ||
| elif constants.TRAINJOB_RUNNING in statuses: | ||
| status = constants.TRAINJOB_RUNNING | ||
| elif constants.TRAINJOB_CREATED in statuses: | ||
| status = constants.TRAINJOB_CREATED | ||
| else: | ||
| status = constants.TRAINJOB_COMPLETE | ||
| return constants.TRAINJOB_FAILED | ||
|
|
||
| if constants.TRAINJOB_RUNNING in statuses: | ||
| return constants.TRAINJOB_RUNNING | ||
| if constants.TRAINJOB_CREATED in statuses: | ||
| return constants.TRAINJOB_CREATED | ||
|
|
||
| # ✅ NEW FIX: Ensure all steps are actually complete | ||
| if all(status == constants.TRAINJOB_COMPLETE for status in statuses): | ||
| return constants.TRAINJOB_COMPLETE | ||
|
|
||
| return status | ||
| # fallback (safety) | ||
| return constants.TRAINJOB_RUNNING |
There was a problem hiding this comment.
Incorrect indentation: comment lines 268, 277, and 281 are indented with 4 spaces, but should be indented with 8 spaces (aligned with the subsequent if statements). This will cause a Python IndentationError when the file is parsed.
| def get_job_progress(self, name: str) -> dict: | ||
| """Get progress of a TrainJob. | ||
|
|
||
| Args: | ||
| name: Name of the TrainJob. | ||
|
|
||
| Returns: | ||
| Dictionary containing job status and progress. | ||
| """ | ||
|
|
||
| # Get job details | ||
| job = self.get_job(name=name) | ||
|
|
||
| status = job.status if hasattr(job, "status") else "Unknown" | ||
|
|
||
| if status == "Running": | ||
| progress = "In Progress" | ||
| elif status in ["Complete", "Succeeded"]: | ||
| progress = "100%" | ||
| elif status == "Failed": | ||
| progress = "Error" | ||
| else: | ||
| progress = "Unknown" | ||
|
|
||
| return { | ||
| "job_id": name, | ||
| "status": status, | ||
| "progress": progress, | ||
| } |
There was a problem hiding this comment.
The PR description states it fixes issue #367 by implementing the proposed update_runtime_status() API, but this PR adds get_job_progress() instead. The PR description does not mention this method. Clarify whether this is the intended implementation for issue #367 or if it is a separate enhancement.
|
|
||
| if status == "Running": | ||
| progress = "In Progress" | ||
| elif status in ["Complete", "Succeeded"]: |
There was a problem hiding this comment.
Line 285 checks for status "Succeeded", but the TrainJob status constants only define "Created", "Running", "Complete", and "Failed". This check will never match. Use "Complete" instead if checking for successful completion.
| elif status in ["Complete", "Succeeded"]: | |
| elif status == "Complete": |
|
|
||
|
|
There was a problem hiding this comment.
Line 297 contains trailing whitespace. Remove to follow code style conventions.
What this PR does / why we need it:
Fixes an issue where
LocalProcessBackend.__get_job_statusmay not correctly returnCompleteeven when all steps have finished successfully.This change ensures that the job status is marked as
Completeonly when all step statuses areTRAINJOB_COMPLETE.Which issue(s) this PR fixes:
Fixes #367
Checklist:
uv run pytest kubeflow/trainer -q)