Skip to content

fix(localprocess): ensure __get_job_status returns Complete correctly#408

Open
Harshithavigna wants to merge 2 commits intokubeflow:mainfrom
Harshithavigna:fix-localprocess-status
Open

fix(localprocess): ensure __get_job_status returns Complete correctly#408
Harshithavigna wants to merge 2 commits intokubeflow:mainfrom
Harshithavigna:fix-localprocess-status

Conversation

@Harshithavigna
Copy link
Copy Markdown

What this PR does / why we need it:
Fixes an issue where LocalProcessBackend.__get_job_status may not correctly return Complete even when all steps have finished successfully.

This change ensures that the job status is marked as Complete only when all step statuses are TRAINJOB_COMPLETE.

Which issue(s) this PR fixes:
Fixes #367

Checklist:

  • Tests pass locally (uv run pytest kubeflow/trainer -q)
  • No breaking changes introduced
  • Follows existing code style and structure

Copilot AI review requested due to automatic review settings March 20, 2026 09:18
@google-oss-prow
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign andreyvelich for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@github-actions
Copy link
Copy Markdown
Contributor

🎉 Welcome to the Kubeflow SDK! 🎉

Thanks for opening your first PR! We're happy to have you as part of our community 🚀

Here's what happens next:

  • If you haven't already, please check out our Contributing Guide for repo-specific guidelines and the Kubeflow Contributor Guide for general community standards
  • Our team will review your PR soon! cc @kubeflow/kubeflow-sdk-team

Join the community:

Feel free to ask questions in the comments if you need any help or clarification!
Thanks again for contributing to Kubeflow! 🙏

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR attempts to fix an issue where LocalProcessBackend.__get_job_status may not correctly return Complete when all steps have finished successfully. However, the PR has discrepancies between its description and implementation. The PR description claims to fix issue #367 which proposes adding an update_runtime_status() utility function for progress reporting, but the actual changes include a fix to __get_job_status, a new get_job_progress() method in the TrainerClient API, and documentation about options usage.

Changes:

  • Refactors __get_job_status logic to ensure proper status priority and correct handling of the Complete status
  • Adds get_job_progress() method to TrainerClient API for retrieving job progress information
  • Adds documentation example showing how to use options with TrainerClient

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.

File Description
kubeflow/trainer/backends/localprocess/backend.py Refactors job status logic to prioritize Failed > Running > Created > Complete, with explicit check that all steps are Complete before returning Complete status
kubeflow/trainer/api/trainer_client.py Adds new get_job_progress() method to retrieve job status and progress information
docs/source/train/options.rst Adds documentation section with example of using options parameter with TrainerClient

Comment on lines +268 to +282
# Priority: Failed > Running > Created > Complete
if constants.TRAINJOB_FAILED in statuses:
status = constants.TRAINJOB_FAILED
elif constants.TRAINJOB_RUNNING in statuses:
status = constants.TRAINJOB_RUNNING
elif constants.TRAINJOB_CREATED in statuses:
status = constants.TRAINJOB_CREATED
else:
status = constants.TRAINJOB_COMPLETE
return constants.TRAINJOB_FAILED

if constants.TRAINJOB_RUNNING in statuses:
return constants.TRAINJOB_RUNNING
if constants.TRAINJOB_CREATED in statuses:
return constants.TRAINJOB_CREATED

# ✅ NEW FIX: Ensure all steps are actually complete
if all(status == constants.TRAINJOB_COMPLETE for status in statuses):
return constants.TRAINJOB_COMPLETE

return status
# fallback (safety)
return constants.TRAINJOB_RUNNING
Copy link

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Incorrect indentation: comment lines 268, 277, and 281 are indented with 4 spaces, but should be indented with 8 spaces (aligned with the subsequent if statements). This will cause a Python IndentationError when the file is parsed.

Copilot uses AI. Check for mistakes.
Comment on lines +268 to +296
def get_job_progress(self, name: str) -> dict:
"""Get progress of a TrainJob.

Args:
name: Name of the TrainJob.

Returns:
Dictionary containing job status and progress.
"""

# Get job details
job = self.get_job(name=name)

status = job.status if hasattr(job, "status") else "Unknown"

if status == "Running":
progress = "In Progress"
elif status in ["Complete", "Succeeded"]:
progress = "100%"
elif status == "Failed":
progress = "Error"
else:
progress = "Unknown"

return {
"job_id": name,
"status": status,
"progress": progress,
}
Copy link

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR description states it fixes issue #367 by implementing the proposed update_runtime_status() API, but this PR adds get_job_progress() instead. The PR description does not mention this method. Clarify whether this is the intended implementation for issue #367 or if it is a separate enhancement.

Copilot uses AI. Check for mistakes.

if status == "Running":
progress = "In Progress"
elif status in ["Complete", "Succeeded"]:
Copy link

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line 285 checks for status "Succeeded", but the TrainJob status constants only define "Created", "Running", "Complete", and "Failed". This check will never match. Use "Complete" instead if checking for successful completion.

Suggested change
elif status in ["Complete", "Succeeded"]:
elif status == "Complete":

Copilot uses AI. Check for mistakes.
Comment on lines +297 to 298


Copy link

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line 297 contains trailing whitespace. Remove to follow code style conventions.

Suggested change

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add utility for TrainJob progress reporting

2 participants