Skip to content

fix(trainer): raise RuntimeError instead of ValueError in LocalProcess and Container backends#434

Open
Vartika222 wants to merge 1 commit intokubeflow:mainfrom
Vartika222:fix/backend-valueerror-to-runtimeerror
Open

fix(trainer): raise RuntimeError instead of ValueError in LocalProcess and Container backends#434
Vartika222 wants to merge 1 commit intokubeflow:mainfrom
Vartika222:fix/backend-valueerror-to-runtimeerror

Conversation

@Vartika222
Copy link
Copy Markdown

Fixes #433

LocalProcess and Container backends were raising ValueError for
operational failures (job not found, runtime not found), inconsistent
with the Kubernetes backend which correctly raises RuntimeError.

Changes:

  • localprocess/backend.py: get_runtime, get_job, get_job_logs,
    wait_for_job_status, delete_job
  • container/backend.py: _get_job_containers, __get_trainjob_from_containers
    (updated raise statements and docstrings)

Input validation errors remain ValueError as intended.

Copilot AI review requested due to automatic review settings March 29, 2026 17:43
@github-actions
Copy link
Copy Markdown
Contributor

🎉 Welcome to the Kubeflow SDK! 🎉

Thanks for opening your first PR! We're happy to have you as part of our community 🚀

Here's what happens next:

  • If you haven't already, please check out our Contributing Guide for repo-specific guidelines and the Kubeflow Contributor Guide for general community standards
  • Our team will review your PR soon! cc @kubeflow/kubeflow-sdk-team

Join the community:

Feel free to ask questions in the comments if you need any help or clarification!
Thanks again for contributing to Kubeflow! 🙏

@google-oss-prow
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign andreyvelich for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

…s and Container backends

Signed-off-by: Vartika Manish <vartikamanish03@gmail.com>
@Vartika222 Vartika222 force-pushed the fix/backend-valueerror-to-runtimeerror branch from 946bb24 to 4706df1 Compare March 29, 2026 17:45
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR aligns exception semantics across Trainer backends by raising RuntimeError (instead of ValueError) for operational failures (e.g., job/runtime not found) in the LocalProcess and Container backends, matching the Kubernetes backend’s documented behavior.

Changes:

  • LocalProcess backend: switch job/runtime “not found” errors from ValueError to RuntimeError.
  • Container backend: switch job/runtime/network lookup operational failures from ValueError to RuntimeError.
  • Update some related docstrings/raises to reflect the new exception type (incomplete in current diff).

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 5 comments.

File Description
kubeflow/trainer/backends/localprocess/backend.py Changes several operational failure paths to raise RuntimeError (but also changes one input-validation path incorrectly).
kubeflow/trainer/backends/container/backend.py Changes job/container/runtime/network operational failure paths to raise RuntimeError, but leaves docstrings inconsistent and tests need updates.

_job = next((j for j in self.__local_jobs if j.name == name), None)
if _job is None:
raise ValueError(f"No TrainJob with name {name}")
raise RuntimeError(f"No TrainJob with name {name}")
Copy link

Copilot AI Mar 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change updates job-not-found from ValueError to RuntimeError, but kubeflow/trainer/backends/localprocess/backend_test.py still asserts ValueError for missing jobs/logs/waits/deletes; update those tests to expect RuntimeError so they match the backend contract.

Copilot uses AI. Check for mistakes.
Comment on lines 503 to +510
Raises:
ValueError: If no containers found for the job
"""
filters = {"label": [f"{self.label_prefix}/trainjob-name={name}"]}
containers = self._adapter.list_containers(filters=filters)

if not containers:
raise ValueError(f"No TrainJob with name {name}")
raise RuntimeError(f"No TrainJob with name {name}")
Copy link

Copilot AI Mar 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The _get_job_containers docstring still documents ValueError when no containers are found, but the implementation now raises RuntimeError; update the docstring (and any referenced expectations) to keep documentation consistent with behavior.

Copilot uses AI. Check for mistakes.

if not containers:
raise ValueError(f"No TrainJob with name {name}")
raise RuntimeError(f"No TrainJob with name {name}")
Copy link

Copilot AI Mar 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change updates job-not-found to raise RuntimeError, but kubeflow/trainer/backends/container/backend_test.py still expects ValueError for a nonexistent job; update the test to match the new exception type so it validates the intended contract.

Copilot uses AI. Check for mistakes.
@@ -672,28 +672,27 @@
ValueError: If network metadata is missing or runtime not found
Copy link

Copilot AI Mar 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

__get_trainjob_from_containers now raises RuntimeError for missing network metadata/runtime, but the docstring still lists ValueError; update the Raises section to reflect the actual exception type(s).

Suggested change
ValueError: If network metadata is missing or runtime not found
RuntimeError: If no containers are found, network metadata or network is missing,
or the training runtime cannot be resolved for the job.

Copilot uses AI. Check for mistakes.

if polling_interval > timeout:
raise ValueError(
raise RuntimeError(
Copy link

Copilot AI Mar 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

polling_interval > timeout is an input validation error (and KubernetesBackend raises ValueError for this), so this should remain ValueError rather than RuntimeError to preserve the documented contract for invalid arguments.

Suggested change
raise RuntimeError(
raise ValueError(

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

bug(trainer): LocalProcess and Container backends raise ValueError instead of RuntimeError for job-not-found

2 participants