fix(trainer): raise RuntimeError instead of ValueError in LocalProcess and Container backends#434
Conversation
|
🎉 Welcome to the Kubeflow SDK! 🎉 Thanks for opening your first PR! We're happy to have you as part of our community 🚀 Here's what happens next:
Join the community:
Feel free to ask questions in the comments if you need any help or clarification! |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
…s and Container backends Signed-off-by: Vartika Manish <vartikamanish03@gmail.com>
946bb24 to
4706df1
Compare
There was a problem hiding this comment.
Pull request overview
This PR aligns exception semantics across Trainer backends by raising RuntimeError (instead of ValueError) for operational failures (e.g., job/runtime not found) in the LocalProcess and Container backends, matching the Kubernetes backend’s documented behavior.
Changes:
- LocalProcess backend: switch job/runtime “not found” errors from
ValueErrortoRuntimeError. - Container backend: switch job/runtime/network lookup operational failures from
ValueErrortoRuntimeError. - Update some related docstrings/raises to reflect the new exception type (incomplete in current diff).
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 5 comments.
| File | Description |
|---|---|
| kubeflow/trainer/backends/localprocess/backend.py | Changes several operational failure paths to raise RuntimeError (but also changes one input-validation path incorrectly). |
| kubeflow/trainer/backends/container/backend.py | Changes job/container/runtime/network operational failure paths to raise RuntimeError, but leaves docstrings inconsistent and tests need updates. |
| _job = next((j for j in self.__local_jobs if j.name == name), None) | ||
| if _job is None: | ||
| raise ValueError(f"No TrainJob with name {name}") | ||
| raise RuntimeError(f"No TrainJob with name {name}") |
There was a problem hiding this comment.
This change updates job-not-found from ValueError to RuntimeError, but kubeflow/trainer/backends/localprocess/backend_test.py still asserts ValueError for missing jobs/logs/waits/deletes; update those tests to expect RuntimeError so they match the backend contract.
| Raises: | ||
| ValueError: If no containers found for the job | ||
| """ | ||
| filters = {"label": [f"{self.label_prefix}/trainjob-name={name}"]} | ||
| containers = self._adapter.list_containers(filters=filters) | ||
|
|
||
| if not containers: | ||
| raise ValueError(f"No TrainJob with name {name}") | ||
| raise RuntimeError(f"No TrainJob with name {name}") |
There was a problem hiding this comment.
The _get_job_containers docstring still documents ValueError when no containers are found, but the implementation now raises RuntimeError; update the docstring (and any referenced expectations) to keep documentation consistent with behavior.
|
|
||
| if not containers: | ||
| raise ValueError(f"No TrainJob with name {name}") | ||
| raise RuntimeError(f"No TrainJob with name {name}") |
There was a problem hiding this comment.
This change updates job-not-found to raise RuntimeError, but kubeflow/trainer/backends/container/backend_test.py still expects ValueError for a nonexistent job; update the test to match the new exception type so it validates the intended contract.
| @@ -672,28 +672,27 @@ | |||
| ValueError: If network metadata is missing or runtime not found | |||
There was a problem hiding this comment.
__get_trainjob_from_containers now raises RuntimeError for missing network metadata/runtime, but the docstring still lists ValueError; update the Raises section to reflect the actual exception type(s).
| ValueError: If network metadata is missing or runtime not found | |
| RuntimeError: If no containers are found, network metadata or network is missing, | |
| or the training runtime cannot be resolved for the job. |
|
|
||
| if polling_interval > timeout: | ||
| raise ValueError( | ||
| raise RuntimeError( |
There was a problem hiding this comment.
polling_interval > timeout is an input validation error (and KubernetesBackend raises ValueError for this), so this should remain ValueError rather than RuntimeError to preserve the documented contract for invalid arguments.
| raise RuntimeError( | |
| raise ValueError( |
Fixes #433
LocalProcess and Container backends were raising ValueError for
operational failures (job not found, runtime not found), inconsistent
with the Kubernetes backend which correctly raises RuntimeError.
Changes:
wait_for_job_status, delete_job
(updated raise statements and docstrings)
Input validation errors remain ValueError as intended.