What happened?
TrainerClient.get_job(), delete_job(), get_job_logs(), and wait_for_job_status() document RuntimeError for operational failures like job-not-found. The Kubernetes backend follows this contract, but LocalProcess and Container backends raise ValueError instead.
This means callers writing except RuntimeError to handle missing jobs will not catch the actual exception when using local or container backends.
Steps to Reproduce
from kubeflow.trainer.backends.localprocess.backend import LocalProcessBackend
class Cfg:
auto_remove = False
execution_dir = "/tmp"
backend = LocalProcessBackend(Cfg())
try:
backend.get_job("nonexistent")
except RuntimeError:
print("caught") # never reached
except ValueError as e:
print(f"BUG: {e}") # "No TrainJob with name nonexistent"
Same happens with delete_job, get_job_logs, wait_for_job_status.
Affected locations
LocalProcess backend (localprocess/backend.py):
get_job (line 172): ValueError("No TrainJob with name {name}")
get_job_logs (line 201): same
wait_for_job_status (line 227): same
delete_job (line 256): same
get_runtime (line 60): ValueError("Runtime '{name}' not found.")
Container backend (container/backend.py):
_get_job_containers (line 509): ValueError("No TrainJob with name {name}")
__get_trainjob_from_containers (lines 669, 674, 678, 687, 690): various ValueError for metadata/network/runtime lookup failures
Kubernetes backend: correctly raises RuntimeError for all of these.
What did you expect to happen?
Operational failures (job not found, runtime not found) should raise RuntimeError consistently across all backends, matching the documented API contract.
Input validation errors (e.g. "CustomTrainer must be set", "polling_interval >= timeout") should remain ValueError.
Proposed Fix
Change raise ValueError("No TrainJob with name ...") to raise RuntimeError(...) in the affected methods across both backends. Keep ValueError for actual input validation.
/kind bug
/area local
What happened?
TrainerClient.get_job(),delete_job(),get_job_logs(), andwait_for_job_status()documentRuntimeErrorfor operational failures like job-not-found. The Kubernetes backend follows this contract, but LocalProcess and Container backends raiseValueErrorinstead.This means callers writing
except RuntimeErrorto handle missing jobs will not catch the actual exception when using local or container backends.Steps to Reproduce
Same happens with
delete_job,get_job_logs,wait_for_job_status.Affected locations
LocalProcess backend (
localprocess/backend.py):get_job(line 172):ValueError("No TrainJob with name {name}")get_job_logs(line 201): samewait_for_job_status(line 227): samedelete_job(line 256): sameget_runtime(line 60):ValueError("Runtime '{name}' not found.")Container backend (
container/backend.py):_get_job_containers(line 509):ValueError("No TrainJob with name {name}")__get_trainjob_from_containers(lines 669, 674, 678, 687, 690): variousValueErrorfor metadata/network/runtime lookup failuresKubernetes backend: correctly raises
RuntimeErrorfor all of these.What did you expect to happen?
Operational failures (job not found, runtime not found) should raise
RuntimeErrorconsistently across all backends, matching the documented API contract.Input validation errors (e.g. "CustomTrainer must be set", "polling_interval >= timeout") should remain
ValueError.Proposed Fix
Change
raise ValueError("No TrainJob with name ...")toraise RuntimeError(...)in the affected methods across both backends. KeepValueErrorfor actual input validation./kind bug
/area local