Skip to content

bug(trainer): LocalProcess and Container backends raise ValueError instead of RuntimeError for job-not-found #433

@YassinNouh21

Description

@YassinNouh21

What happened?

TrainerClient.get_job(), delete_job(), get_job_logs(), and wait_for_job_status() document RuntimeError for operational failures like job-not-found. The Kubernetes backend follows this contract, but LocalProcess and Container backends raise ValueError instead.

This means callers writing except RuntimeError to handle missing jobs will not catch the actual exception when using local or container backends.

Steps to Reproduce

from kubeflow.trainer.backends.localprocess.backend import LocalProcessBackend

class Cfg:
    auto_remove = False
    execution_dir = "/tmp"

backend = LocalProcessBackend(Cfg())

try:
    backend.get_job("nonexistent")
except RuntimeError:
    print("caught")  # never reached
except ValueError as e:
    print(f"BUG: {e}")  # "No TrainJob with name nonexistent"

Same happens with delete_job, get_job_logs, wait_for_job_status.

Affected locations

LocalProcess backend (localprocess/backend.py):

  • get_job (line 172): ValueError("No TrainJob with name {name}")
  • get_job_logs (line 201): same
  • wait_for_job_status (line 227): same
  • delete_job (line 256): same
  • get_runtime (line 60): ValueError("Runtime '{name}' not found.")

Container backend (container/backend.py):

  • _get_job_containers (line 509): ValueError("No TrainJob with name {name}")
  • __get_trainjob_from_containers (lines 669, 674, 678, 687, 690): various ValueError for metadata/network/runtime lookup failures

Kubernetes backend: correctly raises RuntimeError for all of these.

What did you expect to happen?

Operational failures (job not found, runtime not found) should raise RuntimeError consistently across all backends, matching the documented API contract.

Input validation errors (e.g. "CustomTrainer must be set", "polling_interval >= timeout") should remain ValueError.

Proposed Fix

Change raise ValueError("No TrainJob with name ...") to raise RuntimeError(...) in the affected methods across both backends. Keep ValueError for actual input validation.

/kind bug
/area local

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions