Skip to content

Fix/335 clusterruntime retrieval#423

Open
debojit11070 wants to merge 6 commits intokubeflow:mainfrom
debojit11070:fix/335-clusterruntime-retrieval
Open

Fix/335 clusterruntime retrieval#423
debojit11070 wants to merge 6 commits intokubeflow:mainfrom
debojit11070:fix/335-clusterruntime-retrieval

Conversation

@debojit11070
Copy link
Copy Markdown

feat(trainer): Add utility for TrainJob progress reporting (#367)

Description

This PR implements a user-friendly progress reporting utility for TrainJob monitoring, addressing issue #367. Users can now easily track training job progress with completion percentages, step details, and pod health status.

Approach: Hybrid implementation combining a structured dataclass (Approach C) for data representation with an instance method (Approach B) for convenient API access.

What's Changed

New Types (kubeflow/trainer/types/types.py)

  • Added JobProgress dataclass with the following metrics:

    • job_name: The name of the TrainJob
    • overall_status: Current job status (Running, Complete, Failed, etc.)
    • total_steps / completed_steps: Step progress tracking
    • running_steps / failed_steps: Detailed step breakdown
    • healthy_pods / total_pods: Pod health status
    • completion_percentage: Overall progress as percentage (0-100)
  • Added from_job() class method to extract progress from TrainJob objects

  • Added __str__() method for human-readable progress summaries

New API (kubeflow/trainer/api/trainer_client.py)

  • Added TrainerClient.get_job_progress(name: str) -> JobProgress method
  • Provides convenient access to progress information with full docstring and usage examples

Tests

  • types_test.py: 7 parametrized tests covering various job states

    • All steps running
    • Some steps completed, some running
    • All steps completed
    • Some steps failed
    • Empty job (no steps)
    • String representation test
  • trainer_client_test.py: 2 integration tests

    • Progress object creation and field validation
    • Human-readable output formatting

Usage Example

from kubeflow.trainer import TrainerClient

client = TrainerClient()

# Get current job progress
progress = client.get_job_progress("my-training-job")
print(progress)

# Output:
# Job: my-training-job
# Status: Running
# Progress: 50.0% (1/2 steps)
# Pods: 2/2 healthy
# Running steps: data-loading, training
# Failed steps: validation

…ubeflow#400)

- Changed polling_interval > timeout to >= across all three backends
- Kubernetes: Fixed validation to reject equal values
- Container: Added completely missing polling_interval validation
- LocalProcess: Fixed validation to reject equal values
- Container tests: Added test case for polling_interval >= timeout scenario

Fixes: kubeflow#400
…timeout

- Validates polling_interval > 0 and timeout > 0 in LocalProcess backend
- Validates polling_interval > 0 and timeout > 0 in Kubernetes backend
- Validates polling_interval > 0 and timeout > 0 in Container backend
- Uses pytest.raises() in container backend tests to properly assert exceptions

This prevents ZeroDivisionError and tight loops when non-positive values are provided.
…kubeflow#335)

- Enhanced mock for get_cluster_custom_object_response to properly handle 404 and 403 errors
- Updated test case for non-existent runtimes to expect RuntimeError instead of success
- Added test case for cluster-only runtime retrieval (exists in cluster, not in namespace)
- Better simulation of real Kubernetes API behavior for cluster-scoped resources
…untimes (kubeflow#335)

- Add RuntimeScope enum with NAMESPACE and CLUSTER values
- Update Runtime dataclass to include scope field with default NAMESPACE value
- Update KubernetesBackend to set scope based on resource type (TrainingRuntime vs ClusterTrainingRuntime)
- Update test helpers and test cases to verify scope is correctly set
- Resolves issue where users couldn't distinguish between TrainingRuntime and ClusterTrainingRuntime
)

- Add JobProgress dataclass to types.py with from_job() factory method
- Implement progress calculation: completion_percentage, running/failed steps, pod health
- Add TrainerClient.get_job_progress(name) method for easy API access
- Include __str__() method for human-readable progress summaries
- Add comprehensive unit tests covering various job states
- All tests pass, code passes ruff linting and formatting checks

Example usage:
  client = TrainerClient()
  progress = client.get_job_progress('my-job')
  print(progress)
  # Output:
  # Job: my-job
  # Status: Running
  # Progress: 50.0% (1/2 steps)
  # Pods: 2/2 healthy
  # Running steps: training
Copilot AI review requested due to automatic review settings March 24, 2026 15:16
@google-oss-prow
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign andreyvelich for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@github-actions
Copy link
Copy Markdown
Contributor

🎉 Welcome to the Kubeflow SDK! 🎉

Thanks for opening your first PR! We're happy to have you as part of our community 🚀

Here's what happens next:

  • If you haven't already, please check out our Contributing Guide for repo-specific guidelines and the Kubeflow Contributor Guide for general community standards
  • Our team will review your PR soon! cc @kubeflow/kubeflow-sdk-team

Join the community:

Feel free to ask questions in the comments if you need any help or clarification!
Thanks again for contributing to Kubeflow! 🙏

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds a progress-reporting data model/API for TrainJob monitoring and refines runtime retrieval behavior by distinguishing namespace-scoped vs cluster-scoped runtimes.

Changes:

  • Introduce JobProgress (computed from a TrainJob) and TrainerClient.get_job_progress() for user-facing progress summaries.
  • Add RuntimeScope and plumb Runtime.scope through Kubernetes runtime parsing/tests to reflect namespaced vs cluster runtimes.
  • Tighten wait_for_job_status() parameter validation (e.g., polling_interval >= timeout, positive values) across backends and adjust tests accordingly.

Reviewed changes

Copilot reviewed 9 out of 10 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
kubeflow/trainer/types/types.py Adds RuntimeScope, extends Runtime with scope, and introduces JobProgress.
kubeflow/trainer/types/types_test.py Adds unit tests for JobProgress.from_job() and string formatting.
kubeflow/trainer/api/trainer_client.py Adds TrainerClient.get_job_progress() convenience API.
kubeflow/trainer/api/trainer_client_test.py Adds tests validating get_job_progress() output/format.
kubeflow/trainer/backends/kubernetes/backend.py Adds polling/timeout validation and sets Runtime.scope based on CR kind.
kubeflow/trainer/backends/kubernetes/backend_test.py Expands runtime retrieval tests (404/403 cases; cluster-only runtime; scope assertions).
kubeflow/trainer/backends/localprocess/backend.py Adds polling/timeout validation in wait_for_job_status().
kubeflow/trainer/backends/container/backend.py Adds polling/timeout validation in wait_for_job_status().
kubeflow/trainer/backends/container/backend_test.py Adds a new validation test case and refactors wait_for_job_status tests.
.gitignore Ignores local development notes file ISSUES_TO_SOLVE.md.

Comment on lines +941 to +943
container_backend.wait_for_job_status(
job_name, status={test_case.config["wait_status"]}, timeout=5, polling_interval=1
)
Copy link

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the "job fails" test case, wait_for_job_status() is invoked without asserting the expected RuntimeError, so the test will error out instead of validating the failure path; wrap this call in pytest.raises(test_case.expected_error) (and optionally assert the message).

Suggested change
container_backend.wait_for_job_status(
job_name, status={test_case.config["wait_status"]}, timeout=5, polling_interval=1
)
with pytest.raises(test_case.expected_error):
container_backend.wait_for_job_status(
job_name,
status={test_case.config["wait_status"]},
timeout=5,
polling_interval=1,
)

Copilot uses AI. Check for mistakes.
f"must both be positive"
)

for _ in range(round(timeout / polling_interval)):
Copy link

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wait_for_job_status() uses range(round(timeout / polling_interval)), which can both undershoot (timeout=5,polling_interval=4 -> only 1 poll) and overshoot the requested timeout; prefer a time.monotonic()-based loop (like ContainerBackend) or compute iterations with math.ceil and track remaining time explicitly.

Copilot uses AI. Check for mistakes.
f"must both be positive"
)

for _ in range(round(timeout / polling_interval)):
Copy link

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wait_for_job_status() uses range(round(timeout / polling_interval)), which can both undershoot and overshoot the requested timeout (e.g., timeout=5,polling_interval=4 -> only 1 poll); consider using a time.monotonic()-based loop or math.ceil with explicit remaining-time checks.

Copilot uses AI. Check for mistakes.
Comment on lines +233 to +237
def get_job_progress(self, name: str) -> types.JobProgress:
"""Get progress information for a TrainJob.

Provides a human-readable summary of the job's progress including
completion percentage, running steps, and pod health status.
Copy link

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR description/linked issue focuses on a runtime-side utility to report trainerStatus (e.g., update_runtime_status() POSTing updates), but this change adds a client-side retrieval API (get_job_progress) plus runtime scoping; either update the PR description to match the delivered functionality or add the reporting utility described in #367.

Copilot uses AI. Check for mistakes.
…SOLVE.md to repository

- Remove Issue kubeflow#403 (active_deadline_seconds) from ISSUES_TO_SOLVE.md as it won't be worked on for now
- Remove ISSUES_TO_SOLVE.md from .gitignore to make it a tracked file in the repository
- Keep completed issues (kubeflow#400, kubeflow#335, kubeflow#367) documented for reference
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants