Skip to content

fix(trainer): validate polling_interval strictly less than timeout (#400)#417

Open
debojit11070 wants to merge 2 commits intokubeflow:mainfrom
debojit11070:main
Open

fix(trainer): validate polling_interval strictly less than timeout (#400)#417
debojit11070 wants to merge 2 commits intokubeflow:mainfrom
debojit11070:main

Conversation

@debojit11070
Copy link
Copy Markdown

Fixes #400

Changes

  • Changed polling_interval > timeout to >= across all three backends
  • Kubernetes: Fixed validation to reject equal values
  • Container: Added completely missing polling_interval validation
  • LocalProcess: Fixed validation to reject equal values
  • Container tests: Added test case for polling_interval >= timeout scenario

Testing

  • ✅ Ruff linting: All checks passed
  • ✅ Ruff formatting: All files properly formatted
  • ✅ Tests: New test case added

…ubeflow#400)

- Changed polling_interval > timeout to >= across all three backends
- Kubernetes: Fixed validation to reject equal values
- Container: Added completely missing polling_interval validation
- LocalProcess: Fixed validation to reject equal values
- Container tests: Added test case for polling_interval >= timeout scenario

Fixes: kubeflow#400
Copilot AI review requested due to automatic review settings March 23, 2026 18:10
@google-oss-prow
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign electronic-waste for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@github-actions
Copy link
Copy Markdown
Contributor

🎉 Welcome to the Kubeflow SDK! 🎉

Thanks for opening your first PR! We're happy to have you as part of our community 🚀

Here's what happens next:

  • If you haven't already, please check out our Contributing Guide for repo-specific guidelines and the Kubeflow Contributor Guide for general community standards
  • Our team will review your PR soon! cc @kubeflow/kubeflow-sdk-team

Join the community:

Feel free to ask questions in the comments if you need any help or clarification!
Thanks again for contributing to Kubeflow! 🙏

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes an inconsistency in wait_for_job_status() validation across Trainer backends by enforcing that polling_interval must be strictly less than timeout, preventing “single poll then exit” behavior when the values are equal.

Changes:

  • Tightened validation from polling_interval > timeout to polling_interval >= timeout in Kubernetes and LocalProcess backends.
  • Added missing polling_interval >= timeout validation in the Container backend.
  • Added a Container backend test case for the polling_interval >= timeout error path.

Reviewed changes

Copilot reviewed 4 out of 5 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
kubeflow/trainer/backends/localprocess/backend.py Enforces polling_interval < timeout and updates the error message.
kubeflow/trainer/backends/kubernetes/backend.py Enforces polling_interval < timeout and updates the error message.
kubeflow/trainer/backends/container/backend.py Adds missing polling_interval < timeout validation to container polling loop.
kubeflow/trainer/backends/container/backend_test.py Adds a test case covering polling_interval >= timeout.
.gitignore Ignores a local development notes file.

Comment on lines +229 to 235
if polling_interval >= timeout:
raise ValueError(
f"Polling interval {polling_interval} must be less than timeout: {timeout}"
f"polling_interval ({polling_interval}) must be strictly less than "
f"timeout ({timeout})"
)

for _ in range(round(timeout / polling_interval)):
Copy link

Copilot AI Mar 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new polling_interval >= timeout guard still allows polling_interval <= 0 (and/or timeout <= 0), which will lead to a ZeroDivisionError in round(timeout / polling_interval) or nonsensical waiting behavior; validate both values are positive before doing the division and sleeping.

Copilot uses AI. Check for mistakes.
Comment on lines +463 to 469
if polling_interval >= timeout:
raise ValueError(
f"Polling interval {polling_interval} must be less than timeout: {timeout}"
f"polling_interval ({polling_interval}) must be strictly less than "
f"timeout ({timeout})"
)

for _ in range(round(timeout / polling_interval)):
Copy link

Copilot AI Mar 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This validation should also reject polling_interval <= 0 (and likely timeout <= 0) because the subsequent round(timeout / polling_interval) will raise ZeroDivisionError or behave incorrectly; add an explicit positive-value check before the division/sleep loop.

Copilot uses AI. Check for mistakes.
Comment on lines +819 to 825
if polling_interval >= timeout:
raise ValueError(
f"polling_interval ({polling_interval}) must be strictly less than "
f"timeout ({timeout})"
)

end = time.time() + timeout
Copy link

Copilot AI Mar 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider validating polling_interval > 0 (and timeout > 0) here as well: polling_interval=0 currently causes a tight loop (no sleep) until timeout, which can hammer the API and CPU; raising ValueError for non-positive values avoids this.

Copilot uses AI. Check for mistakes.
Comment on lines +945 to +950
elif test_case.name == "polling interval >= timeout error":
container_backend.wait_for_job_status(
job_name,
timeout=test_case.config["timeout"],
polling_interval=test_case.config["polling_interval"],
)
Copy link

Copilot AI Mar 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this test (and the other error branches), if wait_for_job_status unexpectedly does not raise, the test will still pass; use pytest.raises(...) (or an explicit pytest.fail after the call) to ensure the expected exception is actually raised.

Copilot uses AI. Check for mistakes.
…timeout

- Validates polling_interval > 0 and timeout > 0 in LocalProcess backend
- Validates polling_interval > 0 and timeout > 0 in Kubernetes backend
- Validates polling_interval > 0 and timeout > 0 in Container backend
- Uses pytest.raises() in container backend tests to properly assert exceptions

This prevents ZeroDivisionError and tight loops when non-positive values are provided.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

bug(trainer): wait_for_job_status silently allows polling_interval == timeout

2 participants