fix(trainer): validate polling_interval strictly less than timeout (#400)#417
fix(trainer): validate polling_interval strictly less than timeout (#400)#417debojit11070 wants to merge 2 commits intokubeflow:mainfrom
Conversation
…ubeflow#400) - Changed polling_interval > timeout to >= across all three backends - Kubernetes: Fixed validation to reject equal values - Container: Added completely missing polling_interval validation - LocalProcess: Fixed validation to reject equal values - Container tests: Added test case for polling_interval >= timeout scenario Fixes: kubeflow#400
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
🎉 Welcome to the Kubeflow SDK! 🎉 Thanks for opening your first PR! We're happy to have you as part of our community 🚀 Here's what happens next:
Join the community:
Feel free to ask questions in the comments if you need any help or clarification! |
There was a problem hiding this comment.
Pull request overview
This PR fixes an inconsistency in wait_for_job_status() validation across Trainer backends by enforcing that polling_interval must be strictly less than timeout, preventing “single poll then exit” behavior when the values are equal.
Changes:
- Tightened validation from
polling_interval > timeouttopolling_interval >= timeoutin Kubernetes and LocalProcess backends. - Added missing
polling_interval >= timeoutvalidation in the Container backend. - Added a Container backend test case for the
polling_interval >= timeouterror path.
Reviewed changes
Copilot reviewed 4 out of 5 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| kubeflow/trainer/backends/localprocess/backend.py | Enforces polling_interval < timeout and updates the error message. |
| kubeflow/trainer/backends/kubernetes/backend.py | Enforces polling_interval < timeout and updates the error message. |
| kubeflow/trainer/backends/container/backend.py | Adds missing polling_interval < timeout validation to container polling loop. |
| kubeflow/trainer/backends/container/backend_test.py | Adds a test case covering polling_interval >= timeout. |
| .gitignore | Ignores a local development notes file. |
| if polling_interval >= timeout: | ||
| raise ValueError( | ||
| f"Polling interval {polling_interval} must be less than timeout: {timeout}" | ||
| f"polling_interval ({polling_interval}) must be strictly less than " | ||
| f"timeout ({timeout})" | ||
| ) | ||
|
|
||
| for _ in range(round(timeout / polling_interval)): |
There was a problem hiding this comment.
The new polling_interval >= timeout guard still allows polling_interval <= 0 (and/or timeout <= 0), which will lead to a ZeroDivisionError in round(timeout / polling_interval) or nonsensical waiting behavior; validate both values are positive before doing the division and sleeping.
| if polling_interval >= timeout: | ||
| raise ValueError( | ||
| f"Polling interval {polling_interval} must be less than timeout: {timeout}" | ||
| f"polling_interval ({polling_interval}) must be strictly less than " | ||
| f"timeout ({timeout})" | ||
| ) | ||
|
|
||
| for _ in range(round(timeout / polling_interval)): |
There was a problem hiding this comment.
This validation should also reject polling_interval <= 0 (and likely timeout <= 0) because the subsequent round(timeout / polling_interval) will raise ZeroDivisionError or behave incorrectly; add an explicit positive-value check before the division/sleep loop.
| if polling_interval >= timeout: | ||
| raise ValueError( | ||
| f"polling_interval ({polling_interval}) must be strictly less than " | ||
| f"timeout ({timeout})" | ||
| ) | ||
|
|
||
| end = time.time() + timeout |
There was a problem hiding this comment.
Consider validating polling_interval > 0 (and timeout > 0) here as well: polling_interval=0 currently causes a tight loop (no sleep) until timeout, which can hammer the API and CPU; raising ValueError for non-positive values avoids this.
| elif test_case.name == "polling interval >= timeout error": | ||
| container_backend.wait_for_job_status( | ||
| job_name, | ||
| timeout=test_case.config["timeout"], | ||
| polling_interval=test_case.config["polling_interval"], | ||
| ) |
There was a problem hiding this comment.
In this test (and the other error branches), if wait_for_job_status unexpectedly does not raise, the test will still pass; use pytest.raises(...) (or an explicit pytest.fail after the call) to ensure the expected exception is actually raised.
…timeout - Validates polling_interval > 0 and timeout > 0 in LocalProcess backend - Validates polling_interval > 0 and timeout > 0 in Kubernetes backend - Validates polling_interval > 0 and timeout > 0 in Container backend - Uses pytest.raises() in container backend tests to properly assert exceptions This prevents ZeroDivisionError and tight loops when non-positive values are provided.
Fixes #400
Changes
Testing