fix(examples): stabilize flaky E2E notebook completion#3366
fix(examples): stabilize flaky E2E notebook completion#3366neeraj542 wants to merge 4 commits intokubeflow:masterfrom
Conversation
|
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
🎉 Welcome to the Kubeflow Trainer! 🎉 Thanks for opening your first PR! We're happy to have you as part of our community 🚀 Here's what happens next:
Join the community:
Feel free to ask questions in the comments if you need any help or clarification! |
90ee3ee to
0e5de77
Compare
There was a problem hiding this comment.
Pull request overview
This PR increases timeout values for job status checks and makes Kubernetes debug commands non-fatal in CI environments to handle slower cluster performance during CI runs. The changes consist of updating e2e test script behavior and example notebooks to provide more time for job completion.
Changes:
- Modified e2e test script to add error handling (
|| true) to kubectl commands and increased kubectl wait timeout from 3 seconds to 60 seconds with explanatory comment - Updated job status check timeouts in 7 example notebooks from 20 seconds to 300 seconds (5 minutes) to accommodate slower CI environments
Reviewed changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| hack/e2e-run-notebook.sh | Added error handling to kubectl commands and increased timeout from 3s to 60s for slower CI clusters |
| examples/xgboost/distributed-training/xgboost-distributed.ipynb | Increased wait_for_job_status timeout from 20s to 300s |
| examples/pytorch/speech-recognition/speech-recognition.ipynb | Increased wait_for_job_status timeout from 20s to 300s |
| examples/pytorch/question-answering/fine-tune-distilbert.ipynb | Increased wait_for_job_status timeout from 20s to 300s |
| examples/pytorch/image-classification/mnist.ipynb | Increased wait_for_job_status timeout from 20s to 300s |
| examples/pytorch/audio-classification/audio-classification.ipynb | Increased wait_for_job_status timeout from 20s to 300s |
| examples/local/local-training-mnist.ipynb | Increased wait_for_job_status timeout from 20s to 300s |
| examples/local/local-container-mnist.ipynb | Increased wait_for_job_status timeout from 20s to 300s |
You can also share your feedback on Copilot code review. Take the survey.
| "outputs": [], | ||
| "source": [ | ||
| "client.wait_for_job_status(name=k8s_job_name, timeout=20)" | ||
| "client.wait_for_job_status(name=k8s_job_name, timeout=300)" |
Signed-off-by: neeraj542 <mneeraj2133@gmail.com>
0e5de77 to
31bbef9
Compare
|
Hi @kuizhiqing, I’d really appreciate a review when you have time. I increased those completion waits in the notebooks used by test-e2e.yaml and also made the notebook debug kubectl wait non-fatal/less strict. |
|
@neeraj542 I noticed we’re increasing the timeout from 20 to 300s, which helps with slow clusters, but a few things stood out-
Curious what you think. |
|
@Goku2099 thnx for the feedback, I kept this PR intentionally simple to fix the flaky test quickly with minimal risk:
I agree the broader consistency updates and retry/polling improvement are good ideas, and can do them in a follow-up PR. |
|
have checked the failing 1.35.0 job logs, this failure is coming from local-container-mnist.ipynb due to transient/corrupted FashionMNIST download (RuntimeError: File not found or corrupted), not from the TrainJob completion timeout change. I’ll add retry+cleanup logic around dataset download in local notebook example(s) to make this path robust |
Signed-off-by: neeraj542 <mneeraj2133@gmail.com>
|
I investigated the failing E2E Test (1.35.0) logs and found the failure in local-container-mnist notebook due to transient/corrupted FashionMNIST download (RuntimeError: File not found or corrupted), followed by DDP barrier failure. I pushed a targeted fix in this PR:
This keeps the scope minimal and directly addresses the observed 1.35.0 failure mode. |
|
Hi @andreyvelich, could you pls review this PR again? |
|
/ok-to-test |
|
Hi @astefanutti, thanks for taking a look. |
|
hi @neeraj542, thanks for the PR #3366. I pulled the same root-cause from the failing notebook artifacts earlier yesterday for the Runtime Error and the subsequent dist.barrier() errors are a cascade from the rank crash. One remaining hardening I think is still relevant: the download guard is currently if local_rank == 0:. In this container backend, that may not be equivalent to “exactly one container downloads” when num_nodes > 1. A safer follow-up would be to gate downloads on dist.get_rank() == 0 (global rank) to avoid concurrent writes to the same ./data path. |
jiayuzhao05
left a comment
There was a problem hiding this comment.
If this solution is to reduce the random failures of these specific notebooks on CI, I don't think there is a problem. If it is to systematically fix the flaky problem of notebook e2e, I don't think it has been fundamentally solved, because the same timeout=20 notebooks are not all updated uniformly.
|
@jiayuzhao05: changing LGTM is restricted to collaborators DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
Thanks all for the feedbacks.. For this PR, I kept the scope focused on stabilizing the flaky notebook path seen in CI, and the full E2E matrix is now passing (including 1.35.0). I agree there are broader hardening ways also:
I can include these in this PR if maintainers prefer, or open a follow-up PR to keep this fix minimal and unblock merge. cc: @astefanutti, @kuizhiqing |
|
Hi @andreyvelich, I saw #3366 mentioned in #3325. Is there any dependency between the two |
Sridhar1030
left a comment
There was a problem hiding this comment.
Looks good for the scoped fix. One thing I noticed in the local-container-mnist diff: the original code applied Normalize((0.1307,), (0.3081,)) only on the rank-0 download path, while the actual training dataset used bare ToTensor(). After this refactor both go through load_fashion_mnist_with_retry with the same transform , so this is also a quiet correctness fix worth noting in the PR description.
+1 to @jiayuzhao05 's point that the remaining 6 notebooks still on timeout=20 (mlx, jax, deepspeed, torchtune, data-cache) should be updated for consistency, either here or in a follow-up.
Signed-off-by: neeraj542 <mneeraj2133@gmail.com>
|
Pushed updates addressing the open review feedback in this PR:
|
| "outputs": [], | ||
| "source": [ | ||
| "TrainerClient().wait_for_job_status(name=job_id, timeout=20)" | ||
| "TrainerClient().wait_for_job_status(name=job_id, timeout=300)" |
There was a problem hiding this comment.
Do we still need to have such long timeout here?
Since we have this before, the TrainJob should be Complete by that cell:
for logline in TrainerClient().get_job_logs(job_id, follow=True):
print(logline)
There was a problem hiding this comment.
Do we still need to have such long timeout here? Since we have this before, the TrainJob should be Complete by that cell:
for logline in TrainerClient().get_job_logs(job_id, follow=True): print(logline)
i reverted this notebook back to timeout=20 as suggested
| " transforms.Normalize((0.1307,), (0.3081,)),\n", | ||
| " ])\n", | ||
| "\n", | ||
| " def load_fashion_mnist_with_retry(train: bool, download: bool):\n", |
There was a problem hiding this comment.
Do we see that FashionMNIST is constantly failing and we require retries?
There was a problem hiding this comment.
Do we see that FashionMNIST is constantly failing and we require retries?
not constantly, this was seen in the failing 1.35.0 CI run artifacts
and I kept the retry only in local-container-mnist as a targeted stabilization for that failure mode
There was a problem hiding this comment.
But we use FashionMNIST dataset in other examples too, do we need to introduce retries to all of them?
There was a problem hiding this comment.
But we use FashionMNIST dataset in other examples too, do we need to introduce retries to all of them?
i agree, there is no strong evidence i found of repeated FashionMNIST failures yet, so I’ll remove retries here
| " transforms.Normalize((0.1307,), (0.3081,)),\n", | ||
| " ]),\n", | ||
| " )\n", | ||
| " if dist.get_rank() == 0:\n", |
There was a problem hiding this comment.
Why did you check if dist.get_rank() == 0 ? Since we don't use sharable PVC in these TrainJobs, we should download dataset of every Kubernetes pod where local_rank == 0
There was a problem hiding this comment.
Why did you check if
dist.get_rank() == 0? Since we don't use sharable PVC in these TrainJobs, we should download dataset of every Kubernetes pod where local_rank == 0
updated back to local_rank == 0 for this container backend scenario
hack/e2e-run-notebook.sh
Outdated
| kubectl logs -n kubeflow-system -l app.kubernetes.io/name=trainer || true | ||
| # CI clusters can be a bit slower; debug output should not fail the whole run. | ||
| kubectl wait trainjob --for=condition=Complete --all --timeout 60s || true |
There was a problem hiding this comment.
Shall we error on failure ?
| kubectl logs -n kubeflow-system -l app.kubernetes.io/name=trainer || true | |
| # CI clusters can be a bit slower; debug output should not fail the whole run. | |
| kubectl wait trainjob --for=condition=Complete --all --timeout 60s || true | |
| kubectl logs -n kubeflow-system -l app.kubernetes.io/name=trainer || exit 1 | |
| # CI clusters can be a bit slower; debug output should not fail the whole run. | |
| kubectl wait trainjob --for=condition=Complete --all --timeout 60s || exit 1 |
There was a problem hiding this comment.
Shall we error on failure ?
appplied and updated both commands to fail explicitly with exit 1
|
Hi Guys, any updates here? |
|
hi @andreyvelich i'm working on this will push required changes in sometime |
Signed-off-by: neeraj542 <mneeraj2133@gmail.com>
Fixes #3365
What/Why
Notebook E2E tests in were flaky because the notebooks verify completion using a short fixed timeout (), which can exceed CI cluster scheduling/runtime variability.
This PR:
Test plan
Check JSON...........................................(no files to check)Skipped
Fix End of Files.....................................(no files to check)Skipped
Trim Trailing Whitespace.............................(no files to check)Skipped
isort................................................(no files to check)Skipped
black................................................(no files to check)Skipped
flake8...............................................(no files to check)Skipped
cargo fmt (data_cache)...............................(no files to check)Skipped
cargo check (data_cache).............................(no files to check)Skipped on the modified files (CI gate equivalent for formatting/JSON checks).