fix(examples): stabilize flaky E2E notebook completion by neeraj542 · Pull Request #3366 · kubeflow/trainer

neeraj542 · 2026-03-18T18:37:05Z

What/Why

Notebook E2E tests in were flaky because the notebooks verify completion using a short fixed timeout (), which can exceed CI cluster scheduling/runtime variability.

This PR:

Increases the "verify TrainJob completion" timeout from to in all notebooks executed by .
Makes the post-Papermill debug non-fatal (and increases it to ) so slow clusters don\u2019t fail the whole run during log collection.

Test plan

Ran Check Yaml...........................................(no files to check)Skipped
Check JSON...........................................(no files to check)Skipped
Fix End of Files.....................................(no files to check)Skipped
Trim Trailing Whitespace.............................(no files to check)Skipped
isort................................................(no files to check)Skipped
black................................................(no files to check)Skipped
flake8...............................................(no files to check)Skipped
cargo fmt (data_cache)...............................(no files to check)Skipped
cargo check (data_cache).............................(no files to check)Skipped on the modified files (CI gate equivalent for formatting/JSON checks).

review-notebook-app · 2026-03-18T18:37:11Z

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

google-oss-prow · 2026-03-18T18:37:13Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign jeffwan for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

github-actions · 2026-03-18T18:37:16Z

🎉 Welcome to the Kubeflow Trainer! 🎉

Thanks for opening your first PR! We're happy to have you as part of our community 🚀

Here's what happens next:

If you haven't already, please check out our Contributing Guide for repo-specific guidelines and the Kubeflow Contributor Guide for general community standards.
Our team will review your PR soon! cc @kubeflow/kubeflow-trainer-team

Join the community:

Slack: Join our #kubeflow-trainer Slack channel.
Meetings: Attend the Kubeflow AutoML and Training Working Group bi-weekly meetings.

Feel free to ask questions in the comments if you need any help or clarification!
Thanks again for contributing to Kubeflow! 🙏

Copilot

Pull request overview

This PR increases timeout values for job status checks and makes Kubernetes debug commands non-fatal in CI environments to handle slower cluster performance during CI runs. The changes consist of updating e2e test script behavior and example notebooks to provide more time for job completion.

Changes:

Modified e2e test script to add error handling (|| true) to kubectl commands and increased kubectl wait timeout from 3 seconds to 60 seconds with explanatory comment
Updated job status check timeouts in 7 example notebooks from 20 seconds to 300 seconds (5 minutes) to accommodate slower CI environments

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
hack/e2e-run-notebook.sh	Added error handling to kubectl commands and increased timeout from 3s to 60s for slower CI clusters
examples/xgboost/distributed-training/xgboost-distributed.ipynb	Increased wait_for_job_status timeout from 20s to 300s
examples/pytorch/speech-recognition/speech-recognition.ipynb	Increased wait_for_job_status timeout from 20s to 300s
examples/pytorch/question-answering/fine-tune-distilbert.ipynb	Increased wait_for_job_status timeout from 20s to 300s
examples/pytorch/image-classification/mnist.ipynb	Increased wait_for_job_status timeout from 20s to 300s
examples/pytorch/audio-classification/audio-classification.ipynb	Increased wait_for_job_status timeout from 20s to 300s
examples/local/local-training-mnist.ipynb	Increased wait_for_job_status timeout from 20s to 300s
examples/local/local-container-mnist.ipynb	Increased wait_for_job_status timeout from 20s to 300s

You can also share your feedback on Copilot code review. Take the survey.

examples/pytorch/speech-recognition/speech-recognition.ipynb

   "outputs": [],
   "source": [
-    "client.wait_for_job_status(name=k8s_job_name, timeout=20)"
+    "client.wait_for_job_status(name=k8s_job_name, timeout=300)"


Signed-off-by: neeraj542 <mneeraj2133@gmail.com>

neeraj542 · 2026-03-18T18:56:53Z

Hi @kuizhiqing, I’d really appreciate a review when you have time.
I made this PR to fix #3365 because the notebook E2E checks were flaky: the notebooks used a very short fixed timeout for “TrainJob completion”, and CI clusters can be slower occasionally.

I increased those completion waits in the notebooks used by test-e2e.yaml and also made the notebook debug kubectl wait non-fatal/less strict.
If you can, could you please check whether the timeouts are now robust and that the change is safe for the E2E runs? Thnx!

Goku2099 · 2026-03-18T19:41:08Z

@neeraj542
Nice improvement addressing CI flakiness

I noticed we’re increasing the timeout from 20 to 300s, which helps with slow clusters, but a few things stood out-

There are still other notebooks using wait_for_job_status(..., timeout=20). Should we update them as well for consistency?
Using a fixed 300s timeout works, but would a retry/polling mechanism be more reliable instead of waiting the full duration?
In e2e-run-notebook.sh, making kubectl wait non-fatal (|| true) avoids CI failures, but it might hide real issues. Maybe we can log a warning when it times out?

Curious what you think.

neeraj542 · 2026-03-19T05:20:05Z

@Goku2099 thnx for the feedback, I kept this PR intentionally simple to fix the flaky test quickly with minimal risk:

I updated only the notebooks used by test-e2e.yaml.
and increased the completion wait from 20s to 300s.
made the debug kubectl wait non-fatal so temporary cluster slowness does not fail the whole E2E run.

I agree the broader consistency updates and retry/polling improvement are good ideas, and can do them in a follow-up PR.

neeraj542 · 2026-03-19T05:46:27Z

have checked the failing 1.35.0 job logs, this failure is coming from local-container-mnist.ipynb due to transient/corrupted FashionMNIST download (RuntimeError: File not found or corrupted), not from the TrainJob completion timeout change. I’ll add retry+cleanup logic around dataset download in local notebook example(s) to make this path robust

Signed-off-by: neeraj542 <mneeraj2133@gmail.com>

neeraj542 · 2026-03-19T05:55:26Z

I investigated the failing E2E Test (1.35.0) logs and found the failure in local-container-mnist notebook due to transient/corrupted FashionMNIST download (RuntimeError: File not found or corrupted), followed by DDP barrier failure.

I pushed a targeted fix in this PR:

added retry logic (3 attempts) for FashionMNIST download in examples/local/local-container-mnist.ipynb
cleanup of ./data/FashionMNIST/raw before retry when corruption is detected

This keeps the scope minimal and directly addresses the observed 1.35.0 failure mode.

neeraj542 · 2026-03-19T06:01:16Z

Hi @andreyvelich, could you pls review this PR again?
If it looks good, I’d appreciate your help to move it toward merge. Thanks!

astefanutti · 2026-03-19T08:51:54Z

/ok-to-test

neeraj542 · 2026-03-19T14:31:43Z

Hi @astefanutti, thanks for taking a look.
CI is now green for the notebooks E2E (including 1.35.0) and the PR is ready for your final review/approval, could you pls reply with your review//approve?

xikronz · 2026-03-20T03:30:32Z

hi @neeraj542, thanks for the PR #3366. I pulled the same root-cause from the failing notebook artifacts earlier yesterday for the Runtime Error and the subsequent dist.barrier() errors are a cascade from the rank crash.

One remaining hardening I think is still relevant: the download guard is currently if local_rank == 0:. In this container backend, that may not be equivalent to “exactly one container downloads” when num_nodes > 1. A safer follow-up would be to gate downloads on dist.get_rank() == 0 (global rank) to avoid concurrent writes to the same ./data path.

jiayuzhao05

If this solution is to reduce the random failures of these specific notebooks on CI, I don't think there is a problem. If it is to systematically fix the flaky problem of notebook e2e, I don't think it has been fundamentally solved, because the same timeout=20 notebooks are not all updated uniformly.

google-oss-prow · 2026-03-20T17:15:34Z

@jiayuzhao05: changing LGTM is restricted to collaborators

Details

In response to this:

If this solution is to reduce the random failures of these specific notebooks on CI, I don't think there is a problem. If it is to systematically fix the flaky problem of notebook e2e, I don't think it has been fundamentally solved, because the same timeout=20 notebooks are not all updated uniformly.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

neeraj542 · 2026-03-20T18:47:57Z

Thanks all for the feedbacks..

For this PR, I kept the scope focused on stabilizing the flaky notebook path seen in CI, and the full E2E matrix is now passing (including 1.35.0).

I agree there are broader hardening ways also:

replacing local_rank == 0 with dist.get_rank() == 0 for safer global download gating
reviewing other notebooks still using short completion timeouts

I can include these in this PR if maintainers prefer, or open a follow-up PR to keep this fix minimal and unblock merge.
@andreyvelich please advise your preferred direction.

cc: @astefanutti, @kuizhiqing

neeraj542 · 2026-03-21T07:04:08Z

Hi @andreyvelich, I saw #3366 mentioned in #3325. Is there any dependency between the two

Sridhar1030

Looks good for the scoped fix. One thing I noticed in the local-container-mnist diff: the original code applied Normalize((0.1307,), (0.3081,)) only on the rank-0 download path, while the actual training dataset used bare ToTensor(). After this refactor both go through load_fashion_mnist_with_retry with the same transform , so this is also a quiet correctness fix worth noting in the PR description.

+1 to @jiayuzhao05 's point that the remaining 6 notebooks still on timeout=20 (mlx, jax, deepspeed, torchtune, data-cache) should be updated for consistency, either here or in a follow-up.

Signed-off-by: neeraj542 <mneeraj2133@gmail.com>

neeraj542 · 2026-03-22T17:01:15Z

Pushed updates addressing the open review feedback in this PR:

@xikronz: gated FashionMNIST download with dist.get_rank() == 0 instead of local_rank == 0 in examples/local/local-container-mnist.ipynb.
@jiayuzhao05 / @Sridhar1030: bumped TrainJob completion wait_for_job_status(..., timeout=20) to timeout=300 in the remaining example notebooks that still used 20s (data-cache, jax, deepspeed, mlx x2, torchtune llama3_2 alpaca notebook).

andreyvelich · 2026-03-23T21:05:17Z

examples/deepspeed/text-summarization/T5-Fine-Tuning.ipynb

   "outputs": [],
   "source": [
-    "TrainerClient().wait_for_job_status(name=job_id, timeout=20)"
+    "TrainerClient().wait_for_job_status(name=job_id, timeout=300)"


Do we still need to have such long timeout here?
Since we have this before, the TrainJob should be Complete by that cell:

for logline in TrainerClient().get_job_logs(job_id, follow=True): print(logline)

Do we still need to have such long timeout here? Since we have this before, the TrainJob should be Complete by that cell:

for logline in TrainerClient().get_job_logs(job_id, follow=True): print(logline)

i reverted this notebook back to timeout=20 as suggested

andreyvelich · 2026-03-23T21:06:46Z

examples/local/local-container-mnist.ipynb

+    "        transforms.Normalize((0.1307,), (0.3081,)),\n",
+    "    ])\n",
+    "\n",
+    "    def load_fashion_mnist_with_retry(train: bool, download: bool):\n",


Do we see that FashionMNIST is constantly failing and we require retries?

Do we see that FashionMNIST is constantly failing and we require retries?

not constantly, this was seen in the failing 1.35.0 CI run artifacts

and I kept the retry only in local-container-mnist as a targeted stabilization for that failure mode

But we use FashionMNIST dataset in other examples too, do we need to introduce retries to all of them?

But we use FashionMNIST dataset in other examples too, do we need to introduce retries to all of them?

i agree, there is no strong evidence i found of repeated FashionMNIST failures yet, so I’ll remove retries here

andreyvelich · 2026-03-23T21:31:12Z

examples/local/local-container-mnist.ipynb

-    "                transforms.Normalize((0.1307,), (0.3081,)),\n",
-    "            ]),\n",
-    "        )\n",
+    "    if dist.get_rank() == 0:\n",


Why did you check if dist.get_rank() == 0 ? Since we don't use sharable PVC in these TrainJobs, we should download dataset of every Kubernetes pod where local_rank == 0

Why did you check if dist.get_rank() == 0 ? Since we don't use sharable PVC in these TrainJobs, we should download dataset of every Kubernetes pod where local_rank == 0

updated back to local_rank == 0 for this container backend scenario

andreyvelich · 2026-03-23T21:31:50Z

hack/e2e-run-notebook.sh

+        kubectl logs -n kubeflow-system -l app.kubernetes.io/name=trainer || true
+        # CI clusters can be a bit slower; debug output should not fail the whole run.
+        kubectl wait trainjob --for=condition=Complete --all --timeout 60s || true


Shall we error on failure ?

Suggested change

kubectl logs -n kubeflow-system -l app.kubernetes.io/name=trainer || true

# CI clusters can be a bit slower; debug output should not fail the whole run.

kubectl wait trainjob --for=condition=Complete --all --timeout 60s || true

kubectl logs -n kubeflow-system -l app.kubernetes.io/name=trainer || exit 1

# CI clusters can be a bit slower; debug output should not fail the whole run.

kubectl wait trainjob --for=condition=Complete --all --timeout 60s || exit 1

Shall we error on failure ?

appplied and updated both commands to fail explicitly with exit 1

andreyvelich · 2026-03-26T18:01:19Z

Hi Guys, any updates here?

neeraj542 · 2026-03-26T18:03:43Z

hi @andreyvelich i'm working on this will push required changes in sometime

Signed-off-by: neeraj542 <mneeraj2133@gmail.com>

google-oss-prow bot added the do-not-merge/work-in-progress label Mar 18, 2026

google-oss-prow bot requested review from jinchihe and kuizhiqing March 18, 2026 18:37

google-oss-prow bot added the size/S label Mar 18, 2026

neeraj542 force-pushed the fix/notebook-e2e-flaky-completion branch from 90ee3ee to 0e5de77 Compare March 18, 2026 18:41

neeraj542 marked this pull request as ready for review March 18, 2026 18:42

Copilot AI review requested due to automatic review settings March 18, 2026 18:42

google-oss-prow bot removed the do-not-merge/work-in-progress label Mar 18, 2026

Copilot started reviewing on behalf of neeraj542 March 18, 2026 18:43 View session

Copilot AI reviewed Mar 18, 2026

View reviewed changes

examples/pytorch/speech-recognition/speech-recognition.ipynb

"outputs": [],

"source": [

"client.wait_for_job_status(name=k8s_job_name, timeout=20)"

"client.wait_for_job_status(name=k8s_job_name, timeout=300)"

fix(examples): stabilize flaky E2E notebook completion

31bbef9

Signed-off-by: neeraj542 <mneeraj2133@gmail.com>

neeraj542 force-pushed the fix/notebook-e2e-flaky-completion branch from 0e5de77 to 31bbef9 Compare March 18, 2026 18:47

fix(examples): retry FashionMNIST download in local container notebook

f6233a6

Signed-off-by: neeraj542 <mneeraj2133@gmail.com>

google-oss-prow bot added size/M and removed size/S labels Mar 19, 2026

google-oss-prow bot added the ok-to-test label Mar 19, 2026

jiayuzhao05 suggested changes Mar 20, 2026

View reviewed changes

andreyvelich mentioned this pull request Mar 20, 2026

feat: customize R generation to include GPUs #3325

Merged

1 task

Sridhar1030 reviewed Mar 22, 2026

View reviewed changes

fix(examples): address review — global rank gate + completion timeouts

eb2b720

Signed-off-by: neeraj542 <mneeraj2133@gmail.com>

andreyvelich reviewed Mar 23, 2026

View reviewed changes

fix(examples): align notebook e2e changes with maintainer feedback

e164fb6

Signed-off-by: neeraj542 <mneeraj2133@gmail.com>

Goku2099 mentioned this pull request Mar 27, 2026

Improve debugging for Go E2E tests on failure #3393

Open

Conversation

neeraj542 commented Mar 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What/Why

Test plan

Uh oh!

review-notebook-app bot commented Mar 18, 2026

Uh oh!

google-oss-prow bot commented Mar 18, 2026

Uh oh!

github-actions bot commented Mar 18, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

neeraj542 commented Mar 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Goku2099 commented Mar 18, 2026

Uh oh!

neeraj542 commented Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

neeraj542 commented Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

neeraj542 commented Mar 19, 2026

Uh oh!

neeraj542 commented Mar 19, 2026

Uh oh!

astefanutti commented Mar 19, 2026

Uh oh!

neeraj542 commented Mar 19, 2026

Uh oh!

xikronz commented Mar 20, 2026

Uh oh!

jiayuzhao05 left a comment

Choose a reason for hiding this comment

Uh oh!

google-oss-prow bot commented Mar 20, 2026

Uh oh!

neeraj542 commented Mar 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

neeraj542 commented Mar 21, 2026

Uh oh!

Sridhar1030 left a comment

Choose a reason for hiding this comment

Uh oh!

neeraj542 commented Mar 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

neeraj542 Mar 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

andreyvelich commented Mar 26, 2026

Uh oh!

neeraj542 commented Mar 26, 2026

Uh oh!

neeraj542 commented Mar 18, 2026 •

edited

Loading

neeraj542 commented Mar 18, 2026 •

edited

Loading

neeraj542 commented Mar 19, 2026 •

edited

Loading

neeraj542 commented Mar 19, 2026 •

edited

Loading

neeraj542 commented Mar 20, 2026 •

edited

Loading

neeraj542 commented Mar 22, 2026 •

edited

Loading

neeraj542 Mar 26, 2026 •

edited

Loading