Skip to content

fix(examples): stabilize flaky E2E notebook completion#3366

Open
neeraj542 wants to merge 4 commits intokubeflow:masterfrom
neeraj542:fix/notebook-e2e-flaky-completion
Open

fix(examples): stabilize flaky E2E notebook completion#3366
neeraj542 wants to merge 4 commits intokubeflow:masterfrom
neeraj542:fix/notebook-e2e-flaky-completion

Conversation

@neeraj542
Copy link
Copy Markdown

@neeraj542 neeraj542 commented Mar 18, 2026

Fixes #3365

What/Why

Notebook E2E tests in were flaky because the notebooks verify completion using a short fixed timeout (), which can exceed CI cluster scheduling/runtime variability.

This PR:

  • Increases the "verify TrainJob completion" timeout from to in all notebooks executed by .
  • Makes the post-Papermill debug non-fatal (and increases it to ) so slow clusters don\u2019t fail the whole run during log collection.

Test plan

  • Ran Check Yaml...........................................(no files to check)Skipped
    Check JSON...........................................(no files to check)Skipped
    Fix End of Files.....................................(no files to check)Skipped
    Trim Trailing Whitespace.............................(no files to check)Skipped
    isort................................................(no files to check)Skipped
    black................................................(no files to check)Skipped
    flake8...............................................(no files to check)Skipped
    cargo fmt (data_cache)...............................(no files to check)Skipped
    cargo check (data_cache).............................(no files to check)Skipped on the modified files (CI gate equivalent for formatting/JSON checks).

@review-notebook-app
Copy link
Copy Markdown

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@google-oss-prow
Copy link
Copy Markdown

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign jeffwan for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@github-actions
Copy link
Copy Markdown

🎉 Welcome to the Kubeflow Trainer! 🎉

Thanks for opening your first PR! We're happy to have you as part of our community 🚀

Here's what happens next:

  • If you haven't already, please check out our Contributing Guide for repo-specific guidelines and the Kubeflow Contributor Guide for general community standards.
  • Our team will review your PR soon! cc @kubeflow/kubeflow-trainer-team

Join the community:

Feel free to ask questions in the comments if you need any help or clarification!
Thanks again for contributing to Kubeflow! 🙏

@neeraj542 neeraj542 force-pushed the fix/notebook-e2e-flaky-completion branch from 90ee3ee to 0e5de77 Compare March 18, 2026 18:41
@neeraj542 neeraj542 marked this pull request as ready for review March 18, 2026 18:42
Copilot AI review requested due to automatic review settings March 18, 2026 18:42
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR increases timeout values for job status checks and makes Kubernetes debug commands non-fatal in CI environments to handle slower cluster performance during CI runs. The changes consist of updating e2e test script behavior and example notebooks to provide more time for job completion.

Changes:

  • Modified e2e test script to add error handling (|| true) to kubectl commands and increased kubectl wait timeout from 3 seconds to 60 seconds with explanatory comment
  • Updated job status check timeouts in 7 example notebooks from 20 seconds to 300 seconds (5 minutes) to accommodate slower CI environments

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
hack/e2e-run-notebook.sh Added error handling to kubectl commands and increased timeout from 3s to 60s for slower CI clusters
examples/xgboost/distributed-training/xgboost-distributed.ipynb Increased wait_for_job_status timeout from 20s to 300s
examples/pytorch/speech-recognition/speech-recognition.ipynb Increased wait_for_job_status timeout from 20s to 300s
examples/pytorch/question-answering/fine-tune-distilbert.ipynb Increased wait_for_job_status timeout from 20s to 300s
examples/pytorch/image-classification/mnist.ipynb Increased wait_for_job_status timeout from 20s to 300s
examples/pytorch/audio-classification/audio-classification.ipynb Increased wait_for_job_status timeout from 20s to 300s
examples/local/local-training-mnist.ipynb Increased wait_for_job_status timeout from 20s to 300s
examples/local/local-container-mnist.ipynb Increased wait_for_job_status timeout from 20s to 300s

You can also share your feedback on Copilot code review. Take the survey.

"outputs": [],
"source": [
"client.wait_for_job_status(name=k8s_job_name, timeout=20)"
"client.wait_for_job_status(name=k8s_job_name, timeout=300)"
Signed-off-by: neeraj542 <mneeraj2133@gmail.com>
@neeraj542 neeraj542 force-pushed the fix/notebook-e2e-flaky-completion branch from 0e5de77 to 31bbef9 Compare March 18, 2026 18:47
@neeraj542
Copy link
Copy Markdown
Author

neeraj542 commented Mar 18, 2026

Hi @kuizhiqing, I’d really appreciate a review when you have time.
I made this PR to fix #3365 because the notebook E2E checks were flaky: the notebooks used a very short fixed timeout for “TrainJob completion”, and CI clusters can be slower occasionally.

I increased those completion waits in the notebooks used by test-e2e.yaml and also made the notebook debug kubectl wait non-fatal/less strict.
If you can, could you please check whether the timeouts are now robust and that the change is safe for the E2E runs? Thnx!

@Goku2099
Copy link
Copy Markdown
Contributor

@neeraj542
Nice improvement addressing CI flakiness

I noticed we’re increasing the timeout from 20 to 300s, which helps with slow clusters, but a few things stood out-

  1. There are still other notebooks using wait_for_job_status(..., timeout=20). Should we update them as well for consistency?
  2. Using a fixed 300s timeout works, but would a retry/polling mechanism be more reliable instead of waiting the full duration?
  3. In e2e-run-notebook.sh, making kubectl wait non-fatal (|| true) avoids CI failures, but it might hide real issues. Maybe we can log a warning when it times out?

Curious what you think.

@neeraj542
Copy link
Copy Markdown
Author

neeraj542 commented Mar 19, 2026

@Goku2099 thnx for the feedback, I kept this PR intentionally simple to fix the flaky test quickly with minimal risk:

  • I updated only the notebooks used by test-e2e.yaml.
  • and increased the completion wait from 20s to 300s.
  • made the debug kubectl wait non-fatal so temporary cluster slowness does not fail the whole E2E run.

I agree the broader consistency updates and retry/polling improvement are good ideas, and can do them in a follow-up PR.

@neeraj542
Copy link
Copy Markdown
Author

neeraj542 commented Mar 19, 2026

have checked the failing 1.35.0 job logs, this failure is coming from local-container-mnist.ipynb due to transient/corrupted FashionMNIST download (RuntimeError: File not found or corrupted), not from the TrainJob completion timeout change. I’ll add retry+cleanup logic around dataset download in local notebook example(s) to make this path robust

Signed-off-by: neeraj542 <mneeraj2133@gmail.com>
@google-oss-prow google-oss-prow bot added size/M and removed size/S labels Mar 19, 2026
@neeraj542
Copy link
Copy Markdown
Author

I investigated the failing E2E Test (1.35.0) logs and found the failure in local-container-mnist notebook due to transient/corrupted FashionMNIST download (RuntimeError: File not found or corrupted), followed by DDP barrier failure.

I pushed a targeted fix in this PR:

  • added retry logic (3 attempts) for FashionMNIST download in examples/local/local-container-mnist.ipynb
  • cleanup of ./data/FashionMNIST/raw before retry when corruption is detected

This keeps the scope minimal and directly addresses the observed 1.35.0 failure mode.

@neeraj542
Copy link
Copy Markdown
Author

Hi @andreyvelich, could you pls review this PR again?
If it looks good, I’d appreciate your help to move it toward merge. Thanks!

@astefanutti
Copy link
Copy Markdown
Contributor

/ok-to-test

@neeraj542
Copy link
Copy Markdown
Author

Hi @astefanutti, thanks for taking a look.
CI is now green for the notebooks E2E (including 1.35.0) and the PR is ready for your final review/approval, could you pls reply with your review//approve?

@xikronz
Copy link
Copy Markdown

xikronz commented Mar 20, 2026

hi @neeraj542, thanks for the PR #3366. I pulled the same root-cause from the failing notebook artifacts earlier yesterday for the Runtime Error and the subsequent dist.barrier() errors are a cascade from the rank crash.

One remaining hardening I think is still relevant: the download guard is currently if local_rank == 0:. In this container backend, that may not be equivalent to “exactly one container downloads” when num_nodes > 1. A safer follow-up would be to gate downloads on dist.get_rank() == 0 (global rank) to avoid concurrent writes to the same ./data path.

Copy link
Copy Markdown

@jiayuzhao05 jiayuzhao05 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this solution is to reduce the random failures of these specific notebooks on CI, I don't think there is a problem. If it is to systematically fix the flaky problem of notebook e2e, I don't think it has been fundamentally solved, because the same timeout=20 notebooks are not all updated uniformly.

@google-oss-prow
Copy link
Copy Markdown

@jiayuzhao05: changing LGTM is restricted to collaborators

Details

In response to this:

If this solution is to reduce the random failures of these specific notebooks on CI, I don't think there is a problem. If it is to systematically fix the flaky problem of notebook e2e, I don't think it has been fundamentally solved, because the same timeout=20 notebooks are not all updated uniformly.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@neeraj542
Copy link
Copy Markdown
Author

neeraj542 commented Mar 20, 2026

Thanks all for the feedbacks..

For this PR, I kept the scope focused on stabilizing the flaky notebook path seen in CI, and the full E2E matrix is now passing (including 1.35.0).

I agree there are broader hardening ways also:

  • replacing local_rank == 0 with dist.get_rank() == 0 for safer global download gating
  • reviewing other notebooks still using short completion timeouts

I can include these in this PR if maintainers prefer, or open a follow-up PR to keep this fix minimal and unblock merge.
@andreyvelich please advise your preferred direction.

cc: @astefanutti, @kuizhiqing

@neeraj542
Copy link
Copy Markdown
Author

Hi @andreyvelich, I saw #3366 mentioned in #3325. Is there any dependency between the two

Copy link
Copy Markdown

@Sridhar1030 Sridhar1030 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good for the scoped fix. One thing I noticed in the local-container-mnist diff: the original code applied Normalize((0.1307,), (0.3081,)) only on the rank-0 download path, while the actual training dataset used bare ToTensor(). After this refactor both go through load_fashion_mnist_with_retry with the same transform , so this is also a quiet correctness fix worth noting in the PR description.

+1 to @jiayuzhao05 's point that the remaining 6 notebooks still on timeout=20 (mlx, jax, deepspeed, torchtune, data-cache) should be updated for consistency, either here or in a follow-up.

Signed-off-by: neeraj542 <mneeraj2133@gmail.com>
@neeraj542
Copy link
Copy Markdown
Author

neeraj542 commented Mar 22, 2026

Pushed updates addressing the open review feedback in this PR:

  • @xikronz: gated FashionMNIST download with dist.get_rank() == 0 instead of local_rank == 0 in examples/local/local-container-mnist.ipynb.
  • @jiayuzhao05 / @Sridhar1030: bumped TrainJob completion wait_for_job_status(..., timeout=20) to timeout=300 in the remaining example notebooks that still used 20s (data-cache, jax, deepspeed, mlx x2, torchtune llama3_2 alpaca notebook).

"outputs": [],
"source": [
"TrainerClient().wait_for_job_status(name=job_id, timeout=20)"
"TrainerClient().wait_for_job_status(name=job_id, timeout=300)"
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we still need to have such long timeout here?
Since we have this before, the TrainJob should be Complete by that cell:

for logline in TrainerClient().get_job_logs(job_id, follow=True):
    print(logline)

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we still need to have such long timeout here? Since we have this before, the TrainJob should be Complete by that cell:

for logline in TrainerClient().get_job_logs(job_id, follow=True):
    print(logline)

i reverted this notebook back to timeout=20 as suggested

" transforms.Normalize((0.1307,), (0.3081,)),\n",
" ])\n",
"\n",
" def load_fashion_mnist_with_retry(train: bool, download: bool):\n",
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we see that FashionMNIST is constantly failing and we require retries?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we see that FashionMNIST is constantly failing and we require retries?

not constantly, this was seen in the failing 1.35.0 CI run artifacts

and I kept the retry only in local-container-mnist as a targeted stabilization for that failure mode

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But we use FashionMNIST dataset in other examples too, do we need to introduce retries to all of them?

Copy link
Copy Markdown
Author

@neeraj542 neeraj542 Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But we use FashionMNIST dataset in other examples too, do we need to introduce retries to all of them?

i agree, there is no strong evidence i found of repeated FashionMNIST failures yet, so I’ll remove retries here

" transforms.Normalize((0.1307,), (0.3081,)),\n",
" ]),\n",
" )\n",
" if dist.get_rank() == 0:\n",
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did you check if dist.get_rank() == 0 ? Since we don't use sharable PVC in these TrainJobs, we should download dataset of every Kubernetes pod where local_rank == 0

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did you check if dist.get_rank() == 0 ? Since we don't use sharable PVC in these TrainJobs, we should download dataset of every Kubernetes pod where local_rank == 0

updated back to local_rank == 0 for this container backend scenario

Comment on lines +48 to +50
kubectl logs -n kubeflow-system -l app.kubernetes.io/name=trainer || true
# CI clusters can be a bit slower; debug output should not fail the whole run.
kubectl wait trainjob --for=condition=Complete --all --timeout 60s || true
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we error on failure ?

Suggested change
kubectl logs -n kubeflow-system -l app.kubernetes.io/name=trainer || true
# CI clusters can be a bit slower; debug output should not fail the whole run.
kubectl wait trainjob --for=condition=Complete --all --timeout 60s || true
kubectl logs -n kubeflow-system -l app.kubernetes.io/name=trainer || exit 1
# CI clusters can be a bit slower; debug output should not fail the whole run.
kubectl wait trainjob --for=condition=Complete --all --timeout 60s || exit 1

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we error on failure ?

appplied and updated both commands to fail explicitly with exit 1

@andreyvelich
Copy link
Copy Markdown
Member

Hi Guys, any updates here?

@neeraj542
Copy link
Copy Markdown
Author

hi @andreyvelich i'm working on this will push required changes in sometime

Signed-off-by: neeraj542 <mneeraj2133@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Fix flaky E2E test in Trainer Notebook examples

8 participants