Skip to content

[rocm-jaxlib-v0.8.2] Backporting CI Benchmark related changes and fixes#760

Open
mmakevic-amd wants to merge 8 commits intorocm-jaxlib-v0.8.2from
mmakevic/v0.8.2-backport-ci-benchmark-fix
Open

[rocm-jaxlib-v0.8.2] Backporting CI Benchmark related changes and fixes#760
mmakevic-amd wants to merge 8 commits intorocm-jaxlib-v0.8.2from
mmakevic/v0.8.2-backport-ci-benchmark-fix

Conversation

@mmakevic-amd
Copy link
Copy Markdown

@mmakevic-amd mmakevic-amd commented Mar 27, 2026

Motivation

Currently, CI benchmarks are failing on v0.8.2. This PR fixes it by backporting changes from #691 and #730.

Note: #622 will be closed in favour of this PR

Test Plan

I will manually trigger CI check before merging

Test Result

Workflow run as expected but gemma2 failed due to device time being above threshold: https://github.com/ROCm/xla/actions/runs/23635202015/job/68844227492
After unsetting HLO arg mode (from uninitialized ) job is running as expected: https://github.com/ROCm/xla/actions/runs/24037623275/job/70190205343

Submission Checklist

@mmakevic-amd mmakevic-amd added cherry-pick-candidate Mark a PR to be cherry-picked into the next ROCm JAX. Remove IIF the latest upstream contain the PR. rocm-jaxlib-v0.8.2 labels Mar 27, 2026
@mmakevic-amd mmakevic-amd mentioned this pull request Mar 27, 2026
@mmakevic-amd mmakevic-amd requested a review from hsharsha March 27, 2026 08:17
@mmakevic-amd
Copy link
Copy Markdown
Author

@hsharsha gemma2 is exhibiting some run-to-run variations (from ~12ms to ~600ms GPU device time), so the job sometimes fails. Workflow itself works as expected.

@mmakevic-amd mmakevic-amd requested a review from i-chaochen March 27, 2026 10:00
Copy link
Copy Markdown
Collaborator

@i-chaochen i-chaochen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

once this PR is merged, are we going to have a benchmark CI on presubmit check? because seems I don't see this benchmark CI on presubmit check on 0.9.1 branch? for example this PR #756 maybe need to rebase or shall we create a new PR to check on 0.9.1?

@mmakevic-amd
Copy link
Copy Markdown
Author

once this PR is merged, are we going to have a benchmark CI on presubmit check? because seems I don't see this benchmark CI on presubmit check on 0.9.1 branch? for example this PR #756 maybe need to rebase or shall we create a new PR to check on 0.9.1?

No, only as a postsubmit. I can enable it as a presubmit no problem, but the whole workflow lasts ~40min, as you can see https://github.com/ROCm/xla/actions/workflows/postsubmit_benchmark.yml, so I'm not sure if we want that

@mmakevic-amd
Copy link
Copy Markdown
Author

One can trigger it manually before merging if that's necessary

@i-chaochen
Copy link
Copy Markdown
Collaborator

i-chaochen commented Mar 27, 2026

once this PR is merged, are we going to have a benchmark CI on presubmit check? because seems I don't see this benchmark CI on presubmit check on 0.9.1 branch? for example this PR #756 maybe need to rebase or shall we create a new PR to check on 0.9.1?

No, only as a postsubmit. I can enable it as a presubmit no problem, but the whole workflow lasts ~40min, as you can see https://github.com/ROCm/xla/actions/workflows/postsubmit_benchmark.yml, so I'm not sure if we want that

I think we can choose label to activate benchmark CI, just like what we have in claude code review and TSAN/ASAN. And it's best to let it run on pre-submit.

cc @nurmukhametov @alekstheod ?

@nurmukhametov
Copy link
Copy Markdown
Member

once this PR is merged, are we going to have a benchmark CI on presubmit check? because seems I don't see this benchmark CI on presubmit check on 0.9.1 branch? for example this PR #756 maybe need to rebase or shall we create a new PR to check on 0.9.1?

No, only as a postsubmit. I can enable it as a presubmit no problem, but the whole workflow lasts ~40min, as you can see https://github.com/ROCm/xla/actions/workflows/postsubmit_benchmark.yml, so I'm not sure if we want that

IMO, 40 mins is fine. However, I don't understand two things:

  • Why does the checkout take up to 15 min sometimes?
  • Why don't we hit caches for builds?

@mmakevic-amd mmakevic-amd requested a review from i-chaochen April 7, 2026 06:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cherry-pick-candidate Mark a PR to be cherry-picked into the next ROCm JAX. Remove IIF the latest upstream contain the PR. rocm-jaxlib-v0.8.2

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants