Skip to content

Update to mam4xx that eliminates a Kokkos::single bottleneck.#8317

Merged
jeff-cohere merged 4 commits into
masterfrom
overfelt/eamxx/refactor-gas_washout
May 15, 2026
Merged

Update to mam4xx that eliminates a Kokkos::single bottleneck.#8317
jeff-cohere merged 4 commits into
masterfrom
overfelt/eamxx/refactor-gas_washout

Conversation

@overfelt
Copy link
Copy Markdown
Contributor

@overfelt overfelt commented Apr 23, 2026

The function gas_washout() in mam4xx had a Kokkos::single surrounding a double loop over the column data. With a little rearranging of the code this was re-written as a single loop over a Kokkos::parallel loop. This is in mam4xx hash tag: 6a5c9bf2a6b06e7c3a5f85ea00c059b0d56ba575

[BFB]

@overfelt overfelt added BFB PR leaves answers BFB Performance MAM4xx MAM4xx related changes labels Apr 23, 2026
@overfelt overfelt marked this pull request as draft April 23, 2026 21:06
@overfelt overfelt requested review from jeff-cohere April 23, 2026 22:04
@overfelt overfelt self-assigned this Apr 23, 2026
@overfelt overfelt requested review from odiazib April 23, 2026 22:04
@overfelt overfelt marked this pull request as ready for review April 23, 2026 22:05
@odiazib
Copy link
Copy Markdown
Contributor

odiazib commented Apr 23, 2026

@meng630, could you please help with the performance plots?

@meng630
Copy link
Copy Markdown
Contributor

meng630 commented Apr 24, 2026

It shows a big improvement for sethet kernel (0.545 s -> 0.142 s wall clock time). Since this kernel is pretty small, we don't see much improvement in MAM4.
image
image

@jeff-cohere
Copy link
Copy Markdown
Contributor

This isn't quite BFB, right? We can't merge non-BFB PRs today, but I'll merge to next at the next opportunity if everyone's happy with this. (I'm travelling to Argonne today.)

@odiazib
Copy link
Copy Markdown
Contributor

odiazib commented Apr 27, 2026

This PR should be BFB.

@overfelt
Copy link
Copy Markdown
Contributor Author

Since the change revered the order of those two loops, I think this should be BFB up to order of operations. Given how simple the calculation is, there is a good chance that it is BFB in most cases.

@jeff-cohere
Copy link
Copy Markdown
Contributor

Why don't we wait till tomorrow just to be safe? :-)

jeff-cohere added a commit that referenced this pull request Apr 28, 2026
The function gas_washout() in mam4xx had a Kokkos::single surrounding a
double loop over the column data. With a little rearranging of the code
this was re-written as a single loop over a Kokkos::parallel loop. This
is in mam4xx hash tag: 6a5c9bf2a6b06e7c3a5f85ea00c059b0d56ba575

[BFB]
@jeff-cohere
Copy link
Copy Markdown
Contributor

Merged to next.

@jeff-cohere
Copy link
Copy Markdown
Contributor

We're still waiting for a report from Aurora's next branch for today. The filesystem on Aurora was having pretty serious issues earlier in the week, so we decided to wait for another report. The most recent report does show two failing mam4xx-related tests (DIFFs). It's possible that this PR is not BFB.

@jeff-cohere
Copy link
Copy Markdown
Contributor

We're still seeing these test failures, and this PR is one of a handful of candidates for the DIFFs that cause the failures. I started to do a bisection on Aurora to pick out the commit responsible, but it turns out I don't have the right access permissions to do this.

I'm on vacation next week and returning 5/12, so if someone else like @tcclevenger (hey Conrad! Was great to hang out with you at the ALCF Hackathon!) could take over in my absence, I'd be most appreciative.

@jeff-cohere
Copy link
Copy Markdown
Contributor

So it seems that this PR is not BFB on Aurora. Is this an issue, or should we just rebless these test results? @odiazib , do you have a sense for how big the DIFFs are?

@overfelt
Copy link
Copy Markdown
Contributor Author

@jeff-cohere , Yes the non-BFB is an issue on Aurora. See eagles-project/mam4xx#549 which should fix this.

@jeff-cohere
Copy link
Copy Markdown
Contributor

@overfelt , if you update the mam4xx submodule here, I can re-merge it to next so we can see what aurora thinks of it on its next pass.

@overfelt overfelt force-pushed the overfelt/eamxx/refactor-gas_washout branch from d1af184 to 8176220 Compare May 13, 2026 17:59
@jeff-cohere
Copy link
Copy Markdown
Contributor

jeff-cohere commented May 13, 2026

Okay, the resulting build error is related to the merge of mam4xx PR 547. Sorry about that--I should have waited. Can you rewind the submodule to a2117f054609d6625194dda929c448f5f9240fe1? Then I can create a separate PR for the changes related to those static arrays.

jeff-cohere added a commit that referenced this pull request May 13, 2026
The function gas_washout() in mam4xx had a Kokkos::single surrounding a
double loop over the column data. With a little rearranging of the code
this was re-written as a single loop over a Kokkos::parallel loop. This
is in mam4xx hash tag: 6a5c9bf2a6b06e7c3a5f85ea00c059b0d56ba575

This is a re-merge of the previous merge of PR 8317 to next.

[BFB]
@jeff-cohere
Copy link
Copy Markdown
Contributor

Merged to next.

@jeff-cohere
Copy link
Copy Markdown
Contributor

@bartgol reminded me that EAMxx-only PRs can be merged to master without waiting for overnight testing to finish (but after merging to next). @odiazib and @overfelt , do you think it's worth merging this to master today? And the same question applies to #8385.

@odiazib
Copy link
Copy Markdown
Contributor

odiazib commented May 13, 2026

@bartgol reminded me that EAMxx-only PRs can be merged to master without waiting for overnight testing to finish (but after merging to next). @odiazib and @overfelt , do you think it's worth merging this to master today? And the same question applies to #8385.

If all tests are okay after merging into next, I think it is okay to merge.

@jeff-cohere
Copy link
Copy Markdown
Contributor

Do you mean tomorrow? Or are you saying that we should wait till next week so the CDash board can be cleaned up?

@odiazib
Copy link
Copy Markdown
Contributor

odiazib commented May 13, 2026

Do you mean tomorrow? Or are you saying that we should wait till next week so the CDash board can be cleaned up?

I meant after merging to the next branch.

@odiazib
Copy link
Copy Markdown
Contributor

odiazib commented May 13, 2026

Do you mean tomorrow? Or are you saying that we should wait till next week so the CDash board can be cleaned up?

After merging into the next branch, a couple of test suites run on that branch, for example on Aurora and Polaris. Thus, if those sets on CDash are passing, I think we are fine to merge.
This PR was a good example: CI on both E3SM and MAM4xx passed, but the issue was reported by CDash in the Aurora test suite.

@jeff-cohere
Copy link
Copy Markdown
Contributor

There's always the possibility that the mam4xx and EAMxx tests won't catch all of the issues related to a PR. Do you think this means we should follow the non-EAMxx testing guidelines for mam4xx? In other words:

  1. Merge to next
  2. Wait for overnight testing
  3. If tests pass, merge to master, else fix tests and repeat steps 1-2.

This is what we've been doing so far, as opposed to the EAMxx workflow:

  1. Merge to next
  2. Merge to master
  3. If tests related to the PR fail in master, submit a bugfix PR.

Luca prefers the EAMxx workflow because the additional CI testing reduces the likelihood of issues creeping into master. But it still can happen.

@odiazib
Copy link
Copy Markdown
Contributor

odiazib commented May 13, 2026

There's always the possibility that the mam4xx and EAMxx tests won't catch all of the issues related to a PR. Do you think this means we should follow the non-EAMxx testing guidelines for mam4xx? In other words:

  1. Merge to next
  2. Wait for overnight testing
  3. If tests pass, merge to master, else fix tests and repeat steps 1-2.

This is what we've been doing so far, as opposed to the EAMxx workflow:

  1. Merge to next
  2. Merge to master
  3. If tests related to the PR fail in master, submit a bugfix PR.

Luca prefers the EAMxx workflow because the additional CI testing reduces the likelihood of issues creeping into master. But it still can happen.

I like the approach of merging to next, waiting, and then merging to master.
I guess the purpose of next is to reduce the likelihood of issues creeping into master.

@jeff-cohere
Copy link
Copy Markdown
Contributor

jeff-cohere commented May 15, 2026

It looks like Aurora and the other ALCF systems are being taken down for a few days. If we want to use next as a screening process for mam4xx changes, I think we need to run next on a system that is a bit more available than Aurora has been.

I recommend we merge this PR, since it's an improvement on the status quo that fixes issues that should have manifested on our other systems anyway. If it turns out to have problems, we can always circle back to it.

@odiazib
Copy link
Copy Markdown
Contributor

odiazib commented May 15, 2026

We can merge this PR. @overfelt tested it on Aurora, and the previously failing tests are now passing.

@jeff-cohere jeff-cohere merged commit 586a534 into master May 15, 2026
16 of 17 checks passed
@jeff-cohere jeff-cohere deleted the overfelt/eamxx/refactor-gas_washout branch May 15, 2026 16:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

BFB PR leaves answers BFB MAM4xx MAM4xx related changes Performance

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants