Update to mam4xx that eliminates a Kokkos::single bottleneck.#8317
Conversation
|
@meng630, could you please help with the performance plots? |
|
This isn't quite BFB, right? We can't merge non-BFB PRs today, but I'll merge to |
|
This PR should be BFB. |
|
Since the change revered the order of those two loops, I think this should be BFB up to order of operations. Given how simple the calculation is, there is a good chance that it is BFB in most cases. |
|
Why don't we wait till tomorrow just to be safe? :-) |
The function gas_washout() in mam4xx had a Kokkos::single surrounding a double loop over the column data. With a little rearranging of the code this was re-written as a single loop over a Kokkos::parallel loop. This is in mam4xx hash tag: 6a5c9bf2a6b06e7c3a5f85ea00c059b0d56ba575 [BFB]
|
Merged to |
|
We're still waiting for a report from Aurora's |
|
We're still seeing these test failures, and this PR is one of a handful of candidates for the DIFFs that cause the failures. I started to do a bisection on Aurora to pick out the commit responsible, but it turns out I don't have the right access permissions to do this. I'm on vacation next week and returning 5/12, so if someone else like @tcclevenger (hey Conrad! Was great to hang out with you at the ALCF Hackathon!) could take over in my absence, I'd be most appreciative. |
|
So it seems that this PR is not BFB on Aurora. Is this an issue, or should we just rebless these test results? @odiazib , do you have a sense for how big the DIFFs are? |
|
@jeff-cohere , Yes the non-BFB is an issue on Aurora. See eagles-project/mam4xx#549 which should fix this. |
|
@overfelt , if you update the mam4xx submodule here, I can re-merge it to |
d1af184 to
8176220
Compare
|
Okay, the resulting build error is related to the merge of mam4xx PR 547. Sorry about that--I should have waited. Can you rewind the submodule to |
The function gas_washout() in mam4xx had a Kokkos::single surrounding a double loop over the column data. With a little rearranging of the code this was re-written as a single loop over a Kokkos::parallel loop. This is in mam4xx hash tag: 6a5c9bf2a6b06e7c3a5f85ea00c059b0d56ba575 This is a re-merge of the previous merge of PR 8317 to next. [BFB]
|
Merged to |
If all tests are okay after merging into next, I think it is okay to merge. |
|
Do you mean tomorrow? Or are you saying that we should wait till next week so the CDash board can be cleaned up? |
I meant after merging to the next branch. |
After merging into the next branch, a couple of test suites run on that branch, for example on Aurora and Polaris. Thus, if those sets on CDash are passing, I think we are fine to merge. |
|
There's always the possibility that the mam4xx and EAMxx tests won't catch all of the issues related to a PR. Do you think this means we should follow the non-EAMxx testing guidelines for mam4xx? In other words:
This is what we've been doing so far, as opposed to the EAMxx workflow:
Luca prefers the EAMxx workflow because the additional CI testing reduces the likelihood of issues creeping into |
I like the approach of merging to next, waiting, and then merging to master. |
|
It looks like Aurora and the other ALCF systems are being taken down for a few days. If we want to use I recommend we merge this PR, since it's an improvement on the status quo that fixes issues that should have manifested on our other systems anyway. If it turns out to have problems, we can always circle back to it. |
|
We can merge this PR. @overfelt tested it on Aurora, and the previously failing tests are now passing. |


The function gas_washout() in mam4xx had a Kokkos::single surrounding a double loop over the column data. With a little rearranging of the code this was re-written as a single loop over a Kokkos::parallel loop. This is in mam4xx hash tag: 6a5c9bf2a6b06e7c3a5f85ea00c059b0d56ba575
[BFB]