Skip to content

Muphys jax mlir opt#7

Open
dganellari wants to merge 170 commits into
muphys_bug_fixfrom
muphys_jax_mlir_opt
Open

Muphys jax mlir opt#7
dganellari wants to merge 170 commits into
muphys_bug_fixfrom
muphys_jax_mlir_opt

Conversation

@dganellari
Copy link
Copy Markdown
Owner

No description provided.

dganellari and others added 30 commits January 24, 2026 15:28
Re-structuring of the (experiments and grids) serialized data generation
and download_extraction

---------

Co-authored-by: Mikael Simberg <mikael.simberg@iki.fi>
Co-authored-by: Hannes Vogt <hannes@havogt.de>
porting of advection least square coefficients for both sphere and torus
and implemented `setup_program` thorughout advection

---------

Co-authored-by: Hannes Vogt <hannes@havogt.de>
Co-authored-by: Rico Haeuselmann <r.haeuselmann@gmx.ch>
Co-authored-by: Jacopo Canton <jacopo.canton@gmail.com>
…TOFFSET_DSL fields with torus grids (C2SM#1045)

They were using the hardcoded config values for `thslp_zdiffu` and
`thhgtd_zdiffu`. These values can now be customized in the
`MetricsFieldsFactory`, and have been updated in the test definitions
for GAUSS3D and WEISMAN_KLEMP. The xfails are removed from
`test_metrics_factory.py`. The previously failing tests in
`test_compute_diffusion_metrics.py` now pass.

---------

Co-authored-by: Jacopo Canton <jacopo.canton@gmail.com>
dganellari and others added 29 commits March 20, 2026 15:51
Co-authored-by: Edoardo Paone <edoardo.paone@cscs.ch>
Co-authored-by: Mikael Simberg <mikael.simberg@cscs.ch>
Since tests are currently anyway serialized, I think we don't benefit
from running across the full node. This uses MPS to run all four ranks
on a single GPU. This is based on C2SM#819.
…C2SM#1119)

Combines C2SM#1118 and C2SM#1115, because both require `v3` serialized data:
- Move computation of `vertoffset_gradp` from `vertidx_gradp` to Python
(index2offset)
- Move transpose (and decomposition) of `rbf_vec_coeff_v` and
`rbf_vec_coeff_e` to Python

Additional:
- Add missing halo exchange to `compute_zdiff_gradp` (test failure
triggered by new serialized data, most likely unrelated to the other
changes)
- Refactor `kflip_wgtfacq` to a more general `flip` on fields
- set default log level in py2fgen runtime to WARNING
- add `ICON4PY_WAIT_FOR_COMPILATION` option to wait until granules inits
finished jit compiling
- remove an unused parameter
Reduce blanket type ignores at the price of adding a handful specific
ones.

---------

Co-authored-by: Hannes Vogt <vogt@hey.com>
Make usage of `setup_graupel()` consistent across multiple places.
Adds gtfn_gpu backend to the distributed CI pipeline. dace_gpu is still
left out because compilation takes too long.

The base image is upgraded because it's possible, but not strictly
necessary. The CPU-only version of the pipeline needed 25.04 (24.04 and
25.10 did not work for various reasons). However, since OpenMPI and
libfabric are now built manually in the container the base image version
is less of a constraint. 24.04 doesn't have matching GCC/CUDA versions
and 26.04 doesn't exist yet, but the pipeline should eventually use
26.04.

OpenMPI and libfabric are built manually for slingshot support because
getting the ubuntu repository packages to work with GPU support did not
seem possible/easy. The installation is based on
https://github.com/eth-cscs/cray-network-stack.

GHEX needs an upgrade, because there's a bug in how strides are
calculated for GPU buffers. @philip-paul-mueller has already fixed this
in ghex-org/GHEX#190 but we should wait for that
to be merged (and probably test in icon-exclaim first).

This also fixes a few cupy/numpy incompatibilities.
`revert_repeated_index_to_invalid` was updated to only deal with numpy
for now as the connectivities are always numpy arrays.
`test_halo_exchange_for_sparse_field` is marked `embedded_only`. The
non-MPI test was already marked embedded-only.

This does not try to unify the default and distributed CI pipeline
definitions. That should, however, be done done sooner or later as well.

---------

Co-authored-by: Jacopo Canton <jacopo.canton@gmail.com>
Co-authored-by: Nicoletta Farabullini <41536517+nfarabullini@users.noreply.github.com>
Plan is to tag this version and then branch to make further v0.1.x
releases from a branch with selected commits. Greenline work will
continue in main, blueline will stay on `v0.1.x` until we feel
comfortable to get to the next version with all changes from main.
In [PR#980](C2SM#980) introduced streams
into the halo exchanges. For this also `DEFAULT_STREAM`, which models the
default stream and implements the [CUDA Stream Protocol](https://nvidia.github.io/cuda-python/cuda-core/latest/interoperability.html#cuda-stream-protocol). However, the original
implementation identified as protocol version `1` instead of version `0`.
Because of a related bug in [GHEX](ghex-org/GHEX#202)
this error was hidden.

This PR fixes the Python implementation and also updates GHEX.
The orchestration is not used and not tested. Moreover the
orchestration.decorator does import mpi.MPI which does an MPI_Init (e.g.
when generating bindings with py2fgen).
- delete tools/common (there was only py2fgen left which had its own
setup_logger)
- default setup_logger is WARNING
…1171)

`test_diffusion.f90` and `test_dycore.f90` in
`tools/tests/tools/py2fgen/fortran_samples/` are unused — only
referenced by permanently-skipped tests that require connectivity data
never passed from Fortran.

- **Deleted files:** `test_diffusion.f90` (384 lines), `test_dycore.f90`
(851 lines)
- **Removed 4 skipped tests** from `test_cli.py`:
`test_py2fgen_compilation_and_execution_{diffusion,diffusion_gpu,dycore,dycore_gpu}`
- **Kept:** `test_square.f90` and all active tests that use it

<!-- START COPILOT CODING AGENT TIPS -->
---

🔒 GitHub Advanced Security automatically protects Copilot coding agent
pull requests. You can protect all pull requests by enabling Advanced
Security for your repositories. [Learn more about Advanced
Security.](https://gh.io/cca-advanced-security)

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: jcanton <5622559+jcanton@users.noreply.github.com>
The `graupel` SDFG looks like the following:
<img width="1889" height="1051" alt="image"
src="https://github.com/user-attachments/assets/2c88af89-1b2d-40f5-9928-aa6e6698449b"
/>
In both maps there are outputs whose values are determined based on
if-statements that check if a mask or multiple masks are activated. In
case they are not the values of the maps are updated with the inputs
without any change.
Since we know that the inputs and outputs are the same pointers we can
improve this patter by removing the copies in the false branches of the
if-statements and replacing the intermediate temporary `AccessNode`s
with the global `AccessNode`s that are used as outputs of the program.
To be more specific, the `AccessNode`s where this is applied are:
- `q_in_2` -> `q_out_2`
- `q_in_3` -> `q_out_3`
- `q_in_4` -> `q_out_4`
- `q_in_5` -> `q_out_5`
- `te` -> `t_out`
This is the updated SDFG:
<img width="1766" height="1136" alt="image"
src="https://github.com/user-attachments/assets/3827fe87-10f0-4c33-98e3-12e78d9bbfed"
/>

---------

Co-authored-by: Edoardo Paone <edoardo.paone@cscs.ch>
Co-authored-by: Hannes Vogt <hannes.vogt@cscs.ch>
Co-authored-by: Philip Mueller, CSCS <philip.mueller@cscs.ch>
rayleigh_coeff
divdamp_trans_start
divdamp_trans_end

and also remove nudging_decay_rate in DiffusionConfig
Co-authored-by: Edoardo Paone <edoardo.paone@cscs.ch>
- Removed duplicate ```timeloop_diffusion_savepoint_exit (driver)``` and
```timeloop_diffusion_savepoint_exit_standalone (standalone_driver) ```
fixtures that were identical to the shared
```savepoint_diffusion_exit``` in ```datatest.py```
- Added a small ```linit``` fixture alias in both driver and
standalone_driver to bridge the parametrized
```timeloop_diffusion_linit_exit``` name
to the ```linit``` name expected by the shared fixture
…re solver

Profiled vertically_implicit_solver_at_predictor_step on MI300A (Beverin, gfx942).
Individual kernels achieve 93% of HBM peak bandwidth. Enable fuse_tasklets for the
solver stencil, giving ~7% improvement (0.82ms -> 0.76ms). Added per-kernel roofline
script, C2E scatter analysis, and HIP/CUDA bandwidth benchmarks for cross-platform
comparison. See amd_scripts/PROFILING_RESULTS.md for detailed findings.
`single_node_default` is ambiguous—it sits next to
`single_node_reductions` in `definitions.py` but doesn't convey that
it's an exchange runtime. Renamed to `single_node_exchange` to match its
type (`SingleNodeExchange`) and mirror the naming of its sibling.

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: jcanton <5622559+jcanton@users.noreply.github.com>
The actual Fortran bindings of diffusion and dycore where part of the
tools/py2fgen package. However py2fgen is actually a standalone tool.

We introduce a new package `icon4py.bindings` which depends on py2fgen
and the atmosphere packages that it's generating bindings for.

Longer term it might be better to make the bindings part of their
respective packages as optionals.
@github-actions
Copy link
Copy Markdown

Mandatory Tests

Please make sure you run these tests via comment before you merge!

  • cscs-ci run default
  • cscs-ci run distributed

Optional Tests

To run benchmarks you can use:

  • cscs-ci run benchmark-bencher

To run tests and benchmarks with the DaCe backend you can use:

  • cscs-ci run dace

To run test levels ignored by the default test suite (mostly simple datatest for static fields computations) you can use:

  • cscs-ci run extra

For more detailed information please look at CI in the EXCLAIM universe.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.