Muphys jax mlir opt by dganellari · Pull Request #7 · dganellari/icon4py

dganellari · 2026-03-12T12:10:27Z

No description provided.

Re-structuring of the (experiments and grids) serialized data generation and download_extraction --------- Co-authored-by: Mikael Simberg <mikael.simberg@iki.fi> Co-authored-by: Hannes Vogt <hannes@havogt.de>

porting of advection least square coefficients for both sphere and torus and implemented `setup_program` thorughout advection --------- Co-authored-by: Hannes Vogt <hannes@havogt.de> Co-authored-by: Rico Haeuselmann <r.haeuselmann@gmx.ch> Co-authored-by: Jacopo Canton <jacopo.canton@gmail.com>

…TOFFSET_DSL fields with torus grids (C2SM#1045) They were using the hardcoded config values for `thslp_zdiffu` and `thhgtd_zdiffu`. These values can now be customized in the `MetricsFieldsFactory`, and have been updated in the test definitions for GAUSS3D and WEISMAN_KLEMP. The xfails are removed from `test_metrics_factory.py`. The previously failing tests in `test_compute_diffusion_metrics.py` now pass. --------- Co-authored-by: Jacopo Canton <jacopo.canton@gmail.com>

…fix' into amd_profiling

Co-authored-by: Edoardo Paone <edoardo.paone@cscs.ch> Co-authored-by: Mikael Simberg <mikael.simberg@cscs.ch>

Since tests are currently anyway serialized, I think we don't benefit from running across the full node. This uses MPS to run all four ranks on a single GPU. This is based on C2SM#819.

…C2SM#1119) Combines C2SM#1118 and C2SM#1115, because both require `v3` serialized data: - Move computation of `vertoffset_gradp` from `vertidx_gradp` to Python (index2offset) - Move transpose (and decomposition) of `rbf_vec_coeff_v` and `rbf_vec_coeff_e` to Python Additional: - Add missing halo exchange to `compute_zdiff_gradp` (test failure triggered by new serialized data, most likely unrelated to the other changes) - Refactor `kflip_wgtfacq` to a more general `flip` on fields

- set default log level in py2fgen runtime to WARNING - add `ICON4PY_WAIT_FOR_COMPILATION` option to wait until granules inits finished jit compiling - remove an unused parameter

Reduce blanket type ignores at the price of adding a handful specific ones. --------- Co-authored-by: Hannes Vogt <vogt@hey.com>

Make usage of `setup_graupel()` consistent across multiple places.

@philip-paul-mueller

Adds gtfn_gpu backend to the distributed CI pipeline. dace_gpu is still left out because compilation takes too long. The base image is upgraded because it's possible, but not strictly necessary. The CPU-only version of the pipeline needed 25.04 (24.04 and 25.10 did not work for various reasons). However, since OpenMPI and libfabric are now built manually in the container the base image version is less of a constraint. 24.04 doesn't have matching GCC/CUDA versions and 26.04 doesn't exist yet, but the pipeline should eventually use 26.04. OpenMPI and libfabric are built manually for slingshot support because getting the ubuntu repository packages to work with GPU support did not seem possible/easy. The installation is based on https://github.com/eth-cscs/cray-network-stack. GHEX needs an upgrade, because there's a bug in how strides are calculated for GPU buffers. @philip-paul-mueller has already fixed this in ghex-org/GHEX#190 but we should wait for that to be merged (and probably test in icon-exclaim first). This also fixes a few cupy/numpy incompatibilities. `revert_repeated_index_to_invalid` was updated to only deal with numpy for now as the connectivities are always numpy arrays. `test_halo_exchange_for_sparse_field` is marked `embedded_only`. The non-MPI test was already marked embedded-only. This does not try to unify the default and distributed CI pipeline definitions. That should, however, be done done sooner or later as well. --------- Co-authored-by: Jacopo Canton <jacopo.canton@gmail.com> Co-authored-by: Nicoletta Farabullini <41536517+nfarabullini@users.noreply.github.com>

Plan is to tag this version and then branch to make further v0.1.x releases from a branch with selected commits. Greenline work will continue in main, blueline will stay on `v0.1.x` until we feel comfortable to get to the next version with all changes from main.

In [PR#980](C2SM#980) introduced streams into the halo exchanges. For this also `DEFAULT_STREAM`, which models the default stream and implements the [CUDA Stream Protocol](https://nvidia.github.io/cuda-python/cuda-core/latest/interoperability.html#cuda-stream-protocol). However, the original implementation identified as protocol version `1` instead of version `0`. Because of a related bug in [GHEX](ghex-org/GHEX#202) this error was hidden. This PR fixes the Python implementation and also updates GHEX.

The orchestration is not used and not tested. Moreover the orchestration.decorator does import mpi.MPI which does an MPI_Init (e.g. when generating bindings with py2fgen).

- delete tools/common (there was only py2fgen left which had its own setup_logger) - default setup_logger is WARNING

…1171) `test_diffusion.f90` and `test_dycore.f90` in `tools/tests/tools/py2fgen/fortran_samples/` are unused — only referenced by permanently-skipped tests that require connectivity data never passed from Fortran. - **Deleted files:** `test_diffusion.f90` (384 lines), `test_dycore.f90` (851 lines) - **Removed 4 skipped tests** from `test_cli.py`: `test_py2fgen_compilation_and_execution_{diffusion,diffusion_gpu,dycore,dycore_gpu}` - **Kept:** `test_square.f90` and all active tests that use it  --- 🔒 GitHub Advanced Security automatically protects Copilot coding agent pull requests. You can protect all pull requests by enabling Advanced Security for your repositories. [Learn more about Advanced Security.](https://gh.io/cca-advanced-security) --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: jcanton <5622559+jcanton@users.noreply.github.com>

The `graupel` SDFG looks like the following: <img width="1889" height="1051" alt="image" src="https://github.com/user-attachments/assets/2c88af89-1b2d-40f5-9928-aa6e6698449b" /> In both maps there are outputs whose values are determined based on if-statements that check if a mask or multiple masks are activated. In case they are not the values of the maps are updated with the inputs without any change. Since we know that the inputs and outputs are the same pointers we can improve this patter by removing the copies in the false branches of the if-statements and replacing the intermediate temporary `AccessNode`s with the global `AccessNode`s that are used as outputs of the program. To be more specific, the `AccessNode`s where this is applied are: - `q_in_2` -> `q_out_2` - `q_in_3` -> `q_out_3` - `q_in_4` -> `q_out_4` - `q_in_5` -> `q_out_5` - `te` -> `t_out` This is the updated SDFG: <img width="1766" height="1136" alt="image" src="https://github.com/user-attachments/assets/3827fe87-10f0-4c33-98e3-12e78d9bbfed" /> --------- Co-authored-by: Edoardo Paone <edoardo.paone@cscs.ch> Co-authored-by: Hannes Vogt <hannes.vogt@cscs.ch> Co-authored-by: Philip Mueller, CSCS <philip.mueller@cscs.ch>

rayleigh_coeff divdamp_trans_start divdamp_trans_end and also remove nudging_decay_rate in DiffusionConfig

Co-authored-by: Edoardo Paone <edoardo.paone@cscs.ch>

- Removed duplicate ```timeloop_diffusion_savepoint_exit (driver)``` and ```timeloop_diffusion_savepoint_exit_standalone (standalone_driver) ``` fixtures that were identical to the shared ```savepoint_diffusion_exit``` in ```datatest.py``` - Added a small ```linit``` fixture alias in both driver and standalone_driver to bridge the parametrized ```timeloop_diffusion_linit_exit``` name to the ```linit``` name expected by the shared fixture

…re solver Profiled vertically_implicit_solver_at_predictor_step on MI300A (Beverin, gfx942). Individual kernels achieve 93% of HBM peak bandwidth. Enable fuse_tasklets for the solver stencil, giving ~7% improvement (0.82ms -> 0.76ms). Added per-kernel roofline script, C2E scatter analysis, and HIP/CUDA bandwidth benchmarks for cross-platform comparison. See amd_scripts/PROFILING_RESULTS.md for detailed findings.

`single_node_default` is ambiguous—it sits next to `single_node_reductions` in `definitions.py` but doesn't convey that it's an exchange runtime. Renamed to `single_node_exchange` to match its type (`SingleNodeExchange`) and mirror the naming of its sibling. Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: jcanton <5622559+jcanton@users.noreply.github.com>

The actual Fortran bindings of diffusion and dycore where part of the tools/py2fgen package. However py2fgen is actually a standalone tool. We introduce a new package `icon4py.bindings` which depends on py2fgen and the atmosphere packages that it's generating bindings for. Longer term it might be better to make the bindings part of their respective packages as optionals.

Amd profiling

github-actions · 2026-04-15T13:39:55Z

Mandatory Tests

Please make sure you run these tests via comment before you merge!

cscs-ci run default
cscs-ci run distributed

Optional Tests

To run benchmarks you can use:

cscs-ci run benchmark-bencher

To run tests and benchmarks with the DaCe backend you can use:

cscs-ci run dace

To run test levels ignored by the default test suite (mostly simple datatest for static fields computations) you can use:

cscs-ci run extra

For more detailed information please look at CI in the EXCLAIM universe.

dganellari and others added 30 commits January 24, 2026 15:28

merged with muphus bug fix

5274ffb

update gt4py version

be4cee4

switch gt4py branch

6ecff32

update uv lock

1c9c744

edit import metrics

a1e753f

switch gt4py branch

b45b9b1

edit import metrics

517d122

edit import metrics

672b4f0

Move ci-mpi-wrapper.sh script to ci subdirectory (C2SM#1024)

4384db7

Merge branch 'main' into update_dace_version

9b2662d

Update serialized data (C2SM#1004)

8fe545a

Re-structuring of the (experiments and grids) serialized data generation and download_extraction --------- Co-authored-by: Mikael Simberg <mikael.simberg@iki.fi> Co-authored-by: Hannes Vogt <hannes@havogt.de>

Update DaCe version

532c125

Update the gt4py commit

991b6b8

Initial amd notes and scripts

f194d83

Pre-compilation fix with_backend

1eb4708

Fixes to the notes

30fe86c

Additional comments in the scripts

4d13d82

Fix gtx_metrics

81e7a24

Clean up setup script

47e5e48

Move scripts in amd_scripts and renamed instructions' file

cfc5d89

Added quickstart guide

adae364

Added goals section

d7a6aa2

Added note about scratch directory

8ed9403

Use revised with_compilation_option naming

ffc0d51

Merge remote-tracking branch 'origin/update_dace_version_pre_compile_…

6ded3a9

…fix' into amd_profiling

Cleaned up scripts

634ddfe

Edited notes of instructions

31271fe

Fix GT4PY_BUILD_CACHE_DIR in solver script

f450589

dganellari and others added 29 commits March 20, 2026 15:51

update log

ebbd129

update log

5821c4e

update log

c6026d4

cleanup

286e170

Add --skip-compilation flag to py2fgen (C2SM#1120)

62c7171

CI: Use node sharing (C2SM#819)

a60f2db

Co-authored-by: Edoardo Paone <edoardo.paone@cscs.ch> Co-authored-by: Mikael Simberg <mikael.simberg@cscs.ch>

Use only a single GPU/numa node for distributed tests (C2SM#1121)

4a3ecaa

Since tests are currently anyway serialized, I think we don't benefit from running across the full node. This uses MPS to run all four ranks on a single GPU. This is based on C2SM#819.

blueline: default log level, wait_for_compilation, cleanups (C2SM#1122)

fb1e5ad

- set default log level in py2fgen runtime to WARNING - add `ICON4PY_WAIT_FOR_COMPILATION` option to wait until granules inits finished jit compiling - remove an unused parameter

Update to GT4Py v1.1.8: adapt type hints (and ignores) (C2SM#1096)

ef5717a

Reduce blanket type ignores at the price of adding a handful specific ones. --------- Co-authored-by: Hannes Vogt <vogt@hey.com>

Make iau runtime (C2SM#972)

c18cbe0

Muphys: Refactor setup of graupel program (C2SM#1124)

699c127

Make usage of `setup_graupel()` consistent across multiple places.

Remove dace orchestration from diffusion (C2SM#1162)

a851180

The orchestration is not used and not tested. Moreover the orchestration.decorator does import mpi.MPI which does an MPI_Init (e.g. when generating bindings with py2fgen).

blueline: more cleanups (C2SM#1131)

1660d0e

- delete tools/common (there was only py2fgen left which had its own setup_logger) - default setup_logger is WARNING

cleanup 3 unused vars from solve_nh_init (C2SM#1169)

f7f2f48

rayleigh_coeff divdamp_trans_start divdamp_trans_end and also remove nudging_decay_rate in DiffusionConfig

Update GT4Py to v1.1.9 (C2SM#1187)

8fafac6

Co-authored-by: Edoardo Paone <edoardo.paone@cscs.ch>

merged with origin

2266c6f

add docs

34611ee

Merge pull request #12 from dganellari/amd_profiling

a834b24

Amd profiling

Merge branch 'main' into muphys_jax_mlir_opt

b5c1e40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Muphys jax mlir opt#7

Muphys jax mlir opt#7
dganellari wants to merge 170 commits into
muphys_bug_fixfrom
muphys_jax_mlir_opt

dganellari commented Mar 12, 2026

Uh oh!

github-actions Bot commented Apr 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

17 participants

Conversation

dganellari commented Mar 12, 2026

Uh oh!

github-actions Bot commented Apr 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

17 participants