Skip to content

Add persistent gt4py cache for CSCS CI#1218

Merged
msimberg merged 34 commits into
C2SM:mainfrom
msimberg:ci-persistent-cache
May 19, 2026
Merged

Add persistent gt4py cache for CSCS CI#1218
msimberg merged 34 commits into
C2SM:mainfrom
msimberg:ci-persistent-cache

Conversation

@msimberg
Copy link
Copy Markdown
Contributor

@msimberg msimberg commented Apr 24, 2026

Fixes #1133.

Sets up a persistent cache for cscs ci pipelines. There's a weekly cache shared across all jobs. This means that each week a new cache directory is populated based on the year and week number. The idea of this is that we sometimes trigger compilation for a new cache from scratch, but not for every job. The cache is shared between branches and the same uv.lock to avoid having too many files on scratch from every branch writing a separate cache such as in #1220. The caches are separated by uv.lock hash to avoid sharing cache when gt4py or another important dependency changes. The caches are separated by job because .venvs are different for each nox session. This avoids issues with hardcoded paths in compile_commands.json.

@msimberg msimberg changed the title Ci persistent cache Reuse gt4py cache in CI Apr 24, 2026
@msimberg
Copy link
Copy Markdown
Contributor Author

The github actions cache for gt4py seems to work nicely. I had test-model (3.11, gtfn_cpu, dycore) do a first run without a cache in 27 minutes. The second run with a cache ran in 10 minutes.

I've yet to try the cscs ci cache.

@msimberg
Copy link
Copy Markdown
Contributor Author

cscs-ci run distributed

@msimberg
Copy link
Copy Markdown
Contributor Author

cscs-ci run distributed

@msimberg
Copy link
Copy Markdown
Contributor Author

cscs-ci run distributed

@msimberg msimberg changed the title Reuse gt4py cache in CI Add persistent gt4py cache for CSCS CI Apr 27, 2026
Comment thread ci/default.yml Outdated
Comment thread ci/default.yml Outdated
Comment thread ci/distributed.yml Outdated
@msimberg
Copy link
Copy Markdown
Contributor Author

This should benefit significantly from GridTools/gt4py#2565. The concern about number of files on scratch will be a much smaller concern (and possibly it allows using per-branch caches? still needs measurement).

Copy link
Copy Markdown
Collaborator

@philip-paul-mueller philip-paul-mueller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I am a bit concerned with is when you work on GT4Py, i.e. if your PR does not touch the ICON4Py stencil code but affects it.
I know the solution is simply to disable it, but there should be somewhere an explanation on how to do it.

Comment thread ci/scripts/gt4py-cache.sh Outdated
@havogt
Copy link
Copy Markdown
Contributor

havogt commented Apr 30, 2026

What I am a bit concerned with is when you work on GT4Py, i.e. if your PR does not touch the ICON4Py stencil code but affects it. I know the solution is simply to disable it, but there should be somewhere an explanation on how to do it.

I agree. A simple solution might be to add the gt4py version to the key. PRs not touching gt4py will use the same cache as main, PRs touching gt4py will get their own directory. But need to double-check that the gt4py version is correctly resolved (i.e. with the dev suffix if it's not a release version).

@msimberg
Copy link
Copy Markdown
Contributor Author

What I am a bit concerned with is when you work on GT4Py, i.e. if your PR does not touch the ICON4Py stencil code but affects it. I know the solution is simply to disable it, but there should be somewhere an explanation on how to do it.

Yeah, that's a valid concern. How well do you think gt4py itself will invalidate caches when it itself is changed? The "weekly fresh cache" slightly helps protect against this. A per-branch cache might also help a bit more? But yeah, if one really wants a fresh cache then it would (at the moment) mean either disabling the scratch cache for your PR, or we can think about also allowing setting a specific variable to disable the cache (cscs-ci run allows passing extra env vars that will be forwarded to the job). Ideally this would look as close as possible to the github actions caching does (those caches can also be explicitly deleted, manually). File count was my main concern, but it might not even be a real concern.

@msimberg
Copy link
Copy Markdown
Contributor Author

msimberg commented May 7, 2026

cscs-ci run default;GT4PY_BUILD_CACHE_LIFETIME=session;ICON4PY_CI_WIPE_GT4PY_CACHE=true

@msimberg
Copy link
Copy Markdown
Contributor Author

msimberg commented May 7, 2026

cscs-ci run default

@msimberg msimberg force-pushed the ci-persistent-cache branch from 375ebec to 247f818 Compare May 8, 2026 13:57
@msimberg
Copy link
Copy Markdown
Contributor Author

msimberg commented May 8, 2026

cscs-ci run default

@msimberg msimberg force-pushed the ci-persistent-cache branch from 247f818 to 6cd8b9f Compare May 8, 2026 14:05
@msimberg
Copy link
Copy Markdown
Contributor Author

msimberg commented May 8, 2026

cscs-ci run default

Comment thread ci/scripts/gt4py-cache.sh Outdated

echo "Using GT4PY_BUILD_CACHE_DIR=${GT4PY_BUILD_CACHE_DIR}"

# TODO: This is here just for debugging, probably remove?
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know.. On the one hand, I find it useful to easily wipe the build cache (which I guess needs to be done by the CI user) without relying on the scratch cleanup policy. On the other hand, we would expose a variable that deletes stuff from the filesystem, so we introduce the risk for CI users of deleting folders by mistake.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I'm undecided. I think ideally if one needs a fresh cache then one should just set cache lifetime to session, but I'm sure we'll sooner or later bump into a situation where we just need to start over with the cache. It's easily added again later. In the current version it's removed again.

@msimberg
Copy link
Copy Markdown
Contributor Author

cscs-ci run default

@msimberg
Copy link
Copy Markdown
Contributor Author

cscs-ci run default

@msimberg
Copy link
Copy Markdown
Contributor Author

cscs-ci run distributed

@github-actions
Copy link
Copy Markdown

Mandatory Tests

Please make sure you run these tests via comment before you merge!

  • cscs-ci run default
  • cscs-ci run distributed

Optional Tests

To run benchmarks you can use:

  • cscs-ci run benchmark-bencher

To run tests and benchmarks with the DaCe backend you can use:

  • cscs-ci run dace

To run test levels ignored by the default test suite (mostly simple datatest for static fields computations) you can use:

  • cscs-ci run extra

For more detailed information please look at CI in the EXCLAIM universe.

@msimberg
Copy link
Copy Markdown
Contributor Author

cscs-ci run distributed

@msimberg msimberg marked this pull request as ready for review May 18, 2026 16:03
@msimberg
Copy link
Copy Markdown
Contributor Author

cscs-ci run default

@msimberg msimberg requested a review from edopao May 19, 2026 07:09
@msimberg
Copy link
Copy Markdown
Contributor Author

@edopao I'd probably go ahead with the current version to give it some real-world time on other PRs. If you think it'd still be useful to have an explicit "wipe the cache" option I don't mind adding that. Of course let me know if you have other concerns.

For reference, the iopsstor scratch of svc_cwci02_cicd_ext now has ~100G space and 2M files used:

| /iopsstor/scratch/cscs/svc_cwci02_cicd_ext | LUSTRE |  97.1G |    - |      - |            - | 2160954 |    - |    -     |           - |

It's quite a lot of files, but there's no quota on either at the moment. If I remember correctly cleanup is 2 or 3 weeks, which is good enough for our use case (it's wiped after a week by us anyway).

Copy link
Copy Markdown
Contributor

@edopao edopao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

assert pytest.approx(field) == field_ref
else:
if model_backends.is_cpu_backend(backend) and test_utils.is_dace(backend):
if test_utils.is_dace(backend) and (
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was not aware of this special case. We need to follow up on this next time we upgrade the gt4py version.

@msimberg msimberg merged commit 9daa9db into C2SM:main May 19, 2026
54 checks passed
@msimberg msimberg deleted the ci-persistent-cache branch May 19, 2026 14:10
jcanton added a commit that referenced this pull request May 20, 2026
* main:
  Add persistent gt4py cache for CSCS CI (#1218)
  Populate uv cache in CSCS CI checkout job (#1271)
  Remove defaulted exchange parameters (#1261)
jcanton added a commit that referenced this pull request May 20, 2026
* origin/main:
  Add persistent gt4py cache for CSCS CI (#1218)
  Populate uv cache in CSCS CI checkout job (#1271)
  Remove defaulted exchange parameters (#1261)
  Fix `DEFAULT_RBF_KERNEL` to use `InterpolationKernel` values (#1265)
  Fix gtfn codegen for `_compute_rayleigh_w` by using int literals (#1266)
  Clean up global grid parameters and grid shape (#1183)
  Check completion marker on downloads before taking lock (#1267)

# Conflicts:
#	model/common/src/icon4py/model/common/interpolation/interpolation_factory.py
#	model/common/src/icon4py/model/common/metrics/metrics_factory.py
#	model/common/tests/common/interpolation/unit_tests/test_interpolation_fields.py
#	model/testing/src/icon4py/model/testing/definitions.py
#	model/testing/src/icon4py/model/testing/fixtures/benchmark.py
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Set up a persistent cache across jobs for CI

4 participants