Add persistent gt4py cache for CSCS CI#1218
Conversation
|
The github actions cache for gt4py seems to work nicely. I had I've yet to try the cscs ci cache. |
|
cscs-ci run distributed |
|
cscs-ci run distributed |
|
cscs-ci run distributed |
|
This should benefit significantly from GridTools/gt4py#2565. The concern about number of files on scratch will be a much smaller concern (and possibly it allows using per-branch caches? still needs measurement). |
philip-paul-mueller
left a comment
There was a problem hiding this comment.
What I am a bit concerned with is when you work on GT4Py, i.e. if your PR does not touch the ICON4Py stencil code but affects it.
I know the solution is simply to disable it, but there should be somewhere an explanation on how to do it.
I agree. A simple solution might be to add the gt4py version to the key. PRs not touching gt4py will use the same cache as main, PRs touching gt4py will get their own directory. But need to double-check that the gt4py version is correctly resolved (i.e. with the dev suffix if it's not a release version). |
Yeah, that's a valid concern. How well do you think gt4py itself will invalidate caches when it itself is changed? The "weekly fresh cache" slightly helps protect against this. A per-branch cache might also help a bit more? But yeah, if one really wants a fresh cache then it would (at the moment) mean either disabling the scratch cache for your PR, or we can think about also allowing setting a specific variable to disable the cache (cscs-ci run allows passing extra env vars that will be forwarded to the job). Ideally this would look as close as possible to the github actions caching does (those caches can also be explicitly deleted, manually). File count was my main concern, but it might not even be a real concern. |
|
cscs-ci run default;GT4PY_BUILD_CACHE_LIFETIME=session;ICON4PY_CI_WIPE_GT4PY_CACHE=true |
|
cscs-ci run default |
375ebec to
247f818
Compare
|
cscs-ci run default |
247f818 to
6cd8b9f
Compare
|
cscs-ci run default |
|
|
||
| echo "Using GT4PY_BUILD_CACHE_DIR=${GT4PY_BUILD_CACHE_DIR}" | ||
|
|
||
| # TODO: This is here just for debugging, probably remove? |
There was a problem hiding this comment.
I don't know.. On the one hand, I find it useful to easily wipe the build cache (which I guess needs to be done by the CI user) without relying on the scratch cleanup policy. On the other hand, we would expose a variable that deletes stuff from the filesystem, so we introduce the risk for CI users of deleting folders by mistake.
There was a problem hiding this comment.
Yeah, I'm undecided. I think ideally if one needs a fresh cache then one should just set cache lifetime to session, but I'm sure we'll sooner or later bump into a situation where we just need to start over with the cache. It's easily added again later. In the current version it's removed again.
|
cscs-ci run default |
|
cscs-ci run default |
|
cscs-ci run distributed |
|
Mandatory Tests Please make sure you run these tests via comment before you merge!
Optional Tests To run benchmarks you can use:
To run tests and benchmarks with the DaCe backend you can use:
To run test levels ignored by the default test suite (mostly simple datatest for static fields computations) you can use:
For more detailed information please look at CI in the EXCLAIM universe. |
|
cscs-ci run distributed |
|
cscs-ci run default |
|
@edopao I'd probably go ahead with the current version to give it some real-world time on other PRs. If you think it'd still be useful to have an explicit "wipe the cache" option I don't mind adding that. Of course let me know if you have other concerns. For reference, the iopsstor scratch of svc_cwci02_cicd_ext now has ~100G space and 2M files used: It's quite a lot of files, but there's no quota on either at the moment. If I remember correctly cleanup is 2 or 3 weeks, which is good enough for our use case (it's wiped after a week by us anyway). |
| assert pytest.approx(field) == field_ref | ||
| else: | ||
| if model_backends.is_cpu_backend(backend) and test_utils.is_dace(backend): | ||
| if test_utils.is_dace(backend) and ( |
There was a problem hiding this comment.
I was not aware of this special case. We need to follow up on this next time we upgrade the gt4py version.
* origin/main: Add persistent gt4py cache for CSCS CI (#1218) Populate uv cache in CSCS CI checkout job (#1271) Remove defaulted exchange parameters (#1261) Fix `DEFAULT_RBF_KERNEL` to use `InterpolationKernel` values (#1265) Fix gtfn codegen for `_compute_rayleigh_w` by using int literals (#1266) Clean up global grid parameters and grid shape (#1183) Check completion marker on downloads before taking lock (#1267) # Conflicts: # model/common/src/icon4py/model/common/interpolation/interpolation_factory.py # model/common/src/icon4py/model/common/metrics/metrics_factory.py # model/common/tests/common/interpolation/unit_tests/test_interpolation_fields.py # model/testing/src/icon4py/model/testing/definitions.py # model/testing/src/icon4py/model/testing/fixtures/benchmark.py
Fixes #1133.
Sets up a persistent cache for cscs ci pipelines. There's a weekly cache shared across all jobs. This means that each week a new cache directory is populated based on the year and week number. The idea of this is that we sometimes trigger compilation for a new cache from scratch, but not for every job. The cache is shared between branches and the same uv.lock to avoid having too many files on scratch from every branch writing a separate cache such as in #1220. The caches are separated by uv.lock hash to avoid sharing cache when gt4py or another important dependency changes. The caches are separated by job because .venvs are different for each nox session. This avoids issues with hardcoded paths in compile_commands.json.