Implement a NUMERICALEARTH_DATA mode for constructing data manifests that can be downloaded by simone-silvestri · Pull Request #278 · NumericalEarth/NumericalEarth.jl

simone-silvestri · 2026-05-22T11:43:09Z

This PR is an implementation that should resolve #143

The idea is that an environment variable NUMERICALEARTH_DATA controls the execution mode of a script. There are three modes that inform how to deal with data downloading:

auto: same as now, whenever a dataset needs to be downloaded it will be downloaded (if not present locally). This is one of the "production" modes
existing: introduces a guard which errors if the data is not available locally. This is the second "production" mode for cluster where downloading from servers is not possible
build:/path/to/DataManifest.toml: this is a "dry run" mode, the script will be parsed instruction by instruction and a /path/to/DataManifest.toml with all the information on the required data to run the script will be populated. The data specified in /path/to/DataManifest.toml can be downloaded with download_datasets("/path/to/DataManifest.toml").

The auto and existing modes are pretty much straightforward, the bulk of the implementation deals with the "dry run" mode.
I have tried to keep all the implementation as general as possible and interact with the previous source code as little as possible such that we will need no bespoke features when implementing new datasets. The only required contract that a new dataset has to obey is a registration step in their respective init methods (see the changes for all the specific datasets).

The new module DataModes holds basically all the implementation, it parses though a script recurring in include statements and wrap all calls in try; .. catch; statements. The script will execute even without data download by forwarding all the catches to a bespoke DryRunValue method which survives most of the operations. Once a Metadata (or MetadataSet) statement is reached the observe_metadata method in the inner constructor will record the use of the metadata in the /path/to/Manifest.
By default the manifest is built from scratch with overwrite_exisiting=true. It is possible to append to an existing .toml file with the overwrite_existing=false kwarg.

An example is:

(base) simonesilvestri@Mac NumericalEarth.jl %   NUMERICALEARTH_DATA="build:MyDataManifest.toml" julia --project examples/one_degree_simulation.jl
ERROR: LoadError: ArgumentError: We cannot make a GPU with the CUDA backend:
a CUDA GPU was not found!
Stacktrace:
 [1] GPU()
   @ OceananigansCUDAExt ~/.julia/packages/Oceananigans/8cYE4/ext/OceananigansCUDAExt.jl:56
 [2] top-level scope
   @ ~/development/temp/NumericalEarth.jl/examples/one_degree_simulation.jl:24
 [3] include(mod::Module, _path::String)
   @ Base ./Base.jl:306
 [4] exec_options(opts::Base.JLOptions)
   @ Base ./client.jl:317
 [5] _start()
   @ Base ./client.jl:550
in expression starting at /Users/simonesilvestri/development/temp/NumericalEarth.jl/examples/one_degree_simulation.jl:24
[ Info: Loading cached bathymetry from /Users/simonesilvestri/.julia/scratchspaces/904d977b-046a-4731-8b86-9235c0d1ef02/bathymetry_cache/bathymetry_360x180_0.0_360.0_-85.22387615721567_90.0_b841e80e.jld2
[ Info: We've built an ocean simulation with model:
ocean.model = DryRunValue()
┌ Info: NUMERICALEARTH_DATA=build: wrote manifest via AST trace
│   path = "MyDataManifest.toml"
└   script = "/Users/simonesilvestri/development/temp/NumericalEarth.jl/examples/one_degree_simulation.jl"

with the MyDataManifest.toml containing

[[metadatum]]
filename = "ETOPO_2022_v1_60s_N90W180_surface.nc"
variable_name = "bottom_height"
dataset = "ETOPO2022"
[[metadatum]]
filename = "THETA_1993_01.nc"
date = 1993-01-01T00:00:00.000Z
variable_name = "temperature"
dataset = "ECCO4Monthly"
[[metadatum]]
filename = "SALT_1993_01.nc"
date = 1993-01-01T00:00:00.000Z
variable_name = "salinity"
dataset = "ECCO4Monthly"
[[metadatum]]
filename = "SIheff_1993_01.nc"
date = 1993-01-01T00:00:00.000Z
variable_name = "sea_ice_thickness"
dataset = "ECCO4Monthly"
[[metadatum]]
filename = "SIarea_1993_01.nc"
date = 1993-01-01T00:00:00.000Z
variable_name = "sea_ice_concentration"
dataset = "ECCO4Monthly"

[[metadataset]]
date = 1993-01-01T00:00:00.000Z
variable_names = ["temperature", "salinity", "sea_ice_thickness", "sea_ice_concentration"]
dataset = "ECCO4Monthly"

[[metadata]]
start_date = 1990-01-01T00:00:00.000Z
end_date = 1990-12-31T21:00:00.000Z
filename = "RYF.uas.1990_1991.nc"
variable_name = "eastward_velocity"
dataset = "RepeatYearJRA55"
[[metadata]]
start_date = 1990-01-01T00:00:00.000Z
end_date = 1990-12-31T21:00:00.000Z
filename = "RYF.vas.1990_1991.nc"
variable_name = "northward_velocity"
dataset = "RepeatYearJRA55"
[[metadata]]
start_date = 1990-01-01T00:00:00.000Z
end_date = 1990-12-31T21:00:00.000Z
filename = "RYF.tas.1990_1991.nc"
variable_name = "temperature"
dataset = "RepeatYearJRA55"
[[metadata]]
start_date = 1990-01-01T00:00:00.000Z
end_date = 1990-12-31T21:00:00.000Z
filename = "RYF.huss.1990_1991.nc"
variable_name = "specific_humidity"
dataset = "RepeatYearJRA55"
[[metadata]]
start_date = 1990-01-01T00:00:00.000Z
end_date = 1990-12-31T21:00:00.000Z
filename = "RYF.psl.1990_1991.nc"
variable_name = "sea_level_pressure"
dataset = "RepeatYearJRA55"
[[metadata]]
start_date = 1990-01-01T00:00:00.000Z
end_date = 1990-12-31T21:00:00.000Z
filename = "RYF.prra.1990_1991.nc"
variable_name = "rain_freshwater_flux"
dataset = "RepeatYearJRA55"
[[metadata]]
start_date = 1990-01-01T00:00:00.000Z
end_date = 1990-12-31T21:00:00.000Z
filename = "RYF.prsn.1990_1991.nc"
variable_name = "snow_freshwater_flux"
dataset = "RepeatYearJRA55"
[[metadata]]
start_date = 1990-01-01T00:00:00.000Z
end_date = 1990-12-31T21:00:00.000Z
filename = "RYF.rsds.1990_1991.nc"
variable_name = "downwelling_shortwave_radiation"
dataset = "RepeatYearJRA55"
[[metadata]]
start_date = 1990-01-01T00:00:00.000Z
end_date = 1990-12-31T21:00:00.000Z
filename = "RYF.rlds.1990_1991.nc"
variable_name = "downwelling_longwave_radiation"
dataset = "RepeatYearJRA55"
[[metadata]]
start_date = 1990-01-01T00:00:00.000Z
end_date = 1990-12-31T00:00:00.000Z
filename = "RYF.friver.1990_1991.nc"
variable_name = "river_freshwater_flux"
dataset = "RepeatYearJRA55"

simone-silvestri · 2026-05-22T11:46:34Z

I think to solve the test issue, we can trace a DataManifest.toml in the test folder and add a CI pipeline that dry runs the runtests.jl file and makes sure that the DataManifest.toml captures all the data that is needed. If not the test errors requesting to add the metadata which is not available. It also can check if the fallbacks a available in NumericalEarthArtifacts and if not it requests adding them. What do you think @glwagner @giordano?

glwagner · 2026-05-22T11:51:04Z

@simone-silvestri side notes is that I would like to change the prescribed atmospheres to use MetadataSet

simone-silvestri · 2026-05-22T11:52:11Z

A flaw in this design is that also in "dry mode" it executes the script and unless we hit an error it will continue executing all the commands that it can. So if all the data is already stored locally a "dry mode" is actually a production run and will never be as fast as we want.

simone-silvestri · 2026-05-22T11:52:26Z

@simone-silvestri side notes is that I would like to change the prescribed atmospheres to use MetadataSet

Yeah that would be great

glwagner · 2026-05-22T11:53:06Z

MyDataManifest.toml

Should we call this DataManifest.toml and forbid custom names, like Project.toml etc? That way there is one data manifest per directory.

Also in terms of file --- should we organize by dataset rather than the outer object (metadata, metadatum, metadataset)? I think this might be better. Ultimately it does not matter how the data is requested

simone-silvestri · 2026-05-22T11:55:31Z

Should we call this DataManifest.toml and forbid custom names, like Project.toml etc? That way there is one data manifest per directory.

Definitely we should forbid julia names that might conflict. I was thinking to allow custom names so each script even if living in the same directory can have its own data manifest.

Also in terms of file --- should we organize by dataset rather than the outer object (metadata, metadatum, metadataset)? I think this might be better. Ultimately it does not matter how the data is requested

I ll give it a shot

codecov · 2026-05-22T11:56:09Z

Codecov Report

❌ Patch coverage is 17.46032% with 312 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
...DataWrangling/DataModes/data_manifest_wrangling.jl	3.52%	137 Missing ⚠️
...ataWrangling/DataModes/parse_and_rewrite_script.jl	0.00%	79 Missing ⚠️
src/DataWrangling/DataModes/dry_run_value.jl	0.00%	51 Missing ⚠️
src/DataWrangling/DataModes/DataModes.jl	33.33%	16 Missing ⚠️
src/NumericalEarth.jl	47.82%	12 Missing ⚠️
src/DataWrangling/DataWrangling.jl	42.85%	4 Missing ⚠️
src/DataWrangling/metadata.jl	75.00%	4 Missing ⚠️
src/Bathymetry/regrid_bathymetry.jl	0.00%	2 Missing ⚠️
src/DataWrangling/ERA5/ERA5_pressure_levels.jl	0.00%	2 Missing ⚠️
...taWrangling/OSPapa/OSPapa_prescribed_atmosphere.jl	0.00%	1 Missing ⚠️
... and 4 more

📢 Thoughts on this report? Let us know!

glwagner · 2026-05-22T12:00:23Z

I was thinking to allow custom names so each script even if living in the same directory can have its own data manifest.

A benefit to having a single name is that it will stay up to date with whatever you are running automatically. I think probably the same reason why we cannot have multiple Project.toml also applies to this situation?

I think having more than one manifest in a directory would get messy. How do you identify the manifest with the script?

glwagner · 2026-05-22T12:05:18Z

So I suggest the following updates:

always produce a DataManifest.toml, even if we aren't doing a dry run
use NUMERICALEARTH_DATA=pregenerate for a dry run mode
have NUMERICALEARTH_DATA=strict which errors if data does not already exist
have NUMERICALEARTH_DATA='' (unset)

glwagner · 2026-05-22T12:05:43Z

We can call it NumericalEarthDataManifest.toml if we want to make it specific to here

simone-silvestri · 2026-05-22T12:08:58Z

I was thinking users could just specify the names, like

NUMERICALEARTH_DATA=build:data_for_script julia --project script.jl

I am ok with just one manifest though, if we see that there is scope for multiple we can always revisit. What about calling it NumericalEarthDataManifest.toml to connect it to the repo?

simone-silvestri · 2026-05-22T12:09:09Z

Ah you wrote before me :)

simone-silvestri · 2026-05-22T12:13:09Z

always produce a DataManifest.toml, even if we aren't doing a dry run

Ok then probably we want overwrite_existing = false by default to avoid rewriting constantly

glwagner · 2026-05-22T12:16:41Z

always produce a DataManifest.toml, even if we aren't doing a dry run

Ok then probably we want overwrite_existing = false by default to avoid rewriting constantly

What would you use this for?

simone-silvestri · 2026-05-22T12:22:08Z

If we want to append to the current manifest or delete it and start from scratch. How are we envisioning this manifest? A package-wide toml that holds all the variable or a thin toml that records the needed data per script or per project?
I was thinking more the second, in this case we might want to add some data to an existing manifest and then we need overwrite_existing = false. While if we care only about one script's data we set overwrite_existing = true se we do not bring in all the rest of the data we do not need for the particular script we are running (in case we continue changing script).

glwagner · 2026-05-22T12:22:59Z

If we want to append to the current manifest or delete it and start from scratch. How are we envisioning this manifest? A package-wide toml that holds all the variable or a thin toml that records the needed data per script or per project? I was thinking more the second, in this case we might want to add some data to an existing manifest and then we need overwrite_existing = false. While if we care only about one script's data we set overwrite_existing = true se we do not bring in all the rest of the data we do not need for the particular script we are running (in case we continue changing script).

It is per-directory, so it lives alongside the script that generates it.

glwagner · 2026-05-22T12:25:40Z

Ah ok, so per-environment. I think that could work too.

glwagner · 2026-05-22T12:27:22Z

I'm not sure I grasp the concept of custom names. How would you use a manifest with a script? It wouldn't work automatically?

I'm envisioning a situation where you want to share a setup with someone, so you give them both a script, a package environment, and the data manifest, and then julia --project script.jl will run the code.

simone-silvestri · 2026-05-22T12:50:03Z

Ah ok, like that, I was thinking having in the script

NumericalEarth.download_dataset("mymanifest.toml")

at the beginning of a script.
but this works also, so each script has its own manifest that is detected and downloaded automatically in the __init__ if present?

simone-silvestri · 2026-05-22T14:03:18Z

I have also committed a NumericalEarthDataManifest.toml to the test directory and added a test that checks that it is not changed (no new datasets are added to the tests). If that test fails means that we are adding new data that should be backed up on NumericalEarthArtifacts

giordano · 2026-05-22T15:16:11Z

Catching up with this PR now. While reading the initial discussion I was going to suggest adding "NumericalEarth" to the file name, but apparently everybody else had the same idea before me 😂

giordano · 2026-05-22T15:19:23Z

+    if m.dates isa AbstractVector
+        d["start_date"] = first(m.dates)
+        d["end_date"]   = last(m.dates)


Are we 100% sure m.dates is always sorted?

glwagner · 2026-05-22T15:23:37Z

Ah ok, like that, I was thinking having in the script
NumericalEarth.download_dataset("mymanifest.toml")
at the beginning of a script.

This works, but is more manual labor and kind of annoying (you have to remember the right name, this annoying line has to always appear in the script; you can't add the line until you add the data too so it creates a circular dependency, blah blah). I think it could be more powerful if it functions more like a package environment. What do you think?

but this works also, so each script has its own manifest that is detected and downloaded automatically in the __init__ if present?

I'm not sure the mechanics of how to achieve it. I can look into that though (hopefully it is possible)

glwagner · 2026-05-22T15:25:26Z

Catching up with this PR now. While reading the initial discussion I was going to suggest adding "NumericalEarth" to the file name, but apparently everybody else had the same idea before me 😂

yay!

Maybe just NumericalEarthData.toml is enough.

first commit

3bae680

simone-silvestri requested review from giordano and glwagner May 22, 2026 11:43

simone-silvestri added 5 commits May 22, 2026 15:00

more changes

31147b3

change manifest

d17146c

add a test manifest

719e63c

add a freshness test

1b7197d

remove extra comments

6cb77ad

update toml file and clean up a bit

71daf6c

giordano reviewed May 22, 2026

View reviewed changes

navidcy added the data wrangling 🗃️ JRA55, ECCO, ERA5, and friends label May 22, 2026

Conversation

simone-silvestri commented May 22, 2026 • edited by giordano Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

simone-silvestri commented May 22, 2026

Uh oh!

glwagner commented May 22, 2026

Uh oh!

simone-silvestri commented May 22, 2026

Uh oh!

simone-silvestri commented May 22, 2026

Uh oh!

glwagner commented May 22, 2026

Uh oh!

simone-silvestri commented May 22, 2026

Uh oh!

codecov Bot commented May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

glwagner commented May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

glwagner commented May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

glwagner commented May 22, 2026

Uh oh!

simone-silvestri commented May 22, 2026

Uh oh!

simone-silvestri commented May 22, 2026

Uh oh!

simone-silvestri commented May 22, 2026

Uh oh!

glwagner commented May 22, 2026

Uh oh!

simone-silvestri commented May 22, 2026

Uh oh!

glwagner commented May 22, 2026

Uh oh!

glwagner commented May 22, 2026

Uh oh!

glwagner commented May 22, 2026

Uh oh!

simone-silvestri commented May 22, 2026

Uh oh!

simone-silvestri commented May 22, 2026

Uh oh!

giordano commented May 22, 2026

Uh oh!

giordano May 22, 2026

Choose a reason for hiding this comment

Uh oh!

glwagner commented May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

glwagner commented May 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

simone-silvestri commented May 22, 2026 •

edited by giordano

Loading

codecov Bot commented May 22, 2026 •

edited

Loading

glwagner commented May 22, 2026 •

edited

Loading

glwagner commented May 22, 2026 •

edited

Loading

glwagner commented May 22, 2026 •

edited

Loading