Skip to content

Implement a NUMERICALEARTH_DATA mode for constructing data manifests that can be downloaded#278

Open
simone-silvestri wants to merge 7 commits into
mainfrom
ss/data-modes
Open

Implement a NUMERICALEARTH_DATA mode for constructing data manifests that can be downloaded#278
simone-silvestri wants to merge 7 commits into
mainfrom
ss/data-modes

Conversation

@simone-silvestri
Copy link
Copy Markdown
Member

@simone-silvestri simone-silvestri commented May 22, 2026

This PR is an implementation that should resolve #143

The idea is that an environment variable NUMERICALEARTH_DATA controls the execution mode of a script. There are three modes that inform how to deal with data downloading:

  • auto: same as now, whenever a dataset needs to be downloaded it will be downloaded (if not present locally). This is one of the "production" modes
  • existing: introduces a guard which errors if the data is not available locally. This is the second "production" mode for cluster where downloading from servers is not possible
  • build:/path/to/DataManifest.toml: this is a "dry run" mode, the script will be parsed instruction by instruction and a /path/to/DataManifest.toml with all the information on the required data to run the script will be populated. The data specified in /path/to/DataManifest.toml can be downloaded with download_datasets("/path/to/DataManifest.toml").

The auto and existing modes are pretty much straightforward, the bulk of the implementation deals with the "dry run" mode.
I have tried to keep all the implementation as general as possible and interact with the previous source code as little as possible such that we will need no bespoke features when implementing new datasets. The only required contract that a new dataset has to obey is a registration step in their respective init methods (see the changes for all the specific datasets).

The new module DataModes holds basically all the implementation, it parses though a script recurring in include statements and wrap all calls in try; .. catch; statements. The script will execute even without data download by forwarding all the catches to a bespoke DryRunValue method which survives most of the operations. Once a Metadata (or MetadataSet) statement is reached the observe_metadata method in the inner constructor will record the use of the metadata in the /path/to/Manifest.
By default the manifest is built from scratch with overwrite_exisiting=true. It is possible to append to an existing .toml file with the overwrite_existing=false kwarg.

An example is:

(base) simonesilvestri@Mac NumericalEarth.jl %   NUMERICALEARTH_DATA="build:MyDataManifest.toml" julia --project examples/one_degree_simulation.jl
ERROR: LoadError: ArgumentError: We cannot make a GPU with the CUDA backend:
a CUDA GPU was not found!
Stacktrace:
 [1] GPU()
   @ OceananigansCUDAExt ~/.julia/packages/Oceananigans/8cYE4/ext/OceananigansCUDAExt.jl:56
 [2] top-level scope
   @ ~/development/temp/NumericalEarth.jl/examples/one_degree_simulation.jl:24
 [3] include(mod::Module, _path::String)
   @ Base ./Base.jl:306
 [4] exec_options(opts::Base.JLOptions)
   @ Base ./client.jl:317
 [5] _start()
   @ Base ./client.jl:550
in expression starting at /Users/simonesilvestri/development/temp/NumericalEarth.jl/examples/one_degree_simulation.jl:24
[ Info: Loading cached bathymetry from /Users/simonesilvestri/.julia/scratchspaces/904d977b-046a-4731-8b86-9235c0d1ef02/bathymetry_cache/bathymetry_360x180_0.0_360.0_-85.22387615721567_90.0_b841e80e.jld2
[ Info: We've built an ocean simulation with model:
ocean.model = DryRunValue()
┌ Info: NUMERICALEARTH_DATA=build: wrote manifest via AST trace
│   path = "MyDataManifest.toml"
└   script = "/Users/simonesilvestri/development/temp/NumericalEarth.jl/examples/one_degree_simulation.jl"

with the MyDataManifest.toml containing

[[metadatum]]
filename = "ETOPO_2022_v1_60s_N90W180_surface.nc"
variable_name = "bottom_height"
dataset = "ETOPO2022"
[[metadatum]]
filename = "THETA_1993_01.nc"
date = 1993-01-01T00:00:00.000Z
variable_name = "temperature"
dataset = "ECCO4Monthly"
[[metadatum]]
filename = "SALT_1993_01.nc"
date = 1993-01-01T00:00:00.000Z
variable_name = "salinity"
dataset = "ECCO4Monthly"
[[metadatum]]
filename = "SIheff_1993_01.nc"
date = 1993-01-01T00:00:00.000Z
variable_name = "sea_ice_thickness"
dataset = "ECCO4Monthly"
[[metadatum]]
filename = "SIarea_1993_01.nc"
date = 1993-01-01T00:00:00.000Z
variable_name = "sea_ice_concentration"
dataset = "ECCO4Monthly"

[[metadataset]]
date = 1993-01-01T00:00:00.000Z
variable_names = ["temperature", "salinity", "sea_ice_thickness", "sea_ice_concentration"]
dataset = "ECCO4Monthly"

[[metadata]]
start_date = 1990-01-01T00:00:00.000Z
end_date = 1990-12-31T21:00:00.000Z
filename = "RYF.uas.1990_1991.nc"
variable_name = "eastward_velocity"
dataset = "RepeatYearJRA55"
[[metadata]]
start_date = 1990-01-01T00:00:00.000Z
end_date = 1990-12-31T21:00:00.000Z
filename = "RYF.vas.1990_1991.nc"
variable_name = "northward_velocity"
dataset = "RepeatYearJRA55"
[[metadata]]
start_date = 1990-01-01T00:00:00.000Z
end_date = 1990-12-31T21:00:00.000Z
filename = "RYF.tas.1990_1991.nc"
variable_name = "temperature"
dataset = "RepeatYearJRA55"
[[metadata]]
start_date = 1990-01-01T00:00:00.000Z
end_date = 1990-12-31T21:00:00.000Z
filename = "RYF.huss.1990_1991.nc"
variable_name = "specific_humidity"
dataset = "RepeatYearJRA55"
[[metadata]]
start_date = 1990-01-01T00:00:00.000Z
end_date = 1990-12-31T21:00:00.000Z
filename = "RYF.psl.1990_1991.nc"
variable_name = "sea_level_pressure"
dataset = "RepeatYearJRA55"
[[metadata]]
start_date = 1990-01-01T00:00:00.000Z
end_date = 1990-12-31T21:00:00.000Z
filename = "RYF.prra.1990_1991.nc"
variable_name = "rain_freshwater_flux"
dataset = "RepeatYearJRA55"
[[metadata]]
start_date = 1990-01-01T00:00:00.000Z
end_date = 1990-12-31T21:00:00.000Z
filename = "RYF.prsn.1990_1991.nc"
variable_name = "snow_freshwater_flux"
dataset = "RepeatYearJRA55"
[[metadata]]
start_date = 1990-01-01T00:00:00.000Z
end_date = 1990-12-31T21:00:00.000Z
filename = "RYF.rsds.1990_1991.nc"
variable_name = "downwelling_shortwave_radiation"
dataset = "RepeatYearJRA55"
[[metadata]]
start_date = 1990-01-01T00:00:00.000Z
end_date = 1990-12-31T21:00:00.000Z
filename = "RYF.rlds.1990_1991.nc"
variable_name = "downwelling_longwave_radiation"
dataset = "RepeatYearJRA55"
[[metadata]]
start_date = 1990-01-01T00:00:00.000Z
end_date = 1990-12-31T00:00:00.000Z
filename = "RYF.friver.1990_1991.nc"
variable_name = "river_freshwater_flux"
dataset = "RepeatYearJRA55"

@simone-silvestri
Copy link
Copy Markdown
Member Author

I think to solve the test issue, we can trace a DataManifest.toml in the test folder and add a CI pipeline that dry runs the runtests.jl file and makes sure that the DataManifest.toml captures all the data that is needed. If not the test errors requesting to add the metadata which is not available. It also can check if the fallbacks a available in NumericalEarthArtifacts and if not it requests adding them. What do you think @glwagner @giordano?

@glwagner
Copy link
Copy Markdown
Member

@simone-silvestri side notes is that I would like to change the prescribed atmospheres to use MetadataSet

@simone-silvestri
Copy link
Copy Markdown
Member Author

A flaw in this design is that also in "dry mode" it executes the script and unless we hit an error it will continue executing all the commands that it can. So if all the data is already stored locally a "dry mode" is actually a production run and will never be as fast as we want.

@simone-silvestri
Copy link
Copy Markdown
Member Author

@simone-silvestri side notes is that I would like to change the prescribed atmospheres to use MetadataSet

Yeah that would be great

@glwagner
Copy link
Copy Markdown
Member

MyDataManifest.toml

Should we call this DataManifest.toml and forbid custom names, like Project.toml etc? That way there is one data manifest per directory.

Also in terms of file --- should we organize by dataset rather than the outer object (metadata, metadatum, metadataset)? I think this might be better. Ultimately it does not matter how the data is requested

@simone-silvestri
Copy link
Copy Markdown
Member Author

Should we call this DataManifest.toml and forbid custom names, like Project.toml etc? That way there is one data manifest per directory.

Definitely we should forbid julia names that might conflict. I was thinking to allow custom names so each script even if living in the same directory can have its own data manifest.

Also in terms of file --- should we organize by dataset rather than the outer object (metadata, metadatum, metadataset)? I think this might be better. Ultimately it does not matter how the data is requested

I ll give it a shot

@glwagner
Copy link
Copy Markdown
Member

glwagner commented May 22, 2026

I was thinking to allow custom names so each script even if living in the same directory can have its own data manifest.

A benefit to having a single name is that it will stay up to date with whatever you are running automatically. I think probably the same reason why we cannot have multiple Project.toml also applies to this situation?

I think having more than one manifest in a directory would get messy. How do you identify the manifest with the script?

@glwagner
Copy link
Copy Markdown
Member

glwagner commented May 22, 2026

So I suggest the following updates:

  1. always produce a DataManifest.toml, even if we aren't doing a dry run
  2. use NUMERICALEARTH_DATA=pregenerate for a dry run mode
  3. have NUMERICALEARTH_DATA=strict which errors if data does not already exist
  4. have NUMERICALEARTH_DATA='' (unset)

@glwagner
Copy link
Copy Markdown
Member

We can call it NumericalEarthDataManifest.toml if we want to make it specific to here

@simone-silvestri
Copy link
Copy Markdown
Member Author

I was thinking users could just specify the names, like

NUMERICALEARTH_DATA=build:data_for_script julia --project script.jl

I am ok with just one manifest though, if we see that there is scope for multiple we can always revisit. What about calling it NumericalEarthDataManifest.toml to connect it to the repo?

@simone-silvestri
Copy link
Copy Markdown
Member Author

Ah you wrote before me :)

@simone-silvestri
Copy link
Copy Markdown
Member Author

always produce a DataManifest.toml, even if we aren't doing a dry run

Ok then probably we want overwrite_existing = false by default to avoid rewriting constantly

@glwagner
Copy link
Copy Markdown
Member

always produce a DataManifest.toml, even if we aren't doing a dry run

Ok then probably we want overwrite_existing = false by default to avoid rewriting constantly

What would you use this for?

@simone-silvestri
Copy link
Copy Markdown
Member Author

If we want to append to the current manifest or delete it and start from scratch. How are we envisioning this manifest? A package-wide toml that holds all the variable or a thin toml that records the needed data per script or per project?
I was thinking more the second, in this case we might want to add some data to an existing manifest and then we need overwrite_existing = false. While if we care only about one script's data we set overwrite_existing = true se we do not bring in all the rest of the data we do not need for the particular script we are running (in case we continue changing script).

@glwagner
Copy link
Copy Markdown
Member

If we want to append to the current manifest or delete it and start from scratch. How are we envisioning this manifest? A package-wide toml that holds all the variable or a thin toml that records the needed data per script or per project? I was thinking more the second, in this case we might want to add some data to an existing manifest and then we need overwrite_existing = false. While if we care only about one script's data we set overwrite_existing = true se we do not bring in all the rest of the data we do not need for the particular script we are running (in case we continue changing script).

It is per-directory, so it lives alongside the script that generates it.

@glwagner
Copy link
Copy Markdown
Member

Ah ok, so per-environment. I think that could work too.

@glwagner
Copy link
Copy Markdown
Member

I'm not sure I grasp the concept of custom names. How would you use a manifest with a script? It wouldn't work automatically?

I'm envisioning a situation where you want to share a setup with someone, so you give them both a script, a package environment, and the data manifest, and then julia --project script.jl will run the code.

@simone-silvestri
Copy link
Copy Markdown
Member Author

Ah ok, like that, I was thinking having in the script

NumericalEarth.download_dataset("mymanifest.toml")

at the beginning of a script.
but this works also, so each script has its own manifest that is detected and downloaded automatically in the __init__ if present?

@simone-silvestri
Copy link
Copy Markdown
Member Author

I have also committed a NumericalEarthDataManifest.toml to the test directory and added a test that checks that it is not changed (no new datasets are added to the tests). If that test fails means that we are adding new data that should be backed up on NumericalEarthArtifacts

@giordano
Copy link
Copy Markdown
Member

Catching up with this PR now. While reading the initial discussion I was going to suggest adding "NumericalEarth" to the file name, but apparently everybody else had the same idea before me 😂

Comment on lines +104 to +106
if m.dates isa AbstractVector
d["start_date"] = first(m.dates)
d["end_date"] = last(m.dates)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we 100% sure m.dates is always sorted?

@glwagner
Copy link
Copy Markdown
Member

glwagner commented May 22, 2026

Ah ok, like that, I was thinking having in the script

NumericalEarth.download_dataset("mymanifest.toml")

at the beginning of a script.

This works, but is more manual labor and kind of annoying (you have to remember the right name, this annoying line has to always appear in the script; you can't add the line until you add the data too so it creates a circular dependency, blah blah). I think it could be more powerful if it functions more like a package environment. What do you think?

but this works also, so each script has its own manifest that is detected and downloaded automatically in the __init__ if present?

I'm not sure the mechanics of how to achieve it. I can look into that though (hopefully it is possible)

@glwagner
Copy link
Copy Markdown
Member

Catching up with this PR now. While reading the initial discussion I was going to suggest adding "NumericalEarth" to the file name, but apparently everybody else had the same idea before me 😂

yay!

Maybe just NumericalEarthData.toml is enough.

@navidcy navidcy added the data wrangling 🗃️ JRA55, ECCO, ERA5, and friends label May 22, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

data wrangling 🗃️ JRA55, ECCO, ERA5, and friends

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Method to enforce "predownloading" of data, when desired

4 participants