Implement a NUMERICALEARTH_DATA mode for constructing data manifests that can be downloaded#278
Implement a NUMERICALEARTH_DATA mode for constructing data manifests that can be downloaded#278simone-silvestri wants to merge 7 commits into
Conversation
|
I think to solve the test issue, we can trace a DataManifest.toml in the test folder and add a CI pipeline that dry runs the |
|
@simone-silvestri side notes is that I would like to change the prescribed atmospheres to use MetadataSet |
|
A flaw in this design is that also in "dry mode" it executes the script and unless we hit an error it will continue executing all the commands that it can. So if all the data is already stored locally a "dry mode" is actually a production run and will never be as fast as we want. |
Yeah that would be great |
Should we call this Also in terms of file --- should we organize by dataset rather than the outer object (metadata, metadatum, metadataset)? I think this might be better. Ultimately it does not matter how the data is requested |
Definitely we should forbid julia names that might conflict. I was thinking to allow custom names so each script even if living in the same directory can have its own data manifest.
I ll give it a shot |
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
A benefit to having a single name is that it will stay up to date with whatever you are running automatically. I think probably the same reason why we cannot have multiple I think having more than one manifest in a directory would get messy. How do you identify the manifest with the script? |
|
So I suggest the following updates:
|
|
We can call it |
|
I was thinking users could just specify the names, like I am ok with just one manifest though, if we see that there is scope for multiple we can always revisit. What about calling it |
|
Ah you wrote before me :) |
Ok then probably we want |
What would you use this for? |
|
If we want to append to the current manifest or delete it and start from scratch. How are we envisioning this manifest? A package-wide toml that holds all the variable or a thin toml that records the needed data per script or per project? |
It is per-directory, so it lives alongside the script that generates it. |
|
Ah ok, so per-environment. I think that could work too. |
|
I'm not sure I grasp the concept of custom names. How would you use a manifest with a script? It wouldn't work automatically? I'm envisioning a situation where you want to share a setup with someone, so you give them both a script, a package environment, and the data manifest, and then |
|
Ah ok, like that, I was thinking having in the script at the beginning of a script. |
|
I have also committed a |
|
Catching up with this PR now. While reading the initial discussion I was going to suggest adding "NumericalEarth" to the file name, but apparently everybody else had the same idea before me 😂 |
| if m.dates isa AbstractVector | ||
| d["start_date"] = first(m.dates) | ||
| d["end_date"] = last(m.dates) |
There was a problem hiding this comment.
Are we 100% sure m.dates is always sorted?
This works, but is more manual labor and kind of annoying (you have to remember the right name, this annoying line has to always appear in the script; you can't add the line until you add the data too so it creates a circular dependency, blah blah). I think it could be more powerful if it functions more like a package environment. What do you think?
I'm not sure the mechanics of how to achieve it. I can look into that though (hopefully it is possible) |
yay! Maybe just |
This PR is an implementation that should resolve #143
The idea is that an environment variable
NUMERICALEARTH_DATAcontrols the execution mode of a script. There are three modes that inform how to deal with data downloading:/path/to/DataManifest.tomlwith all the information on the required data to run the script will be populated. The data specified in/path/to/DataManifest.tomlcan be downloaded withdownload_datasets("/path/to/DataManifest.toml").The auto and existing modes are pretty much straightforward, the bulk of the implementation deals with the "dry run" mode.
I have tried to keep all the implementation as general as possible and interact with the previous source code as little as possible such that we will need no bespoke features when implementing new datasets. The only required contract that a new dataset has to obey is a registration step in their respective init methods (see the changes for all the specific datasets).
The new module
DataModesholds basically all the implementation, it parses though a script recurring inincludestatements and wrap all calls intry; .. catch;statements. The script will execute even without data download by forwarding all the catches to a bespokeDryRunValuemethod which survives most of the operations. Once a Metadata (or MetadataSet) statement is reached theobserve_metadatamethod in the inner constructor will record the use of the metadata in the /path/to/Manifest.By default the manifest is built from scratch with
overwrite_exisiting=true. It is possible to append to an existing .toml file with theoverwrite_existing=falsekwarg.An example is:
with the MyDataManifest.toml containing