EAMxx: add conditional geo data output (io_output_if_dim_exists)#8349
EAMxx: add conditional geo data output (io_output_if_dim_exists)#8349Copilot wants to merge 4 commits into
Conversation
bartgol
left a comment
There was a problem hiding this comment.
@copilot I think I would like a different design in the OutputManager class. I would like this approach:
- during setup, for the geo data, we only gather the fields we want to output.
- during setup_file, if the geo data streams have already been created, just run them. Otherwise, proceed to create them
- when creating the geo streams, we can parse the output file (as output is registered BEFORE geo data in setup_file), so we can already create them with the correct fields. Namely, while parsing the list of fields to add to geo data stream, if a field is conditional on a dim being present, we can immedaiately check if the dim is in fact there.
This approach should allow to not have to create cond_streams every time setup_file runs. This should make the logic for saving geo data more contained in one place, and simpler to maintain.
I want you to:
- consider my argument above, analyzing its pros/cons
- convert my argument above into a plan
- analyze and refine the plan with any additional tweaks/fixes you can think of
- implement the plan
Redesigned in commit 5528df6. The new approach:
The logic for saving geo data is now self-contained in |
5528df6 to
7b7bddc
Compare
7b7bddc to
1d1259c
Compare
…s in the file This allows, e.g., to avoid adding radiation's sw/lw bands variables if no radiation-related vars are in the file.
1d1259c to
a107896
Compare
|
The PR should remove some fields, but tests pass. However, looking at the cprnc output of, e.g., eamxx-prod CI test, we see I wish cprnc did NOT report IDENTICAL when vars/dims are missing... Anyhow, this is good to go, if @brhillman is ok with it. |
|
@crterai @AaronDonahue @mahf708 and others. Discussing with @brhillman we were wondering if we actually need the logic in this PR or if it should done automatically for all geo data. To recap, this PR marks the bands geo data (stored in the grid) as "conditional to the output file having 1+ var that needs the band dim (sw or lw)". This is to avoid having a few sw/lw bands vars in the nc file that are pointless, as there are no rad fields in the file. We achieve this by using the extra data feature of the Field class, with radiation using a specific key to mark the dim as "conditional to...". But Ben wondered: do we need this extra data logic? Why not having IO always check if the geo data is needed? It may be uncommon, but you could imagine someone having a stream containing ONLY zonal_avg diagnostics. For such stream, NO variable would have the "ncol" dimension, so having the So, the proposal would be to do something like this instead: when processing the grid geo data, check if ANY of the geo data field dims is already registered in the file. If yes, write the geo data to file, otherwise skip it, since no var share even one dim tag with this geo data field (we could make an exception for rank-0 geo data, just in case we ever have some). This would remove the hard-coded extra-data string for detecting these "conditional" geo data, making the IO layer handling it more automatically. Thoughts? |
Thoughtful response: Unhinged response: |
|
I think we should ask the E3SM-Diags folks (@chengzhuzhang ) to comment on what geo fields,if any, are expected in any output stream. |
To be clear so the E3SM-Diags folks can understand: the question is "are there geo fields you ALWAYS want in the output file, even if they are completely irrelevant for the vars stored in the file?". This PR was tackling the sw/lw bands and bands_bounds vars when there are NO vars with sw/lw bands dim. But the natural question is "should we take this forward, and apply the same idea to ALL geo-data/coord fileds?". |
|
I think it's fine and helps clean up output files to only output the dims that are relevant for the variables that exist in the file. I'm curious about what happens to regridded or subsampled data. And for the example you gave where we do zonal outputs, would we just not get the latitude corresponding to the zonal output? Or would we get a latitude variable that corresponds to the zonal bands? |
Thanks for the contexts! Re: "are there geo fields you ALWAYS want in the output file, even if they are completely irrelevant for the vars stored in the file?". My answer is no, but I'd like @czender to chime in since NCO is the first downstream tool to consume the native model output. One possibly related case, though maybe not relevant to this change: we always want PS/ps for vertically regridding a 3D variables on hybrid grid. But I assume this PR won’t affect that. |
|
NCO does not need variables or dimensions that are unrelated to those it is processing. Unless told otherwise it just copies the non-processed variables into the output. As Jill mentioned some variables that are ancillary to the processed variables (e.g., |
Add support for conditionally writing geometry data to output files, controlled by the new "io_output_if_dim_exists" extra_data key on Field headers.
[non BFB] only for output files with bands dims but no bands vars.
Problem
The EAMxx IO layer currently writes all grid geometry data (including
swband,lwband,swband_bounds,lwband_bounds) to every output file, regardless of whether any output variable uses those dimensions. This wastes space and creates misleading output files.Solution
Introduce an
"io_output_if_dim_exists"extra_data flag onFieldheaders. When set to a dimension name string, that geometry data field is written to the output file only if the named dimension is already registered in the file (i.e., at least one regular output variable uses that dimension).Changes
eamxx_output_manager.hppm_conditional_geo_datawithm_geo_data_fields: astd::vector<GeoDataEntry>where each entry holds the cloned field, its grid, and an optionalrequired_dimstring (empty for unconditional fields).eamxx_output_manager.cppsetup(): collect all candidate geo data fields (both regular and conditional) intom_geo_data_fields. NoAtmosphereOutputstreams are created here.setup_file(): lazily createm_geo_data_streamson first call (when empty). Since regular output streams have already registered their dims at this point, conditional dim checks (scorpio::has_dim()) are done inline during stream creation. Subsequentsetup_file()calls (new files in a multi-file run) skip creation and reuse the existing streams.eamxx_rrtmgp_process_interface.cppswband,swband_bounds,lwband,lwband_bounds) with"io_output_if_dim_exists"→"swband"/"lwband".src/share/io/tests/io_cond_geo.cpp(new)swbanddim →swband/swband_boundsgeo vars must not appear in the file.swbanddim →swband/swband_boundsmust appear.src/share/io/tests/CMakeLists.txtio_cond_geoas a new unit test with labelsio, MPI ranks 1..max.