Skip to content

EAMxx: add conditional geo data output (io_output_if_dim_exists)#8349

Open
Copilot wants to merge 4 commits into
masterfrom
copilot/add-conditional-geo-data-output
Open

EAMxx: add conditional geo data output (io_output_if_dim_exists)#8349
Copilot wants to merge 4 commits into
masterfrom
copilot/add-conditional-geo-data-output

Conversation

Copy link
Copy Markdown
Contributor

Copilot AI commented May 1, 2026

Add support for conditionally writing geometry data to output files, controlled by the new "io_output_if_dim_exists" extra_data key on Field headers.

[non BFB] only for output files with bands dims but no bands vars.


Problem

The EAMxx IO layer currently writes all grid geometry data (including swband, lwband, swband_bounds, lwband_bounds) to every output file, regardless of whether any output variable uses those dimensions. This wastes space and creates misleading output files.

Solution

Introduce an "io_output_if_dim_exists" extra_data flag on Field headers. When set to a dimension name string, that geometry data field is written to the output file only if the named dimension is already registered in the file (i.e., at least one regular output variable uses that dimension).

Changes

eamxx_output_manager.hpp

  • Replace m_conditional_geo_data with m_geo_data_fields: a std::vector<GeoDataEntry> where each entry holds the cloned field, its grid, and an optional required_dim string (empty for unconditional fields).

eamxx_output_manager.cpp

  • In setup(): collect all candidate geo data fields (both regular and conditional) into m_geo_data_fields. No AtmosphereOutput streams are created here.
  • In setup_file(): lazily create m_geo_data_streams on first call (when empty). Since regular output streams have already registered their dims at this point, conditional dim checks (scorpio::has_dim()) are done inline during stream creation. Subsequent setup_file() calls (new files in a multi-file run) skip creation and reuse the existing streams.

eamxx_rrtmgp_process_interface.cpp

  • Tag the four swband/lwband geometry data fields (swband, swband_bounds, lwband, lwband_bounds) with "io_output_if_dim_exists""swband" / "lwband".

src/share/io/tests/io_cond_geo.cpp (new)

  • Unit test with two scenarios:
    1. Output a field without the swband dim → swband / swband_bounds geo vars must not appear in the file.
    2. Output a field with the swband dim → swband / swband_bounds must appear.

src/share/io/tests/CMakeLists.txt

  • Register io_cond_geo as a new unit test with labels io, MPI ranks 1..max.

Copy link
Copy Markdown
Contributor

@bartgol bartgol left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@copilot I think I would like a different design in the OutputManager class. I would like this approach:

  1. during setup, for the geo data, we only gather the fields we want to output.
  2. during setup_file, if the geo data streams have already been created, just run them. Otherwise, proceed to create them
  3. when creating the geo streams, we can parse the output file (as output is registered BEFORE geo data in setup_file), so we can already create them with the correct fields. Namely, while parsing the list of fields to add to geo data stream, if a field is conditional on a dim being present, we can immedaiately check if the dim is in fact there.

This approach should allow to not have to create cond_streams every time setup_file runs. This should make the logic for saving geo data more contained in one place, and simpler to maintain.

I want you to:

  • consider my argument above, analyzing its pros/cons
  • convert my argument above into a plan
  • analyze and refine the plan with any additional tweaks/fixes you can think of
  • implement the plan

Copy link
Copy Markdown
Contributor Author

Copilot AI commented May 1, 2026

@copilot I think I would like a different design in the OutputManager class. I would like this approach:

  1. during setup, for the geo data, we only gather the fields we want to output....

Redesigned in commit 5528df6.

The new approach:

  1. setup() only collects candidate geo data fields into m_geo_data_fields (a vector of GeoDataEntry{field, grid, required_dim}) — no AtmosphereOutput streams are created here. The m_conditional_geo_data map is gone.

  2. setup_file() lazily creates m_geo_data_streams on first call (when they are empty). Since regular output streams have already called setup_output_file() at that point, scorpio::has_dim() can be used directly to filter conditional fields. Subsequent setup_file() calls (new files in a multi-file run) skip creation and reuse the same streams.

The logic for saving geo data is now self-contained in setup_file().

Copilot finished work on behalf of bartgol May 1, 2026 17:17
Copilot AI requested a review from bartgol May 1, 2026 17:17
@bartgol bartgol force-pushed the copilot/add-conditional-geo-data-output branch from 5528df6 to 7b7bddc Compare May 1, 2026 18:36
@bartgol bartgol marked this pull request as ready for review May 1, 2026 18:36
@bartgol bartgol force-pushed the copilot/add-conditional-geo-data-output branch from 7b7bddc to 1d1259c Compare May 1, 2026 20:19
@bartgol bartgol force-pushed the copilot/add-conditional-geo-data-output branch from 1d1259c to a107896 Compare May 4, 2026 17:55
@bartgol bartgol requested a review from brhillman May 4, 2026 23:51
@bartgol
Copy link
Copy Markdown
Contributor

bartgol commented May 5, 2026

The PR should remove some fields, but tests pass. However, looking at the cprnc output of, e.g., eamxx-prod CI test, we see

 Could not find match for file 2 dimension lwband
 Could not find match for file 2 dimension swband
 Could not find match for file2 variable lwband in file1
 Could not find match for file2 variable lwband_bounds in file1
 Could not find match for file2 variable swband in file1
 Could not find match for file2 variable swband_bounds in file1

I wish cprnc did NOT report IDENTICAL when vars/dims are missing...

Anyhow, this is good to go, if @brhillman is ok with it.

@bartgol bartgol requested a review from AaronDonahue May 6, 2026 20:50
@bartgol bartgol changed the title Add conditional geo data output (io_output_if_dim_exists) EAMxx: add conditional geo data output (io_output_if_dim_exists) May 6, 2026
@bartgol bartgol added BFB PR leaves answers BFB EAMxx C++ based E3SM atmosphere model (aka SCREAM) non-BFB PR makes roundoff changes to answers. and removed BFB PR leaves answers BFB labels May 6, 2026
@bartgol bartgol added the non-BFB PR makes roundoff changes to answers. label May 6, 2026
@bartgol
Copy link
Copy Markdown
Contributor

bartgol commented May 7, 2026

@crterai @AaronDonahue @mahf708 and others. Discussing with @brhillman we were wondering if we actually need the logic in this PR or if it should done automatically for all geo data.

To recap, this PR marks the bands geo data (stored in the grid) as "conditional to the output file having 1+ var that needs the band dim (sw or lw)". This is to avoid having a few sw/lw bands vars in the nc file that are pointless, as there are no rad fields in the file. We achieve this by using the extra data feature of the Field class, with radiation using a specific key to mark the dim as "conditional to...".

But Ben wondered: do we need this extra data logic? Why not having IO always check if the geo data is needed? It may be uncommon, but you could imagine someone having a stream containing ONLY zonal_avg diagnostics. For such stream, NO variable would have the "ncol" dimension, so having the lat,lon, and area vars in the file may be questionably useful...

So, the proposal would be to do something like this instead: when processing the grid geo data, check if ANY of the geo data field dims is already registered in the file. If yes, write the geo data to file, otherwise skip it, since no var share even one dim tag with this geo data field (we could make an exception for rank-0 geo data, just in case we ever have some). This would remove the hard-coded extra-data string for detecting these "conditional" geo data, making the IO layer handling it more automatically.

Thoughts?

@mahf708
Copy link
Copy Markdown
Contributor

mahf708 commented May 7, 2026

@crterai @AaronDonahue @mahf708 and others. Discussing with @brhillman we were wondering if we actually need the logic in this PR or if it should done automatically for all geo data.

To recap, this PR marks the bands geo data (stored in the grid) as "conditional to the output file having 1+ var that needs the band dim (sw or lw)". This is to avoid having a few sw/lw bands vars in the nc file that are pointless, as there are no rad fields in the file. We achieve this by using the extra data feature of the Field class, with radiation using a specific key to mark the dim as "conditional to...".

But Ben wondered: do we need this extra data logic? Why not having IO always check if the geo data is needed? It may be uncommon, but you could imagine someone having a stream containing ONLY zonal_avg diagnostics. For such stream, NO variable would have the "ncol" dimension, so having the lat,lon, and area vars in the file may be questionably useful...

So, the proposal would be to do something like this instead: when processing the grid geo data, check if ANY of the geo data field dims is already registered in the file. If yes, write the geo data to file, otherwise skip it, since no var share even one dim tag with this geo data field (we could make an exception for rank-0 geo data, just in case we ever have some). This would remove the hard-coded extra-data string for detecting these "conditional" geo data, making the IO layer handling it more automatically.

Thoughts?

Thoughtful response:
Tricky business... customarily, they dumped all this info into all netcdf files; e.g., just look at a random eamv3 netcdf file and see the hell in there. I think the first thing to wonder about: did they do that for a (legit) reason other than convenience/laziness? I'd ask @rljacob who might know.

Unhinged response:
I like your proposal because I really hate confines and hierarchies, and so I just want to be able to invent new dimensions and coords on the fly without having to be stuck with global(ist)-state controls

@AaronDonahue
Copy link
Copy Markdown
Contributor

I think we should ask the E3SM-Diags folks (@chengzhuzhang ) to comment on what geo fields,if any, are expected in any output stream.

@bartgol
Copy link
Copy Markdown
Contributor

bartgol commented May 7, 2026

I think we should ask the E3SM-Diags folks (@chengzhuzhang ) to comment on what geo fields,if any, are expected in any output stream.

To be clear so the E3SM-Diags folks can understand: the question is "are there geo fields you ALWAYS want in the output file, even if they are completely irrelevant for the vars stored in the file?". This PR was tackling the sw/lw bands and bands_bounds vars when there are NO vars with sw/lw bands dim. But the natural question is "should we take this forward, and apply the same idea to ALL geo-data/coord fileds?".

@crterai
Copy link
Copy Markdown
Contributor

crterai commented May 7, 2026

I think it's fine and helps clean up output files to only output the dims that are relevant for the variables that exist in the file. I'm curious about what happens to regridded or subsampled data. And for the example you gave where we do zonal outputs, would we just not get the latitude corresponding to the zonal output? Or would we get a latitude variable that corresponds to the zonal bands?

@chengzhuzhang
Copy link
Copy Markdown
Contributor

To be clear so the E3SM-Diags folks can understand: the question is "are there geo fields you ALWAYS want in the output file, even if they are completely irrelevant for the vars stored in the file?". This PR was tackling the sw/lw bands and bands_bounds vars when there are NO vars with sw/lw bands dim. But the natural question is "should we take this forward, and apply the same idea to ALL geo-data/coord fileds?".

Thanks for the contexts! Re: "are there geo fields you ALWAYS want in the output file, even if they are completely irrelevant for the vars stored in the file?". My answer is no, but I'd like @czender to chime in since NCO is the first downstream tool to consume the native model output.

One possibly related case, though maybe not relevant to this change: we always want PS/ps for vertically regridding a 3D variables on hybrid grid. But I assume this PR won’t affect that.

@czender
Copy link
Copy Markdown
Contributor

czender commented May 7, 2026

NCO does not need variables or dimensions that are unrelated to those it is processing. Unless told otherwise it just copies the non-processed variables into the output. As Jill mentioned some variables that are ancillary to the processed variables (e.g., ps, hyai, landfrac) do need to be accessible. This PR would presumably not affect those ancillary variables. In particular, NCO has no need for the [sw/lw]bands coordinate variables and bounds unless data variables with those dimensions are present.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

EAMxx C++ based E3SM atmosphere model (aka SCREAM) non-BFB PR makes roundoff changes to answers.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants