Skip to content

Clean up Data handeling methods in adf_dataset.py and fix AMWG table formatting#427

Draft
justin-richling wants to merge 8 commits into
NCAR:mainfrom
justin-richling:adf-dataset-cleanup
Draft

Clean up Data handeling methods in adf_dataset.py and fix AMWG table formatting#427
justin-richling wants to merge 8 commits into
NCAR:mainfrom
justin-richling:adf-dataset-cleanup

Conversation

@justin-richling
Copy link
Copy Markdown
Collaborator

This PR will clean up and organize the calls to read files, datasets, and data arrays depending on the file type, ie time series, climo, regridded climo.

The main changes here are to:

  1. Get all the file type functionality consistent. (probably be able to simplify all these down to a couple of functions still)
  2. Remove baseline/ref name from any of these methods since there will always be only one case, just get it in the methods
  3. Add function in adf_dataset.py to gather new unit info update_unit and call that in load_da. This could be useful as an external method, similarly to get_value_converters
  4. Update amwg_table.py to implement these changes

Important changes:

  1. The modifications for scale factor, offset, new units, and time bound adjustment (for time series) will always happen if the data is drawn using these methods in adf_dataset.py. That was the issue with the tables being improperly formatted, see AMWG tables are not getting proper data cleaning #423

Now you can load data sets and arrays generically any where in the ADF by calling load_dataset(fils) and/or load_da(<list of files>, <variable name>, <case name>)

or through the three file types:
  load_reference_timeseries_dataset/load_timeseries_dataset
  load_reference_climo_dataset/load_climo_dataset
  load_reference_regrid_dataset/load_regrid_dataset

  load_<file type>_dataset will inherently call load_dataset and
  load_<file type>_da will inherently call load_da

calling load_dataset has a flag for time series files so that it can check the time bounds to ensure proper time alignment. This is only an issue for using older CAM/CESM files but this won't modify anything if the newer files are used.

This change will now force the ADF to figure out the modifications at the loading data array stage

Potential issues:
  user must make sure of the scale/factor, units if using any premade file; using files that aren't created via ADF from the history files level.
  Remedy: The user just has to make sure there are no values for these arguments in the config yaml file if their datasets have been already modified, or use even more generic xarray load via load_dataset from adf_utils
  Not likely but a definite issue for edge case scenarios.

Closes #423 and #424

Remove the dependency of the baseline case name since there will only be one baseline
This will now grab any scale factors and new units for time series files
Now if these methods are called it will apply the scale factors/offset/units and time bounds fixes (for time series)
Make `new_unit` not Tex formatting (mostly for AMWG table viewing) but move it to mpl args for plotting
@justin-richling justin-richling added bug Something isn't working code clean-up Made code simpler and/or easier to read. analysis Related to data analysis and statistics data handling Related to handling of data, ie file i/o, data workflow, etc labels Nov 14, 2025
@brianpm
Copy link
Copy Markdown
Collaborator

brianpm commented May 14, 2026

Hi @justin-richling -- Here's another analysis thanks to Claude. Looks like this one needs a little more attention, but the proposed changes are good.

PR #427 Complete Analysis & Review

PR: Clean up Data handling methods in adf_dataset.py and fix AMWG table formatting
Status: Open (Draft)
Created: November 14, 2025
Files Modified: 8 files
Net Changes: +184 lines, -597 lines (primarily config cleanup)


Executive Summary

PR #427 is a significant refactoring that improves data loading architecture and fixes AMWG table formatting issues. The work reorganizes how test and reference data are loaded, creating consistent wrapper methods for different data types. However, the PR is stale (6+ months old), marked as draft, has merge conflicts with the current main branch, and needs updates to pass CI.


1. MERGE CONFLICT ANALYSIS

Conflict Found: YES

File with Conflicts: lib/adf_dataset.py

Location: Lines 164-186

Nature of Conflict:

  • PR Version (HEAD): Adds new methods load_climo_dataset() and load_climo_da() with enhanced error handling and data loading logic
  • Main Branch: Simplified version that returns sorted files from glob() directly

Root Cause:
The PR attempts to refactor data loading methods while main branch simplified the get_climo_file() return logic. These are competing approaches to the same functionality.

Conflict Block:

# PR Version adds:
def load_climo_dataset(self, case, variablename):
    """Return Dataset for test climo of field"""
    fils = self.get_climo_file(case, variablename)
    if not fils:
        warnings.warn(...)
        return None
    return self.load_dataset(fils)

# vs. Main simplified to:
fils = sorted(model_cl_loc.glob(...))
return fils

Severity: Medium - Resolvable but requires careful consideration of design intent


2. CHANGES OVERVIEW

lib/adf_dataset.py (+112, -104)

Purpose: Refactor data loading methods for consistency and clarity

Key Changes:

  1. New Methods Added:

    • load_timeseries_dataset() - Load test case timeseries as Dataset
    • load_reference_timeseries_dataset() - Load reference timeseries as Dataset
    • load_climo_dataset() - Load test case climatology as Dataset
    • load_reference_climo_dataset() - Load reference climatology as Dataset
  2. API Modifications:

    • Removed case parameter from reference loading methods (consistency improvement)
    • Added update_unit() helper method for unit conversions
    • Enhanced load_da() to accept case parameter and handle unit conversions
  3. Error Handling:

    • More informative warning messages when files are missing
    • Null checks before accessing var_obs_dict
    • Consistent return patterns (None on failure)

Assessment:Good refactoring - Creates consistent API, reduces code duplication, improves maintainability


lib/adf_variable_defaults.yaml (+59, -482)

Purpose: Configuration cleanup and modernization

Changes:

  1. Removed deprecated settings:

    • pct_diff_contour_levels (hundreds of entries)
    • pct_diff_colormap entries across all variable categories
  2. Standardized units:

    • Converted LaTeX notation → Simple notation
    • Example: "g m$^{-2}$""g/m2"
  3. Added matplotlib colorbar configs:

    • For variables requiring special formatting

Assessment:Beneficial cleanup - Removes dead code, reduces config file size, modernizes unit notation


scripts/analysis/amwg_table.py (+7, -5)

Changes:

  • Added baseline_name variable initialization
  • Replaced manual file globbing with new ADF methods
  • Updated to use new load_da() API

Assessment:Proper adaptation - Uses new data loading interface consistently


Plotting Scripts (adf_histogram.py, aod_latlon.py, global_latlon_map.py, polar_map.py, zonal_mean.py)

Changes: Removed case/base_name/case_name parameters from reference data loading calls

Assessment:Consistent API - Reflects the API cleanup in adf_dataset.py


3. RELATED ISSUES

Closes Issues:

Note: Current status of these issues should be verified


4. CODE QUALITY ASSESSMENT

Strengths ✅

  • Architecture: Excellent separation of concerns - distinct methods for each data type
  • Consistency: Unified interface across different data loading patterns
  • Error Handling: Better null checks and informative warnings
  • Code Reduction: Net -413 lines (mostly config cleanup)
  • Documentation: Clear docstrings on new methods

Weaknesses ⚠️

  • Unit Conversion Logic: update_unit() implementation should be reviewed for edge cases
  • Missing Type Hints: No type annotations on new methods (minor)
  • Test Coverage: Verify tests exist for new methods

5. RECENT CHANGES TO MAIN THAT AFFECT THIS PR

Since PR was opened (Nov 2025), the following commits affected overlapping code:

  1. PR Fix None for fils in load_reference_climo_da #448 - "missing-ref-aod-climo-fix"
  2. PR Bring in climo file hanging during creation fix #453 - "climo-hang-fix"
    • Modified load_reference_climo_da() handling
  3. Commit 4eba549 - "Bring in @brianpm's climo file fix"
  4. Commit 1cdccf6 - "Fix None for fils in load_reference_climo_da"

Impact: These concurrent fixes to climatology loading overlap with PR #427's refactoring of the same methods. The merge conflict in adf_dataset.py is a direct result.


6. NEXT STEPS & RECOMMENDATIONS

Required Actions (Before Merge):

  1. Resolve Merge Conflict

    • Choose: embrace PR's wrapper method approach OR maintain main's simpler return logic
    • Recommendation: Adopt PR's approach - the wrapper methods with error handling are superior
  2. Reconcile with Recent Climatology Fixes

  3. Update CI/Tests

    • PR currently fails CI checks (draft status)
    • Verify all test suites pass after merge conflict resolution
    • Add tests for new load_climo_dataset(), load_timeseries_dataset() methods
  4. Code Review Points

    • Verify update_unit() handles all unit conversion cases correctly
    • Check that unit conversions previously done downstream still work with early application
    • Ensure backwards compatibility if old API is still used elsewhere
    • Review AMWG table formatting fix - does it address original issue completely?
  5. Documentation

    • Update any docstrings in adf_dataset.py that reference old method signatures
    • Document the new data loading workflow

Timeline Considerations:

Recommendation: PROCEED WITH MERGE


7. POTENTIAL ISSUES TO WATCH

  1. Unit Conversion Timing: Moving unit conversions earlier in pipeline could affect downstream analysis
  2. Backwards Compatibility: Any old code still using removed method signatures will break
  3. AMWG Table Formatting: Verify the table formatting fix actually resolves the visual/data issues
  4. Performance: Ensure new wrapper methods don't add overhead vs. direct file loading

Files Summary for Reviewer

File Type Risk Level Notes
lib/adf_dataset.py Core 🟡 Medium Has merge conflict, needs reconciliation with recent fixes
lib/adf_variable_defaults.yaml Config 🟢 Low Straightforward cleanup, no conflicts
scripts/analysis/amwg_table.py Script 🟢 Low Simple API update
adf_histogram.py Script 🟢 Low Simple API update
aod_latlon.py Script 🟢 Low Simple API update
global_latlon_map.py Script 🟢 Low Simple API update
polar_map.py Script 🟢 Low Simple API update
zonal_mean.py Script 🟢 Low Simple API update

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

analysis Related to data analysis and statistics bug Something isn't working code clean-up Made code simpler and/or easier to read. data handling Related to handling of data, ie file i/o, data workflow, etc

Projects

None yet

Development

Successfully merging this pull request may close these issues.

AMWG tables are not getting proper data cleaning

2 participants