Clean up Data handeling methods in `adf_dataset.py` and fix AMWG table formatting by justin-richling · Pull Request #427 · NCAR/ADF

justin-richling · 2025-11-14T19:50:14Z

This PR will clean up and organize the calls to read files, datasets, and data arrays depending on the file type, ie time series, climo, regridded climo.

The main changes here are to:

Get all the file type functionality consistent. (probably be able to simplify all these down to a couple of functions still)
Remove baseline/ref name from any of these methods since there will always be only one case, just get it in the methods
Add function in adf_dataset.py to gather new unit info update_unit and call that in load_da. This could be useful as an external method, similarly to get_value_converters
Update amwg_table.py to implement these changes

Important changes:

The modifications for scale factor, offset, new units, and time bound adjustment (for time series) will always happen if the data is drawn using these methods in adf_dataset.py. That was the issue with the tables being improperly formatted, see AMWG tables are not getting proper data cleaning #423

Now you can load data sets and arrays generically any where in the ADF by calling load_dataset(fils) and/or load_da(<list of files>, <variable name>, <case name>)

or through the three file types:
  load_reference_timeseries_dataset/load_timeseries_dataset
  load_reference_climo_dataset/load_climo_dataset
  load_reference_regrid_dataset/load_regrid_dataset

load_<file type>_dataset will inherently call load_dataset and
load_<file type>_da will inherently call load_da

calling load_dataset has a flag for time series files so that it can check the time bounds to ensure proper time alignment. This is only an issue for using older CAM/CESM files but this won't modify anything if the newer files are used.

This change will now force the ADF to figure out the modifications at the loading data array stage

Potential issues:
  user must make sure of the scale/factor, units if using any premade file; using files that aren't created via ADF from the history files level.
  Remedy: The user just has to make sure there are no values for these arguments in the config yaml file if their datasets have been already modified, or use even more generic xarray load via load_dataset from adf_utils
  Not likely but a definite issue for edge case scenarios.

Closes #423 and #424

Remove the dependency of the baseline case name since there will only be one baseline

This will now grab any scale factors and new units for time series files

Now if these methods are called it will apply the scale factors/offset/units and time bounds fixes (for time series)

Make `new_unit` not Tex formatting (mostly for AMWG table viewing) but move it to mpl args for plotting

brianpm · 2026-05-14T16:39:57Z

Hi @justin-richling -- Here's another analysis thanks to Claude. Looks like this one needs a little more attention, but the proposed changes are good.

PR #427 Complete Analysis & Review

PR: Clean up Data handling methods in adf_dataset.py and fix AMWG table formatting
Status: Open (Draft)
Created: November 14, 2025
Files Modified: 8 files
Net Changes: +184 lines, -597 lines (primarily config cleanup)

Executive Summary

PR #427 is a significant refactoring that improves data loading architecture and fixes AMWG table formatting issues. The work reorganizes how test and reference data are loaded, creating consistent wrapper methods for different data types. However, the PR is stale (6+ months old), marked as draft, has merge conflicts with the current main branch, and needs updates to pass CI.

1. MERGE CONFLICT ANALYSIS

Conflict Found: YES ❌

File with Conflicts: lib/adf_dataset.py

Location: Lines 164-186

Nature of Conflict:

PR Version (HEAD): Adds new methods load_climo_dataset() and load_climo_da() with enhanced error handling and data loading logic
Main Branch: Simplified version that returns sorted files from glob() directly

Root Cause:
The PR attempts to refactor data loading methods while main branch simplified the get_climo_file() return logic. These are competing approaches to the same functionality.

Conflict Block:

# PR Version adds:
def load_climo_dataset(self, case, variablename):
    """Return Dataset for test climo of field"""
    fils = self.get_climo_file(case, variablename)
    if not fils:
        warnings.warn(...)
        return None
    return self.load_dataset(fils)

# vs. Main simplified to:
fils = sorted(model_cl_loc.glob(...))
return fils

Severity: Medium - Resolvable but requires careful consideration of design intent

2. CHANGES OVERVIEW

lib/adf_dataset.py (+112, -104)

Purpose: Refactor data loading methods for consistency and clarity

Key Changes:

New Methods Added:
- load_timeseries_dataset() - Load test case timeseries as Dataset
- load_reference_timeseries_dataset() - Load reference timeseries as Dataset
- load_climo_dataset() - Load test case climatology as Dataset
- load_reference_climo_dataset() - Load reference climatology as Dataset
API Modifications:
- Removed case parameter from reference loading methods (consistency improvement)
- Added update_unit() helper method for unit conversions
- Enhanced load_da() to accept case parameter and handle unit conversions
Error Handling:
- More informative warning messages when files are missing
- Null checks before accessing var_obs_dict
- Consistent return patterns (None on failure)

Assessment: ✅ Good refactoring - Creates consistent API, reduces code duplication, improves maintainability

lib/adf_variable_defaults.yaml (+59, -482)

Purpose: Configuration cleanup and modernization

Changes:

Removed deprecated settings:
- pct_diff_contour_levels (hundreds of entries)
- pct_diff_colormap entries across all variable categories
Standardized units:
- Converted LaTeX notation → Simple notation
- Example: "g m$^{-2}$" → "g/m2"
Added matplotlib colorbar configs:
- For variables requiring special formatting

Assessment: ✅ Beneficial cleanup - Removes dead code, reduces config file size, modernizes unit notation

scripts/analysis/amwg_table.py (+7, -5)

Changes:

Added baseline_name variable initialization
Replaced manual file globbing with new ADF methods
Updated to use new load_da() API

Assessment: ✅ Proper adaptation - Uses new data loading interface consistently

Plotting Scripts (adf_histogram.py, aod_latlon.py, global_latlon_map.py, polar_map.py, zonal_mean.py)

Changes: Removed case/base_name/case_name parameters from reference data loading calls

Assessment: ✅ Consistent API - Reflects the API cleanup in adf_dataset.py

3. RELATED ISSUES

Closes Issues:

AMWG tables are not getting proper data cleaning #423 - Related to data handling cleanup
Clean up data set and data array calls from adf_dataset.py #424 - Related to table formatting

Note: Current status of these issues should be verified

4. CODE QUALITY ASSESSMENT

Strengths ✅

Architecture: Excellent separation of concerns - distinct methods for each data type
Consistency: Unified interface across different data loading patterns
Error Handling: Better null checks and informative warnings
Code Reduction: Net -413 lines (mostly config cleanup)
Documentation: Clear docstrings on new methods

Weaknesses ⚠️

Unit Conversion Logic: update_unit() implementation should be reviewed for edge cases
Missing Type Hints: No type annotations on new methods (minor)
Test Coverage: Verify tests exist for new methods

5. RECENT CHANGES TO MAIN THAT AFFECT THIS PR

Since PR was opened (Nov 2025), the following commits affected overlapping code:

PR Fix None for fils in load_reference_climo_da #448 - "missing-ref-aod-climo-fix"
PR Bring in climo file hanging during creation fix #453 - "climo-hang-fix"
- Modified load_reference_climo_da() handling
Commit 4eba549 - "Bring in @brianpm's climo file fix"
Commit 1cdccf6 - "Fix None for fils in load_reference_climo_da"

Impact: These concurrent fixes to climatology loading overlap with PR #427's refactoring of the same methods. The merge conflict in adf_dataset.py is a direct result.

6. NEXT STEPS & RECOMMENDATIONS

Required Actions (Before Merge):

Resolve Merge Conflict
- Choose: embrace PR's wrapper method approach OR maintain main's simpler return logic
- Recommendation: Adopt PR's approach - the wrapper methods with error handling are superior
Reconcile with Recent Climatology Fixes
- PR Bring in climo file hanging during creation fix #453 and commits 4eba549/1cdccf6 fixed issues in load_reference_climo_da()
- Ensure PR Clean up Data handeling methods in adf_dataset.py and fix AMWG table formatting #427's refactoring preserves these fixes
- May need to port fixes into PR's wrapper methods
Update CI/Tests
- PR currently fails CI checks (draft status)
- Verify all test suites pass after merge conflict resolution
- Add tests for new load_climo_dataset(), load_timeseries_dataset() methods
Code Review Points
- Verify update_unit() handles all unit conversion cases correctly
- Check that unit conversions previously done downstream still work with early application
- Ensure backwards compatibility if old API is still used elsewhere
- Review AMWG table formatting fix - does it address original issue completely?
Documentation
- Update any docstrings in adf_dataset.py that reference old method signatures
- Document the new data loading workflow

Timeline Considerations:

PR has been open for ~6 months (since Nov 2025)
Multiple other PRs (add 1degree version of variable defaults yaml #437, Add 'workflow_dispatch' trigger to Github Action workflows. #438, Handle empty dict if no git info is available #444, Fix None for fils in load_reference_climo_da #448, Bring in climo file hanging during creation fix #453) have merged in the meantime
This creates technical debt - the longer it stays open, the more conflicts will accumulate

Recommendation: PROCEED WITH MERGE ✅

The refactoring is sound and well-designed
Recent climatology fixes should be incorporated into the refactored methods
This PR will improve code quality and fix real issues (AMWG tables are not getting proper data cleaning #423, Clean up data set and data array calls from adf_dataset.py #424)
Allocate ~2-3 hours for conflict resolution and testing

7. POTENTIAL ISSUES TO WATCH

Unit Conversion Timing: Moving unit conversions earlier in pipeline could affect downstream analysis
Backwards Compatibility: Any old code still using removed method signatures will break
AMWG Table Formatting: Verify the table formatting fix actually resolves the visual/data issues
Performance: Ensure new wrapper methods don't add overhead vs. direct file loading

Files Summary for Reviewer

File	Type	Risk Level	Notes
lib/adf_dataset.py	Core	🟡 Medium	Has merge conflict, needs reconciliation with recent fixes
lib/adf_variable_defaults.yaml	Config	🟢 Low	Straightforward cleanup, no conflicts
scripts/analysis/amwg_table.py	Script	🟢 Low	Simple API update
adf_histogram.py	Script	🟢 Low	Simple API update
aod_latlon.py	Script	🟢 Low	Simple API update
global_latlon_map.py	Script	🟢 Low	Simple API update
polar_map.py	Script	🟢 Low	Simple API update
zonal_mean.py	Script	🟢 Low	Simple API update

justin-richling added 8 commits November 13, 2025 13:52

Update all scripts for new reference data load

5344f1b

Remove the dependency of the baseline case name since there will only be one baseline

Use adf_dataset method to get time series data

419e68b

This will now grab any scale factors and new units for time series files

Expand on helper functions for data loading

83ff9aa

Now if these methods are called it will apply the scale factors/offset/units and time bounds fixes (for time series)

Update adf_variable_defaults.yaml

59e2814

Make `new_unit` not Tex formatting (mostly for AMWG table viewing) but move it to mpl args for plotting

Update adf_dataset.py

6aa204b

Update adf_dataset.py

a72b781

Update adf_dataset.py

ad7e7ee

Update adf_dataset.py

bd5b5b1

justin-richling added bug Something isn't working code clean-up Made code simpler and/or easier to read. analysis Related to data analysis and statistics data handling Related to handling of data, ie file i/o, data workflow, etc labels Nov 14, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clean up Data handeling methods in `adf_dataset.py` and fix AMWG table formatting#427

Clean up Data handeling methods in `adf_dataset.py` and fix AMWG table formatting#427
justin-richling wants to merge 8 commits into
NCAR:mainfrom
justin-richling:adf-dataset-cleanup

justin-richling commented Nov 14, 2025

Uh oh!

brianpm commented May 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

justin-richling commented Nov 14, 2025

Uh oh!

brianpm commented May 14, 2026

PR #427 Complete Analysis & Review

Executive Summary

1. MERGE CONFLICT ANALYSIS

Conflict Found: YES ❌

2. CHANGES OVERVIEW

lib/adf_dataset.py (+112, -104)

lib/adf_variable_defaults.yaml (+59, -482)

scripts/analysis/amwg_table.py (+7, -5)

Plotting Scripts (adf_histogram.py, aod_latlon.py, global_latlon_map.py, polar_map.py, zonal_mean.py)

3. RELATED ISSUES

4. CODE QUALITY ASSESSMENT

Strengths ✅

Weaknesses ⚠️

5. RECENT CHANGES TO MAIN THAT AFFECT THIS PR

6. NEXT STEPS & RECOMMENDATIONS

Required Actions (Before Merge):

Timeline Considerations:

Recommendation: PROCEED WITH MERGE ✅

7. POTENTIAL ISSUES TO WATCH

Files Summary for Reviewer

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants