Skip to content

Restructure the library #409

@dfulu

Description

@dfulu

The current repo structure is a bit strange. We inherited some of the layout from ocf_datapipes, some of it is legacy from when we had separate UK and site datasets, and some of it just grew organically without top-down thought.

I think we should restructure the repo as part of the breaking changes in the next big release #400.

Possible Implementation

Here's a draft suggestion of a new repo structure for this library.

.
├── README.md
├── LICENSE
├── pyproject.toml
├── uv.lock
├── src/
│   └── ocf_data_sampler/
│       ├── __init__.py
│       │
│       ├── common/          # shared low level functionality
│       │   ├── __init__.py
│       │   ├── indexing.py          # (select/get_indices_in_sorted_unique.py + some from load/utils.py)
│       │   ├── lightarray.py        # (lightarray.py) lightweight array wrapper over NumPy / TensorStore-backed arrays
│       │   ├── datetime.py          # (time_utils.py) generic datetime utilities
│       │   └── types.py             # (partially from common_types.py) reusable shared type aliases
│       │
│       ├── spatial/         # coordinate systems and location semantics
│       │   ├── __init__.py
│       │   ├── location.py           # (select/location.py) Location object and coordinate-projection bookkeeping
│       │   ├── transforms.py         # (select/geospatial.py) coordinate transform functions
│       │   └── coords.py             # find_coord_system and related coord helpers
│       │
│       ├── config/          # keep as-is
│       │   ├── __init__.py
│       │   ├── load.py
│       │   ├── model.py
│       │   └── save.py
│       │
│       ├── load/            # raw data opening; returns validated xarray objects
│       │   ├── __init__.py
│       │   ├── generation.py      # open + validate generation data
│       │   ├── satellite.py       # open + validate satellite data
│       │   ├── tensorstore.py     # (load/open_xarray_tensorstore.py)
│       │   ├── validation.py      # xarray validation helpers from load/utils.py
│       │   └── [nwp/ or nwp.py]          # open + validate NWP data. If possible Should refactor this to a generic loader so this submodule isn't needed
│       │       ├── __init__.py
│       │       ├── registry.py           # provider dispatch currently living in load/nwp/nwp.py
│       │       └── readers/     
│       │           ├── __init__.py
│       │           ├── cloudcasting.py   
│       │           ├── ecmwf.py
│       │           ├── gdm.py
│       │           ├── gfs.py            # Consider dropping this file. It's forcing us to keep the dask backend around
│       │           ├── icon.py
│       │           ├── ukv.py
│       │           └── utils.py          # Optional: This file should be disolved into other files. Only if shared logic remains
│       │
│       ├── select/          # selection logic over already-opened arrays
│       │   ├── __init__.py
│       │   ├── time_periods.py    # (find_contiguous_time_periods.py + fill_time_periods.py) Find contiguous periods of data and available t0s
│       │   ├── dropout.py         # dropout application
│       │   ├── spatial_slice.py   # (select/select_spatial_slice.py) slice around Location
│       │   └── time_slice.py      # (select/select_spatial_slice.py) slice around t0
│       │
│       ├── features/        # feature engineering
│       │   ├── __init__.py
│       │   ├── solar.py           # (numpy_sample/sun_position.py) solar azimuth/elevation features from sun_position.py
│       │   └── datetime.py        # (numpy_sample/datetime_features.py) datetime encodings + t0 embedding from datetime_features.py
│       │
│       └── datasets/              # (torch_datasets) dataset-facing orchestration layer
│           ├── __init__.py
│           └── pvnet/
│               ├── __init__.py
│               ├── README.md
│               ├── dataset.py            # (torch_datasets/pvnet_dataset.py)
│               ├── loading.py            # (torch_datasets/load/load_dataset.py)  build datasets_dict from config
│               ├── materialize.py        # (utils.py) nested dict I/O helpers
│               ├── *normalization.py     # (torch_datasets/utils/config_normalization_values_to_dicts.py) extract/apply normalization stats
│               ├── preprocess.py         # (torch_datasets/utils/diff_nwp_data.py + select/diff_channels.py + torch_datasets/utils/fill_nans.py + value scaling (apply mean, std, clip etc)) Channel diff, scaling, and filling nans
│               ├── *projections.py       # (torch_datasets/utils/add_alterate_coordinate_projections.py) add alternate coordinate projections to Location objects
│               ├── sample.py             # (numpy_sample/convert.py + the sun and datetime -to-sample wrapping)
│               ├── slicing.py            # (torch_datasets/utils/[spatial/time]_slice_for_dataset.py) dataset-level spatial slicing
│               └── valid_t0s.py          # (torch_datasets/utils/valid_time_periods.py) find the available t0 times for dataset
│
└── tests/    # mirror structure of src
        └── ...

* I'd like to try to refactor these modules so these files are no longer needed

I think this new repo structure would be cleaner and more maintainable. The repo should still be recognisable and easy to edit by those already familiar with it.

Additional changes I'd like to see would be to decompose torch_datasets/pvnet_dataset.py a bit more. That module seems to have too many responsibilities and it should really act as more of a wrapper of functions

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requestocf-internalAn issue to be addressed internally by Open Climate Fix and not suitable for external contributors

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions