Restructure the library

The current repo structure is a bit strange. We inherited some of the layout from `ocf_datapipes`, some of it is legacy from when we had separate UK and site datasets, and some of it just grew organically without top-down thought.

I think we should restructure the repo as part of the breaking changes in the next big release #400.


## Possible Implementation

Here's a draft suggestion of a new repo structure for this library.

```
.
├── README.md
├── LICENSE
├── pyproject.toml
├── uv.lock
├── src/
│   └── ocf_data_sampler/
│       ├── __init__.py
│       │
│       ├── common/          # shared low level functionality
│       │   ├── __init__.py
│       │   ├── indexing.py          # (select/get_indices_in_sorted_unique.py + some from load/utils.py)
│       │   ├── lightarray.py        # (lightarray.py) lightweight array wrapper over NumPy / TensorStore-backed arrays
│       │   ├── datetime.py          # (time_utils.py) generic datetime utilities
│       │   └── types.py             # (partially from common_types.py) reusable shared type aliases
│       │
│       ├── spatial/         # coordinate systems and location semantics
│       │   ├── __init__.py
│       │   ├── location.py           # (select/location.py) Location object and coordinate-projection bookkeeping
│       │   ├── transforms.py         # (select/geospatial.py) coordinate transform functions
│       │   └── coords.py             # find_coord_system and related coord helpers
│       │
│       ├── config/          # keep as-is
│       │   ├── __init__.py
│       │   ├── load.py
│       │   ├── model.py
│       │   └── save.py
│       │
│       ├── load/            # raw data opening; returns validated xarray objects
│       │   ├── __init__.py
│       │   ├── generation.py      # open + validate generation data
│       │   ├── satellite.py       # open + validate satellite data
│       │   ├── tensorstore.py     # (load/open_xarray_tensorstore.py)
│       │   ├── validation.py      # xarray validation helpers from load/utils.py
│       │   └── [nwp/ or nwp.py]          # open + validate NWP data. If possible Should refactor this to a generic loader so this submodule isn't needed
│       │       ├── __init__.py
│       │       ├── registry.py           # provider dispatch currently living in load/nwp/nwp.py
│       │       └── readers/     
│       │           ├── __init__.py
│       │           ├── cloudcasting.py   
│       │           ├── ecmwf.py
│       │           ├── gdm.py
│       │           ├── gfs.py            # Consider dropping this file. It's forcing us to keep the dask backend around
│       │           ├── icon.py
│       │           ├── ukv.py
│       │           └── utils.py          # Optional: This file should be disolved into other files. Only if shared logic remains
│       │
│       ├── select/          # selection logic over already-opened arrays
│       │   ├── __init__.py
│       │   ├── time_periods.py    # (find_contiguous_time_periods.py + fill_time_periods.py) Find contiguous periods of data and available t0s
│       │   ├── dropout.py         # dropout application
│       │   ├── spatial_slice.py   # (select/select_spatial_slice.py) slice around Location
│       │   └── time_slice.py      # (select/select_spatial_slice.py) slice around t0
│       │
│       ├── features/        # feature engineering
│       │   ├── __init__.py
│       │   ├── solar.py           # (numpy_sample/sun_position.py) solar azimuth/elevation features from sun_position.py
│       │   └── datetime.py        # (numpy_sample/datetime_features.py) datetime encodings + t0 embedding from datetime_features.py
│       │
│       └── datasets/              # (torch_datasets) dataset-facing orchestration layer
│           ├── __init__.py
│           └── pvnet/
│               ├── __init__.py
│               ├── README.md
│               ├── dataset.py            # (torch_datasets/pvnet_dataset.py)
│               ├── loading.py            # (torch_datasets/load/load_dataset.py)  build datasets_dict from config
│               ├── materialize.py        # (utils.py) nested dict I/O helpers
│               ├── *normalization.py     # (torch_datasets/utils/config_normalization_values_to_dicts.py) extract/apply normalization stats
│               ├── preprocess.py         # (torch_datasets/utils/diff_nwp_data.py + select/diff_channels.py + torch_datasets/utils/fill_nans.py + value scaling (apply mean, std, clip etc)) Channel diff, scaling, and filling nans
│               ├── *projections.py       # (torch_datasets/utils/add_alterate_coordinate_projections.py) add alternate coordinate projections to Location objects
│               ├── sample.py             # (numpy_sample/convert.py + the sun and datetime -to-sample wrapping)
│               ├── slicing.py            # (torch_datasets/utils/[spatial/time]_slice_for_dataset.py) dataset-level spatial slicing
│               └── valid_t0s.py          # (torch_datasets/utils/valid_time_periods.py) find the available t0 times for dataset
│
└── tests/    # mirror structure of src
        └── ...

* I'd like to try to refactor these modules so these files are no longer needed
```

I think this new repo structure would be cleaner and more maintainable. The repo should still be recognisable and easy to edit by those already familiar with it. 

Additional changes I'd like to see would be to decompose `torch_datasets/pvnet_dataset.py` a bit more. That module seems to have too many responsibilities and it should really act as more of a wrapper of functions  

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Restructure the library #409

Possible Implementation

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Restructure the library #409

Description

Possible Implementation

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions