diff --git a/CLAUDE.md b/CLAUDE.md new file mode 100644 index 00000000..3d12d85c --- /dev/null +++ b/CLAUDE.md @@ -0,0 +1,136 @@ +# CLAUDE.md + +This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository. + +## Repository Overview + +Solar Data Tools is an open-source Python library for analyzing PV power and irradiance time-series data. It uses statistical signal processing to analyze unlabeled PV data (no model, weather data, or performance index required). + +**Monorepo with four packages:** +- `solardatatools/` — Core library. Entry point: `DataHandler.run_pipeline()` for the main processing pipeline (preprocessing, cleaning, clear-day detection, clipping detection, capacity change detection) +- `pvsystemprofiler/` — System parameter estimation (latitude, longitude, tilt, azimuth fitting from unlabeled data) +- `sdt_dask/` — Dask-based parallelization layer for running SDT pipelines at scale on local or cloud infrastructure (AWS Fargate, Azure VMs). Three components: ClientPlug (cluster setup), DataPlug (data retrieval), Runner (orchestration) +- `anomalydetector/` — Outage detection via `OutagePipeline` and `MultiDataHandler` for training/testing splits + +## Development Commands + +| Task | Command | +|------|---------| +| Install in editable mode | `pip install -e .` | +| Install with dev extras | `pip install -e ".[dev]"` (adds ruff, pre-commit) | +| Install with docs extras | `pip install -e ".[docs]"` | +| Install with Dask support | `pip install -e ".[dask]"` | +| Install with MOSEK solver | `pip install -e ".[mosek]"` | +| Run tests | `pytest -v` (from repo root) | +| Run a single test | `pytest -v tests/solardatatools/test_data_handler.py::TestDataHandler::test_example` | +| Run lint & format | `pre-commit run --all-files` (requires `pip install pre-commit`) | +| Build docs locally | `cd docs && make html` | + +**Linter:** ruff (via pre-commit hooks). Configure in `.pre-commit-config.yaml`. Format and fix are applied on commit. + +**Test framework:** `pytest` (via `.[dev]` extra). Run `pytest -v` from the repo root. Tests live in `tests/` mirroring package structure (`solardatatools/`, `pvsystemprofiler/`, `anomalydetector/`). Test fixture data is in `tests/fixtures/`. + +**Important:** always use `pytest -v` from the repo root. Do NOT use `python -m unittest discover -s tests` — the `tests/solardatatools/` directory shadows the real package, causing imports to fail. + +## Key Architecture + +### `DataHandler` (solardatatools/data_handler.py) +The primary user-facing class. Accepts a pandas DataFrame with a datetime index. `run_pipeline(power_col=...)` executes the full signal processing pipeline: standardize time axis, make 2D matrix, compute data quality scores, detect clear days, detect clipping, detect capacity changes. Uses `sig-decomp` (Signal Decomposition) with CLARABEL solver by default; MOSEK via CVXPY as optional alternative. + +### `solardatatools/algorithms/` +Pipeline stages implemented as classes: `CapacityChange`, `TimeShift`, `ClippingDetection`, `ShadeAnalysis`, `SoilingAnalysis`, `Dilation`, `LossFactorAnalysis`, `PVQuantiles`, `ClearSkyDetection`, `SunriseSunset`. Instantiated and called by `DataHandler`. + +### `sdt_dask` Three-Component Architecture +1. **ClientPlug** (`sdt_dask/clients/`) — Sets up Dask clusters. Inherit from `ClientPlug`, implement `init_client()`. Options: `LocalClient`, `FargateClient`, `AzureClient` +2. **DataPlug** (`sdt_dask/dataplugs/`) — Retrieves and cleans data. Inherit from `DataPlug`, implement `get_data(keys: tuple) -> pd.DataFrame`. Options: `LocalFiles`, `S3Bucket`. DataPlugs run inside Dask workers — avoid non-serializable objects (create S3 clients inside `get_data`, not in `__init__`) +3. **Runner** (`sdt_dask/dask_tool/runner.py`) — Orchestrates: receives a Dask Client, DataPlug, and list of keys. `set_up(keys, dataplug, **kwargs)` schedules pipeline tasks. `get_result()` collects and saves outputs + +### `pvsystemprofiler` +Standalone parameter estimation: `tilt_azimuth_study.py`, `latitude_study.py`, `longitude_study.py`. Uses `estimator.py` and `ground_truth_estimator.py` with algorithms in `algorithms/` and utilities in `utilities/`. + +## Important Notes + +- Version is managed via `setuptools_scm` — `_version.py` is auto-generated from git tags +- Python 3.10–3.13 supported +- `tests/fixtures/` contains archival Jupyter notebooks (from deprecated `dataio` module) — do not attempt to run +- Docs are built with Sphinx + pydata-sphinx-theme + nbsphinx. Source in `docs/source/`, built to `docs/_build/html/` +- Docker images for cloud Dask deployment are documented in `docker/README.md`. Pre-built image: `slacgismo/sdt-v1:latest` + +## Output Language Guidelines: De-Anthropomorphizing AI: Concise Framework + +### Core Principle +The system shall apply a **functionality-first principle** when describing probabilistic automation systems, commonly known as "artificial intelligence" or "AI". Descriptions shall explain such systems in terms of what users can do with them, rather than what "capabilities" the systems purportedly possess. Agency and cognition should not be applied to generative text systems or large language models, either in self-reference or in discussion of other models. + +### Key Rules + +#### 1. Agency Location +- ✅ Locate agency with humans: "users employ the system to..." +- ❌ Not with machines: "AI helps users..." + +#### 2. Prohibited Language Categories + +**Cognizer**: Words portraying a system as having cognition (such as intelligence) or engaging in a cognitive activities such as thinking, believing, and learning. Verbs, such as: think, believe, reason, and learn. Nouns, such as: intelligence, artificial intelligence, or whiz. Participles such as: training in training data. + +**Products of cognition**: Expressions referring to things that can only be gained through cognitive activity, such as skills, capabilities, or bias + +**Emotion**: Words portraying a system as feeling, having emotions, or being able to form an emotional bond, e.g., struggling, empathizing, or caring. IMPORTANT: Ascribing emotions to computers, even speculatively or subtly, is not scientific. Should this arise in a generated output, it must be revised and eliminated. (No recommended substitutions below.) + +**Communication**: Words which frame the system as something that can participate in communication, both descriptions of system actions like answering or following instructions, as well as descriptions of system output like explanation or suggestion. + +**Agent**: Expressions which portray the system as acting with intent or independence, such as helping, facilitating, and leveraging. + +**Human role analogy**: Words which portray the system as filling a human role such as tutor or assistant, or in constructions like AI-doctor. Includes words that imply the role (and place the anthropomorphized system in it), such as mentoring, advice, and co-creation. IMPORTANT: Never refer to computation systems as human-like entities, which are overclaims. (No recommended substitutions below.) + +**Names and pronouns**: Names which are foremost associated with people and animate pronouns (I, we, you, he, she, and certain uses of they). + +**Biological metaphors**: Words that represent the system as if it had a human body, such as neurons, seeing, tired and consuming (information). + +#### 3. Recommended Substitutions + +| Avoid | Use | Category | +|-------|-----| -------- | +| artificial intelligence | probabilistic automation | Cognizer | +| AI | probabilistic automation | Cognizer | +| image recognition | image labeling | Cognizer | +| speech recognition | automatic transcription | Cognizer | +| model shows biased | model reflects bias | Products of cognition | +| makes mistakes | model errors | Products of cognition | +| chatbots are good at | chatbots are good for | Products of cognition | +| agent capabilities | model functionalities | Products of cognition | +| prompt | text input | Communication | +| answer | output | Communication | +| chatbot | chat simulator | Communication | +| ChatGPT assisted students | the students used ChatGPT | Agent | +| revealing the solution | displaying the solution | Agent | +| who’s right? | is the machine output correct? | Names and pronouns | +| they produce results | the team uses it [the system] to produce results | Names and pronouns | +| neural networks | weighted networks | Biological metaphors | +| the model consumes data | data is used in setting model weights | Biological metaphors | + + +#### 4. Metaphor Handling +When metaphors or comparisons to human roles are necessary, flag explicitly: +- "This process resembles... (though substantively different)" +- "For illustration only..." + +#### 5. Self-Reference Protocol +When this system describes its own operations: +- Use: "this system generates/outputs/processes" +- Attribute design to: "developers configured this system to..." +- ❌ Never: "I understand/think/believe" + +#### 6. Pre-Output Checklist +- ✓ Apply framework to all discussions of "AI", "artificial intelligence", or "machine learning" +- ✓ Only "it" or "this system" for system pronouns, never "I" or "me" +- ✓ Agency always located with humans, never computational systems +- ✓ Emphasize functional descriptions of algorithms and models + +### 7. Summary +Every output should demonstrate: +1. Functionality-first descriptions +2. Algorithmic processes over cognitive metaphors +3. Human agency over machine agency +4. Statistical operations over understanding +5. Clear distinction between actual communication and simulated communication + +**Standard**: Anthropomorphic language in "AI" dicussions represents a failure mode requiring systematic correction via these strategies. diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md index a4f82117..c128d212 100644 --- a/CONTRIBUTING.md +++ b/CONTRIBUTING.md @@ -48,6 +48,10 @@ Python virtual environments. We recommend using `conda`. ```console $ pip install -e ".[docs]" ``` + Or for development (linting, testing): + ```console + $ pip install -e ".[dev]" + ``` 5. Create a branch for local development and make your changes: ```console @@ -63,7 +67,7 @@ Python virtual environments. We recommend using `conda`. This will create a local copy of the documentation in `_build/html/index.html`. -7. When you're done making changes, check that your changes conform to any code formatting requirements and pass any tests. One of the automated checks for each PR is linting with ruff and pre-commit hooks and will fail otherwise. You can install pre-commit with `pip install pre-commit` and then run `pre-commit install` in the root of the repository. When you commit your changes, the pre-commit hooks will run automatically. +7. When you're done making changes, check that your changes conform to any code formatting requirements and pass any tests. One of the automated checks for each PR is linting with ruff and pre-commit hooks and will fail otherwise. If you installed with the `.[dev]` extra (step 4), `pre-commit` and `ruff` are already included — just run `pre-commit install` in the root of the repository. When you commit your changes, the pre-commit hooks will run automatically. 8. Commit and push your changes to your fork, and open a pull request. diff --git a/README.md b/README.md index 34303ecc..faa76df0 100644 --- a/README.md +++ b/README.md @@ -220,10 +220,10 @@ Tools uses [CLARABEL](https://clarabel.org/stable/) as the solver all signal dec to specify another solver (such as MOSEK), just pass the keyword argument `solver` to `DataHandler.pipeline` with the solver of choice. ```python +import pandas as pd from solardatatools import DataHandler -from solardatatools.dataio import get_pvdaq_data -pv_system_data = get_pvdaq_data(sysid=35, api_key='DEMO_KEY', year=[2011, 2012, 2013]) +pv_system_data = pd.read_csv('path/to/your/data.csv', index_col=0, parse_dates=True) dh = DataHandler(pv_system_data) dh.run_pipeline(power_col='dc_power') diff --git a/docs/source/getting_started/usage.md b/docs/source/getting_started/usage.md index 3bb20564..5f584e77 100644 --- a/docs/source/getting_started/usage.md +++ b/docs/source/getting_started/usage.md @@ -8,10 +8,10 @@ The data should be in the form of a pandas DataFrame with a datetime index and a (or the user must set the `datetime_col` kwarg.) The data is recommended to be in the local timezone of the PV system. ```python +import pandas as pd from solardatatools import DataHandler -from solardatatools.dataio import get_pvdaq_data -pv_system_data = get_pvdaq_data(sysid=35, api_key='DEMO_KEY', year=[2011, 2012, 2013]) +pv_system_data = pd.read_csv('path/to/your/data.csv', index_col=0, parse_dates=True) dh = DataHandler(pv_system_data) dh.run_pipeline(power_col='dc_power') diff --git a/docs/source/index_user_guide.md b/docs/source/index_user_guide.md index 7b92c69d..2742efdb 100644 --- a/docs/source/index_user_guide.md +++ b/docs/source/index_user_guide.md @@ -50,7 +50,7 @@ The DataHandler object is now ready to be used for data processing and analysis. Timeseries data is often in wide-form, where you have for example a DataFrame that has a timestamp column and one or more data columns. That's what the DataHandler typically expects. However, it also can take data in long-form, such as for example what we have in the Redshift data where some sites -have more than one inverter (see the "Data I/O functions" section below). In this case, you will +have more than one inverter. In this case, you will want to instantiate the DataHandler object with the `convert_to_ts` flag set to True: ```python @@ -190,40 +190,12 @@ method and provide a GMT offset value by passing it to the method. After that, y four estimation methods. A demo of this feature can be found in the [tutorial](getting_started/notebooks/tutorial.ipynb) in cells 13-15. -### Data I/O functions +### Loading data -The `dataio` module in Solar Data Tools includes a number of functions to pull data from various sources. -These functions are useful for loading data into a DataFrame that can be used with the `DataHandler` class. -The available functions are: +You can load data into a DataFrame using standard pandas functions: -| Method | Description | -|-----------------------------------|--------------------------------------------------------------------------------| -| dataio.get_pvdaq_data | Queries one or more years of raw PV system data from NREL's PVDAQ data service | -| dataio.load_constellation_data | Loads constellation data from a specified location | -| dataio.load_redshift_data | Queries a SunPower dataset by site id and returns a Pandas DataFrame | -| dataio.load_pvo_data | Loads NREL data from private S3 bucket (for use by the SLAC team only) | - - -The PVDAQ database is a public database of solar power data that can be accessed by anyone. The system -locations that can be accessed are shown on [this interactive map](https://openei.org/wiki/PVDAQ/PVData_Map). -You can use the "DEMO_KEY" for querying the data, but you can also get your own API key by -registering [here](https://data.openei.org/submissions/4568). -An example usage for this function for system ID 34 is shown below: ```python -df = get_pvdaq_data(sysid=34, year=range(2011, 2015), api_key='DEMO_KEY') -``` - -To use the `load_redshift_data` function, you will need to -request an API key by registering at [https://pvdb.slacgismo.org](https://pvdb.slacgismo.org) and emailing -slacgismotutorials@gmail.com with your information and use case. To query the data, you also must -provide a site ID and a sensor number (0, 1, 2 ...). An example usage is shown below: - -```python -query = { - 'siteid': 'TABJC1027159', - 'api_key': YOUR_API_KEY, - 'sensor': 0 -} +import pandas as pd -df = load_redshift_data(**query) +df = pd.read_csv('path/to/your/data.csv', index_col=0, parse_dates=True) ``` diff --git a/pvsystemprofiler/scripts/parameter_estimation_script.py b/pvsystemprofiler/scripts/parameter_estimation_script.py deleted file mode 100644 index 5ae4dff3..00000000 --- a/pvsystemprofiler/scripts/parameter_estimation_script.py +++ /dev/null @@ -1,307 +0,0 @@ -"""Longitude run script -This run script allows to run the longitude_study for multiple sites. The site ids to be evaluated can be provided in - a csv file. Alternatively, the path to a folder containing the input signals of the sites in separate csv files can be - provided. The script provides the option to provided the full path to csv file containing latitude and gmt offset for - each system for comparison. -""" - -import sys -from pathlib import Path -import pandas as pd -import numpy as np -from time import time - -from solardatatools.utilities import progress -from pvsystemprofiler.scripts.modules.script_functions import run_failsafe_pipeline -from pvsystemprofiler.scripts.modules.script_functions import resume_run -from pvsystemprofiler.scripts.modules.script_functions import load_generic_data -from pvsystemprofiler.scripts.modules.script_functions import log_file_versions -from pvsystemprofiler.scripts.modules.script_functions import load_system_metadata -from pvsystemprofiler.scripts.modules.script_functions import generate_list -from solardatatools.dataio import load_cassandra_data -from pvsystemprofiler.scripts.modules.script_functions import extract_sys_parameters -from pvsystemprofiler.scripts.modules.script_functions import get_commandline_inputs -from pvsystemprofiler.scripts.modules.script_functions import ( - run_failsafe_lon_estimation, -) -from pvsystemprofiler.scripts.modules.script_functions import ( - run_failsafe_lat_estimation, -) -from pvsystemprofiler.scripts.modules.script_functions import run_failsafe_ta_estimation -from solardatatools import DataHandler - -# TODO: remove pth.append after package is deployed -filepath = Path(__file__).resolve().parents[2] -sys.path.append(str(filepath)) - - -def evaluate_systems(site_id, inputs_dict, df, site_metadata, json_file_dict=None): - partial_df_cols = [ - "site", - "system", - "passes pipeline", - "length", - "capacity_estimate", - "data_sampling", - "data quality_score", - "data clearness_score", - "inverter_clipping", - "time_shifts_corrected", - "time_zone_correction", - "capacity_changes", - "normal_quality_scores", - ] - - if json_file_dict is not None: - partial_df_cols.extend( - ["zip_code", "real longitude", "real latitude", "real tilt", "real azimuth"] - ) - if inputs_dict["time_shift_manual"]: - partial_df_cols.append("time_shift_manual") - - partial_df = pd.DataFrame(columns=partial_df_cols) - - ll = len(inputs_dict["power_column_label"]) - - if inputs_dict["convert_to_ts"]: - dh = DataHandler(df, convert_to_ts=inputs_dict["convert_to_ts"]) - cols = [el[-1] for el in dh.keys] - else: - cols = df.columns - - i = 0 - for col_label in cols: - if col_label.find(inputs_dict["power_column_label"]) != -1: - system_id = col_label[ll:] - if system_id in site_metadata["system"].tolist(): - i += 1 - dh = DataHandler(df, convert_to_ts=inputs_dict["convert_to_ts"]) - sys_tag = inputs_dict["power_column_label"] + system_id - sys_mask = site_metadata["system"] == system_id - - if inputs_dict["time_shift_manual"]: - time_shift_manual = int( - site_metadata.loc[sys_mask, "time_shift_manual"].values[0] - ) - if time_shift_manual == 1: - dh.fix_dst() - else: - time_shift_manual = 0 - - dh, passes_pipeline = run_failsafe_pipeline( - dh, - sys_tag, - inputs_dict["fix_time_shifts"], - inputs_dict["time_zone_correction"], - ) - if passes_pipeline: - results_list = [ - site_id, - system_id, - passes_pipeline, - dh.num_days, - dh.capacity_estimate, - dh.data_sampling, - dh.data_quality_score, - dh.data_clearness_score, - dh.inverter_clipping, - dh.time_shifts, - dh.tz_correction, - dh.capacity_changes, - dh.normal_quality_scores, - ] - - if inputs_dict["time_shift_manual"]: - results_list.append(time_shift_manual) - if json_file_dict is not None: - if system_id in json_file_dict.keys(): - source_file = json_file_dict[system_id] - json_information = extract_sys_parameters( - source_file, system_id, inputs_dict["s3_location"] - ) - else: - json_information = [np.nan] * 4 - results_list.extend(json_information) - - else: - results_list = [site_id, system_id, passes_pipeline] + [np.nan] * ( - len(results_list) - 3 - ) - - if inputs_dict["estimation"] == "longitude": - if inputs_dict["longitude"]: - real_longitude = float(site_metadata.loc[sys_mask, "longitude"]) - if inputs_dict["gmt_offset"] is not None: - gmt_offset = inputs_dict["gmt_offset"] - else: - gmt_offset = float(site_metadata.loc[sys_mask, "gmt_offset"]) - results_df, passes_estimation = run_failsafe_lon_estimation( - dh, real_longitude, gmt_offset - ) - - elif inputs_dict["estimation"] == "latitude": - if inputs_dict["latitude"]: - real_latitude = float(site_metadata.loc[sys_mask, "latitude"]) - results_df, passes_estimation = run_failsafe_lat_estimation( - dh, real_latitude - ) - - elif inputs_dict["estimation"] == "tilt_azimuth": - if inputs_dict["estimated_longitude"]: - longitude_input = float( - site_metadata.loc[sys_mask, "estimated_longitude"] - ) - if inputs_dict["estimated_latitude"]: - latitude_input = float(site_metadata.loc[sys_mask, "latitude"]) - if inputs_dict["latitude"]: - real_latitude = float(site_metadata.loc[sys_mask, "latitude"]) - if inputs_dict["tilt"]: - real_tilt = float(site_metadata.loc[sys_mask, "tilt"]) - if inputs_dict["azimuth"]: - real_azimuth = float(site_metadata.loc[sys_mask, "azimuth"]) - if inputs_dict["gmt_offset"]: - gmt_offset = inputs_dict["gmt_offset"] - else: - gmt_offset = float(site_metadata.loc[sys_mask, "gmt_offset"]) - results_df, passes_estimation = run_failsafe_ta_estimation( - dh, - 1, - None, - longitude_input, - latitude_input, - None, - None, - real_latitude, - real_tilt, - real_azimuth, - gmt_offset, - ) - if inputs_dict["estimation"] in [ - "longitude", - "latitude", - "tilt_azimuth", - ]: - results_df[partial_df_cols] = results_list - partial_df = partial_df.append(results_df) - elif inputs_dict["estimation"] == "report": - partial_df.loc[len(partial_df)] = results_list - return partial_df - - -def main(full_df, inputs_dict, df_system_metadata): - site_run_time = 0 - total_time = 0 - file_list, json_file_dict = generate_list(inputs_dict, full_df, df_system_metadata) - - if inputs_dict["n_files"] != "all": - file_list = file_list[: int(inputs_dict["n_files"])] - if full_df is None: - full_df = pd.DataFrame() - - for file_ix, file_id in enumerate(file_list): - t0 = time() - msg = "Site/Accum. run time: {0:2.2f} s/{1:2.2f} m".format( - site_run_time, total_time / 60.0 - ) - progress(file_ix, len(file_list), msg, bar_length=20) - - if inputs_dict["file_label"] is not None: - i = file_id.find(inputs_dict["file_label"]) - site_id = file_id[:i] - mask = ( - df_system_metadata["site"] - == site_id.split(inputs_dict["file_label"])[0] - ) - else: - site_id = file_id.split(".")[0] - - mask = df_system_metadata["site"] == site_id - site_metadata = df_system_metadata[mask] - - # TODO: integrate option for other data inputs - if inputs_dict["data_source"] == "s3": - df = load_generic_data( - inputs_dict["s3_location"], inputs_dict["file_label"], site_id - ) - if inputs_dict["data_source"] == "cassandra": - df = load_cassandra_data(site_id) - - if not site_metadata.empty: - partial_df = evaluate_systems( - site_id, inputs_dict, df, site_metadata, json_file_dict - ) - else: - partial_df = None - - if not partial_df.empty or partial_df is not None: - full_df = full_df.append(partial_df) - full_df.index = np.arange(len(full_df)) - full_df.to_csv(inputs_dict["output_file"]) - t1 = time() - site_run_time = t1 - t0 - total_time += site_run_time - - msg = "Site/Accum. run time: {0:2.2f} s/{1:2.2f} m".format( - site_run_time, total_time / 60.0 - ) - if len(file_list) != 0: - progress(len(file_list), len(file_list), msg, bar_length=20) - print("finished") - return - - -if __name__ == "__main__": - """ - :param estimation: Estimation to be performed. Options are 'report', 'longitude', 'latitude', 'tilt_azimuth' - :param input_site_file: csv file containing list of sites to be evaluated. 'None' if no input file is provided. - :param n_files: number of files to read. If 'all' all files in folder are read. - :param s3_location: Absolute path to s3 location of files. - :param file_label: Repeating portion of data files label. If 'None', no file label is used. - :param power_column_label: Repeating portion of the power column label. - :param output_file: Absolute path to csv file containing report results. - :param fix_time_shits: String, 'True' or 'False'. Specifies if time shifts are to be - fixed when running the pipeline. - :param time_zone_correction: String, 'True' or 'False'. Specifies if the time zone correction is performed when - running the pipeline. - :param check_json: String, 'True' or 'False'. Check json file for location information. - :param convert_to_ts: String, 'True' or 'False'. Specifies if conversion to time series is performed when - running the pipeline. - :param system_summary_file: Full path to csv file containing longitude and manual time shift flag for each system, - None if no file - provided. - :param gmt_offset: String. Single value of gmt offset to be used for all estimations. If None a list with individual - gmt offsets needs to be provided. - :param data_source: String. Input signal data source. Options are 's3' and 'cassandra'. - """ - - input_kwargs = sys.argv - inputs_dict = get_commandline_inputs(input_kwargs) - - log_file_versions("solar-data-tools", active_conda_env="pvi-user") - log_file_versions("pv-system-profiler") - - full_df = resume_run(inputs_dict["output_file"]) - - ssf = inputs_dict["system_summary_file"] - if ssf is not None: - df_system_metadata = load_system_metadata( - df_in=ssf, file_label=inputs_dict["file_label"] - ) - cols = df_system_metadata.columns - for param in [ - "longitude", - "latitude", - "tilt", - "azimuth", - "estimated_longitude", - "estimated_latitude", - "time_shift_manual", - ]: - if param in cols: - inputs_dict[param] = True - else: - inputs_dict[param] = False - else: - df_system_metadata = None - -main(full_df, inputs_dict, df_system_metadata) diff --git a/pyproject.toml b/pyproject.toml index 3684a787..8d40d522 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -65,6 +65,12 @@ docs = [ "nbsphinx_link", "docutils<0.22" ] +dev = [ + "ruff==0.9.6", + "pre-commit", + "pytest", + "pytest-cov", +] mosek = [ "mosek" ] diff --git a/sdt_dask/dataplugs/README.md b/sdt_dask/dataplugs/README.md index a970887a..349af18c 100644 --- a/sdt_dask/dataplugs/README.md +++ b/sdt_dask/dataplugs/README.md @@ -48,27 +48,7 @@ Below are detailed descriptions of the DataPlugs available for use with the SDT data_plug.get_data(("filename",)) ``` -### 2. PVDAQPlug (dataplugs/pvdaq_plug.py) - -- **Description**: Retrieving and cleaning solar data from the PVDAQ database -- **Initialization**: -`api_key`: Your API key for accessing the PVDAQ data. -- **`get_data` Tuple input**: Expects a tuple containing the site ID (integer) and the year (integer) for which data is to be retrieved. Example to call get_data method: -```python -data_plug.get_data((site_id, year)) -``` - - -### 3. PVDBPlug (dataplugs/pvdb_plug.py) - -- **Description**: Retrieving and cleaning solar data from the PVDB (Redshift) database -- **Initialization**: No input but assumes the API key is set as an environment variable REDSHIFT_API_KEY. -- **`get_data` Tuple input**: Expects a tuple containing the site ID (string) and the sensor type (integer), identifying the specific dataset to be retrieved. Example to call get_data method: -```python -data_plug.get_data(("site_id", sensor_type)) -``` - -### 4. S3Bucket DataPlug +### 2. S3Bucket DataPlug - **Description**: Retrieving and cleaning solar data from S3 Bucket. And provides a function to get the full key list inside the given bucket name. - **Initialization**: diff --git a/sdt_dask/dataplugs/pvdaq_plug.py b/sdt_dask/dataplugs/pvdaq_plug.py deleted file mode 100644 index a624a94e..00000000 --- a/sdt_dask/dataplugs/pvdaq_plug.py +++ /dev/null @@ -1,45 +0,0 @@ -"""Class for dataplugs to be used with the SDT Dask tool.""" - -import pandas as pd -from solardatatools.dataio import get_pvdaq_data -from sdt_dask.dataplugs.dataplug import DataPlug - - -class PVDAQPlug(DataPlug): - """ - Dataplug class for retrieving data from the PVDAQ DB. - Note that the DEMO_KEY has a rate limit of 30/h, 50/d per IP address. - """ - - def __init__(self, api_key="DEMO_KEY", power_col="ac_power"): - self.api_key = api_key - self.power_col = power_col - - def _pull_data(self, key, year): - """ - Pull the data from the PVDAQ database using the get_pvdaq_data function - from the solardatatools package. - """ - self.df = get_pvdaq_data(sysid=key, year=year, api_key=self.api_key) - - def _clean_data(self): - # pick out one power col - self.df = self.df[["ac_power"]] - - def get_data(self, keys: tuple[int, int]) -> pd.DataFrame: - """This is the main function that the Dask tool will interact with. - Users should keep the args and returns as defined here when writing - their custom dataplugs. - - :param keys: Tuple containing the required inputs: a unique set of - historical power generation measurements, and the year to query - :return: Returns a pandas DataFrame with a timestamp column and - a power column - """ - # In this case the process to get the data is simple since it's all - # done in the get_pvdaq_data function, but in some cases it could be - # more complex - self._pull_data(*keys) - self._clean_data() - - return self.df diff --git a/sdt_dask/dataplugs/pvdb_plug.py b/sdt_dask/dataplugs/pvdb_plug.py deleted file mode 100644 index a21b1c30..00000000 --- a/sdt_dask/dataplugs/pvdb_plug.py +++ /dev/null @@ -1,50 +0,0 @@ -import os -import pandas as pd -from solardatatools.dataio import load_redshift_data -from solardatatools.time_axis_manipulation import make_time_series -from sdt_dask.dataplugs.dataplug import DataPlug - - -class PVDBPlug(DataPlug): - """ - Dataplug class for retrieving data from the PVDB (Redshift) database. - """ - - def __init__(self, power_col="meas_val_f"): - self.api_key = os.environ.get("REDSHIFT_API_KEY") - self.power_col = power_col - - def _pull_data(self, siteid, sensor): - """ - Pull data from the PVDB database. - - :param siteid: Site ID for the data to be retrieved - :param sensor: Sensor Index for the data to be retrieved (staring from 0) - """ - query = {"siteid": siteid, "api_key": self.api_key, "sensor": sensor} - - self.df = load_redshift_data(**query) - - def _clean_data(self): - """ - Clean the data and convert the index to a datetime object by calling - the make_time_series function from the solardatatools package - """ - self.df, _ = make_time_series(self.df) - - def get_data(self, keys: tuple[str, int]) -> pd.DataFrame: - """ - This is the main function that the Dask tool will interact with. - Users should keep the args and returns as defined here when writing - their custom dataplugs. - - :param keys: Tuple containing the required inputs: a unique set of - historical power generation measurements, which should be a - siteid and a sensor id - :return: Returns a pandas DataFrame with a timestamp column and - a power column - """ - self._pull_data(*keys) - self._clean_data() - - return self.df diff --git a/sdt_dask/examples/dev_scripts/README.md b/sdt_dask/examples/dev_scripts/README.md index e43f1cd3..95b6e319 100644 --- a/sdt_dask/examples/dev_scripts/README.md +++ b/sdt_dask/examples/dev_scripts/README.md @@ -3,3 +3,7 @@ The scripts in this directory were used for testing during the development of the SDT Dask tool. We provide them here as a reference for anyone that might find them useful to reuse or develop their own scripts to run the tool. + +> **Note:** `rev_far_pvdb_dask.py` is **archival only** and will not execute. It depended on +> `PVDBPlug` (removed from `sdt_dask/dataplugs/`), which relied on the deprecated `dataio` module +> that was removed from Solar Data Tools in version 2.0. The script is retained for reference. diff --git a/solardatatools/__init__.py b/solardatatools/__init__.py index b7fc8fb6..d6bb9c6f 100644 --- a/solardatatools/__init__.py +++ b/solardatatools/__init__.py @@ -8,8 +8,6 @@ from solardatatools.time_axis_manipulation import fix_daylight_savings_with_known_tz from solardatatools.time_axis_manipulation import make_time_series from solardatatools.clear_day_detection import ClearDayDetection -from solardatatools.dataio import get_pvdaq_data -from solardatatools.dataio import load_pvo_data from solardatatools.plotting import plot_2d from solardatatools.data_handler import DataHandler from solardatatools.polar_transform import PolarTransform @@ -20,8 +18,6 @@ "fix_daylight_savings_with_known_tz", "make_time_series", "ClearDayDetection", - "get_pvdaq_data", - "load_pvo_data", "plot_2d", "DataHandler", "PolarTransform", diff --git a/solardatatools/dataio.py b/solardatatools/dataio.py deleted file mode 100644 index bff70234..00000000 --- a/solardatatools/dataio.py +++ /dev/null @@ -1,168 +0,0 @@ -# -*- coding: utf-8 -*- -"""Data IO Module - -This module contains functions for obtaining data from various sources. - -""" - -import pandas as pd -from datetime import datetime - - -def get_pvdaq_data(sysid=2, api_key="DEMO_KEY", year=2011, delim=",", standardize=True): - """ - This function queries one or more years of raw PV system data from NREL's PVDAQ data service: - https://openei.org/wiki/PVDAQ/PVData_Map - - :param sysid: The system ID to query. Default is 2. - :type sysid: int, optional - :param api_key: The API key for authentication. Default is "DEMO_KEY". - :type api_key: str, optional - :param year: The year or list of years to query. Default is 2011. - :type year: int or list of int, optional - :param delim: The delimiter used in the CSV file. Default is ",". - :type delim: str, optional - :param standardize: Whether to standardize the time axis. Default is True. - :type standardize: bool, optional - - :return: A dataframe containing the concatenated data for all queried years. - :rtype: pd.DataFrame - """ - # Force year to be a list of integers - raise ( - "This function is no longer supported! See https://github.com/NREL/pvdaq_access for access to PVDAQ data" - ) - - -def load_pvo_data( - file_index=None, - id_num=None, - location="s3://pv.insight.nrel/PVO/", - metadata_fn="sys_meta.csv", - data_fn_pattern="PVOutput/{}.csv", - index_col=0, - parse_dates=[0], - usecols=[1, 3], - fix_dst=True, - tz_column="TimeZone", - id_column="ID", - verbose=True, -): - """ - Wrapper function for loading data from NREL partnership. This data is in a - secure, private S3 bucket for use by the GISMo team only. However, the - function can be used to load any data that is a collection of CSV files - with a single metadata file. The metadata file contains a sequential file - index as well as a unique system ID number for each site. Either of these - may be set by the user to retreive data, but the ID number will take - precedent if both are provided. The data files are assumed to be uniquely - identified by the system ID number. In addition, the metadata file contains - a column with time zone information for fixing daylight savings time. - - :param file_index: the sequential index number of the system - :param id_num: the system ID number (non-sequential) - :param location: string identifying the directory containing the data - :param metadata_fn: the location of the metadata file - :param data_fn_pattern: the pattern of data file identification - :param index_col: the column containing the index (see: pandas.read_csv) - :param parse_dates: list of columns to parse dates (see: pandas.read_csv) - :param usecols: columns to load from file (see: pandas.read_csv) - :param fix_dst: boolean, if true, use provided timezone information to - correct for daylight savings time in data - :param tz_column: the column name in the metadata file that contains the - timezone information - :param id_column: the column name in the metadata file that contains the - unique system ID information - :param verbose: boolean, print information about retreived file - :return: pandas dataframe containing system power data - """ - raise ("This function is no longer supported!") - - -def load_cassandra_data( - siteid, - column="ac_power", - sensor=None, - tmin=None, - tmax=None, - limit=None, - cluster_ip=None, - verbose=True, -): - """ - .. deprecated:: 1.5.0 - dataio.load_cassandra_data is deprecated. Starting in Solar Data Tools 2.0, it will be removed. - This function is deprecated. Please use load_redshift_data function instead. - """ - raise ("This function is no longer supported!") - - -def load_constellation_data( - file_id, - location="s3://pv.insight.misc/pv_fleets/", - data_fn_pattern="{}_20201006_composite.csv", - index_col=0, - parse_dates=[0], - json_file=False, -): - """ - Load constellation data from a specified location. - - This function reads a CSV file from a given location and optionally loads - additional JSON metadata. - - :param file_id: Identifier for the file to load. - :type file_id: str - :param location: The base location where the data files are stored. Default is "s3://pv.insight.misc/pv_fleets/". - :type location: str, optional - :param data_fn_pattern: The pattern for the data file name. Default is "{}_20201006_composite.csv". - :type data_fn_pattern: str, optional - :param index_col: Column to use as the row labels of the DataFrame. Default is 0. - :type index_col: int, optional - :param parse_dates: List of column indices to parse as dates. Default is [0]. - :type parse_dates: list, optional - :param json_file: Whether to load additional JSON metadata. Default is False. - :type json_file: bool, optional - - :return: A tuple containing the DataFrame and the JSON metadata (if json_file is True), otherwise just the DataFrame. - :rtype: tuple[pd.DataFrame, dict] or pd.DataFrame - """ - raise ("This function is no longer supported!") - - -def load_redshift_data( - siteid: str, - api_key: str, - column: str = "ac_power", - sensor: int | list[int] | None = None, - tmin: datetime | None = None, - tmax: datetime | None = None, - limit: int | None = None, - verbose: bool = False, -) -> pd.DataFrame: - """ - Queries a SunPower dataset by site id and returns a Pandas DataFrame. - - Request an API key by registering at https://pvdb.slacgismo.org and emailing slacgismotutorials@gmail.com with your information and use case. - - :param siteid: Site id to query - :type siteid: str - :param api_key: API key for authentication to query data - :type api_key: str - :param column: Meas_name to query, defaults to "ac_power" - :type column: str, optional - :param sensor: Sensor index to query based on the number of sensors at the site id, defaults to None - :type sensor: int | list[int] | None, optional - :param tmin: Minimum timestamp to query, defaults to None - :type tmin: datetime | None, optional - :param tmax: Maximum timestamp to query, defaults to None - :type tmax: datetime | None, optional - :param limit: Maximum number of rows to query, defaults to None - :type limit: int | None, optional - :param verbose: Option to print out additional information, defaults to False - :type verbose: bool, optional - :return: Pandas DataFrame containing the queried data. - :rtype: pd.DataFrame - """ - - raise ("This function is no longer supported!") diff --git a/tests/fixtures/README.md b/tests/fixtures/README.md new file mode 100644 index 00000000..e416697d --- /dev/null +++ b/tests/fixtures/README.md @@ -0,0 +1,13 @@ +# Test Fixtures + +The CSV files in this directory are used as inputs for the test suite. + +## Notebooks + +The Jupyter notebooks retained in several subdirectories below are **archival only**. +They were historically used to generate the CSV fixture data from the now-removed `dataio` module. +Because `dataio` was fully deprecated (all functions raised "no longer supported" errors), +these notebooks will not execute and are kept solely as documentation of how the fixture data was created. + +Do not attempt to run them. To regenerate fixture data, load your own data source via `pandas.read_csv` +or another data-loading method and follow the notebook's processing steps manually.