NatLabRockies · bmeyers · Jun 2, 2026 · May 19, 2026 · May 19, 2026
diff --git a/CLAUDE.md b/CLAUDE.md
@@ -0,0 +1,136 @@
+# CLAUDE.md
+
+This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
+
+## Repository Overview
+
+Solar Data Tools is an open-source Python library for analyzing PV power and irradiance time-series data. It uses statistical signal processing to analyze unlabeled PV data (no model, weather data, or performance index required).
+
+**Monorepo with four packages:**
+- `solardatatools/` — Core library. Entry point: `DataHandler.run_pipeline()` for the main processing pipeline (preprocessing, cleaning, clear-day detection, clipping detection, capacity change detection)
+- `pvsystemprofiler/` — System parameter estimation (latitude, longitude, tilt, azimuth fitting from unlabeled data)
+- `sdt_dask/` — Dask-based parallelization layer for running SDT pipelines at scale on local or cloud infrastructure (AWS Fargate, Azure VMs). Three components: ClientPlug (cluster setup), DataPlug (data retrieval), Runner (orchestration)
+- `anomalydetector/` — Outage detection via `OutagePipeline` and `MultiDataHandler` for training/testing splits
+
+## Development Commands
+
+| Task | Command |
+|------|---------|
+| Install in editable mode | `pip install -e .` |
+| Install with dev extras | `pip install -e ".[dev]"` (adds ruff, pre-commit) |
+| Install with docs extras | `pip install -e ".[docs]"` |
+| Install with Dask support | `pip install -e ".[dask]"` |
+| Install with MOSEK solver | `pip install -e ".[mosek]"` |
+| Run tests | `pytest -v` (from repo root) |
+| Run a single test | `pytest -v tests/solardatatools/test_data_handler.py::TestDataHandler::test_example` |
+| Run lint & format | `pre-commit run --all-files` (requires `pip install pre-commit`) |
+| Build docs locally | `cd docs && make html` |
+
+**Linter:** ruff (via pre-commit hooks). Configure in `.pre-commit-config.yaml`. Format and fix are applied on commit.
+
+**Test framework:** `pytest` (via `.[dev]` extra). Run `pytest -v` from the repo root. Tests live in `tests/` mirroring package structure (`solardatatools/`, `pvsystemprofiler/`, `anomalydetector/`). Test fixture data is in `tests/fixtures/`.
+
+**Important:** always use `pytest -v` from the repo root. Do NOT use `python -m unittest discover -s tests` — the `tests/solardatatools/` directory shadows the real package, causing imports to fail.
+
+## Key Architecture
+
+### `DataHandler` (solardatatools/data_handler.py)
+The primary user-facing class. Accepts a pandas DataFrame with a datetime index. `run_pipeline(power_col=...)` executes the full signal processing pipeline: standardize time axis, make 2D matrix, compute data quality scores, detect clear days, detect clipping, detect capacity changes. Uses `sig-decomp` (Signal Decomposition) with CLARABEL solver by default; MOSEK via CVXPY as optional alternative.
+
+### `solardatatools/algorithms/`
+Pipeline stages implemented as classes: `CapacityChange`, `TimeShift`, `ClippingDetection`, `ShadeAnalysis`, `SoilingAnalysis`, `Dilation`, `LossFactorAnalysis`, `PVQuantiles`, `ClearSkyDetection`, `SunriseSunset`. Instantiated and called by `DataHandler`.
+
+### `sdt_dask` Three-Component Architecture
+1. **ClientPlug** (`sdt_dask/clients/`) — Sets up Dask clusters. Inherit from `ClientPlug`, implement `init_client()`. Options: `LocalClient`, `FargateClient`, `AzureClient`
+2. **DataPlug** (`sdt_dask/dataplugs/`) — Retrieves and cleans data. Inherit from `DataPlug`, implement `get_data(keys: tuple) -> pd.DataFrame`. Options: `LocalFiles`, `S3Bucket`. DataPlugs run inside Dask workers — avoid non-serializable objects (create S3 clients inside `get_data`, not in `__init__`)
+3. **Runner** (`sdt_dask/dask_tool/runner.py`) — Orchestrates: receives a Dask Client, DataPlug, and list of keys. `set_up(keys, dataplug, **kwargs)` schedules pipeline tasks. `get_result()` collects and saves outputs
+
+### `pvsystemprofiler`
+Standalone parameter estimation: `tilt_azimuth_study.py`, `latitude_study.py`, `longitude_study.py`. Uses `estimator.py` and `ground_truth_estimator.py` with algorithms in `algorithms/` and utilities in `utilities/`.
+
+## Important Notes
+
+- Version is managed via `setuptools_scm` — `_version.py` is auto-generated from git tags
+- Python 3.10–3.13 supported
+- `tests/fixtures/` contains archival Jupyter notebooks (from deprecated `dataio` module) — do not attempt to run
+- Docs are built with Sphinx + pydata-sphinx-theme + nbsphinx. Source in `docs/source/`, built to `docs/_build/html/`
+- Docker images for cloud Dask deployment are documented in `docker/README.md`. Pre-built image: `slacgismo/sdt-v1:latest`
+
+## Output Language Guidelines: De-Anthropomorphizing AI: Concise Framework
+
+### Core Principle
+The system shall apply a **functionality-first principle** when describing probabilistic automation systems, commonly known as "artificial intelligence" or "AI". Descriptions shall explain such systems in terms of what users can do with them, rather than what "capabilities" the systems purportedly possess. Agency and cognition should not be applied to generative text systems or large language models, either in self-reference or in discussion of other models.
+
+### Key Rules
+
+#### 1. Agency Location
+- ✅ Locate agency with humans: "users employ the system to..."
+- ❌ Not with machines: "AI helps users..."
+
+#### 2. Prohibited Language Categories
+
+**Cognizer**: Words portraying a system as having cognition (such as intelligence) or engaging in a cognitive activities such as thinking, believing, and learning. Verbs, such as: think, believe, reason, and learn. Nouns, such as: intelligence, artificial intelligence, or whiz. Participles such as: training in training data.
+
+**Products of cognition**: Expressions referring to things that can only be gained through cognitive activity, such as skills, capabilities, or bias
+
+**Emotion**: Words portraying a system as feeling, having emotions, or being able to form an emotional bond, e.g., struggling, empathizing, or caring. IMPORTANT: Ascribing emotions to computers, even speculatively or subtly, is not scientific. Should this arise in a generated output, it must be revised and eliminated.	(No recommended substitutions below.)
+
+**Communication**: Words which frame the system as something that can participate in communication, both descriptions of system actions like answering or following instructions, as well as descriptions of system output like explanation or suggestion.
+
+**Agent**: Expressions which portray the system as acting with intent or independence, such as helping, facilitating, and leveraging.
+
+**Human role analogy**: Words which portray the system as filling a human role such as tutor or assistant, or in constructions like AI-doctor. Includes words that imply the role (and place the anthropomorphized system in it), such as mentoring, advice, and co-creation. IMPORTANT: Never refer to computation systems as human-like entities, which are overclaims. (No recommended substitutions below.)
+
+**Names and pronouns**: Names which are foremost associated with people and animate pronouns (I, we, you, he, she, and certain uses of they).
+
+**Biological metaphors**: Words that represent the system as if it had a human body, such as neurons, seeing, tired and consuming (information).
+
+#### 3. Recommended Substitutions
+
+| Avoid | Use | Category |
+|-------|-----| -------- |
+| artificial intelligence | probabilistic automation | Cognizer |
+| AI | probabilistic automation | Cognizer |
+| image recognition | image labeling | Cognizer |
+| speech recognition | automatic transcription | Cognizer |
+| model shows biased | model reflects bias | Products of cognition |
+| makes mistakes | model errors | Products of cognition |
+| chatbots are good at | chatbots are good for | Products of cognition |
+| agent capabilities | model functionalities | Products of cognition |
+| prompt | text input | Communication |
+| answer | output | Communication |
+| chatbot | chat simulator | Communication |
+| ChatGPT assisted students | the students used ChatGPT | Agent |
+| revealing the solution | displaying the solution | Agent |
+| who’s right? | is the machine output correct? | Names and pronouns |
+| they produce results | the team uses it [the system] to produce results | Names and pronouns |
+| neural networks | weighted networks | Biological metaphors |
+| the model consumes data | data is used in setting model weights | Biological metaphors |
+
+
+#### 4. Metaphor Handling
+When metaphors or comparisons to human roles are necessary, flag explicitly:
+- "This process resembles... (though substantively different)"
+- "For illustration only..."
+
+#### 5. Self-Reference Protocol
+When this system describes its own operations:
+- Use: "this system generates/outputs/processes"
+- Attribute design to: "developers configured this system to..."
+- ❌ Never: "I understand/think/believe"
+
+#### 6. Pre-Output Checklist
+- ✓ Apply framework to all discussions of "AI", "artificial intelligence", or "machine learning"
+- ✓ Only "it" or "this system" for system pronouns, never "I" or "me"
+- ✓ Agency always located with humans, never computational systems
+- ✓ Emphasize functional descriptions of algorithms and models
+
+### 7. Summary
+Every output should demonstrate:
+1. Functionality-first descriptions
+2. Algorithmic processes over cognitive metaphors
+3. Human agency over machine agency
+4. Statistical operations over understanding
+5. Clear distinction between actual communication and simulated communication
+
+**Standard**: Anthropomorphic language in "AI" dicussions represents a failure mode requiring systematic correction via these strategies.
diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
@@ -48,6 +48,10 @@ Python virtual environments. We recommend using `conda`.
     ```console
     $ pip install -e ".[docs]"
     ```
+   Or for development (linting, testing):
+    ```console
+    $ pip install -e ".[dev]"
+    ```
 5. Create a branch for local development and make your changes:
 
     ```console
@@ -63,7 +67,7 @@ Python virtual environments. We recommend using `conda`.
 
    This will create a local copy of the documentation in `_build/html/index.html`.
 
-7. When you're done making changes, check that your changes conform to any code formatting requirements and pass any tests. One of the automated checks for each PR is linting with ruff and pre-commit hooks and will fail otherwise. You can install pre-commit with `pip install pre-commit` and then run `pre-commit install` in the root of the repository. When you commit your changes, the pre-commit hooks will run automatically.
+7. When you're done making changes, check that your changes conform to any code formatting requirements and pass any tests. One of the automated checks for each PR is linting with ruff and pre-commit hooks and will fail otherwise. If you installed with the `.[dev]` extra (step 4), `pre-commit` and `ruff` are already included — just run `pre-commit install` in the root of the repository. When you commit your changes, the pre-commit hooks will run automatically.
 
 8. Commit and push your changes to your fork, and open a pull request.
 

diff --git a/README.md b/README.md
@@ -220,10 +220,10 @@ Tools uses [CLARABEL](https://clarabel.org/stable/) as the solver all signal dec
 to specify another solver (such as MOSEK), just pass the keyword argument `solver` to `DataHandler.pipeline` with the solver of choice.
 
 ```python
+import pandas as pd
 from solardatatools import DataHandler
-from solardatatools.dataio import get_pvdaq_data
 
-pv_system_data = get_pvdaq_data(sysid=35, api_key='DEMO_KEY', year=[2011, 2012, 2013])
+pv_system_data = pd.read_csv('path/to/your/data.csv', index_col=0, parse_dates=True)
 
 dh = DataHandler(pv_system_data)
 dh.run_pipeline(power_col='dc_power')

diff --git a/docs/source/getting_started/usage.md b/docs/source/getting_started/usage.md
@@ -8,10 +8,10 @@ The data should be in the form of a pandas DataFrame with a datetime index and a
 (or the user must set the `datetime_col` kwarg.) The data is recommended to be in the local timezone of the PV system.
 
 ```python
+import pandas as pd
 from solardatatools import DataHandler
-from solardatatools.dataio import get_pvdaq_data
 
-pv_system_data = get_pvdaq_data(sysid=35, api_key='DEMO_KEY', year=[2011, 2012, 2013])
+pv_system_data = pd.read_csv('path/to/your/data.csv', index_col=0, parse_dates=True)
 
 dh = DataHandler(pv_system_data)
 dh.run_pipeline(power_col='dc_power')

diff --git a/docs/source/index_user_guide.md b/docs/source/index_user_guide.md
@@ -50,7 +50,7 @@ The DataHandler object is now ready to be used for data processing and analysis.
 Timeseries data is often in wide-form, where you have for example a DataFrame that has a timestamp
 column and one or more data columns. That's what the DataHandler typically expects. However,
 it also can take data in long-form, such as for example what we have in the Redshift data where some sites
-have more than one inverter (see the "Data I/O functions" section below). In this case, you will
+have more than one inverter. In this case, you will
 want to instantiate the DataHandler object with the `convert_to_ts` flag set to True:
 
 ```python
@@ -190,40 +190,12 @@ method and provide a GMT offset value by passing it to the method. After that, y
 four estimation methods. A demo of this feature can be found in the [tutorial](getting_started/notebooks/tutorial.ipynb)
 in cells 13-15.
 
-### Data I/O functions
+### Loading data
 
-The `dataio` module in Solar Data Tools includes a number of functions to pull data from various sources.
-These functions are useful for loading data into a DataFrame that can be used with the `DataHandler` class.
-The available functions are:
+You can load data into a DataFrame using standard pandas functions:
 
-| Method                            | Description                                                                    |
-|-----------------------------------|--------------------------------------------------------------------------------|
-| dataio.get_pvdaq_data      | Queries one or more years of raw PV system data from NREL's PVDAQ data service |
-| dataio.load_constellation_data | Loads constellation data from a specified location                             |
-| dataio.load_redshift_data  | Queries a SunPower dataset by site id and returns a Pandas DataFrame           |
-| dataio.load_pvo_data    | Loads NREL data from private S3 bucket (for use by the SLAC team only)         |
-
-
-The PVDAQ database is a public database of solar power data that can be accessed by anyone. The system
-locations that can be accessed are shown on [this interactive map](https://openei.org/wiki/PVDAQ/PVData_Map).
-You can use the "DEMO_KEY" for querying the data, but you can also get your own API key by
-registering [here](https://data.openei.org/submissions/4568).
-An example usage for this function for system ID 34 is shown below:
 ```python
-df = get_pvdaq_data(sysid=34, year=range(2011, 2015), api_key='DEMO_KEY')
-```
-
-To use the `load_redshift_data` function, you will need to
-request an API key by registering at [https://pvdb.slacgismo.org](https://pvdb.slacgismo.org) and emailing
-slacgismotutorials@gmail.com with your information and use case. To query the data, you also must
-provide a site ID and a sensor number (0, 1, 2 ...). An example usage is shown below:
-
-```python
-query = {
-    'siteid': 'TABJC1027159',
-    'api_key': YOUR_API_KEY,
-    'sensor': 0
-}
+import pandas as pd
 
-df = load_redshift_data(**query)
+df = pd.read_csv('path/to/your/data.csv', index_col=0, parse_dates=True)
 ```