perf: Optimize clustering and I/O (4.4x faster segmented clustering) #579

FBumann · 2026-01-16T17:05:53Z

Summary

Performance optimizations using faster data access patterns throughout the codebase.

Changes

File	Function	Optimization	Speedup
`transform_accessor.py`	`_build_typical_das`	Numpy reshape instead of pandas `.loc`	4.4x
`transform_accessor.py`	`_build_reduced_dataset`	Use `ds.variables` + numpy slicing	12%
`io.py`	`_add_clustering_to_dataset`	Batch assignment with `ds.assign()`	24x
`io.py`	various functions	Use `ds.variables` pattern	-
`structure.py`	`from_dataset`	Use coord_cache pattern	-
`core.py`	`drop_constant_arrays`	Use `ds.variables` + numpy	-
`clustering/base.py`	various functions	Use `ds.variables` pattern	-

Key Optimization: Numpy reshape for segmented clustering

# Before: 451,856 slow pandas .loc calls (93.7s)
for col in typical_df.columns:
    for cluster_id in range(actual_n_clusters):
        cluster_data = typical_df.loc[cluster_id, col]  # SLOW!
        data[cluster_id, :] = cluster_data.values

# After: Single numpy reshape (21.1s)
all_values = typical_df.values  # (n_clusters * n_time_points, n_columns)
reshaped = all_values.reshape(actual_n_clusters, n_time_points, -1)

for col_idx, col in enumerate(columns):
    # reshaped[:, :, col_idx] selects all clusters, all time points, single column
    # Result shape: (n_clusters, n_time_points)
    data = reshaped[:, :, col_idx]

Benchmarks

Operation	Before	After	Speedup
`cluster()` with segments (50 clusters, 4 segments)	93.7s	21.1s	4.4x
`cluster()` without segments (12 clusters)	10.1s	8.9s	12%
`to_dataset()` with clustering	2600ms	109ms	24x

Testing

All IO tests pass
All 185 clustering tests pass

Optimize several functions by using ds.variables instead of iterating over data_vars.items() or accessing ds[name], which triggers slow _construct_dataarray calls. Changes: - io.py: save_dataset_to_netcdf, load_dataset_from_netcdf, _reduce_constant_arrays - structure.py: from_dataset (use coord_cache pattern) - core.py: drop_constant_arrays (use numpy operations) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

coderabbitai · 2026-01-16T17:06:08Z

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

📝 Walkthrough

Walkthrough

Adds multidimensional clustering (Clustering / ClusteringResults), VariableCategory-based variable classification, segmented-timesteps support (RangeIndex), a centralized FlowSystem ↔ xarray I/O layer (FlowSystemDatasetIO), updated TransformAccessor clustering API (ClusterConfig/ExtremeConfig/SegmentConfig), and an IO benchmark script.

Changes

Cohort / File(s)	Summary
Clustering core `flixopt/clustering/__init__.py`, `flixopt/clustering/base.py`, `flixopt/clustering/intercluster_helpers.py`	Replace legacy ClusterResult/ClusterStructure with `Clustering` and `ClusteringResults`; multi-dimensional (period,scenario) results, IO (to_json/from_json), plotting accessor, rename `cluster_order` → `cluster_assignments`.
Transform & clustering ops `flixopt/transform_accessor.py`	New cluster API (data_vars, cluster: ClusterConfig, extremes: ExtremeConfig, segments: SegmentConfig), clustering_data(), apply_clustering(), builders for weights/metrics/typical periods, segmentation-aware expand and mapping helpers.
FlowSystem I/O `flixopt/io.py`, `flixopt/flow_system.py`	Add `FlowSystemDatasetIO` and module wrappers (`flow_system_to_dataset`/`restore_flow_system_from_dataset`); delegate to_dataset/from_dataset; add include_original_data flag; robust netCDF save/load helpers.
Variable categorization `flixopt/structure.py`, `flixopt/components.py`, `flixopt/elements.py`, `flixopt/features.py`, `flixopt/effects.py`, `flixopt/modeling.py`	Introduce `VariableCategory` enum and EXPAND_* groups; track `variable_categories` on FlowSystemModel; `Submodel.add_variables(..., category=...)` records mappings; annotate many model variables and investment sizing with categories.
Segmentation & timesteps `flixopt/flow_system.py`	Support `RangeIndex` timesteps (segmented systems), optional timestep_duration, `is_segmented` property, `variable_categories` propagation, `get_variables_by_category` API and segmentation-aware metadata/expansion.
I/O internals & perf `flixopt/io.py`, `benchmarks/benchmark_io_performance.py`	Dataset I/O refactor to use `ds.variables`, faster stack/unstack/reduce helpers; new benchmark script measuring to_dataset(include_solution=True) and FlowSystem.from_dataset(ds) roundtrip timings.
Statistics & utils `flixopt/statistics_accessor.py`, `flixopt/core.py`	Statistics selection switched to category-based lookups; `drop_constant_arrays()` adds `atol` and uses NumPy-based constancy detection for performance.
Modeling & features `flixopt/modeling.py`, `flixopt/features.py`, `flixopt/components.py`	Add `size_category` parameter to InvestmentModel; propagate categories to size/investment/status/duration and related variables.
Docs & notebooks `docs/notebooks/`, `docs/user-guide/optimization/`	Notebook metadata updates and docs rewritten to use `ExtremeConfig`, `ClusterConfig`, new clustering results access patterns, segmentation docs and examples expanded.
Tests `tests/...` (many files)	Tests updated to new clustering API (Clustering/ClusteringResults), `cluster_assignments` usage, `data_vars`/`extremes`/segmentation coverage, and IO roundtrip assertions.
Project manifest `pyproject.toml`	Replace `tsam` pinned versions with a git-based dependency `tsam @ git+https://github.com/FBumann/tsam.git@v3-rebased`.

Sequence Diagram(s)

sequenceDiagram
    participant User as User
    participant FS as FlowSystem
    participant TA as TransformAccessor
    participant CL as Clustering
    participant IO as FlowSystemDatasetIO
    participant XR as xarray.Dataset

    Note over User,FS: create / configure system
    User->>FS: create(timesteps, periods, components)
    FS->>FS: init (DatetimeIndex | RangeIndex), record variable_categories

    Note over User,TA: clustering workflow
    User->>TA: cluster(n_clusters, data_vars, extremes=ExtremeConfig(...))
    TA->>TA: clustering_data() → prepare inputs
    TA->>CL: call tsam.aggregate & build ClusteringResults
    CL->>CL: build cluster_assignments, timesteps_per_cluster, metrics
    TA->>FS: produce reduced FlowSystem with Clustering attached

    Note over User,FS: solve & attach solution
    User->>FS: solve()
    FS->>FS: store solution, propagate variable_categories

    Note over User,FS: expand & IO
    User->>TA: expand(reduced_solution)
    TA->>CL: expand_data(aggregated, segmentation-aware)
    CL->>FS: produce expanded solution

    User->>FS: to_dataset(include_solution=True, include_original_data=True)
    FS->>IO: flow_system_to_dataset(...)
    IO->>XR: construct xr.Dataset (solution, clustering, categories...)

    User->>XR: save/load dataset
    XR->>IO: restore_flow_system_from_dataset(ds)
    IO->>CL: reconstruct Clustering from serialized data
    IO->>FS: return FlowSystem with clustering and variable_categories

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

Possibly related PRs

Feature: Clustering IO, single selection, and weights API improvements #549: Overlapping clustering and clustering I/O/serialization refactor (Clustering persistence and FlowSystem ↔ Dataset roundtrips).
Feature/aggregation final #554: Adds an IO benchmark script exercising FlowSystem to_dataset/from_dataset I/O paths — directly related to the new benchmarking script and IO utilities.
Clustering Parameter Refinements & Unified Slot Assignment #552: Related changes to clustering API and inter-cluster storage mapping (cluster_assignments / expansion semantics).

Poem

🐰 I hopped through timesteps, tagged each flow with care,

Clusters curled like cozy naps, then stretched back to the air,
RangeIndex and Datetime now share one sunny burrow,
Datasets saved and loaded — crunchy carrots, not sorrow,
I timed the bytes and twitched my nose: IO faster, let's go!

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Docstring Coverage	✅ Passed	Docstring coverage is 91.37% which is sufficient. The required threshold is 80.00%.
Title check	✅ Passed	The title 'perf: Optimize clustering and I/O (4.4x faster segmented clustering)' clearly identifies the main performance optimization focus with specific metrics, directly matching the changeset's primary objective.
Description check	✅ Passed	The description provides a comprehensive summary with change details, optimization strategies, before/after code examples, and benchmark results. However, it lacks explicit Type of Change selection and Testing checklist completion as specified in the template.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Use ds.variables for faster access in clustering/base.py: - _create_reference_structure: original_data and metrics iteration - compare plot: duration_curve generation with direct numpy indexing Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

coderabbitai

Actionable comments posted: 9

🤖 Fix all issues with AI agents

In `@benchmarks/benchmark_io_performance.py`:
- Around line 138-146: The benchmark currently calls
fs_clustered.optimize(fx.solvers.GurobiSolver()) which will fail at solve time
if Gurobi is not available; change this to detect/handle solver availability by
wrapping the optimize call in a try/except that catches solver runtime errors
and falls back to fx.solvers.HighsSolver(), or programmatically choose the
solver before calling fs_clustered.optimize by attempting to use
fx.solvers.GurobiSolver() and on failure selecting fx.solvers.HighsSolver();
update the call sites referenced (create_large_flow_system,
fs.transform.cluster, fs_clustered.optimize, fx.solvers.GurobiSolver) so the
benchmark either falls back to HighsSolver or logs a clear error and exits.

In `@docs/notebooks/08c2-clustering-storage-modes.ipynb`:
- Around line 174-175: Remove the invalid import "from tsam.config import
ExtremeConfig" and replace any usage of ExtremeConfig (e.g.,
extremes=ExtremeConfig(...)) with the TimeSeriesAggregation API by setting the
extremePeriodMethod parameter (one of "None", "append", "new_cluster_center",
"replace_cluster_center") and the corresponding flags like addPeakMin,
addPeakMax, addMeanMin, addMeanMax on the TimeSeriesAggregation instance; ensure
any code that passed extremes=ExtremeConfig(...) is updated to configure extreme
handling via TimeSeriesAggregation.extremePeriodMethod and the addPeak*/addMean*
arguments instead.

In `@docs/notebooks/08d-clustering-multiperiod.ipynb`:
- Around line 283-285: The printed header uses hardcoded "period=2024,
scenario=High" which can diverge from the indices used; retrieve the actual
labels from the dataset/coords used to produce cluster_assignments (e.g., the
period and scenario coordinate arrays referenced when you computed
cluster_assignments) and use those values in the header instead of fixed text,
then keep the existing loop over cluster_assignments and day_names to print each
day's cluster; reference the variables/objects cluster_assignments and day_names
and the same period/scenario coord variables you used earlier to select the
slice so the header is always derived from the data.

In `@flixopt/components.py`:
- Around line 1507-1512: The decay calculation uses mean timestep duration which
underestimates total hours per cluster; update the reduction on
self._model.timestep_duration (used when computing decay_n) to compute the total
hours for the cluster (e.g., use
_scalar_safe_reduce(self._model.timestep_duration, 'time', 'sum') or equivalent)
so decay_n = (1 - self.element.relative_loss_per_hour) **
total_hours_per_cluster correctly reflects summed duration while preserving
DataArray semantics and usage of _scalar_safe_reduce and relative_loss_per_hour.

In `@flixopt/flow_system.py`:
- Around line 210-218: The code currently allows self.timestep_duration to be
None when timesteps is a RangeIndex (segmented system), which breaks downstream
logic (e.g., weights['time'], temporal_weight); add an early validation in the
constructor or initializer where timestep_duration and
computed_timestep_duration are handled: detect the segmented case (timesteps is
a RangeIndex or computed_timestep_duration is None) and raise a clear ValueError
if timestep_duration is not provided, mentioning that RangeIndex requires an
explicit timestep_duration; keep existing logic for assigning
self.timestep_duration (including using fit_to_model_coords('timestep_duration',
computed_timestep_duration) when available) but enforce this guard before
downstream computation.

In `@flixopt/io.py`:
- Around line 1587-1603: _fast_get_dataarray currently only includes coordinates
whose names match variable.dims, dropping auxiliary coordinates whose dims are a
subset of the variable's dims; update the coords build in _fast_get_dataarray to
include any coord_cache entry where coord.dims is a subset of variable.dims (not
just keys equal to dims), so auxiliary coords are preserved when constructing
the xr.DataArray (use variable.dims and each coord.dims to decide inclusion and
pass the resulting coords dict to xr.DataArray).

In `@flixopt/transform_accessor.py`:
- Around line 1379-1385: Add a guard to detect when drop_constant_arrays removes
all variables before calling to_dataframe(): after computing
temporaly_changing_ds_for_clustering =
drop_constant_arrays(ds_slice_for_clustering, dim='time') check whether
temporaly_changing_ds_for_clustering is empty (e.g., has no data_vars or no
variables) and raise a clear ValueError mentioning the input filtering
(data_vars/selector) and that there is no varying data to cluster, instead of
calling temporaly_changing_ds_for_clustering.to_dataframe() and allowing a
cryptic indexing error; update the code paths that assign df_for_clustering
accordingly so downstream logic only runs when df_for_clustering is valid.
- Around line 1902-1914: The current indexing misuses timestep_mapping modulo
n_segments for time positions when clustering.is_segmented, causing wrong
time_indices and bad SOC interpolation; change the time index calculation so
time_indices = timestep_mapping % clustering.timesteps_per_cluster (always use
timesteps_per_cluster for the time position), keep cluster_indices =
timestep_mapping // time_dim_size (where time_dim_size remains
clustering.n_segments when segmented for clustering into clusters), and then use
those cluster_indices/time_indices to index segment_assignments,
position_within_segment and segment_durations (variables: timestep_mapping,
clustering.is_segmented, clustering.n_segments,
clustering.timesteps_per_cluster, cluster_indices, time_indices,
segment_assignments, position_within_segment, segment_durations, seg_indices,
positions, durations).

In `@pyproject.toml`:
- Around line 67-68: The pyproject.toml currently lists a git URL dependency
"tsam @ git+https://github.com/FBumann/tsam.git@v3-rebased" in the [full] and
[dev] extras which will break PyPI publication; remove that git-based
requirement from both the [full] and [dev] extras and either (a) replace it with
a proper PyPI version specifier (e.g., "tsam >= 3.0.0, < 4") once tsam is
published, (b) omit tsam from those extras and add a note in your README/INSTALL
docs instructing users to install tsam from source, or (c) vendor/include tsam
in the package; finally, if this is intended only for development, update the
release workflow (the verify-pypi job) to skip PyPI verification or otherwise
allow dev-only dependencies so the release will pass.

🧹 Nitpick comments (2)

flixopt/structure.py (1)
901-908: Avoid eager .values on potentially lazy arrays.
array.values will fully materialize dask-backed data. If datasets can be opened lazily, this can blow memory and negate the IO perf gains. Consider guarding for lazy arrays or using a reduction that doesn’t materialize the full array.
🔧 Lazy-safe null check (example)
-        has_nulls = (np.issubdtype(array.dtype, np.floating) and np.any(np.isnan(array.values))) or (
-            array.dtype == object and pd.isna(array.values).any()
-        )
+        data = array.data
+        if hasattr(data, "compute"):
+            has_nulls = bool(array.isnull().any().compute())
+        else:
+            has_nulls = (np.issubdtype(array.dtype, np.floating) and np.any(np.isnan(array.values))) or (
+                array.dtype == object and pd.isna(array.values).any()
+            )
To confirm whether lazy loading is possible in your IO path, you can locate the NetCDF loader implementation:
#!/bin/bash
rg -n "def load_dataset_from_netcdf" -g '*.py'
rg -n "open_dataset|load_dataset" -g '*.py'
flixopt/clustering/base.py (1)
528-534: Preserve dimension order in _get_dim_values to keep .isel() predictable.

sorted(set(...)) reorders period/scenario values and can make .isel() indices differ from the original order (e.g., FlowSystem period/scenario lists). Consider preserving first-seen order from the results keys.
♻️ Proposed refactor
-        return sorted(set(k[idx] for k in self._results.keys()))
+        values: list = []
+        seen = set()
+        for key in self._results.keys():
+            val = key[idx]
+            if val not in seen:
+                values.append(val)
+                seen.add(val)
+        return values

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 846dd57 and 5604845.

📒 Files selected for processing (39)

benchmarks/benchmark_io_performance.py
docs/notebooks/01-quickstart.ipynb
docs/notebooks/02-heat-system.ipynb
docs/notebooks/03-investment-optimization.ipynb
docs/notebooks/04-operational-constraints.ipynb
docs/notebooks/05-multi-carrier-system.ipynb
docs/notebooks/06a-time-varying-parameters.ipynb
docs/notebooks/06b-piecewise-conversion.ipynb
docs/notebooks/06c-piecewise-effects.ipynb
docs/notebooks/08a-aggregation.ipynb
docs/notebooks/08c-clustering.ipynb
docs/notebooks/08c2-clustering-storage-modes.ipynb
docs/notebooks/08d-clustering-multiperiod.ipynb
docs/notebooks/08e-clustering-internals.ipynb
docs/notebooks/08f-clustering-segmentation.ipynb
docs/notebooks/09-plotting-and-data-access.ipynb
docs/notebooks/10-transmission.ipynb
docs/user-guide/optimization/clustering.md
docs/user-guide/optimization/index.md
flixopt/clustering/__init__.py
flixopt/clustering/base.py
flixopt/clustering/intercluster_helpers.py
flixopt/components.py
flixopt/core.py
flixopt/effects.py
flixopt/elements.py
flixopt/features.py
flixopt/flow_system.py
flixopt/io.py
flixopt/modeling.py
flixopt/statistics_accessor.py
flixopt/structure.py
flixopt/transform_accessor.py
pyproject.toml
tests/deprecated/examples/03_Optimization_modes/example_optimization_modes.py
tests/test_cluster_reduce_expand.py
tests/test_clustering/test_base.py
tests/test_clustering/test_integration.py
tests/test_clustering_io.py

🧰 Additional context used

🧬 Code graph analysis (12)

tests/deprecated/examples/03_Optimization_modes/example_optimization_modes.py (2)

flixopt/flow_system.py (2)

copy (964-989)

transform (1681-1699)

flixopt/transform_accessor.py (1)

cluster (1194-1478)

benchmarks/benchmark_io_performance.py (2)

flixopt/flow_system.py (6)

FlowSystem (61-2604)

transform (1681-1699)

to_dataset (686-724)

from_dataset (727-749)

solution (1461-1478)

solution (1481-1484)

flixopt/io.py (2)

to_dataset (1785-1828)

from_dataset (1528-1563)

flixopt/modeling.py (1)

flixopt/structure.py (3)

VariableCategory (82-129)

add_variables (204-220)

add_variables (1740-1768)

flixopt/clustering/__init__.py (1)

flixopt/clustering/base.py (2)

Clustering (636-1346)

ClusteringResults (161-633)

flixopt/structure.py (1)

flixopt/flow_system.py (1)

variable_categories (1487-1493)

flixopt/statistics_accessor.py (2)

flixopt/structure.py (3)

VariableCategory (82-129)

solution (290-338)

solution (1288-1302)

flixopt/flow_system.py (3)

get_variables_by_category (1495-1547)

solution (1461-1478)

solution (1481-1484)

tests/test_cluster_reduce_expand.py (1)

flixopt/clustering/base.py (14)

n_clusters (316-318)

n_clusters (672-674)

cluster_assignments (338-349)

cluster_assignments (778-784)

timesteps_per_cluster (321-323)

timesteps_per_cluster (677-679)

sel (235-254)

sel (714-761)

clusters (1603-1706)

dims (217-219)

dims (697-699)

timestep_mapping (814-823)

coords (227-233)

coords (702-712)

flixopt/features.py (3)

flixopt/structure.py (9)

FlowSystemModel (188-470)

VariableCategory (82-129)

add_variables (204-220)

add_variables (1740-1768)

get_coords (415-450)

get_coords (1818-1823)

dims (350-352)

label_full (1284-1285)

label_full (1849-1850)

flixopt/clustering/base.py (4)

coords (227-233)

coords (702-712)

dims (217-219)

dims (697-699)

flixopt/elements.py (1)

label_full (638-639)

flixopt/flow_system.py (3)

flixopt/structure.py (9)

CompositeContainerMixin (1537-1713)

Element (1241-1341)

FlowSystemModel (188-470)

Interface (473-1238)

VariableCategory (82-129)

dims (350-352)

to_dataset (1038-1061)

to_netcdf (1063-1090)

copy (1218-1230)

flixopt/clustering/base.py (5)

coords (227-233)

coords (702-712)

dims (217-219)

dims (697-699)

is_segmented (764-770)

flixopt/io.py (4)

to_dataset (1785-1828)

flow_system_to_dataset (1945-1968)

restore_flow_system_from_dataset (1927-1942)

save_dataset_to_netcdf (532-586)

flixopt/transform_accessor.py (2)

flixopt/structure.py (3)

VariableCategory (82-129)

values (1663-1668)

values (1969-1970)

flixopt/clustering/base.py (13)

Clustering (636-1346)

n_clusters (316-318)

n_clusters (672-674)

is_segmented (764-770)

values (304-306)

values (1321-1328)

n_segments (331-333)

n_segments (773-775)

timesteps_per_cluster (321-323)

timesteps_per_cluster (677-679)

sel (235-254)

sel (714-761)

results (1236-1247)

flixopt/components.py (3)

flixopt/structure.py (4)

FlowSystemModel (188-470)

VariableCategory (82-129)

add_variables (204-220)

add_variables (1740-1768)

flixopt/clustering/base.py (7)

n_clusters (316-318)

n_clusters (672-674)

timesteps_per_cluster (321-323)

timesteps_per_cluster (677-679)

n_original_clusters (687-689)

cluster_assignments (338-349)

cluster_assignments (778-784)

flixopt/modeling.py (1)

_scalar_safe_reduce (57-76)

flixopt/effects.py (1)

flixopt/structure.py (3)

Submodel (1716-1914)

VariableCategory (82-129)

register_class_for_io (148-157)

🔇 Additional comments (91)

docs/notebooks/06b-piecewise-conversion.ipynb (1)

208-217: Notebook metadata standardization looks good.

The added metadata fields (codemirror_mode, file_extension, mimetype, nbconvert_exporter, pygments_lexer) are standard Jupyter notebook fields that improve editor compatibility and export behavior.

Note: The Python version was changed from 3.12.7 to 3.11.11. If this is intentional (e.g., to match your CI/dev environment), this is fine. Just verify it aligns with project requirements.

docs/notebooks/06c-piecewise-effects.ipynb (2)

315-323: Standard Jupyter metadata additions look good.

The added metadata fields (codemirror_mode, file_extension, mimetype, nbconvert_exporter, pygments_lexer) are standard Jupyter notebook metadata that enable proper syntax highlighting and notebook export functionality.

324-324: Verify the Python version downgrade.

The Python version was changed from "3.12.7" to "3.11.11". Please confirm whether this downgrade is intentional (e.g., for CI/environment standardization) or an artifact of re-executing the notebook in a different environment.

If this notebook was simply re-run in your current environment, consider whether committing the version change is necessary, especially if the project supports multiple Python versions.

docs/notebooks/04-operational-constraints.ipynb (1)

475-487: LGTM!

Standard Jupyter notebook metadata enrichment adding language_info fields for editor integration and export functionality. This aligns with the metadata updates across other notebooks in the repository.

docs/notebooks/05-multi-carrier-system.ipynb (1)

544-554: LGTM!

Consistent metadata enrichment matching the pattern applied to other notebooks. No functional changes.

docs/notebooks/10-transmission.ipynb (1)

636-646: LGTM!

Standard notebook metadata update consistent with the other notebooks in this PR.

docs/notebooks/09-plotting-and-data-access.ipynb (1)

834-844: LGTM!

Consistent notebook metadata update aligning with repository standards.

pyproject.toml (1)

86-86: Consistent with [full] extras dependency.

The [dev] dependency mirrors the git reference used in [full], maintaining consistency across installation profiles.

docs/notebooks/06a-time-varying-parameters.ipynb (1)

311-324: LGTM!

The metadata enrichment with language_info (Python 3.11.11) aligns with the consistent notebook metadata pattern across the repository.

docs/notebooks/08f-clustering-segmentation.ipynb (5)

152-166: LGTM!

Clear demonstration of clustering with segmentation. The parameters (n_clusters, cluster_duration, segments, extremes) are well-documented and the output logging provides useful feedback.

185-193: LGTM!

Good exploration of the clustering object properties after segmentation. The is_segmented, n_segments, n_clusters, and timesteps_per_cluster properties provide comprehensive insight into the segmentation structure.

317-325: LGTM!

The expansion workflow is clearly demonstrated with timing metrics and verification of objective preservation.

625-643: LGTM!

Complete notebook metadata with kernelspec and language_info consistent with other notebooks in the repository.

147-147: Import path is correct and consistent with existing codebase usage.

The import from tsam.config import ExtremeConfig, SegmentConfig follows the same pattern used throughout the repository in flixopt/transform_accessor.py and test files, confirming the tsam API is properly integrated.

docs/notebooks/08c-clustering.ipynb (8)

135-147: LGTM!

Clean migration to the new ExtremeConfig-based peak forcing API. The import and usage pattern are consistent with the updated clustering interface.

184-186: LGTM!

Updated access path to fs_clustered.clustering.results aligns with the new ClusteringResults container architecture.

226-258: LGTM!

Excellent addition documenting the clustering_data() method for inspecting time-varying data before clustering. This helps users understand which variables influence the clustering algorithm.

260-303: LGTM!

Good documentation of the data_vars parameter for selective clustering. The explanation of "cluster based on subset, apply to all data" is clear and the use case examples are practical.

337-347: LGTM!

Clear example of using ClusterConfig to specify alternative clustering algorithms (kmeans vs hierarchical).

381-400: LGTM!

The apply_clustering() documentation provides a useful pattern for design variant comparison with consistent cluster structures.

623-645: LGTM!

Comprehensive API reference update with renamed properties (cluster_assignments, cluster_occurrences) and the new ClusteringResults xarray-like interface documentation.

730-748: LGTM!

Updated notebook metadata consistent with repository-wide Python 3.11.11 standardization.

flixopt/effects.py (3)

20-28: LGTM!

Clean import grouping with VariableCategory added from .structure alongside the existing structure imports.

383-389: LGTM!

Correctly annotates the total variable with VariableCategory.TOTAL, enabling proper classification for segmentation and IO handling. This follows the established pattern in Submodel.add_variables which stores the category in the model's variable_categories mapping.

401-407: LGTM!

Correctly annotates the total_over_periods variable with VariableCategory.TOTAL_OVER_PERIODS. Consistent with the categorization pattern applied to total above.

docs/notebooks/02-heat-system.ipynb (1)

383-395: LGTM!

Metadata enrichment with language_info (Python 3.11.11) consistent with the repository-wide notebook standardization.

docs/notebooks/08a-aggregation.ipynb (1)

391-403: LGTM!

The language_info metadata enrichment is consistent with standard Jupyter notebook conventions and aligns with the broader metadata standardization across the notebook collection.

flixopt/clustering/intercluster_helpers.py (3)

14-16: LGTM!

The docstring correctly reflects the updated API using cluster_assignments[d] instead of cluster_order[d], aligning with the broader clustering API refactor described in the PR.

74-154: LGTM!

The function correctly handles all capacity parameter types (InvestParameters, numeric, None) with appropriate fallback behavior and helpful warning messages when using default unbounded capacity.

157-201: LGTM!

The function correctly constructs boundary coordinates with proper handling of optional period and scenario dimensions from the FlowSystem.

docs/notebooks/01-quickstart.ipynb (1)

285-295: LGTM!

The metadata enrichment is consistent with the standardization effort across notebooks, adding proper Python runtime details and version information.

docs/notebooks/03-investment-optimization.ipynb (1)

432-444: LGTM!

The metadata addition follows the same standardization pattern applied across the notebook collection.

docs/notebooks/08c2-clustering-storage-modes.ipynb (1)

195-199: LGTM!

The migration from time_series_for_high_peaks to the ExtremeConfig-based API provides more explicit control over extreme period handling. The method='new_cluster' with max_value=PEAK_SERIES correctly specifies that peak demand periods should be captured as separate clusters.

tests/deprecated/examples/03_Optimization_modes/example_optimization_modes.py (1)

193-211: LGTM - Clean migration to ExtremeConfig API.

The migration from time_series_for_high_peaks/time_series_for_low_peaks to the new ExtremeConfig-based approach follows the tsam 3.0 API correctly. The variable names reference the correct component labels defined earlier in the file (Wärmelast at line 123, Stromlast at line 128).

docs/user-guide/optimization/index.md (1)

59-66: LGTM - Documentation correctly demonstrates the new ExtremeConfig API.

The example clearly shows how to use the new extremes parameter with ExtremeConfig, aligning with the updated transform.cluster() signature.
flixopt/core.py (2)
643-650: Potential edge case with all-NaN slices.

When data contains slices that are entirely NaN along the specified axis, np.nanmax - np.nanmin will return NaN, and the comparison ptp < atol will be False. This means all-NaN variables won't be dropped as "constant."

If all-NaN variables should be treated as constant (and dropped), consider:
-        if np.all(ptp < atol):
+        if np.all((ptp < atol) | np.isnan(ptp)):
However, if preserving all-NaN variables is intentional (e.g., for debugging), the current behavior is acceptable.

617-619: LGTM - Performance optimization with new tolerance parameter.

The addition of atol parameter with a sensible default (1e-10) allows for tolerance-based constancy detection, which is more robust for floating-point data than exact equality checks. The use of ds.variables for direct Variable access is a good performance optimization.
flixopt/elements.py (8)

19-25: LGTM - Clean import of VariableCategory.

The import is properly organized with related structure module imports.

671-677: LGTM - Proper categorization of flow_rate variable.

The category=VariableCategory.FLOW_RATE correctly identifies the main flow rate variable for downstream processing and filtering.

682-693: LGTM - Appropriate category for total_flow_hours tracking variable.

Using VariableCategory.TOTAL correctly identifies this as an aggregate/total variable.

709-724: LGTM - Correct categorization for flow_hours_over_periods.

The VariableCategory.TOTAL_OVER_PERIODS appropriately distinguishes this cross-period aggregate from per-period totals.

732-749: LGTM - Status variable correctly categorized.

The binary status variable is properly marked with VariableCategory.STATUS.

751-761: LGTM - Investment model receives size category.

Passing size_category=VariableCategory.FLOW_SIZE to InvestmentModel allows proper identification of flow sizing variables.

969-981: LGTM - Virtual flow variables properly categorized.

Both virtual_supply and virtual_demand are correctly marked with VariableCategory.VIRTUAL_FLOW, distinguishing them from physical flow variables.

1047-1052: LGTM - Component-level status variable categorized consistently.

The component's status variable uses the same VariableCategory.STATUS as flow-level status variables, maintaining consistency.

tests/test_clustering_io.py (8)

6-6: LGTM - Required import for new tests.

The xarray import is needed for xr.DataArray in the fixture and xr.testing.assert_equal in assertions.

82-93: LGTM - Stronger type validation for restored clustering.

The explicit isinstance(fs_restored.clustering, Clustering) check ensures the deserialization produces the correct type, not just a truthy value.

124-125: Good defensive testing for serialization edge case.

The check_names=False parameter appropriately handles the potential loss of index name during NetCDF serialization, which is a known xarray/NetCDF behavior.

543-586: Well-designed fixture for multi-dimensional testing.

The system_with_periods_and_scenarios fixture creates intentionally different demand patterns for each period/scenario combination to ensure cluster assignments differ. This is a thoughtful approach to testing multi-dimensional clustering behavior.

588-612: Good coverage of cluster assignment dimensionality.

These tests verify the critical requirement that cluster assignments have the correct dimensions and produce different results for different period/scenario combinations.

614-669: Comprehensive roundtrip and derived property tests.

The tests properly validate that cluster_assignments are exactly preserved after NetCDF roundtrip, and that derived properties like n_clusters, timesteps_per_cluster, and cluster_occurrences work correctly after deserialization.

671-703: Valuable test for apply_clustering with loaded clustering.

This test validates an important use case: loading a saved clustering and applying it to a fresh FlowSystem. The assertions verify both structural equivalence and assignment matching.

705-724: LGTM - End-to-end test for expand after load and optimize.

This test covers the complete workflow: cluster → optimize → save → load → expand, verifying that the solution is properly expanded to original timesteps.

flixopt/modeling.py (1)

9-9: Category propagation looks consistent and useful.
Tagging tracker and duration variables with VariableCategory aligns with Submodel.add_variables and will help downstream filtering.

Also applies to: 266-308, 371-378

flixopt/statistics_accessor.py (1)

34-34: Category-based variable selection is a solid improvement.
This aligns the statistics accessor with the new categorization API and simplifies label handling.

Also applies to: 527-533, 571-573, 580-582, 596-598

docs/notebooks/08d-clustering-multiperiod.ipynb (1)

176-188: Notebook updates align well with the new clustering API.
The ExtremeConfig-based examples and results access paths read clearly.

Also applies to: 241-249, 252-259, 262-281, 288-291, 296-317, 322-364, 368-418, 422-451, 455-475, 480-486, 490-520, 524-582, 593-602

docs/user-guide/optimization/clustering.md (1)

26-37: Docs are consistent and clearer with the new clustering options.
Examples and parameter tables read cleanly.

Also applies to: 54-58, 59-88, 89-105, 107-124, 125-134

tests/test_cluster_reduce_expand.py (5)

65-65: Updated tests correctly target the new clustering public API.
The move to n_clusters, cluster_assignments, and timesteps_per_cluster reads cleanly.

Also applies to: 123-127, 135-135, 294-295, 338-347, 352-353, 452-452, 536-538, 545-545

767-840: Extremes coverage looks solid.
These tests should guard the new ExtremeConfig integration well.

845-957: data_vars parameter tests are thorough and focused.
Nice coverage for validation, preservation, and optimization.

996-1377: Segmentation and IO round-trip coverage is excellent.
The assertions around mapping, duration, and expansion behavior are especially helpful.

Also applies to: 1379-1441

1443-1523: combine_slices tests are clear and useful.
They exercise shape, coords, attrs, and datetime handling well.

tests/test_clustering/test_integration.py (2)

125-213: Good coverage for clustering_data shape and period handling.
These tests exercise the expected Dataset return type and period selection behavior well.

288-335: ClusterConfig/preserve_column_means test updates look solid.
The assertions align with the new API surface and deterministic expectations.

tests/test_clustering/test_base.py (5)

12-62: Helper-function tests look solid.
They validate both occurrences and mapping logic clearly.

63-205: ClusteringResults coverage is thorough.
Good coverage of dims/coords/selectors and DataArray accessors.

207-307: Clustering wrapper tests are consistent with the new API.
Nice coverage of derived properties and repr.

309-390: Multi-dimensional selection tests are well scoped.

392-483: Plot accessor tests are clear and focused.

flixopt/clustering/__init__.py (1)

1-51: Public exports and docstring example match the new API.

benchmarks/benchmark_io_performance.py (2)

19-75: Benchmark fixture setup is clear and representative.

78-106: Benchmark harness is clean and reusable.

flixopt/structure.py (4)

82-143: VariableCategory and expansion groupings are well organized.
Clear taxonomy for segment handling.

197-203: Nice: model-level registry for variable categories.
This cleanly centralizes category metadata for downstream expansion logic.

1118-1129: ds.variables + coord_cache construction looks good.
Nice performance-oriented replacement for repeated DataArray construction.

1740-1767: Category registration on variable creation is clean.
This keeps categorization close to variable definition and avoids later bookkeeping.

flixopt/io.py (7)

564-575: Direct attr serialization loop looks solid.

Single-pass handling for data vars and coords keeps the save path consistent and efficient.

609-641: Constant-dimension reduction looks consistent.

The reduction loop maintains dimensional bookkeeping and preserves attrs on the reconstructed variables.

665-706: Stacking path is clear and preserves grouping semantics.

Grouping by dims and stacking into a dedicated dimension reads correctly.

721-748: Unstacking restores variables efficiently.

The slice-based reconstruction logic is straightforward and aligns with the stack layout.

774-785: Load-time attr restoration mirrors the save path.

Good symmetry between serialization and deserialization of variable attrs.

1527-1563: Restore pipeline is cohesive with coord caching.

The sequence from separation → fast array build → restore hooks is clear and consistent.

1831-1861: Solution namespacing and has_solution flagging are consistent.

The solution prefixing and presence flagging are clear and aligned with the restore path.

flixopt/components.py (2)

942-953: Storage variable categorization is a good addition.

Charge state/netto discharge categories and storage-size categorization align well with the new variable taxonomy.

Also applies to: 979-985

1316-1323: Intercluster categorization is consistent.

Size and SOC boundary variables are now correctly tagged for downstream handling.

Also applies to: 1373-1380

flixopt/features.py (4)

40-84: Size category plumbing is clean.

The new size_category parameter is applied consistently to size and invested variables.

200-258: Status variables are now consistently categorized.

Inactive, startup/shutdown, total, and startup_count tags fit the taxonomy well.

408-428: Piecewise variable categories are wired through cleanly.

Categorizing piece and zero_point variables improves downstream handling without changing behavior.

Also applies to: 519-524

642-706: Share allocation categories correctly distinguish temporal vs. scalar shares.

The SHARE, TOTAL, and PER_TIMESTEP tags align with how these variables are used.

docs/notebooks/08e-clustering-internals.ipynb (2)

14-239: Notebook updates align with the new Clustering API.

The expanded explanations and examples match the consolidated Clustering/ClusteringResults workflow.

341-509: IO workflow and file‑size guidance are clear and helpful.

The preservation/loss matrix and size‑reduction tips should prevent user confusion.

_{✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.}

coderabbitai · 2026-01-16T17:19:04Z

benchmarks/benchmark_io_performance.py

+    # Create and prepare FlowSystem
+    print('\n1. Creating FlowSystem...')
+    fs = create_large_flow_system(n_timesteps, n_periods, n_components)
+    print(f'   Components: {len(fs.components)}')
+
+    print('\n2. Clustering and solving...')
+    fs_clustered = fs.transform.cluster(n_clusters=n_clusters, cluster_duration='1D')
+    fs_clustered.optimize(fx.solvers.GurobiSolver())
+


⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

cd /tmp && find . -name "benchmark_io_performance.py" 2>/dev/null | head -5

Repository: flixOpt/flixopt

Length of output: 41

🏁 Script executed:

git ls-files | grep -E "(benchmark|solver)" | head -20

Repository: flixOpt/flixopt

Length of output: 116

🏁 Script executed:

rg "GurobiSolver" --type py -B 2 -A 2 | head -50

Repository: flixOpt/flixopt

Length of output: 2612

🏁 Script executed:

cat -n benchmarks/benchmark_io_performance.py | sed -n '130,155p'

Repository: flixOpt/flixopt

Length of output: 1248

🏁 Script executed:

cat -n benchmarks/benchmark_io_performance.py | head -50

Repository: flixOpt/flixopt

Length of output: 1739

🏁 Script executed:

cat -n flixopt/solvers.py | head -100

Repository: flixOpt/flixopt

Length of output: 3688

🏁 Script executed:

rg "class.*Solver" flixopt/solvers.py -A 5

Repository: flixOpt/flixopt

Length of output: 596

🏁 Script executed:

rg "import.*Gurobi\|from.*gurobi" benchmarks/benchmark_io_performance.py

Repository: flixOpt/flixopt

Length of output: 41

🏁 Script executed:

rg "def optimize" flixopt/flow_system.py -A 15

Repository: flixOpt/flixopt

Length of output: 589

🏁 Script executed:

rg "GurobiSolver\|gurobipy" flixopt/ --type py -l

Repository: flixOpt/flixopt

Length of output: 41

🏁 Script executed:

cat -n flixopt/flow_system.py | grep -A 20 "def optimize"

Repository: flixOpt/flixopt

Length of output: 926

🏁 Script executed:

rg "class OptimizeAccessor" flixopt/ --type py -A 30

Repository: flixOpt/flixopt

Length of output: 1948

🏁 Script executed:

rg "def __call__" flixopt/ --type py -B 5 -A 20 | grep -A 20 "OptimizeAccessor"

Repository: flixOpt/flixopt

Length of output: 41

🏁 Script executed:

rg "gurobipy\|import Gurobi" flixopt/ --type py

Repository: flixOpt/flixopt

Length of output: 41

🏁 Script executed:

python3 << 'EOF' # Check what happens when GurobiSolver() is instantiated # by looking at the solver class definition import sys sys.path.insert(0, '/tmp/repo') try: from flixopt.solvers import GurobiSolver # Try to create an instance (this will fail if gurobipy is not installed) solver = GurobiSolver() print("SUCCESS: GurobiSolver instantiated without error") print(f"Solver: {solver}") except ImportError as e: print(f"ImportError when creating GurobiSolver: {e}") except Exception as e: print(f"Exception when creating GurobiSolver ({type(e).__name__}): {e}") EOF

Repository: flixOpt/flixopt

Length of output: 130

🏁 Script executed:

rg "gurobipy\|import.*Gurobi" . --type py

Repository: flixOpt/flixopt

Length of output: 41

🏁 Script executed:

cat -n flixopt/optimize_accessor.py | head -100

Repository: flixOpt/flixopt

Length of output: 3855

🏁 Script executed:

rg "def __call__" flixopt/optimize_accessor.py -A 50 | head -80

Repository: flixOpt/flixopt

Length of output: 1869

🏁 Script executed:

rg "def solve" flixopt/flow_system.py -A 40 | head -80

Repository: flixOpt/flixopt

Length of output: 1504

🏁 Script executed:

rg "def solve" flixopt/ --type py -l

Repository: flixOpt/flixopt

Length of output: 105

🏁 Script executed:

cat -n flixopt/flow_system.py | grep -B 5 -A 40 "def solve"

Repository: flixOpt/flixopt

Length of output: 1882

🏁 Script executed:

rg "def do_modeling\|model.solve" flixopt/ --type py -B 2 -A 15 | head -120

Repository: flixOpt/flixopt

Length of output: 41

🏁 Script executed:

rg "gurobipy\|import.*gurobi" . -i --type py

Repository: flixOpt/flixopt

Length of output: 41

🏁 Script executed:

cat -n flixopt/optimization.py | head -200

Repository: flixOpt/flixopt

Length of output: 8086

🏁 Script executed:

rg "solver_name.*gurobi\|if.*solver_name" flixopt/ --type py -B 3 -A 5

Repository: flixOpt/flixopt

Length of output: 41

🏁 Script executed:

find . -name "structure.py" -o -name "*model*.py" | grep flixopt | head -5

Repository: flixOpt/flixopt

Length of output: 103

🏁 Script executed:

cat -n flixopt/structure.py | grep -A 50 "def solve\|solver_name" | head -100

Repository: flixOpt/flixopt

Length of output: 41

🏁 Script executed:

wc -l flixopt/structure.py

Repository: flixOpt/flixopt

Length of output: 84

🏁 Script executed:

rg "def solve" flixopt/structure.py -A 50

Repository: flixOpt/flixopt

Length of output: 41

🏁 Script executed:

rg "GurobiSolver\|gurobi\|solver_name" flixopt/structure.py -i -B 2 -A 5

Repository: flixOpt/flixopt

Length of output: 41

🏁 Script executed:

rg "def solve" flixopt/ --type py -l

Repository: flixOpt/flixopt

Length of output: 105

🏁 Script executed:

rg "solver_name\|GurobiSolver" flixopt/ --type py -B 2 -A 10

Repository: flixOpt/flixopt

Length of output: 41

🏁 Script executed:

ls -la flixopt/ | grep -E "\.py$"

Repository: flixOpt/flixopt

Length of output: 1912

🏁 Script executed:

grep -n "def solve" flixopt/structure.py

Repository: flixOpt/flixopt

Length of output: 41

🏁 Script executed:

cat -n flixopt/structure.py | sed -n '1,100p'

Repository: flixOpt/flixopt

Length of output: 3920

🏁 Script executed:

rg "class FlowSystemModel" flixopt/ --type py -A 100 | head -150

Repository: flixOpt/flixopt

Length of output: 6912

🏁 Script executed:

rg "\.solve\(" flixopt/structure.py -B 3 -A 10

Repository: flixOpt/flixopt

Length of output: 41

🏁 Script executed:

python3 << 'EOF' # Check linopy's solve method signature import inspect try: import linopy print("Linopy version:", linopy.__version__) print("\nLinopy Model.solve signature:") print(inspect.signature(linopy.Model.solve)) except Exception as e: print(f"Error: {e}") EOF

Repository: flixOpt/flixopt

Length of output: 90

🏁 Script executed:

cat benchmarks/benchmark_io_performance.py | head -20

Repository: flixOpt/flixopt

Length of output: 488

🏁 Script executed:

cat -n benchmarks/benchmark_io_performance.py | sed -n '1,20p'

Repository: flixOpt/flixopt

Length of output: 628

🏁 Script executed:

python3 << 'EOF' # Check if GurobiSolver() instantiation would fail without gurobipy import sys sys.path.insert(0, '/tmp/repo') # The GurobiSolver class in solvers.py is just a dataclass # Let's verify it doesn't import gurobipy at class definition or instantiation from flixopt.solvers import GurobiSolver try: solver = GurobiSolver() print("SUCCESS: GurobiSolver instantiated without importing gurobipy") print(f"Solver name: {solver.name}") print(f"Solver options: {solver.options}") except Exception as e: print(f"FAILED: {type(e).__name__}: {e}") EOF

Repository: flixOpt/flixopt

Length of output: 334

🏁 Script executed:

rg "def __call__" flixopt/optimize_accessor.py -A 60 | tail -50

Repository: flixOpt/flixopt

Length of output: 1799

🏁 Script executed:

cat -n flixopt/solvers.py

Repository: flixOpt/flixopt

Length of output: 3688

🏁 Script executed:

grep -r "gurobipy\|gurobi" benchmarks/ --include="*.py"

Repository: flixOpt/flixopt

Length of output: 41

🏁 Script executed:

rg "class GurobiSolver" flixopt/solvers.py -A 25

Repository: flixOpt/flixopt

Length of output: 788

🏁 Script executed:

rg "linopy.Model.solve\|xarray" flixopt/structure.py | head -20

Repository: flixOpt/flixopt

Length of output: 41

🏁 Script executed:

cat -n flixopt/flow_system.py | sed -n '1402,1433p'

Repository: flixOpt/flixopt

Length of output: 1417

🏁 Script executed:

grep -n "model.solve" flixopt/flow_system.py

Repository: flixOpt/flixopt

Length of output: 89

🏁 Script executed:

cat -n flixopt/flow_system.py | sed -n '1429,1470p'

Repository: flixOpt/flixopt

Length of output: 1999

🏁 Script executed:

rg "solver_name.*gurobi\|HighsSolver\|HiGHS" flixopt/flow_system.py -i -B 2 -A 5

Repository: flixOpt/flixopt

Length of output: 41

Benchmark requires Gurobi; consider guarding the solve call or using a fallback solver.

The benchmark will fail without Gurobi, but not at GurobiSolver() instantiation—the error occurs later in model.solve() when linopy tries to invoke the actual solver. Wrapping the instantiation won't prevent the failure. Instead, either:

Guard the fs_clustered.optimize() call with exception handling for the actual solve step, or

Use HighsSolver() as a fallback when Gurobi is unavailable, or

Document Gurobi as a hard requirement.

🤖 Prompt for AI Agents

In `@benchmarks/benchmark_io_performance.py` around lines 138 - 146, The benchmark currently calls fs_clustered.optimize(fx.solvers.GurobiSolver()) which will fail at solve time if Gurobi is not available; change this to detect/handle solver availability by wrapping the optimize call in a try/except that catches solver runtime errors and falls back to fx.solvers.HighsSolver(), or programmatically choose the solver before calling fs_clustered.optimize by attempting to use fx.solvers.GurobiSolver() and on failure selecting fx.solvers.HighsSolver(); update the call sites referenced (create_large_flow_system, fs.transform.cluster, fs_clustered.optimize, fx.solvers.GurobiSolver) so the benchmark either falls back to HighsSolver or logs a clear error and exits.

coderabbitai · 2026-01-16T17:19:04Z

docs/notebooks/08c2-clustering-storage-modes.ipynb

+    "from tsam.config import ExtremeConfig\n",
+    "\n",


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🌐 Web query:

tsam python package ExtremeConfig class configuration module

💡 Result:

There is no class named ExtremeConfig in the tsam package. Extreme-period handling is part of the TimeSeriesAggregation module — configure it with the extremePeriodMethod parameter (options: "None", "append", "new_cluster_center", "replace_cluster_center") and the addPeakMin / addPeakMax / addMeanMin / addMeanMax arguments to specify which series produce extreme periods. See the TimeSeriesAggregation docs and source for the implementation. [1][2]

Sources:

TimeSeriesAggregation docs (parameters & extremePeriodMethod). [1]

timeseriesaggregation source code ( _addExtremePeriods implementation ). [2]

Remove the ExtremeConfig import; it does not exist in the tsam package.

The class ExtremeConfig is not available in tsam.config. Extreme-period handling in tsam is configured via the TimeSeriesAggregation module using the extremePeriodMethod parameter (with options: "None", "append", "new_cluster_center", "replace_cluster_center") and arguments like addPeakMin, addPeakMax, addMeanMin, addMeanMax. The current import and API usage (extremes=ExtremeConfig(...)) will fail at runtime.

🤖 Prompt for AI Agents

In `@docs/notebooks/08c2-clustering-storage-modes.ipynb` around lines 174 - 175, Remove the invalid import "from tsam.config import ExtremeConfig" and replace any usage of ExtremeConfig (e.g., extremes=ExtremeConfig(...)) with the TimeSeriesAggregation API by setting the extremePeriodMethod parameter (one of "None", "append", "new_cluster_center", "replace_cluster_center") and the corresponding flags like addPeakMin, addPeakMax, addMeanMin, addMeanMax on the TimeSeriesAggregation instance; ensure any code that passed extremes=ExtremeConfig(...) is updated to configure extreme handling via TimeSeriesAggregation.extremePeriodMethod and the addPeak*/addMean* arguments instead.

coderabbitai · 2026-01-16T17:19:04Z

docs/notebooks/08d-clustering-multiperiod.ipynb

+    "print('\\nCluster assignments per day (period=2024, scenario=High):')\n",
+    "for i, cluster_id in enumerate(cluster_assignments):\n",
    "    print(f'  {day_names[i]}: Cluster {cluster_id}')\n",


⚠️ Potential issue | 🟡 Minor

Avoid hardcoded period/scenario labels in output.
You index by period=0, scenario=0 but print “period=2024, scenario=High”. Consider deriving labels from the coords to avoid mismatches when inputs change.

🤖 Prompt for AI Agents

In `@docs/notebooks/08d-clustering-multiperiod.ipynb` around lines 283 - 285, The printed header uses hardcoded "period=2024, scenario=High" which can diverge from the indices used; retrieve the actual labels from the dataset/coords used to produce cluster_assignments (e.g., the period and scenario coordinate arrays referenced when you computed cluster_assignments) and use those values in the header instead of fixed text, then keep the existing loop over cluster_assignments and day_names to print each day's cluster; reference the variables/objects cluster_assignments and day_names and the same period/scenario coord variables you used earlier to select the slice so the header is always derived from the data.

coderabbitai · 2026-01-16T17:19:04Z

flixopt/components.py

+        # relative_loss_per_hour is per-hour, so we need total hours per cluster
+        # Use sum over time to handle both regular and segmented systems
        # Keep as DataArray to respect per-period/scenario values
        rel_loss = _scalar_safe_reduce(self.element.relative_loss_per_hour, 'time', 'mean')
-        hours_per_cluster = timesteps_per_cluster * _scalar_safe_reduce(self._model.timestep_duration, 'time', 'mean')
+        hours_per_cluster = _scalar_safe_reduce(self._model.timestep_duration, 'time', 'mean')
        decay_n = (1 - rel_loss) ** hours_per_cluster


⚠️ Potential issue | 🟠 Major

Fix decay factor to use total hours per cluster.

The decay currently uses the mean timestep duration, which underestimates self‑discharge across a full cluster period. This skews SOC boundary linking. Use a summed duration (or otherwise compute total hours).

🔧 Proposed fix

- hours_per_cluster = _scalar_safe_reduce(self._model.timestep_duration, 'time', 'mean') + hours_per_cluster = _scalar_safe_reduce(self._model.timestep_duration, 'time', 'sum')

🤖 Prompt for AI Agents

In `@flixopt/components.py` around lines 1507 - 1512, The decay calculation uses mean timestep duration which underestimates total hours per cluster; update the reduction on self._model.timestep_duration (used when computing decay_n) to compute the total hours for the cluster (e.g., use _scalar_safe_reduce(self._model.timestep_duration, 'time', 'sum') or equivalent) so decay_n = (1 - self.element.relative_loss_per_hour) ** total_hours_per_cluster correctly reflects summed duration while preserving DataArray semantics and usage of _scalar_safe_reduce and relative_loss_per_hour.

coderabbitai · 2026-01-16T17:19:04Z

flixopt/flow_system.py

+        # Use provided timestep_duration if given (for segmented systems), otherwise use computed value
+        # For RangeIndex (segmented systems), computed_timestep_duration is None
+        if timestep_duration is not None:
+            self.timestep_duration = timestep_duration
+        elif computed_timestep_duration is not None:
+            self.timestep_duration = self.fit_to_model_coords('timestep_duration', computed_timestep_duration)
+        else:
+            self.timestep_duration = None



⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

# First, let's locate and read the flow_system.py file to see the actual code fd flow_system.py

Repository: flixOpt/flixopt

Length of output: 81

🏁 Script executed:

# Read the flow_system.py file around lines 210-218 wc -l flixopt/flow_system.py

Repository: flixOpt/flixopt

Length of output: 86

🏁 Script executed:

# Read lines around 210-218 to see the actual code sed -n '200,225p' flixopt/flow_system.py

Repository: flixOpt/flixopt

Length of output: 1464

🏁 Script executed:

# Look for the _compute_time_metadata method to understand when computed_timestep_duration is None sed -n '1,2604p' flixopt/flow_system.py | grep -n "_compute_time_metadata" | head -5

Repository: flixOpt/flixopt

Length of output: 327

🏁 Script executed:

# Read the _compute_time_metadata method sed -n '433,500p' flixopt/flow_system.py

Repository: flixOpt/flixopt

Length of output: 3265

🏁 Script executed:

# Search for weights and temporal_weight properties/methods to see how they use timestep_duration rg -n "temporal_weight|def weights" flixopt/flow_system.py | head -20

Repository: flixOpt/flixopt

Length of output: 239

🏁 Script executed:

# Read the temporal_weight property sed -n '2075,2110p' flixopt/flow_system.py

Repository: flixOpt/flixopt

Length of output: 1451

🏁 Script executed:

# Read the weights property sed -n '2209,2250p' flixopt/flow_system.py

Repository: flixOpt/flixopt

Length of output: 1931

🏁 Script executed:

# Check if timestep_duration being None causes any issues in weights/temporal_weight # Let's look at the calculate_timestep_duration method rg -n "calculate_timestep_duration" flixopt/flow_system.py

Repository: flixOpt/flixopt

Length of output: 180

🏁 Script executed:

# Read the calculate_timestep_duration method sed -n '361,400p' flixopt/flow_system.py

Repository: flixOpt/flixopt

Length of output: 1832

🏁 Script executed:

# Let's check what happens when timestep_duration is None in the weights dict # Search for any tests or usage of RangeIndex systems rg -n "RangeIndex" flixopt/flow_system.py | head -15

Repository: flixOpt/flixopt

Length of output: 1116

🏁 Script executed:

# Check for test files that use RangeIndex to understand expected behavior fd "test.*flow_system" -o "*test*flow*"

Repository: flixOpt/flixopt

Length of output: 200

🏁 Script executed:

# Search for test files related to flow_system fd "test.*flow_system" tests/ 2>/dev/null || fd "flow_system" tests/ 2>/dev/null | head -10

Repository: flixOpt/flixopt

Length of output: 173

🏁 Script executed:

# Search for RangeIndex usage in test files rg -n "RangeIndex" tests/ | head -20

Repository: flixOpt/flixopt

Length of output: 362

🏁 Script executed:

# Read the test file to understand RangeIndex usage sed -n '1010,1100p' tests/test_cluster_reduce_expand.py

Repository: flixOpt/flixopt

Length of output: 3745

🏁 Script executed:

# Let's understand better - search for where timestep_duration might be set for RangeIndex systems # Check if there are methods that set it after initialization rg -n "self.timestep_duration =" flixopt/flow_system.py | head -20

Repository: flixOpt/flixopt

Length of output: 278

Add validation for RangeIndex systems requiring explicit timestep_duration.

When timesteps is a RangeIndex (segmented system) and timestep_duration is not provided, self.timestep_duration becomes None. This breaks downstream uses like weights['time'] and temporal_weight. Add early validation with a clear error message.

🐛 Proposed fix

- # Use provided timestep_duration if given (for segmented systems), otherwise use computed value + # Segmented systems (RangeIndex) must provide timestep_duration explicitly + if isinstance(self.timesteps, pd.RangeIndex) and timestep_duration is None: + raise ValueError( + 'timestep_duration must be provided when timesteps is a RangeIndex (segmented system).' + ) + # Use provided timestep_duration if given (for segmented systems), otherwise use computed value # For RangeIndex (segmented systems), computed_timestep_duration is None if timestep_duration is not None: self.timestep_duration = timestep_duration

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

# Use provided timestep_duration if given (for segmented systems), otherwise use computed value

# For RangeIndex (segmented systems), computed_timestep_duration is None

if timestep_duration is not None:

self.timestep_duration = timestep_duration

elif computed_timestep_duration is not None:

self.timestep_duration = self.fit_to_model_coords('timestep_duration', computed_timestep_duration)

else:

self.timestep_duration = None

# Segmented systems (RangeIndex) must provide timestep_duration explicitly

if isinstance(self.timesteps, pd.RangeIndex) and timestep_duration is None:

raise ValueError(

'timestep_duration must be provided when timesteps is a RangeIndex (segmented system).'

)

# Use provided timestep_duration if given (for segmented systems), otherwise use computed value

# For RangeIndex (segmented systems), computed_timestep_duration is None

if timestep_duration is not None:

self.timestep_duration = timestep_duration

elif computed_timestep_duration is not None:

self.timestep_duration = self.fit_to_model_coords('timestep_duration', computed_timestep_duration)

else:

self.timestep_duration = None

🤖 Prompt for AI Agents

In `@flixopt/flow_system.py` around lines 210 - 218, The code currently allows self.timestep_duration to be None when timesteps is a RangeIndex (segmented system), which breaks downstream logic (e.g., weights['time'], temporal_weight); add an early validation in the constructor or initializer where timestep_duration and computed_timestep_duration are handled: detect the segmented case (timesteps is a RangeIndex or computed_timestep_duration is None) and raise a clear ValueError if timestep_duration is not provided, mentioning that RangeIndex requires an explicit timestep_duration; keep existing logic for assigning self.timestep_duration (including using fit_to_model_coords('timestep_duration', computed_timestep_duration) when available) but enforce this guard before downstream computation.

coderabbitai · 2026-01-16T17:19:04Z

flixopt/io.py

+    def _fast_get_dataarray(ds: xr.Dataset, name: str, coord_cache: dict[str, xr.DataArray]) -> xr.DataArray:
+        """Construct DataArray from Variable without slow coordinate inference.
+
+        This bypasses the slow _construct_dataarray method (~1.5ms -> ~0.1ms per var).
+
+        Args:
+            ds: Source dataset
+            name: Variable name
+            coord_cache: Pre-cached coordinate DataArrays
+
+        Returns:
+            Constructed DataArray
+        """
+        variable = ds.variables[name]
+        coords = {k: coord_cache[k] for k in variable.dims if k in coord_cache}
+        return xr.DataArray(variable, coords=coords, name=name)
+


⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

# First, let's look at the file structure and context around the function cd flixopt && wc -l io.py && head -50 io.py

Repository: flixOpt/flixopt

Length of output: 1413

🏁 Script executed:

# Find the _fast_get_dataarray function and surrounding context rg -n "_fast_get_dataarray" flixopt/

Repository: flixOpt/flixopt

Length of output: 312

🏁 Script executed:

# Also search for _construct_dataarray to understand the slow path rg -n "_construct_dataarray" flixopt/

Repository: flixOpt/flixopt

Length of output: 821

🏁 Script executed:

# Search for test files related to IO operations fd -e "test_*.py" -o -e "*_test.py" | grep -i io

Repository: flixOpt/flixopt

Length of output: 175

🏁 Script executed:

# Let's read the actual function and surrounding context sed -n '1530,1620p' flixopt/io.py

Repository: flixOpt/flixopt

Length of output: 3853

🏁 Script executed:

# Search for test files find . -type f -name "test*.py" | head -20

Repository: flixOpt/flixopt

Length of output: 835

🏁 Script executed:

# Look for where coord_cache is created rg -n "coord_cache" flixopt/io.py -B 2 -A 2

Repository: flixOpt/flixopt

Length of output: 1043

🏁 Script executed:

# Search for uses of auxiliary coordinates or non-dimension coordinates rg -i "auxiliary|non.*dimension|coords\[" flixopt/io.py | head -20

Repository: flixOpt/flixopt

Length of output: 439

🏁 Script executed:

# Look for test files that test round-trip behavior find . -name "*.py" -path "*/test*" -type f | xargs grep -l "from_dataset\|to_dataset" | head -5

Repository: flixOpt/flixopt

Length of output: 229

🏁 Script executed:

# Check if there are any references to xarray's behavior with auxiliary coords rg "DataArray.*coords" flixopt/io.py | head -10

Repository: flixOpt/flixopt

Length of output: 122

🏁 Script executed:

# Search for how variables are written to understand what coords are preserved sed -n '1800,1900p' flixopt/io.py

Repository: flixOpt/flixopt

Length of output: 3645

🏁 Script executed:

# Look at the main IO test file to understand what's tested head -100 tests/test_io.py

Repository: flixOpt/flixopt

Length of output: 2842

🏁 Script executed:

# Search for any auxiliary coordinates being set in the codebase rg "assign_coords|coords\[" --type py | grep -v test | head -20

Repository: flixOpt/flixopt

Length of output: 2031

🏁 Script executed:

# Check if there's any xarray usage pattern that might add auxiliary coords rg "xr\.DataArray\|xr\.Dataset" flixopt/io.py -A 1 | head -30

Repository: flixOpt/flixopt

Length of output: 41

🏁 Script executed:

# Look for where datasets are created initially (to_dataset method) sed -n '1680,1750p' flixopt/io.py

Repository: flixOpt/flixopt

Length of output: 3000

🏁 Script executed:

# Let's examine the test more carefully to see what's being tested cat tests/test_io.py | head -60

Repository: flixOpt/flixopt

Length of output: 1440

🏁 Script executed:

# Check what flow_system_base fixture includes and if it creates auxiliary coords grep -n "flow_system_base\|@pytest.fixture" tests/conftest.py | head -30

Repository: flixOpt/flixopt

Length of output: 667

🏁 Script executed:

# Search for where DataArrays are created with auxiliary coordinates in to_dataset sed -n '1400,1550p' flixopt/io.py | grep -A 5 -B 5 "DataArray\|xr\."

Repository: flixOpt/flixopt

Length of output: 906

🏁 Script executed:

# Check if xarray's construction of DataArray preserves all coords or just dimension coords python3 << 'EOF' import xarray as xr import numpy as np # Test how xarray handles auxiliary coordinates ds = xr.Dataset({ 'data': (['x', 'y'], np.random.rand(3, 4)) }) # Add a dimension coordinate ds = ds.assign_coords({'x': np.arange(3), 'y': np.arange(4)}) # Add an auxiliary coordinate with dims as a subset ds = ds.assign_coords({'aux': (('x',), np.arange(3))}) print("Original dataset coords:", list(ds.coords)) print("Original dataset coord names and dims:") for name, coord in ds.coords.items(): print(f" {name}: dims={coord.dims}") # Now simulate what _fast_get_dataarray does var = ds.variables['data'] print("\nVariable 'data' dims:", var.dims) # Method 1: Only dimension coords (current implementation) coords1 = {k: ds.coords[k] for k in var.dims if k in ds.coords} print("\nCoords selected (current method):", list(coords1.keys())) # Method 2: All coords (proposed fix) coords2 = {k: c for k, c in ds.coords.items() if set(c.dims).issubset(var.dims)} print("Coords selected (proposed method):", list(coords2.keys())) # Check which coords are lost lost_coords = set(ds.coords.keys()) - set(coords1.keys()) print("\nLost coords with current method:", lost_coords) EOF

Repository: flixOpt/flixopt

Length of output: 177

🏁 Script executed:

# Install xarray and test coordinate handling pip install xarray numpy pandas -q && python3 << 'EOF' import xarray as xr import numpy as np # Test how xarray handles auxiliary coordinates ds = xr.Dataset({ 'data': (['x', 'y'], np.random.rand(3, 4)) }) # Add dimension coordinates ds = ds.assign_coords({'x': np.arange(3), 'y': np.arange(4)}) # Add an auxiliary coordinate with dims as a subset ds = ds.assign_coords({'aux': (('x',), np.arange(3))}) print("Original dataset coords:", list(ds.coords)) print("Original dataset coord names and dims:") for name, coord in ds.coords.items(): print(f" {name}: dims={coord.dims}") # Now simulate what _fast_get_dataarray does var = ds.variables['data'] print("\nVariable 'data' dims:", var.dims) # Method 1: Only dimension coords (current implementation) coords1 = {k: ds.coords[k] for k in var.dims if k in ds.coords} print("\nCoords selected (current method):", list(coords1.keys())) # Method 2: All coords (proposed fix) coords2 = {k: c for k, c in ds.coords.items() if set(c.dims).issubset(var.dims)} print("Coords selected (proposed method):", list(coords2.keys())) # Check which coords are lost lost_coords = set(ds.coords.keys()) - set(coords1.keys()) print("\nLost coords with current method:", lost_coords) EOF

Repository: flixOpt/flixopt

Length of output: 613

🏁 Script executed:

# Now let's check if flixopt actually creates auxiliary coordinates anywhere rg "assign_coords\(" flixopt/ --type py | grep -v test | grep -v "\.pyc"

Repository: flixOpt/flixopt

Length of output: 2851

🏁 Script executed:

# Check the structure.py file where auxiliary coords might be used rg -n "assign_coords" flixopt/structure.py -B 2 -A 2

Repository: flixOpt/flixopt

Length of output: 41

🏁 Script executed:

# Check if the test actually verifies coordinate preservation rg -n "test_flow_system_io" tests/test_io.py -A 10

Repository: flixOpt/flixopt

Length of output: 354

🏁 Script executed:

# Look at the FlowSystem equality comparison to understand what's checked rg -n "__eq__" flixopt/flow_system.py -A 15 | head -50

Repository: flixOpt/flixopt

Length of output: 692

🏁 Script executed:

# Search for where auxiliary coordinates are actually added to datasets in serialization rg -n "to_dataset" flixopt/flow_system.py -B 2 -A 20 | head -100

Repository: flixOpt/flixopt

Length of output: 4885

🏁 Script executed:

# Check if any of the datasets created by to_dataset() actually have auxiliary coords sed -n '1400,1650p' flixopt/io.py

Repository: flixOpt/flixopt

Length of output: 9490

🏁 Script executed:

# Let's check if the to_dataset method actually creates auxiliary coordinates rg -n "assign_coords" flixopt/flow_system.py -B 3 -A 3

Repository: flixOpt/flixopt

Length of output: 41

🏁 Script executed:

# Check what Structure.to_dataset does (parent class) rg -n "class FlowSystem" flixopt/flow_system.py | head -3

Repository: flixOpt/flixopt

Length of output: 238

🏁 Script executed:

# Find the parent class to_dataset method rg -n "def to_dataset" flixopt/ --type py | grep -v test | grep -v ".pyc"

Repository: flixOpt/flixopt

Length of output: 297

🏁 Script executed:

# Let's run the actual test to see if it passes cd /repo && python -m pytest tests/test_io.py::test_flow_system_io -v 2>&1 | head -100

Repository: flixOpt/flixopt

Length of output: 114

🏁 Script executed:

# Check Structure.to_dataset (parent class) sed -n '1038,1120p' flixopt/structure.py

Repository: flixOpt/flixopt

Length of output: 3615

🏁 Script executed:

# Let's verify if datasets created by to_dataset() have auxiliary coords # by running Python code to test this python3 << 'EOF' import sys sys.path.insert(0, '/repo') # Try to import and test try: import flixopt as fx # Create a simple flow system fs = fx.FlowSystem( timesteps=fx.utils.get_times('2020', '2020', freq='d')[0:5], ) # Convert to dataset ds = fs.to_dataset() print("Dataset coordinates:") for name, coord in ds.coords.items(): print(f" {name}: dims={coord.dims}, shape={coord.shape}") print("\nDataset variables:") for name, var in ds.data_vars.items(): print(f" {name}: dims={var.dims}, shape={var.shape}") # Check for auxiliary coords print("\nAuxiliary coordinates (non-dimension coords):") aux_coords = [name for name, coord in ds.coords.items() if name not in ds.dims] print(f" {aux_coords if aux_coords else 'None found'}") except Exception as e: print(f"Error: {e}") import traceback traceback.print_exc() EOF

Repository: flixOpt/flixopt

Length of output: 390

🏁 Script executed:

# Let's also check the structure.py to_dataset implementation sed -n '1038,1150p' flixopt/structure.py

Repository: flixOpt/flixopt

Length of output: 4616

🏁 Script executed:

# Let's check where auxiliary coords are actually added and if they're part of the serialization flow rg -n "assign_coords" flixopt/statistics_accessor.py -B 5 -A 5 | head -40

Repository: flixOpt/flixopt

Length of output: 1032

🏁 Script executed:

# Check if these auxiliary coords (timeframe, timestep) are ever serialized to dataset rg -n "to_dataset\|to_netcdf" flixopt/statistics_accessor.py

Repository: flixOpt/flixopt

Length of output: 41

🏁 Script executed:

# Look at the actual line in Structure.from_dataset to see the pattern sed -n '1080,1100p' flixopt/structure.py

Repository: flixOpt/flixopt

Length of output: 810

🏁 Script executed:

# Critical: Check if there's a difference between how to_dataset and from_dataset handle coordinates # Let's compare them side by side echo "=== TO_DATASET ===" && sed -n '1038,1055p' flixopt/structure.py && echo -e "\n=== FROM_DATASET ===" && sed -n '1090,1110p' flixopt/structure.py

Repository: flixOpt/flixopt

Length of output: 1767

🏁 Script executed:

# Let's directly check if datasets created by to_dataset actually have auxiliary coords # by looking at what _create_reference_structure does rg -n "_create_reference_structure" flixopt/structure.py -A 30 | head -80

Repository: flixOpt/flixopt

Length of output: 4114

🏁 Script executed:

# More importantly, let's see if the datasets that go through from_dataset/to_dataset # ever have auxiliary coordinates to begin with rg -n "coord_cache\|assign_coords" flixopt/structure.py

Repository: flixOpt/flixopt

Length of output: 41

🏁 Script executed:

# Critical: Check if from_dataset in Structure uses the same pattern as _fast_get_dataarray sed -n '1090,1115p' flixopt/structure.py

Repository: flixOpt/flixopt

Length of output: 975

🏁 Script executed:

# Key insight: Let's verify that xr.Dataset(dict_of_dataarrays) loses auxiliary coords python3 << 'EOF' import xarray as xr import numpy as np # Create DataArrays with auxiliary coordinates data_da = xr.DataArray( np.random.rand(3, 4), dims=['x', 'y'], coords={'x': np.arange(3), 'y': np.arange(4)}, name='data' ) # Add auxiliary coordinate to the DataArray data_da = data_da.assign_coords({'aux': (('x',), np.arange(3))}) print("DataArray coords before Dataset creation:") print(f" {list(data_da.coords.keys())}") # Now create a Dataset from this DataArray (similar to to_dataset) ds = xr.Dataset({'data': data_da}) print("\nDataset coords after Dataset creation:") print(f" {list(ds.coords.keys())}") print("\nConclusion: Auxiliary coords from DataArray are lost when creating Dataset") EOF

Repository: flixOpt/flixopt

Length of output: 256

🏁 Script executed:

# Let's verify the actual flow: to_dataset -> from_dataset # Check if xr.Dataset constructor drops auxiliary coords python3 << 'EOF' import xarray as xr import numpy as np # Simulate what to_dataset does da1 = xr.DataArray(np.random.rand(3, 4), dims=['x', 'y'], name='var1') da1 = da1.assign_coords({'x': np.arange(3), 'y': np.arange(4)}) da1 = da1.assign_coords({'aux': (('x',), ['a', 'b', 'c'])}) da2 = xr.DataArray(np.random.rand(3), dims=['x'], name='var2') da2 = da2.assign_coords({'x': np.arange(3)}) # What to_dataset does: creates Dataset from dict extracted = {'var1': da1, 'var2': da2} ds = xr.Dataset(extracted) print("DataArray var1 coords:", list(da1.coords.keys())) print("DataArray var2 coords:", list(da2.coords.keys())) print("Dataset coords:", list(ds.coords.keys())) print("\nAuxiliary 'aux' coord preserved in Dataset?", 'aux' in ds.coords) EOF

Repository: flixOpt/flixopt

Length of output: 212

🏁 Script executed:

# Let's check if the datasets actually get auxiliary coords added anywhere # by searching for patterns that would add non-dimension coords to the dataset rg "assign_coords\|Dataset.*coords" flixopt/io.py | grep -A 2 -B 2 "1[45][0-9]{2}"

Repository: flixOpt/flixopt

Length of output: 41

🏁 Script executed:

# Most importantly, verify that the _restored_ dataset from from_dataset doesn't have auxiliary coords # Let's check the exact flow in from_dataset sed -n '1548,1610p' flixopt/io.py

Repository: flixOpt/flixopt

Length of output: 2606

🏁 Script executed:

# Final verification: test the actual serialization round-trip to see if auxiliary coords would be lost python3 << 'EOF' import xarray as xr import numpy as np # Simulate the to_dataset flow - create a dataset with an auxiliary coordinate print("=== SIMULATION OF TO_DATASET ===") # Create a DataArray with both dimension and auxiliary coordinates da = xr.DataArray( np.random.rand(3, 4), dims=['x', 'y'], coords={'x': np.arange(3), 'y': np.arange(4)}, name='data' ) da = da.assign_coords({'aux': (('x',), ['a', 'b', 'c'])}) print(f"Original DataArray coords: {list(da.coords.keys())}") # Create dataset (like to_dataset does) ds = xr.Dataset({'data': da}) print(f"Dataset after creation: {list(ds.coords.keys())}") print("\n=== SIMULATION OF _FAST_GET_DATAARRAY ===") # Now simulate _fast_get_dataarray variable = ds.variables['data'] coord_cache = {k: ds.coords[k] for k in ds.coords} print(f"Variable dims: {variable.dims}") print(f"Available coords in cache: {list(coord_cache.keys())}") # Current implementation coords_current = {k: coord_cache[k] for k in variable.dims if k in coord_cache} print(f"Coords selected (current method): {list(coords_current.keys())}") # Proposed implementation coords_proposed = {k: c for k, c in coord_cache.items() if set(c.dims).issubset(variable.dims)} print(f"Coords selected (proposed method): {list(coords_proposed.keys())}") print(f"\nAuxiliary coord 'aux' lost with current method? {('aux' in ds.coords) and ('aux' not in coords_current)}") EOF

Repository: flixOpt/flixopt

Length of output: 444

Preserve auxiliary coordinates in fast DataArray construction.

_fast_get_dataarray currently only attaches dimension coordinates (those with names matching variable dims). Auxiliary coordinates—those with dimensions that are a subset of the variable's dimensions—are discarded during reconstruction. This will cause data loss if such coordinates are present in serialized datasets.

The proposed fix correctly handles both dimension and auxiliary coordinates:

Recommended fix

- coords = {k: coord_cache[k] for k in variable.dims if k in coord_cache} + coords = { + k: c for k, c in coord_cache.items() + if set(c.dims).issubset(variable.dims) + }

While the round-trip tests currently pass (suggesting auxiliary coordinates aren't presently used in the codebase), this change makes the implementation robust for future use cases and aligns with xarray's coordinate handling semantics.

🤖 Prompt for AI Agents

In `@flixopt/io.py` around lines 1587 - 1603, _fast_get_dataarray currently only includes coordinates whose names match variable.dims, dropping auxiliary coordinates whose dims are a subset of the variable's dims; update the coords build in _fast_get_dataarray to include any coord_cache entry where coord.dims is a subset of variable.dims (not just keys equal to dims), so auxiliary coords are preserved when constructing the xr.DataArray (use variable.dims and each coord.dims to decide inclusion and pass the resulting coords dict to xr.DataArray).

coderabbitai · 2026-01-16T17:19:04Z

flixopt/transform_accessor.py

+                # Select data for clustering (may be subset if data_vars specified)
+                ds_slice_for_clustering = (
+                    ds_for_clustering.sel(**selector, drop=True) if selector else ds_for_clustering
+                )
+                temporaly_changing_ds_for_clustering = drop_constant_arrays(ds_slice_for_clustering, dim='time')
+                df_for_clustering = temporaly_changing_ds_for_clustering.to_dataframe()



⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

find flixopt -name "transform_accessor.py" -type f

Repository: flixOpt/flixopt

Length of output: 88

🏁 Script executed:

cat -n flixopt/transform_accessor.py | sed -n '1370,1400p'

Repository: flixOpt/flixopt

Length of output: 2058

🏁 Script executed:

grep -n "def drop_constant_arrays" flixopt/transform_accessor.py

Repository: flixOpt/flixopt

Length of output: 41

🏁 Script executed:

rg "def drop_constant_arrays" --type py

Repository: flixOpt/flixopt

Length of output: 100

🏁 Script executed:

rg "def drop_constant_arrays" -A 15 flixopt/core.py

Repository: flixOpt/flixopt

Length of output: 736

🏁 Script executed:

rg "def drop_constant_arrays" -A 30 flixopt/core.py

Repository: flixOpt/flixopt

Length of output: 1269

🏁 Script executed:

rg "def drop_constant_arrays" -A 45 flixopt/core.py

Repository: flixOpt/flixopt

Length of output: 1660

🏁 Script executed:

cat -n flixopt/transform_accessor.py | sed -n '1400,1415p'

Repository: flixOpt/flixopt

Length of output: 968

🏁 Script executed:

rg "tsam.aggregate" -B 5 -A 5 flixopt/transform_accessor.py | head -40

Repository: flixOpt/flixopt

Length of output: 1388

🏁 Script executed:

cat -n flixopt/transform_accessor.py | sed -n '1378,1398p'

Repository: flixOpt/flixopt

Length of output: 1517

🏁 Script executed:

cat -n flixopt/transform_accessor.py | sed -n '1374,1425p'

Repository: flixOpt/flixopt

Length of output: 3386

🏁 Script executed:

cat -n flixopt/transform_accessor.py | sed -n '1350,1375p'

Repository: flixOpt/flixopt

Length of output: 1326

🏁 Script executed:

rg "ds_for_clustering\s*=" flixopt/transform_accessor.py | head -20

Repository: flixOpt/flixopt

Length of output: 258

🏁 Script executed:

cat -n flixopt/transform_accessor.py | sed -n '1320,1365p'

Repository: flixOpt/flixopt

Length of output: 2334

🏁 Script executed:

python3 << 'EOF' import xarray as xr import numpy as np # Test if empty dataset's data_vars attribute can be checked with 'if not' ds_empty = xr.Dataset() ds_with_vars = xr.Dataset({'var1': (['time'], [1, 2, 3])}) print(f"Empty dataset data_vars: {ds_empty.data_vars}") print(f"Empty dataset bool test: {bool(ds_empty.data_vars)}") print(f"not empty dataset.data_vars: {not ds_empty.data_vars}") print(f"\nDataset with vars data_vars: {ds_with_vars.data_vars}") print(f"Dataset with vars bool test: {bool(ds_with_vars.data_vars)}") print(f"not dataset_with_vars.data_vars: {not ds_with_vars.data_vars}") # Test to_dataframe on empty dataset try: df_empty = ds_empty.to_dataframe() print(f"\nEmpty dataset to_dataframe() result:\n{df_empty}") print(f"Shape: {df_empty.shape}") except Exception as e: print(f"Error: {e}") EOF

Repository: flixOpt/flixopt

Length of output: 380

🏁 Script executed:

python3 << 'EOF' import pandas as pd import numpy as np # Test what happens with tsam.aggregate on empty DataFrame # First check if tsam is available try: import tsam print(f"tsam version: {tsam.__version__}") # Create an empty dataframe df_empty = pd.DataFrame() print(f"Empty DataFrame shape: {df_empty.shape}") print(f"Empty DataFrame:\n{df_empty}") # Try to call aggregate (this may fail) try: result = tsam.aggregate(df_empty, n_clusters=3) except Exception as e: print(f"\ntsam.aggregate() error with empty DataFrame:") print(f"Error type: {type(e).__name__}") print(f"Error message: {e}") except ImportError: print("tsam not available in sandbox environment") EOF

Repository: flixOpt/flixopt

Length of output: 100

Add a guard when clustering input becomes empty.

When drop_constant_arrays() removes all selected variables (especially with data_vars filtering), calling to_dataframe() on an empty Dataset fails with a cryptic indexing error. Add an explicit check with a clear error message.

🐛 Proposed fix

temporaly_changing_ds_for_clustering = drop_constant_arrays(ds_slice_for_clustering, dim='time') + if not temporaly_changing_ds_for_clustering.data_vars: + raise ValueError( + 'No time-varying data left for clustering after drop_constant_arrays(). ' + 'Check data_vars or inspect transform.clustering_data().' + ) df_for_clustering = temporaly_changing_ds_for_clustering.to_dataframe()

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

# Select data for clustering (may be subset if data_vars specified)

ds_slice_for_clustering = (

ds_for_clustering.sel(**selector, drop=True) if selector else ds_for_clustering

)

temporaly_changing_ds_for_clustering = drop_constant_arrays(ds_slice_for_clustering, dim='time')

df_for_clustering = temporaly_changing_ds_for_clustering.to_dataframe()

# Select data for clustering (may be subset if data_vars specified)

ds_slice_for_clustering = (

ds_for_clustering.sel(**selector, drop=True) if selector else ds_for_clustering

)

temporaly_changing_ds_for_clustering = drop_constant_arrays(ds_slice_for_clustering, dim='time')

if not temporaly_changing_ds_for_clustering.data_vars:

raise ValueError(

'No time-varying data left for clustering after drop_constant_arrays(). '

'Check data_vars or inspect transform.clustering_data().'

)

df_for_clustering = temporaly_changing_ds_for_clustering.to_dataframe()

🤖 Prompt for AI Agents

In `@flixopt/transform_accessor.py` around lines 1379 - 1385, Add a guard to detect when drop_constant_arrays removes all variables before calling to_dataframe(): after computing temporaly_changing_ds_for_clustering = drop_constant_arrays(ds_slice_for_clustering, dim='time') check whether temporaly_changing_ds_for_clustering is empty (e.g., has no data_vars or no variables) and raise a clear ValueError mentioning the input filtering (data_vars/selector) and that there is no varying data to cluster, instead of calling temporaly_changing_ds_for_clustering.to_dataframe() and allowing a cryptic indexing error; update the code paths that assign df_for_clustering accordingly so downstream logic only runs when df_for_clustering is valid.

flixopt/transform_accessor.py

coderabbitai · 2026-01-16T17:19:04Z

pyproject.toml

+    "tsam @ git+https://github.com/FBumann/tsam.git@v3-rebased",   # Time series aggregation (unreleased)
    "scipy >= 1.15.1, < 2", # Used by tsam. Prior versions have conflict with highspy. See https://github.com/scipy/scipy/issues/22257


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Check if v3-rebased branch exists and get its current HEAD commit curl -s "https://api.github.com/repos/FBumann/tsam/branches/v3-rebased" | jq '{branch: .name, sha: .commit.sha}' 2>/dev/null || echo "Branch check failed or jq not available" # Also test basic GitHub API access curl -s -I "https://api.github.com/repos/FBumann/tsam" | head -5

Repository: flixOpt/flixopt

Length of output: 266

🌐 Web query:

Can PyPI packages depend on git URLs in pyproject.toml

💡 Result:

Short answer: No — a package published on PyPI must not rely on git/VCS URLs in its declared install-time dependencies because pip will refuse to install non-PyPI URL dependencies transitively when installing a package from PyPI. Build-system (build-time) requirements in pyproject.toml can be URL/VCS specifiers, but runtime dependencies that point at git URLs are not supported for packages distributed via PyPI.

Relevant sources:

pip docs: pyproject.toml build-system.requires may contain URL specifiers, but runtime dependency resolution for packages from PyPI blocks arbitrary URL dependencies. [1]

Historical/issue discussion explaining pip blocks PEP 508 URL requirements for packages installed from PyPI. [2]

Ongoing PEP work (PEP 725 / DepURLs) proposes better ways to reference external dependencies in pyproject.toml, but it is a separate proposal and not a current guarantee that PyPI/pip will accept git URL runtime dependencies. [3]

Community Q&A and examples showing the same pip behavior and common workarounds (vendor, publish to PyPI, or install from a VCS directly). [4]

If you need a workable solution: publish the dependency to PyPI, vendor or bundle it, instruct users to install from a VCS editable/direct URL themselves (i.e., not via PyPI), or follow future PEPs (like PEP 725) if/when tooling implements them.

🌐 Web query:

pip install git URL dependency from pyproject.toml production

💡 Result:

Yes — pyproject.toml (PEP 621) supports VCS/direct-URL dependencies and pip can install them. Short recipe and production recommendations:

Add a VCS direct reference in pyproject.toml (PEP 621 / setuptools syntax):

Example (pin to tag/commit):
[project]
dependencies = [
"somepkg @ git+https://github.com/owner/somepkg.git@v1.2.3",
"otherpkg @ git+ssh://[email protected]:owner/otherpkg.git@abcdef123456",
]
(pip/PEP 508 direct URL form). (setuptools.pypa.io)

Install with pip

To install the project and have pip install those dependencies, run:
pip install .

Or install the project directly from its repo (pip will install its pyproject dependencies):
pip install "git+https://github.com/you/your-project.git@main"

You can also install a single VCS URL directly:
pip install "git+https://github.com/owner/somepkg.git@v1.2.3" . (pythontutorials.net)

Production best practices

Pin VCS deps to an exact commit or annotated tag (avoid floating branch names) to ensure reproducible installs. (setuptools.pypa.io)

For private repos, use deploy-friendly auth (CI/CD deploy tokens, SSH keys, or tokenized HTTPS URLs) rather than interactive credentials. Test auth in your CI environment. (pythontutorials.net)

Consider converting pyproject deps into a requirements.txt for pip-based deployment (tools like pyp2req or generating a pip constraints file), so your deployment step can be pip install -r requirements.txt --no-deps / --constraint ... if you need deterministic installs. (pypi.org)

If you want, tell me:

whether you use setuptools/poetry/pdm, and

whether the VCS repo is public or private — I’ll give the exact pyproject snippet and CI-ready install command.

Citations:

1: https://setuptools.pypa.io/en/stable/userguide/dependency_management.html?utm_source=openai

2: https://www.pythontutorials.net/blog/download-dependencies-declared-in-pyproject-toml-using-pip/?utm_source=openai

3: https://setuptools.pypa.io/en/stable/userguide/dependency_management.html?utm_source=openai

4: https://www.pythontutorials.net/blog/how-to-pip-install-a-package-that-has-git-dependencies/?utm_source=openai

5: https://pypi.org/project/pyp2req/?utm_source=openai

🏁 Script executed:

#!/bin/bash # Check if this project is configured for PyPI release cat pyproject.toml | head -30 # Check for PyPI upload workflows find . -name "*.yml" -o -name "*.yaml" | xargs grep -l "publish\|pypi\|release" 2>/dev/null | head -5

Repository: flixOpt/flixopt

Length of output: 1378

🏁 Script executed:

#!/bin/bash # Check git log to see context around the tsam dependency change git log --oneline -20 -- pyproject.toml | head -10 # Check if v3-rebased ref is mentioned elsewhere rg "v3-rebased" --type toml --type py --type txt

Repository: flixOpt/flixopt

Length of output: 329

🏁 Script executed:

#!/bin/bash # Check the release workflow to see if it uploads to PyPI cat .github/workflows/release.yaml | head -60 # Also check for any PyPI token secrets grep -r "pypi" .github/workflows/ | head -10

Repository: flixOpt/flixopt

Length of output: 2194

🏁 Script executed:

#!/bin/bash # Check pyproject.toml for extras documentation and other dependencies sed -n '50,150p' pyproject.toml

Repository: flixOpt/flixopt

Length of output: 3450

This configuration will fail PyPI publication.

Using a git URL for tsam (line 67 in [full] extras and line 86 in [dev] extras) is incompatible with PyPI distribution. pip explicitly blocks git URL dependencies for packages installed from PyPI — the verify-pypi job in your release workflow will fail when attempting to install from https://pypi.org/simple/.

Pinning to a commit hash does not resolve this. Before release, you must either:

Wait for tsam v3 to be published to PyPI and use a version specifier (tsam >= 3.0.0, < 4)

Remove tsam from the [full] and [dev] extras and document that users requiring tsam must install from source

Vendor or bundle tsam as part of flixopt

Verify the release workflow will pass with the current dependencies, or update the workflow to skip PyPI publication if this is intentional for development-only use.

🤖 Prompt for AI Agents

In `@pyproject.toml` around lines 67 - 68, The pyproject.toml currently lists a git URL dependency "tsam @ git+https://github.com/FBumann/tsam.git@v3-rebased" in the [full] and [dev] extras which will break PyPI publication; remove that git-based requirement from both the [full] and [dev] extras and either (a) replace it with a proper PyPI version specifier (e.g., "tsam >= 3.0.0, < 4") once tsam is published, (b) omit tsam from those extras and add a note in your README/INSTALL docs instructing users to install tsam from source, or (c) vendor/include tsam in the package; finally, if this is intended only for development, update the release workflow (the verify-pypi job) to skip PyPI verification or otherwise allow dev-only dependencies so the release will pass.

coderabbitai

Actionable comments posted: 1

🤖 Fix all issues with AI agents

In `@flixopt/clustering/base.py`:
- Around line 619-633: The zip call that builds the selector silently truncates
mismatches (zip(self._dim_names, key, strict=False)); change it to use
strict=True so any length mismatch between self._dim_names and key raises
immediately, e.g., replace the zip invocation used when constructing selector in
the loop over self._results.items() (where selector = dict(zip(self._dim_names,
key, ...))) with zip(..., strict=True) to fail fast on dimension mismatches and
ensure selectors are built only when dimensions align.

🧹 Nitpick comments (5)

flixopt/clustering/base.py (5)
578-591: Type conversion in deserialization only handles int.

The _str_to_key method only attempts to convert parts to int. If period values are floats (e.g., 2024.5) or other numeric types, they would be stored as strings after round-trip serialization.

Consider extending the conversion logic:
🔧 Suggested enhancement
     `@staticmethod`
     def _str_to_key(key_str: str, dim_names: list[str]) -> tuple:
         """Convert string back to key tuple."""
         if key_str == '__single__':
             return ()
         parts = key_str.split('|')
-        # Try to convert to int if possible (for period years)
+        # Try to convert to numeric types if possible (for period years/floats)
         result = []
         for part in parts:
             try:
                 result.append(int(part))
             except ValueError:
-                result.append(part)
+                try:
+                    result.append(float(part))
+                except ValueError:
+                    result.append(part)
         return tuple(result)
1206-1206: Default empty DatetimeIndex may cause issues downstream.

When original_timesteps is None, it defaults to an empty pd.DatetimeIndex([]). Methods like expand_data and build_expansion_divisor default to using self.original_timesteps, which could lead to confusing behavior with zero-length expansions.

Consider validating that original_timesteps is not empty when it's actually needed, or document this edge case more explicitly.

1242-1254: Property raises ValueError without checking all paths.

The results property raises ValueError if neither _results_cache nor _aggregation_results is available. However, the __init__ method allows creating a Clustering with results=None and _aggregation_results=None, which would always fail when accessing results.

Consider adding validation in __init__ to ensure at least one of these is provided, or document this as an invalid state.

1425-1434: Variable shadowing: variables parameter is overwritten.

The local assignment variables = ds.variables at line 1426 shadows the function parameter variables from line 1369. While this doesn't cause a bug (the parameter was already consumed at line 1410), it's confusing and could lead to issues if the code is refactored.
🔧 Suggested fix: rename local variable
         if kind == 'duration_curve':
             sorted_vars = {}
-            # Use variables for faster access (avoids _construct_dataarray overhead)
-            variables = ds.variables
+            # Use ds_variables for faster access (avoids _construct_dataarray overhead)
+            ds_variables = ds.variables
             rep_values = ds.coords['representation'].values
             rep_idx = {rep: i for i, rep in enumerate(rep_values)}
             for var in ds.data_vars:
-                data = variables[var].values
+                data = ds_variables[var].values
                 for rep in rep_values:
                     # Direct numpy indexing instead of .sel()
                     values = np.sort(data[rep_idx[rep]].flatten())[::-1]
1665-1669: Consider direct numpy indexing for cluster labels.

The list comprehension at line 1668 calls .sel(cluster=c).values for each cluster, creating intermediate DataArrays. For consistency with the PR's optimization pattern, consider using direct indexing:
🔧 Suggested optimization
         if occ_extra_dims:
             cluster_labels = [f'Cluster {c}' for c in range(n_clusters)]
         else:
+            occ_values = cluster_occurrences.values
             cluster_labels = [
-                f'Cluster {c} (×{int(cluster_occurrences.sel(cluster=c).values)})' for c in range(n_clusters)
+                f'Cluster {c} (×{int(occ_values[c])})' for c in range(n_clusters)
             ]

coderabbitai · 2026-01-16T21:04:37Z

flixopt/clustering/base.py

+        for key, cr in self._results.items():
+            # Build selector for this key
+            selector = dict(zip(self._dim_names, key, strict=False))
+
+            # Select the slice for this (period, scenario)
+            data_slice = data.sel(**selector, drop=True) if selector else data
+
+            # Drop constant arrays and convert to DataFrame
+            time_varying = drop_constant_arrays(data_slice, dim='time')
+            df = time_varying.to_dataframe()
+
+            # Apply clustering
+            results[key] = cr.apply(df)
+
+        return Clustering._from_aggregation_results(results, self._dim_names)


⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

# First, let's look at the file structure and find the relevant class/method cd flixopt/clustering && wc -l base.py

Repository: flixOpt/flixopt

Length of output: 71

🏁 Script executed:

# Read the context around line 619-633 to understand the method sed -n '600,650p' flixopt/clustering/base.py

Repository: flixOpt/flixopt

Length of output: 2146

🏁 Script executed:

# Find where self._dim_names is defined/set rg -n "_dim_names" flixopt/clustering/base.py

Repository: flixOpt/flixopt

Length of output: 948

🏁 Script executed:

# Find where self._results is populated rg -n "_results" flixopt/clustering/base.py -A 2

Repository: flixOpt/flixopt

Length of output: 5616

🏁 Script executed:

# Search for zip() usage with strict parameter across the codebase rg -n "zip.*strict" flixopt/

Repository: flixOpt/flixopt

Length of output: 2078

🏁 Script executed:

# Find where AggregationResults is instantiated (look for __init__ and calls to it) sed -n '200,230p' flixopt/clustering/base.py

Repository: flixOpt/flixopt

Length of output: 1155

🏁 Script executed:

# Find where AggregationResults is created/instantiated from rg -n "AggregationResults(" flixopt/ -A 3 | head -80

Repository: flixOpt/flixopt

Length of output: 147

🏁 Script executed:

# Look at the _from_aggregation_results classmethod that's used at the end of apply() sed -n '1257,1290p' flixopt/clustering/base.py

Repository: flixOpt/flixopt

Length of output: 1474

🏁 Script executed:

# Check how keys are structured in _results - look for patterns where results dict is built rg -n "self._results\[" flixopt/clustering/base.py | head -20

Repository: flixOpt/flixopt

Length of output: 314

🏁 Script executed:

# Look at how _results dict is populated - find code that builds it rg -n "results\[.*\].*=" flixopt/clustering/base.py | head -20

Repository: flixOpt/flixopt

Length of output: 285

🏁 Script executed:

# Search for places where AggregationResults is instantiated to see the pattern of key/dim_names rg -n "ClusteringResults\(" flixopt/ -B 3 -A 3 | head -100

Repository: flixOpt/flixopt

Length of output: 2835

🏁 Script executed:

# Check the overall structure by looking at how keys are constructed sed -n '535,575p' flixopt/clustering/base.py

Repository: flixOpt/flixopt

Length of output: 1831

🏁 Script executed:

# Look at where the apply method is called from to understand what _results contains rg -n "\.apply\(" flixopt/clustering/base.py -B 2 -A 2

Repository: flixOpt/flixopt

Length of output: 874

🏁 Script executed:

# Look at git history to understand when strict=False was introduced git log --oneline -p flixopt/clustering/base.py | grep -A 10 -B 5 "strict=False" | head -50

Repository: flixOpt/flixopt

Length of output: 690

🏁 Script executed:

# Check if there are any tests or validation that verify key/dim_names matching rg -n "dim_names" flixopt/clustering/base.py | grep -i "test\|assert\|valid\|check"

Repository: flixOpt/flixopt

Length of output: 41

🏁 Script executed:

# Look more carefully at how keys are created in _results dict # Find where clustering is applied with period/scenario dimensions sed -n '600,640p' flixopt/clustering/base.py

Repository: flixOpt/flixopt

Length of output: 1655

🏁 Script executed:

# Check if there's any code that constructs mismatched keys rg -n "self._results\[" flixopt/clustering/base.py -B 2 -A 2

Repository: flixOpt/flixopt

Length of output: 1023

Use strict=True in zip() to catch dimension mismatches.

The zip(self._dim_names, key, strict=False) at line 621 will silently truncate if key and dim_names have mismatched lengths. Since key tuples should always match the structure of dim_names (both sourced from the same ClusteringResults initialization), using strict=True would catch any unexpected inconsistencies and fail fast rather than silently producing incomplete selectors. The empty case (key=(), dim_names=[]) works fine with strict=True since both are empty.

🤖 Prompt for AI Agents

In `@flixopt/clustering/base.py` around lines 619 - 633, The zip call that builds the selector silently truncates mismatches (zip(self._dim_names, key, strict=False)); change it to use strict=True so any length mismatch between self._dim_names and key raises immediately, e.g., replace the zip invocation used when constructing selector in the loop over self._results.items() (where selector = dict(zip(self._dim_names, key, ...))) with zip(..., strict=True) to fail fast on dimension mismatches and ensure selectors are built only when dimensions align.

_add_clustering_to_dataset was slow due to 210 individual ds[name] = arr assignments. Each triggers xarray's expensive dataset_update_method. Changed to batch assignment with ds.assign(dict): - Before: ~2600ms for to_dataset with clustering - After: ~109ms for to_dataset with clustering Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Avoided _construct_dataarray overhead by: - Using ds.variables instead of ds.data_vars.items() - Using numpy slicing instead of .isel() - Passing attrs dict directly instead of DataArray cluster() benchmark: - Before: ~10.1s - After: ~8.9s Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

coderabbitai

Actionable comments posted: 1

🤖 Fix all issues with AI agents

In `@flixopt/transform_accessor.py`:
- Around line 293-313: The current shortcut returns a 1-D xr.Dataset when there
are multiple (period, scenario) slices but only one non-empty metrics DataFrame,
dropping period/scenario dims; change the condition so the single-slice fast
path is used only when there is truly one overall slice (i.e.,
len(clustering_metrics_all) == 1 and len(non_empty_metrics) == 1). If there are
multiple (period, scenario) keys but only one non-empty metrics_df, do not
return the 1-D Dataset from the block using first_key/metrics_df; instead let
execution fall through to the multi-dim assembly code (which should construct
the full Dataset with NaN-filled slices) so dimensions referenced by
Clustering._dim_names (periods/scenarios) are preserved.

♻️ Duplicate comments (2)

flixopt/transform_accessor.py (2)

1392-1397: Guard against empty clustering input after drop_constant_arrays().

Same issue as previously noted: if the selected variables become empty after drop_constant_arrays, to_dataframe() will fail with a cryptic error. Please add an explicit guard and a clear error message before converting to DataFrame.

1915-1923: Segmented SOC interpolation still uses wrong time index.

timestep_mapping % n_segments collapses to segment indices, not timestep positions within a cluster, which yields incorrect interpolation. Use modulo timesteps_per_cluster for time_indices while keeping cluster decoding as‑is.

🛠️ Proposed fix

-        if clustering.is_segmented and clustering.n_segments is not None:
-            time_dim_size = clustering.n_segments
-        else:
-            time_dim_size = clustering.timesteps_per_cluster
-        cluster_indices = timestep_mapping // time_dim_size
-        time_indices = timestep_mapping % time_dim_size
+        if clustering.is_segmented and clustering.n_segments is not None:
+            time_dim_size = clustering.n_segments
+        else:
+            time_dim_size = clustering.timesteps_per_cluster
+        cluster_indices = timestep_mapping // time_dim_size
+        # Use original timestep position within each cluster for segment lookup
+        timesteps_per_cluster = clustering.timesteps_per_cluster
+        time_positions = xr.DataArray(
+            np.arange(timestep_mapping.sizes['original_time']),
+            dims=['original_time'],
+            coords={'original_time': timestep_mapping.coords['original_time']},
+        ).broadcast_like(timestep_mapping)
+        time_indices = time_positions % timesteps_per_cluster

coderabbitai · 2026-01-16T21:27:58Z

flixopt/transform_accessor.py

+        non_empty_metrics = {k: v for k, v in clustering_metrics_all.items() if not v.empty}
+
+        if not non_empty_metrics:
+            return xr.Dataset()
+
+        first_key = (periods[0], scenarios[0])
+
+        if len(non_empty_metrics) == 1 or len(clustering_metrics_all) == 1:
+            metrics_df = non_empty_metrics.get(first_key)
+            if metrics_df is None:
+                metrics_df = next(iter(non_empty_metrics.values()))
+            return xr.Dataset(
+                {
+                    col: xr.DataArray(
+                        metrics_df[col].values,
+                        dims=['time_series'],
+                        coords={'time_series': metrics_df.index},
+                    )
+                    for col in metrics_df.columns
+                }
+            )


⚠️ Potential issue | 🟡 Minor

Preserve period/scenario dims when only one metrics slice is non‑empty.

If there are multiple (period, scenario) slices but only one non‑empty metrics DataFrame, the shortcut returns a 1‑D Dataset and drops period/scenario dimensions, which can misalign metrics with Clustering._dim_names. Consider only using the shortcut when there is truly a single slice, and otherwise fall through to the multi‑dim path with NaN‑filled slices.

🛠️ Proposed fix

- if len(non_empty_metrics) == 1 or len(clustering_metrics_all) == 1: + if len(clustering_metrics_all) == 1: metrics_df = non_empty_metrics.get(first_key) if metrics_df is None: metrics_df = next(iter(non_empty_metrics.values())) return xr.Dataset(

🤖 Prompt for AI Agents

In `@flixopt/transform_accessor.py` around lines 293 - 313, The current shortcut returns a 1-D xr.Dataset when there are multiple (period, scenario) slices but only one non-empty metrics DataFrame, dropping period/scenario dims; change the condition so the single-slice fast path is used only when there is truly one overall slice (i.e., len(clustering_metrics_all) == 1 and len(non_empty_metrics) == 1). If there are multiple (period, scenario) keys but only one non-empty metrics_df, do not return the 1-D Dataset from the block using first_key/metrics_df; instead let execution fall through to the multi-dim assembly code (which should construct the full Dataset with NaN-filled slices) so dimensions referenced by Clustering._dim_names (periods/scenarios) are preserved.

Eliminated 451,856 slow pandas .loc calls by using numpy reshape for segmented clustering data instead of iterating per-cluster. cluster() with segments benchmark (50 clusters, 4 segments): - Before: ~93.7s - After: ~21.1s - Speedup: 4.4x Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

coderabbitai

Actionable comments posted: 1

🤖 Fix all issues with AI agents

In `@flixopt/transform_accessor.py`:
- Around line 533-565: The current branch treats any variable with
partially-missing typical slices as fully constant; change the logic in the loop
(around variables, ds_new_vars, typical_das, all_keys) so that only when name
not in typical_das you take the constant/no-time path; if name exists in
typical_das but its keys != all_keys, build the time-dependent DataArray by
iterating all_keys and for each missing (period,scenario) key fill that slice
with the appropriate constant aggregate (e.g., broadcast the available constant
array or a selected default from typical_das[name]), then concatenate/stack
these filled slices into the full (cluster, time, ...) shape and assign into
ds_new_vars with dims ['cluster','time',...] and coords new_coords; ensure you
use the same coord_cache, cluster_coords and time_coords when constructing the
filled/combined DataArray.

♻️ Duplicate comments (3)

flixopt/transform_accessor.py (3)
298-318: Preserve period/scenario dims in the metrics shortcut.
The len(non_empty_metrics) == 1 or len(clustering_metrics_all) == 1 fast path can drop period/scenario dims when multiple slices exist but only one is non-empty (Line 305). This produces a 1‑D Dataset and misaligns downstream dimension expectations. Consider requiring a true single-slice case only.
♻️ Proposed fix
-        if len(non_empty_metrics) == 1 or len(clustering_metrics_all) == 1:
+        if len(clustering_metrics_all) == 1 and len(non_empty_metrics) == 1:
             metrics_df = non_empty_metrics.get(first_key)
             if metrics_df is None:
                 metrics_df = next(iter(non_empty_metrics.values()))
             return xr.Dataset(
1397-1403: Guard against empty clustering input before to_dataframe().
After drop_constant_arrays, the dataset can be empty when all selected variables are constant (Line 1401–1402). to_dataframe() then fails with a cryptic indexing error. Add an explicit guard with a clear message.
🐛 Proposed fix
                 ds_slice_for_clustering = (
                     ds_for_clustering.sel(**selector, drop=True) if selector else ds_for_clustering
                 )
                 temporaly_changing_ds_for_clustering = drop_constant_arrays(ds_slice_for_clustering, dim='time')
+                if not temporaly_changing_ds_for_clustering.data_vars:
+                    raise ValueError(
+                        'No time-varying data left for clustering after drop_constant_arrays(). '
+                        'Check data_vars or inspect transform.clustering_data().'
+                    )
                 df_for_clustering = temporaly_changing_ds_for_clustering.to_dataframe()
1920-1932: Segmented charge_state interpolation uses the wrong time index basis.
For segmented clustering, timestep_mapping % n_segments gives the segment index, but segment_assignments/position_within_segment are keyed by timestep position within the cluster (Line 1926–1932). This mis-indexes segment assignments and produces incorrect interpolation. Use timestep positions mod timesteps_per_cluster for the time index while keeping cluster indices derived from timestep_mapping.
🐛 Proposed fix
         if clustering.is_segmented and clustering.n_segments is not None:
             time_dim_size = clustering.n_segments
         else:
             time_dim_size = clustering.timesteps_per_cluster
         cluster_indices = timestep_mapping // time_dim_size
-        time_indices = timestep_mapping % time_dim_size
+        # Use original timestep position within each cluster for segment lookup
+        timesteps_per_cluster = clustering.timesteps_per_cluster
+        time_positions = xr.DataArray(
+            np.arange(timestep_mapping.sizes['original_time']),
+            dims=['original_time'],
+            coords={'original_time': timestep_mapping.coords['original_time']},
+        ).broadcast_like(timestep_mapping)
+        time_indices = time_positions % timesteps_per_cluster

coderabbitai · 2026-01-16T21:41:52Z

flixopt/transform_accessor.py

+        # Use ds.variables to avoid _construct_dataarray overhead
+        variables = ds.variables
+        coord_cache = {k: ds.coords[k].values for k in ds.coords}
+
+        for name in ds.data_vars:
+            var = variables[name]
+            if 'time' not in var.dims:
+                # No time dimension - wrap Variable in DataArray
+                coords = {d: coord_cache[d] for d in var.dims if d in coord_cache}
+                ds_new_vars[name] = xr.DataArray(var.values, dims=var.dims, coords=coords, attrs=var.attrs, name=name)
+            elif name not in typical_das or set(typical_das[name].keys()) != all_keys:
+                # Time-dependent but constant: reshape to (cluster, time, ...)
+                # Use numpy slicing instead of .isel()
+                time_idx = var.dims.index('time')
+                slices = [slice(None)] * len(var.dims)
+                slices[time_idx] = slice(0, n_reduced_timesteps)
+                sliced_values = var.values[tuple(slices)]
+
+                other_dims = [d for d in var.dims if d != 'time']
+                other_shape = [var.sizes[d] for d in other_dims]
+                new_shape = [actual_n_clusters, n_time_points] + other_shape
+                reshaped = sliced_values.reshape(new_shape)
+                new_coords = {'cluster': cluster_coords, 'time': time_coords}
+                for dim in other_dims:
+                    if dim in coord_cache:
+                        new_coords[dim] = coord_cache[dim]
+                ds_new_vars[name] = xr.DataArray(
+                    reshaped,
+                    dims=['cluster', 'time'] + other_dims,
+                    coords=new_coords,
+                    attrs=var.attrs,
+                )
+            else:


⚠️ Potential issue | 🟠 Major

Handle partially-missing typical slices without discarding valid aggregates.
The condition name not in typical_das or set(typical_das[name].keys()) != all_keys treats any variable missing one (period, scenario) slice as fully constant (Line 543). That discards valid aggregated data for slices where the variable does vary and reshapes raw data instead. This can corrupt reduced outputs for mixed-variability cases. Only use the constant path when the variable is absent altogether, and fill missing slices with constant values before combining.

🛠️ Suggested approach (fill missing slices)

- elif name not in typical_das or set(typical_das[name].keys()) != all_keys: + elif name not in typical_das: # Time-dependent but constant: reshape to (cluster, time, ...) # Use numpy slicing instead of .isel() time_idx = var.dims.index('time') slices = [slice(None)] * len(var.dims) slices[time_idx] = slice(0, n_reduced_timesteps) sliced_values = var.values[tuple(slices)] ... ds_new_vars[name] = xr.DataArray( reshaped, dims=['cluster', 'time'] + other_dims, coords=new_coords, attrs=var.attrs, ) else: + # Fill missing (period, scenario) slices with constant values + slices = dict(typical_das[name]) + missing = all_keys - set(slices.keys()) + if missing: + # TODO: extract constant-slice construction into a helper reused above + for key in missing: + slices[key] = _build_constant_slice_for_key(...) + da = self._combine_slices_to_dataarray_2d( + slices=slices, + attrs=var.attrs, + periods=periods, + scenarios=scenarios, + )

🤖 Prompt for AI Agents

In `@flixopt/transform_accessor.py` around lines 533 - 565, The current branch treats any variable with partially-missing typical slices as fully constant; change the logic in the loop (around variables, ds_new_vars, typical_das, all_keys) so that only when name not in typical_das you take the constant/no-time path; if name exists in typical_das but its keys != all_keys, build the time-dependent DataArray by iterating all_keys and for each missing (period,scenario) key fill that slice with the appropriate constant aggregate (e.g., broadcast the available constant array or a selected default from typical_das[name]), then concatenate/stack these filled slices into the full (cluster, time, ...) shape and assign into ds_new_vars with dims ['cluster','time',...] and coords new_coords; ensure you use the same coord_cache, cluster_coords and time_coords when constructing the filled/combined DataArray.

- benchmark_io_performance.py: Add Gurobi → HiGHS solver fallback - components.py: Fix storage decay to use sum (not mean) for hours per cluster - flow_system.py: Add RangeIndex validation requiring explicit timestep_duration - io.py: Include auxiliary coordinates in _fast_get_dataarray - transform_accessor.py: Add empty dataset guard after drop_constant_arrays - transform_accessor.py: Fix timestep_mapping indexing for segmented clustering Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Replace data_vars.items() iteration with ds.variables pattern to avoid slow _construct_dataarray calls (5502 calls × ~1.5ms each). Before: 3.73s After: 1.72s Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

coderabbitai bot reviewed Jan 16, 2026

View reviewed changes

FBumann and others added 2 commits January 16, 2026 22:12

coderabbitai bot reviewed Jan 16, 2026

View reviewed changes

FBumann force-pushed the perf/clustering branch from 914580f to 0754f06 Compare January 16, 2026 21:36

coderabbitai bot reviewed Jan 16, 2026

View reviewed changes

FBumann changed the title ~~perf: Use ds.variables to avoid _construct_dataarray overhead~~ perf: Optimize clustering and I/O (4.4x faster segmented clustering) Jan 16, 2026

FBumann changed the base branch from main to feature/tsam-v3 January 16, 2026 21:46

FBumann changed the base branch from feature/tsam-v3 to feature/tsam-v3+rework January 16, 2026 21:47

FBumann and others added 2 commits January 16, 2026 23:06

perf: Use ds.variables pattern in expand() (2.2x faster)

68e4695

Replace data_vars.items() iteration with ds.variables pattern to avoid slow _construct_dataarray calls (5502 calls × ~1.5ms each). Before: 3.73s After: 1.72s Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

FBumann merged commit 7a4280d into feature/tsam-v3+rework Jan 17, 2026
8 checks passed

		"tsam @ git+https://github.com/FBumann/tsam.git@v3-rebased", # Time series aggregation (unreleased)
		"scipy >= 1.15.1, < 2", # Used by tsam. Prior versions have conflict with highspy. See https://github.com/scipy/scipy/issues/22257

perf: Optimize clustering and I/O (4.4x faster segmented clustering) #579

perf: Optimize clustering and I/O (4.4x faster segmented clustering) #579

Uh oh!

Conversation

FBumann commented Jan 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Key Optimization: Numpy reshape for segmented clustering

Benchmarks

Testing

Uh oh!

coderabbitai bot commented Jan 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Poem

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 16, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 16, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 16, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 16, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 16, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 16, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 16, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot Jan 16, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 16, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 16, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 16, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

FBumann commented Jan 16, 2026 •

edited

Loading

coderabbitai bot commented Jan 16, 2026 •

edited

Loading