Skip to content

Conversation

@FBumann
Copy link
Member

@FBumann FBumann commented Jan 16, 2026

Summary

Performance optimizations using faster data access patterns throughout the codebase.

Changes

File Function Optimization Speedup
transform_accessor.py _build_typical_das Numpy reshape instead of pandas .loc 4.4x
transform_accessor.py _build_reduced_dataset Use ds.variables + numpy slicing 12%
io.py _add_clustering_to_dataset Batch assignment with ds.assign() 24x
io.py various functions Use ds.variables pattern -
structure.py from_dataset Use coord_cache pattern -
core.py drop_constant_arrays Use ds.variables + numpy -
clustering/base.py various functions Use ds.variables pattern -

Key Optimization: Numpy reshape for segmented clustering

# Before: 451,856 slow pandas .loc calls (93.7s)
for col in typical_df.columns:
    for cluster_id in range(actual_n_clusters):
        cluster_data = typical_df.loc[cluster_id, col]  # SLOW!
        data[cluster_id, :] = cluster_data.values

# After: Single numpy reshape (21.1s)
all_values = typical_df.values  # (n_clusters * n_time_points, n_columns)
reshaped = all_values.reshape(actual_n_clusters, n_time_points, -1)

for col_idx, col in enumerate(columns):
    # reshaped[:, :, col_idx] selects all clusters, all time points, single column
    # Result shape: (n_clusters, n_time_points)
    data = reshaped[:, :, col_idx]

Benchmarks

Operation Before After Speedup
cluster() with segments (50 clusters, 4 segments) 93.7s 21.1s 4.4x
cluster() without segments (12 clusters) 10.1s 8.9s 12%
to_dataset() with clustering 2600ms 109ms 24x

Testing

  • All IO tests pass
  • All 185 clustering tests pass

Optimize several functions by using ds.variables instead of iterating
over data_vars.items() or accessing ds[name], which triggers slow
_construct_dataarray calls.

Changes:
- io.py: save_dataset_to_netcdf, load_dataset_from_netcdf, _reduce_constant_arrays
- structure.py: from_dataset (use coord_cache pattern)
- core.py: drop_constant_arrays (use numpy operations)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Jan 16, 2026

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

📝 Walkthrough

Walkthrough

Adds multidimensional clustering (Clustering / ClusteringResults), VariableCategory-based variable classification, segmented-timesteps support (RangeIndex), a centralized FlowSystem ↔ xarray I/O layer (FlowSystemDatasetIO), updated TransformAccessor clustering API (ClusterConfig/ExtremeConfig/SegmentConfig), and an IO benchmark script.

Changes

Cohort / File(s) Summary
Clustering core
flixopt/clustering/__init__.py, flixopt/clustering/base.py, flixopt/clustering/intercluster_helpers.py
Replace legacy ClusterResult/ClusterStructure with Clustering and ClusteringResults; multi-dimensional (period,scenario) results, IO (to_json/from_json), plotting accessor, rename cluster_ordercluster_assignments.
Transform & clustering ops
flixopt/transform_accessor.py
New cluster API (data_vars, cluster: ClusterConfig, extremes: ExtremeConfig, segments: SegmentConfig), clustering_data(), apply_clustering(), builders for weights/metrics/typical periods, segmentation-aware expand and mapping helpers.
FlowSystem I/O
flixopt/io.py, flixopt/flow_system.py
Add FlowSystemDatasetIO and module wrappers (flow_system_to_dataset/restore_flow_system_from_dataset); delegate to_dataset/from_dataset; add include_original_data flag; robust netCDF save/load helpers.
Variable categorization
flixopt/structure.py, flixopt/components.py, flixopt/elements.py, flixopt/features.py, flixopt/effects.py, flixopt/modeling.py
Introduce VariableCategory enum and EXPAND_* groups; track variable_categories on FlowSystemModel; Submodel.add_variables(..., category=...) records mappings; annotate many model variables and investment sizing with categories.
Segmentation & timesteps
flixopt/flow_system.py
Support RangeIndex timesteps (segmented systems), optional timestep_duration, is_segmented property, variable_categories propagation, get_variables_by_category API and segmentation-aware metadata/expansion.
I/O internals & perf
flixopt/io.py, benchmarks/benchmark_io_performance.py
Dataset I/O refactor to use ds.variables, faster stack/unstack/reduce helpers; new benchmark script measuring to_dataset(include_solution=True) and FlowSystem.from_dataset(ds) roundtrip timings.
Statistics & utils
flixopt/statistics_accessor.py, flixopt/core.py
Statistics selection switched to category-based lookups; drop_constant_arrays() adds atol and uses NumPy-based constancy detection for performance.
Modeling & features
flixopt/modeling.py, flixopt/features.py, flixopt/components.py
Add size_category parameter to InvestmentModel; propagate categories to size/investment/status/duration and related variables.
Docs & notebooks
docs/notebooks/*, docs/user-guide/optimization/*
Notebook metadata updates and docs rewritten to use ExtremeConfig, ClusterConfig, new clustering results access patterns, segmentation docs and examples expanded.
Tests
tests/... (many files)
Tests updated to new clustering API (Clustering/ClusteringResults), cluster_assignments usage, data_vars/extremes/segmentation coverage, and IO roundtrip assertions.
Project manifest
pyproject.toml
Replace tsam pinned versions with a git-based dependency tsam @ git+https://github.com/FBumann/tsam.git@v3-rebased.

Sequence Diagram(s)

sequenceDiagram
    participant User as User
    participant FS as FlowSystem
    participant TA as TransformAccessor
    participant CL as Clustering
    participant IO as FlowSystemDatasetIO
    participant XR as xarray.Dataset

    Note over User,FS: create / configure system
    User->>FS: create(timesteps, periods, components)
    FS->>FS: init (DatetimeIndex | RangeIndex), record variable_categories

    Note over User,TA: clustering workflow
    User->>TA: cluster(n_clusters, data_vars, extremes=ExtremeConfig(...))
    TA->>TA: clustering_data() → prepare inputs
    TA->>CL: call tsam.aggregate & build ClusteringResults
    CL->>CL: build cluster_assignments, timesteps_per_cluster, metrics
    TA->>FS: produce reduced FlowSystem with Clustering attached

    Note over User,FS: solve & attach solution
    User->>FS: solve()
    FS->>FS: store solution, propagate variable_categories

    Note over User,FS: expand & IO
    User->>TA: expand(reduced_solution)
    TA->>CL: expand_data(aggregated, segmentation-aware)
    CL->>FS: produce expanded solution

    User->>FS: to_dataset(include_solution=True, include_original_data=True)
    FS->>IO: flow_system_to_dataset(...)
    IO->>XR: construct xr.Dataset (solution, clustering, categories...)

    User->>XR: save/load dataset
    XR->>IO: restore_flow_system_from_dataset(ds)
    IO->>CL: reconstruct Clustering from serialized data
    IO->>FS: return FlowSystem with clustering and variable_categories
Loading

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

Possibly related PRs

Poem

🐰 I hopped through timesteps, tagged each flow with care,

Clusters curled like cozy naps, then stretched back to the air,
RangeIndex and Datetime now share one sunny burrow,
Datasets saved and loaded — crunchy carrots, not sorrow,
I timed the bytes and twitched my nose: IO faster, let's go!

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Docstring Coverage ✅ Passed Docstring coverage is 91.37% which is sufficient. The required threshold is 80.00%.
Title check ✅ Passed The title 'perf: Optimize clustering and I/O (4.4x faster segmented clustering)' clearly identifies the main performance optimization focus with specific metrics, directly matching the changeset's primary objective.
Description check ✅ Passed The description provides a comprehensive summary with change details, optimization strategies, before/after code examples, and benchmark results. However, it lacks explicit Type of Change selection and Testing checklist completion as specified in the template.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Use ds.variables for faster access in clustering/base.py:
- _create_reference_structure: original_data and metrics iteration
- compare plot: duration_curve generation with direct numpy indexing

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 9

🤖 Fix all issues with AI agents
In `@benchmarks/benchmark_io_performance.py`:
- Around line 138-146: The benchmark currently calls
fs_clustered.optimize(fx.solvers.GurobiSolver()) which will fail at solve time
if Gurobi is not available; change this to detect/handle solver availability by
wrapping the optimize call in a try/except that catches solver runtime errors
and falls back to fx.solvers.HighsSolver(), or programmatically choose the
solver before calling fs_clustered.optimize by attempting to use
fx.solvers.GurobiSolver() and on failure selecting fx.solvers.HighsSolver();
update the call sites referenced (create_large_flow_system,
fs.transform.cluster, fs_clustered.optimize, fx.solvers.GurobiSolver) so the
benchmark either falls back to HighsSolver or logs a clear error and exits.

In `@docs/notebooks/08c2-clustering-storage-modes.ipynb`:
- Around line 174-175: Remove the invalid import "from tsam.config import
ExtremeConfig" and replace any usage of ExtremeConfig (e.g.,
extremes=ExtremeConfig(...)) with the TimeSeriesAggregation API by setting the
extremePeriodMethod parameter (one of "None", "append", "new_cluster_center",
"replace_cluster_center") and the corresponding flags like addPeakMin,
addPeakMax, addMeanMin, addMeanMax on the TimeSeriesAggregation instance; ensure
any code that passed extremes=ExtremeConfig(...) is updated to configure extreme
handling via TimeSeriesAggregation.extremePeriodMethod and the addPeak*/addMean*
arguments instead.

In `@docs/notebooks/08d-clustering-multiperiod.ipynb`:
- Around line 283-285: The printed header uses hardcoded "period=2024,
scenario=High" which can diverge from the indices used; retrieve the actual
labels from the dataset/coords used to produce cluster_assignments (e.g., the
period and scenario coordinate arrays referenced when you computed
cluster_assignments) and use those values in the header instead of fixed text,
then keep the existing loop over cluster_assignments and day_names to print each
day's cluster; reference the variables/objects cluster_assignments and day_names
and the same period/scenario coord variables you used earlier to select the
slice so the header is always derived from the data.

In `@flixopt/components.py`:
- Around line 1507-1512: The decay calculation uses mean timestep duration which
underestimates total hours per cluster; update the reduction on
self._model.timestep_duration (used when computing decay_n) to compute the total
hours for the cluster (e.g., use
_scalar_safe_reduce(self._model.timestep_duration, 'time', 'sum') or equivalent)
so decay_n = (1 - self.element.relative_loss_per_hour) **
total_hours_per_cluster correctly reflects summed duration while preserving
DataArray semantics and usage of _scalar_safe_reduce and relative_loss_per_hour.

In `@flixopt/flow_system.py`:
- Around line 210-218: The code currently allows self.timestep_duration to be
None when timesteps is a RangeIndex (segmented system), which breaks downstream
logic (e.g., weights['time'], temporal_weight); add an early validation in the
constructor or initializer where timestep_duration and
computed_timestep_duration are handled: detect the segmented case (timesteps is
a RangeIndex or computed_timestep_duration is None) and raise a clear ValueError
if timestep_duration is not provided, mentioning that RangeIndex requires an
explicit timestep_duration; keep existing logic for assigning
self.timestep_duration (including using fit_to_model_coords('timestep_duration',
computed_timestep_duration) when available) but enforce this guard before
downstream computation.

In `@flixopt/io.py`:
- Around line 1587-1603: _fast_get_dataarray currently only includes coordinates
whose names match variable.dims, dropping auxiliary coordinates whose dims are a
subset of the variable's dims; update the coords build in _fast_get_dataarray to
include any coord_cache entry where coord.dims is a subset of variable.dims (not
just keys equal to dims), so auxiliary coords are preserved when constructing
the xr.DataArray (use variable.dims and each coord.dims to decide inclusion and
pass the resulting coords dict to xr.DataArray).

In `@flixopt/transform_accessor.py`:
- Around line 1379-1385: Add a guard to detect when drop_constant_arrays removes
all variables before calling to_dataframe(): after computing
temporaly_changing_ds_for_clustering =
drop_constant_arrays(ds_slice_for_clustering, dim='time') check whether
temporaly_changing_ds_for_clustering is empty (e.g., has no data_vars or no
variables) and raise a clear ValueError mentioning the input filtering
(data_vars/selector) and that there is no varying data to cluster, instead of
calling temporaly_changing_ds_for_clustering.to_dataframe() and allowing a
cryptic indexing error; update the code paths that assign df_for_clustering
accordingly so downstream logic only runs when df_for_clustering is valid.
- Around line 1902-1914: The current indexing misuses timestep_mapping modulo
n_segments for time positions when clustering.is_segmented, causing wrong
time_indices and bad SOC interpolation; change the time index calculation so
time_indices = timestep_mapping % clustering.timesteps_per_cluster (always use
timesteps_per_cluster for the time position), keep cluster_indices =
timestep_mapping // time_dim_size (where time_dim_size remains
clustering.n_segments when segmented for clustering into clusters), and then use
those cluster_indices/time_indices to index segment_assignments,
position_within_segment and segment_durations (variables: timestep_mapping,
clustering.is_segmented, clustering.n_segments,
clustering.timesteps_per_cluster, cluster_indices, time_indices,
segment_assignments, position_within_segment, segment_durations, seg_indices,
positions, durations).

In `@pyproject.toml`:
- Around line 67-68: The pyproject.toml currently lists a git URL dependency
"tsam @ git+https://github.com/FBumann/tsam.git@v3-rebased" in the [full] and
[dev] extras which will break PyPI publication; remove that git-based
requirement from both the [full] and [dev] extras and either (a) replace it with
a proper PyPI version specifier (e.g., "tsam >= 3.0.0, < 4") once tsam is
published, (b) omit tsam from those extras and add a note in your README/INSTALL
docs instructing users to install tsam from source, or (c) vendor/include tsam
in the package; finally, if this is intended only for development, update the
release workflow (the verify-pypi job) to skip PyPI verification or otherwise
allow dev-only dependencies so the release will pass.
🧹 Nitpick comments (2)
flixopt/structure.py (1)

901-908: Avoid eager .values on potentially lazy arrays.
array.values will fully materialize dask-backed data. If datasets can be opened lazily, this can blow memory and negate the IO perf gains. Consider guarding for lazy arrays or using a reduction that doesn’t materialize the full array.

🔧 Lazy-safe null check (example)
-        has_nulls = (np.issubdtype(array.dtype, np.floating) and np.any(np.isnan(array.values))) or (
-            array.dtype == object and pd.isna(array.values).any()
-        )
+        data = array.data
+        if hasattr(data, "compute"):
+            has_nulls = bool(array.isnull().any().compute())
+        else:
+            has_nulls = (np.issubdtype(array.dtype, np.floating) and np.any(np.isnan(array.values))) or (
+                array.dtype == object and pd.isna(array.values).any()
+            )

To confirm whether lazy loading is possible in your IO path, you can locate the NetCDF loader implementation:

#!/bin/bash
rg -n "def load_dataset_from_netcdf" -g '*.py'
rg -n "open_dataset|load_dataset" -g '*.py'
flixopt/clustering/base.py (1)

528-534: Preserve dimension order in _get_dim_values to keep .isel() predictable.

sorted(set(...)) reorders period/scenario values and can make .isel() indices differ from the original order (e.g., FlowSystem period/scenario lists). Consider preserving first-seen order from the results keys.

♻️ Proposed refactor
-        return sorted(set(k[idx] for k in self._results.keys()))
+        values: list = []
+        seen = set()
+        for key in self._results.keys():
+            val = key[idx]
+            if val not in seen:
+                values.append(val)
+                seen.add(val)
+        return values
📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 846dd57 and 5604845.

📒 Files selected for processing (39)
  • benchmarks/benchmark_io_performance.py
  • docs/notebooks/01-quickstart.ipynb
  • docs/notebooks/02-heat-system.ipynb
  • docs/notebooks/03-investment-optimization.ipynb
  • docs/notebooks/04-operational-constraints.ipynb
  • docs/notebooks/05-multi-carrier-system.ipynb
  • docs/notebooks/06a-time-varying-parameters.ipynb
  • docs/notebooks/06b-piecewise-conversion.ipynb
  • docs/notebooks/06c-piecewise-effects.ipynb
  • docs/notebooks/08a-aggregation.ipynb
  • docs/notebooks/08c-clustering.ipynb
  • docs/notebooks/08c2-clustering-storage-modes.ipynb
  • docs/notebooks/08d-clustering-multiperiod.ipynb
  • docs/notebooks/08e-clustering-internals.ipynb
  • docs/notebooks/08f-clustering-segmentation.ipynb
  • docs/notebooks/09-plotting-and-data-access.ipynb
  • docs/notebooks/10-transmission.ipynb
  • docs/user-guide/optimization/clustering.md
  • docs/user-guide/optimization/index.md
  • flixopt/clustering/__init__.py
  • flixopt/clustering/base.py
  • flixopt/clustering/intercluster_helpers.py
  • flixopt/components.py
  • flixopt/core.py
  • flixopt/effects.py
  • flixopt/elements.py
  • flixopt/features.py
  • flixopt/flow_system.py
  • flixopt/io.py
  • flixopt/modeling.py
  • flixopt/statistics_accessor.py
  • flixopt/structure.py
  • flixopt/transform_accessor.py
  • pyproject.toml
  • tests/deprecated/examples/03_Optimization_modes/example_optimization_modes.py
  • tests/test_cluster_reduce_expand.py
  • tests/test_clustering/test_base.py
  • tests/test_clustering/test_integration.py
  • tests/test_clustering_io.py
🧰 Additional context used
🧬 Code graph analysis (12)
tests/deprecated/examples/03_Optimization_modes/example_optimization_modes.py (2)
flixopt/flow_system.py (2)
  • copy (964-989)
  • transform (1681-1699)
flixopt/transform_accessor.py (1)
  • cluster (1194-1478)
benchmarks/benchmark_io_performance.py (2)
flixopt/flow_system.py (6)
  • FlowSystem (61-2604)
  • transform (1681-1699)
  • to_dataset (686-724)
  • from_dataset (727-749)
  • solution (1461-1478)
  • solution (1481-1484)
flixopt/io.py (2)
  • to_dataset (1785-1828)
  • from_dataset (1528-1563)
flixopt/modeling.py (1)
flixopt/structure.py (3)
  • VariableCategory (82-129)
  • add_variables (204-220)
  • add_variables (1740-1768)
flixopt/clustering/__init__.py (1)
flixopt/clustering/base.py (2)
  • Clustering (636-1346)
  • ClusteringResults (161-633)
flixopt/structure.py (1)
flixopt/flow_system.py (1)
  • variable_categories (1487-1493)
flixopt/statistics_accessor.py (2)
flixopt/structure.py (3)
  • VariableCategory (82-129)
  • solution (290-338)
  • solution (1288-1302)
flixopt/flow_system.py (3)
  • get_variables_by_category (1495-1547)
  • solution (1461-1478)
  • solution (1481-1484)
tests/test_cluster_reduce_expand.py (1)
flixopt/clustering/base.py (14)
  • n_clusters (316-318)
  • n_clusters (672-674)
  • cluster_assignments (338-349)
  • cluster_assignments (778-784)
  • timesteps_per_cluster (321-323)
  • timesteps_per_cluster (677-679)
  • sel (235-254)
  • sel (714-761)
  • clusters (1603-1706)
  • dims (217-219)
  • dims (697-699)
  • timestep_mapping (814-823)
  • coords (227-233)
  • coords (702-712)
flixopt/features.py (3)
flixopt/structure.py (9)
  • FlowSystemModel (188-470)
  • VariableCategory (82-129)
  • add_variables (204-220)
  • add_variables (1740-1768)
  • get_coords (415-450)
  • get_coords (1818-1823)
  • dims (350-352)
  • label_full (1284-1285)
  • label_full (1849-1850)
flixopt/clustering/base.py (4)
  • coords (227-233)
  • coords (702-712)
  • dims (217-219)
  • dims (697-699)
flixopt/elements.py (1)
  • label_full (638-639)
flixopt/flow_system.py (3)
flixopt/structure.py (9)
  • CompositeContainerMixin (1537-1713)
  • Element (1241-1341)
  • FlowSystemModel (188-470)
  • Interface (473-1238)
  • VariableCategory (82-129)
  • dims (350-352)
  • to_dataset (1038-1061)
  • to_netcdf (1063-1090)
  • copy (1218-1230)
flixopt/clustering/base.py (5)
  • coords (227-233)
  • coords (702-712)
  • dims (217-219)
  • dims (697-699)
  • is_segmented (764-770)
flixopt/io.py (4)
  • to_dataset (1785-1828)
  • flow_system_to_dataset (1945-1968)
  • restore_flow_system_from_dataset (1927-1942)
  • save_dataset_to_netcdf (532-586)
flixopt/transform_accessor.py (2)
flixopt/structure.py (3)
  • VariableCategory (82-129)
  • values (1663-1668)
  • values (1969-1970)
flixopt/clustering/base.py (13)
  • Clustering (636-1346)
  • n_clusters (316-318)
  • n_clusters (672-674)
  • is_segmented (764-770)
  • values (304-306)
  • values (1321-1328)
  • n_segments (331-333)
  • n_segments (773-775)
  • timesteps_per_cluster (321-323)
  • timesteps_per_cluster (677-679)
  • sel (235-254)
  • sel (714-761)
  • results (1236-1247)
flixopt/components.py (3)
flixopt/structure.py (4)
  • FlowSystemModel (188-470)
  • VariableCategory (82-129)
  • add_variables (204-220)
  • add_variables (1740-1768)
flixopt/clustering/base.py (7)
  • n_clusters (316-318)
  • n_clusters (672-674)
  • timesteps_per_cluster (321-323)
  • timesteps_per_cluster (677-679)
  • n_original_clusters (687-689)
  • cluster_assignments (338-349)
  • cluster_assignments (778-784)
flixopt/modeling.py (1)
  • _scalar_safe_reduce (57-76)
flixopt/effects.py (1)
flixopt/structure.py (3)
  • Submodel (1716-1914)
  • VariableCategory (82-129)
  • register_class_for_io (148-157)
🔇 Additional comments (91)
docs/notebooks/06b-piecewise-conversion.ipynb (1)

208-217: Notebook metadata standardization looks good.

The added metadata fields (codemirror_mode, file_extension, mimetype, nbconvert_exporter, pygments_lexer) are standard Jupyter notebook fields that improve editor compatibility and export behavior.

Note: The Python version was changed from 3.12.7 to 3.11.11. If this is intentional (e.g., to match your CI/dev environment), this is fine. Just verify it aligns with project requirements.

docs/notebooks/06c-piecewise-effects.ipynb (2)

315-323: Standard Jupyter metadata additions look good.

The added metadata fields (codemirror_mode, file_extension, mimetype, nbconvert_exporter, pygments_lexer) are standard Jupyter notebook metadata that enable proper syntax highlighting and notebook export functionality.


324-324: Verify the Python version downgrade.

The Python version was changed from "3.12.7" to "3.11.11". Please confirm whether this downgrade is intentional (e.g., for CI/environment standardization) or an artifact of re-executing the notebook in a different environment.

If this notebook was simply re-run in your current environment, consider whether committing the version change is necessary, especially if the project supports multiple Python versions.

docs/notebooks/04-operational-constraints.ipynb (1)

475-487: LGTM!

Standard Jupyter notebook metadata enrichment adding language_info fields for editor integration and export functionality. This aligns with the metadata updates across other notebooks in the repository.

docs/notebooks/05-multi-carrier-system.ipynb (1)

544-554: LGTM!

Consistent metadata enrichment matching the pattern applied to other notebooks. No functional changes.

docs/notebooks/10-transmission.ipynb (1)

636-646: LGTM!

Standard notebook metadata update consistent with the other notebooks in this PR.

docs/notebooks/09-plotting-and-data-access.ipynb (1)

834-844: LGTM!

Consistent notebook metadata update aligning with repository standards.

pyproject.toml (1)

86-86: Consistent with [full] extras dependency.

The [dev] dependency mirrors the git reference used in [full], maintaining consistency across installation profiles.

docs/notebooks/06a-time-varying-parameters.ipynb (1)

311-324: LGTM!

The metadata enrichment with language_info (Python 3.11.11) aligns with the consistent notebook metadata pattern across the repository.

docs/notebooks/08f-clustering-segmentation.ipynb (5)

152-166: LGTM!

Clear demonstration of clustering with segmentation. The parameters (n_clusters, cluster_duration, segments, extremes) are well-documented and the output logging provides useful feedback.


185-193: LGTM!

Good exploration of the clustering object properties after segmentation. The is_segmented, n_segments, n_clusters, and timesteps_per_cluster properties provide comprehensive insight into the segmentation structure.


317-325: LGTM!

The expansion workflow is clearly demonstrated with timing metrics and verification of objective preservation.


625-643: LGTM!

Complete notebook metadata with kernelspec and language_info consistent with other notebooks in the repository.


147-147: Import path is correct and consistent with existing codebase usage.

The import from tsam.config import ExtremeConfig, SegmentConfig follows the same pattern used throughout the repository in flixopt/transform_accessor.py and test files, confirming the tsam API is properly integrated.

docs/notebooks/08c-clustering.ipynb (8)

135-147: LGTM!

Clean migration to the new ExtremeConfig-based peak forcing API. The import and usage pattern are consistent with the updated clustering interface.


184-186: LGTM!

Updated access path to fs_clustered.clustering.results aligns with the new ClusteringResults container architecture.


226-258: LGTM!

Excellent addition documenting the clustering_data() method for inspecting time-varying data before clustering. This helps users understand which variables influence the clustering algorithm.


260-303: LGTM!

Good documentation of the data_vars parameter for selective clustering. The explanation of "cluster based on subset, apply to all data" is clear and the use case examples are practical.


337-347: LGTM!

Clear example of using ClusterConfig to specify alternative clustering algorithms (kmeans vs hierarchical).


381-400: LGTM!

The apply_clustering() documentation provides a useful pattern for design variant comparison with consistent cluster structures.


623-645: LGTM!

Comprehensive API reference update with renamed properties (cluster_assignments, cluster_occurrences) and the new ClusteringResults xarray-like interface documentation.


730-748: LGTM!

Updated notebook metadata consistent with repository-wide Python 3.11.11 standardization.

flixopt/effects.py (3)

20-28: LGTM!

Clean import grouping with VariableCategory added from .structure alongside the existing structure imports.


383-389: LGTM!

Correctly annotates the total variable with VariableCategory.TOTAL, enabling proper classification for segmentation and IO handling. This follows the established pattern in Submodel.add_variables which stores the category in the model's variable_categories mapping.


401-407: LGTM!

Correctly annotates the total_over_periods variable with VariableCategory.TOTAL_OVER_PERIODS. Consistent with the categorization pattern applied to total above.

docs/notebooks/02-heat-system.ipynb (1)

383-395: LGTM!

Metadata enrichment with language_info (Python 3.11.11) consistent with the repository-wide notebook standardization.

docs/notebooks/08a-aggregation.ipynb (1)

391-403: LGTM!

The language_info metadata enrichment is consistent with standard Jupyter notebook conventions and aligns with the broader metadata standardization across the notebook collection.

flixopt/clustering/intercluster_helpers.py (3)

14-16: LGTM!

The docstring correctly reflects the updated API using cluster_assignments[d] instead of cluster_order[d], aligning with the broader clustering API refactor described in the PR.


74-154: LGTM!

The function correctly handles all capacity parameter types (InvestParameters, numeric, None) with appropriate fallback behavior and helpful warning messages when using default unbounded capacity.


157-201: LGTM!

The function correctly constructs boundary coordinates with proper handling of optional period and scenario dimensions from the FlowSystem.

docs/notebooks/01-quickstart.ipynb (1)

285-295: LGTM!

The metadata enrichment is consistent with the standardization effort across notebooks, adding proper Python runtime details and version information.

docs/notebooks/03-investment-optimization.ipynb (1)

432-444: LGTM!

The metadata addition follows the same standardization pattern applied across the notebook collection.

docs/notebooks/08c2-clustering-storage-modes.ipynb (1)

195-199: LGTM!

The migration from time_series_for_high_peaks to the ExtremeConfig-based API provides more explicit control over extreme period handling. The method='new_cluster' with max_value=PEAK_SERIES correctly specifies that peak demand periods should be captured as separate clusters.

tests/deprecated/examples/03_Optimization_modes/example_optimization_modes.py (1)

193-211: LGTM - Clean migration to ExtremeConfig API.

The migration from time_series_for_high_peaks/time_series_for_low_peaks to the new ExtremeConfig-based approach follows the tsam 3.0 API correctly. The variable names reference the correct component labels defined earlier in the file (Wärmelast at line 123, Stromlast at line 128).

docs/user-guide/optimization/index.md (1)

59-66: LGTM - Documentation correctly demonstrates the new ExtremeConfig API.

The example clearly shows how to use the new extremes parameter with ExtremeConfig, aligning with the updated transform.cluster() signature.

flixopt/core.py (2)

643-650: Potential edge case with all-NaN slices.

When data contains slices that are entirely NaN along the specified axis, np.nanmax - np.nanmin will return NaN, and the comparison ptp < atol will be False. This means all-NaN variables won't be dropped as "constant."

If all-NaN variables should be treated as constant (and dropped), consider:

-        if np.all(ptp < atol):
+        if np.all((ptp < atol) | np.isnan(ptp)):

However, if preserving all-NaN variables is intentional (e.g., for debugging), the current behavior is acceptable.


617-619: LGTM - Performance optimization with new tolerance parameter.

The addition of atol parameter with a sensible default (1e-10) allows for tolerance-based constancy detection, which is more robust for floating-point data than exact equality checks. The use of ds.variables for direct Variable access is a good performance optimization.

flixopt/elements.py (8)

19-25: LGTM - Clean import of VariableCategory.

The import is properly organized with related structure module imports.


671-677: LGTM - Proper categorization of flow_rate variable.

The category=VariableCategory.FLOW_RATE correctly identifies the main flow rate variable for downstream processing and filtering.


682-693: LGTM - Appropriate category for total_flow_hours tracking variable.

Using VariableCategory.TOTAL correctly identifies this as an aggregate/total variable.


709-724: LGTM - Correct categorization for flow_hours_over_periods.

The VariableCategory.TOTAL_OVER_PERIODS appropriately distinguishes this cross-period aggregate from per-period totals.


732-749: LGTM - Status variable correctly categorized.

The binary status variable is properly marked with VariableCategory.STATUS.


751-761: LGTM - Investment model receives size category.

Passing size_category=VariableCategory.FLOW_SIZE to InvestmentModel allows proper identification of flow sizing variables.


969-981: LGTM - Virtual flow variables properly categorized.

Both virtual_supply and virtual_demand are correctly marked with VariableCategory.VIRTUAL_FLOW, distinguishing them from physical flow variables.


1047-1052: LGTM - Component-level status variable categorized consistently.

The component's status variable uses the same VariableCategory.STATUS as flow-level status variables, maintaining consistency.

tests/test_clustering_io.py (8)

6-6: LGTM - Required import for new tests.

The xarray import is needed for xr.DataArray in the fixture and xr.testing.assert_equal in assertions.


82-93: LGTM - Stronger type validation for restored clustering.

The explicit isinstance(fs_restored.clustering, Clustering) check ensures the deserialization produces the correct type, not just a truthy value.


124-125: Good defensive testing for serialization edge case.

The check_names=False parameter appropriately handles the potential loss of index name during NetCDF serialization, which is a known xarray/NetCDF behavior.


543-586: Well-designed fixture for multi-dimensional testing.

The system_with_periods_and_scenarios fixture creates intentionally different demand patterns for each period/scenario combination to ensure cluster assignments differ. This is a thoughtful approach to testing multi-dimensional clustering behavior.


588-612: Good coverage of cluster assignment dimensionality.

These tests verify the critical requirement that cluster assignments have the correct dimensions and produce different results for different period/scenario combinations.


614-669: Comprehensive roundtrip and derived property tests.

The tests properly validate that cluster_assignments are exactly preserved after NetCDF roundtrip, and that derived properties like n_clusters, timesteps_per_cluster, and cluster_occurrences work correctly after deserialization.


671-703: Valuable test for apply_clustering with loaded clustering.

This test validates an important use case: loading a saved clustering and applying it to a fresh FlowSystem. The assertions verify both structural equivalence and assignment matching.


705-724: LGTM - End-to-end test for expand after load and optimize.

This test covers the complete workflow: cluster → optimize → save → load → expand, verifying that the solution is properly expanded to original timesteps.

flixopt/modeling.py (1)

9-9: Category propagation looks consistent and useful.
Tagging tracker and duration variables with VariableCategory aligns with Submodel.add_variables and will help downstream filtering.

Also applies to: 266-308, 371-378

flixopt/statistics_accessor.py (1)

34-34: Category-based variable selection is a solid improvement.
This aligns the statistics accessor with the new categorization API and simplifies label handling.

Also applies to: 527-533, 571-573, 580-582, 596-598

docs/notebooks/08d-clustering-multiperiod.ipynb (1)

176-188: Notebook updates align well with the new clustering API.
The ExtremeConfig-based examples and results access paths read clearly.

Also applies to: 241-249, 252-259, 262-281, 288-291, 296-317, 322-364, 368-418, 422-451, 455-475, 480-486, 490-520, 524-582, 593-602

docs/user-guide/optimization/clustering.md (1)

26-37: Docs are consistent and clearer with the new clustering options.
Examples and parameter tables read cleanly.

Also applies to: 54-58, 59-88, 89-105, 107-124, 125-134

tests/test_cluster_reduce_expand.py (5)

65-65: Updated tests correctly target the new clustering public API.
The move to n_clusters, cluster_assignments, and timesteps_per_cluster reads cleanly.

Also applies to: 123-127, 135-135, 294-295, 338-347, 352-353, 452-452, 536-538, 545-545


767-840: Extremes coverage looks solid.
These tests should guard the new ExtremeConfig integration well.


845-957: data_vars parameter tests are thorough and focused.
Nice coverage for validation, preservation, and optimization.


996-1377: Segmentation and IO round-trip coverage is excellent.
The assertions around mapping, duration, and expansion behavior are especially helpful.

Also applies to: 1379-1441


1443-1523: combine_slices tests are clear and useful.
They exercise shape, coords, attrs, and datetime handling well.

tests/test_clustering/test_integration.py (2)

125-213: Good coverage for clustering_data shape and period handling.
These tests exercise the expected Dataset return type and period selection behavior well.


288-335: ClusterConfig/preserve_column_means test updates look solid.
The assertions align with the new API surface and deterministic expectations.

tests/test_clustering/test_base.py (5)

12-62: Helper-function tests look solid.
They validate both occurrences and mapping logic clearly.


63-205: ClusteringResults coverage is thorough.
Good coverage of dims/coords/selectors and DataArray accessors.


207-307: Clustering wrapper tests are consistent with the new API.
Nice coverage of derived properties and repr.


309-390: Multi-dimensional selection tests are well scoped.


392-483: Plot accessor tests are clear and focused.

flixopt/clustering/__init__.py (1)

1-51: Public exports and docstring example match the new API.

benchmarks/benchmark_io_performance.py (2)

19-75: Benchmark fixture setup is clear and representative.


78-106: Benchmark harness is clean and reusable.

flixopt/structure.py (4)

82-143: VariableCategory and expansion groupings are well organized.
Clear taxonomy for segment handling.


197-203: Nice: model-level registry for variable categories.
This cleanly centralizes category metadata for downstream expansion logic.


1118-1129: ds.variables + coord_cache construction looks good.
Nice performance-oriented replacement for repeated DataArray construction.


1740-1767: Category registration on variable creation is clean.
This keeps categorization close to variable definition and avoids later bookkeeping.

flixopt/io.py (7)

564-575: Direct attr serialization loop looks solid.

Single-pass handling for data vars and coords keeps the save path consistent and efficient.


609-641: Constant-dimension reduction looks consistent.

The reduction loop maintains dimensional bookkeeping and preserves attrs on the reconstructed variables.


665-706: Stacking path is clear and preserves grouping semantics.

Grouping by dims and stacking into a dedicated dimension reads correctly.


721-748: Unstacking restores variables efficiently.

The slice-based reconstruction logic is straightforward and aligns with the stack layout.


774-785: Load-time attr restoration mirrors the save path.

Good symmetry between serialization and deserialization of variable attrs.


1527-1563: Restore pipeline is cohesive with coord caching.

The sequence from separation → fast array build → restore hooks is clear and consistent.


1831-1861: Solution namespacing and has_solution flagging are consistent.

The solution prefixing and presence flagging are clear and aligned with the restore path.

flixopt/components.py (2)

942-953: Storage variable categorization is a good addition.

Charge state/netto discharge categories and storage-size categorization align well with the new variable taxonomy.

Also applies to: 979-985


1316-1323: Intercluster categorization is consistent.

Size and SOC boundary variables are now correctly tagged for downstream handling.

Also applies to: 1373-1380

flixopt/features.py (4)

40-84: Size category plumbing is clean.

The new size_category parameter is applied consistently to size and invested variables.


200-258: Status variables are now consistently categorized.

Inactive, startup/shutdown, total, and startup_count tags fit the taxonomy well.


408-428: Piecewise variable categories are wired through cleanly.

Categorizing piece and zero_point variables improves downstream handling without changing behavior.

Also applies to: 519-524


642-706: Share allocation categories correctly distinguish temporal vs. scalar shares.

The SHARE, TOTAL, and PER_TIMESTEP tags align with how these variables are used.

docs/notebooks/08e-clustering-internals.ipynb (2)

14-239: Notebook updates align with the new Clustering API.

The expanded explanations and examples match the consolidated Clustering/ClusteringResults workflow.


341-509: IO workflow and file‑size guidance are clear and helpful.

The preservation/loss matrix and size‑reduction tips should prevent user confusion.

✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.

Comment on lines 138 to 146
# Create and prepare FlowSystem
print('\n1. Creating FlowSystem...')
fs = create_large_flow_system(n_timesteps, n_periods, n_components)
print(f' Components: {len(fs.components)}')

print('\n2. Clustering and solving...')
fs_clustered = fs.transform.cluster(n_clusters=n_clusters, cluster_duration='1D')
fs_clustered.optimize(fx.solvers.GurobiSolver())

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

cd /tmp && find . -name "benchmark_io_performance.py" 2>/dev/null | head -5

Repository: flixOpt/flixopt

Length of output: 41


🏁 Script executed:

git ls-files | grep -E "(benchmark|solver)" | head -20

Repository: flixOpt/flixopt

Length of output: 116


🏁 Script executed:

rg "GurobiSolver" --type py -B 2 -A 2 | head -50

Repository: flixOpt/flixopt

Length of output: 2612


🏁 Script executed:

cat -n benchmarks/benchmark_io_performance.py | sed -n '130,155p'

Repository: flixOpt/flixopt

Length of output: 1248


🏁 Script executed:

cat -n benchmarks/benchmark_io_performance.py | head -50

Repository: flixOpt/flixopt

Length of output: 1739


🏁 Script executed:

cat -n flixopt/solvers.py | head -100

Repository: flixOpt/flixopt

Length of output: 3688


🏁 Script executed:

rg "class.*Solver" flixopt/solvers.py -A 5

Repository: flixOpt/flixopt

Length of output: 596


🏁 Script executed:

rg "import.*Gurobi\|from.*gurobi" benchmarks/benchmark_io_performance.py

Repository: flixOpt/flixopt

Length of output: 41


🏁 Script executed:

rg "def optimize" flixopt/flow_system.py -A 15

Repository: flixOpt/flixopt

Length of output: 589


🏁 Script executed:

rg "GurobiSolver\|gurobipy" flixopt/ --type py -l

Repository: flixOpt/flixopt

Length of output: 41


🏁 Script executed:

cat -n flixopt/flow_system.py | grep -A 20 "def optimize"

Repository: flixOpt/flixopt

Length of output: 926


🏁 Script executed:

rg "class OptimizeAccessor" flixopt/ --type py -A 30

Repository: flixOpt/flixopt

Length of output: 1948


🏁 Script executed:

rg "def __call__" flixopt/ --type py -B 5 -A 20 | grep -A 20 "OptimizeAccessor"

Repository: flixOpt/flixopt

Length of output: 41


🏁 Script executed:

rg "gurobipy\|import Gurobi" flixopt/ --type py

Repository: flixOpt/flixopt

Length of output: 41


🏁 Script executed:

python3 << 'EOF'
# Check what happens when GurobiSolver() is instantiated
# by looking at the solver class definition
import sys
sys.path.insert(0, '/tmp/repo')

try:
    from flixopt.solvers import GurobiSolver
    # Try to create an instance (this will fail if gurobipy is not installed)
    solver = GurobiSolver()
    print("SUCCESS: GurobiSolver instantiated without error")
    print(f"Solver: {solver}")
except ImportError as e:
    print(f"ImportError when creating GurobiSolver: {e}")
except Exception as e:
    print(f"Exception when creating GurobiSolver ({type(e).__name__}): {e}")
EOF

Repository: flixOpt/flixopt

Length of output: 130


🏁 Script executed:

rg "gurobipy\|import.*Gurobi" . --type py

Repository: flixOpt/flixopt

Length of output: 41


🏁 Script executed:

cat -n flixopt/optimize_accessor.py | head -100

Repository: flixOpt/flixopt

Length of output: 3855


🏁 Script executed:

rg "def __call__" flixopt/optimize_accessor.py -A 50 | head -80

Repository: flixOpt/flixopt

Length of output: 1869


🏁 Script executed:

rg "def solve" flixopt/flow_system.py -A 40 | head -80

Repository: flixOpt/flixopt

Length of output: 1504


🏁 Script executed:

rg "def solve" flixopt/ --type py -l

Repository: flixOpt/flixopt

Length of output: 105


🏁 Script executed:

cat -n flixopt/flow_system.py | grep -B 5 -A 40 "def solve"

Repository: flixOpt/flixopt

Length of output: 1882


🏁 Script executed:

rg "def do_modeling\|model.solve" flixopt/ --type py -B 2 -A 15 | head -120

Repository: flixOpt/flixopt

Length of output: 41


🏁 Script executed:

rg "gurobipy\|import.*gurobi" . -i --type py

Repository: flixOpt/flixopt

Length of output: 41


🏁 Script executed:

cat -n flixopt/optimization.py | head -200

Repository: flixOpt/flixopt

Length of output: 8086


🏁 Script executed:

rg "solver_name.*gurobi\|if.*solver_name" flixopt/ --type py -B 3 -A 5

Repository: flixOpt/flixopt

Length of output: 41


🏁 Script executed:

find . -name "structure.py" -o -name "*model*.py" | grep flixopt | head -5

Repository: flixOpt/flixopt

Length of output: 103


🏁 Script executed:

cat -n flixopt/structure.py | grep -A 50 "def solve\|solver_name" | head -100

Repository: flixOpt/flixopt

Length of output: 41


🏁 Script executed:

wc -l flixopt/structure.py

Repository: flixOpt/flixopt

Length of output: 84


🏁 Script executed:

rg "def solve" flixopt/structure.py -A 50

Repository: flixOpt/flixopt

Length of output: 41


🏁 Script executed:

rg "GurobiSolver\|gurobi\|solver_name" flixopt/structure.py -i -B 2 -A 5

Repository: flixOpt/flixopt

Length of output: 41


🏁 Script executed:

rg "def solve" flixopt/ --type py -l

Repository: flixOpt/flixopt

Length of output: 105


🏁 Script executed:

rg "solver_name\|GurobiSolver" flixopt/ --type py -B 2 -A 10

Repository: flixOpt/flixopt

Length of output: 41


🏁 Script executed:

ls -la flixopt/ | grep -E "\.py$"

Repository: flixOpt/flixopt

Length of output: 1912


🏁 Script executed:

grep -n "def solve" flixopt/structure.py

Repository: flixOpt/flixopt

Length of output: 41


🏁 Script executed:

cat -n flixopt/structure.py | sed -n '1,100p'

Repository: flixOpt/flixopt

Length of output: 3920


🏁 Script executed:

rg "class FlowSystemModel" flixopt/ --type py -A 100 | head -150

Repository: flixOpt/flixopt

Length of output: 6912


🏁 Script executed:

rg "\.solve\(" flixopt/structure.py -B 3 -A 10

Repository: flixOpt/flixopt

Length of output: 41


🏁 Script executed:

python3 << 'EOF'
# Check linopy's solve method signature
import inspect
try:
    import linopy
    print("Linopy version:", linopy.__version__)
    print("\nLinopy Model.solve signature:")
    print(inspect.signature(linopy.Model.solve))
except Exception as e:
    print(f"Error: {e}")
EOF

Repository: flixOpt/flixopt

Length of output: 90


🏁 Script executed:

cat benchmarks/benchmark_io_performance.py | head -20

Repository: flixOpt/flixopt

Length of output: 488


🏁 Script executed:

cat -n benchmarks/benchmark_io_performance.py | sed -n '1,20p'

Repository: flixOpt/flixopt

Length of output: 628


🏁 Script executed:

python3 << 'EOF'
# Check if GurobiSolver() instantiation would fail without gurobipy
import sys
sys.path.insert(0, '/tmp/repo')

# The GurobiSolver class in solvers.py is just a dataclass
# Let's verify it doesn't import gurobipy at class definition or instantiation
from flixopt.solvers import GurobiSolver

try:
    solver = GurobiSolver()
    print("SUCCESS: GurobiSolver instantiated without importing gurobipy")
    print(f"Solver name: {solver.name}")
    print(f"Solver options: {solver.options}")
except Exception as e:
    print(f"FAILED: {type(e).__name__}: {e}")
EOF

Repository: flixOpt/flixopt

Length of output: 334


🏁 Script executed:

rg "def __call__" flixopt/optimize_accessor.py -A 60 | tail -50

Repository: flixOpt/flixopt

Length of output: 1799


🏁 Script executed:

cat -n flixopt/solvers.py

Repository: flixOpt/flixopt

Length of output: 3688


🏁 Script executed:

grep -r "gurobipy\|gurobi" benchmarks/ --include="*.py"

Repository: flixOpt/flixopt

Length of output: 41


🏁 Script executed:

rg "class GurobiSolver" flixopt/solvers.py -A 25

Repository: flixOpt/flixopt

Length of output: 788


🏁 Script executed:

rg "linopy.Model.solve\|xarray" flixopt/structure.py | head -20

Repository: flixOpt/flixopt

Length of output: 41


🏁 Script executed:

cat -n flixopt/flow_system.py | sed -n '1402,1433p'

Repository: flixOpt/flixopt

Length of output: 1417


🏁 Script executed:

grep -n "model.solve" flixopt/flow_system.py

Repository: flixOpt/flixopt

Length of output: 89


🏁 Script executed:

cat -n flixopt/flow_system.py | sed -n '1429,1470p'

Repository: flixOpt/flixopt

Length of output: 1999


🏁 Script executed:

rg "solver_name.*gurobi\|HighsSolver\|HiGHS" flixopt/flow_system.py -i -B 2 -A 5

Repository: flixOpt/flixopt

Length of output: 41


Benchmark requires Gurobi; consider guarding the solve call or using a fallback solver.

The benchmark will fail without Gurobi, but not at GurobiSolver() instantiation—the error occurs later in model.solve() when linopy tries to invoke the actual solver. Wrapping the instantiation won't prevent the failure. Instead, either:

  • Guard the fs_clustered.optimize() call with exception handling for the actual solve step, or
  • Use HighsSolver() as a fallback when Gurobi is unavailable, or
  • Document Gurobi as a hard requirement.
🤖 Prompt for AI Agents
In `@benchmarks/benchmark_io_performance.py` around lines 138 - 146, The benchmark
currently calls fs_clustered.optimize(fx.solvers.GurobiSolver()) which will fail
at solve time if Gurobi is not available; change this to detect/handle solver
availability by wrapping the optimize call in a try/except that catches solver
runtime errors and falls back to fx.solvers.HighsSolver(), or programmatically
choose the solver before calling fs_clustered.optimize by attempting to use
fx.solvers.GurobiSolver() and on failure selecting fx.solvers.HighsSolver();
update the call sites referenced (create_large_flow_system,
fs.transform.cluster, fs_clustered.optimize, fx.solvers.GurobiSolver) so the
benchmark either falls back to HighsSolver or logs a clear error and exits.

Comment on lines 174 to 175
"from tsam.config import ExtremeConfig\n",
"\n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🌐 Web query:

tsam python package ExtremeConfig class configuration module

💡 Result:

There is no class named ExtremeConfig in the tsam package. Extreme-period handling is part of the TimeSeriesAggregation module — configure it with the extremePeriodMethod parameter (options: "None", "append", "new_cluster_center", "replace_cluster_center") and the addPeakMin / addPeakMax / addMeanMin / addMeanMax arguments to specify which series produce extreme periods. See the TimeSeriesAggregation docs and source for the implementation. [1][2]

Sources:

  • TimeSeriesAggregation docs (parameters & extremePeriodMethod). [1]
  • timeseriesaggregation source code ( _addExtremePeriods implementation ). [2]

Remove the ExtremeConfig import; it does not exist in the tsam package.

The class ExtremeConfig is not available in tsam.config. Extreme-period handling in tsam is configured via the TimeSeriesAggregation module using the extremePeriodMethod parameter (with options: "None", "append", "new_cluster_center", "replace_cluster_center") and arguments like addPeakMin, addPeakMax, addMeanMin, addMeanMax. The current import and API usage (extremes=ExtremeConfig(...)) will fail at runtime.

🤖 Prompt for AI Agents
In `@docs/notebooks/08c2-clustering-storage-modes.ipynb` around lines 174 - 175,
Remove the invalid import "from tsam.config import ExtremeConfig" and replace
any usage of ExtremeConfig (e.g., extremes=ExtremeConfig(...)) with the
TimeSeriesAggregation API by setting the extremePeriodMethod parameter (one of
"None", "append", "new_cluster_center", "replace_cluster_center") and the
corresponding flags like addPeakMin, addPeakMax, addMeanMin, addMeanMax on the
TimeSeriesAggregation instance; ensure any code that passed
extremes=ExtremeConfig(...) is updated to configure extreme handling via
TimeSeriesAggregation.extremePeriodMethod and the addPeak*/addMean* arguments
instead.

Comment on lines 283 to 285
"print('\\nCluster assignments per day (period=2024, scenario=High):')\n",
"for i, cluster_id in enumerate(cluster_assignments):\n",
" print(f' {day_names[i]}: Cluster {cluster_id}')\n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Avoid hardcoded period/scenario labels in output.
You index by period=0, scenario=0 but print “period=2024, scenario=High”. Consider deriving labels from the coords to avoid mismatches when inputs change.

🤖 Prompt for AI Agents
In `@docs/notebooks/08d-clustering-multiperiod.ipynb` around lines 283 - 285, The
printed header uses hardcoded "period=2024, scenario=High" which can diverge
from the indices used; retrieve the actual labels from the dataset/coords used
to produce cluster_assignments (e.g., the period and scenario coordinate arrays
referenced when you computed cluster_assignments) and use those values in the
header instead of fixed text, then keep the existing loop over
cluster_assignments and day_names to print each day's cluster; reference the
variables/objects cluster_assignments and day_names and the same period/scenario
coord variables you used earlier to select the slice so the header is always
derived from the data.

Comment on lines 1507 to 1512
# relative_loss_per_hour is per-hour, so we need total hours per cluster
# Use sum over time to handle both regular and segmented systems
# Keep as DataArray to respect per-period/scenario values
rel_loss = _scalar_safe_reduce(self.element.relative_loss_per_hour, 'time', 'mean')
hours_per_cluster = timesteps_per_cluster * _scalar_safe_reduce(self._model.timestep_duration, 'time', 'mean')
hours_per_cluster = _scalar_safe_reduce(self._model.timestep_duration, 'time', 'mean')
decay_n = (1 - rel_loss) ** hours_per_cluster
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Fix decay factor to use total hours per cluster.

The decay currently uses the mean timestep duration, which underestimates self‑discharge across a full cluster period. This skews SOC boundary linking. Use a summed duration (or otherwise compute total hours).

🔧 Proposed fix
-        hours_per_cluster = _scalar_safe_reduce(self._model.timestep_duration, 'time', 'mean')
+        hours_per_cluster = _scalar_safe_reduce(self._model.timestep_duration, 'time', 'sum')
🤖 Prompt for AI Agents
In `@flixopt/components.py` around lines 1507 - 1512, The decay calculation uses
mean timestep duration which underestimates total hours per cluster; update the
reduction on self._model.timestep_duration (used when computing decay_n) to
compute the total hours for the cluster (e.g., use
_scalar_safe_reduce(self._model.timestep_duration, 'time', 'sum') or equivalent)
so decay_n = (1 - self.element.relative_loss_per_hour) **
total_hours_per_cluster correctly reflects summed duration while preserving
DataArray semantics and usage of _scalar_safe_reduce and relative_loss_per_hour.

Comment on lines 210 to 218
# Use provided timestep_duration if given (for segmented systems), otherwise use computed value
# For RangeIndex (segmented systems), computed_timestep_duration is None
if timestep_duration is not None:
self.timestep_duration = timestep_duration
elif computed_timestep_duration is not None:
self.timestep_duration = self.fit_to_model_coords('timestep_duration', computed_timestep_duration)
else:
self.timestep_duration = None

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

# First, let's locate and read the flow_system.py file to see the actual code
fd flow_system.py

Repository: flixOpt/flixopt

Length of output: 81


🏁 Script executed:

# Read the flow_system.py file around lines 210-218
wc -l flixopt/flow_system.py

Repository: flixOpt/flixopt

Length of output: 86


🏁 Script executed:

# Read lines around 210-218 to see the actual code
sed -n '200,225p' flixopt/flow_system.py

Repository: flixOpt/flixopt

Length of output: 1464


🏁 Script executed:

# Look for the _compute_time_metadata method to understand when computed_timestep_duration is None
sed -n '1,2604p' flixopt/flow_system.py | grep -n "_compute_time_metadata" | head -5

Repository: flixOpt/flixopt

Length of output: 327


🏁 Script executed:

# Read the _compute_time_metadata method
sed -n '433,500p' flixopt/flow_system.py

Repository: flixOpt/flixopt

Length of output: 3265


🏁 Script executed:

# Search for weights and temporal_weight properties/methods to see how they use timestep_duration
rg -n "temporal_weight|def weights" flixopt/flow_system.py | head -20

Repository: flixOpt/flixopt

Length of output: 239


🏁 Script executed:

# Read the temporal_weight property
sed -n '2075,2110p' flixopt/flow_system.py

Repository: flixOpt/flixopt

Length of output: 1451


🏁 Script executed:

# Read the weights property
sed -n '2209,2250p' flixopt/flow_system.py

Repository: flixOpt/flixopt

Length of output: 1931


🏁 Script executed:

# Check if timestep_duration being None causes any issues in weights/temporal_weight
# Let's look at the calculate_timestep_duration method
rg -n "calculate_timestep_duration" flixopt/flow_system.py

Repository: flixOpt/flixopt

Length of output: 180


🏁 Script executed:

# Read the calculate_timestep_duration method
sed -n '361,400p' flixopt/flow_system.py

Repository: flixOpt/flixopt

Length of output: 1832


🏁 Script executed:

# Let's check what happens when timestep_duration is None in the weights dict
# Search for any tests or usage of RangeIndex systems
rg -n "RangeIndex" flixopt/flow_system.py | head -15

Repository: flixOpt/flixopt

Length of output: 1116


🏁 Script executed:

# Check for test files that use RangeIndex to understand expected behavior
fd "test.*flow_system" -o "*test*flow*"

Repository: flixOpt/flixopt

Length of output: 200


🏁 Script executed:

# Search for test files related to flow_system
fd "test.*flow_system" tests/ 2>/dev/null || fd "flow_system" tests/ 2>/dev/null | head -10

Repository: flixOpt/flixopt

Length of output: 173


🏁 Script executed:

# Search for RangeIndex usage in test files
rg -n "RangeIndex" tests/ | head -20

Repository: flixOpt/flixopt

Length of output: 362


🏁 Script executed:

# Read the test file to understand RangeIndex usage
sed -n '1010,1100p' tests/test_cluster_reduce_expand.py

Repository: flixOpt/flixopt

Length of output: 3745


🏁 Script executed:

# Let's understand better - search for where timestep_duration might be set for RangeIndex systems
# Check if there are methods that set it after initialization
rg -n "self.timestep_duration =" flixopt/flow_system.py | head -20

Repository: flixOpt/flixopt

Length of output: 278


Add validation for RangeIndex systems requiring explicit timestep_duration.

When timesteps is a RangeIndex (segmented system) and timestep_duration is not provided, self.timestep_duration becomes None. This breaks downstream uses like weights['time'] and temporal_weight. Add early validation with a clear error message.

🐛 Proposed fix
-        # Use provided timestep_duration if given (for segmented systems), otherwise use computed value
+        # Segmented systems (RangeIndex) must provide timestep_duration explicitly
+        if isinstance(self.timesteps, pd.RangeIndex) and timestep_duration is None:
+            raise ValueError(
+                'timestep_duration must be provided when timesteps is a RangeIndex (segmented system).'
+            )
+        # Use provided timestep_duration if given (for segmented systems), otherwise use computed value
         # For RangeIndex (segmented systems), computed_timestep_duration is None
         if timestep_duration is not None:
             self.timestep_duration = timestep_duration
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# Use provided timestep_duration if given (for segmented systems), otherwise use computed value
# For RangeIndex (segmented systems), computed_timestep_duration is None
if timestep_duration is not None:
self.timestep_duration = timestep_duration
elif computed_timestep_duration is not None:
self.timestep_duration = self.fit_to_model_coords('timestep_duration', computed_timestep_duration)
else:
self.timestep_duration = None
# Segmented systems (RangeIndex) must provide timestep_duration explicitly
if isinstance(self.timesteps, pd.RangeIndex) and timestep_duration is None:
raise ValueError(
'timestep_duration must be provided when timesteps is a RangeIndex (segmented system).'
)
# Use provided timestep_duration if given (for segmented systems), otherwise use computed value
# For RangeIndex (segmented systems), computed_timestep_duration is None
if timestep_duration is not None:
self.timestep_duration = timestep_duration
elif computed_timestep_duration is not None:
self.timestep_duration = self.fit_to_model_coords('timestep_duration', computed_timestep_duration)
else:
self.timestep_duration = None
🤖 Prompt for AI Agents
In `@flixopt/flow_system.py` around lines 210 - 218, The code currently allows
self.timestep_duration to be None when timesteps is a RangeIndex (segmented
system), which breaks downstream logic (e.g., weights['time'], temporal_weight);
add an early validation in the constructor or initializer where
timestep_duration and computed_timestep_duration are handled: detect the
segmented case (timesteps is a RangeIndex or computed_timestep_duration is None)
and raise a clear ValueError if timestep_duration is not provided, mentioning
that RangeIndex requires an explicit timestep_duration; keep existing logic for
assigning self.timestep_duration (including using
fit_to_model_coords('timestep_duration', computed_timestep_duration) when
available) but enforce this guard before downstream computation.

Comment on lines 1587 to 1603
def _fast_get_dataarray(ds: xr.Dataset, name: str, coord_cache: dict[str, xr.DataArray]) -> xr.DataArray:
"""Construct DataArray from Variable without slow coordinate inference.
This bypasses the slow _construct_dataarray method (~1.5ms -> ~0.1ms per var).
Args:
ds: Source dataset
name: Variable name
coord_cache: Pre-cached coordinate DataArrays
Returns:
Constructed DataArray
"""
variable = ds.variables[name]
coords = {k: coord_cache[k] for k in variable.dims if k in coord_cache}
return xr.DataArray(variable, coords=coords, name=name)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

# First, let's look at the file structure and context around the function
cd flixopt && wc -l io.py && head -50 io.py

Repository: flixOpt/flixopt

Length of output: 1413


🏁 Script executed:

# Find the _fast_get_dataarray function and surrounding context
rg -n "_fast_get_dataarray" flixopt/

Repository: flixOpt/flixopt

Length of output: 312


🏁 Script executed:

# Also search for _construct_dataarray to understand the slow path
rg -n "_construct_dataarray" flixopt/

Repository: flixOpt/flixopt

Length of output: 821


🏁 Script executed:

# Search for test files related to IO operations
fd -e "test_*.py" -o -e "*_test.py" | grep -i io

Repository: flixOpt/flixopt

Length of output: 175


🏁 Script executed:

# Let's read the actual function and surrounding context
sed -n '1530,1620p' flixopt/io.py

Repository: flixOpt/flixopt

Length of output: 3853


🏁 Script executed:

# Search for test files
find . -type f -name "test*.py" | head -20

Repository: flixOpt/flixopt

Length of output: 835


🏁 Script executed:

# Look for where coord_cache is created
rg -n "coord_cache" flixopt/io.py -B 2 -A 2

Repository: flixOpt/flixopt

Length of output: 1043


🏁 Script executed:

# Search for uses of auxiliary coordinates or non-dimension coordinates
rg -i "auxiliary|non.*dimension|coords\[" flixopt/io.py | head -20

Repository: flixOpt/flixopt

Length of output: 439


🏁 Script executed:

# Look for test files that test round-trip behavior
find . -name "*.py" -path "*/test*" -type f | xargs grep -l "from_dataset\|to_dataset" | head -5

Repository: flixOpt/flixopt

Length of output: 229


🏁 Script executed:

# Check if there are any references to xarray's behavior with auxiliary coords
rg "DataArray.*coords" flixopt/io.py | head -10

Repository: flixOpt/flixopt

Length of output: 122


🏁 Script executed:

# Search for how variables are written to understand what coords are preserved
sed -n '1800,1900p' flixopt/io.py

Repository: flixOpt/flixopt

Length of output: 3645


🏁 Script executed:

# Look at the main IO test file to understand what's tested
head -100 tests/test_io.py

Repository: flixOpt/flixopt

Length of output: 2842


🏁 Script executed:

# Search for any auxiliary coordinates being set in the codebase
rg "assign_coords|coords\[" --type py | grep -v test | head -20

Repository: flixOpt/flixopt

Length of output: 2031


🏁 Script executed:

# Check if there's any xarray usage pattern that might add auxiliary coords
rg "xr\.DataArray\|xr\.Dataset" flixopt/io.py -A 1 | head -30

Repository: flixOpt/flixopt

Length of output: 41


🏁 Script executed:

# Look for where datasets are created initially (to_dataset method)
sed -n '1680,1750p' flixopt/io.py

Repository: flixOpt/flixopt

Length of output: 3000


🏁 Script executed:

# Let's examine the test more carefully to see what's being tested
cat tests/test_io.py | head -60

Repository: flixOpt/flixopt

Length of output: 1440


🏁 Script executed:

# Check what flow_system_base fixture includes and if it creates auxiliary coords
grep -n "flow_system_base\|@pytest.fixture" tests/conftest.py | head -30

Repository: flixOpt/flixopt

Length of output: 667


🏁 Script executed:

# Search for where DataArrays are created with auxiliary coordinates in to_dataset
sed -n '1400,1550p' flixopt/io.py | grep -A 5 -B 5 "DataArray\|xr\."

Repository: flixOpt/flixopt

Length of output: 906


🏁 Script executed:

# Check if xarray's construction of DataArray preserves all coords or just dimension coords
python3 << 'EOF'
import xarray as xr
import numpy as np

# Test how xarray handles auxiliary coordinates
ds = xr.Dataset({
    'data': (['x', 'y'], np.random.rand(3, 4))
})

# Add a dimension coordinate
ds = ds.assign_coords({'x': np.arange(3), 'y': np.arange(4)})

# Add an auxiliary coordinate with dims as a subset
ds = ds.assign_coords({'aux': (('x',), np.arange(3))})

print("Original dataset coords:", list(ds.coords))
print("Original dataset coord names and dims:")
for name, coord in ds.coords.items():
    print(f"  {name}: dims={coord.dims}")

# Now simulate what _fast_get_dataarray does
var = ds.variables['data']
print("\nVariable 'data' dims:", var.dims)

# Method 1: Only dimension coords (current implementation)
coords1 = {k: ds.coords[k] for k in var.dims if k in ds.coords}
print("\nCoords selected (current method):", list(coords1.keys()))

# Method 2: All coords (proposed fix)
coords2 = {k: c for k, c in ds.coords.items() if set(c.dims).issubset(var.dims)}
print("Coords selected (proposed method):", list(coords2.keys()))

# Check which coords are lost
lost_coords = set(ds.coords.keys()) - set(coords1.keys())
print("\nLost coords with current method:", lost_coords)
EOF

Repository: flixOpt/flixopt

Length of output: 177


🏁 Script executed:

# Install xarray and test coordinate handling
pip install xarray numpy pandas -q && python3 << 'EOF'
import xarray as xr
import numpy as np

# Test how xarray handles auxiliary coordinates
ds = xr.Dataset({
    'data': (['x', 'y'], np.random.rand(3, 4))
})

# Add dimension coordinates
ds = ds.assign_coords({'x': np.arange(3), 'y': np.arange(4)})

# Add an auxiliary coordinate with dims as a subset
ds = ds.assign_coords({'aux': (('x',), np.arange(3))})

print("Original dataset coords:", list(ds.coords))
print("Original dataset coord names and dims:")
for name, coord in ds.coords.items():
    print(f"  {name}: dims={coord.dims}")

# Now simulate what _fast_get_dataarray does
var = ds.variables['data']
print("\nVariable 'data' dims:", var.dims)

# Method 1: Only dimension coords (current implementation)
coords1 = {k: ds.coords[k] for k in var.dims if k in ds.coords}
print("\nCoords selected (current method):", list(coords1.keys()))

# Method 2: All coords (proposed fix)
coords2 = {k: c for k, c in ds.coords.items() if set(c.dims).issubset(var.dims)}
print("Coords selected (proposed method):", list(coords2.keys()))

# Check which coords are lost
lost_coords = set(ds.coords.keys()) - set(coords1.keys())
print("\nLost coords with current method:", lost_coords)
EOF

Repository: flixOpt/flixopt

Length of output: 613


🏁 Script executed:

# Now let's check if flixopt actually creates auxiliary coordinates anywhere
rg "assign_coords\(" flixopt/ --type py | grep -v test | grep -v "\.pyc"

Repository: flixOpt/flixopt

Length of output: 2851


🏁 Script executed:

# Check the structure.py file where auxiliary coords might be used
rg -n "assign_coords" flixopt/structure.py -B 2 -A 2

Repository: flixOpt/flixopt

Length of output: 41


🏁 Script executed:

# Check if the test actually verifies coordinate preservation
rg -n "test_flow_system_io" tests/test_io.py -A 10

Repository: flixOpt/flixopt

Length of output: 354


🏁 Script executed:

# Look at the FlowSystem equality comparison to understand what's checked
rg -n "__eq__" flixopt/flow_system.py -A 15 | head -50

Repository: flixOpt/flixopt

Length of output: 692


🏁 Script executed:

# Search for where auxiliary coordinates are actually added to datasets in serialization
rg -n "to_dataset" flixopt/flow_system.py -B 2 -A 20 | head -100

Repository: flixOpt/flixopt

Length of output: 4885


🏁 Script executed:

# Check if any of the datasets created by to_dataset() actually have auxiliary coords
sed -n '1400,1650p' flixopt/io.py

Repository: flixOpt/flixopt

Length of output: 9490


🏁 Script executed:

# Let's check if the to_dataset method actually creates auxiliary coordinates
rg -n "assign_coords" flixopt/flow_system.py -B 3 -A 3

Repository: flixOpt/flixopt

Length of output: 41


🏁 Script executed:

# Check what Structure.to_dataset does (parent class)
rg -n "class FlowSystem" flixopt/flow_system.py | head -3

Repository: flixOpt/flixopt

Length of output: 238


🏁 Script executed:

# Find the parent class to_dataset method
rg -n "def to_dataset" flixopt/ --type py | grep -v test | grep -v ".pyc"

Repository: flixOpt/flixopt

Length of output: 297


🏁 Script executed:

# Let's run the actual test to see if it passes
cd /repo && python -m pytest tests/test_io.py::test_flow_system_io -v 2>&1 | head -100

Repository: flixOpt/flixopt

Length of output: 114


🏁 Script executed:

# Check Structure.to_dataset (parent class)
sed -n '1038,1120p' flixopt/structure.py

Repository: flixOpt/flixopt

Length of output: 3615


🏁 Script executed:

# Let's verify if datasets created by to_dataset() have auxiliary coords
# by running Python code to test this
python3 << 'EOF'
import sys
sys.path.insert(0, '/repo')

# Try to import and test
try:
    import flixopt as fx
    
    # Create a simple flow system
    fs = fx.FlowSystem(
        timesteps=fx.utils.get_times('2020', '2020', freq='d')[0:5],
    )
    
    # Convert to dataset
    ds = fs.to_dataset()
    
    print("Dataset coordinates:")
    for name, coord in ds.coords.items():
        print(f"  {name}: dims={coord.dims}, shape={coord.shape}")
    
    print("\nDataset variables:")
    for name, var in ds.data_vars.items():
        print(f"  {name}: dims={var.dims}, shape={var.shape}")
    
    # Check for auxiliary coords
    print("\nAuxiliary coordinates (non-dimension coords):")
    aux_coords = [name for name, coord in ds.coords.items() 
                  if name not in ds.dims]
    print(f"  {aux_coords if aux_coords else 'None found'}")
    
except Exception as e:
    print(f"Error: {e}")
    import traceback
    traceback.print_exc()
EOF

Repository: flixOpt/flixopt

Length of output: 390


🏁 Script executed:

# Let's also check the structure.py to_dataset implementation
sed -n '1038,1150p' flixopt/structure.py

Repository: flixOpt/flixopt

Length of output: 4616


🏁 Script executed:

# Let's check where auxiliary coords are actually added and if they're part of the serialization flow
rg -n "assign_coords" flixopt/statistics_accessor.py -B 5 -A 5 | head -40

Repository: flixOpt/flixopt

Length of output: 1032


🏁 Script executed:

# Check if these auxiliary coords (timeframe, timestep) are ever serialized to dataset
rg -n "to_dataset\|to_netcdf" flixopt/statistics_accessor.py

Repository: flixOpt/flixopt

Length of output: 41


🏁 Script executed:

# Look at the actual line in Structure.from_dataset to see the pattern
sed -n '1080,1100p' flixopt/structure.py

Repository: flixOpt/flixopt

Length of output: 810


🏁 Script executed:

# Critical: Check if there's a difference between how to_dataset and from_dataset handle coordinates
# Let's compare them side by side
echo "=== TO_DATASET ===" && sed -n '1038,1055p' flixopt/structure.py && echo -e "\n=== FROM_DATASET ===" && sed -n '1090,1110p' flixopt/structure.py

Repository: flixOpt/flixopt

Length of output: 1767


🏁 Script executed:

# Let's directly check if datasets created by to_dataset actually have auxiliary coords
# by looking at what _create_reference_structure does
rg -n "_create_reference_structure" flixopt/structure.py -A 30 | head -80

Repository: flixOpt/flixopt

Length of output: 4114


🏁 Script executed:

# More importantly, let's see if the datasets that go through from_dataset/to_dataset 
# ever have auxiliary coordinates to begin with
rg -n "coord_cache\|assign_coords" flixopt/structure.py

Repository: flixOpt/flixopt

Length of output: 41


🏁 Script executed:

# Critical: Check if from_dataset in Structure uses the same pattern as _fast_get_dataarray
sed -n '1090,1115p' flixopt/structure.py

Repository: flixOpt/flixopt

Length of output: 975


🏁 Script executed:

# Key insight: Let's verify that xr.Dataset(dict_of_dataarrays) loses auxiliary coords
python3 << 'EOF'
import xarray as xr
import numpy as np

# Create DataArrays with auxiliary coordinates
data_da = xr.DataArray(
    np.random.rand(3, 4),
    dims=['x', 'y'],
    coords={'x': np.arange(3), 'y': np.arange(4)},
    name='data'
)

# Add auxiliary coordinate to the DataArray
data_da = data_da.assign_coords({'aux': (('x',), np.arange(3))})

print("DataArray coords before Dataset creation:")
print(f"  {list(data_da.coords.keys())}")

# Now create a Dataset from this DataArray (similar to to_dataset)
ds = xr.Dataset({'data': data_da})

print("\nDataset coords after Dataset creation:")
print(f"  {list(ds.coords.keys())}")

print("\nConclusion: Auxiliary coords from DataArray are lost when creating Dataset")
EOF

Repository: flixOpt/flixopt

Length of output: 256


🏁 Script executed:

# Let's verify the actual flow: to_dataset -> from_dataset
# Check if xr.Dataset constructor drops auxiliary coords
python3 << 'EOF'
import xarray as xr
import numpy as np

# Simulate what to_dataset does
da1 = xr.DataArray(np.random.rand(3, 4), dims=['x', 'y'], name='var1')
da1 = da1.assign_coords({'x': np.arange(3), 'y': np.arange(4)})
da1 = da1.assign_coords({'aux': (('x',), ['a', 'b', 'c'])})

da2 = xr.DataArray(np.random.rand(3), dims=['x'], name='var2')
da2 = da2.assign_coords({'x': np.arange(3)})

# What to_dataset does: creates Dataset from dict
extracted = {'var1': da1, 'var2': da2}
ds = xr.Dataset(extracted)

print("DataArray var1 coords:", list(da1.coords.keys()))
print("DataArray var2 coords:", list(da2.coords.keys()))
print("Dataset coords:", list(ds.coords.keys()))
print("\nAuxiliary 'aux' coord preserved in Dataset?", 'aux' in ds.coords)
EOF

Repository: flixOpt/flixopt

Length of output: 212


🏁 Script executed:

# Let's check if the datasets actually get auxiliary coords added anywhere
# by searching for patterns that would add non-dimension coords to the dataset
rg "assign_coords\|Dataset.*coords" flixopt/io.py | grep -A 2 -B 2 "1[45][0-9]{2}"

Repository: flixOpt/flixopt

Length of output: 41


🏁 Script executed:

# Most importantly, verify that the _restored_ dataset from from_dataset doesn't have auxiliary coords
# Let's check the exact flow in from_dataset
sed -n '1548,1610p' flixopt/io.py

Repository: flixOpt/flixopt

Length of output: 2606


🏁 Script executed:

# Final verification: test the actual serialization round-trip to see if auxiliary coords would be lost
python3 << 'EOF'
import xarray as xr
import numpy as np

# Simulate the to_dataset flow - create a dataset with an auxiliary coordinate
print("=== SIMULATION OF TO_DATASET ===")
# Create a DataArray with both dimension and auxiliary coordinates
da = xr.DataArray(
    np.random.rand(3, 4),
    dims=['x', 'y'],
    coords={'x': np.arange(3), 'y': np.arange(4)},
    name='data'
)
da = da.assign_coords({'aux': (('x',), ['a', 'b', 'c'])})

print(f"Original DataArray coords: {list(da.coords.keys())}")

# Create dataset (like to_dataset does)
ds = xr.Dataset({'data': da})
print(f"Dataset after creation: {list(ds.coords.keys())}")

print("\n=== SIMULATION OF _FAST_GET_DATAARRAY ===")
# Now simulate _fast_get_dataarray
variable = ds.variables['data']
coord_cache = {k: ds.coords[k] for k in ds.coords}

print(f"Variable dims: {variable.dims}")
print(f"Available coords in cache: {list(coord_cache.keys())}")

# Current implementation
coords_current = {k: coord_cache[k] for k in variable.dims if k in coord_cache}
print(f"Coords selected (current method): {list(coords_current.keys())}")

# Proposed implementation
coords_proposed = {k: c for k, c in coord_cache.items() if set(c.dims).issubset(variable.dims)}
print(f"Coords selected (proposed method): {list(coords_proposed.keys())}")

print(f"\nAuxiliary coord 'aux' lost with current method? {('aux' in ds.coords) and ('aux' not in coords_current)}")
EOF

Repository: flixOpt/flixopt

Length of output: 444


Preserve auxiliary coordinates in fast DataArray construction.

_fast_get_dataarray currently only attaches dimension coordinates (those with names matching variable dims). Auxiliary coordinates—those with dimensions that are a subset of the variable's dimensions—are discarded during reconstruction. This will cause data loss if such coordinates are present in serialized datasets.

The proposed fix correctly handles both dimension and auxiliary coordinates:

Recommended fix
-        coords = {k: coord_cache[k] for k in variable.dims if k in coord_cache}
+        coords = {
+            k: c for k, c in coord_cache.items()
+            if set(c.dims).issubset(variable.dims)
+        }

While the round-trip tests currently pass (suggesting auxiliary coordinates aren't presently used in the codebase), this change makes the implementation robust for future use cases and aligns with xarray's coordinate handling semantics.

🤖 Prompt for AI Agents
In `@flixopt/io.py` around lines 1587 - 1603, _fast_get_dataarray currently only
includes coordinates whose names match variable.dims, dropping auxiliary
coordinates whose dims are a subset of the variable's dims; update the coords
build in _fast_get_dataarray to include any coord_cache entry where coord.dims
is a subset of variable.dims (not just keys equal to dims), so auxiliary coords
are preserved when constructing the xr.DataArray (use variable.dims and each
coord.dims to decide inclusion and pass the resulting coords dict to
xr.DataArray).

Comment on lines 1379 to 1385
# Select data for clustering (may be subset if data_vars specified)
ds_slice_for_clustering = (
ds_for_clustering.sel(**selector, drop=True) if selector else ds_for_clustering
)
temporaly_changing_ds_for_clustering = drop_constant_arrays(ds_slice_for_clustering, dim='time')
df_for_clustering = temporaly_changing_ds_for_clustering.to_dataframe()

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

find flixopt -name "transform_accessor.py" -type f

Repository: flixOpt/flixopt

Length of output: 88


🏁 Script executed:

cat -n flixopt/transform_accessor.py | sed -n '1370,1400p'

Repository: flixOpt/flixopt

Length of output: 2058


🏁 Script executed:

grep -n "def drop_constant_arrays" flixopt/transform_accessor.py

Repository: flixOpt/flixopt

Length of output: 41


🏁 Script executed:

rg "def drop_constant_arrays" --type py

Repository: flixOpt/flixopt

Length of output: 100


🏁 Script executed:

rg "def drop_constant_arrays" -A 15 flixopt/core.py

Repository: flixOpt/flixopt

Length of output: 736


🏁 Script executed:

rg "def drop_constant_arrays" -A 30 flixopt/core.py

Repository: flixOpt/flixopt

Length of output: 1269


🏁 Script executed:

rg "def drop_constant_arrays" -A 45 flixopt/core.py

Repository: flixOpt/flixopt

Length of output: 1660


🏁 Script executed:

cat -n flixopt/transform_accessor.py | sed -n '1400,1415p'

Repository: flixOpt/flixopt

Length of output: 968


🏁 Script executed:

rg "tsam.aggregate" -B 5 -A 5 flixopt/transform_accessor.py | head -40

Repository: flixOpt/flixopt

Length of output: 1388


🏁 Script executed:

cat -n flixopt/transform_accessor.py | sed -n '1378,1398p'

Repository: flixOpt/flixopt

Length of output: 1517


🏁 Script executed:

cat -n flixopt/transform_accessor.py | sed -n '1374,1425p'

Repository: flixOpt/flixopt

Length of output: 3386


🏁 Script executed:

cat -n flixopt/transform_accessor.py | sed -n '1350,1375p'

Repository: flixOpt/flixopt

Length of output: 1326


🏁 Script executed:

rg "ds_for_clustering\s*=" flixopt/transform_accessor.py | head -20

Repository: flixOpt/flixopt

Length of output: 258


🏁 Script executed:

cat -n flixopt/transform_accessor.py | sed -n '1320,1365p'

Repository: flixOpt/flixopt

Length of output: 2334


🏁 Script executed:

python3 << 'EOF'
import xarray as xr
import numpy as np

# Test if empty dataset's data_vars attribute can be checked with 'if not'
ds_empty = xr.Dataset()
ds_with_vars = xr.Dataset({'var1': (['time'], [1, 2, 3])})

print(f"Empty dataset data_vars: {ds_empty.data_vars}")
print(f"Empty dataset bool test: {bool(ds_empty.data_vars)}")
print(f"not empty dataset.data_vars: {not ds_empty.data_vars}")

print(f"\nDataset with vars data_vars: {ds_with_vars.data_vars}")
print(f"Dataset with vars bool test: {bool(ds_with_vars.data_vars)}")
print(f"not dataset_with_vars.data_vars: {not ds_with_vars.data_vars}")

# Test to_dataframe on empty dataset
try:
    df_empty = ds_empty.to_dataframe()
    print(f"\nEmpty dataset to_dataframe() result:\n{df_empty}")
    print(f"Shape: {df_empty.shape}")
except Exception as e:
    print(f"Error: {e}")
EOF

Repository: flixOpt/flixopt

Length of output: 380


🏁 Script executed:

python3 << 'EOF'
import pandas as pd
import numpy as np

# Test what happens with tsam.aggregate on empty DataFrame
# First check if tsam is available
try:
    import tsam
    print(f"tsam version: {tsam.__version__}")
    
    # Create an empty dataframe
    df_empty = pd.DataFrame()
    print(f"Empty DataFrame shape: {df_empty.shape}")
    print(f"Empty DataFrame:\n{df_empty}")
    
    # Try to call aggregate (this may fail)
    try:
        result = tsam.aggregate(df_empty, n_clusters=3)
    except Exception as e:
        print(f"\ntsam.aggregate() error with empty DataFrame:")
        print(f"Error type: {type(e).__name__}")
        print(f"Error message: {e}")
except ImportError:
    print("tsam not available in sandbox environment")
EOF

Repository: flixOpt/flixopt

Length of output: 100


Add a guard when clustering input becomes empty.

When drop_constant_arrays() removes all selected variables (especially with data_vars filtering), calling to_dataframe() on an empty Dataset fails with a cryptic indexing error. Add an explicit check with a clear error message.

🐛 Proposed fix
                temporaly_changing_ds_for_clustering = drop_constant_arrays(ds_slice_for_clustering, dim='time')
+                if not temporaly_changing_ds_for_clustering.data_vars:
+                    raise ValueError(
+                        'No time-varying data left for clustering after drop_constant_arrays(). '
+                        'Check data_vars or inspect transform.clustering_data().'
+                    )
                df_for_clustering = temporaly_changing_ds_for_clustering.to_dataframe()
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# Select data for clustering (may be subset if data_vars specified)
ds_slice_for_clustering = (
ds_for_clustering.sel(**selector, drop=True) if selector else ds_for_clustering
)
temporaly_changing_ds_for_clustering = drop_constant_arrays(ds_slice_for_clustering, dim='time')
df_for_clustering = temporaly_changing_ds_for_clustering.to_dataframe()
# Select data for clustering (may be subset if data_vars specified)
ds_slice_for_clustering = (
ds_for_clustering.sel(**selector, drop=True) if selector else ds_for_clustering
)
temporaly_changing_ds_for_clustering = drop_constant_arrays(ds_slice_for_clustering, dim='time')
if not temporaly_changing_ds_for_clustering.data_vars:
raise ValueError(
'No time-varying data left for clustering after drop_constant_arrays(). '
'Check data_vars or inspect transform.clustering_data().'
)
df_for_clustering = temporaly_changing_ds_for_clustering.to_dataframe()
🤖 Prompt for AI Agents
In `@flixopt/transform_accessor.py` around lines 1379 - 1385, Add a guard to
detect when drop_constant_arrays removes all variables before calling
to_dataframe(): after computing temporaly_changing_ds_for_clustering =
drop_constant_arrays(ds_slice_for_clustering, dim='time') check whether
temporaly_changing_ds_for_clustering is empty (e.g., has no data_vars or no
variables) and raise a clear ValueError mentioning the input filtering
(data_vars/selector) and that there is no varying data to cluster, instead of
calling temporaly_changing_ds_for_clustering.to_dataframe() and allowing a
cryptic indexing error; update the code paths that assign df_for_clustering
accordingly so downstream logic only runs when df_for_clustering is valid.

Comment on lines 67 to 68
"tsam @ git+https://github.com/FBumann/tsam.git@v3-rebased", # Time series aggregation (unreleased)
"scipy >= 1.15.1, < 2", # Used by tsam. Prior versions have conflict with highspy. See https://github.com/scipy/scipy/issues/22257
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Check if v3-rebased branch exists and get its current HEAD commit
curl -s "https://api.github.com/repos/FBumann/tsam/branches/v3-rebased" | jq '{branch: .name, sha: .commit.sha}' 2>/dev/null || echo "Branch check failed or jq not available"

# Also test basic GitHub API access
curl -s -I "https://api.github.com/repos/FBumann/tsam" | head -5

Repository: flixOpt/flixopt

Length of output: 266


🌐 Web query:

Can PyPI packages depend on git URLs in pyproject.toml

💡 Result:

Short answer: No — a package published on PyPI must not rely on git/VCS URLs in its declared install-time dependencies because pip will refuse to install non-PyPI URL dependencies transitively when installing a package from PyPI. Build-system (build-time) requirements in pyproject.toml can be URL/VCS specifiers, but runtime dependencies that point at git URLs are not supported for packages distributed via PyPI.

Relevant sources:

  • pip docs: pyproject.toml build-system.requires may contain URL specifiers, but runtime dependency resolution for packages from PyPI blocks arbitrary URL dependencies. [1]
  • Historical/issue discussion explaining pip blocks PEP 508 URL requirements for packages installed from PyPI. [2]
  • Ongoing PEP work (PEP 725 / DepURLs) proposes better ways to reference external dependencies in pyproject.toml, but it is a separate proposal and not a current guarantee that PyPI/pip will accept git URL runtime dependencies. [3]
  • Community Q&A and examples showing the same pip behavior and common workarounds (vendor, publish to PyPI, or install from a VCS directly). [4]

If you need a workable solution: publish the dependency to PyPI, vendor or bundle it, instruct users to install from a VCS editable/direct URL themselves (i.e., not via PyPI), or follow future PEPs (like PEP 725) if/when tooling implements them.


🌐 Web query:

pip install git URL dependency from pyproject.toml production

💡 Result:

Yes — pyproject.toml (PEP 621) supports VCS/direct-URL dependencies and pip can install them. Short recipe and production recommendations:

  1. Add a VCS direct reference in pyproject.toml (PEP 621 / setuptools syntax):

    • Example (pin to tag/commit):
      [project]
      dependencies = [
      "somepkg @ git+https://github.com/owner/somepkg.git@v1.2.3",
      "otherpkg @ git+ssh://[email protected]:owner/otherpkg.git@abcdef123456",
      ]
      (pip/PEP 508 direct URL form). (setuptools.pypa.io)
  2. Install with pip

  3. Production best practices

    • Pin VCS deps to an exact commit or annotated tag (avoid floating branch names) to ensure reproducible installs. (setuptools.pypa.io)
    • For private repos, use deploy-friendly auth (CI/CD deploy tokens, SSH keys, or tokenized HTTPS URLs) rather than interactive credentials. Test auth in your CI environment. (pythontutorials.net)
    • Consider converting pyproject deps into a requirements.txt for pip-based deployment (tools like pyp2req or generating a pip constraints file), so your deployment step can be pip install -r requirements.txt --no-deps / --constraint ... if you need deterministic installs. (pypi.org)

If you want, tell me:

  • whether you use setuptools/poetry/pdm, and
  • whether the VCS repo is public or private — I’ll give the exact pyproject snippet and CI-ready install command.

Citations:


🏁 Script executed:

#!/bin/bash
# Check if this project is configured for PyPI release
cat pyproject.toml | head -30

# Check for PyPI upload workflows
find . -name "*.yml" -o -name "*.yaml" | xargs grep -l "publish\|pypi\|release" 2>/dev/null | head -5

Repository: flixOpt/flixopt

Length of output: 1378


🏁 Script executed:

#!/bin/bash
# Check git log to see context around the tsam dependency change
git log --oneline -20 -- pyproject.toml | head -10

# Check if v3-rebased ref is mentioned elsewhere
rg "v3-rebased" --type toml --type py --type txt

Repository: flixOpt/flixopt

Length of output: 329


🏁 Script executed:

#!/bin/bash
# Check the release workflow to see if it uploads to PyPI
cat .github/workflows/release.yaml | head -60

# Also check for any PyPI token secrets
grep -r "pypi" .github/workflows/ | head -10

Repository: flixOpt/flixopt

Length of output: 2194


🏁 Script executed:

#!/bin/bash
# Check pyproject.toml for extras documentation and other dependencies
sed -n '50,150p' pyproject.toml

Repository: flixOpt/flixopt

Length of output: 3450


This configuration will fail PyPI publication.

Using a git URL for tsam (line 67 in [full] extras and line 86 in [dev] extras) is incompatible with PyPI distribution. pip explicitly blocks git URL dependencies for packages installed from PyPI — the verify-pypi job in your release workflow will fail when attempting to install from https://pypi.org/simple/.

Pinning to a commit hash does not resolve this. Before release, you must either:

  1. Wait for tsam v3 to be published to PyPI and use a version specifier (tsam >= 3.0.0, < 4)
  2. Remove tsam from the [full] and [dev] extras and document that users requiring tsam must install from source
  3. Vendor or bundle tsam as part of flixopt

Verify the release workflow will pass with the current dependencies, or update the workflow to skip PyPI publication if this is intentional for development-only use.

🤖 Prompt for AI Agents
In `@pyproject.toml` around lines 67 - 68, The pyproject.toml currently lists a
git URL dependency "tsam @ git+https://github.com/FBumann/tsam.git@v3-rebased"
in the [full] and [dev] extras which will break PyPI publication; remove that
git-based requirement from both the [full] and [dev] extras and either (a)
replace it with a proper PyPI version specifier (e.g., "tsam >= 3.0.0, < 4")
once tsam is published, (b) omit tsam from those extras and add a note in your
README/INSTALL docs instructing users to install tsam from source, or (c)
vendor/include tsam in the package; finally, if this is intended only for
development, update the release workflow (the verify-pypi job) to skip PyPI
verification or otherwise allow dev-only dependencies so the release will pass.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@flixopt/clustering/base.py`:
- Around line 619-633: The zip call that builds the selector silently truncates
mismatches (zip(self._dim_names, key, strict=False)); change it to use
strict=True so any length mismatch between self._dim_names and key raises
immediately, e.g., replace the zip invocation used when constructing selector in
the loop over self._results.items() (where selector = dict(zip(self._dim_names,
key, ...))) with zip(..., strict=True) to fail fast on dimension mismatches and
ensure selectors are built only when dimensions align.
🧹 Nitpick comments (5)
flixopt/clustering/base.py (5)

578-591: Type conversion in deserialization only handles int.

The _str_to_key method only attempts to convert parts to int. If period values are floats (e.g., 2024.5) or other numeric types, they would be stored as strings after round-trip serialization.

Consider extending the conversion logic:

🔧 Suggested enhancement
     `@staticmethod`
     def _str_to_key(key_str: str, dim_names: list[str]) -> tuple:
         """Convert string back to key tuple."""
         if key_str == '__single__':
             return ()
         parts = key_str.split('|')
-        # Try to convert to int if possible (for period years)
+        # Try to convert to numeric types if possible (for period years/floats)
         result = []
         for part in parts:
             try:
                 result.append(int(part))
             except ValueError:
-                result.append(part)
+                try:
+                    result.append(float(part))
+                except ValueError:
+                    result.append(part)
         return tuple(result)

1206-1206: Default empty DatetimeIndex may cause issues downstream.

When original_timesteps is None, it defaults to an empty pd.DatetimeIndex([]). Methods like expand_data and build_expansion_divisor default to using self.original_timesteps, which could lead to confusing behavior with zero-length expansions.

Consider validating that original_timesteps is not empty when it's actually needed, or document this edge case more explicitly.


1242-1254: Property raises ValueError without checking all paths.

The results property raises ValueError if neither _results_cache nor _aggregation_results is available. However, the __init__ method allows creating a Clustering with results=None and _aggregation_results=None, which would always fail when accessing results.

Consider adding validation in __init__ to ensure at least one of these is provided, or document this as an invalid state.


1425-1434: Variable shadowing: variables parameter is overwritten.

The local assignment variables = ds.variables at line 1426 shadows the function parameter variables from line 1369. While this doesn't cause a bug (the parameter was already consumed at line 1410), it's confusing and could lead to issues if the code is refactored.

🔧 Suggested fix: rename local variable
         if kind == 'duration_curve':
             sorted_vars = {}
-            # Use variables for faster access (avoids _construct_dataarray overhead)
-            variables = ds.variables
+            # Use ds_variables for faster access (avoids _construct_dataarray overhead)
+            ds_variables = ds.variables
             rep_values = ds.coords['representation'].values
             rep_idx = {rep: i for i, rep in enumerate(rep_values)}
             for var in ds.data_vars:
-                data = variables[var].values
+                data = ds_variables[var].values
                 for rep in rep_values:
                     # Direct numpy indexing instead of .sel()
                     values = np.sort(data[rep_idx[rep]].flatten())[::-1]

1665-1669: Consider direct numpy indexing for cluster labels.

The list comprehension at line 1668 calls .sel(cluster=c).values for each cluster, creating intermediate DataArrays. For consistency with the PR's optimization pattern, consider using direct indexing:

🔧 Suggested optimization
         if occ_extra_dims:
             cluster_labels = [f'Cluster {c}' for c in range(n_clusters)]
         else:
+            occ_values = cluster_occurrences.values
             cluster_labels = [
-                f'Cluster {c} (×{int(cluster_occurrences.sel(cluster=c).values)})' for c in range(n_clusters)
+                f'Cluster {c} (×{int(occ_values[c])})' for c in range(n_clusters)
             ]

Comment on lines 619 to 633
for key, cr in self._results.items():
# Build selector for this key
selector = dict(zip(self._dim_names, key, strict=False))

# Select the slice for this (period, scenario)
data_slice = data.sel(**selector, drop=True) if selector else data

# Drop constant arrays and convert to DataFrame
time_varying = drop_constant_arrays(data_slice, dim='time')
df = time_varying.to_dataframe()

# Apply clustering
results[key] = cr.apply(df)

return Clustering._from_aggregation_results(results, self._dim_names)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

# First, let's look at the file structure and find the relevant class/method
cd flixopt/clustering && wc -l base.py

Repository: flixOpt/flixopt

Length of output: 71


🏁 Script executed:

# Read the context around line 619-633 to understand the method
sed -n '600,650p' flixopt/clustering/base.py

Repository: flixOpt/flixopt

Length of output: 2146


🏁 Script executed:

# Find where self._dim_names is defined/set
rg -n "_dim_names" flixopt/clustering/base.py

Repository: flixOpt/flixopt

Length of output: 948


🏁 Script executed:

# Find where self._results is populated
rg -n "_results" flixopt/clustering/base.py -A 2

Repository: flixOpt/flixopt

Length of output: 5616


🏁 Script executed:

# Search for zip() usage with strict parameter across the codebase
rg -n "zip.*strict" flixopt/

Repository: flixOpt/flixopt

Length of output: 2078


🏁 Script executed:

# Find where AggregationResults is instantiated (look for __init__ and calls to it)
sed -n '200,230p' flixopt/clustering/base.py

Repository: flixOpt/flixopt

Length of output: 1155


🏁 Script executed:

# Find where AggregationResults is created/instantiated from
rg -n "AggregationResults(" flixopt/ -A 3 | head -80

Repository: flixOpt/flixopt

Length of output: 147


🏁 Script executed:

# Look at the _from_aggregation_results classmethod that's used at the end of apply()
sed -n '1257,1290p' flixopt/clustering/base.py

Repository: flixOpt/flixopt

Length of output: 1474


🏁 Script executed:

# Check how keys are structured in _results - look for patterns where results dict is built
rg -n "self._results\[" flixopt/clustering/base.py | head -20

Repository: flixOpt/flixopt

Length of output: 314


🏁 Script executed:

# Look at how _results dict is populated - find code that builds it
rg -n "results\[.*\].*=" flixopt/clustering/base.py | head -20

Repository: flixOpt/flixopt

Length of output: 285


🏁 Script executed:

# Search for places where AggregationResults is instantiated to see the pattern of key/dim_names
rg -n "ClusteringResults\(" flixopt/ -B 3 -A 3 | head -100

Repository: flixOpt/flixopt

Length of output: 2835


🏁 Script executed:

# Check the overall structure by looking at how keys are constructed
sed -n '535,575p' flixopt/clustering/base.py

Repository: flixOpt/flixopt

Length of output: 1831


🏁 Script executed:

# Look at where the apply method is called from to understand what _results contains
rg -n "\.apply\(" flixopt/clustering/base.py -B 2 -A 2

Repository: flixOpt/flixopt

Length of output: 874


🏁 Script executed:

# Look at git history to understand when strict=False was introduced
git log --oneline -p flixopt/clustering/base.py | grep -A 10 -B 5 "strict=False" | head -50

Repository: flixOpt/flixopt

Length of output: 690


🏁 Script executed:

# Check if there are any tests or validation that verify key/dim_names matching
rg -n "dim_names" flixopt/clustering/base.py | grep -i "test\|assert\|valid\|check"

Repository: flixOpt/flixopt

Length of output: 41


🏁 Script executed:

# Look more carefully at how keys are created in _results dict
# Find where clustering is applied with period/scenario dimensions
sed -n '600,640p' flixopt/clustering/base.py

Repository: flixOpt/flixopt

Length of output: 1655


🏁 Script executed:

# Check if there's any code that constructs mismatched keys
rg -n "self._results\[" flixopt/clustering/base.py -B 2 -A 2

Repository: flixOpt/flixopt

Length of output: 1023


Use strict=True in zip() to catch dimension mismatches.

The zip(self._dim_names, key, strict=False) at line 621 will silently truncate if key and dim_names have mismatched lengths. Since key tuples should always match the structure of dim_names (both sourced from the same ClusteringResults initialization), using strict=True would catch any unexpected inconsistencies and fail fast rather than silently producing incomplete selectors. The empty case (key=(), dim_names=[]) works fine with strict=True since both are empty.

🤖 Prompt for AI Agents
In `@flixopt/clustering/base.py` around lines 619 - 633, The zip call that builds
the selector silently truncates mismatches (zip(self._dim_names, key,
strict=False)); change it to use strict=True so any length mismatch between
self._dim_names and key raises immediately, e.g., replace the zip invocation
used when constructing selector in the loop over self._results.items() (where
selector = dict(zip(self._dim_names, key, ...))) with zip(..., strict=True) to
fail fast on dimension mismatches and ensure selectors are built only when
dimensions align.

FBumann and others added 2 commits January 16, 2026 22:12
_add_clustering_to_dataset was slow due to 210 individual
ds[name] = arr assignments. Each triggers xarray's
expensive dataset_update_method.

Changed to batch assignment with ds.assign(dict):
- Before: ~2600ms for to_dataset with clustering
- After: ~109ms for to_dataset with clustering

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Avoided _construct_dataarray overhead by:
- Using ds.variables instead of ds.data_vars.items()
- Using numpy slicing instead of .isel()
- Passing attrs dict directly instead of DataArray

cluster() benchmark:
- Before: ~10.1s
- After: ~8.9s

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@flixopt/transform_accessor.py`:
- Around line 293-313: The current shortcut returns a 1-D xr.Dataset when there
are multiple (period, scenario) slices but only one non-empty metrics DataFrame,
dropping period/scenario dims; change the condition so the single-slice fast
path is used only when there is truly one overall slice (i.e.,
len(clustering_metrics_all) == 1 and len(non_empty_metrics) == 1). If there are
multiple (period, scenario) keys but only one non-empty metrics_df, do not
return the 1-D Dataset from the block using first_key/metrics_df; instead let
execution fall through to the multi-dim assembly code (which should construct
the full Dataset with NaN-filled slices) so dimensions referenced by
Clustering._dim_names (periods/scenarios) are preserved.
♻️ Duplicate comments (2)
flixopt/transform_accessor.py (2)

1392-1397: Guard against empty clustering input after drop_constant_arrays().

Same issue as previously noted: if the selected variables become empty after drop_constant_arrays, to_dataframe() will fail with a cryptic error. Please add an explicit guard and a clear error message before converting to DataFrame.


1915-1923: Segmented SOC interpolation still uses wrong time index.

timestep_mapping % n_segments collapses to segment indices, not timestep positions within a cluster, which yields incorrect interpolation. Use modulo timesteps_per_cluster for time_indices while keeping cluster decoding as‑is.

🛠️ Proposed fix
-        if clustering.is_segmented and clustering.n_segments is not None:
-            time_dim_size = clustering.n_segments
-        else:
-            time_dim_size = clustering.timesteps_per_cluster
-        cluster_indices = timestep_mapping // time_dim_size
-        time_indices = timestep_mapping % time_dim_size
+        if clustering.is_segmented and clustering.n_segments is not None:
+            time_dim_size = clustering.n_segments
+        else:
+            time_dim_size = clustering.timesteps_per_cluster
+        cluster_indices = timestep_mapping // time_dim_size
+        # Use original timestep position within each cluster for segment lookup
+        timesteps_per_cluster = clustering.timesteps_per_cluster
+        time_positions = xr.DataArray(
+            np.arange(timestep_mapping.sizes['original_time']),
+            dims=['original_time'],
+            coords={'original_time': timestep_mapping.coords['original_time']},
+        ).broadcast_like(timestep_mapping)
+        time_indices = time_positions % timesteps_per_cluster

Comment on lines 293 to 313
non_empty_metrics = {k: v for k, v in clustering_metrics_all.items() if not v.empty}

if not non_empty_metrics:
return xr.Dataset()

first_key = (periods[0], scenarios[0])

if len(non_empty_metrics) == 1 or len(clustering_metrics_all) == 1:
metrics_df = non_empty_metrics.get(first_key)
if metrics_df is None:
metrics_df = next(iter(non_empty_metrics.values()))
return xr.Dataset(
{
col: xr.DataArray(
metrics_df[col].values,
dims=['time_series'],
coords={'time_series': metrics_df.index},
)
for col in metrics_df.columns
}
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Preserve period/scenario dims when only one metrics slice is non‑empty.

If there are multiple (period, scenario) slices but only one non‑empty metrics DataFrame, the shortcut returns a 1‑D Dataset and drops period/scenario dimensions, which can misalign metrics with Clustering._dim_names. Consider only using the shortcut when there is truly a single slice, and otherwise fall through to the multi‑dim path with NaN‑filled slices.

🛠️ Proposed fix
-        if len(non_empty_metrics) == 1 or len(clustering_metrics_all) == 1:
+        if len(clustering_metrics_all) == 1:
             metrics_df = non_empty_metrics.get(first_key)
             if metrics_df is None:
                 metrics_df = next(iter(non_empty_metrics.values()))
             return xr.Dataset(
🤖 Prompt for AI Agents
In `@flixopt/transform_accessor.py` around lines 293 - 313, The current shortcut
returns a 1-D xr.Dataset when there are multiple (period, scenario) slices but
only one non-empty metrics DataFrame, dropping period/scenario dims; change the
condition so the single-slice fast path is used only when there is truly one
overall slice (i.e., len(clustering_metrics_all) == 1 and len(non_empty_metrics)
== 1). If there are multiple (period, scenario) keys but only one non-empty
metrics_df, do not return the 1-D Dataset from the block using
first_key/metrics_df; instead let execution fall through to the multi-dim
assembly code (which should construct the full Dataset with NaN-filled slices)
so dimensions referenced by Clustering._dim_names (periods/scenarios) are
preserved.

Eliminated 451,856 slow pandas .loc calls by using numpy reshape
for segmented clustering data instead of iterating per-cluster.

cluster() with segments benchmark (50 clusters, 4 segments):
- Before: ~93.7s
- After: ~21.1s
- Speedup: 4.4x

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@flixopt/transform_accessor.py`:
- Around line 533-565: The current branch treats any variable with
partially-missing typical slices as fully constant; change the logic in the loop
(around variables, ds_new_vars, typical_das, all_keys) so that only when name
not in typical_das you take the constant/no-time path; if name exists in
typical_das but its keys != all_keys, build the time-dependent DataArray by
iterating all_keys and for each missing (period,scenario) key fill that slice
with the appropriate constant aggregate (e.g., broadcast the available constant
array or a selected default from typical_das[name]), then concatenate/stack
these filled slices into the full (cluster, time, ...) shape and assign into
ds_new_vars with dims ['cluster','time',...] and coords new_coords; ensure you
use the same coord_cache, cluster_coords and time_coords when constructing the
filled/combined DataArray.
♻️ Duplicate comments (3)
flixopt/transform_accessor.py (3)

298-318: Preserve period/scenario dims in the metrics shortcut.
The len(non_empty_metrics) == 1 or len(clustering_metrics_all) == 1 fast path can drop period/scenario dims when multiple slices exist but only one is non-empty (Line 305). This produces a 1‑D Dataset and misaligns downstream dimension expectations. Consider requiring a true single-slice case only.

♻️ Proposed fix
-        if len(non_empty_metrics) == 1 or len(clustering_metrics_all) == 1:
+        if len(clustering_metrics_all) == 1 and len(non_empty_metrics) == 1:
             metrics_df = non_empty_metrics.get(first_key)
             if metrics_df is None:
                 metrics_df = next(iter(non_empty_metrics.values()))
             return xr.Dataset(

1397-1403: Guard against empty clustering input before to_dataframe().
After drop_constant_arrays, the dataset can be empty when all selected variables are constant (Line 1401–1402). to_dataframe() then fails with a cryptic indexing error. Add an explicit guard with a clear message.

🐛 Proposed fix
                 ds_slice_for_clustering = (
                     ds_for_clustering.sel(**selector, drop=True) if selector else ds_for_clustering
                 )
                 temporaly_changing_ds_for_clustering = drop_constant_arrays(ds_slice_for_clustering, dim='time')
+                if not temporaly_changing_ds_for_clustering.data_vars:
+                    raise ValueError(
+                        'No time-varying data left for clustering after drop_constant_arrays(). '
+                        'Check data_vars or inspect transform.clustering_data().'
+                    )
                 df_for_clustering = temporaly_changing_ds_for_clustering.to_dataframe()

1920-1932: Segmented charge_state interpolation uses the wrong time index basis.
For segmented clustering, timestep_mapping % n_segments gives the segment index, but segment_assignments/position_within_segment are keyed by timestep position within the cluster (Line 1926–1932). This mis-indexes segment assignments and produces incorrect interpolation. Use timestep positions mod timesteps_per_cluster for the time index while keeping cluster indices derived from timestep_mapping.

🐛 Proposed fix
         if clustering.is_segmented and clustering.n_segments is not None:
             time_dim_size = clustering.n_segments
         else:
             time_dim_size = clustering.timesteps_per_cluster
         cluster_indices = timestep_mapping // time_dim_size
-        time_indices = timestep_mapping % time_dim_size
+        # Use original timestep position within each cluster for segment lookup
+        timesteps_per_cluster = clustering.timesteps_per_cluster
+        time_positions = xr.DataArray(
+            np.arange(timestep_mapping.sizes['original_time']),
+            dims=['original_time'],
+            coords={'original_time': timestep_mapping.coords['original_time']},
+        ).broadcast_like(timestep_mapping)
+        time_indices = time_positions % timesteps_per_cluster

Comment on lines 533 to 567
# Use ds.variables to avoid _construct_dataarray overhead
variables = ds.variables
coord_cache = {k: ds.coords[k].values for k in ds.coords}

for name in ds.data_vars:
var = variables[name]
if 'time' not in var.dims:
# No time dimension - wrap Variable in DataArray
coords = {d: coord_cache[d] for d in var.dims if d in coord_cache}
ds_new_vars[name] = xr.DataArray(var.values, dims=var.dims, coords=coords, attrs=var.attrs, name=name)
elif name not in typical_das or set(typical_das[name].keys()) != all_keys:
# Time-dependent but constant: reshape to (cluster, time, ...)
# Use numpy slicing instead of .isel()
time_idx = var.dims.index('time')
slices = [slice(None)] * len(var.dims)
slices[time_idx] = slice(0, n_reduced_timesteps)
sliced_values = var.values[tuple(slices)]

other_dims = [d for d in var.dims if d != 'time']
other_shape = [var.sizes[d] for d in other_dims]
new_shape = [actual_n_clusters, n_time_points] + other_shape
reshaped = sliced_values.reshape(new_shape)
new_coords = {'cluster': cluster_coords, 'time': time_coords}
for dim in other_dims:
if dim in coord_cache:
new_coords[dim] = coord_cache[dim]
ds_new_vars[name] = xr.DataArray(
reshaped,
dims=['cluster', 'time'] + other_dims,
coords=new_coords,
attrs=var.attrs,
)
else:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Handle partially-missing typical slices without discarding valid aggregates.
The condition name not in typical_das or set(typical_das[name].keys()) != all_keys treats any variable missing one (period, scenario) slice as fully constant (Line 543). That discards valid aggregated data for slices where the variable does vary and reshapes raw data instead. This can corrupt reduced outputs for mixed-variability cases. Only use the constant path when the variable is absent altogether, and fill missing slices with constant values before combining.

🛠️ Suggested approach (fill missing slices)
-            elif name not in typical_das or set(typical_das[name].keys()) != all_keys:
+            elif name not in typical_das:
                 # Time-dependent but constant: reshape to (cluster, time, ...)
                 # Use numpy slicing instead of .isel()
                 time_idx = var.dims.index('time')
                 slices = [slice(None)] * len(var.dims)
                 slices[time_idx] = slice(0, n_reduced_timesteps)
                 sliced_values = var.values[tuple(slices)]
                 ...
                 ds_new_vars[name] = xr.DataArray(
                     reshaped,
                     dims=['cluster', 'time'] + other_dims,
                     coords=new_coords,
                     attrs=var.attrs,
                 )
             else:
+                # Fill missing (period, scenario) slices with constant values
+                slices = dict(typical_das[name])
+                missing = all_keys - set(slices.keys())
+                if missing:
+                    # TODO: extract constant-slice construction into a helper reused above
+                    for key in missing:
+                        slices[key] = _build_constant_slice_for_key(...)
+                da = self._combine_slices_to_dataarray_2d(
+                    slices=slices,
+                    attrs=var.attrs,
+                    periods=periods,
+                    scenarios=scenarios,
+                )
🤖 Prompt for AI Agents
In `@flixopt/transform_accessor.py` around lines 533 - 565, The current branch
treats any variable with partially-missing typical slices as fully constant;
change the logic in the loop (around variables, ds_new_vars, typical_das,
all_keys) so that only when name not in typical_das you take the
constant/no-time path; if name exists in typical_das but its keys != all_keys,
build the time-dependent DataArray by iterating all_keys and for each missing
(period,scenario) key fill that slice with the appropriate constant aggregate
(e.g., broadcast the available constant array or a selected default from
typical_das[name]), then concatenate/stack these filled slices into the full
(cluster, time, ...) shape and assign into ds_new_vars with dims
['cluster','time',...] and coords new_coords; ensure you use the same
coord_cache, cluster_coords and time_coords when constructing the
filled/combined DataArray.

@FBumann FBumann changed the title perf: Use ds.variables to avoid _construct_dataarray overhead perf: Optimize clustering and I/O (4.4x faster segmented clustering) Jan 16, 2026
@FBumann FBumann changed the base branch from main to feature/tsam-v3 January 16, 2026 21:46
@FBumann FBumann changed the base branch from feature/tsam-v3 to feature/tsam-v3+rework January 16, 2026 21:47
FBumann and others added 2 commits January 16, 2026 23:06
- benchmark_io_performance.py: Add Gurobi → HiGHS solver fallback
- components.py: Fix storage decay to use sum (not mean) for hours per cluster
- flow_system.py: Add RangeIndex validation requiring explicit timestep_duration
- io.py: Include auxiliary coordinates in _fast_get_dataarray
- transform_accessor.py: Add empty dataset guard after drop_constant_arrays
- transform_accessor.py: Fix timestep_mapping indexing for segmented clustering

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Replace data_vars.items() iteration with ds.variables pattern to avoid
slow _construct_dataarray calls (5502 calls × ~1.5ms each).

Before: 3.73s
After:  1.72s

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@FBumann FBumann merged commit 7a4280d into feature/tsam-v3+rework Jan 17, 2026
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants