Skip to content

Investigation: long-format (Polars) kernel behind linopy's xarray-style API #24

@FBumann

Description

@FBumann

Trigger

Upstream PyPSA#740 compares linopy against polar-high, a Polars-backed LP/MIP eDSL, and reports 2.3–3.5× faster builds with ~30 % less peak memory. That prompted a deeper question than the benchmark itself: is the xarray-based kernel the right foundation for linopy going forward?

This issue documents the investigation: what polar-high actually does, where linopy's costs really are, what a tabular kernel would mean, and what it would depend on. It is a research/tracking issue on my fork — nothing here is a decision.

What polar-high actually does

Everything is a long-format (tidy) Polars frame, sparse from the first moment:

Object Schema Evaluation
Var (*dims, col_id) — one row per LP column eager, built once
Param (*dims, value) lazy
expression term (*dims, col_id, coef) lazy

Arithmetic never materialises anything: Param * Var is a lazy join, Sum(expr, over=dims) is a lazy group_by, expr1 + expr2 is list concatenation of terms. Matrix assembly happens inside solve(): per-constraint-family COO triples → global dedup group_by → lexsort → CSC → highspy.passModel() (or streaming addRows per family, which bounds peak memory to one family + the growing HiGHS model).

Where the speed comes from — and what it is not:

  1. It is not Polars multi-threading — polar-high forces POLARS_MAX_THREADS=1 at import; their own benchmark shows single-thread is faster and leaner for this workload.
  2. Sparse-first representation. There is never a dense (dims × _term) array.
  3. No LP text file. Direct in-process matrix assembly into highspy. The benchmark in Benchmark comparison vs polar-high (Polars-backed LP/MIP eDSL on HiGHS) — thoughts? PyPSA/linopy#740 forces linopy through io_api="lp" (multi-GB text written, then re-parsed by HiGHS) and never tests linopy's io_api="direct".
  4. Lazy plans with bounded intermediates ("prune-down" joins keep every intermediate at O(constraint rows)).

Two claims in PyPSA#740 need correction: linopy's LP writer is already Polars-based since 0.7.0 (lp and lp-polars are the same code path), and the time gap is substantially an IO-path artifact, not purely architectural.

The real question: rectangular-dense vs long-format-sparse

"xarray vs Polars" is the wrong framing. The defining property of linopy's kernel is that an expression is a rectangular array: coeffs/vars with shape (*coord_dims, _term). Rectangularity is what xarray needs, and it is what costs us. The question is whether that shape is right — the container choice follows from the answer.

Where dense wins — homogeneous expressions. 2*x + y over (time=8760, bus=500): every cell genuinely has 2 terms, zero padding. Dense: ~140 MB with coords stored once. Long format: ~420 MB (coordinates repeated per row). Dense regular grids — the PyPSA core case — are ~3× more compact in xarray.

Where dense loses — incidence/balance constraints. flow.groupby(dst).sum() goes through unstack(group_dim, fill_value=...) in LinearExpressionGroupby.sum(): every group is padded to the size of the largest group. A network where the median node has 3 lines but one hub has 60:

  • dense: (10 000 nodes × 168 h × 60 terms) ≈ 1.6 GB, ~95 % padding
  • long format: one row per real (edge, time) entry ≈ 270 MB

The nodal balance is the central constraint of every energy model, and this is why the gap in polar-high's network benchmark (3.2×) is larger than in their dense one. It is real, not a methodology artifact.

Where dense also loses — masking. x.where(mask) keeps the full rectangular shape with -1 labels; a constraint over 5 % of components costs the same as over 100 %.

The two coherent architectures

A hybrid — xarray objects with Polars bolted on top/underneath mid-pipeline — was considered and rejected: the dense→long conversion doesn't disappear, it moves into the hot arithmetic path and gets paid per operation, and the library ends up with two mental models. The coherent end states are:

A — xarray through and through. Keep the kernel; do the fixes that don't require re-architecting: make io_api="direct" the default for HiGHS/Gurobi, fix the groupby padding (an implementation choice, not forced by the representation), accept the representation's limits.

B — tabular through and through, xarray only at the edges. All model objects (Variable, LinearExpression, Constraint) are Polars frames from birth. xarray appears in exactly two places: user input (bounds/coefficients/coords arrive as DataArrays → converted once at add_variables/add_constraints) and solution output (returned as DataArrays). The user-facing API vocabulary — named dims, coords, .sel(), broadcasting arithmetic, x - y >= rhs — is kept; the objects backing it are not xarray.

Is B feasible? Measured, not guessed

Inventory of every linopy API call in the PyPSA ecosystem (PyPSA master, PyPSA-Eur master, linopy's own test suite):

PyPSA core PyPSA-Eur
call sites ~500 ~150
trivial (arithmetic, add_*, .solution, .dual) 11 patterns 8 patterns
mechanical (sel/loc/groupby/merge/shift/rolling/where/reindex) 15 patterns 6 patterns
reaching into xarray internals 2 patterns (3 sites) 0

The three internal-reaching call sites in the entire ecosystem (all in PyPSA core, none in PyPSA-Eur):

  • (lhs.vars == -1).all("_term")pypsa/optimization/constraints.py:1166 (emptiness check; needs a public .empty)
  • soc.data.sel(snapshot=(p, sl)).roll(snapshot=1)constraints.py:1711 and :1895 (multi-index roll for storage cyclicity; needs a public multi-index-aware .roll())

No dask, no netcdf, no subclassing, no .coeffs/.nterm access anywhere in either consumer.

linopy's own test suite (~900–1000 tests): ~55–60 % behavioral spec (must pass with any kernel), ~25–30 % representation-coupled (assert on _term/Dataset structure; rewritten), ~15–20 % xarray-integration (stack/unstack/ffill forwarding, netcdf, dask — defines what is lost or needs a redesign decision).

The genuinely hard residue is small and concentrated:

  1. DataArray/Series-operand arithmetic semantics — what alignment means when users write xr.DataArray(p_max_pu) * dispatch. This is the silent-failure risk: different alignment = different coefficients, no error.
  2. Multi-index time structures — rolling/shifting over (period, timestep) investment-period indexes.
  3. The test-suite migration volume (~350–400 tests) and a kernel-agnostic assert_linequal.

The v1 convention is the prerequisite

Hard-spot #1 above is exactly what the v1 semantic convention (PyPSA#717) specifies. The convention's object-scope rule makes operand type irrelevant (pd.Series * xxr.DataArray * x ≡ constant-expression), §8 replaces implicit xarray alignment with exact-match-or-raise, §6/§13 define absence propagation and reductions. The unlabeled-operand pairing rule is specified but not yet implemented (PyPSA#736).

The dependency direction matters: a tabular kernel depends on v1, not the other way around.

  • Without v1, a tabular kernel must bug-for-bug replicate legacy's implicit xarray behaviors (positional fallbacks, silent inner joins, per-operator NaN fills). That is close to impossible and the failure mode is silent wrong coefficients.
  • With v1, the semantics are explicit and join-native. They map onto Polars primitives almost 1:1 — and in several places Polars provides natively what v1 must build and police on top of numpy/xarray:
v1 rule On numpy/xarray (today) On Polars
§1–§2 absence is first-class NaN in float fields + -1 sentinels in int fields, kept in sync by convention null for every dtype (Arrow validity bitmap) — one mechanism, engine-enforced
§5 user NaN ≠ absence, raises needed because NaN is overloaded NaN and null are different things; the ambiguity doesn't exist
§6 absence propagates implemented per operator null propagation is the arithmetic default
§13 reductions skip absent, sum of none = 0 skipna handling the default behavior of sum()/mean()
§8 exact-match alignment custom check join + verify-no-unmatched (~small wrapper)
§10 explicit join= custom the how= argument of join()

In short: v1 specifies relational missing-data semantics on top of an array stack that doesn't have them; Polars is built on those semantics. That is both why a tabular kernel would be faster and why it would be safer — fewer conventions to police, more invariants the storage physically cannot break.

One real design gap: v1's absence is per-slot, and in long format an absent slot is most naturally "no row at all" — which means the kernel must track the coordinate universe (index sets) separately from the data rows, e.g. for fillna() reviving slots and for constraint row counting. polar-high solves this by having Var frames born with their full index. Solved problem, but a decision to make, not a freebie.

Sequencing

Independent of any kernel decision (worth doing now):

The prerequisite chain for any kernel work:

Then, and only then, the kernel question becomes decidable:

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions