Split local benchmark analytics into a standalone generic package (benchkit)

_TODO (human): why we're doing this, in your words._

> [!NOTE]
> Drafted by AI (Claude) from our design discussion; decisions are the maintainers'.

## Goal

Split the **local-dev analytics** (plotting, cross-version sweep, the memray engine, snapshot/compare, the CLI) out of linopy into a **standalone, reusable** benchmarking package (`benchkit`, name TBD), so that:

- **linopy stays lean** — only the benchmark *content* (specs, phases, pytest drivers, conftest) + the CodSpeed workflows. (Maintainer ask: "ship CI first, CLI later".)
- the **local CLI survives** as `pip install benchkit` rather than being dropped.
- the tool is **reusable on other repos** (cross-repo interest from review).

## Status — foundation done (#34, draft)

Two structural changes already landed on the foundation branch, both **no behaviour change**:
- **Decoupled the CI baseline from analytics** — the CodSpeed baseline imports *zero* of `{memory, sweep, plotting, snapshot, bench, cli}`; the dependency arrow only points analytics → core.
- **One `measurements()` contract** — phases own `phase_cases(phase) -> Iterable[PhaseCase]` (each a setup→action→teardown context manager); the pytest drivers and the memray engine consume the same source. Node ids byte-identical; the id-alignment band-aid test is gone.

## Target shape — two repos, one CLI

```
benchkit/                 (new repo — generic, no linopy import)
  core:   Case, measure_peak, run_case, snapshot, plot, compare, sweep, cli   ── knows only Cases
  suite:  Subject(build=…), operation(node=…), iter_params, skip_reason, tiers ── optional convention → Cases

linopy/benchmarks/        (stays in linopy — the "content", thin)
  models/ patterns/   Subject(build=…) via benchkit.suite          ← linopy-specific
  phases.py           @operation(...) wrappers around to_file/…     ← linopy verbs
  test_*.py + conftest pytest drivers (parametrize benchkit cases)  ← stays for CodSpeed
  __main__.py         wires linopy's registry into benchkit.cli
  depends on benchkit
```

`python -m benchmarks …` stays the command; linopy's `__main__` registers its content into benchkit, then hands off to benchkit's CLI. Same UX, zero tooling code in linopy.

## Primitives (the design question)

Primitives live in different layers — `measure`/`analyze` are generic; `generate` is the plugin.

- **The atom is `Case`, not `phase`.** `phase` bundles three linopy-specific things (an operation, a pytest node prefix, applicability) and assumes a "pipeline of phases" many repos don't have. Keep it in linopy's layer.

  ```python
  class Case:
      id:   str                           # stable key (== CodSpeed node id)
      dims: Mapping[str, str | int]       # structured axes for analysis
      run:  () -> ContextManager[Action]  # setup (untimed) → measured action → teardown
  ```

- **`dims` is the primitive that makes analysis generic.** Today structure is smuggled into the id and recovered by regex (`<spec>-<axis>=<value>`). Promote it to an explicit field: plots facet/scale by *any* dim, no id-parsing; the id stays opaque (CodSpeed's key), `dims` is separate metadata in the local snapshot.

- **Required plugin contract is just `cases() -> Iterable[Case]`.**

- **`subject` / `operation` / `scale` are convention, not core** — the "subjects × operations × scale-dial" shape (build-once helper, registry, quick/full tiers) is useful but linopy-shaped, so it lives in `benchkit.suite` (opt-in), not core.

### Injection seams (already isolated by #34)
1. **Phases register into benchkit** (`@operation(node=…)`) instead of the engine importing linopy's `phases.py`.
2. **The sweep's install spec is a parameter** (`install_spec=lambda v: f"linopy=={v}"`) instead of a literal.

Everything else (`snapshot`, `plotting`, `compare`, `measure_peak`, CLI bodies, the spec dataclasses) is already domain-agnostic — pure move.

## CodSpeed unaffected

`test_*.py` + `conftest` stay in `linopy/benchmarks/`, so a linopy PR still runs `pytest benchmarks/ --quick --codspeed` against the dev linopy — ids and baselines unchanged. benchkit is a dev dependency.

## Open questions

1. `Case{id, dims, run}` as the atom — promote `dims` to first-class instead of id-parsing? (lean: yes)
2. Core-knows-only-Cases with subject/operation as opt-in **sugar**, or bake the subject/operation registry **into core**?
3. Selection (quick/full tiers): core (predicate over a numeric dim) or suite concern? (lean: suite)
4. Name: `benchkit` / `linopy-bench` / `xbench` / …?

## Implementation sequence

1. Land #34 (foundation).
2. Scaffold `benchkit` locally (sibling package); move the domain-agnostic files verbatim.
3. Add the two seams (`operation` registration; `install_spec`).
4. Repoint linopy's `benchmarks/` at it; prove `python -m benchmarks` + the node-id diff still hold.
5. Publish the repo; trim #771 to the CI baseline (deletion last — reversible).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Split local benchmark analytics into a standalone generic package (benchkit) #35

Goal

Status — foundation done (#34, draft)

Target shape — two repos, one CLI

Primitives (the design question)

Injection seams (already isolated by #34)

CodSpeed unaffected

Open questions

Implementation sequence

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Split local benchmark analytics into a standalone generic package (benchkit) #35

Description

Goal

Status — foundation done (#34, draft)

Target shape — two repos, one CLI

Primitives (the design question)

Injection seams (already isolated by #34)

CodSpeed unaffected

Open questions

Implementation sequence

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions