Benchmark selection rework + time/memory CLI unification by FBumann · Pull Request #31 · fluxopt/linopy

FBumann · 2026-06-06T14:39:07Z

TODO (human): one or two sentences on why.

Note

Generated by AI (Claude).

What

Unifies benchmark size/severity selection and the time-vs-memory CLI split.

Selection

Replace the quick_threshold size cap with an explicit per-spec quick_subset (first/middle/last of each spec's sweep). Fixes two latent bugs the threshold caused: models ran only their smallest size under --quick, and patterns silently skipped severity=100. Both now run a representative 3 points; patterns get all of (0, 50, 100).
Add manual --size / --severity (repeatable) overriding the tiers per axis.
Widen DEFAULT_SEVERITIES to (0, 25, 50, 75, 100); --quick still distills to (0, 50, 100), so CodSpeed isn't rebaselined.
One shared registry.skip_reason() is the single source of truth for selection, used by both pytest (conftest.maybe_skip) and the memory engine — guarded by the existing id-alignment test.

CLI: metric as a flag

run / sweep take --metric {time,memory,both}; compare / plot auto-detect memory vs timing snapshots. Retires the memory sub-app (memory save→run --metric memory, memory sweep→sweep --metric memory, memory compare→compare).
--quick/--long/--size/--severity now apply uniformly to both metrics.

Verify

ruff + mypy clean; harness unit tests pass (incl. memory↔pytest id alignment).
Walkthrough (CI-executed) re-runs clean; README + walkthrough updated.

…is selection Replace the quick_threshold size cap with quick_subset — the first/middle/last of each spec's sweep — so --quick runs three representative points instead of a threshold side-effect. This fixes two latent bugs the threshold caused: - models ran only their smallest size under --quick (most quick_threshold=10), so the per-PR memory signal was dominated by sub-MiB benchmarks; - patterns silently skipped severity=100 (quick_threshold=50), so the densest, peak-memory regime never ran in CI. Both now follow one rule (first/mid/last); patterns get all of (0, 50, 100). pypsa_scigrid opts out via quick_sizes=(). Add manual --size / --severity pytest options (repeatable) that override the tier flags per axis, so CI can pin exact values. Widen DEFAULT_SEVERITIES to (0, 25, 50, 75, 100) for finer default/--long/sweep resolution; --quick still distills to (0, 50, 100), so CodSpeed isn't rebaselined. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…tect Make the metric (time vs memory) a flag on the workflow rather than a separate command tree — the first half of unifying the time/memory CLI split. - Add a Measure enum (time|memory|both) and `--metric` to `sweep`, dispatching to the timing or memory engine (or both, sequentially — memray overhead would skew wall-clock if run concurrently). `both` writes per-metric subdirs so the two linopy-<version>.json sets don't collide. Default snapshot dirs are now .benchmarks/{time,memory}/ (timing default moved from .benchmarks/sweep). - `compare` auto-detects memory snapshots (peak_mib key) and diffs them with a peak-RSS table; timing still goes through pytest-benchmark. Mixing errors out. Adds memory.compare_snapshots (path-based; compare() now delegates to it). The `memory` sub-app still works; Stage 2 retires it and adds `run --metric` + --long/--size/--severity parity on the memory engine. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…ting Finish folding the time/memory split into a single command surface. Memory is now a --metric flag everywhere, not a parallel `memory` sub-app. - `run --metric {time,memory,both}` replaces `memory save`: prints results, saves only with --json (label = filename stem) — one rule for both metrics. Adds --quick/--size/--severity/--repeats so flags match across metrics. - Delete the `memory` sub-app (cli/memory.py, memory_app); `memory sweep` → `sweep --metric memory`, `memory compare` → `compare` (auto-detect, Stage 1). - Unify size/severity selection in registry.skip_reason(), shared by conftest.maybe_skip and memory.run_phase so the two engines can't drift. Threads --long/--size/--severity through the memory worker; the id-alignment test still passes. - Split memory.save() into measure() (returns the dict) + save() (writes), so `run --metric memory` can measure-then-maybe-write. - run_memory_sweep now invokes `run --metric memory --json <abs>` and writes straight to output_dir (no save-to-default-then-move dance); gains --long. - Update the walkthrough (CI-executed, re-runs clean) and benchmarks/README. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

compare crashed with a raw JSONDecodeError traceback (and plot printed a cryptic "Expecting value:" with no filename) when handed an empty or malformed snapshot. load_snapshot now raises a clear, file-named ValueError on unreadable JSON or an unrecognized shape; compare routes its metric detection through load_snapshot too, so both commands report e.g. /tmp/x.json: not a readable JSON snapshot (Expecting value: line 1 ...) The mixed memory/timing guard (compare's any/all, plot's _check_same_unit) was already correct and is unchanged. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

codspeed-hq · 2026-06-06T15:04:43Z

Merging this PR will not alter performance

✅ 79 untouched benchmarks
🆕 98 new benchmarks
⏩ 495 skipped benchmarks¹

Performance Changes

	Mode	Benchmark	`BASE`	`HEAD`	Efficiency
🆕	Memory	`test_to_lp[kvl_cycles-severity=100]`	N/A	38.5 MB	N/A
🆕	Memory	`test_to_solver[highs-knapsack-n=10000]`	N/A	2.5 MB	N/A
🆕	Memory	`test_to_solver[gurobi-knapsack-n=10000]`	N/A	2.8 MB	N/A
🆕	Memory	`test_to_solver[highs-piecewise-n=5000]`	N/A	20.6 MB	N/A
🆕	Memory	`test_to_solver[gurobi-sos-n=1000]`	N/A	3.1 MB	N/A
🆕	Memory	`test_to_solver[highs-expression_arithmetic-n=250]`	N/A	34.9 MB	N/A
🆕	Memory	`test_to_solver[gurobi-piecewise-n=1000]`	N/A	5.1 MB	N/A
🆕	Memory	`test_to_solver[gurobi-qp-n=1000]`	N/A	831.1 KB	N/A
🆕	Memory	`test_to_solver[gurobi-qp-n=20000]`	N/A	11.2 MB	N/A
🆕	Memory	`test_to_solver[gurobi-sos-n=10000]`	N/A	29.2 MB	N/A
🆕	Memory	`test_build[masked-n=1000]`	N/A	70.7 MB	N/A
🆕	Memory	`test_to_solver[gurobi-kvl_cycles-severity=100]`	N/A	198.8 MB	N/A
🆕	Memory	`test_to_lp[nodal_balance-severity=100]`	N/A	17.9 MB	N/A
🆕	Memory	`test_to_lp[merge_balance-severity=100]`	N/A	17.6 MB	N/A
🆕	Memory	`test_to_solver[highs-storage-n=1000]`	N/A	150 MB	N/A
🆕	Memory	`test_build[milp-n=50]`	N/A	283.7 KB	N/A
🆕	Memory	`test_to_solver[highs-kvl_cycles-severity=100]`	N/A	217 MB	N/A
🆕	Memory	`test_build[masked-n=100]`	N/A	735.4 KB	N/A
🆕	Memory	`test_to_solver[highs-sparse_network-n=250]`	N/A	64.8 MB	N/A
🆕	Memory	`test_to_solver[gurobi-basic-n=1600]`	N/A	2.8 GB	N/A
...	...	...	...	...	...

ℹ️ Only the first 20 benchmarks are displayed. Go to the app to view all benchmarks.

_{Comparing benchmark/quick-subset (60fa372) with master (3fdb9b4)}

495 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports. ↩

FBumann and others added 4 commits June 6, 2026 13:18

FBumann merged commit 0dc9c0a into master Jun 6, 2026
17 of 18 checks passed

FBumann mentioned this pull request Jun 6, 2026

refactor(benchmarks): explicit QUICK_SIZES/LONG_SIZES per spec (+ fix 14-min memory CI) #33

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Benchmark selection rework + time/memory CLI unification#31

Benchmark selection rework + time/memory CLI unification#31
FBumann merged 4 commits into
masterfrom
benchmark/quick-subset

FBumann commented Jun 6, 2026

Uh oh!

Uh oh!

codspeed-hq Bot commented Jun 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

FBumann commented Jun 6, 2026

What

Selection

CLI: metric as a flag

Verify

Uh oh!

Uh oh!

codspeed-hq Bot commented Jun 6, 2026

Merging this PR will not alter performance

Performance Changes

Footnotes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant