Skip to content

Benchmark selection rework + time/memory CLI unification#31

Merged
FBumann merged 4 commits into
masterfrom
benchmark/quick-subset
Jun 6, 2026
Merged

Benchmark selection rework + time/memory CLI unification#31
FBumann merged 4 commits into
masterfrom
benchmark/quick-subset

Conversation

@FBumann

@FBumann FBumann commented Jun 6, 2026

Copy link
Copy Markdown
Collaborator

TODO (human): one or two sentences on why.

Note

Generated by AI (Claude).

What

Unifies benchmark size/severity selection and the time-vs-memory CLI split.

Selection

  • Replace the quick_threshold size cap with an explicit per-spec quick_subset (first/middle/last of each spec's sweep). Fixes two latent bugs the threshold caused: models ran only their smallest size under --quick, and patterns silently skipped severity=100. Both now run a representative 3 points; patterns get all of (0, 50, 100).
  • Add manual --size / --severity (repeatable) overriding the tiers per axis.
  • Widen DEFAULT_SEVERITIES to (0, 25, 50, 75, 100); --quick still distills to (0, 50, 100), so CodSpeed isn't rebaselined.
  • One shared registry.skip_reason() is the single source of truth for selection, used by both pytest (conftest.maybe_skip) and the memory engine — guarded by the existing id-alignment test.

CLI: metric as a flag

  • run / sweep take --metric {time,memory,both}; compare / plot auto-detect memory vs timing snapshots. Retires the memory sub-app (memory saverun --metric memory, memory sweepsweep --metric memory, memory comparecompare).
  • --quick/--long/--size/--severity now apply uniformly to both metrics.

Verify

  • ruff + mypy clean; harness unit tests pass (incl. memory↔pytest id alignment).
  • Walkthrough (CI-executed) re-runs clean; README + walkthrough updated.

FBumann and others added 4 commits June 6, 2026 13:18
…is selection

Replace the quick_threshold size cap with quick_subset — the first/middle/last
of each spec's sweep — so --quick runs three representative points instead of a
threshold side-effect. This fixes two latent bugs the threshold caused:

- models ran only their smallest size under --quick (most quick_threshold=10),
  so the per-PR memory signal was dominated by sub-MiB benchmarks;
- patterns silently skipped severity=100 (quick_threshold=50), so the densest,
  peak-memory regime never ran in CI.

Both now follow one rule (first/mid/last); patterns get all of (0, 50, 100).
pypsa_scigrid opts out via quick_sizes=().

Add manual --size / --severity pytest options (repeatable) that override the
tier flags per axis, so CI can pin exact values.

Widen DEFAULT_SEVERITIES to (0, 25, 50, 75, 100) for finer default/--long/sweep
resolution; --quick still distills to (0, 50, 100), so CodSpeed isn't rebaselined.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…tect

Make the metric (time vs memory) a flag on the workflow rather than a separate
command tree — the first half of unifying the time/memory CLI split.

- Add a Measure enum (time|memory|both) and `--metric` to `sweep`, dispatching
  to the timing or memory engine (or both, sequentially — memray overhead would
  skew wall-clock if run concurrently). `both` writes per-metric subdirs so the
  two linopy-<version>.json sets don't collide. Default snapshot dirs are now
  .benchmarks/{time,memory}/ (timing default moved from .benchmarks/sweep).
- `compare` auto-detects memory snapshots (peak_mib key) and diffs them with a
  peak-RSS table; timing still goes through pytest-benchmark. Mixing errors out.
  Adds memory.compare_snapshots (path-based; compare() now delegates to it).

The `memory` sub-app still works; Stage 2 retires it and adds `run --metric` +
--long/--size/--severity parity on the memory engine.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ting

Finish folding the time/memory split into a single command surface. Memory is
now a --metric flag everywhere, not a parallel `memory` sub-app.

- `run --metric {time,memory,both}` replaces `memory save`: prints results,
  saves only with --json (label = filename stem) — one rule for both metrics.
  Adds --quick/--size/--severity/--repeats so flags match across metrics.
- Delete the `memory` sub-app (cli/memory.py, memory_app); `memory sweep` →
  `sweep --metric memory`, `memory compare` → `compare` (auto-detect, Stage 1).
- Unify size/severity selection in registry.skip_reason(), shared by
  conftest.maybe_skip and memory.run_phase so the two engines can't drift.
  Threads --long/--size/--severity through the memory worker; the id-alignment
  test still passes.
- Split memory.save() into measure() (returns the dict) + save() (writes), so
  `run --metric memory` can measure-then-maybe-write.
- run_memory_sweep now invokes `run --metric memory --json <abs>` and writes
  straight to output_dir (no save-to-default-then-move dance); gains --long.
- Update the walkthrough (CI-executed, re-runs clean) and benchmarks/README.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
compare crashed with a raw JSONDecodeError traceback (and plot printed a
cryptic "Expecting value:" with no filename) when handed an empty or malformed
snapshot. load_snapshot now raises a clear, file-named ValueError on unreadable
JSON or an unrecognized shape; compare routes its metric detection through
load_snapshot too, so both commands report e.g.

  /tmp/x.json: not a readable JSON snapshot (Expecting value: line 1 ...)

The mixed memory/timing guard (compare's any/all, plot's _check_same_unit) was
already correct and is unchanged.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@FBumann FBumann merged commit 0dc9c0a into master Jun 6, 2026
17 of 18 checks passed
@codspeed-hq

codspeed-hq Bot commented Jun 6, 2026

Copy link
Copy Markdown

Merging this PR will not alter performance

✅ 79 untouched benchmarks
🆕 98 new benchmarks
⏩ 495 skipped benchmarks1

Performance Changes

Mode Benchmark BASE HEAD Efficiency
🆕 Memory test_to_lp[kvl_cycles-severity=100] N/A 38.5 MB N/A
🆕 Memory test_to_solver[highs-knapsack-n=10000] N/A 2.5 MB N/A
🆕 Memory test_to_solver[gurobi-knapsack-n=10000] N/A 2.8 MB N/A
🆕 Memory test_to_solver[highs-piecewise-n=5000] N/A 20.6 MB N/A
🆕 Memory test_to_solver[gurobi-sos-n=1000] N/A 3.1 MB N/A
🆕 Memory test_to_solver[highs-expression_arithmetic-n=250] N/A 34.9 MB N/A
🆕 Memory test_to_solver[gurobi-piecewise-n=1000] N/A 5.1 MB N/A
🆕 Memory test_to_solver[gurobi-qp-n=1000] N/A 831.1 KB N/A
🆕 Memory test_to_solver[gurobi-qp-n=20000] N/A 11.2 MB N/A
🆕 Memory test_to_solver[gurobi-sos-n=10000] N/A 29.2 MB N/A
🆕 Memory test_build[masked-n=1000] N/A 70.7 MB N/A
🆕 Memory test_to_solver[gurobi-kvl_cycles-severity=100] N/A 198.8 MB N/A
🆕 Memory test_to_lp[nodal_balance-severity=100] N/A 17.9 MB N/A
🆕 Memory test_to_lp[merge_balance-severity=100] N/A 17.6 MB N/A
🆕 Memory test_to_solver[highs-storage-n=1000] N/A 150 MB N/A
🆕 Memory test_build[milp-n=50] N/A 283.7 KB N/A
🆕 Memory test_to_solver[highs-kvl_cycles-severity=100] N/A 217 MB N/A
🆕 Memory test_build[masked-n=100] N/A 735.4 KB N/A
🆕 Memory test_to_solver[highs-sparse_network-n=250] N/A 64.8 MB N/A
🆕 Memory test_to_solver[gurobi-basic-n=1600] N/A 2.8 GB N/A
... ... ... ... ... ...

ℹ️ Only the first 20 benchmarks are displayed. Go to the app to view all benchmarks.


Comparing benchmark/quick-subset (60fa372) with master (3fdb9b4)

Open in CodSpeed

Footnotes

  1. 495 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant