Benchmark selection rework + time/memory CLI unification#31
Merged
Conversation
…is selection Replace the quick_threshold size cap with quick_subset — the first/middle/last of each spec's sweep — so --quick runs three representative points instead of a threshold side-effect. This fixes two latent bugs the threshold caused: - models ran only their smallest size under --quick (most quick_threshold=10), so the per-PR memory signal was dominated by sub-MiB benchmarks; - patterns silently skipped severity=100 (quick_threshold=50), so the densest, peak-memory regime never ran in CI. Both now follow one rule (first/mid/last); patterns get all of (0, 50, 100). pypsa_scigrid opts out via quick_sizes=(). Add manual --size / --severity pytest options (repeatable) that override the tier flags per axis, so CI can pin exact values. Widen DEFAULT_SEVERITIES to (0, 25, 50, 75, 100) for finer default/--long/sweep resolution; --quick still distills to (0, 50, 100), so CodSpeed isn't rebaselined. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…tect
Make the metric (time vs memory) a flag on the workflow rather than a separate
command tree — the first half of unifying the time/memory CLI split.
- Add a Measure enum (time|memory|both) and `--metric` to `sweep`, dispatching
to the timing or memory engine (or both, sequentially — memray overhead would
skew wall-clock if run concurrently). `both` writes per-metric subdirs so the
two linopy-<version>.json sets don't collide. Default snapshot dirs are now
.benchmarks/{time,memory}/ (timing default moved from .benchmarks/sweep).
- `compare` auto-detects memory snapshots (peak_mib key) and diffs them with a
peak-RSS table; timing still goes through pytest-benchmark. Mixing errors out.
Adds memory.compare_snapshots (path-based; compare() now delegates to it).
The `memory` sub-app still works; Stage 2 retires it and adds `run --metric` +
--long/--size/--severity parity on the memory engine.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ting
Finish folding the time/memory split into a single command surface. Memory is
now a --metric flag everywhere, not a parallel `memory` sub-app.
- `run --metric {time,memory,both}` replaces `memory save`: prints results,
saves only with --json (label = filename stem) — one rule for both metrics.
Adds --quick/--size/--severity/--repeats so flags match across metrics.
- Delete the `memory` sub-app (cli/memory.py, memory_app); `memory sweep` →
`sweep --metric memory`, `memory compare` → `compare` (auto-detect, Stage 1).
- Unify size/severity selection in registry.skip_reason(), shared by
conftest.maybe_skip and memory.run_phase so the two engines can't drift.
Threads --long/--size/--severity through the memory worker; the id-alignment
test still passes.
- Split memory.save() into measure() (returns the dict) + save() (writes), so
`run --metric memory` can measure-then-maybe-write.
- run_memory_sweep now invokes `run --metric memory --json <abs>` and writes
straight to output_dir (no save-to-default-then-move dance); gains --long.
- Update the walkthrough (CI-executed, re-runs clean) and benchmarks/README.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
compare crashed with a raw JSONDecodeError traceback (and plot printed a cryptic "Expecting value:" with no filename) when handed an empty or malformed snapshot. load_snapshot now raises a clear, file-named ValueError on unreadable JSON or an unrecognized shape; compare routes its metric detection through load_snapshot too, so both commands report e.g. /tmp/x.json: not a readable JSON snapshot (Expecting value: line 1 ...) The mixed memory/timing guard (compare's any/all, plot's _check_same_unit) was already correct and is unchanged. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Merging this PR will not alter performance
Performance Changes
Comparing Footnotes
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
TODO (human): one or two sentences on why.
Note
Generated by AI (Claude).
What
Unifies benchmark size/severity selection and the time-vs-memory CLI split.
Selection
quick_thresholdsize cap with an explicit per-specquick_subset(first/middle/last of each spec's sweep). Fixes two latent bugs the threshold caused: models ran only their smallest size under--quick, and patterns silently skippedseverity=100. Both now run a representative 3 points; patterns get all of(0, 50, 100).--size/--severity(repeatable) overriding the tiers per axis.DEFAULT_SEVERITIESto(0, 25, 50, 75, 100);--quickstill distills to(0, 50, 100), so CodSpeed isn't rebaselined.registry.skip_reason()is the single source of truth for selection, used by both pytest (conftest.maybe_skip) and the memory engine — guarded by the existing id-alignment test.CLI: metric as a flag
run/sweeptake--metric {time,memory,both};compare/plotauto-detect memory vs timing snapshots. Retires thememorysub-app (memory save→run --metric memory,memory sweep→sweep --metric memory,memory compare→compare).--quick/--long/--size/--severitynow apply uniformly to both metrics.Verify