Why this exists
Several open issues independently propose pieces of the same design — more than one
memory metric, a way to record which one produced a number, a way to select it, and a guard
against comparing incomparable ones. Drafted at different times (mostly pre-#32), they overlap
and in one case conflict: #9 and #34 each add an "rss" number, with different engines
(psutil polling vs kernel ru_maxrss), different JSON tags (peak_backend vs mode), and
different comparability guards. Left separate they'd ship two different "rss" numbers wearing
one label.
This is the design hub. It owns the shared model; the linked issues become engines/consumers
that plug into it.
The model
Metrics (the what)
| metric |
meaning |
unit |
precision |
heap |
allocator demand (memray) |
bytes |
byte-exact |
rss |
resident-page high-water |
bytes |
page/THP-quantized |
allocated |
total bytes allocated / count (churn) — see #23 |
bytes / count |
byte-exact |
These are different quantities and must never share a plot axis. heap vs rss is
allocator-bytes vs resident-pages; see the accuracy caveats in #34.
Engines (the how, platform-dispatched)
A metric can have more than one engine; the engine is a property of the environment, not the
test. The result records the concrete engine used.
One tag (resolves the #9/#34 collision)
A single field in extra_info["benchmem"] records what produced the number — mode (the
metric) plus the concrete engine. Supersedes #9's separate peak_backend/--peakbench-backend
scheme; everything funnels through one tag.
One selection surface
Scope (Step-2 decision): no OOM handling — no limit=/cgroup/RLIMIT_AS, no killed
field. If a benchmarked action dies or raises under measurement, it fails like any other
test. OOM-survival testing is out of scope.
One comparability guard
memory_from_pytest_benchmark / load_long_df carry mode+engine as facets;
compare/plot refuse to stack rows of differing mode (and warn on differing engine).
Supersedes both #9's mixed-backend guard and #34's co-plot refusal.
Shared infrastructure
Children
Note on naming
#9/#19/#20/#23/#26 predate #32 and reference stale names (peakbench, peak_mib,
measure_peak, --peakbench-backend). Refresh to current (pytest_benchmem, peak_bytes,
measure_memory, extra_info["benchmem"]) when each is picked up.
Why this exists
Several open issues independently propose pieces of the same design — more than one
memory metric, a way to record which one produced a number, a way to select it, and a guard
against comparing incomparable ones. Drafted at different times (mostly pre-#32), they overlap
and in one case conflict: #9 and #34 each add an "rss" number, with different engines
(psutil polling vs kernel
ru_maxrss), different JSON tags (peak_backendvsmode), anddifferent comparability guards. Left separate they'd ship two different "rss" numbers wearing
one label.
This is the design hub. It owns the shared model; the linked issues become engines/consumers
that plug into it.
The model
Metrics (the what)
heaprssallocatedThese are different quantities and must never share a plot axis.
heapvsrssisallocator-bytes vs resident-pages; see the accuracy caveats in #34.
Engines (the how, platform-dispatched)
A metric can have more than one engine; the engine is a property of the environment, not the
test. The result records the concrete engine used.
heap→ memray (Linux/macOS), in-process or isolated (see Isolatedheapmode: subprocess memray pass for an order-independent baseline #20).rss→ru_maxrss+forkon Linux/macOS (Add an rss memory mode: kernel peak-RSS via subprocess, with baseline subtraction #34, default, accurate); psutil sampling onWindows only (Windows engine for the
rssmetric (psutil fallback) #9, coarse fallback — nofork/ru_maxrssthere). Windows engine for therssmetric (psutil fallback) #9's "misses spikes"caveat is honest Windows small print, not a rival design.
One tag (resolves the #9/#34 collision)
A single field in
extra_info["benchmem"]records what produced the number —mode(themetric) plus the concrete
engine. Supersedes #9's separatepeak_backend/--peakbench-backendscheme; everything funnels through one tag.
One selection surface
--benchmark-memory[=heap|rss](optional-value),--benchmark-memory-repeats=N(--benchmark-memory-repeats=N: min-of-N denoising knob #19).@pytest.mark.benchmem(mode=..., repeats=...).benchmark_memory(fn, mode=...).One comparability guard
memory_from_pytest_benchmark/load_long_dfcarrymode+engineas facets;compare/plotrefuse to stack rows of differingmode(and warn on differing engine).Supersedes both #9's mixed-backend guard and #34's co-plot refusal.
Shared infrastructure
fork+os.wait4(per-child),gc.freeze()before fork.Baseline is a forked no-op child's
ru_maxrss(NOT parent current RSS — thatover-subtracts inherited COW; proven in Add an rss memory mode: kernel peak-RSS via subprocess, with baseline subtraction #34's Step-2 validation). Used by
rss(Add an rss memory mode: kernel peak-RSS via subprocess, with baseline subtraction #34) and byisolated-
heap(Isolatedheapmode: subprocess memray pass for an order-independent baseline #20).min-of-N repeats (--benchmark-memory-repeats=N: min-of-N denoising knob #19), surfaced spread.Children
rssengine (posixru_maxrss+ fork + baseline). Reference design for the shared infra.rssmetric (psutil fallback) #9 — Windows engine forrss(psutil fallback). Rescoped: drop rival tagging, adopt this model.heapmode: subprocess memray pass for an order-independent baseline #20 — isolatedheapmode, reusing the shared subprocess/baseline infra.--benchmark-memory-repeats=N: min-of-N denoising knob #19 —repeats=Ndenoising knob (shared across metrics).allocated/count metric +--metric allocated.--metric both: timing + memory side by side in compare/plot #26 —--metric both(time + memory together) in compare/plot.Note on naming
#9/#19/#20/#23/#26 predate #32 and reference stale names (
peakbench,peak_mib,measure_peak,--peakbench-backend). Refresh to current (pytest_benchmem,peak_bytes,measure_memory,extra_info["benchmem"]) when each is picked up.