feat: add E2E perf tests for mlperf-endpoints#328
Open
viraatc wants to merge 3 commits into
Open
Conversation
|
MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅ |
Contributor
There was a problem hiding this comment.
Code Review
This pull request introduces end-to-end performance and correctness tests for the benchmark CLI, including roofline tests, low-QPS correctness tests, and a pytest terminal summary reporter. The review feedback identifies several critical improvements: fixing a logic bug in the Poisson binary search where the lowest target might be skipped, using os.cpu_count() for cross-platform core detection instead of parsing /proc/cpuinfo, defensively handling type conversion errors in the summary formatter to prevent pytest crashes, and adding headroom to --num-samples in the Poisson load pattern test to prevent premature termination.
Drive the cyclopts `inference-endpoint` app in-process against the existing MaxThroughputServer and VariableResponseServer stubs. Two families: * Roofline (`@pytest.mark.performance`, CI-skipped) — measures peak QPS for max_throughput, concurrency sweep, and binary-searches the largest 10k-multiple target_qps Poisson sustains. Reports numbers rather than asserting on them. * Low-QPS correctness (`@pytest.mark.integration`, CI-included) — 5 QPS Poisson against the realistic stub for 20s; asserts zero failed requests. Guards keep-alive / idle-pool / slow-response regressions that only surface when connections sit idle longer than TCP_KEEPIDLE. A conftest.py captures each parameterized case via a `record_result` fixture and renders a unified summary table with host + CPU info at end of session, so cross-machine roofline runs are easy to compare. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* poisson binary search: switch to while lo<=hi + best_sustained so the LO boundary is actually tested. Old loop could converge to lo==hi==LO/STEP without running LO and report max_sustained=0. * low_qps: 2x num-samples headroom over TARGET_QPS*DURATION so wall time, not sample count, caps the run despite Poisson variance. * conftest._host_info: use os.cpu_count() for cores (cross-platform); keep /proc/cpuinfo only for the CPU model string. Document why the OSError except is silent. * conftest._fmt_cell: wrap int/float conversions in try/except so a bad recorded value can't crash the end-of-session summary table. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Low-QPS correctness was marked integration which would have it run in CI on every PR. These are long-running benchmark tests that aren't meant to gate merges; marking them all performance keeps CI fast and makes the file's policy uniform. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ed4f70a to
b66712f
Compare
4 tasks
viraatc
added a commit
that referenced
this pull request
Jun 22, 2026
Opt-in (`-vvv`) per-request lifecycle tracing for the benchmark client, rendered as a live `rich` dashboard alongside the run. Off by default with no measurable overhead when off (emit is a no-op binding); the worker hot path stays lock-free and allocation-free. Pipeline: - utils/trace.py — lock-free SPSC ring emitter (~190 ns/event) into a per-process 512 MiB anonymous mmap (pages fault in on write). Per-pid POSIX FIFO transport: non-blocking open with bounded retry (raises rather than hanging if the dashboard died), O_NONBLOCK writes that drop on EAGAIN with an adaptive sampler + cumulative self-healing drop counter. bootstrap spawns the dashboard, tracks its Popen, and cleanup() reaps it (kills a wedged one as a backstop). 17-byte <BQQ frames. - utils/trace_dashboard.py — pure aggregation + render (unit-tested in isolation): lifecycle fold into HDR-histogram stages, heat-graded %E2E table, client/server/backpressure verdict, e2e bar, backpressure cause tree (pickup-ipc heats independently; encode+tcp-acquire fused), per-proc loop-lag panel, and an ESTABLISHED tcp-conn gauge. Cross-process deltas floored at 0. - scripts/trace_dashboard.py — TUI subprocess: FIFO reader thread, ZMQ SUB to the aggregator PUB (sidecar fallback), and an off-render-thread /proc tcp-conn sampler. Reads to true FIFO EOF with an authoritative final frame. - Worker / session / agentic-inference emit sites; warmup excluded via a PERF_START reset; PERF_END freezes the lifecycle/verdict/tree. Perf (B200 bare metal, #328 roofline, 3 reps): trace-off within run-to-run noise of main; -vvv ~5-8% at the sub-ms stub ceiling (worst case). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Drive the cyclopts
inference-endpointapp in-process against the existingMaxThroughputServerandVariableResponseServerstubs. Two families, both marked@pytest.mark.performance(CI-skipped — run on demand):Roofline — measures peak QPS for
max_throughput,concurrencysweep (1k / 4k / 16k), and binary-searches the largest 10 k-multipletarget_qpsPoisson sustains. Reports numbers rather than asserting on them.Low-QPS correctness — 5 QPS Poisson against the realistic stub for ~20 s; asserts zero failed requests. Guards keep-alive / idle-pool / slow-response regressions that only surface when connections sit idle longer than
TCP_KEEPIDLE.A
conftest.pycaptures each parameterized case via arecord_resultfixture and renders a unified summary table with host / CPU / core info at end of session, so cross-machine roofline runs are easy to compare.WARNING: full run takes ~8–10 min wall.
How to run
Example output — local dev box
What does this PR do?
Adds the smallest set of E2E parameterized perf tests that exercise all three online load patterns + low-QPS keep-alive guard, behind the existing
performancemarker so CI is unaffected.Type of change
Related issues
Testing
Checklist