feat: add E2E perf tests for mlperf-endpoints by viraatc · Pull Request #328 · mlcommons/endpoints

viraatc · 2026-05-28T22:25:51Z

Drive the cyclopts inference-endpoint app in-process against the existing MaxThroughputServer and VariableResponseServer stubs. Two families, both marked @pytest.mark.performance (CI-skipped — run on demand):

Roofline — measures peak QPS for max_throughput, concurrency sweep (1k / 4k / 16k), and binary-searches the largest 10 k-multiple target_qps Poisson sustains. Reports numbers rather than asserting on them.
Low-QPS correctness — 5 QPS Poisson against the realistic stub for ~20 s; asserts zero failed requests. Guards keep-alive / idle-pool / slow-response regressions that only surface when connections sit idle longer than TCP_KEEPIDLE.

A conftest.py captures each parameterized case via a record_result fixture and renders a unified summary table with host / CPU / core info at end of session, so cross-machine roofline runs are easy to compare.

WARNING: full run takes ~8–10 min wall.

How to run

uv run pytest -vs -m performance --no-cov tests/performance/commands/test_e2e_perf.py

Example output — local dev box

=========================== E2E Performance Summary ============================
Host:  nova-2    Arch: x86_64    Cores: 48
CPU:   AMD Ryzen Threadripper PRO 7965WX 24-Cores

Test                       Stream  QPS        Total       Elapsed   Failed
-------------------------  ------  ---------  ----------  --------  ------
max_throughput (2M burst)  off        42,929   2,000,000    46.59s     0
concurrency c=1,000        off        51,433     617,330    11.99s     0
concurrency c=4,000        off        51,819     622,182    11.98s     0
concurrency c=16,000       off        50,160     604,654    11.80s     0
poisson max_sustained      off        30,000  —           —            0
max_throughput (2M burst)  on         32,025   2,000,000    62.45s     0
concurrency c=1,000        on         30,337     364,982    12.00s     0
concurrency c=4,000        on         31,348     380,157    12.00s     0
concurrency c=16,000       on         31,495     393,876    12.00s     0
poisson max_sustained      on         30,000  —           —            0
low_qps target=5           off          4.99         200    40.07s     0
low_qps target=5           on           3.97         200    50.39s     0
========================= 12 passed, 2 warnings in 495.47s (0:08:15) =========================

What does this PR do?

Adds the smallest set of E2E parameterized perf tests that exercise all three online load patterns + low-QPS keep-alive guard, behind the existing performance marker so CI is unaffected.

Type of change

Bug fix
New feature
Documentation update
Refactor/cleanup

Related issues

Testing

Tests added/updated
All 12 tests pass locally (~8 min)
Manual testing completed

Checklist

Code follows project style
Pre-commit hooks pass
Documentation updated (if needed)

github-actions · 2026-05-28T22:25:59Z

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

gemini-code-assist

Code Review

This pull request introduces end-to-end performance and correctness tests for the benchmark CLI, including roofline tests, low-QPS correctness tests, and a pytest terminal summary reporter. The review feedback identifies several critical improvements: fixing a logic bug in the Poisson binary search where the lowest target might be skipped, using os.cpu_count() for cross-platform core detection instead of parsing /proc/cpuinfo, defensively handling type conversion errors in the summary formatter to prevent pytest crashes, and adding headroom to --num-samples in the Poisson load pattern test to prevent premature termination.

Drive the cyclopts `inference-endpoint` app in-process against the existing MaxThroughputServer and VariableResponseServer stubs. Two families: * Roofline (`@pytest.mark.performance`, CI-skipped) — measures peak QPS for max_throughput, concurrency sweep, and binary-searches the largest 10k-multiple target_qps Poisson sustains. Reports numbers rather than asserting on them. * Low-QPS correctness (`@pytest.mark.integration`, CI-included) — 5 QPS Poisson against the realistic stub for 20s; asserts zero failed requests. Guards keep-alive / idle-pool / slow-response regressions that only surface when connections sit idle longer than TCP_KEEPIDLE. A conftest.py captures each parameterized case via a `record_result` fixture and renders a unified summary table with host + CPU info at end of session, so cross-machine roofline runs are easy to compare. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* poisson binary search: switch to while lo<=hi + best_sustained so the LO boundary is actually tested. Old loop could converge to lo==hi==LO/STEP without running LO and report max_sustained=0. * low_qps: 2x num-samples headroom over TARGET_QPS*DURATION so wall time, not sample count, caps the run despite Poisson variance. * conftest._host_info: use os.cpu_count() for cores (cross-platform); keep /proc/cpuinfo only for the CPU model string. Document why the OSError except is silent. * conftest._fmt_cell: wrap int/float conversions in try/except so a bad recorded value can't crash the end-of-session summary table. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Low-QPS correctness was marked integration which would have it run in CI on every PR. These are long-running benchmark tests that aren't meant to gate merges; marking them all performance keeps CI fast and makes the file's policy uniform. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Opt-in (`-vvv`) per-request lifecycle tracing for the benchmark client, rendered as a live `rich` dashboard alongside the run. Off by default with no measurable overhead when off (emit is a no-op binding); the worker hot path stays lock-free and allocation-free. Pipeline: - utils/trace.py — lock-free SPSC ring emitter (~190 ns/event) into a per-process 512 MiB anonymous mmap (pages fault in on write). Per-pid POSIX FIFO transport: non-blocking open with bounded retry (raises rather than hanging if the dashboard died), O_NONBLOCK writes that drop on EAGAIN with an adaptive sampler + cumulative self-healing drop counter. bootstrap spawns the dashboard, tracks its Popen, and cleanup() reaps it (kills a wedged one as a backstop). 17-byte <BQQ frames. - utils/trace_dashboard.py — pure aggregation + render (unit-tested in isolation): lifecycle fold into HDR-histogram stages, heat-graded %E2E table, client/server/backpressure verdict, e2e bar, backpressure cause tree (pickup-ipc heats independently; encode+tcp-acquire fused), per-proc loop-lag panel, and an ESTABLISHED tcp-conn gauge. Cross-process deltas floored at 0. - scripts/trace_dashboard.py — TUI subprocess: FIFO reader thread, ZMQ SUB to the aggregator PUB (sidecar fallback), and an off-render-thread /proc tcp-conn sampler. Reads to true FIFO EOF with an authoritative final frame. - Worker / session / agentic-inference emit sites; warmup excluded via a PERF_START reset; PERF_END freezes the lifecycle/verdict/tree. Perf (B200 bare metal, #328 roofline, 3 reps): trace-off within run-to-run noise of main; -vvv ~5-8% at the sub-ms stub ceiling (worst case). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

viraatc requested a review from a team May 28, 2026 22:25

github-actions Bot requested review from arekay-nv and nvzhihanj May 28, 2026 22:26

gemini-code-assist Bot reviewed May 28, 2026

View reviewed changes

Comment thread tests/performance/commands/test_e2e_perf.py Outdated

Comment thread tests/performance/commands/conftest.py

Comment thread tests/performance/commands/conftest.py

Comment thread tests/performance/commands/test_e2e_perf.py Outdated

github-code-quality Bot found potential problems May 28, 2026

View reviewed changes

Comment thread tests/performance/commands/conftest.py Fixed

viraatc changed the title ~~feat: add E2E perf tests for benchmark CLI~~ feat: add E2E perf tests for mlperf-endpoints Jun 1, 2026

viraatc mentioned this pull request Jun 1, 2026

Perf: Analayze the roofline of the inference endpoints #9

Open

viraatc and others added 3 commits June 8, 2026 19:46

viraatc force-pushed the viraat/e2e-perf-tests branch from ed4f70a to b66712f Compare June 9, 2026 02:46

viraatc mentioned this pull request Jun 10, 2026

feat: add -vvv trace mode dashboard for client overhead #334

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add E2E perf tests for mlperf-endpoints#328

feat: add E2E perf tests for mlperf-endpoints#328
viraatc wants to merge 3 commits into
mainfrom
viraat/e2e-perf-tests

viraatc commented May 28, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented May 28, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

viraatc commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

How to run

Example output — local dev box

What does this PR do?

Type of change

Related issues

Testing

Checklist

Uh oh!

github-actions Bot commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

viraatc commented May 28, 2026 •

edited

Loading

github-actions Bot commented May 28, 2026 •

edited

Loading