feat(metrics): add use_legacy_loadgen_qps_metrics for LoadGen-parity QPS/TPS by viraatc · Pull Request #372 · mlcommons/endpoints

viraatc · 2026-06-23T09:22:08Z

What

Adds use_legacy_loadgen_qps_metrics (bool, default True). When on, QPS/TPS use the MLPerf LoadGen Server "completed" definition:

QPS = (completed - 1) / T, TPS = tokens / T
T = first issued request -> completion of the last-issued request (LoadGen final_query_all_samples_done_time; mlcommons/inference loadgen/results.cc).

Disable with --no-use-legacy-loadgen-qps-metrics for endpoints-native completed/duration and tokens/duration (window to the last request to finish) -- byte-identical to prior behavior.

Notes

The metrics-aggregator subprocess emits one extra counter (loadgen_window_duration_ns); the timed load-generator path is unchanged (no new events, no per-query work).
Falls back to the native window when the LoadGen window is unavailable (snapshots predating this change, or <2 completions).
Intended for server/online runs.

~100 lines of source + tests; 261 unit tests pass, mypy/ruff/pre-commit clean.

github-actions · 2026-06-23T09:22:22Z

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

gemini-code-assist

Code Review

This pull request introduces support for MLPerf LoadGen Server 'completed' window throughput metrics, including a configuration toggle (legacy_streaming_tps) to switch between legacy and LoadGen-aligned throughput calculations. However, several configuration templates were updated to use agentic_inference and agentic_inference_inline which are not yet supported by the schema in schema.py, leading to validation failures. Additionally, a potential state inconsistency was identified in metrics_table.py when handling out-of-bounds tracking block indices.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

viraatc · 2026-06-23T10:09:13Z

Review Council — Multi-AI Code Review

Reviewed by: Codex (gpt-5.5, xhigh) + Claude | Depth: standard | Commit: 40dbf898

Found 4 findings (1 already fixed in this push). No critical/high blockers — the core numerator/denominator logic faithfully matches loadgen/results.cc (Server completed). Posting as a synthesized summary; items 2–3 are design-level decisions for the author rather than line bugs.

#	File	Severity	Category	Reviewer(s)	Finding
1	`config/templates/*.yaml`	medium	~~bug~~ fixed	Codex	Resolved in `40dbf898`. Templates had been regenerated by a system-python `pre-commit` run against a newer installed package (`agentic_inference`, `drain_timeout 300`) and omitted the new field — inconsistent with this branch's `multi_turn` enum. Regenerated from the branch's editable env: now `multi_turn` + `legacy_streaming_tps: true`, consistent.
2	`metrics_table.py` (`total_loadgen_window_ns` / block-start)	medium	bug	Claude	LoadGen window starts at `START_PERFORMANCE_TRACKING`, not first issue. For `poisson`/online, the first sample fires at `start + first_delay`, so the window includes one pre-issue inter-arrival gap → LoadGen QPS biased low by ~`1/(N-1)` (negligible at large N, material for short runs). `max_throughput`/`concurrency` unaffected. LoadGen schedules its first Server query at delta 0. For exact parity, anchor the window start at the first issued sample's `issued_ns`.
3	`metrics_table.py` (`_update_tracked_block`) + `schema.py` help	low	design	Codex + Opus	Failed/hung last-issued request: the window anchors to the last successfully-completed issued request (graceful) rather than rendering N/A. Deliberate (avoids fragile, non-reproducible N/A on a single tail failure; identical to LoadGen on clean runs). Note: the flag help says "last-issued" — should read "last successfully-completed issued request" to match the code.
4	`tests/`	low	testing	Claude	No aggregator-level `ISSUED→ERROR→COMPLETE` integration test exercising the `(completed-failed)` numerator and failed-last-issued exclusion end-to-end (unit tests cover `mark_sample_failed` and the window directly, but not the live event stream).

Disposition: #1 fixed. #2 (Poisson start bias) and #3 (failed-last-issued philosophy + help wording) are author calls — see PR thread. #4 is a nice-to-have follow-up.

viraatc · 2026-06-23T18:07:50Z

Review Council — all findings addressed ✅

Rebased onto current main (dad7d1a6); branch now at d44bb211 (clean rebase, no conflicts). 169 unit tests pass; mypy/ruff/pre-commit clean.

#	Finding	Resolution
1	Templates inconsistent with branch enum (`agentic_inference`) + missing the new field	Resolved — rebased onto current `main` and regenerated from the branch's editable env. Templates now add only `legacy_streaming_tps: true`, consistent with main's schema. (Original churn was a `language: system` pre-commit hook running system-python against a newer install.)
2	Poisson window-start bias (~`1/(N-1)` low)	Fixed — the LoadGen window now starts at the first issued request (LoadGen t=0 / delta 0), not the `START_PERFORMANCE_TRACKING` timestamp. `metrics_table.py`: `_loadgen_window_start_ns` captured on first tracked ISSUED.
3	Failed/hung last-issued handling + stale help text	Resolved — kept graceful anchoring to the last successfully-completed issued request (identical to LoadGen on clean runs, where every query completes; avoids fragile non-reproducible N/A on a single tail failure). The `legacy_streaming_tps` help text is corrected to describe the actual behavior (N/A only when no request completed successfully, or multi-phase runs).
4	No end-to-end aggregator test for the failed-last-issued path	Added — `test_failed_last_issued_excluded_from_window_end` drives the live `ISSUED→ERROR→COMPLETE` event stream through the aggregator and asserts (a) the window end anchors to the successful sample (failed last-issued excluded) and (b) `tracked_samples_completed - tracked_samples_failed == 1`.

Branch is rebased on top of origin/main; ready for review.

(Findings were posted as a single summary comment, so there are no inline threads to resolve via the UI — this comment records the resolution.)

viraatc · 2026-06-23T19:16:32Z

Update — finding #3 (error handling) revised

Per maintainer direction, LoadGen mode now handles errors exactly as endpoints already does — no special-casing:

Numerator (n_samples_completed - 1) — failed-but-completed requests count (same as legacy n_samples_completed); no failure subtraction.
Window anchors to the last-issued request that completed, failed or not.
Removed the SampleRow.failed flag + mark_sample_failed (net code reduction); n_samples_failed is still reported, just unused for QPS/TPS.

This also matches LoadGen itself — verified against loadgen source: QuerySamplesComplete (loadgen.cc) counts every reported-complete sample with no success/failure/status filtering, QuerySampleResponse has no error field, and sample_count / final_query_all_samples_done_time derive purely from reported completions (accuracy is judged only in the separate accuracy mode). A request that errors and never emits COMPLETE still does not extend the window (no COMPLETE → not in the done-set) — consistent with both endpoints and LoadGen.

Branch rebased on origin/main and updated.

viraatc · 2026-06-23T20:09:40Z

Redesigned per review: the boolean legacy_streaming_tps is replaced by an explicit throughput_report_type enum (loadgen default / endpoints). LoadGen vs native is now scenario-aware with a single consistent fallback (offline / multi-phase / missing-window / <2 completions all use the native window for both QPS and TPS). Per-phase token attribution was intentionally not added — async OSL tokenization cannot be cheaply attributed per phase, so multi-phase falls back to native (documented in the flag help); reachable runs have one perf phase so always get the exact LoadGen view. All accounting is in the aggregator subprocess; the timed load-generator path is untouched. See updated description.

viraatc · 2026-06-23T21:09:47Z

Simplified to a single boolean use_legacy_loadgen_qps_metrics (default True). The earlier enum / multi-phase / offline machinery was removed (net -414 lines vs the prior revision) — LoadGen window is one counter, and the report falls back to the native window when it is unavailable. See the updated description.

…QPS/TPS Adds `use_legacy_loadgen_qps_metrics` on the poisson load pattern (default True). When on, QPS/TPS use the legacy MLPerf LoadGen Server "completed" definition: (completed-1)/T and tokens/T, where T spans from the first issued request to the completion of the last-issued request (see mlcommons/inference loadgen/results.cc). Disable via --no-use-legacy-loadgen-qps-metrics for endpoints-native completed/duration over the full run. Scoped to poisson (serialized out of the offline/concurrency templates) and gated to poisson in execute.py, so other patterns stay native. The metrics-aggregator subprocess emits one extra counter (legacy_loadgen_window_duration_ns); the timed load-generator path is unchanged. Falls back to the native window when the legacy window is unavailable (snapshots predating this change, or <2 completions). TODO(vir) to deprecate once a formal tail-cutting mechanism lands. Ref: mlcommons/inference loadgen/results.cc (Server scenario, completed). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

github-actions Bot requested review from arekay-nv and nvzhihanj June 23, 2026 09:22

gemini-code-assist Bot reviewed Jun 23, 2026

View reviewed changes

viraatc force-pushed the feat/viraatc-loadgen-tps-parity branch 2 times, most recently from 0751e69 to 40dbf89 Compare June 23, 2026 10:08

viraatc force-pushed the feat/viraatc-loadgen-tps-parity branch from 40dbf89 to d44bb21 Compare June 23, 2026 18:07

viraatc force-pushed the feat/viraatc-loadgen-tps-parity branch from d44bb21 to fec276f Compare June 23, 2026 19:15

viraatc force-pushed the feat/viraatc-loadgen-tps-parity branch 2 times, most recently from 8bacc8a to 3f5e8b5 Compare June 23, 2026 20:08

viraatc changed the title ~~feat(metrics): add legacy_streaming_tps toggle for LoadGen-parity QPS/TPS~~ feat(metrics): add throughput_report_type for LoadGen-parity QPS/TPS Jun 23, 2026

viraatc force-pushed the feat/viraatc-loadgen-tps-parity branch from 3f5e8b5 to 39a99b2 Compare June 23, 2026 21:09

viraatc changed the title ~~feat(metrics): add throughput_report_type for LoadGen-parity QPS/TPS~~ feat(metrics): add use_legacy_loadgen_qps_metrics for LoadGen-parity QPS/TPS Jun 23, 2026

viraatc marked this pull request as ready for review June 24, 2026 00:13

viraatc requested review from a team and nv-alicheng June 24, 2026 00:13

nvzhihanj reviewed Jun 24, 2026

View reviewed changes

Comment thread src/inference_endpoint/async_utils/services/metrics_aggregator/aggregator.py Outdated

nvzhihanj reviewed Jun 24, 2026

View reviewed changes

Comment thread src/inference_endpoint/config/templates/concurrency_template_full.yaml Outdated

nvzhihanj reviewed Jun 24, 2026

View reviewed changes

Comment thread src/inference_endpoint/config/templates/offline_template_full.yaml Outdated

nvzhihanj reviewed Jun 24, 2026

View reviewed changes

Comment thread src/inference_endpoint/metrics/report.py Outdated

viraatc force-pushed the feat/viraatc-loadgen-tps-parity branch 3 times, most recently from 75132b0 to 8c322de Compare June 24, 2026 07:53

viraatc force-pushed the feat/viraatc-loadgen-tps-parity branch from 8c322de to 53566b6 Compare June 24, 2026 07:55

viraatc force-pushed the feat/viraatc-loadgen-tps-parity branch from 53566b6 to 4a6ceda Compare June 24, 2026 10:03

Conversation

viraatc commented Jun 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What

Notes

Uh oh!

github-actions Bot commented Jun 23, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

viraatc commented Jun 23, 2026

Review Council — Multi-AI Code Review

Uh oh!

viraatc commented Jun 23, 2026

Review Council — all findings addressed ✅

Uh oh!

viraatc commented Jun 23, 2026

Update — finding #3 (error handling) revised

Uh oh!

viraatc commented Jun 23, 2026

Uh oh!

viraatc commented Jun 23, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

viraatc commented Jun 23, 2026 •

edited

Loading