Skip to content

feat(metrics): add use_legacy_loadgen_qps_metrics for LoadGen-parity QPS/TPS#372

Open
viraatc wants to merge 1 commit into
mainfrom
feat/viraatc-loadgen-tps-parity
Open

feat(metrics): add use_legacy_loadgen_qps_metrics for LoadGen-parity QPS/TPS#372
viraatc wants to merge 1 commit into
mainfrom
feat/viraatc-loadgen-tps-parity

Conversation

@viraatc

@viraatc viraatc commented Jun 23, 2026

Copy link
Copy Markdown
Collaborator

What

Adds use_legacy_loadgen_qps_metrics (bool, default True). When on, QPS/TPS use the MLPerf LoadGen Server "completed" definition:

  • QPS = (completed - 1) / T, TPS = tokens / T
  • T = first issued request -> completion of the last-issued request (LoadGen final_query_all_samples_done_time; mlcommons/inference loadgen/results.cc).

Disable with --no-use-legacy-loadgen-qps-metrics for endpoints-native completed/duration and tokens/duration (window to the last request to finish) -- byte-identical to prior behavior.

Notes

  • The metrics-aggregator subprocess emits one extra counter (loadgen_window_duration_ns); the timed load-generator path is unchanged (no new events, no per-query work).
  • Falls back to the native window when the LoadGen window is unavailable (snapshots predating this change, or <2 completions).
  • Intended for server/online runs.

~100 lines of source + tests; 261 unit tests pass, mypy/ruff/pre-commit clean.

@github-actions github-actions Bot requested review from arekay-nv and nvzhihanj June 23, 2026 09:22
@github-actions

Copy link
Copy Markdown

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for MLPerf LoadGen Server 'completed' window throughput metrics, including a configuration toggle (legacy_streaming_tps) to switch between legacy and LoadGen-aligned throughput calculations. However, several configuration templates were updated to use agentic_inference and agentic_inference_inline which are not yet supported by the schema in schema.py, leading to validation failures. Additionally, a potential state inconsistency was identified in metrics_table.py when handling out-of-bounds tracking block indices.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread src/inference_endpoint/config/templates/offline_template_full.yaml
Comment thread src/inference_endpoint/config/templates/offline_template_full.yaml
Comment thread src/inference_endpoint/config/templates/online_template_full.yaml
Comment thread src/inference_endpoint/config/templates/online_template_full.yaml
Comment thread src/inference_endpoint/async_utils/services/metrics_aggregator/metrics_table.py Outdated
Comment thread src/inference_endpoint/config/templates/concurrency_template.yaml
Comment thread src/inference_endpoint/config/templates/online_template.yaml
@viraatc viraatc force-pushed the feat/viraatc-loadgen-tps-parity branch 2 times, most recently from 0751e69 to 40dbf89 Compare June 23, 2026 10:08
@viraatc

viraatc commented Jun 23, 2026

Copy link
Copy Markdown
Collaborator Author

Review Council — Multi-AI Code Review

Reviewed by: Codex (gpt-5.5, xhigh) + Claude | Depth: standard | Commit: 40dbf898

Found 4 findings (1 already fixed in this push). No critical/high blockers — the core numerator/denominator logic faithfully matches loadgen/results.cc (Server completed). Posting as a synthesized summary; items 2–3 are design-level decisions for the author rather than line bugs.

# File Severity Category Reviewer(s) Finding
1 config/templates/*.yaml medium bug fixed Codex Resolved in 40dbf898. Templates had been regenerated by a system-python pre-commit run against a newer installed package (agentic_inference, drain_timeout 300) and omitted the new field — inconsistent with this branch's multi_turn enum. Regenerated from the branch's editable env: now multi_turn + legacy_streaming_tps: true, consistent.
2 metrics_table.py (total_loadgen_window_ns / block-start) medium bug Claude LoadGen window starts at START_PERFORMANCE_TRACKING, not first issue. For poisson/online, the first sample fires at start + first_delay, so the window includes one pre-issue inter-arrival gap → LoadGen QPS biased low by ~1/(N-1) (negligible at large N, material for short runs). max_throughput/concurrency unaffected. LoadGen schedules its first Server query at delta 0. For exact parity, anchor the window start at the first issued sample's issued_ns.
3 metrics_table.py (_update_tracked_block) + schema.py help low design Codex + Opus Failed/hung last-issued request: the window anchors to the last successfully-completed issued request (graceful) rather than rendering N/A. Deliberate (avoids fragile, non-reproducible N/A on a single tail failure; identical to LoadGen on clean runs). Note: the flag help says "last-issued" — should read "last successfully-completed issued request" to match the code.
4 tests/ low testing Claude No aggregator-level ISSUED→ERROR→COMPLETE integration test exercising the (completed-failed) numerator and failed-last-issued exclusion end-to-end (unit tests cover mark_sample_failed and the window directly, but not the live event stream).

Disposition: #1 fixed. #2 (Poisson start bias) and #3 (failed-last-issued philosophy + help wording) are author calls — see PR thread. #4 is a nice-to-have follow-up.

@viraatc viraatc force-pushed the feat/viraatc-loadgen-tps-parity branch from 40dbf89 to d44bb21 Compare June 23, 2026 18:07
@viraatc

viraatc commented Jun 23, 2026

Copy link
Copy Markdown
Collaborator Author

Review Council — all findings addressed ✅

Rebased onto current main (dad7d1a6); branch now at d44bb211 (clean rebase, no conflicts). 169 unit tests pass; mypy/ruff/pre-commit clean.

# Finding Resolution
1 Templates inconsistent with branch enum (agentic_inference) + missing the new field Resolved — rebased onto current main and regenerated from the branch's editable env. Templates now add only legacy_streaming_tps: true, consistent with main's schema. (Original churn was a language: system pre-commit hook running system-python against a newer install.)
2 Poisson window-start bias (~1/(N-1) low) Fixed — the LoadGen window now starts at the first issued request (LoadGen t=0 / delta 0), not the START_PERFORMANCE_TRACKING timestamp. metrics_table.py: _loadgen_window_start_ns captured on first tracked ISSUED.
3 Failed/hung last-issued handling + stale help text Resolved — kept graceful anchoring to the last successfully-completed issued request (identical to LoadGen on clean runs, where every query completes; avoids fragile non-reproducible N/A on a single tail failure). The legacy_streaming_tps help text is corrected to describe the actual behavior (N/A only when no request completed successfully, or multi-phase runs).
4 No end-to-end aggregator test for the failed-last-issued path Addedtest_failed_last_issued_excluded_from_window_end drives the live ISSUED→ERROR→COMPLETE event stream through the aggregator and asserts (a) the window end anchors to the successful sample (failed last-issued excluded) and (b) tracked_samples_completed - tracked_samples_failed == 1.

Branch is rebased on top of origin/main; ready for review.

(Findings were posted as a single summary comment, so there are no inline threads to resolve via the UI — this comment records the resolution.)

@viraatc viraatc force-pushed the feat/viraatc-loadgen-tps-parity branch from d44bb21 to fec276f Compare June 23, 2026 19:15
@viraatc

viraatc commented Jun 23, 2026

Copy link
Copy Markdown
Collaborator Author

Update — finding #3 (error handling) revised

Per maintainer direction, LoadGen mode now handles errors exactly as endpoints already does — no special-casing:

  • Numerator (n_samples_completed - 1) — failed-but-completed requests count (same as legacy n_samples_completed); no failure subtraction.
  • Window anchors to the last-issued request that completed, failed or not.
  • Removed the SampleRow.failed flag + mark_sample_failed (net code reduction); n_samples_failed is still reported, just unused for QPS/TPS.

This also matches LoadGen itself — verified against loadgen source: QuerySamplesComplete (loadgen.cc) counts every reported-complete sample with no success/failure/status filtering, QuerySampleResponse has no error field, and sample_count / final_query_all_samples_done_time derive purely from reported completions (accuracy is judged only in the separate accuracy mode). A request that errors and never emits COMPLETE still does not extend the window (no COMPLETE → not in the done-set) — consistent with both endpoints and LoadGen.

Branch rebased on origin/main and updated.

@viraatc viraatc force-pushed the feat/viraatc-loadgen-tps-parity branch 2 times, most recently from 8bacc8a to 3f5e8b5 Compare June 23, 2026 20:08
@viraatc viraatc changed the title feat(metrics): add legacy_streaming_tps toggle for LoadGen-parity QPS/TPS feat(metrics): add throughput_report_type for LoadGen-parity QPS/TPS Jun 23, 2026
@viraatc

viraatc commented Jun 23, 2026

Copy link
Copy Markdown
Collaborator Author

Redesigned per review: the boolean legacy_streaming_tps is replaced by an explicit throughput_report_type enum (loadgen default / endpoints). LoadGen vs native is now scenario-aware with a single consistent fallback (offline / multi-phase / missing-window / <2 completions all use the native window for both QPS and TPS). Per-phase token attribution was intentionally not added — async OSL tokenization cannot be cheaply attributed per phase, so multi-phase falls back to native (documented in the flag help); reachable runs have one perf phase so always get the exact LoadGen view. All accounting is in the aggregator subprocess; the timed load-generator path is untouched. See updated description.

@viraatc viraatc force-pushed the feat/viraatc-loadgen-tps-parity branch from 3f5e8b5 to 39a99b2 Compare June 23, 2026 21:09
@viraatc viraatc changed the title feat(metrics): add throughput_report_type for LoadGen-parity QPS/TPS feat(metrics): add use_legacy_loadgen_qps_metrics for LoadGen-parity QPS/TPS Jun 23, 2026
@viraatc

viraatc commented Jun 23, 2026

Copy link
Copy Markdown
Collaborator Author

Simplified to a single boolean use_legacy_loadgen_qps_metrics (default True). The earlier enum / multi-phase / offline machinery was removed (net -414 lines vs the prior revision) — LoadGen window is one counter, and the report falls back to the native window when it is unavailable. See the updated description.

@viraatc viraatc marked this pull request as ready for review June 24, 2026 00:13
@viraatc viraatc requested review from a team and nv-alicheng June 24, 2026 00:13
Comment thread src/inference_endpoint/async_utils/services/metrics_aggregator/aggregator.py Outdated
Comment thread src/inference_endpoint/config/templates/concurrency_template_full.yaml Outdated
Comment thread src/inference_endpoint/config/templates/offline_template_full.yaml Outdated
Comment thread src/inference_endpoint/metrics/report.py Outdated
@viraatc viraatc force-pushed the feat/viraatc-loadgen-tps-parity branch 3 times, most recently from 75132b0 to 8c322de Compare June 24, 2026 07:53
…QPS/TPS

Adds `use_legacy_loadgen_qps_metrics` on the poisson load pattern (default True).
When on, QPS/TPS use the legacy MLPerf LoadGen Server "completed" definition:
(completed-1)/T and tokens/T, where T spans from the first issued request to the
completion of the last-issued request (see mlcommons/inference loadgen/results.cc).
Disable via --no-use-legacy-loadgen-qps-metrics for endpoints-native
completed/duration over the full run.

Scoped to poisson (serialized out of the offline/concurrency templates) and gated
to poisson in execute.py, so other patterns stay native. The metrics-aggregator
subprocess emits one extra counter (legacy_loadgen_window_duration_ns); the timed
load-generator path is unchanged. Falls back to the native window when the legacy
window is unavailable (snapshots predating this change, or <2 completions).
TODO(vir) to deprecate once a formal tail-cutting mechanism lands.

Ref: mlcommons/inference loadgen/results.cc (Server scenario, completed).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@viraatc viraatc force-pushed the feat/viraatc-loadgen-tps-parity branch from 8c322de to 53566b6 Compare June 24, 2026 07:55
@viraatc viraatc force-pushed the feat/viraatc-loadgen-tps-parity branch from 53566b6 to 4a6ceda Compare June 24, 2026 10:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants