Skip to content

Feat: Add required fields (TPS/QPS/seed) into results_summary.json#353

Merged
nvzhihanj merged 7 commits into
mainfrom
feat/slim-benchmark-output-artifacts
Jun 24, 2026
Merged

Feat: Add required fields (TPS/QPS/seed) into results_summary.json#353
nvzhihanj merged 7 commits into
mainfrom
feat/slim-benchmark-output-artifacts

Conversation

@nvzhihanj

@nvzhihanj nvzhihanj commented Jun 12, 2026

Copy link
Copy Markdown
Collaborator

Narrowed scope: this PR now covers only the result_summary.json / report.txt change.

What & why

result_summary.json and report.txt are two renderings of the same Report. report.txt has no programmatic consumers, and result_summary.json was missing the computed QPS/TPS (callers had to recompute them from duration + counts).

Changes

  • result_summary.json is now self-complete. Report.to_json injects the derived qps/tps into the serialized JSON. They remain methods (not stored fields) to avoid duplicating data already in duration_ns / counters / OSL, so no Report constructor sites change.
  • Docs: docs/LOCAL_TESTING.mdresult_summary.json described as the primary, self-complete report; dropped report.txt and the never-written runtime_settings.json from the artifacts list.

Tests

  • result_summary.json serializes qps/tps (concrete value) and serializes them as null when there is no duration.
  • Full suite + pre-commit green.

Report.txt is not yet deprecated, we will decide later in the cycle.

🤖 Generated with Claude Code

@nvzhihanj nvzhihanj requested a review from a team as a code owner June 12, 2026 05:48
@github-actions

github-actions Bot commented Jun 12, 2026

Copy link
Copy Markdown

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

@github-actions github-actions Bot requested a review from arekay-nv June 12, 2026 05:48

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new truncate-results CLI command to shrink large benchmark results files by keeping a few full responses and hashing the rest. It also optimizes artifact generation by only running the event logger and saving event logs or sample index maps when accuracy scoring is active or when explicitly requested via configuration flags. Additionally, derived throughput metrics (qps and tps) are now directly serialized into the JSON report. The review feedback suggests several robust improvements: implementing atomic file writes and better error handling in the truncation command, wrapping the tmpfs salvage process in a try-except block to ensure resource cleanup, and using msgspec.structs.asdict for a more efficient struct-to-dict conversion.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread src/inference_endpoint/commands/truncate_results.py Outdated
Comment thread src/inference_endpoint/commands/benchmark/execute.py Outdated
Comment thread src/inference_endpoint/commands/truncate_results.py Outdated
Comment thread src/inference_endpoint/commands/truncate_results.py Outdated
Comment thread src/inference_endpoint/metrics/report.py Outdated
@nvzhihanj nvzhihanj changed the title feat: slim benchmark output artifacts Draft: feat: slim benchmark output artifacts Jun 12, 2026
@nvzhihanj nvzhihanj marked this pull request as draft June 12, 2026 06:30
@nvzhihanj nvzhihanj force-pushed the feat/slim-benchmark-output-artifacts branch from 7a76518 to 8557e26 Compare June 12, 2026 06:42
@nvzhihanj nvzhihanj changed the title Draft: feat: slim benchmark output artifacts feat: make result_summary.json self-complete; drop report.txt Jun 12, 2026
@nvzhihanj nvzhihanj marked this pull request as ready for review June 12, 2026 06:44
@nvzhihanj nvzhihanj changed the title feat: make result_summary.json self-complete; drop report.txt Feat: Drop report.txt and use results_summary.json Jun 12, 2026
@nvzhihanj nvzhihanj marked this pull request as draft June 12, 2026 20:03
@nvzhihanj nvzhihanj changed the base branch from release/v0.5 to main June 22, 2026 06:35
@nvzhihanj nvzhihanj force-pushed the feat/slim-benchmark-output-artifacts branch from 0e42073 to e966271 Compare June 22, 2026 06:37
@nvzhihanj nvzhihanj changed the title Feat: Drop report.txt and use results_summary.json Feat: Add required fields into results_summary.json Jun 24, 2026
@nvzhihanj nvzhihanj changed the title Feat: Add required fields into results_summary.json Feat: Add required fields (TPS/QPS/seed) into results_summary.json Jun 24, 2026
@nvzhihanj nvzhihanj marked this pull request as ready for review June 24, 2026 05:42
nvzhihanj and others added 7 commits June 23, 2026 22:55
result_summary.json is now the single source of truth for the run summary.
Report.to_json injects the derived qps/tps so the JSON is self-complete and
consumers no longer recompute them from duration + counts. report.txt (a
human-readable dump of the same Report, with no programmatic consumers) is no
longer written — the same summary is still logged to the console.

Tests: result_summary serializes qps/tps (and null when there is no duration);
an end-to-end perf run confirms result_summary.json is self-complete and no
report.txt is produced.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Asserting that report.txt is *not* written guards nothing meaningful —
the code simply no longer writes it. Keep the assertions that matter
(result_summary.json carries qps/tps, i.e. it is self-complete) and
rename the test to test_result_summary_self_complete.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Replace the `msgspec.json.decode(msgspec.json.encode(self))` round-trip in
Report.to_json with `msgspec.to_builtins(self)` — same wire-shaped dict
(honors the encoder config, so the JSON schema is unchanged) without the
encode→decode→encode detour. Addresses the reviewer note on this line;
note that the suggested `msgspec.structs.asdict` would NOT be schema-safe
(it uses Python attribute names rather than the encoder's field names).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Compute qps/tps once in Report.from_snapshot and store them as struct fields
(default None) so they serialize naturally — to_json is a plain encode again,
with no to_builtins/dict round-trip and no method-vs-JSON-key divergence
(msgspec.json.encode(report) now matches report.to_json()). qps()/tps()
methods are dropped; display() and execute.py read the attributes. Production
only builds Report via from_snapshot, so the fields are always computed there.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
A run is only reproducible/validatable if its RNG seeds are recorded, so
surface them in the report. Adds a keyword-only `seeds` to
Report.from_snapshot (and a `seeds: dict[str,int] | None` field): the seeds
are config (RuntimeConfig.scheduler_random_seed / dataloader_random_seed),
not a measured metric, so the caller in finalize supplies them rather than
reading them from the metrics snapshot. Serialized into result_summary.json
and shown in display() as a 'Seeds:' line. Unit tests cover passthrough,
JSON round-trip, and the absent->null case.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Review-council hardening of the self-complete report:
- Restore report.txt (the full human-readable histogram/percentile dump);
  result_summary.json stays the self-complete machine-readable report.
- Capture all three RNG seeds (scheduler/dataloader/warmup) so the report
  fully identifies a reproducible run; fix the now-inaccurate seeds comment.
- docs/load_generator/DESIGN.md: Report.qps is a field now, not a method
  (report.qps() would TypeError) -> use 'report.qps or 0'.
- Re-add report.txt to docs/LOCAL_TESTING.md artifacts; assert report.txt is
  written in the self-complete integration test.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
msgpack 1.1.2 carries GHSA-6v7p-g79w-8964 (fixed in 1.2.1); the audit CI
job fails on it. msgpack is a transitive dev-only dep (pip-audit ->
cachecontrol -> msgpack) and unused at runtime, so this is a lockfile-only
bump. Verified: pip-audit reports no known vulnerabilities after the bump.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@nvzhihanj nvzhihanj force-pushed the feat/slim-benchmark-output-artifacts branch from 1fa52c0 to e537363 Compare June 24, 2026 06:09
@nvzhihanj nvzhihanj merged commit 8480f39 into main Jun 24, 2026
8 checks passed
@github-actions github-actions Bot locked and limited conversation to collaborators Jun 24, 2026
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants