feat(compliance): MLPerf TEST04 caching audit — extensible AuditTest framework by wu6u3tw · Pull Request #332 · mlcommons/endpoints

wu6u3tw · 2026-06-03T20:53:15Z

Summary

Adds an extensible MLPerf compliance-audit framework with TEST04 (caching detection) as the first test, driven by an audit: block in the benchmark YAML. This PR carries the full redesign: the approved design plan, the implementation, tests, and runnable WAN2.2 examples.

TEST04 issues one fixed sample for every query in an audit phase; if repeating an identical request makes the SUT meaningfully faster, it is serving from cache. Pass iff the audit run is at most 10% faster than the reference (matching upstream compliance/TEST04/verify_performance.py).

Design (the two axes)

Axis A — run modification: expressed as a generic typed SampleOrderSpec (WITHOUT_REPLACEMENT | SINGLE(index)) carried on a RunSpec. No test-specific knowledge leaks into the load generator.
Axis B — verification: a pure post-run check, AuditTest.verify(runs) -> AuditVerdict, registered per test.

A generic orchestrator (commands/audit.py::run_audit) runs each RunSpec phase back-to-back via the existing setup_benchmark / run_benchmark_async path, then verifies and writes the verdict. Adding TEST01/06/07/09 later is a new registry entry, not cross-cutting edits.

Config shape

audit:
  test: test04
  samples: 64         # reference phase query count
  audit_samples: 64   # audit (fixed-sample) phase count
  sample_index: 3     # MLCommons performance_issue_same_index
  threshold: 0.10     # audit qps must stay < ref qps * (1 + threshold)

AuditConfig is a discriminated-union-ready sub-model on BenchmarkConfig (parallel to AccuracyConfig) — no DatasetType.AUDIT, no audit fields polluting Dataset, no test04 boolean in RuntimeSettings.

What's included

compliance/__init__.py — AuditTest protocol + RunSpec/RunStats/RunArtifacts + registry
compliance/verdict.py — AuditVerdict + atomic write_verdict (tmp → fsync → rename → fsync)
compliance/tests/test04.py — Test04Audit + verify_test04
commands/audit.py — generic run_audit orchestrator
config/schema.py — AuditTestId + Test04Config/AuditConfig + BenchmarkConfig.audit
load_generator — SampleOrderSpec + SingleSampleOrder + factory dispatch
Unit tests + e2e integration test (offline + single-stream) against the echo server
docs/compliance_audit_plan.md — the design plan
WAN2.2 submission examples: offline_wan22_submission.yaml, single_stream_wan22_submission.yaml

Exit codes

benchmark from-config with an audit: block exits 0 (PASS) / 1 (FAIL); errors propagate via the standard handler using the repo-wide scheme (InputValidationError → 2, SetupError → 3, ExecutionError → 4). The on-disk audit_verdict.json is the durable record.

Testing

Unit + integration green; pre-commit run --all-files clean. The e2e test exercises the full audit: → run_audit → AuditVerdict flow for both max_throughput (offline) and concurrency=1 (single-stream).

🤖 Generated with Claude Code

github-actions · 2026-06-03T20:53:31Z

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

gemini-code-assist

Code Review

This pull request implements the MLPerf TEST04 compliance audit to detect result caching by repeatedly issuing a single fixed sample and comparing the throughput against a reference run. It introduces configuration options, validation guards, a SingleSampleOrder generator, and a compliance verification module with a CLI tool and tests. The review feedback focuses on improving the robustness of the compliance verifier, specifically by handling potential OSError exceptions during file writes, catching AttributeError when parsing non-dictionary JSON configurations, and gracefully handling malformed snapshot files during parsing.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

…review Address gemini-code-assist review on PR mlcommons#332: - CLI catches OSError (PermissionError etc.) and write_verdict failures, not just FileNotFoundError/ValueError — all map to exit 2. - _audit_marker tolerates non-dict results.json (isinstance guards) instead of raising AttributeError. - _run_stats_from_dir rejects a non-dict snapshot with a clear ValueError. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

wu6u3tw · 2026-06-04T22:57:49Z

Update summary

All review feedback has been addressed. Here is what changed since the original submission:

Architecture (main concern)

audit: test04 now runs both phases in a single command — reference run then audit run back-to-back against the same endpoint, with automatic comparison and verdict output. No more 3-step manual workflow.

Config shape

type: audit dataset replaces the old settings.runtime.test04_sample_index and audit_n_samples runtime variables. Reference and audit sample counts are now independent and co-located with the dataset config — consistent with how type: accuracy datasets carry their own accuracy_config.

audit: "test04"

datasets:
  - name: wan22_prompts
    path: wan22_prompts.jsonl
    type: "performance"
    samples: 50          # reference phase query count (50–144)

  - name: wan22_audit
    path: wan22_prompts.jsonl
    type: "audit"
    samples: 25          # audit phase query count (25–50)
    audit_sample_index: 0

Robustness

Warning logged when audit: test04 is set but no type: audit dataset is present (previously silent fallback to index 0).
Phase failures (SetupError/ExecutionError) are caught and logged cleanly — no unhandled traceback, verdict not lost.
Report.from_snapshot wrapped in try/except in _run_stats_from_dir — malformed snapshots exit with code 2 instead of crashing.
Pre-flight audit_sample_index bounds check before dataset load.

Testing

New e2e integration test (test_audit_test04_two_phase_flow) exercises the full run_benchmark → two-phase flow against the echo server and asserts both phase subdirs are created and the flow completes gracefully.

Example config

Renamed offline_wan22_test04.yaml → wan22_audit_test04.yaml per review suggestion.

wu6u3tw · 2026-06-05T21:26:19Z

All review feedback has been addressed. Here's a summary of what changed:

Architecture
audit: test04 now runs reference and audit phases in a single command back-to-back against the same endpoint — no more 3-step workflow, no endpoint-change risk. A single type: audit dataset entry drives both phases (carrying ref_samples, audit_samples, audit_sample_index).

Sample counts & index
ref_samples: 50, audit_samples: 25 — sized for WAN2.2 throughput. audit_sample_index: 3 — fixed per MLCommons audit.config (performance_issue_same_index=3 for WAN2.2).

SingleStream
Added wan22_single_stream_test04.yaml (concurrency=1, ref/audit samples=20 matching MLCommons min_query_count).

Durations
Perf configs: min=10min, max=4hr. Audit configs: min=10min, max=2hr. The 10-min minimum documents MLCommons compliance intent; counts take priority in the current session stop logic, with AND-semantics available as a future improvement.

Robustness fixes (Gemini)

write_verdict moved inside try-except in CLI
_audit_marker uses isinstance guards — no AttributeError possible
Report.from_snapshot wrapped in try/except (KeyError, TypeError) in _run_stats_from_dir

Cleanup

Test renamed to test_audit_test04.py
README.md removed from diff (rebased onto main)
Orphaned type: audit datasets in non-TEST04 configs now emit a warning; multiple audit datasets raise InputValidationError

nvzhihanj

Review Council — first-principles design review

Reviewed by: Claude (Codex review timed out on this 2046-line diff at xhigh reasoning) · Depth: thorough

Focus: design issues warranting re-design for a modular, extensible audit-test framework (TEST04 is the first of several). 11 findings; see the tiered summary comment. The ref_samples dead-write (#1) was independently verified against the source.

nvzhihanj · 2026-06-07T04:48:42Z

Review Council — Multi-AI Code Review (first-principles design review)

Reviewed by: Claude · Depth: thorough
Codex review timed out on this 2046-line diff at xhigh reasoning (the load-gen + compliance surface is large); this pass is Claude-led. The one HIGH bug below was independently verified against the source.

Framing: TEST04 is the first MLPerf compliance/audit test and is meant to become a modular, extensible framework. The findings below are design-led — what would adding the next audit (TEST01/05) cost, and where does TEST04-specific knowledge leak into general-purpose code. 11 findings, all posted inline.

🔴 Re-design / Must-fix

#	File	Line	Cat	Why it needs a re-design
1	`commands/benchmark/execute.py`	1151	bug	`ref_samples` is a dead write. `Dataset.samples` is consumed nowhere; `ref_config` never sets `n_samples_to_issue`, so the reference phase runs duration-driven and ignores `ref_samples` while the audit phase honors `audit_samples` → the compared phases run mismatched counts. Set `n_samples_to_issue=ref_samples`.
2	`compliance/__init__.py`	18	design	No `AuditTest` abstraction. `run_benchmark` hardcodes `if audit==TEST04`; package exports only `test04_*`. Adding TEST01/05 = cross-cutting edits everywhere. Introduce an `AuditTest` protocol (`plan_runs`+`verify`) registered by `AuditMode`.
3	`config/schema.py`	82	design	`DatasetType.AUDIT` is a fake dataset type the loader ignores, carrying test params on the shared `Dataset` model, then converted to `PERFORMANCE`. Move params to a structured `audit:` block; drop the fake type.
4	`config/runtime_settings.py`	90	design	`test04` boolean leaks into core load-gen. `RuntimeSettings.test04`/`test04_sample_index` + `create_sample_order`'s `if settings.test04`. Use a generic sample-order strategy selector, not a per-test flag.

🟡 Should-fix

#	File	Line	Cat	Summary
5	`commands/benchmark/execute.py`	113	design	`_OVERRIDE_TEST04_SAMPLE_INDEX` stringly-typed magic kwarg through `**runtime_overrides`; pass a typed `run_spec` instead.
6	`commands/benchmark/execute.py`	1146	design	Two-phase `model_copy` surgery is fragile (root cause of #1; ref phase also skips `_validate_audit_test04`). Use a declarative `RunSpec` + validate before any phase runs.
7	`tests/integration/commands/test_benchmark_command.py`	209	testing	`_run_benchmark_test04` has no unit test; the one integration test asserts `verdict OR error` with `min_duration_ms=0` — the regime that hides bug #1.
8	`config/schema.py`	666	design	`audit` bare top-level enum; params scattered, threshold hardcoded. Use a structured compliance sub-config (like `accuracy_config`).
9	`compliance/test04.py`	206	design	QPS compared across phases with different counts/contents (upstream TEST04 uses the same query set); completion guard only protects the FAIL direction. Extends the existing fairness thread; compounded by #1.

🔵 Consider

#	File	Line	Cat	Summary
10	`compliance/test04.py`	175	design	`verify_test04_dirs` vs `verify_test04_from_reports` duplication; dir-swap guard in one path only. Collapse to one core + thin adapters.
11	`commands/benchmark/execute.py`	446	bug	`audit_sample_index` bound-checked vs requested counts, not the loaded dataset size, until phase 2 — an out-of-range index wastes a full reference run.

Through-line: #1, #5, #6, #7 are all symptoms of the same root cause — TEST04 is bolted onto run_benchmark via per-phase config surgery and untyped overrides instead of a first-class audit-test abstraction (#2). Fixing #2/#3/#4 (an AuditTest that emits typed RunSpecs + a generic ordering strategy) would dissolve most of the others structurally.

Dedup: none overlap existing inline comments except #9, which extends the maintainer's existing fairness thread with upstream-parity / guard-direction substance.

Ground-up redesign of the compliance/audit framework after PR mlcommons#332 review. Replaces the bolted-on TEST04 with a first-class, extensible AuditTest abstraction: a generic orchestrator runs plan_runs() phases back-to-back at a single shared sample count (fair comparison), and verify() produces a typed verdict. Maps every PR mlcommons#332 design-review finding + maintainer workflow requirement to where the design resolves it. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

… header Review Council (Claude) findings on PR mlcommons#332: - examples hardcoded num_workers despite §8 claiming it was dropped; remove it (use endpoint default, per viraatc's request) so the traceability row is true - single-stream header said '(independent counts)' but uses equal 20/20; align to '(equal counts here)' matching the offline sibling and §5/§8 framing Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

… PR) Review Council (Claude, 2nd pass) on PR mlcommons#332: three §8 traceability rows described the example YAMLs as future ('land/dropped at implementation', 'plan doc only'), but the PR now ships offline_wan22_submission.yaml and single_stream_wan22_submission.yaml. Reword to reflect they're included. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

viraatc

lgtm, looking forward to impl!

nvzhihanj · 2026-06-13T00:17:41Z

+            │
+            │ 3. verify(runs) ; 4. write_verdict (atomic)
+            ▼
+   verify_TEST04.txt  +  audit_verdict.json


If possible, unify the output to be json so it's easier to parse.

And do we need 2 files? Ideally one json should be enough (containing all information). On traditional MLPerf side we can always use scripts to make it pass

MLCommons run_verification.py takes verify_.txt that req a Performance check pass: True We can remove it if MLComm run_verification changed.

Brings the design doc (docs/compliance_audit_plan.md, rebased on the latest mlcommons#332 web edits) and the two WAN2.2 submission example configs (offline + single-stream, using the audit: block) onto the implementation branch so the redesign ships as one coherent PR: framework + tests + doc + runnable examples. The doc's exit-code contract and module layout are corrected to match this implementation: samples/audit_samples/sample_index/threshold, no standalone verifier CLI, and errors via the repo-wide handler (SetupError → 3, ExecutionError → 4) rather than a flat exit 2. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

wu6u3tw

Review Council — Multi-AI Code Review

Reviewed by: Claude (Codex unavailable — bwrap sandbox blocked, no sudo to relax userns) · Depth: thorough

4 findings posted inline. Verdict-correctness finding (threshold) is the one to prioritize.

wu6u3tw · 2026-06-22T23:51:50Z

History squashed to 3 commits

The branch was force-pushed (89179d5) to reorganize the feature into three focused commits. No content changed vs the prior tip — only the commit layout. Heads-up: earlier inline review comments may now show as outdated since they pointed at the previous commits.

#	Commit	Summary
1	`docs(compliance): output-caching audit (MLPerf TEST04) design + examples`	Design plan (`docs/compliance_audit_plan.md`), the compliance entry in `AGENTS.md`, and the WAN2.2 Offline/SingleStream submission example configs (perf + accuracy + audit in one `from-config` run).
2	`feat(compliance): output-caching audit (MLPerf TEST04) implementation`	Generic `AuditTest` framework (`compliance/`): protocol + `RunSpec`/`RunStats`/`RunArtifacts` + registry; `OutputCachingAudit` implements TEST04 caching detection (reference vs fixed-sample phase; fails if audit QPS > reference × (1 + threshold)). `run_audit` orchestrator runs phases back-to-back, validates unpaced load + `sample_index`, refuses to certify an incomplete phase, and atomically writes `verify_OUTPUT_CACHING_TEST.txt` + `audit_result.json`. Wired via the YAML `audit:` block and a generic `SampleOrderSpec`/`SingleSampleOrder` seam.
3	`test(compliance): unit + integration tests for the output-caching audit`	Unit tests for the verify core, `plan_runs`/`verify`, `RunStats.from_report`, the `run_audit` guards (load-pattern allow-list, incomplete-phase abort, interrupt-skips-audit), `SampleOrderSpec`/`SingleSampleOrder`, and the atomic result writer; plus the end-to-end `audit:` flow (offline + single-stream).

Each commit independently passes pre-commit run --all-files (ruff, ruff-format, mypy, prettier, template regen, license, uv.lock).

Naming note: this version names the test output_caching_test (AuditTestId.OUTPUT_CACHING_TEST, OutputCachingTestConfig, compliance/result.py); the artifact is verify_OUTPUT_CACHING_TEST.txt. "TEST04" is retained in prose/comments as the upstream MLPerf test this re-implements.

wu6u3tw · 2026-06-23T01:04:40Z

@viraatc @arekay-nv can you review it with impl is already updated in this PR.

Design plan (docs/compliance_audit_plan.md, incl. an ASCII program-flow diagram showing every decision gate and its exit code), the compliance-module entry in AGENTS.md, and the WAN2.2 Offline/SingleStream submission example configs (perf + accuracy + output_caching_test audit in one from-config run). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Generic AuditTest framework (compliance/): AuditTest protocol + RunSpec/RunStats/RunArtifacts + registry; OutputCachingAudit (compliance/tests/output_caching_test.py) implements MLPerf TEST04 caching detection — reference vs fixed-sample phase, fails if audit QPS exceeds reference QPS by > threshold. run_audit orchestrator (commands/audit.py) runs phases back-to-back, validates unpaced load + sample_index, refuses to certify an incomplete phase, and writes verify_OUTPUT_CACHING_TEST.txt + audit_result.json atomically (compliance/result.py). Wired via the YAML audit: block (schema.py AuditTestId/OutputCachingTestConfig) and a generic SampleOrderSpec + SingleSampleOrder seam in the load generator. Also folds in the branch's incidental non-compliance changes that touch these files: the metrics-aggregator --ready-file flag, the service launcher ready-check timeout widening, and the aiohttp + msgpack==1.2.1 CVE bumps (uv.lock/pyproject; msgpack clears GHSA-6v7p-g79w-8964). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Unit tests for verify_output_caching, OutputCachingAudit.plan_runs/verify, RunStats.from_report, the run_audit guards (load-pattern, incomplete phase, interrupt-skips-audit), SampleOrderSpec/SingleSampleOrder, and the atomic result writer; plus the end-to-end audit: flow (offline + single-stream). Includes the metrics aggregator signal-handling ready-file test update. The rejected-load-pattern guard test derives its parametrization from the LoadPatternType enum (anything that isn't max_throughput/concurrency) so it stays correct regardless of which patterns exist on the base branch. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

wu6u3tw · 2026-06-23T21:48:48Z

Per offline discussion with @nvzhihanj, will modify the dump json file layout under audit/output_caching_test.json

wu6u3tw requested a review from a team June 3, 2026 20:53

gemini-code-assist Bot reviewed Jun 3, 2026

View reviewed changes

Comment thread src/inference_endpoint/compliance/__main__.py Outdated

Comment thread src/inference_endpoint/compliance/test04.py Outdated

Comment thread src/inference_endpoint/compliance/test04.py Outdated

wu6u3tw requested review from arekay-nv and nv-alicheng June 3, 2026 22:19

nvzhihanj reviewed Jun 4, 2026

View reviewed changes

Comment thread examples/09_Wan22_VideoGen_Example/wan22_audit_test04.yaml Outdated

nvzhihanj reviewed Jun 4, 2026

View reviewed changes

Comment thread examples/09_Wan22_VideoGen_Example/offline_wan22_test04.yaml Outdated

nvzhihanj reviewed Jun 4, 2026

View reviewed changes

Comment thread examples/09_Wan22_VideoGen_Example/offline_wan22_test04.yaml Outdated

wu6u3tw force-pushed the feat/test04-compliance branch from 9057190 to b547f1d Compare June 4, 2026 23:14

viraatc force-pushed the main branch from 19b20bb to 28f5ac1 Compare June 4, 2026 23:25

wu6u3tw force-pushed the feat/test04-compliance branch 2 times, most recently from cdbae64 to eae1234 Compare June 4, 2026 23:40

nvzhihanj reviewed Jun 5, 2026

View reviewed changes

Comment thread examples/09_Wan22_VideoGen_Example/wan22_audit_test04.yaml Outdated

nvzhihanj reviewed Jun 5, 2026

View reviewed changes

Comment thread examples/09_Wan22_VideoGen_Example/wan22_audit_test04.yaml Outdated

nvzhihanj reviewed Jun 5, 2026

View reviewed changes

Comment thread examples/09_Wan22_VideoGen_Example/wan22_audit_test04.yaml Outdated

nvzhihanj reviewed Jun 5, 2026

View reviewed changes

Comment thread tests/unit/compliance/test_audit_test04.py Outdated

nvzhihanj reviewed Jun 5, 2026

View reviewed changes

Comment thread README.md Outdated

wu6u3tw force-pushed the feat/test04-compliance branch 3 times, most recently from b190d21 to e0de06f Compare June 5, 2026 21:03

wu6u3tw commented Jun 5, 2026

View reviewed changes

Comment thread examples/09_Wan22_VideoGen_Example/offline_wan22.yaml

wu6u3tw force-pushed the feat/test04-compliance branch 2 times, most recently from 385630c to c1e48bf Compare June 5, 2026 21:22

nvzhihanj reviewed Jun 6, 2026

View reviewed changes

Comment thread examples/09_Wan22_VideoGen_Example/wan22_audit_test04.yaml Outdated

nvzhihanj reviewed Jun 7, 2026

View reviewed changes

viraatc reviewed Jun 8, 2026

View reviewed changes

Comment thread examples/09_Wan22_VideoGen_Example/single_stream_wan22.yaml Outdated

wu6u3tw force-pushed the feat/test04-compliance branch from 4290b36 to d34459b Compare June 10, 2026 21:43

wu6u3tw force-pushed the feat/test04-compliance branch from d34459b to 642ec6c Compare June 10, 2026 22:20

wu6u3tw force-pushed the feat/test04-compliance branch from 642ec6c to b40b0ef Compare June 10, 2026 22:46

viraatc approved these changes Jun 12, 2026

View reviewed changes