feat(compliance): MLPerf TEST04 caching audit — extensible AuditTest framework#332
feat(compliance): MLPerf TEST04 caching audit — extensible AuditTest framework#332wu6u3tw wants to merge 3 commits into
Conversation
|
MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅ |
There was a problem hiding this comment.
Code Review
This pull request implements the MLPerf TEST04 compliance audit to detect result caching by repeatedly issuing a single fixed sample and comparing the throughput against a reference run. It introduces configuration options, validation guards, a SingleSampleOrder generator, and a compliance verification module with a CLI tool and tests. The review feedback focuses on improving the robustness of the compliance verifier, specifically by handling potential OSError exceptions during file writes, catching AttributeError when parsing non-dictionary JSON configurations, and gracefully handling malformed snapshot files during parsing.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
…review Address gemini-code-assist review on PR mlcommons#332: - CLI catches OSError (PermissionError etc.) and write_verdict failures, not just FileNotFoundError/ValueError — all map to exit 2. - _audit_marker tolerates non-dict results.json (isinstance guards) instead of raising AttributeError. - _run_stats_from_dir rejects a non-dict snapshot with a clear ValueError. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Update summaryAll review feedback has been addressed. Here is what changed since the original submission: Architecture (main concern)
Config shape
audit: "test04"
datasets:
- name: wan22_prompts
path: wan22_prompts.jsonl
type: "performance"
samples: 50 # reference phase query count (50–144)
- name: wan22_audit
path: wan22_prompts.jsonl
type: "audit"
samples: 25 # audit phase query count (25–50)
audit_sample_index: 0Robustness
Testing
Example config
|
9057190 to
b547f1d
Compare
cdbae64 to
eae1234
Compare
b190d21 to
e0de06f
Compare
385630c to
c1e48bf
Compare
|
All review feedback has been addressed. Here's a summary of what changed: Architecture Sample counts & index SingleStream Durations Robustness fixes (Gemini)
Cleanup
|
nvzhihanj
left a comment
There was a problem hiding this comment.
Review Council — first-principles design review
Reviewed by: Claude (Codex review timed out on this 2046-line diff at xhigh reasoning) · Depth: thorough
Focus: design issues warranting re-design for a modular, extensible audit-test framework (TEST04 is the first of several). 11 findings; see the tiered summary comment. The ref_samples dead-write (#1) was independently verified against the source.
Review Council — Multi-AI Code Review (first-principles design review)Reviewed by: Claude · Depth: thorough Framing: TEST04 is the first MLPerf compliance/audit test and is meant to become a modular, extensible framework. The findings below are design-led — what would adding the next audit (TEST01/05) cost, and where does TEST04-specific knowledge leak into general-purpose code. 11 findings, all posted inline. 🔴 Re-design / Must-fix
🟡 Should-fix
🔵 Consider
Through-line: #1, #5, #6, #7 are all symptoms of the same root cause — TEST04 is bolted onto Dedup: none overlap existing inline comments except #9, which extends the maintainer's existing fairness thread with upstream-parity / guard-direction substance. |
Ground-up redesign of the compliance/audit framework after PR mlcommons#332 review. Replaces the bolted-on TEST04 with a first-class, extensible AuditTest abstraction: a generic orchestrator runs plan_runs() phases back-to-back at a single shared sample count (fair comparison), and verify() produces a typed verdict. Maps every PR mlcommons#332 design-review finding + maintainer workflow requirement to where the design resolves it. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
4290b36 to
d34459b
Compare
Ground-up redesign of the compliance/audit framework after PR mlcommons#332 review. Replaces the bolted-on TEST04 with a first-class, extensible AuditTest abstraction: a generic orchestrator runs plan_runs() phases back-to-back at a single shared sample count (fair comparison), and verify() produces a typed verdict. Maps every PR mlcommons#332 design-review finding + maintainer workflow requirement to where the design resolves it. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
d34459b to
642ec6c
Compare
Ground-up redesign of the compliance/audit framework after PR mlcommons#332 review. Replaces the bolted-on TEST04 with a first-class, extensible AuditTest abstraction: a generic orchestrator runs plan_runs() phases back-to-back at a single shared sample count (fair comparison), and verify() produces a typed verdict. Maps every PR mlcommons#332 design-review finding + maintainer workflow requirement to where the design resolves it. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
642ec6c to
b40b0ef
Compare
… header Review Council (Claude) findings on PR mlcommons#332: - examples hardcoded num_workers despite §8 claiming it was dropped; remove it (use endpoint default, per viraatc's request) so the traceability row is true - single-stream header said '(independent counts)' but uses equal 20/20; align to '(equal counts here)' matching the offline sibling and §5/§8 framing Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
… PR) Review Council (Claude, 2nd pass) on PR mlcommons#332: three §8 traceability rows described the example YAMLs as future ('land/dropped at implementation', 'plan doc only'), but the PR now ships offline_wan22_submission.yaml and single_stream_wan22_submission.yaml. Reword to reflect they're included. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
viraatc
left a comment
There was a problem hiding this comment.
lgtm, looking forward to impl!
| │ | ||
| │ 3. verify(runs) ; 4. write_verdict (atomic) | ||
| ▼ | ||
| verify_TEST04.txt + audit_verdict.json |
There was a problem hiding this comment.
If possible, unify the output to be json so it's easier to parse.
And do we need 2 files? Ideally one json should be enough (containing all information). On traditional MLPerf side we can always use scripts to make it pass
There was a problem hiding this comment.
MLCommons run_verification.py takes verify_.txt that req a Performance check pass: True We can remove it if MLComm run_verification changed.
Brings the design doc (docs/compliance_audit_plan.md, rebased on the latest mlcommons#332 web edits) and the two WAN2.2 submission example configs (offline + single-stream, using the audit: block) onto the implementation branch so the redesign ships as one coherent PR: framework + tests + doc + runnable examples. The doc's exit-code contract and module layout are corrected to match this implementation: samples/audit_samples/sample_index/threshold, no standalone verifier CLI, and errors via the repo-wide handler (SetupError → 3, ExecutionError → 4) rather than a flat exit 2. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
1806fde to
ba09656
Compare
wu6u3tw
left a comment
There was a problem hiding this comment.
Review Council — Multi-AI Code Review
Reviewed by: Claude (Codex unavailable — bwrap sandbox blocked, no sudo to relax userns) · Depth: thorough
4 findings posted inline. Verdict-correctness finding (threshold) is the one to prioritize.
1dfaf3c to
89179d5
Compare
History squashed to 3 commitsThe branch was force-pushed (
Each commit independently passes Naming note: this version names the test |
6411c17 to
bb12443
Compare
|
@viraatc @arekay-nv can you review it with impl is already updated in this PR. |
Design plan (docs/compliance_audit_plan.md, incl. an ASCII program-flow diagram showing every decision gate and its exit code), the compliance-module entry in AGENTS.md, and the WAN2.2 Offline/SingleStream submission example configs (perf + accuracy + output_caching_test audit in one from-config run). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Generic AuditTest framework (compliance/): AuditTest protocol + RunSpec/RunStats/RunArtifacts + registry; OutputCachingAudit (compliance/tests/output_caching_test.py) implements MLPerf TEST04 caching detection — reference vs fixed-sample phase, fails if audit QPS exceeds reference QPS by > threshold. run_audit orchestrator (commands/audit.py) runs phases back-to-back, validates unpaced load + sample_index, refuses to certify an incomplete phase, and writes verify_OUTPUT_CACHING_TEST.txt + audit_result.json atomically (compliance/result.py). Wired via the YAML audit: block (schema.py AuditTestId/OutputCachingTestConfig) and a generic SampleOrderSpec + SingleSampleOrder seam in the load generator. Also folds in the branch's incidental non-compliance changes that touch these files: the metrics-aggregator --ready-file flag, the service launcher ready-check timeout widening, and the aiohttp + msgpack==1.2.1 CVE bumps (uv.lock/pyproject; msgpack clears GHSA-6v7p-g79w-8964). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Unit tests for verify_output_caching, OutputCachingAudit.plan_runs/verify, RunStats.from_report, the run_audit guards (load-pattern, incomplete phase, interrupt-skips-audit), SampleOrderSpec/SingleSampleOrder, and the atomic result writer; plus the end-to-end audit: flow (offline + single-stream). Includes the metrics aggregator signal-handling ready-file test update. The rejected-load-pattern guard test derives its parametrization from the LoadPatternType enum (anything that isn't max_throughput/concurrency) so it stays correct regardless of which patterns exist on the base branch. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
bb12443 to
5abd63b
Compare
|
Per offline discussion with @nvzhihanj, will modify the dump json file layout under |
Summary
Adds an extensible MLPerf compliance-audit framework with TEST04 (caching detection) as the first test, driven by an
audit:block in the benchmark YAML. This PR carries the full redesign: the approved design plan, the implementation, tests, and runnable WAN2.2 examples.TEST04 issues one fixed sample for every query in an audit phase; if repeating an identical request makes the SUT meaningfully faster, it is serving from cache. Pass iff the audit run is at most 10% faster than the reference (matching upstream
compliance/TEST04/verify_performance.py).Design (the two axes)
SampleOrderSpec(WITHOUT_REPLACEMENT | SINGLE(index)) carried on aRunSpec. No test-specific knowledge leaks into the load generator.AuditTest.verify(runs) -> AuditVerdict, registered per test.A generic orchestrator (
commands/audit.py::run_audit) runs eachRunSpecphase back-to-back via the existingsetup_benchmark/run_benchmark_asyncpath, then verifies and writes the verdict. Adding TEST01/06/07/09 later is a new registry entry, not cross-cutting edits.Config shape
AuditConfigis a discriminated-union-ready sub-model onBenchmarkConfig(parallel toAccuracyConfig) — noDatasetType.AUDIT, no audit fields pollutingDataset, notest04boolean inRuntimeSettings.What's included
compliance/__init__.py—AuditTestprotocol +RunSpec/RunStats/RunArtifacts+ registrycompliance/verdict.py—AuditVerdict+ atomicwrite_verdict(tmp → fsync → rename → fsync)compliance/tests/test04.py—Test04Audit+verify_test04commands/audit.py— genericrun_auditorchestratorconfig/schema.py—AuditTestId+Test04Config/AuditConfig+BenchmarkConfig.auditload_generator—SampleOrderSpec+SingleSampleOrder+ factory dispatchdocs/compliance_audit_plan.md— the design planoffline_wan22_submission.yaml,single_stream_wan22_submission.yamlExit codes
benchmark from-configwith anaudit:block exits 0 (PASS) / 1 (FAIL); errors propagate via the standard handler using the repo-wide scheme (InputValidationError→ 2,SetupError→ 3,ExecutionError→ 4). The on-diskaudit_verdict.jsonis the durable record.Testing
Unit + integration green;
pre-commit run --all-filesclean. The e2e test exercises the fullaudit:→run_audit→AuditVerdictflow for both max_throughput (offline) and concurrency=1 (single-stream).🤖 Generated with Claude Code