From c28e6ce7172bfb146f3ea6c43b36fc15043bb56b Mon Sep 17 00:00:00 2001 From: Tin-Yin Lai Date: Mon, 22 Jun 2026 18:18:20 -0700 Subject: [PATCH 1/3] docs(compliance): output-caching audit (MLPerf TEST04) design + examples Design plan (docs/compliance_audit_plan.md, incl. an ASCII program-flow diagram showing every decision gate and its exit code), the compliance-module entry in AGENTS.md, and the WAN2.2 Offline/SingleStream submission example configs (perf + accuracy + output_caching_test audit in one from-config run). Co-Authored-By: Claude Opus 4.8 --- AGENTS.md | 49 +- docs/compliance_audit_plan.md | 728 ++++++++++++++++++ .../offline_wan22_submission.yaml | 70 ++ .../single_stream_wan22_submission.yaml | 71 ++ 4 files changed, 902 insertions(+), 16 deletions(-) create mode 100644 docs/compliance_audit_plan.md create mode 100644 examples/09_Wan22_VideoGen_Example/offline_wan22_submission.yaml create mode 100644 examples/09_Wan22_VideoGen_Example/single_stream_wan22_submission.yaml diff --git a/AGENTS.md b/AGENTS.md index 4803c99a7..6dbbd1a9e 100644 --- a/AGENTS.md +++ b/AGENTS.md @@ -85,19 +85,20 @@ Dataset Manager --> Load Generator --> Endpoint Client --> External Endpoint ### Key Components -| Component | Location | Purpose | -| ---------------------- | ----------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -| **Load Generator** | `src/inference_endpoint/load_generator/` | Central orchestrator: `BenchmarkSession` owns the lifecycle, `PhaseIssuer` drives per-phase execution, `TimedIssueStrategy`/`BurstStrategy`/`ConcurrencyStrategy` control timing. Emits `ERROR` before `COMPLETE` for failed queries (metrics aggregator depends on this order). | -| **Endpoint Client** | `src/inference_endpoint/endpoint_client/` | Multi-process HTTP workers communicating via ZMQ IPC. `HTTPEndpointClient` is the main entry point | -| **Dataset Manager** | `src/inference_endpoint/dataset_manager/` | Loads JSONL, HuggingFace, CSV, JSON, Parquet datasets. `Dataset` base class with `load_sample()`/`num_samples()` interface | -| **Metrics Aggregator** | `src/inference_endpoint/async_utils/services/metrics_aggregator/` | Subprocess. Subscribes to events, aggregates per-sample metrics into a `MetricsRegistry` (counters + HDR-histogram series + raw values), publishes `MetricsSnapshot` over IPC PUB at a configurable cadence (`SessionState`: `INITIALIZE` → `LIVE` → `DRAINING` → {`COMPLETE` \| `INTERRUPTED`}). Final snapshot is atomically written to `final_snapshot.json` as the **primary** Report source; the terminal pub/sub frame is a TUI "run finished" signal only. | -| **Report** | `src/inference_endpoint/metrics/report.py` | `Report.from_snapshot(dict)` — pure-function builder consuming the dict form (`snapshot_to_dict`). Reads `final_snapshot.json` directly via `json.loads` (no Struct decode). Plumbs `complete = (state == "complete" and n_pending_tasks == 0)`; renders an explicit warning for `INTERRUPTED` runs. | -| **Config** | `src/inference_endpoint/config/`, `endpoint_client/config.py` | Pydantic-based YAML schema (`schema.py`), `HTTPClientConfig` (single Pydantic model for CLI/YAML/runtime), `RuntimeSettings` | -| **CLI** | `src/inference_endpoint/main.py`, `commands/benchmark/cli.py` | cyclopts-based, auto-generated from `schema.py` and `HTTPClientConfig` Pydantic models. Flat shorthands via `cyclopts.Parameter(alias=...)` | -| **Async Utils** | `src/inference_endpoint/async_utils/` | `LoopManager` (uvloop + eager_task_factory), ZMQ transport layer, generic `MessageCodec[T]`-parametrized pub/sub, event publisher | -| **OpenAI/SGLang** | `src/inference_endpoint/openai/`, `sglang/` | Protocol adapters and response accumulators for different API formats. `openai_completions` adapter (`completions_adapter.py`) sends pre-tokenized token IDs to `/v1/completions`, bypassing the server chat template — required for gpt-oss-120b on vLLM. `sglang` adapter sends to `/generate` via `input_ids`. Both apply `Harmonize()` client-side. | -| **TensorRT-LLM** | `src/inference_endpoint/trtllm/` | Adapter for TensorRT-LLM endpoints. `TRTLLMAdapter` sends requests; `TRTLLMSSEAccumulator` handles SSE streaming responses. | -| **VideoGen** | `src/inference_endpoint/videogen/` | Adapter for video-generation endpoints (e.g. trtllm-serve `POST /v1/videos/generations`, used by MLPerf WAN2.2-T2V-A14B). Defaults to `response_format=video_path` (server saves video to shared storage and returns path) to avoid large byte payloads. Accuracy mode also runs on `video_path`: the adapter mirrors the path into `response_output` so the event log carries it to `VBenchScorer` (see `evaluation/scoring.py`), which scores videos via VBench from a sibling `uv` subproject at `examples/09_Wan22_VideoGen_Example/accuracy/` (vbench's `transformers==4.33.2` + `numpy<2` pins are incompatible with the parent env, so it runs out-of-process via `uv run --project`). Dataset is ingested via the generic JSONL loader. | +| Component | Location | Purpose | +| ---------------------- | ----------------------------------------------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| **Load Generator** | `src/inference_endpoint/load_generator/` | Central orchestrator: `BenchmarkSession` owns the lifecycle, `PhaseIssuer` drives per-phase execution, `TimedIssueStrategy`/`BurstStrategy`/`ConcurrencyStrategy` control timing. Emits `ERROR` before `COMPLETE` for failed queries (metrics aggregator depends on this order). | +| **Endpoint Client** | `src/inference_endpoint/endpoint_client/` | Multi-process HTTP workers communicating via ZMQ IPC. `HTTPEndpointClient` is the main entry point | +| **Dataset Manager** | `src/inference_endpoint/dataset_manager/` | Loads JSONL, HuggingFace, CSV, JSON, Parquet datasets. `Dataset` base class with `load_sample()`/`num_samples()` interface | +| **Metrics Aggregator** | `src/inference_endpoint/async_utils/services/metrics_aggregator/` | Subprocess. Subscribes to events, aggregates per-sample metrics into a `MetricsRegistry` (counters + HDR-histogram series + raw values), publishes `MetricsSnapshot` over IPC PUB at a configurable cadence (`SessionState`: `INITIALIZE` → `LIVE` → `DRAINING` → {`COMPLETE` \| `INTERRUPTED`}). Final snapshot is atomically written to `final_snapshot.json` as the **primary** Report source; the terminal pub/sub frame is a TUI "run finished" signal only. | +| **Report** | `src/inference_endpoint/metrics/report.py` | `Report.from_snapshot(dict)` — pure-function builder consuming the dict form (`snapshot_to_dict`). Reads `final_snapshot.json` directly via `json.loads` (no Struct decode). Plumbs `complete = (state == "complete" and n_pending_tasks == 0)`; renders an explicit warning for `INTERRUPTED` runs. | +| **Config** | `src/inference_endpoint/config/`, `endpoint_client/config.py` | Pydantic-based YAML schema (`schema.py`), `HTTPClientConfig` (single Pydantic model for CLI/YAML/runtime), `RuntimeSettings` | +| **CLI** | `src/inference_endpoint/main.py`, `commands/benchmark/cli.py` | cyclopts-based, auto-generated from `schema.py` and `HTTPClientConfig` Pydantic models. Flat shorthands via `cyclopts.Parameter(alias=...)` | +| **Async Utils** | `src/inference_endpoint/async_utils/` | `LoopManager` (uvloop + eager_task_factory), ZMQ transport layer, generic `MessageCodec[T]`-parametrized pub/sub, event publisher | +| **OpenAI/SGLang** | `src/inference_endpoint/openai/`, `sglang/` | Protocol adapters and response accumulators for different API formats. `openai_completions` adapter (`completions_adapter.py`) sends pre-tokenized token IDs to `/v1/completions`, bypassing the server chat template — required for gpt-oss-120b on vLLM. `sglang` adapter sends to `/generate` via `input_ids`. Both apply `Harmonize()` client-side. | +| **TensorRT-LLM** | `src/inference_endpoint/trtllm/` | Adapter for TensorRT-LLM endpoints. `TRTLLMAdapter` sends requests; `TRTLLMSSEAccumulator` handles SSE streaming responses. | +| **VideoGen** | `src/inference_endpoint/videogen/` | Adapter for video-generation endpoints (e.g. trtllm-serve `POST /v1/videos/generations`, used by MLPerf WAN2.2-T2V-A14B). Defaults to `response_format=video_path` (server saves video to shared storage and returns path) to avoid large byte payloads. Accuracy mode also runs on `video_path`: the adapter mirrors the path into `response_output` so the event log carries it to `VBenchScorer` (see `evaluation/scoring.py`), which scores videos via VBench from a sibling `uv` subproject at `examples/09_Wan22_VideoGen_Example/accuracy/` (vbench's `transformers==4.33.2` + `numpy<2` pins are incompatible with the parent env, so it runs out-of-process via `uv run --project`). Dataset is ingested via the generic JSONL loader. | +| **Compliance** | `src/inference_endpoint/compliance/`, `commands/audit.py` | MLPerf compliance audits. `AuditTest` protocol + `RunSpec`/`RunStats`/`RunArtifacts` + test registry (`compliance/__init__.py`); `OutputCachingAudit` (`compliance/tests/output_caching_test.py`) implements the **output-caching** audit (MLPerf **TEST04**) caching detection — a reference phase over distinct samples vs an audit phase repeating one fixed sample (`SingleSampleOrder`), failing if audit QPS exceeds reference QPS by more than `threshold`. `commands/audit.py:run_audit` orchestrates the phases back-to-back, refuses to certify an incomplete phase, and writes `verify_OUTPUT_CACHING_TEST.txt` + `audit_result.json` atomically via `compliance/result.py`. Enabled by the YAML `audit:` block (`AuditConfig`/`OutputCachingTestConfig`, `AuditTestId.OUTPUT_CACHING_TEST` in `schema.py`); `run_benchmark` dispatches to it after the main run. Performance-only, unpaced loads only (`max_throughput`/`concurrency`). | ### Hot-Path Architecture @@ -154,6 +155,16 @@ Validation is layered: - `poisson`: Fixed QPS with Poisson arrival distribution - `concurrency`: Fixed concurrent requests +### Compliance Audits + +Orthogonal to the main run: a YAML-only `audit:` block (`show=False`, no CLI flag) on `BenchmarkConfig` selects an `AuditTest`. `run_benchmark` runs the main benchmark, then — if `audit:` is set — calls `commands/audit.py:run_audit`, which: + +1. Validates the load pattern (unpaced only: `max_throughput`/`concurrency`) and the configured `sample_index` bounds (reusing the first phase's loaded dataset — no extra load). +2. Runs each `RunSpec` phase (from `AuditTest.plan_runs`) back-to-back under its own `/