Loadgen++/endpoints integration with submission checker#2601
Loadgen++/endpoints integration with submission checker#2601pgmpablo157321 wants to merge 8 commits into
Conversation
|
MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅ |
7eeb1dc to
2d143ff
Compare
2d143ff to
788e9ed
Compare
| system=system, benchmark=benchmark, scenario=scenario) | ||
|
|
||
| def load_single_log(self, path, log_type: Literal["Performance", "Accuracy", | ||
| "AccuracyResult", "AccuracyJSON", "Test", "System", "Measurements"]): |
There was a problem hiding this comment.
How about we add Endpoints also here?
There was a problem hiding this comment.
Loading the endpoints result is very different. It was moved to another function. Please let me know if you think we should integrate both
| """ | ||
| log = None | ||
| if os.path.exists(path): | ||
| if log_type in ["Endpoints"]: |
There was a problem hiding this comment.
would it be better to add check if path path exists? as we have in elif?
There was a problem hiding this comment.
This was moved to another function as well
| acc_json_path, "AccuracyJSON") | ||
| measurements_json = self.load_single_log( | ||
| measurements_path, "Measurements") | ||
| if perf_log is None and acc_log is None: |
There was a problem hiding this comment.
My concern here is that if user did an inference run and supplied a wrong path, it would print the error log specific for endpoints and also set is_endpoints_submittion as true:
Could not load Endpoints log from path/supplied, log type not recognized
- constants.py: Fix percentile key format in ENDPOINTS_MAPPINGS — the
endpoints JSON uses float-format keys (e.g. "99.0") but the mappings
used integer strings ("99"), causing latency_check to receive None and
crash on the comparison. Updated all latency/ttft/tpot percentile keys
to use the .0 suffix (50.0, 90.0, 95.0, 99.0).
- performance_check.py: Fix llm_check for endpoints — the check gated
on the loadgen use_token_latencies flag which does not exist in
endpoints submissions, causing all LLM models to fail. Added an
endpoints-specific branch that checks TTFT/TPOT p99 values directly
from the result JSON. Also fix get_performance_metric_check to skip
RESULT_FIELD_BENCHMARK_OVERWRITE for endpoints (the tokens/sec field
is not present in endpoints result files; use QPS instead).
- endpoints_parser.py: Fix inferred QPS unit — the fallback QPS
calculation divided n_samples_issued by duration_ns directly, giving
~1e-8 instead of the correct value. Convert duration to seconds first.
Also guard against overwriting an already-resolved QPS value.
- README.md: Update submission checker documentation with current
version numbers, endpoints directory structure, endpoints-specific
checks, and accuracy_scores requirement.
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
|
@pgmpablo157321 Could you solve the conflicts |
Documentation pending