Skip the bmk scores that were not run during the partial runtime of SPEC#20
Skip the bmk scores that were not run during the partial runtime of SPEC#20zephyrols wants to merge 1 commit intoshinezyy:graphmakerfrom
Conversation
There was a problem hiding this comment.
Pull request overview
Updates SPEC score aggregation to better handle partial benchmark runs by attempting to skip scoring for benchmarks that were not executed.
Changes:
- Adds a guard to skip scoring when a benchmark has
time == 0. - Computes overall geometric mean from a filtered list of “valid” (non-zero) per-benchmark scores.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| score[bmk]['time'] = float(weighted_df.loc[bmk, 'time']) | ||
| score[bmk]['ref_time'] = float(reftime_js[bmk]) | ||
| if score[bmk]['time'] == 0: | ||
| warnings.warn(f'{bmk} has 0 time, skip scoring') | ||
| print(f'{bmk} has 0 time, skip scoring') | ||
| score[bmk]['score'] = 0 | ||
| score[bmk]['coverage'] = 0 | ||
| continue |
There was a problem hiding this comment.
weighted_df is reindexed to the full SPEC list earlier, so benchmarks that were not run will typically have NaN time (not 0). float(np.nan) yields nan, so this branch won’t trigger and you’ll compute ref_time / nan (propagating NaNs into downstream stats). Consider treating missing/non-finite times as “not run” (e.g., pd.isna / np.isfinite checks) and skipping scoring for those rows too.
| score[bmk]['time'] = float(weighted_df.loc[bmk, 'time']) | |
| score[bmk]['ref_time'] = float(reftime_js[bmk]) | |
| if score[bmk]['time'] == 0: | |
| warnings.warn(f'{bmk} has 0 time, skip scoring') | |
| print(f'{bmk} has 0 time, skip scoring') | |
| score[bmk]['score'] = 0 | |
| score[bmk]['coverage'] = 0 | |
| continue | |
| raw_time = weighted_df.loc[bmk, 'time'] | |
| # Treat missing or non-finite times as "not run" and skip scoring | |
| if pd.isna(raw_time) or not np.isfinite(float(raw_time)) or float(raw_time) == 0: | |
| warnings.warn(f'{bmk} has invalid or zero time ({raw_time}), skip scoring') | |
| print(f'{bmk} has invalid or zero time ({raw_time}), skip scoring') | |
| score[bmk]['time'] = float(raw_time) if not pd.isna(raw_time) else float('nan') | |
| score[bmk]['ref_time'] = float(reftime_js[bmk]) | |
| score[bmk]['score'] = 0 | |
| score[bmk]['coverage'] = 0 | |
| continue | |
| score[bmk]['time'] = float(raw_time) | |
| score[bmk]['ref_time'] = float(reftime_js[bmk]) |
| if score[bmk]['time'] == 0: | ||
| warnings.warn(f'{bmk} has 0 time, skip scoring') | ||
| print(f'{bmk} has 0 time, skip scoring') | ||
| score[bmk]['score'] = 0 | ||
| score[bmk]['coverage'] = 0 | ||
| continue |
There was a problem hiding this comment.
Marking skipped benchmarks with score = 0 will still include them in the later intdf['score'] / fpdf['score'] geometric means (lines 208+), producing 0 rather than the mean over executed benchmarks. To align with the PR goal (“only calculate executed benchmarks”), it’s better to exclude skipped entries from those per-suite means (e.g., filter out non-positive/non-finite scores or drop the rows entirely).
| valid_scores = [x[1]['score'] for x in score.items() if x[1]['score'] != 0] | ||
| score['mean'] = { | ||
| 'time':0, | ||
| 'ref_time':0, | ||
| 'score': geometric_mean([x[1]['score'] for x in score.items()]), | ||
| 'score': geometric_mean(valid_scores) if valid_scores else 0, |
There was a problem hiding this comment.
valid_scores currently filters only != 0, which will still include NaN scores (since nan != 0 is True) and can make geometric_mean(valid_scores) return NaN. Filter for finite, positive values instead so the overall mean remains well-defined when some benchmarks are missing.
Slightly modified the function of the score statistics section, so that when running SPEC Benchmarks in some cases, it can only calculate the scores of the benchmarks that have been executed.