As part of refining #211, I am trying to figure out how we can predict the quality of the resulting model more reliably than with the KL divergence, while remaining fast enough to optimize for the metric. I have evaluated several small benchmark suites, and after a few days of testing, PIQA (Physical Interaction: Question Answering) has emerged as a promising candidate.
My main findings are:
- PIQA takes just seconds to run (e.g. 9 seconds for
gpt-oss-20b BF16 on an RTX 6000 Pro).
- PIQA appears to correlate much more strongly with the proprietary UGI NatInt benchmark than the KL divergence.
The second point is illustrated by the following table:
| Model |
PIQA acc_norm |
UGI NatInt |
Heretic KLD |
PIQA rank |
NatInt rank |
KLD rank |
| openai/gpt-oss-20b |
0.7731 |
27.18 |
0 |
1 |
1 |
1 |
| arnomatic/gpt-oss-20b-heretic-scanner-V1-2 |
0.7726 |
21.64 |
0.0534 |
2 |
3 |
3 |
| MuXodious/gpt-oss-20b-RichardErkhov-heresy |
0.7715 |
22.52 |
0.0640 |
3 |
2 |
4 |
| kabachuha/gpt-oss-20b-SOMbliterated |
0.7699 |
20.34 |
0.1158 |
4 |
4 |
6 |
| p-e-w/gpt-oss-20b-heretic-ara-v3 |
0.7688 |
20.17 |
0.0437 |
5 |
5 |
2 |
| coder3101/gpt-oss-20b-heretic |
0.7666 |
20.11 |
0.2765 |
6 |
6 |
7 |
| ArliAI/gpt-oss-20b-Derestricted |
0.7622 |
18.31 |
0.1058 |
7 |
7 |
5 |
| huihui-ai/Huihui-gpt-oss-20b-BF16-abliterated-v2 |
0.7427 |
10.89 |
1.2190 |
8 |
8 |
8 |
As you can see, PIQA and NatInt correlate almost perfectly (in a Spearman sense), and unlike the KLD, PIQA correctly predicts the poor NatInt score of ArliAI/gpt-oss-20b-Derestricted.
You can reproduce the PIQA scores using lm-evaluation-harness, with a command like lm_eval --model hf --model_args pretrained=<model> --tasks piqa --device cuda:0 --batch_size 64. You can reproduce the Heretic KLD scores with heretic --model <base_model> --evaluate-model <model>. The NatInt scores are taken from the UGI Leaderboard.
The very short evaluation time required for PIQA opens intriguing possibilities, including running PIQA on every trial, or (probably better), running it for all trials that are sufficiently close to the KLD-based Pareto front once the study completes. We can then re-build the Pareto front based on PIQA scores rather than KLD, which, according to these tests, should lead to improved prediction of model intelligence.
As part of refining #211, I am trying to figure out how we can predict the quality of the resulting model more reliably than with the KL divergence, while remaining fast enough to optimize for the metric. I have evaluated several small benchmark suites, and after a few days of testing, PIQA (Physical Interaction: Question Answering) has emerged as a promising candidate.
My main findings are:
gpt-oss-20bBF16 on an RTX 6000 Pro).The second point is illustrated by the following table:
acc_normAs you can see, PIQA and NatInt correlate almost perfectly (in a Spearman sense), and unlike the KLD, PIQA correctly predicts the poor NatInt score of
ArliAI/gpt-oss-20b-Derestricted.You can reproduce the PIQA scores using lm-evaluation-harness, with a command like
lm_eval --model hf --model_args pretrained=<model> --tasks piqa --device cuda:0 --batch_size 64. You can reproduce the Heretic KLD scores withheretic --model <base_model> --evaluate-model <model>. The NatInt scores are taken from the UGI Leaderboard.The very short evaluation time required for PIQA opens intriguing possibilities, including running PIQA on every trial, or (probably better), running it for all trials that are sufficiently close to the KLD-based Pareto front once the study completes. We can then re-build the Pareto front based on PIQA scores rather than KLD, which, according to these tests, should lead to improved prediction of model intelligence.