[RFC] PIQA: A potential complement to KL divergence that could improve the quality of Heretic models

As part of refining #211, I am trying to figure out how we can predict the quality of the resulting model more reliably than with the KL divergence, while remaining fast enough to optimize for the metric. I have evaluated several small benchmark suites, and after a few days of testing, [PIQA](https://arxiv.org/abs/1911.11641) (Physical Interaction: Question Answering) has emerged as a promising candidate.

My main findings are:

1. PIQA takes just seconds to run (e.g. 9 seconds for `gpt-oss-20b` BF16 on an RTX 6000 Pro).
2. **PIQA appears to correlate much more strongly with the proprietary UGI NatInt benchmark than the KL divergence.**

&nbsp;

The second point is illustrated by the following table:

| Model | PIQA `acc_norm` | UGI NatInt | Heretic KLD | PIQA rank | NatInt rank | KLD rank |
| --- | --- | --- | --- | --- | --- | --- |
| openai/gpt-oss-20b | 0.7731 | 27.18 | 0 | 1 | 1 | 1 |
| arnomatic/gpt-oss-20b-heretic-scanner-V1-2 | 0.7726 | 21.64 | 0.0534 | 2 | 3 | 3 |
| MuXodious/gpt-oss-20b-RichardErkhov-heresy | 0.7715 | 22.52 | 0.0640 | 3 | 2 | 4 |
| kabachuha/gpt-oss-20b-SOMbliterated | 0.7699 | 20.34 | 0.1158 | 4 | 4 | 6 |
| p-e-w/gpt-oss-20b-heretic-ara-v3 | 0.7688 | 20.17 | 0.0437 | 5 | 5 | 2 |
| coder3101/gpt-oss-20b-heretic | 0.7666 | 20.11 | 0.2765 | 6 | 6 | 7 |
| ArliAI/gpt-oss-20b-Derestricted | 0.7622 | 18.31 | 0.1058 | 7 | 7 | 5 |
| huihui-ai/Huihui-gpt-oss-20b-BF16-abliterated-v2 | 0.7427 | 10.89 | 1.2190 | 8 | 8 | 8 |

As you can see, PIQA and NatInt correlate almost perfectly (in a [Spearman](https://en.wikipedia.org/wiki/Spearman%27s_rank_correlation_coefficient) sense), and unlike the KLD, PIQA correctly predicts the poor NatInt score of `ArliAI/gpt-oss-20b-Derestricted`.

You can reproduce the PIQA scores using [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness), with a command like `lm_eval --model hf --model_args pretrained=<model> --tasks piqa --device cuda:0 --batch_size 64`. You can reproduce the Heretic KLD scores with `heretic --model <base_model> --evaluate-model <model>`. The NatInt scores are taken from the [UGI Leaderboard](https://huggingface.co/spaces/DontPlanToEnd/UGI-Leaderboard).

&nbsp;

The very short evaluation time required for PIQA opens intriguing possibilities, including running PIQA on every trial, or (probably better), running it for all trials that are sufficiently close to the KLD-based Pareto front once the study completes. We can then re-build the Pareto front based on PIQA scores rather than KLD, which, according to these tests, should lead to improved prediction of model intelligence.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] PIQA: A potential complement to KL divergence that could improve the quality of Heretic models #236

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Model	PIQA `acc_norm`	UGI NatInt	Heretic KLD	PIQA rank	NatInt rank	KLD rank
openai/gpt-oss-20b	0.7731	27.18	0	1	1	1
arnomatic/gpt-oss-20b-heretic-scanner-V1-2	0.7726	21.64	0.0534	2	3	3
MuXodious/gpt-oss-20b-RichardErkhov-heresy	0.7715	22.52	0.0640	3	2	4
kabachuha/gpt-oss-20b-SOMbliterated	0.7699	20.34	0.1158	4	4	6
p-e-w/gpt-oss-20b-heretic-ara-v3	0.7688	20.17	0.0437	5	5	2
coder3101/gpt-oss-20b-heretic	0.7666	20.11	0.2765	6	6	7
ArliAI/gpt-oss-20b-Derestricted	0.7622	18.31	0.1058	7	7	5
huihui-ai/Huihui-gpt-oss-20b-BF16-abliterated-v2	0.7427	10.89	1.2190	8	8	8

[RFC] PIQA: A potential complement to KL divergence that could improve the quality of Heretic models #236

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions