Skip to content

[RFC] PIQA: A potential complement to KL divergence that could improve the quality of Heretic models #236

@p-e-w

Description

@p-e-w

As part of refining #211, I am trying to figure out how we can predict the quality of the resulting model more reliably than with the KL divergence, while remaining fast enough to optimize for the metric. I have evaluated several small benchmark suites, and after a few days of testing, PIQA (Physical Interaction: Question Answering) has emerged as a promising candidate.

My main findings are:

  1. PIQA takes just seconds to run (e.g. 9 seconds for gpt-oss-20b BF16 on an RTX 6000 Pro).
  2. PIQA appears to correlate much more strongly with the proprietary UGI NatInt benchmark than the KL divergence.

 

The second point is illustrated by the following table:

Model PIQA acc_norm UGI NatInt Heretic KLD PIQA rank NatInt rank KLD rank
openai/gpt-oss-20b 0.7731 27.18 0 1 1 1
arnomatic/gpt-oss-20b-heretic-scanner-V1-2 0.7726 21.64 0.0534 2 3 3
MuXodious/gpt-oss-20b-RichardErkhov-heresy 0.7715 22.52 0.0640 3 2 4
kabachuha/gpt-oss-20b-SOMbliterated 0.7699 20.34 0.1158 4 4 6
p-e-w/gpt-oss-20b-heretic-ara-v3 0.7688 20.17 0.0437 5 5 2
coder3101/gpt-oss-20b-heretic 0.7666 20.11 0.2765 6 6 7
ArliAI/gpt-oss-20b-Derestricted 0.7622 18.31 0.1058 7 7 5
huihui-ai/Huihui-gpt-oss-20b-BF16-abliterated-v2 0.7427 10.89 1.2190 8 8 8

As you can see, PIQA and NatInt correlate almost perfectly (in a Spearman sense), and unlike the KLD, PIQA correctly predicts the poor NatInt score of ArliAI/gpt-oss-20b-Derestricted.

You can reproduce the PIQA scores using lm-evaluation-harness, with a command like lm_eval --model hf --model_args pretrained=<model> --tasks piqa --device cuda:0 --batch_size 64. You can reproduce the Heretic KLD scores with heretic --model <base_model> --evaluate-model <model>. The NatInt scores are taken from the UGI Leaderboard.

 

The very short evaluation time required for PIQA opens intriguing possibilities, including running PIQA on every trial, or (probably better), running it for all trials that are sufficiently close to the KLD-based Pareto front once the study completes. We can then re-build the Pareto front based on PIQA scores rather than KLD, which, according to these tests, should lead to improved prediction of model intelligence.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions