I am trying to reproduce the SCIQ results from the SC'23 paper using Eleuther's LM evaluation harness.
These are my results
| Model |
SciQ |
PIQA |
| forge-bio |
0.788 |
|
| forge-che |
0.821 |
|
| forge-eng |
0.793 |
|
| forge-mat |
0.777 |
|
| forge-phy |
0.761 |
|
| forge-soc |
0.82 |
|
| forge-s1 |
0.787 |
|
| forge-s2 |
0.783 |
|
| forge-s3 |
0.805 |
|
| forge-s4 |
0.86 |
|
| forge-m1 |
0.82 |
|
| forge-m2 |
0.574 |
0.5577 |
| forge-l |
0.242 |
|
The highlighted scores are much lower than the others, and than what is expected from Table 8 of the paper. A quick check of the evaluation logs (data/eval/forge-m2) suggests that these are roughly the scores of the m2 checkpoint at iteration 1000, and probably of some very early checkpoint of forge-l.
I downloaded the checkpoints from the links in the README.md. I suspect that the dropbox versions were somehow mixed up.
Command line
lm_eval --model hf --model_args pretrained=forge-bio,parallelize=True --tasks sciq --device cuda