Hi, thank you for the great work and for sharing your paper.
I noticed that your model is evaluated on RewardBench2, but I couldn’t find the detailed evaluation protocol or prompt format used in the paper or the released code. I’m trying to reproduce or fairly compare against your results.
Could you please clarify how RewardBench2 is evaluated in your setup?