Main repo for PRM-o1 research project
- Do experiment on steps produced by llama 3.1 8b instruct (how many go over length, what is distribution of lengths, etc.)
- Implement LLM step expansion routine that involves generating N steps that are less than max tokens and the paths are semantically novel
- Implement A* search using QVM and PRM (maybe use skywork PRM?)
- Check out this: https://huggingface.co/Skywork/Skywork-o1-Open-Llama-3.1-8B, https://github.com/SkyworkAI/skywork-o1-prm-inference
- they say have Q* and such, but seems not very impressive thinking from first impression.
- however their Qwen2.5-Math-7B-Instruct based PRM with llama 3.1 8b instruct looks to be second best to Qwen2.5-Math-RM-72B (about as good as Maj@64 with llama) - source
- so, we can probably use this PRM along with mathshepherd? however i don't know how this one is trained though
- another math shepherd moded, but based on llama 3.1 8b instruct, so it may be better
- https://huggingface.co/RLHFlow/Llama3.1-8B-PRM-Deepseek-Data
- ok here is the confirmation that the prm outperforms mathshepherd: https://github.com/lifan-yuan/ImplicitPRM/blob/main/figs/main_exp.png
-
$\implies$ Skywork-o1-Open-PRM-Qwen-2.5-7B > RLHFFlow-8B-DS-Data > Math Shepherd
- another PRM to look at: https://github.com/lifan-yuan/ImplicitPRM
- ImplicitPRM DPO seems to be (very slightly) better than RLHFFlow-8B-DS-Data for llama 3.1 8b instruct
- Skywork-o1-Open-PRM-Qwen-2.5-7B is still best though