TODO

Main repo for PRM-o1 research project

Do experiment on steps produced by llama 3.1 8b instruct (how many go over length, what is distribution of lengths, etc.)
Implement LLM step expansion routine that involves generating N steps that are less than max tokens and the paths are semantically novel
Implement A* search using QVM and PRM (maybe use skywork PRM?)

Notes

Check out this: https://huggingface.co/Skywork/Skywork-o1-Open-Llama-3.1-8B, https://github.com/SkyworkAI/skywork-o1-prm-inference
- they say have Q* and such, but seems not very impressive thinking from first impression.
- however their Qwen2.5-Math-7B-Instruct based PRM with llama 3.1 8b instruct looks to be second best to Qwen2.5-Math-RM-72B (about as good as Maj@64 with llama) - source
- so, we can probably use this PRM along with mathshepherd? however i don't know how this one is trained though
another math shepherd moded, but based on llama 3.1 8b instruct, so it may be better
- https://huggingface.co/RLHFlow/Llama3.1-8B-PRM-Deepseek-Data
- ok here is the confirmation that the prm outperforms mathshepherd: https://github.com/lifan-yuan/ImplicitPRM/blob/main/figs/main_exp.png
- $\implies$ Skywork-o1-Open-PRM-Qwen-2.5-7B > RLHFFlow-8B-DS-Data > Math Shepherd
another PRM to look at: https://github.com/lifan-yuan/ImplicitPRM
- ImplicitPRM DPO seems to be (very slightly) better than RLHFFlow-8B-DS-Data for llama 3.1 8b instruct
- Skywork-o1-Open-PRM-Qwen-2.5-7B is still best though

Name		Name	Last commit message	Last commit date
Latest commit History 34 Commits
experiments/prm_astar		experiments/prm_astar
prm-o1-sglang		prm-o1-sglang
prm-o1		prm-o1
.gitignore		.gitignore
.python-version		.python-version
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock