A UW Math AI Lab project.
Within the past few years, the ability of Large Language Models (LLMs) to generate formal mathematical proofs has improved drastically. We will provide a comparison of various LLMs' effectiveness in producing a successful proof in the Lean 4 theorem prover on both Mini-F2F and Mini-CTX datasets. Specifically, we will compare three general purpose LLMs: Gemini 3-pro, Chat GPT 5.2, Claude 4.5 Opus, and three specialized LLMs: Kimina-72B, Goedel-Prover-32B, Deepseek-Prover-V2-7B. We plan to test each models effectiveness at Pass@4, and multi turn proof generation with
The following tracks the success rate of Sonnet 4.5 as it is allowed multiple attempts to solve the problems from the MiniF2F dataset.
| Attempts | Number of Successes |
|---|---|
| 1 | 146 |
| 2 | 183 |
| 3 | 207 |
| 4 | 217 |
| 5 | 227 |