Evan Wang, Simon Chess, Daniel Lee, Siyuan Ge, Ajit Mallavarapu, Vasily Ilin
|
Existing Lean datasets contain correct proofs. Models learn error correction with RL, that's expensive. We release a dataset of 260k erroneous Lean proofs, the compiler feedback, error explanation, proof repair reasoning trace, and the corrected proof.
|
|
@article{wang2026repairlean,
title = {Learning to Repair Lean Proofs from Compiler Feedback},
author = {Wang, Evan and Chess, Simon and Lee, Daniel and Ge, Siyuan and Mallavarapu, Ajit and Ilin, Vasily},
journal= {arXiv preprint arXiv:2602.02990},
year = {2026},
doi = {10.48550/arXiv.2602.02990},
url = {https://arxiv.org/abs/2602.02990}
}First, run the following commands
pip install -r requirements.txt
pip install -e .
For the full process, run these. Alternatively, you can run them in a slightly different order
python -m scripts.filter_files && \
python -m scripts.build_theorem_cache && \
python -m scripts.create_theorem_pairs && \
python -m scripts.expand_theorem_pairs && \
python -m scripts.create_incorrect_proofs && \
python -m scripts.annotate_proofs && \
python -m scripts.generate_explanations && \
python -m scripts.build_sft_dataset
For filtering into src_pass.jsonl (check to make sure no errors)
python -m scripts.filter_files
For generating errors by replacing theorems
python -m scripts.build_theorem_cache && \
python -m scripts.create_theorem_pairs && \
python -m scripts.expand_theorem_pairs && \
python -m scripts.create_incorrect_proofs && \
python -m scripts.annotate_proofs
TODO (alternative ways to generate errors)
python -m scripts.generate_explanations && \
python -m scripts.build_sft_dataset
