This project is a prototype framework for extracting training data for machine-learning-based proof assistants. Specifically, it extracts maximally rich geometric data from constructive mathematical proofs written in Lean.
For additional context, refer to main.pdf.
- Clone this repository.
- Initialize a Lean 4 project within the cloned repository and ensure Python is installed.
- Modify your
lakefileto include any required packages and Lean modules, ensuring they can be imported in~/Foo/Test.lean.
The module Export.lean was originally programmed by Mario Carneiro (see Lean4Export and Lean3 Export Format).
Minimal modifications were made to Main.lean, including an I/O operation at the end to write the output of Carneiro’s export process to a .txt file.
-
Write your Lean program (i.e., constructive mathematical theorem) in
~/Foo/Test.lean, ensuring it compiles. -
Alternatively, import all necessary Lean modules you'd like to export.
-
Execute the following command in a terminal within the same directory as
Main.lean:lake exe foo Foo.Test -- <full name of your Lean program>
This will export the fully elaborated Lean program to
Export.txt. -
To export all imported modules in
./Foo/Test.lean, use:lake exe foo Foo.Test
This process allows exporting entire GitHub repositories, modules, etc.
This component processes Lean exports into structured data.
Once Export.txt has been generated using the Lean.foo process, run:
python main.py Export.txtUpon successful execution, this Python script generates graph.csv in the same directory.
- A set of declarations forming a directed acyclic graph (DAG).
- An edge
d → d'exists ifd'is referenced withind. - Dependencies are labeled based on their position in the structure.
- Computational Perspective: The function computed by
main.pyis an isomorphism—it acts as an interpreter for Lean. - Data Representation: The exported data provides a compiler-invariant representation of fully elaborated Lean programs—constructive proofs of theorems. While the output format (
graph.csv) represents labeled directed edges, alternative representations (e.g., sets of vertices and edges) can be derived using the functions inbase.py. - Machine Learning Relevance: The exported data serves as the rawest form of positive training data for ML-based proof assistants.
To visualize the exported digraph, use Cosmograph:
- Upload
graph.csvto Cosmograph. - The image
Nat.gcd_self.pngin this repo is an example visualization.
A deeper understanding of the syntactic structure of each declaration will facilitate better insights into the exported digraph, enabling more effective transformations into training data.
This project leverages the LeanDojo (LeanDojo GitHub), LeanCopilot (LeanCopilot GitHub), and ReProver (ReProver GitHub) frameworks.
The PDF main.pdf is a preprint of a paper I am writing on this project. Look there for specifics.
This framework provides an alternative approach to Lean data extraction by focusing on fully elaborated declarations and dependency structures. The extracted data can be directly utilized for machine learning applications, particularly in proof assistant research.