Note
This monorepo hosts the dataset, packages, and experiment code accompanying FLARE: Verifying MILP Reformulations with LLM-Based Formal Proof Synthesis.
FLARE (Formulation-Level Automated Reformulation Evaluation) uses an
LLM-based agent and the Lean 4 proof assistant to verify mixed-integer linear
program (MILP) reformulations. FLARE is implemented in the milp-flare Python package and evaluated on the FormulationBench dataset using the formulation-bench Python package. This repository is a monorepo hosting the FormulationBench dataset, both Python packages, and all of the experimental code used to produce the paper's results.
| Project | Path | Description | Links |
|---|---|---|---|
| FormulationBench | dataset/ |
20 problems, 116 MILP formulations, 96 labelled reformulation pairs. | |
formulation-bench |
packages/formulation_bench/ |
Utilities for loading and working with the FormulationBench dataset. | |
milp-flare |
packages/milp_flare/ |
Official implementation of FLARE and FLARE-NL. | |
| Experiments | src/, experiments/, scripts/ |
Paper experiment code: alternative verifiers, prompt templates, and experiment/analysis scripts. | |
| Paper site | site/ |
Astro landing page for the paper, deployed to GitHub Pages on pushes to the site branch. |
Live site |
The two scripts in experiments/ reproduce every quantitative result.
- Install uv, then sync the workspace:
make install
- Build the
flare-agentDocker image (FLAREruns each agent in a Docker container):make -C packages/milp_flare build-image
- Populate all necessary API keys for the LLM-based verifiers (Anthropic, OpenAI, DeepSeek). The relevant secrets go in a top-level
.envfile (see.env.example). - Install a Gurobi license (required by the
executionbaseline and the dataset'ssolve.pyscripts). A free academic license works.
See the milp-flare installation guide for more details.
Runs execution, equivamap, and FLARE on every reformulation pair, 3
runs each, with results written under runs/<timestamp>/:
uv run python -m experiments.baseline -c experiments/configs/baseline.yamlSubsets and worker counts are overridable on the CLI:
uv run python -m experiments.baseline -c experiments/configs/baseline.yaml \
--problems 1,2,3 --workers 5 --runs 3Sweeps prompt variants and LLM models for FLARE-NL:
uv run python -m experiments.ablation -c experiments/configs/ablation.yamlFor Table 5 in the Appendix, use the ablation_p12.yaml configuration.
uv run python -m experiments.ablation -c experiments/configs/ablation_p12.yamlSweeps different agent harnesses for FLARE:
uv run python -m experiments.baseline -c experiments/configs/baseline_flare.yamlPer-instance and aggregated classification metrics for any run directory:
uv run python scripts/report.py runs/<timestamp> # summary
uv run python scripts/report.py runs/<timestamp> -i # per-instanceAdditional analysis scripts (cost/time plots, context analysis) live under
scripts/analysis/.
See AGENTS.md for development information.
TODO: arXiv paper.