FLARE

Note

This monorepo hosts the dataset, packages, and experiment code accompanying FLARE: Verifying MILP Reformulations with LLM-Based Formal Proof Synthesis.

FLARE (Formulation-Level Automated Reformulation Evaluation) uses an LLM-based agent and the Lean 4 proof assistant to verify mixed-integer linear program (MILP) reformulations. FLARE is implemented in the milp-flare Python package and evaluated on the FormulationBench dataset using the formulation-bench Python package. This repository is a monorepo hosting the FormulationBench dataset, both Python packages, and all of the experimental code used to produce the paper's results.

Sub-Projects

Project	Path	Description	Links
FormulationBench	`dataset/`	20 problems, 116 MILP formulations, 96 labelled reformulation pairs.
`formulation-bench`	`packages/formulation_bench/`	Utilities for loading and working with the FormulationBench dataset.
`milp-flare`	`packages/milp_flare/`	Official implementation of FLARE and FLARE-NL.
Experiments	`src/`, `experiments/`, `scripts/`	Paper experiment code: alternative verifiers, prompt templates, and experiment/analysis scripts.
Paper site	`site/`	Astro landing page for the paper, deployed to GitHub Pages on pushes to the `site` branch.	Live site

Reproducing Experimental Results

The two scripts in experiments/ reproduce every quantitative result.

Setup

Install uv, then sync the workspace:
```
make install
```
Build the flare-agent Docker image (FLARE runs each agent in a Docker container):
```
make -C packages/milp_flare build-image
```
Populate all necessary API keys for the LLM-based verifiers (Anthropic, OpenAI, DeepSeek). The relevant secrets go in a top-level .env file (see .env.example).
Install a Gurobi license (required by the execution baseline and the dataset's solve.py scripts). A free academic license works.

See the milp-flare installation guide for more details.

Baseline (Table 1, Table 2)

Runs execution, equivamap, and FLARE on every reformulation pair, 3 runs each, with results written under runs/<timestamp>/:

uv run python -m experiments.baseline -c experiments/configs/baseline.yaml

Subsets and worker counts are overridable on the CLI:

uv run python -m experiments.baseline -c experiments/configs/baseline.yaml \
    --problems 1,2,3 --workers 5 --runs 3

FLARE-NL Ablation Study (Table 3, Table 5)

Sweeps prompt variants and LLM models for FLARE-NL:

uv run python -m experiments.ablation -c experiments/configs/ablation.yaml

For Table 5 in the Appendix, use the ablation_p12.yaml configuration.

uv run python -m experiments.ablation -c experiments/configs/ablation_p12.yaml

FLARE Harness Evaluation (Table 6)

Sweeps different agent harnesses for FLARE:

uv run python -m experiments.baseline -c experiments/configs/baseline_flare.yaml

Aggregating results

Per-instance and aggregated classification metrics for any run directory:

uv run python scripts/report.py runs/<timestamp>           # summary
uv run python scripts/report.py runs/<timestamp> -i        # per-instance

Additional analysis scripts (cost/time plots, context analysis) live under scripts/analysis/.

Development

See AGENTS.md for development information.

Cite

TODO: arXiv paper.

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 458 Commits
.claude		.claude
.github/workflows		.github/workflows
.vscode		.vscode
dataset		dataset
experiments		experiments
packages		packages
scripts		scripts
site		site
src		src
tests/verifiers		tests/verifiers
.env.example		.env.example
.git-blame-ignore-revs		.git-blame-ignore-revs
.gitignore		.gitignore
.mcp.json		.mcp.json
AGENTS.md		AGENTS.md
CLAUDE.md		CLAUDE.md
LICENSE.md		LICENSE.md
Makefile		Makefile
README.md		README.md
codecov.yml		codecov.yml
lake-manifest.json		lake-manifest.json
lakefile.toml		lakefile.toml
lean-toolchain		lean-toolchain
lefthook.yml		lefthook.yml
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

FLARE

Sub-Projects

Reproducing Experimental Results

Setup

Baseline (Table 1, Table 2)

FLARE-NL Ablation Study (Table 3, Table 5)

FLARE Harness Evaluation (Table 6)

Aggregating results

Development

Cite

License

About

Uh oh!

Releases

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

FLARE

Sub-Projects

Reproducing Experimental Results

Setup

Baseline (Table 1, Table 2)

FLARE-NL Ablation Study (Table 3, Table 5)

FLARE Harness Evaluation (Table 6)

Aggregating results

Development

Cite

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Contributors

Uh oh!

Languages