Skip to content

henryrobbins/flare

Repository files navigation

FLARE

CI codecov License: MIT Checked with mypy Ruff

Note

This monorepo hosts the dataset, packages, and experiment code accompanying FLARE: Verifying MILP Reformulations with LLM-Based Formal Proof Synthesis.

FLARE (Formulation-Level Automated Reformulation Evaluation) uses an LLM-based agent and the Lean 4 proof assistant to verify mixed-integer linear program (MILP) reformulations. FLARE is implemented in the milp-flare Python package and evaluated on the FormulationBench dataset using the formulation-bench Python package. This repository is a monorepo hosting the FormulationBench dataset, both Python packages, and all of the experimental code used to produce the paper's results.

Sub-Projects

Project Path Description Links
FormulationBench dataset/ 20 problems, 116 MILP formulations, 96 labelled reformulation pairs. Docs
formulation-bench packages/formulation_bench/ Utilities for loading and working with the FormulationBench dataset. PyPI codecov Docs
milp-flare packages/milp_flare/ Official implementation of FLARE and FLARE-NL. PyPI codecov Docs
Experiments src/, experiments/, scripts/ Paper experiment code: alternative verifiers, prompt templates, and experiment/analysis scripts. codecov
Paper site site/ Astro landing page for the paper, deployed to GitHub Pages on pushes to the site branch. Live site

Reproducing Experimental Results

The two scripts in experiments/ reproduce every quantitative result.

Setup

  1. Install uv, then sync the workspace:
    make install
  2. Build the flare-agent Docker image (FLARE runs each agent in a Docker container):
    make -C packages/milp_flare build-image
  3. Populate all necessary API keys for the LLM-based verifiers (Anthropic, OpenAI, DeepSeek). The relevant secrets go in a top-level .env file (see .env.example).
  4. Install a Gurobi license (required by the execution baseline and the dataset's solve.py scripts). A free academic license works.

See the milp-flare installation guide for more details.

Baseline (Table 1, Table 2)

Runs execution, equivamap, and FLARE on every reformulation pair, 3 runs each, with results written under runs/<timestamp>/:

uv run python -m experiments.baseline -c experiments/configs/baseline.yaml

Subsets and worker counts are overridable on the CLI:

uv run python -m experiments.baseline -c experiments/configs/baseline.yaml \
    --problems 1,2,3 --workers 5 --runs 3

FLARE-NL Ablation Study (Table 3, Table 5)

Sweeps prompt variants and LLM models for FLARE-NL:

uv run python -m experiments.ablation -c experiments/configs/ablation.yaml

For Table 5 in the Appendix, use the ablation_p12.yaml configuration.

uv run python -m experiments.ablation -c experiments/configs/ablation_p12.yaml

FLARE Harness Evaluation (Table 6)

Sweeps different agent harnesses for FLARE:

uv run python -m experiments.baseline -c experiments/configs/baseline_flare.yaml

Aggregating results

Per-instance and aggregated classification metrics for any run directory:

uv run python scripts/report.py runs/<timestamp>           # summary
uv run python scripts/report.py runs/<timestamp> -i        # per-instance

Additional analysis scripts (cost/time plots, context analysis) live under scripts/analysis/.

Development

See AGENTS.md for development information.

Cite

TODO: arXiv paper.

License

MIT

About

Official implementation of "FLARE: Verifying MILP Reformulations with LLM-Based Formal Proof Synthesis"

Resources

License

Stars

Watchers

Forks

Contributors