FormulaCode is a live benchmark for evaluating the holistic ability of LLM agents to optimize codebases. FormulaCode consists of two parts: a pipeline to construct performance optimization tasks, and an execution harness that connects a language model to our terminal sandbox. This repository contains the task generation pipeline.
datasmith is a python package for automatically curating and managing FormulaCode tasks. After installation, datasmith is designed to run as a monthly CRON job that updates the FormulaCode dataset with new commits and repositories.
Install uv and set up the development environment:
# Install uv
$ curl -LsSf https://astral.sh/uv/install.sh | sh
# Clone the repository.
$ git clone https://github.com/formula-code/datasmith.git
$ cd datasmith
# Installs pre-commit hooks and dev dependencies.
$ make install
# Resolve initial formatting issues.
$ uv run pre-commit run -a
$ make check
# Run tests to verify installation.
$ make testMake a tokens.env file with your environment variables. You'll need:
- A GitHub token with
repoandread:orgpermissions to read repositories and commit. - An OpenAI API key or local LLM endpoint (recommended) for synthesizing reproducible containers and for performance classification.
- A DockerHub token with
write:packagespermissions if you want to publish images to DockerHub for easier sharing and distribution.
Click to reveal a sample tokens.env file.
# Cache and backup locations
CACHE_LOCATION=/home/???/formulacode/datasmith/scratch/artifacts/cache.db
BACKUP_DIR=/home/???/formulacode/backup/
# Scraping tokens
GH_TOKENs=github_pat_???,github_pat_???
CODECOV_TOKEN=54c6???
# LLM configuration for context synthesis
DSPY_MODEL_NAME=openai/meta-llama/Llama-3.3-70B-Instruct
DSPY_URL=http://localhost:30000/v1
DSPY_API_KEY=local
DSPY_TEMPERATURE=0.7
# For DockerHub publishing (dataset verification)
DOCKERHUB_NAMESPACE=formulacode # Required for dataset verification
DOCKERHUB_USERNAME=myuser # Required for dataset verification
DOCKERHUB_TOKEN=dckr_pat_xxxxx # Required for dataset verification
# For Hugging Face dataset uploads
HF_TOKEN=hf_xxxxx # Required for --upload-to-hf
# Depends on the system.
#DOCKER_USE_BUILDX=0
DOCKER_NETWORK_MODE=hostSet up a CRON job to run the update_formulacode.py script on the 25th of every month. This will keep the FormulaCode dataset up to date with new repositories and commits.
$ crontab -l
# Edit this file to introduce tasks to be run by cron.
#
# Each task to run has to be defined through a single line
# indicating with different fields when the task will be run
# and what command to run for the task
#
# To define the time you can provide concrete values for
# minute (m), hour (h), day of month (dom), month (mon),
# and day of week (dow) or use '*' in these fields (for 'any').
#
# Notice that tasks will be started based on the cron's system
# daemon's notion of time and timezones.
#
# Output of the crontab jobs (including errors) is sent through
# email to the user the crontab file belongs to (unless redirected).
#
# For example, you can run a backup of all your user accounts
# at 5 a.m every week with:
# 0 5 * * 1 tar -zcf /var/backups/home.tgz /home/
#
# For more information see the manual pages of crontab(5) and cron(8)
#
# m h dom mon dow command
# Clean up Docker containers every day at midnight
0 * * * * /usr/bin/docker container prune -f
# Clean up dangling Docker images every week
0 0 * * 0 /usr/bin/docker image prune -f
# Run FormulaCode update script on the 25th day of every month at 2am
0 2 25 * * flock -n /tmp/update_formulacode.lock /home/???/formulacode/datasmith/.venv/bin/python /home/???/formulacode/datasmith/scratch/scripts/update_formulacode.py --start-date "$(date -d '-1 month' +\%Y-\%m-01)" --end-date "$(date +\%Y-\%m-01)" >> /home/???/formulacode/datasmith/scratch/logs/update_formulacode_$(date +\%Y\%m\%d).log 2>&1
$ crontab -e
# <Make the necessary edits>src/datasmith/
├── agents/ # LLM agent orchestration for build tasks and context synthesis
├── benchmark/ # Benchmark collection from ASV (Airspeed Velocity) configs
├── collation/ # Benchmark result collation and aggregation
├── core/ # Shared infra: API clients, caching, Git integration, data models
├── detection/ # Performance breakpoint and regression detection
├── docker/ # Docker image building, validation, DockerHub publishing
├── execution/ # Commit collection, filtering, and performance analysis
├── notebooks/ # Jupyter notebook utilities and context-registry updates
└── scrape/ # Web scraping, report building, LLM-based classification
Pipeline scripts live in scratch/scripts/. The monthly orchestrator is update_formulacode.py; individual steps can also be run standalone.
The diagram below summarizes how scratch/scripts/update_formulacode.py orchestrates the monthly update pipeline.
sequenceDiagram
participant U as update_formulacode.py
participant ENV as env setup
participant C1 as collect_and_filter_commits.py
participant C2 as prepare_commits_for_building_reports.py
participant C3 as collect_perf_commits.py
participant C4 as synthesize_contexts.py
participant C5 as build_and_publish_to_dockerhub.py
participant C6 as merge_perfonly_commits_master.py
participant C7 as prepare_formulacode_dataset.py
participant CSV as repos_valid.csv
participant GH as github or offline store
participant TMP as temp repo dir
participant MI as merge info
participant FS as file system
participant LLM as llm backends
participant SQL as sqlite cache
participant CR as context registry
participant DOCK as docker build
participant DH as dockerhub
participant HF as hugging face
%% Orchestrator setup (grey)
rect rgb(230,230,230)
Note over U,ENV: Prerequisite: collect_commits.py / scrape_repositories.py (one-time)
U->>ENV: step 0 setup environment and logging
ENV->>U: environment ready
end
%% Step 1 collect_and_filter_commits.py (green)
rect rgb(210,245,220)
U->>C1: step 1 collect and filter commits
C1->>CSV: read repos_valid.csv
C1->>TMP: clone repo into temp dir
C1->>MI: collect merge shas and commit info
C1->>FS: write merge_commits_filtered parquet
end
%% Step 2 prepare_commits_for_building_reports.py (yellow)
rect rgb(255,250,210)
U->>C2: step 2 prepare commits for reports
C2->>FS: read merge_commits_filtered parquet
C2->>C2: tokenize patches and crude perf filter
C2->>C2: analyze commits in threads
C2->>DOCK: make tasks with container names
C2->>FS: optional get patch from diff url
C2->>FS: write parquet with patch
end
%% Step 3 collect_perf_commits.py (red-ish)
rect rgb(255,225,220)
U->>C3: step 3 classify performance commits
C3->>FS: read prepared parquet
C3->>C3: report builder per row
C3->>SQL: cache completion in sqlite
C3->>LLM: call llm backends
LLM->>C3: performance classification
C3->>FS: write raw parquet
C3->>FS: write perf only parquet
end
%% Step 4 synthesize_contexts.py (purple)
rect rgb(235,220,255)
U->>C4: step 4 synthesize contexts
C4->>C4: configure agent backends
C4->>FS: load perf only parquet
C4->>CR: load context registry and update
CR->>C4: context registry ready
C4->>DOCK: build base image
DOCK->>C4: base image built
C4->>C4: prepare task list per repo and commit
C4->>C4: agent build and validate in threads
C4->>FS: write results jsonl and all files by image json
C4->>CR: update context_registry json
end
%% Step 5 build_and_publish_to_dockerhub.py (teal)
rect rgb(210,245,245)
U->>C5: step 5 build and publish
C5->>FS: read perf only parquet
C5->>CR: load context registry
C5->>DOCK: build images
DOCK->>C5: built images
C5->>DH: publish images to dockerhub
end
%% Step 6 merge_perfonly_commits_master.py (blue)
rect rgb(210,225,255)
U->>C6: step 6 merge into master parquet
C6->>FS: read new perfonly parquet
C6->>FS: read master parquet
C6->>C6: deduplicate and merge
C6->>FS: write updated master parquet
end
%% Step 7 prepare_formulacode_dataset.py (orange)
rect rgb(255,235,210)
U->>C7: step 7 enrich and upload to HF (optional)
C7->>FS: read master parquet
C7->>DH: resolve available images
DH->>C7: image map
C7->>C7: derive container names and normalize columns
C7->>C7: filter to key columns
C7->>FS: write enriched parquet
C7->>HF: upload monthly and default configs
end
Prerequisites (one-time setup)
Scrape GitHub for ASV-compatible repositories using collect_commits.py or scrape_repositories.py. These are not part of the monthly orchestrator — run them once to build repos_valid.csv.
$ python scratch/scripts/collect_commits.py \
--repos scratch/artifacts/pipeflush/repos_valid.csv \
--outfile scratch/artifacts/pipeflush/commits_all.jsonl \
--max-pages 50The output repos_valid.csv contains repositories that aren't forks/reuploads, have at least a minimum number of stars, and pass other sanity checks (~700 repositories).
Given the list of repositories, find merged PRs and filter out commits that only modified benchmarking files, were documentation-only, or could not be installed.
$ python scratch/scripts/collect_and_filter_commits.py \
--filtered-benchmarks-pth scratch/artifacts/pipeflush/repos_valid.csv \
--output-pth scratch/artifacts/pipeflush/merge_commits_filtered.parquet \
--threads 8 \
--procs 32 \
--since 2025-10-01 \
--until 2025-11-01Tokenize patches, apply a crude performance filter, and optionally fetch full patch text from the GitHub diff API.
$ python scratch/scripts/prepare_commits_for_building_reports.py \
--input scratch/artifacts/pipeflush/merge_commits_filtered.parquet \
--output scratch/artifacts/pipeflush/merge_commits_filtered_with_patch.parquet \
--max-workers 200 \
--filter-repos \
--fetch-patchesBuild a structured report for each commit, call LLM backends for performance classification, and write a filtered parquet of performance-only commits.
$ python scratch/scripts/collect_perf_commits.py \
--commits scratch/artifacts/pipeflush/merge_commits_filtered_with_patch.parquet \
--outfile scratch/artifacts/pipeflush/perfonly_commits \
--max-workers -1
# Produces perfonly_commits.raw.parquet and perfonly_commits.parquetEach context is a (repo, commit) pair with an associated build_env.sh script. Contexts that fail to build are filtered out. Common failure reasons:
- Commit couldn't be checked out
- Commit didn't have an
asv.conf.jsonfile - ASV environment could not be built
- A quick
asv rundid not succeed
$ python scratch/scripts/synthesize_contexts.py \
--commits scratch/artifacts/pipeflush/perfonly_commits.parquet \
--output-dir scratch/artifacts/pipeflush/results_synthesis/ \
--context-registry scratch/artifacts/pipeflush/context_registry.json \
--max-workers 32 \
--max-attempts 3 \
--max-steps 10 \
--max-similar-candidates 5 \
--ignore-exhausted \
--push-to-dockerhubRebuild Docker images and publish them to DockerHub. Credentials are read from environment variables configured in tokens.env (see Installation).
$ python scratch/scripts/build_and_publish_to_dockerhub.py \
--commits scratch/artifacts/pipeflush/perfonly_commits.parquet \
--context-registry scratch/artifacts/pipeflush/context_registry.json \
--namespace formulacode \
--max-workers 5 \
--skip-existingDeduplicate and append this month's performance commits into the cumulative master parquet.
$ python scratch/scripts/merge_perfonly_commits_master.py \
--new-perfonly scratch/artifacts/pipeflush/perfonly_commits.parquet \
--master scratch/artifacts/pipeflush/perfonly_commits_master.parquetDerive container names, resolve Docker Hub images, normalize difficulty/classification, generate task IDs, and upload the enriched dataset to Hugging Face with monthly configs. Only key columns (task metadata, patch, instructions) are uploaded.
$ python scratch/scripts/prepare_formulacode_dataset.py \
--input scratch/artifacts/pipeflush/perfonly_commits_master.parquet \
--output scratch/artifacts/pipeflush/perfonly_enriched.parquet \
--dockerhub-repository formulacode/all \
--upload-to-hf formulacode/formulacode-all \
--hf-verified-filter /path/to/valid_tasks.jsonRequires
HF_TOKENintokens.env. The upload createsdefault,verified, and per-month (YYYY-MM) configs on Hugging Face.
Evaluation is done in FormulaCode's fork of the terminal-bench evaluation framework.
