A benchmarking framework that jointly evaluates predictive accuracy and carbon footprint of generative AI models across six scientific discovery tasks.
Key Finding: Simpler, specialized models frequently match or approach state-of-the-art accuracy while consuming 10–100× less compute.
| Category | Activity | CO₂ Emission |
|---|---|---|
| Everyday | Smartphone charge (iPhone 16 Pro Max) | ~9.7 g CO₂ eq/charge |
| Driving a car (EU average) | ~170 g CO₂ eq/km | |
| Commercial Aviation (Boeing 737) | ~15.8 kg CO₂ eq/km | |
| LLM inference | Text generation (Claude 3.7 Sonnet) | ~2.12 g CO₂ eq/call |
| Image generation (Stable Diffusion) | ~1.38 g CO₂ eq/image | |
| Chemical simulation | Classical MD (force field) | 10 g CO₂ eq/1M steps |
| Ab initio MD (PBE, 50 atoms) | 140.96 kg CO₂ eq/1M steps | |
| Chemical synthesis | Organic synthesis (Letermovir) | 369 kg CO₂ eq/kg |
| Material synthesis (UiO-66-NH₂) | 43 kg CO₂ eq/kg | |
| Battery synthesis (vanadium flow battery) | 37 kg CO₂ eq/MWh |
All tasks benchmarked on the same hardware with full carbon tracking.
Hardware: NVIDIA RTX 5000 Ada Generation (32 GB) · Intel Xeon Platinum 8558 (192 cores) · 503 GB RAM
Column definitions:
- CO₂/exp — total CO₂ for the full benchmark run (actual experiment)
- CO₂/job — normalized per a fixed workload (see per-task note)
- Time/exp — total wall-clock time for the full benchmark run
Dataset: USPTO-50K · N = 5,007 reactions · Metric: Top-50 exact match · CO₂/job: per 500 reactions
| Year | Venue | Model | Architecture | Params | Top-10 | Top-50 | CO₂/exp (g) | CO₂/job (g) | Time/exp (s) | Time/job (s) |
|---|---|---|---|---|---|---|---|---|---|---|
| 2017 | Chem. Eur. J. | neuralsym | MLP | 32.5M | 72.8% | 74.8% | 35.0 | 3.50 | 1,282 | 128 |
| 2021 | JCIM | MEGAN | GNN | 9.8M | 87.0% | 90.1% | 51.7 | 5.15 | 2,951 | 295 |
| 2021 | JACS Au | LocalRetro | GNN | 8.6M | 91.5% | 97.3% | 62.1 | 6.20 | 2,313 | 231 |
| 2022 | Chem. Sci. | RSMILES | LM | 44.6M | 89.6% | 93.0% | 1,083 | 108.25 | 44,142 | 4,401 |
| 2022 | ML:ST | Chemformer | LM | 44.7M | 62.8% | 64.0% | 2,570 | 256.65 | 85,055 | 8,482 |
| 2024 | COLM | LlaSMol | LLM | ~7.2B | 5.0% | 5.0% | 1,385 | 138.35 | 39,119 | 3,905 |
| 2024 | ICLR | RetroBridge | Flow Matching | 4.6M | 44.9% | 52.8% | 4,040 | 403.45 | 157,820 | 15,740 |
| 2025 | Nat. Commun. | RSGPT | LLM | ~1.6B | 96.6% | 97.8% | 2,512 | 250.80 | 79,090 | 7,887 |
Dataset: USPTO-MIT · N = 40,029 reactions · Metric: Top-3 exact match · CO₂/job: per 500 reactions
| Year | Venue | Model | Architecture | Params | Top-1 | Top-3 | CO₂/exp (g) | CO₂/job (g) | Time/exp (s) | Time/job (s) |
|---|---|---|---|---|---|---|---|---|---|---|
| 2017 | Chem. Eur. J. | neuralsym | MLP | 98.1M | 49.5% | 50.6% | 43.9 | 0.55 | 2,732 | 34 |
| 2019 | ACS Cent. Sci. | MolecularTransformer | LM | 11.7M | 86.8% | 91.7% | 360.0 | 4.50 | 12,317 | 154 |
| 2021 | JCIM | MEGAN | GNN | 9.9M | 80.1% | 86.4% | 85.3 | 1.07 | 6,657 | 83 |
| 2021 | JCIM | Graph2SMILES | LM | 18M | 88.5% | 89.9% | 287.8 | 3.60 | 7,940 | 99 |
| 2022 | Nat. Mach. Intell. | LocalTransform | GNN | 9.1M | 90.4% | 94.1% | 141.4 | 1.77 | 8,799 | 110 |
| 2022 | ML:ST | Chemformer | LM | 44.7M | 89.0% | 89.8% | 580.0 | 7.25 | 45,288 | 566 |
| 2022 | Chem. Sci. | RSMILES | LM | 44.6M | 89.4% | 94.7% | 614.7 | 7.68 | 46,209 | 578 |
| 2024 | COLM | LlaSMol | LLM | ~7.2B | 3.8% | 5.9% | 1,413.8 | 17.67 | 104,960 | 1,312 |
Dataset: ChEMBL 28 · N = 10,000 molecules · Metric: VUN% · CO₂/job: per 10K molecules (= full exp)
| Year | Venue | Model | Architecture | Params | VUN (%) | VUNS (%) | CO₂/exp (g) | CO₂/job (g) | Time/exp (s) | Time/job (s) |
|---|---|---|---|---|---|---|---|---|---|---|
| 2017 | J. Cheminf. | REINVENT | LM | 4.2M | 87.90 | 80.88 | 0.18 | 0.18 | 14 | 14 |
| 2018 | ICML | JT-VAE | VAE | 5.3M | 91.39 | 89.41 | 10.58 | 10.58 | 662 | 662 |
| 2020 | ICML | HierVAE | VAE | 8.0M | 92.10 | 88.89 | 11.97 | 11.97 | 757 | 757 |
| 2021 | J. Chem. Inf. Model. | MolGPT | LM | 9.5M | 77.15 | 76.65 | 1.07 | 1.07 | 37 | 37 |
| 2023 | ICML | DiGress | Diffusion | 16.2M | 82.45 | 81.18 | 175.35 | 175.35 | 5,201 | 5,201 |
| 2024 | J. Cheminf. | REINVENT4 | LM | 5.8M | 94.16 | 85.44 | 0.07 | 0.07 | 8 | 8 |
| 2025 | ICML | DeFoG | Flow Matching | 16.3M | 82.27 | 81.73 | 355.24 | 355.24 | 9,874 | 9,874 |
| 2026 | Nat. Comput. Sci. | SmileyLlama | LLM | 8.0B | 94.30 | 85.16 | 21.79 | 21.79 | 638 | 638 |
Dataset: MP-20 · N = 1,000 structures · Metric: mSUN% · CO₂/job: per 1K structures (= full exp)
| Year | Venue | Model | Architecture | Params | mSUN (%) | SUN (%) | CO₂/exp (g) | CO₂/job (g) | Time/exp (s) | Time/job (s) |
|---|---|---|---|---|---|---|---|---|---|---|
| 2022 | ICLR | CDVAE | Diffusion | 4.9M | 22.6 | 3.2 | 270.4 | 270.40 | 25,764 | 25,764 |
| 2023 | NeurIPS | DiffCSP | Diffusion | 12.4M | 29.0 | 4.3 | 12.7 | 12.60 | 381 | 381 |
| 2024 | Nat. Commun. | CrystaLLM | LM | 25.9M | 16.4 | 3.5 | 19.3 | 19.20 | 942 | 942 |
| 2024 | ICML | FlowMM | Flow Matching | 28.3M | 23.9 | 4.3 | 12.8 | 12.80 | 547 | 547 |
| 2025 | arXiv | ChargeDIFF | Diffusion | 59.5M | 33.5 | 4.4 | 133.5 | 133.50 | 2,994 | 2,994 |
| 2025 | Nature | MatterGen | Diffusion | 44.6M | 33.4 | 5.2 | 248.1 | 248.10 | 8,079 | 8,079 |
| 2025 | ICML | ADiT | Diffusion | 231.9M | 29.6 | 5.5 | 112.5 | 112.50 | 10,512 | 10,512 |
| 2025 | ICML | CrystalFlow | Flow Matching | 20.9M | 21.7 | 3.0 | 1.5 | 1.50 | 43 | 43 |
System: WBM · N = 100 structures · Metric: CPS · CO₂/job: per 1,000 structures
| Year | Venue | Model | Architecture | Params | CPS | CO₂/exp (g) | CO₂/job (g) | Time/exp (s) | Time/job (s) |
|---|---|---|---|---|---|---|---|---|---|
| 2023 | Nat. Mach. Intell. | CHGNet | GNN | 413K | 0.343 | 1.50 | 15.0 | 88 | 884 |
| 2023 | arXiv | MACE | GNN | 4.69M | 0.637 | 3.57 | 35.7 | 208 | 2,083 |
| 2024 | J. Chem. Theory Comput. | SevenNet | GNN | 1.17M | 0.714 | 2.87 | 28.7 | 160 | 1,600 |
| 2024 | arXiv | ORB | GNN | 25.2M | 0.470 | 0.58 | 5.8 | 37 | 374 |
| 2025 | arXiv | NequIP | GNN | 9.6M | 0.733 | 1.07 | 10.7 | 52 | 519 |
| 2025 | arXiv | DPA3 | GNN | 4.81M | 0.718 | 6.71 | 67.1 | 380 | 3,798 |
| 2025 | arXiv | Nequix | GNN | 708K | 0.729 | 2.21 | 22.1 | 126 | 1,258 |
| 2025 | arXiv | eSEN | GNN | 30.1M | 0.797 | 13.86 | 138.6 | 763 | 7,629 |
System: LGPS · N = 75K steps × 3 seeds · Metric: MSD score · CO₂/job: per 1M steps
| Year | Venue | Model | Architecture | Params | MSD | CO₂/exp (g) | CO₂/job (g) | Time/exp (s) | Time/job (s) |
|---|---|---|---|---|---|---|---|---|---|
| 2023 | Nat. Mach. Intell. | CHGNet | GNN | 413K | 0.047 | 9.47 | 379 | 602 | 8,033 |
| 2023 | arXiv | MACE | GNN | 4.69M | 0.095 | 23.29 | 932 | 1,068 | 14,241 |
| 2024 | J. Chem. Theory Comput. | SevenNet | GNN | 1.17M | 0.531 | 16.21 | 648 | 790 | 10,529 |
| 2024 | arXiv | ORB | GNN | 25.2M | 0.385 | 3.87 | 155 | 210 | 2,795 |
| 2025 | arXiv | NequIP | GNN | 9.6M | 0.361 | 11.34 | 454 | 316 | 4,219 |
| 2025 | arXiv | DPA3 | GNN | 4.81M | 0.508 | 38.45 | 1,538 | 2,087 | 27,829 |
| 2025 | arXiv | Nequix | GNN | 708K | 0.203 | 17.13 | 685 | 736 | 9,809 |
| 2025 | arXiv | eSEN | GNN | 30.1M | 0.720 | 87.14 | 3,486 | 2,780 | 37,071 |
System: CASP15/CASP16 unique <1000-residue monomers · N = 45 targets · Metric: GDT-TS (%) · CO₂/job: per 45 targets
| Year | Venue | Model | Architecture | Params | GDT-TS (%) | lDDT-Cα | CO₂/exp (g) | CO₂/job (g) | Time/exp (s) | Time/job (s) |
|---|---|---|---|---|---|---|---|---|---|---|
| 2021 | Nature | AF2 | Evoformer + MSA | 93.2M | 59.15 | 0.868 | 46.73 | 2,103.0 | 1,729.6 | 77,832 |
| 2022 | Nat. Methods | ColabFold | Evoformer + MMseqs2 | 93.2M | 60.96 | 0.876 | 11.60 | 522.2 | 669.5 | 30,126 |
| 2022 | bioRxiv | OmegaFold | PLM + Geoformer | 795M | 47.18 | 0.770 | 4.00 | 180.1 | 123.0 | 5,535 |
| 2023 | Science | ESMFold | ESM-2 LM + folding | 693M | 52.36 | 0.811 | 5.27 | 237.0 | 363.2 | 16,345 |
| 2024 | bioRxiv | Chai-1 | Diffusion (AF3-like) | 316M | 48.79 | 0.798 | 19.83 | 892.4 | 1,416.4 | 63,738 |
| 2024 | Nat. Methods | OpenFold | Evoformer + MSA | 93.2M | 60.84 | 0.875 | 10.61 | 477.6 | 596.8 | 26,854 |
| 2025 | bioRxiv | Boltz-2 | Diffusion (AF3-like) | 521M | 51.82 | 0.765 | 17.70 | 796.5 | 1,180.8 | 53,137 |
| 2025 | bioRxiv | Protenix | Diffusion (AF3-like) | 368M | 57.50 | 0.871 | 9.83 | 442.3 | 612.3 | 27,555 |
GDT-TS (%) is the primary ranking metric; lDDT-Cα is the secondary metric. Per-exp values are per single target; per-job values are totals over all 45 targets. Carbon uses CodeCarbon offline world-average accounting (475 gCO₂/kWh) on 3 × NVIDIA RTX A5000 (24 GB). Boltz-2's training cutoff (2023-06-01) postdates the 33 CASP15 targets — interpret with a potential leakage caveat.
- Simpler models dominate the efficiency frontier: In every task, at least one GNN or small LM achieves near-SOTA accuracy at 10–100× lower CO₂ than the best-performing model
- Architecture drives cost more than parameter count: Diffusion models cost 10–100× more per sample than LM or GNN models due to iterative sampling, regardless of size
- LLMs underperform on narrow scientific tasks: LlaSMol (7B) scores 3.8% top-1 on Forward prediction vs. 89.4% for RSMILES (45M) — at 2× the carbon cost
- MLIP tasks are carbon-intensive by nature: A single 75K-step MD run costs 4–87 g CO₂; relaxing 100 WBM structures costs 0.6–14 g — both orders of magnitude more than per-molecule chemistry tasks
- 50–75% of models per task are Pareto-dominated: A cheaper and more accurate alternative always exists — the extra carbon was wasted
See CONTRIBUTING.md for the full guide on adding new models and tasks.
Quick start:
git clone https://github.com/shuan4638/Carbon4Science.git
cd Carbon4Science
git checkout main && git pull
git checkout -b <your-name>/<task>-<model>
cp -r Example/ <YourTask>/ # copy the template
claude # Claude Code guides you through the restYour PR to main only needs: results/<Task>/<model>_<N>.json + a new row in this README.
@article{carbon2026,
title={The Carbon Cost of AI for Science},
author={...},
journal={...},
year={2026}
}MIT — see LICENSE for details.