ViroGym: Realistic Large-Scale Benchmarks for Evaluating Viral Proteins

Overview

ViroGym is a comprehensive benchmark dataset for evaluating protein language models (pLMs) on viral protein sequences. It covers three core tasks: mutational effect prediction, antigenic diversity prediction, and pandemic forecasting — bridging controlled laboratory experiments and real-world viral evolution.

The dataset comprises:

79 deep mutational scanning (DMS) assays across 13 virus types, totalling 552,065 mutated amino acid sequences and 7 distinct phenotypic readouts
21 influenza A neutralisation assays for antigenic diversity prediction
GISAID-derived SARS-CoV-2 mutation frequency data spanning January 2020 to December 2025, for real-world pandemic prediction

Motivation

Vaccines for rapidly mutating respiratory viruses — influenza, SARS-CoV-2, and others — are selected through a semi-annual WHO process that has remained largely unchanged for over 40 years. Seasonal influenza vaccine effectiveness has ranged from only 19%–60% since 2009, and SARS-CoV-2 vaccine effectiveness dropped from 50.6% to 13.6% within weeks as new variants emerged. Manufacturers must commit to production months before dominant strains are known.

pLMs trained on large protein sequence databases offer a potential path forward: they can estimate mutational fitness and antigenic novelty in zero-shot settings, without requiring prior experimental data on the virus at hand. However, systematic benchmarks for viral proteins were lacking. Most existing benchmarks either focus on non-viral proteins (e.g. ProteinGym, where only 24 of 217 assays are viral) or cover a limited subset of viruses (e.g. EVEREST's 45 assays). ViroGym addresses this gap by providing the broadest and most clinically relevant viral pLM benchmark to date.

Inspiration

ViroGym builds on several lines of work:

ProteinGym established the blueprint for large-scale DMS-based pLM benchmarking and defines the data schema we follow. 23 of our 79 DMS assays overlap with ProteinGym; the remaining 56 are newly curated.
EVEREST demonstrated that current pLMs fail on a majority of viral DMS tasks, motivating the need for broader coverage and more phenotype diversity.
Dadonaite et al. (2024) showed that SARS-CoV-2 Spike DMS fitness scores correlate with subsequent real-world clade success, directly inspiring the pandemic prediction task.
Kikawa et al. (2025) demonstrated a similar DMS-to-evolution link for influenza, motivating the neutralisation benchmark.

A key insight motivating ViroGym is that DMS and pLM predictions are complementary: DMS assays measure fitness under a single controlled condition, while pLMs implicitly integrate broader evolutionary pressures from training on millions of sequences. Combining both signals — as this benchmark enables — may yield more robust predictions than either approach alone.

Repository Structure

viroGym/
├── data/
│   ├── DMS/
│   │   ├── benchmark.csv          # master index of all 79 DMS tasks
│   │   ├── cleaned_benchmark/     # processed DMS assays (one CSV per task)
│   │   ├── raw_csv/               # raw inputs by virus (not tracked by git; see data/DMS/benchmark.csv)
│   │   └── process/               # scripts to regenerate cleaned_benchmark from raw_csv
│   ├── GISAID/
│   │   ├── raw/                   # aggregated Spike mutation counts by year (not tracked by git)
│   │   ├── cleaned_benchmark/     # processed mutation frequency tables
│   │   └── process/               # script to regenerate cleaned_benchmark from raw
│   └── neutralization/
│       ├── benchmark.csv          # master index of 21 neutralisation tasks
│       ├── cleaned_benchmark/     # processed neutralisation assays (one CSV per task)
│       ├── raw_csv/               # raw titer and sequence inputs (not tracked by git)
│       └── process/               # scripts to regenerate cleaned_benchmark from raw_csv
├── analysis/                      # scripts to compute tables and figures from outputs/
└── outputs/                       # model predictions (not tracked by git; see Baselines)
    ├── dms_model_outputs/
    ├── neutralization_model_outputs/
    └── GISAID_outputs/

Data Sources

DMS assays were curated from published literature following the ProteinGym guidelines. Sources include assays for SARS-CoV-2, Influenza A (H1N1, H2N1, H3N2, H5N1), HIV, Hepatitis B, Zika, Rabies, Nipah, Lassa, Dengue, Coxsackievirus B3, HCV, and Adeno-associated virus. Full per-assay citations are provided in the accompanying paper.

Neutralisation assays are derived from two high-throughput sequence-based influenza studies (Kikawa et al. 2025; Loes et al. 2024), covering H1N1 and H3N2 strains with both ferret and human sera.

GISAID data is derived from the Global Initiative on Sharing All Influenza Data database.

Usage

All scripts are designed to be run from the repository root:

# Regenerate processed data from raw inputs
python data/DMS/process/__main__.py
python data/GISAID/process/reconstruct_gisaid_processed.py
python data/neutralization/process/export_neutralization_tasks.py

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
data		data
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ViroGym: Realistic Large-Scale Benchmarks for Evaluating Viral Proteins

Overview

Motivation

Inspiration

Repository Structure

Data Sources

Usage

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

ViroGym: Realistic Large-Scale Benchmarks for Evaluating Viral Proteins

Overview

Motivation

Inspiration

Repository Structure

Data Sources

Usage

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages