VespaG: Expert-Guided Protein Language Models enable Accurate and Blazingly Fast Fitness Prediction

VespaG is a blazingly fast single amino acid variant effect predictor, leveraging embeddings of the protein language model ESM-2 (Lin et al. 2022) as input to a minimal deep learning model.

To overcome the sparsity of experimental training data, we created a dataset of 39 million single amino acid variants from a subset of the Human proteome, which we then annotated using predictions from the multiple sequence alignment-based effect predictor GEMME (Laine et al. 2019) as a proxy for experimental scores.

Assessed on the ProteinGym (Notin et al. 2023) benchmark, VespaG matches state-of-the-art methods while being several orders of magnitude faster, predicting the entire single-site mutational landscape for a human proteome in under a half hour on a consumer-grade laptop.

More details on VespaG can be found in the corresponding publication

Installation

create virtual environment
git clone https://github.com/jschlensok/vespag.git
pip install . or uv pip install .

Quick Start: Running Inference with VespaG

Run vespag predict with the following options: Required:

--input/-i: Path to FASTA-formatted file containing protein sequence(s). Optional:
--output/-o:Path for saving created CSV and/or H5 files. Defaults to ./output.
--embeddings/-e: Path to pre-computed ESM2 (esm2_t36_3B_UR50D) input embeddings. Embeddings will be generated from scratch if no path is provided and saved in ./output. Please note that embedding generation on CPU can be slow.
--mutation-file: CSV file specifying specific mutations to score. If not provided, the whole single-site mutational landscape of all input proteins will be scored.
--id-map: CSV file mapping embedding IDs (first column) to FASTA IDs (second column) if they're different. Does not have to cover cases with identical IDs.
--single-csv: Whether to return one CSV file for all proteins instead of a single file for each protein.
--no-csv: Whether no CSV output should be produced.
--h5-output: Whether a file containing predictions in HDF5 format should be created.
--zero-idx: Whether to enumerate protein sequences (both in- and output) starting at 0.
--transform: Whether to transform predicted scores to the same distribution as GEMME substitution scores, which fall into a narrower range than VespaG scores, to ease comparability
--normalize: Whether to transform predicted scores to the [0, 1] interval by applying a sigmoid

Examples

After installing the dependencies above and cloning the VespaG repo, you can try out the following examples:

Run VespaG without precomputed embeddings for the example fasta file with 3 sequences in data/example/example.fasta:
- vespag predict -i data/example/example.fasta. This will save a CSV file for each sequence in the folder ./output
Run VespaG with precomputed embeddings for the example fasta file with 3 sequences in data/example/example.fasta:
- vespag predict -i data/example/example.fasta -e output/esm2_embeddings.h5 --single-csv. This will save a single CSV file for all sequences in the folder ./output

Re-training VespaG

VespaG uses DVC for pipeline orchestration and WandB for experiment tracking.

Using WandB is optional; a username and project for WandB can be specified in params.yaml.

Using DVC is non-optional. There is a dvc.yaml file in place that contains stages for generating pLM embeddings from FASTA files, but you can also download pre-computed embeddings and GEMME scores from our Zenodo repository. Adjust paths in params.yaml to your context, and feel free to play around with model parameters. You can simply run a training run using dvc repro -s train@<model_type>-{esm2|prott5}-<dataset>, with <model_type> and <dataset> each corresponding to a named block in params.yaml.

Evaluation

You can reproduce our evaluation using the eval subcommand, which pre-processes data into a format usable by VespaG, runs predict, and computes performance metrics.

ProteinGym217

Based on the ProteinGym (Notin et al. 2023) DMS substitutions benchmark, dubbed ProteinGym217 by us. Run it with vespag eval proteingym, with the following options: Optional:

--reference-file: Path to ProteinGym reference file. Will download to data/test/proteingym217/reference.csv or data/test/proteingym87/reference.csv if not provided.
--dms-directory: Path to directory containing per-DMS score files in CSV format. Will download to data/test/proteingym217/raw_dms_files/ or data/test/proteingym87/raw_dms_files/ if not provided.
--output/-o:Path for saving created CSV with scores for all assays and variants as well as a CSV with Spearman correlation coefficients for each DMS. Defaults to ./output/proteingym217 or ./output/proteingym87.
--embeddings/-e, --id-map, --normalize-scores: identical to predict, used for the internal call to it.
--v1 if you want to get a result for the first iteration of ProteinGym with 87 assays.

How to cite

@article{10.1093/bioinformatics/btae621,
    author = {Marquet, Céline and Schlensok, Julius and Abakarova, Marina and Rost, Burkhard and Laine, Elodie},
    title = {Expert-guided protein language models enable accurate and blazingly fast fitness prediction},
    journal = {Bioinformatics},
    volume = {40},
    number = {11},
    pages = {btae621},
    year = {2024},
    month = {11},
    issn = {1367-4811},
    doi = {10.1093/bioinformatics/btae621},
    url = {https://doi.org/10.1093/bioinformatics/btae621},
    eprint = {https://academic.oup.com/bioinformatics/article-pdf/40/11/btae621/60811415/btae621.pdf},
}

Name		Name	Last commit message	Last commit date
Latest commit History 114 Commits
.github/workflows		.github/workflows
data		data
images		images
model_weights		model_weights
tests		tests
vespag		vespag
.dvcignore		.dvcignore
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
.safety-policy.yml		.safety-policy.yml
LICENSE		LICENSE
README.md		README.md
dvc.yaml		dvc.yaml
hubconf.py		hubconf.py
noxfile.py		noxfile.py
params.yaml		params.yaml
pyproject.toml		pyproject.toml
requirements-dev.txt		requirements-dev.txt
requirements.txt		requirements.txt
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

VespaG: Expert-Guided Protein Language Models enable Accurate and Blazingly Fast Fitness Prediction

Installation

Quick Start: Running Inference with VespaG

Examples

Re-training VespaG

Evaluation

ProteinGym217

How to cite

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

VespaG: Expert-Guided Protein Language Models enable Accurate and Blazingly Fast Fitness Prediction

Installation

Quick Start: Running Inference with VespaG

Examples

Re-training VespaG

Evaluation

ProteinGym217

How to cite

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages