VespaG is a blazingly fast single amino acid variant effect predictor, leveraging embeddings of the protein language model ESM-2 (Lin et al. 2022) as input to a minimal deep learning model.
To overcome the sparsity of experimental training data, we created a dataset of 39 million single amino acid variants from a subset of the Human proteome, which we then annotated using predictions from the multiple sequence alignment-based effect predictor GEMME (Laine et al. 2019) as a proxy for experimental scores.
Assessed on the ProteinGym (Notin et al. 2023) benchmark, VespaG matches state-of-the-art methods while being several orders of magnitude faster, predicting the entire single-site mutational landscape for a human proteome in under a half hour on a consumer-grade laptop.
More details on VespaG can be found in the corresponding publication
- create virtual environment
git clone https://github.com/jschlensok/vespag.gitpip install .oruv pip install .
Run vespag predict with the following options:
Required:
--input/-i: Path to FASTA-formatted file containing protein sequence(s). Optional:--output/-o:Path for saving created CSV and/or H5 files. Defaults to./output.--embeddings/-e: Path to pre-computed ESM2 (esm2_t36_3B_UR50D) input embeddings. Embeddings will be generated from scratch if no path is provided and saved in./output. Please note that embedding generation on CPU can be slow.--mutation-file: CSV file specifying specific mutations to score. If not provided, the whole single-site mutational landscape of all input proteins will be scored.--id-map: CSV file mapping embedding IDs (first column) to FASTA IDs (second column) if they're different. Does not have to cover cases with identical IDs.--single-csv: Whether to return one CSV file for all proteins instead of a single file for each protein.--no-csv: Whether no CSV output should be produced.--h5-output: Whether a file containing predictions in HDF5 format should be created.--zero-idx: Whether to enumerate protein sequences (both in- and output) starting at 0.--transform: Whether to transform predicted scores to the same distribution as GEMME substitution scores, which fall into a narrower range than VespaG scores, to ease comparability--normalize: Whether to transform predicted scores to the [0, 1] interval by applying a sigmoid
After installing the dependencies above and cloning the VespaG repo, you can try out the following examples:
- Run VespaG without precomputed embeddings for the example fasta file with 3 sequences in
data/example/example.fasta:vespag predict -i data/example/example.fasta. This will save a CSV file for each sequence in the folder./output
- Run VespaG with precomputed embeddings for the example fasta file with 3 sequences in
data/example/example.fasta:vespag predict -i data/example/example.fasta -e output/esm2_embeddings.h5 --single-csv. This will save a single CSV file for all sequences in the folder./output
VespaG uses DVC for pipeline orchestration and WandB for experiment tracking.
Using WandB is optional; a username and project for WandB can be specified in params.yaml.
Using DVC is non-optional. There is a dvc.yaml file in place that contains stages for generating pLM embeddings from FASTA files, but you can also download pre-computed embeddings and GEMME scores from our Zenodo repository. Adjust paths in params.yaml to your context, and feel free to play around with model parameters. You can simply run a training run using dvc repro -s train@<model_type>-{esm2|prott5}-<dataset>, with <model_type> and <dataset> each corresponding to a named block in params.yaml.
You can reproduce our evaluation using the eval subcommand, which pre-processes data into a format usable by VespaG, runs predict, and computes performance metrics.
Based on the ProteinGym (Notin et al. 2023) DMS substitutions benchmark, dubbed ProteinGym217 by us. Run it with vespag eval proteingym, with the following options:
Optional:
--reference-file: Path to ProteinGym reference file. Will download todata/test/proteingym217/reference.csvordata/test/proteingym87/reference.csvif not provided.--dms-directory: Path to directory containing per-DMS score files in CSV format. Will download todata/test/proteingym217/raw_dms_files/ordata/test/proteingym87/raw_dms_files/if not provided.--output/-o:Path for saving created CSV with scores for all assays and variants as well as a CSV with Spearman correlation coefficients for each DMS. Defaults to./output/proteingym217or./output/proteingym87.--embeddings/-e,--id-map,--normalize-scores: identical topredict, used for the internal call to it.--v1if you want to get a result for the first iteration of ProteinGym with 87 assays.
@article{10.1093/bioinformatics/btae621,
author = {Marquet, Céline and Schlensok, Julius and Abakarova, Marina and Rost, Burkhard and Laine, Elodie},
title = {Expert-guided protein language models enable accurate and blazingly fast fitness prediction},
journal = {Bioinformatics},
volume = {40},
number = {11},
pages = {btae621},
year = {2024},
month = {11},
issn = {1367-4811},
doi = {10.1093/bioinformatics/btae621},
url = {https://doi.org/10.1093/bioinformatics/btae621},
eprint = {https://academic.oup.com/bioinformatics/article-pdf/40/11/btae621/60811415/btae621.pdf},
}
