S-PLM: Structure-aware Protein Language Model via Contrastive Learning between Sequence and Structure
This is the official implementation of S-PLM paper (S-PLM V1). S-PLM is a 3D structure-aware protein language model (PLM) that enables the sequence-based embedding to carry the structural information through multi-view contrastive learning.
This repository offers comprehensive guidance on utilizing pre-trained S-PLM models to generate structure-aware protein representations. Additionally, it provides a library of code for implementing lightweight tuning methods tailored for various downstream supervised learning tasks involving proteins.
The tasks include Enzyme Commission number (EC) prediction, Gene Ontology (GO) prediction, protein fold (fold) and enzyme reaction (ER) classification, and protein secondary structure (SS) prediction.
The lightweight tuning methods include fine-tune top layers, Adapter Tuning, and Low-rank adaptation (LoRA). Users can train a task-specific model using different tuning methods by modifying the configuration files provided in the configs directory
To use an updated residue-level pre-trained model, refer to model explanation
We now provide a new GVP-based structure encoder. See SPLM-V2-GVP .
To use S-PLM project, install the corresponding environment.yaml file in your environment. Or you can follow the install.sh file to install the dependencies.
Using environment.yaml
- Create a new environment using the
environment.yamlfile:conda env create -f environment.yaml. - Activate the environment you have just created:
conda activate splm.
Create a conda environment and use this command to install the required packages inside the conda environment. First, make the install.sh file executable by running the following command:
chmod +x install.sh
Then, run the following command to install the required packages inside the conda environment:
bash install.sh
Colab quickstart available: Use our minimal S-PLM v1 Colab example to set up dependencies, load the checkpoint, extract embeddings, and launch downstream training.
To utilize the accelerator power in your training code, such as distributed multi-GPU training, you have to set the accelerator config by running accelerate config in the command line.
Then, you have to set the training settings and hyperparameters inside your target task configs/config_{task}.yaml file.
Finally, you can start your training for downstream tasks using a config file from configs and a pretrained S-PLM model by running
accelerate launch train_{task}.py --config_path configs/<config_name> --resume_path model/checkpoint_0520000.pth`
accelerate launch train_go.py --config_path configs/bp_config_adapterH_adapterH.yaml --resume_path model/checkpoint_0520000.pth
accelerate launch train_go.py --config_path configs/cc_config_adapterH_adapterH.yaml --resume_path model/checkpoint_0520000.pth
accelerate launch train_go.py --config_path configs/mf_config_adapterH_adapterH.yaml --resume_path model/checkpoint_0520000.pth
accelerate launch train_fold.py --config_path configs/fold_config_adapterH_finetune.yaml --resume_path model/checkpoint_0520000.pth
accelerate launch train_ss.py --config_path configs/ss_config_adapterH_finetune.yaml --resume_path model/checkpoint_0520000.pth
You might not use accelerator to run the train.py script if you just want to debug your script on single GPU. If so, simply after setting the config.yaml file
run the code by python train_{task}.py. It should be noted that accelerate supports single gpu and distributed training. So, you can use it for your
final training.
There are two related scripts for generating protein sequence embeddings from a pre-trained S-PLM:
extract_sequence_representation.pyβ intended for small-scale modifications or debugging,
allowing you to quickly run embedding generation for a few proteins directly within the script.cli_seq_embed.pyβ designed for batch processing of protein sequences in a FASTA file.
It reads multiple sequences and outputs the embeddings to a pickle file. It supports both protein-level and residue-level representations. There are two pre-trained models that can be used; refer to S-PLM model to download pretrained weights for S-PLM1 and S-PLM2. To use S-PLM1, the config file should be ./configs/SPLM1_representation_config.yaml; To use S-PLM2, the config file should be ./configs/SPLM2_representation_config.yaml .
To generate embeddings, run:
python generate_seq_embedding.py --input_seq ./test.fasta --config_path /path/to/configfile.yaml --checkpoint_path /path/to/checkpoint.pth --result_path ./outThis produces a pickle file such as protein_embeddings.pkl containing a dictionary that maps protein IDs β NumPy embedding arrays.
| Argument | Description |
|---|---|
--input_seq |
Path to the input FASTA file containing protein sequences. |
--config_path |
Path to the model configuration YAML file. |
--checkpoint_path |
Optional path to pretrained model checkpoint. |
--result_path |
Output directory for saving embeddings. |
--out_file |
Output file name, default protein_embeddings.pkl. |
--residue_level |
Output residue level embeddings per amino acid. |
--truncate_inference |
Enable sequence truncation with 1 or 0. |
--max_length_inference |
Maximum sequence length when truncation is enabled. |
--afterproject |
Output post projection embeddings if supported by the model. |
Protein level embeddings
python generate_seq_embedding.py --input_seq sample.fasta -c /path/to/configfile.yaml --checkpoint_path checkpoints/model.pth --result_path ./outTruncated inference
python generate_seq_embedding.py --input_seq sample.fasta -c /path/to/configfile.yaml --checkpoint_path checkpoints/model.pth --result_path ./out --truncate_inference 1 --max_length_inference 1022Residue level embeddings
python generate_seq_embedding.py --input_seq sample.fasta -c /path/to/configfile.yaml --checkpoint_path checkpoints/model.pth --result_path ./out --residue_levelThe output pickle file contains a Python dictionary:
{
"protein_id_1": np.ndarray, # shape [embedding_dim] or [seq_len, embedding_dim]
"protein_id_2": np.ndarray,
...
}Each value corresponds to either a protein level or residue level embedding depending on your arguments.
Evaluate sequence embedding clustering quality on CATH and Kinase datasets (.fa).
All scripts save t-SNE figures and a scores.txt summary under the output folder.
python cath_with_seq.py \
--checkpoint_path /path/to/checkpoint.pth \
--config_path /path/to/config.yaml \
--cath_seq ./dataset/Rep_subfamily_basedon_S40pdb.fa
python kinase_with_seq.py \
--checkpoint_path /path/to/checkpoint.pth \
--config_path /path/to/config.yaml \
--kinase_seq ./dataset/kinase_alllabels.fa For advanced users who wish to pretrain S-PLM from scratch, please refer to the pretrain
If you use this code or the pretrained models, please cite the following paper:
[1] S-PLM V1: protein-level contrastive learning, using Swin-transformer as protein structure encoder.
Wang D, Pourmirzaei M, Abbas UL, Zeng S, Manshour N, Esmaili F, Poudel B, Jiang Y, Shao Q, Chen J, Xu D. S-PLM: Structure-Aware Protein Language Model via Contrastive Learning Between Sequence and Structure. Adv Sci (Weinh). 2025 Feb;12(5):e2404212. doi: 10.1002/advs.202404212. Epub 2024 Dec 12. PMID: 39665266; PMCID: PMC11791933.
@article {Zhang2025.04.23.650337,
author = {Zhang, Yichuan and Qin, Yongfang and Pourmirzaei, Mahdi and Shao, Qing and Wang, Duolin and Xu, Dong},
title = {Enhancing Structure-aware Protein Language Models with Efficient Fine-tuning for Various Protein Prediction Tasks},
elocation-id = {2025.04.23.650337},
year = {2025},
doi = {10.1101/2025.04.23.650337},
publisher = {Cold Spring Harbor Laboratory},
abstract = {Proteins are crucial in a wide range of biological and engineering processes. Large protein language models (PLMs) can significantly advance our understanding and engineering of proteins. However, the effectiveness of PLMs in prediction and design is largely based on the representations derived from protein sequences. Without incorporating the three-dimensional structures of proteins, PLMs would overlook crucial aspects of how proteins interact with other molecules, thereby limiting their predictive accuracy. To address this issue, we present S-PLM, a 3D structure-aware PLM that employs multi-view contrastive learning to align protein sequences with their 3D structures in a unified latent space. Previously, we utilized a contact map-based approach to encode structural information, applying the Swin-Transformer to contact maps derived from AlphaFold-predicted protein structures. This work introduces a new approach that leverages a Geometric Vector Perceptron (GVP) model to process 3D coordinates and obtain structural embeddings. We focus on the application of structure-aware models for protein-related tasks by utilizing efficient fine-tuning methods to achieve optimal performance without significant computational costs. Our results show that S-PLM outperforms sequence-only PLMs across all protein clustering and classification tasks, achieving performance on par with state-of-the-art methods that require both sequence and structure inputs. S-PLM and its tuning tools are available at https://github.com/duolinwang/S-PLM/.Competing Interest StatementThe authors have declared no competing interest.National Institutes of Health, , R35GM126985, R01LM014510National Science Foundation, , 2138259, 2138286, 2138307, 2137603, 2138296},
URL = {https://www.biorxiv.org/content/early/2025/04/26/2025.04.23.650337},
eprint = {https://www.biorxiv.org/content/early/2025/04/26/2025.04.23.650337.full.pdf},
journal = {bioRxiv}
}