Skip to content

michele1993/Protein_design

Repository files navigation

Protein design

This repository aims to analyze, sanitize and use a dataset of protein sequences with validated activities to design new alpha-amylase variants with improved activity. This is done through 4 main steps:

  1. Data sanitation, e.g, investigate/remove NaN and duplicate sequences, check each sequence only contains natural amino acids, investigate/remove sequences with out of distribution lenghts.
  2. Fine-tune a pretrained model: The repositiory uses the pretraind protGPT2 base model and fine-tunes it to the enitre (cleaned) alpha-amylase dataset (i.e., independtly of activities) with SFT.
  3. Model alignment: The repositiory uses DPO to align the model towards protein sequences with high activities.
  4. Generation: Generate seversal sequenses given the dpo aligned model and pick the 'best' one based on a mixture of model perxplexity and sequence length (see below).

In the figure below, I show the dpo fine-tuned model has lower perplexity for training sequences with high activity. Therefore, I use the perplexity of the dpo fine-tuned model generated sequences as a proxy for high activity. This is also based on the fact that protGPT2 seems to be well calibrated (see protGPT2).

Figure: Perplexity vs activity for the DPO fine-tuned model of training sequences

Installation

Virtual Environment

To keep things versioned and segregated from the rest of the system, I use a virtual environment. I use a conda virtual environment called protein_design for this project.

conda create [-p /optional/prefix] -n protein_design

Installing python packages

First I need to activate the newly activated environment so that the pagakes get installed there,

conda activate protein_design

To avoid having to manually activate the environment every time I use use direnv (highly recommended!). Next, I begin by installing PyTorch with the latest CUDA release together with a compatible python release.

conda install python=3.12.7
pip3 install torch torchvision torchaudio

Next, I install pandas to efficiently read the dataset, which is stored in a .cvs file (Note: the data file is not provided in the repository).

pip3 install pandas

Generative protein sequence base model

In order to use ProtGPT2 base model, I install Hugging Face package. However, I had to downgrade to earlier verison of it

pip install transformers==4.45.2

due to a potential bug between the lastest realise and the DPOTrainer of the trl package.

To supervise fine tune protGPT2 to my dataset, I dowloaded the run_clm.py file from the specified Hugging face repository.

wget https://github.com/huggingface/transformers/blob/26a9443dae41737e665910fbb617173e17a0cd18/examples/pytorch/language-modeling/run_clm.py

Note: the run_clm.py file must be downloaded from the past commit relating to the transformers==4.45.2 release, otherwise it won't work with the version of transformers installed in the previous step.

Finally, I installed the latest version of the TRL to align protGPT2 with DPO.

pip install trl

I also use evaluate from Hugging Face to compute the model perplexity for each given sequense.

pip install evaluate

Additional requirements can be found in the requirements.txt file.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages