This repository aims to analyze, sanitize and use a dataset of protein sequences with validated activities to design new alpha-amylase variants with improved activity. This is done through 4 main steps:
- Data sanitation, e.g, investigate/remove NaN and duplicate sequences, check each sequence only contains natural amino acids, investigate/remove sequences with out of distribution lenghts.
- Fine-tune a pretrained model: The repositiory uses the pretraind protGPT2 base model and fine-tunes it to the enitre (cleaned) alpha-amylase dataset (i.e., independtly of activities) with SFT.
- Model alignment: The repositiory uses DPO to align the model towards protein sequences with high activities.
- Generation: Generate seversal sequenses given the dpo aligned model and pick the 'best' one based on a mixture of model perxplexity and sequence length (see below).
In the figure below, I show the dpo fine-tuned model has lower perplexity for training sequences with high activity. Therefore, I use the perplexity of the dpo fine-tuned model generated sequences as a proxy for high activity. This is also based on the fact that protGPT2 seems to be well calibrated (see protGPT2).
To keep things versioned and segregated from the rest of the system, I use a virtual environment. I use a conda virtual environment called
protein_design for this project.
conda create [-p /optional/prefix] -n protein_designFirst I need to activate the newly activated environment so that the pagakes get installed there,
conda activate protein_designTo avoid having to manually activate the environment every time I use use direnv (highly recommended!). Next, I begin by installing PyTorch with the latest CUDA release together with a compatible python release.
conda install python=3.12.7
pip3 install torch torchvision torchaudioNext, I install pandas to efficiently read the dataset, which is stored in a .cvs file (Note: the data file is not provided in the repository).
pip3 install pandasIn order to use ProtGPT2 base model, I install Hugging Face package. However, I had to downgrade to earlier verison of it
pip install transformers==4.45.2due to a potential bug between the lastest realise and the DPOTrainer of the trl package.
To supervise fine tune protGPT2 to my dataset, I dowloaded the run_clm.py file from the specified Hugging face repository.
wget https://github.com/huggingface/transformers/blob/26a9443dae41737e665910fbb617173e17a0cd18/examples/pytorch/language-modeling/run_clm.pyNote: the run_clm.py file must be downloaded from the past commit relating to the transformers==4.45.2 release, otherwise it won't work with the version of transformers installed in the previous step.
Finally, I installed the latest version of the TRL to align protGPT2 with DPO.
pip install trlI also use evaluate from Hugging Face to compute the model perplexity for each given sequense.
pip install evaluateAdditional requirements can be found in the requirements.txt file.
