Protein Understanding and Design Using Guided Generation

Introduction

Guided generation in protein design and engineering allows using external information to steer the output of a generative model towards specific biological or functional goals. PLMs often struggle to generate sequences with specific, desired properties that are not strongly represented in the training data.

This repository contains a Python-based framework for computational protein design that uses ESM3's guided generation capabilities to refine protein sequences, optimizing for structural stability as predicted by the FoldX energy function.

The primary goal of this tool is to take a wild-type protein structure and redesign a user-defined portion of its sequence to discover novel variants with enhanced stability ($ΔΔG$). The code is customizable to include other protein properties in the guided-generation custom-scoring function.

Folder Structure

ESM3-Guided-Generation-Based-Protein-Engineering
│
└─── data                  <- folder to keep the respective protein data bank and crystallographic information file       
│        │ 
│        └───cif           
│        └───pdb
│        
└───ESM_Cookbook           <- experimental notebooks provided by ESM for testing purposes
│
└───dist                   <- pypi build package                            
│
└───result                 <- folder to store the plots and obtained results
│
└───foldx                  <- folder to store pdb files, foldx-generated repaired files, foldx binary, and rotabase.txt
│
└───logs                   <- folder to store the generated log from experiments, which includes all information and processes.                           
│
└───src                    <- main source folder
│    │ 
│    └───notebooks         <- includes file to convert cif to pdb, analyzing pdb files, and code to generate plots from log file                   
│    │      │ 
│    │      └───ciftopdb.ipynb
│    │      └───PDB_analysis.ipynb
│    │      └───plot.ipynb
│    │  
│    └───esm_foldx_guidedgeneration             <- installable Python package
│           │
│           └───guided_generation.py            <- derivative-free guided generation, parallel foldx run
│           └───main.py                         <- main python entry point
│           └───scoring_utils.py                <- pdb parsing, foldx call, foldx scorer
│           └───guided_generation.sh            <- sample batch script to run on an HPC cluster using Slurm
└───.gitignore
└───environment.yml
└───pyproject.toml
└───LICENSE
└───README.md
└───setup.py

Features

Guided Design: Leverages the state-of-the-art ESM3 protein language model to generate new sequence variants intelligently.
Stability Scoring: Uses the physically-grounded FoldX energy function to score the stability of each generated candidate.
Proportional Unmasking: Employs an adaptive unmasking schedule that makes large changes initially and fine-tunes the sequence in later steps.
Parallel Processing: Significantly accelerates the scoring of candidates by running multiple FoldX instances in parallel.
Automated Workflow: A single script handles PDB repair, sequence masking, iterative generation, scoring, and results visualization.

Methodology

The design process is an iterative, guided search that can be thought of as a Design-Build-Test cycle performed entirely in silico.

Installation and Setup

This framework is designed for a Linux-based environment with CPU/GPU acceleration.

To install the package directly from pip:

https://pypi.org/project/esm-foldx-guidedgeneration/

pip install esm_foldx_guidedgeneration

System Requirements

Operating System: Linux (tested on NERSC Perlmutter Custom Linux-based kernel)
Processor: Modern multi-core CPU (8+ cores recommended for parallel scoring)
Memory (RAM): 64 GB or more recommended
GPU: NVIDIA GPU with CUDA support (16GB+ VRAM recommended for the 1.4B ESM3 model)

Dependencies

This project relies on several key pieces of software.

Python & Conda: Python 3.8+ is required. It is highly recommended to manage the environment using Conda.
FoldX Modeling Suite: This package calls the FoldX executable to perform stability calculations.
- Obtain a FoldX License: Request a free academic license from the FoldX website.
- Download FoldX: After receiving your license, download the Linux version of the FoldX executable.
- Set Up foldx Directory:
  - In the root of this repository, create a directory named foldx.
  - Place the foldx executable and the rotabase.txt file inside this foldx directory.
  - Place the pdb files inside the foldx directory.
Python Libraries: All required Python libraries and their specific versions are listed in the environment.yml file. Key dependencies include:
- pytorch
- esm
- pandas
- matplotlib & seaborn
Hugging Face Account: The ESM3 model is a gated model and requires a Hugging Face account for access.

Environment Setup

Log in to Hugging Face: Before setting up the environment, you must authenticate with Hugging Face.
- Go to the ESM3 model page and accept the terms of use.
- Go to your Hugging Face tokens page, generate a new read token, and copy it.
- In your terminal, run the login command and paste your token when prompted:
```
huggingface-cli login
```
- To access larger ESM model:
```
from getpass import getpass

from esm.sdk import client

token = getpass("Token from Forge console: ")
model = client(model="esm3-medium-2024-08", url="https://forge.evolutionaryscale.ai", token=token)
```
Create the Conda Environment: You can recreate the necessary Conda environment using the provided file. This will install all the required Python packages.
```
# Create the conda environment from the file
conda env create -f environment.yml

# Activate the new environment
conda activate proteinenv
```

Local Package Installation

To make the scripts callable from anywhere, install the package in "editable" mode. From the root of the repository, run:

pip install -e .

Usage

The main script can be run from the command line. You must provide a PDB filename, chain ID, and masking percentage.

python -m esm_foldx_guidedgeneration.main --pdb_filename "1PGA.pdb" --chain_id "A" --masking_percentage 0.4 --num_decoding_steps 32 --num_samples_per_step 20 --num_workers 20

Change the masking_percentage based on the protein residue, if the residue is larger try to give a smaller masking_percentage, for smaller residue 0.4-0.5 works perfect. For the num_decoding_steps and num_samples_per_step give the value based on the no of iterations desired for the optimization process. num_workers value will be same as num_samples_per_step for performing simultaneous foldx call in parallel.
Results, including log files and plots of the $ΔΔG$ trajectory, will be saved in the logs/ and results/ directories.

To run using a HPC system like Perlmutter, Can use the guided_generation.sh file provided.

sbatch guided_generation.sh

Make sure to change the #SBATCH --array=0-1 for the number of pdb file submitting for the job. The script is designed to take multiple protein .pdb file as input.

To run after installing the package:

esm_foldx_guidedgeneration --pdb_filename "1PGA.pdb" --chain_id "A" --masking_percentage 0.4 --num_decoding_steps 32 --num_samples_per_step 20 --num_workers 20

Make sure to create the foldx directory and add the necessary pdb files, place the foldx executable and the rotabase.txt file inside this foldx directory.

License

This project is licensed under the Apache License.

Acknowledgments

This is a summer internship work at NERSC from June 2025 to September 2025.

Perlmutter Supercomputer
Lawrence Berkeley National Laboratory
National Energy Research Scientific Computing Center
ai4protein group
University of California San Diego (Boolean Lab)
Anna Su (Yale University)

ESM Github (Code / Weights)

author = {{EvolutionaryScale Team}},
title = {evolutionaryscale/esm},
year = {2024},
publisher = {Zenodo},
doi = {10.5281/zenodo.14219303},
URL = {https://doi.org/10.5281/zenodo.14219303}}

Citation

If you use this work in your research, please cite it as follows:

@software{Nanda_esm_foldx_guidedgeneration_2025,
  author = {Amitash Nanda, Steven Farrell, Nabin Giri },
  title = {{ESM-FoldX Guided Generation: A Framework for Protein Understanding and Design Using Guided Generation}},
  year = {2025},
  publisher = {GitHub},
  journal = {GitHub repository},
  url = {https://github.com/amitashnanda/ESM3-Guided-Generation-Based-Protein-Engineering}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Protein Understanding and Design Using Guided Generation

Introduction

Folder Structure

Features

Methodology

Installation and Setup

System Requirements

Dependencies

Environment Setup

Local Package Installation

Usage

License

Acknowledgments

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 33 Commits
boltz_outputs		boltz_outputs
data		data
dist		dist
foldx		foldx
logs		logs
results		results
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
boltz.yaml		boltz.yaml
environment.yml		environment.yml
environment_new.yaml		environment_new.yaml
guided_generation.yaml		guided_generation.yaml
pyproject.toml		pyproject.toml
setup.py		setup.py

Folders and files

Latest commit

History

Repository files navigation

Protein Understanding and Design Using Guided Generation

Introduction

Folder Structure

Features

Methodology

Installation and Setup

System Requirements

Dependencies

Environment Setup

Local Package Installation

Usage

License

Acknowledgments

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages