Guided generation in protein design and engineering allows using external information to steer the output of a generative model towards specific biological or functional goals. PLMs often struggle to generate sequences with specific, desired properties that are not strongly represented in the training data.
This repository contains a Python-based framework for computational protein design that uses ESM3's guided generation capabilities to refine protein sequences, optimizing for structural stability as predicted by the FoldX energy function.
The primary goal of this tool is to take a wild-type protein structure and redesign a user-defined portion of its sequence to discover novel variants with enhanced stability (
ESM3-Guided-Generation-Based-Protein-Engineering
│
└─── data <- folder to keep the respective protein data bank and crystallographic information file
│ │
│ └───cif
│ └───pdb
│
└───ESM_Cookbook <- experimental notebooks provided by ESM for testing purposes
│
└───dist <- pypi build package
│
└───result <- folder to store the plots and obtained results
│
└───foldx <- folder to store pdb files, foldx-generated repaired files, foldx binary, and rotabase.txt
│
└───logs <- folder to store the generated log from experiments, which includes all information and processes.
│
└───src <- main source folder
│ │
│ └───notebooks <- includes file to convert cif to pdb, analyzing pdb files, and code to generate plots from log file
│ │ │
│ │ └───ciftopdb.ipynb
│ │ └───PDB_analysis.ipynb
│ │ └───plot.ipynb
│ │
│ └───esm_foldx_guidedgeneration <- installable Python package
│ │
│ └───guided_generation.py <- derivative-free guided generation, parallel foldx run
│ └───main.py <- main python entry point
│ └───scoring_utils.py <- pdb parsing, foldx call, foldx scorer
│ └───guided_generation.sh <- sample batch script to run on an HPC cluster using Slurm
└───.gitignore
└───environment.yml
└───pyproject.toml
└───LICENSE
└───README.md
└───setup.py
- Guided Design: Leverages the state-of-the-art ESM3 protein language model to generate new sequence variants intelligently.
- Stability Scoring: Uses the physically-grounded FoldX energy function to score the stability of each generated candidate.
- Proportional Unmasking: Employs an adaptive unmasking schedule that makes large changes initially and fine-tunes the sequence in later steps.
- Parallel Processing: Significantly accelerates the scoring of candidates by running multiple FoldX instances in parallel.
- Automated Workflow: A single script handles PDB repair, sequence masking, iterative generation, scoring, and results visualization.
The design process is an iterative, guided search that can be thought of as a Design-Build-Test cycle performed entirely in silico.
This framework is designed for a Linux-based environment with CPU/GPU acceleration.
To install the package directly from pip:
https://pypi.org/project/esm-foldx-guidedgeneration/
pip install esm_foldx_guidedgeneration- Operating System: Linux (tested on NERSC Perlmutter Custom Linux-based kernel)
- Processor: Modern multi-core CPU (8+ cores recommended for parallel scoring)
- Memory (RAM): 64 GB or more recommended
- GPU: NVIDIA GPU with CUDA support (16GB+ VRAM recommended for the 1.4B ESM3 model)
This project relies on several key pieces of software.
-
Python & Conda: Python 3.8+ is required. It is highly recommended to manage the environment using Conda.
-
FoldX Modeling Suite: This package calls the FoldX executable to perform stability calculations.
- Obtain a FoldX License: Request a free academic license from the FoldX website.
- Download FoldX: After receiving your license, download the Linux version of the FoldX executable.
- Set Up
foldxDirectory:- In the root of this repository, create a directory named
foldx. - Place the
foldxexecutable and therotabase.txtfile inside thisfoldxdirectory. - Place the pdb files inside the
foldxdirectory.
- In the root of this repository, create a directory named
-
Python Libraries: All required Python libraries and their specific versions are listed in the
environment.ymlfile. Key dependencies include:pytorchesmpandasmatplotlib&seaborn
-
Hugging Face Account: The ESM3 model is a gated model and requires a Hugging Face account for access.
- Log in to Hugging Face:
Before setting up the environment, you must authenticate with Hugging Face.
-
Go to the ESM3 model page and accept the terms of use.
-
Go to your Hugging Face tokens page, generate a new read token, and copy it.
-
In your terminal, run the login command and paste your token when prompted:
huggingface-cli login
-
To access larger ESM model:
from getpass import getpass from esm.sdk import client token = getpass("Token from Forge console: ") model = client(model="esm3-medium-2024-08", url="https://forge.evolutionaryscale.ai", token=token)
-
- Create the Conda Environment:
You can recreate the necessary Conda environment using the provided file. This will install all the required Python packages.
# Create the conda environment from the file conda env create -f environment.yml # Activate the new environment conda activate proteinenv
To make the scripts callable from anywhere, install the package in "editable" mode. From the root of the repository, run:
pip install -e .The main script can be run from the command line. You must provide a PDB filename, chain ID, and masking percentage.
python -m esm_foldx_guidedgeneration.main --pdb_filename "1PGA.pdb" --chain_id "A" --masking_percentage 0.4 --num_decoding_steps 32 --num_samples_per_step 20 --num_workers 20-
Change the
masking_percentagebased on the protein residue, if the residue is larger try to give a smallermasking_percentage, for smaller residue0.4-0.5works perfect. For thenum_decoding_stepsandnum_samples_per_stepgive the value based on the no of iterations desired for the optimization process.num_workersvalue will be same asnum_samples_per_stepfor performing simultaneousfoldxcall in parallel. -
Results, including log files and plots of the
$ΔΔG$ trajectory, will be saved in thelogs/andresults/directories.
To run using a HPC system like Perlmutter, Can use the guided_generation.sh file provided.
sbatch guided_generation.shMake sure to change the #SBATCH --array=0-1 for the number of pdb file submitting for the job. The script is designed to take multiple protein .pdb file as input.
To run after installing the package:
esm_foldx_guidedgeneration --pdb_filename "1PGA.pdb" --chain_id "A" --masking_percentage 0.4 --num_decoding_steps 32 --num_samples_per_step 20 --num_workers 20Make sure to create the foldx directory and add the necessary pdb files, place the foldx executable and the rotabase.txt file inside this foldx directory.
This project is licensed under the Apache License.
This is a summer internship work at NERSC from June 2025 to September 2025.
- Perlmutter Supercomputer
- Lawrence Berkeley National Laboratory
- National Energy Research Scientific Computing Center
- ai4protein group
- University of California San Diego (Boolean Lab)
- Anna Su (Yale University)
- ESM Github (Code / Weights)
author = {{EvolutionaryScale Team}}, title = {evolutionaryscale/esm}, year = {2024}, publisher = {Zenodo}, doi = {10.5281/zenodo.14219303}, URL = {https://doi.org/10.5281/zenodo.14219303}}
If you use this work in your research, please cite it as follows:
@software{Nanda_esm_foldx_guidedgeneration_2025,
author = {Amitash Nanda, Steven Farrell, Nabin Giri },
title = {{ESM-FoldX Guided Generation: A Framework for Protein Understanding and Design Using Guided Generation}},
year = {2025},
publisher = {GitHub},
journal = {GitHub repository},
url = {https://github.com/amitashnanda/ESM3-Guided-Generation-Based-Protein-Engineering}
}