Skip to content

obrien-lab/cg_simtk_protein_folding

Repository files navigation

Coarse-grained Simulation Toolkit for Protein Folding

This is a package of scripts that are used to create coarse-grained (CG) models of proteins/ribosomes, optimize CG force field parameters, run MD simulations for protein co- and post-translational folding and analyze the folding structures. This toolkit is developed based on the original simulation protocols and scripts of O'Brien Lab, in which the simulations were performed by Charmm.

An example of the translation and folding process of the E. coli D-Ala-D-Ala Ligase B (DDLB, PDBid 4c5c) that was simulated by this toolkit is shown below.

All the scripts are ready to use on a Linux machine (have been tested on RHEL7 and Centos7) when users have added the directories (including all the child folders) in $PATH (add export PATH=${PATH}:/path/to/one/child/folder/ in your ~/.bashrc file) and have granted the execution permission (chmod -R +x ./cg_simtk_protein_folding/) for all the scripts.

⚠️ Some of the scripts need python modules (Python 3.X) and other softerware installed prior to use. Some of the scripts have an option to use CUDA to accelerate simulations (powered by openMM). OpenMM v7.4.1 with CUDA 9.0 and Python 3.7 have been tested. Please click Learn more in the following script instruction tables to find detailed instruction of usage and basic theory used in the script.

In summary, to use all scripts, the following Python modules and software are required:

Required Module Required Software
numpy
scipy
multiprocessing
lxml
openmm
parmed
mdtraj
pdbfixer
pyemma
msmtools
topoly
Python 3.X
Perl v5+
Stride
pywham
matlab
imagemagick
vmd
PD2
Pulchra
ChaperISM

The scripts in this toolkit have been applied in the studies reported in the following papers:

  1. Nissley, D.A., Jiang, Y., Trovato, F. et al. Universal protein misfolding intermediates can bypass the proteostasis network and remain soluble and less functional. Nature Communications 13, 3081 (2022). https://doi.org/10.1038/s41467-022-30548-5
  2. Jiang, Y., Neti, S.S., Sitarik, I. et al. How synonymous mutations alter enzyme structure and function over long timescales. Nature Chemistry 15, 308–318 (2023). https://doi.org/10.1038/s41557-022-01091-z
  3. Quyen V. Vu, Daniel A. Nissley, Jiang, Y. et al. Is Posttranslational Folding More Efficient Than Refolding from a Denatured State: A Computational Study. The Journal of Physical Chemistry B 127, 4761-4774 (2023). https://doi.org/10.1021/acs.jpcb.3c01694
  4. Jiang, Y., Deane, C.M., Morris, G.M., O’Brien, E.P. It is theoretically possible to avoid misfolding into non-covalent lasso entanglements using small molecule drugs. PLoS Computational Biology 20, e1011901 (2024). https://doi.org/10.1371/journal.pcbi.1011901
  5. Jiang, Y., Xia, Y., Sitarik, I., Sharma, P., Song, H., Fried, S. D., & O’Brien, E. P. Protein misfolding involving entanglements provides a structural explanation for the origin of stretched-exponential refolding kinetics. Science Advances, 11(11), eads7379. (2025). https://doi.org/10.1126/sciadv.ads7379

If you use these scripts in your study, please cite our work:

  1. Jiang, Y., Neti, S.S., Sitarik, I. et al. How synonymous mutations alter enzyme structure and function over long timescales. Nature Chemistry 15, 308–318 (2023). https://doi.org/10.1038/s41557-022-01091-z

Table of Contents

1. Create CG protein models and tune the force field parameters (nscal) for a given protein

  • The CG protein model used in the simulations of protein folding is a Gō-like Cα model. The force field can be found here. Another CG model (Cα-sidechain model) is used to do backmapping.

1.1. Tune nscal for a small single-domain protein that has experimental folding stability reported

  • For a small single-domain protein that has experimental folding stability reported, we use an enhanced sampling protocol i.e. replica exchange simulations to extensively sample the protein folding and unfolding and identify the nscal value to reproduce the experimental protein folding stability.
  • Scripts to be used in this section:
Scripts Instructions
CG_protein_parameterization/create_cg_protein_model.py Create the CG model .psf .top .cor and .prm file that can be used for MD simulations. Need to get the Stride software installed prior to use. (Learn more)
CG_protein_parameterization/parse_cg_prm.py Parse the parameters in .prm file and then create a .xml file for OpenMM use. (Learn more)
CG_protein_parameterization/parallel_temperature_REX.py Run parallel temperature replica exchange molecular dynamics (pt-REMD) simulation. This simulation is parallelized using multiple CPU processors. (Learn more)
CG_protein_parameterization/opt_temp.pl Optimize the temperature windows for pt-REMD simulation to ensure the good sampling quality around the melting temperature of the given protein. (Learn more)
CG_protein_parameterization/check_sampling.pl Check the sampling quality for pt-REMD simulation. Insufficient sampling will cause problems and inaccuracy in estimating the protein folding stability. (Learn more)
CG_protein_parameterization/dGns.pl Convert folding propability PN to folding stability ΔGUN at a given temperature T:
equation.
CG_protein_parameterization/analysis_folding_stability.pl Estimate the protein folding stability at a given temperature from pt-REMD data using WHAM. (Learn more)
CG_protein_parameterization/run_REX_LD.py A simple script to run a Langevin dynamics (LD) simulation. (Learn more)
CG_protein_parameterization/scan_nscal_REX.pl An automated script to call create_cg_protein_model.py, opt_temp.pl, parallel_temperature_REX.py and analysis_folding_stability.pl to scan the protein folding stability profile as changing nscal value. The protein folding stability profile will be used to find the optimized nscal value for CG model parameterization. (Learn more)
  • To tune nscal for a given protein on PSU ACI cluster, run scan_nscal_REX.pl with a series of nscal values. Use scan_nscal_REX.pl again to analyze the results and get the protein stability profile. The optimized nscal is the value that reproduces the experimental protein folding stability. ↩️

1.2. Tune nscal for a protein without experimental folding stability reported

  • For a protein without experimental folding stability reported, we use a stepwise optimization strategy to tune nscal according to 5 levels of nscal. The optimized nscal of an entire single-domain protein or one domain/interface of a multi-domain protein is identified as the lowest level that maintains the native structure.
  • Scripts to be used in this section:
Scripts Instructions
CG_protein_parameterization/opt_nscal.pl An automated script to find the optimized nscal for each domain/interface according to 5 levels of nscal values trained by running the protocol in Section 1.1 for 18 small single-domain proteins. The MD simulator is OpenMM, which is different from that in script opt_nscal_charmm.pl.pl. (Learn more)
CG_protein_parameterization/opt_nscal_charmm.pl.pl This script has the same function with opt_nscal.pl but the levels of nscal values used in this script was trained by using Charmm for the same protein set.
CG_protein_parameterization/opt_nscal.py An updated version of opt_nscal.pl, which is implemented with Python and can be run on both CPU and GPU platforms. (Learn more)
  • On PSU ACI cluster, to tune nscal that is compatible with OpenMM, run opt_nscal.pl; To tune nscal that is compatible with Charmm, run opt_nscal_charmm.pl. ↩️

2. Temperature quenching simulation

  • To estimate the folding rates for a given protein, you need to run temperature quenching simulation where the system is first heated to a very high temperature (usually 800 K) quickly to make protein totally unfolded and then cooled down to the physiological temperature to moniter the refolding of the protein.
  • Scripts to be used in this section:
Scripts Instructions
CG_protein_parameterization/temperature_quenching.py Run temperature quenching simulation from the CG native structure of a given protein. This simulation is parallelized using CPUs or GPUs. (Learn more)
CG_protein_parameterization/T_quench.pl An automated script to build CG model from a pdb file and then run temperature quenching simulations. (Learn more)
CG_protein_parameterization/analysis_Tq.pl Analyze the results of temperature quenching simulations, fit a single- or double- exponential function to the survival probability of the unfolded protein and then estimate the folding rate. (Learn more)
  • To estimate the protein folding rate on PSU ACI cluster, you need to run T_quench.pl with optimized nscal values obtained from Section 1 and then run analysis_Tq.pl to do the curve fitting. ↩️

3. Create CG ribosome model

  • The CG ribosome model is used to run continuous synthesis of a nascent chain that is parameterized with the CG protein model. The CG ribosome is usually fixed (not allow to move) during the simulation. The force field parameters thus only contain the nonbonding term. To speedup the computation, we usually truncate the ribosome to only contain the tails of P- and A-site tRNA molecules, the entire exit tunnel with a few atoms near the tunnel wall and the surface near the exit that may have contacts with the nascent chain.
  • Scripts to be used in this section:
Scripts Instructions
CG_ribosome_parameterization/gen_50S_pdb.py Get he necessary subunits from .cif file of the ribosome and create a pdb file contains those sbuunits. (Learn more)
CG_ribosome_parameterization/fix_orein_50S_pdb.py Add missing atoms and rotate/translate the subunits to a desired orientation for E. coli ribosome. (Learn more)
CG_ribosome_parameterization/fix_orein_60S_pdb.py Do the same thing with fix_orein_50S_pdb.py for S. cerevisiae ribosome. (Learn more)
CG_ribosome_parameterization/create_cg_ribosome_model.py Create the CG model for those re-orientated subunits, including .psf, .top and .cor files. (Learn more)
CG_ribosome_parameterization/truncate_ribosome.py Truncate the CG ribosome according to the centroid line of the exit tunnel. (Learn more)
CG_ribosome_parameterization/gen_ribosome_FF.py Train collision diameters of CG ribosome beads from given ribosome structures. (Learn more)
  • To create CG ribosome model, you need to download the .cif file of your ribosome from RCSB PDB. Use gen_50S_pdb.py to generate a pdb file that contains the assembly of subunits that you want to model. Note that the 50S in the script name doesn't restrict the scope of usage. You can use it to generate any assembly of subunits you want, no matter what the organism the ribosome belongs to. Then use fix_orein_50S_pdb.py for E. coli ribosome and fix_orein_60S_pdb.py for S. cerevisiae ribosome to get the re-orientated subunits, followed by using create_cg_ribosome_model.py to build the CG model for the assembly of subunits. Finally, use truncate_ribosome.py to generate the truncated CG ribosome .cor file. If you are interested in tunning force field parameters for CG ribosome model, use gen_ribosome_FF.py to train your parameters using some high-resolution crystal structures or Cryo-EM structures. ↩️

4. Simulation of co-translational folding

  • The simulation protocol for co-translational folding is called "Continuous Synthesis Protocol" (CSP). We add new amino acid on the A-site tRNA in the truncated CG ribosome and simulate the tRNA translocation and nascent chain elongation at a frequency based on in vivo codon translation time.
  • Scripts to be used in this section:
Scripts Instructions
Continuous_synthesis_protocol/continuous_synthesis_v6.py Run continuous synthesis of a CG protein on a CG ribosome. Both parallelizations on CPU and GPU are supported. (Learn more)
Scripts needed: Continuous_synthesis_protocol/ribosome_traffic and CG_protein_parameterization/parse_cg_prm.py.
Continuous_synthesis_protocol/continuous_synthesis_v7.py An updated version of continuous_synthesis_v6.py. Interactions between nascent chain and small molecule is enabled. (Learn more)
Continuous_synthesis_protocol/ribosome_traffic Estimate the real codon translation time by taking into account of the ribosome traffic effects. (Learn more)
Continuous_synthesis_protocol/visualize_cont_synth.py Generate movies of the continuous synthesis process. (Learn more)
Scripts needed: Backmapping/backmap.py, Continuous_synthesis_protocol/render_ecoli_RNC.tcl and Continuous_synthesis_protocol/render_yeast_RNC.tcl
Continuous_synthesis_protocol/render_ecoli_RNC.tcl Render the picture of E. coli ribosome-nascent-chain (RNC) complex in VMD.
Continuous_synthesis_protocol/render_yeast_RNC.tcl Render the picture of S. cerevisiae ribosome-nascent-chain (RNC) complex in VMD.
  • To run CSP, you need to prepare the CG protein model for your nascent chain according to Section 1 and prepare the CG ribosome model according to Section 3. All the .psf, .top, .cor and .prm files are required in initialization of CSP. In addition, users have to provide a table of intrinsic codon translation time and the mRNA sequence of the nascent chain.
  • Below is a video of the entire continuous synthesis process using continuous_synthesis_v6.py for synthesizing Firfly luciferase (550 residue long) on the E.coli ribosome and visualized using visualize_cont_synth.py (time is shown in the experimental timescale):
Example.video.of.synthesizing.Firfly.luciferase.on.the.E.coli.ribosome.mp4

The large subunit of ribosome is shown in silver; The tail of tRNA is shown in orange; The nascent chain is represented as Cartoon and colored according to the secondary structures; The flexible region of ribosomal protein L24 is shown in green. Only the last frame of the trajectory for each synthesis step at each NC length is shown in this video. ↩️

5. Simulation of post-translational folding

  • The simulation of post-translational folding is quite simple. The nascent chain structures that are disassociated from the ribosome will be obtained from CSP and run Langevin dynamics (LD) in implicit water environment at the physiological temperature.
  • Scripts to be used in this section:
Scripts Instructions
Post_translational_folding/post_trans_single_run.py Run a single trajectory of post-translational folding. User can specify a walltime or a threshold to control the termination of the post-translational folding. Compatible with the old CSP code continuous_synthesis_v6.py. (Learn more)
Post_translational_folding/post_trans_single_run_v2.py Updated version of post_trans_single_run.py. Compatible with the updated CSP code continuous_synthesis_v7.py. Interactions between nascent chain and other molecules (e.g. small ligand or other protein chain) is enabled. (Learn more)
Post_translational_folding/PTP_setup.pl Automated script to setup post-translation simulations after CSP. (Learn more)
Scripts needed: Post-translational_folding/post_trans_single_run.py
  • To setup and run post-translation simulations on PSU ACI cluster, use PTP_setup.pl. ↩️

6. Backmapping from coarse-grained model to all-atom model

  • CG model has a lot of benifits on saving computational costs and improving sampling efficiency. However, it loses the atomic-level accuracy of the molecular structure. A backmapping strategy presented here can rebuild the all-atom structure from the CG Cα model with high accuracy. The rebuilt all-atom structure can be furthure used for visualization (e.g., visualize_cont_synth.py) and simulation (e.g., Activation Energy Estimation Workflow) at all-atom level.
  • Scripts to be used in this section:
Scripts Instructions
Backmapping/backmap.py Backmap the CG Cα structure to its corresponding all-atom structure. (Learn more)
Scripts needed: Backmapping/parse_cg_cacb_prm.py and CG_protein_parameterization/create_cg_protein_model.py
Backmapping/parse_cg_cacb_prm.py Parse the CG Cα-sidechain model parameters and convert them into OpenMM .xml format. (Learn more)
  • To backmap your CG Cα structure, use backmap.py. Note that you need to install PD2 and Pulchra before use this script. ↩️

7. Analysis of protein folding trajectories

  • To analyze the protein folding process, we usually calculate the order parameters, such as the fraction of native contacts (equation) and the fraction of entangment changes (equation).
  • Some examples of misfolding protein structures with entanglements can be found here.
  • Scripts to be used in this section:
Scripts Instructions
Analysis_protocol/get_Ep_from_dcd.py Get potential energy from a dcd trajectory using OpenMM for a CG model. (Learn more)
Analysis_protocol/calc_native_contact_fraction.pl Calculate equation vs. time for a given trajectory. (Learn more)
Analysis_protocol/calc_entanglement_number.pl Calculate equation vs. time for a given trajectory. (Learn more)
Analysis_protocol/calc_entanglement_number_v2.pl Updated script to calculate equation vs. time for a given trajectory. (Learn more)
Analysis_protocol/calc_chirality_number.pl Calculate equation vs. time for a given trajectory. (Learn more)
Analysis_protocol/find_chg_ent.py Find changes in entanglements using a more time-consuming protocol including Topoly. (Learn more)
Analysis_protocol/find_representative_chg_ent.py Find representative changes in entanglements for a set of simulation structures. (Learn more)
Scripts needed: Backmapping/backmap.py
Analysis_protocol/calc_cont_synth_qbb_vs_T.py Automated script to calculate equation vs. time for CSP trajectoris. (Learn more)
Scripts needed: Analysis_protocol/calc_native_contact_fraction.pl
Analysis_protocol/calc_cont_synth_G_vs_T.py Automated script to calculate equation vs. time for CSP trajectoris. (Learn more)
Scripts needed: Analysis_protocol/calc_entanglement_number.pl
Analysis_protocol/calc_cont_synth_G_vs_T_v2.py Updated script to calculate equation vs. time for CSP trajectoris. (Learn more)
Scripts needed: Analysis_protocol/calc_entanglement_number_v2.pl
Analysis_protocol/mrna_silent_mutation.pl Do silent mutation for a given mRNA sequence and a mutation scheme, such as fastest translation, slowest translation and random. (Learn more)
Analysis_protocol/get_co_trans_order_parameters.py Collect the pre-calculated order parameters equation and equation and output .npy data files for the co-translation trajectories. (Learn more)
Analysis_protocol/build_co_trans_kinetic_model.py Assign the metastable states on the co-translational trajectories and generate the visualization of the representative structures. (Learn more)
Scripts needed: Backmapping/backmap.py
Analysis_protocol/co_trans_JS_divergence.py Estimate the Jensen-Shannon divergence (JSD) for the co-translational mirocstates or metastable states in the function of nascent chain length. (Learn more)
Analysis_protocol/get_post_trans_order_parameters.py Collect the pre-calculated order parameters equation and equation and output .npy data files for the post-translation trajectories. (Learn more)
Scripts needed: Analysis_protocol/calc_native_contact_fraction.pl and Analysis_protocol/calc_entanglement_number.pl
Analysis_protocol/build_post_trans_kinetic_model.py Assign the metastable states on the post-translational trajectories and generate the visualization of the representative structures. (Learn more)
Scripts needed: Backmapping/backmap.py
Analysis_protocol/post_trans_JS_divergence.py Estimate the Jensen-Shannon divergence (JSD) for the post-translational mirocstates or metastable states in the function of time. (Learn more)
Analysis_protocol/get_co_post_folding_pathways.py Combine the discrete metastable states trajectories and analyze the folding pathways. (Learn more)
Analysis_protocol/get_solubility.py Estimate the aggregation, degradation and Hsp70 binding propensities for the post-translational metastable states and estimate the percent of soluble protein in each metastable state. (Learn more)
Scripts needed: Backmapping/backmap.py
Analysis_protocol/get_state_probability.py Get the metastable state probabilities vs. post-translational time and run the bootstrapping resampling for error estimation. (Learn more)
Analysis_protocol/calc_specific_activity.py Estimates the specific activities, as well as the errors and p-value, by using the equation obtained from Activation Energy Estimation Workflow and the state probabilities obtained from get_state_probability.py. (Learn more)

↩️

8. General workflow

graph TD
    A[Protein] -->|opt_nscal.pl or scan_nscal_REX.pl| B(CG protein model)
    B --> |continuous_synthesis_v6.py| C(Protein synthesis trajectories)
    C ---> |post_trans_single_run.py| D(Post-translational folding trajectories)
    C --> |calc_cont_synth_qbb_vs_T.py| E(Co-translation Q vs. time)
    C --> |calc_cont_synth_G_vs_T.py| F(Co-translation G vs. time)
    D --> |calc_native_contact_fraction.pl| G(post-translation Q vs. time)
    D --> |calc_entanglement_number.pl| H(post-translation G vs. time)
    E --> |get_co_trans_order_parameters.py| I(Co-translation order parameters)
    F --> |get_co_trans_order_parameters.py| I
    I --> |build_co_trans_kinetic_model.py| J(Co-translation metastable states)
    J --> |co_trans_JS_divergence.py| K>Co-translation JSD]
    G --> |get_post_trans_order_parameters.py| L(Post-translation order parameters)
    H --> |get_post_trans_order_parameters.py| L
    L --> |build_post_trans_kinetic_model.py| M(Post-translation metastable states)
    J & M --> |get_co_post_folding_pathways.py| O>Folding pathways]
    M ---> |post_trans_JS_divergence.py| N>Post-translation JSD]
    M ----> |get_state_probability.py| P(State probability vs. post-translation time)
    M ----> |get_solubility.py| Q(Percent soluble protein in each state)
    M ----> |backmap & QM/MM simulations| R(Enzymatic reaction barrier for each state)
    P & Q & R --> |calc_specific_activity.py| S>Specific activity]

Loading

↩️

About

A toolkit for building CG model of protein/ribosome and MD simulation protocols for protein co- and post-translational folding

Topics

Resources

License

Stars

Watchers

Forks

Packages