Author: Dr. Yang Jiang
This is a package of scripts that are used to create coarse-grained (CG) models of proteins/ribosomes, optimize CG force field parameters, run MD simulations for protein co- and post-translational folding and analyze the folding structures. This toolkit is developed based on the original simulation protocols and scripts of O'Brien Lab, in which the simulations were performed by Charmm.
An example of the translation and folding process of the E. coli D-Ala-D-Ala Ligase B (DDLB, PDBid 4c5c) that was simulated by this toolkit is shown below.
All the scripts are ready to use on a Linux machine (have been tested on RHEL7 and Centos7) when users have added the directories (including all the child folders) in $PATH (add export PATH=${PATH}:/path/to/one/child/folder/ in your ~/.bashrc file) and have granted the execution permission (chmod -R +x ./cg_simtk_protein_folding/) for all the scripts.
Learn more in the following script instruction tables to find detailed instruction of usage and basic theory used in the script.
In summary, to use all scripts, the following Python modules and software are required:
| Required Module | Required Software |
|---|---|
numpyscipymultiprocessinglxmlopenmmparmedmdtrajpdbfixerpyemmamsmtoolstopoly |
Python 3.XPerl v5+StridepywhammatlabimagemagickvmdPD2PulchraChaperISM |
The scripts in this toolkit have been applied in the studies reported in the following papers:
- Nissley, D.A., Jiang, Y., Trovato, F. et al. Universal protein misfolding intermediates can bypass the proteostasis network and remain soluble and less functional. Nature Communications 13, 3081 (2022). https://doi.org/10.1038/s41467-022-30548-5
- Jiang, Y., Neti, S.S., Sitarik, I. et al. How synonymous mutations alter enzyme structure and function over long timescales. Nature Chemistry 15, 308–318 (2023). https://doi.org/10.1038/s41557-022-01091-z
- Quyen V. Vu, Daniel A. Nissley, Jiang, Y. et al. Is Posttranslational Folding More Efficient Than Refolding from a Denatured State: A Computational Study. The Journal of Physical Chemistry B 127, 4761-4774 (2023). https://doi.org/10.1021/acs.jpcb.3c01694
- Jiang, Y., Deane, C.M., Morris, G.M., O’Brien, E.P. It is theoretically possible to avoid misfolding into non-covalent lasso entanglements using small molecule drugs. PLoS Computational Biology 20, e1011901 (2024). https://doi.org/10.1371/journal.pcbi.1011901
- Jiang, Y., Xia, Y., Sitarik, I., Sharma, P., Song, H., Fried, S. D., & O’Brien, E. P. Protein misfolding involving entanglements provides a structural explanation for the origin of stretched-exponential refolding kinetics. Science Advances, 11(11), eads7379. (2025). https://doi.org/10.1126/sciadv.ads7379
If you use these scripts in your study, please cite our work:
- Jiang, Y., Neti, S.S., Sitarik, I. et al. How synonymous mutations alter enzyme structure and function over long timescales. Nature Chemistry 15, 308–318 (2023). https://doi.org/10.1038/s41557-022-01091-z
- 1. Create CG protein models and tune the force field parameters (nscal) for a given protein
- 2. Temperature quenching simulation
- 3. Create CG ribosome model
- 4. Simulation of co-translational folding
- 5. Simulation of post-translational folding
- 6. Backmapping from coarse-grained model to all-atom model
- 7. Analysis of protein folding trajectories
- 8. General workflow
- The CG protein model used in the simulations of protein folding is a Gō-like Cα model. The force field can be found here. Another CG model (Cα-sidechain model) is used to do backmapping.
- For a small single-domain protein that has experimental folding stability reported, we use an enhanced sampling protocol i.e. replica exchange simulations to extensively sample the protein folding and unfolding and identify the nscal value to reproduce the experimental protein folding stability.
- Scripts to be used in this section:
| Scripts | Instructions |
|---|---|
| CG_protein_parameterization/create_cg_protein_model.py | Create the CG model .psf .top .cor and .prm file that can be used for MD simulations. Need to get the Stride software installed prior to use. (Learn more) |
| CG_protein_parameterization/parse_cg_prm.py | Parse the parameters in .prm file and then create a .xml file for OpenMM use. (Learn more) |
| CG_protein_parameterization/parallel_temperature_REX.py | Run parallel temperature replica exchange molecular dynamics (pt-REMD) simulation. This simulation is parallelized using multiple CPU processors. (Learn more) |
| CG_protein_parameterization/opt_temp.pl | Optimize the temperature windows for pt-REMD simulation to ensure the good sampling quality around the melting temperature of the given protein. (Learn more) |
| CG_protein_parameterization/check_sampling.pl | Check the sampling quality for pt-REMD simulation. Insufficient sampling will cause problems and inaccuracy in estimating the protein folding stability. (Learn more) |
| CG_protein_parameterization/dGns.pl | Convert folding propability PN to folding stability ΔGUN at a given temperature T: |
| CG_protein_parameterization/analysis_folding_stability.pl | Estimate the protein folding stability at a given temperature from pt-REMD data using WHAM. (Learn more) |
| CG_protein_parameterization/run_REX_LD.py | A simple script to run a Langevin dynamics (LD) simulation. (Learn more) |
| CG_protein_parameterization/scan_nscal_REX.pl | An automated script to call create_cg_protein_model.py, opt_temp.pl, parallel_temperature_REX.py and analysis_folding_stability.pl to scan the protein folding stability profile as changing nscal value. The protein folding stability profile will be used to find the optimized nscal value for CG model parameterization. (Learn more) |
- To tune nscal for a given protein on PSU ACI cluster, run
scan_nscal_REX.plwith a series of nscal values. Usescan_nscal_REX.plagain to analyze the results and get the protein stability profile. The optimized nscal is the value that reproduces the experimental protein folding stability. ↩️
- For a protein without experimental folding stability reported, we use a stepwise optimization strategy to tune nscal according to 5 levels of nscal. The optimized nscal of an entire single-domain protein or one domain/interface of a multi-domain protein is identified as the lowest level that maintains the native structure.
- Scripts to be used in this section:
| Scripts | Instructions |
|---|---|
| CG_protein_parameterization/opt_nscal.pl | An automated script to find the optimized nscal for each domain/interface according to 5 levels of nscal values trained by running the protocol in Section 1.1 for 18 small single-domain proteins. The MD simulator is OpenMM, which is different from that in script opt_nscal_charmm.pl.pl. (Learn more) |
| CG_protein_parameterization/opt_nscal_charmm.pl.pl | This script has the same function with opt_nscal.pl but the levels of nscal values used in this script was trained by using Charmm for the same protein set. |
| CG_protein_parameterization/opt_nscal.py | An updated version of opt_nscal.pl, which is implemented with Python and can be run on both CPU and GPU platforms. (Learn more) |
- On PSU ACI cluster, to tune nscal that is compatible with OpenMM, run
opt_nscal.pl; To tune nscal that is compatible with Charmm, runopt_nscal_charmm.pl. ↩️
- To estimate the folding rates for a given protein, you need to run temperature quenching simulation where the system is first heated to a very high temperature (usually 800 K) quickly to make protein totally unfolded and then cooled down to the physiological temperature to moniter the refolding of the protein.
- Scripts to be used in this section:
| Scripts | Instructions |
|---|---|
| CG_protein_parameterization/temperature_quenching.py | Run temperature quenching simulation from the CG native structure of a given protein. This simulation is parallelized using CPUs or GPUs. (Learn more) |
| CG_protein_parameterization/T_quench.pl | An automated script to build CG model from a pdb file and then run temperature quenching simulations. (Learn more) |
| CG_protein_parameterization/analysis_Tq.pl | Analyze the results of temperature quenching simulations, fit a single- or double- exponential function to the survival probability of the unfolded protein and then estimate the folding rate. (Learn more) |
- To estimate the protein folding rate on PSU ACI cluster, you need to run
T_quench.plwith optimized nscal values obtained from Section 1 and then runanalysis_Tq.plto do the curve fitting. ↩️
- The CG ribosome model is used to run continuous synthesis of a nascent chain that is parameterized with the CG protein model. The CG ribosome is usually fixed (not allow to move) during the simulation. The force field parameters thus only contain the nonbonding term. To speedup the computation, we usually truncate the ribosome to only contain the tails of P- and A-site tRNA molecules, the entire exit tunnel with a few atoms near the tunnel wall and the surface near the exit that may have contacts with the nascent chain.
- Scripts to be used in this section:
| Scripts | Instructions |
|---|---|
| CG_ribosome_parameterization/gen_50S_pdb.py | Get he necessary subunits from .cif file of the ribosome and create a pdb file contains those sbuunits. (Learn more) |
| CG_ribosome_parameterization/fix_orein_50S_pdb.py | Add missing atoms and rotate/translate the subunits to a desired orientation for E. coli ribosome. (Learn more) |
| CG_ribosome_parameterization/fix_orein_60S_pdb.py | Do the same thing with fix_orein_50S_pdb.py for S. cerevisiae ribosome. (Learn more) |
| CG_ribosome_parameterization/create_cg_ribosome_model.py | Create the CG model for those re-orientated subunits, including .psf, .top and .cor files. (Learn more) |
| CG_ribosome_parameterization/truncate_ribosome.py | Truncate the CG ribosome according to the centroid line of the exit tunnel. (Learn more) |
| CG_ribosome_parameterization/gen_ribosome_FF.py | Train collision diameters of CG ribosome beads from given ribosome structures. (Learn more) |
- To create CG ribosome model, you need to download the .cif file of your ribosome from RCSB PDB. Use
gen_50S_pdb.pyto generate a pdb file that contains the assembly of subunits that you want to model. Note that the 50S in the script name doesn't restrict the scope of usage. You can use it to generate any assembly of subunits you want, no matter what the organism the ribosome belongs to. Then usefix_orein_50S_pdb.pyfor E. coli ribosome andfix_orein_60S_pdb.pyfor S. cerevisiae ribosome to get the re-orientated subunits, followed by usingcreate_cg_ribosome_model.pyto build the CG model for the assembly of subunits. Finally, usetruncate_ribosome.pyto generate the truncated CG ribosome .cor file. If you are interested in tunning force field parameters for CG ribosome model, usegen_ribosome_FF.pyto train your parameters using some high-resolution crystal structures or Cryo-EM structures. ↩️
- The simulation protocol for co-translational folding is called "Continuous Synthesis Protocol" (CSP). We add new amino acid on the A-site tRNA in the truncated CG ribosome and simulate the tRNA translocation and nascent chain elongation at a frequency based on in vivo codon translation time.
- Scripts to be used in this section:
| Scripts | Instructions |
|---|---|
| Continuous_synthesis_protocol/continuous_synthesis_v6.py | Run continuous synthesis of a CG protein on a CG ribosome. Both parallelizations on CPU and GPU are supported. (Learn more) Scripts needed: Continuous_synthesis_protocol/ribosome_traffic and CG_protein_parameterization/parse_cg_prm.py. |
| Continuous_synthesis_protocol/continuous_synthesis_v7.py | An updated version of continuous_synthesis_v6.py. Interactions between nascent chain and small molecule is enabled. (Learn more) |
| Continuous_synthesis_protocol/ribosome_traffic | Estimate the real codon translation time by taking into account of the ribosome traffic effects. (Learn more) |
| Continuous_synthesis_protocol/visualize_cont_synth.py | Generate movies of the continuous synthesis process. (Learn more) Scripts needed: Backmapping/backmap.py, Continuous_synthesis_protocol/render_ecoli_RNC.tcl and Continuous_synthesis_protocol/render_yeast_RNC.tcl |
| Continuous_synthesis_protocol/render_ecoli_RNC.tcl | Render the picture of E. coli ribosome-nascent-chain (RNC) complex in VMD. |
| Continuous_synthesis_protocol/render_yeast_RNC.tcl | Render the picture of S. cerevisiae ribosome-nascent-chain (RNC) complex in VMD. |
- To run CSP, you need to prepare the CG protein model for your nascent chain according to Section 1 and prepare the CG ribosome model according to Section 3. All the .psf, .top, .cor and .prm files are required in initialization of CSP. In addition, users have to provide a table of intrinsic codon translation time and the mRNA sequence of the nascent chain.
- Below is a video of the entire continuous synthesis process using
continuous_synthesis_v6.pyfor synthesizing Firfly luciferase (550 residue long) on the E.coli ribosome and visualized usingvisualize_cont_synth.py(time is shown in the experimental timescale):
Example.video.of.synthesizing.Firfly.luciferase.on.the.E.coli.ribosome.mp4
The large subunit of ribosome is shown in silver; The tail of tRNA is shown in orange; The nascent chain is represented as Cartoon and colored according to the secondary structures; The flexible region of ribosomal protein L24 is shown in green. Only the last frame of the trajectory for each synthesis step at each NC length is shown in this video. ↩️
- The simulation of post-translational folding is quite simple. The nascent chain structures that are disassociated from the ribosome will be obtained from CSP and run Langevin dynamics (LD) in implicit water environment at the physiological temperature.
- Scripts to be used in this section:
| Scripts | Instructions |
|---|---|
| Post_translational_folding/post_trans_single_run.py | Run a single trajectory of post-translational folding. User can specify a walltime or a threshold to control the termination of the post-translational folding. Compatible with the old CSP code continuous_synthesis_v6.py. (Learn more) |
| Post_translational_folding/post_trans_single_run_v2.py | Updated version of post_trans_single_run.py. Compatible with the updated CSP code continuous_synthesis_v7.py. Interactions between nascent chain and other molecules (e.g. small ligand or other protein chain) is enabled. (Learn more) |
| Post_translational_folding/PTP_setup.pl | Automated script to setup post-translation simulations after CSP. (Learn more) Scripts needed: Post-translational_folding/post_trans_single_run.py |
- To setup and run post-translation simulations on PSU ACI cluster, use
PTP_setup.pl. ↩️
- CG model has a lot of benifits on saving computational costs and improving sampling efficiency. However, it loses the atomic-level accuracy of the molecular structure. A backmapping strategy presented here can rebuild the all-atom structure from the CG Cα model with high accuracy. The rebuilt all-atom structure can be furthure used for visualization (e.g.,
visualize_cont_synth.py) and simulation (e.g., Activation Energy Estimation Workflow) at all-atom level. - Scripts to be used in this section:
| Scripts | Instructions |
|---|---|
| Backmapping/backmap.py | Backmap the CG Cα structure to its corresponding all-atom structure. (Learn more) Scripts needed: Backmapping/parse_cg_cacb_prm.py and CG_protein_parameterization/create_cg_protein_model.py |
| Backmapping/parse_cg_cacb_prm.py | Parse the CG Cα-sidechain model parameters and convert them into OpenMM .xml format. (Learn more) |
- To backmap your CG Cα structure, use
backmap.py. Note that you need to install PD2 and Pulchra before use this script. ↩️
- To analyze the protein folding process, we usually calculate the order parameters, such as the fraction of native contacts (
) and the fraction of entangment changes (
).
- Some examples of misfolding protein structures with entanglements can be found here.
- Scripts to be used in this section:
| Scripts | Instructions |
|---|---|
| Analysis_protocol/get_Ep_from_dcd.py | Get potential energy from a dcd trajectory using OpenMM for a CG model. (Learn more) |
| Analysis_protocol/calc_native_contact_fraction.pl | Calculate |
| Analysis_protocol/calc_entanglement_number.pl | Calculate |
| Analysis_protocol/calc_entanglement_number_v2.pl | Updated script to calculate |
| Analysis_protocol/calc_chirality_number.pl | Calculate |
| Analysis_protocol/find_chg_ent.py | Find changes in entanglements using a more time-consuming protocol including Topoly. (Learn more) |
| Analysis_protocol/find_representative_chg_ent.py | Find representative changes in entanglements for a set of simulation structures. (Learn more) Scripts needed: Backmapping/backmap.py |
| Analysis_protocol/calc_cont_synth_qbb_vs_T.py | Automated script to calculate Scripts needed: Analysis_protocol/calc_native_contact_fraction.pl |
| Analysis_protocol/calc_cont_synth_G_vs_T.py | Automated script to calculate Scripts needed: Analysis_protocol/calc_entanglement_number.pl |
| Analysis_protocol/calc_cont_synth_G_vs_T_v2.py | Updated script to calculate Scripts needed: Analysis_protocol/calc_entanglement_number_v2.pl |
| Analysis_protocol/mrna_silent_mutation.pl | Do silent mutation for a given mRNA sequence and a mutation scheme, such as fastest translation, slowest translation and random. (Learn more) |
| Analysis_protocol/get_co_trans_order_parameters.py | Collect the pre-calculated order parameters |
| Analysis_protocol/build_co_trans_kinetic_model.py | Assign the metastable states on the co-translational trajectories and generate the visualization of the representative structures. (Learn more) Scripts needed: Backmapping/backmap.py |
| Analysis_protocol/co_trans_JS_divergence.py | Estimate the Jensen-Shannon divergence (JSD) for the co-translational mirocstates or metastable states in the function of nascent chain length. (Learn more) |
| Analysis_protocol/get_post_trans_order_parameters.py | Collect the pre-calculated order parameters Scripts needed: Analysis_protocol/calc_native_contact_fraction.pl and Analysis_protocol/calc_entanglement_number.pl |
| Analysis_protocol/build_post_trans_kinetic_model.py | Assign the metastable states on the post-translational trajectories and generate the visualization of the representative structures. (Learn more) Scripts needed: Backmapping/backmap.py |
| Analysis_protocol/post_trans_JS_divergence.py | Estimate the Jensen-Shannon divergence (JSD) for the post-translational mirocstates or metastable states in the function of time. (Learn more) |
| Analysis_protocol/get_co_post_folding_pathways.py | Combine the discrete metastable states trajectories and analyze the folding pathways. (Learn more) |
| Analysis_protocol/get_solubility.py | Estimate the aggregation, degradation and Hsp70 binding propensities for the post-translational metastable states and estimate the percent of soluble protein in each metastable state. (Learn more) Scripts needed: Backmapping/backmap.py |
| Analysis_protocol/get_state_probability.py | Get the metastable state probabilities vs. post-translational time and run the bootstrapping resampling for error estimation. (Learn more) |
| Analysis_protocol/calc_specific_activity.py | Estimates the specific activities, as well as the errors and p-value, by using the get_state_probability.py. (Learn more) |
graph TD
A[Protein] -->|opt_nscal.pl or scan_nscal_REX.pl| B(CG protein model)
B --> |continuous_synthesis_v6.py| C(Protein synthesis trajectories)
C ---> |post_trans_single_run.py| D(Post-translational folding trajectories)
C --> |calc_cont_synth_qbb_vs_T.py| E(Co-translation Q vs. time)
C --> |calc_cont_synth_G_vs_T.py| F(Co-translation G vs. time)
D --> |calc_native_contact_fraction.pl| G(post-translation Q vs. time)
D --> |calc_entanglement_number.pl| H(post-translation G vs. time)
E --> |get_co_trans_order_parameters.py| I(Co-translation order parameters)
F --> |get_co_trans_order_parameters.py| I
I --> |build_co_trans_kinetic_model.py| J(Co-translation metastable states)
J --> |co_trans_JS_divergence.py| K>Co-translation JSD]
G --> |get_post_trans_order_parameters.py| L(Post-translation order parameters)
H --> |get_post_trans_order_parameters.py| L
L --> |build_post_trans_kinetic_model.py| M(Post-translation metastable states)
J & M --> |get_co_post_folding_pathways.py| O>Folding pathways]
M ---> |post_trans_JS_divergence.py| N>Post-translation JSD]
M ----> |get_state_probability.py| P(State probability vs. post-translation time)
M ----> |get_solubility.py| Q(Percent soluble protein in each state)
M ----> |backmap & QM/MM simulations| R(Enzymatic reaction barrier for each state)
P & Q & R --> |calc_specific_activity.py| S>Specific activity]
