Details
Hello! Thank you for your amazing work on molecular representation learning!
I am interested in computing pocket representations with Uni-Mol for some experimental structures from the PDB.
As I understand from the paper (Appendix A), raw PDB data is first preprocessed: missing heavy atoms, hydrogen atoms, and water molecules are added.
While going through the repository, specifically the example for computing pocket representations, I could not find the part where such preprocessing is performed.
As far as I understand, this needs to be done as a prerequisite.
Could you please share the scripts that were used to preprocess raw protein data for the pocket encoder pretraining?
I also have a few other related questions about preprocessing:
- In Appendix C, it is stated that hydrogen atoms were removed from the pocket input structures during pretraining. However, in the pretraining example, the
remove-hydrogen flag is not used. It also seems that the pocket pretraining dataset transformations retain hydrogens in the structure. Could you clarify this discrepancy?
- Do you remove heterogens (ions, cofactors) during raw data preprocessing?
- Were hydrogen and water positions minimized with some force field or added using templates?
Details
Hello! Thank you for your amazing work on molecular representation learning!
I am interested in computing pocket representations with Uni-Mol for some experimental structures from the PDB.
As I understand from the paper (Appendix A), raw PDB data is first preprocessed: missing heavy atoms, hydrogen atoms, and water molecules are added.
While going through the repository, specifically the example for computing pocket representations, I could not find the part where such preprocessing is performed.
As far as I understand, this needs to be done as a prerequisite.
Could you please share the scripts that were used to preprocess raw protein data for the pocket encoder pretraining?
I also have a few other related questions about preprocessing:
remove-hydrogenflag is not used. It also seems that the pocket pretraining dataset transformations retain hydrogens in the structure. Could you clarify this discrepancy?