Skip to content

Preprocessing of raw .pdb protein files for Uni-Mol pocket encoder #370

@alexander-telepov

Description

@alexander-telepov

Details

Hello! Thank you for your amazing work on molecular representation learning!

I am interested in computing pocket representations with Uni-Mol for some experimental structures from the PDB.
As I understand from the paper (Appendix A), raw PDB data is first preprocessed: missing heavy atoms, hydrogen atoms, and water molecules are added.

While going through the repository, specifically the example for computing pocket representations, I could not find the part where such preprocessing is performed.
As far as I understand, this needs to be done as a prerequisite.

Could you please share the scripts that were used to preprocess raw protein data for the pocket encoder pretraining?

I also have a few other related questions about preprocessing:

  1. In Appendix C, it is stated that hydrogen atoms were removed from the pocket input structures during pretraining. However, in the pretraining example, the remove-hydrogen flag is not used. It also seems that the pocket pretraining dataset transformations retain hydrogens in the structure. Could you clarify this discrepancy?
  2. Do you remove heterogens (ions, cofactors) during raw data preprocessing?
  3. Were hydrogen and water positions minimized with some force field or added using templates?

Metadata

Metadata

Assignees

No one assigned

    Labels

    documentationImprovements or additions to documentation

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions