Skip to content

mmCIF files lack "_atom_site.occupancy" field #331

@JustATestHAHA

Description

@JustATestHAHA

Hi!

I have used ESMFold2 to predict structures of a handful of thousands of proteins. I used BioPython's MMCIFParser (version 1.86) implementation to convert cif files into structure objects for downstream analysis but this returned an error:

Traceback (most recent call last):
  File "/gpfs/work5/0/prjs1460/tools/MyTools/structure.py", line 900, in <module>
    s = Structure('/home/asanchez/chonky/tools/0.esmfold2.cif')
  File "/gpfs/work5/0/prjs1460/tools/MyTools/structure.py", line 47, in __init__
    self.structure = parser.get_structure(path.stem, path)
  File "/home/asanchez/chonky/miniconda3/envs/ml4mikc/lib/python3.10/site-packages/Bio/PDB/MMCIFParser.py", line 390, in get_structure
    self._build_structure(structure_id, handle)
  File "/home/asanchez/chonky/miniconda3/envs/ml4mikc/lib/python3.10/site-packages/Bio/PDB/MMCIFParser.py", line 452, in _build_structure
    occupancy_list = mmcif_dict["_atom_site.occupancy"]
KeyError: '_atom_site.occupancy'

This has an easy solution (see below), but I thought it could be handy to have this feature by default or an option to set it. Additionally, it is unknown to me whether the same error is thrown in other structure parsers.

def write_fixed_cif(
    in_path: str | path.Path, 
    output_path: str | path.Path
) -> None:
    '''
    Fixes a CIF file by adding missing fields required for Biopython's
    MMCIFParser. Specifically, it adds the "_atom_site.occupancy"
    fields if they are missing, and fills them with default values.
    It does so by reading the CIF file into a dictionary, modifying the 
    internal mmcif_dict, and writing it back to a new CIF file. This ensures 
    that the structure can be parsed correctly by Biopython without errors.

    Parameters
    ----------
    in_path : str | path.Path
        Path to the input CIF file.

    output_path : str | path.Path
        Path to the output fixed CIF file.
    '''
    # Modify mmcif_dict
    mmcif_dict = MMCIF2Dict(in_path)
    n_atoms = len(mmcif_dict["_atom_site.id"])
    # Fix fields
    if "_atom_site.occupancy" not in mmcif_dict:
        mmcif_dict["_atom_site.occupancy"] = ["1.00"] * n_atoms
    # Rewrite CIF file with fixed fields
    atom_keys = [k for k in mmcif_dict if k.startswith("_atom_site.")]
    with open(output_path, "w") as f:
        f.write("data_fixed\n#\nloop_\n")
        for key in atom_keys:
            f.write(key + "\n")
        for idx in range(n_atoms):
            row = [mmcif_dict[key][idx] for key in atom_keys]
            f.write(" ".join(row) + "\n")

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions