Hi!
I have used ESMFold2 to predict structures of a handful of thousands of proteins. I used BioPython's MMCIFParser (version 1.86) implementation to convert cif files into structure objects for downstream analysis but this returned an error:
Traceback (most recent call last):
File "/gpfs/work5/0/prjs1460/tools/MyTools/structure.py", line 900, in <module>
s = Structure('/home/asanchez/chonky/tools/0.esmfold2.cif')
File "/gpfs/work5/0/prjs1460/tools/MyTools/structure.py", line 47, in __init__
self.structure = parser.get_structure(path.stem, path)
File "/home/asanchez/chonky/miniconda3/envs/ml4mikc/lib/python3.10/site-packages/Bio/PDB/MMCIFParser.py", line 390, in get_structure
self._build_structure(structure_id, handle)
File "/home/asanchez/chonky/miniconda3/envs/ml4mikc/lib/python3.10/site-packages/Bio/PDB/MMCIFParser.py", line 452, in _build_structure
occupancy_list = mmcif_dict["_atom_site.occupancy"]
KeyError: '_atom_site.occupancy'
This has an easy solution (see below), but I thought it could be handy to have this feature by default or an option to set it. Additionally, it is unknown to me whether the same error is thrown in other structure parsers.
def write_fixed_cif(
in_path: str | path.Path,
output_path: str | path.Path
) -> None:
'''
Fixes a CIF file by adding missing fields required for Biopython's
MMCIFParser. Specifically, it adds the "_atom_site.occupancy"
fields if they are missing, and fills them with default values.
It does so by reading the CIF file into a dictionary, modifying the
internal mmcif_dict, and writing it back to a new CIF file. This ensures
that the structure can be parsed correctly by Biopython without errors.
Parameters
----------
in_path : str | path.Path
Path to the input CIF file.
output_path : str | path.Path
Path to the output fixed CIF file.
'''
# Modify mmcif_dict
mmcif_dict = MMCIF2Dict(in_path)
n_atoms = len(mmcif_dict["_atom_site.id"])
# Fix fields
if "_atom_site.occupancy" not in mmcif_dict:
mmcif_dict["_atom_site.occupancy"] = ["1.00"] * n_atoms
# Rewrite CIF file with fixed fields
atom_keys = [k for k in mmcif_dict if k.startswith("_atom_site.")]
with open(output_path, "w") as f:
f.write("data_fixed\n#\nloop_\n")
for key in atom_keys:
f.write(key + "\n")
for idx in range(n_atoms):
row = [mmcif_dict[key][idx] for key in atom_keys]
f.write(" ".join(row) + "\n")
Hi!
I have used ESMFold2 to predict structures of a handful of thousands of proteins. I used BioPython's MMCIFParser (version 1.86) implementation to convert cif files into structure objects for downstream analysis but this returned an error:
This has an easy solution (see below), but I thought it could be handy to have this feature by default or an option to set it. Additionally, it is unknown to me whether the same error is thrown in other structure parsers.