A crucial part of the task of automating metadata curation from NCBI is learning how to automate efficient XML parsing that depends neither upon prior knowledge of the fields present NOR upon differences between XML structures within and between records.
At present, 0.1.0 flattens the top level, but returns the rest as an XML structure within a column.
In the future, we will use recursion to progressively flatten every layer, and print to one distinct column each, at the time of initial processing.
While implementation details could vary, the following general conceptual approach is likely to work well here:
import pandas as pd
from lxml import etree
import io
def flatten_dict(d, parent_key='', sep='_'):
items = []
for k, v in d.items():
new_key = f"{parent_key}{sep}{k}" if parent_key else k
if isinstance(v, dict):
items.extend(flatten_dict(v, new_key, sep=sep).items())
else:
items.append((new_key, v))
return dict(items)
def parse_xml(xml_str):
def recurse_tree(element, parent_path=''):
result = {}
# Process attributes
for name, value in element.attrib.items():
attribute_path = f"{parent_path}_{name}" if parent_path else name
result[attribute_path] = value
# Process text if element contains text
if element.text and element.text.strip():
text_path = f"{parent_path}_text" if parent_path else "text"
result[text_path] = element.text.strip()
# Recursively process children
for child in element:
child_path = f"{parent_path}_{child.tag}" if parent_path else child.tag
result.update(recurse_tree(child, child_path))
return result
root = etree.parse(io.StringIO(xml_str)).getroot()
parsed_data = recurse_tree(root)
return flatten_dict(parsed_data)
# Assuming you have a DataFrame `df` with an 'xml_column'
df = pd.DataFrame({
'xml_column': [
'''<YourXML>...</YourXML>''', # Replace with actual XML strings
]
})
# Apply the parsing function to each row and expand into a new DataFrame
expanded_data = df['xml_column'].apply(parse_xml)
expanded_df = pd.DataFrame(expanded_data.tolist())
print(expanded_df.head())
A crucial part of the task of automating metadata curation from NCBI is learning how to automate efficient XML parsing that depends neither upon prior knowledge of the fields present NOR upon differences between XML structures within and between records.
At present, 0.1.0 flattens the top level, but returns the rest as an XML structure within a column.
In the future, we will use recursion to progressively flatten every layer, and print to one distinct column each, at the time of initial processing.
While implementation details could vary, the following general conceptual approach is likely to work well here: