Description
get_leaf_nodes() in pageindex/utils.py crashes with a KeyError when called on a tree structure that was built by the standard pipeline. The function accesses structure['nodes'] directly, but leaf nodes do not have a nodes key — it is explicitly deleted by clean_node() inside list_to_tree().
Affected File
pageindex/utils.py — get_leaf_nodes function
def get_leaf_nodes(structure):
if isinstance(structure, dict):
if not structure['nodes']: # ← KeyError: 'nodes' key doesn't exist on leaf nodes
structure_node = copy.deepcopy(structure)
structure_node.pop('nodes', None)
return [structure_node]
Root Cause
list_to_tree() calls clean_node() which deletes the nodes key from any node with no children:
def clean_node(node):
if not node['nodes']:
del node['nodes'] # ← key is removed entirely, not set to []
So when get_leaf_nodes later checks structure['nodes'] on one of these nodes, the key no longer exists and a KeyError is raised.
Steps to Reproduce
from pageindex.utils import get_leaf_nodes, list_to_tree
flat = [
{'structure': '1', 'title': 'Chapter 1', 'start_index': 1, 'end_index': 5},
{'structure': '2', 'title': 'Chapter 2', 'start_index': 6, 'end_index': 10},
]
tree = list_to_tree(flat) # builds tree; leaf nodes have no 'nodes' key
get_leaf_nodes(tree) # ← KeyError: 'nodes'
Expected Behaviour
get_leaf_nodes should safely handle nodes without a nodes key and return them as leaf nodes.
Proposed Fix
Replace the direct key access with .get():
def get_leaf_nodes(structure):
if isinstance(structure, dict):
if not structure.get('nodes'): # ← safe: returns None (falsy) if key absent
structure_node = copy.deepcopy(structure)
structure_node.pop('nodes', None)
return [structure_node]
else:
leaf_nodes = []
for key in list(structure.keys()):
if 'nodes' in key:
leaf_nodes.extend(get_leaf_nodes(structure[key]))
return leaf_nodes
This is a one-line fix and consistent with how nodes is accessed elsewhere in the codebase (e.g. format_structure, is_leaf_node, print_tree all use .get('nodes')).
I would like to work on this fix if maintainers are open to it. Happy to submit a PR once confirmed. 🙂
Description
get_leaf_nodes()inpageindex/utils.pycrashes with aKeyErrorwhen called on a tree structure that was built by the standard pipeline. The function accessesstructure['nodes']directly, but leaf nodes do not have anodeskey — it is explicitly deleted byclean_node()insidelist_to_tree().Affected File
pageindex/utils.py—get_leaf_nodesfunctionRoot Cause
list_to_tree()callsclean_node()which deletes thenodeskey from any node with no children:So when
get_leaf_nodeslater checksstructure['nodes']on one of these nodes, the key no longer exists and aKeyErroris raised.Steps to Reproduce
Expected Behaviour
get_leaf_nodesshould safely handle nodes without anodeskey and return them as leaf nodes.Proposed Fix
Replace the direct key access with
.get():This is a one-line fix and consistent with how
nodesis accessed elsewhere in the codebase (e.g.format_structure,is_leaf_node,print_treeall use.get('nodes')).I would like to work on this fix if maintainers are open to it. Happy to submit a PR once confirmed. 🙂