Skip to content

Bug: get_leaf_nodes raises KeyError on leaf nodes due to missing nodes key #330

Description

@ChiragB254

Description

get_leaf_nodes() in pageindex/utils.py crashes with a KeyError when called on a tree structure that was built by the standard pipeline. The function accesses structure['nodes'] directly, but leaf nodes do not have a nodes key — it is explicitly deleted by clean_node() inside list_to_tree().

Affected File

pageindex/utils.pyget_leaf_nodes function

def get_leaf_nodes(structure):
    if isinstance(structure, dict):
        if not structure['nodes']:   # ← KeyError: 'nodes' key doesn't exist on leaf nodes
            structure_node = copy.deepcopy(structure)
            structure_node.pop('nodes', None)
            return [structure_node]

Root Cause

list_to_tree() calls clean_node() which deletes the nodes key from any node with no children:

def clean_node(node):
    if not node['nodes']:
        del node['nodes']   # ← key is removed entirely, not set to []

So when get_leaf_nodes later checks structure['nodes'] on one of these nodes, the key no longer exists and a KeyError is raised.

Steps to Reproduce

from pageindex.utils import get_leaf_nodes, list_to_tree

flat = [
    {'structure': '1', 'title': 'Chapter 1', 'start_index': 1, 'end_index': 5},
    {'structure': '2', 'title': 'Chapter 2', 'start_index': 6, 'end_index': 10},
]
tree = list_to_tree(flat)   # builds tree; leaf nodes have no 'nodes' key
get_leaf_nodes(tree)         # ← KeyError: 'nodes'

Expected Behaviour

get_leaf_nodes should safely handle nodes without a nodes key and return them as leaf nodes.

Proposed Fix

Replace the direct key access with .get():

def get_leaf_nodes(structure):
    if isinstance(structure, dict):
        if not structure.get('nodes'):   # ← safe: returns None (falsy) if key absent
            structure_node = copy.deepcopy(structure)
            structure_node.pop('nodes', None)
            return [structure_node]
        else:
            leaf_nodes = []
            for key in list(structure.keys()):
                if 'nodes' in key:
                    leaf_nodes.extend(get_leaf_nodes(structure[key]))
            return leaf_nodes

This is a one-line fix and consistent with how nodes is accessed elsewhere in the codebase (e.g. format_structure, is_leaf_node, print_tree all use .get('nodes')).


I would like to work on this fix if maintainers are open to it. Happy to submit a PR once confirmed. 🙂

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions