Skip to content

sqlglot.lineage doesn't preserve the struct-field path on leaf nodes #7604

@treff7es

Description

@treff7es

Summary

sqlglot.lineage doesn't preserve the struct-field path on leaf nodes. For a query like SELECT MIN(widget.metric.a, widget.metric.b) FROM tbl, both leaves come back as tbl.widget — the metric.a and metric.b parts are gone after to_node returns. v30.7.0 added on_node and Node.payload for consumer-side enrichment, which is the right pattern, but on_node(leaf) doesn't have access to the source column where that information lives.

Filing as an issue first to agree on the shape before sending a PR.

Repro

import sqlglot
import sqlglot.lineage as l
import sqlglot.optimizer.qualify as q
import sqlglot.optimizer.scope as scp

sql = (
    "SELECT MIN(widget.metric.metric_a, widget.metric.metric_b) AS x "
    "FROM tbl"
)
parsed = sqlglot.parse_one(sql, dialect="bigquery")
qualified = q.qualify(
    parsed,
    dialect="bigquery",
    schema=None,
    validate_qualify_columns=False,
    identify=False,
)
node = l.lineage(
    "x",
    qualified,
    dialect="bigquery",
    scope=scp.build_scope(qualified),
    trim_selects=False,
    copy=False,
)

for leaf in node.walk():
    if not leaf.downstream:
        print(leaf.name)

Output:

tbl.widget

One leaf. The metric_a and metric_b paths are both gone, and they're also collapsed to a single leaf by the set(find_all_in_scope(...)) on line 355. We see this on Snowflake VARIANT, BigQuery STRUCT, Iceberg nested types, and Spark complex-type queries — anywhere users access struct fields.

Where the information goes missing

In to_node, when a leaf is built:

# v30.7.0 sqlglot/lineage.py around line 468
col_expr = col_source or exp.Placeholder()
leaf = Node(name=c.sql(comments=False), source=col_expr, expression=col_expr)
node.downstream.append(leaf)
if on_node:
    on_node(leaf)

The source column c carries the Dot ancestors (the .metric.metric_a chain) on c.parent, but only c.sql(comments=False) makes it into the leaf — and that renders just the column reference, not the parent Dots. After the function returns there's no way to tell from the leaf which struct field it referred to.

What we tried externally before opening this

We tried two approaches that don't require any sqlglot change:

1. Pre-walk the qualified SQL and build a column_ref → [subfields] map.

from sqlglot import exp
from sqlglot.optimizer.scope import find_all_in_scope

subfield_index = {}
for c in find_all_in_scope(qualified, exp.Column):
    chain, f = [], c
    while isinstance(f.parent, exp.Dot):
        f = f.parent
        chain.append(f.name)
    if chain:
        subfield_index.setdefault(c.sql(comments=False), []).append(".".join(chain))

For the repro query: {'tbl.widget': ['metric.metric_a', 'metric.metric_b']}. Useful, but the lineage tree has only one leaf named tbl.widget and there's no way to tell which entry in the list it should be paired with.

2. Override Column.__hash__ to identity-based hashing so the set() on line 355 stops collapsing duplicates.

This restores both leaves, but both are still named tbl.widget. From outside to_node, there's no way to know which leaf came from which source column.

Where this breaks down:

  • Same column ref appearing with different subfields (the repro above).
  • Joined queries where each struct subfield resolves to a different upstream table — the pre-walk knows the subfields but can't attribute them to the right upstream branch.
  • UDF / aggregate calls over multiple struct fields (COALESCE(t.s.a, t.s.b), STRUCT(t.s.x, t.s.y)).

For simple queries with one struct access per column ref, the external workaround is good enough. For the typical analytical workload — joins, aggregates, conditional logic over struct fields — it's lossy.

Why a hook fits

v30.7.0 added:

on_node: Callable[[Node], None] | None = None
payload: dict[str, t.Any] = field(default_factory=dict)  # on Node

The shape is right for consumer-side enrichment. The gap is that on_node(leaf) doesn't have the source column. Adding a second callback that does:

def to_node(
    ...
    on_node: Callable[[Node], None] | None = None,
    on_leaf: Callable[[Node, exp.Column], None] | None = None,  # new
) -> Node:
    ...
    col_expr = col_source or exp.Placeholder()
    leaf = Node(name=c.sql(comments=False), source=col_expr, expression=col_expr)
    node.downstream.append(leaf)
    if on_node:
        on_node(leaf)
    if on_leaf:
        on_leaf(leaf, c)

Same idea at the pivot leaf (around line 454) and star leaf (around line 349) — pass whatever source expression makes sense for each.

Consumer side, recovering the struct path:

def _capture_subfield(leaf: Node, c: exp.Column) -> None:
    chain, f = [], c
    while isinstance(f.parent, exp.Dot):
        f = f.parent
        chain.append(f.name)
    if chain:
        leaf.payload["subfield"] = ".".join(chain)

lineage(..., on_leaf=_capture_subfield)

Subfield lands in the existing Node.payload dict. No new fields on the dataclass.

Question

Does on_leaf(leaf, source_column) look right, or would you prefer a different shape — extending on_node with an optional second arg, populating payload automatically with the source, or something else?

Once the shape's settled, I can open a PR if you want.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions