Summary
sqlglot.lineage doesn't preserve the struct-field path on leaf nodes. For a query like SELECT MIN(widget.metric.a, widget.metric.b) FROM tbl, both leaves come back as tbl.widget — the metric.a and metric.b parts are gone after to_node returns. v30.7.0 added on_node and Node.payload for consumer-side enrichment, which is the right pattern, but on_node(leaf) doesn't have access to the source column where that information lives.
Filing as an issue first to agree on the shape before sending a PR.
Repro
import sqlglot
import sqlglot.lineage as l
import sqlglot.optimizer.qualify as q
import sqlglot.optimizer.scope as scp
sql = (
"SELECT MIN(widget.metric.metric_a, widget.metric.metric_b) AS x "
"FROM tbl"
)
parsed = sqlglot.parse_one(sql, dialect="bigquery")
qualified = q.qualify(
parsed,
dialect="bigquery",
schema=None,
validate_qualify_columns=False,
identify=False,
)
node = l.lineage(
"x",
qualified,
dialect="bigquery",
scope=scp.build_scope(qualified),
trim_selects=False,
copy=False,
)
for leaf in node.walk():
if not leaf.downstream:
print(leaf.name)
Output:
One leaf. The metric_a and metric_b paths are both gone, and they're also collapsed to a single leaf by the set(find_all_in_scope(...)) on line 355. We see this on Snowflake VARIANT, BigQuery STRUCT, Iceberg nested types, and Spark complex-type queries — anywhere users access struct fields.
Where the information goes missing
In to_node, when a leaf is built:
# v30.7.0 sqlglot/lineage.py around line 468
col_expr = col_source or exp.Placeholder()
leaf = Node(name=c.sql(comments=False), source=col_expr, expression=col_expr)
node.downstream.append(leaf)
if on_node:
on_node(leaf)
The source column c carries the Dot ancestors (the .metric.metric_a chain) on c.parent, but only c.sql(comments=False) makes it into the leaf — and that renders just the column reference, not the parent Dots. After the function returns there's no way to tell from the leaf which struct field it referred to.
What we tried externally before opening this
We tried two approaches that don't require any sqlglot change:
1. Pre-walk the qualified SQL and build a column_ref → [subfields] map.
from sqlglot import exp
from sqlglot.optimizer.scope import find_all_in_scope
subfield_index = {}
for c in find_all_in_scope(qualified, exp.Column):
chain, f = [], c
while isinstance(f.parent, exp.Dot):
f = f.parent
chain.append(f.name)
if chain:
subfield_index.setdefault(c.sql(comments=False), []).append(".".join(chain))
For the repro query: {'tbl.widget': ['metric.metric_a', 'metric.metric_b']}. Useful, but the lineage tree has only one leaf named tbl.widget and there's no way to tell which entry in the list it should be paired with.
2. Override Column.__hash__ to identity-based hashing so the set() on line 355 stops collapsing duplicates.
This restores both leaves, but both are still named tbl.widget. From outside to_node, there's no way to know which leaf came from which source column.
Where this breaks down:
- Same column ref appearing with different subfields (the repro above).
- Joined queries where each struct subfield resolves to a different upstream table — the pre-walk knows the subfields but can't attribute them to the right upstream branch.
- UDF / aggregate calls over multiple struct fields (
COALESCE(t.s.a, t.s.b), STRUCT(t.s.x, t.s.y)).
For simple queries with one struct access per column ref, the external workaround is good enough. For the typical analytical workload — joins, aggregates, conditional logic over struct fields — it's lossy.
Why a hook fits
v30.7.0 added:
on_node: Callable[[Node], None] | None = None
payload: dict[str, t.Any] = field(default_factory=dict) # on Node
The shape is right for consumer-side enrichment. The gap is that on_node(leaf) doesn't have the source column. Adding a second callback that does:
def to_node(
...
on_node: Callable[[Node], None] | None = None,
on_leaf: Callable[[Node, exp.Column], None] | None = None, # new
) -> Node:
...
col_expr = col_source or exp.Placeholder()
leaf = Node(name=c.sql(comments=False), source=col_expr, expression=col_expr)
node.downstream.append(leaf)
if on_node:
on_node(leaf)
if on_leaf:
on_leaf(leaf, c)
Same idea at the pivot leaf (around line 454) and star leaf (around line 349) — pass whatever source expression makes sense for each.
Consumer side, recovering the struct path:
def _capture_subfield(leaf: Node, c: exp.Column) -> None:
chain, f = [], c
while isinstance(f.parent, exp.Dot):
f = f.parent
chain.append(f.name)
if chain:
leaf.payload["subfield"] = ".".join(chain)
lineage(..., on_leaf=_capture_subfield)
Subfield lands in the existing Node.payload dict. No new fields on the dataclass.
Question
Does on_leaf(leaf, source_column) look right, or would you prefer a different shape — extending on_node with an optional second arg, populating payload automatically with the source, or something else?
Once the shape's settled, I can open a PR if you want.
Summary
sqlglot.lineagedoesn't preserve the struct-field path on leaf nodes. For a query likeSELECT MIN(widget.metric.a, widget.metric.b) FROM tbl, both leaves come back astbl.widget— themetric.aandmetric.bparts are gone afterto_nodereturns. v30.7.0 addedon_nodeandNode.payloadfor consumer-side enrichment, which is the right pattern, buton_node(leaf)doesn't have access to the source column where that information lives.Filing as an issue first to agree on the shape before sending a PR.
Repro
Output:
One leaf. The
metric_aandmetric_bpaths are both gone, and they're also collapsed to a single leaf by theset(find_all_in_scope(...))on line 355. We see this on Snowflake VARIANT, BigQuery STRUCT, Iceberg nested types, and Spark complex-type queries — anywhere users access struct fields.Where the information goes missing
In
to_node, when a leaf is built:The source column
ccarries the Dot ancestors (the.metric.metric_achain) onc.parent, but onlyc.sql(comments=False)makes it into the leaf — and that renders just the column reference, not the parent Dots. After the function returns there's no way to tell from the leaf which struct field it referred to.What we tried externally before opening this
We tried two approaches that don't require any sqlglot change:
1. Pre-walk the qualified SQL and build a
column_ref → [subfields]map.For the repro query:
{'tbl.widget': ['metric.metric_a', 'metric.metric_b']}. Useful, but the lineage tree has only one leaf namedtbl.widgetand there's no way to tell which entry in the list it should be paired with.2. Override
Column.__hash__to identity-based hashing so theset()on line 355 stops collapsing duplicates.This restores both leaves, but both are still named
tbl.widget. From outsideto_node, there's no way to know which leaf came from which source column.Where this breaks down:
COALESCE(t.s.a, t.s.b),STRUCT(t.s.x, t.s.y)).For simple queries with one struct access per column ref, the external workaround is good enough. For the typical analytical workload — joins, aggregates, conditional logic over struct fields — it's lossy.
Why a hook fits
v30.7.0 added:
The shape is right for consumer-side enrichment. The gap is that
on_node(leaf)doesn't have the source column. Adding a second callback that does:Same idea at the pivot leaf (around line 454) and star leaf (around line 349) — pass whatever source expression makes sense for each.
Consumer side, recovering the struct path:
Subfield lands in the existing
Node.payloaddict. No new fields on the dataclass.Question
Does
on_leaf(leaf, source_column)look right, or would you prefer a different shape — extendingon_nodewith an optional second arg, populatingpayloadautomatically with the source, or something else?Once the shape's settled, I can open a PR if you want.