sqlglot.lineage doesn't preserve the struct-field path on leaf nodes

## Summary

`sqlglot.lineage` doesn't preserve the struct-field path on leaf nodes. For a query like `SELECT MIN(widget.metric.a, widget.metric.b) FROM tbl`, both leaves come back as `tbl.widget` — the `metric.a` and `metric.b` parts are gone after `to_node` returns. v30.7.0 added `on_node` and `Node.payload` for consumer-side enrichment, which is the right pattern, but `on_node(leaf)` doesn't have access to the source column where that information lives.

Filing as an issue first to agree on the shape before sending a PR.

## Repro

```python
import sqlglot
import sqlglot.lineage as l
import sqlglot.optimizer.qualify as q
import sqlglot.optimizer.scope as scp

sql = (
    "SELECT MIN(widget.metric.metric_a, widget.metric.metric_b) AS x "
    "FROM tbl"
)
parsed = sqlglot.parse_one(sql, dialect="bigquery")
qualified = q.qualify(
    parsed,
    dialect="bigquery",
    schema=None,
    validate_qualify_columns=False,
    identify=False,
)
node = l.lineage(
    "x",
    qualified,
    dialect="bigquery",
    scope=scp.build_scope(qualified),
    trim_selects=False,
    copy=False,
)

for leaf in node.walk():
    if not leaf.downstream:
        print(leaf.name)
```

Output:

```
tbl.widget
```

One leaf. The `metric_a` and `metric_b` paths are both gone, and they're also collapsed to a single leaf by the `set(find_all_in_scope(...))` on line 355. We see this on Snowflake VARIANT, BigQuery STRUCT, Iceberg nested types, and Spark complex-type queries — anywhere users access struct fields.

## Where the information goes missing

In `to_node`, when a leaf is built:

```python
# v30.7.0 sqlglot/lineage.py around line 468
col_expr = col_source or exp.Placeholder()
leaf = Node(name=c.sql(comments=False), source=col_expr, expression=col_expr)
node.downstream.append(leaf)
if on_node:
    on_node(leaf)
```

The source column `c` carries the Dot ancestors (the `.metric.metric_a` chain) on `c.parent`, but only `c.sql(comments=False)` makes it into the leaf — and that renders just the column reference, not the parent Dots. After the function returns there's no way to tell from the leaf which struct field it referred to.

## What we tried externally before opening this

We tried two approaches that don't require any sqlglot change:

**1. Pre-walk the qualified SQL and build a `column_ref → [subfields]` map.**

```python
from sqlglot import exp
from sqlglot.optimizer.scope import find_all_in_scope

subfield_index = {}
for c in find_all_in_scope(qualified, exp.Column):
    chain, f = [], c
    while isinstance(f.parent, exp.Dot):
        f = f.parent
        chain.append(f.name)
    if chain:
        subfield_index.setdefault(c.sql(comments=False), []).append(".".join(chain))
```

For the repro query: `{'tbl.widget': ['metric.metric_a', 'metric.metric_b']}`. Useful, but the lineage tree has only one leaf named `tbl.widget` and there's no way to tell which entry in the list it should be paired with.

**2. Override `Column.__hash__` to identity-based hashing so the `set()` on line 355 stops collapsing duplicates.**

This restores both leaves, but both are still named `tbl.widget`. From outside `to_node`, there's no way to know which leaf came from which source column.

**Where this breaks down:**
- Same column ref appearing with different subfields (the repro above).
- Joined queries where each struct subfield resolves to a different upstream table — the pre-walk knows the subfields but can't attribute them to the right upstream branch.
- UDF / aggregate calls over multiple struct fields (`COALESCE(t.s.a, t.s.b)`, `STRUCT(t.s.x, t.s.y)`).

For simple queries with one struct access per column ref, the external workaround is good enough. For the typical analytical workload — joins, aggregates, conditional logic over struct fields — it's lossy.

## Why a hook fits

v30.7.0 added:

```python
on_node: Callable[[Node], None] | None = None
payload: dict[str, t.Any] = field(default_factory=dict)  # on Node
```

The shape is right for consumer-side enrichment. The gap is that `on_node(leaf)` doesn't have the source column. Adding a second callback that does:

```python
def to_node(
    ...
    on_node: Callable[[Node], None] | None = None,
    on_leaf: Callable[[Node, exp.Column], None] | None = None,  # new
) -> Node:
    ...
    col_expr = col_source or exp.Placeholder()
    leaf = Node(name=c.sql(comments=False), source=col_expr, expression=col_expr)
    node.downstream.append(leaf)
    if on_node:
        on_node(leaf)
    if on_leaf:
        on_leaf(leaf, c)
```

Same idea at the pivot leaf (around line 454) and star leaf (around line 349) — pass whatever source expression makes sense for each.

Consumer side, recovering the struct path:

```python
def _capture_subfield(leaf: Node, c: exp.Column) -> None:
    chain, f = [], c
    while isinstance(f.parent, exp.Dot):
        f = f.parent
        chain.append(f.name)
    if chain:
        leaf.payload["subfield"] = ".".join(chain)

lineage(..., on_leaf=_capture_subfield)
```

Subfield lands in the existing `Node.payload` dict. No new fields on the dataclass.

## Question

Does `on_leaf(leaf, source_column)` look right, or would you prefer a different shape — extending `on_node` with an optional second arg, populating `payload` automatically with the source, or something else?

Once the shape's settled, I can open a PR if you want.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sqlglot.lineage doesn't preserve the struct-field path on leaf nodes #7604

Summary

Repro

Where the information goes missing

What we tried externally before opening this

Why a hook fits

Question

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

sqlglot.lineage doesn't preserve the struct-field path on leaf nodes #7604

Description

Summary

Repro

Where the information goes missing

What we tried externally before opening this

Why a hook fits

Question

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions