[SPARK-56032][SQL][FOLLOWUP] Skip FilterExec CSE codegen when the only common subexpression is a leaf by cloud-fan · Pull Request #56604 · apache/spark

cloud-fan · 2026-06-19T00:09:10Z

What changes were proposed in this pull request?

A follow-up of #54862 (which introduced subexpression elimination (CSE) in FilterExec
whole-stage codegen) and #56209 (which gated the CSE path on whether otherPreds contain a
common subexpression).

The #56209 gate takes the CSE path whenever otherPreds contain a common subexpression, where
"common subexpression" is anything EquivalentExpressions counts more than once -- which includes
cheap expressions such as bare columns. c BETWEEN lo AND hi lowers to c >= lo AND c <= hi, so
any BETWEEN (or a column referenced in several conjuncts) makes that column a common subexpression.

Taking the CSE path then emits the eager inputVarsEvalCode prologue, which evaluates every
column referenced by otherPreds at the top of the per-row loop. Caching a cheap load gains nothing
-- the non-CSE path already loads each column lazily into a variable on demand -- so when the only
common subexpressions are cheap, the prologue is pure overhead that defeats short-circuiting.

This PR requires a non-cheap common subexpression (per CollapseProject.isCheap) before taking
the CSE path. Filters with a genuine repeated computation (e.g. a + b) are unaffected and still
benefit from CSE.

CollapseProject.isCheap is the canonical "cheap to recompute" predicate (attributes, foldables,
Alias/ExtractValue of cheap children). The gate runs on predicates already bound for codegen, so
this PR also teaches isCheap that BoundReference -- the codegen-bound form of an Attribute, an
equally cheap slot read -- is cheap. This lets the gate reuse the single EquivalentExpressions
analysis it already shares with the CSE codegen, rather than re-analyzing the predicates.

Why are the changes needed?

TPC-DS q28 filters as
ss_quantity BETWEEN ... AND (ss_list_price BETWEEN ... OR ss_coupon_amt BETWEEN ... OR ss_wholesale_cost BETWEEN ...).
Its only repeated expressions are the bare columns, so the #56209 gate wrongly took the CSE path and
eagerly decoded the high-precision decimal columns on every row -- including rows the cheap
ss_quantity integer predicate would have rejected -- allocating a BigInteger/BigDecimal per
decoded decimal. On a 3TB run this showed up as a ~40% slowdown on q28, which this change removes
(the filter falls back to the lazy, short-circuiting path).

Does this PR introduce any user-facing change?

No. This is a codegen-only change; query results are unchanged.

How was this patch tested?

New unit test in WholeStageCodegenSuite asserting that, for the q28 BETWEEN shape (whose only
common subexpressions are cheap leaf columns), CSE-enabled generated code is identical to
CSE-disabled code -- i.e. it falls back to the lazy, short-circuiting non-CSE path. The existing
FilterExec CSE tests, which use genuine non-cheap common subexpressions, still exercise the CSE
path and pass.

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude (Claude Code)

…y common subexpression is a leaf ### What changes were proposed in this pull request? A follow-up of the FilterExec whole-stage-codegen subexpression elimination (CSE) work. The CSE path is taken whenever otherPreds contain a common subexpression, where "common subexpression" is anything EquivalentExpressions counts more than once -- which includes bare leaf columns. `c BETWEEN lo AND hi` lowers to `c >= lo AND c <= hi`, so any BETWEEN (or a column used in several conjuncts) makes its column a common subexpression. Taking the CSE path then emits the eager inputVarsEvalCode prologue, which decodes every otherPred-referenced column at the top of the row loop. Caching a bare column load gains nothing (the non-CSE path already loads columns lazily), so this is pure overhead and defeats short-circuiting. This PR requires a non-leaf common subexpression before taking the CSE path. ### Why are the changes needed? TPC-DS q28 filters as `ss_quantity BETWEEN ... AND (ss_list_price BETWEEN ... OR ss_coupon_amt BETWEEN ... OR ss_wholesale_cost BETWEEN ...)`. Its only repeated expressions are the bare columns, so the gate wrongly took the CSE path and eagerly decoded the high-precision decimal columns on every row -- including rows the cheap ss_quantity integer predicate would reject -- allocating a BigInteger/BigDecimal per row. A 3TB q28 run showed ~40% slowdown that this change removes. ### Does this PR introduce any user-facing change? No. Codegen-only change; query results are unchanged. ### How was this patch tested? New unit test in WholeStageCodegenSuite asserting that, for the q28 BETWEEN shape (only leaf common subexpressions), CSE-enabled codegen is identical to CSE-disabled codegen (i.e. it falls back to the lazy, short-circuiting path). The existing FilterExec CSE tests, which use genuine non-leaf common subexpressions, still exercise the CSE path and pass. ### Was this patch authored or co-authored using generative AI tooling? Generated-by: Claude (Claude Code) Co-authored-by: Isaac

…sCheap Use the canonical CollapseProject.isCheap to decide whether a common subexpression is worth eliminating, instead of an ad-hoc leaf check. A repeated expression that is cheap to recompute (a bare column, a foldable, an Alias/ExtractValue of cheap children) gains nothing from CSE, so requiring a non-cheap common subexpression avoids the eager prologue for those. CollapseProject.isCheap matches Attribute; the gate runs on bound predicates, so teach isCheap that BoundReference (the codegen-bound form of Attribute) is equally cheap. This keeps the single EquivalentExpressions analysis shared by the gate and the CSE codegen. Co-authored-by: Isaac

cloud-fan · 2026-06-19T00:29:49Z

@LuciferYang @yaooqinn

LuciferYang

should we backport this to 4.2?

LuciferYang · 2026-06-19T03:11:58Z

+        val df = spark.createDataFrame(data, schema)
+        // The only repeated expressions are the bare columns q, p1, p2 (each referenced by the two
+        // halves of its BETWEEN). No non-leaf expression is shared.
+        val filtered = df.where(


Let me verify the validity of this test.

cloud-fan · 2026-06-19T03:45:31Z

Yea this is a 4.2 perf regression, cc @huaxingao

cloud-fan added 2 commits June 19, 2026 00:08

LuciferYang approved these changes Jun 19, 2026

View reviewed changes

LuciferYang reviewed Jun 19, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-56032][SQL][FOLLOWUP] Skip FilterExec CSE codegen when the only common subexpression is a leaf#56604

[SPARK-56032][SQL][FOLLOWUP] Skip FilterExec CSE codegen when the only common subexpression is a leaf#56604
cloud-fan wants to merge 2 commits into
apache:masterfrom
cloud-fan:filterexec-cse-leaf-gate

cloud-fan commented Jun 19, 2026 •

edited

Loading

Uh oh!

cloud-fan commented Jun 19, 2026

Uh oh!

LuciferYang left a comment

Uh oh!

LuciferYang Jun 19, 2026

Uh oh!

cloud-fan commented Jun 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

cloud-fan commented Jun 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

cloud-fan commented Jun 19, 2026

Uh oh!

LuciferYang left a comment

Choose a reason for hiding this comment

Uh oh!

LuciferYang Jun 19, 2026

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented Jun 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

cloud-fan commented Jun 19, 2026 •

edited

Loading