Skip to content

[SPARK-54593][SQL][FOLLOWUP] Only inject DPP for a materialized filtering side when it can reuse a broadcast#56603

Open
cloud-fan wants to merge 1 commit into
apache:masterfrom
cloud-fan:SPARK-54593-followup
Open

[SPARK-54593][SQL][FOLLOWUP] Only inject DPP for a materialized filtering side when it can reuse a broadcast#56603
cloud-fan wants to merge 1 commit into
apache:masterfrom
cloud-fan:SPARK-54593-followup

Conversation

@cloud-fan

@cloud-fan cloud-fan commented Jun 18, 2026

Copy link
Copy Markdown
Contributor

What changes were proposed in this pull request?

Follow-up to #56071 (SPARK-54593), which made an already-materialized filtering side (a LocalRelation or a checkpoint() / localCheckpoint()-derived LogicalRDD) eligible for dynamic partition pruning (DPP).

This PR makes such a side obtain DPP only by reusing a broadcast (onlyInBroadcast), rather than also being injected as a standalone, always-applied subquery: insertPredicate now treats the filtering side as beneficial only when it carries a selective predicate, so a materialized side no longer gets hasBenefit = true from the fallback filtering ratio.

It also refactors the eligibility check for readability — the enumeration-named hasSelectivePredicateOrLocalOrCheckpointedInput is split into two intent-named predicates, hasSelectivePredicate and hasMaterializedInput, combined in hasPartitionPruningFilter.

Why are the changes needed?

pruningHasBenefit estimates the filtering ratio from column statistics, falling back to spark.sql.optimizer.dynamicPartitionPruning.fallbackFilterRatio (default 0.5) — a value documented as the ratio to use "when CBO stats are missing, but there is a predicate that is likely to be selective". A materialized filtering side carries no such predicate and typically has no column statistics, so it is assumed 50%-selective and treated as beneficial regardless of its actual selectivity.

Consequently, with spark.sql.optimizer.dynamicPartitionPruning.reuseBroadcastOnly = false, a materialized side that covers all (or most) of the probe table's partitions — and therefore prunes nothing — is still injected as an always-applied subquery that re-executes the filtering side for no benefit. Before SPARK-54593 such a side was not DPP-eligible at all, so this is a regression in master.

The broadcast-reuse path (the feature's intended use, where the cost is negligible because the build side is already being broadcast for the join) is left unchanged; only the no-benefit standalone-subquery case is avoided. A materialized side carries no evidence of selectivity, so the conservative choice is to apply it only when it can ride an existing broadcast.

Does this PR introduce any user-facing change?

No. It only avoids planning a no-benefit dynamic partition pruning subquery for a materialized build side; query results are unchanged.

How was this patch tested?

New unit test in DynamicPartitionPruningSuite: with broadcast joins disabled and reuseBroadcastOnly = false, a materialized filtering side that covers every partition no longer triggers a DPP subquery. The existing positive tests for materialized filtering sides (which use broadcast reuse) continue to pass.

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude Code (Anthropic Claude Opus)

…ring side when it can reuse a broadcast

### What changes were proposed in this pull request?

Follow-up to apache#56071 (SPARK-54593), which enabled DPP for already-materialized
filtering sides (LocalRelation / checkpoint-derived LogicalRDD). This makes
PartitionPruning consider a materialized filtering side for DPP only when the
filter can reuse a broadcast, instead of also injecting it as a standalone,
always-applied subquery: insertPredicate now treats the side as beneficial only
when it carries a selective predicate.

### Why are the changes needed?

pruningHasBenefit falls back to fallbackFilterRatio (0.5) "when CBO stats are
missing, but there is a predicate that is likely to be selective". A materialized
side carries no predicate and usually no column stats, so it is assumed
50%-selective and treated as beneficial regardless of its actual selectivity.
With reuseBroadcastOnly=false, a materialized side covering all probe partitions
(pruning nothing) is then injected as an always-applied subquery. Before
SPARK-54593 such a side was not DPP-eligible, so this is a regression in master.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

New unit test in DynamicPartitionPruningSuite.

### Was this patch authored or co-authored using generative AI tooling?

Generated-by: Claude Code (Anthropic Claude Opus)
@cloud-fan cloud-fan force-pushed the SPARK-54593-followup branch from 372caba to eed6aff Compare June 18, 2026 23:49
@cloud-fan

Copy link
Copy Markdown
Contributor Author

cc @sunchao @MaxGekk

@sunchao

sunchao commented Jun 19, 2026

Copy link
Copy Markdown
Member

Thanks @cloud-fan ! will take a look.

FYI I also opened this follow-up earlier: #56535

@cloud-fan

Copy link
Copy Markdown
Contributor Author

It seems to fix a different issue, I think we need both followups.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants