[SPARK-54593][SQL][FOLLOWUP] Only inject DPP for a materialized filtering side when it can reuse a broadcast#56603
Open
cloud-fan wants to merge 1 commit into
Open
[SPARK-54593][SQL][FOLLOWUP] Only inject DPP for a materialized filtering side when it can reuse a broadcast#56603cloud-fan wants to merge 1 commit into
cloud-fan wants to merge 1 commit into
Conversation
…ring side when it can reuse a broadcast ### What changes were proposed in this pull request? Follow-up to apache#56071 (SPARK-54593), which enabled DPP for already-materialized filtering sides (LocalRelation / checkpoint-derived LogicalRDD). This makes PartitionPruning consider a materialized filtering side for DPP only when the filter can reuse a broadcast, instead of also injecting it as a standalone, always-applied subquery: insertPredicate now treats the side as beneficial only when it carries a selective predicate. ### Why are the changes needed? pruningHasBenefit falls back to fallbackFilterRatio (0.5) "when CBO stats are missing, but there is a predicate that is likely to be selective". A materialized side carries no predicate and usually no column stats, so it is assumed 50%-selective and treated as beneficial regardless of its actual selectivity. With reuseBroadcastOnly=false, a materialized side covering all probe partitions (pruning nothing) is then injected as an always-applied subquery. Before SPARK-54593 such a side was not DPP-eligible, so this is a regression in master. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? New unit test in DynamicPartitionPruningSuite. ### Was this patch authored or co-authored using generative AI tooling? Generated-by: Claude Code (Anthropic Claude Opus)
372caba to
eed6aff
Compare
Contributor
Author
Member
|
Thanks @cloud-fan ! will take a look. FYI I also opened this follow-up earlier: #56535 |
Contributor
Author
|
It seems to fix a different issue, I think we need both followups. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What changes were proposed in this pull request?
Follow-up to #56071 (SPARK-54593), which made an already-materialized filtering side (a
LocalRelationor acheckpoint()/localCheckpoint()-derivedLogicalRDD) eligible for dynamic partition pruning (DPP).This PR makes such a side obtain DPP only by reusing a broadcast (
onlyInBroadcast), rather than also being injected as a standalone, always-applied subquery:insertPredicatenow treats the filtering side as beneficial only when it carries a selective predicate, so a materialized side no longer getshasBenefit = truefrom the fallback filtering ratio.It also refactors the eligibility check for readability — the enumeration-named
hasSelectivePredicateOrLocalOrCheckpointedInputis split into two intent-named predicates,hasSelectivePredicateandhasMaterializedInput, combined inhasPartitionPruningFilter.Why are the changes needed?
pruningHasBenefitestimates the filtering ratio from column statistics, falling back tospark.sql.optimizer.dynamicPartitionPruning.fallbackFilterRatio(default0.5) — a value documented as the ratio to use "when CBO stats are missing, but there is a predicate that is likely to be selective". A materialized filtering side carries no such predicate and typically has no column statistics, so it is assumed 50%-selective and treated as beneficial regardless of its actual selectivity.Consequently, with
spark.sql.optimizer.dynamicPartitionPruning.reuseBroadcastOnly = false, a materialized side that covers all (or most) of the probe table's partitions — and therefore prunes nothing — is still injected as an always-applied subquery that re-executes the filtering side for no benefit. Before SPARK-54593 such a side was not DPP-eligible at all, so this is a regression in master.The broadcast-reuse path (the feature's intended use, where the cost is negligible because the build side is already being broadcast for the join) is left unchanged; only the no-benefit standalone-subquery case is avoided. A materialized side carries no evidence of selectivity, so the conservative choice is to apply it only when it can ride an existing broadcast.
Does this PR introduce any user-facing change?
No. It only avoids planning a no-benefit dynamic partition pruning subquery for a materialized build side; query results are unchanged.
How was this patch tested?
New unit test in
DynamicPartitionPruningSuite: with broadcast joins disabled andreuseBroadcastOnly = false, a materialized filtering side that covers every partition no longer triggers a DPP subquery. The existing positive tests for materialized filtering sides (which use broadcast reuse) continue to pass.Was this patch authored or co-authored using generative AI tooling?
Generated-by: Claude Code (Anthropic Claude Opus)