[SPARK-54593][SQL] Fix DPP eligibility for materialized filtering sides by sunchao · Pull Request #56535 · apache/spark

sunchao · 2026-06-16T03:07:10Z

Why are the changes needed?

PR #56071 allowed dynamic partition pruning (DPP) when the filtering side contained locally available rows (LocalRelation) or a checkpoint-derived LogicalRDD. That check was too broad: a materialized leaf does not guarantee that the operators above it return the same rows every time they run.

For example, consider a fact table partitioned by p and a filtering plan that applies stateful user code above a checkpoint:

// The fact table has rows in partitions p = 1 and p = 2.
val keys = checkpointedInput.mapPartitions { _ =>
  Iterator(counter.incrementAndGet())
}

fact.join(keys, Seq("p"))

The original query contains one evaluation of keys, but DPP may introduce another evaluation to decide which fact-table partitions to scan:

The DPP evaluation sees key 1 and prunes the fact table to partition p = 1.
The join evaluation sees key 2 and needs the row from partition p = 2.
Partition p = 2 has already been pruned, so Spark can incorrectly return no rows.

The same mismatch can happen when DPP binds to a matching sibling broadcast elsewhere in the physical plan. A subquery or other non-repeatable operator above the materialized leaf has the same fundamental problem.

There is a second issue with lazy checkpoints. The checkpoint marker records that a LogicalRDD came from checkpoint() or localCheckpoint(), but a lazy checkpoint is not actually materialized until its first action completes. Treating the marker alone as proof of materialization can therefore duplicate the original upstream computation before its lineage has been truncated.

This is a follow-up to #56071. The materialized-input approach originated in #53263 (SPARK-54554) and was extended to LocalRelation and LogicalRDD in #53324 (SPARK-54593). This follow-up credits @mc8max and @dwsmith1983 as co-authors, as requested in the attribution discussion on #56071.

What changes were proposed in this PR?

Require a checkpoint-derived LogicalRDD to be both provenance-marked and actually materialized according to RDD.isCheckpointed.
For the materialized-input eligibility path, require the complete filtering plan—not just one leaf—to be repeatable.
Use a deliberately narrow allowlist: materialized leaves may be composed through deterministic Catalyst Project, Filter, Union, and SubqueryAlias nodes.
Reject subqueries, user-defined or non-SQL expressions, generators, mixed materialized/non-materialized inputs, and unknown logical operators.
Preserve standalone DPP for safe local and checkpointed filtering plans instead of requiring broadcast reuse.
Add regression coverage for mixed materialization, non-repeatable mapPartitions, scalar subqueries, standalone DPP, lazy checkpoint materialization, and the sibling-broadcast wrong-result shape with adaptive execution both disabled and enabled.

This only narrows the materialized-input eligibility added on unreleased master. The older DPP path for plans with a selective Filter is unchanged.

Generated-by: OpenAI Codex

How was this PR tested?

build/sbt 'sql/testOnly org.apache.spark.sql.DynamicPartitionPruningV1SuiteAEOff org.apache.spark.sql.DynamicPartitionPruningV1SuiteAEOn' (82 passed, 2 ignored)
build/sbt 'sql/testOnly org.apache.spark.sql.DatasetSuite -- -z "Dataset.checkpoint() - basic"' (4 passed)
build/sbt sql/scalastyle sql/Test/scalastyle (0 errors and 0 warnings)

Co-authored-by: Tri Tam Hoang <tritam.hoang@gmail.com> Co-authored-by: Dustin Smith <Dustin.William.Smith@gmail.com>

sunchao · 2026-06-19T02:31:11Z

cc @cloud-fan @MaxGekk @viirya would appreciate reviews on this follow-up, especially alongside #56603.

Both address regressions from #56071, but at different layers:

[SPARK-54593][SQL] Fix DPP eligibility for materialized filtering sides #56535 is correctness-focused: it requires checkpoints to be materialized and the complete filtering plan to be repeatable, preventing DPP from pruning with results different from the join.
[SPARK-54593][SQL][FOLLOWUP] Only inject DPP for a materialized filtering side when it can reuse a broadcast #56603 is cost-focused: it avoids unnecessary standalone DPP when a materialized side lacks reliable selectivity evidence.

[SPARK-54593][SQL] Fix DPP eligibility for materialized filtering sides

3a75aca

Co-authored-by: Tri Tam Hoang <tritam.hoang@gmail.com> Co-authored-by: Dustin Smith <Dustin.William.Smith@gmail.com>

sunchao marked this pull request as ready for review June 19, 2026 00:14

sunchao mentioned this pull request Jun 19, 2026

[SPARK-54593][SQL][FOLLOWUP] Only inject DPP for a materialized filtering side when it can reuse a broadcast #56603

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-54593][SQL] Fix DPP eligibility for materialized filtering sides#56535

[SPARK-54593][SQL] Fix DPP eligibility for materialized filtering sides#56535
sunchao wants to merge 1 commit into
apache:masterfrom
sunchao:dev/chao/codex/dpp-materialized-input-correctness

sunchao commented Jun 16, 2026 •

edited

Loading

Uh oh!

sunchao commented Jun 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

sunchao commented Jun 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why are the changes needed?

What changes were proposed in this PR?

How was this PR tested?

Uh oh!

sunchao commented Jun 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

sunchao commented Jun 16, 2026 •

edited

Loading