Skip to content

[SPARK-54593][SQL] Fix DPP eligibility for materialized filtering sides#56535

Open
sunchao wants to merge 1 commit into
apache:masterfrom
sunchao:dev/chao/codex/dpp-materialized-input-correctness
Open

[SPARK-54593][SQL] Fix DPP eligibility for materialized filtering sides#56535
sunchao wants to merge 1 commit into
apache:masterfrom
sunchao:dev/chao/codex/dpp-materialized-input-correctness

Conversation

@sunchao

@sunchao sunchao commented Jun 16, 2026

Copy link
Copy Markdown
Member

Why are the changes needed?

PR #56071 allowed dynamic partition pruning (DPP) when the filtering side contained locally available rows (LocalRelation) or a checkpoint-derived LogicalRDD. That check was too broad: a materialized leaf does not guarantee that the operators above it return the same rows every time they run.

For example, consider a fact table partitioned by p and a filtering plan that applies stateful user code above a checkpoint:

// The fact table has rows in partitions p = 1 and p = 2.
val keys = checkpointedInput.mapPartitions { _ =>
  Iterator(counter.incrementAndGet())
}

fact.join(keys, Seq("p"))

The original query contains one evaluation of keys, but DPP may introduce another evaluation to decide which fact-table partitions to scan:

  1. The DPP evaluation sees key 1 and prunes the fact table to partition p = 1.
  2. The join evaluation sees key 2 and needs the row from partition p = 2.
  3. Partition p = 2 has already been pruned, so Spark can incorrectly return no rows.

The same mismatch can happen when DPP binds to a matching sibling broadcast elsewhere in the physical plan. A subquery or other non-repeatable operator above the materialized leaf has the same fundamental problem.

There is a second issue with lazy checkpoints. The checkpoint marker records that a LogicalRDD came from checkpoint() or localCheckpoint(), but a lazy checkpoint is not actually materialized until its first action completes. Treating the marker alone as proof of materialization can therefore duplicate the original upstream computation before its lineage has been truncated.

This is a follow-up to #56071. The materialized-input approach originated in #53263 (SPARK-54554) and was extended to LocalRelation and LogicalRDD in #53324 (SPARK-54593). This follow-up credits @mc8max and @dwsmith1983 as co-authors, as requested in the attribution discussion on #56071.

What changes were proposed in this PR?

  • Require a checkpoint-derived LogicalRDD to be both provenance-marked and actually materialized according to RDD.isCheckpointed.
  • For the materialized-input eligibility path, require the complete filtering plan—not just one leaf—to be repeatable.
  • Use a deliberately narrow allowlist: materialized leaves may be composed through deterministic Catalyst Project, Filter, Union, and SubqueryAlias nodes.
  • Reject subqueries, user-defined or non-SQL expressions, generators, mixed materialized/non-materialized inputs, and unknown logical operators.
  • Preserve standalone DPP for safe local and checkpointed filtering plans instead of requiring broadcast reuse.
  • Add regression coverage for mixed materialization, non-repeatable mapPartitions, scalar subqueries, standalone DPP, lazy checkpoint materialization, and the sibling-broadcast wrong-result shape with adaptive execution both disabled and enabled.

This only narrows the materialized-input eligibility added on unreleased master. The older DPP path for plans with a selective Filter is unchanged.

Generated-by: OpenAI Codex

How was this PR tested?

  • build/sbt 'sql/testOnly org.apache.spark.sql.DynamicPartitionPruningV1SuiteAEOff org.apache.spark.sql.DynamicPartitionPruningV1SuiteAEOn' (82 passed, 2 ignored)
  • build/sbt 'sql/testOnly org.apache.spark.sql.DatasetSuite -- -z "Dataset.checkpoint() - basic"' (4 passed)
  • build/sbt sql/scalastyle sql/Test/scalastyle (0 errors and 0 warnings)

Co-authored-by: Tri Tam Hoang <tritam.hoang@gmail.com>
Co-authored-by: Dustin Smith <Dustin.William.Smith@gmail.com>
@sunchao

sunchao commented Jun 19, 2026

Copy link
Copy Markdown
Member Author

cc @cloud-fan @MaxGekk @viirya would appreciate reviews on this follow-up, especially alongside #56603.

Both address regressions from #56071, but at different layers:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant