[SPARK-56573][SQL] Widen the default tablesample seed to reduce collisions by wilmerdooley · Pull Request #56608 · apache/spark

wilmerdooley · 2026-06-19T01:28:33Z

This PR addresses SPARK-56573.

What changes were proposed in this pull request?

When a sample or TABLESAMPLE runs without an explicit seed, Spark resolved the default seed via (math.random() * 1000).toLong, which only produces about 1000 distinct values (0 to 999). This change replaces that expression at both call sites with Utils.random.nextLong(), which draws from the full Long range:

sql/core/src/main/scala/org/apache/spark/sql/execution/basicPhysicalOperators.scala: SampleExec.resolvedSeed now defaults to Utils.random.nextLong() (and adds the Utils import).
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/V2ScanRelationPushDown.scala: pushDownSample applies the same default so the two code paths stay consistent, and removes the now-stale TODO(SPARK-56573) comment above the call.

The explicit-seed path (Some(seed), including TABLESAMPLE ... REPEATABLE(n) and DataFrame.sample(seed = ...)) is unchanged, as are the seed type (Long) and the pushed SEED(...) explain text.

Why are the changes needed?

A 1000-value default-seed space means independent sample queries that do not set a seed collide on the same seed often, which weakens the statistical independence expected of separate samples. Widening the default to the full Long range reduces those collisions. The in-tree TODO(SPARK-56573) already flagged this and asked for it to be fixed across both call sites.

Does this PR introduce any user-facing change?

No. Behavior changes only for samples that do not specify a seed, where the default seed is now drawn from a wider range; results were already non-deterministic in that case. Explicit-seed and REPEATABLE behavior is unchanged.

How was this patch tested?

Existing sql/core tests that pin sample behavior with explicit seeds continue to pass, run with build/sbt -Phadoop-3 "sql/testOnly org.apache.spark.sql.connector.DataSourceV2TableSampleSuite org.apache.spark.sql.DataFrameStatSuite" (the DSv2 pushdown path and the SampleExec path). No new test asserts the default-seed range, since the default seed is non-deterministic by design and a distinct-count assertion would be flaky.

Was this patch authored or co-authored using generative AI tooling?

Generated-by: OpenAI Codex (GPT-5.5)

Signed-off-by: wilmerdooley <wilmerdooley1@gmail.com>

SPARK-56573: Widen the default tablesample seed to reduce collisions

77e7b89

Signed-off-by: wilmerdooley <wilmerdooley1@gmail.com>

wilmerdooley marked this pull request as draft June 19, 2026 01:32

wilmerdooley marked this pull request as ready for review June 19, 2026 01:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-56573][SQL] Widen the default tablesample seed to reduce collisions#56608

[SPARK-56573][SQL] Widen the default tablesample seed to reduce collisions#56608
wilmerdooley wants to merge 1 commit into
apache:masterfrom
wilmerdooley:oss/spark-56573

wilmerdooley commented Jun 19, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

wilmerdooley commented Jun 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

wilmerdooley commented Jun 19, 2026 •

edited

Loading