Skip to content

[SPARK-56573][SQL] Widen the default tablesample seed to reduce collisions#56608

Open
wilmerdooley wants to merge 1 commit into
apache:masterfrom
wilmerdooley:oss/spark-56573
Open

[SPARK-56573][SQL] Widen the default tablesample seed to reduce collisions#56608
wilmerdooley wants to merge 1 commit into
apache:masterfrom
wilmerdooley:oss/spark-56573

Conversation

@wilmerdooley

@wilmerdooley wilmerdooley commented Jun 19, 2026

Copy link
Copy Markdown

This PR addresses SPARK-56573.

What changes were proposed in this pull request?

When a sample or TABLESAMPLE runs without an explicit seed, Spark resolved the default seed via (math.random() * 1000).toLong, which only produces about 1000 distinct values (0 to 999). This change replaces that expression at both call sites with Utils.random.nextLong(), which draws from the full Long range:

  • sql/core/src/main/scala/org/apache/spark/sql/execution/basicPhysicalOperators.scala: SampleExec.resolvedSeed now defaults to Utils.random.nextLong() (and adds the Utils import).
  • sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/V2ScanRelationPushDown.scala: pushDownSample applies the same default so the two code paths stay consistent, and removes the now-stale TODO(SPARK-56573) comment above the call.

The explicit-seed path (Some(seed), including TABLESAMPLE ... REPEATABLE(n) and DataFrame.sample(seed = ...)) is unchanged, as are the seed type (Long) and the pushed SEED(...) explain text.

Why are the changes needed?

A 1000-value default-seed space means independent sample queries that do not set a seed collide on the same seed often, which weakens the statistical independence expected of separate samples. Widening the default to the full Long range reduces those collisions. The in-tree TODO(SPARK-56573) already flagged this and asked for it to be fixed across both call sites.

Does this PR introduce any user-facing change?

No. Behavior changes only for samples that do not specify a seed, where the default seed is now drawn from a wider range; results were already non-deterministic in that case. Explicit-seed and REPEATABLE behavior is unchanged.

How was this patch tested?

Existing sql/core tests that pin sample behavior with explicit seeds continue to pass, run with build/sbt -Phadoop-3 "sql/testOnly org.apache.spark.sql.connector.DataSourceV2TableSampleSuite org.apache.spark.sql.DataFrameStatSuite" (the DSv2 pushdown path and the SampleExec path). No new test asserts the default-seed range, since the default seed is non-deterministic by design and a distinct-count assertion would be flaky.

Was this patch authored or co-authored using generative AI tooling?

Generated-by: OpenAI Codex (GPT-5.5)

Signed-off-by: wilmerdooley <wilmerdooley1@gmail.com>
@wilmerdooley wilmerdooley marked this pull request as draft June 19, 2026 01:32
@wilmerdooley wilmerdooley marked this pull request as ready for review June 19, 2026 01:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant