[SPARK-56573][SQL] Widen the default tablesample seed to reduce collisions#56608
Open
wilmerdooley wants to merge 1 commit into
Open
[SPARK-56573][SQL] Widen the default tablesample seed to reduce collisions#56608wilmerdooley wants to merge 1 commit into
wilmerdooley wants to merge 1 commit into
Conversation
Signed-off-by: wilmerdooley <wilmerdooley1@gmail.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR addresses SPARK-56573.
What changes were proposed in this pull request?
When a sample or
TABLESAMPLEruns without an explicit seed, Spark resolved the default seed via(math.random() * 1000).toLong, which only produces about 1000 distinct values (0 to 999). This change replaces that expression at both call sites withUtils.random.nextLong(), which draws from the fullLongrange:sql/core/src/main/scala/org/apache/spark/sql/execution/basicPhysicalOperators.scala:SampleExec.resolvedSeednow defaults toUtils.random.nextLong()(and adds theUtilsimport).sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/v2/V2ScanRelationPushDown.scala:pushDownSampleapplies the same default so the two code paths stay consistent, and removes the now-staleTODO(SPARK-56573)comment above the call.The explicit-seed path (
Some(seed), includingTABLESAMPLE ... REPEATABLE(n)andDataFrame.sample(seed = ...)) is unchanged, as are the seed type (Long) and the pushedSEED(...)explain text.Why are the changes needed?
A 1000-value default-seed space means independent sample queries that do not set a seed collide on the same seed often, which weakens the statistical independence expected of separate samples. Widening the default to the full
Longrange reduces those collisions. The in-treeTODO(SPARK-56573)already flagged this and asked for it to be fixed across both call sites.Does this PR introduce any user-facing change?
No. Behavior changes only for samples that do not specify a seed, where the default seed is now drawn from a wider range; results were already non-deterministic in that case. Explicit-seed and
REPEATABLEbehavior is unchanged.How was this patch tested?
Existing
sql/coretests that pin sample behavior with explicit seeds continue to pass, run withbuild/sbt -Phadoop-3 "sql/testOnly org.apache.spark.sql.connector.DataSourceV2TableSampleSuite org.apache.spark.sql.DataFrameStatSuite"(the DSv2 pushdown path and theSampleExecpath). No new test asserts the default-seed range, since the default seed is non-deterministic by design and a distinct-count assertion would be flaky.Was this patch authored or co-authored using generative AI tooling?
Generated-by: OpenAI Codex (GPT-5.5)