[Test] Add operator-level determinism matrix for TORCH_DETERMINISTIC#37
Open
Young-Leo wants to merge 1 commit into
Open
[Test] Add operator-level determinism matrix for TORCH_DETERMINISTIC#37Young-Leo wants to merge 1 commit into
Young-Leo wants to merge 1 commit into
Conversation
Verify that high-risk torch operators relevant to LLM and video/diffusion models are bitwise-reproducible under use_deterministic_algorithms(True), exploiting the compositionality of determinism to validate at the operator layer rather than per model. For each operator the contract test asserts that, with the flag on, the result over N runs on identical inputs is either bitwise identical or a RuntimeError (no deterministic implementation) -- never silent drift; ops without a deterministic kernel are additionally re-run under warn_only to report fallback reproducibility. A report-only probe records which operators the flag rescues with the flag off. Cases: scatter/index_add/scatter_reduce/index_put, embedding and cross_entropy backward, interpolate/grid_sample backward, SDPA fwd/bwd, and matmul/cumsum/sort/topk references.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Add
tests/feature_tests/test_torch_op_determinism_matrix.py: anoperator-level determinism matrix verifying that high-risk torch operators
relevant to LLM and video/diffusion models are bitwise-reproducible under
torch.use_deterministic_algorithms(True).Determinism is compositional, so the operator layer is validated directly
rather than per model.
How
For each operator:
workspace, fixed seed), N runs on identical inputs must be either bitwise
identical or raise a
RuntimeError(no deterministic implementation). Silentdrift fails the test. Ops with no deterministic kernel are re-run under
warn_only=Trueto report fallback reproducibility (measured, not skipped).operators the flag actually rescues (hardware/build dependent, no assert).
Operators covered
scatter / index_add / scatter_reduce / index_put · embedding & cross_entropy
backward · interpolate / grid_sample backward · SDPA fwd+bwd · matmul / cumsum
/ sort / topk references.
Result (NVIDIA H100, PyTorch 2.9) — 28 passed
The flag rescues
index_add,scatter_add,scatter_reduce_sum, andsdpa_backward(drift → identical).grid_sample_backwardhas no deterministicCUDA implementation: it raises under strict mode and is non-reproducible on
fallback. All other operators are reproducible regardless of the flag.
Scope
Eager operators only. Nondeterminism from the compiled path (fused
Inductor/Triton kernels) is out of scope for this layer.