[SPARK-57521][ML][CONNECT] Exclude parent from Model.estimatedSize to fix overcounting in ML cache#56584
Open
mkincaid wants to merge 1 commit into
Open
Conversation
… fix overcounting in ML cache
Author
|
Fixed actions configuration on my fork. Closing and reopening to trigger the checks to rerun |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What changes were proposed in this pull request?
This patch unsets
parentbefore calling theSizeEstimator.Why are the changes needed?
Currently
SizeEstimatorincludes the size of theSparkSessionbecause it traverses theparentobject which (in the case of many estimators that use DataFrame operations when fitting, likeStringIndexer) eventually refers to the session. The session is there anyway and its size isn't attributable to fitting this specific model (and this results in double-counting when more models are fit), so it shouldn't be included in the size estimate.The impact of the bug is largest when the
SparkSessionis large. For example, in Databricks, my testing shows that a 300-800M SparkSession is typical. In some configurations, like Databricks serverless, the size limit for a single model object might be 256M, so this bug causes such models to fail to train regardless of the state of the cache otherwise.The Jira ticket includes a simple script that reproduces the condition locally, though the session is much smaller in that case (maybe 300k).
Does this PR introduce any user-facing change?
Yes, a favorable one, in that the model cache would fill less quickly (and the reported sizes of cached models would be smaller, if they are among the affected models).
How was this patch tested?
A test is added: training a
StringIndexershould estimate at no larger than 50k, in the trivial test case with 3 strings. This test fails before the patch and passes after it. Another similar test is provided forMinMaxScaler. AModelSuiteis added to hold these since the bug is at theModellevel, not that of individual models (so theStringIndexerandMinMaxScalersuites aren't really the right place for these tests, although they are examples).Was this patch authored or co-authored using generative AI tooling?
Yes, the bug was discovered and initial patch/tests were created by pair programming with Claude. I wrote the bug/docs myself and validated the approach and final patch.
Generated-by: Claude Opus 4.6