[SPARK-57517][SQL] Fix schema_of_json to return proper error on non-string literal input#56582
[SPARK-57517][SQL] Fix schema_of_json to return proper error on non-string literal input#56582jubins wants to merge 2 commits into
Conversation
…tring literal input
|
LGTM overall — the fix is a faithful port of the SPARK-52234 change already in One thing to fix before merge: the golden files need to be regenerated. CI ( Please regenerate rather than hand-editing: Then double-check the diff — only the two |
Thanks for the review! Fixed, updated stopIndex from 24 to 25 in both results/json-functions.sql.out and analyzer-results/json-functions.sql.out. That was the only diff. |
|
@HyukjinKwon thanks for the approval. I don't have merge access, would you be able to help? |
|
+1, LGTM. Merging to master/4.x. |
…tring literal input ## What is the purpose of the change Fixes [SPARK-57517](https://issues.apache.org/jira/browse/SPARK-57517) — `schema_of_json` throws a `ClassCastException` during analysis when called with a non-string literal (e.g., `SELECT schema_of_json(42)`), instead of surfacing a clean `DATATYPE_MISMATCH.UNEXPECTED_INPUT_TYPE` error. The root cause is in `SchemaOfJson.checkInputDataTypes()`: it references a `lazy val json = child.eval().asInstanceOf[UTF8String]` before verifying that the child's type is `StringType`. For an integer literal, the `asInstanceOf[UTF8String]` cast throws `ClassCastException` at analysis time rather than producing a user-facing error. The companion functions `schema_of_csv` and `schema_of_xml` were fixed for the same issue in SPARK-52234, but `schema_of_json` was missed. This PR applies the same fix: restructuring `checkInputDataTypes` to check `!foldable` → `eval() == null` → `dataType != StringType` in safe order, and removing the unsafe lazy val entirely. ## Brief change log - `SchemaOfJson.checkInputDataTypes()`: removed the `lazy val json` that performed an unsafe `asInstanceOf[UTF8String]` cast; restructured the condition chain to check for non-foldable input, null input, and wrong type (adding a new `UNEXPECTED_INPUT_TYPE` branch) before delegating to `super.checkInputDataTypes()` - Added `select schema_of_json(42)` to `json-functions.sql` input - Added corresponding `DATATYPE_MISMATCH.UNEXPECTED_INPUT_TYPE` expected entries to `analyzer-results/json-functions.sql.out` and `results/json-functions.sql.out` ## Verifying this change This change is covered by golden file SQL query tests in `SQLQueryTestSuite`: - `select schema_of_json(42)` — verifies that a non-string integer literal produces `DATATYPE_MISMATCH.UNEXPECTED_INPUT_TYPE` at analysis time (previously threw `ClassCastException`) - Existing tests for `schema_of_json(null)` and `schema_of_json(nonFoldableColumn)` continue to pass, confirming the null and non-foldable branches are unaffected ## Does this pull request potentially affect one of the following parts - **Dependencies** (does it add or upgrade a dependency): no - **The public API, i.e., is any changed class annotated with `Public(Evolving)`**: no — `SchemaOfJson` is an internal catalyst expression - **The serializers**: no - **The runtime per-record code paths (performance sensitive)**: no — only affects the analysis-time type check path - **Anything that affects deployment or recovery**: no - **The S3 file system connector**: no ## Documentation Does this pull request introduce a new feature? no If yes, how is the feature documented? not applicable ## Was generative AI tooling used to co-author this PR? - [x] Yes — Claude Code was used as a pair-programming assistant. All code was written, understood, and verified by the author. Generated-by: Claude Opus 4.8 Closes #56582 from jubins/j-SPARK-57517-fix-class-cast-exception. Authored-by: Jubin Soni <jubinsoni27@gmail.com> Signed-off-by: Max Gekk <max.gekk@gmail.com> (cherry picked from commit 7ce5ae7) Signed-off-by: Max Gekk <max.gekk@gmail.com>
|
@jubins Congratulations with your first contribution to Apache Spark! |
What is the purpose of the change
Fixes SPARK-57517 —
schema_of_jsonthrows aClassCastExceptionduring analysis when called with a non-string literal (e.g.,SELECT schema_of_json(42)), instead of surfacing a cleanDATATYPE_MISMATCH.UNEXPECTED_INPUT_TYPEerror.The root cause is in
SchemaOfJson.checkInputDataTypes(): it references alazy val json = child.eval().asInstanceOf[UTF8String]before verifying that the child's type isStringType. For an integer literal, theasInstanceOf[UTF8String]cast throwsClassCastExceptionat analysis time rather than producing a user-facing error.The companion functions
schema_of_csvandschema_of_xmlwere fixed for the same issue in SPARK-52234, butschema_of_jsonwas missed. This PR applies the same fix: restructuringcheckInputDataTypesto check!foldable→eval() == null→dataType != StringTypein safe order, and removing the unsafe lazy val entirely.Brief change log
SchemaOfJson.checkInputDataTypes(): removed thelazy val jsonthat performed an unsafeasInstanceOf[UTF8String]cast; restructured the condition chain to check for non-foldable input, null input, and wrong type (adding a newUNEXPECTED_INPUT_TYPEbranch) before delegating tosuper.checkInputDataTypes()select schema_of_json(42)tojson-functions.sqlinputDATATYPE_MISMATCH.UNEXPECTED_INPUT_TYPEexpected entries toanalyzer-results/json-functions.sql.outandresults/json-functions.sql.outVerifying this change
This change is covered by golden file SQL query tests in
SQLQueryTestSuite:select schema_of_json(42)— verifies that a non-string integer literal producesDATATYPE_MISMATCH.UNEXPECTED_INPUT_TYPEat analysis time (previously threwClassCastException)schema_of_json(null)andschema_of_json(nonFoldableColumn)continue to pass, confirming the null and non-foldable branches are unaffectedDoes this pull request potentially affect one of the following parts
@Public(Evolving): no —SchemaOfJsonis an internal catalyst expressionDocumentation
Does this pull request introduce a new feature? no
If yes, how is the feature documented? not applicable
Was generative AI tooling used to co-author this PR?
Generated-by: Claude Opus 4.8