Skip to content

[SPARK-57517][SQL] Fix schema_of_json to return proper error on non-string literal input#56582

Closed
jubins wants to merge 2 commits into
apache:masterfrom
jubins:j-SPARK-57517-fix-class-cast-exception
Closed

[SPARK-57517][SQL] Fix schema_of_json to return proper error on non-string literal input#56582
jubins wants to merge 2 commits into
apache:masterfrom
jubins:j-SPARK-57517-fix-class-cast-exception

Conversation

@jubins

@jubins jubins commented Jun 18, 2026

Copy link
Copy Markdown
Contributor

What is the purpose of the change

Fixes SPARK-57517schema_of_json throws a ClassCastException during analysis when called with a non-string literal (e.g., SELECT schema_of_json(42)), instead of surfacing a clean DATATYPE_MISMATCH.UNEXPECTED_INPUT_TYPE error.

The root cause is in SchemaOfJson.checkInputDataTypes(): it references a lazy val json = child.eval().asInstanceOf[UTF8String] before verifying that the child's type is StringType. For an integer literal, the asInstanceOf[UTF8String] cast throws ClassCastException at analysis time rather than producing a user-facing error.

The companion functions schema_of_csv and schema_of_xml were fixed for the same issue in SPARK-52234, but schema_of_json was missed. This PR applies the same fix: restructuring checkInputDataTypes to check !foldableeval() == nulldataType != StringType in safe order, and removing the unsafe lazy val entirely.

Brief change log

  • SchemaOfJson.checkInputDataTypes(): removed the lazy val json that performed an unsafe asInstanceOf[UTF8String] cast; restructured the condition chain to check for non-foldable input, null input, and wrong type (adding a new UNEXPECTED_INPUT_TYPE branch) before delegating to super.checkInputDataTypes()
  • Added select schema_of_json(42) to json-functions.sql input
  • Added corresponding DATATYPE_MISMATCH.UNEXPECTED_INPUT_TYPE expected entries to analyzer-results/json-functions.sql.out and results/json-functions.sql.out

Verifying this change

This change is covered by golden file SQL query tests in SQLQueryTestSuite:

  • select schema_of_json(42) — verifies that a non-string integer literal produces DATATYPE_MISMATCH.UNEXPECTED_INPUT_TYPE at analysis time (previously threw ClassCastException)
  • Existing tests for schema_of_json(null) and schema_of_json(nonFoldableColumn) continue to pass, confirming the null and non-foldable branches are unaffected

Does this pull request potentially affect one of the following parts

  • Dependencies (does it add or upgrade a dependency): no
  • The public API, i.e., is any changed class annotated with @Public(Evolving): no — SchemaOfJson is an internal catalyst expression
  • The serializers: no
  • The runtime per-record code paths (performance sensitive): no — only affects the analysis-time type check path
  • Anything that affects deployment or recovery: no
  • The S3 file system connector: no

Documentation

Does this pull request introduce a new feature? no

If yes, how is the feature documented? not applicable

Was generative AI tooling used to co-author this PR?

  • Yes — Claude Code was used as a pair-programming assistant. All code was written, understood, and verified by the author.

Generated-by: Claude Opus 4.8

@MaxGekk

MaxGekk commented Jun 18, 2026

Copy link
Copy Markdown
Member

LGTM overall — the fix is a faithful port of the SPARK-52234 change already in schema_of_csv / schema_of_xml (the condition chain is character-identical to both siblings, removing the unsafe asInstanceOf[UTF8String] cast), and the added schema_of_json(42) test mirrors the csv/xml suites.

One thing to fix before merge: the golden files need to be regenerated. CI (sql - extended tests) fails on select schema_of_json(42) because the hardcoded queryContext offset is off by one — the files have "stopIndex" : 24, but the analyzer emits 25 (the closing ) of schema_of_json(42) is at column 25, and the fragment string already spans 8→25, so 24 is internally inconsistent). This is the usual symptom of hand-editing a golden file instead of regenerating it.

Please regenerate rather than hand-editing:

SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/testOnly org.apache.spark.sql.SQLQueryTestSuite -- -z json-functions.sql"

Then double-check the diff — only the two stopIndex values (24 → 25) in results/json-functions.sql.out and analyzer-results/json-functions.sql.out should change.

@jubins

jubins commented Jun 18, 2026

Copy link
Copy Markdown
Contributor Author

LGTM overall — the fix is a faithful port of the SPARK-52234 change already in schema_of_csv / schema_of_xml (the condition chain is character-identical to both siblings, removing the unsafe asInstanceOf[UTF8String] cast), and the added schema_of_json(42) test mirrors the csv/xml suites.

One thing to fix before merge: the golden files need to be regenerated. CI (sql - extended tests) fails on select schema_of_json(42) because the hardcoded queryContext offset is off by one — the files have "stopIndex" : 24, but the analyzer emits 25 (the closing ) of schema_of_json(42) is at column 25, and the fragment string already spans 8→25, so 24 is internally inconsistent). This is the usual symptom of hand-editing a golden file instead of regenerating it.

Please regenerate rather than hand-editing:

SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/testOnly org.apache.spark.sql.SQLQueryTestSuite -- -z json-functions.sql"

Then double-check the diff — only the two stopIndex values (24 → 25) in results/json-functions.sql.out and analyzer-results/json-functions.sql.out should change.

Thanks for the review! Fixed, updated stopIndex from 24 to 25 in both results/json-functions.sql.out and analyzer-results/json-functions.sql.out. That was the only diff.

@jubins

jubins commented Jun 19, 2026

Copy link
Copy Markdown
Contributor Author

@HyukjinKwon thanks for the approval. I don't have merge access, would you be able to help?

@MaxGekk

MaxGekk commented Jun 19, 2026

Copy link
Copy Markdown
Member

+1, LGTM. Merging to master/4.x.
Thank you, @jubins and @HyukjinKwon for review.

@MaxGekk MaxGekk closed this in 7ce5ae7 Jun 19, 2026
MaxGekk pushed a commit that referenced this pull request Jun 19, 2026
…tring literal input

## What is the purpose of the change

Fixes [SPARK-57517](https://issues.apache.org/jira/browse/SPARK-57517) — `schema_of_json` throws a `ClassCastException` during analysis when called with a non-string literal (e.g., `SELECT schema_of_json(42)`), instead of surfacing a clean `DATATYPE_MISMATCH.UNEXPECTED_INPUT_TYPE` error.

The root cause is in `SchemaOfJson.checkInputDataTypes()`: it references a `lazy val json = child.eval().asInstanceOf[UTF8String]` before verifying that the child's type is `StringType`. For an integer literal, the `asInstanceOf[UTF8String]` cast throws `ClassCastException` at analysis time rather than producing a user-facing error.

The companion functions `schema_of_csv` and `schema_of_xml` were fixed for the same issue in SPARK-52234, but `schema_of_json` was missed. This PR applies the same fix: restructuring `checkInputDataTypes` to check `!foldable` → `eval() == null` → `dataType != StringType` in safe order, and removing the unsafe lazy val entirely.

## Brief change log

- `SchemaOfJson.checkInputDataTypes()`: removed the `lazy val json` that performed an unsafe `asInstanceOf[UTF8String]` cast; restructured the condition chain to check for non-foldable input, null input, and wrong type (adding a new `UNEXPECTED_INPUT_TYPE` branch) before delegating to `super.checkInputDataTypes()`
- Added `select schema_of_json(42)` to `json-functions.sql` input
- Added corresponding `DATATYPE_MISMATCH.UNEXPECTED_INPUT_TYPE` expected entries to `analyzer-results/json-functions.sql.out` and `results/json-functions.sql.out`

## Verifying this change

This change is covered by golden file SQL query tests in `SQLQueryTestSuite`:

- `select schema_of_json(42)` — verifies that a non-string integer literal produces `DATATYPE_MISMATCH.UNEXPECTED_INPUT_TYPE` at analysis time (previously threw `ClassCastException`)
- Existing tests for `schema_of_json(null)` and `schema_of_json(nonFoldableColumn)` continue to pass, confirming the null and non-foldable branches are unaffected

## Does this pull request potentially affect one of the following parts

- **Dependencies** (does it add or upgrade a dependency): no
- **The public API, i.e., is any changed class annotated with `Public(Evolving)`**: no — `SchemaOfJson` is an internal catalyst expression
- **The serializers**: no
- **The runtime per-record code paths (performance sensitive)**: no — only affects the analysis-time type check path
- **Anything that affects deployment or recovery**: no
- **The S3 file system connector**: no

## Documentation

Does this pull request introduce a new feature? no

If yes, how is the feature documented? not applicable

## Was generative AI tooling used to co-author this PR?

- [x] Yes — Claude Code was used as a pair-programming assistant. All code was written, understood, and verified by the author.

Generated-by: Claude Opus 4.8

Closes #56582 from jubins/j-SPARK-57517-fix-class-cast-exception.

Authored-by: Jubin Soni <jubinsoni27@gmail.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
(cherry picked from commit 7ce5ae7)
Signed-off-by: Max Gekk <max.gekk@gmail.com>
@MaxGekk

MaxGekk commented Jun 19, 2026

Copy link
Copy Markdown
Member

@jubins Congratulations with your first contribution to Apache Spark!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants