Skip to content

Add support for output MapType[StringType, ArrayType[StringType]] in from_json SQL function#15134

Open
ttnghia wants to merge 2 commits into
NVIDIA:mainfrom
ttnghia:from-json-array
Open

Add support for output MapType[StringType, ArrayType[StringType]] in from_json SQL function#15134
ttnghia wants to merge 2 commits into
NVIDIA:mainfrom
ttnghia:from-json-array

Conversation

@ttnghia

@ttnghia ttnghia commented Jun 24, 2026

Copy link
Copy Markdown
Collaborator

This adds support for output type MapType[StringType, ArrayType[StringType]] in from_json SQL function. Previously, it can only output MapType[StringType, StringType].

Depends on:

Checklists

Documentation

  • Updated for new or modified user-facing features or behaviors
  • No user-facing change

Testing

  • Added or modified tests to cover new code paths
  • Covered by existing tests
    (Please provide the names of the existing tests in the PR description.)
  • Not required

Performance

  • Tests ran and results are added in the PR description
  • Issue filed with a link in the PR description
  • Not required

---------

Signed-off-by: Nghia Truong <nghiat@nvidia.com>
@ttnghia ttnghia self-assigned this Jun 24, 2026
Copilot AI review requested due to automatic review settings June 24, 2026 04:56
@ttnghia ttnghia added feature request New feature or request SQL part of the SQL/Dataframe plugin task Work required that improves the product but is not user facing labels Jun 24, 2026

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds GPU support for from_json when the target schema is MAP<STRING,ARRAY<STRING>>, extending the existing MAP<STRING,STRING> path. The implementation uses a “raw extraction” approach (no JSON unescaping/normalization) and documents known CPU/GPU divergences.

Changes:

  • Extend GpuJsonToStructs to extract MAP<STRING,ARRAY<STRING>> via JSONUtils.extractRawMapFromJsonString(..., MapValueType.ARRAY_OF_STRING).
  • Update JsonToStructs GPU override type support/messaging to include MAP<STRING,ARRAY<STRING>>.
  • Add integration tests (including corner cases + non-strict xfail probes) and update docs to describe the new supported map schema and its divergences.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
sql-plugin/src/main/scala/org/apache/spark/sql/rapids/GpuJsonToStructs.scala Adds map-value-type dispatch for MAP<STRING,ARRAY<STRING>> vs MAP<STRING,STRING> raw extraction.
sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuOverrides.scala Expands from_json type signature + tagging logic and updates fallback message for the new map schema.
integration_tests/src/main/python/json_test.py Adds coverage for MAP<STRING,ARRAY<STRING>> including fallback + corner/xfail probes.
docs/supported_ops.md Updates the supported-ops note for from_json MAP output to include ARRAY<STRING> values.
docs/compatibility.md Documents support for MAP<STRING,ARRAY<STRING>> and clarifies the “raw extraction” divergences.

Comment on lines +67 to +74
// Raw extraction (no unescaping, duplicate keys kept, non-string elements as raw text) --
// see the class doc and docs/compatibility.md.
case MapType(StringType, ArrayType(StringType, _), _) =>
JSONUtils.extractRawMapFromJsonString(input.getBase, cudfOptions,
JSONUtils.MapValueType.ARRAY_OF_STRING)
case _: MapType =>
JSONUtils.extractRawMapFromJsonString(input.getBase, cudfOptions,
JSONUtils.MapValueType.STRING)
Comment on lines +691 to +696
json_string_gen = StringGen(
r'{"a": (\[\]|\["[0-9]{0,5}"(, "[0-9]{0,3}"){0,2}\])(, "b": \["[A-Z]{0,5}"\])?}')
assert_gpu_and_cpu_are_equal_collect(
lambda spark : unary_op_df(spark, json_string_gen) \
.select(f.from_json(f.col('a'), 'MAP<STRING,ARRAY<STRING>>')),
conf=_enable_all_types_conf)
@greptile-apps

greptile-apps Bot commented Jun 24, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR extends from_json GPU support to emit MAP<STRING, ARRAY<STRING>> output in addition to the previously-supported MAP<STRING, STRING>. The dispatch in GpuJsonToStructs.doColumnar is updated to route the new map shape to JSONUtils.extractRawMapFromJsonString with the ARRAY_OF_STRING value-type enum, and a defensive IllegalArgumentException arm replaces the silent STRING fall-through for unrecognised map shapes.

  • Runtime dispatch: GpuJsonToStructs.doColumnar gets three ordered match arms — ARRAY value, STRING value, and a defensive throw for any other map value type — with the corresponding tagExprForGpu pattern in GpuOverrides extended to accept the new shape.
  • Documentation: docs/compatibility.md and docs/supported_ops.md are updated to describe the raw-extraction semantics, the two known divergences from Spark CPU (escape handling, object/nested-array element re-serialisation), and the cases that do match Spark (scalar array elements, whole-row null on non-array value, document-order duplicate keys).
  • Tests: Five new integration tests cover happy-path randomised data, explicit corner cases (malformed input, nulls, empty arrays, mixed key kinds), and xfail probes for the two documented divergences and the duplicate-key behaviour.

Confidence Score: 5/5

Safe to merge — the new map-array code path follows the same resource and dispatch conventions as the existing string-map path, and the defensive throw replaces the previous silent fall-through.

The change is well-scoped: a new match arm in doColumnar that mirrors the existing STRING arm, an updated TypeSig declaration, and matching tagExprForGpu guards. Known divergences from Spark CPU are documented in compatibility.md and covered by xfail tests. No resource leaks, no OOM-retry regressions relative to the existing code, and the fallback path for non-string array element types is verified by assert_gpu_fallback_collect tests.

No files require special attention.

Important Files Changed

Filename Overview
sql-plugin/src/main/scala/org/apache/spark/sql/rapids/GpuJsonToStructs.scala Adds ARRAY_OF_STRING dispatch arm and a defensive IllegalArgumentException for unrecognised map value types; no resource-management or OOM-retry regressions relative to existing code.
sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuOverrides.scala TypeSig updated to include ARRAY as a valid map value type; tagExprForGpu extended with a matching pattern arm and a fallback willNotWorkOnGpu message covering the new shape.
integration_tests/src/main/python/json_test.py Five new tests — randomised happy path, 40+ literal corner cases, fallback for non-string array element types, and xfail probes for the two documented divergences — all using assert_gpu_and_cpu_are_equal_collect or assert_gpu_fallback_collect.
docs/compatibility.md Extends the from_json compatibility note to cover MAP<STRING,ARRAY>, the raw-extraction semantics, the two divergences, and the duplicate-key policy difference across Spark versions.
docs/supported_ops.md Single-line update to the PS note for JsonToStructs describing the expanded map key/value type support.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A["from_json(col, schema)"] --> B{TypeSig check\nGpuOverrides}
    B -->|"MAP nested STRING\nor ARRAY<STRING>"| C{tagExprForGpu\npattern match}
    B -->|"STRUCT"| C
    B -->|Other| Z["willNotWorkOnGpu\n→ CPU fallback"]

    C -->|"MapType(StringType, StringType, _)"| D["doColumnar dispatch"]
    C -->|"MapType(StringType, ArrayType(StringType,_), _)"| D
    C -->|"StructType"| D
    C -->|"Other map shape"| Z

    D -->|MapType STRING value| E["JSONUtils.extractRawMapFromJsonString\n(MapValueType.STRING)"]
    D -->|"MapType ARRAY<STRING> value"| F["JSONUtils.extractRawMapFromJsonString\n(MapValueType.ARRAY_OF_STRING)"]
    D -->|"Other MapType"| G["throw IllegalArgumentException\n(defensive guard)"]
    D -->|StructType| H["JSONUtils.fromJSONToStructs\n+ optional datetime conversion"]

    E --> R["cudf.ColumnVector result"]
    F --> R
    H --> R
Loading
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
flowchart TD
    A["from_json(col, schema)"] --> B{TypeSig check\nGpuOverrides}
    B -->|"MAP nested STRING\nor ARRAY<STRING>"| C{tagExprForGpu\npattern match}
    B -->|"STRUCT"| C
    B -->|Other| Z["willNotWorkOnGpu\n→ CPU fallback"]

    C -->|"MapType(StringType, StringType, _)"| D["doColumnar dispatch"]
    C -->|"MapType(StringType, ArrayType(StringType,_), _)"| D
    C -->|"StructType"| D
    C -->|"Other map shape"| Z

    D -->|MapType STRING value| E["JSONUtils.extractRawMapFromJsonString\n(MapValueType.STRING)"]
    D -->|"MapType ARRAY<STRING> value"| F["JSONUtils.extractRawMapFromJsonString\n(MapValueType.ARRAY_OF_STRING)"]
    D -->|"Other MapType"| G["throw IllegalArgumentException\n(defensive guard)"]
    D -->|StructType| H["JSONUtils.fromJSONToStructs\n+ optional datetime conversion"]

    E --> R["cudf.ColumnVector result"]
    F --> R
    H --> R
Loading

Reviews (2): Last reviewed commit: "Apply from_json map-of-array code-review..." | Re-trigger Greptile

Comment thread sql-plugin/src/main/scala/org/apache/spark/sql/rapids/GpuJsonToStructs.scala Outdated
Make the GpuJsonToStructs map dispatch explicit: match MAP<STRING,STRING> directly and throw on any other (currently unreachable) map value type instead of silently extracting it as STRING, so widening the GpuOverrides map gate later fails loudly rather than producing wrong results.

Compare the array-valued-map equality test result via map_entries for deterministic ordering, parametrize the array fallback test over ARRAY<INT> and ARRAY<DOUBLE>, and rewrap the GpuJsonToStructs and GpuOverrides comments to the 100-column limit.

---------

Signed-off-by: Nghia Truong <nghiat@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

feature request New feature or request SQL part of the SQL/Dataframe plugin task Work required that improves the product but is not user facing

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants