Add support for output MapType[StringType, ArrayType[StringType]] in from_json SQL function#15134
Add support for output MapType[StringType, ArrayType[StringType]] in from_json SQL function#15134ttnghia wants to merge 2 commits into
MapType[StringType, ArrayType[StringType]] in from_json SQL function#15134Conversation
--------- Signed-off-by: Nghia Truong <nghiat@nvidia.com>
There was a problem hiding this comment.
Pull request overview
Adds GPU support for from_json when the target schema is MAP<STRING,ARRAY<STRING>>, extending the existing MAP<STRING,STRING> path. The implementation uses a “raw extraction” approach (no JSON unescaping/normalization) and documents known CPU/GPU divergences.
Changes:
- Extend
GpuJsonToStructsto extractMAP<STRING,ARRAY<STRING>>viaJSONUtils.extractRawMapFromJsonString(..., MapValueType.ARRAY_OF_STRING). - Update
JsonToStructsGPU override type support/messaging to includeMAP<STRING,ARRAY<STRING>>. - Add integration tests (including corner cases + non-strict xfail probes) and update docs to describe the new supported map schema and its divergences.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| sql-plugin/src/main/scala/org/apache/spark/sql/rapids/GpuJsonToStructs.scala | Adds map-value-type dispatch for MAP<STRING,ARRAY<STRING>> vs MAP<STRING,STRING> raw extraction. |
| sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuOverrides.scala | Expands from_json type signature + tagging logic and updates fallback message for the new map schema. |
| integration_tests/src/main/python/json_test.py | Adds coverage for MAP<STRING,ARRAY<STRING>> including fallback + corner/xfail probes. |
| docs/supported_ops.md | Updates the supported-ops note for from_json MAP output to include ARRAY<STRING> values. |
| docs/compatibility.md | Documents support for MAP<STRING,ARRAY<STRING>> and clarifies the “raw extraction” divergences. |
| // Raw extraction (no unescaping, duplicate keys kept, non-string elements as raw text) -- | ||
| // see the class doc and docs/compatibility.md. | ||
| case MapType(StringType, ArrayType(StringType, _), _) => | ||
| JSONUtils.extractRawMapFromJsonString(input.getBase, cudfOptions, | ||
| JSONUtils.MapValueType.ARRAY_OF_STRING) | ||
| case _: MapType => | ||
| JSONUtils.extractRawMapFromJsonString(input.getBase, cudfOptions, | ||
| JSONUtils.MapValueType.STRING) |
| json_string_gen = StringGen( | ||
| r'{"a": (\[\]|\["[0-9]{0,5}"(, "[0-9]{0,3}"){0,2}\])(, "b": \["[A-Z]{0,5}"\])?}') | ||
| assert_gpu_and_cpu_are_equal_collect( | ||
| lambda spark : unary_op_df(spark, json_string_gen) \ | ||
| .select(f.from_json(f.col('a'), 'MAP<STRING,ARRAY<STRING>>')), | ||
| conf=_enable_all_types_conf) |
Greptile SummaryThis PR extends
Confidence Score: 5/5Safe to merge — the new map-array code path follows the same resource and dispatch conventions as the existing string-map path, and the defensive throw replaces the previous silent fall-through. The change is well-scoped: a new match arm in doColumnar that mirrors the existing STRING arm, an updated TypeSig declaration, and matching tagExprForGpu guards. Known divergences from Spark CPU are documented in compatibility.md and covered by xfail tests. No resource leaks, no OOM-retry regressions relative to the existing code, and the fallback path for non-string array element types is verified by assert_gpu_fallback_collect tests. No files require special attention. Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A["from_json(col, schema)"] --> B{TypeSig check\nGpuOverrides}
B -->|"MAP nested STRING\nor ARRAY<STRING>"| C{tagExprForGpu\npattern match}
B -->|"STRUCT"| C
B -->|Other| Z["willNotWorkOnGpu\n→ CPU fallback"]
C -->|"MapType(StringType, StringType, _)"| D["doColumnar dispatch"]
C -->|"MapType(StringType, ArrayType(StringType,_), _)"| D
C -->|"StructType"| D
C -->|"Other map shape"| Z
D -->|MapType STRING value| E["JSONUtils.extractRawMapFromJsonString\n(MapValueType.STRING)"]
D -->|"MapType ARRAY<STRING> value"| F["JSONUtils.extractRawMapFromJsonString\n(MapValueType.ARRAY_OF_STRING)"]
D -->|"Other MapType"| G["throw IllegalArgumentException\n(defensive guard)"]
D -->|StructType| H["JSONUtils.fromJSONToStructs\n+ optional datetime conversion"]
E --> R["cudf.ColumnVector result"]
F --> R
H --> R
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
flowchart TD
A["from_json(col, schema)"] --> B{TypeSig check\nGpuOverrides}
B -->|"MAP nested STRING\nor ARRAY<STRING>"| C{tagExprForGpu\npattern match}
B -->|"STRUCT"| C
B -->|Other| Z["willNotWorkOnGpu\n→ CPU fallback"]
C -->|"MapType(StringType, StringType, _)"| D["doColumnar dispatch"]
C -->|"MapType(StringType, ArrayType(StringType,_), _)"| D
C -->|"StructType"| D
C -->|"Other map shape"| Z
D -->|MapType STRING value| E["JSONUtils.extractRawMapFromJsonString\n(MapValueType.STRING)"]
D -->|"MapType ARRAY<STRING> value"| F["JSONUtils.extractRawMapFromJsonString\n(MapValueType.ARRAY_OF_STRING)"]
D -->|"Other MapType"| G["throw IllegalArgumentException\n(defensive guard)"]
D -->|StructType| H["JSONUtils.fromJSONToStructs\n+ optional datetime conversion"]
E --> R["cudf.ColumnVector result"]
F --> R
H --> R
Reviews (2): Last reviewed commit: "Apply from_json map-of-array code-review..." | Re-trigger Greptile |
Make the GpuJsonToStructs map dispatch explicit: match MAP<STRING,STRING> directly and throw on any other (currently unreachable) map value type instead of silently extracting it as STRING, so widening the GpuOverrides map gate later fails loudly rather than producing wrong results. Compare the array-valued-map equality test result via map_entries for deterministic ordering, parametrize the array fallback test over ARRAY<INT> and ARRAY<DOUBLE>, and rewrap the GpuJsonToStructs and GpuOverrides comments to the 100-column limit. --------- Signed-off-by: Nghia Truong <nghiat@nvidia.com>
This adds support for output type
MapType[StringType, ArrayType[StringType]]infrom_jsonSQL function. Previously, it can only outputMapType[StringType, StringType].Depends on:
MAP<STRING, ARRAY<STRING>>when parsing JSON usingextractRawMapFromJsonStringcudf-spark-jni#4741Checklists
Documentation
Testing
(Please provide the names of the existing tests in the PR description.)
Performance