Add auto-extract mode for spath command#5140
Add auto-extract mode for spath command#5140dai-chen wants to merge 6 commits intoopensearch-project:mainfrom
spath command#5140Conversation
Signed-off-by: Chen Dai <daichen@amazon.com>
…serve issue Signed-off-by: Chen Dai <daichen@amazon.com>
📝 WalkthroughSummary by CodeRabbit
WalkthroughMade SPath.path optional to enable an auto-extract mode; spath without a path now rewrites to json_extract_all and returns a flattened Map<String,String>. Changes touch core AST, json UDF behavior, parser/rewrite logic, tests, test data/mappings, and documentation. Changes
Sequence DiagramsequenceDiagram
participant User
participant Parser as PPL Parser
participant SPath as SPath AST
participant Rewriter as SPath Rewriter
participant JsonExtract as JsonExtractAll
participant Output as Result Map
User->>Parser: spath input=field (no path)
Parser->>SPath: Create SPath(input=field, path=null)
SPath->>Rewriter: rewriteAsEval(context)
Rewriter->>Rewriter: Detect path is null
Rewriter->>JsonExtract: Use json_extract_all(field)
JsonExtract->>JsonExtract: Parse JSON and flatten structure
JsonExtract->>JsonExtract: Stringify all values
JsonExtract->>Output: Return Map<String,String>
Output->>User: Flattened key-value pairs (strings)
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 4 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
🧪 Generate unit tests (beta)
⚔️ Resolve merge conflicts (beta)
Comment |
Signed-off-by: Chen Dai <daichen@amazon.com>
Signed-off-by: Chen Dai <daichen@amazon.com>
Signed-off-by: Chen Dai <daichen@amazon.com>
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Fix all issues with AI agents
In `@docs/user/ppl/cmd/spath.md`:
- Around line 41-43: There are two adjacent blockquotes ("Auto-extract mode
processes the entire input field..." and "Invalid or malformed JSON returns
partial results...") separated by a blank line which triggers MD028; remove the
blank line and either merge the two notes into a single blockquote or place
regular text between them so they are no longer consecutive blockquotes—update
the lines containing those two blockquote texts to be joined into one blockquote
or separated by non-blockquote content.
In
`@integ-test/src/test/java/org/opensearch/sql/calcite/remote/CalcitePPLSpathCommandIT.java`:
- Around line 40-48: Add three new test documents to the auto-extract fixtures
for CalcitePPLSpathCommandIT by creating additional Request objects (similar to
autoExtractDoc) and performing them via client().performRequest: (1) a document
where the target auto-extract field(s) are explicitly null to exercise null
handling, (2) a document where those fields contain empty JSON object "{}" and
an empty JSON array "[]" to cover boundary/empty-input behavior, and (3) a
document with a deliberately malformed JSON string (e.g., truncated or missing
quotes/braces) to trigger error paths; each should use the same PUT to the
/test_spath_auto/_doc/{id}?refresh=true pattern with unique ids and matching
JSON entity strings so tests validate null, empty, and invalid inputs for the
auto-extract mode.
- Around line 59-63: The testSpathAutoExtract test only asserts schema via
verifySchema(schema("doc","struct")) which is too weak; update
testSpathAutoExtract to also assert actual extracted values and multi-document
behavior by parsing the JSONObject returned from executeQuery("source=test_spath
| spath input=doc") and adding row-level assertions that specific keys/values
exist for each document (e.g., expected map entries inside the "doc" struct),
and/or use existing row-assert helper methods to assert the expected number of
rows and exact field values; keep the initial verifySchema call but add concrete
assertions that validate the map extraction across multiple documents.
docs/user/ppl/cmd/spath.md
Outdated
| > **Note**: Auto-extract mode processes the entire input field with no character limit. For large JSON payloads, consider using path-based extraction to target specific fields. | ||
|
|
||
| > **Note**: Invalid or malformed JSON returns partial results containing any fields successfully parsed before the error. Empty JSON object (`{}`) returns an empty map. |
There was a problem hiding this comment.
Minor markdown formatting issue.
There's a blank line between two blockquotes (lines 41-43), which triggers a markdown lint warning (MD028). Consider combining them into a single blockquote or separating them with regular text.
📝 Suggested fix
-> **Note**: Auto-extract mode processes the entire input field with no character limit. For large JSON payloads, consider using path-based extraction to target specific fields.
-
-> **Note**: Invalid or malformed JSON returns partial results containing any fields successfully parsed before the error. Empty JSON object (`{}`) returns an empty map.
+> **Note**: Auto-extract mode processes the entire input field with no character limit. For large JSON payloads, consider using path-based extraction to target specific fields.
+>
+> Invalid or malformed JSON returns partial results containing any fields successfully parsed before the error. Empty JSON object (`{}`) returns an empty map.📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| > **Note**: Auto-extract mode processes the entire input field with no character limit. For large JSON payloads, consider using path-based extraction to target specific fields. | |
| > **Note**: Invalid or malformed JSON returns partial results containing any fields successfully parsed before the error. Empty JSON object (`{}`) returns an empty map. | |
| > **Note**: Auto-extract mode processes the entire input field with no character limit. For large JSON payloads, consider using path-based extraction to target specific fields. | |
| > | |
| > Invalid or malformed JSON returns partial results containing any fields successfully parsed before the error. Empty JSON object (`{}`) returns an empty map. |
🧰 Tools
🪛 markdownlint-cli2 (0.20.0)
[warning] 42-42: Blank line inside blockquote
(MD028, no-blanks-blockquote)
🤖 Prompt for AI Agents
In `@docs/user/ppl/cmd/spath.md` around lines 41 - 43, There are two adjacent
blockquotes ("Auto-extract mode processes the entire input field..." and
"Invalid or malformed JSON returns partial results...") separated by a blank
line which triggers MD028; remove the blank line and either merge the two notes
into a single blockquote or place regular text between them so they are no
longer consecutive blockquotes—update the lines containing those two blockquote
texts to be joined into one blockquote or separated by non-blockquote content.
integ-test/src/test/java/org/opensearch/sql/calcite/remote/CalcitePPLSpathCommandIT.java
Outdated
Show resolved
Hide resolved
integ-test/src/test/java/org/opensearch/sql/calcite/remote/CalcitePPLSpathCommandIT.java
Show resolved
Hide resolved
integ-test/src/test/java/org/opensearch/sql/calcite/remote/CalcitePPLSpathCommandIT.java
Outdated
Show resolved
Hide resolved
Signed-off-by: Chen Dai <daichen@amazon.com>
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In
`@integ-test/src/test/java/org/opensearch/sql/calcite/remote/CalcitePPLSpathCommandIT.java`:
- Around line 102-111: Add a boundary test that verifies empty-array handling
for the spath array suffix case: create a new test (or extend
testSpathAutoExtractArraySuffix) that runs executeQuery with a document whose
array field is empty (e.g., "tags": []) and assert schema via
verifySchema("result","struct") and data via verifyDataRows expecting the
extracted key "tags{}" to stringify as "[]"; reference the existing test method
testSpathAutoExtractArraySuffix, the helpers executeQuery, verifySchema and
verifyDataRows, and ensure the expected JSONObject uses "tags{}":"[]".
| @Test | ||
| public void testSpathAutoExtractArraySuffix() throws IOException { | ||
| JSONObject result = | ||
| executeQuery( | ||
| "source=test_spath_auto | spath input=array_doc output=result | fields result"); | ||
|
|
||
| // Arrays use {} suffix: tags{} | ||
| verifySchema(result, schema("result", "struct")); | ||
| verifyDataRows(result, rows(new JSONObject("{\"tags{}\":\"[java, sql]\"}"))); | ||
| } |
There was a problem hiding this comment.
Add empty-array boundary test for array suffix handling.
Current coverage only validates non-empty arrays; empty arrays are a boundary case that can stringify differently (e.g., []).
💡 Suggested additions
- + " \"array_doc\": \"{\\\"tags\\\":[\\\"java\\\",\\\"sql\\\"]}\","
+ + " \"array_doc\": \"{\\\"tags\\\":[\\\"java\\\",\\\"sql\\\"]}\","
+ + " \"empty_array_doc\": \"{\\\"tags\\\":[]}\","+ `@Test`
+ public void testSpathAutoExtractEmptyArray() throws IOException {
+ JSONObject result =
+ executeQuery(
+ "source=test_spath_auto | spath input=empty_array_doc output=result | fields result");
+
+ verifySchema(result, schema("result", "struct"));
+ verifyDataRows(result, rows(new JSONObject("{\"tags{}\":\"[]\"}")));
+ }🤖 Prompt for AI Agents
In
`@integ-test/src/test/java/org/opensearch/sql/calcite/remote/CalcitePPLSpathCommandIT.java`
around lines 102 - 111, Add a boundary test that verifies empty-array handling
for the spath array suffix case: create a new test (or extend
testSpathAutoExtractArraySuffix) that runs executeQuery with a document whose
array field is empty (e.g., "tags": []) and assert schema via
verifySchema("result","struct") and data via verifyDataRows expecting the
extracted key "tags{}" to stringify as "[]"; reference the existing test method
testSpathAutoExtractArraySuffix, the helpers executeQuery, verifySchema and
verifyDataRows, and ensure the expected JSONObject uses "tags{}":"[]".
Description
As a follow-up after #5139, this PR implements the auto-extract mode for the
spathcommand per the proposal in #4307. Whenpathparameter is omitted,spathrewrites tojson_extract_all(input)function and returns amap<string, string>with flattened keys instead of expanding dynamic columns.Related Issues
Resolves #4307 (partially, ergonomic map access syntax are follow-ups)
Check List
--signoffor-s.By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.