Skip to content

Add auto-extract mode for spath command#5140

Open
dai-chen wants to merge 6 commits intoopensearch-project:mainfrom
dai-chen:add-auto-extract-mode-support-in-spath
Open

Add auto-extract mode for spath command#5140
dai-chen wants to merge 6 commits intoopensearch-project:mainfrom
dai-chen:add-auto-extract-mode-support-in-spath

Conversation

@dai-chen
Copy link
Collaborator

@dai-chen dai-chen commented Feb 12, 2026

Description

As a follow-up after #5139, this PR implements the auto-extract mode for the spath command per the proposal in #4307. When path parameter is omitted, spath rewrites to json_extract_all(input) function and returns a map<string, string> with flattened keys instead of expanding dynamic columns.

Related Issues

Resolves #4307 (partially, ergonomic map access syntax are follow-ups)

Check List

  • New functionality includes testing.
  • New functionality has been documented.
  • New functionality has javadoc added.
  • New functionality has a user manual doc added.
  • New PPL command checklist all confirmed.
  • API changes companion pull request created.
  • Commits are signed per the DCO using --signoff or -s.
  • Public documentation issue/PR created.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Signed-off-by: Chen Dai <daichen@amazon.com>
…serve issue

Signed-off-by: Chen Dai <daichen@amazon.com>
@dai-chen dai-chen self-assigned this Feb 12, 2026
@dai-chen dai-chen added enhancement New feature or request PPL Piped processing language labels Feb 12, 2026
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 12, 2026

📝 Walkthrough

Summary by CodeRabbit

  • New Features

    • spath supports auto-extract mode: omit the path to flatten and extract all JSON fields into key-value pairs.
  • Improvements

    • JSON extraction now returns all values as strings for consistent output across modes.
  • Documentation

    • Updated spath docs to cover path-based and auto-extract modes, behavior rules, examples, and performance notes.
  • Tests

    • Expanded and updated tests to cover auto-extract scenarios and adjusted expectations to stringified outputs.

Walkthrough

Made SPath.path optional to enable an auto-extract mode; spath without a path now rewrites to json_extract_all and returns a flattened Map<String,String>. Changes touch core AST, json UDF behavior, parser/rewrite logic, tests, test data/mappings, and documentation.

Changes

Cohort / File(s) Summary
Core AST
core/src/main/java/org/opensearch/sql/ast/tree/SPath.java
Made path nullable and added conditional rewrite logic in rewriteAsEval to route pathless spath to auto-extract behavior.
JSON UDF Implementation
core/src/main/java/org/opensearch/sql/expression/function/jsonUDF/JsonExtractAllFunctionImpl.java, core/src/test/java/org/opensearch/sql/expression/function/jsonUDF/JsonExtractAllFunctionImplTest.java
Changed json_extract_all to produce stringified values (Map<String,String>), updated return type inference to VARCHAR, added stringifyMap helper, and adapted unit tests to expect string values.
PPL Parser & Rewrite
ppl/src/main/java/org/opensearch/sql/ppl/parser/AstBuilder.java, ppl/src/test/java/org/opensearch/sql/ppl/utils/SPathRewriteTest.java
Removed runtime validation that enforced a path; allow constructing SPath with null path and added tests for auto-extract rewrite behavior (with/without output).
PPL Parser Tests / Anonymizer
ppl/src/test/java/org/opensearch/sql/ppl/parser/AstBuilderTest.java, ppl/src/test/java/org/opensearch/sql/ppl/utils/PPLQueryDataAnonymizerTest.java
Added tests covering spath with no path and anonymization of pathless spath queries.
Calcite PPL Tests
ppl/src/test/java/org/opensearch/sql/ppl/calcite/CalcitePPLSpathTest.java
Reworked tests to fluent API and added both path-mode and auto-extract-mode tests (with output variants).
Integration Tests
integ-test/src/test/java/org/opensearch/sql/calcite/remote/CalcitePPLSpathCommandIT.java, integ-test/src/test/java/org/opensearch/sql/calcite/standalone/JsonExtractAllFunctionIT.java
Added multiple integration tests exercising auto-extract scenarios (nested fields, arrays, duplicate keys, stringify/null handling) and adapted helpers to Map<String,String>.
Docs & Test Data / Mappings
docs/user/ppl/cmd/spath.md, doctest/test_data/structured.json, doctest/test_mapping/structured.json
Docs updated to describe dual modes (path-based and auto-extract) with examples and flattening rules; test data and mapping gain doc_auto entries to exercise auto-extract.

Sequence Diagram

sequenceDiagram
    participant User
    participant Parser as PPL Parser
    participant SPath as SPath AST
    participant Rewriter as SPath Rewriter
    participant JsonExtract as JsonExtractAll
    participant Output as Result Map

    User->>Parser: spath input=field (no path)
    Parser->>SPath: Create SPath(input=field, path=null)
    SPath->>Rewriter: rewriteAsEval(context)
    Rewriter->>Rewriter: Detect path is null
    Rewriter->>JsonExtract: Use json_extract_all(field)
    JsonExtract->>JsonExtract: Parse JSON and flatten structure
    JsonExtract->>JsonExtract: Stringify all values
    JsonExtract->>Output: Return Map<String,String>
    Output->>User: Flattened key-value pairs (strings)
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Suggested labels

calcite, feature

Suggested reviewers

  • ykmr1224
  • penghuo
  • anirudha
  • ps48
  • kavithacm
  • derek-ho
  • joshuali925
  • GumpacG
  • Swiddis
  • qianheng-aws
  • yuancu
🚥 Pre-merge checks | ✅ 4 | ❌ 2
❌ Failed checks (2 warnings)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 3.70% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Merge Conflict Detection ⚠️ Warning ❌ Merge conflicts detected (14 files):

⚔️ build.gradle (content)
⚔️ core/src/main/java/org/opensearch/sql/ast/tree/SPath.java (content)
⚔️ core/src/main/java/org/opensearch/sql/expression/function/jsonUDF/JsonExtractAllFunctionImpl.java (content)
⚔️ core/src/test/java/org/opensearch/sql/expression/function/jsonUDF/JsonExtractAllFunctionImplTest.java (content)
⚔️ docs/user/ppl/cmd/spath.md (content)
⚔️ doctest/test_data/structured.json (content)
⚔️ doctest/test_mapping/structured.json (content)
⚔️ integ-test/src/test/java/org/opensearch/sql/calcite/remote/CalcitePPLSpathCommandIT.java (content)
⚔️ integ-test/src/test/java/org/opensearch/sql/calcite/standalone/JsonExtractAllFunctionIT.java (content)
⚔️ ppl/src/main/java/org/opensearch/sql/ppl/parser/AstBuilder.java (content)
⚔️ ppl/src/test/java/org/opensearch/sql/ppl/calcite/CalcitePPLSpathTest.java (content)
⚔️ ppl/src/test/java/org/opensearch/sql/ppl/parser/AstBuilderTest.java (content)
⚔️ ppl/src/test/java/org/opensearch/sql/ppl/utils/PPLQueryDataAnonymizerTest.java (content)
⚔️ ppl/src/test/java/org/opensearch/sql/ppl/utils/SPathRewriteTest.java (content)

These conflicts must be resolved before merging into main.
Resolve conflicts locally and push changes to this branch.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title 'Add auto-extract mode for spath command' clearly and specifically summarizes the main change in the PR, which implements auto-extract mode for the spath command when the path parameter is omitted.
Description check ✅ Passed The description provides context about implementing auto-extract mode for spath, references the related issue #4307, and explains the key change: spath without path parameter now rewrites to json_extract_all and returns map<string, string>.
Linked Issues check ✅ Passed The PR implements the core requirement from issue #4307: dynamic column support for spath where omitting the path parameter enables auto-extract mode that parses JSON and extracts all attributes as a flattened map, addressing the linked objective.
Out of Scope Changes check ✅ Passed All changes align with the stated objective of adding auto-extract mode: SPath nullable path support, JsonExtractAllFunctionImpl return type change to VARCHAR, test updates, documentation, and parser/rewrite logic modifications directly support auto-extract functionality.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
⚔️ Resolve merge conflicts (beta)
  • Auto-commit resolved conflicts to branch add-auto-extract-mode-support-in-spath
  • Post resolved changes as copyable diffs in a comment

Comment @coderabbitai help to get the list of available commands and usage tips.

Signed-off-by: Chen Dai <daichen@amazon.com>
Signed-off-by: Chen Dai <daichen@amazon.com>
Signed-off-by: Chen Dai <daichen@amazon.com>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Fix all issues with AI agents
In `@docs/user/ppl/cmd/spath.md`:
- Around line 41-43: There are two adjacent blockquotes ("Auto-extract mode
processes the entire input field..." and "Invalid or malformed JSON returns
partial results...") separated by a blank line which triggers MD028; remove the
blank line and either merge the two notes into a single blockquote or place
regular text between them so they are no longer consecutive blockquotes—update
the lines containing those two blockquote texts to be joined into one blockquote
or separated by non-blockquote content.

In
`@integ-test/src/test/java/org/opensearch/sql/calcite/remote/CalcitePPLSpathCommandIT.java`:
- Around line 40-48: Add three new test documents to the auto-extract fixtures
for CalcitePPLSpathCommandIT by creating additional Request objects (similar to
autoExtractDoc) and performing them via client().performRequest: (1) a document
where the target auto-extract field(s) are explicitly null to exercise null
handling, (2) a document where those fields contain empty JSON object "{}" and
an empty JSON array "[]" to cover boundary/empty-input behavior, and (3) a
document with a deliberately malformed JSON string (e.g., truncated or missing
quotes/braces) to trigger error paths; each should use the same PUT to the
/test_spath_auto/_doc/{id}?refresh=true pattern with unique ids and matching
JSON entity strings so tests validate null, empty, and invalid inputs for the
auto-extract mode.
- Around line 59-63: The testSpathAutoExtract test only asserts schema via
verifySchema(schema("doc","struct")) which is too weak; update
testSpathAutoExtract to also assert actual extracted values and multi-document
behavior by parsing the JSONObject returned from executeQuery("source=test_spath
| spath input=doc") and adding row-level assertions that specific keys/values
exist for each document (e.g., expected map entries inside the "doc" struct),
and/or use existing row-assert helper methods to assert the expected number of
rows and exact field values; keep the initial verifySchema call but add concrete
assertions that validate the map extraction across multiple documents.

Comment on lines 41 to 43
> **Note**: Auto-extract mode processes the entire input field with no character limit. For large JSON payloads, consider using path-based extraction to target specific fields.

> **Note**: Invalid or malformed JSON returns partial results containing any fields successfully parsed before the error. Empty JSON object (`{}`) returns an empty map.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Minor markdown formatting issue.

There's a blank line between two blockquotes (lines 41-43), which triggers a markdown lint warning (MD028). Consider combining them into a single blockquote or separating them with regular text.

📝 Suggested fix
-> **Note**: Auto-extract mode processes the entire input field with no character limit. For large JSON payloads, consider using path-based extraction to target specific fields.
-
-> **Note**: Invalid or malformed JSON returns partial results containing any fields successfully parsed before the error. Empty JSON object (`{}`) returns an empty map.
+> **Note**: Auto-extract mode processes the entire input field with no character limit. For large JSON payloads, consider using path-based extraction to target specific fields.
+>
+> Invalid or malformed JSON returns partial results containing any fields successfully parsed before the error. Empty JSON object (`{}`) returns an empty map.
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
> **Note**: Auto-extract mode processes the entire input field with no character limit. For large JSON payloads, consider using path-based extraction to target specific fields.
> **Note**: Invalid or malformed JSON returns partial results containing any fields successfully parsed before the error. Empty JSON object (`{}`) returns an empty map.
> **Note**: Auto-extract mode processes the entire input field with no character limit. For large JSON payloads, consider using path-based extraction to target specific fields.
>
> Invalid or malformed JSON returns partial results containing any fields successfully parsed before the error. Empty JSON object (`{}`) returns an empty map.
🧰 Tools
🪛 markdownlint-cli2 (0.20.0)

[warning] 42-42: Blank line inside blockquote

(MD028, no-blanks-blockquote)

🤖 Prompt for AI Agents
In `@docs/user/ppl/cmd/spath.md` around lines 41 - 43, There are two adjacent
blockquotes ("Auto-extract mode processes the entire input field..." and
"Invalid or malformed JSON returns partial results...") separated by a blank
line which triggers MD028; remove the blank line and either merge the two notes
into a single blockquote or place regular text between them so they are no
longer consecutive blockquotes—update the lines containing those two blockquote
texts to be joined into one blockquote or separated by non-blockquote content.

penghuo
penghuo previously approved these changes Feb 13, 2026
Signed-off-by: Chen Dai <daichen@amazon.com>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In
`@integ-test/src/test/java/org/opensearch/sql/calcite/remote/CalcitePPLSpathCommandIT.java`:
- Around line 102-111: Add a boundary test that verifies empty-array handling
for the spath array suffix case: create a new test (or extend
testSpathAutoExtractArraySuffix) that runs executeQuery with a document whose
array field is empty (e.g., "tags": []) and assert schema via
verifySchema("result","struct") and data via verifyDataRows expecting the
extracted key "tags{}" to stringify as "[]"; reference the existing test method
testSpathAutoExtractArraySuffix, the helpers executeQuery, verifySchema and
verifyDataRows, and ensure the expected JSONObject uses "tags{}":"[]".

Comment on lines +102 to +111
@Test
public void testSpathAutoExtractArraySuffix() throws IOException {
JSONObject result =
executeQuery(
"source=test_spath_auto | spath input=array_doc output=result | fields result");

// Arrays use {} suffix: tags{}
verifySchema(result, schema("result", "struct"));
verifyDataRows(result, rows(new JSONObject("{\"tags{}\":\"[java, sql]\"}")));
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Add empty-array boundary test for array suffix handling.
Current coverage only validates non-empty arrays; empty arrays are a boundary case that can stringify differently (e.g., []).

💡 Suggested additions
-            + " \"array_doc\": \"{\\\"tags\\\":[\\\"java\\\",\\\"sql\\\"]}\","
+            + " \"array_doc\": \"{\\\"tags\\\":[\\\"java\\\",\\\"sql\\\"]}\","
+            + " \"empty_array_doc\": \"{\\\"tags\\\":[]}\","
+  `@Test`
+  public void testSpathAutoExtractEmptyArray() throws IOException {
+    JSONObject result =
+        executeQuery(
+            "source=test_spath_auto | spath input=empty_array_doc output=result | fields result");
+
+    verifySchema(result, schema("result", "struct"));
+    verifyDataRows(result, rows(new JSONObject("{\"tags{}\":\"[]\"}")));
+  }
As per coding guidelines: “Include boundary condition tests (min/max values, empty inputs) for all new functions”.
🤖 Prompt for AI Agents
In
`@integ-test/src/test/java/org/opensearch/sql/calcite/remote/CalcitePPLSpathCommandIT.java`
around lines 102 - 111, Add a boundary test that verifies empty-array handling
for the spath array suffix case: create a new test (or extend
testSpathAutoExtractArraySuffix) that runs executeQuery with a document whose
array field is empty (e.g., "tags": []) and assert schema via
verifySchema("result","struct") and data via verifyDataRows expecting the
extracted key "tags{}" to stringify as "[]"; reference the existing test method
testSpathAutoExtractArraySuffix, the helpers executeQuery, verifySchema and
verifyDataRows, and ensure the expected JSONObject uses "tags{}":"[]".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request PPL Piped processing language

Projects

None yet

Development

Successfully merging this pull request may close these issues.

(Calcite PPL) spath command dynamic columns support

2 participants