Skip to content

[BugFix] Fix dedup aggregation pushdown nullifying renamed fields (#5150)#27

Closed
qianheng-aws wants to merge 2 commits intorefactor/dedupe-reusable-workflowfrom
worktree-agent-a91850db
Closed

[BugFix] Fix dedup aggregation pushdown nullifying renamed fields (#5150)#27
qianheng-aws wants to merge 2 commits intorefactor/dedupe-reusable-workflowfrom
worktree-agent-a91850db

Conversation

@qianheng-aws
Copy link
Copy Markdown
Owner

Description

When a field is renamed via rename and a subsequent dedup operates on a different field, the aggregation-based DedupPushdownRule returns null for all renamed fields. This is because the top_hits response uses original index field names from _source, but the OpenSearchIndexEnumerator expects the renamed names from the logical plan's rowType.

Root cause: In AggregateAnalyzer.createTopHitsBuilder(), helper.inferNamedField() resolves RexInputRef fields to original index names (e.g., value), but the enumerator's field list uses renamed names (e.g., val). No mapping bridges this gap.

Fix: Build a rename mapping (originalName -> renamedName) in the LITERAL_AGG case of AggregateAnalyzer by comparing the resolved index field name with the project's field name. Pass this mapping to TopHitsParser, which applies it when constructing result maps from the _source response.

Related Issues

Resolves opensearch-project#5150

Check List

  • New functionality includes testing
  • Commits signed per DCO (-s)
  • spotlessCheck passed
  • Unit tests passed
  • Integration tests passed

…pensearch-project#5269) (opensearch-project#5293)

When querying across indices with conflicting field mappings (boolean vs
text), numeric values like 0 were not coerced to boolean, causing
"node must be a boolean, found NUMBER" error. Added numeric handling to
parseBooleanValue() consistent with ObjectContent.booleanValue().

Signed-off-by: Heng Qian <qianheng@amazon.com>
…ensearch-project#5150)

When a field is renamed via `rename` and a subsequent `dedup` operates on
a different field, the aggregation-based DedupPushdownRule returns null for
all renamed fields. The root cause is that the top_hits response uses
original index field names from _source, but the enumerator expects the
renamed names from the logical plan's rowType.

Fix: Build a rename mapping (originalName -> renamedName) in the LITERAL_AGG
case of AggregateAnalyzer and pass it to TopHitsParser. The parser applies
this mapping when constructing result maps, so field names match what the
enumerator expects.

Signed-off-by: Heng Qian <qianheng@amazon.com>
@qianheng-aws
Copy link
Copy Markdown
Owner Author

Decision Log

Root Cause: The DedupPushdownRule converts dedup to a composite aggregation with top_hits. When building the top_hits, AggregateAnalyzer.createTopHitsBuilder() uses helper.inferNamedField() to resolve RexInputRef fields to original index field names (e.g., value). The OpenSearch response returns _source with these original names. However, OpenSearchIndexEnumerator.current() resolves field values using the renamed names from dedup.getRowType() (e.g., val). This mismatch causes all renamed fields to resolve as MISSING/null.

Approach: Added a fieldRenameMapping (Map<String, String>) to TopHitsParser. In the LITERAL_AGG case of AggregateAnalyzer, we build this mapping by comparing each RexInputRef field's resolved original name with the project's field name. When the parser constructs result maps from the _source response, it applies this mapping to rename keys before they reach the enumerator. The fix is minimal and backward-compatible: when no rename exists, the mapping is empty and the behavior is unchanged.

Alternatives Rejected:

  1. Post-processing LogicalProject: Would add an extra plan node to rename fields after the scan. Rejected because it's more invasive and the scan already handles schema mapping via copyWithNewSchema.
  2. Modify inferNamedField to return renamed names: Would break the fetchField call which needs the original index name to fetch data from OpenSearch.
  3. Fix in OpenSearchIndexEnumerator: Would require the enumerator to maintain a name mapping, adding complexity to a general-purpose component for a dedup-specific issue.

Pitfalls:

  • The composite aggregation parser returns separate maps for bucket keys and metric values. The TopHitsParser rename mapping only affects the metric (top_hits) maps, not the bucket keys which already use renamed names from the composite key configuration.
  • Fields that are RexCall (eval expressions) or RexLiteral go through the ScriptField path which already uses the correct renamed name, so they are unaffected.

Things to Watch:

  • The TopHitsParser is also used for FIRST/LAST aggregations with returnSingleValue=true. The rename mapping only affects the else branch (multi-value/dedup path), so those aggregations are not impacted.
  • If future code adds a fetchSource with field-specific includes for the dedup path, the rename mapping would need to be applied to _source field names too (currently it handles both _source and fields response sections).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] Dedup aggregation pushdown nullifies renamed non-dedup fields via top_hits response mapping

1 participant