Skip to content

MutableJsonIndexImpl: unpaired UTF-16 surrogate in JSON value corrupts doc ID mapping #18737

@spapin

Description

@spapin

Bug description

A JSON value containing an unpaired UTF-16 surrogate corrupts the JSON index of a consuming (mutable) segment.

Java represents characters above U+FFFF (emoji, some CJK, etc.) as a pair of char values called surrogates. An unpaired surrogate is one half of that pair with its partner missing — it's not valid text and not encodable as UTF-8. It commonly appears when a string is truncated in the middle of such a character.

Why it happens

For each document, MutableJsonIndexImpl.addFlattenedRecords does two steps:

  1. Step 1 — appends one entry per flattened record to _docIdMapping.
  2. Step 2 — loops over those records to build the posting lists, incrementing _nextFlattenedDocId as it goes.

Step 2 calls Utf8.encodedLength(...) only to track an approximate memory size — and that call throws on an unpaired surrogate. The throw aborts step 2 partway, after step 1 has already grown _docIdMapping. Result: _docIdMapping is left permanently one entry longer than _nextFlattenedDocId.

Effect

With the two counters out of sync, every later document's posting-list entries map back to the wrong real doc id, so json_match() on that segment returns rows shifted by one. MutableSegmentImpl swallows the exception, so it's completely silent and only heals once the segment is committed and the immutable index is rebuilt from scratch.

Reproduction

Insert these three records in order into a JSON-indexed column on a consuming (realtime) segment:

{"name": "first"}
{"name": "\uD800"}   // unpaired high surrogate
{"name": "third"}

Expected results:

  • json_match(col, '"$.name = ''first''") → doc 0
  • json_match(col, '"$.name = ''\uD800''") → doc 1
  • json_match(col, '"$.name = ''third''") → doc 2

Actual results after the corrupted document:

  • json_match(col, '"$.name = ''third''") → doc 1 (off by one)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions