Bug description
A JSON value containing an unpaired UTF-16 surrogate corrupts the JSON index of a consuming (mutable) segment.
Java represents characters above U+FFFF (emoji, some CJK, etc.) as a pair of char values called surrogates. An unpaired surrogate is one half of that pair with its partner missing — it's not valid text and not encodable as UTF-8. It commonly appears when a string is truncated in the middle of such a character.
Why it happens
For each document, MutableJsonIndexImpl.addFlattenedRecords does two steps:
- Step 1 — appends one entry per flattened record to
_docIdMapping.
- Step 2 — loops over those records to build the posting lists, incrementing
_nextFlattenedDocId as it goes.
Step 2 calls Utf8.encodedLength(...) only to track an approximate memory size — and that call throws on an unpaired surrogate. The throw aborts step 2 partway, after step 1 has already grown _docIdMapping. Result: _docIdMapping is left permanently one entry longer than _nextFlattenedDocId.
Effect
With the two counters out of sync, every later document's posting-list entries map back to the wrong real doc id, so json_match() on that segment returns rows shifted by one. MutableSegmentImpl swallows the exception, so it's completely silent and only heals once the segment is committed and the immutable index is rebuilt from scratch.
Reproduction
Insert these three records in order into a JSON-indexed column on a consuming (realtime) segment:
{"name": "first"}
{"name": "\uD800"} // unpaired high surrogate
{"name": "third"}
Expected results:
json_match(col, '"$.name = ''first''") → doc 0
json_match(col, '"$.name = ''\uD800''") → doc 1
json_match(col, '"$.name = ''third''") → doc 2
Actual results after the corrupted document:
json_match(col, '"$.name = ''third''") → doc 1 (off by one)
Bug description
A JSON value containing an unpaired UTF-16 surrogate corrupts the JSON index of a consuming (mutable) segment.
Why it happens
For each document,
MutableJsonIndexImpl.addFlattenedRecordsdoes two steps:_docIdMapping._nextFlattenedDocIdas it goes.Step 2 calls
Utf8.encodedLength(...)only to track an approximate memory size — and that call throws on an unpaired surrogate. The throw aborts step 2 partway, after step 1 has already grown_docIdMapping. Result:_docIdMappingis left permanently one entry longer than_nextFlattenedDocId.Effect
With the two counters out of sync, every later document's posting-list entries map back to the wrong real doc id, so
json_match()on that segment returns rows shifted by one.MutableSegmentImplswallows the exception, so it's completely silent and only heals once the segment is committed and the immutable index is rebuilt from scratch.Reproduction
Insert these three records in order into a JSON-indexed column on a consuming (realtime) segment:
Expected results:
json_match(col, '"$.name = ''first''")→ doc 0json_match(col, '"$.name = ''\uD800''")→ doc 1json_match(col, '"$.name = ''third''")→ doc 2Actual results after the corrupted document:
json_match(col, '"$.name = ''third''")→ doc 1 (off by one)