Skip to content

API: Harden variant binary parsing against malformed input#16568

Open
nssalian wants to merge 3 commits into
apache:mainfrom
nssalian:fix-variant-malformed-parsing
Open

API: Harden variant binary parsing against malformed input#16568
nssalian wants to merge 3 commits into
apache:mainfrom
nssalian:fix-variant-malformed-parsing

Conversation

@nssalian

@nssalian nssalian commented May 26, 2026

Copy link
Copy Markdown
Collaborator

Closes #16334
Addresses #16455

Summary

Validates malformed Variant binary input so adversarial buffers fail fast with IllegalArgumentException instead of OOM, StackOverflowError, or raw JVM exceptions. Adds bounds checks across header, count, offset table,
primitive payload, and short-string length, plus a 500-level nesting depth cap threaded through object/array recursive descent.

The approach mirrors Steve Loughran's parquet-java hardening work in apache/parquet-java#3562. Pulled in the changes from @steveloughran's #16335

Changes

  • SerializedMetadata, SerializedArray, SerializedObject: header / count / offset-table bounds, lazy per-element offset and field-id checks, long arithmetic to prevent int overflow.
  • SerializedPrimitive: payload bounds, BIN/STR size-field bounds, non-negative size.
  • SerializedShortString: length bounds.
  • VariantUtil: MAX_VARIANT_DEPTH = 500 and fromBuffer dispatch threading depth through recursive descent.
  • TestMalformedVariant: 20 attack-payload tests covering each new check.

Test plan

  • ./gradlew :iceberg-api:test --tests 'org.apache.iceberg.variants.*'

Co-authored-by: Steve Loughran <stevel@cloudera.com>
@github-actions github-actions Bot added the API label May 26, 2026
@nssalian nssalian marked this pull request as ready for review May 26, 2026 16:57
Comment thread api/src/main/java/org/apache/iceberg/variants/SerializedArray.java
Preconditions.checkArgument(
offsetTableEnd <= value.remaining(),
"Invalid variant array: element count %s exceeds buffer",
numElements);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've just realised that this and the parquet-java hardening don't worry about leftover data. "don't do that" is implicit the policy there, being as it is useless.

I wonder what the rust parquet reader does.

Comment thread api/src/main/java/org/apache/iceberg/variants/VariantUtil.java
@nssalian nssalian requested a review from steveloughran June 14, 2026 21:49

@laskoviymishka laskoviymishka left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice hardening here. TestMalformedVariant covers a good spread of crafted-buffer cases, and moving the offset-table checks to long is the right direction.

I’d hold on one bug and one design question.

The bug: in both SerializedArray and SerializedObject, we compute the offset-table bound in long, check it, then immediately assign dataOffset using plain int arithmetic. With 4-byte offsets this can still overflow around ~536M elements. The long guard passes, dataOffset wraps negative, and we get a ByteBuffer underflow instead of the IllegalArgumentException this PR is trying to guarantee.

The design question: MAX_VARIANT_DEPTH = 500 is a non-spec limit. Parquet Variant doesn’t cap nesting, and Iceberg defers to that. So a >500-level variant written by Spark or another client could read elsewhere but fail here at scan time. What did parquet-java#3562 settle on? I’d either match that or document why 500 is the right cap.

Once the overflow is fixed and the depth-limit decision is clear, happy to take another pass.

offsetTableEnd <= value.remaining(),
"Invalid variant array: element count %s exceeds buffer",
numElements);
this.dataOffset = offsetListOffset + ((1 + numElements) * offsetSize);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dataOffset here is assigned with plain-int arithmetic, but we just computed the same bound in long as offsetTableEnd and guarded against it.

For offsetSize = 4, (1 + numElements) * offsetSize overflows int around numElements ~536M — a valid 4-byte count. The long guard above still passes, so dataOffset wraps negative and we hit a ByteBuffer underflow at access time instead of the IllegalArgumentException this check is meant to produce.

I'd just reuse the value we already have: this.dataOffset = Math.toIntExact(offsetTableEnd). wdyt?

+ (long) numElements * fieldIdSize
+ ((long) numElements + 1L) * offsetSize;
Preconditions.checkArgument(
dataStart <= value.remaining(),

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same int-overflow as the array side: offsetListOffset and dataOffset below are assigned in plain int while dataStart here is long. numElements * fieldIdSize and (1 + numElements) * offsetSize can overflow before the long guard's bound is ever applied to the int fields, wrapping the data offset negative for a large-but-valid count.

I'd derive both offsets from the long intermediates and Math.toIntExact them so the guard actually protects the value we use.

static VariantValue fromBuffer(VariantMetadata metadata, ByteBuffer value, int depth) {
Preconditions.checkArgument(
depth <= MAX_VARIANT_DEPTH,
"Invalid variant: nesting depth %s exceeds maximum %s",

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Javadoc on MAX_VARIANT_DEPTH says a variant "may contain up to MAX_VARIANT_DEPTH nested levels," but depth <= MAX_VARIANT_DEPTH with the top-level at depth 0 actually admits 501 levels (depths 0 through 500). It's internally consistent, just off-by-one against the comment.

I'd either flip this to depth < MAX_VARIANT_DEPTH or update the Javadoc to state the counting model exactly. A depth >= 0 lower bound here would also close the direct-call path where depth + 1 could overflow past the cap.

* Variant may contain up to {@code MAX_VARIANT_DEPTH} nested levels.
*/
static final int MAX_VARIANT_DEPTH = 500;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a non-spec limit — the Parquet VariantEncoding spec doesn't cap nesting depth and Iceberg defers to it. A valid variant with >500 levels written by Spark or another client would read fine elsewhere but throw here at scan time, which reads as silent data loss rather than malformed-input rejection.

What did parquet-java#3562 settle on? If it picked a different value (or no cap), we'd diverge on file interop. I'd at minimum cross-reference that decision in a comment here and confirm 500 is high enough to never reject real data (it's below Jackson's default of 2000).

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no real reason, except it was what was in org.apache.parquet.variant.VariantJsonParser

how about we discuss on parquet dev list?

I don't really care, just that there's some limit and that it's broadly consistent.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've started a discussion on the parquet dev list, please get involved. Looking at jackson, seems like 1000 is their limit right now, so copying that would be consistent. Let's see what the mailing list discussion outcome is though and we can all go with that, and add to the test dataset.
FasterXML/jackson-core#943

@steveloughran

Copy link
Copy Markdown
Contributor

claude looking at your tests after taking the ones from the suite that weren't in the parquet lib one and replicating

                                                                                                                                          
                                                                                                                                                                                                                                                                
  ┌────────────────────────────────────────┬──────────────────────────────────────────────────────────────────────────────────────────────────┐                                                                                                                 
  │               Local test               │                                         Why it's unique                                          │                                                                                                                 
  ├────────────────────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────────┤                                                                                                                 
  │ testWellFormedRoundTrip                │ Happy-path baseline. Iceberg's file is malformed-only, so it has no positive control.            │                                                                                                                 
  ├────────────────────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────────┤                                                                                                                 
  │ testEmptyMetadataRejected              │ Zero-byte metadata → IllegalArgumentException "empty". No empty-metadata case in the Iceberg 20. │                                                                                                                 
  ├────────────────────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────────┤                                                                                                                 
  │ testSingleByteMetadataRejected         │ Lone version byte (below the 3-byte spec minimum) → "truncated". No analog.                      │                                                                                                                 
  ├────────────────────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────────┤                                                                                                                 
  │ testUnsupportedMetadataVersionRejected │ Version bits ≠ 1 → UnsupportedOperationException. Iceberg has no version-check test.             │                                                                                                                 
  └────────────────────────────────────────┴──────────────────────────────────────────────────────────────────────────────────────────────────┘                                                                                                                 
                                                                                                                                                                                                                                                                

import java.nio.ByteOrder;
import org.junit.jupiter.api.Test;

public class TestMalformedVariant {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a test to see how a different version is handled. it MUST be rejected

* Variant may contain up to {@code MAX_VARIANT_DEPTH} nested levels.
*/
static final int MAX_VARIANT_DEPTH = 500;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've started a discussion on the parquet dev list, please get involved. Looking at jackson, seems like 1000 is their limit right now, so copying that would be consistent. Let's see what the mailing list discussion outcome is though and we can all go with that, and add to the test dataset.
FasterXML/jackson-core#943

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Harden Variant Reading

3 participants