Skip to content

GH-45946: [C++][Parquet] Variant decoding#50121

Open
qzyu999 wants to merge 2 commits into
apache:mainfrom
qzyu999:variant-decoding
Open

GH-45946: [C++][Parquet] Variant decoding#50121
qzyu999 wants to merge 2 commits into
apache:mainfrom
qzyu999:variant-decoding

Conversation

@qzyu999

@qzyu999 qzyu999 commented Jun 8, 2026

Copy link
Copy Markdown

PR Stack (merge in order):

  1. GH-45946: [C++][Parquet] Variant decoding #50121 YOU ARE HERE Variant decoding (this PR)
  2. GH-45947 : [C++][Parquet] Variant encoding #50122 Variant encoding (depends on this PR)
  3. GH-45948: [C++][Parquet] Variant shredding #50232 Variant shredding (depends on GH-45947 : [C++][Parquet] Variant encoding #50122)

All three PRs are part of the GH-45937 umbrella (Add variant support to C++ Parquet).

Rationale for this change

This is part of the GH-45937 umbrella (Add variant support to C++ Parquet). The Variant Encoding Spec defines a binary format for semi-structured data in Parquet. This PR adds the decoding (reading) side, which is a prerequisite for the encoder (GH-45947, PR #50122) and shredding support (GH-45948, PR #50232).

The implementation targets feature parity with the existing arrow-go parquet/variant package, adapted to idiomatic C++ patterns. Divergences from Go are deliberate and documented in code comments.

What changes are included in this PR?

Full Variant binary decoding per the Variant Encoding Spec. Adds variant_internal.h/.cc with:

Decoder (visitor/SAX-style traversal):

  • DecodeMetadata() parses the string dictionary from raw bytes
  • DecodeVariantValue() recursive traversal invoking a VariantVisitor for each value
  • All 21 primitive types, short strings, objects (with non-monotonic offset support), arrays
  • Recursion depth limit (kMaxNestingDepth = 128) security hardening for C++ stack semantics (Go doesn't need this due to goroutine stack growth)

Random-access utilities (for future Parquet reader integration):

  • ValueSize() compute byte size of a value without full decode
  • FindObjectField() lookup by field name (binary search for 32 fields, linear for small objects)
  • GetArrayElement() O(1) element access by index
  • GetObjectFieldAt() positional field access
  • FindMetadataKey() dictionary ID lookup (binary search if sorted)

Design choices (deliberate divergences from Go documented in code):

  • Visitor pattern (SAX-style) idiomatic Arrow C++ (TypeVisitor, ArrayVisitor precedent)
  • Reserved bit 5 enforcement in metadata header (Go does not check; we fail cleanly on future spec versions)
  • Object field offset bounds validation (Go does not check; defense-in-depth against malformed input)
  • No UTF-8 validation during decode (matches Go; documented for future follow-up)
  • FindObjectField binary search uses signed int32_t for lo/hi to avoid an unsigned underflow pattern present in Go's ObjectValue.ValueByKey() (separate bug report TBD)

Bug discovered in arrow-go during development:

Are these changes tested?

Yes. 165 tests pass with BUILD_WARNING_LEVEL=CHECKIN (warnings-as-errors):

  • Metadata parsing (15 tests including error cases, all offset sizes 1-4, reserved bit rejection)
  • All primitive types + boundary values (21 tests)
  • Short strings (4 tests)
  • Objects (5 tests including 3-byte offset/id sizes, non-monotonic offsets)
  • Arrays + is_large (4 tests)
  • Nesting + depth limit (5 tests)
  • Visitor early abort propagation (2 tests)
  • Spec-conformance with handcrafted byte sequences (6 tests)
  • ValueSize including regression test for the Go bug (9 tests)
  • Random access: FindObjectField, GetArrayElement, GetObjectFieldAt (8 tests)
  • FindMetadataKey sorted/unsorted (4 tests)
  • Binary search path for large objects 32 fields (4 tests)
  • Error cases: type mismatches, truncation, offset overflows, negative indices (8 tests)

Are there any user-facing changes?

No breaking changes. This adds new public API (arrow::extension::variant_internal namespace) that did not previously exist. The header variant_internal.h is installed "internal" in the filename refers to "binary encoding internals" not visibility.

Next in chain: The encoder PR (#50122) is stacked on this branch and should be reviewed/merged after this one. The shredding PR (#50232) depends on both.

AI Disclosure: AI coding assistants were used during development for scaffolding, test generation, and review iteration. All code has been reviewed, debugged, and verified by the author who owns and understands the changes.

@misiek1984 misiek1984 left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some initial comments.

/// Searches the field IDs in the object, resolving each against the
/// metadata dictionary. Per spec, field IDs are in lexicographic order
/// of their corresponding key names, enabling binary search for large
/// objects (>=32 fields). For smaller objects, linear scan is used.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How was the 32 threshold determined?


/// \brief Basic type codes from bits 0-1 of the value header byte.
///
/// Variant Encoding Spec §3: "Value encoding"

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: The current version of the spec does not contain paragraph §3 and §3.1. I would just add a link to the section with tables: https://github.com/apache/parquet-format/blob/master/VariantEncoding.md#encoding-types

/// Implements parsing logic per the Variant Encoding Spec:
/// https://github.com/apache/parquet-format/blob/master/VariantEncoding.md
///
/// The "internal" in the filename refers to the binary encoding internals

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have a strong opinion here. But maybe instead of explaining in the comment what "internal" means it would be better to rename a file e.g. to variant_binary_encoding, variant_internal_encoding etc.


class VariantIntegrationTest : public ::testing::Test {};

TEST_F(VariantIntegrationTest, FullRoundTrip) {

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would also add more tests demonstrating how to use all these new functions together. For example, let's assume we have the following Variant:

{
  "name": "Alice",
  "age": 30,
  "addresses": {
    "postal": {
      "country": "USA",
      "city": "New York"
    },
    "billing": {
      "country": "USA",
      "city": "Chicago"
    }
  }
}

If we want to find the city for the postal address, we would first need to use FindObjectField to find "addresses", then "postal", and finally "city". After that, we would read the value of the "city" field.

/// the visitor. Returns the number of bytes consumed.
///
/// This is the core recursive function.
Status DecodeValueAt(const VariantMetadata& metadata, const uint8_t* data, int64_t length,

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this function should be public. Let's assume I want to read the value of a specific nested field from a Variant using a path (e.g., field_1.field_2.field_3).

My current understanding is that I would first need to call FindObjectField to locate "field_1". If it exists, I then have to find "field_2", and finally "field_3". However, I have to implement the last step—reading the actual value—on my own because DecodeValueAt is not public, and DecodeVariantValue only allows for decoding the entire Variant.

/// \return Status::OK on success, Status::Invalid on malformed input
///
/// \note The data buffer must remain valid for the duration of the call.
ARROW_EXPORT Status DecodeVariantValue(const VariantMetadata& metadata,

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have a plan to also support reading/decoding shredded variants?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants