Skip to content

feat(variant): vortex.variant encode + Java-side read + shredding#53

Merged
dfa1 merged 7 commits into
mainfrom
feat/variant-encode
Jun 18, 2026
Merged

feat(variant): vortex.variant encode + Java-side read + shredding#53
dfa1 merged 7 commits into
mainfrom
feat/variant-encode

Conversation

@dfa1

@dfa1 dfa1 commented Jun 18, 2026

Copy link
Copy Markdown
Owner

Implements vortex.variant encode (and symmetric Java-side decode) for everything short of the vortex.parquet.variant physical encoding, which is deliberately deferred (see ADR 0014).

What

  • Layer A — constant columns (35438c72): encode an all-equal variant column as a single vortex.constant child under the canonical vortex.variant container. Adds the missing DType.Variant case to VortexWriter.serializeDType.
  • Layer B — row-varying (3b1be436): varying columns encode as core_storage = vortex.chunked (cumulative u64 run offsets + one vortex.constant per run of equal adjacent values). Reuses existing encodings — no new physical encoding. Mirrors the Rust reference.
  • ADR 0014 (6d5ad386): records the chunked-of-constants strategy and defers vortex.parquet.variant (efficient Apache Variant binary + real path-based shredding) until a concrete object-column need.
  • Java-side read (64662a27): ConstantEncodingDecoder + ChunkedEncodingDecoder now handle DType.Variant (materialise the inner-typed array); VariantEncodingDecoder wraps as VariantArray. Makes the reader symmetric with the writer.
  • Layer C — shredding (2d1b027a): VariantData can carry a row-aligned typed shredded column; the encoder emits it as the container's second child and records VariantMetadata.shredded_dtype. The existing decoder reads it.

Input API: VariantData(List<Scalar> values, Object shreddedData, DType shreddedDtype) with .constant(n, v) and .shredded(...) factories — each row is a typed inner scalar wrapped as a variant (Scalar::variant(inner)).

Verification

  • 16 writer unit tests (encode structure, metadata, validation, encode↔decode round-trip, shredded).
  • 500 reader unit tests green (ConstantEncodingDecoder dtype-threading refactor is behaviour-preserving).
  • 3 Rust (JNI) integration tests — constant, row-varying, and shredded variant columns all round-trip against the reference reader.
  • Full reactor build green incl. javadoc.

Not in scope (deferred — ADR 0014)

vortex.parquet.variant physical encoding: arbitrary nested JSON objects and real path-based shredding. The current model is typed-scalar-per-row, so shredding is mechanically real but semantically thin until that lands.

🤖 Generated with Claude Code

dfa1 and others added 6 commits June 18, 2026 21:54
VariantEncodingEncoder previously threw "encode not yet implemented".
Implement the container encode for constant variant columns: a single
vortex.constant child (core_storage) wrapping the variant scalar, no
shredded child, VariantMetadata with absent shredded_dtype.

Also add the missing DType.Variant case to VortexWriter.serializeDType
so the column's logical type is written into the file's DType blob.

Scope is Layer A only (every row holds the same value). Physical
variant-bytes storage (Layer B) and shredding (Layer C) are future work.
Java-side read of a constant variant is not yet supported; the round-trip
is proven against the Rust reference reader via JNI.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Extend VariantData to carry one inner scalar per row instead of a single
constant. The encoder coalesces adjacent equal values into runs:
- all rows equal -> a single vortex.constant child (Layer A, unchanged on the wire);
- varying values -> core_storage = vortex.chunked, child 0 the cumulative u64 run
  offsets, then one vortex.constant per run.

This mirrors the Rust reference, which represents a row-varying variant column
as a chunked array of constant variant scalars under the canonical variant array
(vortex.variant), with no new physical encoding. Efficient Apache Variant binary
storage (vortex.parquet.variant) and shredding remain future work.

VariantData.constant(length, value) preserves the constant-column ergonomics.
Verified end-to-end: the Rust (JNI) reader round-trips both a constant and a
row-varying Java-written variant column.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…rquet.variant later

Record the decision to encode variant columns via the canonical vortex.variant
container over existing encodings (constant / chunked-of-constants), and to defer
the vortex.parquet.variant physical encoding (and shredding of arbitrary objects)
until a real JSON-shaped object-column need arrives. Captures the Rust wire-format
references gathered while implementing the encoder.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
ConstantEncodingDecoder now handles DType.Variant by unwrapping the variant
scalar to its typed inner scalar and materialising the inner-typed constant
array; the dispatch was refactored to thread the dtype explicitly so Extension
and Variant can recurse without re-reading the buffer. ChunkedEncodingDecoder
wraps Variant chunks under their inner dtype. VariantEncodingDecoder.accepts now
reports Variant.

Together with VariantEncodingDecoder this makes the Java reader symmetric with
the writer: a Java-encoded variant column round-trips through Java decode, with
core storage exposing each row's inner value (verified for constant and
row-varying columns).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
VariantData can now carry a row-aligned shredded typed column (shreddedData +
shreddedDtype). When present, the encoder emits it as the container's second
child (encoded via the matching primitive/varbin/bool encoder) and records its
dtype in VariantMetadata.shredded_dtype; otherwise the container stays
single-child as before.

The existing VariantEncodingDecoder already reads the shredded child, so this
round-trips Java-side (decode exposes VariantArray.shredded()) and through the
Rust reference reader (JNI).

NOTE: with the current scalar-only variant model there are no object paths, so a
shredded child can only re-express the values as a typed column. Real path-based
shredding of nested objects needs vortex.parquet.variant (deferred, ADR 0014).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Update the compatibility matrix (encode now ✅; reframe the wire-format gap as
"arbitrary nested objects need parquet.variant"), the decode-shape table, the
EncodingId doc comment, and add a CHANGELOG entry under [Unreleased].

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@dfa1 dfa1 force-pushed the feat/variant-encode branch from bd303e7 to d17a890 Compare June 18, 2026 19:54
Variant encode + Java-side read + shredding are realized and verified
(unit + Rust JNI). parquet.variant remains deferred by design.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@dfa1 dfa1 merged commit a663a99 into main Jun 18, 2026
6 checks passed
@dfa1 dfa1 deleted the feat/variant-encode branch June 18, 2026 20:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant