feat(variant): vortex.variant encode + Java-side read + shredding#53
Merged
Conversation
VariantEncodingEncoder previously threw "encode not yet implemented". Implement the container encode for constant variant columns: a single vortex.constant child (core_storage) wrapping the variant scalar, no shredded child, VariantMetadata with absent shredded_dtype. Also add the missing DType.Variant case to VortexWriter.serializeDType so the column's logical type is written into the file's DType blob. Scope is Layer A only (every row holds the same value). Physical variant-bytes storage (Layer B) and shredding (Layer C) are future work. Java-side read of a constant variant is not yet supported; the round-trip is proven against the Rust reference reader via JNI. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Extend VariantData to carry one inner scalar per row instead of a single constant. The encoder coalesces adjacent equal values into runs: - all rows equal -> a single vortex.constant child (Layer A, unchanged on the wire); - varying values -> core_storage = vortex.chunked, child 0 the cumulative u64 run offsets, then one vortex.constant per run. This mirrors the Rust reference, which represents a row-varying variant column as a chunked array of constant variant scalars under the canonical variant array (vortex.variant), with no new physical encoding. Efficient Apache Variant binary storage (vortex.parquet.variant) and shredding remain future work. VariantData.constant(length, value) preserves the constant-column ergonomics. Verified end-to-end: the Rust (JNI) reader round-trips both a constant and a row-varying Java-written variant column. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…rquet.variant later Record the decision to encode variant columns via the canonical vortex.variant container over existing encodings (constant / chunked-of-constants), and to defer the vortex.parquet.variant physical encoding (and shredding of arbitrary objects) until a real JSON-shaped object-column need arrives. Captures the Rust wire-format references gathered while implementing the encoder. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
ConstantEncodingDecoder now handles DType.Variant by unwrapping the variant scalar to its typed inner scalar and materialising the inner-typed constant array; the dispatch was refactored to thread the dtype explicitly so Extension and Variant can recurse without re-reading the buffer. ChunkedEncodingDecoder wraps Variant chunks under their inner dtype. VariantEncodingDecoder.accepts now reports Variant. Together with VariantEncodingDecoder this makes the Java reader symmetric with the writer: a Java-encoded variant column round-trips through Java decode, with core storage exposing each row's inner value (verified for constant and row-varying columns). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
VariantData can now carry a row-aligned shredded typed column (shreddedData + shreddedDtype). When present, the encoder emits it as the container's second child (encoded via the matching primitive/varbin/bool encoder) and records its dtype in VariantMetadata.shredded_dtype; otherwise the container stays single-child as before. The existing VariantEncodingDecoder already reads the shredded child, so this round-trips Java-side (decode exposes VariantArray.shredded()) and through the Rust reference reader (JNI). NOTE: with the current scalar-only variant model there are no object paths, so a shredded child can only re-express the values as a typed column. Real path-based shredding of nested objects needs vortex.parquet.variant (deferred, ADR 0014). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Update the compatibility matrix (encode now ✅; reframe the wire-format gap as "arbitrary nested objects need parquet.variant"), the decode-shape table, the EncodingId doc comment, and add a CHANGELOG entry under [Unreleased]. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
bd303e7 to
d17a890
Compare
Variant encode + Java-side read + shredding are realized and verified (unit + Rust JNI). parquet.variant remains deferred by design. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Implements
vortex.variantencode (and symmetric Java-side decode) for everything short of thevortex.parquet.variantphysical encoding, which is deliberately deferred (see ADR 0014).What
35438c72): encode an all-equal variant column as a singlevortex.constantchild under the canonicalvortex.variantcontainer. Adds the missingDType.Variantcase toVortexWriter.serializeDType.3b1be436): varying columns encode ascore_storage=vortex.chunked(cumulativeu64run offsets + onevortex.constantper run of equal adjacent values). Reuses existing encodings — no new physical encoding. Mirrors the Rust reference.6d5ad386): records the chunked-of-constants strategy and defersvortex.parquet.variant(efficient Apache Variant binary + real path-based shredding) until a concrete object-column need.64662a27):ConstantEncodingDecoder+ChunkedEncodingDecodernow handleDType.Variant(materialise the inner-typed array);VariantEncodingDecoderwraps asVariantArray. Makes the reader symmetric with the writer.2d1b027a):VariantDatacan carry a row-aligned typedshreddedcolumn; the encoder emits it as the container's second child and recordsVariantMetadata.shredded_dtype. The existing decoder reads it.Input API:
VariantData(List<Scalar> values, Object shreddedData, DType shreddedDtype)with.constant(n, v)and.shredded(...)factories — each row is a typed inner scalar wrapped as a variant (Scalar::variant(inner)).Verification
Not in scope (deferred — ADR 0014)
vortex.parquet.variantphysical encoding: arbitrary nested JSON objects and real path-based shredding. The current model is typed-scalar-per-row, so shredding is mechanically real but semantically thin until that lands.🤖 Generated with Claude Code