Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,11 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0

## [Unreleased]

### Added

- Writer: `vortex.variant` encoder. Encodes a variant column as the canonical `vortex.variant` container over `core_storage` — an all-equal column becomes a single `vortex.constant`, a row-varying column a `vortex.chunked` of per-run constants — with an optional row-aligned typed `shredded` child recorded in `VariantMetadata.shredded_dtype`. Input is `VariantData(List<Scalar>)` with `.constant(n, v)` / `.shredded(...)` factories. Java↔Rust (JNI) round-trip verified for constant, row-varying, and shredded columns. Scalar values only — arbitrary nested objects need `vortex.parquet.variant` (deferred, [ADR 0014](docs/adr/0014-variant-encoding-strategy.md)).
- Reader: variant columns now decode Java-side. `ConstantEncodingDecoder` and `ChunkedEncodingDecoder` handle `DType.Variant` (materialising the inner-typed array); `VariantEncodingDecoder` wraps the result as `VariantArray`, exposing `coreStorage()` and `shredded()`.

## [0.7.3] — 2026-06-17

Parquet ZSTD support, `vortex.patched` encoder, constant-encoding selection fix, Windows TUI raw-mode fix.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -77,7 +77,7 @@ public enum EncodingId {
VORTEX_MASKED("vortex.masked"),
/// Patched encoding (not yet implemented; registered to prevent parse errors).
VORTEX_PATCHED("vortex.patched"),
/// Variant encoding (not yet implemented; registered to prevent parse errors).
/// Variant logical encoding: canonical container over `core_storage` plus an optional shredded child.
VORTEX_VARIANT("vortex.variant"),
;

Expand Down
90 changes: 90 additions & 0 deletions docs/adr/0014-variant-encoding-strategy.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,90 @@
# ADR 0014: Variant encoding strategy — chunked constants now, parquet.variant later

- **Status:** Implemented
- **Date:** 2026-06-18
- **Deciders:** project maintainer
- **Supersedes:** —
- **Superseded by:** —

## Context

Java could already *decode* the canonical variant container (`vortex.variant`) but
`VariantEncodingEncoder` threw `"encode not yet implemented"`, so no Java-written file
could carry a variant column. Variant is the biggest external-facing gap: every
JSON-shaped ingest pipeline maps to it.

The Rust reference has two distinct encodings:

- **`vortex.variant`** — the canonical container. Purely structural: `nbuffers = 0`,
slots `[core_storage, shredded]`. It stores no bytes itself. `core_storage` must be a
`DType::Variant` array but "may be chunked, constant, or otherwise encoded". `shredded`
is an optional row-aligned typed tree for selected paths, its dtype recorded in
`VariantMetadata.shredded_dtype`.
- **`vortex.parquet.variant`** — a separate *physical* encoding (id literally
`"vortex.parquet.variant"`), slots `[validity, metadata, value, typed_value]`, where
`metadata` and `value` are opaque Binary columns holding the Apache Variant binary
(metadata dictionary + value bytes) per row. This is what Rust writes for real,
arbitrary JSON-shaped columns.

A row-varying variant column therefore has (at least) two valid physical layouts. The
Rust test suite itself builds a non-constant column *without* `parquet.variant`: a
`ChunkedArray` of one-row `ConstantArray`s of variant scalars, wrapped in the canonical
`VariantArray`.

## Decision

Encode variant columns using **only the canonical `vortex.variant` container over
existing encodings** — no new physical encoding for now.

`core_storage` is built from per-row inner scalars (`Scalar::variant(inner)`):

- all rows equal → a single `vortex.constant` child (the constant broadcasts);
- varying rows → `vortex.chunked`, child 0 = cumulative `u64` run offsets, then one
`vortex.constant` per run of equal adjacent values.

Adjacent-equal values are coalesced into runs so an all-equal column collapses to one
constant. Shredding is expressed through the container's `shredded` child plus
`VariantMetadata.shredded_dtype`.

**Defer `vortex.parquet.variant`** (the efficient Apache-Variant-binary physical
encoding) until a concrete need for arbitrary JSON-shaped object columns arrives.

## Consequences

### Positive

- No new `EncodingId`, decoder, encoder, proto message, or reader array type. Reuses
`vortex.variant` + `vortex.chunked` + `vortex.constant`, all already round-tripped.
- Matches a layout the Rust reference produces and reads, so Java↔Rust interop holds
(verified via JNI: rowCount + arrow schema for constant and row-varying columns).
- Keeps the variant surface small and reviewable; ships value now.

### Negative

- Per-row values are limited to **typed scalars** wrapped as variants, not arbitrary
Apache Variant binary objects (nested JSON). Real object columns need
`parquet.variant`.
- One `vortex.constant` per distinct run is space-inefficient for high-cardinality
columns versus two packed Binary columns.

### Risks to manage

- Heterogeneous inner dtypes across rows: the chunked-of-constants read path assumes a
consistent inner dtype. Mixed-type variant columns are out of scope until
`parquet.variant`.
- When `parquet.variant` lands it becomes a second physical layout for the same logical
column; readers must handle both (the canonical container already abstracts this).

## Alternatives considered

- **Implement `vortex.parquet.variant` now.** Rejected for the first milestone: new
encoding id + decoder + encoder + `ParquetVariantMetadata{has_value, typed_value_dtype,
value_nullable}` proto (regeneration) + reader array type — a much larger surface, not
needed until arbitrary object columns are required. Tracked as the next step.

## References

- Rust: `vortex-array/src/arrays/variant/vtable/mod.rs` (`vortex.variant`),
`encodings/parquet-variant/src/vtable.rs` (`vortex.parquet.variant`),
`vortex-array/src/arrays/chunked/array.rs` (cumulative `chunk_offsets`).
- Commits `35438c72` (Layer A constant), `3b1be436` (Layer B chunked constants).
1 change: 1 addition & 0 deletions docs/adr/ADR.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,3 +26,4 @@ Each ADR is a Markdown file named `NNNN-short-title.md`. Use `template.md` as th
| 0011 | Writer zero-copy MemorySegment overload | Deferred |
| 0012 | Zero-copy layout decoding: lazy Chunked/Dict | Implemented |
| 0013 | Compute primitives: masks, kernels, no-materialise | Proposed |
| 0014 | Variant encoding: chunked constants now, parquet.variant later | Implemented |
6 changes: 3 additions & 3 deletions docs/compatibility.md
Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,7 @@ resolves only the standalone decoders in `reader`; no encoder class is loaded.
|------|------------|-------------|
| `DType::Union` (`fbs.DType.Type.Union = 12`) | Rust 0.71.0 | ❌ Decode throws `VortexException("unsupported DType typeType=12")`. No `DType.Union` variant in Java's sealed type. |
| `vortex.onpair` experimental string encoding | Rust 0.74.0 | ❌ Not registered. Files using it fail to decode unless `Registry.allowUnknown()` is enabled. |
| `vortex.variant` write path | Rust 0.73.0 (`Allow writing Variant to files`, #7945) | ❌ Java decode works; Java encode throws `"encode not yet implemented"`. JavaRust round-trip not possible for Variant columns. |
| `vortex.variant` arbitrary nested objects | Rust (`vortex.parquet.variant`) | ⚠️ Java encodes/decodes variant columns of **typed scalar** values (constant / chunked-of-constants core, optional shredded child); JavaRust round-trip verified. Arbitrary nested JSON objects and real path-based shredding need the `vortex.parquet.variant` physical encoding — deferred ([ADR 0014](adr/0014-variant-encoding-strategy.md)). |
| Arrow extension array import affecting Variant shape | Rust 0.74.0 (#8125) | Untested. Re-run integration fixtures against v0.74.0 once published. |

## Encodings
Expand Down Expand Up @@ -72,7 +72,7 @@ resolves only the standalone decoders in `reader`; no encoder class is loaded.
| `fastlanes.for` | `FrameOfReferenceEncodingDecoder`| `FrameOfReferenceEncodingEncoder`| ✅ | ✅ | Integer PTypes |
| `fastlanes.rle` | `RleEncodingDecoder` | `RleEncodingEncoder` | ✅ | ✅ | Chunk-based RLE |
| `vortex.patched` | `PatchedEncodingDecoder` | `PatchedEncodingEncoder` | ✅ | ❌ | Primitive PTypes; encode not yet implemented |
| `vortex.variant` | `VariantEncodingDecoder` | `VariantEncodingEncoder` | ✅ | | Decode (incl. shredded child); encode not yet implemented (Rust 0.73+) |
| `vortex.variant` | `VariantEncodingDecoder` | `VariantEncodingEncoder` | ✅ | | Canonical container; constant / chunked-of-constants core + optional shredded child. Typed-scalar values only — nested objects need `parquet.variant` (ADR 0014) |
| `vortex.onpair` | _none_ | _none_ | ❌ | ❌ | Experimental in Rust 0.74.0; not yet ported |

### Decode shape
Expand Down Expand Up @@ -124,7 +124,7 @@ decoder falls into one of three shapes:
| `fastlanes.for` | Lazy | Lazy | `LazyForXxxArray` (I8/U8/I16/U16/I32/U32/I64/U64), ADR 0010 + 0013 |
| `fastlanes.rle` | Lazy | Lazy | `LazyRleXxxArray`; validity → `OffsetBoolArray`; empty → `LazyConstantXxxArray`, ADR 0013 |
| `vortex.patched` | Materialized | Materialized | inner is full base + chunked patches (1024-elem blocks, lane-window-sorted); per-row access requires 2 laneOffsets reads + binary search inside the chunk window, so eager scatter wins for full scans |
| `vortex.variant` | Materialized | TBD | shredded child reassembly |
| `vortex.variant` | Lazy | Lazy | container wraps constant/chunked core (inner-typed) + optional shredded child |
| `vortex.onpair` | n/a | n/a | not ported |

Decompression-style encodings (Bitpacked / Pco / Zstd / Fsst / Delta) stay Materialized by design
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,122 @@
package io.github.dfa1.vortex.integration;

import dev.vortex.api.DataSource;
import dev.vortex.api.Session;
import dev.vortex.arrow.ArrowAllocation;
import dev.vortex.jni.NativeLoader;
import io.github.dfa1.vortex.core.DType;
import io.github.dfa1.vortex.core.PType;
import io.github.dfa1.vortex.proto.Primitive;
import io.github.dfa1.vortex.proto.Scalar;
import io.github.dfa1.vortex.proto.ScalarValue;
import io.github.dfa1.vortex.writer.VortexWriter;
import io.github.dfa1.vortex.writer.WriteOptions;
import io.github.dfa1.vortex.writer.encode.VariantData;
import org.apache.arrow.memory.BufferAllocator;
import org.apache.arrow.vector.types.pojo.Schema;
import org.junit.jupiter.api.Test;
import org.junit.jupiter.api.io.TempDir;

import java.io.IOException;
import java.nio.channels.FileChannel;
import java.nio.file.Path;
import java.nio.file.StandardOpenOption;
import java.util.List;
import java.util.Map;
import java.util.OptionalLong;

import static org.assertj.core.api.Assertions.assertThat;

/// Cross-compatibility for `vortex.variant` (Layer A): Java writes a constant variant
/// column, the Rust (JNI) reader opens the file and reports the correct row count and
/// schema. Arrow has no native variant type, so this asserts the file parses end-to-end
/// against the reference reader rather than decoding per-row values.
class VariantJavaWritesRustReadsIntegrationTest {

private static final Session SESSION = Session.create();
private static final BufferAllocator ALLOCATOR = ArrowAllocation.rootAllocator();

private static final DType.Struct VARIANT_SCHEMA = new DType.Struct(
List.of("v"),
List.of(new DType.Variant(false)),
false);

static {
NativeLoader.loadJni();
}

@Test
void javaWriter_jniReader_constantVariantColumn(@TempDir Path tmp) throws IOException {
// Given — a constant variant column: every one of N rows is the i32 variant value 7.
// Built as Rust's Scalar::variant(Scalar::primitive(7i32)): the inner Scalar carries
// its own i32 dtype so the reference reader knows the wrapped value's type.
Path file = tmp.resolve("java_variant.vtx");
int rows = 5;
VariantData data = VariantData.constant(rows, i32Variant(7L));

try (var ch = FileChannel.open(file, StandardOpenOption.CREATE, StandardOpenOption.WRITE);
var sut = VortexWriter.create(ch, VARIANT_SCHEMA, WriteOptions.defaults())) {
// When
sut.writeChunk(Map.of("v", data));
}

// Then — the Rust reader parses the variant layout and agrees on row count + schema.
DataSource ds = DataSource.open(SESSION, file.toAbsolutePath().toUri().toString());
OptionalLong count = ds.rowCount().asOptional();
assertThat(count).hasValue(rows);

Schema schema = ds.arrowSchema(ALLOCATOR);
assertThat(schema.getFields()).extracting(f -> f.getName()).contains("v");
}

@Test
void javaWriter_jniReader_varyingVariantColumn(@TempDir Path tmp) throws IOException {
// Given — a non-constant variant column: distinct per-row i32 values. The encoder
// lays this out as core_storage = vortex.chunked of one vortex.constant per row,
// exactly the representation the Rust reference uses for a row-varying variant array.
Path file = tmp.resolve("java_variant_varying.vtx");
List<Scalar> values = List.of(i32Variant(10L), i32Variant(20L), i32Variant(30L), i32Variant(40L));
VariantData data = new VariantData(values);

try (var ch = FileChannel.open(file, StandardOpenOption.CREATE, StandardOpenOption.WRITE);
var sut = VortexWriter.create(ch, VARIANT_SCHEMA, WriteOptions.defaults())) {
// When
sut.writeChunk(Map.of("v", data));
}

// Then — the Rust reader parses the chunked variant layout and agrees on row count + schema.
DataSource ds = DataSource.open(SESSION, file.toAbsolutePath().toUri().toString());
assertThat(ds.rowCount().asOptional()).hasValue(values.size());
assertThat(ds.arrowSchema(ALLOCATOR).getFields()).extracting(f -> f.getName()).contains("v");
}

@Test
void javaWriter_jniReader_shreddedVariantColumn(@TempDir Path tmp) throws IOException {
// Given — a variant column with a row-aligned shredded i32 projection. The container
// gains a second (shredded) typed child plus shredded_dtype in its metadata.
Path file = tmp.resolve("java_variant_shredded.vtx");
List<Scalar> values = List.of(i32Variant(10L), i32Variant(20L), i32Variant(30L));
VariantData data = VariantData.shredded(
values, new int[]{10, 20, 30}, new DType.Primitive(PType.I32, false));

try (var ch = FileChannel.open(file, StandardOpenOption.CREATE, StandardOpenOption.WRITE);
var sut = VortexWriter.create(ch, VARIANT_SCHEMA, WriteOptions.defaults())) {
// When
sut.writeChunk(Map.of("v", data));
}

// Then — the Rust reader parses the shredded variant layout and agrees on row count + schema.
DataSource ds = DataSource.open(SESSION, file.toAbsolutePath().toUri().toString());
assertThat(ds.rowCount().asOptional()).hasValue(values.size());
assertThat(ds.arrowSchema(ALLOCATOR).getFields()).extracting(f -> f.getName()).contains("v");
}

private static Scalar i32Variant(long value) {
// Inner typed scalar carrying its own i32 dtype, wrapped as a variant value
// (mirrors Rust Scalar::variant(Scalar::primitive(value))).
return new Scalar(
io.github.dfa1.vortex.proto.DType.ofPrimitive(
new Primitive(io.github.dfa1.vortex.proto.PType.I32, false)),
ScalarValue.ofInt64Value(value));
}
}
Original file line number Diff line number Diff line change
Expand Up @@ -86,6 +86,23 @@ private static Array wrap(List<Array> chunks, DType dtype, long totalRows) {
if (dtype instanceof DType.Struct struct) {
return wrapStruct(chunks, struct, totalRows);
}
if (dtype instanceof DType.Variant) {
// Each chunk decoded as Variant materialises to its inner-typed constant array
// (see ConstantEncodingDecoder). Wrap the chunks under that inner dtype; the
// VariantArray container re-applies the logical Variant dtype.
if (chunks.isEmpty()) {
throw new VortexException(EncodingId.VORTEX_CHUNKED, "chunked variant has no chunks");
}
DType innerDtype = chunks.get(0).dtype();
if (innerDtype instanceof DType.Primitive innerPt) {
return wrapPrimitive(chunks, innerPt, innerDtype, totalRows);
}
if (innerDtype instanceof DType.Bool) {
return ChunkedBoolArray.of(innerDtype, totalRows, chunks);
}
throw new VortexException(EncodingId.VORTEX_CHUNKED,
"chunked variant inner dtype not supported: " + innerDtype);
}
throw new VortexException(EncodingId.VORTEX_CHUNKED,
"chunked not supported for dtype: " + dtype);
}
Expand Down
Loading