Draft: feat(format): schema evolution for the Java row codec#3714
Draft
stevenschlansker wants to merge 36 commits into
Draft
Draft: feat(format): schema evolution for the Java row codec#3714stevenschlansker wants to merge 36 commits into
stevenschlansker wants to merge 36 commits into
Conversation
stevenschlansker
commented
Jun 4, 2026
stevenschlansker
commented
Jun 4, 2026
stevenschlansker
commented
Jun 4, 2026
stevenschlansker
commented
Jun 4, 2026
eabbce4 to
30c5ba3
Compare
added 9 commits
June 26, 2026 15:27
Opt in with `.withSchemaEvolution()` on any row, array, or map codec builder. Fields carry `@ForyVersion(since, until)`; removed fields are listed on a nested interface referenced from `@ForySchema(removedFields = ...)`, which preserves parameterized types like `List<String>`. Older payloads are dispatched at read time; nothing changes when the flag is off. Standard and compact formats supported; interface-typed beans included.
…p codecs BinaryArrayEncoder.encode(T) and BinaryMapEncoder.encode(T) previously composed the hash-prefixed payload through MemoryUtils.buffer + writeInt64 + writeBytes + getBytes, allocating three byte[] copies and a MemoryBuffer per call. Build the result directly into a single byte[]: wrap it to write the 8-byte hash header, then System.arraycopy the body in. The non-evolution paths are unchanged.
Adds RowFormatAllocationProbe, a thread-allocation harness that measures per-encode allocations for the evolution-enabled array/map row codecs so the one-allocation-per-encode property can be checked directly. (The compact-row layout caching this commit originally introduced is now provided by upstream's CompactRowLayout; only the probe remains.)
The strict schema hash already recurses through StructType, so two payloads whose inner-struct shapes differ produce different outer hashes. The implementation gap was in SchemaHistory.build, which only enumerated the outer bean's own version boundaries — projection codecs for "outer V=K with inner V=L" weren't generated, so older inner shapes failed to deserialize even though the hash distinguished them. Implementation: - SchemaHistory.build now recurses into nested-bean fields whose type carries schema-evolution annotations, builds each inner's history, and cross-products over inner versions when enumerating outer versions. Each VersionedSchema now carries a map of (nested bean class -> chosen inner version) so the codec builder can wire the right inner projection codec. - RowCodecBuilder.evolvingBuildForWriter emits one projection codec class per cross-product combination, using a per-nested-bean-type suffix map passed down through Encoding/RowEncoderBuilder. BaseBinaryEncoderBuilder exposes a `nestedBeanSuffix(TypeRef)` hook that the projection builder overrides to look up each nested bean's right suffix. - Inner projection classes are generated recursively from nestedSuffixesFor(), so a deeply-nested versioned bean produces the required class tree at outer-build time. Class-count complexity is O(product of versions across nesting), but each projection class is small (decode-only) and only those reachable from the outer's enumeration are generated. Regression test nestedInnerEvolution_readerInnerNewerThanWriter and the two-axis crossOuterAndInnerEvolution both pass. 138 tests in fory-format green.
…decs Array and map evolution paths were generating per-outer-version projection classes named with only the outer version suffix and instantiated without an inner-version routing map. When the element bean contained a versioned nested bean, multiple cross-product entries collided on the codegen cache: the projection always read inner beans at whichever version was compiled first. The row codec already did this correctly; lift its suffix and nested- suffix logic into a shared ProjectionRouting helper and reuse it from ArrayCodecBuilder and MapCodecBuilder. Add array/map regression tests that fail before the fix and pass after.
… codecs The existing row test (evolutionFlagAsymmetryFailsLoud) had no array or map equivalent. Add both. The evolution-on consumer reading evolution-off bytes direction is loud (ClassNotCompatibleException); the reverse direction is undefined per the wire format but must not silently return a structurally plausible value. Rename isVersionedBeanElement/Value to isBeanElement/Value with a doc comment, since the predicate is just isBean — calling it "versioned" suggested the unversioned-bean case was excluded.
…ures collapse bySignature.putIfAbsent could store a non-all-current cross-product combination under the signature that build() later marks as the writer-side current. The stored VS's nestedBeanVersions would then misreport at least one inner bean as living at a non-current version, violating the documented contract on current().nestedBeanVersions(). Reachable only if two combinations canonicalize to the same outer signature, which today's inner-bySignature collapse prevents, but the contract should not depend on that. Add a contract test that asserts the invariant for a deeply nested versioned bean.
@ForyVersion declares RECORD_COMPONENT as a valid target but no test exercised the record path. Add three cases in fory-latest-jdk-tests: a record with a String field added at v2, a record with the @ForySchema-removed-field History interface, and a record with a primitive int field added at v2 (verifying the 0 default).
Tighten the row-format schema-evolution doc to reflect the actual flag-mismatch behavior (loud in one direction, undefined in the reverse for array/map) and add a note that the projection codec class count grows as the product of per-bean version counts in a composition, with retiring history entries as the way to bound it.
30c5ba3 to
bc25986
Compare
added 4 commits
June 26, 2026 16:50
Three small edits in the row-format schema-evolution section: name all primitive defaults (0, 0.0, false), fold the "parameterized types are expressed naturally" assertion into the lead-in to the removed-field example, and drop the trailing sentence that restated what the example already showed.
- Guard array/map evolution decode against payloads smaller than the 8-byte schema-hash prefix, failing with ClassNotCompatibleException instead of feeding a negative size into pointTo. - Remove the dead 5-arg loadOrGenProjectionRowCodecClass overload; all callers pass the nested-suffix map. - Replace fully-qualified java.util.* and Schema references with imports. - Add tests covering the new too-small-payload guards.
Adds SchemaEvolutionSuite under benchmarks/java: encode plus current-version and older-version (projection) decode benchmarks for evolution-enabled row codecs. Run with the JMH gc profiler (-prof gc) for repeatable per-op allocation numbers, including evidence that the projection decode path allocates no more than the current-schema path (each projection holds its historical schema's cached row layout). Replaces the earlier hand-rolled allocation probe main(), which measured only the non-evolution path and was never run by CI.
Formatting-only: google-java-format line wrapping across the schema-evolution files. No logic changes.
b1a051a to
ecb433c
Compare
added 5 commits
June 26, 2026 17:54
…te per class Carry the chosen inner VersionedSchema (with its strict hash) through the cross-product instead of a bare version number, so nested projection routing identifies the correct inner subtree to arbitrary depth. Enumerate one cross-product dimension per nested bean class rather than per field: a writer writes one definition of a class, so all fields of that class share a version on the wire. This makes deep nesting and same-class-in-two-fields correct, and makes the projection-class count a product over distinct nested classes rather than over fields.
… path Add the size<8 lower-bound guard to BinaryRowEncoder.decode so a truncated row payload fails with ClassNotCompatibleException like the array and map paths already do, instead of computing a negative body size. Swap the runtime projection lookup maps (row/array/map) from Map<Long,_> to the primitive-keyed LongMap to drop per-decode Long boxing on the historical-version path; the current-schema hot path is unaffected. Narrow the catch in SchemaHistory.isBeanWithVersioning from Exception to RuntimeException with an accurate comment, and remove a dead null-check in RowEncoderBuilder. Add tests for the removed-field @ForyVersion validation messages.
ArrayEncoderBuilder and MapEncoderBuilder divided the elapsed nanos by 1_000_000 (milliseconds) but logged the value with a "us" unit, overstating the unit by 1000x. Divide by 1000 so the logged value is microseconds, matching the unit label and RowEncoderBuilder.
… path Add evolution-off PersonV2 codecs (standard + compact) and four *NoEvolution benchmarks so the suite measures the steady-state cost of withSchemaEvolution() when reading and writing current-version data, not only projection parity. Bounded JMH run (JDK 26, 2 forks x 4 iters, -prof gc), B/op = gc.alloc.rate.norm: currentDecode 17.6M ops/s 312 B/op currentDecodeNoEvolution 16.6M ops/s 312 B/op encode 15.9M ops/s 152 B/op encodeNoEvolution 15.8M ops/s 152 B/op compactCurrentDecode 16.4M ops/s 280 B/op compactCurrentDecodeNoEvolution 16.3M 280 B/op compactEncode 16.4M ops/s 144 B/op compactEncodeNoEvolution 15.5M ops/s 144 B/op olderDecode 24.5M ops/s 216 B/op compactOlderDecode 24.9M ops/s 192 B/op Enabling evolution adds zero allocation on the current path (B/op identical on/off across all four paths); throughput differences are within the bounded run's noise band. Projection (older) decode is not penalized versus current decode; it allocates less here because it reads the narrower V1 schema.
SchemaHistory.isBeanWithVersioning probed every nested field's raw type with Descriptor.getDescriptors to find @ForyVersion descriptors. TypeInference.inferField, the real encode/decode path, routes collection/map/array/enum field types away from getDescriptors (they are classified before the isBean branch), so a collection subclass that shadows a field name across its hierarchy round-trips fine even though getDescriptors rejects it for duplicate fields. The unguarded probe threw IllegalArgumentException and broke SchemaHistory.build for such a bean. Gate getDescriptors behind TypeUtils.isBean, matching inferField's classification, so only genuine bean field types are introspected. A class that truly cannot be a bean still surfaces its error through isBean, which fails identically on the real path. Add a MemoryBuffer streaming round-trip test through a projection hit, covering the sizeEmbedded int32-prefix framing the byte[] tests skip, and a reproducer (versionedBeanWithShadowedCollectionFieldBuilds) for the shadowed-collection regression.
ca39fe8 to
aba4338
Compare
added 18 commits
June 26, 2026 19:30
Move inferNamedField out from between the inferField overloads in TypeInference so OverloadMethodsDeclarationOrder no longer fails the Code Style Check job.
SchemaHistory.build discovered nested versioned beans only at a field's raw type, so a versioned bean appearing as a List element or Map value was never found: the outer's cross-product carried no dimension for the inner bean, and its history was never enumerated. A reader whose inner bean had evolved then had no projection matching an older payload's inner layout, and decode threw ClassNotCompatibleException. findVersionedBean now looks through list/array element and map key/value type refs to locate the versioned bean, mirroring TypeInference's element handling (component type for arrays, getElementType for iterables) and keeping the collection-first classification that lets a shadowed-field collection subclass short-circuit before any Descriptor.getDescriptors probe. The cross-product is keyed by the discovered bean class, preserving the one-dimension-per-class invariant. substituteNestedStruct rebuilds the list/map field with the chosen historical struct in the bean's slot, leaving the wrapper and its nullability exactly as inferNamedField produced them, so existing direct-field schemas and hashes are unchanged. Add evolvingBeanInCollectionField covering an inner bean evolved across a List and a Map value read by a newer codec.
…11 safety ElementType.RECORD_COMPONENT is a JDK 16 enum constant. fory-format compiles with source/target=11 (no --release), so a modern build JDK accepts it but the class fails at runtime on JDK 11 when @ForyVersion's @target is materialized. Record components stay covered by FIELD+METHOD: the compiler propagates a record-component annotation to the backing field and accessor, where SchemaHistory.lookupForyVersion already reads it. Add a nested-versioned-record evolution test (RecordRowTest) covering the cross-product enumeration path with record-component naming, and fix the stale comment that claimed @ForyVersion targets RECORD_COMPONENT. Hoist the duplicated schema-history build (the compact-format sort transform plus SchemaHistory.build) from the row/map/array codec builders into BaseCodecBuilder.buildSchemaHistory, and extract an evolvingCodec(Class) helper in the schema-evolution tests to remove repeated builder boilerplate. No wire or behavior change.
A live field still exists as a Java member, so a finite until silently dropped it from the current schema (until extends the version set, so latestVersion >= until excludes the field) and the writer stopped serializing a field the bean still has, with no error. collectLiveFields now rejects a finite until on a live field and points the user at the @ForySchema.removedFields history class, which is the only place a removal should be declared. Mirrors the existing until==MAX_VALUE guard in collectRemovedFields.
findVersionedBean inspected map keys, so a row field typed Map<KeyBean, V> added a key-version dimension to the cross-product and generated one projection codec class per key version. Map keys carry no per-payload hash and are always read with the current schema (see row-format.md), so those key-version projections are never dispatched: dead classes plus inflated cross-product growth. Restrict findVersionedBean and substituteNestedStruct to the map value, matching the wire format's only routable nested-map position. Add a row-field Map<versionedKey, V> evolution test that exercises this path.
…g builder The evolution build path rotated this.schema to the history-derived current version, and build() relied on reading it back after buildForWriter(). A reused builder, or a direct buildForWriter() caller such as Encoders.bean, would then observe the rotated schema. Bundle the resolved schema with the per-writer factory (RowEncoderFactory) so build() creates its writer from the factory's schema and the build no longer mutates builder state. No behavior change.
ProjectionCodecFactory.instantiate rebuilt a RowFactory per encoder instance, though it depends only on the historical schema and codec format, both fixed at build time. Under the documented one-encoder-per-thread usage this recomputed K row factories per thread. Build it once in the factory constructor; instantiate() now only rebuilds the generated codec, which binds the per-instance writer. The Map and Array projection factories allocate per-instance BinaryArrayWriters that genuinely bind per-encoder buffers, so they have no analogous hoistable work.
…ion bug A top-level Map<structKeyBean, versionedValue> codec with schema evolution corrupts the key when the value is read at a non-current version: the value's version suffix is applied to the key bean too, and a same-class key/value share one bean codec keyed by type rather than position, so the key decodes with the value's historical layout. The fix is position-scoped bean-codec registration in the map codegen and must activate during the lazy genCode of the value subtree; it spans shared codegen, so it is tracked separately. The reproducer is disabled to keep the suite green while documenting the failure precisely.
…ojection A schema-evolution map codec whose value reads at a historical version corrupted a struct key. The projection codec applied the value's version suffix to every nested bean via the type-blind nestedBeanSuffix, and the bean-codec registration maps were keyed by typeRef. When the key and value share a class (the reader side is effectively Map<Bean,Bean> with the value historical and the key current), both collapsed to one registration entry, so the key reused the value's historical row codec and decoded a current key row with the wrong field count. Map keys carry no per-payload version hash and are always read at the current schema, so route the key position to the current, unsuffixed codec under a distinct registration key. BaseBinaryEncoderBuilder gains a beanCodecKey(TypeRef) indirection (default identity, so row/array codecs are unchanged) and keys its bean maps by it. MapEncoderBuilder overrides nestedBeanSuffix and beanCodecKey for the key position, gated by an inKeyPosition flag. The flag is scoped around both expression construction and genCode of the key subtree, because the encode ForEach registers nested beans eagerly in its constructor while the decode lazy array registers them during genCode. Enables the previously-disabled SchemaEvolutionStressTest#mapStructKeyValueEvolution.
…ersioned bean A top-level array or map codec only took the schema-evolution path when its element/value type was directly a bean, so Collection<List<Bean>> and Map<K, List<Bean>> (or Map<K, Map<.., Bean>>) silently skipped projection: the writer emitted no strict-hash prefix and the reader decoded older payloads at the current layout, corrupting reads. Route both top-level builders through the versioned bean reachable through the element/value wrapper. SchemaHistory.evolutionBean descends list/map/array wrappers and returns the bean at the leaf (versioned or not, so an unversioned bean still emits the prefix and stays wire-compatible); projectThroughWrapper rebuilds the historical element/value field with the wrapper preserved around the projected struct, the same substitution the row-field path already uses for a versioned bean nested in a collection field. The generated projection codec already reads the wrapper from the container type, so no codegen change is needed. Covers the array-codec variant of the same bug as well as the reported map case.
collectLiveFields already rejects any finite until on a live field, so the subsequent since >= until check could only fire for since == Integer.MAX_VALUE and was dead for any real annotation. The reachable ordering check remains on the removed-field path in collectRemovedFields, where a finite until is valid.
Document that schema-evolution decode selects a layout from the 8-byte strict hash, and that a payload whose hash coincides with one of the reader's historical layouts is decoded against it. This is the same hash-based dispatch the row format has always used; the note makes the accepted trade-off explicit.
…row evolution Every existing added-field evolution test defaults a scalar; defaulting an added struct or collection slot is a distinct projection path that was untested. Add a v1->v2 case where v2 introduces a nested struct and a list of structs absent from the v1 wire, and assert both read back as null. Also correct the RowFactory Javadoc: the layout is computed once only for the compact format, which captures a CompactRowLayout in the factory; the default format builds a BinaryRow per call, matching BinaryRowWriter#newRow.
…when nested in evolving beans Two related fixes let interface beans work as map keys/values in the row codec, both for plain inference and schema evolution: - TypeUtils.isSupported dropped the TypeResolutionContext when recursing into map key/value types, calling the context-less overload that resets synthesizeInterfaces to false. An interface bean was therefore rejected as a map key or value even though the same type is supported as a direct field or list element (which thread the context). Thread ctx into both map key and value recursions, matching the iterable branch. The error surfaced as "Unsupported type <Outer>" because the failed map field made isBean(Outer) return false. - SchemaHistory.isBeanWithVersioning probed for a nested versioned bean with the context-less TypeUtils.isBean, so a nested interface bean was never recognized as versioned. Its older versions were not enumerated into the outer cross-product, and an older inner payload had no matching projection, so decode failed with a schema-hash mismatch. Use the same synthesize-interfaces context as inferField and evolutionBean. Tests: ImplementInterfaceTest#testMapValueInterface covers the plain-row map-value case; SchemaEvolutionTest#evolvingInterfaceBeanNestedInOuterBean covers a versioned interface bean nested as a field, list element, and map value across an evolution boundary.
…BeanCodec() The decode-time IllegalStateException claimed the encoder "should have be added in serializeForBean()", but this branch moved nested-bean codec registration into registerBeanCodec(), which serializeForBean() and the decode-only projection path both call. On the projection path serializeForBean() never runs, so the old message points a debugger at the wrong method. Name registerBeanCodec() and fix the "be added" grammar.
…jection fields isAccessorOfAbsentField matched a leftover interface method to an absent field's descriptor by name and return type alone. A parameterized method sharing that name and return type (e.g. a getScore(int) overload of a since=2 getScore() field) was therefore silenced into a default-value body during projection instead of throwing, returning wrong data. Guard on parameterCount() == 0 since an accessor is always no-arg; the live-member pass only ever removes the no-arg signature. Also document why SchemaHistory.build needs no cycle guard: inferField's checkNoCycle, run from RowCodecBuilder's constructor before build(), already rejects self-referential beans, so the nested-bean recursion is unreachable for a cycle.
…e framing Add a soft warn-log in BaseCodecBuilder.buildSchemaHistory when a bean resolves more than 256 historical schemas, since each becomes a generated projection codec class and the count grows as the product of per-class version counts across nested versioned beans. The count is read from the already-materialized history, so tracking adds only one comparison. Correct the decode(byte[]) comments in the row/array/map encoders: they claimed encode writes no prefix, which is misleading now that the schema hash leads the body (always for rows, under evolution for arrays/maps). Rename the array/map decode body-length local from payloadSize to bodySize per the codec read-identifier naming rule.
collectLiveFields and collectRemovedFields read ann.since() without a lower-bound check, so since=0 (or negative) silently injected a schema version no writer can emit, unlike every other malformed annotation which fails fast at build. Validate since >= FIRST_VERSION on both paths. Also point the nested-bean decode lookup miss at its real cause: a beanCodecKey() miss means the decode ran outside the key/value position scope that registered the codec, so name that in the message and comment the coupling at the choke point instead of the generic registerBeanCodec hint.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Opt in with
.withSchemaEvolution()on any row, array, or map codec builder. Fields carry@ForyVersion(since, until); removed fields are listed on a nested interface referenced from@ForySchema(removedFields = ...). Older payloads are dispatched at read time; nothing changes when the flag is off. Standard and compact formats supported.Why?
Currently changing row format schema definition in any way invalidates all records
What does this PR do?
Propose a new concept of format versions, each succeeding version may add or remove fields from types, and deserialization machinery picks version based on schema hash
AI Contribution Checklist
AI Usage Disclosure
yes, I included a completed AI Contribution Checklist in this PR description and the requiredAI Usage Disclosure.yes, my PR description includes the requiredai_reviewsummary and screenshot evidence of the final clean AI review results from both fresh reviewers on the current PR diff or current HEAD after the latest code changes.Does this PR introduce any user-facing change?
New codec option: schema evolution. Some small annotations and a builder method.
Existing row format compatibility unchanged
Benchmark
withSchemaEvolution()is an opt-in feature that adds a new row-codec path; it does not modify any existing serialization hot path.There is no
apache/mainbaseline to compare against —SchemaEvolutionSuiteexerciseswithSchemaEvolution(), which does not existon
main— so the benchmark measures two things directly: the steady-state cost of enabling the flag, and projection-vs-currentparity.
Bounded JMH run (JDK 26, 2 forks × 4 iterations, 1s each,
-prof gc).B/opisgc.alloc.rate.norm(bytes allocated per operation).currentDecodecurrentDecodeNoEvolutionencodeencodeNoEvolutioncompactCurrentDecodecompactCurrentDecodeNoEvolutioncompactEncodecompactEncodeNoEvolutionolderDecodecompactOlderDecodeFindings:
B/opis byte-identical between the evolution-on and*NoEvolutionvariants on every path (decode 312/312, encode 152/152, compact decode 280/280, compact encode 144/144).
~10% confidence intervals on the no-evolution variants, so the throughput claim is "no measurable difference," not a tight bound;
allocation is exact.
vs 280 compact) because it reads the narrower V1 schema, not because projection is inherently cheaper. Each projection codec holds its
historical schema's precomputed row layout, so there is no per-decode rebuild.
Limitations
withSchemaEvolution()flag; the twoframings are not wire-compatible. A flag-mismatched peer fails loudly with
ClassNotCompatibleException(except evolution-off reading evolution-on bytes,which is undefined). Adopt by enabling the flag on both sides in a release that
changes no schema, then evolve schemas once every peer is on the new build.
C++) cannot read them.
dispatched to a projection codec (map keys carry no per-payload hash).
version counts of the distinct nested versioned bean classes. Retire entries
from a bean's
Historyinterface once you no longer need to read payloads fromthat range to bound the growth.