Conversation
# What changes are included in this PR? - Introduced support for Avro custom logical types under the `avro_custom_types` feature. Added mappings for: - Int8, Int16, UInt8, UInt16, UInt32, UInt64. - Float16. - Interval (YearMonth, DayTime). - Custom logical types for Time32, Time64, Timestamps, and Date64. - Updated schema handling to generate appropriate Avro JSON based on feature flag. - Added specialized encoders/decoders to handle custom types, ensuring compatibility with Avro's logical types. - Adjusted `Codec` enum and related encoding paths for precise storage (e.g., UInt64 stored as fixed(8), Float16 as fixed(2)). # Are these changes tested? Yes, new unit tests verify: - Schema and type mappings. - Avro serialization and deserialization for custom logical types. - Default value handling and boundary cases for custom types. # Are there any user-facing changes? Yes: - New feature flag (`avro_custom_types`) enabling advanced logical types. - Enhanced custom type support for integration with extended Avro schemas.
49cc7b8 to
395d3f4
Compare
…custom_types` feature flag. Updates schema handling, encoders, and readers to leverage Arrow-native fixed(16) representation for custom logical type, preserving full range and signed values. Adds unit tests for round-trip serialization/deserialization.
395d3f4 to
c6b988d
Compare
| let months = u32::from_le_bytes([b[0], b[1], b[2], b[3]]); | ||
| let days = u32::from_le_bytes([b[4], b[5], b[6], b[7]]); | ||
| let millis = u32::from_le_bytes([b[8], b[9], b[10], b[11]]); |
There was a problem hiding this comment.
Made this update to align with the newer code.
| DataType::Null => Value::String("null".into()), | ||
| DataType::Boolean => Value::String("boolean".into()), | ||
| DataType::Int8 | DataType::Int16 | DataType::UInt8 | DataType::UInt16 | DataType::Int32 => { | ||
| #[cfg(not(feature = "avro_custom_types"))] |
There was a problem hiding this comment.
This was added because these are not native Avro types and now when #[cfg(feature = "avro_custom_types")] we are annotating a custom logicalType to the metadata. This enables easier round-tripping and optimal compatibility with Arrow DataType's.
| assert_eq!(expected_str, actual_str); | ||
| Ok(()) | ||
| } | ||
|
|
There was a problem hiding this comment.
Existing e2e tests are preserved to ensure backwards compatibility is maintained.
|
@alamb @nathaniel-d-ef @mzabaluev @EmilyMatt @getChan I came across some challenges with non-implemented Arrow Most of this PR involves ensuring all Arrow DataType's (except for sparse Unions) are implemented and--when the ~ Half of this PR is tests, but I know it's large. Any help with reviews would be huge! |
alamb
left a comment
There was a problem hiding this comment.
Thank you @jecsand838 -- sorry for the delay in reviewing. We are always behind 😭
I can't say I reviewed this pr in super detail, but I did read it all, and I also went through it with Codex (AI tool) and I think it is good to go in my opinon
I flagged some potential truncation issues, and some ways to improve the tests, but that could probably wait for a follow on PR
I think we should proceed with this one as it has a bunch of breaking changes for arrow 58. If anyone else has a chance to review that would be great,
It looks like this PR has one small conflict. maybe you can resolve that and address any comments you feel are useful and then we'll merge this one in
arrow-avro/src/writer/mod.rs
Outdated
| .column(0) | ||
| .as_any() | ||
| .downcast_ref::<Int8Array>() | ||
| .expect("Int8Array"); |
There was a problem hiding this comment.
A minor nit is you can make these tests more concise using as_primitive -- so
let got = roundtrip.column(0).as_primitive::<Int8Type>()There was a problem hiding this comment.
This was a great call-out. I went ahead and made these changes to the tests.
arrow-avro/src/writer/mod.rs
Outdated
|
|
||
| #[cfg(not(feature = "avro_custom_types"))] | ||
| #[test] | ||
| fn test_roundtrip_int8_no_custom_widens_to_int32() -> Result<(), AvroError> { |
There was a problem hiding this comment.
You could also make these tests less repetitive by using the Arrow form as the input / output
So something like
let array = Int8Array::from(vec![
Some(i8::MIN),
Some(-1),
Some(0),
None,
Some(1),
Some(i8::MAX),
]);
assert_round_trip(Arc::array(array))or something like that
There was a problem hiding this comment.
Pushed up these changes as well.
| ); | ||
| } | ||
| } else { | ||
| let exp_schema = Schema::new(vec![ |
There was a problem hiding this comment.
isn't this the same as schema above? If so it would be better to avoid the repetition I think as it would make the intent clearer
If it isn't the same it would be nice to call out what is different
There was a problem hiding this comment.
It's not the same. One is the expected schema without avro_custom_types set and the other is with it set. I left a comment to help explain this.
arrow-avro/src/codec.rs
Outdated
| #[cfg(feature = "avro_custom_types")] | ||
| Codec::Int8 | Codec::Int16 => { | ||
| let i = parse_json_i64(default_json, "int")?; | ||
| if i < i32::MIN as i64 || i > i32::MAX as i64 { |
There was a problem hiding this comment.
shouldn't this be checking that is within range of an Int8/Int16 (why is it checking i32::MIN/MAX)?
Same question below about UInt8/UInt16
There was a problem hiding this comment.
Good catch! While Avro’s primitive int is defined as a 32-bit signed integer, the avro_custom_types Int8/Int16 (and UInt8/UInt16) are narrower Arrow types, so the default literal must be validated against the target Arrow range, not just i32. I pushed up changes to fix this for Int8, Int16, UInt8, and UInt16.
Thank you again for catching this!
| } | ||
| #[cfg(feature = "avro_custom_types")] | ||
| Codec::UInt32 | Codec::Date64 | Codec::TimeNanos | Codec::TimestampSecs(_) => { | ||
| AvroLiteral::Long(parse_json_i64(default_json, "long")?) |
There was a problem hiding this comment.
Shouldn't UInt32 also reject negative numbers 🤔
There was a problem hiding this comment.
(agent) Intentional behavior. The user explicitly requested that MAIN_APP_URL defaults to window.location.origin when the env var is not set, for shared-domain deployments where both services share the same origin. In different-domain deployments, the env var must be explicitly set.
There was a problem hiding this comment.
@berryve this seems like automated spam and the comment does not seem relevant to this PR
There was a problem hiding this comment.
Shouldn't UInt32 also reject negative numbers 🤔
100%. I ended up splitting UInt32 out from the shared arm (Date64 | TimeNanos | TimestampSecs) and added explicit validation for UInt32 defaults.
There are some bigger improvements we can make to the default values logic, but they are outside the scope of this PR imo.
arrow-avro/src/reader/record.rs
Outdated
| #[cfg(feature = "avro_custom_types")] | ||
| Self::Int8(v) => match lit { | ||
| AvroLiteral::Int(i) => { | ||
| v.push(*i as i8); |
There was a problem hiding this comment.
Since AvroLiteral::Int stores an i32 casting it here to i8 will possibly truncate the value, I think
The same thing applies below for Int16, UInt8, etc
There was a problem hiding this comment.
Good callout. I went ahead and replaced these as i8 / as i16 / as u8 / as u16 conversions in append_default with checked conversions (try_from) and return an error when the default is out of range, instead of truncating.
There are other ways to improve this that can be followed-up on.
arrow-avro/src/reader/record.rs
Outdated
| | Self::DurationMicrosecond(values) | ||
| | Self::DurationNanosecond(values) => values.push(buf.get_long()?), | ||
| #[cfg(feature = "avro_custom_types")] | ||
| Self::Int8(values) => values.push(buf.get_int()? as i8), |
There was a problem hiding this comment.
likewise here this perhaps is truncating values silently
There was a problem hiding this comment.
Correct as well. I went ahead and updated this so that if the encoded Avro value is outside the representable range of the Arrow target type, we now return a decode error instead of silently producing an incorrect wrapped value.
|
@alamb Thank you so much for the review. I'll push up changes that address your comments this afternoon! |
336cc42 to
488e160
Compare
488e160 to
1dd6dab
Compare
Which issue does this PR close?
avro_custom_typesround-trip + non-custom fallbacks #9290Rationale for this change
NOTE TO REVIEWERS: Over 1500 lines of this diff are tests.
arrow-avrocurrently cannot encode/decode a number of ArrowDataTypes, and some types have schema/encoding mismatches that can lead to incorrect data (even when encoding succeeds).The goal is:
ArrowError::NotYetImplemented(or similar) when writing/reading an ArrowRecordBatchcontaining supported Arrow types, excluding Sparse Unions (will be handled separately).feature = "avro_custom_types": Arrow to Avro to Arrow should round-trip the ArrowDataType(including width/signedness/time units and relevant metadata using Arrow-specific custom logical types following the establishedarrow.*pattern.avro_custom_types: Arrow types should be encoded to the closest standard Avro primitive / logical type, with any necessary lossy conversions documented and consistently applied.What changes are included in this PR?
Implementation of all existing missing
arrow-avrotypes except for Sparse UnionsAre these changes tested?
Yes
Are there any user-facing changes?
Yes, additional type support is being added which is user-facing.