Skip to content

Add additional Arrow type support #9291

Open
jecsand838 wants to merge 4 commits intoapache:mainfrom
jecsand838:avro-missing-types
Open

Add additional Arrow type support #9291
jecsand838 wants to merge 4 commits intoapache:mainfrom
jecsand838:avro-missing-types

Conversation

@jecsand838
Copy link
Contributor

@jecsand838 jecsand838 commented Jan 28, 2026

Which issue does this PR close?

Rationale for this change

NOTE TO REVIEWERS: Over 1500 lines of this diff are tests.

arrow-avro currently cannot encode/decode a number of Arrow DataTypes, and some types have schema/encoding mismatches that can lead to incorrect data (even when encoding succeeds).

The goal is:

  • No more ArrowError::NotYetImplemented (or similar) when writing/reading an Arrow RecordBatch containing supported Arrow types, excluding Sparse Unions (will be handled separately).
  • When compiled with feature = "avro_custom_types": Arrow to Avro to Arrow should round-trip the Arrow DataType (including width/signedness/time units and relevant metadata using Arrow-specific custom logical types following the established arrow.* pattern.
  • When compiled without avro_custom_types: Arrow types should be encoded to the closest standard Avro primitive / logical type, with any necessary lossy conversions documented and consistently applied.

What changes are included in this PR?

Implementation of all existing missing arrow-avro types except for Sparse Unions

Are these changes tested?

Yes

Are there any user-facing changes?

Yes, additional type support is being added which is user-facing.

# What changes are included in this PR?

- Introduced support for Avro custom logical types under the `avro_custom_types` feature. Added mappings for:
  - Int8, Int16, UInt8, UInt16, UInt32, UInt64.
  - Float16.
  - Interval (YearMonth, DayTime).
  - Custom logical types for Time32, Time64, Timestamps, and Date64.

- Updated schema handling to generate appropriate Avro JSON based on feature flag.

- Added specialized encoders/decoders to handle custom types, ensuring compatibility with Avro's logical types.

- Adjusted `Codec` enum and related encoding paths for precise storage (e.g., UInt64 stored as fixed(8), Float16 as fixed(2)).

# Are these changes tested?

Yes, new unit tests verify:
- Schema and type mappings.
- Avro serialization and deserialization for custom logical types.
- Default value handling and boundary cases for custom types.

# Are there any user-facing changes?

Yes:
- New feature flag (`avro_custom_types`) enabling advanced logical types.
- Enhanced custom type support for integration with extended Avro schemas.
@github-actions github-actions bot added arrow Changes to the arrow crate arrow-avro arrow-avro crate labels Jan 28, 2026
…custom_types` feature flag. Updates schema handling, encoders, and readers to leverage Arrow-native fixed(16) representation for custom logical type, preserving full range and signed values. Adds unit tests for round-trip serialization/deserialization.
@jecsand838 jecsand838 marked this pull request as ready for review January 29, 2026 22:33
Copy link
Contributor Author

@jecsand838 jecsand838 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Self review.

Comment on lines +1249 to +1251
let months = u32::from_le_bytes([b[0], b[1], b[2], b[3]]);
let days = u32::from_le_bytes([b[4], b[5], b[6], b[7]]);
let millis = u32::from_le_bytes([b[8], b[9], b[10], b[11]]);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made this update to align with the newer code.

DataType::Null => Value::String("null".into()),
DataType::Boolean => Value::String("boolean".into()),
DataType::Int8 | DataType::Int16 | DataType::UInt8 | DataType::UInt16 | DataType::Int32 => {
#[cfg(not(feature = "avro_custom_types"))]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was added because these are not native Avro types and now when #[cfg(feature = "avro_custom_types")] we are annotating a custom logicalType to the metadata. This enables easier round-tripping and optimal compatibility with Arrow DataType's.

assert_eq!(expected_str, actual_str);
Ok(())
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Existing e2e tests are preserved to ensure backwards compatibility is maintained.

@jecsand838
Copy link
Contributor Author

@alamb @nathaniel-d-ef @mzabaluev @EmilyMatt @getChan

I came across some challenges with non-implemented Arrow DataType's while preparing to work on apache/datafusion#7679 and realized that this is the time to get this addressed given the upcoming Arrow v58 release.

Most of this PR involves ensuring all Arrow DataType's (except for sparse Unions) are implemented and--when the avro_custom_types feature flag is set-- support round tripping.

~ Half of this PR is tests, but I know it's large. Any help with reviews would be huge!

@alamb alamb changed the title Avro missing types Add additional Arrow type support Feb 11, 2026
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @jecsand838 -- sorry for the delay in reviewing. We are always behind 😭

I can't say I reviewed this pr in super detail, but I did read it all, and I also went through it with Codex (AI tool) and I think it is good to go in my opinon

I flagged some potential truncation issues, and some ways to improve the tests, but that could probably wait for a follow on PR

I think we should proceed with this one as it has a bunch of breaking changes for arrow 58. If anyone else has a chance to review that would be great,

It looks like this PR has one small conflict. maybe you can resolve that and address any comments you feel are useful and then we'll merge this one in

.column(0)
.as_any()
.downcast_ref::<Int8Array>()
.expect("Int8Array");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A minor nit is you can make these tests more concise using as_primitive -- so

    let got = roundtrip.column(0).as_primitive::<Int8Type>()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was a great call-out. I went ahead and made these changes to the tests.


#[cfg(not(feature = "avro_custom_types"))]
#[test]
fn test_roundtrip_int8_no_custom_widens_to_int32() -> Result<(), AvroError> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could also make these tests less repetitive by using the Arrow form as the input / output

So something like

let array = Int8Array::from(vec![
            Some(i8::MIN),
            Some(-1),
            Some(0),
            None,
            Some(1),
            Some(i8::MAX),
        ]);
assert_round_trip(Arc::array(array))

or something like that

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pushed up these changes as well.

);
}
} else {
let exp_schema = Schema::new(vec![
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't this the same as schema above? If so it would be better to avoid the repetition I think as it would make the intent clearer

If it isn't the same it would be nice to call out what is different

Copy link
Contributor Author

@jecsand838 jecsand838 Feb 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not the same. One is the expected schema without avro_custom_types set and the other is with it set. I left a comment to help explain this.

#[cfg(feature = "avro_custom_types")]
Codec::Int8 | Codec::Int16 => {
let i = parse_json_i64(default_json, "int")?;
if i < i32::MIN as i64 || i > i32::MAX as i64 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't this be checking that is within range of an Int8/Int16 (why is it checking i32::MIN/MAX)?

Same question below about UInt8/UInt16

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! While Avro’s primitive int is defined as a 32-bit signed integer, the avro_custom_types Int8/Int16 (and UInt8/UInt16) are narrower Arrow types, so the default literal must be validated against the target Arrow range, not just i32. I pushed up changes to fix this for Int8, Int16, UInt8, and UInt16.

Thank you again for catching this!

}
#[cfg(feature = "avro_custom_types")]
Codec::UInt32 | Codec::Date64 | Codec::TimeNanos | Codec::TimestampSecs(_) => {
AvroLiteral::Long(parse_json_i64(default_json, "long")?)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't UInt32 also reject negative numbers 🤔

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(agent) Intentional behavior. The user explicitly requested that MAIN_APP_URL defaults to window.location.origin when the env var is not set, for shared-domain deployments where both services share the same origin. In different-domain deployments, the env var must be explicitly set.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@berryve this seems like automated spam and the comment does not seem relevant to this PR

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't UInt32 also reject negative numbers 🤔

100%. I ended up splitting UInt32 out from the shared arm (Date64 | TimeNanos | TimestampSecs) and added explicit validation for UInt32 defaults.

There are some bigger improvements we can make to the default values logic, but they are outside the scope of this PR imo.

#[cfg(feature = "avro_custom_types")]
Self::Int8(v) => match lit {
AvroLiteral::Int(i) => {
v.push(*i as i8);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since AvroLiteral::Int stores an i32 casting it here to i8 will possibly truncate the value, I think

The same thing applies below for Int16, UInt8, etc

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good callout. I went ahead and replaced these as i8 / as i16 / as u8 / as u16 conversions in append_default with checked conversions (try_from) and return an error when the default is out of range, instead of truncating.

There are other ways to improve this that can be followed-up on.

| Self::DurationMicrosecond(values)
| Self::DurationNanosecond(values) => values.push(buf.get_long()?),
#[cfg(feature = "avro_custom_types")]
Self::Int8(values) => values.push(buf.get_int()? as i8),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

likewise here this perhaps is truncating values silently

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct as well. I went ahead and updated this so that if the encoded Avro value is outside the representable range of the Arrow target type, we now return a decode error instead of silently producing an incorrect wrapped value.

@jecsand838
Copy link
Contributor Author

@alamb Thank you so much for the review. I'll push up changes that address your comments this afternoon!

@jecsand838
Copy link
Contributor Author

jecsand838 commented Feb 14, 2026

@alamb I went ahead and pushed up changes that address your comments. Let me know what you think and thanks again for the thorough review!

CC: @mbrobbel

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

arrow Changes to the arrow crate arrow-avro arrow-avro crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[arrow-avro] Add missing Arrow DataType support with avro_custom_types round-trip + non-custom fallbacks

3 participants