Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
148 changes: 146 additions & 2 deletions arrow-array/src/array/mod.rs
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,7 @@ use arrow_buffer::{ArrowNativeType, Buffer, NullBuffer, OffsetBuffer, ScalarBuff
use arrow_data::ArrayData;
use arrow_schema::{DataType, IntervalUnit, TimeUnit};
use std::any::Any;
use std::borrow::Cow;
use std::sync::Arc;

pub use binary_array::*;
Expand Down Expand Up @@ -87,9 +88,13 @@ use crate::iterator::ArrayIter;
/// translate into panics or undefined behavior. For example, a value computed based on `len`
/// may be used as a direct index into memory regions without checks.
///
/// Additionally, implementations must ensure that implementations details of `Array::as_any()`,
/// `Array::data_type()`, `Array::into_data()` and [`downcast_array_ref`] are consistent with each
/// other -- and those details could change at any time.
///
/// Use it at your own risk knowing that this trait might be sealed in the future.
pub unsafe trait Array: std::fmt::Debug + Send + Sync {
/// Returns the array as [`Any`] so that it can be
/// Returns the underlying concrete array instance as [`Any`] so that it can be
/// downcasted to a specific implementation.
///
/// # Example:
Expand All @@ -113,6 +118,64 @@ pub unsafe trait Array: std::fmt::Debug + Send + Sync {
/// ```
fn as_any(&self) -> &dyn Any;

/// Attempts to recover an [`ArrayRef`] from `&dyn Array`, returning `None` if not Arc-backed.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this API makes sense to me as a user of the library, for what it is worth, and would save some allocations rather than going to/from ArrayData

///
/// This is useful when you have a `&dyn Array` that came from an [`ArrayRef`] and need to
/// recover the Arc. Automatically unwraps nested `Arc<dyn Array>` to the innermost Arc.
///
/// ```
/// # use std::sync::Arc;
/// # use arrow_array::{Array, ArrayRef, Int32Array};
/// let arc: ArrayRef = Arc::new(Int32Array::from(vec![1, 2, 3]));
///
/// // When called on ArrayRef directly, returns innermost Arc
/// let recovered = arc.as_array_ref_opt().unwrap();
/// assert!(Arc::ptr_eq(&arc, &recovered));
///
/// // Works when trait object references the Arc
/// let array_ref: &dyn Array = &arc;
/// assert!(array_ref.as_array_ref_opt().is_some());
///
/// // Returns None when trait object doesn't reference an Arc
/// let array_ref: &dyn Array = &*arc;
/// assert!(array_ref.as_array_ref_opt().is_none());
///
/// // Automatically unwraps nested Arcs
/// let nested: ArrayRef = Arc::new(arc.clone());
/// let innermost = nested.as_array_ref_opt().unwrap();
/// assert!(Arc::ptr_eq(&innermost, &arc));
/// ```
fn as_array_ref_opt(&self) -> Option<ArrayRef> {
None
}

/// Returns an [`ArrayRef`] for this array, cloning the array data if necessary.
///
/// If self is already an Arc (i.e., [`as_array_ref_opt`] returns `Some`), this is a cheap
/// Arc clone. Otherwise, it clones the underlying array data to create a new Arc-backed array.
///
/// # Example
///
/// ```
/// # use std::sync::Arc;
/// # use arrow_array::{Array, ArrayRef, Int32Array};
/// // Arc-backed array: cheap Arc clone
/// let arc: ArrayRef = Arc::new(Int32Array::from(vec![1, 2, 3]));
/// let same_arc = arc.to_array_ref();
/// assert!(Arc::ptr_eq(&arc, &same_arc));
///
/// // Stack-allocated array: creates a new Arc-backed array
/// let array = Int32Array::from(vec![1, 2, 3]);
/// let new_arc = array.to_array_ref();
/// assert_eq!(new_arc.len(), 3);
/// ```
///
/// [`as_array_ref_opt`]: Array::as_array_ref_opt
fn to_array_ref(&self) -> ArrayRef {
self.as_array_ref_opt()
.unwrap_or_else(|| make_array(self.to_data()))
}

/// Returns the underlying data of this array
fn to_data(&self) -> ArrayData;

Expand Down Expand Up @@ -350,12 +413,19 @@ pub unsafe trait Array: std::fmt::Debug + Send + Sync {
/// A reference-counted reference to a generic `Array`
pub type ArrayRef = Arc<dyn Array>;

/// Ergonomics: Allow use of an ArrayRef as an `&dyn Array`
/// Ergonomics: Allow use of an ArrayRef as an `&dyn Array`.
/// Use [`Array::as_array_ref_opt()`] to recover the original [`ArrayRef`] and
/// [`downcast_array_ref()`] to downcast to an Arc of some concrete array type.
unsafe impl Array for ArrayRef {
fn as_any(&self) -> &dyn Any {
self.as_ref().as_any()
}

fn as_array_ref_opt(&self) -> Option<ArrayRef> {
// Recursively unwrap nested Arcs to find the deepest one
Some(innermost_array_ref(Cow::Borrowed(self)))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am probably confused, but why not just Arc::clone(self)? This is impl ... for ArrayRef which is Arc<dyn Array>

So technically speaking as_array_ref_opt would be recovering the ArrayRef

Are there usecases when we have wrapped multiple levels of Arrays? I think this (code tries to handle the case of Arc<Arc<PrimitiveArary>> or something, which I don't think is common 🤔

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hehe, yes we absolutely do have Arc<Arc<...>> and there are even unit tests verifying they behave correctly.
That was a surprise to me when I started exploring the possibility to have Array: Any.

So this code might arc-clone self... but not if self harbors a deeper ArrayRef inside.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree it is possible, but why on earth would you do such a thing? Tbh I'm not sure why Array is implemented for Arc<dyn Array>...

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO both impl Array for Arc<dyn Array> and impl<T: Array> Array for &T were questionable (design?) decisions back whenever they showed up.

I expect it was intended to be ergonomic?

  • The former allows passing &ArrayRef directly to functions that expect &dyn Array, without needing to litter the code with explicit as_ref() calls.
  • The latter allows code that expects impl AsRef<dyn Array> to work with both owned and borrowed operands.

Part of me would love to just remove both of those, but that would definitely be a massive breaking change. As in, massive number of compilation failures (even if easily fixed).

Even within arrow-rs itself, the whole ArrayAccessor and ArrayIterator abstraction would break if we took away the blanket impl for &T, because ArrayAccessor::Item is owned for some array types (primitive, boolean, etc) and borrowed for others (GenericByteArray). I spent a few minutes trying to untangle it, without success.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried removing impl Array for Arc<dyn Array> and the biggest problem is that Scalar<ArrayRef> stops working. It's not immediately clear whether/how to fix that. In particular, there's no obvious replacement for Scalar::new(new_null_array(...)).

Other smaller messes include:

  • Quite a few call sites that call Arc::new(<expression returning ArrayRef>) as ArrayRef (there's that wrapping!)
  • A surprising number of calls to <ArrayRef as Array>::shrink_to_fit() (fake-mutable, no-op)
  • Lots of calls to <ArrayRef as Array>::into_data() (fake-mutable, equivalent to Array::to_data)
  • A lot of code that has an ArrayRef or &ArrayRef, and which lacks the necessary .as_ref() call. Mostly unit tests.
  • Missing impl Datum for ArrayRef

Just getting arrow-rs to compile without it was an impressive amount of churn:

 82 files changed, 883 insertions(+), 732 deletions(-)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If nothing else, we can at least clean up call sites that are needlessly creating Arc<Arc<dyn Array>>:

}

fn to_data(&self) -> ArrayData {
self.as_ref().to_data()
}
Expand Down Expand Up @@ -435,6 +505,10 @@ unsafe impl<T: Array> Array for &T {
T::as_any(self)
}

fn as_array_ref_opt(&self) -> Option<ArrayRef> {
T::as_array_ref_opt(self)
}

fn to_data(&self) -> ArrayData {
T::to_data(self)
}
Expand Down Expand Up @@ -500,6 +574,76 @@ unsafe impl<T: Array> Array for &T {
}
}

/// Returns the innermost `ArrayRef` from potentially nested `Arc<dyn Array>` wrappers.
fn innermost_array_ref(array: Cow<'_, ArrayRef>) -> ArrayRef {
// Peel away the Cow and the Arc, then recurse on the referent.
(**array)
.as_array_ref_opt()
.unwrap_or_else(|| array.into_owned())
}

/// Downcasts an [`ArrayRef`] to `Arc<T>`, or returns the original on type mismatch.
///
/// This provides zero-cost downcasting when the dynamic type matches, avoiding cloning
/// through [`ArrayData`]. Automatically handles nested `Arc<dyn Array>` wrappers.
///
/// If the Arc contains nested `Arc<dyn Array>`, this method unwraps the innermost Arc.
///
/// ```
/// # use std::sync::Arc;
/// # use arrow_array::{Array, ArrayRef, Int32Array, StringArray};
/// # use arrow_array::downcast_array_ref;
/// let array: ArrayRef = Arc::new(Int32Array::from(vec![1, 2, 3]));
///
/// let typed: Arc<Int32Array> = downcast_array_ref(array.clone()).unwrap();
/// assert_eq!(typed.len(), 3);
///
/// assert!(downcast_array_ref::<StringArray>(array).is_err());
/// ```
pub fn downcast_array_ref<T: Array + 'static>(array: ArrayRef) -> Result<Arc<T>, ArrayRef> {
// SAFETY STRATEGY:
//
// This function performs two checks to ensure it's safe to reinterpret
// Arc<dyn Array> as Arc<T>:
//
// 1. Type verification via Any::downcast_ref ensures the dynamic type is T
// 2. Pointer provenance check ensures the Arc was formed by unsized coercion
// from Arc<T>, not through a wrapper like Arc<&dyn Array> or Arc<Box<dyn Array>>
//
// NOTE: The second check is unsound if `Array::as_any()` returns a reference to any field of
// (any sub-object of) `self`. All canonical array types either return &self or dereference an
// internal ArrayRef, both of which satisfy the safety requirements of `Arc::from_raw`.
//
// Only if both checks pass do we reconstruct Arc<T> from the same pointer
// that Arc::into_raw gave us, which is safe because:
// - Type correctness is guaranteed by check #1
// - Same allocation/pointer is guaranteed by check #2
// - Arc::from_raw is only called once on this pointer

// Unwrap nested Arc<dyn Array> if present (zero clones if not nested)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like this is code I am not qualified to review.

I did some more googling, and it seems like we should be able to do soemthing similar using a trick like this:

https://stackoverflow.com/questions/76222743/upcasting-an-arcdyn-trait-to-an-arcdyn-any

However, I couldn't make the trait impls all line up 🤔

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand correctly, the SO question is trying to go from Arc<dyn Foo> to Arc<dyn Any>. The two solutions are:

  • use trait upcasting
  • define an extension trait with a helper method that receives Arc<Self>

The former requires Foo: Any + 'static as well as the unstable trait-upcasting feature. Even if the feature stabilizes, we still have the problem that Array does not satisfy Any + 'static.

The latter I thought would make the extension trait not dyn-compatible... but the code in that accepted SO answer seems to compile... let me look into that and get back to you.

Copy link
Copy Markdown
Contributor Author

@scovich scovich Jan 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah -- apparently the 'static constraint allows Arc<Self> methods to be dyn-compatible (TIL!).

Unfortunately, that would impose a transitive requirement of Array: 'static which is incompatible with a bunch of existing code, including the existing impl<T> Array for &T, trait ArrayAccessor: Array, and any code that does anything with a &'a dyn Array (which is ~all of them, if you account for anonymous/erased lifetimes).

let array = innermost_array_ref(Cow::Owned(array));

// Check 1: The underlying concrete type matches the requested type
let Some(concrete_ref) = array.as_any().downcast_ref::<T>() else {
return Err(array);
};

// Check #2: Detect wrapper types like Arc<&dyn Array>
// Note: This cannot detect all wrapper violations (e.g., fields at offset 0).
// Soundness relies on Array's safety contract forbidding as_any() field projection.
if !std::ptr::addr_eq(concrete_ref, array.as_ref()) {
// Pointer mismatch indicates a wrapper (e.g., Arc<&dyn Array>)
return Err(array);
}

// Both checks passed - safe to reconstruct Arc<T>
let ptr = Arc::into_raw(array).cast();

// SAFETY: The Arc owns a T at the same address as the trait object's data pointer, so we can
// reinterpret the Arc's dyn Array pointer to point to T instead. See above.
Ok(unsafe { Arc::from_raw(ptr) })
}

/// A generic trait for accessing the values of an [`Array`]
///
/// This trait helps write specialized implementations of algorithms for
Expand Down
Loading