Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
14 changes: 13 additions & 1 deletion python/pyarrow/src/arrow/python/python_to_arrow.cc
Original file line number Diff line number Diff line change
Expand Up @@ -908,8 +908,20 @@ class PyListConverter : public ListConverter<T, PyConverter, PyConverterTrait> {

Status AppendNdarray(PyObject* value) {
PyArrayObject* ndarray = reinterpret_cast<PyArrayObject*>(value);
OwnedRef flattened;
if (PyArray_NDIM(ndarray) != 1) {
Comment thread
aboderinsamuel marked this conversation as resolved.
return Status::Invalid("Can only convert 1-dimensional array values");
// GH-49644: a fixed-size list (e.g. the storage of a fixed-shape tensor)
// can be built from a multi-dimensional array by flattening it in C
// order. The total number of elements must still match the list size,
// which the builder validates below. 0-dimensional arrays and
// variable-sized lists remain restricted to 1-dimensional values.
Comment on lines +913 to +917

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be less verbose.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay sir 👍

if (PyArray_NDIM(ndarray) < 2 || this->list_type_->id() != Type::FIXED_SIZE_LIST) {
return Status::Invalid("Can only convert 1-dimensional array values");

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should probably say something like:

Suggested change
return Status::Invalid("Can only convert 1-dimensional array values");
return Status::Invalid("Can only convert 1-dimensional array values to variable sized lists array");

}
flattened.reset(PyArray_Ravel(ndarray, NPY_CORDER));

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did you decided to use PyArray_Ravel? Are there any other alternatives?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PyArray_Ravel(ndarray, NPY_CORDER) returns a 1-D, C-order view, which lets me reuse the existing 1-D append paths unchanged (the typed fast paths and the PySequence fallback). It's zero-copy when the input is already C-contiguous and only copies for non-contiguous/Fortran-order input, where a copy is unavoidable to get C-order data anyway, which also matches the fixed-shape-tensor's row-major storage.

Alternatives: PyArray_Flatten(NPY_CORDER) gives the same result but always copies; a manual NpyIter/buffer loop could avoid even the view allocation but would mean duplicating the typed-append fast paths. Ravel was the smallest change with the best reuse/performance balance.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PyArray_Ravel over PyArray_Flatten seems ok. If I read the context correctly we're using a builder to append these arrays. When builder finishes it creates a new buffer with a copy of concatenated arrays. So PyArray_Ravel should be ok if the list of arrays is not deleted before appending is complete, which I expect won't happen. @AlenkaF am I reading this right?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you have a look at PyArray_CheckFromAny with flag NPY_ARRAY_C_CONTIGUOUS? If I understand correctly it reshapes if necessary or gives access to the data/view directly. Then we can use PyArray_DATA as in other places in the codebase.

I am not sure which would fit better but am interested in what you think.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also to note here - PyArray_Ravel seems to default to C order. So we should document somewhere that no matter the input arrays order we will return C order. (@AlenkaF please doublecheck me here :) )

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please look into @AlenkaF proposal here, it might be better to have the explicit approach here.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@AlenkaF PyArray_Ravel returns a 1-D C-order view so I could reuse the existing 1-D append paths, but per the discussion I'll switch to the explicit PyArray_CheckFromAny + NPY_ARRAY_C_CONTIGUOUS approach and read PyArray_DATA directly, it's more idiomatic here and lets me handle byte order in the same call.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rok Will do, switching to the explicit PyArray_CheckFromAny + NPY_ARRAY_C_CONTIGUOUS approach and reading PyArray_DATA directly, matching the existing pattern. That should also fold in the byte-order handling.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, I think having more control can be helpful. Thanks for looking into it!
Also, some other flags might be needed, see: https://numpy.org/doc/stable/reference/c-api/array.html#basic-array-flags

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you 🙏, i'm currently taking a look at the reference

RETURN_IF_PYERROR();
value = flattened.obj();
ndarray = reinterpret_cast<PyArrayObject*>(value);
}
if (PyArray_ISBYTESWAPPED(ndarray)) {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder how PyArray_Ravel behaves in big-endian situations? I imagine it would return the correct result but we can't really test anyway. Output would likely be PyArray_ISBYTESWAPPED(ndarray) == true so this would fail and the computation above would be done in vain.
@aboderinsamuel please move the if (PyArray_ISBYTESWAPPED(ndarray)) block above the if (PyArray_NDIM(ndarray) != 1) block.

// TODO
Expand Down
30 changes: 30 additions & 0 deletions python/pyarrow/tests/test_array.py
Original file line number Diff line number Diff line change
Expand Up @@ -2924,6 +2924,36 @@ def test_array_from_invalid_dim_raises():
pa.array(arr0d)


@pytest.mark.numpy
def test_fixed_size_list_from_multidim_ndarray():
# GH-49644: a fixed-size list can be built from multi-dimensional ndarray
# elements by flattening them in C order.
arr = pa.array([np.array([[1, 2, 3]], dtype=np.int64),
np.array([[4, 5, 6]], dtype=np.int64)],
type=pa.list_(pa.int64(), 3))
assert arr.type == pa.list_(pa.int64(), 3)
assert arr.to_pylist() == [[1, 2, 3], [4, 5, 6]]
Comment thread
aboderinsamuel marked this conversation as resolved.

# A non-trivial 2D shape confirms values are flattened in C (row-major) order
arr = pa.array([np.array([[1, 2], [3, 4]], dtype=np.int64)],
type=pa.list_(pa.int64(), 4))
assert arr.to_pylist() == [[1, 2, 3, 4]]

# The flattened length must still match the fixed size
with pytest.raises(pa.lib.ArrowInvalid):
pa.array([np.array([[1, 2], [3, 4]], dtype=np.int64)],
type=pa.list_(pa.int64(), 3))

# Variable-sized lists still require 1-dimensional values
with pytest.raises(pa.lib.ArrowInvalid, match="1-dimensional"):
pa.array([np.array([[1, 2, 3]], dtype=np.int64)],
type=pa.list_(pa.int64()))

# 0-dimensional arrays are still rejected (not flattened to length 1)
with pytest.raises(pa.lib.ArrowInvalid, match="1-dimensional"):
pa.array([np.array(1, dtype=np.int64)], type=pa.list_(pa.int64(), 1))


@pytest.mark.numpy
def test_array_from_strided_bool():
# ARROW-6325
Expand Down
42 changes: 42 additions & 0 deletions python/pyarrow/tests/test_extension_type.py
Original file line number Diff line number Diff line change
Expand Up @@ -1730,6 +1730,48 @@ def test_tensor_array_from_numpy(np_type_str):
pa.FixedShapeTensorArray.from_numpy_ndarray(arr, dim_names=[0, 1])


@pytest.mark.numpy
@pytest.mark.parametrize("np_type_str", ("int8", "int64", "float32"))
def test_tensor_array_from_list_of_ndarrays(np_type_str):
# GH-49644: build a fixed-shape-tensor array from a list of individual
# (multi-dimensional) ndarrays, not only from a single stacked ndarray.
np_dtype = np.dtype(np_type_str)
tensor_type = pa.fixed_shape_tensor(pa.from_numpy_dtype(np_dtype), (2, 3))

elements = [
np.arange(6, dtype=np_dtype).reshape(2, 3),
np.arange(6, 12, dtype=np_dtype).reshape(2, 3),
]
result = pa.array(elements, type=tensor_type)
assert isinstance(result, pa.FixedShapeTensorArray)
assert result.type == tensor_type
assert len(result) == 2

# Must match the existing from_numpy_ndarray path on the same data
expected = pa.FixedShapeTensorArray.from_numpy_ndarray(np.stack(elements))
assert result.storage.equals(expected.storage)

# Each element round-trips back to the original ndarray (with its shape)
for scalar, original in zip(result, elements):
np.testing.assert_array_equal(scalar.to_numpy(), original)

# Higher-dimensional tensors work too
tensor_3d = pa.fixed_shape_tensor(pa.from_numpy_dtype(np_dtype), (2, 2, 3))
elements_3d = [np.arange(12, dtype=np_dtype).reshape(2, 2, 3)]
result_3d = pa.array(elements_3d, type=tensor_3d)
assert result_3d.type == tensor_3d
np.testing.assert_array_equal(result_3d[0].to_numpy(), elements_3d[0])

# None elements are allowed
result_with_null = pa.array([elements[0], None], type=tensor_type)
assert result_with_null.null_count == 1
assert result_with_null[1].as_py() is None

# A flattened size that doesn't match the tensor shape is rejected
with pytest.raises(pa.lib.ArrowInvalid):
pa.array([np.arange(8, dtype=np_dtype).reshape(2, 4)], type=tensor_type)


@pytest.mark.numpy
@pytest.mark.parametrize("tensor_type", (
pa.fixed_shape_tensor(pa.int8(), [2, 2, 3]),
Expand Down
Loading