GH-49644: [Python] Support converting list of multi-dimensional arrays to FixedShapeTensor#50203
Conversation
|
|
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
Adds support for constructing fixed-shape tensors and fixed-size lists from lists of multi-dimensional NumPy ndarrays by flattening values in C order (GH-49644).
Changes:
- Add tests covering tensor arrays built from lists of ndarrays (including nulls and shape mismatch).
- Add tests ensuring fixed-size lists accept multi-dimensional ndarray elements (and reject invalid cases).
- Update ndarray-to-list conversion to allow flattening for
FIXED_SIZE_LISTwhile keeping variable-sized lists restricted to 1D.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| python/pyarrow/tests/test_extension_type.py | Adds coverage for building FixedShapeTensorArray from a list of ndarrays. |
| python/pyarrow/tests/test_array.py | Adds a regression test for fixed-size list conversion from multi-dimensional ndarrays. |
| python/pyarrow/src/arrow/python/python_to_arrow.cc | Implements multi-dimensional ndarray flattening for fixed-size lists during conversion. |
… arrays to FixedShapeTensor
…cover C-order flatten
90c2ac4 to
e695d01
Compare
|
Thank you @aboderinsamuel! Also, am curious how you decided to use |
|
Hi @AlenkaF The macOS-intel failure is an intermittent dependency-install flake. Homebrew lock on aws-sdk-cpp, before the build/tests even run. It's hitting other PRs too (e.g. #50146), and #50195 is actively moving the macOS AWS-SDK dependency, so it looks like known infra rather than my change. Happy to keep re-running until it catches a clean runner. Everything else is green. @rok 🙂 |
| if (PyArray_NDIM(ndarray) < 2 || this->list_type_->id() != Type::FIXED_SIZE_LIST) { | ||
| return Status::Invalid("Can only convert 1-dimensional array values"); | ||
| } | ||
| flattened.reset(PyArray_Ravel(ndarray, NPY_CORDER)); |
There was a problem hiding this comment.
Why did you decided to use PyArray_Ravel? Are there any other alternatives?
There was a problem hiding this comment.
PyArray_Ravel(ndarray, NPY_CORDER) returns a 1-D, C-order view, which lets me reuse the existing 1-D append paths unchanged (the typed fast paths and the PySequence fallback). It's zero-copy when the input is already C-contiguous and only copies for non-contiguous/Fortran-order input, where a copy is unavoidable to get C-order data anyway, which also matches the fixed-shape-tensor's row-major storage.
Alternatives: PyArray_Flatten(NPY_CORDER) gives the same result but always copies; a manual NpyIter/buffer loop could avoid even the view allocation but would mean duplicating the typed-append fast paths. Ravel was the smallest change with the best reuse/performance balance.
There was a problem hiding this comment.
PyArray_Ravel over PyArray_Flatten seems ok. If I read the context correctly we're using a builder to append these arrays. When builder finishes it creates a new buffer with a copy of concatenated arrays. So PyArray_Ravel should be ok if the list of arrays is not deleted before appending is complete, which I expect won't happen. @AlenkaF am I reading this right?
There was a problem hiding this comment.
Could you have a look at PyArray_CheckFromAny with flag NPY_ARRAY_C_CONTIGUOUS? If I understand correctly it reshapes if necessary or gives access to the data/view directly. Then we can use PyArray_DATA as in other places in the codebase.
I am not sure which would fit better but am interested in what you think.
There was a problem hiding this comment.
Also to note here - PyArray_Ravel seems to default to C order. So we should document somewhere that no matter the input arrays order we will return C order. (@AlenkaF please doublecheck me here :) )
There was a problem hiding this comment.
Please look into @AlenkaF proposal here, it might be better to have the explicit approach here.
There was a problem hiding this comment.
@AlenkaF PyArray_Ravel returns a 1-D C-order view so I could reuse the existing 1-D append paths, but per the discussion I'll switch to the explicit PyArray_CheckFromAny + NPY_ARRAY_C_CONTIGUOUS approach and read PyArray_DATA directly, it's more idiomatic here and lets me handle byte order in the same call.
There was a problem hiding this comment.
@rok Will do, switching to the explicit PyArray_CheckFromAny + NPY_ARRAY_C_CONTIGUOUS approach and reading PyArray_DATA directly, matching the existing pattern. That should also fold in the byte-order handling.
There was a problem hiding this comment.
Great, I think having more control can be helpful. Thanks for looking into it!
Also, some other flags might be needed, see: https://numpy.org/doc/stable/reference/c-api/array.html#basic-array-flags
|
|
|
Hi all. I'd add two points for further review of current PR changes (thanks @aboderinsamuel !) import numpy as np
import pytest
import pyarrow as pa
np_dtype = np.dtype("int8")
tensor_type = pa.fixed_shape_tensor(pa.from_numpy_dtype(np_dtype), (2, 3))
pa.array([np.arange(6, dtype=np_dtype).reshape(3, 2)], type=tensor_type)
# As (3, 2) has the right number of elements (6), but the wrong shape for a (2, 3) tensor,
# shouldn't we get an error?Currently returns: np_dtype2 = np.dtype("float32")
elements = [
np.arange(6, dtype=np_dtype2).reshape(2, 3),
np.arange(6, 12, dtype=np_dtype2).reshape(2, 3),
]
permuted_type = pa.fixed_shape_tensor(pa.from_numpy_dtype(np_dtype2), (2, 3),
permutation=[1, 0])
pa.array(elements, type=permuted_type)
# For permuted tensor types built from multi-dim ndarrays, doesn't this store the wrong layout?Currently returns: |
| value = flattened.obj(); | ||
| ndarray = reinterpret_cast<PyArrayObject*>(value); | ||
| } | ||
| if (PyArray_ISBYTESWAPPED(ndarray)) { |
There was a problem hiding this comment.
I wonder how PyArray_Ravel behaves in big-endian situations? I imagine it would return the correct result but we can't really test anyway. Output would likely be PyArray_ISBYTESWAPPED(ndarray) == true so this would fail and the computation above would be done in vain.
@aboderinsamuel please move the if (PyArray_ISBYTESWAPPED(ndarray)) block above the if (PyArray_NDIM(ndarray) != 1) block.
| // which the builder validates below. 0-dimensional arrays and | ||
| // variable-sized lists remain restricted to 1-dimensional values. | ||
| if (PyArray_NDIM(ndarray) < 2 || this->list_type_->id() != Type::FIXED_SIZE_LIST) { | ||
| return Status::Invalid("Can only convert 1-dimensional array values"); |
There was a problem hiding this comment.
This should probably say something like:
| return Status::Invalid("Can only convert 1-dimensional array values"); | |
| return Status::Invalid("Can only convert 1-dimensional array values to variable sized lists array"); |
| // GH-49644: a fixed-size list (e.g. the storage of a fixed-shape tensor) | ||
| // can be built from a multi-dimensional array by flattening it in C | ||
| // order. The total number of elements must still match the list size, | ||
| // which the builder validates below. 0-dimensional arrays and | ||
| // variable-sized lists remain restricted to 1-dimensional values. |
| if (PyArray_NDIM(ndarray) < 2 || this->list_type_->id() != Type::FIXED_SIZE_LIST) { | ||
| return Status::Invalid("Can only convert 1-dimensional array values"); | ||
| } | ||
| flattened.reset(PyArray_Ravel(ndarray, NPY_CORDER)); |
There was a problem hiding this comment.
Also to note here - PyArray_Ravel seems to default to C order. So we should document somewhere that no matter the input arrays order we will return C order. (@AlenkaF please doublecheck me here :) )
|
Thanks @tadeja @rok @AlenkaF, these are really helpful, i'm sorry i didn't think of this earlier. I reproduced tadeja's cases and both are real: a wrong-shape array (e.g. (3,2) into a (2,3) tensor) is silently accepted, and permuted tensor types store the wrong layout (the to_numpy round-trip doesn't return the input). Root cause: array.pxi swaps the extension type for its storage_type before conversion and re-wraps with wrap_array after, so the C++ converter only ever sees the flat fixed_size_list, it can't know the tensor's shape or permutation. The flatten is correct for a plain fixed_size_list, but shape-validation and permutation-handling need to live in the Python/Cython layer where the FixedShapeTensorType is still intact. Plan:
Does this direction sound right before I work on it? |
Permuted output types would best fit into a follow-up issue indeed.
Yes! Please proceed. |
Rationale for this change
Constructing a fixed-shape-tensor array from a list of individual ndarrays only
worked when each element was 1-D; ≥2-D elements failed with
ArrowInvalid: Can only convert 1-dimensional array values. The only workaroundwas stacking the list into a single ndarray and using
FixedShapeTensorArray.from_numpy_ndarray.What changes are included in this PR?
The C++ list converter
PyListConverter::AppendNdarraynow acceptsmulti-dimensional ndarray elements for fixed-size lists (the storage of a
fixed-shape tensor) by flattening them in C order. The fixed-size-list builder
still validates that the flattened length matches the list width, so wrong sizes
error cleanly. Variable-sized lists remain restricted to 1-D values to avoid
ambiguity. As a side benefit, plain
fixed_size_listalso acceptsmulti-dimensional ndarray elements now.
Are these changes tested?
Yes:
test_tensor_array_from_list_of_ndarrays— construction from 2-D and 3-Dndarrays, null handling, storage parity with
from_numpy_ndarray, and thesize-mismatch error, across
int8/int64/float32.test_fixed_size_list_from_multidim_ndarray— plainfixed_size_listfrommulti-dim arrays, plus a check that variable-sized lists still reject 2-D.
Are there any user-facing changes?
Yes —
pa.array([multi-dim ndarrays], type=fixed_shape_tensor(...))(and thesame for
fixed_size_list) now works instead of raising. Existing 1-D behaviorand variable-sized-list behavior are unchanged.
Scoped to construction only; the reverse
to_numpyshape-preservation alsoraised in the issue is intentionally left as a separate follow-up.