Skip to content

GH-49644: [Python] Support converting list of multi-dimensional arrays to FixedShapeTensor#50203

Open
aboderinsamuel wants to merge 4 commits into
apache:mainfrom
aboderinsamuel:gh-49644-list-multidim-to-fixed-shape-tensor
Open

GH-49644: [Python] Support converting list of multi-dimensional arrays to FixedShapeTensor#50203
aboderinsamuel wants to merge 4 commits into
apache:mainfrom
aboderinsamuel:gh-49644-list-multidim-to-fixed-shape-tensor

Conversation

@aboderinsamuel

@aboderinsamuel aboderinsamuel commented Jun 17, 2026

Copy link
Copy Markdown

Rationale for this change

Constructing a fixed-shape-tensor array from a list of individual ndarrays only
worked when each element was 1-D; ≥2-D elements failed with
ArrowInvalid: Can only convert 1-dimensional array values. The only workaround
was stacking the list into a single ndarray and using
FixedShapeTensorArray.from_numpy_ndarray.

What changes are included in this PR?

The C++ list converter PyListConverter::AppendNdarray now accepts
multi-dimensional ndarray elements for fixed-size lists (the storage of a
fixed-shape tensor) by flattening them in C order. The fixed-size-list builder
still validates that the flattened length matches the list width, so wrong sizes
error cleanly. Variable-sized lists remain restricted to 1-D values to avoid
ambiguity. As a side benefit, plain fixed_size_list also accepts
multi-dimensional ndarray elements now.

Are these changes tested?

Yes:

  • test_tensor_array_from_list_of_ndarrays — construction from 2-D and 3-D
    ndarrays, null handling, storage parity with from_numpy_ndarray, and the
    size-mismatch error, across int8/int64/float32.
  • test_fixed_size_list_from_multidim_ndarray — plain fixed_size_list from
    multi-dim arrays, plus a check that variable-sized lists still reject 2-D.

Are there any user-facing changes?

Yes — pa.array([multi-dim ndarrays], type=fixed_shape_tensor(...)) (and the
same for fixed_size_list) now works instead of raising. Existing 1-D behavior
and variable-sized-list behavior are unchanged.

Scoped to construction only; the reverse to_numpy shape-preservation also
raised in the issue is intentionally left as a separate follow-up.

Copilot AI review requested due to automatic review settings June 17, 2026 09:21
@github-actions

Copy link
Copy Markdown

⚠️ GitHub issue #49644 has been automatically assigned in GitHub to PR creator.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Adds support for constructing fixed-shape tensors and fixed-size lists from lists of multi-dimensional NumPy ndarrays by flattening values in C order (GH-49644).

Changes:

  • Add tests covering tensor arrays built from lists of ndarrays (including nulls and shape mismatch).
  • Add tests ensuring fixed-size lists accept multi-dimensional ndarray elements (and reject invalid cases).
  • Update ndarray-to-list conversion to allow flattening for FIXED_SIZE_LIST while keeping variable-sized lists restricted to 1D.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File Description
python/pyarrow/tests/test_extension_type.py Adds coverage for building FixedShapeTensorArray from a list of ndarrays.
python/pyarrow/tests/test_array.py Adds a regression test for fixed-size list conversion from multi-dimensional ndarrays.
python/pyarrow/src/arrow/python/python_to_arrow.cc Implements multi-dimensional ndarray flattening for fixed-size lists during conversion.

Comment thread python/pyarrow/tests/test_array.py Outdated
Comment thread python/pyarrow/tests/test_array.py
@github-actions github-actions Bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Jun 17, 2026
@aboderinsamuel aboderinsamuel force-pushed the gh-49644-list-multidim-to-fixed-shape-tensor branch from 90c2ac4 to e695d01 Compare June 17, 2026 20:42
Copilot AI review requested due to automatic review settings June 17, 2026 20:42

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

Comment thread python/pyarrow/src/arrow/python/python_to_arrow.cc
@AlenkaF

AlenkaF commented Jun 22, 2026

Copy link
Copy Markdown
Member

Thank you @aboderinsamuel!
Can you have a look at the linter error here: https://github.com/apache/arrow/actions/runs/27719407632/job/82001639136?pr=50203#step:5:98

Also, am curious how you decided to use PyArray_Ravel? Are there any other alternatives?

Copilot AI review requested due to automatic review settings June 22, 2026 17:04

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated no new comments.

@aboderinsamuel

aboderinsamuel commented Jun 22, 2026

Copy link
Copy Markdown
Author

Hi @AlenkaF The macOS-intel failure is an intermittent dependency-install flake. Homebrew lock on aws-sdk-cpp, before the build/tests even run. It's hitting other PRs too (e.g. #50146), and #50195 is actively moving the macOS AWS-SDK dependency, so it looks like known infra rather than my change. Happy to keep re-running until it catches a clean runner. Everything else is green. @rok 🙂

@AlenkaF AlenkaF left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Repeating a question :)

if (PyArray_NDIM(ndarray) < 2 || this->list_type_->id() != Type::FIXED_SIZE_LIST) {
return Status::Invalid("Can only convert 1-dimensional array values");
}
flattened.reset(PyArray_Ravel(ndarray, NPY_CORDER));

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did you decided to use PyArray_Ravel? Are there any other alternatives?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PyArray_Ravel(ndarray, NPY_CORDER) returns a 1-D, C-order view, which lets me reuse the existing 1-D append paths unchanged (the typed fast paths and the PySequence fallback). It's zero-copy when the input is already C-contiguous and only copies for non-contiguous/Fortran-order input, where a copy is unavoidable to get C-order data anyway, which also matches the fixed-shape-tensor's row-major storage.

Alternatives: PyArray_Flatten(NPY_CORDER) gives the same result but always copies; a manual NpyIter/buffer loop could avoid even the view allocation but would mean duplicating the typed-append fast paths. Ravel was the smallest change with the best reuse/performance balance.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PyArray_Ravel over PyArray_Flatten seems ok. If I read the context correctly we're using a builder to append these arrays. When builder finishes it creates a new buffer with a copy of concatenated arrays. So PyArray_Ravel should be ok if the list of arrays is not deleted before appending is complete, which I expect won't happen. @AlenkaF am I reading this right?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you have a look at PyArray_CheckFromAny with flag NPY_ARRAY_C_CONTIGUOUS? If I understand correctly it reshapes if necessary or gives access to the data/view directly. Then we can use PyArray_DATA as in other places in the codebase.

I am not sure which would fit better but am interested in what you think.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also to note here - PyArray_Ravel seems to default to C order. So we should document somewhere that no matter the input arrays order we will return C order. (@AlenkaF please doublecheck me here :) )

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please look into @AlenkaF proposal here, it might be better to have the explicit approach here.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@AlenkaF PyArray_Ravel returns a 1-D C-order view so I could reuse the existing 1-D append paths, but per the discussion I'll switch to the explicit PyArray_CheckFromAny + NPY_ARRAY_C_CONTIGUOUS approach and read PyArray_DATA directly, it's more idiomatic here and lets me handle byte order in the same call.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rok Will do, switching to the explicit PyArray_CheckFromAny + NPY_ARRAY_C_CONTIGUOUS approach and reading PyArray_DATA directly, matching the existing pattern. That should also fold in the byte-order handling.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, I think having more control can be helpful. Thanks for looking into it!
Also, some other flags might be needed, see: https://numpy.org/doc/stable/reference/c-api/array.html#basic-array-flags

@rok rok changed the title GH-49644: [Python] Support converting list of multi-dimensional array… GH-49644: [Python] Support converting list of multi-dimensional arrays to FixedShapeTensor Jun 23, 2026
@github-actions

Copy link
Copy Markdown

⚠️ GitHub issue #49644 has been automatically assigned in GitHub to PR creator.

@tadeja

tadeja commented Jun 23, 2026

Copy link
Copy Markdown
Collaborator

Hi all. I'd add two points for further review of current PR changes (thanks @aboderinsamuel !)
Example of wrong shape and example of permuted tensor currently produce results but might need to return errors instead:

import numpy as np
import pytest
import pyarrow as pa

np_dtype = np.dtype("int8")
tensor_type = pa.fixed_shape_tensor(pa.from_numpy_dtype(np_dtype), (2, 3))
pa.array([np.arange(6, dtype=np_dtype).reshape(3, 2)], type=tensor_type)
# As (3, 2) has the right number of elements (6), but the wrong shape for a (2, 3) tensor,
# shouldn't we get an error?

Currently returns:

<pyarrow.lib.FixedShapeTensorArray object at 0x12524ec20>
[
  [
    0,
    1,
    2,
    3,
    4,
    5
  ]
]
np_dtype2 = np.dtype("float32")
elements = [
  np.arange(6, dtype=np_dtype2).reshape(2, 3),
  np.arange(6, 12, dtype=np_dtype2).reshape(2, 3),
]
permuted_type = pa.fixed_shape_tensor(pa.from_numpy_dtype(np_dtype2), (2, 3),
                                    permutation=[1, 0])
pa.array(elements, type=permuted_type)
# For permuted tensor types built from multi-dim ndarrays, doesn't this store the wrong layout?

Currently returns:

<pyarrow.lib.FixedShapeTensorArray object at 0x104597ac0>
[
  [
    0,
    1,
    2,
    3,
    4,
    5
  ],
  [
    6,
    7,
    8,
    9,
    10,
    11
  ]
]

@github-actions github-actions Bot added awaiting changes Awaiting changes and removed awaiting committer review Awaiting committer review labels Jun 23, 2026
value = flattened.obj();
ndarray = reinterpret_cast<PyArrayObject*>(value);
}
if (PyArray_ISBYTESWAPPED(ndarray)) {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder how PyArray_Ravel behaves in big-endian situations? I imagine it would return the correct result but we can't really test anyway. Output would likely be PyArray_ISBYTESWAPPED(ndarray) == true so this would fail and the computation above would be done in vain.
@aboderinsamuel please move the if (PyArray_ISBYTESWAPPED(ndarray)) block above the if (PyArray_NDIM(ndarray) != 1) block.

// which the builder validates below. 0-dimensional arrays and
// variable-sized lists remain restricted to 1-dimensional values.
if (PyArray_NDIM(ndarray) < 2 || this->list_type_->id() != Type::FIXED_SIZE_LIST) {
return Status::Invalid("Can only convert 1-dimensional array values");

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should probably say something like:

Suggested change
return Status::Invalid("Can only convert 1-dimensional array values");
return Status::Invalid("Can only convert 1-dimensional array values to variable sized lists array");

Comment on lines +913 to +917
// GH-49644: a fixed-size list (e.g. the storage of a fixed-shape tensor)
// can be built from a multi-dimensional array by flattening it in C
// order. The total number of elements must still match the list size,
// which the builder validates below. 0-dimensional arrays and
// variable-sized lists remain restricted to 1-dimensional values.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be less verbose.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay sir 👍

if (PyArray_NDIM(ndarray) < 2 || this->list_type_->id() != Type::FIXED_SIZE_LIST) {
return Status::Invalid("Can only convert 1-dimensional array values");
}
flattened.reset(PyArray_Ravel(ndarray, NPY_CORDER));

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also to note here - PyArray_Ravel seems to default to C order. So we should document somewhere that no matter the input arrays order we will return C order. (@AlenkaF please doublecheck me here :) )

@aboderinsamuel

aboderinsamuel commented Jun 23, 2026

Copy link
Copy Markdown
Author

Thanks @tadeja @rok @AlenkaF, these are really helpful, i'm sorry i didn't think of this earlier. I reproduced tadeja's cases and both are real: a wrong-shape array (e.g. (3,2) into a (2,3) tensor) is silently accepted, and permuted tensor types store the wrong layout (the to_numpy round-trip doesn't return the input).

Root cause: array.pxi swaps the extension type for its storage_type before conversion and re-wraps with wrap_array after, so the C++ converter only ever sees the flat fixed_size_list, it can't know the tensor's shape or permutation. The flatten is correct for a plain fixed_size_list, but shape-validation and permutation-handling need to live in the Python/Cython layer where the FixedShapeTensorType is still intact.

Plan:

  1. C++: switch from PyArray_Ravel to the explicit PyArray_CheckFromAny + NPY_ARRAY_C_CONTIGUOUS approach, reading PyArray_DATA directly (per @AlenkaF), this should also let me handle the byte-order case in the same call.
  2. Cython: validate each element's shape against the tensor's shape (so (3,2) into (2,3) errors), and reject permuted types with a clear error for now, with full permutation support as a follow-up, unless you'd prefer I handle the transpose here.
  3. Also tighten the comment + error message and document that we always output C order, per @rok.

Does this direction sound right before I work on it?

@rok

rok commented Jun 23, 2026

Copy link
Copy Markdown
Member
  1. Cython: validate each element's shape against the tensor's shape (so (3,2) into (2,3) errors), and reject permuted types with a clear error for now, with full permutation support as a follow-up, unless you'd prefer I handle the transpose here.

Permuted output types would best fit into a follow-up issue indeed.
As for non-C ordered input types - if we have a clean way to handle them (e.g. always convert them to C) and document that in the function docs I think that'd be ok. It doesn't seem likely we'll have requests for non-C orders anytime soon, but we should guard against misinterpretation of non-C ordered data.

Does this direction sound right before I work on it?

Yes! Please proceed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants