Skip to content

GH-38849: [C++][Parquet] Add support for list view and large list view#50160

Open
HuaHuaY wants to merge 6 commits into
apache:mainfrom
HuaHuaY:list_view_type
Open

GH-38849: [C++][Parquet] Add support for list view and large list view#50160
HuaHuaY wants to merge 6 commits into
apache:mainfrom
HuaHuaY:list_view_type

Conversation

@HuaHuaY

@HuaHuaY HuaHuaY commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

Rationale for this change

Allow us to write an arrow list view array into a parquet file and read list view type from parquet.

What changes are included in this PR?

We will sort the list views based on the offsets before inserting the array in Parquet writer, and generate a size array in Parquet reader.

Are these changes tested?

Yes.

Are there any user-facing changes?

No.

@wgtmac wgtmac left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally looks good to me. I've left some comments. cc @mapleFU @pitrou

Comment thread cpp/src/parquet/arrow/reader.cc
Comment thread cpp/src/parquet/arrow/reader.cc Outdated
Comment thread cpp/src/parquet/arrow/writer.cc
Comment thread cpp/src/parquet/arrow/arrow_reader_writer_test.cc Outdated
Comment thread cpp/src/parquet/arrow/arrow_reader_writer_test.cc Outdated
@github-actions github-actions Bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Jun 15, 2026
Comment thread cpp/src/parquet/properties.h Outdated
Comment thread cpp/src/parquet/arrow/path_internal.cc
auto table = Table::Make(
::arrow::schema({::arrow::field("root", array->type(), false)}), {array});

auto props_store_schema = ArrowWriterProperties::Builder().store_schema()->build();

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we also check the expect List result when store_schema is false? This would further validate that the actual Parquet contents are ok.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is tested by expected_array.

// Multiple leaf ranges can be produced when child values are
// skipped, such as null fixed-size-list slots, or when
// list-view ranges are non-contiguous. Concatenate the slices
// in logical write order.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this somehow fixing #35697?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes

@HuaHuaY HuaHuaY Jun 23, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry that my previous reply was inaccurate. If this PR is merged, test cases in #35697 will not report NotImplemented: Lists with non-zero length null components are not supported but still report other issues. Null values ​​in a FixedSizeList also need to occupy space, but they do not appear in the Parquet data array. So round trip tests will fail. We can fix FixedSizeListReader in future PR.

arrays.push_back(result.leaf_array->Slice(range.start, range.Size()));
}
ARROW_ASSIGN_OR_RAISE(values_array,
::arrow::Concatenate(arrays, ctx->memory_pool));

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of concatenating, would it be possible (and potentially more efficient) to call WriteArrow for each individual chunk?

@HuaHuaY HuaHuaY Jun 16, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's difficult to write data without constructing a new array because we don't know the definition level and repetition level ranges for each segment of the array. Maybe we can add two offsets in ElementRange to implement this.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, yes, that's a good point.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can add two offsets in ElementRange to implement this.

We can, but perhaps that can be left as a later optimization.

(also, not sure it's a good idea if there are many ElementRanges in the first place)

@HuaHuaY

HuaHuaY commented Jun 23, 2026

Copy link
Copy Markdown
Contributor Author

@pitrou Sorry for the delay in updating. I have updated the code and comments. Please take a look again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants