GH-38849: [C++][Parquet] Add support for list view and large list view#50160
GH-38849: [C++][Parquet] Add support for list view and large list view#50160HuaHuaY wants to merge 6 commits into
Conversation
| auto table = Table::Make( | ||
| ::arrow::schema({::arrow::field("root", array->type(), false)}), {array}); | ||
|
|
||
| auto props_store_schema = ArrowWriterProperties::Builder().store_schema()->build(); |
There was a problem hiding this comment.
Can we also check the expect List result when store_schema is false? This would further validate that the actual Parquet contents are ok.
There was a problem hiding this comment.
I think this is tested by expected_array.
| // Multiple leaf ranges can be produced when child values are | ||
| // skipped, such as null fixed-size-list slots, or when | ||
| // list-view ranges are non-contiguous. Concatenate the slices | ||
| // in logical write order. |
There was a problem hiding this comment.
Sorry that my previous reply was inaccurate. If this PR is merged, test cases in #35697 will not report NotImplemented: Lists with non-zero length null components are not supported but still report other issues. Null values in a FixedSizeList also need to occupy space, but they do not appear in the Parquet data array. So round trip tests will fail. We can fix FixedSizeListReader in future PR.
| arrays.push_back(result.leaf_array->Slice(range.start, range.Size())); | ||
| } | ||
| ARROW_ASSIGN_OR_RAISE(values_array, | ||
| ::arrow::Concatenate(arrays, ctx->memory_pool)); |
There was a problem hiding this comment.
Instead of concatenating, would it be possible (and potentially more efficient) to call WriteArrow for each individual chunk?
There was a problem hiding this comment.
It's difficult to write data without constructing a new array because we don't know the definition level and repetition level ranges for each segment of the array. Maybe we can add two offsets in ElementRange to implement this.
There was a problem hiding this comment.
Maybe we can add two offsets in
ElementRangeto implement this.
We can, but perhaps that can be left as a later optimization.
(also, not sure it's a good idea if there are many ElementRanges in the first place)
|
@pitrou Sorry for the delay in updating. I have updated the code and comments. Please take a look again. |
Rationale for this change
Allow us to write an arrow list view array into a parquet file and read list view type from parquet.
What changes are included in this PR?
We will sort the list views based on the offsets before inserting the array in Parquet writer, and generate a size array in Parquet reader.
Are these changes tested?
Yes.
Are there any user-facing changes?
No.