[WIP] Support file splitting in ReadParquetPyarrowFS#1139
[WIP] Support file splitting in ReadParquetPyarrowFS#1139
ReadParquetPyarrowFS#1139Conversation
|
I do have a pretty strong preference on not adding keywords that interact with other things counterintuitively. One of the main complexity drivers of the old read_parquet implementation is that there is a plethora of options that disable each other, so I don't want to repeat this here |
Yeah, that makes sense. What do you think is the best way to deal with oversized files? I don't think the |
blocksize and aggregate_files options in ReadParquetPyarrowFSblocksize in ReadParquetPyarrowFS
blocksize in ReadParquetPyarrowFSReadParquetPyarrowFS
|
Update: I moved away from using |
Proposed Changes (Revised Sep 23, 2024)
Adds optimization-time ("tune up") support for large-file splitting.
I like how dask-expr currently "squashes" small files together at optimization time. This PR simply expands the functionality of
_tune_upto split (rather than fuse) oversized parquet files. The splitting behavior is controlled by a new"dataframe.parquet.maximum-partition-size"config (to compliment the existing"dataframe.parquet.minimum-partition-size"config that is already used to control fusion).Background
The legacy
read_parquetinfrastructure is obviously a mess, and I'd like to avoid the need to keep maintaining it (in both dask and rapids). The logic inReadParquetPyarrowFSis slightly arrow-specific, but is already very close to what I was already planning to do in rapids. The only missing feature that is preventing it from supporting real-world use cases is the lack of support for splitting oversized files (a need we definitely run into a lot in the wild).