[WIP] Support file splitting in ``ReadParquetPyarrowFS`` by rjzamora · Pull Request #1139 · dask/dask-expr

rjzamora · 2024-09-21T14:47:38Z

Proposed Changes (Revised Sep 23, 2024)

Adds optimization-time ("tune up") support for large-file splitting.

I like how dask-expr currently "squashes" small files together at optimization time. This PR simply expands the functionality of _tune_up to split (rather than fuse) oversized parquet files. The splitting behavior is controlled by a new "dataframe.parquet.maximum-partition-size" config (to compliment the existing "dataframe.parquet.minimum-partition-size" config that is already used to control fusion).

Background

The legacy read_parquet infrastructure is obviously a mess, and I'd like to avoid the need to keep maintaining it (in both dask and rapids). The logic in ReadParquetPyarrowFS is slightly arrow-specific, but is already very close to what I was already planning to do in rapids. The only missing feature that is preventing it from supporting real-world use cases is the lack of support for splitting oversized files (a need we definitely run into a lot in the wild).

phofl · 2024-09-23T16:24:45Z

I do have a pretty strong preference on not adding keywords that interact with other things counterintuitively. One of the main complexity drivers of the old read_parquet implementation is that there is a plethora of options that disable each other, so I don't want to repeat this here

rjzamora · 2024-09-23T16:39:24Z

I do have a pretty strong preference on not adding keywords that interact with other things counterintuitively.

Yeah, that makes sense. What do you think is the best way to deal with oversized files? I don't think the aggregate_files argument is necessary, but the lack of support for "splitting" large files is a pretty serious blocker right now. Would a "dataframe.parquet.maximum-partition-size" config be more palatable?

rjzamora · 2024-09-23T18:04:10Z

Update: I moved away from using blocksize and aggregate_files in favor of "dataframe.parquet.minimum-partition-size" (already used) and "dataframe.parquet.maximum-partition-size" (new) configs.

rjzamora added 7 commits September 20, 2024 12:32

initial changes to support blocksize with arrow parquet reader

dd3718f

add aggregate_files support

dc52fa9

use blockwise=None as default for now

603851c

add test coverage

7d12678

add docstring notes

d420a09

tweak _fusion_compression_factor

73a10d8

fix _blocksize

86ab5d5

rjzamora added the enhancement New feature or request label Sep 21, 2024

rjzamora self-assigned this Sep 21, 2024

remove aggregate_files

0144cd9

rjzamora changed the title ~~[WIP] Support blocksize and aggregate_files options in ReadParquetPyarrowFS~~ [WIP] Support blocksize in ReadParquetPyarrowFS Sep 23, 2024

rjzamora changed the title ~~[WIP] Support blocksize in ReadParquetPyarrowFS~~ [WIP] Support file splitting in ReadParquetPyarrowFS Sep 23, 2024

remove blocksize in favor of config

48f2623

Merge branch 'main' into arrow-blocksize-support

928085b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[WIP] Support file splitting in `ReadParquetPyarrowFS`#1139

[WIP] Support file splitting in `ReadParquetPyarrowFS`#1139
rjzamora wants to merge 10 commits intodask:mainfrom
rjzamora:arrow-blocksize-support

rjzamora commented Sep 21, 2024 •

edited

Loading

Uh oh!

phofl commented Sep 23, 2024

Uh oh!

rjzamora commented Sep 23, 2024

Uh oh!

rjzamora commented Sep 23, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

rjzamora commented Sep 21, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

phofl commented Sep 23, 2024

Uh oh!

rjzamora commented Sep 23, 2024

Uh oh!

rjzamora commented Sep 23, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

rjzamora commented Sep 21, 2024 •

edited

Loading