Skip to content

ParquetPushDecoder: expose the next row-group index that try_next_reader will yield #10148

Description

@zhuqi-lucas

Description

ParquetPushDecoder advertises is_at_row_group_boundary() and row_groups_remaining(), but a consumer that needs to track per-RG state in lock-step with the decoder has no way to ask which row-group index the next try_next_reader() call will produce a reader for. Today the call sites have to assume a 1:1 correspondence between the row-group list passed to with_row_groups(...) and the readers returned in order. That assumption breaks silently when page-index pruning eliminates every page of a row group: the decoder skips that RG internally and the next reader is for the RG after it, with no observable signal.

Concrete example

In DataFusion (apache/datafusion#22450) we maintain an rg_plan: VecDeque<RgEntry> parallel to the decoder's row-group list so we can consult per-RG metadata (statistics for runtime row-group pruning, fully-matched flag for skipping RowFilter evaluation, etc.) at each boundary. The current implementation pop_front()s the queue every time try_next_reader() yields a reader. With this query:

WHERE species > 'M' AND s >= 50 ORDER BY species LIMIT 3

against a 3-row-group file where every page of RG 1 is page-index-pruned, the trace is:

  • rg_plan = [RG1 (not fully-matched), RG2 (fully-matched), RG3 (not fully-matched)]
  • arrow-rs page-index-prunes RG 1 entirely
  • try_next_reader() returns the reader for RG 2; our code pops RG 1 from the queue
  • The next boundary check looks at the new head (RG 2), sees fully-matched, triggers a per-RG RowFilter toggle, calls into_builder().with_row_filter(empty).with_row_groups([2, 3]).build()
  • The rebuilt decoder dutifully reads RG 2 again — duplicate output

The final query result is still correct because TopK ranks the rows, but the scan emits 2x the rows it should and downstream operators do extra work. Any consumer that wants to drive per-RG state alongside the decoder (per-RG pruners, fully-matched filter toggles, custom statistics, etc.) hits this.

Proposed API

Add a peek_next_row_group(&self) -> Option<usize> accessor on ParquetPushDecoder (and the underlying RemainingRowGroups). When is_at_row_group_boundary() is true, this returns Some(idx) where idx is the file-level row-group index that the next try_next_reader() call will yield a reader for — i.e. after any page-index-driven skipping the decoder will perform internally. When the decoder is mid-row-group or finished, returns None.

An alternative shape would be remaining_row_group_indices(&self) -> &[usize] — the full sequence the decoder still intends to read. Either works for our use case.

Why this matters

This isn't specific to DataFusion — any push-decoder consumer that needs per-RG bookkeeping aligned with what the decoder is actually doing will run into this. The information already exists inside RemainingRowGroups; just no public accessor.

Happy to send a PR if there's interest.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions