ParquetPushDecoder: expose the next row-group index that try_next_reader will yield

## Description

`ParquetPushDecoder` advertises `is_at_row_group_boundary()` and `row_groups_remaining()`, but a consumer that needs to track per-RG state in lock-step with the decoder has no way to ask **which row-group index** the next `try_next_reader()` call will produce a reader for. Today the call sites have to *assume* a 1:1 correspondence between the row-group list passed to `with_row_groups(...)` and the readers returned in order. That assumption breaks silently when page-index pruning eliminates every page of a row group: the decoder skips that RG internally and the next reader is for the RG after it, with no observable signal.

## Concrete example

In DataFusion ([apache/datafusion#22450](https://github.com/apache/datafusion/pull/22450)) we maintain an `rg_plan: VecDeque<RgEntry>` parallel to the decoder's row-group list so we can consult per-RG metadata (statistics for runtime row-group pruning, fully-matched flag for skipping `RowFilter` evaluation, etc.) at each boundary. The current implementation `pop_front()`s the queue every time `try_next_reader()` yields a reader. With this query:

```sql
WHERE species > 'M' AND s >= 50 ORDER BY species LIMIT 3
```

against a 3-row-group file where every page of RG 1 is page-index-pruned, the trace is:

- `rg_plan = [RG1 (not fully-matched), RG2 (fully-matched), RG3 (not fully-matched)]`
- arrow-rs page-index-prunes RG 1 entirely
- `try_next_reader()` returns the reader for **RG 2**; our code pops RG 1 from the queue
- The next boundary check looks at the new head (`RG 2`), sees `fully-matched`, triggers a per-RG `RowFilter` toggle, calls `into_builder().with_row_filter(empty).with_row_groups([2, 3]).build()`
- The rebuilt decoder dutifully reads RG 2 again — duplicate output

The final query result is still correct because TopK ranks the rows, but the scan emits 2x the rows it should and downstream operators do extra work. Any consumer that wants to drive per-RG state alongside the decoder (per-RG pruners, fully-matched filter toggles, custom statistics, etc.) hits this.

## Proposed API

Add a `peek_next_row_group(&self) -> Option<usize>` accessor on `ParquetPushDecoder` (and the underlying `RemainingRowGroups`). When `is_at_row_group_boundary()` is `true`, this returns `Some(idx)` where `idx` is the file-level row-group index that the next `try_next_reader()` call will yield a reader for — i.e. after any page-index-driven skipping the decoder will perform internally. When the decoder is mid-row-group or finished, returns `None`.

An alternative shape would be `remaining_row_group_indices(&self) -> &[usize]` — the full sequence the decoder still intends to read. Either works for our use case.

## Why this matters

This isn't specific to DataFusion — any push-decoder consumer that needs per-RG bookkeeping aligned with what the decoder is actually doing will run into this. The information already exists inside `RemainingRowGroups`; just no public accessor.

Happy to send a PR if there's interest.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ParquetPushDecoder: expose the next row-group index that try_next_reader will yield #10148

Description

Concrete example

Proposed API

Why this matters

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

ParquetPushDecoder: expose the next row-group index that try_next_reader will yield #10148

Description

Description

Concrete example

Proposed API

Why this matters

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions