Skip to content

parquet reader fails to read files with more than 32767 row groups when RowGroup.ordinal is absent #10129

Description

@jerryway42

Describe the bug

parquet fails to read metadata for Parquet files with more than 32767 row groups when RowGroup.ordinal is absent.

The failure happens before decoding any data pages, inside ParquetRecordBatchReaderBuilder::try_new(...).

Error:

Parquet error: Row group ordinal 32768 exceeds i16 max value

The same files are readable by PyArrow. Since RowGroup.ordinal is optional Parquet metadata, the reader should not fail solely because it cannot synthesize an optional i16 ordinal
for row groups beyond i16::MAX.

Current main still appears to contain this logic in parquet/src/file/metadata/thrift/mod.rs:

for ordinal in 0..list_ident.size {
    let ordinal: i16 = ordinal.try_into().map_err(|_| {
        ParquetError::General(format!(
            "Row group ordinal {ordinal} exceeds i16 max value",
        ))
    })?;
    let rg = read_row_group(&mut prot, schema_descr, options)?;
    rg_vec.push(assigner.ensure(ordinal, rg)?);
}

To Reproduce

Using:

arrow = "57"
parquet = "57"

Code:

use std::fs::File;
use parquet::arrow::arrow_reader::ParquetRecordBatchReaderBuilder;

let file = File::open("transactions.parquet")?;
let reader = ParquetRecordBatchReaderBuilder::try_new(file)?;

This fails with:

Row group ordinal 32768 exceeds i16 max value

PyArrow can read the same file metadata:

import pyarrow.parquet as pq

pf = pq.ParquetFile("transactions.parquet")
print(pf.metadata.num_rows)
print(pf.metadata.num_row_groups)

Observed examples:

num_rows=26,303,646  num_row_groups=37,638
num_rows=31,308,590  num_row_groups=43,990
num_rows=27,487,443  num_row_groups=39,870
num_rows=25,255,685  num_row_groups=34,599

The row groups are unusually small, but the files are readable by PyArrow.

Expected behavior

ParquetRecordBatchReaderBuilder::try_new(...) should successfully read metadata for Parquet files with more than 32767 row groups.

If RowGroup.ordinal is absent, the reader should avoid failing just because a synthetic ordinal cannot fit in i16. It could leave RowGroupMetaData::ordinal as None for such row
groups, or avoid assigning synthetic ordinals on the read path unless semantically required.

Additional context

This seems to be caused by converting the row group index to i16 before reading each row group. Since RowGroup.ordinal is optional metadata, this check prevents reading otherwise
valid files with a large number of row groups.

A possible fix would be to read the row group first, then only assign/check ordinal when it is present or representable.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementAny new improvement worthy of a entry in the changelog
    No fields configured for Feature.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions