Describe the bug
parquet fails to read metadata for Parquet files with more than 32767 row groups when RowGroup.ordinal is absent.
The failure happens before decoding any data pages, inside ParquetRecordBatchReaderBuilder::try_new(...).
Error:
Parquet error: Row group ordinal 32768 exceeds i16 max value
The same files are readable by PyArrow. Since RowGroup.ordinal is optional Parquet metadata, the reader should not fail solely because it cannot synthesize an optional i16 ordinal
for row groups beyond i16::MAX.
Current main still appears to contain this logic in parquet/src/file/metadata/thrift/mod.rs:
for ordinal in 0..list_ident.size {
let ordinal: i16 = ordinal.try_into().map_err(|_| {
ParquetError::General(format!(
"Row group ordinal {ordinal} exceeds i16 max value",
))
})?;
let rg = read_row_group(&mut prot, schema_descr, options)?;
rg_vec.push(assigner.ensure(ordinal, rg)?);
}
To Reproduce
Using:
arrow = "57"
parquet = "57"
Code:
use std::fs::File;
use parquet::arrow::arrow_reader::ParquetRecordBatchReaderBuilder;
let file = File::open("transactions.parquet")?;
let reader = ParquetRecordBatchReaderBuilder::try_new(file)?;
This fails with:
Row group ordinal 32768 exceeds i16 max value
PyArrow can read the same file metadata:
import pyarrow.parquet as pq
pf = pq.ParquetFile("transactions.parquet")
print(pf.metadata.num_rows)
print(pf.metadata.num_row_groups)
Observed examples:
num_rows=26,303,646 num_row_groups=37,638
num_rows=31,308,590 num_row_groups=43,990
num_rows=27,487,443 num_row_groups=39,870
num_rows=25,255,685 num_row_groups=34,599
The row groups are unusually small, but the files are readable by PyArrow.
Expected behavior
ParquetRecordBatchReaderBuilder::try_new(...) should successfully read metadata for Parquet files with more than 32767 row groups.
If RowGroup.ordinal is absent, the reader should avoid failing just because a synthetic ordinal cannot fit in i16. It could leave RowGroupMetaData::ordinal as None for such row
groups, or avoid assigning synthetic ordinals on the read path unless semantically required.
Additional context
This seems to be caused by converting the row group index to i16 before reading each row group. Since RowGroup.ordinal is optional metadata, this check prevents reading otherwise
valid files with a large number of row groups.
A possible fix would be to read the row group first, then only assign/check ordinal when it is present or representable.
Describe the bug
parquetfails to read metadata for Parquet files with more than32767row groups whenRowGroup.ordinalis absent.The failure happens before decoding any data pages, inside
ParquetRecordBatchReaderBuilder::try_new(...).Error:
The same files are readable by PyArrow. Since
RowGroup.ordinalis optional Parquet metadata, the reader should not fail solely because it cannot synthesize an optionali16ordinalfor row groups beyond
i16::MAX.Current
mainstill appears to contain this logic inparquet/src/file/metadata/thrift/mod.rs:To Reproduce
Using:
Code:
This fails with:
PyArrow can read the same file metadata:
Observed examples:
The row groups are unusually small, but the files are readable by PyArrow.
Expected behavior
ParquetRecordBatchReaderBuilder::try_new(...)should successfully read metadata for Parquet files with more than32767row groups.If
RowGroup.ordinalis absent, the reader should avoid failing just because a synthetic ordinal cannot fit ini16. It could leaveRowGroupMetaData::ordinalasNonefor such rowgroups, or avoid assigning synthetic ordinals on the read path unless semantically required.
Additional context
This seems to be caused by converting the row group index to
i16before reading each row group. SinceRowGroup.ordinalis optional metadata, this check prevents reading otherwisevalid files with a large number of row groups.
A possible fix would be to read the row group first, then only assign/check ordinal when it is present or representable.