parquet reader fails to read files with more than 32767 row groups when `RowGroup.ordinal` is absent

## Describe the bug

  `parquet` fails to read metadata for Parquet files with more than `32767` row groups when `RowGroup.ordinal` is absent.

  The failure happens before decoding any data pages, inside `ParquetRecordBatchReaderBuilder::try_new(...)`.

  Error:

  ```text
  Parquet error: Row group ordinal 32768 exceeds i16 max value
  ```

  The same files are readable by PyArrow. Since `RowGroup.ordinal` is optional Parquet metadata, the reader should not fail solely because it cannot synthesize an optional `i16` ordinal
  for row groups beyond `i16::MAX`.

  Current `main` still appears to contain this logic in `parquet/src/file/metadata/thrift/mod.rs`:

  ```rust
  for ordinal in 0..list_ident.size {
      let ordinal: i16 = ordinal.try_into().map_err(|_| {
          ParquetError::General(format!(
              "Row group ordinal {ordinal} exceeds i16 max value",
          ))
      })?;
      let rg = read_row_group(&mut prot, schema_descr, options)?;
      rg_vec.push(assigner.ensure(ordinal, rg)?);
  }
  ```

  ## To Reproduce

  Using:

  ```toml
  arrow = "57"
  parquet = "57"
  ```

  Code:

  ```rust
  use std::fs::File;
  use parquet::arrow::arrow_reader::ParquetRecordBatchReaderBuilder;

  let file = File::open("transactions.parquet")?;
  let reader = ParquetRecordBatchReaderBuilder::try_new(file)?;
  ```

  This fails with:

  ```text
  Row group ordinal 32768 exceeds i16 max value
  ```

  PyArrow can read the same file metadata:

  ```python
  import pyarrow.parquet as pq

  pf = pq.ParquetFile("transactions.parquet")
  print(pf.metadata.num_rows)
  print(pf.metadata.num_row_groups)
  ```

  Observed examples:

  ```text
  num_rows=26,303,646  num_row_groups=37,638
  num_rows=31,308,590  num_row_groups=43,990
  num_rows=27,487,443  num_row_groups=39,870
  num_rows=25,255,685  num_row_groups=34,599
  ```

  The row groups are unusually small, but the files are readable by PyArrow.

  ## Expected behavior

  `ParquetRecordBatchReaderBuilder::try_new(...)` should successfully read metadata for Parquet files with more than `32767` row groups.

  If `RowGroup.ordinal` is absent, the reader should avoid failing just because a synthetic ordinal cannot fit in `i16`. It could leave `RowGroupMetaData::ordinal` as `None` for such row
  groups, or avoid assigning synthetic ordinals on the read path unless semantically required.

  ## Additional context

  This seems to be caused by converting the row group index to `i16` before reading each row group. Since `RowGroup.ordinal` is optional metadata, this check prevents reading otherwise
  valid files with a large number of row groups.

  A possible fix would be to read the row group first, then only assign/check ordinal when it is present or representable.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

parquet reader fails to read files with more than 32767 row groups when `RowGroup.ordinal` is absent #10129

Describe the bug

To Reproduce

Expected behavior

Additional context

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

parquet reader fails to read files with more than 32767 row groups when RowGroup.ordinal is absent #10129

Description

Describe the bug

To Reproduce

Expected behavior

Additional context

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

parquet reader fails to read files with more than 32767 row groups when `RowGroup.ordinal` is absent #10129