parquet files with dumb repetition/definition levels are not readable

**Describe the bug**
Some parquet files are not readable by the current parquet reader. Parquet spec states ([here](https://parquet.apache.org/docs/file-format/data-pages/)) that repetition/definition levels might not be written after data page header if they are known from schema

**To Reproduce**
```python
import numpy as np
import pyarrow as pa
import pyarrow.parquet as pq

n = 100_000
v = np.full(n, np.uint64(0xFFFFFFFF), dtype=np.uint64)
arrays = [pa.array(v, type=pa.uint64()) for _ in range(10)]
schema = pa.schema([pa.field(str(i), pa.uint64(), nullable=False) for i in range(10)])
table = pa.Table.from_arrays(arrays, schema=schema)
pq.write_table(table, "example.parquet", compression=None, use_dictionary=False)

# make_canonical_parquet --input example.parquet --output canonical/
# zli compress -f canonical/example.parquet.canonical --profile parquet -o outc.zl
```

**Expected behavior**
This file from reproducer to be parsed. I encountered this bug while testing this project to be used at my job.

**Screenshots and charts**
Two hexdumps of parquet files. First one is a real parquet I tried to compress first. Second one is from reproducer + make-canonical cli tool. Both screenshots have mouse hovering over start of data. (Idk if you need them actually)

<img width="651" height="243" alt="Image" src="https://github.com/user-attachments/assets/8b6d15c6-5419-4697-bbcf-31ecf9240525" />
<img width="651" height="243" alt="Image" src="https://github.com/user-attachments/assets/0d955f77-4ec1-4ddf-b3cf-9f9d119c46cc" />

Also, here is how it looks when parsing repro:
```
$ ./zli compress -f canonical/example.parquet.canonical --profile parquet -o outc.zl
OpenZL Library Exception:
        OpenZL error code: 55
OpenZL error string: Input does not respect conditions for this node
OpenZL error context: Code: Input does not respect conditions for this node
Message: Check `getRemaining(lexer) < size' failed where:
        lhs = (unknown) 8800466
        rhs = (unknown) 4294967295

Stack Trace:
        #0 lexPageHeader (/home/kandriianov/tpsrc/openzl/custom_parsers/parquet/parquet_lexer.cpp:136): Check `getRemaining(lexer) < size' failed where:
        lhs = (unknown) 8800466
        rhs = (unknown) 4294967295

        #1 ZL_ParquetLexer_lex (/home/kandriianov/tpsrc/openzl/custom_parsers/parquet/parquet_lexer.cpp:357): Forwarding error: 
        #2 parquetSegmenterInner (/home/kandriianov/tpsrc/openzl/custom_parsers/parquet/parquet_graph.c:211): Forwarding error: 
        #3 CCTX_startCompression (/home/kandriianov/tpsrc/openzl/src/openzl/compress/cctx.c:1400): Forwarding error: 
        #4 CCTX_compressInputs_withGraphSet_stage2 (/home/kandriianov/tpsrc/openzl/src/openzl/compress/compress2.c:119): Forwarding error: 
```

**Desktop (please complete the following information):**
 - Source commit: `87eba73c9f2c860a33524f144720319e546cce73`
 - OS: Ubuntu
 - Version 22.04
 - Compiler GCC 13.1.0
 - Flags: cmake .. -DOPENZL_BUILD_PARQUET_TOOLS=ON -DOPENZL_BUILD_MODE=dev-nosan -DCMAKE_BUILD_TYPE=Debug -GNinja
 - Other relevant hardware specs: Nah
 - Build system Ninja

**Additional context**
Sooo, I checked out how this parser is written and made a hack that works for me.
```
diff --git a/custom_parsers/parquet/parquet_lexer.cpp b/custom_parsers/parquet/parquet_lexer.cpp
index fa3fad8..05c3928 100644
--- a/custom_parsers/parquet/parquet_lexer.cpp
+++ b/custom_parsers/parquet/parquet_lexer.cpp
@@ -131,18 +131,18 @@ ZL_Report lexPageHeader(
                 node_invalid_input);
         // Repetition and Definition levels
         ZL_ERR_IF_LT(getRemaining(lexer), 4, node_invalid_input);
-        auto size = ZL_readLE32(lexer->currPtr);
-        advance(4);
-        ZL_ERR_IF_LT(getRemaining(lexer), size, node_invalid_input);
-        advance(size);
-
-        // Adjust the expected data page bytes
-        ZL_ERR_IF_LT(
-
-                (size_t)lexer->pageHeader->numBytes,
-                size + 4,
-                node_invalid_input);
-        lexer->pageHeader->numBytes -= size + 4;
+        // auto size = ZL_readLE32(lexer->currPtr);
+        // advance(4);
+        // ZL_ERR_IF_LT(getRemaining(lexer), size, node_invalid_input);
+        // advance(size);
+
+        // // Adjust the expected data page bytes
+        // ZL_ERR_IF_LT(
+
+        //         (size_t)lexer->pageHeader->numBytes,
+        //         size + 4,
+        //         node_invalid_input);
+        // lexer->pageHeader->numBytes -= size + 4;
     }
 
     lexer->chunkLexed += out->size;
```

But you know, that's kinda a really bad solution. There should be some code to parse this stuff...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

parquet files with dumb repetition/definition levels are not readable #417

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

parquet files with dumb repetition/definition levels are not readable #417

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions