Skip to content

parquet files with dumb repetition/definition levels are not readable #417

@pprettysimpple

Description

@pprettysimpple

Describe the bug
Some parquet files are not readable by the current parquet reader. Parquet spec states (here) that repetition/definition levels might not be written after data page header if they are known from schema

To Reproduce

import numpy as np
import pyarrow as pa
import pyarrow.parquet as pq

n = 100_000
v = np.full(n, np.uint64(0xFFFFFFFF), dtype=np.uint64)
arrays = [pa.array(v, type=pa.uint64()) for _ in range(10)]
schema = pa.schema([pa.field(str(i), pa.uint64(), nullable=False) for i in range(10)])
table = pa.Table.from_arrays(arrays, schema=schema)
pq.write_table(table, "example.parquet", compression=None, use_dictionary=False)

# make_canonical_parquet --input example.parquet --output canonical/
# zli compress -f canonical/example.parquet.canonical --profile parquet -o outc.zl

Expected behavior
This file from reproducer to be parsed. I encountered this bug while testing this project to be used at my job.

Screenshots and charts
Two hexdumps of parquet files. First one is a real parquet I tried to compress first. Second one is from reproducer + make-canonical cli tool. Both screenshots have mouse hovering over start of data. (Idk if you need them actually)

Image Image

Also, here is how it looks when parsing repro:

$ ./zli compress -f canonical/example.parquet.canonical --profile parquet -o outc.zl
OpenZL Library Exception:
        OpenZL error code: 55
OpenZL error string: Input does not respect conditions for this node
OpenZL error context: Code: Input does not respect conditions for this node
Message: Check `getRemaining(lexer) < size' failed where:
        lhs = (unknown) 8800466
        rhs = (unknown) 4294967295

Stack Trace:
        #0 lexPageHeader (/home/kandriianov/tpsrc/openzl/custom_parsers/parquet/parquet_lexer.cpp:136): Check `getRemaining(lexer) < size' failed where:
        lhs = (unknown) 8800466
        rhs = (unknown) 4294967295

        #1 ZL_ParquetLexer_lex (/home/kandriianov/tpsrc/openzl/custom_parsers/parquet/parquet_lexer.cpp:357): Forwarding error: 
        #2 parquetSegmenterInner (/home/kandriianov/tpsrc/openzl/custom_parsers/parquet/parquet_graph.c:211): Forwarding error: 
        #3 CCTX_startCompression (/home/kandriianov/tpsrc/openzl/src/openzl/compress/cctx.c:1400): Forwarding error: 
        #4 CCTX_compressInputs_withGraphSet_stage2 (/home/kandriianov/tpsrc/openzl/src/openzl/compress/compress2.c:119): Forwarding error: 

Desktop (please complete the following information):

  • Source commit: 87eba73c9f2c860a33524f144720319e546cce73
  • OS: Ubuntu
  • Version 22.04
  • Compiler GCC 13.1.0
  • Flags: cmake .. -DOPENZL_BUILD_PARQUET_TOOLS=ON -DOPENZL_BUILD_MODE=dev-nosan -DCMAKE_BUILD_TYPE=Debug -GNinja
  • Other relevant hardware specs: Nah
  • Build system Ninja

Additional context
Sooo, I checked out how this parser is written and made a hack that works for me.

diff --git a/custom_parsers/parquet/parquet_lexer.cpp b/custom_parsers/parquet/parquet_lexer.cpp
index fa3fad8..05c3928 100644
--- a/custom_parsers/parquet/parquet_lexer.cpp
+++ b/custom_parsers/parquet/parquet_lexer.cpp
@@ -131,18 +131,18 @@ ZL_Report lexPageHeader(
                 node_invalid_input);
         // Repetition and Definition levels
         ZL_ERR_IF_LT(getRemaining(lexer), 4, node_invalid_input);
-        auto size = ZL_readLE32(lexer->currPtr);
-        advance(4);
-        ZL_ERR_IF_LT(getRemaining(lexer), size, node_invalid_input);
-        advance(size);
-
-        // Adjust the expected data page bytes
-        ZL_ERR_IF_LT(
-
-                (size_t)lexer->pageHeader->numBytes,
-                size + 4,
-                node_invalid_input);
-        lexer->pageHeader->numBytes -= size + 4;
+        // auto size = ZL_readLE32(lexer->currPtr);
+        // advance(4);
+        // ZL_ERR_IF_LT(getRemaining(lexer), size, node_invalid_input);
+        // advance(size);
+
+        // // Adjust the expected data page bytes
+        // ZL_ERR_IF_LT(
+
+        //         (size_t)lexer->pageHeader->numBytes,
+        //         size + 4,
+        //         node_invalid_input);
+        // lexer->pageHeader->numBytes -= size + 4;
     }
 
     lexer->chunkLexed += out->size;

But you know, that's kinda a really bad solution. There should be some code to parse this stuff...

Metadata

Metadata

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions