Describe the bug
Some parquet files are not readable by the current parquet reader. Parquet spec states (here) that repetition/definition levels might not be written after data page header if they are known from schema
To Reproduce
import numpy as np
import pyarrow as pa
import pyarrow.parquet as pq
n = 100_000
v = np.full(n, np.uint64(0xFFFFFFFF), dtype=np.uint64)
arrays = [pa.array(v, type=pa.uint64()) for _ in range(10)]
schema = pa.schema([pa.field(str(i), pa.uint64(), nullable=False) for i in range(10)])
table = pa.Table.from_arrays(arrays, schema=schema)
pq.write_table(table, "example.parquet", compression=None, use_dictionary=False)
# make_canonical_parquet --input example.parquet --output canonical/
# zli compress -f canonical/example.parquet.canonical --profile parquet -o outc.zl
Expected behavior
This file from reproducer to be parsed. I encountered this bug while testing this project to be used at my job.
Screenshots and charts
Two hexdumps of parquet files. First one is a real parquet I tried to compress first. Second one is from reproducer + make-canonical cli tool. Both screenshots have mouse hovering over start of data. (Idk if you need them actually)
Also, here is how it looks when parsing repro:
$ ./zli compress -f canonical/example.parquet.canonical --profile parquet -o outc.zl
OpenZL Library Exception:
OpenZL error code: 55
OpenZL error string: Input does not respect conditions for this node
OpenZL error context: Code: Input does not respect conditions for this node
Message: Check `getRemaining(lexer) < size' failed where:
lhs = (unknown) 8800466
rhs = (unknown) 4294967295
Stack Trace:
#0 lexPageHeader (/home/kandriianov/tpsrc/openzl/custom_parsers/parquet/parquet_lexer.cpp:136): Check `getRemaining(lexer) < size' failed where:
lhs = (unknown) 8800466
rhs = (unknown) 4294967295
#1 ZL_ParquetLexer_lex (/home/kandriianov/tpsrc/openzl/custom_parsers/parquet/parquet_lexer.cpp:357): Forwarding error:
#2 parquetSegmenterInner (/home/kandriianov/tpsrc/openzl/custom_parsers/parquet/parquet_graph.c:211): Forwarding error:
#3 CCTX_startCompression (/home/kandriianov/tpsrc/openzl/src/openzl/compress/cctx.c:1400): Forwarding error:
#4 CCTX_compressInputs_withGraphSet_stage2 (/home/kandriianov/tpsrc/openzl/src/openzl/compress/compress2.c:119): Forwarding error:
Desktop (please complete the following information):
- Source commit:
87eba73c9f2c860a33524f144720319e546cce73
- OS: Ubuntu
- Version 22.04
- Compiler GCC 13.1.0
- Flags: cmake .. -DOPENZL_BUILD_PARQUET_TOOLS=ON -DOPENZL_BUILD_MODE=dev-nosan -DCMAKE_BUILD_TYPE=Debug -GNinja
- Other relevant hardware specs: Nah
- Build system Ninja
Additional context
Sooo, I checked out how this parser is written and made a hack that works for me.
diff --git a/custom_parsers/parquet/parquet_lexer.cpp b/custom_parsers/parquet/parquet_lexer.cpp
index fa3fad8..05c3928 100644
--- a/custom_parsers/parquet/parquet_lexer.cpp
+++ b/custom_parsers/parquet/parquet_lexer.cpp
@@ -131,18 +131,18 @@ ZL_Report lexPageHeader(
node_invalid_input);
// Repetition and Definition levels
ZL_ERR_IF_LT(getRemaining(lexer), 4, node_invalid_input);
- auto size = ZL_readLE32(lexer->currPtr);
- advance(4);
- ZL_ERR_IF_LT(getRemaining(lexer), size, node_invalid_input);
- advance(size);
-
- // Adjust the expected data page bytes
- ZL_ERR_IF_LT(
-
- (size_t)lexer->pageHeader->numBytes,
- size + 4,
- node_invalid_input);
- lexer->pageHeader->numBytes -= size + 4;
+ // auto size = ZL_readLE32(lexer->currPtr);
+ // advance(4);
+ // ZL_ERR_IF_LT(getRemaining(lexer), size, node_invalid_input);
+ // advance(size);
+
+ // // Adjust the expected data page bytes
+ // ZL_ERR_IF_LT(
+
+ // (size_t)lexer->pageHeader->numBytes,
+ // size + 4,
+ // node_invalid_input);
+ // lexer->pageHeader->numBytes -= size + 4;
}
lexer->chunkLexed += out->size;
But you know, that's kinda a really bad solution. There should be some code to parse this stuff...
Describe the bug
Some parquet files are not readable by the current parquet reader. Parquet spec states (here) that repetition/definition levels might not be written after data page header if they are known from schema
To Reproduce
Expected behavior
This file from reproducer to be parsed. I encountered this bug while testing this project to be used at my job.
Screenshots and charts
Two hexdumps of parquet files. First one is a real parquet I tried to compress first. Second one is from reproducer + make-canonical cli tool. Both screenshots have mouse hovering over start of data. (Idk if you need them actually)
Also, here is how it looks when parsing repro:
Desktop (please complete the following information):
87eba73c9f2c860a33524f144720319e546cce73Additional context
Sooo, I checked out how this parser is written and made a hack that works for me.
But you know, that's kinda a really bad solution. There should be some code to parse this stuff...