GH-50216: [C++][Parquet] Add RleBitPackedToBitmapDecoder#50217
GH-50216: [C++][Parquet] Add RleBitPackedToBitmapDecoder#50217AntoinePrv wants to merge 21 commits into
Conversation
|
|
There was a problem hiding this comment.
Pull request overview
Adds a specialized path for decoding Parquet mixed RLE / BitPacked streams with bit_width=1 directly into Arrow bitmaps, aiming to speed up validity/definition-level decoding.
Changes:
- Added bitmap-oriented decoders for
RleRunandBitPackedRun, plusRleBitPackedToBitmapDecoderfor mixed streams. - Refactored
RleBitPackedParserto provideParseWithCallable()and adjustedRleBitPackedDecoderto use it. - Added new bit utilities (
Get32Bits,CopyBits) and comprehensive unit tests for the new bitmap decoders.
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| cpp/src/arrow/util/rle_encoding_internal.h | Adds RleRun::value(), moves callable-based parsing into the parser, and updates RleBitPackedDecoder accordingly. |
| cpp/src/arrow/util/rle_bitmap_internal.h | Introduces bitmap span + run-to-bitmap decoders and the new RleBitPackedToBitmapDecoder. |
| cpp/src/arrow/util/rle_bitmap_test.cc | Adds tests covering alignment, chunking, reset, and truncated input scenarios for bitmap decoding. |
| cpp/src/arrow/util/CMakeLists.txt | Registers the new test file in the util test target. |
| cpp/src/arrow/util/bit_util.h | Adds Get32Bits and CopyBits helpers used by the new bitmap decoder implementation. |
| cpp/src/arrow/util/bit_util_test.cc | Adds unit tests for Get32Bits and CopyBits. |
| void Reset(const RunType& run) noexcept { | ||
| values_left_ = run.values_count(); | ||
| if (run.value_little_endian() == 0) { | ||
| value_pattern_ = uint8_t{0}; | ||
| } else { | ||
| value_pattern_ = uint8_t{0xFF}; | ||
| } | ||
| } |
|
@pitrou this is ready |
| rle_size_t batch_size) -> rle_size_t { | ||
| using ControlFlow = RleBitPackedParser::ControlFlow; | ||
|
|
||
| if (ARROW_PREDICT_FALSE(batch_size == 0 || exhausted())) { |
There was a problem hiding this comment.
Is there a particular reason this became necessary?
There was a problem hiding this comment.
Came from the Copilot review, which I found to be relevant. I may not be a possible path in practice in all of Arrow, but better safe than sorry.
pitrou
left a comment
There was a problem hiding this comment.
Thanks for the update @AntoinePrv !
| /// Make a vector of `size` pseudo-random bytes, deterministic for a given `seed`. | ||
| std::vector<uint8_t> MakeRandomBytes(size_t size, uint32_t seed = 56) { | ||
| std::vector<uint8_t> bytes(size); | ||
| std::minstd_rand gen(seed); |
There was a problem hiding this comment.
Hmm, the default_random_engine might give better randomness, though not sure it's useful here.
There was a problem hiding this comment.
I picked this one because it is small and cheap and I assume sufficient here.
| void Reset(const RunType& run) noexcept { | ||
| values_left_ = run.values_count(); | ||
| if (run.value_little_endian() == 0) { | ||
| value_pattern_ = uint8_t{0}; | ||
| } else { | ||
| value_pattern_ = uint8_t{0xFF}; | ||
| } | ||
| } |
|
@pitrou this is ready |
|
@AntoinePrv Note there is a compilation error on MSVC. |
|
@github-actions crossbow submit -g cpp |
pitrou
left a comment
There was a problem hiding this comment.
+1, but need to fix the MSVC-based CI build.
|
Revision: 6fc9785 Submitted crossbow builds: ursacomputing/crossbow @ actions-fb93408670 |
|
Other related failures:
|
Rationale for this change
Add a
RleBitPackedToBitmapDecodercapable of decoding a Parquet mixed Rle / BitPacked byte stream directly into a bitmap. This is an optimization to be used when target and source bit_width = 1 that will improve decoding nullable columns in a follow-up PR.What changes are included in this PR?
New classes, that reuse previous Runs and parser.
Their API is slighly different from the existing one as it fits a partilcular case.
RleRunToBitmapDecoderBitPackedRunToBitmapDecoderRleRunToBitmapDecoderAre these changes tested?
Yes.
Are there any user-facing changes?
No.