feat: add DELTA_BINARY_PACKED decoder#64
Conversation
Add decode_delta_binary_packed for the Parquet DELTA_BINARY_PACKED encoding (enum 5), pyarrow's default for INT32/INT64 columns. It parses the stream header (block size, miniblocks per block, total value count, zigzag first value), then each block's min delta and per-miniblock bit widths, unpacks the LSB-first bit-packed miniblocks, and reconstructs values as value[i] = value[i-1] + min_delta + unpacked[i]. Reconstruction wraps to the column type width and reinterprets as signed, which is required: deltas between extreme values overflow the fixed-width type and the encoder relies on two's-complement wraparound. It emits exactly the header's total value count, ignoring trailing padding in the final miniblock and the unused trailing miniblocks of the last block, and cross-checks that count against the caller-supplied num_values. A DeltaBinaryPackedStats result exposes the encoder's structural choices that the values alone don't reveal — block/miniblock geometry, first value, and each block's min delta and miniblock bit widths — mirroring the DecodeStats inspection surface of the RLE primitive. Validated against real pyarrow-emitted bytes: writing parquet files with column_encoding="DELTA_BINARY_PACKED" for INT32 and INT64, extracting the data-page value stream, decoding it, and asserting the values equal the original column, across consecutive, single-value, all-equal, negative/mixed, type-boundary (wrapping), and multi-block cases. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
clee704
left a comment
There was a problem hiding this comment.
Reviewed this draft with correctness as the primary focus, grounding against the DELTA_BINARY_PACKED spec and the actual emitted bytes.
The decode logic is spec-correct, and I couldn't break it. Verified:
- Header parse — block size / miniblocks-per-block / total value count as ULEB128, first value as zigzag-ULEB128; per-block zigzag min-delta; one bit-width byte per declared miniblock; LSB-first bit unpacking; reconstruction
value[i] = value[i-1] + min_delta + unpacked[i]; exacttotal_value_countemission with final-miniblock padding and unused trailing miniblocks ignored — all match the spec. - Integer width / overflow — masking to the type width and re-interpreting the sign bit gives correct two's-complement wraparound for both INT32 and INT64, including extreme deltas (
INT32_MAX→INT32_MIN, etc.). - Truncation / mismatch — header, bit-width-byte, miniblock-body, and "needs another block but stream ended" paths all raise
ValueError; thenum_valuesvs headertotal_value_countcross-check raises on disagreement.
Independent grounding:
- The full decoder suite passes (22 delta tests), and
decoders.pyis at 99% coverage — the single uncovered line (309) predates this PR. - I ran a 400-case fuzz cross-check: random INT32/INT64 arrays (sizes 1–1000; uniform, small-step, clustered, ramp, and boundary/extreme distributions that overflow the type and force multiple blocks) round-tripped through pyarrow's
DELTA_BINARY_PACKEDwriter and decoded exactly in every case. - The ground-truth test is meaningful:
assert page["encoding"] == "DELTA_BINARY_PACKED"rules out a dictionary/PLAIN-fallback false positive, so a passing test really is decoding a delta page.
One non-correctness, API-consistency point inline. Everything else checks out.
|
|
||
|
|
||
| def decode_delta_binary_packed( | ||
| data: bytes, num_values: int, parquet_type: str |
There was a problem hiding this comment.
Parameter order is inconsistent with the rest of the module (consistency, not correctness).
Every other public decoder here takes num_values as its last positional argument, with the encoding-specific parameter immediately before it:
decode_plain(data, parquet_type, num_values, type_length=None)decode_rle_bitpacked_hybrid(data, bit_width, num_values)decode_levels(data, max_level, num_values)decode_v1_level_block(data, offset, max_level, num_values)
decode_delta_binary_packed(data, num_values, parquet_type) puts num_values in the middle and parquet_type last, making it the lone outlier. The PR description says it "takes a parquet_type argument (mirroring decode_plain)", but decode_plain's order is (data, parquet_type, num_values) — so the signature doesn't actually mirror it.
Since this is a new public function (it's exported in __all__), reordering to decode_delta_binary_packed(data, parquet_type, num_values) now — while it's still a draft and unreleased — keeps call sites uniform with the rest of the module and avoids a breaking signature change later.
There was a problem hiding this comment.
Fixed in d564fd5. Reordered to decode_delta_binary_packed(data, parquet_type, num_values) to match decode_plain / decode_levels / decode_v1_level_block (encoding-specific arg before num_values, num_values last). New function this PR, so no released callers affected; updated all 14 call sites in the tests. dev:check green.
…ype, num_values) Match the signature convention of the other public decoders (decode_plain, decode_levels, decode_v1_level_block), which take the encoding-specific argument before num_values and num_values last. The function is new in this PR, so no released callers are affected. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
clee704
left a comment
There was a problem hiding this comment.
Reviewed this draft with correctness as the primary focus, grounded against the DELTA_BINARY_PACKED spec and real pyarrow-emitted bytes (not just trusting the prior pass).
Decode logic is spec-correct — independently confirmed.
- Ran the 22 delta tests: all green;
decoders.pyat 99% (the single uncovered line, 309, predates this PR). - Ran my own 500-case fuzz cross-check through pyarrow 24's
DELTA_BINARY_PACKEDwriter — INT32 and INT64, sizes 1–1000, across uniform/full-range, small-step, all-equal, ramp, boundary-clustered, and base-clustered distributions. Every case round-tripped exactly. Max miniblock bit widths of 32 (INT32) and 64 (INT64) were exercised, and 88 trials produced multiple blocks. Header parse, zigzag first-value/min-delta, LSB-first unpacking,value[i]=value[i-1]+min_delta+unpacked[i], exacttotal_value_countemission (final-miniblock padding + unused trailing miniblocks ignored), and the two's-complement wrap/sign-reinterpret for both widths all check out. - Truncation / mismatch paths raise
ValueErroras intended.
The param-order point from the earlier pass is already addressed in d564fd5. One robustness finding below — not a correctness bug on valid data, but it contradicts the module's stated fail-loudly-on-corruption stance for a tool whose job is inspecting potentially-malformed files.
| f"DELTA_BINARY_PACKED miniblock bit width {bit_width} " | ||
| f"exceeds the {type_bits}-bit {ptype} type width" | ||
| ) | ||
| unpacked = _unpack_bitpacked(data, pos, values_per_miniblock, bit_width) |
There was a problem hiding this comment.
Unbounded memory allocation from an unvalidated header field (robustness / resource-exhaustion on malformed input — Low/Medium).
block_size is taken from the stream header and validated only for positivity + divisibility (lines 910–915), never for a sane upper bound. values_per_miniblock = block_size // miniblocks_per_block then flows straight into _unpack_bitpacked(data, pos, values_per_miniblock, bit_width), which always materializes a list of values_per_miniblock entries regardless of how many values are actually still needed. For the bit_width == 0 branch (return [0] * count, line 811) there is no byte-availability guard at all, so the allocation is driven purely by the header.
A crafted/corrupt 9-byte stream demonstrates it — block_size huge, miniblocks_per_block=1, total_value_count=2 (so the num_values cross-check passes), min_delta=0, one bit_width=0 byte:
block_size= 1,000,000 input=8B num_values=2 -> peak_mem ~8 MB
block_size= 50,000,000 input=9B num_values=2 -> peak_mem ~400 MB
It scales linearly with block_size; near INT32_MAX this OOMs the process. The surrounding truncation guards already raise ValueError for bit_width > 0 bodies (the nbytes check at line 812 fires), so this bit_width == 0 path is the lone hole, and the outcome (multi-GB allocation) is inconsistent with the clean ValueError every other corruption path produces.
Suggested fix: unpack only what's left — pass min(values_per_miniblock, total_value_count - len(values)) as the count while still advancing pos by the full miniblock body size (values_per_miniblock * bit_width + 7) // 8. That preserves correctness (padding still skipped) and bounds the values list by num_values. Optionally also reject an unreasonable block_size in the header validation (the spec constrains it to a multiple of 128), alongside the existing positivity/divisibility checks.
There was a problem hiding this comment.
Fixed in e81aafc. The decoder now validates the full (padded) miniblock body length against the stream before unpacking — so a corrupt block_size with a non-zero bit width is reported as truncated rather than allocating — and materializes only the values still needed, which bounds the zero-bit-width case (no body to validate against) at num_values. The read cursor still advances by the full body, so valid streams are unaffected. Added two regression tests (huge block_size with bit width 1 → truncated; with bit width 0 → caps allocation, decodes correctly). dev:check green.
A malformed stream could declare a huge block_size and force the decoder to materialize block_size//miniblocks values per miniblock — a ~MB-scale allocation for a few-byte stream — instead of failing loudly on corrupt input. Validate the full (padded) miniblock body length against the stream before unpacking (so a corrupt block_size with a non-zero bit width is reported as truncated), and materialize only the values still needed (so a zero-bit-width miniblock, which has no body to bound it, can't allocate block_size zeros). The read cursor still advances past the full body, so valid streams decode unchanged. The now-redundant bounds check in the bit-unpack helper moves to this single validating caller. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Every delta-reconstructed value is masked to the column's bit width and
reinterpreted as signed two's-complement, but the leading seed value was
appended raw — so a corrupt out-of-range seed escaped the type while all
later values were normalized. Apply the same width reinterpretation to
the first value (a no-op for valid in-range data), keeping the whole
decoded series type-valid and internally consistent.
Also clarify the block-size validation message ("a positive multiple of
the N miniblocks per block").
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
What & why
Adds
decode_delta_binary_packed, the decoder for the ParquetDELTA_BINARY_PACKED encoding (enum 5). This is the highest-priority
remaining decoder: pyarrow has emitted DELTA_BINARY_PACKED as the default
for
INT32/INT64columns since Arrow 14, so most recent Parquet filesin the wild exercise it, and it is the building block for the
DELTA_LENGTH_BYTE_ARRAY(enum 6) andDELTA_BYTE_ARRAY(enum 7)decoders.
The function decodes a raw value stream (caller slices off any
definition/repetition-level block first, as with the other value
decoders):
ULEB128), and the zigzag-ULEB128 first value.
then the LSB-first bit-packed miniblock bodies. Values are reconstructed
as
value[i] = value[i-1] + min_delta + unpacked[i].Reconstruction wraps to the column type width (
INT32→ 32 bits,INT64→ 64 bits) and reinterprets as signed. This is required for correctness,
not cosmetic: deltas between extreme values (e.g.
INT32_MAX→INT32_MIN) overflow the fixed-width type, and the encoder relies ontwo's-complement wraparound — without the wrap the decoded values are
wrong. The decoder needs the column type for this, so it takes a
parquet_typeargument (mirroringdecode_plain).It emits exactly the header's
total value count, ignoring trailingpadding in the final used miniblock and the unused trailing miniblocks of
the last block, and cross-checks that count against the caller-supplied
num_values(raising on mismatch, consistent with the module'sfail-loudly-on-corruption stance).
Following the module's "the encoding-specific return shape is the value"
design rule (#12), the function returns
(values, DeltaBinaryPackedStats).The stats expose the encoder's structural choices that the reconstructed
values alone don't reveal — block/miniblock geometry, first value, and
each block's min delta and per-miniblock bit widths — mirroring the
DecodeStatsinspection surface of the RLE primitive. Bit widths arerecorded for every declared miniblock (the on-disk truth), matching how
DecodeStatsrecords encoder-declared run lengths.Closes #62. Part of #14.
Refactoring checkpoint
module's existing
_read_varinthelper for ULEB128 reads and followsthe established dataclass +
decode_*conventions; the one new privatehelper (
_unpack_bitpacked) is a focused LSB-first unpacker and everynew branch (header validation, type-width guard, two truncation guards,
the count cross-check) is covered by a test.
Dogfooding
pyarrow-emitted bytes (pyarrow 24.0.0). For each case I wrote a parquet
file with
pq.write_table(..., column_encoding="DELTA_BINARY_PACKED", use_dictionary=False), located the data page via the repo's ownParquetFilesegment walk, decompressed it, stripped the V1 def-levelblock, decoded the remaining DELTA stream, and asserted equality with
the original column. Cases: consecutive integers (
range(20)),single value, all-equal (zero deltas), negative/mixed deltas, INT32 and
INT64 type-boundary values that overflow the type and exercise the
wrap-to-signed path and max bit width, and a 400-value multi-block
stream with varied deltas (forcing multiple blocks and non-zero
miniblock bit widths). Both
INT32andINT64round-trip exactly.stream; the typed
(values, stats)return reads naturally alongside theexisting
decode_rle_bitpacked_hybridprimitive. Nothing to fix.Tests
hatch run dev:checkis green (format, lint, type-check, tests, per-module 95% coverage).decoders.pyat 99.7%.