feat: add DELTA_BINARY_PACKED decoder by clee704 · Pull Request #64 · clee704/parquet-analyzer

clee704 · 2026-06-11T04:07:37Z

What & why

Adds decode_delta_binary_packed, the decoder for the Parquet
DELTA_BINARY_PACKED encoding (enum 5). This is the highest-priority
remaining decoder: pyarrow has emitted DELTA_BINARY_PACKED as the default
for INT32 / INT64 columns since Arrow 14, so most recent Parquet files
in the wild exercise it, and it is the building block for the
DELTA_LENGTH_BYTE_ARRAY (enum 6) and DELTA_BYTE_ARRAY (enum 7)
decoders.

The function decodes a raw value stream (caller slices off any
definition/repetition-level block first, as with the other value
decoders):

Header — block size, miniblocks per block, total value count (all
ULEB128), and the zigzag-ULEB128 first value.
Each block — a zigzag min delta, one bit-width byte per miniblock,
then the LSB-first bit-packed miniblock bodies. Values are reconstructed
as value[i] = value[i-1] + min_delta + unpacked[i].

Reconstruction wraps to the column type width (INT32 → 32 bits, INT64
→ 64 bits) and reinterprets as signed. This is required for correctness,
not cosmetic: deltas between extreme values (e.g. INT32_MAX →
INT32_MIN) overflow the fixed-width type, and the encoder relies on
two's-complement wraparound — without the wrap the decoded values are
wrong. The decoder needs the column type for this, so it takes a
parquet_type argument (mirroring decode_plain).

It emits exactly the header's total value count, ignoring trailing
padding in the final used miniblock and the unused trailing miniblocks of
the last block, and cross-checks that count against the caller-supplied
num_values (raising on mismatch, consistent with the module's
fail-loudly-on-corruption stance).

Following the module's "the encoding-specific return shape is the value"
design rule (#12), the function returns (values, DeltaBinaryPackedStats).
The stats expose the encoder's structural choices that the reconstructed
values alone don't reveal — block/miniblock geometry, first value, and
each block's min delta and per-miniblock bit widths — mirroring the
DecodeStats inspection surface of the RLE primitive. Bit widths are
recorded for every declared miniblock (the on-disk truth), matching how
DecodeStats records encoder-declared run lengths.

Closes #62. Part of #14.

Refactoring checkpoint

I ran the refactoring checkpoint on the touched code and its design context.
Fixed in this PR: Nothing to clean up. The new code reuses the
module's existing _read_varint helper for ULEB128 reads and follows
the established dataclass + decode_* conventions; the one new private
helper (_unpack_bitpacked) is a focused LSB-first unpacker and every
new branch (header validation, type-width guard, two truncation guards,
the count cross-check) is covered by a test.
Deferred (issue links): none.

Dogfooding

I drove the change on realistic data and assessed performance + UX.
Scenarios exercised: Ground-truth validation against real
pyarrow-emitted bytes (pyarrow 24.0.0). For each case I wrote a parquet
file with pq.write_table(..., column_encoding="DELTA_BINARY_PACKED", use_dictionary=False), located the data page via the repo's own
ParquetFile segment walk, decompressed it, stripped the V1 def-level
block, decoded the remaining DELTA stream, and asserted equality with
the original column. Cases: consecutive integers (range(20)),
single value, all-equal (zero deltas), negative/mixed deltas, INT32 and
INT64 type-boundary values that overflow the type and exercise the
wrap-to-signed path and max bit width, and a 400-value multi-block
stream with varied deltas (forcing multiple blocks and non-zero
miniblock bit widths). Both INT32 and INT64 round-trip exactly.
Perf / UX findings: The decoder is a linear single pass over the
stream; the typed (values, stats) return reads naturally alongside the
existing decode_rle_bitpacked_hybrid primitive. Nothing to fix.
Fixed / deferred: none.

Tests

hatch run dev:check is green (format, lint, type-check, tests, per-module 95% coverage). decoders.py at 99.7%.
New / changed code paths assert observable behavior, not just execute lines.

Add decode_delta_binary_packed for the Parquet DELTA_BINARY_PACKED encoding (enum 5), pyarrow's default for INT32/INT64 columns. It parses the stream header (block size, miniblocks per block, total value count, zigzag first value), then each block's min delta and per-miniblock bit widths, unpacks the LSB-first bit-packed miniblocks, and reconstructs values as value[i] = value[i-1] + min_delta + unpacked[i]. Reconstruction wraps to the column type width and reinterprets as signed, which is required: deltas between extreme values overflow the fixed-width type and the encoder relies on two's-complement wraparound. It emits exactly the header's total value count, ignoring trailing padding in the final miniblock and the unused trailing miniblocks of the last block, and cross-checks that count against the caller-supplied num_values. A DeltaBinaryPackedStats result exposes the encoder's structural choices that the values alone don't reveal — block/miniblock geometry, first value, and each block's min delta and miniblock bit widths — mirroring the DecodeStats inspection surface of the RLE primitive. Validated against real pyarrow-emitted bytes: writing parquet files with column_encoding="DELTA_BINARY_PACKED" for INT32 and INT64, extracting the data-page value stream, decoding it, and asserting the values equal the original column, across consecutive, single-value, all-equal, negative/mixed, type-boundary (wrapping), and multi-block cases. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

clee704

Reviewed this draft with correctness as the primary focus, grounding against the DELTA_BINARY_PACKED spec and the actual emitted bytes.

The decode logic is spec-correct, and I couldn't break it. Verified:

Header parse — block size / miniblocks-per-block / total value count as ULEB128, first value as zigzag-ULEB128; per-block zigzag min-delta; one bit-width byte per declared miniblock; LSB-first bit unpacking; reconstruction value[i] = value[i-1] + min_delta + unpacked[i]; exact total_value_count emission with final-miniblock padding and unused trailing miniblocks ignored — all match the spec.
Integer width / overflow — masking to the type width and re-interpreting the sign bit gives correct two's-complement wraparound for both INT32 and INT64, including extreme deltas (INT32_MAX → INT32_MIN, etc.).
Truncation / mismatch — header, bit-width-byte, miniblock-body, and "needs another block but stream ended" paths all raise ValueError; the num_values vs header total_value_count cross-check raises on disagreement.

Independent grounding:

The full decoder suite passes (22 delta tests), and decoders.py is at 99% coverage — the single uncovered line (309) predates this PR.
I ran a 400-case fuzz cross-check: random INT32/INT64 arrays (sizes 1–1000; uniform, small-step, clustered, ramp, and boundary/extreme distributions that overflow the type and force multiple blocks) round-tripped through pyarrow's DELTA_BINARY_PACKED writer and decoded exactly in every case.
The ground-truth test is meaningful: assert page["encoding"] == "DELTA_BINARY_PACKED" rules out a dictionary/PLAIN-fallback false positive, so a passing test really is decoding a delta page.

One non-correctness, API-consistency point inline. Everything else checks out.

clee704 · 2026-06-11T04:28:27Z

+
+
+def decode_delta_binary_packed(
+    data: bytes, num_values: int, parquet_type: str


Parameter order is inconsistent with the rest of the module (consistency, not correctness).

Every other public decoder here takes num_values as its last positional argument, with the encoding-specific parameter immediately before it:

decode_plain(data, parquet_type, num_values, type_length=None)

decode_rle_bitpacked_hybrid(data, bit_width, num_values)

decode_levels(data, max_level, num_values)

decode_v1_level_block(data, offset, max_level, num_values)

decode_delta_binary_packed(data, num_values, parquet_type) puts num_values in the middle and parquet_type last, making it the lone outlier. The PR description says it "takes a parquet_type argument (mirroring decode_plain)", but decode_plain's order is (data, parquet_type, num_values) — so the signature doesn't actually mirror it.

Since this is a new public function (it's exported in __all__), reordering to decode_delta_binary_packed(data, parquet_type, num_values) now — while it's still a draft and unreleased — keeps call sites uniform with the rest of the module and avoids a breaking signature change later.

Fixed in d564fd5. Reordered to decode_delta_binary_packed(data, parquet_type, num_values) to match decode_plain / decode_levels / decode_v1_level_block (encoding-specific arg before num_values, num_values last). New function this PR, so no released callers affected; updated all 14 call sites in the tests. dev:check green.

…ype, num_values) Match the signature convention of the other public decoders (decode_plain, decode_levels, decode_v1_level_block), which take the encoding-specific argument before num_values and num_values last. The function is new in this PR, so no released callers are affected. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

clee704

Reviewed this draft with correctness as the primary focus, grounded against the DELTA_BINARY_PACKED spec and real pyarrow-emitted bytes (not just trusting the prior pass).

Decode logic is spec-correct — independently confirmed.

Ran the 22 delta tests: all green; decoders.py at 99% (the single uncovered line, 309, predates this PR).
Ran my own 500-case fuzz cross-check through pyarrow 24's DELTA_BINARY_PACKED writer — INT32 and INT64, sizes 1–1000, across uniform/full-range, small-step, all-equal, ramp, boundary-clustered, and base-clustered distributions. Every case round-tripped exactly. Max miniblock bit widths of 32 (INT32) and 64 (INT64) were exercised, and 88 trials produced multiple blocks. Header parse, zigzag first-value/min-delta, LSB-first unpacking, value[i]=value[i-1]+min_delta+unpacked[i], exact total_value_count emission (final-miniblock padding + unused trailing miniblocks ignored), and the two's-complement wrap/sign-reinterpret for both widths all check out.
Truncation / mismatch paths raise ValueError as intended.

The param-order point from the earlier pass is already addressed in d564fd5. One robustness finding below — not a correctness bug on valid data, but it contradicts the module's stated fail-loudly-on-corruption stance for a tool whose job is inspecting potentially-malformed files.

clee704 · 2026-06-11T04:55:45Z

+                    f"DELTA_BINARY_PACKED miniblock bit width {bit_width} "
+                    f"exceeds the {type_bits}-bit {ptype} type width"
+                )
+            unpacked = _unpack_bitpacked(data, pos, values_per_miniblock, bit_width)


Unbounded memory allocation from an unvalidated header field (robustness / resource-exhaustion on malformed input — Low/Medium).

block_size is taken from the stream header and validated only for positivity + divisibility (lines 910–915), never for a sane upper bound. values_per_miniblock = block_size // miniblocks_per_block then flows straight into _unpack_bitpacked(data, pos, values_per_miniblock, bit_width), which always materializes a list of values_per_miniblock entries regardless of how many values are actually still needed. For the bit_width == 0 branch (return [0] * count, line 811) there is no byte-availability guard at all, so the allocation is driven purely by the header.

A crafted/corrupt 9-byte stream demonstrates it — block_size huge, miniblocks_per_block=1, total_value_count=2 (so the num_values cross-check passes), min_delta=0, one bit_width=0 byte:

block_size= 1,000,000 input=8B num_values=2 -> peak_mem ~8 MB block_size= 50,000,000 input=9B num_values=2 -> peak_mem ~400 MB

It scales linearly with block_size; near INT32_MAX this OOMs the process. The surrounding truncation guards already raise ValueError for bit_width > 0 bodies (the nbytes check at line 812 fires), so this bit_width == 0 path is the lone hole, and the outcome (multi-GB allocation) is inconsistent with the clean ValueError every other corruption path produces.

Suggested fix: unpack only what's left — pass min(values_per_miniblock, total_value_count - len(values)) as the count while still advancing pos by the full miniblock body size (values_per_miniblock * bit_width + 7) // 8. That preserves correctness (padding still skipped) and bounds the values list by num_values. Optionally also reject an unreasonable block_size in the header validation (the spec constrains it to a multiple of 128), alongside the existing positivity/divisibility checks.

Fixed in e81aafc. The decoder now validates the full (padded) miniblock body length against the stream before unpacking — so a corrupt block_size with a non-zero bit width is reported as truncated rather than allocating — and materializes only the values still needed, which bounds the zero-bit-width case (no body to validate against) at num_values. The read cursor still advances by the full body, so valid streams are unaffected. Added two regression tests (huge block_size with bit width 1 → truncated; with bit width 0 → caps allocation, decodes correctly). dev:check green.

A malformed stream could declare a huge block_size and force the decoder to materialize block_size//miniblocks values per miniblock — a ~MB-scale allocation for a few-byte stream — instead of failing loudly on corrupt input. Validate the full (padded) miniblock body length against the stream before unpacking (so a corrupt block_size with a non-zero bit width is reported as truncated), and materialize only the values still needed (so a zero-bit-width miniblock, which has no body to bound it, can't allocate block_size zeros). The read cursor still advances past the full body, so valid streams decode unchanged. The now-redundant bounds check in the bit-unpack helper moves to this single validating caller. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Every delta-reconstructed value is masked to the column's bit width and reinterpreted as signed two's-complement, but the leading seed value was appended raw — so a corrupt out-of-range seed escaped the type while all later values were normalized. Apply the same width reinterpretation to the first value (a no-op for valid in-range data), keeping the whole decoded series type-valid and internally consistent. Also clarify the block-size validation message ("a positive multiple of the N miniblocks per block"). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

clee704 commented Jun 11, 2026

View reviewed changes

clee704 and others added 2 commits June 11, 2026 05:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add DELTA_BINARY_PACKED decoder#64

feat: add DELTA_BINARY_PACKED decoder#64
clee704 wants to merge 4 commits into
masterfrom
feat/delta-binary-packed

clee704 commented Jun 11, 2026

Uh oh!

clee704 left a comment

Uh oh!

clee704 Jun 11, 2026

Uh oh!

clee704 Jun 11, 2026

Uh oh!

clee704 left a comment

Uh oh!

clee704 Jun 11, 2026

Uh oh!

clee704 Jun 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant



		def decode_delta_binary_packed(
		data: bytes, num_values: int, parquet_type: str

Conversation

clee704 commented Jun 11, 2026

What & why

Refactoring checkpoint

Dogfooding

Tests

Uh oh!

clee704 left a comment

Choose a reason for hiding this comment

Uh oh!

clee704 Jun 11, 2026

Choose a reason for hiding this comment

Uh oh!

clee704 Jun 11, 2026

Choose a reason for hiding this comment

Uh oh!

clee704 left a comment

Choose a reason for hiding this comment

Uh oh!

clee704 Jun 11, 2026

Choose a reason for hiding this comment

Uh oh!

clee704 Jun 11, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant