Skip to content

feat: add DELTA_BINARY_PACKED decoder#64

Draft
clee704 wants to merge 4 commits into
masterfrom
feat/delta-binary-packed
Draft

feat: add DELTA_BINARY_PACKED decoder#64
clee704 wants to merge 4 commits into
masterfrom
feat/delta-binary-packed

Conversation

@clee704

@clee704 clee704 commented Jun 11, 2026

Copy link
Copy Markdown
Owner

What & why

Adds decode_delta_binary_packed, the decoder for the Parquet
DELTA_BINARY_PACKED encoding (enum 5). This is the highest-priority
remaining decoder: pyarrow has emitted DELTA_BINARY_PACKED as the default
for INT32 / INT64 columns since Arrow 14, so most recent Parquet files
in the wild exercise it, and it is the building block for the
DELTA_LENGTH_BYTE_ARRAY (enum 6) and DELTA_BYTE_ARRAY (enum 7)
decoders.

The function decodes a raw value stream (caller slices off any
definition/repetition-level block first, as with the other value
decoders):

  • Header — block size, miniblocks per block, total value count (all
    ULEB128), and the zigzag-ULEB128 first value.
  • Each block — a zigzag min delta, one bit-width byte per miniblock,
    then the LSB-first bit-packed miniblock bodies. Values are reconstructed
    as value[i] = value[i-1] + min_delta + unpacked[i].

Reconstruction wraps to the column type width (INT32 → 32 bits, INT64
→ 64 bits) and reinterprets as signed. This is required for correctness,
not cosmetic: deltas between extreme values (e.g. INT32_MAX
INT32_MIN) overflow the fixed-width type, and the encoder relies on
two's-complement wraparound — without the wrap the decoded values are
wrong. The decoder needs the column type for this, so it takes a
parquet_type argument (mirroring decode_plain).

It emits exactly the header's total value count, ignoring trailing
padding in the final used miniblock and the unused trailing miniblocks of
the last block, and cross-checks that count against the caller-supplied
num_values (raising on mismatch, consistent with the module's
fail-loudly-on-corruption stance).

Following the module's "the encoding-specific return shape is the value"
design rule (#12), the function returns (values, DeltaBinaryPackedStats).
The stats expose the encoder's structural choices that the reconstructed
values alone don't reveal — block/miniblock geometry, first value, and
each block's min delta and per-miniblock bit widths — mirroring the
DecodeStats inspection surface of the RLE primitive. Bit widths are
recorded for every declared miniblock (the on-disk truth), matching how
DecodeStats records encoder-declared run lengths.

Closes #62. Part of #14.

Refactoring checkpoint

  • I ran the refactoring checkpoint on the touched code and its design context.
  • Fixed in this PR: Nothing to clean up. The new code reuses the
    module's existing _read_varint helper for ULEB128 reads and follows
    the established dataclass + decode_* conventions; the one new private
    helper (_unpack_bitpacked) is a focused LSB-first unpacker and every
    new branch (header validation, type-width guard, two truncation guards,
    the count cross-check) is covered by a test.
  • Deferred (issue links): none.

Dogfooding

  • I drove the change on realistic data and assessed performance + UX.
  • Scenarios exercised: Ground-truth validation against real
    pyarrow-emitted bytes (pyarrow 24.0.0). For each case I wrote a parquet
    file with pq.write_table(..., column_encoding="DELTA_BINARY_PACKED", use_dictionary=False), located the data page via the repo's own
    ParquetFile segment walk, decompressed it, stripped the V1 def-level
    block, decoded the remaining DELTA stream, and asserted equality with
    the original column. Cases: consecutive integers (range(20)),
    single value, all-equal (zero deltas), negative/mixed deltas, INT32 and
    INT64 type-boundary values that overflow the type and exercise the
    wrap-to-signed path and max bit width, and a 400-value multi-block
    stream with varied deltas (forcing multiple blocks and non-zero
    miniblock bit widths). Both INT32 and INT64 round-trip exactly.
  • Perf / UX findings: The decoder is a linear single pass over the
    stream; the typed (values, stats) return reads naturally alongside the
    existing decode_rle_bitpacked_hybrid primitive. Nothing to fix.
  • Fixed / deferred: none.

Tests

  • hatch run dev:check is green (format, lint, type-check, tests, per-module 95% coverage). decoders.py at 99.7%.
  • New / changed code paths assert observable behavior, not just execute lines.

Add decode_delta_binary_packed for the Parquet DELTA_BINARY_PACKED
encoding (enum 5), pyarrow's default for INT32/INT64 columns. It parses
the stream header (block size, miniblocks per block, total value count,
zigzag first value), then each block's min delta and per-miniblock bit
widths, unpacks the LSB-first bit-packed miniblocks, and reconstructs
values as value[i] = value[i-1] + min_delta + unpacked[i]. Reconstruction
wraps to the column type width and reinterprets as signed, which is
required: deltas between extreme values overflow the fixed-width type and
the encoder relies on two's-complement wraparound.

It emits exactly the header's total value count, ignoring trailing
padding in the final miniblock and the unused trailing miniblocks of the
last block, and cross-checks that count against the caller-supplied
num_values.

A DeltaBinaryPackedStats result exposes the encoder's structural choices
that the values alone don't reveal — block/miniblock geometry, first
value, and each block's min delta and miniblock bit widths — mirroring
the DecodeStats inspection surface of the RLE primitive.

Validated against real pyarrow-emitted bytes: writing parquet files with
column_encoding="DELTA_BINARY_PACKED" for INT32 and INT64, extracting the
data-page value stream, decoding it, and asserting the values equal the
original column, across consecutive, single-value, all-equal,
negative/mixed, type-boundary (wrapping), and multi-block cases.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

@clee704 clee704 left a comment

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed this draft with correctness as the primary focus, grounding against the DELTA_BINARY_PACKED spec and the actual emitted bytes.

The decode logic is spec-correct, and I couldn't break it. Verified:

  • Header parse — block size / miniblocks-per-block / total value count as ULEB128, first value as zigzag-ULEB128; per-block zigzag min-delta; one bit-width byte per declared miniblock; LSB-first bit unpacking; reconstruction value[i] = value[i-1] + min_delta + unpacked[i]; exact total_value_count emission with final-miniblock padding and unused trailing miniblocks ignored — all match the spec.
  • Integer width / overflow — masking to the type width and re-interpreting the sign bit gives correct two's-complement wraparound for both INT32 and INT64, including extreme deltas (INT32_MAXINT32_MIN, etc.).
  • Truncation / mismatch — header, bit-width-byte, miniblock-body, and "needs another block but stream ended" paths all raise ValueError; the num_values vs header total_value_count cross-check raises on disagreement.

Independent grounding:

  • The full decoder suite passes (22 delta tests), and decoders.py is at 99% coverage — the single uncovered line (309) predates this PR.
  • I ran a 400-case fuzz cross-check: random INT32/INT64 arrays (sizes 1–1000; uniform, small-step, clustered, ramp, and boundary/extreme distributions that overflow the type and force multiple blocks) round-tripped through pyarrow's DELTA_BINARY_PACKED writer and decoded exactly in every case.
  • The ground-truth test is meaningful: assert page["encoding"] == "DELTA_BINARY_PACKED" rules out a dictionary/PLAIN-fallback false positive, so a passing test really is decoding a delta page.

One non-correctness, API-consistency point inline. Everything else checks out.

Comment thread src/parquet_analyzer/decoders.py Outdated


def decode_delta_binary_packed(
data: bytes, num_values: int, parquet_type: str

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Parameter order is inconsistent with the rest of the module (consistency, not correctness).

Every other public decoder here takes num_values as its last positional argument, with the encoding-specific parameter immediately before it:

  • decode_plain(data, parquet_type, num_values, type_length=None)
  • decode_rle_bitpacked_hybrid(data, bit_width, num_values)
  • decode_levels(data, max_level, num_values)
  • decode_v1_level_block(data, offset, max_level, num_values)

decode_delta_binary_packed(data, num_values, parquet_type) puts num_values in the middle and parquet_type last, making it the lone outlier. The PR description says it "takes a parquet_type argument (mirroring decode_plain)", but decode_plain's order is (data, parquet_type, num_values) — so the signature doesn't actually mirror it.

Since this is a new public function (it's exported in __all__), reordering to decode_delta_binary_packed(data, parquet_type, num_values) now — while it's still a draft and unreleased — keeps call sites uniform with the rest of the module and avoids a breaking signature change later.

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in d564fd5. Reordered to decode_delta_binary_packed(data, parquet_type, num_values) to match decode_plain / decode_levels / decode_v1_level_block (encoding-specific arg before num_values, num_values last). New function this PR, so no released callers affected; updated all 14 call sites in the tests. dev:check green.

…ype, num_values)

Match the signature convention of the other public decoders
(decode_plain, decode_levels, decode_v1_level_block), which take the
encoding-specific argument before num_values and num_values last. The
function is new in this PR, so no released callers are affected.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

@clee704 clee704 left a comment

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed this draft with correctness as the primary focus, grounded against the DELTA_BINARY_PACKED spec and real pyarrow-emitted bytes (not just trusting the prior pass).

Decode logic is spec-correct — independently confirmed.

  • Ran the 22 delta tests: all green; decoders.py at 99% (the single uncovered line, 309, predates this PR).
  • Ran my own 500-case fuzz cross-check through pyarrow 24's DELTA_BINARY_PACKED writer — INT32 and INT64, sizes 1–1000, across uniform/full-range, small-step, all-equal, ramp, boundary-clustered, and base-clustered distributions. Every case round-tripped exactly. Max miniblock bit widths of 32 (INT32) and 64 (INT64) were exercised, and 88 trials produced multiple blocks. Header parse, zigzag first-value/min-delta, LSB-first unpacking, value[i]=value[i-1]+min_delta+unpacked[i], exact total_value_count emission (final-miniblock padding + unused trailing miniblocks ignored), and the two's-complement wrap/sign-reinterpret for both widths all check out.
  • Truncation / mismatch paths raise ValueError as intended.

The param-order point from the earlier pass is already addressed in d564fd5. One robustness finding below — not a correctness bug on valid data, but it contradicts the module's stated fail-loudly-on-corruption stance for a tool whose job is inspecting potentially-malformed files.

Comment thread src/parquet_analyzer/decoders.py Outdated
f"DELTA_BINARY_PACKED miniblock bit width {bit_width} "
f"exceeds the {type_bits}-bit {ptype} type width"
)
unpacked = _unpack_bitpacked(data, pos, values_per_miniblock, bit_width)

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unbounded memory allocation from an unvalidated header field (robustness / resource-exhaustion on malformed input — Low/Medium).

block_size is taken from the stream header and validated only for positivity + divisibility (lines 910–915), never for a sane upper bound. values_per_miniblock = block_size // miniblocks_per_block then flows straight into _unpack_bitpacked(data, pos, values_per_miniblock, bit_width), which always materializes a list of values_per_miniblock entries regardless of how many values are actually still needed. For the bit_width == 0 branch (return [0] * count, line 811) there is no byte-availability guard at all, so the allocation is driven purely by the header.

A crafted/corrupt 9-byte stream demonstrates it — block_size huge, miniblocks_per_block=1, total_value_count=2 (so the num_values cross-check passes), min_delta=0, one bit_width=0 byte:

block_size=   1,000,000  input=8B  num_values=2  -> peak_mem ~8 MB
block_size=  50,000,000  input=9B  num_values=2  -> peak_mem ~400 MB

It scales linearly with block_size; near INT32_MAX this OOMs the process. The surrounding truncation guards already raise ValueError for bit_width > 0 bodies (the nbytes check at line 812 fires), so this bit_width == 0 path is the lone hole, and the outcome (multi-GB allocation) is inconsistent with the clean ValueError every other corruption path produces.

Suggested fix: unpack only what's left — pass min(values_per_miniblock, total_value_count - len(values)) as the count while still advancing pos by the full miniblock body size (values_per_miniblock * bit_width + 7) // 8. That preserves correctness (padding still skipped) and bounds the values list by num_values. Optionally also reject an unreasonable block_size in the header validation (the spec constrains it to a multiple of 128), alongside the existing positivity/divisibility checks.

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in e81aafc. The decoder now validates the full (padded) miniblock body length against the stream before unpacking — so a corrupt block_size with a non-zero bit width is reported as truncated rather than allocating — and materializes only the values still needed, which bounds the zero-bit-width case (no body to validate against) at num_values. The read cursor still advances by the full body, so valid streams are unaffected. Added two regression tests (huge block_size with bit width 1 → truncated; with bit width 0 → caps allocation, decodes correctly). dev:check green.

clee704 and others added 2 commits June 11, 2026 05:01
A malformed stream could declare a huge block_size and force the decoder
to materialize block_size//miniblocks values per miniblock — a ~MB-scale
allocation for a few-byte stream — instead of failing loudly on corrupt
input.

Validate the full (padded) miniblock body length against the stream
before unpacking (so a corrupt block_size with a non-zero bit width is
reported as truncated), and materialize only the values still needed
(so a zero-bit-width miniblock, which has no body to bound it, can't
allocate block_size zeros). The read cursor still advances past the full
body, so valid streams decode unchanged. The now-redundant bounds check
in the bit-unpack helper moves to this single validating caller.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Every delta-reconstructed value is masked to the column's bit width and
reinterpreted as signed two's-complement, but the leading seed value was
appended raw — so a corrupt out-of-range seed escaped the type while all
later values were normalized. Apply the same width reinterpretation to
the first value (a no-op for valid in-range data), keeping the whole
decoded series type-valid and internally consistent.

Also clarify the block-size validation message ("a positive multiple of
the N miniblocks per block").

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Decoder] decode_delta_binary_packed (DELTA_BINARY_PACKED, enum 5)

1 participant