Skip to content

Fix byte position accounting in cached batch output#15080

Open
fallintoplace wants to merge 1 commit into
NVIDIA:mainfrom
fallintoplace:fix-byte-array-output-position
Open

Fix byte position accounting in cached batch output#15080
fallintoplace wants to merge 1 commit into
NVIDIA:mainfrom
fallintoplace:fix-byte-array-output-position

Conversation

@fallintoplace

@fallintoplace fallintoplace commented Jun 14, 2026

Copy link
Copy Markdown

No issue filed.

Description

Fix byte position accounting in the cached batch parquet output stream. OutputStream.write(int) writes one byte, so the tracked position should advance by one byte.

Checklists

Documentation

  • Updated for new or modified user-facing features or behaviors
  • No user-facing change

Testing

  • Added or modified tests to cover new code paths
  • Covered by existing tests
    (Please provide the names of the existing tests in the PR description.)
  • Not required

Performance

  • Tests ran and results are added in the PR description
  • Issue filed with a link in the PR description
  • Not required

@greptile-apps

greptile-apps Bot commented Jun 14, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

Fixes an off-by-factor-of-4 bug in the byte-position tracking of the ByteArrayOutputFile output stream used when serializing cached Parquet batches. OutputStream.write(int) writes exactly one byte (the low-order byte of the argument), so pos must advance by 1, not by Integer.BYTES (which is 4).

  • The bug caused getPos() to return positions 4× too large on every single-byte write, potentially corrupting offset accounting in the Parquet encoder.
  • The sibling overrides (write(byte[])b.length, write(byte[], int, int)len) were already correct; only the write(int) path was wrong.

Confidence Score: 5/5

Single-line fix correcting an obvious mis-accounting; all three write paths now agree and the change is trivially verifiable against the Java OutputStream contract.

The change is a one-character correction to a clearly wrong constant. OutputStream.write(int) is documented to write exactly one byte, so replacing Integer.BYTES (4) with 1 is unambiguously correct. The sibling overrides were already right; no other logic is touched.

No files require special attention.

Important Files Changed

Filename Overview
sql-plugin/src/main/scala/com/nvidia/spark/rapids/parquet/ParquetCachedBatchSerializer.scala Corrects pos += Integer.BYTES to pos += 1 in the write(Int) override of the anonymous DelegatingPositionOutputStream; the fix is accurate and all three write overrides now consistently track bytes written.

Sequence Diagram

%%{init: {'theme': 'neutral'}}%%
sequenceDiagram
    participant PE as Parquet Encoder
    participant POS as DelegatingPositionOutputStream
    participant BOS as ByteArrayOutputStream

    PE->>POS: write(b: Int)
    POS->>BOS: super.write(b) [writes 1 byte]
    Note over POS: pos += 1  ✅ (was Integer.BYTES = 4 ❌)
    PE->>POS: getPos(): Long
    POS-->>PE: pos (now accurate)

    PE->>POS: write(b: Array[Byte])
    POS->>BOS: super.write(b)
    Note over POS: pos += b.length ✅

    PE->>POS: write(b: Array[Byte], off: Int, len: Int)
    POS->>BOS: super.write(b, off, len)
    Note over POS: pos += len ✅
Loading
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
sequenceDiagram
    participant PE as Parquet Encoder
    participant POS as DelegatingPositionOutputStream
    participant BOS as ByteArrayOutputStream

    PE->>POS: write(b: Int)
    POS->>BOS: super.write(b) [writes 1 byte]
    Note over POS: pos += 1  ✅ (was Integer.BYTES = 4 ❌)
    PE->>POS: getPos(): Long
    POS-->>PE: pos (now accurate)

    PE->>POS: write(b: Array[Byte])
    POS->>BOS: super.write(b)
    Note over POS: pos += b.length ✅

    PE->>POS: write(b: Array[Byte], off: Int, len: Int)
    POS->>BOS: super.write(b, off, len)
    Note over POS: pos += len ✅
Loading

Reviews (1): Last reviewed commit: "Fix byte position accounting in cached b..." | Re-trigger Greptile

Signed-off-by: Minh Vu <vuhoangminh97@gmail.com>
@fallintoplace fallintoplace force-pushed the fix-byte-array-output-position branch from 2c3d942 to 5d0fa0f Compare June 14, 2026 10:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants