Skip to content

[VL] Native Delta DV bitmap aggregator aborts on a Long.MAX_VALUE sentinel row index during MERGE with deletion vectors (intermittent VeloxRuntimeError) #12377

Description

@felipepessoto

Backend

VL (Velox)

Bug description

Backend

VL (Velox)

Bug description

Expected: A Delta MERGE INTO that writes deletion vectors (DVs) completes successfully, exactly as it does on vanilla Spark + Delta.

Actual: Under the Gluten Velox bundle the MERGE intermittently aborts with a native VeloxRuntimeError (INVALID_STATE) raised by Gluten's Delta DV bitmap aggregator:

Delta RoaringBitmapArray row index 9223372036854775807 exceeds max representable value 9223372030412324864

9223372036854775807 is exactly Long.MAX_VALUE (2^63 - 1). The target table in the failing test is tiny (a handful of rows), so this is not a real row index -- it is a sentinel / placeholder value that is leaking into the DV-write aggregation.

The aggregation that builds the per-file DV (PartialAggregation, function addSafe) packs each matched target row's index into a RoaringBitmapArray. RoaringBitmapArray::addSafe enforces value <= kMaxRepresentableValue (= 0x7ffffffe80000000 = 9223372030412324864, which the code comments say mirrors Delta JVM's RoaringBitmapArray.MAX_REPRESENTABLE_VALUE). Long.MAX_VALUE is one 2^32 block above that ceiling, so the check fails and the whole stage aborts.

This is flaky / non-deterministic. The exact same, byte-for-byte identical bundle passed this test in one CI run and failed it in the next (see Logs). So whether the sentinel reaches the aggregator depends on runtime plan / scan / scheduling (split boundaries, batch composition, task distribution), not on a source change. It reproduces in the suite:

org.apache.spark.sql.delta.generatedsuites.MergeIntoExtendedSyntaxSQLPathBasedDVsPredPushOnSuite
  test: extended syntax - update + conditional insert - isPartitioned: true

(...DVsPredPushOn... = deletion vectors on, predicate pushdown on.)

Root cause analysis

  • The aggregator only skips SQL NULLs; it does not special-case the sentinel:
    • cpp/velox/operators/functions/delta/DeltaBitmapAggregator.cc:63-69 (addInput returns early only when !value.has_value()),
    • cpp/velox/operators/functions/delta/DeltaBitmapAggregator.cc:43-46 (addRowIndex checks only value >= 0, then calls bitmap.addSafe).
  • The ceiling and check:
    • cpp/velox/compute/delta/RoaringBitmapArray.cpp:91-98 (addSafe, VELOX_CHECK_LE(value, kMaxRepresentableValue, ...)),
    • cpp/velox/compute/delta/RoaringBitmapArray.h:51-56 (kMaxHighKey = 0x7ffffffe, kMaxLowKeyForMaxHighKey = 0x80000000, kMaxRepresentableValue = (kMaxHighKey << 32) | kMaxLowKeyForMaxHighKey; comment: "Matches Delta JVM RoaringBitmapArray.MAX_REPRESENTABLE_VALUE").

Open question for a maintainer with Velox + Delta DV-write context: Delta's own JVM RoaringBitmapArray uses the same MAX_REPRESENTABLE_VALUE, so vanilla Delta would reject Long.MAX_VALUE too. Since vanilla Delta passes this MERGE, it must either never produce the sentinel on the DV-write branch or filter it out before the bitmap is built. That suggests the real defect is upstream of the aggregator -- Gluten's native row-index materialization / DV-write plan is emitting (and not filtering) a Long.MAX_VALUE placeholder that vanilla Delta would have excluded. The addSafe check is just where it surfaces. Two possible fix directions:

  1. Stop the sentinel at the source (mirror Delta's filter so placeholder rows never reach the DV aggregation), or
  2. Make the aggregator skip the sentinel the same way it skips NULLs -- but only if that matches Delta's documented semantics (silently dropping a genuinely out-of-range index would corrupt the DV, so option 1 is preferred unless the sentinel is a contract).

This was written with the assistance of AI tooling.

Gluten version

main branch

Spark version

spark-4.0.x (actually Spark 4.1.0 -- Delta 4.2.0's default; the form has no 4.1 option)

Spark configurations

From the Delta-on-Gluten test harness (patched DeltaSQLCommandTest):

spark.plugins                = org.apache.gluten.GlutenPlugin
spark.shuffle.manager        = org.apache.spark.shuffle.sort.ColumnarShuffleManager
spark.memory.offHeap.enabled = true
spark.memory.offHeap.size    = 2g
Delta 4.2.0, Scala 2.13, JDK 17
(Delta defaults: deletion vectors enabled; predicate pushdown enabled.)

System information

CI runner: ubuntu-22.04 host, ~16 GB RAM, container apache/gluten:centos-9-jdk17. Not run via dev/info.sh (observed in CI).

Relevant logs

Delta Spark UT (Gluten) pipeline, apache/gluten run 28198677737, shard 1 (job 83536282846). The prior, byte-for-byte identical run 28148323203 passed the same test (shard 1: 230 expected failures, 0 regressions) -- demonstrating the intermittency.

extended syntax - update + conditional insert - isPartitioned: true *** FAILED ***
org.apache.spark.SparkException: Job aborted due to stage failure:
  Task 0 in stage 1028.0 failed 1 times, most recent failure:
  Lost task 0.0 in stage 1028.0 (TID 843):
  org.apache.gluten.exception.GlutenException: ... Exception: VeloxRuntimeError
  Error Source: RUNTIME
  Error Code: INVALID_STATE
  Reason: (9223372036854775807 vs. 9223372030412324864)
          Delta RoaringBitmapArray row index 9223372036854775807
          exceeds max representable value 9223372030412324864
  Retriable: False
  Expression: value <= kMaxRepresentableValue
  Context: Operator: PartialAggregation[9] 9
  Function: addSafe
  File: /work/cpp/velox/compute/delta/RoaringBitmapArray.cpp
  Line: 92
  ...
  at org.apache.gluten.vectorized.ColumnarBatchOutIterator.nativeHasNext(Native Method)
  at org.apache.spark.shuffle.ColumnarShuffleWriter.internalWrite(ColumnarShuffleWriter.scala:135)
  at org.apache.spark.shuffle.ColumnarShuffleWriter.write(ColumnarShuffleWriter.scala:316)
  at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:111)

Reproduction

  1. Build the Gluten Velox bundle (Spark 4.1 + Scala 2.13 + JDK 17, Delta profile).
  2. Run delta-io/delta v4.2.0 with the Gluten plugin enabled (spark.plugins=org.apache.gluten.GlutenPlugin), suite MergeIntoExtendedSyntaxSQLPathBasedDVsPredPushOnSuite, test "extended syntax - update + conditional insert - isPartitioned: true".

Impact / workaround

  • Intermittently fails any MERGE-with-DV workload, and makes the Delta-on-Gluten CI gate flaky (apache/gluten PR [VL][Delta] Delta CI pipeline #12278): the test is not in the known-failures baseline (it usually passes), so a run that hits the sentinel is reported as a regression and turns the gate red.
  • No good baseline workaround: because the failure is flaky, adding it to known-failures.txt would instead make the gate red on every run where it passes (the pipeline runs with DELTA_FAIL_ON_FIXED=true). A proper fix (or a dedicated flaky-quarantine list in the gate) is needed.

Gluten version

main branch

Spark version

None

Spark configurations

Spark 4.1.0

System information

No response

Relevant logs

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingtriage

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions