Skip to content

[VL] Native memory OOMs the test/executor JVM when reading large (billions-of-rows) Delta tables with deletion vectors #12387

Description

@felipepessoto

Backend

VL (Velox)

Bug description

Bug description

Expected: Reading from / deleting from a large Delta table that has deletion vectors (DVs) completes within a bounded, reasonable memory footprint. Vanilla Spark runs Delta's own "huge table" DV tests fine with a 1 GB test heap (-Xmx1024m).

Actual: Under the Gluten Velox bundle, the same reads grow the JVM's native (off-heap) memory monotonically until the kernel/cgroup OOM-kills the process. On Delta's synthetic 2-billion-row DV table the forked test JVM climbs to ~13 GB RSS even though its JVM heap is only -Xmx2G, i.e. ~11 GB is native (Velox), not heap. The growth tracks the duration of a single DV read over the huge table, which points at unbounded native materialization on the DV / metadata-row-index read path rather than normal query working set.

Concretely, two Delta tests reproduce it (suite org.apache.spark.sql.delta.deletionvectors.DeletionVectorsSuite):

  • huge table: read from tables of 2B rows with existing DV of many zeros
  • huge table: delete a small number of rows from tables of 2B rows with DVs

Both operate on the suite's 2B-row table5. The read test alone grew the fork from ~5.9 GB to ~13.3 GB over ~13 minutes before the OOM-kill.

Likely area: native row-index materialization on the DV read path. Delta DV reads use the metadata row index (spark.databricks.delta.deletionVectors.useMetadataRowIndex, default true), and Gluten offloads that path to Velox (apache/gluten #12269 only falls back DML DV scans when useMetadataRowIndex=false, so the default read path stays native). A maintainer with Velox memory-tracking context should confirm the exact allocation site and whether it can be bounded/spilled.

Gluten version

main branch

Spark version

spark-4.0.x (actually Spark 4.1.0 -- Delta 4.2.0's default; the form has no 4.1 option)

Spark configurations

From the Delta-on-Gluten test harness (patched DeltaSQLCommandTest):

spark.plugins                    = org.apache.gluten.GlutenPlugin
spark.shuffle.manager            = org.apache.spark.shuffle.sort.ColumnarShuffleManager
spark.memory.offHeap.enabled     = true
spark.memory.offHeap.size        = 2g
spark.gluten.sql.columnar.backend.velox... (default bundle config)
Delta 4.2.0, Scala 2.13, JDK 17

(The forked test JVM heap is -Xmx2G; off-heap is capped at 2g, yet native RSS still reaches ~13 GB -- the allocation appears untracked / not honoring the off-heap cap.)

System information

CI runner: ubuntu-22.04 host, ~16 GB RAM, container apache/gluten:centos-9-jdk17. Not run via dev/info.sh (observed in CI).

Relevant logs

Evidence from the Delta Spark UT (Gluten) pipeline, run 28071158711, shard 2 (job 83108337324). Per-minute memory profiler during the "read from tables of 2B rows with existing DV of many zeros" test (p1289 = forked test JVM with -Xmx2G; p382 = sbt launcher):

MEM cgroup=12.53G JVMs=[2664M(p382) 5869M(p1289)]
MEM cgroup=13.70G JVMs=[2664M(p382) 7777M(p1289)]
MEM cgroup=13.97G JVMs=[2664M(p382) 8431M(p1289)]
MEM cgroup=14.32G JVMs=[2623M(p382) 11629M(p1289)]
MEM cgroup=14.77G JVMs=[1815M(p382) 13122M(p1289)]   <- fork ~13.1G RSS, heap only 2G
MEM cgroup=14.91G JVMs=[1879M(p382) 13303M(p1289)]
Warning: Unable to read from client ...                 <- fork OOM-killed here
MEM cgroup=1.92G  JVMs=[1902M(p382)]                    <- fork gone; cgroup drops ~13G

After the kernel killed the fork, sbt wedged on the dead fork (no hs_err, no heap dump -- the signature of a kernel/cgroup OOM-kill rather than a JVM OOM), and a hang watchdog had to kill the shard after ~16 minutes of silence.

Reproduction

  1. Build the Gluten Velox bundle (Spark 4.1 + Scala 2.13 + JDK 17, Delta profile).
  2. Run delta-io/delta v4.2.0 DeletionVectorsSuite with the Gluten plugin enabled (spark.plugins=org.apache.gluten.GlutenPlugin), e.g. the two "huge table ... 2B rows ... DV" tests above.
    • Equivalent minimal repro: with Gluten Velox enabled, run a count/sum scan over a Delta table of billions of rows that carries deletion vectors; watch native RSS grow without bound.

Impact / workaround

  • Makes large-table DV reads unusable under Gluten Velox (native memory blows up and the process is OOM-killed).
  • In the Delta CI pipeline (apache/gluten PR [VL][Delta] Delta CI pipeline #12278) these two tests are force-failed in setup-delta.sh to keep the shard from OOM-hanging. That workaround should be removed once this is fixed.

This was written with the assistance of AI tooling.

Gluten version

main branch

Spark version

None

Spark configurations

No response

System information

No response

Relevant logs

https://github.com/apache/gluten/actions/runs/28071158711/job/83108337324

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workingtriage

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions