[VL][Delta] Add Delta Spark UT pipeline gated against a known-failures baseline by felipepessoto · Pull Request #12388 · apache/gluten

felipepessoto · 2026-06-28T07:33:38Z

What changes are proposed in this pull request?

Adds a CI pipeline that runs delta-io/delta's spark ScalaTest suite against the Gluten Velox bundle, so we can validate Gluten against a real Delta release and catch regressions over time.

Running the Delta UTs on Gluten produces many expected failures (Gluten does not yet offload every Delta code path, and falls back or behaves differently in places). A plain "red on any failure" gate would be useless. Instead, the pipeline keeps a committed baseline of known failures and gates each run against it:

regression -- a test fails that is not in the baseline -> the shard fails.
expected -- a failing test that is in the baseline -> ignored.
now-passing -- a baseline test that starts passing -> fails the shard (keeps the baseline honest), unless fail_on_fixed=false.

How it works

Runs as a reusable workflow (on: workflow_call) invoked from velox_backend_x86.yml, so it reuses the Velox native libs + Arrow jars that workflow already builds instead of duplicating the expensive native C++ build. It then assembles the gluten-velox-bundle fat jar (Spark 4.1 + Scala 2.13 + JDK 17, Delta profile). A workflow_dispatch trigger is kept for standalone manual runs (which build the native lib themselves).
Clones delta-io/delta at a release tag (currently v4.2.0), drops the bundle onto the spark project's test classpath, patches DeltaSQLCommandTest to register GlutenPlugin, and cherry-picks two merged upstream Delta test-only fixes ([Spark] [Test] Collect scans by FileSourceScanLike in ScanReportHelper delta-io/delta#7104 + [VL] Remove a limit for BHJ in stage fallback policy #7105) that widen FileSourceScanExec checks to FileSourceScanLike so Gluten's transformed plan is recognized.
Runs sbt spark/test sharded by suite across 4 shards (4 forked test JVMs each, ~16-way parallelism), with ScalaTest's JUnit XML reporter enabled, then gates each shard with compare-test-results.py against known-failures.txt. A final job aggregates all shards into a single ready-to-commit baseline and flags stale entries.

Files

File	Purpose
`.github/workflows/velox_backend_x86.yml`	Caller: builds the native lib once, uploads the native + Arrow artifacts, and invokes the reusable Delta workflow (reusing that build instead of duplicating it).
`.github/workflows/delta_spark_ut.yml`	The reusable Delta workflow (build bundle -> shard tests -> gate).
`.github/workflows/util/delta-spark-ut/setup-delta.sh`	Clones Delta, injects the Gluten bundle, patches `DeltaSQLCommandTest`, cherry-picks the upstream test fixes.
`.github/workflows/util/delta-spark-ut/compare-test-results.py`	Parses JUnit XML and enforces / seeds / aggregates against the baseline (stdlib only).
`.github/workflows/util/delta-spark-ut/known-failures.txt`	Committed baseline of currently-expected failures (`#` comments per line).
`.github/workflows/util/delta-spark-ut/README.md`	Documents the gate, bootstrapping, and baseline refresh.

Operational hardening

JDK 17 + Arrow/Netty: forked test JVMs get the --add-opens set plus -Dio.netty.tryReflectionSetAccessible=true (otherwise Arrow's allocator fails to initialize).
Heap tuning: forked-test heap and the sbt launcher's idle G1 behavior are tuned to keep the ~16 GB runner under the cgroup OOM threshold.
Hang watchdog: a per-shard watchdog dumps threads and kills a forked test JVM that has gone silent too long, so a wedged suite can't stall the whole job.
DeletionVectorsSuite 2B-row tests: two tests build/read/delete a 2-billion-row table and balloon the fork to ~13 GB of native memory (Velox row-index materialization), OOM-killing it and hanging the shard. They are force-failed (with a clear message) rather than silently ignored, so the gap stays visible until the native memory blow-up is fixed.

Scope / known limitations

Velox backend, x86 only; Delta v4.2.0 / Spark 4.1 / Scala 2.13 / JDK 17.
The baseline reflects the current set of known Delta-on-Gluten failures; refresh it via a workflow_dispatch run with update_baseline=true.
Future work -- Delta 4.3.0: attempted, but the bundle (compiled against Delta 4.1.0) hits a binary-incompatible Delta change (IdentityColumn.logTableWrite first param Snapshot -> SnapshotDescriptor), which NoSuchMethodErrors on every write. Supporting 4.3.0 needs the bundle built against 4.3.0; tracked as follow-up.

How was this patch tested?

This change is CI. The Delta suite runs as part of velox_backend_x86.yml -- on every PR/trigger that touches Velox/core/cpp or the Delta CI files -- and via manual workflow_dispatch. In the latest runs all shards pass against the committed baseline (failures limited to known-failures entries; no regressions).

19,073 Delta tests run (18,297 passed / 776 failed).

Main failures (776 baseline):

226 tests - Increment Metric: known issue [VL] IncrementMetric doesn't work in some cases #9003. Test with increment metric offload disabled
99 tests - VariantType - java.lang.UnsupportedOperationException: Unsupported data type: variant - Arrow throws (SparkArrowUtil.scala:60)
~47 tests - ClassCast ProjectExec -> WholeStageTransformer (Delta stats) - This will be addressed by [VL] Support TIMESTAMP_NTZ Type #11622 (comment) timestamp -> timestamp_ntz

Fixed since the first draft (#12371): the 187 MatchError List() DataSkipping-empty-stats failures (caused by a FileSourceScanExec match) were fixed by cherry-picking the merged Delta PRs 7104 + 7105 (FileSourceScanExec -> FileSourceScanLike) during test setup. That dropped the baseline from 963 to 776 known failures (187 now-passing removed, 0 regressions).

Delta Spark UT (Gluten) -- shard count vs test parallelism

Sharding is by suite (MurmurHash3(suiteName) % NUM_SHARDS), so total test work is fixed (~1250 fork-minutes). The runners are 4-core / ~16 GB. The committed config is 4 shards x 4 forks.

Config	Runner jobs	Forks/shard	Max shard	Wall-clock	Billed job-hrs*	Outcome
16 shards x 1 fork	16	1	~110 min	~130 min	~29	green
4 shards x 4 forks	4	4	158 min	178 min	~10.5	green
4 shards x 1 fork	4	1	360 min (hit cap)	--	--	cancelled

Was this patch authored or co-authored using generative AI tooling?

Generated-by: GitHub Copilot CLI

…es baseline Run delta-io/delta's `spark` ScalaTest suite against a Gluten Velox bundle in CI and gate the results against a committed baseline so the many expected Delta-on- Gluten failures stay manageable and can be fixed incrementally without letting currently-passing tests silently regress. What it adds (.github/workflows/util/delta-spark-ut/): - delta_spark_ut.yml: builds the native lib + Gluten bundle, then runs the Delta spark suite sharded by suite into 4 shards x 4 forked test JVMs (~16-way), and gates each shard against the baseline. - compare-test-results.py: the gate. Per shard, regressions (failed not in the baseline) fail the build; newly-passing baselined tests are flagged so the baseline can be tightened. Also supports seed/aggregate modes. - known-failures.txt: the committed baseline of expected failures. - setup-delta.sh: clones Delta, injects the Gluten bundle, patches DeltaSQLCommandTest, and force-fails the two DeletionVectorsSuite 2B-row tests whose native row-index materialization OOM-kills the runner and hangs the shard. - README.md: how the pipeline, gating and baseline-refresh work. The workflow also carries a hang watchdog that thread-dumps and kills a wedged fork, and tunes the per-fork heap (2G) and off-heap (2G) to fit the ~16G runner. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…line Delta's data-skipping, limit-push-down, column-pruning and scan-metric tests collect file-source scans by matching the concrete `FileSourceScanExec` case class. Under the Gluten Velox bundle the scan is offloaded to DeltaScanTransformer, a sibling that implements the same `FileSourceScanLike` interface but is not FileSourceScanExec, so the match misses and the scan looks absent. This surfaced as `scala.MatchError: List()` (~56 DataSkipping*/DeltaLimitPushDown* tests), empty generated-column partition filters (~45 OptimizeGeneratedColumnSuite tests) and broken column-pruning / scan-metric checks across the Delete, Update, Merge, DeletionVectors and RowId suites and the TestsStatistics helper. Gluten copies `partitionFilters` and the other accessors these tests read verbatim onto the offloaded scan, so results are identical to vanilla -- only the test's `case` match breaks. Fix it by cherry-picking the two merged upstream Delta commits that widen these matches to the shared `FileSourceScanLike` interface (behavior-preserving for vanilla, which also implements it): * delta-io/delta#7104 -- ScanReportHelper.collectScans * delta-io/delta#7105 -- the remaining 9 test sources, its follow-up Both are merged on Delta master but land after the ref this workflow builds against (v4.2.0), so setup-delta.sh cherry-picks them onto the shallow checkout. Each fetches the fix commit at depth 2 (commit + parent) so cherry-pick can compute the parent->fix diff, and uses `cherry-pick -n` so no committer identity is required. Once the pinned DELTA_REF advances to include a commit its cherry-pick becomes a clean no-op and that block can be removed. The cherry-picks run before the DeletionVectorsSuite 2B-row force-fail step: that step sed-injects fail() into DeletionVectorsSuite.scala, which delta-io/delta#7105 also edits, and git cherry-pick refuses to apply onto a working tree with uncommitted changes to a file it touches (exit 128). Refresh known-failures.txt from run 28299900971 (the delta-spark-aggregate job output), which ran all 19073 tests across 16 shards: removes 187 now-passing tests with 0 regressions, 963 -> 776. ~147 come from the fixes above (DataSkipping*, DeltaLimitPushDown*, OptimizeGeneratedColumnSuite, MergeInto*, RowIdSuite); the remaining ~40 are other suites that now pass (e.g. HiveConvertToDeltaSuite, BitmapAggregatorE2ESuite). Verified against the per-shard ran/failed lists: every baseline entry was observed this run (0 stale), so nothing was dropped due to a crashed or incomplete shard. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Make delta_spark_ut.yml a reusable workflow (on: workflow_call) and call it from velox_backend_x86.yml so the Delta tests reuse the native lib + arrow jars that workflow already builds, instead of duplicating the build-native-lib-centos-7 job. GitHub artifacts cannot be shared across workflows, so the only way to reuse the artifact is to run the Delta jobs in the same workflow run. delta_spark_ut.yml keeps a workflow_dispatch trigger for standalone manual runs (its build-native-lib-centos-7 job is gated to that case and skipped when called); the pull_request trigger is removed so the suite no longer double-runs. velox_backend_x86.yml gains an arrow-jars upload on its native build and a delta-spark-ut job that calls the reusable workflow. That job runs on every velox trigger like the other spark-test jobs, since core/velox/substrait/cpp changes can affect Delta query offload. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot

Pull request overview

This PR adds a Delta Lake spark module unit-test CI pipeline for the Velox backend, integrating it into the existing velox_backend_x86.yml workflow and gating results against a committed “known failures” baseline so regressions are detected without failing on expected gaps.

Changes:

Adds a reusable GitHub Actions workflow to build the Gluten Velox bundle and run sharded Delta ScalaTest suites, with baseline enforcement/aggregation.
Adds Delta setup + patching utilities (clone, inject bundle on test classpath, patch tests, apply upstream test fixes, and force-fail known OOM-inducing tests).
Adds a committed known-failures.txt baseline plus documentation for maintaining it.

Reviewed changes

Copilot reviewed 5 out of 6 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
`.github/workflows/velox_backend_x86.yml`	Invokes the reusable Delta UT workflow and uploads native + Arrow artifacts for reuse.
`.github/workflows/delta_spark_ut.yml`	Implements the reusable (and dispatchable) Delta UT pipeline: build bundle, shard tests, gate, aggregate.
`.github/workflows/util/delta-spark-ut/setup-delta.sh`	Prepares a Delta clone for testing with the Gluten bundle; applies targeted patches/cherry-picks.
`.github/workflows/util/delta-spark-ut/compare-test-results.py`	Parses JUnit XML and enforces/seeds/aggregates results against the known-failures baseline.
`.github/workflows/util/delta-spark-ut/known-failures.txt`	Baseline list of currently expected failing Delta tests under Gluten.
`.github/workflows/util/delta-spark-ut/README.md`	Documents baseline seeding, enforcement behavior, and refresh workflows.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

Copilot

Pull request overview

Copilot reviewed 5 out of 6 changed files in this pull request and generated 3 comments.

`jar=$(ls <glob> | head -n1)` aborts the step under `set -euo pipefail` when the glob matches nothing: `ls` exits non-zero, pipefail propagates it, and set -e exits before the explicit "jar not found" check can print an actionable error -- the log shows only a generic `ls: cannot access`. Make the lookup non-fatal (`2>/dev/null ... || true`) so the check runs, and add the missing check to the Clone-and-patch-Delta step. Addresses PR review feedback. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

felipepessoto and others added 3 commits June 27, 2026 09:12

Copilot AI review requested due to automatic review settings June 28, 2026 07:33

github-actions Bot added INFRA DOCS labels Jun 28, 2026

Copilot started reviewing on behalf of felipepessoto June 28, 2026 07:34 View session

Copilot AI reviewed Jun 28, 2026

View reviewed changes

Comment thread .github/workflows/delta_spark_ut.yml Outdated

Comment thread .github/workflows/util/delta-spark-ut/setup-delta.sh

Comment thread .github/workflows/util/delta-spark-ut/compare-test-results.py

Change order of steps

78ac085

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

Copilot AI review requested due to automatic review settings June 28, 2026 09:02

Copilot started reviewing on behalf of felipepessoto June 28, 2026 09:03 View session

Copilot AI reviewed Jun 28, 2026

View reviewed changes

Comment thread .github/workflows/delta_spark_ut.yml Outdated

Comment thread .github/workflows/delta_spark_ut.yml Outdated

Comment thread .github/workflows/util/delta-spark-ut/compare-test-results.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[VL][Delta] Add Delta Spark UT pipeline gated against a known-failures baseline#12388

[VL][Delta] Add Delta Spark UT pipeline gated against a known-failures baseline#12388
felipepessoto wants to merge 5 commits into
apache:mainfrom
felipepessoto:delta_pipeline

felipepessoto commented Jun 28, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

felipepessoto commented Jun 28, 2026

What changes are proposed in this pull request?

How it works

Files

Operational hardening

Scope / known limitations

How was this patch tested?

Main failures (776 baseline):

Delta Spark UT (Gluten) -- shard count vs test parallelism

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants