Skip to content

[VL][Delta] Add Delta Spark UT pipeline gated against a known-failures baseline#12388

Open
felipepessoto wants to merge 5 commits into
apache:mainfrom
felipepessoto:delta_pipeline
Open

[VL][Delta] Add Delta Spark UT pipeline gated against a known-failures baseline#12388
felipepessoto wants to merge 5 commits into
apache:mainfrom
felipepessoto:delta_pipeline

Conversation

@felipepessoto

Copy link
Copy Markdown
Contributor

Fix #9296.

What changes are proposed in this pull request?

Adds a CI pipeline that runs delta-io/delta's spark ScalaTest suite against the Gluten Velox bundle, so we can validate Gluten against a real Delta release and catch regressions over time.

Running the Delta UTs on Gluten produces many expected failures (Gluten does not yet offload every Delta code path, and falls back or behaves differently in places). A plain "red on any failure" gate would be useless. Instead, the pipeline keeps a committed baseline of known failures and gates each run against it:

  • regression -- a test fails that is not in the baseline -> the shard fails.
  • expected -- a failing test that is in the baseline -> ignored.
  • now-passing -- a baseline test that starts passing -> fails the shard (keeps the baseline honest), unless fail_on_fixed=false.

How it works

  1. Runs as a reusable workflow (on: workflow_call) invoked from velox_backend_x86.yml, so it reuses the Velox native libs + Arrow jars that workflow already builds instead of duplicating the expensive native C++ build. It then assembles the gluten-velox-bundle fat jar (Spark 4.1 + Scala 2.13 + JDK 17, Delta profile). A workflow_dispatch trigger is kept for standalone manual runs (which build the native lib themselves).
  2. Clones delta-io/delta at a release tag (currently v4.2.0), drops the bundle onto the spark project's test classpath, patches DeltaSQLCommandTest to register GlutenPlugin, and cherry-picks two merged upstream Delta test-only fixes ([Spark] [Test] Collect scans by FileSourceScanLike in ScanReportHelper delta-io/delta#7104 + [VL] Remove a limit for BHJ in stage fallback policy #7105) that widen FileSourceScanExec checks to FileSourceScanLike so Gluten's transformed plan is recognized.
  3. Runs sbt spark/test sharded by suite across 4 shards (4 forked test JVMs each, ~16-way parallelism), with ScalaTest's JUnit XML reporter enabled, then gates each shard with compare-test-results.py against known-failures.txt. A final job aggregates all shards into a single ready-to-commit baseline and flags stale entries.

Files

File Purpose
.github/workflows/velox_backend_x86.yml Caller: builds the native lib once, uploads the native + Arrow artifacts, and invokes the reusable Delta workflow (reusing that build instead of duplicating it).
.github/workflows/delta_spark_ut.yml The reusable Delta workflow (build bundle -> shard tests -> gate).
.github/workflows/util/delta-spark-ut/setup-delta.sh Clones Delta, injects the Gluten bundle, patches DeltaSQLCommandTest, cherry-picks the upstream test fixes.
.github/workflows/util/delta-spark-ut/compare-test-results.py Parses JUnit XML and enforces / seeds / aggregates against the baseline (stdlib only).
.github/workflows/util/delta-spark-ut/known-failures.txt Committed baseline of currently-expected failures (# comments per line).
.github/workflows/util/delta-spark-ut/README.md Documents the gate, bootstrapping, and baseline refresh.

Operational hardening

  • JDK 17 + Arrow/Netty: forked test JVMs get the --add-opens set plus -Dio.netty.tryReflectionSetAccessible=true (otherwise Arrow's allocator fails to initialize).
  • Heap tuning: forked-test heap and the sbt launcher's idle G1 behavior are tuned to keep the ~16 GB runner under the cgroup OOM threshold.
  • Hang watchdog: a per-shard watchdog dumps threads and kills a forked test JVM that has gone silent too long, so a wedged suite can't stall the whole job.
  • DeletionVectorsSuite 2B-row tests: two tests build/read/delete a 2-billion-row table and balloon the fork to ~13 GB of native memory (Velox row-index materialization), OOM-killing it and hanging the shard. They are force-failed (with a clear message) rather than silently ignored, so the gap stays visible until the native memory blow-up is fixed.

Scope / known limitations

  • Velox backend, x86 only; Delta v4.2.0 / Spark 4.1 / Scala 2.13 / JDK 17.
  • The baseline reflects the current set of known Delta-on-Gluten failures; refresh it via a workflow_dispatch run with update_baseline=true.
  • Future work -- Delta 4.3.0: attempted, but the bundle (compiled against Delta 4.1.0) hits a binary-incompatible Delta change (IdentityColumn.logTableWrite first param Snapshot -> SnapshotDescriptor), which NoSuchMethodErrors on every write. Supporting 4.3.0 needs the bundle built against 4.3.0; tracked as follow-up.

How was this patch tested?

This change is CI. The Delta suite runs as part of velox_backend_x86.yml -- on every PR/trigger that touches Velox/core/cpp or the Delta CI files -- and via manual workflow_dispatch. In the latest runs all shards pass against the committed baseline (failures limited to known-failures entries; no regressions).

19,073 Delta tests run (18,297 passed / 776 failed).

Main failures (776 baseline):

Fixed since the first draft (#12371): the 187 MatchError List() DataSkipping-empty-stats failures (caused by a FileSourceScanExec match) were fixed by cherry-picking the merged Delta PRs 7104 + 7105 (FileSourceScanExec -> FileSourceScanLike) during test setup. That dropped the baseline from 963 to 776 known failures (187 now-passing removed, 0 regressions).

Delta Spark UT (Gluten) -- shard count vs test parallelism

Sharding is by suite (MurmurHash3(suiteName) % NUM_SHARDS), so total test work is fixed (~1250 fork-minutes). The runners are 4-core / ~16 GB. The committed config is 4 shards x 4 forks.

Config Runner jobs Forks/shard Max shard Wall-clock Billed job-hrs* Outcome
16 shards x 1 fork 16 1 ~110 min ~130 min ~29 green
4 shards x 4 forks 4 4 158 min 178 min ~10.5 green
4 shards x 1 fork 4 1 360 min (hit cap) -- -- cancelled

Was this patch authored or co-authored using generative AI tooling?

Generated-by: GitHub Copilot CLI

felipepessoto and others added 3 commits June 27, 2026 09:12
…es baseline

Run delta-io/delta's `spark` ScalaTest suite against a Gluten Velox bundle in CI
and gate the results against a committed baseline so the many expected Delta-on-
Gluten failures stay manageable and can be fixed incrementally without letting
currently-passing tests silently regress.

What it adds (.github/workflows/util/delta-spark-ut/):
- delta_spark_ut.yml: builds the native lib + Gluten bundle, then runs the Delta
  spark suite sharded by suite into 4 shards x 4 forked test JVMs (~16-way), and
  gates each shard against the baseline.
- compare-test-results.py: the gate. Per shard, regressions (failed not in the
  baseline) fail the build; newly-passing baselined tests are flagged so the
  baseline can be tightened. Also supports seed/aggregate modes.
- known-failures.txt: the committed baseline of expected failures.
- setup-delta.sh: clones Delta, injects the Gluten bundle, patches
  DeltaSQLCommandTest, and force-fails the two DeletionVectorsSuite 2B-row tests
  whose native row-index materialization OOM-kills the runner and hangs the shard.
- README.md: how the pipeline, gating and baseline-refresh work.

The workflow also carries a hang watchdog that thread-dumps and kills a wedged
fork, and tunes the per-fork heap (2G) and off-heap (2G) to fit the ~16G runner.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…line

Delta's data-skipping, limit-push-down, column-pruning and scan-metric tests
collect file-source scans by matching the concrete `FileSourceScanExec` case
class. Under the Gluten Velox bundle the scan is offloaded to
DeltaScanTransformer, a sibling that implements the same `FileSourceScanLike`
interface but is not FileSourceScanExec, so the match misses and the scan
looks absent. This surfaced as `scala.MatchError: List()` (~56
DataSkipping*/DeltaLimitPushDown* tests), empty generated-column partition
filters (~45 OptimizeGeneratedColumnSuite tests) and broken column-pruning /
scan-metric checks across the Delete, Update, Merge, DeletionVectors and
RowId suites and the TestsStatistics helper.

Gluten copies `partitionFilters` and the other accessors these tests read
verbatim onto the offloaded scan, so results are identical to vanilla -- only
the test's `case` match breaks. Fix it by cherry-picking the two merged
upstream Delta commits that widen these matches to the shared
`FileSourceScanLike` interface (behavior-preserving for vanilla, which also
implements it):

  * delta-io/delta#7104 -- ScanReportHelper.collectScans
  * delta-io/delta#7105 -- the remaining 9 test sources, its follow-up

Both are merged on Delta master but land after the ref this workflow builds
against (v4.2.0), so setup-delta.sh cherry-picks them onto the shallow
checkout. Each fetches the fix commit at depth 2 (commit + parent) so
cherry-pick can compute the parent->fix diff, and uses `cherry-pick -n` so no
committer identity is required. Once the pinned DELTA_REF advances to include
a commit its cherry-pick becomes a clean no-op and that block can be removed.

The cherry-picks run before the DeletionVectorsSuite 2B-row force-fail step:
that step sed-injects fail() into DeletionVectorsSuite.scala, which
delta-io/delta#7105 also edits, and git cherry-pick refuses to apply onto a
working tree with uncommitted changes to a file it touches (exit 128).

Refresh known-failures.txt from run 28299900971 (the delta-spark-aggregate job
output), which ran all 19073 tests across 16 shards: removes 187 now-passing
tests with 0 regressions, 963 -> 776. ~147 come from the fixes above
(DataSkipping*, DeltaLimitPushDown*, OptimizeGeneratedColumnSuite, MergeInto*,
RowIdSuite); the remaining ~40 are other suites that now pass (e.g.
HiveConvertToDeltaSuite, BitmapAggregatorE2ESuite). Verified against the
per-shard ran/failed lists: every baseline entry was observed this run (0
stale), so nothing was dropped due to a crashed or incomplete shard.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Make delta_spark_ut.yml a reusable workflow (on: workflow_call) and call it from
velox_backend_x86.yml so the Delta tests reuse the native lib + arrow jars that
workflow already builds, instead of duplicating the build-native-lib-centos-7
job. GitHub artifacts cannot be shared across workflows, so the only way to
reuse the artifact is to run the Delta jobs in the same workflow run.

delta_spark_ut.yml keeps a workflow_dispatch trigger for standalone manual runs
(its build-native-lib-centos-7 job is gated to that case and skipped when
called); the pull_request trigger is removed so the suite no longer double-runs.
velox_backend_x86.yml gains an arrow-jars upload on its native build and a
delta-spark-ut job that calls the reusable workflow. That job runs on every
velox trigger like the other spark-test jobs, since core/velox/substrait/cpp
changes can affect Delta query offload.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings June 28, 2026 07:33

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds a Delta Lake spark module unit-test CI pipeline for the Velox backend, integrating it into the existing velox_backend_x86.yml workflow and gating results against a committed “known failures” baseline so regressions are detected without failing on expected gaps.

Changes:

  • Adds a reusable GitHub Actions workflow to build the Gluten Velox bundle and run sharded Delta ScalaTest suites, with baseline enforcement/aggregation.
  • Adds Delta setup + patching utilities (clone, inject bundle on test classpath, patch tests, apply upstream test fixes, and force-fail known OOM-inducing tests).
  • Adds a committed known-failures.txt baseline plus documentation for maintaining it.

Reviewed changes

Copilot reviewed 5 out of 6 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
.github/workflows/velox_backend_x86.yml Invokes the reusable Delta UT workflow and uploads native + Arrow artifacts for reuse.
.github/workflows/delta_spark_ut.yml Implements the reusable (and dispatchable) Delta UT pipeline: build bundle, shard tests, gate, aggregate.
.github/workflows/util/delta-spark-ut/setup-delta.sh Prepares a Delta clone for testing with the Gluten bundle; applies targeted patches/cherry-picks.
.github/workflows/util/delta-spark-ut/compare-test-results.py Parses JUnit XML and enforces/seeds/aggregates results against the known-failures baseline.
.github/workflows/util/delta-spark-ut/known-failures.txt Baseline list of currently expected failing Delta tests under Gluten.
.github/workflows/util/delta-spark-ut/README.md Documents baseline seeding, enforcement behavior, and refresh workflows.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread .github/workflows/delta_spark_ut.yml Outdated
Comment thread .github/workflows/util/delta-spark-ut/setup-delta.sh
Comment thread .github/workflows/util/delta-spark-ut/compare-test-results.py
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings June 28, 2026 09:02

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 6 changed files in this pull request and generated 3 comments.

Comment thread .github/workflows/delta_spark_ut.yml Outdated
Comment thread .github/workflows/delta_spark_ut.yml Outdated
Comment thread .github/workflows/util/delta-spark-ut/compare-test-results.py
`jar=$(ls <glob> | head -n1)` aborts the step under `set -euo pipefail`
when the glob matches nothing: `ls` exits non-zero, pipefail propagates
it, and set -e exits before the explicit "jar not found" check can print
an actionable error -- the log shows only a generic `ls: cannot access`.

Make the lookup non-fatal (`2>/dev/null ... || true`) so the check runs,
and add the missing check to the Clone-and-patch-Delta step. Addresses PR
review feedback.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Enhancement] [Build] Run Delta unit tests during PR validation

2 participants