[VL][Delta] Add Delta Spark UT pipeline gated against a known-failures baseline#12388
Open
felipepessoto wants to merge 5 commits into
Open
[VL][Delta] Add Delta Spark UT pipeline gated against a known-failures baseline#12388felipepessoto wants to merge 5 commits into
felipepessoto wants to merge 5 commits into
Conversation
…es baseline Run delta-io/delta's `spark` ScalaTest suite against a Gluten Velox bundle in CI and gate the results against a committed baseline so the many expected Delta-on- Gluten failures stay manageable and can be fixed incrementally without letting currently-passing tests silently regress. What it adds (.github/workflows/util/delta-spark-ut/): - delta_spark_ut.yml: builds the native lib + Gluten bundle, then runs the Delta spark suite sharded by suite into 4 shards x 4 forked test JVMs (~16-way), and gates each shard against the baseline. - compare-test-results.py: the gate. Per shard, regressions (failed not in the baseline) fail the build; newly-passing baselined tests are flagged so the baseline can be tightened. Also supports seed/aggregate modes. - known-failures.txt: the committed baseline of expected failures. - setup-delta.sh: clones Delta, injects the Gluten bundle, patches DeltaSQLCommandTest, and force-fails the two DeletionVectorsSuite 2B-row tests whose native row-index materialization OOM-kills the runner and hangs the shard. - README.md: how the pipeline, gating and baseline-refresh work. The workflow also carries a hang watchdog that thread-dumps and kills a wedged fork, and tunes the per-fork heap (2G) and off-heap (2G) to fit the ~16G runner. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…line Delta's data-skipping, limit-push-down, column-pruning and scan-metric tests collect file-source scans by matching the concrete `FileSourceScanExec` case class. Under the Gluten Velox bundle the scan is offloaded to DeltaScanTransformer, a sibling that implements the same `FileSourceScanLike` interface but is not FileSourceScanExec, so the match misses and the scan looks absent. This surfaced as `scala.MatchError: List()` (~56 DataSkipping*/DeltaLimitPushDown* tests), empty generated-column partition filters (~45 OptimizeGeneratedColumnSuite tests) and broken column-pruning / scan-metric checks across the Delete, Update, Merge, DeletionVectors and RowId suites and the TestsStatistics helper. Gluten copies `partitionFilters` and the other accessors these tests read verbatim onto the offloaded scan, so results are identical to vanilla -- only the test's `case` match breaks. Fix it by cherry-picking the two merged upstream Delta commits that widen these matches to the shared `FileSourceScanLike` interface (behavior-preserving for vanilla, which also implements it): * delta-io/delta#7104 -- ScanReportHelper.collectScans * delta-io/delta#7105 -- the remaining 9 test sources, its follow-up Both are merged on Delta master but land after the ref this workflow builds against (v4.2.0), so setup-delta.sh cherry-picks them onto the shallow checkout. Each fetches the fix commit at depth 2 (commit + parent) so cherry-pick can compute the parent->fix diff, and uses `cherry-pick -n` so no committer identity is required. Once the pinned DELTA_REF advances to include a commit its cherry-pick becomes a clean no-op and that block can be removed. The cherry-picks run before the DeletionVectorsSuite 2B-row force-fail step: that step sed-injects fail() into DeletionVectorsSuite.scala, which delta-io/delta#7105 also edits, and git cherry-pick refuses to apply onto a working tree with uncommitted changes to a file it touches (exit 128). Refresh known-failures.txt from run 28299900971 (the delta-spark-aggregate job output), which ran all 19073 tests across 16 shards: removes 187 now-passing tests with 0 regressions, 963 -> 776. ~147 come from the fixes above (DataSkipping*, DeltaLimitPushDown*, OptimizeGeneratedColumnSuite, MergeInto*, RowIdSuite); the remaining ~40 are other suites that now pass (e.g. HiveConvertToDeltaSuite, BitmapAggregatorE2ESuite). Verified against the per-shard ran/failed lists: every baseline entry was observed this run (0 stale), so nothing was dropped due to a crashed or incomplete shard. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Make delta_spark_ut.yml a reusable workflow (on: workflow_call) and call it from velox_backend_x86.yml so the Delta tests reuse the native lib + arrow jars that workflow already builds, instead of duplicating the build-native-lib-centos-7 job. GitHub artifacts cannot be shared across workflows, so the only way to reuse the artifact is to run the Delta jobs in the same workflow run. delta_spark_ut.yml keeps a workflow_dispatch trigger for standalone manual runs (its build-native-lib-centos-7 job is gated to that case and skipped when called); the pull_request trigger is removed so the suite no longer double-runs. velox_backend_x86.yml gains an arrow-jars upload on its native build and a delta-spark-ut job that calls the reusable workflow. That job runs on every velox trigger like the other spark-test jobs, since core/velox/substrait/cpp changes can affect Delta query offload. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Contributor
There was a problem hiding this comment.
Pull request overview
This PR adds a Delta Lake spark module unit-test CI pipeline for the Velox backend, integrating it into the existing velox_backend_x86.yml workflow and gating results against a committed “known failures” baseline so regressions are detected without failing on expected gaps.
Changes:
- Adds a reusable GitHub Actions workflow to build the Gluten Velox bundle and run sharded Delta ScalaTest suites, with baseline enforcement/aggregation.
- Adds Delta setup + patching utilities (clone, inject bundle on test classpath, patch tests, apply upstream test fixes, and force-fail known OOM-inducing tests).
- Adds a committed
known-failures.txtbaseline plus documentation for maintaining it.
Reviewed changes
Copilot reviewed 5 out of 6 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
.github/workflows/velox_backend_x86.yml |
Invokes the reusable Delta UT workflow and uploads native + Arrow artifacts for reuse. |
.github/workflows/delta_spark_ut.yml |
Implements the reusable (and dispatchable) Delta UT pipeline: build bundle, shard tests, gate, aggregate. |
.github/workflows/util/delta-spark-ut/setup-delta.sh |
Prepares a Delta clone for testing with the Gluten bundle; applies targeted patches/cherry-picks. |
.github/workflows/util/delta-spark-ut/compare-test-results.py |
Parses JUnit XML and enforces/seeds/aggregates results against the known-failures baseline. |
.github/workflows/util/delta-spark-ut/known-failures.txt |
Baseline list of currently expected failing Delta tests under Gluten. |
.github/workflows/util/delta-spark-ut/README.md |
Documents baseline seeding, enforcement behavior, and refresh workflows. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
`jar=$(ls <glob> | head -n1)` aborts the step under `set -euo pipefail` when the glob matches nothing: `ls` exits non-zero, pipefail propagates it, and set -e exits before the explicit "jar not found" check can print an actionable error -- the log shows only a generic `ls: cannot access`. Make the lookup non-fatal (`2>/dev/null ... || true`) so the check runs, and add the missing check to the Clone-and-patch-Delta step. Addresses PR review feedback. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fix #9296.
What changes are proposed in this pull request?
Adds a CI pipeline that runs delta-io/delta's
sparkScalaTest suite against the Gluten Velox bundle, so we can validate Gluten against a real Delta release and catch regressions over time.Running the Delta UTs on Gluten produces many expected failures (Gluten does not yet offload every Delta code path, and falls back or behaves differently in places). A plain "red on any failure" gate would be useless. Instead, the pipeline keeps a committed baseline of known failures and gates each run against it:
fail_on_fixed=false.How it works
on: workflow_call) invoked fromvelox_backend_x86.yml, so it reuses the Velox native libs + Arrow jars that workflow already builds instead of duplicating the expensive native C++ build. It then assembles thegluten-velox-bundlefat jar (Spark 4.1 + Scala 2.13 + JDK 17, Delta profile). Aworkflow_dispatchtrigger is kept for standalone manual runs (which build the native lib themselves).v4.2.0), drops the bundle onto thesparkproject's test classpath, patchesDeltaSQLCommandTestto registerGlutenPlugin, and cherry-picks two merged upstream Delta test-only fixes ([Spark] [Test] Collect scans by FileSourceScanLike in ScanReportHelper delta-io/delta#7104 + [VL] Remove a limit for BHJ in stage fallback policy #7105) that widenFileSourceScanExecchecks toFileSourceScanLikeso Gluten's transformed plan is recognized.sbt spark/testsharded by suite across 4 shards (4 forked test JVMs each, ~16-way parallelism), with ScalaTest's JUnit XML reporter enabled, then gates each shard withcompare-test-results.pyagainstknown-failures.txt. A final job aggregates all shards into a single ready-to-commit baseline and flags stale entries.Files
.github/workflows/velox_backend_x86.yml.github/workflows/delta_spark_ut.yml.github/workflows/util/delta-spark-ut/setup-delta.shDeltaSQLCommandTest, cherry-picks the upstream test fixes..github/workflows/util/delta-spark-ut/compare-test-results.py.github/workflows/util/delta-spark-ut/known-failures.txt#comments per line)..github/workflows/util/delta-spark-ut/README.mdOperational hardening
--add-opensset plus-Dio.netty.tryReflectionSetAccessible=true(otherwise Arrow's allocator fails to initialize).Scope / known limitations
v4.2.0/ Spark 4.1 / Scala 2.13 / JDK 17.workflow_dispatchrun withupdate_baseline=true.IdentityColumn.logTableWritefirst paramSnapshot->SnapshotDescriptor), whichNoSuchMethodErrors on every write. Supporting 4.3.0 needs the bundle built against 4.3.0; tracked as follow-up.How was this patch tested?
This change is CI. The Delta suite runs as part of
velox_backend_x86.yml-- on every PR/trigger that touches Velox/core/cpp or the Delta CI files -- and via manualworkflow_dispatch. In the latest runs all shards pass against the committed baseline (failures limited to known-failures entries; no regressions).19,073 Delta tests run (18,297 passed / 776 failed).
Main failures (776 baseline):
timestamp -> timestamp_ntzFixed since the first draft (#12371): the 187
MatchError List()DataSkipping-empty-stats failures (caused by aFileSourceScanExecmatch) were fixed by cherry-picking the merged Delta PRs 7104 + 7105 (FileSourceScanExec->FileSourceScanLike) during test setup. That dropped the baseline from 963 to 776 known failures (187 now-passing removed, 0 regressions).Delta Spark UT (Gluten) -- shard count vs test parallelism
Sharding is by suite (
MurmurHash3(suiteName) % NUM_SHARDS), so total test work is fixed (~1250 fork-minutes). The runners are 4-core / ~16 GB. The committed config is 4 shards x 4 forks.Was this patch authored or co-authored using generative AI tooling?
Generated-by: GitHub Copilot CLI