Delta pipeline fix tests by felipepessoto · Pull Request #12386 · apache/gluten

felipepessoto · 2026-06-27T09:10:40Z

What changes are proposed in this pull request?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

…es baseline Run delta-io/delta's `spark` ScalaTest suite against a Gluten Velox bundle in CI and gate the results against a committed baseline so the many expected Delta-on- Gluten failures stay manageable and can be fixed incrementally without letting currently-passing tests silently regress. What it adds (.github/workflows/util/delta-spark-ut/): - delta_spark_ut.yml: builds the native lib + Gluten bundle, then runs the Delta spark suite sharded by suite into 4 shards x 4 forked test JVMs (~16-way), and gates each shard against the baseline. - compare-test-results.py: the gate. Per shard, regressions (failed not in the baseline) fail the build; newly-passing baselined tests are flagged so the baseline can be tightened. Also supports seed/aggregate modes. - known-failures.txt: the committed baseline of expected failures. - setup-delta.sh: clones Delta, injects the Gluten bundle, patches DeltaSQLCommandTest, and force-fails the two DeletionVectorsSuite 2B-row tests whose native row-index materialization OOM-kills the runner and hangs the shard. - README.md: how the pipeline, gating and baseline-refresh work. The workflow also carries a hang watchdog that thread-dumps and kills a wedged fork, and tunes the per-fork heap (2G) and off-heap (2G) to fit the ~16G runner. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Velox has no Arrow representation for VariantType, so the native columnar write path -- which converts the incoming rows to Velox batches via RowToVeloxColumnarExec.toArrowSchema -- throws `UnsupportedOperationException: Unsupported data type: variant` at runtime. This broke every Delta write whose schema contains a variant column (INSERT, UPDATE, MERGE, OPTIMIZE/auto-compact, checkpoint-driven rewrites), since GlutenOptimisticTransaction.writeFiles always offloaded the write to the native writer (the now-removed code path built the Velox plan unconditionally). Guard GlutenOptimisticTransaction.writeFiles: if the input schema contains a variant at any nesting level, delegate to super.writeFiles (the vanilla Delta write path) instead of offloading. Non-variant writes are unaffected. The check matches by type name so it stays source-compatible across Spark versions. Adds GlutenDeltaVariantWriteSuite covering top-level, struct-nested, and UPDATE variant writes. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

This reverts commit 95ce39c.

…line Delta's data-skipping, limit-push-down, column-pruning and scan-metric tests collect file-source scans by matching the concrete `FileSourceScanExec` case class. Under the Gluten Velox bundle the scan is offloaded to DeltaScanTransformer, a sibling that implements the same `FileSourceScanLike` interface but is not FileSourceScanExec, so the match misses and the scan looks absent. This surfaced as `scala.MatchError: List()` (~56 DataSkipping*/DeltaLimitPushDown* tests), empty generated-column partition filters (~45 OptimizeGeneratedColumnSuite tests) and broken column-pruning / scan-metric checks across the Delete, Update, Merge, DeletionVectors and RowId suites and the TestsStatistics helper. Gluten copies `partitionFilters` and the other accessors these tests read verbatim onto the offloaded scan, so results are identical to vanilla -- only the test's `case` match breaks. Fix it by cherry-picking the two merged upstream Delta commits that widen these matches to the shared `FileSourceScanLike` interface (behavior-preserving for vanilla, which also implements it): * delta-io/delta#7104 -- ScanReportHelper.collectScans * delta-io/delta#7105 -- the remaining 9 test sources, its follow-up Both are merged on Delta master but land after the ref this workflow builds against (v4.2.0), so setup-delta.sh cherry-picks them onto the shallow checkout. Each fetches the fix commit at depth 2 (commit + parent) so cherry-pick can compute the parent->fix diff, and uses `cherry-pick -n` so no committer identity is required. Once the pinned DELTA_REF advances to include a commit its cherry-pick becomes a clean no-op and that block can be removed. The cherry-picks run before the DeletionVectorsSuite 2B-row force-fail step: that step sed-injects fail() into DeletionVectorsSuite.scala, which delta-io/delta#7105 also edits, and git cherry-pick refuses to apply onto a working tree with uncommitted changes to a file it touches (exit 128). Refresh known-failures.txt from run 28299900971 (the delta-spark-aggregate job output), which ran all 19073 tests across 16 shards: removes 187 now-passing tests with 0 regressions, 963 -> 776. ~147 come from the fixes above (DataSkipping*, DeltaLimitPushDown*, OptimizeGeneratedColumnSuite, MergeInto*, RowIdSuite); the remaining ~40 are other suites that now pass (e.g. HiveConvertToDeltaSuite, BitmapAggregatorE2ESuite). Verified against the per-shard ran/failed lists: every baseline entry was observed this run (0 stale), so nothing was dropped due to a crashed or incomplete shard. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Make delta_spark_ut.yml a reusable workflow (on: workflow_call) and call it from velox_backend_x86.yml so the Delta tests reuse the native lib + arrow jars that workflow already builds, instead of duplicating the build-native-lib-centos-7 job. GitHub artifacts cannot be shared across workflows, so the only way to reuse the artifact is to run the Delta jobs in the same workflow run. delta_spark_ut.yml keeps a workflow_dispatch trigger for standalone manual runs (its build-native-lib-centos-7 job is gated to that case and skipped when called); the pull_request trigger is removed so the suite no longer double-runs. velox_backend_x86.yml gains an arrow-jars upload on its native build and a delta-spark-ut job that calls the reusable workflow. That job runs on every velox trigger like the other spark-test jobs, since core/velox/substrait/cpp changes can affect Delta query offload. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

github-actions Bot added VELOX INFRA DOCS labels Jun 27, 2026

felipepessoto and others added 2 commits June 27, 2026 09:37

Shards 16

6899d64

felipepessoto force-pushed the delta_pipeline_fix_tests branch from d39550f to 95ce39c Compare June 27, 2026 09:37

Revert "[VL] Fall back to vanilla Delta write for VariantType columns"

fbafb1a

This reverts commit 95ce39c.

github-actions Bot removed the VELOX label Jun 27, 2026

felipepessoto force-pushed the delta_pipeline_fix_tests branch 2 times, most recently from 154089e to 05e5156 Compare June 28, 2026 02:50

felipepessoto force-pushed the delta_pipeline_fix_tests branch from 05e5156 to b1fe046 Compare June 28, 2026 03:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Delta pipeline fix tests#12386

Delta pipeline fix tests#12386
felipepessoto wants to merge 6 commits into
apache:mainfrom
felipepessoto:delta_pipeline_fix_tests

felipepessoto commented Jun 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

felipepessoto commented Jun 27, 2026

What changes are proposed in this pull request?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant