[GLUTEN-12280][VL] Fix Spark 4 Arrow Python UDF stream writer by ReemaAlzaid · Pull Request #12345 · apache/gluten

ReemaAlzaid · 2026-06-23T19:47:24Z

What changes are proposed in this pull request?

Fix Spark 4 Arrow Python UDF execution with the Velox backend by keeping the Arrow stream writer alive across input batches instead of reopening the IPC stream per batch.

Also adds a regression test for Arrow Python UDF over Parquet scan

How was this patch tested?

Added ArrowEvalPythonExecSuite coverage.

Verified locally on Spark 4.0.2 / Scala 2.13 / linux aarch64. The repro uses ColumnarArrowPythonRunner, returns max(ship_len) = 7, and no longer fails with Invalid IPC stream

liuneng1994

LGTM

Copilot

Pull request overview

Fixes Spark 4 Arrow Python UDF execution for the Velox backend by changing ColumnarArrowPythonRunner to keep a single Arrow IPC stream writer open across input batches (instead of effectively reopening the stream per batch), and adds a Spark-4-only regression test that exercises an Arrow-batched Python UDF over a Parquet scan.

Changes:

Update Spark 4 writeNextInputToStream path to reuse VectorSchemaRoot/VectorLoader/ArrowStreamWriter across batches and close at task completion.
Add a regression test for Arrow-batched Python UDF over Parquet scan + aggregation.
Add Spark 4 constructor handling in test UDF creation to require explicit return type.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File	Description
backends-velox/src/main/scala/org/apache/spark/api/python/ColumnarArrowEvalPythonExec.scala	Keep Arrow IPC stream writer alive across Spark 4 batch writes; close on completion.
backends-velox/src/test/scala/org/apache/gluten/execution/python/ArrowEvalPythonExecSuite.scala	Add Spark 4 regression test covering Arrow-batched Python UDF over Parquet scan.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+      private def writeNextInputToStreamHelper(dataOut: DataOutputStream): Boolean = {
+        ensureNextInputWriter(dataOut)
+        if (!inputIterator.hasNext) {
+          closeNextInputWriter()
+          // See https://issues.apache.org/jira/browse/SPARK-44705:
+          // Starting from Spark 4.0, we should return false once the iterator is drained out,
+          // otherwise Spark won't stop calling this method repeatedly.
+          return false
+        }
+        val nextBatch = inputIterator.next()


Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

ReemaAlzaid and others added 2 commits June 17, 2026 22:40

[GLUTEN-12280][VL] Fix Spark 4 Arrow Python UDF stream writer

9dea8f7

Merge branch 'main' into fix-pyarrow

f13fec8

github-actions Bot added the VELOX label Jun 23, 2026

Merge branch 'main' into fix-pyarrow

cdc2d0c

liuneng1994 requested a review from Copilot June 27, 2026 07:06

Copilot started reviewing on behalf of liuneng1994 June 27, 2026 07:06 View session

liuneng1994 self-requested a review June 27, 2026 07:06

liuneng1994 approved these changes Jun 27, 2026

View reviewed changes

Copilot AI reviewed Jun 27, 2026

View reviewed changes

Potential fix for pull request finding

3bc8054

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[GLUTEN-12280][VL] Fix Spark 4 Arrow Python UDF stream writer#12345

[GLUTEN-12280][VL] Fix Spark 4 Arrow Python UDF stream writer#12345
ReemaAlzaid wants to merge 4 commits into
apache:mainfrom
ReemaAlzaid:fix-pyarrow

ReemaAlzaid commented Jun 23, 2026

Uh oh!

liuneng1994 left a comment

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

ReemaAlzaid commented Jun 23, 2026

What changes are proposed in this pull request?

How was this patch tested?

Uh oh!

liuneng1994 left a comment

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants