[SPARK-56424][PYTHON] Add ASV benchmark for SQL_SCALAR_PANDAS_UDF by Yicong-Huang · Pull Request #55316 · apache/spark

Yicong-Huang · 2026-04-13T03:09:28Z

What changes were proposed in this pull request?

Add ASV microbenchmarks for SQL_SCALAR_PANDAS_UDF eval type in bench_eval_type.py.

The new _ScalarPandasBenchMixin follows the same pattern as the existing _ScalarArrowBenchMixin but uses Pandas Series operations. It measures the full Arrow-to-Pandas-to-Arrow round-trip that occurs in scalar Pandas UDFs.

Scenarios (9): sm_batch_few_col, sm_batch_many_col, lg_batch_few_col, lg_batch_many_col, pure_ints, pure_floats, pure_strings, pure_ts, mixed_types

UDFs (3): identity_udf (passthrough), sort_udf (Series.sort_values), nullcheck_udf (Series.notna)

Benchmark classes (2): ScalarPandasUDFTimeBench, ScalarPandasUDFPeakmemBench

Why are the changes needed?

This is part of the PySpark Serializer & EvalType Refactor effort (SPARK-55724). We need baseline benchmarks for every eval type before refactoring the serialization path, so we can detect performance regressions. SQL_SCALAR_PANDAS_UDF is one of the most commonly used eval types and its Arrow-to-Pandas-to-Arrow conversion cost is a key metric for the refactor.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Ran all 54 benchmarks (27 time + 27 peakmem) locally with python/asv run --python=same --quick:

time_worker results (27/27 passed):
  sm_batch_few_col   identity_udf: 269ms, sort_udf: 385ms, nullcheck_udf: 300ms
  lg_batch_few_col   identity_udf: 836ms, sort_udf: 1.06s, nullcheck_udf: 834ms
  pure_ints          identity_udf: 165ms, sort_udf: 236ms, nullcheck_udf: 170ms
  pure_strings       identity_udf: 819ms, sort_udf: 1.17s, nullcheck_udf: 818ms
  mixed_types        identity_udf: 504ms, sort_udf: 551ms, nullcheck_udf: 510ms
  (+ 12 more scenarios all passing)

peakmem_worker results (27/27 passed):
  Range: 479M - 628M across all scenario/udf combinations

Was this patch authored or co-authored using generative AI tooling?

No.

### What changes were proposed in this pull request? Add ASV microbenchmark for `SQL_SCALAR_PANDAS_UDF` eval type in `bench_eval_type.py`. ### Why are the changes needed? Part of SPARK-55724 (Micro-benchmark PySpark Eval Types). Guards against performance regressions during the serializer refactor by measuring the full Arrow-to-Pandas-to-Arrow round-trip for scalar Pandas UDFs. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Verified all 27 scenario/UDF combinations run successfully.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-56424][PYTHON] Add ASV benchmark for SQL_SCALAR_PANDAS_UDF#55316

[SPARK-56424][PYTHON] Add ASV benchmark for SQL_SCALAR_PANDAS_UDF#55316
Yicong-Huang wants to merge 1 commit intoapache:masterfrom
Yicong-Huang:SPARK-55724/bench/scalar-pandas-udf

Yicong-Huang commented Apr 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Yicong-Huang commented Apr 13, 2026

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant