Skip to content

[SPARK-56424][PYTHON] Add ASV benchmark for SQL_SCALAR_PANDAS_UDF#55316

Open
Yicong-Huang wants to merge 1 commit intoapache:masterfrom
Yicong-Huang:SPARK-55724/bench/scalar-pandas-udf
Open

[SPARK-56424][PYTHON] Add ASV benchmark for SQL_SCALAR_PANDAS_UDF#55316
Yicong-Huang wants to merge 1 commit intoapache:masterfrom
Yicong-Huang:SPARK-55724/bench/scalar-pandas-udf

Conversation

@Yicong-Huang
Copy link
Copy Markdown
Contributor

What changes were proposed in this pull request?

Add ASV microbenchmarks for SQL_SCALAR_PANDAS_UDF eval type in bench_eval_type.py.

The new _ScalarPandasBenchMixin follows the same pattern as the existing _ScalarArrowBenchMixin but uses Pandas Series operations. It measures the full Arrow-to-Pandas-to-Arrow round-trip that occurs in scalar Pandas UDFs.

Scenarios (9): sm_batch_few_col, sm_batch_many_col, lg_batch_few_col, lg_batch_many_col, pure_ints, pure_floats, pure_strings, pure_ts, mixed_types

UDFs (3): identity_udf (passthrough), sort_udf (Series.sort_values), nullcheck_udf (Series.notna)

Benchmark classes (2): ScalarPandasUDFTimeBench, ScalarPandasUDFPeakmemBench

Why are the changes needed?

This is part of the PySpark Serializer & EvalType Refactor effort (SPARK-55724). We need baseline benchmarks for every eval type before refactoring the serialization path, so we can detect performance regressions. SQL_SCALAR_PANDAS_UDF is one of the most commonly used eval types and its Arrow-to-Pandas-to-Arrow conversion cost is a key metric for the refactor.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Ran all 54 benchmarks (27 time + 27 peakmem) locally with python/asv run --python=same --quick:

time_worker results (27/27 passed):
  sm_batch_few_col   identity_udf: 269ms, sort_udf: 385ms, nullcheck_udf: 300ms
  lg_batch_few_col   identity_udf: 836ms, sort_udf: 1.06s, nullcheck_udf: 834ms
  pure_ints          identity_udf: 165ms, sort_udf: 236ms, nullcheck_udf: 170ms
  pure_strings       identity_udf: 819ms, sort_udf: 1.17s, nullcheck_udf: 818ms
  mixed_types        identity_udf: 504ms, sort_udf: 551ms, nullcheck_udf: 510ms
  (+ 12 more scenarios all passing)

peakmem_worker results (27/27 passed):
  Range: 479M - 628M across all scenario/udf combinations

Was this patch authored or co-authored using generative AI tooling?

No.

### What changes were proposed in this pull request?

Add ASV microbenchmark for `SQL_SCALAR_PANDAS_UDF` eval type in `bench_eval_type.py`.

### Why are the changes needed?

Part of SPARK-55724 (Micro-benchmark PySpark Eval Types). Guards against performance regressions during the serializer refactor by measuring the full Arrow-to-Pandas-to-Arrow round-trip for scalar Pandas UDFs.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Verified all 27 scenario/UDF combinations run successfully.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant