-
Notifications
You must be signed in to change notification settings - Fork 283
Description
What is the problem the feature request solves?
We have some very useful microbenchmarks implemented in Scala and tightly integrated with Comet. I would like to propose moving most of these to PySpark instead, making it easier to run them in Spark clusters, and easier to run with different versions of Comet and with other Spark accelerators for comparison.
It also removes the compile time from running the benchmarks. The accelerator jar is provided by COMET_JAR environment variable.
I already prototyped this, using Claude Code to automate the conversion.
Here are some sample outputs:
Parquet Scans
SQL Single DOUBLE Column Scan (13,421,772 rows)
---------------------------------------------------------------------------------------------------
Case Best(ms) Avg(ms) Median StdDev Rate vs base
---------------------------------------------------------------------------------------------------
spark 42.9 43.2 43.1 0.4 313.2M/s
gluten 88.8 90.8 90.3 1.9 151.2M/s 0.48x
comet_native_datafusion 35.1 36.9 37.2 1.1 382.4M/s 1.22x
comet_native_iceberg_compat 36.5 37.5 37.6 0.7 367.9M/s 1.17x
---------------------------------------------------------------------------------------------------
SQL Single Decimal(5,2) Column Scan (1,572,864 rows)
---------------------------------------------------------------------------------------------------
Case Best(ms) Avg(ms) Median StdDev Rate vs base
---------------------------------------------------------------------------------------------------
spark 19.0 21.8 21.1 2.1 82.8M/s
gluten 66.8 70.4 69.6 3.1 23.5M/s 0.28x
comet_native_datafusion 23.8 24.5 24.6 0.5 66.1M/s 0.80x
comet_native_iceberg_compat 24.8 25.5 25.1 0.9 63.5M/s 0.77x
---------------------------------------------------------------------------------------------------
Expressions
contains (102 rows)
---------------------------------------------------------------------------------------------------
Case Best(ms) Avg(ms) Median StdDev Rate vs base
---------------------------------------------------------------------------------------------------
spark 15.2 17.9 17.0 3.2 6.7K/s
gluten 36.6 39.2 38.8 2.0 2.8K/s 0.42x
comet_native_datafusion 19.3 19.8 19.7 0.4 5.3K/s 0.79x
comet_native_iceberg_compat 18.8 19.2 19.0 0.5 5.4K/s 0.81x
---------------------------------------------------------------------------------------------------
endswith (102 rows)
---------------------------------------------------------------------------------------------------
Case Best(ms) Avg(ms) Median StdDev Rate vs base
---------------------------------------------------------------------------------------------------
spark 15.7 16.3 16.4 0.4 6.5K/s
gluten 35.9 37.7 37.2 1.9 2.8K/s 0.44x
comet_native_datafusion 17.9 18.6 18.4 0.6 5.7K/s 0.87x
comet_native_iceberg_compat 18.2 20.6 19.5 3.7 5.6K/s 0.86x
---------------------------------------------------------------------------------------------------
Joins
Inner Join 20 cols (10,485 rows)
------------------------------------------------------------------------------------------
Case Best(ms) Avg(ms) Median StdDev Rate vs base
------------------------------------------------------------------------------------------
spark 267.5 283.9 285.8 11.4 39.2K/s
gluten 297.8 315.7 318.9 15.4 35.2K/s 0.90x
gluten (hash join) 307.7 316.6 314.6 6.9 34.1K/s 0.87x
comet 250.0 269.0 268.2 14.7 41.9K/s 1.07x
comet (hash join) 277.1 289.8 289.3 14.8 37.8K/s 0.97x
------------------------------------------------------------------------------------------
Describe the potential solution
I already have this working, with most benchmarks ported over. If this approach is acceptable to the community then I can create a PR to add this and remove some of the existing Scala-based microbenchmarks, so we don't have duplicates to maintain.
Additional context
No response