Skip to content

Proposal: Port most microbenchmarks to PySpark #3440

@andygrove

Description

@andygrove

What is the problem the feature request solves?

We have some very useful microbenchmarks implemented in Scala and tightly integrated with Comet. I would like to propose moving most of these to PySpark instead, making it easier to run them in Spark clusters, and easier to run with different versions of Comet and with other Spark accelerators for comparison.

It also removes the compile time from running the benchmarks. The accelerator jar is provided by COMET_JAR environment variable.

I already prototyped this, using Claude Code to automate the conversion.

Here are some sample outputs:

Parquet Scans

SQL Single DOUBLE Column Scan (13,421,772 rows)
---------------------------------------------------------------------------------------------------
Case                           Best(ms)     Avg(ms)      Median      StdDev          Rate   vs base
---------------------------------------------------------------------------------------------------
spark                              42.9        43.2        43.1         0.4      313.2M/s          
gluten                             88.8        90.8        90.3         1.9      151.2M/s     0.48x
comet_native_datafusion            35.1        36.9        37.2         1.1      382.4M/s     1.22x
comet_native_iceberg_compat        36.5        37.5        37.6         0.7      367.9M/s     1.17x
---------------------------------------------------------------------------------------------------

SQL Single Decimal(5,2) Column Scan (1,572,864 rows)
---------------------------------------------------------------------------------------------------
Case                           Best(ms)     Avg(ms)      Median      StdDev          Rate   vs base
---------------------------------------------------------------------------------------------------
spark                              19.0        21.8        21.1         2.1       82.8M/s          
gluten                             66.8        70.4        69.6         3.1       23.5M/s     0.28x
comet_native_datafusion            23.8        24.5        24.6         0.5       66.1M/s     0.80x
comet_native_iceberg_compat        24.8        25.5        25.1         0.9       63.5M/s     0.77x
---------------------------------------------------------------------------------------------------

Expressions

contains (102 rows)
---------------------------------------------------------------------------------------------------
Case                           Best(ms)     Avg(ms)      Median      StdDev          Rate   vs base
---------------------------------------------------------------------------------------------------
spark                              15.2        17.9        17.0         3.2        6.7K/s          
gluten                             36.6        39.2        38.8         2.0        2.8K/s     0.42x
comet_native_datafusion            19.3        19.8        19.7         0.4        5.3K/s     0.79x
comet_native_iceberg_compat        18.8        19.2        19.0         0.5        5.4K/s     0.81x
---------------------------------------------------------------------------------------------------

endswith (102 rows)
---------------------------------------------------------------------------------------------------
Case                           Best(ms)     Avg(ms)      Median      StdDev          Rate   vs base
---------------------------------------------------------------------------------------------------
spark                              15.7        16.3        16.4         0.4        6.5K/s          
gluten                             35.9        37.7        37.2         1.9        2.8K/s     0.44x
comet_native_datafusion            17.9        18.6        18.4         0.6        5.7K/s     0.87x
comet_native_iceberg_compat        18.2        20.6        19.5         3.7        5.6K/s     0.86x
---------------------------------------------------------------------------------------------------

Joins

Inner Join 20 cols (10,485 rows)
------------------------------------------------------------------------------------------
Case                  Best(ms)     Avg(ms)      Median      StdDev          Rate   vs base
------------------------------------------------------------------------------------------
spark                    267.5       283.9       285.8        11.4       39.2K/s          
gluten                   297.8       315.7       318.9        15.4       35.2K/s     0.90x
gluten (hash join)       307.7       316.6       314.6         6.9       34.1K/s     0.87x
comet                    250.0       269.0       268.2        14.7       41.9K/s     1.07x
comet (hash join)        277.1       289.8       289.3        14.8       37.8K/s     0.97x
------------------------------------------------------------------------------------------

Describe the potential solution

I already have this working, with most benchmarks ported over. If this approach is acceptable to the community then I can create a PR to add this and remove some of the existing Scala-based microbenchmarks, so we don't have duplicates to maintain.

Additional context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions