Proposal: Port most microbenchmarks to PySpark

### What is the problem the feature request solves?

We have some very useful microbenchmarks implemented in Scala and tightly integrated with Comet. I would like to propose moving most of these to PySpark instead, making it easier to run them in Spark clusters, and easier to run with different versions of Comet and with other Spark accelerators for comparison.

It also removes the compile time from running the benchmarks. The accelerator jar is provided by `COMET_JAR` environment variable.

I already prototyped this, using Claude Code to automate the conversion.

Here are some sample outputs:

### Parquet Scans

```
SQL Single DOUBLE Column Scan (13,421,772 rows)
---------------------------------------------------------------------------------------------------
Case                           Best(ms)     Avg(ms)      Median      StdDev          Rate   vs base
---------------------------------------------------------------------------------------------------
spark                              42.9        43.2        43.1         0.4      313.2M/s          
gluten                             88.8        90.8        90.3         1.9      151.2M/s     0.48x
comet_native_datafusion            35.1        36.9        37.2         1.1      382.4M/s     1.22x
comet_native_iceberg_compat        36.5        37.5        37.6         0.7      367.9M/s     1.17x
---------------------------------------------------------------------------------------------------

SQL Single Decimal(5,2) Column Scan (1,572,864 rows)
---------------------------------------------------------------------------------------------------
Case                           Best(ms)     Avg(ms)      Median      StdDev          Rate   vs base
---------------------------------------------------------------------------------------------------
spark                              19.0        21.8        21.1         2.1       82.8M/s          
gluten                             66.8        70.4        69.6         3.1       23.5M/s     0.28x
comet_native_datafusion            23.8        24.5        24.6         0.5       66.1M/s     0.80x
comet_native_iceberg_compat        24.8        25.5        25.1         0.9       63.5M/s     0.77x
---------------------------------------------------------------------------------------------------
```

### Expressions

```
contains (102 rows)
---------------------------------------------------------------------------------------------------
Case                           Best(ms)     Avg(ms)      Median      StdDev          Rate   vs base
---------------------------------------------------------------------------------------------------
spark                              15.2        17.9        17.0         3.2        6.7K/s          
gluten                             36.6        39.2        38.8         2.0        2.8K/s     0.42x
comet_native_datafusion            19.3        19.8        19.7         0.4        5.3K/s     0.79x
comet_native_iceberg_compat        18.8        19.2        19.0         0.5        5.4K/s     0.81x
---------------------------------------------------------------------------------------------------

endswith (102 rows)
---------------------------------------------------------------------------------------------------
Case                           Best(ms)     Avg(ms)      Median      StdDev          Rate   vs base
---------------------------------------------------------------------------------------------------
spark                              15.7        16.3        16.4         0.4        6.5K/s          
gluten                             35.9        37.7        37.2         1.9        2.8K/s     0.44x
comet_native_datafusion            17.9        18.6        18.4         0.6        5.7K/s     0.87x
comet_native_iceberg_compat        18.2        20.6        19.5         3.7        5.6K/s     0.86x
---------------------------------------------------------------------------------------------------
```

### Joins

```
Inner Join 20 cols (10,485 rows)
------------------------------------------------------------------------------------------
Case                  Best(ms)     Avg(ms)      Median      StdDev          Rate   vs base
------------------------------------------------------------------------------------------
spark                    267.5       283.9       285.8        11.4       39.2K/s          
gluten                   297.8       315.7       318.9        15.4       35.2K/s     0.90x
gluten (hash join)       307.7       316.6       314.6         6.9       34.1K/s     0.87x
comet                    250.0       269.0       268.2        14.7       41.9K/s     1.07x
comet (hash join)        277.1       289.8       289.3        14.8       37.8K/s     0.97x
------------------------------------------------------------------------------------------
```


### Describe the potential solution

I already have this working, with most benchmarks ported over. If this approach is acceptable to the community then I can create a PR to add this and remove some of the existing Scala-based microbenchmarks, so we don't have duplicates to maintain.

### Additional context

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposal: Port most microbenchmarks to PySpark #3440

What is the problem the feature request solves?

Parquet Scans

Expressions

Joins

Describe the potential solution

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Proposal: Port most microbenchmarks to PySpark #3440

Description

What is the problem the feature request solves?

Parquet Scans

Expressions

Joins

Describe the potential solution

Additional context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions