Continuous Sightglass benchmarking

We've wanted continuous Sightglass benchmarking for Cranelift and Wasmtime for a looong time; we are finally getting to the point where we can prioritize it. But first we need to agree exactly what we measure, how we measure it (i.e. high-level consensus on architecture), and what we do with those measurements (i.e. how we display them, what kind of automatic alerting/issue-filing/etc we do).

-------------------------

Benchmarking is hard. Lots of people have opinions on how it should best be done. Historically, this is a topic that has attracted a lot of drive-by opinions. **I ask that people kindly refrain from engaging on this issue unless you are regularly hacking on Wasmtime, Cranelift, or Sightglass.**

--------------------------

I think we have roughly two separate use cases for continuous benchmarking:

1. **Viewing long-term trends over time.** Basically, drawing arewefastyet.com-style graphs where the x axis is ~nightly Wasmtime builds and the y axis is some metric of performance. Gives us a rough sense of how performance today is compared to last week/month/year/etc.

2. **Catching and reporting performance regressions as quickly as we can.** Basically, do a statistical significance test between ~yesterday's `main` and today's `main`. If the difference is significant and performance got worse, then open an issue detailing the regression's commit range and effect size. Helps us avoid unwittingly regressing performance. This could also be triggered explicitly by Wasmtime/Cranelift maintainers for a particular PR, to compare its performance against `main`.

These two use cases have (or *can* have) different requirements:

* (1) must have either a dedicated machine or some other manner of making results recorded on different machines ~deterministic (e.g. [measure with `callgrind` rather than native cycles](https://github.com/bytecodealliance/sightglass/pull/314)) or else the trend over time is meaningless.
* However, (2) doesn't necessarily need dedicated hardware or even the ability to compare results recorded on different machines: it can build yesterday's and today's Wasmtime, run them both at the same time on the same current machine, and then do a statistical significance test between their results, and it doesn't matter if this process happened on a different machine two days ago because it isn't reusing that data.
  * But if it *did* have access to dedicated hardware, it could potentially reuse yesterday's data, instead of re-running yesterday's Wasmtime
* (1) requires saving the results (or at least a summary of the results) of historical benchmark runs
* (2) does not necessarily need that

I just point this out to make sure we don't conflate the two use cases and over-constrain one because of the requirements of the other.

-----------------------

### Which measurements should we record?

Sightglass supports recording a variety of measurements:

* Instructions retired
* Cycles
* Wall time
* Various `perf` counters, notably cache accesses and misses
* Simulated (and therefore ~deterministic across machines) microarch details via `callgrind`:
  * instructions-retired
  * data-reads
  * data-writes
  * l1-icache-misses
  * l1-dcache-read-misses
  * l1-dcache-write-misses
  * ll-icache-misses
  * ll-dcache-read-misses
  * ll-dcache-write-misses
  * conditional-branches
  * conditional-branch-misses
  * indirect-branches
  * indirect-branch-misses
  * (We don't currently but we could also compute "virtual cycles" or "virtual wall time" based on this data, which should also be ~deterministic across machines)

Doing everything is going to be too noisy and give us information overload. It will also take more resources. So I think we really do want to narrow things down here, ideally to a single measure.

**Which should we use for use case (1), long-term trends over time?**

**Which should we use for use case (2), significance tests between today's and ~yesterday's Wasmtime to catch performance regressions quickly?**

### How do we measure it?

What's the high-level architecture? Dedicated hardware? How does it integrate into github and actions?

### What do we do with it? 

We've recorded some benchmark data, what actions do we take based on its results? Do we plot it and serve it on a web page? Do we file issues? Etc...

---------------------------------------------------------

### My Recommendation: Two separate jobs

#### Use case (1), long-term trends over time

* **Measure:** virtual cycles/wall-time based on callgrind simulation, so we can avoid dedicated hardware but still measure roughly the thing we really care about (wall time) as opposed to things that are too far-removed (like instructions retired)
* **How:**
  * make a `bytecodealliance/arewefastyet` repo (or other name)
  * github daily cron action to fetch the latest Wasmtime and run Sightglass
  * save the raw benchmark data as a file (it will be small since we only need ~3 iterations for callgrind because it is very low noise)
  * run a script to render/summarize all historical data, write it to a canonical file
  * `index.html` loads that summary and displays it
  * very simple; no dedicated hardware involved, no S3, no CDN, etc...
* **What:**
  * two scatter plots for each benchmark: compilation and execution
    * x axis: date of wasmtime build
    * y axis: performance (virtual wall time)

#### Use case (2), significance tests between today's and ~yesterday's Wasmtime

* **Measure:** cycles
* **How:**
  * github action/workflow in the Wasmtime repo
  * check out today's and yesterday's Wasmtimes
  * build both of their `libwasmtime_bench_api.so`s
  * run sightglass, passing `-e yesterday.so -e today.so`
  * if there is any statistically significant regression, open an issue detailing the regression's commit range and effect size
* **What:**
  * No webpage, no graphs/plots
  * Just the issues opened on regression
    * Or a comment on a PR, if explicitly triggered for a particular PR (main vs PR branch, instead of today vs yesterday)
  * Issue/comment has an HTML table with a row for each regression, columns for benchmark, phase, and effect size, something like this:
    | benchmark | phase | effect size |
    |-----------|-------|-------------|
    | `bz2.wasm` | compilation | 1.23x +/- .03 more cycles |
  * Issue/comment also has info on how to run locally

I could also be convinced we should reuse the virtual wall time data from use case 1, rather than rebenching, but that wouldn't play nice with the benchmark-this-particular-PR-against-main sub-use case.

----------------------------------------------------------

cc @alexcrichton @cfallin @ricochet 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Continuous Sightglass benchmarking #13576

Which measurements should we record?

How do we measure it?

What do we do with it?

My Recommendation: Two separate jobs

Use case (1), long-term trends over time

Use case (2), significance tests between today's and ~yesterday's Wasmtime

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Continuous Sightglass benchmarking #13576

Description

Which measurements should we record?

How do we measure it?

What do we do with it?

My Recommendation: Two separate jobs

Use case (1), long-term trends over time

Use case (2), significance tests between today's and ~yesterday's Wasmtime

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions