Skip to content

Continuous Sightglass benchmarking #13576

@fitzgen

Description

@fitzgen

We've wanted continuous Sightglass benchmarking for Cranelift and Wasmtime for a looong time; we are finally getting to the point where we can prioritize it. But first we need to agree exactly what we measure, how we measure it (i.e. high-level consensus on architecture), and what we do with those measurements (i.e. how we display them, what kind of automatic alerting/issue-filing/etc we do).


Benchmarking is hard. Lots of people have opinions on how it should best be done. Historically, this is a topic that has attracted a lot of drive-by opinions. I ask that people kindly refrain from engaging on this issue unless you are regularly hacking on Wasmtime, Cranelift, or Sightglass.


I think we have roughly two separate use cases for continuous benchmarking:

  1. Viewing long-term trends over time. Basically, drawing arewefastyet.com-style graphs where the x axis is ~nightly Wasmtime builds and the y axis is some metric of performance. Gives us a rough sense of how performance today is compared to last week/month/year/etc.

  2. Catching and reporting performance regressions as quickly as we can. Basically, do a statistical significance test between ~yesterday's main and today's main. If the difference is significant and performance got worse, then open an issue detailing the regression's commit range and effect size. Helps us avoid unwittingly regressing performance. This could also be triggered explicitly by Wasmtime/Cranelift maintainers for a particular PR, to compare its performance against main.

These two use cases have (or can have) different requirements:

  • (1) must have either a dedicated machine or some other manner of making results recorded on different machines ~deterministic (e.g. measure with callgrind rather than native cycles) or else the trend over time is meaningless.
  • However, (2) doesn't necessarily need dedicated hardware or even the ability to compare results recorded on different machines: it can build yesterday's and today's Wasmtime, run them both at the same time on the same current machine, and then do a statistical significance test between their results, and it doesn't matter if this process happened on a different machine two days ago because it isn't reusing that data.
    • But if it did have access to dedicated hardware, it could potentially reuse yesterday's data, instead of re-running yesterday's Wasmtime
  • (1) requires saving the results (or at least a summary of the results) of historical benchmark runs
  • (2) does not necessarily need that

I just point this out to make sure we don't conflate the two use cases and over-constrain one because of the requirements of the other.


Which measurements should we record?

Sightglass supports recording a variety of measurements:

  • Instructions retired
  • Cycles
  • Wall time
  • Various perf counters, notably cache accesses and misses
  • Simulated (and therefore ~deterministic across machines) microarch details via callgrind:
    • instructions-retired
    • data-reads
    • data-writes
    • l1-icache-misses
    • l1-dcache-read-misses
    • l1-dcache-write-misses
    • ll-icache-misses
    • ll-dcache-read-misses
    • ll-dcache-write-misses
    • conditional-branches
    • conditional-branch-misses
    • indirect-branches
    • indirect-branch-misses
    • (We don't currently but we could also compute "virtual cycles" or "virtual wall time" based on this data, which should also be ~deterministic across machines)

Doing everything is going to be too noisy and give us information overload. It will also take more resources. So I think we really do want to narrow things down here, ideally to a single measure.

Which should we use for use case (1), long-term trends over time?

Which should we use for use case (2), significance tests between today's and ~yesterday's Wasmtime to catch performance regressions quickly?

How do we measure it?

What's the high-level architecture? Dedicated hardware? How does it integrate into github and actions?

What do we do with it?

We've recorded some benchmark data, what actions do we take based on its results? Do we plot it and serve it on a web page? Do we file issues? Etc...


My Recommendation: Two separate jobs

Use case (1), long-term trends over time

  • Measure: virtual cycles/wall-time based on callgrind simulation, so we can avoid dedicated hardware but still measure roughly the thing we really care about (wall time) as opposed to things that are too far-removed (like instructions retired)
  • How:
    • make a bytecodealliance/arewefastyet repo (or other name)
    • github daily cron action to fetch the latest Wasmtime and run Sightglass
    • save the raw benchmark data as a file (it will be small since we only need ~3 iterations for callgrind because it is very low noise)
    • run a script to render/summarize all historical data, write it to a canonical file
    • index.html loads that summary and displays it
    • very simple; no dedicated hardware involved, no S3, no CDN, etc...
  • What:
    • two scatter plots for each benchmark: compilation and execution
      • x axis: date of wasmtime build
      • y axis: performance (virtual wall time)

Use case (2), significance tests between today's and ~yesterday's Wasmtime

  • Measure: cycles
  • How:
    • github action/workflow in the Wasmtime repo
    • check out today's and yesterday's Wasmtimes
    • build both of their libwasmtime_bench_api.sos
    • run sightglass, passing -e yesterday.so -e today.so
    • if there is any statistically significant regression, open an issue detailing the regression's commit range and effect size
  • What:
    • No webpage, no graphs/plots
    • Just the issues opened on regression
      • Or a comment on a PR, if explicitly triggered for a particular PR (main vs PR branch, instead of today vs yesterday)
    • Issue/comment has an HTML table with a row for each regression, columns for benchmark, phase, and effect size, something like this:
      benchmark phase effect size
      bz2.wasm compilation 1.23x +/- .03 more cycles
    • Issue/comment also has info on how to run locally

I could also be convinced we should reuse the virtual wall time data from use case 1, rather than rebenching, but that wouldn't play nice with the benchmark-this-particular-PR-against-main sub-use case.


cc @alexcrichton @cfallin @ricochet

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions