Improve benchmarks 1374#1385
Conversation
Drops the base-vs-head comparison entirely. The benchmark workflow now runs once on the PR head, on a pinned `ubuntu-22.04` runner, and reports absolute log-density times plus gradient/log-density ratios in the posted comment. Output schema follows Mooncake's bench harness; readers compare against recent main-branch comments to spot regressions. Noise reduction in `run_ad`: per-sample incremental GC teardown and a full GC before each measurement keep accumulated garbage from triggering mid-sample collections that inflate individual samples. Adds a `benchmark_seconds` knob for tightening the median estimate. Also removes the synthetic reference timing that normalised eval times against a non-DPPL function. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Inline `Models.jl` and `DynamicPPLBenchmarks.jl` into `benchmarks.jl` and convert `benchmarks/Project.toml` from a package to a flat environment, mirroring Mooncake.jl's `bench/run_benchmarks.jl` layout. Also: take dim from `length(r.params)` (run_ad already constructed the LDF) so models are no longer evaluated twice on the success path; switch results to NamedTuples so `print_results` reads `r.name`/`r.dim`/...; extract `transform_strategy(islinked)` helper; drop unused imports. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
DynamicPPL.jl documentation for PR #1385 is available at: |
Benchmark Report
Absolute log-density times and grad/log-density ratios are Computer InformationBenchmark ResultsGist: SmorgasbordFull table (68 rows) |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #1385 +/- ##
==========================================
- Coverage 82.35% 82.26% -0.10%
==========================================
Files 50 50
Lines 3531 3535 +4
==========================================
Hits 2908 2908
- Misses 623 627 +4 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Run a full cross-product of the 9 model configs × 4 AD backends ×
{linked, unlinked} = 72 rows, ordered model → linked → backend so each
model's eight rows are adjacent for side-by-side inspection.
`:reversediff_compiled` is excluded because compiled tapes are
input-dependent and silently produce wrong gradients on
parameter-dependent control flow (see CLAUDE.md).
Per-row logging mirrors Mooncake's `bench/run_benchmarks.jl`: an
`(i / N, name, (linked = …))` header, the backend on its own line,
then `t(logdensity)` / `t(grad)` formatted with units. `model_dimension`
is now defensive (returns `missing` on init failures) and the table
formats `missing` dims as `err`, so combos that crash during dimension
lookup still produce a well-formed row instead of derailing the run.
Also: add a `setup` stage to `run_ad`'s Chairmarks pipeline that
deep-copies `params` per sample, matching Mooncake's harness — setup
runs before the timed window, so the copy is excluded from
measurements. Widen `combos` to a typed `Tuple{...}[]` so it accepts
models with non-default contexts (e.g. the `condition`-wrapped
`loop_univariate`/`multivariate` rows).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…apsible full table LDA's discrete `Categorical` RVs make `linked = false` ill-defined for gradient-based AD, so all four backends previously errored on those rows. Skip them at combination time, leaving 68 rows. In markdown mode, emit a `### Gist: Smorgasbord` block with just that model's eight rows (Smorgasbord covers the broadest set of DPPL features, so it is the most informative single row band), then put the full 68-row table inside `<details><summary>` so it is collapsed by default in GitHub PR comments. Plain (non-markdown) output is unchanged. Drop the now-redundant `### Benchmark Results` heading from the workflow body since the script emits its own. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
| - base branch: `${{ needs.benchmark-base.outputs.sha }}` | ||
| - this PR's head: `${{ github.event.pull_request.head.sha }}` | ||
|
|
||
| Absolute log-density times and grad/log-density ratios are |
There was a problem hiding this comment.
Removing the at PR time benchmarking of main vs change seems strictly worse and as far as I am aware doesn't follow how most benchmarking tools approach this?
Not clear what motivated this? We have been enjoying airspeedvelocity https://astroautomata.com/AirspeedVelocity.jl/stable/
There was a problem hiding this comment.
These models are extremely lightweight: a few models' log densities are a couple of nanoseconds. The main value of these benchmarks is for eyeballing obvious regressions (since the models are cheap, any instability or allocation will be caught). On the other hand, it does imply that their benchmarks (PR vs main) are quite noisy as well, so the main baseline was removed.
That said, #1386 added the main back, though I don't think it is very useful.
There was a problem hiding this comment.
I'm not sure I understand the motivation for removing then to be honest but it sounds like this has churned back around. In the other PR I noted that the main change is now to remove a clear ratio comparison which seems strictly worse?
No description provided.