Skip to content

Improve benchmarks 1374#1385

Merged
yebai merged 4 commits into
mainfrom
improve-benchmarks-1374
May 4, 2026
Merged

Improve benchmarks 1374#1385
yebai merged 4 commits into
mainfrom
improve-benchmarks-1374

Conversation

@yebai
Copy link
Copy Markdown
Member

@yebai yebai commented May 4, 2026

No description provided.

yebai and others added 2 commits May 4, 2026 22:07
Drops the base-vs-head comparison entirely. The benchmark workflow now
runs once on the PR head, on a pinned `ubuntu-22.04` runner, and reports
absolute log-density times plus gradient/log-density ratios in the
posted comment. Output schema follows Mooncake's bench harness; readers
compare against recent main-branch comments to spot regressions.

Noise reduction in `run_ad`: per-sample incremental GC teardown and a
full GC before each measurement keep accumulated garbage from triggering
mid-sample collections that inflate individual samples. Adds a
`benchmark_seconds` knob for tightening the median estimate. Also
removes the synthetic reference timing that normalised eval times
against a non-DPPL function.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Inline `Models.jl` and `DynamicPPLBenchmarks.jl` into `benchmarks.jl` and
convert `benchmarks/Project.toml` from a package to a flat environment,
mirroring Mooncake.jl's `bench/run_benchmarks.jl` layout.

Also: take dim from `length(r.params)` (run_ad already constructed the LDF)
so models are no longer evaluated twice on the success path; switch results
to NamedTuples so `print_results` reads `r.name`/`r.dim`/...; extract
`transform_strategy(islinked)` helper; drop unused imports.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 4, 2026

DynamicPPL.jl documentation for PR #1385 is available at:
https://TuringLang.github.io/DynamicPPL.jl/previews/PR1385/

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 4, 2026

Benchmark Report

  • this PR's head: 952b7561b76144567249ba6d28348cbd9c13d392

Absolute log-density times and grad/log-density ratios are
reported. To judge whether a PR helps or hurts, compare against
the latest comment on a recent main-branch PR run.

Computer Information

Julia Version 1.11.9
Commit 53a02c0720c (2026-02-06 00:27 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 4 × AMD EPYC 7763 64-Core Processor
  WORD_SIZE: 64
  LLVM: libLLVM-16.0.6 (ORCJIT, znver3)
Threads: 1 default, 0 interactive, 1 GC (on 4 virtual cores)

Benchmark Results

Gist: Smorgasbord

┌─────────────┬─────┬─────────────┬────────┬───────────────┬───────────────────────┐
│       Model │ Dim │  AD Backend │ Linked │ t(logdensity) │ t(grad)/t(logdensity) │
├─────────────┼─────┼─────────────┼────────┼───────────────┼───────────────────────┤
│ Smorgasbord │ 201 │ forwarddiff │  false │       6.41 μs │                 66.20 │
│ Smorgasbord │ 201 │ reversediff │  false │       6.52 μs │                125.67 │
│ Smorgasbord │ 201 │    mooncake │  false │       6.55 μs │                  6.33 │
│ Smorgasbord │ 201 │      enzyme │  false │       6.56 μs │                  8.72 │
│ Smorgasbord │ 201 │ forwarddiff │   true │       8.86 μs │                 71.32 │
│ Smorgasbord │ 201 │ reversediff │   true │       8.86 μs │                124.82 │
│ Smorgasbord │ 201 │    mooncake │   true │       9.46 μs │                  5.10 │
│ Smorgasbord │ 201 │      enzyme │   true │       9.16 μs │                  5.87 │
└─────────────┴─────┴─────────────┴────────┴───────────────┴───────────────────────┘
Full table (68 rows)
┌───────────────────────┬───────┬─────────────┬────────┬───────────────┬───────────────────────┐
│                 Model │   Dim │  AD Backend │ Linked │ t(logdensity) │ t(grad)/t(logdensity) │
├───────────────────────┼───────┼─────────────┼────────┼───────────────┼───────────────────────┤
│ Simple assume observe │     1 │ forwarddiff │  false │       5.89 ns │                 10.11 │
│ Simple assume observe │     1 │ reversediff │  false │       7.52 ns │                960.30 │
│ Simple assume observe │     1 │    mooncake │  false │       5.88 ns │                 28.78 │
│ Simple assume observe │     1 │      enzyme │  false │       6.18 ns │                  5.95 │
│ Simple assume observe │     1 │ forwarddiff │   true │       23.8 ns │                  2.49 │
│ Simple assume observe │     1 │ reversediff │   true │       23.8 ns │                330.22 │
│ Simple assume observe │     1 │    mooncake │   true │       23.7 ns │                  7.15 │
│ Simple assume observe │     1 │      enzyme │   true │       24.1 ns │                  1.52 │
│           Smorgasbord │   201 │ forwarddiff │  false │       6.41 μs │                 66.20 │
│           Smorgasbord │   201 │ reversediff │  false │       6.52 μs │                125.67 │
│           Smorgasbord │   201 │    mooncake │  false │       6.55 μs │                  6.33 │
│           Smorgasbord │   201 │      enzyme │  false │       6.56 μs │                  8.72 │
│           Smorgasbord │   201 │ forwarddiff │   true │       8.86 μs │                 71.32 │
│           Smorgasbord │   201 │ reversediff │   true │       8.86 μs │                124.82 │
│           Smorgasbord │   201 │    mooncake │   true │       9.46 μs │                  5.10 │
│           Smorgasbord │   201 │      enzyme │   true │       9.16 μs │                  5.87 │
│    Loop univariate 1k │  1000 │ forwarddiff │  false │       20.7 μs │                913.93 │
│    Loop univariate 1k │  1000 │ reversediff │  false │       20.5 μs │                269.34 │
│    Loop univariate 1k │  1000 │    mooncake │  false │       20.6 μs │                  6.97 │
│    Loop univariate 1k │  1000 │      enzyme │  false │       20.1 μs │                  5.89 │
│    Loop univariate 1k │  1000 │ forwarddiff │   true │       21.8 μs │               1244.31 │
│    Loop univariate 1k │  1000 │ reversediff │   true │       21.8 μs │                258.07 │
│    Loop univariate 1k │  1000 │    mooncake │   true │       21.6 μs │                  6.66 │
│    Loop univariate 1k │  1000 │      enzyme │   true │       21.5 μs │                  5.52 │
│       Multivariate 1k │  1000 │ forwarddiff │  false │       23.7 μs │                376.80 │
│       Multivariate 1k │  1000 │ reversediff │  false │       24.2 μs │                 72.33 │
│       Multivariate 1k │  1000 │    mooncake │  false │       27.1 μs │                  8.21 │
│       Multivariate 1k │  1000 │      enzyme │  false │       27.8 μs │                  3.14 │
│       Multivariate 1k │  1000 │ forwarddiff │   true │       21.9 μs │                340.52 │
│       Multivariate 1k │  1000 │ reversediff │   true │       28.8 μs │                 61.44 │
│       Multivariate 1k │  1000 │    mooncake │   true │       28.0 μs │                  8.12 │
│       Multivariate 1k │  1000 │      enzyme │   true │       29.7 μs │                  3.14 │
│   Loop univariate 10k │ 10000 │ forwarddiff │  false │      204.0 μs │              10457.81 │
│   Loop univariate 10k │ 10000 │ reversediff │  false │      207.0 μs │                288.21 │
│   Loop univariate 10k │ 10000 │    mooncake │  false │      203.0 μs │                  7.24 │
│   Loop univariate 10k │ 10000 │      enzyme │  false │      203.0 μs │                  5.79 │
│   Loop univariate 10k │ 10000 │ forwarddiff │   true │      218.0 μs │              10903.04 │
│   Loop univariate 10k │ 10000 │ reversediff │   true │      219.0 μs │                274.70 │
│   Loop univariate 10k │ 10000 │    mooncake │   true │      217.0 μs │                  6.77 │
│   Loop univariate 10k │ 10000 │      enzyme │   true │      216.0 μs │                  5.45 │
│      Multivariate 10k │ 10000 │ forwarddiff │  false │      217.0 μs │               4956.73 │
│      Multivariate 10k │ 10000 │ reversediff │  false │      218.0 μs │                 80.17 │
│      Multivariate 10k │ 10000 │    mooncake │  false │      216.0 μs │                 10.40 │
│      Multivariate 10k │ 10000 │      enzyme │  false │      217.0 μs │                  2.15 │
│      Multivariate 10k │ 10000 │ forwarddiff │   true │      218.0 μs │               4710.61 │
│      Multivariate 10k │ 10000 │ reversediff │   true │      218.0 μs │                 81.30 │
│      Multivariate 10k │ 10000 │    mooncake │   true │      217.0 μs │                 10.42 │
│      Multivariate 10k │ 10000 │      enzyme │   true │      217.0 μs │                  2.12 │
│               Dynamic │    15 │ forwarddiff │  false │           err │                   err │
│               Dynamic │    15 │ reversediff │  false │       1.34 μs │                 45.96 │
│               Dynamic │    15 │    mooncake │  false │       1.35 μs │                 15.00 │
│               Dynamic │    15 │      enzyme │  false │       1.36 μs │                 11.76 │
│               Dynamic │    10 │ forwarddiff │   true │       1.93 μs │                  1.87 │
│               Dynamic │    10 │ reversediff │   true │       1.94 μs │                 58.30 │
│               Dynamic │    10 │    mooncake │   true │       1.99 μs │                 20.83 │
│               Dynamic │    10 │      enzyme │   true │       1.91 μs │                 20.60 │
│              Submodel │     1 │ forwarddiff │  false │       5.89 ns │                 10.17 │
│              Submodel │     1 │ reversediff │  false │        5.9 ns │               1340.52 │
│              Submodel │     1 │    mooncake │  false │       5.89 ns │                 28.89 │
│              Submodel │     1 │      enzyme │  false │       5.89 ns │                  6.24 │
│              Submodel │     1 │ forwarddiff │   true │       5.89 ns │                 10.14 │
│              Submodel │     1 │ reversediff │   true │       7.52 ns │               1117.21 │
│              Submodel │     1 │    mooncake │   true │       5.89 ns │                 28.97 │
│              Submodel │     1 │      enzyme │   true │        5.9 ns │                  6.23 │
│                   LDA │    12 │ forwarddiff │   true │       22.8 μs │                  0.46 │
│                   LDA │    12 │ reversediff │   true │       22.9 μs │                  2.02 │
│                   LDA │    12 │    mooncake │   true │       25.5 μs │                 31.60 │
│                   LDA │    12 │      enzyme │   true │           err │                   err │
└───────────────────────┴───────┴─────────────┴────────┴───────────────┴───────────────────────┘

@codecov
Copy link
Copy Markdown

codecov Bot commented May 4, 2026

Codecov Report

❌ Patch coverage is 0% with 6 lines in your changes missing coverage. Please review.
✅ Project coverage is 82.26%. Comparing base (10a3651) to head (952b756).

Files with missing lines Patch % Lines
src/test_utils/ad.jl 0.00% 6 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1385      +/-   ##
==========================================
- Coverage   82.35%   82.26%   -0.10%     
==========================================
  Files          50       50              
  Lines        3531     3535       +4     
==========================================
  Hits         2908     2908              
- Misses        623      627       +4     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

yebai and others added 2 commits May 4, 2026 23:14
Run a full cross-product of the 9 model configs × 4 AD backends ×
{linked, unlinked} = 72 rows, ordered model → linked → backend so each
model's eight rows are adjacent for side-by-side inspection.
`:reversediff_compiled` is excluded because compiled tapes are
input-dependent and silently produce wrong gradients on
parameter-dependent control flow (see CLAUDE.md).

Per-row logging mirrors Mooncake's `bench/run_benchmarks.jl`: an
`(i / N, name, (linked = …))` header, the backend on its own line,
then `t(logdensity)` / `t(grad)` formatted with units. `model_dimension`
is now defensive (returns `missing` on init failures) and the table
formats `missing` dims as `err`, so combos that crash during dimension
lookup still produce a well-formed row instead of derailing the run.

Also: add a `setup` stage to `run_ad`'s Chairmarks pipeline that
deep-copies `params` per sample, matching Mooncake's harness — setup
runs before the timed window, so the copy is excluded from
measurements. Widen `combos` to a typed `Tuple{...}[]` so it accepts
models with non-default contexts (e.g. the `condition`-wrapped
`loop_univariate`/`multivariate` rows).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…apsible full table

LDA's discrete `Categorical` RVs make `linked = false` ill-defined for
gradient-based AD, so all four backends previously errored on those
rows. Skip them at combination time, leaving 68 rows.

In markdown mode, emit a `### Gist: Smorgasbord` block with just that
model's eight rows (Smorgasbord covers the broadest set of DPPL
features, so it is the most informative single row band), then put the
full 68-row table inside `<details><summary>` so it is collapsed by
default in GitHub PR comments. Plain (non-markdown) output is
unchanged. Drop the now-redundant `### Benchmark Results` heading from
the workflow body since the script emits its own.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@yebai yebai marked this pull request as ready for review May 4, 2026 23:14
@yebai yebai merged commit 2691e7c into main May 4, 2026
18 of 20 checks passed
@yebai yebai deleted the improve-benchmarks-1374 branch May 4, 2026 23:14
- base branch: `${{ needs.benchmark-base.outputs.sha }}`
- this PR's head: `${{ github.event.pull_request.head.sha }}`

Absolute log-density times and grad/log-density ratios are
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removing the at PR time benchmarking of main vs change seems strictly worse and as far as I am aware doesn't follow how most benchmarking tools approach this?

Not clear what motivated this? We have been enjoying airspeedvelocity https://astroautomata.com/AirspeedVelocity.jl/stable/

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These models are extremely lightweight: a few models' log densities are a couple of nanoseconds. The main value of these benchmarks is for eyeballing obvious regressions (since the models are cheap, any instability or allocation will be caught). On the other hand, it does imply that their benchmarks (PR vs main) are quite noisy as well, so the main baseline was removed.

That said, #1386 added the main back, though I don't think it is very useful.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand the motivation for removing then to be honest but it sounds like this has churned back around. In the other PR I noted that the main change is now to remove a clear ratio comparison which seems strictly worse?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants