Improve benchmarks 1374 by yebai · Pull Request #1385 · TuringLang/DynamicPPL.jl

yebai · 2026-05-04T22:01:14Z

No description provided.

Drops the base-vs-head comparison entirely. The benchmark workflow now runs once on the PR head, on a pinned `ubuntu-22.04` runner, and reports absolute log-density times plus gradient/log-density ratios in the posted comment. Output schema follows Mooncake's bench harness; readers compare against recent main-branch comments to spot regressions. Noise reduction in `run_ad`: per-sample incremental GC teardown and a full GC before each measurement keep accumulated garbage from triggering mid-sample collections that inflate individual samples. Adds a `benchmark_seconds` knob for tightening the median estimate. Also removes the synthetic reference timing that normalised eval times against a non-DPPL function. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Inline `Models.jl` and `DynamicPPLBenchmarks.jl` into `benchmarks.jl` and convert `benchmarks/Project.toml` from a package to a flat environment, mirroring Mooncake.jl's `bench/run_benchmarks.jl` layout. Also: take dim from `length(r.params)` (run_ad already constructed the LDF) so models are no longer evaluated twice on the success path; switch results to NamedTuples so `print_results` reads `r.name`/`r.dim`/...; extract `transform_strategy(islinked)` helper; drop unused imports. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

github-actions · 2026-05-04T22:06:57Z

DynamicPPL.jl documentation for PR #1385 is available at:
https://TuringLang.github.io/DynamicPPL.jl/previews/PR1385/

github-actions · 2026-05-04T22:07:08Z

Benchmark Report

this PR's head: 952b7561b76144567249ba6d28348cbd9c13d392

Absolute log-density times and grad/log-density ratios are
reported. To judge whether a PR helps or hurts, compare against
the latest comment on a recent main-branch PR run.

Computer Information

Julia Version 1.11.9
Commit 53a02c0720c (2026-02-06 00:27 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 4 × AMD EPYC 7763 64-Core Processor
  WORD_SIZE: 64
  LLVM: libLLVM-16.0.6 (ORCJIT, znver3)
Threads: 1 default, 0 interactive, 1 GC (on 4 virtual cores)

Benchmark Results

Gist: Smorgasbord

┌─────────────┬─────┬─────────────┬────────┬───────────────┬───────────────────────┐
│       Model │ Dim │  AD Backend │ Linked │ t(logdensity) │ t(grad)/t(logdensity) │
├─────────────┼─────┼─────────────┼────────┼───────────────┼───────────────────────┤
│ Smorgasbord │ 201 │ forwarddiff │  false │       6.41 μs │                 66.20 │
│ Smorgasbord │ 201 │ reversediff │  false │       6.52 μs │                125.67 │
│ Smorgasbord │ 201 │    mooncake │  false │       6.55 μs │                  6.33 │
│ Smorgasbord │ 201 │      enzyme │  false │       6.56 μs │                  8.72 │
│ Smorgasbord │ 201 │ forwarddiff │   true │       8.86 μs │                 71.32 │
│ Smorgasbord │ 201 │ reversediff │   true │       8.86 μs │                124.82 │
│ Smorgasbord │ 201 │    mooncake │   true │       9.46 μs │                  5.10 │
│ Smorgasbord │ 201 │      enzyme │   true │       9.16 μs │                  5.87 │
└─────────────┴─────┴─────────────┴────────┴───────────────┴───────────────────────┘

Full table (68 rows)

┌───────────────────────┬───────┬─────────────┬────────┬───────────────┬───────────────────────┐
│                 Model │   Dim │  AD Backend │ Linked │ t(logdensity) │ t(grad)/t(logdensity) │
├───────────────────────┼───────┼─────────────┼────────┼───────────────┼───────────────────────┤
│ Simple assume observe │     1 │ forwarddiff │  false │       5.89 ns │                 10.11 │
│ Simple assume observe │     1 │ reversediff │  false │       7.52 ns │                960.30 │
│ Simple assume observe │     1 │    mooncake │  false │       5.88 ns │                 28.78 │
│ Simple assume observe │     1 │      enzyme │  false │       6.18 ns │                  5.95 │
│ Simple assume observe │     1 │ forwarddiff │   true │       23.8 ns │                  2.49 │
│ Simple assume observe │     1 │ reversediff │   true │       23.8 ns │                330.22 │
│ Simple assume observe │     1 │    mooncake │   true │       23.7 ns │                  7.15 │
│ Simple assume observe │     1 │      enzyme │   true │       24.1 ns │                  1.52 │
│           Smorgasbord │   201 │ forwarddiff │  false │       6.41 μs │                 66.20 │
│           Smorgasbord │   201 │ reversediff │  false │       6.52 μs │                125.67 │
│           Smorgasbord │   201 │    mooncake │  false │       6.55 μs │                  6.33 │
│           Smorgasbord │   201 │      enzyme │  false │       6.56 μs │                  8.72 │
│           Smorgasbord │   201 │ forwarddiff │   true │       8.86 μs │                 71.32 │
│           Smorgasbord │   201 │ reversediff │   true │       8.86 μs │                124.82 │
│           Smorgasbord │   201 │    mooncake │   true │       9.46 μs │                  5.10 │
│           Smorgasbord │   201 │      enzyme │   true │       9.16 μs │                  5.87 │
│    Loop univariate 1k │  1000 │ forwarddiff │  false │       20.7 μs │                913.93 │
│    Loop univariate 1k │  1000 │ reversediff │  false │       20.5 μs │                269.34 │
│    Loop univariate 1k │  1000 │    mooncake │  false │       20.6 μs │                  6.97 │
│    Loop univariate 1k │  1000 │      enzyme │  false │       20.1 μs │                  5.89 │
│    Loop univariate 1k │  1000 │ forwarddiff │   true │       21.8 μs │               1244.31 │
│    Loop univariate 1k │  1000 │ reversediff │   true │       21.8 μs │                258.07 │
│    Loop univariate 1k │  1000 │    mooncake │   true │       21.6 μs │                  6.66 │
│    Loop univariate 1k │  1000 │      enzyme │   true │       21.5 μs │                  5.52 │
│       Multivariate 1k │  1000 │ forwarddiff │  false │       23.7 μs │                376.80 │
│       Multivariate 1k │  1000 │ reversediff │  false │       24.2 μs │                 72.33 │
│       Multivariate 1k │  1000 │    mooncake │  false │       27.1 μs │                  8.21 │
│       Multivariate 1k │  1000 │      enzyme │  false │       27.8 μs │                  3.14 │
│       Multivariate 1k │  1000 │ forwarddiff │   true │       21.9 μs │                340.52 │
│       Multivariate 1k │  1000 │ reversediff │   true │       28.8 μs │                 61.44 │
│       Multivariate 1k │  1000 │    mooncake │   true │       28.0 μs │                  8.12 │
│       Multivariate 1k │  1000 │      enzyme │   true │       29.7 μs │                  3.14 │
│   Loop univariate 10k │ 10000 │ forwarddiff │  false │      204.0 μs │              10457.81 │
│   Loop univariate 10k │ 10000 │ reversediff │  false │      207.0 μs │                288.21 │
│   Loop univariate 10k │ 10000 │    mooncake │  false │      203.0 μs │                  7.24 │
│   Loop univariate 10k │ 10000 │      enzyme │  false │      203.0 μs │                  5.79 │
│   Loop univariate 10k │ 10000 │ forwarddiff │   true │      218.0 μs │              10903.04 │
│   Loop univariate 10k │ 10000 │ reversediff │   true │      219.0 μs │                274.70 │
│   Loop univariate 10k │ 10000 │    mooncake │   true │      217.0 μs │                  6.77 │
│   Loop univariate 10k │ 10000 │      enzyme │   true │      216.0 μs │                  5.45 │
│      Multivariate 10k │ 10000 │ forwarddiff │  false │      217.0 μs │               4956.73 │
│      Multivariate 10k │ 10000 │ reversediff │  false │      218.0 μs │                 80.17 │
│      Multivariate 10k │ 10000 │    mooncake │  false │      216.0 μs │                 10.40 │
│      Multivariate 10k │ 10000 │      enzyme │  false │      217.0 μs │                  2.15 │
│      Multivariate 10k │ 10000 │ forwarddiff │   true │      218.0 μs │               4710.61 │
│      Multivariate 10k │ 10000 │ reversediff │   true │      218.0 μs │                 81.30 │
│      Multivariate 10k │ 10000 │    mooncake │   true │      217.0 μs │                 10.42 │
│      Multivariate 10k │ 10000 │      enzyme │   true │      217.0 μs │                  2.12 │
│               Dynamic │    15 │ forwarddiff │  false │           err │                   err │
│               Dynamic │    15 │ reversediff │  false │       1.34 μs │                 45.96 │
│               Dynamic │    15 │    mooncake │  false │       1.35 μs │                 15.00 │
│               Dynamic │    15 │      enzyme │  false │       1.36 μs │                 11.76 │
│               Dynamic │    10 │ forwarddiff │   true │       1.93 μs │                  1.87 │
│               Dynamic │    10 │ reversediff │   true │       1.94 μs │                 58.30 │
│               Dynamic │    10 │    mooncake │   true │       1.99 μs │                 20.83 │
│               Dynamic │    10 │      enzyme │   true │       1.91 μs │                 20.60 │
│              Submodel │     1 │ forwarddiff │  false │       5.89 ns │                 10.17 │
│              Submodel │     1 │ reversediff │  false │        5.9 ns │               1340.52 │
│              Submodel │     1 │    mooncake │  false │       5.89 ns │                 28.89 │
│              Submodel │     1 │      enzyme │  false │       5.89 ns │                  6.24 │
│              Submodel │     1 │ forwarddiff │   true │       5.89 ns │                 10.14 │
│              Submodel │     1 │ reversediff │   true │       7.52 ns │               1117.21 │
│              Submodel │     1 │    mooncake │   true │       5.89 ns │                 28.97 │
│              Submodel │     1 │      enzyme │   true │        5.9 ns │                  6.23 │
│                   LDA │    12 │ forwarddiff │   true │       22.8 μs │                  0.46 │
│                   LDA │    12 │ reversediff │   true │       22.9 μs │                  2.02 │
│                   LDA │    12 │    mooncake │   true │       25.5 μs │                 31.60 │
│                   LDA │    12 │      enzyme │   true │           err │                   err │
└───────────────────────┴───────┴─────────────┴────────┴───────────────┴───────────────────────┘

codecov · 2026-05-04T22:12:43Z

Codecov Report

❌ Patch coverage is 0% with 6 lines in your changes missing coverage. Please review.
✅ Project coverage is 82.26%. Comparing base (10a3651) to head (952b756).

Files with missing lines	Patch %	Lines
src/test_utils/ad.jl	0.00%	6 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1385      +/-   ##
==========================================
- Coverage   82.35%   82.26%   -0.10%     
==========================================
  Files          50       50              
  Lines        3531     3535       +4     
==========================================
  Hits         2908     2908              
- Misses        623      627       +4

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Run a full cross-product of the 9 model configs × 4 AD backends × {linked, unlinked} = 72 rows, ordered model → linked → backend so each model's eight rows are adjacent for side-by-side inspection. `:reversediff_compiled` is excluded because compiled tapes are input-dependent and silently produce wrong gradients on parameter-dependent control flow (see CLAUDE.md). Per-row logging mirrors Mooncake's `bench/run_benchmarks.jl`: an `(i / N, name, (linked = …))` header, the backend on its own line, then `t(logdensity)` / `t(grad)` formatted with units. `model_dimension` is now defensive (returns `missing` on init failures) and the table formats `missing` dims as `err`, so combos that crash during dimension lookup still produce a well-formed row instead of derailing the run. Also: add a `setup` stage to `run_ad`'s Chairmarks pipeline that deep-copies `params` per sample, matching Mooncake's harness — setup runs before the timed window, so the copy is excluded from measurements. Widen `combos` to a typed `Tuple{...}[]` so it accepts models with non-default contexts (e.g. the `condition`-wrapped `loop_univariate`/`multivariate` rows). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…apsible full table LDA's discrete `Categorical` RVs make `linked = false` ill-defined for gradient-based AD, so all four backends previously errored on those rows. Skip them at combination time, leaving 68 rows. In markdown mode, emit a `### Gist: Smorgasbord` block with just that model's eight rows (Smorgasbord covers the broadest set of DPPL features, so it is the most informative single row band), then put the full 68-row table inside `<details><summary>` so it is collapsed by default in GitHub PR comments. Plain (non-markdown) output is unchanged. Drop the now-redundant `### Benchmark Results` heading from the workflow body since the script emits its own. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

seabbs · 2026-05-06T09:39:23Z

-            - base branch: `${{ needs.benchmark-base.outputs.sha }}`
+            - this PR's head: `${{ github.event.pull_request.head.sha }}`
+
+            Absolute log-density times and grad/log-density ratios are


Removing the at PR time benchmarking of main vs change seems strictly worse and as far as I am aware doesn't follow how most benchmarking tools approach this?

Not clear what motivated this? We have been enjoying airspeedvelocity https://astroautomata.com/AirspeedVelocity.jl/stable/

These models are extremely lightweight: a few models' log densities are a couple of nanoseconds. The main value of these benchmarks is for eyeballing obvious regressions (since the models are cheap, any instability or allocation will be caught). On the other hand, it does imply that their benchmarks (PR vs main) are quite noisy as well, so the main baseline was removed.

That said, #1386 added the main back, though I don't think it is very useful.

I'm not sure I understand the motivation for removing then to be honest but it sounds like this has churned back around. In the other PR I noted that the main change is now to remove a clear ratio comparison which seems strictly worse?

yebai and others added 2 commits May 4, 2026 22:07

github-actions Bot assigned yebai May 4, 2026

yebai and others added 2 commits May 4, 2026 23:14

yebai marked this pull request as ready for review May 4, 2026 23:14

yebai merged commit 2691e7c into main May 4, 2026
18 of 20 checks passed

yebai deleted the improve-benchmarks-1374 branch May 4, 2026 23:14

This was referenced May 5, 2026

feat: report absolute eval time in nanoseconds instead of relative to reference #1383

Closed

Improve DynamicPPL benchmarks #1374

Closed

seabbs reviewed May 6, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve benchmarks 1374#1385

Improve benchmarks 1374#1385
yebai merged 4 commits into
mainfrom
improve-benchmarks-1374

yebai commented May 4, 2026

Uh oh!

github-actions Bot commented May 4, 2026

Uh oh!

github-actions Bot commented May 4, 2026 •

edited

Loading

Uh oh!

codecov Bot commented May 4, 2026 •

edited

Loading

Uh oh!

Uh oh!

seabbs May 6, 2026

Uh oh!

yebai May 6, 2026

Uh oh!

seabbs May 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

yebai commented May 4, 2026

Uh oh!

github-actions Bot commented May 4, 2026

Uh oh!

github-actions Bot commented May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Benchmark Report

Computer Information

Benchmark Results

Gist: Smorgasbord

Uh oh!

codecov Bot commented May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

seabbs May 6, 2026

Choose a reason for hiding this comment

Uh oh!

yebai May 6, 2026

Choose a reason for hiding this comment

Uh oh!

seabbs May 6, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

github-actions Bot commented May 4, 2026 •

edited

Loading

codecov Bot commented May 4, 2026 •

edited

Loading