Skip to content

Use AbstractPPL AD interface#1363

Open
yebai wants to merge 53 commits into
mainfrom
adproblems-interface
Open

Use AbstractPPL AD interface#1363
yebai wants to merge 53 commits into
mainfrom
adproblems-interface

Conversation

@yebai
Copy link
Copy Markdown
Member

@yebai yebai commented Apr 21, 2026

Migrates DynamicPPL's AD plumbing to the new AbstractPPL.prepare / value_and_gradient!! interface introduced in AbstractPPL 0.15, retiring DynamicPPL's direct dependency on DifferentiationInterface.

The LogDensityFunction constructor now hands AD preparation to AbstractPPL.prepare(adtype, logdensity_internal, x; context=..., check_dims=false) and LogDensityProblems.logdensity_and_gradient simply forwards to AbstractPPL.value_and_gradient!!. The internal _use_closure heuristic that picked between Base.Fix1 closures and DI.Constant arguments per backend is removed — that decision now lives in AbstractPPL. logdensity_at is renamed to logdensity_internal (with a const alias for compatibility), and LogDensityAt becomes a deprecation shim that returns an AbstractPPL.Evaluators.VectorEvaluator.

Notes

  • The primal and gradient times are identical to main except for tiny models, where this branch beats main for logdensity evaluation while gradient remains comparable.

Replacing #1354. Tests currently fail due to #1364

Tiny primal time drop in `Simple assume observe`


The apparent AD regression in the tiny linked row is mostly caused by a much faster
primal denominator, not by slower raw gradients.

Reported row:

main:    primal = 23.6 ns, Mooncake ratio = 7.16   -> raw grad ~= 169 ns
branch:  primal = 4.63 ns, Mooncake ratio = 33.43 -> raw grad ~= 155 ns

So Mooncake is slightly faster in raw time for this row. The ratio looks worse because
t(logdensity) became tiny.

Root cause

On Julia 1.11, old main's optimised primal still contains a call boundary:

invoke DynamicPPL.tilde_assume!!(...)

Current branch inlines through the same path, so the one-dimensional linked model reduces
to loading params[1] and evaluating two scalar Normal logpdfs.

Local Julia 1.11.9 checks:

old main e2d4b5da:
  primal ~= 9.4 ns

old main + monkey-patched @inline tilde_assume!!:
  primal ~= 3.2 ns

current branch:
  primal ~= 2.5 ns

Adding @inline only to LogDensityProblems.logdensity did not materially change old
main. The important boundary is the assume/init path.

- remove DifferentiationInterface from Project.toml deps and compat
- replace DI gradient preparation/evaluation with AbstractPPL.prepare and AbstractPPL.value_and_gradient
- reuse LogDensityAt and closures without local AbstractPPL.prepare piracy
- guard the Mooncake precompile workload until AbstractPPLMooncakeExt is loaded
- pin AbstractPPL to the evaluator-interface branch in [sources] for this environment

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
# Conflicts:
#	Project.toml
#	src/logdensityfunction.jl
@yebai yebai force-pushed the adproblems-interface branch from 2340748 to e1cac2f Compare April 21, 2026 10:51
@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 21, 2026

Codecov Report

❌ Patch coverage is 73.68421% with 5 lines in your changes missing coverage. Please review.
✅ Project coverage is 81.55%. Comparing base (d2052a1) to head (3bdeb74).

Files with missing lines Patch % Lines
src/logdensityfunction.jl 54.54% 5 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1363      +/-   ##
==========================================
- Coverage   82.30%   81.55%   -0.75%     
==========================================
  Files          50       50              
  Lines        3543     3535       -8     
==========================================
- Hits         2916     2883      -33     
- Misses        627      652      +25     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

yebai and others added 2 commits April 23, 2026 19:28
Fix #1364 

TODO 
- [x] use `main` as base, as otherwise benchmarking would fail, need to
change back to #1364 before merging.
- ~~Julia 1.10’s Pkg does not reliably apply the root [sources] override
during build/test, so min CI fails.~~

This change switches `LogDensityFunction` AD preparation to pass a
structural `LogDensityAt` problem directly into
`AbstractPPL.prepare(...)` instead of wrapping it in backend-specific
anonymous closures. It also removes the old `_use_closure` machinery,
since the structural problem object now provides the stable one-argument
evaluator shape that AD backends need.

The accompanying test coverage adds a focused `ReverseDiff` check to
confirm that `AutoReverseDiff(; compile=true)` still retains a compiled
tape and can be reused across repeated `logdensity_and_gradient` calls.

Benchmarks 


```
  +----------------------+------------+--------------+--------+--------------+
  | Backend              | PR grad μs | main grad μs |  Ratio | Summary      |
  +----------------------+------------+--------------+--------+--------------+
  | ForwardDiff          |      0.322 |        0.322 |  1.001x | 0.1% slower  |
  | Forward Mooncake     |      0.900 |        2.601 |  0.346x | 2.89x faster |
  | Forward Enzyme       |      0.789 |        1.051 |  0.751x | 24.9% faster |
  | ReverseDiff          |     14.239 |       15.225 |  0.935x | 6.5% faster  |
  | ReverseDiff compiled |      5.355 |        5.872 |  0.912x | 8.8% faster  |
  | Reverse Mooncake     |      0.892 |        0.928 |  0.961x | 3.9% faster  |
  | Reverse Enzyme       |      1.253 |        1.268 |  0.988x | 1.2% faster  |
  +----------------------+------------+--------------+--------+--------------+
```

---------

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown
Contributor

DynamicPPL.jl documentation for PR #1363 is available at:
https://TuringLang.github.io/DynamicPPL.jl/previews/PR1363/

yebai and others added 15 commits April 26, 2026 21:55
Update LogDensityFunction gradient evaluation to match the AbstractPPL evaluator interface and make the floattypes environment resolve that source branch explicitly.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Drops the base-vs-head comparison entirely. The benchmark workflow now
runs once on the PR head, on a pinned `ubuntu-22.04` runner, and reports
absolute log-density times plus gradient/log-density ratios in the
posted comment. Output schema follows Mooncake's bench harness; readers
compare against recent main-branch comments to spot regressions.

Noise reduction in `run_ad`: per-sample incremental GC teardown and a
full GC before each measurement keep accumulated garbage from triggering
mid-sample collections that inflate individual samples. Adds a
`benchmark_seconds` knob for tightening the median estimate. Also
removes the synthetic reference timing that normalised eval times
against a non-DPPL function.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Inline `Models.jl` and `DynamicPPLBenchmarks.jl` into `benchmarks.jl` and
convert `benchmarks/Project.toml` from a package to a flat environment,
mirroring Mooncake.jl's `bench/run_benchmarks.jl` layout.

Also: take dim from `length(r.params)` (run_ad already constructed the LDF)
so models are no longer evaluated twice on the success path; switch results
to NamedTuples so `print_results` reads `r.name`/`r.dim`/...; extract
`transform_strategy(islinked)` helper; drop unused imports.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
# Conflicts:
#	benchmarks/Project.toml
#	benchmarks/src/DynamicPPLBenchmarks.jl
Run a full cross-product of the 9 model configs × 4 AD backends ×
{linked, unlinked} = 72 rows, ordered model → linked → backend so each
model's eight rows are adjacent for side-by-side inspection.
`:reversediff_compiled` is excluded because compiled tapes are
input-dependent and silently produce wrong gradients on
parameter-dependent control flow (see CLAUDE.md).

Per-row logging mirrors Mooncake's `bench/run_benchmarks.jl`: an
`(i / N, name, (linked = …))` header, the backend on its own line,
then `t(logdensity)` / `t(grad)` formatted with units. `model_dimension`
is now defensive (returns `missing` on init failures) and the table
formats `missing` dims as `err`, so combos that crash during dimension
lookup still produce a well-formed row instead of derailing the run.

Also: add a `setup` stage to `run_ad`'s Chairmarks pipeline that
deep-copies `params` per sample, matching Mooncake's harness — setup
runs before the timed window, so the copy is excluded from
measurements. Widen `combos` to a typed `Tuple{...}[]` so it accepts
models with non-default contexts (e.g. the `condition`-wrapped
`loop_univariate`/`multivariate` rows).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…apsible full table

LDA's discrete `Categorical` RVs make `linked = false` ill-defined for
gradient-based AD, so all four backends previously errored on those
rows. Skip them at combination time, leaving 68 rows.

In markdown mode, emit a `### Gist: Smorgasbord` block with just that
model's eight rows (Smorgasbord covers the broadest set of DPPL
features, so it is the most informative single row band), then put the
full 68-row table inside `<details><summary>` so it is collapsed by
default in GitHub PR comments. Plain (non-markdown) output is
unchanged. Drop the now-redundant `### Benchmark Results` heading from
the workflow body since the script emits its own.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Required as a direct dep so the benchmarks project resolves cleanly
without a manual `Pkg.resolve()` after `Pkg.instantiate()`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@yebai yebai force-pushed the adproblems-interface branch from 976a8eb to 49b1837 Compare May 4, 2026 22:52
yebai and others added 5 commits May 5, 2026 10:53
The Benchmarking comment now reads top-down as: SHA in the heading,
Smorgasbord gist as a level-3 section, explanatory paragraph, full
table as a level-3 section, and computer info as a foldable <details>
with an inline <b>-styled summary (avoids markdown inside <details>,
which renders inconsistently). The two integration jobs added to
CI.yml were still on setup-julia@v2; bumped to @V3 for consistency
with the main job.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Splits Benchmarking.yml into three jobs (benchmark-pr, benchmark-main,
post-comment). The PR-comment now shows PR-head numbers up top and
main's numbers in a foldout below, with a column legend and labelled
SHAs. If benchmark-main fails (e.g. transitionally before this branch
lands on main, since main's bench script does not yet support markdown
mode), post-comment still posts PR-head numbers and notes the main job
result inline.

Workflow hardening: concurrency group cancels superseded PR runs,
60min timeout per bench job, explicit pull-requests: write permission
so fork PRs can post comments.

Body assembly moved out of the YAML literal block scalar into a shell
heredoc using `body-path:` (peter-evans pattern), avoiding env-var
multi-line interpolation. Bench script no longer prints the obsolete
"compare against main-branch PR run" paragraph -- main is now in the
same comment.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
# Conflicts:
#	.github/workflows/Benchmarking.yml
#	benchmarks/Project.toml
#	benchmarks/benchmarks.jl
#	src/test_utils/ad.jl
The 68-row table dominates the PR comment vertically. Wrap it in a
<details><summary> block so it is collapsed by default, leaving the
Smorgasbord gist as the at-a-glance view.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Long-form table (one row per (model, linked, backend)) becomes a
pivoted table where each (Model, Dim, Linked) row spans all four AD
backends as columns. 68 rows collapse to 17, the gist/full-table split
goes away (single unified table), and the gist constant + foldout
wrapper in benchmarks/benchmarks.jl are dropped.

`t(logdensity)` does not depend on the AD backend, so the four primal
samples per group are noise around a common value — take the minimum
as the most stable estimate. Per-backend `err` cells render
independently of the primal column.

Workflow comment restructure to match: SHA goes inline in the H2
title, "PR head"/"Main" preamble lines drop, main-branch results live
under their own H3 with a "Click to see" foldout (and a fallback H3
note if benchmark-main failed), Computer Information is its own H3.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
yebai and others added 7 commits May 5, 2026 23:07
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Registered Bijectors 0.15.17+ caps AbstractPPL below 0.15, so
the main package and the integration/floattypes/docs envs need
the replace-di-with-abstractppl branch to resolve alongside
the AbstractPPL evaluator-interface branch.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The [sources] override for Bijectors requires Bijectors to be
listed in [deps]; Pkg rejects sources for packages not present
in deps or extras.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 13, 2026

Benchmarks @ 3bdeb74

Performance

Performance Ratio:
Ratio of time to compute gradient and time to compute log-density.
Warning: results are very approximate! See benchmark notes for more context.

===================================================================================================
                                               eval                       gradient                 
                                            ----------  -------------------------------------------
Model                        dim    linked      primal     FwdDiff    RvsDiff    Mooncake    Enzyme
---------------------------------------------------------------------------------------------------
Simple assume observe*         1     false     4.63 ns       12.97    1454.51       32.42      8.34
Simple assume observe*         1      true     4.63 ns       13.33    1653.88       32.14      8.02
Smorgasbord                  201     false     5.95 μs       71.03     141.44        6.92      9.60
Smorgasbord                  201      true      7.6 μs       75.45     142.00        6.30      6.97
Loop univariate 1k          1000     false     17.8 μs      894.81     301.80        8.11      6.40
Loop univariate 1k          1000      true     19.0 μs     1339.89     277.68        7.50      6.19
Multivariate 1k             1000     false     22.9 μs      388.41      75.70        9.62      2.71
Multivariate 1k             1000      true     22.0 μs      319.24      62.18       10.15      2.55
Loop univariate 10k        10000     false    174.0 μs    11638.11     322.86        8.37      6.62
Loop univariate 10k        10000      true    186.0 μs    11352.05     300.29        8.39      6.20
Multivariate 10k           10000     false    196.0 μs     6017.45      89.18       11.53      2.14
Multivariate 10k           10000      true    198.0 μs     5115.96      89.15       11.40      2.20
Dynamic                       15     false     1.36 μs         err      45.82       14.79     12.20
Dynamic                       10      true      1.9 μs        2.05      59.25       13.31     22.38
Submodel*                      1     false     4.63 ns       12.95    1703.98       32.45      8.47
Submodel*                      1      true     4.64 ns       12.98    1835.70       34.38      8.11
LDA                           12      true     23.6 μs        0.59       1.98       31.29       err
===================================================================================================

Rows marked * have t(logdensity) below about 100 ns; their ratios can be dominated by timer floor, fixed overhead, and run-to-run variation. For those rows, raw t(grad) is more meaningful than t(grad)/t(logdensity).

Main @ d2052a1
===================================================================================================
                                               eval                       gradient                 
                                            ----------  -------------------------------------------
Model                        dim    linked      primal     FwdDiff    RvsDiff    Mooncake    Enzyme
---------------------------------------------------------------------------------------------------
Simple assume observe*         1     false     5.87 ns        9.77    1212.24       30.22      6.79
Simple assume observe*         1      true     24.2 ns        2.48     322.39        7.01      1.66
Smorgasbord                  201     false     6.49 μs       66.37     127.50        6.25      8.74
Smorgasbord                  201      true     8.82 μs       63.36     122.08        5.23      5.96
Loop univariate 1k          1000     false     19.5 μs      922.57     277.98        7.17      5.79
Loop univariate 1k          1000      true     20.9 μs     1332.31     263.78        6.89      5.58
Multivariate 1k             1000     false     22.8 μs      366.92      75.48        9.48      2.68
Multivariate 1k             1000      true     25.7 μs      257.11      60.87        9.68      2.60
Loop univariate 10k        10000     false    192.0 μs    11023.94     303.51        7.37      5.90
Loop univariate 10k        10000      true    205.0 μs    11011.95     287.47        7.08      5.68
Multivariate 10k           10000     false    201.0 μs     5186.72      87.03       11.32      2.15
Multivariate 10k           10000      true    200.0 μs     4982.16      88.44       11.41      2.17
Dynamic                       15     false     1.43 μs         err      40.33       15.54     10.41
Dynamic                       10      true     1.96 μs        2.08      58.58       21.14     19.07
Submodel*                      1     false     5.87 ns       10.26    1325.65       30.05      6.70
Submodel*                      1      true     5.87 ns       10.26    1464.17       28.92      6.81
LDA                           12      true     22.7 μs        0.49       1.92       32.48       err
===================================================================================================
Environment
Julia Version 1.11.9
Commit 53a02c0720c (2026-02-06 00:27 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 4 × AMD EPYC 7763 64-Core Processor
  WORD_SIZE: 64
  LLVM: libLLVM-16.0.6 (ORCJIT, znver3)
Threads: 1 default, 0 interactive, 1 GC (on 4 virtual cores)

@TuringLang TuringLang deleted a comment from github-actions Bot May 13, 2026
yebai and others added 6 commits May 13, 2026 15:25
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
AbstractPPL's `prepare(::AbstractADType, ...)` method lives in its
DifferentiationInterface extension, so test envs that exercise the AD
path with ForwardDiff/Enzyme need DI present and imported.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Some backends (notably Enzyme) return gradients as non-Vector types
(e.g. Enzyme.TupleArray). ADResult expects Vector, so collect() before
storing.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Reading from `problem`'s fields keeps the AD prep target in sync if
`LogDensityAt`'s shape ever changes, instead of duplicating the 5-tuple.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@wsmoses

This comment was marked as spam.

Pass the LogDensityFunction state as a constant `context` tuple to
`AbstractPPL.prepare(adtype, logdensity_internal, x; ...)` instead of
building a `LogDensityAt` problem object and routing through the
Mooncake-only `raw_gradient_target` keyword. This restores a single
prep entry point that both the DI and Mooncake AbstractPPL extensions
can handle.

Rename `logdensity_at` to `logdensity_internal` to reflect that it is
the implementation behind `LogDensityProblems.logdensity(ldf, params)`
rather than a user-facing wrapper. Keep `logdensity_at` as a const
alias so existing references still resolve.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@yebai yebai mentioned this pull request May 18, 2026
yebai and others added 8 commits May 19, 2026 12:32
Switch the `evaluator-interface` revision overrides in every Project.toml
to `main` now that the evaluator API has landed there. Also drop redundant
parens in a `@model` Type default in benchmarks/benchmarks.jl.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replace the LogDensityAt struct with a deprecation shim that emits
`Base.depwarn` and returns an `AbstractPPL.Evaluators.VectorEvaluator`
wrapping a closure over `logdensity_internal`. The new path is the
sanctioned one now that AD prep flows through `AbstractPPL.prepare`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replace the closure-over-state form with the new `context::Tuple` kwarg
on `AbstractPPL.prepare`, which threads constants through to the problem
function. Equivalent semantics, no closure construction.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Pass `check_dims=false` to match the pre-deprecation `LogDensityAt`'s
unchecked call behavior; callers that relied on it shouldn't suddenly
hit a `DimensionMismatch`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Move test/integration/{enzyme,reversediff}/ to test/ext/{DynamicPPLEnzymeCoreExt,DynamicPPLReverseDiffExt}/ and convert the MarginalLogDensities split (subdir env + sibling script) to a single test/ext/DynamicPPLMarginalLogDensitiesExt/{Project.toml,main.jl}.

Each integration env now follows the same layout. CI.yml updated accordingly.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Remove the `AbstractPPL = {url = ..., rev = "main"}` overrides from all
eight Project.toml files. Compat is already at 0.15, so the resolver
picks up the registered release.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
pull Bot pushed a commit to Stars1233/DynamicPPL.jl that referenced this pull request May 19, 2026
Fresh-buffer benchmark setup
- In `DynamicPPL.TestUtils.AD.run_ad`, use `setup = deepcopy($params)`
for both the primal and gradient `@be ` calls so each Chairmarks sample
starts from a fresh input buffer instead of reusing the same vector
across samples. Matches Mooncake's bench harness; setup runs outside the
timed window, so the copy isn't measured.

Plus some minor tweaks. 


~~This appears to have a noticeable impact on log-density evaluation for
tiny models, first noticed in TuringLang#1363~~: e.g., `simple assume
observe(linked = true)` logdensity improves from ~20 ns to ~4 ns. In
practise, these tiny nanosecond overheads won't matter at all, so the
improvement here is only for perfectionists.

EDIT: likely `LogDensityAt` -> `logdensity_internal` reduced
logdensity's overhead by 20 ns in TuringLang#1363

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@yebai yebai marked this pull request as ready for review May 21, 2026 19:54
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants