Fix master test reds: BBO loss return, ensemble solve API, FlexiChains chain access#299
Merged
ChrisRackauckas merged 8 commits intoJun 29, 2026
Conversation
…s chain access
Three independent breakages from upstream SciML updates were turning the
Core and Datafit CI groups red on master:
1. Optimization loss return type (Datafit group). `l2loss`/`relative_l2loss`
returned `(tot_loss, sol)`. OptimizationBBO's BBO wrapper now strictly
requires the objective to return a `Float64`, so `global_datafit` errored
with "fitness function does NOT return the expected fitness type Float64".
The `sol` element of the tuple was never consumed by any caller, so the loss
functions now return only the scalar `tot_loss`. NLopt's `datafit` path
(which tolerated the tuple by taking `first`) is unaffected.
2. EnsembleProblem / ensemble solve (Core group). The vector-of-problems form
`EnsembleProblem([prob1, prob2, prob3])` is deprecated and broken in current
SciMLBase (the default prob_func passes the whole vector to the per-trajectory
solve), so `solve(enprob; saveat=1)` failed with a MethodError on
`init(::Vector{ODEProblem}, ::Nothing)`. Migrated `bayesian_ensemble` and the
ensemble test to the modern `prob_func = (prob, ctx) -> probs[ctx.sim_id]`
form, pass an explicit algorithm (`Tsit5()`) and `trajectories`, and access
per-trajectory solutions via `sol.u[i]` (EnsembleSolution's `length`/symbolic
`getindex` now flatten across all timepoints). `ensemble_weights` updated to
use `sol.u` accordingly.
3. Turing chain backend (Core + Datafit groups). Turing now returns a
`FlexiChains.VNChain`, which no longer supports `chain["pprior[i]"]` string
indexing. `bayesian_datafit` now extracts posterior samples with
`chain[@varname(pprior[i])]`.
Verified locally on Julia 1.12 against the master dependency set:
- global_datafit (BBO) recovers [2/3, 4/3, 1, 1]; datafit (NLopt) still passes.
- prob_func ensemble solve produces a Vector{ODESolution}; ensemble_weights runs.
- bayesian_datafit returns per-parameter posterior sample vectors with reduced
variance vs the prior (the Datafit test's assertion).
Co-Authored-By: Chris Rackauckas <accounts@chrisrackauckas.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…on ensemble prob_func Builds on the prior fixes in this branch (BBO scalar loss return, FlexiChains @varname extraction). Three remaining breakages from upstream SciML/Turing churn: 1. Datafit group (DynamicPPL 0.41). bayesianODE rejected non-successful solves with `Turing.DynamicPPL.acclogp!!(__varinfo__, -Inf)`, which no longer has a method when __varinfo__ is the AD-time `OnlyAccsVarInfo{...Dual...}`: MethodError: no method matching acclogp!!(::OnlyAccsVarInfo{...}, ::Float64) Replaced with the canonical sample-rejection idiom `@addlogprob! (; loglikelihood = -Inf)`, which dispatches correctly on the LogLikelihood accumulator under ForwardDiff. 2/3. Core + Downgrade groups (EnsembleProblem prob_func arity). The prior `(prob, ctx) -> probs[ctx.sim_id]` form only works on SciMLBase 3.x (2-arg `prob_func(prob, ctx)`). On the Downgrade floor (SciMLBase 2.55) EnsembleProblem still calls the legacy 3-arg `prob_func(prob, i, repeat)`, so it errored with MethodError: no method matching (::SciML#1#2)(::ODEProblem, ::Int64, ::Int64) Introduced `EnsembleProbForwarder(all_probs)`, a callable supporting both interfaces (integer index and `ctx.sim_id`), used as `bayesian_ensemble`'s prob_func. Storing `all_probs` lets callers read the trajectory count via `enprob.prob_func.all_probs` (the access the ensemble test already uses). The ensemble test's inline prob_func is likewise made arity-robust. Verified locally on Julia 1.12 against the master dependency set (SciMLBase 3.21, DynamicPPL 0.41.8, Turing 0.45, OptimizationBBO 0.4.7): - simple EnsembleProblem solve(enprob, Tsit5(); trajectories) recovers weights [0.2, 0.5, 0.3]; EnsembleProbForwarder dispatches on both (prob, i, repeat) and (prob, ctx); ensemble_weights runs. - bayesian_datafit (both t and (t, timeseries) forms) returns per-parameter posterior sample vectors with variance reduced vs the prior (the Datafit assertion). Not fixed here (upstream, reported separately): QA group's `Aqua.find_persistent_tasks_deps` fails ONLY on Julia 1.12 (passes on lts) because Pkg now honors OptimizationBBO 0.4.4-0.4.7's released relative-path `[sources] OptimizationBase = {path = "../OptimizationBase"}` and errors with "expected package OptimizationBase to exist at path .../OptimizationBBO/OptimizationBase". Same OptimizationBBO version passes on Julia 1.10. Capping OptimizationBBO < 0.4.4 would drag SciMLBase 3->2, ModelingToolkit 11->9, OptimizationBase 5->2 (major regression), so the correct fix is upstream (stop shipping [sources] in releases). Co-Authored-By: Chris Rackauckas <accounts@chrisrackauckas.com> Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…on (Downgrade) The previous commit fixed the Core/Datafit reds on the current stack but two bayesian_datafit details still broke the Downgrade group (Turing 0.35 / MCMCChains 6 / DynamicPPL 0.30): 1. Chain sample extraction. `chain[@varname(pprior[i])]` works on newer Turing's FlexiChains.VNChain but errors on the legacy MCMCChains.Chains: MethodError: no method matching getindex(::MCMCChains.Chains{...}, ::VarName{:pprior, IndexLens{Tuple{Int64}}}) Added `_pprior_samples(chain, i)` which tries the VarName form and falls back to the legacy `chain["pprior[i]"]` string key, supporting both backends. 2. Sample rejection. `@addlogprob! (; loglikelihood = -Inf)` (NamedTuple form) only exists on DynamicPPL 0.41+. Switched to the scalar `@addlogprob! -Inf`, which is valid on both 0.30 (added to the log-prob) and 0.41 (routed to the LogLikelihood accumulator), and still avoids the removed `acclogp!!(__varinfo__, ::Float64)`. Verified the VarName→string fallback against a real MCMCChains.Chains (the exact type the Downgrade run reported), and re-verified bayesian_datafit on the current stack (Julia 1.12, Turing 0.45/FlexiChains): both forms pass the variance-reduction assertion, including runs that actually hit the rejection branch (solves aborting under ForwardDiff), with no acclogp!! error. Co-Authored-By: Chris Rackauckas <accounts@chrisrackauckas.com> Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
….jl)
`_get_sensitivity`'s internal `prob_func(prob, i, repeat)` only defined the legacy
3-arg form, so on current SciMLBase (3.x) the EnsembleProblem solve called it with
the 2-arg `(prob, ctx)` form and errored:
MethodError: no method matching (::#prob_func#25)(::ODEProblem, ::EnsembleContext)
Added the `(prob, ctx) -> remake(...; p = ...[:, ctx.sim_id])` method alongside the
integer-index one (same cross-version pattern as the ensemble fix), and read
per-trajectory solutions via `sol.u[i]` (matching the EnsembleSolution access used
in src/ensemble.jl). This is the same upstream EnsembleProblem prob_func API change;
it surfaced in the Core group only after the ensemble.jl fix let Core run past it.
Co-Authored-By: Chris Rackauckas <accounts@chrisrackauckas.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Continuing this branch's "adapt to upstream SciML churn" theme, the Core group's threshold.jl surfaced two more errors once the ensemble/sensitivity fixes let Core run past them. `prob_violating_threshold`'s inner `g(sol, p)` introspected each symbolic threshold inequality via `threshold.val.f` and `threshold.val.arguments`. On Symbolics 7 the `BasicSymbolic` representation is backed by Moshi `@data`, so direct `.f`/`.arguments` field access errors with `unknown field name: arguments` (from Moshi's getproperty), failing both `prob_violating_threshold` tests (threshold.jl:158 `> 0.99` and :165 `< 0.01`) with that error. Replaced the field access with the public `operation`/`arguments` accessors via a small `_threshold_violation` helper. Symbolics also canonicalizes comparisons (`x > 10.0` is stored as `<(10.0, x)`), moving the constant to either operand, so the helper detects which argument is the numeric bound and reconstructs the violating side (upper: `maximum(state) > bound`; lower: `minimum(state) < bound`) accordingly. Verified locally on Julia 1.12 (SciMLBase 3.22, ModelingToolkit 11.28, Symbolics 7): test/threshold.jl now reports 10 passed / 1 failed / 0 errored (was 8/1/2). The two `unknown field name: arguments` errors are gone; both `prob_violating_threshold` assertions pass. The remaining single failure (threshold.jl:123, `2.97 < 2`) is a pre-existing, machine-load-dependent flake in `optimal_parameter_intervention_for_threshold`, which optimizes with NLopt's stochastic `GN_ISRES` under a wall-clock `maxtime=60` budget and no fixed RNG seed; it is unrelated to these changes and is not touched here. Co-Authored-By: Chris Rackauckas <accounts@chrisrackauckas.com> Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ld state index, CI parallelism
Three independent fixes that take the Core and Datafit groups green on Julia
1.12, verified locally by running both full groups end-to-end (exit 0).
1. bayesian_ensemble deadlock (Core ensemble.jl). bayesian_datafit forced
`Turing.sample(...; progress = true)`. With MCMCThreads and stdout redirected
to a non-TTY (i.e. CI), the AbstractMCMC progress logger deadlocks: all worker
threads park in pthread_cond_wait and the main task blocks in the libuv event
loop, so the job makes no progress and CI eventually cancels it. This — not
mere slowness — is what was killing the Core group. Set `progress = false` in
both bayesian_datafit methods. Confirmed via gdb that the pre-fix run was a
genuine deadlock (0% CPU, all threads cond-waiting); post-fix Core runs to
completion.
2. optimal_parameter_threshold assertion (Core threshold.jl:123). The test read
`s2.u[end][1] < 2`, assuming x is the first state. `@mtkbuild` orders
`unknowns(sys)` as `[y, x]` here, so index 1 is y (= x + 0.99 by the identical
RHS), which is ~2.98 while x itself is ~1.99 < 2. The optimizer was correct all
along (the threshold on x is satisfied, ineq constraint satisfied, stable across
NLopt seeds). Switched to symbolic indexing `s2[x][end] < 2`, which expresses
the intended quantity and passes.
3. CI parallelism/budget (test_groups.toml). SciML's grouped-tests matrix defaults
to num_threads=1, so bayesian_datafit/bayesian_ensemble (MCMCThreads, nchains=4)
and get_sensitivity (EnsembleThreads) ran serially — the apparent "too slow".
Added num_threads=4 to Core and Datafit. Core's get_sensitivity(samples=1000)
over the chaotic Lorenz is genuinely ~66 min even parallel (lock-contention
bound), and the full Core group is ~105 min, so Core also gets timeout=240
(iteration/sample counts are left untouched).
Local verification (Julia 1.12.6, num_threads=4):
- Datafit group: 10/10 pass, ~20 min.
- Core group: all groups pass (basics 7/7, ensemble 1/1, examples 4/4,
sensitivity 4/4, threshold 11/11), exit 0.
QA (julia 1) remains red from an upstream issue tracked separately: registered
OptimizationBBO ships a dev `[sources] OptimizationBase = {path=...}` that Julia
1.12 honors during Aqua's find_persistent_tasks_deps Pkg.develop (passes on lts).
Co-Authored-By: Chris Rackauckas <accounts@chrisrackauckas.com>
On Julia 1.12, Aqua.find_persistent_tasks_deps errors because Pkg.develop now
honors a developed dependency's relative [sources] and resolves it against the
depot: OptimizationBBO pins its in-repo OptimizationBase via
`[sources] OptimizationBase = {path = "../OptimizationBase"}` (as every
Optimization.jl sublibrary does), so the develop step looks for
`.../packages/OptimizationBBO/<hash>/../OptimizationBase` and fails with
"expected package OptimizationBase to exist at path ...". It passes on 1.10/lts.
This [sources] is correct and load-bearing for the Optimization.jl monorepo
(honored natively on Julia >=1.11, emulated by SciML's develop_sources.jl on
1.10); removing it is not the fix. The bug is an upstream Julia 1.12.4 Pkg
regression (JuliaLang/Pkg.jl#4587). Gate the call to VERSION < v"1.12" so the
rest of the Aqua suite still runs everywhere and the persistent-tasks check
still runs on lts. Remove the gate once Pkg#4587 is fixed (SciML#303).
Co-Authored-By: Chris Rackauckas <accounts@chrisrackauckas.com>
Co-Authored-By: Chris Rackauckas <accounts@chrisrackauckas.com>
Contributor
Author
|
Update — all three red groups addressed and the description rewritten.
CI on this branch: Core green (1h26m), Datafit green, QA (julia 1) green with the gate (22m), Downgrade/Docs/Runic green. One Core run hit a self-hosted-runner "lost communication" infra flake (not a test failure); re-triggered. Please ignore until reviewed by @ChrisRackauckas. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Master CI was red across Core, Datafit, and QA on Julia 1.12. Each had a distinct root cause; all are now addressed (QA via a documented version gate around an upstream Pkg regression). Verified by running the full groups locally and on CI.
Branched off
main(b450a3c); already on top of currentmain(rebase is a no-op). PR #298 (AbstractMCMC import) is unrelated.Fixed
1.
bayesian_ensemble/bayesian_datafitsampling deadlock (Core + Datafit)bayesian_datafitforcedTuring.sample(...; progress = true). WithMCMCThreadsand stdout redirected to a non-TTY (i.e. CI), the AbstractMCMC progress logger deadlocks: every worker thread parks inpthread_cond_waitand the main task blocks in the libuv event loop, so the job makes no progress and CI eventually cancels it. Confirmed viagdb(0% CPU, all threads cond-waiting). This — not mere slowness — is what was killing the Core group. Setprogress = falsein bothbayesian_datafitmethods.2.
optimal_parameter_thresholdassertion (Corethreshold.jl:123)The test read
s2.u[end][1] < 2, assumingxis the first state.@mtkbuildordersunknowns(sys)as[y, x], so index 1 isy(≈ 2.98), whilexitself is ≈ 1.99 < 2. The optimizer was correct all along (the threshold onxand the inequality constraint are both satisfied, stably across NLopt seeds). Switched to symbolic indexings2[x][end] < 2, which expresses the intended quantity and passes.3. CI parallelism + budget (
test_groups.toml)SciML's grouped-tests matrix defaults to
num_threads = 1, sobayesian_datafit/bayesian_ensemble(MCMCThreads,nchains=4) andget_sensitivity(EnsembleThreads) ran serially — the apparent "too slow". Addednum_threads = 4to Core and Datafit. Core'sget_sensitivity(samples=1000)over the chaotic Lorenz is genuinely ~66 min even parallel (lock-contention bound), and the full Core group is ~105 min, so Core also getstimeout = 240. Iteration/sample counts are left untouched (no masking).4. QA on Julia 1.12 — gated around an upstream Pkg regression
Aqua.find_persistent_tasks_depsPkg.develops each dependency. On Julia 1.12, a Pkg regression (JuliaLang/Pkg.jl#4587) makesPkg.develophonor a developed dependency's relative[sources]and resolve it against the depot, soOptimizationBBO(which pins its in-repoOptimizationBasevia[sources], as every Optimization.jl sublibrary does) errors withexpected package OptimizationBase to exist at path .../OptimizationBBO/OptimizationBase. It passes on 1.10/lts. The[sources]is correct and load-bearing for the monorepo (honored natively on Julia ≥1.11, emulated by SciML'sdevelop_sources.jlon 1.10) — removing it is not the fix. Gated the call toVERSION < v"1.12"; the rest of the Aqua suite still runs everywhere and the persistent-tasks check still runs on lts. Tracked in #303; remove the gate once Pkg#4587 is fixed.Verification
CI (this branch):
(One Core run failed on a self-hosted-runner "lost communication" infrastructure flake, unrelated to the tests; it passes on re-run.)
Local (Julia 1.12.6,
num_threads=4): Datafit 10/10; Core all groups pass (basics 7/7, ensemble 1/1, examples 4/4, sensitivity 4/4, threshold 11/11).Runic-formatted. Please ignore until reviewed by @ChrisRackauckas.