Fix master test reds: BBO loss return, ensemble solve API, FlexiChains chain access by ChrisRackauckas-Claude · Pull Request #299 · SciML/EasyModelAnalysis.jl

ChrisRackauckas-Claude · 2026-06-20T10:28:50Z

Master CI was red across Core, Datafit, and QA on Julia 1.12. Each had a distinct root cause; all are now addressed (QA via a documented version gate around an upstream Pkg regression). Verified by running the full groups locally and on CI.

Branched off main (b450a3c); already on top of current main (rebase is a no-op). PR #298 (AbstractMCMC import) is unrelated.

Fixed

1. `bayesian_ensemble`/`bayesian_datafit` sampling deadlock (Core + Datafit)

bayesian_datafit forced Turing.sample(...; progress = true). With MCMCThreads and stdout redirected to a non-TTY (i.e. CI), the AbstractMCMC progress logger deadlocks: every worker thread parks in pthread_cond_wait and the main task blocks in the libuv event loop, so the job makes no progress and CI eventually cancels it. Confirmed via gdb (0% CPU, all threads cond-waiting). This — not mere slowness — is what was killing the Core group. Set progress = false in both bayesian_datafit methods.

2. `optimal_parameter_threshold` assertion (Core `threshold.jl:123`)

The test read s2.u[end][1] < 2, assuming x is the first state. @mtkbuild orders unknowns(sys) as [y, x], so index 1 is y (≈ 2.98), while x itself is ≈ 1.99 < 2. The optimizer was correct all along (the threshold on x and the inequality constraint are both satisfied, stably across NLopt seeds). Switched to symbolic indexing s2[x][end] < 2, which expresses the intended quantity and passes.

3. CI parallelism + budget (`test_groups.toml`)

SciML's grouped-tests matrix defaults to num_threads = 1, so bayesian_datafit/bayesian_ensemble (MCMCThreads, nchains=4) and get_sensitivity (EnsembleThreads) ran serially — the apparent "too slow". Added num_threads = 4 to Core and Datafit. Core's get_sensitivity(samples=1000) over the chaotic Lorenz is genuinely ~66 min even parallel (lock-contention bound), and the full Core group is ~105 min, so Core also gets timeout = 240. Iteration/sample counts are left untouched (no masking).

4. QA on Julia 1.12 — gated around an upstream Pkg regression

Aqua.find_persistent_tasks_deps Pkg.develops each dependency. On Julia 1.12, a Pkg regression (JuliaLang/Pkg.jl#4587) makes Pkg.develop honor a developed dependency's relative [sources] and resolve it against the depot, so OptimizationBBO (which pins its in-repo OptimizationBase via [sources], as every Optimization.jl sublibrary does) errors with expected package OptimizationBase to exist at path .../OptimizationBBO/OptimizationBase. It passes on 1.10/lts. The [sources] is correct and load-bearing for the monorepo (honored natively on Julia ≥1.11, emulated by SciML's develop_sources.jl on 1.10) — removing it is not the fix. Gated the call to VERSION < v"1.12"; the rest of the Aqua suite still runs everywhere and the persistent-tasks check still runs on lts. Tracked in #303; remove the gate once Pkg#4587 is fixed.

Verification

CI (this branch):

Core: green (1h26m).
Datafit: green (~38–45m).
QA (julia 1): green with the gate (~22m).
Downgrade, Documentation, Runic, Spell Check: green.

(One Core run failed on a self-hosted-runner "lost communication" infrastructure flake, unrelated to the tests; it passes on re-run.)

Local (Julia 1.12.6, num_threads=4): Datafit 10/10; Core all groups pass (basics 7/7, ensemble 1/1, examples 4/4, sensitivity 4/4, threshold 11/11).

Runic-formatted. Please ignore until reviewed by @ChrisRackauckas.

@varname

…s chain access Three independent breakages from upstream SciML updates were turning the Core and Datafit CI groups red on master: 1. Optimization loss return type (Datafit group). `l2loss`/`relative_l2loss` returned `(tot_loss, sol)`. OptimizationBBO's BBO wrapper now strictly requires the objective to return a `Float64`, so `global_datafit` errored with "fitness function does NOT return the expected fitness type Float64". The `sol` element of the tuple was never consumed by any caller, so the loss functions now return only the scalar `tot_loss`. NLopt's `datafit` path (which tolerated the tuple by taking `first`) is unaffected. 2. EnsembleProblem / ensemble solve (Core group). The vector-of-problems form `EnsembleProblem([prob1, prob2, prob3])` is deprecated and broken in current SciMLBase (the default prob_func passes the whole vector to the per-trajectory solve), so `solve(enprob; saveat=1)` failed with a MethodError on `init(::Vector{ODEProblem}, ::Nothing)`. Migrated `bayesian_ensemble` and the ensemble test to the modern `prob_func = (prob, ctx) -> probs[ctx.sim_id]` form, pass an explicit algorithm (`Tsit5()`) and `trajectories`, and access per-trajectory solutions via `sol.u[i]` (EnsembleSolution's `length`/symbolic `getindex` now flatten across all timepoints). `ensemble_weights` updated to use `sol.u` accordingly. 3. Turing chain backend (Core + Datafit groups). Turing now returns a `FlexiChains.VNChain`, which no longer supports `chain["pprior[i]"]` string indexing. `bayesian_datafit` now extracts posterior samples with `chain[@varname(pprior[i])]`. Verified locally on Julia 1.12 against the master dependency set: - global_datafit (BBO) recovers [2/3, 4/3, 1, 1]; datafit (NLopt) still passes. - prob_func ensemble solve produces a Vector{ODESolution}; ensemble_weights runs. - bayesian_datafit returns per-parameter posterior sample vectors with reduced variance vs the prior (the Datafit test's assertion). Co-Authored-By: Chris Rackauckas <accounts@chrisrackauckas.com> Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

@varname

…on ensemble prob_func Builds on the prior fixes in this branch (BBO scalar loss return, FlexiChains @varname extraction). Three remaining breakages from upstream SciML/Turing churn: 1. Datafit group (DynamicPPL 0.41). bayesianODE rejected non-successful solves with `Turing.DynamicPPL.acclogp!!(__varinfo__, -Inf)`, which no longer has a method when __varinfo__ is the AD-time `OnlyAccsVarInfo{...Dual...}`: MethodError: no method matching acclogp!!(::OnlyAccsVarInfo{...}, ::Float64) Replaced with the canonical sample-rejection idiom `@addlogprob! (; loglikelihood = -Inf)`, which dispatches correctly on the LogLikelihood accumulator under ForwardDiff. 2/3. Core + Downgrade groups (EnsembleProblem prob_func arity). The prior `(prob, ctx) -> probs[ctx.sim_id]` form only works on SciMLBase 3.x (2-arg `prob_func(prob, ctx)`). On the Downgrade floor (SciMLBase 2.55) EnsembleProblem still calls the legacy 3-arg `prob_func(prob, i, repeat)`, so it errored with MethodError: no method matching (::SciML#1#2)(::ODEProblem, ::Int64, ::Int64) Introduced `EnsembleProbForwarder(all_probs)`, a callable supporting both interfaces (integer index and `ctx.sim_id`), used as `bayesian_ensemble`'s prob_func. Storing `all_probs` lets callers read the trajectory count via `enprob.prob_func.all_probs` (the access the ensemble test already uses). The ensemble test's inline prob_func is likewise made arity-robust. Verified locally on Julia 1.12 against the master dependency set (SciMLBase 3.21, DynamicPPL 0.41.8, Turing 0.45, OptimizationBBO 0.4.7): - simple EnsembleProblem solve(enprob, Tsit5(); trajectories) recovers weights [0.2, 0.5, 0.3]; EnsembleProbForwarder dispatches on both (prob, i, repeat) and (prob, ctx); ensemble_weights runs. - bayesian_datafit (both t and (t, timeseries) forms) returns per-parameter posterior sample vectors with variance reduced vs the prior (the Datafit assertion). Not fixed here (upstream, reported separately): QA group's `Aqua.find_persistent_tasks_deps` fails ONLY on Julia 1.12 (passes on lts) because Pkg now honors OptimizationBBO 0.4.4-0.4.7's released relative-path `[sources] OptimizationBase = {path = "../OptimizationBase"}` and errors with "expected package OptimizationBase to exist at path .../OptimizationBBO/OptimizationBase". Same OptimizationBBO version passes on Julia 1.10. Capping OptimizationBBO < 0.4.4 would drag SciMLBase 3->2, ModelingToolkit 11->9, OptimizationBase 5->2 (major regression), so the correct fix is upstream (stop shipping [sources] in releases). Co-Authored-By: Chris Rackauckas <accounts@chrisrackauckas.com> Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

@varname

…on (Downgrade) The previous commit fixed the Core/Datafit reds on the current stack but two bayesian_datafit details still broke the Downgrade group (Turing 0.35 / MCMCChains 6 / DynamicPPL 0.30): 1. Chain sample extraction. `chain[@varname(pprior[i])]` works on newer Turing's FlexiChains.VNChain but errors on the legacy MCMCChains.Chains: MethodError: no method matching getindex(::MCMCChains.Chains{...}, ::VarName{:pprior, IndexLens{Tuple{Int64}}}) Added `_pprior_samples(chain, i)` which tries the VarName form and falls back to the legacy `chain["pprior[i]"]` string key, supporting both backends. 2. Sample rejection. `@addlogprob! (; loglikelihood = -Inf)` (NamedTuple form) only exists on DynamicPPL 0.41+. Switched to the scalar `@addlogprob! -Inf`, which is valid on both 0.30 (added to the log-prob) and 0.41 (routed to the LogLikelihood accumulator), and still avoids the removed `acclogp!!(__varinfo__, ::Float64)`. Verified the VarName→string fallback against a real MCMCChains.Chains (the exact type the Downgrade run reported), and re-verified bayesian_datafit on the current stack (Julia 1.12, Turing 0.45/FlexiChains): both forms pass the variance-reduction assertion, including runs that actually hit the rejection branch (solves aborting under ForwardDiff), with no acclogp!! error. Co-Authored-By: Chris Rackauckas <accounts@chrisrackauckas.com> Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

….jl) `_get_sensitivity`'s internal `prob_func(prob, i, repeat)` only defined the legacy 3-arg form, so on current SciMLBase (3.x) the EnsembleProblem solve called it with the 2-arg `(prob, ctx)` form and errored: MethodError: no method matching (::#prob_func#25)(::ODEProblem, ::EnsembleContext) Added the `(prob, ctx) -> remake(...; p = ...[:, ctx.sim_id])` method alongside the integer-index one (same cross-version pattern as the ensemble fix), and read per-trajectory solutions via `sol.u[i]` (matching the EnsembleSolution access used in src/ensemble.jl). This is the same upstream EnsembleProblem prob_func API change; it surfaced in the Core group only after the ensemble.jl fix let Core run past it. Co-Authored-By: Chris Rackauckas <accounts@chrisrackauckas.com> Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Continuing this branch's "adapt to upstream SciML churn" theme, the Core group's threshold.jl surfaced two more errors once the ensemble/sensitivity fixes let Core run past them. `prob_violating_threshold`'s inner `g(sol, p)` introspected each symbolic threshold inequality via `threshold.val.f` and `threshold.val.arguments`. On Symbolics 7 the `BasicSymbolic` representation is backed by Moshi `@data`, so direct `.f`/`.arguments` field access errors with `unknown field name: arguments` (from Moshi's getproperty), failing both `prob_violating_threshold` tests (threshold.jl:158 `> 0.99` and :165 `< 0.01`) with that error. Replaced the field access with the public `operation`/`arguments` accessors via a small `_threshold_violation` helper. Symbolics also canonicalizes comparisons (`x > 10.0` is stored as `<(10.0, x)`), moving the constant to either operand, so the helper detects which argument is the numeric bound and reconstructs the violating side (upper: `maximum(state) > bound`; lower: `minimum(state) < bound`) accordingly. Verified locally on Julia 1.12 (SciMLBase 3.22, ModelingToolkit 11.28, Symbolics 7): test/threshold.jl now reports 10 passed / 1 failed / 0 errored (was 8/1/2). The two `unknown field name: arguments` errors are gone; both `prob_violating_threshold` assertions pass. The remaining single failure (threshold.jl:123, `2.97 < 2`) is a pre-existing, machine-load-dependent flake in `optimal_parameter_intervention_for_threshold`, which optimizes with NLopt's stochastic `GN_ISRES` under a wall-clock `maxtime=60` budget and no fixed RNG seed; it is unrelated to these changes and is not touched here. Co-Authored-By: Chris Rackauckas <accounts@chrisrackauckas.com> Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…ld state index, CI parallelism Three independent fixes that take the Core and Datafit groups green on Julia 1.12, verified locally by running both full groups end-to-end (exit 0). 1. bayesian_ensemble deadlock (Core ensemble.jl). bayesian_datafit forced `Turing.sample(...; progress = true)`. With MCMCThreads and stdout redirected to a non-TTY (i.e. CI), the AbstractMCMC progress logger deadlocks: all worker threads park in pthread_cond_wait and the main task blocks in the libuv event loop, so the job makes no progress and CI eventually cancels it. This — not mere slowness — is what was killing the Core group. Set `progress = false` in both bayesian_datafit methods. Confirmed via gdb that the pre-fix run was a genuine deadlock (0% CPU, all threads cond-waiting); post-fix Core runs to completion. 2. optimal_parameter_threshold assertion (Core threshold.jl:123). The test read `s2.u[end][1] < 2`, assuming x is the first state. `@mtkbuild` orders `unknowns(sys)` as `[y, x]` here, so index 1 is y (= x + 0.99 by the identical RHS), which is ~2.98 while x itself is ~1.99 < 2. The optimizer was correct all along (the threshold on x is satisfied, ineq constraint satisfied, stable across NLopt seeds). Switched to symbolic indexing `s2[x][end] < 2`, which expresses the intended quantity and passes. 3. CI parallelism/budget (test_groups.toml). SciML's grouped-tests matrix defaults to num_threads=1, so bayesian_datafit/bayesian_ensemble (MCMCThreads, nchains=4) and get_sensitivity (EnsembleThreads) ran serially — the apparent "too slow". Added num_threads=4 to Core and Datafit. Core's get_sensitivity(samples=1000) over the chaotic Lorenz is genuinely ~66 min even parallel (lock-contention bound), and the full Core group is ~105 min, so Core also gets timeout=240 (iteration/sample counts are left untouched). Local verification (Julia 1.12.6, num_threads=4): - Datafit group: 10/10 pass, ~20 min. - Core group: all groups pass (basics 7/7, ensemble 1/1, examples 4/4, sensitivity 4/4, threshold 11/11), exit 0. QA (julia 1) remains red from an upstream issue tracked separately: registered OptimizationBBO ships a dev `[sources] OptimizationBase = {path=...}` that Julia 1.12 honors during Aqua's find_persistent_tasks_deps Pkg.develop (passes on lts). Co-Authored-By: Chris Rackauckas <accounts@chrisrackauckas.com>

On Julia 1.12, Aqua.find_persistent_tasks_deps errors because Pkg.develop now honors a developed dependency's relative [sources] and resolves it against the depot: OptimizationBBO pins its in-repo OptimizationBase via `[sources] OptimizationBase = {path = "../OptimizationBase"}` (as every Optimization.jl sublibrary does), so the develop step looks for `.../packages/OptimizationBBO/<hash>/../OptimizationBase` and fails with "expected package OptimizationBase to exist at path ...". It passes on 1.10/lts. This [sources] is correct and load-bearing for the Optimization.jl monorepo (honored natively on Julia >=1.11, emulated by SciML's develop_sources.jl on 1.10); removing it is not the fix. The bug is an upstream Julia 1.12.4 Pkg regression (JuliaLang/Pkg.jl#4587). Gate the call to VERSION < v"1.12" so the rest of the Aqua suite still runs everywhere and the persistent-tasks check still runs on lts. Remove the gate once Pkg#4587 is fixed (SciML#303). Co-Authored-By: Chris Rackauckas <accounts@chrisrackauckas.com>

Co-Authored-By: Chris Rackauckas <accounts@chrisrackauckas.com>

ChrisRackauckas-Claude · 2026-06-29T16:15:00Z

Update — all three red groups addressed and the description rewritten.

Core/Datafit "too slow" was really a deadlock: bayesian_datafit forced progress=true, which deadlocks AbstractMCMC under MCMCThreads + non-TTY output (confirmed via gdb: all threads in pthread_cond_wait). Set progress=false. Also added num_threads=4 (the matrix defaults to 1, so the chains ran serially) and timeout=240 for Core (its get_sensitivity(samples=1000) over the chaotic Lorenz is ~66 min even parallel). No iteration/sample counts were cut.
threshold.jl:123 was a test bug, not an optimizer regression: s2.u[end][1] read state y (≈2.98) because @mtkbuild orders unknowns as [y, x]; the threshold is on x (≈1.99 < 2). Fixed to symbolic s2[x][end].
QA on Julia 1.12 is the upstream Pkg regression Breaking change: cannot develop app with relative paths in Julia 1.12.4 JuliaLang/Pkg.jl#4587 (Pkg.develop honoring a dev'd dependency's relative [sources]), triggered by OptimizationBBO's monorepo [sources]. Removing [sources] is not viable (load-bearing for the Optimization.jl monorepo). Gated find_persistent_tasks_deps to VERSION < v"1.12"; tracked in Re-enable Aqua.find_persistent_tasks_deps on Julia 1.12 once Pkg.jl#4587 is fixed #303 to revert once Pkg is fixed.

CI on this branch: Core green (1h26m), Datafit green, QA (julia 1) green with the gate (22m), Downgrade/Docs/Runic green. One Core run hit a self-hosted-runner "lost communication" infra flake (not a test failure); re-triggered.

Please ignore until reviewed by @ChrisRackauckas.

ChrisRackauckas and others added 8 commits June 20, 2026 06:27

Re-trigger CI (Core self-hosted runner lost communication on prior run)

5865fea

Co-Authored-By: Chris Rackauckas <accounts@chrisrackauckas.com>

ChrisRackauckas marked this pull request as ready for review June 29, 2026 17:44

ChrisRackauckas merged commit fafca77 into SciML:main Jun 29, 2026
9 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Fix master test reds: BBO loss return, ensemble solve API, FlexiChains chain access#299

Fix master test reds: BBO loss return, ensemble solve API, FlexiChains chain access#299
ChrisRackauckas merged 8 commits into
SciML:mainfrom
ChrisRackauckas-Claude:fix-ensemble-datafit-master-reds

ChrisRackauckas-Claude commented Jun 20, 2026 •

edited

Loading

Uh oh!

ChrisRackauckas-Claude commented Jun 29, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Uh oh!

Conversation

ChrisRackauckas-Claude commented Jun 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Fixed

1. bayesian_ensemble/bayesian_datafit sampling deadlock (Core + Datafit)

2. optimal_parameter_threshold assertion (Core threshold.jl:123)

3. CI parallelism + budget (test_groups.toml)

4. QA on Julia 1.12 — gated around an upstream Pkg regression

Verification

Uh oh!

ChrisRackauckas-Claude commented Jun 29, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ChrisRackauckas-Claude commented Jun 20, 2026 •

edited

Loading

1. `bayesian_ensemble`/`bayesian_datafit` sampling deadlock (Core + Datafit)

2. `optimal_parameter_threshold` assertion (Core `threshold.jl:123`)

3. CI parallelism + budget (`test_groups.toml`)