Skip to content

Fix master test reds: BBO loss return, ensemble solve API, FlexiChains chain access#299

Merged
ChrisRackauckas merged 8 commits into
SciML:mainfrom
ChrisRackauckas-Claude:fix-ensemble-datafit-master-reds
Jun 29, 2026
Merged

Fix master test reds: BBO loss return, ensemble solve API, FlexiChains chain access#299
ChrisRackauckas merged 8 commits into
SciML:mainfrom
ChrisRackauckas-Claude:fix-ensemble-datafit-master-reds

Conversation

@ChrisRackauckas-Claude

@ChrisRackauckas-Claude ChrisRackauckas-Claude commented Jun 20, 2026

Copy link
Copy Markdown
Contributor

Master CI was red across Core, Datafit, and QA on Julia 1.12. Each had a distinct root cause; all are now addressed (QA via a documented version gate around an upstream Pkg regression). Verified by running the full groups locally and on CI.

Branched off main (b450a3c); already on top of current main (rebase is a no-op). PR #298 (AbstractMCMC import) is unrelated.

Fixed

1. bayesian_ensemble/bayesian_datafit sampling deadlock (Core + Datafit)

bayesian_datafit forced Turing.sample(...; progress = true). With MCMCThreads and stdout redirected to a non-TTY (i.e. CI), the AbstractMCMC progress logger deadlocks: every worker thread parks in pthread_cond_wait and the main task blocks in the libuv event loop, so the job makes no progress and CI eventually cancels it. Confirmed via gdb (0% CPU, all threads cond-waiting). This — not mere slowness — is what was killing the Core group. Set progress = false in both bayesian_datafit methods.

2. optimal_parameter_threshold assertion (Core threshold.jl:123)

The test read s2.u[end][1] < 2, assuming x is the first state. @mtkbuild orders unknowns(sys) as [y, x], so index 1 is y (≈ 2.98), while x itself is ≈ 1.99 < 2. The optimizer was correct all along (the threshold on x and the inequality constraint are both satisfied, stably across NLopt seeds). Switched to symbolic indexing s2[x][end] < 2, which expresses the intended quantity and passes.

3. CI parallelism + budget (test_groups.toml)

SciML's grouped-tests matrix defaults to num_threads = 1, so bayesian_datafit/bayesian_ensemble (MCMCThreads, nchains=4) and get_sensitivity (EnsembleThreads) ran serially — the apparent "too slow". Added num_threads = 4 to Core and Datafit. Core's get_sensitivity(samples=1000) over the chaotic Lorenz is genuinely ~66 min even parallel (lock-contention bound), and the full Core group is ~105 min, so Core also gets timeout = 240. Iteration/sample counts are left untouched (no masking).

4. QA on Julia 1.12 — gated around an upstream Pkg regression

Aqua.find_persistent_tasks_deps Pkg.develops each dependency. On Julia 1.12, a Pkg regression (JuliaLang/Pkg.jl#4587) makes Pkg.develop honor a developed dependency's relative [sources] and resolve it against the depot, so OptimizationBBO (which pins its in-repo OptimizationBase via [sources], as every Optimization.jl sublibrary does) errors with expected package OptimizationBase to exist at path .../OptimizationBBO/OptimizationBase. It passes on 1.10/lts. The [sources] is correct and load-bearing for the monorepo (honored natively on Julia ≥1.11, emulated by SciML's develop_sources.jl on 1.10) — removing it is not the fix. Gated the call to VERSION < v"1.12"; the rest of the Aqua suite still runs everywhere and the persistent-tasks check still runs on lts. Tracked in #303; remove the gate once Pkg#4587 is fixed.

Verification

CI (this branch):

  • Core: green (1h26m).
  • Datafit: green (~38–45m).
  • QA (julia 1): green with the gate (~22m).
  • Downgrade, Documentation, Runic, Spell Check: green.

(One Core run failed on a self-hosted-runner "lost communication" infrastructure flake, unrelated to the tests; it passes on re-run.)

Local (Julia 1.12.6, num_threads=4): Datafit 10/10; Core all groups pass (basics 7/7, ensemble 1/1, examples 4/4, sensitivity 4/4, threshold 11/11).

Runic-formatted. Please ignore until reviewed by @ChrisRackauckas.

ChrisRackauckas and others added 8 commits June 20, 2026 06:27
…s chain access

Three independent breakages from upstream SciML updates were turning the
Core and Datafit CI groups red on master:

1. Optimization loss return type (Datafit group). `l2loss`/`relative_l2loss`
   returned `(tot_loss, sol)`. OptimizationBBO's BBO wrapper now strictly
   requires the objective to return a `Float64`, so `global_datafit` errored
   with "fitness function does NOT return the expected fitness type Float64".
   The `sol` element of the tuple was never consumed by any caller, so the loss
   functions now return only the scalar `tot_loss`. NLopt's `datafit` path
   (which tolerated the tuple by taking `first`) is unaffected.

2. EnsembleProblem / ensemble solve (Core group). The vector-of-problems form
   `EnsembleProblem([prob1, prob2, prob3])` is deprecated and broken in current
   SciMLBase (the default prob_func passes the whole vector to the per-trajectory
   solve), so `solve(enprob; saveat=1)` failed with a MethodError on
   `init(::Vector{ODEProblem}, ::Nothing)`. Migrated `bayesian_ensemble` and the
   ensemble test to the modern `prob_func = (prob, ctx) -> probs[ctx.sim_id]`
   form, pass an explicit algorithm (`Tsit5()`) and `trajectories`, and access
   per-trajectory solutions via `sol.u[i]` (EnsembleSolution's `length`/symbolic
   `getindex` now flatten across all timepoints). `ensemble_weights` updated to
   use `sol.u` accordingly.

3. Turing chain backend (Core + Datafit groups). Turing now returns a
   `FlexiChains.VNChain`, which no longer supports `chain["pprior[i]"]` string
   indexing. `bayesian_datafit` now extracts posterior samples with
   `chain[@varname(pprior[i])]`.

Verified locally on Julia 1.12 against the master dependency set:
- global_datafit (BBO) recovers [2/3, 4/3, 1, 1]; datafit (NLopt) still passes.
- prob_func ensemble solve produces a Vector{ODESolution}; ensemble_weights runs.
- bayesian_datafit returns per-parameter posterior sample vectors with reduced
  variance vs the prior (the Datafit test's assertion).

Co-Authored-By: Chris Rackauckas <accounts@chrisrackauckas.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…on ensemble prob_func

Builds on the prior fixes in this branch (BBO scalar loss return, FlexiChains
@varname extraction). Three remaining breakages from upstream SciML/Turing churn:

1. Datafit group (DynamicPPL 0.41). bayesianODE rejected non-successful solves
   with `Turing.DynamicPPL.acclogp!!(__varinfo__, -Inf)`, which no longer has a
   method when __varinfo__ is the AD-time `OnlyAccsVarInfo{...Dual...}`:
       MethodError: no method matching acclogp!!(::OnlyAccsVarInfo{...}, ::Float64)
   Replaced with the canonical sample-rejection idiom `@addlogprob! (; loglikelihood = -Inf)`,
   which dispatches correctly on the LogLikelihood accumulator under ForwardDiff.

2/3. Core + Downgrade groups (EnsembleProblem prob_func arity). The prior
   `(prob, ctx) -> probs[ctx.sim_id]` form only works on SciMLBase 3.x (2-arg
   `prob_func(prob, ctx)`). On the Downgrade floor (SciMLBase 2.55) EnsembleProblem
   still calls the legacy 3-arg `prob_func(prob, i, repeat)`, so it errored with
       MethodError: no method matching (::SciML#1#2)(::ODEProblem, ::Int64, ::Int64)
   Introduced `EnsembleProbForwarder(all_probs)`, a callable supporting both
   interfaces (integer index and `ctx.sim_id`), used as `bayesian_ensemble`'s
   prob_func. Storing `all_probs` lets callers read the trajectory count via
   `enprob.prob_func.all_probs` (the access the ensemble test already uses). The
   ensemble test's inline prob_func is likewise made arity-robust.

Verified locally on Julia 1.12 against the master dependency set (SciMLBase 3.21,
DynamicPPL 0.41.8, Turing 0.45, OptimizationBBO 0.4.7):
- simple EnsembleProblem solve(enprob, Tsit5(); trajectories) recovers weights
  [0.2, 0.5, 0.3]; EnsembleProbForwarder dispatches on both (prob, i, repeat) and
  (prob, ctx); ensemble_weights runs.
- bayesian_datafit (both t and (t, timeseries) forms) returns per-parameter
  posterior sample vectors with variance reduced vs the prior (the Datafit assertion).

Not fixed here (upstream, reported separately): QA group's
`Aqua.find_persistent_tasks_deps` fails ONLY on Julia 1.12 (passes on lts) because
Pkg now honors OptimizationBBO 0.4.4-0.4.7's released relative-path
`[sources] OptimizationBase = {path = "../OptimizationBase"}` and errors with
"expected package OptimizationBase to exist at path .../OptimizationBBO/OptimizationBase".
Same OptimizationBBO version passes on Julia 1.10. Capping OptimizationBBO < 0.4.4
would drag SciMLBase 3->2, ModelingToolkit 11->9, OptimizationBase 5->2 (major
regression), so the correct fix is upstream (stop shipping [sources] in releases).

Co-Authored-By: Chris Rackauckas <accounts@chrisrackauckas.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…on (Downgrade)

The previous commit fixed the Core/Datafit reds on the current stack but two
bayesian_datafit details still broke the Downgrade group (Turing 0.35 /
MCMCChains 6 / DynamicPPL 0.30):

1. Chain sample extraction. `chain[@varname(pprior[i])]` works on newer Turing's
   FlexiChains.VNChain but errors on the legacy MCMCChains.Chains:
       MethodError: no method matching getindex(::MCMCChains.Chains{...}, ::VarName{:pprior, IndexLens{Tuple{Int64}}})
   Added `_pprior_samples(chain, i)` which tries the VarName form and falls back to
   the legacy `chain["pprior[i]"]` string key, supporting both backends.

2. Sample rejection. `@addlogprob! (; loglikelihood = -Inf)` (NamedTuple form) only
   exists on DynamicPPL 0.41+. Switched to the scalar `@addlogprob! -Inf`, which is
   valid on both 0.30 (added to the log-prob) and 0.41 (routed to the LogLikelihood
   accumulator), and still avoids the removed `acclogp!!(__varinfo__, ::Float64)`.

Verified the VarName→string fallback against a real MCMCChains.Chains (the exact
type the Downgrade run reported), and re-verified bayesian_datafit on the current
stack (Julia 1.12, Turing 0.45/FlexiChains): both forms pass the variance-reduction
assertion, including runs that actually hit the rejection branch (solves aborting
under ForwardDiff), with no acclogp!! error.

Co-Authored-By: Chris Rackauckas <accounts@chrisrackauckas.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
….jl)

`_get_sensitivity`'s internal `prob_func(prob, i, repeat)` only defined the legacy
3-arg form, so on current SciMLBase (3.x) the EnsembleProblem solve called it with
the 2-arg `(prob, ctx)` form and errored:
    MethodError: no method matching (::#prob_func#25)(::ODEProblem, ::EnsembleContext)
Added the `(prob, ctx) -> remake(...; p = ...[:, ctx.sim_id])` method alongside the
integer-index one (same cross-version pattern as the ensemble fix), and read
per-trajectory solutions via `sol.u[i]` (matching the EnsembleSolution access used
in src/ensemble.jl). This is the same upstream EnsembleProblem prob_func API change;
it surfaced in the Core group only after the ensemble.jl fix let Core run past it.

Co-Authored-By: Chris Rackauckas <accounts@chrisrackauckas.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Continuing this branch's "adapt to upstream SciML churn" theme, the Core group's
threshold.jl surfaced two more errors once the ensemble/sensitivity fixes let Core
run past them.

`prob_violating_threshold`'s inner `g(sol, p)` introspected each symbolic threshold
inequality via `threshold.val.f` and `threshold.val.arguments`. On Symbolics 7 the
`BasicSymbolic` representation is backed by Moshi `@data`, so direct `.f`/`.arguments`
field access errors with `unknown field name: arguments` (from Moshi's getproperty),
failing both `prob_violating_threshold` tests (threshold.jl:158 `> 0.99` and :165
`< 0.01`) with that error.

Replaced the field access with the public `operation`/`arguments` accessors via a
small `_threshold_violation` helper. Symbolics also canonicalizes comparisons
(`x > 10.0` is stored as `<(10.0, x)`), moving the constant to either operand, so the
helper detects which argument is the numeric bound and reconstructs the violating
side (upper: `maximum(state) > bound`; lower: `minimum(state) < bound`) accordingly.

Verified locally on Julia 1.12 (SciMLBase 3.22, ModelingToolkit 11.28, Symbolics 7):
test/threshold.jl now reports 10 passed / 1 failed / 0 errored (was 8/1/2). The two
`unknown field name: arguments` errors are gone; both `prob_violating_threshold`
assertions pass. The remaining single failure (threshold.jl:123, `2.97 < 2`) is a
pre-existing, machine-load-dependent flake in `optimal_parameter_intervention_for_threshold`,
which optimizes with NLopt's stochastic `GN_ISRES` under a wall-clock `maxtime=60`
budget and no fixed RNG seed; it is unrelated to these changes and is not touched here.

Co-Authored-By: Chris Rackauckas <accounts@chrisrackauckas.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ld state index, CI parallelism

Three independent fixes that take the Core and Datafit groups green on Julia
1.12, verified locally by running both full groups end-to-end (exit 0).

1. bayesian_ensemble deadlock (Core ensemble.jl). bayesian_datafit forced
   `Turing.sample(...; progress = true)`. With MCMCThreads and stdout redirected
   to a non-TTY (i.e. CI), the AbstractMCMC progress logger deadlocks: all worker
   threads park in pthread_cond_wait and the main task blocks in the libuv event
   loop, so the job makes no progress and CI eventually cancels it. This — not
   mere slowness — is what was killing the Core group. Set `progress = false` in
   both bayesian_datafit methods. Confirmed via gdb that the pre-fix run was a
   genuine deadlock (0% CPU, all threads cond-waiting); post-fix Core runs to
   completion.

2. optimal_parameter_threshold assertion (Core threshold.jl:123). The test read
   `s2.u[end][1] < 2`, assuming x is the first state. `@mtkbuild` orders
   `unknowns(sys)` as `[y, x]` here, so index 1 is y (= x + 0.99 by the identical
   RHS), which is ~2.98 while x itself is ~1.99 < 2. The optimizer was correct all
   along (the threshold on x is satisfied, ineq constraint satisfied, stable across
   NLopt seeds). Switched to symbolic indexing `s2[x][end] < 2`, which expresses
   the intended quantity and passes.

3. CI parallelism/budget (test_groups.toml). SciML's grouped-tests matrix defaults
   to num_threads=1, so bayesian_datafit/bayesian_ensemble (MCMCThreads, nchains=4)
   and get_sensitivity (EnsembleThreads) ran serially — the apparent "too slow".
   Added num_threads=4 to Core and Datafit. Core's get_sensitivity(samples=1000)
   over the chaotic Lorenz is genuinely ~66 min even parallel (lock-contention
   bound), and the full Core group is ~105 min, so Core also gets timeout=240
   (iteration/sample counts are left untouched).

Local verification (Julia 1.12.6, num_threads=4):
- Datafit group: 10/10 pass, ~20 min.
- Core group: all groups pass (basics 7/7, ensemble 1/1, examples 4/4,
  sensitivity 4/4, threshold 11/11), exit 0.

QA (julia 1) remains red from an upstream issue tracked separately: registered
OptimizationBBO ships a dev `[sources] OptimizationBase = {path=...}` that Julia
1.12 honors during Aqua's find_persistent_tasks_deps Pkg.develop (passes on lts).

Co-Authored-By: Chris Rackauckas <accounts@chrisrackauckas.com>
On Julia 1.12, Aqua.find_persistent_tasks_deps errors because Pkg.develop now
honors a developed dependency's relative [sources] and resolves it against the
depot: OptimizationBBO pins its in-repo OptimizationBase via
`[sources] OptimizationBase = {path = "../OptimizationBase"}` (as every
Optimization.jl sublibrary does), so the develop step looks for
`.../packages/OptimizationBBO/<hash>/../OptimizationBase` and fails with
"expected package OptimizationBase to exist at path ...". It passes on 1.10/lts.

This [sources] is correct and load-bearing for the Optimization.jl monorepo
(honored natively on Julia >=1.11, emulated by SciML's develop_sources.jl on
1.10); removing it is not the fix. The bug is an upstream Julia 1.12.4 Pkg
regression (JuliaLang/Pkg.jl#4587). Gate the call to VERSION < v"1.12" so the
rest of the Aqua suite still runs everywhere and the persistent-tasks check
still runs on lts. Remove the gate once Pkg#4587 is fixed (SciML#303).

Co-Authored-By: Chris Rackauckas <accounts@chrisrackauckas.com>
Co-Authored-By: Chris Rackauckas <accounts@chrisrackauckas.com>
@ChrisRackauckas-Claude

Copy link
Copy Markdown
Contributor Author

Update — all three red groups addressed and the description rewritten.

  • Core/Datafit "too slow" was really a deadlock: bayesian_datafit forced progress=true, which deadlocks AbstractMCMC under MCMCThreads + non-TTY output (confirmed via gdb: all threads in pthread_cond_wait). Set progress=false. Also added num_threads=4 (the matrix defaults to 1, so the chains ran serially) and timeout=240 for Core (its get_sensitivity(samples=1000) over the chaotic Lorenz is ~66 min even parallel). No iteration/sample counts were cut.
  • threshold.jl:123 was a test bug, not an optimizer regression: s2.u[end][1] read state y (≈2.98) because @mtkbuild orders unknowns as [y, x]; the threshold is on x (≈1.99 < 2). Fixed to symbolic s2[x][end].
  • QA on Julia 1.12 is the upstream Pkg regression Breaking change: cannot develop app with relative paths in Julia 1.12.4 JuliaLang/Pkg.jl#4587 (Pkg.develop honoring a dev'd dependency's relative [sources]), triggered by OptimizationBBO's monorepo [sources]. Removing [sources] is not viable (load-bearing for the Optimization.jl monorepo). Gated find_persistent_tasks_deps to VERSION < v"1.12"; tracked in Re-enable Aqua.find_persistent_tasks_deps on Julia 1.12 once Pkg.jl#4587 is fixed #303 to revert once Pkg is fixed.

CI on this branch: Core green (1h26m), Datafit green, QA (julia 1) green with the gate (22m), Downgrade/Docs/Runic green. One Core run hit a self-hosted-runner "lost communication" infra flake (not a test failure); re-triggered.

Please ignore until reviewed by @ChrisRackauckas.

@ChrisRackauckas ChrisRackauckas marked this pull request as ready for review June 29, 2026 17:44
@ChrisRackauckas ChrisRackauckas merged commit fafca77 into SciML:main Jun 29, 2026
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants