fix: pin_module_config no longer silently destroys pipelines (audit wave 3)#116
Conversation
Highest-value, lowest-risk findings from the codebase audit. Release readiness (dsprrr-w0e): - Lower the ellmer requirement to the released >= 0.4.1 and drop the CRAN-forbidden "Remotes: tidyverse/ellmer" entry. - Remove unused Suggests: future and lifecycle. - Add NEWS.md. Correctness: - print.dsprrr_batch_result now counts failed items. It read the error from item$error / the output object, but create_error_result() stores it in metadata$error with output = NA, so failed batches always printed "All items completed successfully" (dsprrr-8l0). - Unknown or misspelled signature types now error with a closest-match suggestion instead of silently becoming string fields. validate_type_string() was fully written but never called (dsprrr-47p). - module_trials() warns and returns an inspectable summary when every trial fails, instead of a cryptic "attempt to select less than one element" (dsprrr-hew). - .cache is accepted and validated by all module types and pipelines, not only PredictModule, where it previously fell into ... and was dropped with a spurious "unknown input" warning (dsprrr-jup). Test isolation: - Add tests/testthat/setup.R plus a DSPRRR_CACHE_PATH env var so the suite no longer writes the disk cache into the package source tree (dsprrr-63v). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…mean (wave 2) BootstrapFewShot harvests demos with field-aware metrics (dsprrr-s3b): - The single-module compile path passed the bare cell value as `expected`, so a field-aware metric (e.g. metric_exact_match(field = "answer"), the documented default) errored inside extract_field(), was scored NA, and bootstrapped ZERO demos. It now passes the full training row when the metric declares a field, mirroring the pipeline path and evaluate(). Tracks metric_field separately from output_col so the distinction survives. evaluate() counts failed rows as 0 in mean_score (dsprrr-tn1): - mean_score used na.rm = TRUE, so failed rows dropped out of the number that drives all optimizer selection — a config correct on the 2 rows it answers could outrank one robust across 10. Failures now count as 0 (matching DSPy). The raw `scores` vector keeps NA and n_errors still reports failures. Both fixes have regression tests verified to fail without the change. Full suite: 3009 passing, no new failures. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
rollout_id was fully implemented in the cache layer but no caller ever passed it. With caching enabled (the default), BestOfN/Refine/Assert made byte-identical attempts: attempt 1 missed the cache, attempts 2..N hit it and returned the same response, so effective N was always 1. Thread rollout_id end to end: - call_llm_request() and PredictModule's private call_llm() / forward() now accept rollout_id and pass it to cached_chat_structured(), which already folds it into the cache key. - BestOfN, Refine, and Assert pass rollout_id = i on each attempt, so every attempt gets a distinct cache key. Regression tests (verified to fail without the fix): - forward() threads rollout_id into the cache key (repeat id hits cache, new id misses). - BestOfN and Refine pass 1,2,3 to successive attempts. Note: a per-attempt temperature override (also in dsprrr-pcd) and optimizer-side diversity (SIMBA/GEPA/COPRO/bootstrap teacher) remain follow-ups. Full suite: 3015 passing, no new failures. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…07u) module_kind() collapsed every unrecognised module class to "predict", so a PipelineModule (and EnsembleModule, RAGModule, ...) passed the v2-persistence allow-list and serialised as a bare PredictModule. A pinned pipeline restored as a single empty PredictModule, silently dropping every step and its bootstrapped per-step demos. - module_kind() now falls back to the actual class name instead of "predict", so unsupported types are correctly rejected by the persistence allow-list. - serialize_module_config_v2() detects PipelineModule explicitly and aborts with a clear "pipeline persistence not yet supported" message pointing users to pin each step individually. Regression test: pin_module_config() on a pipeline now errors instead of silently losing data. Full suite: 3016 passing, no new failures. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits. |
There was a problem hiding this comment.
Pull request overview
This PR fixes a critical persistence bug where pin_module_config() could silently serialize unsupported module types (notably PipelineModule) as a bare PredictModule, causing pipelines (and their step demos) to be lost on restore. It tightens module-kind detection and adds an explicit guardrail to prevent pipeline persistence until it’s properly supported.
Changes:
- Update
module_kind()to fall back to the actual module class name (instead of"predict") for unrecognized module classes, preventing unsupported types from passing the v2 persistence allow-list. - Add an explicit
PipelineModulecheck inserialize_module_config_v2()that aborts with a clear, user-actionable error message. - Add a regression test ensuring pipelines cannot be pinned silently.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
| tests/testthat/test-orchestration.R | Adds a regression test asserting pin_module_config() errors on pipelines rather than silently degrading them. |
| R/run.R | Fixes module_kind() fallback behavior to avoid misclassifying unknown module classes as "predict". |
| R/orchestration.R | Explicitly rejects PipelineModule persistence with a clear cli_abort() message before serialization proceeds. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Codex/Copilot review of PR #114: - Module$run() now validates .cache via validate_cache_arg(), so callers reaching the R6 method directly get the same loud error as the run() generic instead of a silently-forwarded malformed value. - default_disk_cache_path() treats an empty DSPRRR_CACHE_PATH as unset (nzchar guard) so the disk cache never resolves to "". - Test teardown restores the cache default with an explicit disk_path, independent of LIFO env-var cleanup ordering. - Reword the .cache NEWS bullet to scope it accurately: it is validated and plumbed everywhere, but direct-call modules don't cache yet (dsprrr-aa2). - Move the BootstrapFewShot (s3b) and evaluate (tn1) NEWS bullets out of wave 1 -- their code lands in wave 2 -- so each stacked PR's changelog matches its own diff. - Comment the batch-print test to explain why cli_fmt() is the correct capture (cli writes to stdout; capture.output(type="message") would miss it). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Regression from dsprrr-pcd: BestOfN/Refine/Assert passed rollout_id = i
explicitly while also spreading ..., so a nested wrapper
(refine(best_of_n(mod)), best_of_n(refine(mod)),
with_assertions(best_of_n(mod))) handed the inner forward() rollout_id
twice -> "formal argument matched by multiple actual arguments".
- Each wrapper forward() now takes an explicit rollout_id = NULL formal, so
an inherited id is consumed out of ... instead of being re-spread.
- New compose_rollout_id() helper (R/utils.R) folds the inherited id into
the per-attempt index ("2" at top level, "2.1" when nested), keeping every
(outer, inner) attempt's cache key unique.
- Document @param rollout_id on the wrapper and PredictModule forward()
methods (addresses Copilot review on PR #115).
- Add regression tests for all three nesting combinations; update the two
existing pcd tests for the now-character ids.
- Add the s3b, tn1, and pcd/wx6 NEWS bullets here (their code lives in
wave 2), so PR #115's changelog matches its diff (dsprrr-6qo).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The pipeline-persistence guard (dsprrr-07u) had no changelog entry. Add it here so PR #116's NEWS matches its own diff. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
No actionable inline comments here (Copilot found none; Codex hit its usage limit). The wave-1 and wave-2 review fixes were merged forward into this branch, and this PR adds the |
…ave3 # Conflicts: # NEWS.md
Third wave from the codebase audit. Stacked on #115 (wave 2).
pin_module_config silently destroyed pipelines (
dsprrr-07u)module_kind()collapsed every unrecognised module class to"predict", so aPipelineModulepassed the v2-persistence allow-list and serialised as a barePredictModule. A pinned pipeline restored as a single emptyPredictModule, silently dropping every step and its bootstrapped per-step demos. The same hole affected other unsupported types (Ensemble, RAG, …).module_kind()now falls back to the actual class name instead of"predict", so the persistence allow-list correctly rejects unsupported types.serialize_module_config_v2()detectsPipelineModuleexplicitly and aborts with a clear "pipeline persistence not yet supported" message.Regression test verified to reproduce the silent data loss before the fix. Full suite: 3016 passing, no new failures.
air format --check .passes.🤖 Generated with Claude Code