Skip to content

fix: pin_module_config no longer silently destroys pipelines (audit wave 3)#116

Merged
JamesHWade merged 10 commits into
mainfrom
feature/audit-fixes-wave3
Jun 15, 2026
Merged

fix: pin_module_config no longer silently destroys pipelines (audit wave 3)#116
JamesHWade merged 10 commits into
mainfrom
feature/audit-fixes-wave3

Conversation

@JamesHWade

Copy link
Copy Markdown
Owner

Third wave from the codebase audit. Stacked on #115 (wave 2).

pin_module_config silently destroyed pipelines (dsprrr-07u)

module_kind() collapsed every unrecognised module class to "predict", so a PipelineModule passed the v2-persistence allow-list and serialised as a bare PredictModule. A pinned pipeline restored as a single empty PredictModule, silently dropping every step and its bootstrapped per-step demos. The same hole affected other unsupported types (Ensemble, RAG, …).

  • module_kind() now falls back to the actual class name instead of "predict", so the persistence allow-list correctly rejects unsupported types.
  • serialize_module_config_v2() detects PipelineModule explicitly and aborts with a clear "pipeline persistence not yet supported" message.

Regression test verified to reproduce the silent data loss before the fix. Full suite: 3016 passing, no new failures. air format --check . passes.

🤖 Generated with Claude Code

JamesHWade and others added 4 commits June 14, 2026 18:00
Highest-value, lowest-risk findings from the codebase audit.

Release readiness (dsprrr-w0e):
- Lower the ellmer requirement to the released >= 0.4.1 and drop the
  CRAN-forbidden "Remotes: tidyverse/ellmer" entry.
- Remove unused Suggests: future and lifecycle.
- Add NEWS.md.

Correctness:
- print.dsprrr_batch_result now counts failed items. It read the error from
  item$error / the output object, but create_error_result() stores it in
  metadata$error with output = NA, so failed batches always printed
  "All items completed successfully" (dsprrr-8l0).
- Unknown or misspelled signature types now error with a closest-match
  suggestion instead of silently becoming string fields. validate_type_string()
  was fully written but never called (dsprrr-47p).
- module_trials() warns and returns an inspectable summary when every trial
  fails, instead of a cryptic "attempt to select less than one element"
  (dsprrr-hew).
- .cache is accepted and validated by all module types and pipelines, not
  only PredictModule, where it previously fell into ... and was dropped with
  a spurious "unknown input" warning (dsprrr-jup).

Test isolation:
- Add tests/testthat/setup.R plus a DSPRRR_CACHE_PATH env var so the suite
  no longer writes the disk cache into the package source tree (dsprrr-63v).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…mean (wave 2)

BootstrapFewShot harvests demos with field-aware metrics (dsprrr-s3b):
- The single-module compile path passed the bare cell value as `expected`, so
  a field-aware metric (e.g. metric_exact_match(field = "answer"), the
  documented default) errored inside extract_field(), was scored NA, and
  bootstrapped ZERO demos. It now passes the full training row when the metric
  declares a field, mirroring the pipeline path and evaluate(). Tracks
  metric_field separately from output_col so the distinction survives.

evaluate() counts failed rows as 0 in mean_score (dsprrr-tn1):
- mean_score used na.rm = TRUE, so failed rows dropped out of the number that
  drives all optimizer selection — a config correct on the 2 rows it answers
  could outrank one robust across 10. Failures now count as 0 (matching DSPy).
  The raw `scores` vector keeps NA and n_errors still reports failures.

Both fixes have regression tests verified to fail without the change.
Full suite: 3009 passing, no new failures.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
rollout_id was fully implemented in the cache layer but no caller ever passed
it. With caching enabled (the default), BestOfN/Refine/Assert made byte-identical
attempts: attempt 1 missed the cache, attempts 2..N hit it and returned the same
response, so effective N was always 1.

Thread rollout_id end to end:
- call_llm_request() and PredictModule's private call_llm() / forward() now
  accept rollout_id and pass it to cached_chat_structured(), which already folds
  it into the cache key.
- BestOfN, Refine, and Assert pass rollout_id = i on each attempt, so every
  attempt gets a distinct cache key.

Regression tests (verified to fail without the fix):
- forward() threads rollout_id into the cache key (repeat id hits cache,
  new id misses).
- BestOfN and Refine pass 1,2,3 to successive attempts.

Note: a per-attempt temperature override (also in dsprrr-pcd) and optimizer-side
diversity (SIMBA/GEPA/COPRO/bootstrap teacher) remain follow-ups.

Full suite: 3015 passing, no new failures.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…07u)

module_kind() collapsed every unrecognised module class to "predict", so a
PipelineModule (and EnsembleModule, RAGModule, ...) passed the v2-persistence
allow-list and serialised as a bare PredictModule. A pinned pipeline restored
as a single empty PredictModule, silently dropping every step and its
bootstrapped per-step demos.

- module_kind() now falls back to the actual class name instead of "predict",
  so unsupported types are correctly rejected by the persistence allow-list.
- serialize_module_config_v2() detects PipelineModule explicitly and aborts
  with a clear "pipeline persistence not yet supported" message pointing users
  to pin each step individually.

Regression test: pin_module_config() on a pipeline now errors instead of
silently losing data. Full suite: 3016 passing, no new failures.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings June 14, 2026 22:20
@chatgpt-codex-connector

Copy link
Copy Markdown

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.
Credits must be used to enable repository wide code reviews.

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes a critical persistence bug where pin_module_config() could silently serialize unsupported module types (notably PipelineModule) as a bare PredictModule, causing pipelines (and their step demos) to be lost on restore. It tightens module-kind detection and adds an explicit guardrail to prevent pipeline persistence until it’s properly supported.

Changes:

  • Update module_kind() to fall back to the actual module class name (instead of "predict") for unrecognized module classes, preventing unsupported types from passing the v2 persistence allow-list.
  • Add an explicit PipelineModule check in serialize_module_config_v2() that aborts with a clear, user-actionable error message.
  • Add a regression test ensuring pipelines cannot be pinned silently.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated no comments.

File Description
tests/testthat/test-orchestration.R Adds a regression test asserting pin_module_config() errors on pipelines rather than silently degrading them.
R/run.R Fixes module_kind() fallback behavior to avoid misclassifying unknown module classes as "predict".
R/orchestration.R Explicitly rejects PipelineModule persistence with a clear cli_abort() message before serialization proceeds.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

JamesHWade and others added 5 commits June 15, 2026 17:55
Codex/Copilot review of PR #114:
- Module$run() now validates .cache via validate_cache_arg(), so callers
  reaching the R6 method directly get the same loud error as the run()
  generic instead of a silently-forwarded malformed value.
- default_disk_cache_path() treats an empty DSPRRR_CACHE_PATH as unset
  (nzchar guard) so the disk cache never resolves to "".
- Test teardown restores the cache default with an explicit disk_path,
  independent of LIFO env-var cleanup ordering.
- Reword the .cache NEWS bullet to scope it accurately: it is validated
  and plumbed everywhere, but direct-call modules don't cache yet (dsprrr-aa2).
- Move the BootstrapFewShot (s3b) and evaluate (tn1) NEWS bullets out of
  wave 1 -- their code lands in wave 2 -- so each stacked PR's changelog
  matches its own diff.
- Comment the batch-print test to explain why cli_fmt() is the correct
  capture (cli writes to stdout; capture.output(type="message") would miss it).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Regression from dsprrr-pcd: BestOfN/Refine/Assert passed rollout_id = i
explicitly while also spreading ..., so a nested wrapper
(refine(best_of_n(mod)), best_of_n(refine(mod)),
with_assertions(best_of_n(mod))) handed the inner forward() rollout_id
twice -> "formal argument matched by multiple actual arguments".

- Each wrapper forward() now takes an explicit rollout_id = NULL formal, so
  an inherited id is consumed out of ... instead of being re-spread.
- New compose_rollout_id() helper (R/utils.R) folds the inherited id into
  the per-attempt index ("2" at top level, "2.1" when nested), keeping every
  (outer, inner) attempt's cache key unique.
- Document @param rollout_id on the wrapper and PredictModule forward()
  methods (addresses Copilot review on PR #115).
- Add regression tests for all three nesting combinations; update the two
  existing pcd tests for the now-character ids.
- Add the s3b, tn1, and pcd/wx6 NEWS bullets here (their code lives in
  wave 2), so PR #115's changelog matches its diff (dsprrr-6qo).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The pipeline-persistence guard (dsprrr-07u) had no changelog entry. Add it
here so PR #116's NEWS matches its own diff.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@JamesHWade

Copy link
Copy Markdown
Owner Author

No actionable inline comments here (Copilot found none; Codex hit its usage limit). The wave-1 and wave-2 review fixes were merged forward into this branch, and this PR adds the pin_module_config (dsprrr-07u) NEWS bullet so its changelog matches its diff. CI is green.

@JamesHWade JamesHWade changed the base branch from feature/audit-fixes-wave2 to main June 15, 2026 22:41
@JamesHWade JamesHWade merged commit e0a91ed into main Jun 15, 2026
16 checks passed
@JamesHWade JamesHWade deleted the feature/audit-fixes-wave3 branch June 15, 2026 22:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants