fix: optimizer correctness — bootstrap demos, failure-blind mean, rollout_id diversity (audit wave 2) by JamesHWade · Pull Request #115 · JamesHWade/dsprrr

JamesHWade · 2026-06-14T22:16:31Z

Second wave from the codebase audit: the core optimizer-correctness cluster. Stacked on #114 (wave 1) — review/merge that first, or this diff will show both.

Each fix has a regression test verified to fail without the change.

BootstrapFewShot harvests demos with field-aware metrics (`dsprrr-s3b`)

The flagship optimizer used in its own roxygen example (metric_exact_match(field = "answer")) bootstrapped zero demos. The single-module compile path passed the bare cell value as expected, so the field-aware metric errored inside extract_field(), was scored NA, and never cleared the threshold. It now passes the full training row when the metric declares a field — mirroring the pipeline path and evaluate() — and tracks metric_field separately from output_col.

evaluate() counts failed rows as 0 in mean_score (`dsprrr-tn1`)

mean_score used na.rm = TRUE, dropping failed rows from the number that drives all optimizer selection. A config correct on the 2 rows it answers could outrank one robust across 10. Failures now count as 0 (matching DSPy); the raw scores vector keeps NA and n_errors still reports failures.

rollout_id threads through so retries actually explore (`dsprrr-pcd`)

rollout_id was fully implemented in the cache layer but no caller passed it. With caching on (the default), BestOfN/Refine/Assert made byte-identical attempts — attempt 1 missed the cache, attempts 2..N hit it — so effective N was always 1. Now threaded end-to-end (call_llm_request → call_llm → forward → cached_chat_structured), and BestOfN/Refine/Assert pass rollout_id = i per attempt.

Follow-up filed as dsprrr-vnv: per-attempt temperature override and optimizer-side diversity (SIMBA/GEPA/COPRO/bootstrap teacher).

Testing

Full suite: 3015 passing, no new failures (the 9 pre-existing print-capture failures are unrelated, verified on a clean tree). air format --check . passes.

🤖 Generated with Claude Code

Highest-value, lowest-risk findings from the codebase audit. Release readiness (dsprrr-w0e): - Lower the ellmer requirement to the released >= 0.4.1 and drop the CRAN-forbidden "Remotes: tidyverse/ellmer" entry. - Remove unused Suggests: future and lifecycle. - Add NEWS.md. Correctness: - print.dsprrr_batch_result now counts failed items. It read the error from item$error / the output object, but create_error_result() stores it in metadata$error with output = NA, so failed batches always printed "All items completed successfully" (dsprrr-8l0). - Unknown or misspelled signature types now error with a closest-match suggestion instead of silently becoming string fields. validate_type_string() was fully written but never called (dsprrr-47p). - module_trials() warns and returns an inspectable summary when every trial fails, instead of a cryptic "attempt to select less than one element" (dsprrr-hew). - .cache is accepted and validated by all module types and pipelines, not only PredictModule, where it previously fell into ... and was dropped with a spurious "unknown input" warning (dsprrr-jup). Test isolation: - Add tests/testthat/setup.R plus a DSPRRR_CACHE_PATH env var so the suite no longer writes the disk cache into the package source tree (dsprrr-63v). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…mean (wave 2) BootstrapFewShot harvests demos with field-aware metrics (dsprrr-s3b): - The single-module compile path passed the bare cell value as `expected`, so a field-aware metric (e.g. metric_exact_match(field = "answer"), the documented default) errored inside extract_field(), was scored NA, and bootstrapped ZERO demos. It now passes the full training row when the metric declares a field, mirroring the pipeline path and evaluate(). Tracks metric_field separately from output_col so the distinction survives. evaluate() counts failed rows as 0 in mean_score (dsprrr-tn1): - mean_score used na.rm = TRUE, so failed rows dropped out of the number that drives all optimizer selection — a config correct on the 2 rows it answers could outrank one robust across 10. Failures now count as 0 (matching DSPy). The raw `scores` vector keeps NA and n_errors still reports failures. Both fixes have regression tests verified to fail without the change. Full suite: 3009 passing, no new failures. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

rollout_id was fully implemented in the cache layer but no caller ever passed it. With caching enabled (the default), BestOfN/Refine/Assert made byte-identical attempts: attempt 1 missed the cache, attempts 2..N hit it and returned the same response, so effective N was always 1. Thread rollout_id end to end: - call_llm_request() and PredictModule's private call_llm() / forward() now accept rollout_id and pass it to cached_chat_structured(), which already folds it into the cache key. - BestOfN, Refine, and Assert pass rollout_id = i on each attempt, so every attempt gets a distinct cache key. Regression tests (verified to fail without the fix): - forward() threads rollout_id into the cache key (repeat id hits cache, new id misses). - BestOfN and Refine pass 1,2,3 to successive attempts. Note: a per-attempt temperature override (also in dsprrr-pcd) and optimizer-side diversity (SIMBA/GEPA/COPRO/bootstrap teacher) remain follow-ups. Full suite: 3015 passing, no new failures. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: c6adf9d1a3

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-06-14T22:19:28Z

+              rollout_id = i,
+              ...


Consume inherited rollout_id before delegating

When this wrapper is itself nested under another retrying wrapper, the outer wrapper now supplies rollout_id through ...; this call then adds a second rollout_id before replaying ..., so the eventual PredictModule$forward() receives the same formal twice and the composed wrapper fails before making any attempts (for example, refine(best_of_n(mod)) or best_of_n(refine(mod))). Please make wrapper forward() methods consume a rollout_id formal or strip it from ... before delegating; the same call shape was added to Refine and Assert as well.

Useful? React with 👍 / 👎.

Copilot

Pull request overview

This PR addresses optimizer correctness issues uncovered in audit wave 2 by fixing (1) BootstrapFewShot demo harvesting when using field-aware metrics, (2) evaluate()’s mean_score so failures count as 0, and (3) cache-partitioning for BestOfN/Refine/Assert retries via end-to-end rollout_id threading.

Changes:

Fix BootstrapFewShot’s single-module compile path to pass the full training row to field-aware metrics (instead of the bare expected cell).
Change evaluate()’s mean_score computation so NA (failed rows) contributes 0 to the mean while preserving scores/n_errors diagnostics.
Thread rollout_id through the PredictModule LLM call stack and have wrapper modules send distinct per-attempt rollout IDs; add regression tests.

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
tests/testthat/test-teleprompter-bootstrap.R	Adds regression test ensuring BootstrapFewShot harvests demos with field-aware metrics.
tests/testthat/test-module-wrapper.R	Adds tests asserting BestOfN/Refine pass distinct `rollout_id` values per attempt.
tests/testthat/test-compile.R	Adds regression test asserting failed rows count as 0 in `mean_score`.
tests/testthat/test-cache.R	Adds regression test verifying `rollout_id` affects the cache key (distinct attempts don’t collapse to one cached response).
R/teleprompter-bootstrap.R	Fixes expected-value passing for field-aware metrics in BootstrapFewShot single-module compile flow.
R/run.R	Extends `call_llm_request()` to accept/forward `rollout_id` into cached structured calls.
R/module-wrapper.R	Passes attempt-specific `rollout_id` in BestOfN/Refine wrapper retries.
R/module-predict.R	Adds `rollout_id` param to PredictModule `forward()` and threads it into LLM calls.
R/module-assert.R	Passes attempt-specific `rollout_id` in AssertModule retries.
R/evaluate.R	Updates `mean_score` to treat failures as 0 instead of dropping them via `na.rm = TRUE`.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+              batch,
+              .llm = .llm,
+              trace = FALSE,
+              rollout_id = i,
+              ...


+              current_batch,
+              .llm = .llm,
+              trace = FALSE,
+              rollout_id = i,
+              ...


+              current_batch,
+              .llm = .llm,
+              trace = FALSE,
+              rollout_id = i,
+              ...


    #'   uses global cache configuration. If TRUE, attempts to use cache (no effect
    #'   if caching globally disabled). If FALSE, bypasses cache for this call.
    #' @param ... Additional arguments
    #' @return Tibble with result, .chat, .metadata columns
-    forward = function(batch, .llm = NULL, trace = TRUE, .cache = NULL, ...) {
+    forward = function(


Codex/Copilot review of PR #114: - Module$run() now validates .cache via validate_cache_arg(), so callers reaching the R6 method directly get the same loud error as the run() generic instead of a silently-forwarded malformed value. - default_disk_cache_path() treats an empty DSPRRR_CACHE_PATH as unset (nzchar guard) so the disk cache never resolves to "". - Test teardown restores the cache default with an explicit disk_path, independent of LIFO env-var cleanup ordering. - Reword the .cache NEWS bullet to scope it accurately: it is validated and plumbed everywhere, but direct-call modules don't cache yet (dsprrr-aa2). - Move the BootstrapFewShot (s3b) and evaluate (tn1) NEWS bullets out of wave 1 -- their code lands in wave 2 -- so each stacked PR's changelog matches its own diff. - Comment the batch-print test to explain why cli_fmt() is the correct capture (cli writes to stdout; capture.output(type="message") would miss it). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Regression from dsprrr-pcd: BestOfN/Refine/Assert passed rollout_id = i explicitly while also spreading ..., so a nested wrapper (refine(best_of_n(mod)), best_of_n(refine(mod)), with_assertions(best_of_n(mod))) handed the inner forward() rollout_id twice -> "formal argument matched by multiple actual arguments". - Each wrapper forward() now takes an explicit rollout_id = NULL formal, so an inherited id is consumed out of ... instead of being re-spread. - New compose_rollout_id() helper (R/utils.R) folds the inherited id into the per-attempt index ("2" at top level, "2.1" when nested), keeping every (outer, inner) attempt's cache key unique. - Document @param rollout_id on the wrapper and PredictModule forward() methods (addresses Copilot review on PR #115). - Add regression tests for all three nesting combinations; update the two existing pcd tests for the now-character ids. - Add the s3b, tn1, and pcd/wx6 NEWS bullets here (their code lives in wave 2), so PR #115's changelog matches its diff (dsprrr-6qo). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

JamesHWade · 2026-06-15T22:04:46Z

Thanks for the reviews. Addressed in commit 36ec81c.

Nested wrappers crashed on rollout_id (Codex module-wrapper.R:137; Copilot wrapper:137/:664, assert:163) — confirmed and fixed. The root cause was exactly as described: each wrapper passed rollout_id = i explicitly while also spreading ..., so refine(best_of_n(mod)), best_of_n(refine(mod)), and with_assertions(best_of_n(mod)) handed the inner forward() rollout_id twice.

Fix:

Each wrapper forward() now takes an explicit rollout_id = NULL formal, so an inherited id is consumed out of ... instead of being re-spread.
New compose_rollout_id() helper folds the inherited id into the per-attempt index — "2" at the top level, "2.1" when nested — so every (outer, inner) attempt keeps a unique cache key.
Added regression tests for all three nesting combinations (they error pre-fix, pass post-fix). The two existing pcd tests were updated for the now-character ids.

I used the formal-parameter approach rather than do.call(): it consumes the inherited id, reads cleanly, and composes correctly for arbitrary nesting depth.

Missing @param rollout_id (Copilot module-predict.R:57) — documented on PredictModule$forward() and on all three wrapper forward() methods.

NEWS — this PR's changelog now carries the s3b, tn1, and pcd/wx6 bullets (moved here from PR #114, since their code lives in this wave).

…ave2 # Conflicts: # NEWS.md # tests/testthat/test-cache.R

JamesHWade and others added 3 commits June 14, 2026 18:00

Copilot AI review requested due to automatic review settings June 14, 2026 22:16

Copilot started reviewing on behalf of JamesHWade June 14, 2026 22:16 View session

chatgpt-codex-connector Bot reviewed Jun 14, 2026

View reviewed changes

Copilot AI reviewed Jun 14, 2026

View reviewed changes

JamesHWade mentioned this pull request Jun 14, 2026

fix: pin_module_config no longer silently destroys pipelines (audit wave 3) #116

Merged

JamesHWade and others added 3 commits June 15, 2026 17:55

Merge branch 'feature/audit-fixes-wave1' into feature/audit-fixes-wave2

e3db7a6

JamesHWade mentioned this pull request Jun 15, 2026

fix: release-unblock + silent-failure correctness fixes (audit wave 1) #114

Merged

JamesHWade changed the base branch from feature/audit-fixes-wave1 to main June 15, 2026 22:23

Merge remote-tracking branch 'origin/main' into feature/audit-fixes-w…

64578da

…ave2 # Conflicts: # NEWS.md # tests/testthat/test-cache.R

JamesHWade merged commit 6017844 into main Jun 15, 2026
9 checks passed

JamesHWade deleted the feature/audit-fixes-wave2 branch June 15, 2026 22:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: optimizer correctness — bootstrap demos, failure-blind mean, rollout_id diversity (audit wave 2)#115

fix: optimizer correctness — bootstrap demos, failure-blind mean, rollout_id diversity (audit wave 2)#115
JamesHWade merged 7 commits into
mainfrom
feature/audit-fixes-wave2

JamesHWade commented Jun 14, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Jun 14, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

JamesHWade commented Jun 15, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

JamesHWade commented Jun 14, 2026

BootstrapFewShot harvests demos with field-aware metrics (dsprrr-s3b)

evaluate() counts failed rows as 0 in mean_score (dsprrr-tn1)

rollout_id threads through so retries actually explore (dsprrr-pcd)

Testing

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Jun 14, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

JamesHWade commented Jun 15, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

BootstrapFewShot harvests demos with field-aware metrics (`dsprrr-s3b`)

evaluate() counts failed rows as 0 in mean_score (`dsprrr-tn1`)

rollout_id threads through so retries actually explore (`dsprrr-pcd`)