fix: optimizer correctness — bootstrap demos, failure-blind mean, rollout_id diversity (audit wave 2)#115
Conversation
Highest-value, lowest-risk findings from the codebase audit. Release readiness (dsprrr-w0e): - Lower the ellmer requirement to the released >= 0.4.1 and drop the CRAN-forbidden "Remotes: tidyverse/ellmer" entry. - Remove unused Suggests: future and lifecycle. - Add NEWS.md. Correctness: - print.dsprrr_batch_result now counts failed items. It read the error from item$error / the output object, but create_error_result() stores it in metadata$error with output = NA, so failed batches always printed "All items completed successfully" (dsprrr-8l0). - Unknown or misspelled signature types now error with a closest-match suggestion instead of silently becoming string fields. validate_type_string() was fully written but never called (dsprrr-47p). - module_trials() warns and returns an inspectable summary when every trial fails, instead of a cryptic "attempt to select less than one element" (dsprrr-hew). - .cache is accepted and validated by all module types and pipelines, not only PredictModule, where it previously fell into ... and was dropped with a spurious "unknown input" warning (dsprrr-jup). Test isolation: - Add tests/testthat/setup.R plus a DSPRRR_CACHE_PATH env var so the suite no longer writes the disk cache into the package source tree (dsprrr-63v). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…mean (wave 2) BootstrapFewShot harvests demos with field-aware metrics (dsprrr-s3b): - The single-module compile path passed the bare cell value as `expected`, so a field-aware metric (e.g. metric_exact_match(field = "answer"), the documented default) errored inside extract_field(), was scored NA, and bootstrapped ZERO demos. It now passes the full training row when the metric declares a field, mirroring the pipeline path and evaluate(). Tracks metric_field separately from output_col so the distinction survives. evaluate() counts failed rows as 0 in mean_score (dsprrr-tn1): - mean_score used na.rm = TRUE, so failed rows dropped out of the number that drives all optimizer selection — a config correct on the 2 rows it answers could outrank one robust across 10. Failures now count as 0 (matching DSPy). The raw `scores` vector keeps NA and n_errors still reports failures. Both fixes have regression tests verified to fail without the change. Full suite: 3009 passing, no new failures. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
rollout_id was fully implemented in the cache layer but no caller ever passed it. With caching enabled (the default), BestOfN/Refine/Assert made byte-identical attempts: attempt 1 missed the cache, attempts 2..N hit it and returned the same response, so effective N was always 1. Thread rollout_id end to end: - call_llm_request() and PredictModule's private call_llm() / forward() now accept rollout_id and pass it to cached_chat_structured(), which already folds it into the cache key. - BestOfN, Refine, and Assert pass rollout_id = i on each attempt, so every attempt gets a distinct cache key. Regression tests (verified to fail without the fix): - forward() threads rollout_id into the cache key (repeat id hits cache, new id misses). - BestOfN and Refine pass 1,2,3 to successive attempts. Note: a per-attempt temperature override (also in dsprrr-pcd) and optimizer-side diversity (SIMBA/GEPA/COPRO/bootstrap teacher) remain follow-ups. Full suite: 3015 passing, no new failures. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: c6adf9d1a3
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| rollout_id = i, | ||
| ... |
There was a problem hiding this comment.
Consume inherited rollout_id before delegating
When this wrapper is itself nested under another retrying wrapper, the outer wrapper now supplies rollout_id through ...; this call then adds a second rollout_id before replaying ..., so the eventual PredictModule$forward() receives the same formal twice and the composed wrapper fails before making any attempts (for example, refine(best_of_n(mod)) or best_of_n(refine(mod))). Please make wrapper forward() methods consume a rollout_id formal or strip it from ... before delegating; the same call shape was added to Refine and Assert as well.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Pull request overview
This PR addresses optimizer correctness issues uncovered in audit wave 2 by fixing (1) BootstrapFewShot demo harvesting when using field-aware metrics, (2) evaluate()’s mean_score so failures count as 0, and (3) cache-partitioning for BestOfN/Refine/Assert retries via end-to-end rollout_id threading.
Changes:
- Fix BootstrapFewShot’s single-module compile path to pass the full training row to field-aware metrics (instead of the bare expected cell).
- Change
evaluate()’smean_scorecomputation soNA(failed rows) contributes 0 to the mean while preservingscores/n_errorsdiagnostics. - Thread
rollout_idthrough the PredictModule LLM call stack and have wrapper modules send distinct per-attempt rollout IDs; add regression tests.
Reviewed changes
Copilot reviewed 10 out of 10 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/testthat/test-teleprompter-bootstrap.R | Adds regression test ensuring BootstrapFewShot harvests demos with field-aware metrics. |
| tests/testthat/test-module-wrapper.R | Adds tests asserting BestOfN/Refine pass distinct rollout_id values per attempt. |
| tests/testthat/test-compile.R | Adds regression test asserting failed rows count as 0 in mean_score. |
| tests/testthat/test-cache.R | Adds regression test verifying rollout_id affects the cache key (distinct attempts don’t collapse to one cached response). |
| R/teleprompter-bootstrap.R | Fixes expected-value passing for field-aware metrics in BootstrapFewShot single-module compile flow. |
| R/run.R | Extends call_llm_request() to accept/forward rollout_id into cached structured calls. |
| R/module-wrapper.R | Passes attempt-specific rollout_id in BestOfN/Refine wrapper retries. |
| R/module-predict.R | Adds rollout_id param to PredictModule forward() and threads it into LLM calls. |
| R/module-assert.R | Passes attempt-specific rollout_id in AssertModule retries. |
| R/evaluate.R | Updates mean_score to treat failures as 0 instead of dropping them via na.rm = TRUE. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| batch, | ||
| .llm = .llm, | ||
| trace = FALSE, | ||
| rollout_id = i, | ||
| ... |
| current_batch, | ||
| .llm = .llm, | ||
| trace = FALSE, | ||
| rollout_id = i, | ||
| ... |
| current_batch, | ||
| .llm = .llm, | ||
| trace = FALSE, | ||
| rollout_id = i, | ||
| ... |
| #' uses global cache configuration. If TRUE, attempts to use cache (no effect | ||
| #' if caching globally disabled). If FALSE, bypasses cache for this call. | ||
| #' @param ... Additional arguments | ||
| #' @return Tibble with result, .chat, .metadata columns | ||
| forward = function(batch, .llm = NULL, trace = TRUE, .cache = NULL, ...) { | ||
| forward = function( |
Codex/Copilot review of PR #114: - Module$run() now validates .cache via validate_cache_arg(), so callers reaching the R6 method directly get the same loud error as the run() generic instead of a silently-forwarded malformed value. - default_disk_cache_path() treats an empty DSPRRR_CACHE_PATH as unset (nzchar guard) so the disk cache never resolves to "". - Test teardown restores the cache default with an explicit disk_path, independent of LIFO env-var cleanup ordering. - Reword the .cache NEWS bullet to scope it accurately: it is validated and plumbed everywhere, but direct-call modules don't cache yet (dsprrr-aa2). - Move the BootstrapFewShot (s3b) and evaluate (tn1) NEWS bullets out of wave 1 -- their code lands in wave 2 -- so each stacked PR's changelog matches its own diff. - Comment the batch-print test to explain why cli_fmt() is the correct capture (cli writes to stdout; capture.output(type="message") would miss it). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Regression from dsprrr-pcd: BestOfN/Refine/Assert passed rollout_id = i
explicitly while also spreading ..., so a nested wrapper
(refine(best_of_n(mod)), best_of_n(refine(mod)),
with_assertions(best_of_n(mod))) handed the inner forward() rollout_id
twice -> "formal argument matched by multiple actual arguments".
- Each wrapper forward() now takes an explicit rollout_id = NULL formal, so
an inherited id is consumed out of ... instead of being re-spread.
- New compose_rollout_id() helper (R/utils.R) folds the inherited id into
the per-attempt index ("2" at top level, "2.1" when nested), keeping every
(outer, inner) attempt's cache key unique.
- Document @param rollout_id on the wrapper and PredictModule forward()
methods (addresses Copilot review on PR #115).
- Add regression tests for all three nesting combinations; update the two
existing pcd tests for the now-character ids.
- Add the s3b, tn1, and pcd/wx6 NEWS bullets here (their code lives in
wave 2), so PR #115's changelog matches its diff (dsprrr-6qo).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
Thanks for the reviews. Addressed in commit 36ec81c. Nested wrappers crashed on Fix:
I used the formal-parameter approach rather than Missing NEWS — this PR's changelog now carries the |
…ave2 # Conflicts: # NEWS.md # tests/testthat/test-cache.R
Second wave from the codebase audit: the core optimizer-correctness cluster. Stacked on #114 (wave 1) — review/merge that first, or this diff will show both.
Each fix has a regression test verified to fail without the change.
BootstrapFewShot harvests demos with field-aware metrics (
dsprrr-s3b)The flagship optimizer used in its own roxygen example (
metric_exact_match(field = "answer")) bootstrapped zero demos. The single-module compile path passed the bare cell value asexpected, so the field-aware metric errored insideextract_field(), was scoredNA, and never cleared the threshold. It now passes the full training row when the metric declares a field — mirroring the pipeline path andevaluate()— and tracksmetric_fieldseparately fromoutput_col.evaluate() counts failed rows as 0 in mean_score (
dsprrr-tn1)mean_scoreusedna.rm = TRUE, dropping failed rows from the number that drives all optimizer selection. A config correct on the 2 rows it answers could outrank one robust across 10. Failures now count as 0 (matching DSPy); the rawscoresvector keepsNAandn_errorsstill reports failures.rollout_id threads through so retries actually explore (
dsprrr-pcd)rollout_idwas fully implemented in the cache layer but no caller passed it. With caching on (the default), BestOfN/Refine/Assert made byte-identical attempts — attempt 1 missed the cache, attempts 2..N hit it — so effective N was always 1. Now threaded end-to-end (call_llm_request→call_llm→forward→cached_chat_structured), and BestOfN/Refine/Assert passrollout_id = iper attempt.Follow-up filed as
dsprrr-vnv: per-attempt temperature override and optimizer-side diversity (SIMBA/GEPA/COPRO/bootstrap teacher).Testing
Full suite: 3015 passing, no new failures (the 9 pre-existing
print-capture failures are unrelated, verified on a clean tree).air format --check .passes.🤖 Generated with Claude Code