Skip to content

fix: optimizer correctness — bootstrap demos, failure-blind mean, rollout_id diversity (audit wave 2)#115

Merged
JamesHWade merged 7 commits into
mainfrom
feature/audit-fixes-wave2
Jun 15, 2026
Merged

fix: optimizer correctness — bootstrap demos, failure-blind mean, rollout_id diversity (audit wave 2)#115
JamesHWade merged 7 commits into
mainfrom
feature/audit-fixes-wave2

Conversation

@JamesHWade

Copy link
Copy Markdown
Owner

Second wave from the codebase audit: the core optimizer-correctness cluster. Stacked on #114 (wave 1) — review/merge that first, or this diff will show both.

Each fix has a regression test verified to fail without the change.

BootstrapFewShot harvests demos with field-aware metrics (dsprrr-s3b)

The flagship optimizer used in its own roxygen example (metric_exact_match(field = "answer")) bootstrapped zero demos. The single-module compile path passed the bare cell value as expected, so the field-aware metric errored inside extract_field(), was scored NA, and never cleared the threshold. It now passes the full training row when the metric declares a field — mirroring the pipeline path and evaluate() — and tracks metric_field separately from output_col.

evaluate() counts failed rows as 0 in mean_score (dsprrr-tn1)

mean_score used na.rm = TRUE, dropping failed rows from the number that drives all optimizer selection. A config correct on the 2 rows it answers could outrank one robust across 10. Failures now count as 0 (matching DSPy); the raw scores vector keeps NA and n_errors still reports failures.

rollout_id threads through so retries actually explore (dsprrr-pcd)

rollout_id was fully implemented in the cache layer but no caller passed it. With caching on (the default), BestOfN/Refine/Assert made byte-identical attempts — attempt 1 missed the cache, attempts 2..N hit it — so effective N was always 1. Now threaded end-to-end (call_llm_requestcall_llmforwardcached_chat_structured), and BestOfN/Refine/Assert pass rollout_id = i per attempt.

Follow-up filed as dsprrr-vnv: per-attempt temperature override and optimizer-side diversity (SIMBA/GEPA/COPRO/bootstrap teacher).

Testing

Full suite: 3015 passing, no new failures (the 9 pre-existing print-capture failures are unrelated, verified on a clean tree). air format --check . passes.

🤖 Generated with Claude Code

JamesHWade and others added 3 commits June 14, 2026 18:00
Highest-value, lowest-risk findings from the codebase audit.

Release readiness (dsprrr-w0e):
- Lower the ellmer requirement to the released >= 0.4.1 and drop the
  CRAN-forbidden "Remotes: tidyverse/ellmer" entry.
- Remove unused Suggests: future and lifecycle.
- Add NEWS.md.

Correctness:
- print.dsprrr_batch_result now counts failed items. It read the error from
  item$error / the output object, but create_error_result() stores it in
  metadata$error with output = NA, so failed batches always printed
  "All items completed successfully" (dsprrr-8l0).
- Unknown or misspelled signature types now error with a closest-match
  suggestion instead of silently becoming string fields. validate_type_string()
  was fully written but never called (dsprrr-47p).
- module_trials() warns and returns an inspectable summary when every trial
  fails, instead of a cryptic "attempt to select less than one element"
  (dsprrr-hew).
- .cache is accepted and validated by all module types and pipelines, not
  only PredictModule, where it previously fell into ... and was dropped with
  a spurious "unknown input" warning (dsprrr-jup).

Test isolation:
- Add tests/testthat/setup.R plus a DSPRRR_CACHE_PATH env var so the suite
  no longer writes the disk cache into the package source tree (dsprrr-63v).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…mean (wave 2)

BootstrapFewShot harvests demos with field-aware metrics (dsprrr-s3b):
- The single-module compile path passed the bare cell value as `expected`, so
  a field-aware metric (e.g. metric_exact_match(field = "answer"), the
  documented default) errored inside extract_field(), was scored NA, and
  bootstrapped ZERO demos. It now passes the full training row when the metric
  declares a field, mirroring the pipeline path and evaluate(). Tracks
  metric_field separately from output_col so the distinction survives.

evaluate() counts failed rows as 0 in mean_score (dsprrr-tn1):
- mean_score used na.rm = TRUE, so failed rows dropped out of the number that
  drives all optimizer selection — a config correct on the 2 rows it answers
  could outrank one robust across 10. Failures now count as 0 (matching DSPy).
  The raw `scores` vector keeps NA and n_errors still reports failures.

Both fixes have regression tests verified to fail without the change.
Full suite: 3009 passing, no new failures.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
rollout_id was fully implemented in the cache layer but no caller ever passed
it. With caching enabled (the default), BestOfN/Refine/Assert made byte-identical
attempts: attempt 1 missed the cache, attempts 2..N hit it and returned the same
response, so effective N was always 1.

Thread rollout_id end to end:
- call_llm_request() and PredictModule's private call_llm() / forward() now
  accept rollout_id and pass it to cached_chat_structured(), which already folds
  it into the cache key.
- BestOfN, Refine, and Assert pass rollout_id = i on each attempt, so every
  attempt gets a distinct cache key.

Regression tests (verified to fail without the fix):
- forward() threads rollout_id into the cache key (repeat id hits cache,
  new id misses).
- BestOfN and Refine pass 1,2,3 to successive attempts.

Note: a per-attempt temperature override (also in dsprrr-pcd) and optimizer-side
diversity (SIMBA/GEPA/COPRO/bootstrap teacher) remain follow-ups.

Full suite: 3015 passing, no new failures.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings June 14, 2026 22:16

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: c6adf9d1a3

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread R/module-wrapper.R Outdated
Comment on lines +136 to +137
rollout_id = i,
...

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Consume inherited rollout_id before delegating

When this wrapper is itself nested under another retrying wrapper, the outer wrapper now supplies rollout_id through ...; this call then adds a second rollout_id before replaying ..., so the eventual PredictModule$forward() receives the same formal twice and the composed wrapper fails before making any attempts (for example, refine(best_of_n(mod)) or best_of_n(refine(mod))). Please make wrapper forward() methods consume a rollout_id formal or strip it from ... before delegating; the same call shape was added to Refine and Assert as well.

Useful? React with 👍 / 👎.

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses optimizer correctness issues uncovered in audit wave 2 by fixing (1) BootstrapFewShot demo harvesting when using field-aware metrics, (2) evaluate()’s mean_score so failures count as 0, and (3) cache-partitioning for BestOfN/Refine/Assert retries via end-to-end rollout_id threading.

Changes:

  • Fix BootstrapFewShot’s single-module compile path to pass the full training row to field-aware metrics (instead of the bare expected cell).
  • Change evaluate()’s mean_score computation so NA (failed rows) contributes 0 to the mean while preserving scores/n_errors diagnostics.
  • Thread rollout_id through the PredictModule LLM call stack and have wrapper modules send distinct per-attempt rollout IDs; add regression tests.

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
tests/testthat/test-teleprompter-bootstrap.R Adds regression test ensuring BootstrapFewShot harvests demos with field-aware metrics.
tests/testthat/test-module-wrapper.R Adds tests asserting BestOfN/Refine pass distinct rollout_id values per attempt.
tests/testthat/test-compile.R Adds regression test asserting failed rows count as 0 in mean_score.
tests/testthat/test-cache.R Adds regression test verifying rollout_id affects the cache key (distinct attempts don’t collapse to one cached response).
R/teleprompter-bootstrap.R Fixes expected-value passing for field-aware metrics in BootstrapFewShot single-module compile flow.
R/run.R Extends call_llm_request() to accept/forward rollout_id into cached structured calls.
R/module-wrapper.R Passes attempt-specific rollout_id in BestOfN/Refine wrapper retries.
R/module-predict.R Adds rollout_id param to PredictModule forward() and threads it into LLM calls.
R/module-assert.R Passes attempt-specific rollout_id in AssertModule retries.
R/evaluate.R Updates mean_score to treat failures as 0 instead of dropping them via na.rm = TRUE.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread R/module-wrapper.R
Comment on lines +133 to +137
batch,
.llm = .llm,
trace = FALSE,
rollout_id = i,
...
Comment thread R/module-wrapper.R
Comment on lines +660 to +664
current_batch,
.llm = .llm,
trace = FALSE,
rollout_id = i,
...
Comment thread R/module-assert.R
Comment on lines +159 to +163
current_batch,
.llm = .llm,
trace = FALSE,
rollout_id = i,
...
Comment thread R/module-predict.R
Comment on lines 53 to +57
#' uses global cache configuration. If TRUE, attempts to use cache (no effect
#' if caching globally disabled). If FALSE, bypasses cache for this call.
#' @param ... Additional arguments
#' @return Tibble with result, .chat, .metadata columns
forward = function(batch, .llm = NULL, trace = TRUE, .cache = NULL, ...) {
forward = function(
JamesHWade and others added 3 commits June 15, 2026 17:55
Codex/Copilot review of PR #114:
- Module$run() now validates .cache via validate_cache_arg(), so callers
  reaching the R6 method directly get the same loud error as the run()
  generic instead of a silently-forwarded malformed value.
- default_disk_cache_path() treats an empty DSPRRR_CACHE_PATH as unset
  (nzchar guard) so the disk cache never resolves to "".
- Test teardown restores the cache default with an explicit disk_path,
  independent of LIFO env-var cleanup ordering.
- Reword the .cache NEWS bullet to scope it accurately: it is validated
  and plumbed everywhere, but direct-call modules don't cache yet (dsprrr-aa2).
- Move the BootstrapFewShot (s3b) and evaluate (tn1) NEWS bullets out of
  wave 1 -- their code lands in wave 2 -- so each stacked PR's changelog
  matches its own diff.
- Comment the batch-print test to explain why cli_fmt() is the correct
  capture (cli writes to stdout; capture.output(type="message") would miss it).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Regression from dsprrr-pcd: BestOfN/Refine/Assert passed rollout_id = i
explicitly while also spreading ..., so a nested wrapper
(refine(best_of_n(mod)), best_of_n(refine(mod)),
with_assertions(best_of_n(mod))) handed the inner forward() rollout_id
twice -> "formal argument matched by multiple actual arguments".

- Each wrapper forward() now takes an explicit rollout_id = NULL formal, so
  an inherited id is consumed out of ... instead of being re-spread.
- New compose_rollout_id() helper (R/utils.R) folds the inherited id into
  the per-attempt index ("2" at top level, "2.1" when nested), keeping every
  (outer, inner) attempt's cache key unique.
- Document @param rollout_id on the wrapper and PredictModule forward()
  methods (addresses Copilot review on PR #115).
- Add regression tests for all three nesting combinations; update the two
  existing pcd tests for the now-character ids.
- Add the s3b, tn1, and pcd/wx6 NEWS bullets here (their code lives in
  wave 2), so PR #115's changelog matches its diff (dsprrr-6qo).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@JamesHWade

Copy link
Copy Markdown
Owner Author

Thanks for the reviews. Addressed in commit 36ec81c.

Nested wrappers crashed on rollout_id (Codex module-wrapper.R:137; Copilot wrapper:137/:664, assert:163) — confirmed and fixed. The root cause was exactly as described: each wrapper passed rollout_id = i explicitly while also spreading ..., so refine(best_of_n(mod)), best_of_n(refine(mod)), and with_assertions(best_of_n(mod)) handed the inner forward() rollout_id twice.

Fix:

  • Each wrapper forward() now takes an explicit rollout_id = NULL formal, so an inherited id is consumed out of ... instead of being re-spread.
  • New compose_rollout_id() helper folds the inherited id into the per-attempt index — "2" at the top level, "2.1" when nested — so every (outer, inner) attempt keeps a unique cache key.
  • Added regression tests for all three nesting combinations (they error pre-fix, pass post-fix). The two existing pcd tests were updated for the now-character ids.

I used the formal-parameter approach rather than do.call(): it consumes the inherited id, reads cleanly, and composes correctly for arbitrary nesting depth.

Missing @param rollout_id (Copilot module-predict.R:57) — documented on PredictModule$forward() and on all three wrapper forward() methods.

NEWS — this PR's changelog now carries the s3b, tn1, and pcd/wx6 bullets (moved here from PR #114, since their code lives in this wave).

@JamesHWade JamesHWade changed the base branch from feature/audit-fixes-wave1 to main June 15, 2026 22:23
…ave2

# Conflicts:
#	NEWS.md
#	tests/testthat/test-cache.R
@JamesHWade JamesHWade merged commit 6017844 into main Jun 15, 2026
9 checks passed
@JamesHWade JamesHWade deleted the feature/audit-fixes-wave2 branch June 15, 2026 22:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants