fix: release-unblock + silent-failure correctness fixes (audit wave 1)#114
Conversation
Highest-value, lowest-risk findings from the codebase audit. Release readiness (dsprrr-w0e): - Lower the ellmer requirement to the released >= 0.4.1 and drop the CRAN-forbidden "Remotes: tidyverse/ellmer" entry. - Remove unused Suggests: future and lifecycle. - Add NEWS.md. Correctness: - print.dsprrr_batch_result now counts failed items. It read the error from item$error / the output object, but create_error_result() stores it in metadata$error with output = NA, so failed batches always printed "All items completed successfully" (dsprrr-8l0). - Unknown or misspelled signature types now error with a closest-match suggestion instead of silently becoming string fields. validate_type_string() was fully written but never called (dsprrr-47p). - module_trials() warns and returns an inspectable summary when every trial fails, instead of a cryptic "attempt to select less than one element" (dsprrr-hew). - .cache is accepted and validated by all module types and pipelines, not only PredictModule, where it previously fell into ... and was dropped with a spurious "unknown input" warning (dsprrr-jup). Test isolation: - Add tests/testthat/setup.R plus a DSPRRR_CACHE_PATH env var so the suite no longer writes the disk cache into the package source tree (dsprrr-63v). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 95e2756272
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| .progress = .progress, | ||
| .return_format = .return_format | ||
| .return_format = .return_format, | ||
| .cache = .cache |
There was a problem hiding this comment.
Honor
.cache before accepting it for all modules
Capturing .cache here removes it from ... for every non-Predict module, but several of those modules still never consume it: for example, RAGModule$forward() calls its private helper without .cache, and that helper invokes llm$chat_structured() directly instead of cached_chat_structured(). In those contexts run(rag_mod, ..., .cache = TRUE/FALSE) now validates and appears accepted, but the setting is silently ignored, so the advertised all-module/pipeline cache control still fails.
Useful? React with 👍 / 👎.
| * Evaluation now treats a failed prediction as a score of 0 rather than | ||
| dropping it from the mean, so optimizers no longer prefer configurations | ||
| that fail on most of the data (#dsprrr-tn1). |
There was a problem hiding this comment.
Remove the unfixed evaluation claim
This changelog entry is not true for this commit: evaluate.Module() still stores failed metric/prediction cases as NA_real_ and computes mean_score <- mean(scores, na.rm = TRUE), so a configuration that succeeds on only one row can still get a perfect mean and be selected by optimize_grid(). Either implement the promised zero scoring in evaluate() or drop this release note so users are not told the optimizer corruption case is fixed.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Pull request overview
This PR implements a first wave of release-readiness and correctness fixes: it updates package metadata for CRAN submission, adds a development changelog, and addresses several previously silent failure modes in signature parsing, batch-result printing, optimization trial summarization, and cache parameter forwarding/test isolation.
Changes:
- Update CRAN-facing package metadata (dependency constraints, Removes
Remotes, prune unusedSuggests) and addNEWS.md. - Wire in strict type validation for signature parsing (with closest-match suggestions) and improve batch print error counting +
module_trials()behavior when all scores areNA. - Make
.cachea first-classrun()argument across module entry points and isolate disk cache usage during tests viaDSPRRR_CACHE_PATH.
Reviewed changes
Copilot reviewed 12 out of 13 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/testthat/test-signature-parser.R | Adds regression tests ensuring unknown/misspelled types error (and valid types still parse). |
| tests/testthat/test-run.R | Adds regression tests for batch-result printing error counts. |
| tests/testthat/test-optimize.R | Adds regression test for module_trials() when all trials fail (all-NA scores). |
| tests/testthat/test-cache.R | Adds regression test that .cache is accepted/validated for non-PredictModule modules. |
| tests/testthat/setup.R | Introduces suite-wide cache isolation via env var + cleanup. |
| R/signature-parser.R | Calls validate_type_string() during type parsing and adds closest-match suggestions. |
| R/run.R | Adds shared .cache validation helper; plumbs .cache through run.Module(). |
| R/optimize.R | Makes module_trials() warn + return an inspectable summary when all scores are NA. |
| R/module-base.R | Extends base $run() to accept/forward .cache. |
| R/cache.R | Adds DSPRRR_CACHE_PATH-aware default disk cache path and uses it in cache config defaults. |
| NEWS.md | Adds a development changelog for the release/unblock + fixes. |
| man/configure_cache.Rd | Updates the documented default for disk_path. |
| DESCRIPTION | Lowers ellmer minimum version, removes CRAN-forbidden Remotes, removes unused Suggests. |
Files not reviewed (1)
- man/configure_cache.Rd: Generated file
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| class = c("dsprrr_batch_result", "list") | ||
| ) | ||
|
|
||
| out <- cli::cli_fmt(print(result)) |
| ), | ||
| class = c("dsprrr_batch_result", "list") | ||
| ) | ||
| out <- cli::cli_fmt(print(result)) |
| .cache = NULL | ||
| ) { | ||
| inputs <- list(...) |
| default_disk_cache_path <- function() { | ||
| Sys.getenv("DSPRRR_CACHE_PATH", unset = ".dsprrr_cache") | ||
| } |
| dsprrr_test_cache_dir <- file.path(tempdir(), "dsprrr-test-cache") | ||
| dir.create(dsprrr_test_cache_dir, showWarnings = FALSE, recursive = TRUE) | ||
|
|
||
| withr::local_envvar( | ||
| DSPRRR_CACHE_PATH = dsprrr_test_cache_dir, | ||
| .local_envir = testthat::teardown_env() | ||
| ) | ||
|
|
||
| dsprrr::configure_cache( | ||
| enable_disk = FALSE, | ||
| disk_path = dsprrr_test_cache_dir | ||
| ) | ||
|
|
||
| withr::defer( | ||
| { | ||
| dsprrr::clear_cache() | ||
| unlink(dsprrr_test_cache_dir, recursive = TRUE) | ||
| dsprrr::configure_cache() | ||
| }, | ||
| envir = testthat::teardown_env() | ||
| ) |
| * `BootstrapFewShot` now harvests demonstrations when the metric targets a | ||
| specific output field (e.g. `metric_exact_match(field = "answer")`). | ||
| Previously the single-module path passed a bare value to the metric, which | ||
| errored internally and silently bootstrapped zero demos (#dsprrr-s3b). |
| * Evaluation now treats a failed prediction as a score of 0 rather than | ||
| dropping it from the mean, so optimizers no longer prefer configurations | ||
| that fail on most of the data (#dsprrr-tn1). |
Codex/Copilot review of PR #114: - Module$run() now validates .cache via validate_cache_arg(), so callers reaching the R6 method directly get the same loud error as the run() generic instead of a silently-forwarded malformed value. - default_disk_cache_path() treats an empty DSPRRR_CACHE_PATH as unset (nzchar guard) so the disk cache never resolves to "". - Test teardown restores the cache default with an explicit disk_path, independent of LIFO env-var cleanup ordering. - Reword the .cache NEWS bullet to scope it accurately: it is validated and plumbed everywhere, but direct-call modules don't cache yet (dsprrr-aa2). - Move the BootstrapFewShot (s3b) and evaluate (tn1) NEWS bullets out of wave 1 -- their code lands in wave 2 -- so each stacked PR's changelog matches its own diff. - Comment the batch-print test to explain why cli_fmt() is the correct capture (cli writes to stdout; capture.output(type="message") would miss it). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
Thanks for the reviews. Addressed in the latest commit (71116f9): Fixed
NEWS accuracy (Codex NEWS:30, Copilot NEWS:10/:30) — the
|
First wave of fixes from a full codebase audit. These are the highest-value, lowest-risk findings: they unblock a CRAN release and stop four silent failures, with no changes to optimizer behavior (that's coming in a separate, more carefully validated wave).
Release readiness (
dsprrr-w0e)ellmerrequirement to the released>= 0.4.1and drop the CRAN-forbiddenRemotes: tidyverse/ellmerentry. The package could not be submitted to CRAN with either in place; ellmer 0.4.1 exports every symbol used.Suggests:futureandlifecycle(zero references anywhere).NEWS.md.Correctness
print.dsprrr_batch_resultalways reported success on failed batches (dsprrr-8l0). The error predicate readitem$error/inherits(item$output, "error"), butcreate_error_result()stores the message inmetadata$errorwithoutput = NA, son_errorswas always 0.dsprrr-47p).validate_type_string()was fully written (with a closest-match suggestion) but never called. Nowsignature("q -> a: interger")errors and suggestsinteger.module_trials()crashed with a cryptic base-R error when all trials failed (dsprrr-hew).which.max()on all-NAscores returnsinteger(0), so indexing threw "attempt to select less than one element". Now it warns and returns an inspectable summary..cachewas silently dropped for every non-PredictModulemodule and pipelines (dsprrr-jup). Onlyrun.PredictModuledeclared.cache; elsewhere it fell into..., was flagged as an unknown signature input (spurious warning), and ignored. Nowrun.Moduleand the base$run()accept, validate, and forward it.Test isolation (
dsprrr-63v)tests/testthat/setup.Rand aDSPRRR_CACHE_PATHenv var so the disk cache is redirected to a temp dir during the suite. Previously tests wrote.dsprrr_cacheinto the package source tree (which CRAN flags) and leaked state across runs.Testing
print-output-capture tests are unrelated — they capture the message stream while cli writes to stdout in this local config; verified identical on a clean tree.)air format --check .passes.Follow-up
Filed as beads issues from the same audit, in priority order: the optimizer-correctness cluster (
dsprrr-s3bBootstrapFewShot zero-demos,dsprrr-tn1failure-blind mean,dsprrr-pcdper-attempt diversity), silent-failure handling inevaluate()/batch (dsprrr-hqp,dsprrr-lr7), and ~30 more across API consistency, tests, parser, metrics, and packaging.🤖 Generated with Claude Code