Skip to content

fix: release-unblock + silent-failure correctness fixes (audit wave 1)#114

Merged
JamesHWade merged 2 commits into
mainfrom
feature/audit-fixes-wave1
Jun 15, 2026
Merged

fix: release-unblock + silent-failure correctness fixes (audit wave 1)#114
JamesHWade merged 2 commits into
mainfrom
feature/audit-fixes-wave1

Conversation

@JamesHWade

Copy link
Copy Markdown
Owner

First wave of fixes from a full codebase audit. These are the highest-value, lowest-risk findings: they unblock a CRAN release and stop four silent failures, with no changes to optimizer behavior (that's coming in a separate, more carefully validated wave).

Release readiness (dsprrr-w0e)

  • Lower the ellmer requirement to the released >= 0.4.1 and drop the CRAN-forbidden Remotes: tidyverse/ellmer entry. The package could not be submitted to CRAN with either in place; ellmer 0.4.1 exports every symbol used.
  • Remove unused Suggests: future and lifecycle (zero references anywhere).
  • Add NEWS.md.

Correctness

  • print.dsprrr_batch_result always reported success on failed batches (dsprrr-8l0). The error predicate read item$error / inherits(item$output, "error"), but create_error_result() stores the message in metadata$error with output = NA, so n_errors was always 0.
  • Unknown/misspelled signature types silently became string fields (dsprrr-47p). validate_type_string() was fully written (with a closest-match suggestion) but never called. Now signature("q -> a: interger") errors and suggests integer.
  • module_trials() crashed with a cryptic base-R error when all trials failed (dsprrr-hew). which.max() on all-NA scores returns integer(0), so indexing threw "attempt to select less than one element". Now it warns and returns an inspectable summary.
  • .cache was silently dropped for every non-PredictModule module and pipelines (dsprrr-jup). Only run.PredictModule declared .cache; elsewhere it fell into ..., was flagged as an unknown signature input (spurious warning), and ignored. Now run.Module and the base $run() accept, validate, and forward it.

Test isolation (dsprrr-63v)

  • Add tests/testthat/setup.R and a DSPRRR_CACHE_PATH env var so the disk cache is redirected to a temp dir during the suite. Previously tests wrote .dsprrr_cache into the package source tree (which CRAN flags) and leaked state across runs.

Testing

  • Full suite: 3008 passing, no new failures. (9 pre-existing failures in print-output-capture tests are unrelated — they capture the message stream while cli writes to stdout in this local config; verified identical on a clean tree.)
  • New regression tests for each fix.
  • air format --check . passes.

Follow-up

Filed as beads issues from the same audit, in priority order: the optimizer-correctness cluster (dsprrr-s3b BootstrapFewShot zero-demos, dsprrr-tn1 failure-blind mean, dsprrr-pcd per-attempt diversity), silent-failure handling in evaluate()/batch (dsprrr-hqp, dsprrr-lr7), and ~30 more across API consistency, tests, parser, metrics, and packaging.

🤖 Generated with Claude Code

Highest-value, lowest-risk findings from the codebase audit.

Release readiness (dsprrr-w0e):
- Lower the ellmer requirement to the released >= 0.4.1 and drop the
  CRAN-forbidden "Remotes: tidyverse/ellmer" entry.
- Remove unused Suggests: future and lifecycle.
- Add NEWS.md.

Correctness:
- print.dsprrr_batch_result now counts failed items. It read the error from
  item$error / the output object, but create_error_result() stores it in
  metadata$error with output = NA, so failed batches always printed
  "All items completed successfully" (dsprrr-8l0).
- Unknown or misspelled signature types now error with a closest-match
  suggestion instead of silently becoming string fields. validate_type_string()
  was fully written but never called (dsprrr-47p).
- module_trials() warns and returns an inspectable summary when every trial
  fails, instead of a cryptic "attempt to select less than one element"
  (dsprrr-hew).
- .cache is accepted and validated by all module types and pipelines, not
  only PredictModule, where it previously fell into ... and was dropped with
  a spurious "unknown input" warning (dsprrr-jup).

Test isolation:
- Add tests/testthat/setup.R plus a DSPRRR_CACHE_PATH env var so the suite
  no longer writes the disk cache into the package source tree (dsprrr-63v).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings June 14, 2026 22:00

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 95e2756272

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread R/run.R
.progress = .progress,
.return_format = .return_format
.return_format = .return_format,
.cache = .cache

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Honor .cache before accepting it for all modules

Capturing .cache here removes it from ... for every non-Predict module, but several of those modules still never consume it: for example, RAGModule$forward() calls its private helper without .cache, and that helper invokes llm$chat_structured() directly instead of cached_chat_structured(). In those contexts run(rag_mod, ..., .cache = TRUE/FALSE) now validates and appears accepted, but the setting is silently ignored, so the advertised all-module/pipeline cache control still fails.

Useful? React with 👍 / 👎.

Comment thread NEWS.md Outdated
Comment on lines +28 to +30
* Evaluation now treats a failed prediction as a score of 0 rather than
dropping it from the mean, so optimizers no longer prefer configurations
that fail on most of the data (#dsprrr-tn1).

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Remove the unfixed evaluation claim

This changelog entry is not true for this commit: evaluate.Module() still stores failed metric/prediction cases as NA_real_ and computes mean_score <- mean(scores, na.rm = TRUE), so a configuration that succeeds on only one row can still get a perfect mean and be selected by optimize_grid(). Either implement the promised zero scoring in evaluate() or drop this release note so users are not told the optimizer corruption case is fixed.

Useful? React with 👍 / 👎.

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR implements a first wave of release-readiness and correctness fixes: it updates package metadata for CRAN submission, adds a development changelog, and addresses several previously silent failure modes in signature parsing, batch-result printing, optimization trial summarization, and cache parameter forwarding/test isolation.

Changes:

  • Update CRAN-facing package metadata (dependency constraints, Removes Remotes, prune unused Suggests) and add NEWS.md.
  • Wire in strict type validation for signature parsing (with closest-match suggestions) and improve batch print error counting + module_trials() behavior when all scores are NA.
  • Make .cache a first-class run() argument across module entry points and isolate disk cache usage during tests via DSPRRR_CACHE_PATH.

Reviewed changes

Copilot reviewed 12 out of 13 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
tests/testthat/test-signature-parser.R Adds regression tests ensuring unknown/misspelled types error (and valid types still parse).
tests/testthat/test-run.R Adds regression tests for batch-result printing error counts.
tests/testthat/test-optimize.R Adds regression test for module_trials() when all trials fail (all-NA scores).
tests/testthat/test-cache.R Adds regression test that .cache is accepted/validated for non-PredictModule modules.
tests/testthat/setup.R Introduces suite-wide cache isolation via env var + cleanup.
R/signature-parser.R Calls validate_type_string() during type parsing and adds closest-match suggestions.
R/run.R Adds shared .cache validation helper; plumbs .cache through run.Module().
R/optimize.R Makes module_trials() warn + return an inspectable summary when all scores are NA.
R/module-base.R Extends base $run() to accept/forward .cache.
R/cache.R Adds DSPRRR_CACHE_PATH-aware default disk cache path and uses it in cache config defaults.
NEWS.md Adds a development changelog for the release/unblock + fixes.
man/configure_cache.Rd Updates the documented default for disk_path.
DESCRIPTION Lowers ellmer minimum version, removes CRAN-forbidden Remotes, removes unused Suggests.
Files not reviewed (1)
  • man/configure_cache.Rd: Generated file

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread tests/testthat/test-run.R
class = c("dsprrr_batch_result", "list")
)

out <- cli::cli_fmt(print(result))
Comment thread tests/testthat/test-run.R
),
class = c("dsprrr_batch_result", "list")
)
out <- cli::cli_fmt(print(result))
Comment thread R/module-base.R
Comment on lines +87 to 89
.cache = NULL
) {
inputs <- list(...)
Comment thread R/cache.R
Comment on lines +21 to +23
default_disk_cache_path <- function() {
Sys.getenv("DSPRRR_CACHE_PATH", unset = ".dsprrr_cache")
}
Comment thread tests/testthat/setup.R
Comment on lines +12 to +32
dsprrr_test_cache_dir <- file.path(tempdir(), "dsprrr-test-cache")
dir.create(dsprrr_test_cache_dir, showWarnings = FALSE, recursive = TRUE)

withr::local_envvar(
DSPRRR_CACHE_PATH = dsprrr_test_cache_dir,
.local_envir = testthat::teardown_env()
)

dsprrr::configure_cache(
enable_disk = FALSE,
disk_path = dsprrr_test_cache_dir
)

withr::defer(
{
dsprrr::clear_cache()
unlink(dsprrr_test_cache_dir, recursive = TRUE)
dsprrr::configure_cache()
},
envir = testthat::teardown_env()
)
Comment thread NEWS.md Outdated
Comment on lines +7 to +10
* `BootstrapFewShot` now harvests demonstrations when the metric targets a
specific output field (e.g. `metric_exact_match(field = "answer")`).
Previously the single-module path passed a bare value to the metric, which
errored internally and silently bootstrapped zero demos (#dsprrr-s3b).
Comment thread NEWS.md Outdated
Comment on lines +28 to +30
* Evaluation now treats a failed prediction as a score of 0 rather than
dropping it from the mean, so optimizers no longer prefer configurations
that fail on most of the data (#dsprrr-tn1).
Codex/Copilot review of PR #114:
- Module$run() now validates .cache via validate_cache_arg(), so callers
  reaching the R6 method directly get the same loud error as the run()
  generic instead of a silently-forwarded malformed value.
- default_disk_cache_path() treats an empty DSPRRR_CACHE_PATH as unset
  (nzchar guard) so the disk cache never resolves to "".
- Test teardown restores the cache default with an explicit disk_path,
  independent of LIFO env-var cleanup ordering.
- Reword the .cache NEWS bullet to scope it accurately: it is validated
  and plumbed everywhere, but direct-call modules don't cache yet (dsprrr-aa2).
- Move the BootstrapFewShot (s3b) and evaluate (tn1) NEWS bullets out of
  wave 1 -- their code lands in wave 2 -- so each stacked PR's changelog
  matches its own diff.
- Comment the batch-print test to explain why cli_fmt() is the correct
  capture (cli writes to stdout; capture.output(type="message") would miss it).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@JamesHWade

Copy link
Copy Markdown
Owner Author

Thanks for the reviews. Addressed in the latest commit (71116f9):

Fixed

  • Module$run() didn't validate .cache (Copilot, module-base.R:89) — now calls validate_cache_arg(), so reaching the R6 method directly fails as loudly as the run() generic.
  • Empty DSPRRR_CACHE_PATH (Copilot, cache.R:23) — default_disk_cache_path() now treats "" as unset via nzchar().
  • Teardown LIFO ordering (Copilot, setup.R:32) — teardown restores defaults with an explicit disk_path = ".dsprrr_cache", independent of when local_envvar() cleanup runs.

NEWS accuracy (Codex NEWS:30, Copilot NEWS:10/:30) — the s3b (BootstrapFewShot) and tn1 (evaluate) bullets described code that lands in wave 2, so they've moved to PR #115's changelog. Each stacked PR's NEWS now matches its own diff.

.cache not honored by all modules (Codex run.R:125) — correct, and a good catch. RAGModule/ReActModule/RLMModule/ProgramOfThought/MultiChain/CodeAct call llm$chat_structured() directly and never route through the cache, so .cache is accepted but inert for them. I reworded the NEWS bullet to scope this accurately rather than claim universal support, and filed dsprrr-aa2 to wire caching through those modules (it's a behavior change with recursion/diversity hazards — e.g. MultiChain's N chains would collapse to one cached response without per-chain rollout_id — so it belongs in its own PR with a test audit).

cli_fmt capture (Copilot test-run.R:1085/:1099) — keeping cli::cli_fmt() here. print.dsprrr_batch_result() emits only via cli, and cli_fmt() captures that reliably. capture.output(..., type = "message") is actually the fragile pattern: cli writes to stdout, so that idiom captures nothing (it's the cause of a separate set of pre-existing local-only failures, tracked in dsprrr-6nf). Added a comment so nobody "fixes" it back.

@JamesHWade JamesHWade merged commit c0d2823 into main Jun 15, 2026
9 checks passed
@JamesHWade JamesHWade deleted the feature/audit-fixes-wave1 branch June 15, 2026 22:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants