Skip to content

Add persisted follow-up questions and skillgym explain for failed runs #27

@V3RON

Description

@V3RON

When a benchmark run fails today, skillgym can tell the user what assertion failed, but it cannot help them ask the original runner why it chose one behavior over another. That makes skillgym strong as a tester, but weak as a debugging and tuning loop for skill authors.

This is especially painful for users who are iterating on SKILL.md files or benchmarking agent behavior across runners. A failing assertion like 'expected one matching command, got four' or 'expected this skill/file read, but it never happened' often points to a strategic decision made by the agent, not just a bad final answer. If skillgym can persist follow-up questions at failure time and later resume the original runner session to ask them, users get a much tighter loop for understanding and improving their skills.

Observed Findings

  • SessionReport already persists runner session identifiers when available via report.sessionId, plus normalized events and raw artifact paths. See src/domain/session-report.ts.
  • RunnerResult already preserves artifactDir, report, and serialized errors, and executeRunner writes report.json for both passing and failing runs. See src/domain/result.ts and src/runner/execute-runner.ts.
  • The reporter stack already knows how to resolve a user-facing assertion source location from an error stack via extractUserStackFrame, returning filePath, line, and column. See src/reporters/stack-frame.ts.
  • The current custom assertion surface is rich enough to support explainable follow-up prompts for many cases: skills.*, commands.*, fileReads.*, toolCalls.*, and output.*. See src/assertions/types.ts and the assertion implementations under src/assertions/.
  • Hard assertions stop at the first failure. Soft assertions collect multiple AssertionErrors and then merge them into one aggregate error at the end of the test case or before a later hard assertion failure. Today the soft collector keeps only AssertionError[], which is sufficient for CLI output but lossy for persisted follow-up questions. See src/assertions/soft.ts.
  • The current adapters already preserve enough runner-specific context to make resumed explanation plausible, but support is uneven:
    • OpenCode stores sessionId and uses an isolated XDG runtime under the artifact directory. The CLI supports opencode run --session <id> --format json --thinking <message>.
    • Codex runs non-interactively with codex exec --json ..., and the CLI supports codex exec resume <sessionId> <prompt> --json. The current adapter does not appear to persist a session id for resumed use.
    • Claude Code supports claude -p -r <sessionId> --output-format stream-json <prompt>, but the current adapter launches with --no-session-persistence, which makes later resume impossible.
    • Cursor Agent exposes --resume [chatId] and --continue, and the current adapter already extracts a sessionId from records.
  • We explicitly want to start with a deferred command (skillgym explain) rather than inline --explain-on-failure execution.
  • We only want to write explain.json when there is at least one follow-up question.

Suggested Behavior

When an assertion failure occurs, skillgym should optionally derive one or more follow-up questions from that failure and persist them into the failing run's artifact directory as explain.json.

Question generation rules:

  • Hard assertion failure: produce at most one question, because execution stops at the first hard failure.
  • Soft assertion failures: persist every explainable question collected before the test case ends or before a later hard assertion failure interrupts execution.
  • Do not impose a cap on the number of collected questions.
  • If an assertion is unsupported for automatic question generation and does not define a custom question, skip it silently.

Question source metadata:

  • Every persisted question should include the source location of the originating assertion in the user test file: filePath, line, column.
  • This should reuse the same stack-frame extraction behavior already used by reporters.
  • We do not need to persist a status field.
  • We do not need to persist assertion family or method metadata if the question text and source location are already stored.

Question persistence format:

  • Write artifactDir/explain.json only when at least one question exists.
  • Keep the file minimal and stable enough to support deferred explanation later.
  • A reasonable shape is:
{
  "suitePath": "/abs/path/to/suite.ts",
  "caseId": "skill-selection",
  "runnerId": "open-main",
  "sessionId": "ses_123",
  "questions": [
    {
      "question": "You were expected to read SKILL.md before acting. Why did you proceed without it?",
      "source": {
        "filePath": "/abs/path/to/suite.ts",
        "line": "14",
        "column": "15"
      }
    }
  ]
}

Assertion API additions:

  • Extend assertion options so test authors can define their own follow-up question instead of relying entirely on skillgym to generate one.
  • This should apply to both AssertionOptions and SkillAssertionOptions.
  • The first version can support:
interface ExplainOptions {
  question:
    | string
    | ((ctx: {
        report: SessionReport;
        expected?: unknown;
        actual?: unknown;
        observed?: unknown;
      }) => string | undefined);
}
  • If explain.question is defined on a failing assertion, use that question.
  • If no custom question is defined, use a built-in question generator for supported assertion kinds.
  • If neither path applies, do not create a question for that failure.

Assertion families that are good candidates for built-in question generation in v1:

  • skills.*
  • commands.*
  • fileReads.*
  • toolCalls.*

output.* can be supported later or treated as lower priority because output mismatches are often less informative than behavioral mismatches.

CLI behavior:

  • Add a new deferred command:
skillgym explain <artifactDir>
  • The command should:

    • load report.json
    • load explain.json
    • verify that the underlying runner execution is resumable
    • resume the original runner session
    • ask every persisted question in order
    • persist the answers separately, for example in artifactDir/explanations.json
  • skillgym explain should fail clearly when:

    • explain.json is missing
    • sessionId is missing
    • the runner does not support resume
    • the run was configured in a way that makes resume impossible, such as Claude runs launched with --no-session-persistence

Runner-specific expectations for deferred explanation:

  • OpenCode: likely reuse the preserved runtime state plus session id, not only the exported transcript.
  • Codex: persist whatever session identifier is required for codex exec resume and add a resume path in the adapter layer.
  • Claude Code: remove or gate --no-session-persistence for runs that are intended to be explainable later.
  • Cursor Agent: reuse the extracted chat/session id with a headless resume path.

Plan

  1. Add explain metadata to assertion option types.

    • Extend AssertionOptions and SkillAssertionOptions in src/assertions/types.ts with an optional explain field.
    • Document that it is optional and only used when the assertion fails.
  2. Introduce a small internal question model.

    • Add an internal type for a follow-up question candidate with:
      • rendered question
      • source.filePath
      • source.line
      • source.column
    • Keep this internal unless there is a strong reason to export it.
  3. Capture follow-up question metadata at assertion-failure time.

    • Add helpers that can build a question candidate from:
      • a custom explain.question
      • or a built-in generator for supported assertion kinds
    • Reuse extractUserStackFrame logic for source locations.
    • Make sure the question is rendered at failure time, not reconstructed later from source code.
  4. Preserve structured failures for soft assertions.

    • Replace the current AssertionError[]-only soft collector with a richer structure that can keep both the original error and its optional question candidate.
    • Keep the aggregate AssertionError behavior for reporters and failure output unchanged.
    • Ensure hard-failure-after-soft-failures still preserves all previously collected questions plus the final hard failure question when available.
  5. Persist explain.json for failed runs.

    • During failure result writing in the runner pipeline, write artifactDir/explain.json only if at least one question candidate was collected for that run.
    • Include suitePath, caseId, runnerId, sessionId, and questions.
    • Do not write the file for passing runs or failures with zero questions.
  6. Add test coverage for question persistence.

    • Hard assertion with custom question -> one question in explain.json.
    • Soft assertions with multiple explainable failures -> multiple questions persisted.
    • Unsupported assertion without custom question -> no question emitted.
    • Mixed soft + hard failure path -> all collected questions persisted.
    • Source location points at the user test file assertion line.
    • explain.json omitted when no questions exist.
  7. Add an explanation/resume abstraction for runners.

    • Extend the adapter layer with a runner-specific resume/explain capability rather than trying to force everything through run().
    • This can be an explicit explain(...)/resume(...) adapter method or a dedicated capability object, but it should clearly separate initial execution from deferred explanation.
    • Surface clear unsupported errors for runners or executions that cannot be resumed.
  8. Implement skillgym explain <artifactDir>.

    • Resolve and validate the artifact directory.
    • Load report.json and explain.json.
    • Determine the runner and resumability prerequisites.
    • Resume the original session and ask all questions in order.
    • Persist answers in artifactDir/explanations.json.
    • Print useful CLI output that shows which question is being asked and where it came from.
  9. Add runner-specific explain support incrementally.

    • Start with the runner that has the cleanest non-interactive resume path.
    • OpenCode and Codex look like the best initial targets.
    • Claude Code likely needs a config/adapter change first because current runs disable persistence.
    • Cursor support should be added only after validating its resumed headless behavior carefully.
  10. Document the new workflow.

    • Add docs for custom assertion follow-up questions.
    • Add docs for skillgym explain <artifactDir>.
    • Explain that explain.json is written only when at least one question exists.
    • Call out that deferred explanation depends on runner-specific session persistence and is not guaranteed for every historical run.

Resolution Summary

SkillGym should be able to persist assertion-derived follow-up questions into explain.json for failed runs and later resume the original runner session via skillgym explain <artifactDir> to ask those questions. The implementation should support custom per-assertion questions, preserve source locations for each question, skip unsupported assertions that do not define a custom question, and only emit explain.json when at least one question exists.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions