Add persisted follow-up questions and `skillgym explain` for failed runs

When a benchmark run fails today, skillgym can tell the user what assertion failed, but it cannot help them ask the original runner why it chose one behavior over another. That makes skillgym strong as a tester, but weak as a debugging and tuning loop for skill authors.

This is especially painful for users who are iterating on `SKILL.md` files or benchmarking agent behavior across runners. A failing assertion like 'expected one matching command, got four' or 'expected this skill/file read, but it never happened' often points to a strategic decision made by the agent, not just a bad final answer. If skillgym can persist follow-up questions at failure time and later resume the original runner session to ask them, users get a much tighter loop for understanding and improving their skills.

## Observed Findings

- `SessionReport` already persists runner session identifiers when available via `report.sessionId`, plus normalized events and raw artifact paths. See `src/domain/session-report.ts`.
- `RunnerResult` already preserves `artifactDir`, `report`, and serialized errors, and `executeRunner` writes `report.json` for both passing and failing runs. See `src/domain/result.ts` and `src/runner/execute-runner.ts`.
- The reporter stack already knows how to resolve a user-facing assertion source location from an error stack via `extractUserStackFrame`, returning `filePath`, `line`, and `column`. See `src/reporters/stack-frame.ts`.
- The current custom assertion surface is rich enough to support explainable follow-up prompts for many cases: `skills.*`, `commands.*`, `fileReads.*`, `toolCalls.*`, and `output.*`. See `src/assertions/types.ts` and the assertion implementations under `src/assertions/`.
- Hard assertions stop at the first failure. Soft assertions collect multiple `AssertionError`s and then merge them into one aggregate error at the end of the test case or before a later hard assertion failure. Today the soft collector keeps only `AssertionError[]`, which is sufficient for CLI output but lossy for persisted follow-up questions. See `src/assertions/soft.ts`.
- The current adapters already preserve enough runner-specific context to make resumed explanation plausible, but support is uneven:
  - OpenCode stores `sessionId` and uses an isolated XDG runtime under the artifact directory. The CLI supports `opencode run --session <id> --format json --thinking <message>`.
  - Codex runs non-interactively with `codex exec --json ...`, and the CLI supports `codex exec resume <sessionId> <prompt> --json`. The current adapter does not appear to persist a session id for resumed use.
  - Claude Code supports `claude -p -r <sessionId> --output-format stream-json <prompt>`, but the current adapter launches with `--no-session-persistence`, which makes later resume impossible.
  - Cursor Agent exposes `--resume [chatId]` and `--continue`, and the current adapter already extracts a `sessionId` from records.
- We explicitly want to start with a deferred command (`skillgym explain`) rather than inline `--explain-on-failure` execution.
- We only want to write `explain.json` when there is at least one follow-up question.

## Suggested Behavior

When an assertion failure occurs, skillgym should optionally derive one or more follow-up questions from that failure and persist them into the failing run's artifact directory as `explain.json`.

Question generation rules:

- Hard assertion failure: produce at most one question, because execution stops at the first hard failure.
- Soft assertion failures: persist every explainable question collected before the test case ends or before a later hard assertion failure interrupts execution.
- Do not impose a cap on the number of collected questions.
- If an assertion is unsupported for automatic question generation and does not define a custom question, skip it silently.

Question source metadata:

- Every persisted question should include the source location of the originating assertion in the user test file: `filePath`, `line`, `column`.
- This should reuse the same stack-frame extraction behavior already used by reporters.
- We do not need to persist a status field.
- We do not need to persist assertion family or method metadata if the question text and source location are already stored.

Question persistence format:

- Write `artifactDir/explain.json` only when at least one question exists.
- Keep the file minimal and stable enough to support deferred explanation later.
- A reasonable shape is:

```json
{
  "suitePath": "/abs/path/to/suite.ts",
  "caseId": "skill-selection",
  "runnerId": "open-main",
  "sessionId": "ses_123",
  "questions": [
    {
      "question": "You were expected to read SKILL.md before acting. Why did you proceed without it?",
      "source": {
        "filePath": "/abs/path/to/suite.ts",
        "line": "14",
        "column": "15"
      }
    }
  ]
}
```

Assertion API additions:

- Extend assertion options so test authors can define their own follow-up question instead of relying entirely on skillgym to generate one.
- This should apply to both `AssertionOptions` and `SkillAssertionOptions`.
- The first version can support:

```ts
interface ExplainOptions {
  question:
    | string
    | ((ctx: {
        report: SessionReport;
        expected?: unknown;
        actual?: unknown;
        observed?: unknown;
      }) => string | undefined);
}
```

- If `explain.question` is defined on a failing assertion, use that question.
- If no custom question is defined, use a built-in question generator for supported assertion kinds.
- If neither path applies, do not create a question for that failure.

Assertion families that are good candidates for built-in question generation in v1:

- `skills.*`
- `commands.*`
- `fileReads.*`
- `toolCalls.*`

`output.*` can be supported later or treated as lower priority because output mismatches are often less informative than behavioral mismatches.

CLI behavior:

- Add a new deferred command:

```bash
skillgym explain <artifactDir>
```

- The command should:
  - load `report.json`
  - load `explain.json`
  - verify that the underlying runner execution is resumable
  - resume the original runner session
  - ask every persisted question in order
  - persist the answers separately, for example in `artifactDir/explanations.json`

- `skillgym explain` should fail clearly when:
  - `explain.json` is missing
  - `sessionId` is missing
  - the runner does not support resume
  - the run was configured in a way that makes resume impossible, such as Claude runs launched with `--no-session-persistence`

Runner-specific expectations for deferred explanation:

- OpenCode: likely reuse the preserved runtime state plus session id, not only the exported transcript.
- Codex: persist whatever session identifier is required for `codex exec resume` and add a resume path in the adapter layer.
- Claude Code: remove or gate `--no-session-persistence` for runs that are intended to be explainable later.
- Cursor Agent: reuse the extracted chat/session id with a headless resume path.

## Plan

1. Add explain metadata to assertion option types.
   - Extend `AssertionOptions` and `SkillAssertionOptions` in `src/assertions/types.ts` with an optional `explain` field.
   - Document that it is optional and only used when the assertion fails.

2. Introduce a small internal question model.
   - Add an internal type for a follow-up question candidate with:
     - rendered `question`
     - `source.filePath`
     - `source.line`
     - `source.column`
   - Keep this internal unless there is a strong reason to export it.

3. Capture follow-up question metadata at assertion-failure time.
   - Add helpers that can build a question candidate from:
     - a custom `explain.question`
     - or a built-in generator for supported assertion kinds
   - Reuse `extractUserStackFrame` logic for source locations.
   - Make sure the question is rendered at failure time, not reconstructed later from source code.

4. Preserve structured failures for soft assertions.
   - Replace the current `AssertionError[]`-only soft collector with a richer structure that can keep both the original error and its optional question candidate.
   - Keep the aggregate `AssertionError` behavior for reporters and failure output unchanged.
   - Ensure hard-failure-after-soft-failures still preserves all previously collected questions plus the final hard failure question when available.

5. Persist `explain.json` for failed runs.
   - During failure result writing in the runner pipeline, write `artifactDir/explain.json` only if at least one question candidate was collected for that run.
   - Include `suitePath`, `caseId`, `runnerId`, `sessionId`, and `questions`.
   - Do not write the file for passing runs or failures with zero questions.

6. Add test coverage for question persistence.
   - Hard assertion with custom question -> one question in `explain.json`.
   - Soft assertions with multiple explainable failures -> multiple questions persisted.
   - Unsupported assertion without custom question -> no question emitted.
   - Mixed soft + hard failure path -> all collected questions persisted.
   - Source location points at the user test file assertion line.
   - `explain.json` omitted when no questions exist.

7. Add an explanation/resume abstraction for runners.
   - Extend the adapter layer with a runner-specific resume/explain capability rather than trying to force everything through `run()`.
   - This can be an explicit `explain(...)`/`resume(...)` adapter method or a dedicated capability object, but it should clearly separate initial execution from deferred explanation.
   - Surface clear unsupported errors for runners or executions that cannot be resumed.

8. Implement `skillgym explain <artifactDir>`.
   - Resolve and validate the artifact directory.
   - Load `report.json` and `explain.json`.
   - Determine the runner and resumability prerequisites.
   - Resume the original session and ask all questions in order.
   - Persist answers in `artifactDir/explanations.json`.
   - Print useful CLI output that shows which question is being asked and where it came from.

9. Add runner-specific explain support incrementally.
   - Start with the runner that has the cleanest non-interactive resume path.
   - OpenCode and Codex look like the best initial targets.
   - Claude Code likely needs a config/adapter change first because current runs disable persistence.
   - Cursor support should be added only after validating its resumed headless behavior carefully.

10. Document the new workflow.
    - Add docs for custom assertion follow-up questions.
    - Add docs for `skillgym explain <artifactDir>`.
    - Explain that `explain.json` is written only when at least one question exists.
    - Call out that deferred explanation depends on runner-specific session persistence and is not guaranteed for every historical run.

## Resolution Summary

SkillGym should be able to persist assertion-derived follow-up questions into `explain.json` for failed runs and later resume the original runner session via `skillgym explain <artifactDir>` to ask those questions. The implementation should support custom per-assertion questions, preserve source locations for each question, skip unsupported assertions that do not define a custom question, and only emit `explain.json` when at least one question exists.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add persisted follow-up questions and `skillgym explain` for failed runs #27

Observed Findings

Suggested Behavior

Plan

Resolution Summary

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Add persisted follow-up questions and skillgym explain for failed runs #27

Description

Observed Findings

Suggested Behavior

Plan

Resolution Summary

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Add persisted follow-up questions and `skillgym explain` for failed runs #27