When a benchmark run fails today, skillgym can tell the user what assertion failed, but it cannot help them ask the original runner why it chose one behavior over another. That makes skillgym strong as a tester, but weak as a debugging and tuning loop for skill authors.
This is especially painful for users who are iterating on SKILL.md files or benchmarking agent behavior across runners. A failing assertion like 'expected one matching command, got four' or 'expected this skill/file read, but it never happened' often points to a strategic decision made by the agent, not just a bad final answer. If skillgym can persist follow-up questions at failure time and later resume the original runner session to ask them, users get a much tighter loop for understanding and improving their skills.
Observed Findings
SessionReport already persists runner session identifiers when available via report.sessionId, plus normalized events and raw artifact paths. See src/domain/session-report.ts.
RunnerResult already preserves artifactDir, report, and serialized errors, and executeRunner writes report.json for both passing and failing runs. See src/domain/result.ts and src/runner/execute-runner.ts.
- The reporter stack already knows how to resolve a user-facing assertion source location from an error stack via
extractUserStackFrame, returning filePath, line, and column. See src/reporters/stack-frame.ts.
- The current custom assertion surface is rich enough to support explainable follow-up prompts for many cases:
skills.*, commands.*, fileReads.*, toolCalls.*, and output.*. See src/assertions/types.ts and the assertion implementations under src/assertions/.
- Hard assertions stop at the first failure. Soft assertions collect multiple
AssertionErrors and then merge them into one aggregate error at the end of the test case or before a later hard assertion failure. Today the soft collector keeps only AssertionError[], which is sufficient for CLI output but lossy for persisted follow-up questions. See src/assertions/soft.ts.
- The current adapters already preserve enough runner-specific context to make resumed explanation plausible, but support is uneven:
- OpenCode stores
sessionId and uses an isolated XDG runtime under the artifact directory. The CLI supports opencode run --session <id> --format json --thinking <message>.
- Codex runs non-interactively with
codex exec --json ..., and the CLI supports codex exec resume <sessionId> <prompt> --json. The current adapter does not appear to persist a session id for resumed use.
- Claude Code supports
claude -p -r <sessionId> --output-format stream-json <prompt>, but the current adapter launches with --no-session-persistence, which makes later resume impossible.
- Cursor Agent exposes
--resume [chatId] and --continue, and the current adapter already extracts a sessionId from records.
- We explicitly want to start with a deferred command (
skillgym explain) rather than inline --explain-on-failure execution.
- We only want to write
explain.json when there is at least one follow-up question.
Suggested Behavior
When an assertion failure occurs, skillgym should optionally derive one or more follow-up questions from that failure and persist them into the failing run's artifact directory as explain.json.
Question generation rules:
- Hard assertion failure: produce at most one question, because execution stops at the first hard failure.
- Soft assertion failures: persist every explainable question collected before the test case ends or before a later hard assertion failure interrupts execution.
- Do not impose a cap on the number of collected questions.
- If an assertion is unsupported for automatic question generation and does not define a custom question, skip it silently.
Question source metadata:
- Every persisted question should include the source location of the originating assertion in the user test file:
filePath, line, column.
- This should reuse the same stack-frame extraction behavior already used by reporters.
- We do not need to persist a status field.
- We do not need to persist assertion family or method metadata if the question text and source location are already stored.
Question persistence format:
- Write
artifactDir/explain.json only when at least one question exists.
- Keep the file minimal and stable enough to support deferred explanation later.
- A reasonable shape is:
{
"suitePath": "/abs/path/to/suite.ts",
"caseId": "skill-selection",
"runnerId": "open-main",
"sessionId": "ses_123",
"questions": [
{
"question": "You were expected to read SKILL.md before acting. Why did you proceed without it?",
"source": {
"filePath": "/abs/path/to/suite.ts",
"line": "14",
"column": "15"
}
}
]
}
Assertion API additions:
- Extend assertion options so test authors can define their own follow-up question instead of relying entirely on skillgym to generate one.
- This should apply to both
AssertionOptions and SkillAssertionOptions.
- The first version can support:
interface ExplainOptions {
question:
| string
| ((ctx: {
report: SessionReport;
expected?: unknown;
actual?: unknown;
observed?: unknown;
}) => string | undefined);
}
- If
explain.question is defined on a failing assertion, use that question.
- If no custom question is defined, use a built-in question generator for supported assertion kinds.
- If neither path applies, do not create a question for that failure.
Assertion families that are good candidates for built-in question generation in v1:
skills.*
commands.*
fileReads.*
toolCalls.*
output.* can be supported later or treated as lower priority because output mismatches are often less informative than behavioral mismatches.
CLI behavior:
- Add a new deferred command:
skillgym explain <artifactDir>
Runner-specific expectations for deferred explanation:
- OpenCode: likely reuse the preserved runtime state plus session id, not only the exported transcript.
- Codex: persist whatever session identifier is required for
codex exec resume and add a resume path in the adapter layer.
- Claude Code: remove or gate
--no-session-persistence for runs that are intended to be explainable later.
- Cursor Agent: reuse the extracted chat/session id with a headless resume path.
Plan
-
Add explain metadata to assertion option types.
- Extend
AssertionOptions and SkillAssertionOptions in src/assertions/types.ts with an optional explain field.
- Document that it is optional and only used when the assertion fails.
-
Introduce a small internal question model.
- Add an internal type for a follow-up question candidate with:
- rendered
question
source.filePath
source.line
source.column
- Keep this internal unless there is a strong reason to export it.
-
Capture follow-up question metadata at assertion-failure time.
- Add helpers that can build a question candidate from:
- a custom
explain.question
- or a built-in generator for supported assertion kinds
- Reuse
extractUserStackFrame logic for source locations.
- Make sure the question is rendered at failure time, not reconstructed later from source code.
-
Preserve structured failures for soft assertions.
- Replace the current
AssertionError[]-only soft collector with a richer structure that can keep both the original error and its optional question candidate.
- Keep the aggregate
AssertionError behavior for reporters and failure output unchanged.
- Ensure hard-failure-after-soft-failures still preserves all previously collected questions plus the final hard failure question when available.
-
Persist explain.json for failed runs.
- During failure result writing in the runner pipeline, write
artifactDir/explain.json only if at least one question candidate was collected for that run.
- Include
suitePath, caseId, runnerId, sessionId, and questions.
- Do not write the file for passing runs or failures with zero questions.
-
Add test coverage for question persistence.
- Hard assertion with custom question -> one question in
explain.json.
- Soft assertions with multiple explainable failures -> multiple questions persisted.
- Unsupported assertion without custom question -> no question emitted.
- Mixed soft + hard failure path -> all collected questions persisted.
- Source location points at the user test file assertion line.
explain.json omitted when no questions exist.
-
Add an explanation/resume abstraction for runners.
- Extend the adapter layer with a runner-specific resume/explain capability rather than trying to force everything through
run().
- This can be an explicit
explain(...)/resume(...) adapter method or a dedicated capability object, but it should clearly separate initial execution from deferred explanation.
- Surface clear unsupported errors for runners or executions that cannot be resumed.
-
Implement skillgym explain <artifactDir>.
- Resolve and validate the artifact directory.
- Load
report.json and explain.json.
- Determine the runner and resumability prerequisites.
- Resume the original session and ask all questions in order.
- Persist answers in
artifactDir/explanations.json.
- Print useful CLI output that shows which question is being asked and where it came from.
-
Add runner-specific explain support incrementally.
- Start with the runner that has the cleanest non-interactive resume path.
- OpenCode and Codex look like the best initial targets.
- Claude Code likely needs a config/adapter change first because current runs disable persistence.
- Cursor support should be added only after validating its resumed headless behavior carefully.
-
Document the new workflow.
- Add docs for custom assertion follow-up questions.
- Add docs for
skillgym explain <artifactDir>.
- Explain that
explain.json is written only when at least one question exists.
- Call out that deferred explanation depends on runner-specific session persistence and is not guaranteed for every historical run.
Resolution Summary
SkillGym should be able to persist assertion-derived follow-up questions into explain.json for failed runs and later resume the original runner session via skillgym explain <artifactDir> to ask those questions. The implementation should support custom per-assertion questions, preserve source locations for each question, skip unsupported assertions that do not define a custom question, and only emit explain.json when at least one question exists.
When a benchmark run fails today, skillgym can tell the user what assertion failed, but it cannot help them ask the original runner why it chose one behavior over another. That makes skillgym strong as a tester, but weak as a debugging and tuning loop for skill authors.
This is especially painful for users who are iterating on
SKILL.mdfiles or benchmarking agent behavior across runners. A failing assertion like 'expected one matching command, got four' or 'expected this skill/file read, but it never happened' often points to a strategic decision made by the agent, not just a bad final answer. If skillgym can persist follow-up questions at failure time and later resume the original runner session to ask them, users get a much tighter loop for understanding and improving their skills.Observed Findings
SessionReportalready persists runner session identifiers when available viareport.sessionId, plus normalized events and raw artifact paths. Seesrc/domain/session-report.ts.RunnerResultalready preservesartifactDir,report, and serialized errors, andexecuteRunnerwritesreport.jsonfor both passing and failing runs. Seesrc/domain/result.tsandsrc/runner/execute-runner.ts.extractUserStackFrame, returningfilePath,line, andcolumn. Seesrc/reporters/stack-frame.ts.skills.*,commands.*,fileReads.*,toolCalls.*, andoutput.*. Seesrc/assertions/types.tsand the assertion implementations undersrc/assertions/.AssertionErrors and then merge them into one aggregate error at the end of the test case or before a later hard assertion failure. Today the soft collector keeps onlyAssertionError[], which is sufficient for CLI output but lossy for persisted follow-up questions. Seesrc/assertions/soft.ts.sessionIdand uses an isolated XDG runtime under the artifact directory. The CLI supportsopencode run --session <id> --format json --thinking <message>.codex exec --json ..., and the CLI supportscodex exec resume <sessionId> <prompt> --json. The current adapter does not appear to persist a session id for resumed use.claude -p -r <sessionId> --output-format stream-json <prompt>, but the current adapter launches with--no-session-persistence, which makes later resume impossible.--resume [chatId]and--continue, and the current adapter already extracts asessionIdfrom records.skillgym explain) rather than inline--explain-on-failureexecution.explain.jsonwhen there is at least one follow-up question.Suggested Behavior
When an assertion failure occurs, skillgym should optionally derive one or more follow-up questions from that failure and persist them into the failing run's artifact directory as
explain.json.Question generation rules:
Question source metadata:
filePath,line,column.Question persistence format:
artifactDir/explain.jsononly when at least one question exists.{ "suitePath": "/abs/path/to/suite.ts", "caseId": "skill-selection", "runnerId": "open-main", "sessionId": "ses_123", "questions": [ { "question": "You were expected to read SKILL.md before acting. Why did you proceed without it?", "source": { "filePath": "/abs/path/to/suite.ts", "line": "14", "column": "15" } } ] }Assertion API additions:
AssertionOptionsandSkillAssertionOptions.explain.questionis defined on a failing assertion, use that question.Assertion families that are good candidates for built-in question generation in v1:
skills.*commands.*fileReads.*toolCalls.*output.*can be supported later or treated as lower priority because output mismatches are often less informative than behavioral mismatches.CLI behavior:
The command should:
report.jsonexplain.jsonartifactDir/explanations.jsonskillgym explainshould fail clearly when:explain.jsonis missingsessionIdis missing--no-session-persistenceRunner-specific expectations for deferred explanation:
codex exec resumeand add a resume path in the adapter layer.--no-session-persistencefor runs that are intended to be explainable later.Plan
Add explain metadata to assertion option types.
AssertionOptionsandSkillAssertionOptionsinsrc/assertions/types.tswith an optionalexplainfield.Introduce a small internal question model.
questionsource.filePathsource.linesource.columnCapture follow-up question metadata at assertion-failure time.
explain.questionextractUserStackFramelogic for source locations.Preserve structured failures for soft assertions.
AssertionError[]-only soft collector with a richer structure that can keep both the original error and its optional question candidate.AssertionErrorbehavior for reporters and failure output unchanged.Persist
explain.jsonfor failed runs.artifactDir/explain.jsononly if at least one question candidate was collected for that run.suitePath,caseId,runnerId,sessionId, andquestions.Add test coverage for question persistence.
explain.json.explain.jsonomitted when no questions exist.Add an explanation/resume abstraction for runners.
run().explain(...)/resume(...)adapter method or a dedicated capability object, but it should clearly separate initial execution from deferred explanation.Implement
skillgym explain <artifactDir>.report.jsonandexplain.json.artifactDir/explanations.json.Add runner-specific explain support incrementally.
Document the new workflow.
skillgym explain <artifactDir>.explain.jsonis written only when at least one question exists.Resolution Summary
SkillGym should be able to persist assertion-derived follow-up questions into
explain.jsonfor failed runs and later resume the original runner session viaskillgym explain <artifactDir>to ask those questions. The implementation should support custom per-assertion questions, preserve source locations for each question, skip unsupported assertions that do not define a custom question, and only emitexplain.jsonwhen at least one question exists.