SkillGym should support repeated executions of the same case x runner pair so skill creators can measure stability across nondeterministic agent runs without having to rerun the whole suite manually. Today a single lucky pass can hide flakiness, while a single transient wobble can fail a benchmark run with too little context about whether the skill is fundamentally unstable or just hit one noisy model response. Users need a built-in way to ask for multiple successful repetitions, tolerate a limited number of transient failures, and still keep detailed execution data for debugging and analysis.
Observed Findings
- The current execution model expands the suite into one logical execution per case x runner pair:
src/runner/execute-suite.ts.
- The current retry behavior is
retryFailed, which retries only after a failed execution and produces one final RunnerResult with nested attempts: src/runner/execute-suite.ts, src/domain/result.ts.
- Reporters assume one visible terminal result per case x runner and already collapse retry details into a compact presentation:
src/reporters/standard.ts, src/reporters/github-actions.ts, src/reporters/json-summary.ts.
- Configuration and CLI currently expose
retryFailed but do not expose a notion of proactive repeated executions for stability sampling: src/config.ts, src/cli.ts, src/cli/run.ts, src/cli/help.ts.
- The current artifact layout uses the runner directory as the root execution directory and
attempt-* for retry attempts, which would collide conceptually with repeated baseline runs if repetitions reused the same naming scheme: src/runner/execute-suite.ts.
- The README documents
retryFailed as rerunning failed executions and describes one execution per case x runner pair, but does not describe repeated stability runs: README.md.
Suggested Behavior
Add first-class repeated stability runs with these semantics:
repeat means the target number of successful repetitions for each case x runner pair. Default: 1.
repeatFailure means the number of extra retries allowed for the current repetition after a failure is detected. Default: 0.
- Keep
retryFailed temporarily supported as a backward-compatibility alias for repeatFailure during the transition between SkillGym versions.
- If both
repeatFailure and retryFailed are configured, prefer repeatFailure.
Execution semantics:
- SkillGym should keep running repetitions while each repetition eventually passes.
- When a repetition fails, SkillGym should retry only that repetition up to
repeatFailure additional times.
- If a retry recovers the repetition, SkillGym should continue toward the remaining
repeat target.
- If a repetition still fails after exhausting
repeatFailure, SkillGym should stop that case x runner execution immediately and mark it failed.
- The overall case x runner result should pass only if it reaches the full
repeat target successfully.
Reporting semantics:
- Terminal-facing reporters should keep showing one aggregated visible result per case x runner.
- On success, reporters should show success plus averaged token and duration metrics across the final outcomes of completed repetitions.
- On failure, reporters should stay compact and report only aggregate information such as where the run stopped, without dumping per-repetition failure details.
- Detailed repetition and retry data should be preserved in
results.json and the JSON summary reporter.
Artifact semantics:
- Repetitions should have their own directories, for example
repeat-1, repeat-2, etc.
- Retries within a repetition should continue using
attempt-* inside the repetition directory.
- The aggregate
RunnerResult.artifactDir should continue to point at the stable runner root directory for that case.
Implementation Plan
Favor a small, encapsulated implementation rather than changing the scheduler or reworking reporter contracts.
- Add new run options.
- Introduce
run.repeat and run.repeatFailure in config.
- Add CLI flags
--repeat <n> and --repeat-failure <n>.
- Validation rules:
repeat >= 1
repeatFailure >= 0
- Keep
retryFailed supported as a compatibility alias for repeatFailure in both config and CLI.
- Precedence:
- use
repeatFailure when present
- otherwise fall back to
retryFailed
- Keep scheduling unchanged.
- Do not make the scheduler aware of repetitions.
- Keep one scheduled execution per case x runner pair.
- Implement the new behavior inside
executePlannedExecution in src/runner/execute-suite.ts.
Reasoning:
- This keeps the change localized.
- It avoids spreading repetition semantics across planning, scheduling, and reporters.
- Implement nested execution flow.
- Add an outer loop for repetitions
1..repeat.
- Add an inner loop for retries of the current repetition
1..(repeatFailure + 1).
- Behavior:
- continue while each repetition eventually passes
- on failure, retry only the current repetition
- if retry recovers it, continue to the next repetition
- if retries are exhausted, stop immediately and produce a failed aggregate result
- Separate repetition artifacts from retry artifacts.
- Use the runner root as the aggregate artifact directory:
output/<caseId>/<runnerPathKey>
- Store repetitions under:
.../repeat-1
.../repeat-2
- Store retries inside a repetition under:
Reasoning:
attempt-* already means retry today.
- Reusing
attempt-* for repetitions would blur the two concepts and make debugging harder.
- Extend result types minimally.
- Keep the existing top-level
CaseResult and RunnerResult shape so reporters do not need a redesign.
- Extend
RunnerResult with aggregate repetition metadata such as:
repeatTarget
completedRepetitions
successfulRepetitions
failedRepetitions
repetitions
- Add a
RepetitionResult type for final per-repetition outcomes.
- Keep
attempts meaning “retries within one repetition”.
- Use
repetitions to represent baseline stability runs.
- Aggregate the final visible
RunnerResult.
- Keep one visible
RunnerResult per case x runner.
- Pass if all target repetitions eventually pass.
- Fail on the first repetition that still fails after exhausting
repeatFailure.
- Average
durationMs and usage metrics across the final outcomes of completed repetitions only.
- Do not include discarded retry attempts in those averages.
- On aggregate failure, propagate the terminal failed repetition’s failure metadata as the final failure details.
- Preserve expected-failure behavior.
- Continue applying existing expected-failure classification at the repetition level.
- Derive aggregate pass/fail from classified repetition results rather than raw execution outcomes.
- Avoid redesigning expected-failure behavior globally.
- Keep reporter changes small.
standard reporter:
- still show one row per case x runner
- success stays compact with averaged metrics
- failure stays aggregate-only, for example
failed at 4/10
- no per-repetition detail dump in terminal output
github-actions reporter:
- annotate only the aggregate failure
- include compact repetition counts rather than enumerating all sub-failures
json-summary reporter:
- include repetition details because it is machine-facing
results.json should remain the main detailed artifact
- Add focused tests.
- Config tests:
- parse
repeat
- parse
repeatFailure
- preserve
retryFailed alias behavior
- validate precedence when both old and new fields are present
- CLI tests:
--repeat passthrough
--repeat-failure passthrough
- legacy
--retry-failed alias passthrough
- precedence when both new and legacy flags are present
- Runner tests:
- all repetitions pass
- a repetition fails once and retry recovers it
- a repetition fails and exhausts retries, stopping early
results.json includes all repetitions and nested attempts
- artifact layout uses
repeat-N/attempt-M
- averages use only final repetition outcomes
- Reporter tests:
- success remains compact
- failure remains aggregate-only
- JSON summary includes repetition details
- Update docs and migration messaging.
- Document
repeat and repeatFailure as the preferred API.
- Mention
retryFailed is temporarily supported as a compatibility alias.
- Update README examples and CLI help.
- Explain pass/fail semantics, artifact layout, and averaging behavior.
Resolution Summary
SkillGym should let skill creators require multiple successful runs for each case x runner pair, tolerate a limited number of transient failures for the current repetition, and keep full repetition-level detail in machine-readable outputs while preserving compact human-facing reports. The implementation should stay centered in execute-suite.ts, the config/CLI parsing layer, and minimal result-type and reporter extensions rather than refactoring the scheduler or redesigning the app’s reporting model.
SkillGym should support repeated executions of the same case x runner pair so skill creators can measure stability across nondeterministic agent runs without having to rerun the whole suite manually. Today a single lucky pass can hide flakiness, while a single transient wobble can fail a benchmark run with too little context about whether the skill is fundamentally unstable or just hit one noisy model response. Users need a built-in way to ask for multiple successful repetitions, tolerate a limited number of transient failures, and still keep detailed execution data for debugging and analysis.
Observed Findings
src/runner/execute-suite.ts.retryFailed, which retries only after a failed execution and produces one finalRunnerResultwith nestedattempts:src/runner/execute-suite.ts,src/domain/result.ts.src/reporters/standard.ts,src/reporters/github-actions.ts,src/reporters/json-summary.ts.retryFailedbut do not expose a notion of proactive repeated executions for stability sampling:src/config.ts,src/cli.ts,src/cli/run.ts,src/cli/help.ts.attempt-*for retry attempts, which would collide conceptually with repeated baseline runs if repetitions reused the same naming scheme:src/runner/execute-suite.ts.retryFailedas rerunning failed executions and describes one execution per case x runner pair, but does not describe repeated stability runs:README.md.Suggested Behavior
Add first-class repeated stability runs with these semantics:
repeatmeans the target number of successful repetitions for each case x runner pair. Default:1.repeatFailuremeans the number of extra retries allowed for the current repetition after a failure is detected. Default:0.retryFailedtemporarily supported as a backward-compatibility alias forrepeatFailureduring the transition between SkillGym versions.repeatFailureandretryFailedare configured, preferrepeatFailure.Execution semantics:
repeatFailureadditional times.repeattarget.repeatFailure, SkillGym should stop that case x runner execution immediately and mark it failed.repeattarget successfully.Reporting semantics:
results.jsonand the JSON summary reporter.Artifact semantics:
repeat-1,repeat-2, etc.attempt-*inside the repetition directory.RunnerResult.artifactDirshould continue to point at the stable runner root directory for that case.Implementation Plan
Favor a small, encapsulated implementation rather than changing the scheduler or reworking reporter contracts.
run.repeatandrun.repeatFailurein config.--repeat <n>and--repeat-failure <n>.repeat >= 1repeatFailure >= 0retryFailedsupported as a compatibility alias forrepeatFailurein both config and CLI.repeatFailurewhen presentretryFailedexecutePlannedExecutioninsrc/runner/execute-suite.ts.Reasoning:
1..repeat.1..(repeatFailure + 1).output/<caseId>/<runnerPathKey>.../repeat-1.../repeat-2.../repeat-4/attempt-2Reasoning:
attempt-*already means retry today.attempt-*for repetitions would blur the two concepts and make debugging harder.CaseResultandRunnerResultshape so reporters do not need a redesign.RunnerResultwith aggregate repetition metadata such as:repeatTargetcompletedRepetitionssuccessfulRepetitionsfailedRepetitionsrepetitionsRepetitionResulttype for final per-repetition outcomes.attemptsmeaning “retries within one repetition”.repetitionsto represent baseline stability runs.RunnerResult.RunnerResultper case x runner.repeatFailure.durationMsand usage metrics across the final outcomes of completed repetitions only.standardreporter:failed at 4/10github-actionsreporter:json-summaryreporter:results.jsonshould remain the main detailed artifactrepeatrepeatFailureretryFailedalias behavior--repeatpassthrough--repeat-failurepassthrough--retry-failedalias passthroughresults.jsonincludes all repetitions and nested attemptsrepeat-N/attempt-MrepeatandrepeatFailureas the preferred API.retryFailedis temporarily supported as a compatibility alias.Resolution Summary
SkillGym should let skill creators require multiple successful runs for each case x runner pair, tolerate a limited number of transient failures for the current repetition, and keep full repetition-level detail in machine-readable outputs while preserving compact human-facing reports. The implementation should stay centered in
execute-suite.ts, the config/CLI parsing layer, and minimal result-type and reporter extensions rather than refactoring the scheduler or redesigning the app’s reporting model.