Add repeated stability runs with recovery retries

SkillGym should support repeated executions of the same case x runner pair so skill creators can measure stability across nondeterministic agent runs without having to rerun the whole suite manually. Today a single lucky pass can hide flakiness, while a single transient wobble can fail a benchmark run with too little context about whether the skill is fundamentally unstable or just hit one noisy model response. Users need a built-in way to ask for multiple successful repetitions, tolerate a limited number of transient failures, and still keep detailed execution data for debugging and analysis.

## Observed Findings

- The current execution model expands the suite into one logical execution per case x runner pair: `src/runner/execute-suite.ts`.
- The current retry behavior is `retryFailed`, which retries only after a failed execution and produces one final `RunnerResult` with nested `attempts`: `src/runner/execute-suite.ts`, `src/domain/result.ts`.
- Reporters assume one visible terminal result per case x runner and already collapse retry details into a compact presentation: `src/reporters/standard.ts`, `src/reporters/github-actions.ts`, `src/reporters/json-summary.ts`.
- Configuration and CLI currently expose `retryFailed` but do not expose a notion of proactive repeated executions for stability sampling: `src/config.ts`, `src/cli.ts`, `src/cli/run.ts`, `src/cli/help.ts`.
- The current artifact layout uses the runner directory as the root execution directory and `attempt-*` for retry attempts, which would collide conceptually with repeated baseline runs if repetitions reused the same naming scheme: `src/runner/execute-suite.ts`.
- The README documents `retryFailed` as rerunning failed executions and describes one execution per case x runner pair, but does not describe repeated stability runs: `README.md`.

## Suggested Behavior

Add first-class repeated stability runs with these semantics:

- `repeat` means the target number of successful repetitions for each case x runner pair. Default: `1`.
- `repeatFailure` means the number of extra retries allowed for the current repetition after a failure is detected. Default: `0`.
- Keep `retryFailed` temporarily supported as a backward-compatibility alias for `repeatFailure` during the transition between SkillGym versions.
- If both `repeatFailure` and `retryFailed` are configured, prefer `repeatFailure`.

Execution semantics:

- SkillGym should keep running repetitions while each repetition eventually passes.
- When a repetition fails, SkillGym should retry only that repetition up to `repeatFailure` additional times.
- If a retry recovers the repetition, SkillGym should continue toward the remaining `repeat` target.
- If a repetition still fails after exhausting `repeatFailure`, SkillGym should stop that case x runner execution immediately and mark it failed.
- The overall case x runner result should pass only if it reaches the full `repeat` target successfully.

Reporting semantics:

- Terminal-facing reporters should keep showing one aggregated visible result per case x runner.
- On success, reporters should show success plus averaged token and duration metrics across the final outcomes of completed repetitions.
- On failure, reporters should stay compact and report only aggregate information such as where the run stopped, without dumping per-repetition failure details.
- Detailed repetition and retry data should be preserved in `results.json` and the JSON summary reporter.

Artifact semantics:

- Repetitions should have their own directories, for example `repeat-1`, `repeat-2`, etc.
- Retries within a repetition should continue using `attempt-*` inside the repetition directory.
- The aggregate `RunnerResult.artifactDir` should continue to point at the stable runner root directory for that case.

## Implementation Plan

Favor a small, encapsulated implementation rather than changing the scheduler or reworking reporter contracts.

1. Add new run options.
- Introduce `run.repeat` and `run.repeatFailure` in config.
- Add CLI flags `--repeat <n>` and `--repeat-failure <n>`.
- Validation rules:
  - `repeat >= 1`
  - `repeatFailure >= 0`
- Keep `retryFailed` supported as a compatibility alias for `repeatFailure` in both config and CLI.
- Precedence:
  - use `repeatFailure` when present
  - otherwise fall back to `retryFailed`

2. Keep scheduling unchanged.
- Do not make the scheduler aware of repetitions.
- Keep one scheduled execution per case x runner pair.
- Implement the new behavior inside `executePlannedExecution` in `src/runner/execute-suite.ts`.

Reasoning:
- This keeps the change localized.
- It avoids spreading repetition semantics across planning, scheduling, and reporters.

3. Implement nested execution flow.
- Add an outer loop for repetitions `1..repeat`.
- Add an inner loop for retries of the current repetition `1..(repeatFailure + 1)`.
- Behavior:
  - continue while each repetition eventually passes
  - on failure, retry only the current repetition
  - if retry recovers it, continue to the next repetition
  - if retries are exhausted, stop immediately and produce a failed aggregate result

4. Separate repetition artifacts from retry artifacts.
- Use the runner root as the aggregate artifact directory:
  - `output/<caseId>/<runnerPathKey>`
- Store repetitions under:
  - `.../repeat-1`
  - `.../repeat-2`
- Store retries inside a repetition under:
  - `.../repeat-4/attempt-2`

Reasoning:
- `attempt-*` already means retry today.
- Reusing `attempt-*` for repetitions would blur the two concepts and make debugging harder.

5. Extend result types minimally.
- Keep the existing top-level `CaseResult` and `RunnerResult` shape so reporters do not need a redesign.
- Extend `RunnerResult` with aggregate repetition metadata such as:
  - `repeatTarget`
  - `completedRepetitions`
  - `successfulRepetitions`
  - `failedRepetitions`
  - `repetitions`
- Add a `RepetitionResult` type for final per-repetition outcomes.
- Keep `attempts` meaning “retries within one repetition”.
- Use `repetitions` to represent baseline stability runs.

6. Aggregate the final visible `RunnerResult`.
- Keep one visible `RunnerResult` per case x runner.
- Pass if all target repetitions eventually pass.
- Fail on the first repetition that still fails after exhausting `repeatFailure`.
- Average `durationMs` and usage metrics across the final outcomes of completed repetitions only.
- Do not include discarded retry attempts in those averages.
- On aggregate failure, propagate the terminal failed repetition’s failure metadata as the final failure details.

7. Preserve expected-failure behavior.
- Continue applying existing expected-failure classification at the repetition level.
- Derive aggregate pass/fail from classified repetition results rather than raw execution outcomes.
- Avoid redesigning expected-failure behavior globally.

8. Keep reporter changes small.
- `standard` reporter:
  - still show one row per case x runner
  - success stays compact with averaged metrics
  - failure stays aggregate-only, for example `failed at 4/10`
  - no per-repetition detail dump in terminal output
- `github-actions` reporter:
  - annotate only the aggregate failure
  - include compact repetition counts rather than enumerating all sub-failures
- `json-summary` reporter:
  - include repetition details because it is machine-facing
- `results.json` should remain the main detailed artifact

9. Add focused tests.
- Config tests:
  - parse `repeat`
  - parse `repeatFailure`
  - preserve `retryFailed` alias behavior
  - validate precedence when both old and new fields are present
- CLI tests:
  - `--repeat` passthrough
  - `--repeat-failure` passthrough
  - legacy `--retry-failed` alias passthrough
  - precedence when both new and legacy flags are present
- Runner tests:
  - all repetitions pass
  - a repetition fails once and retry recovers it
  - a repetition fails and exhausts retries, stopping early
  - `results.json` includes all repetitions and nested attempts
  - artifact layout uses `repeat-N/attempt-M`
  - averages use only final repetition outcomes
- Reporter tests:
  - success remains compact
  - failure remains aggregate-only
  - JSON summary includes repetition details

10. Update docs and migration messaging.
- Document `repeat` and `repeatFailure` as the preferred API.
- Mention `retryFailed` is temporarily supported as a compatibility alias.
- Update README examples and CLI help.
- Explain pass/fail semantics, artifact layout, and averaging behavior.

## Resolution Summary

SkillGym should let skill creators require multiple successful runs for each case x runner pair, tolerate a limited number of transient failures for the current repetition, and keep full repetition-level detail in machine-readable outputs while preserving compact human-facing reports. The implementation should stay centered in `execute-suite.ts`, the config/CLI parsing layer, and minimal result-type and reporter extensions rather than refactoring the scheduler or redesigning the app’s reporting model.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add repeated stability runs with recovery retries #26

Observed Findings

Suggested Behavior

Implementation Plan

Resolution Summary

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Add repeated stability runs with recovery retries #26

Description

Observed Findings

Suggested Behavior

Implementation Plan

Resolution Summary

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions