Skip to content

Add repeated stability runs with recovery retries #26

@V3RON

Description

@V3RON

SkillGym should support repeated executions of the same case x runner pair so skill creators can measure stability across nondeterministic agent runs without having to rerun the whole suite manually. Today a single lucky pass can hide flakiness, while a single transient wobble can fail a benchmark run with too little context about whether the skill is fundamentally unstable or just hit one noisy model response. Users need a built-in way to ask for multiple successful repetitions, tolerate a limited number of transient failures, and still keep detailed execution data for debugging and analysis.

Observed Findings

  • The current execution model expands the suite into one logical execution per case x runner pair: src/runner/execute-suite.ts.
  • The current retry behavior is retryFailed, which retries only after a failed execution and produces one final RunnerResult with nested attempts: src/runner/execute-suite.ts, src/domain/result.ts.
  • Reporters assume one visible terminal result per case x runner and already collapse retry details into a compact presentation: src/reporters/standard.ts, src/reporters/github-actions.ts, src/reporters/json-summary.ts.
  • Configuration and CLI currently expose retryFailed but do not expose a notion of proactive repeated executions for stability sampling: src/config.ts, src/cli.ts, src/cli/run.ts, src/cli/help.ts.
  • The current artifact layout uses the runner directory as the root execution directory and attempt-* for retry attempts, which would collide conceptually with repeated baseline runs if repetitions reused the same naming scheme: src/runner/execute-suite.ts.
  • The README documents retryFailed as rerunning failed executions and describes one execution per case x runner pair, but does not describe repeated stability runs: README.md.

Suggested Behavior

Add first-class repeated stability runs with these semantics:

  • repeat means the target number of successful repetitions for each case x runner pair. Default: 1.
  • repeatFailure means the number of extra retries allowed for the current repetition after a failure is detected. Default: 0.
  • Keep retryFailed temporarily supported as a backward-compatibility alias for repeatFailure during the transition between SkillGym versions.
  • If both repeatFailure and retryFailed are configured, prefer repeatFailure.

Execution semantics:

  • SkillGym should keep running repetitions while each repetition eventually passes.
  • When a repetition fails, SkillGym should retry only that repetition up to repeatFailure additional times.
  • If a retry recovers the repetition, SkillGym should continue toward the remaining repeat target.
  • If a repetition still fails after exhausting repeatFailure, SkillGym should stop that case x runner execution immediately and mark it failed.
  • The overall case x runner result should pass only if it reaches the full repeat target successfully.

Reporting semantics:

  • Terminal-facing reporters should keep showing one aggregated visible result per case x runner.
  • On success, reporters should show success plus averaged token and duration metrics across the final outcomes of completed repetitions.
  • On failure, reporters should stay compact and report only aggregate information such as where the run stopped, without dumping per-repetition failure details.
  • Detailed repetition and retry data should be preserved in results.json and the JSON summary reporter.

Artifact semantics:

  • Repetitions should have their own directories, for example repeat-1, repeat-2, etc.
  • Retries within a repetition should continue using attempt-* inside the repetition directory.
  • The aggregate RunnerResult.artifactDir should continue to point at the stable runner root directory for that case.

Implementation Plan

Favor a small, encapsulated implementation rather than changing the scheduler or reworking reporter contracts.

  1. Add new run options.
  • Introduce run.repeat and run.repeatFailure in config.
  • Add CLI flags --repeat <n> and --repeat-failure <n>.
  • Validation rules:
    • repeat >= 1
    • repeatFailure >= 0
  • Keep retryFailed supported as a compatibility alias for repeatFailure in both config and CLI.
  • Precedence:
    • use repeatFailure when present
    • otherwise fall back to retryFailed
  1. Keep scheduling unchanged.
  • Do not make the scheduler aware of repetitions.
  • Keep one scheduled execution per case x runner pair.
  • Implement the new behavior inside executePlannedExecution in src/runner/execute-suite.ts.

Reasoning:

  • This keeps the change localized.
  • It avoids spreading repetition semantics across planning, scheduling, and reporters.
  1. Implement nested execution flow.
  • Add an outer loop for repetitions 1..repeat.
  • Add an inner loop for retries of the current repetition 1..(repeatFailure + 1).
  • Behavior:
    • continue while each repetition eventually passes
    • on failure, retry only the current repetition
    • if retry recovers it, continue to the next repetition
    • if retries are exhausted, stop immediately and produce a failed aggregate result
  1. Separate repetition artifacts from retry artifacts.
  • Use the runner root as the aggregate artifact directory:
    • output/<caseId>/<runnerPathKey>
  • Store repetitions under:
    • .../repeat-1
    • .../repeat-2
  • Store retries inside a repetition under:
    • .../repeat-4/attempt-2

Reasoning:

  • attempt-* already means retry today.
  • Reusing attempt-* for repetitions would blur the two concepts and make debugging harder.
  1. Extend result types minimally.
  • Keep the existing top-level CaseResult and RunnerResult shape so reporters do not need a redesign.
  • Extend RunnerResult with aggregate repetition metadata such as:
    • repeatTarget
    • completedRepetitions
    • successfulRepetitions
    • failedRepetitions
    • repetitions
  • Add a RepetitionResult type for final per-repetition outcomes.
  • Keep attempts meaning “retries within one repetition”.
  • Use repetitions to represent baseline stability runs.
  1. Aggregate the final visible RunnerResult.
  • Keep one visible RunnerResult per case x runner.
  • Pass if all target repetitions eventually pass.
  • Fail on the first repetition that still fails after exhausting repeatFailure.
  • Average durationMs and usage metrics across the final outcomes of completed repetitions only.
  • Do not include discarded retry attempts in those averages.
  • On aggregate failure, propagate the terminal failed repetition’s failure metadata as the final failure details.
  1. Preserve expected-failure behavior.
  • Continue applying existing expected-failure classification at the repetition level.
  • Derive aggregate pass/fail from classified repetition results rather than raw execution outcomes.
  • Avoid redesigning expected-failure behavior globally.
  1. Keep reporter changes small.
  • standard reporter:
    • still show one row per case x runner
    • success stays compact with averaged metrics
    • failure stays aggregate-only, for example failed at 4/10
    • no per-repetition detail dump in terminal output
  • github-actions reporter:
    • annotate only the aggregate failure
    • include compact repetition counts rather than enumerating all sub-failures
  • json-summary reporter:
    • include repetition details because it is machine-facing
  • results.json should remain the main detailed artifact
  1. Add focused tests.
  • Config tests:
    • parse repeat
    • parse repeatFailure
    • preserve retryFailed alias behavior
    • validate precedence when both old and new fields are present
  • CLI tests:
    • --repeat passthrough
    • --repeat-failure passthrough
    • legacy --retry-failed alias passthrough
    • precedence when both new and legacy flags are present
  • Runner tests:
    • all repetitions pass
    • a repetition fails once and retry recovers it
    • a repetition fails and exhausts retries, stopping early
    • results.json includes all repetitions and nested attempts
    • artifact layout uses repeat-N/attempt-M
    • averages use only final repetition outcomes
  • Reporter tests:
    • success remains compact
    • failure remains aggregate-only
    • JSON summary includes repetition details
  1. Update docs and migration messaging.
  • Document repeat and repeatFailure as the preferred API.
  • Mention retryFailed is temporarily supported as a compatibility alias.
  • Update README examples and CLI help.
  • Explain pass/fail semantics, artifact layout, and averaging behavior.

Resolution Summary

SkillGym should let skill creators require multiple successful runs for each case x runner pair, tolerate a limited number of transient failures for the current repetition, and keep full repetition-level detail in machine-readable outputs while preserving compact human-facing reports. The implementation should stay centered in execute-suite.ts, the config/CLI parsing layer, and minimal result-type and reporter extensions rather than refactoring the scheduler or redesigning the app’s reporting model.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions