Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 7 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -104,6 +104,13 @@ View CLI help:
npx skillgym help
```

List bundled agent-facing skills and read the main one:

```bash
npx skillgym skills list
npx skillgym skills get core
```

By default, `skillgym` uses the built-in `standard` reporter.

TypeScript config, suite, and reporter modules work out of the box on Node `>=22.18.0` using Node's built-in TypeScript stripping.
Expand Down
3 changes: 2 additions & 1 deletion package.json
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,8 @@
},
"files": [
"dist",
"bin.js"
"bin.js",
"skills"
],
"type": "module",
"main": "./dist/index.js",
Expand Down
97 changes: 97 additions & 0 deletions skills/assertions.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,97 @@
---
name: assertions
description: Assertion authoring for Skillgym benchmark suites. Covers hard and soft assertions, grouped helpers, skill detection checks, command matching, and failure classification.
---

# skillgym assertions

Use this skill when you are writing or debugging `assert(report, ctx)` logic.

## What Skillgym gives you

`skillgym` exports a root `assert` object that combines Node strict assertions with benchmark-focused helpers.

Available groups:

- `assert.soft.*`
- `assert.skills.*`
- `assert.commands.*`
- `assert.fileReads.*`
- `assert.toolCalls.*`
- `assert.output.*`

## Typical patterns

```ts
import { assert, commandMatcher } from "skillgym";

assert.skills.has(report, "find-skills");
assert.commands.includes(report, commandMatcher("pnpm").arg("test"));
assert.fileReads.includes(report, /README\.md$/);
assert.toolCalls.includes(report, { tool: /skill/i });
assert.match(ctx.finalOutput(), /expo/i);
```

## Hard vs soft assertions

- Use hard assertions when one failure should stop the case immediately.
- Use `assert.soft.*` when you want one run to report multiple mismatches together.

```ts
assert.soft.skills.has(report, "find-skills");
assert.soft.commands.includes(report, "npx skills find");
assert.soft.output.notEmpty(report);
```

## Skill assertions

Use `assert.skills.*` against `report.detectedSkills`.

Most useful methods:

- `has`
- `notHas`
- `includes`
- `count`
- `exactlyOne`
- `only`

Confidence can be filtered with `minConfidence`:

```ts
assert.skills.has(report, "find-skills", { minConfidence: "strong" });
```

## Command assertions

Use raw strings or `RegExp` for quick checks. Use `commandMatcher(...)` for stable matching on executable, args, and options.

```ts
assert.commands.includes(report, "pnpm test");
assert.commands.before(report, /skills find/, /pnpm install/);
assert.commands.includes(report, commandMatcher("pnpm").arg("test").option("--filter", "unit"));
```

## Output assertions

Use output assertions when the final agent response matters directly. Use them together with normalized report assertions, not instead of them.

## Failure classification

Use `assert.classify(...)` when multiple failures should collapse into one stable machine-readable cause.

```ts
assert.classify({ id: "wrong-cli-alias", label: "Wrong CLI alias" }, () => {
assert.doesNotMatch(ctx.finalOutput(), /\bcursr\b/i);
});
```

This improves grouped reporting across runs.

## Good assertion style

- assert the smallest behavior that proves the benchmark intent
- prefer normalized report fields over fragile prose matching
- use command and tool-call assertions for workflow checks
- use output assertions for user-visible wording checks
- classify repeated failure modes with stable ids
103 changes: 103 additions & 0 deletions skills/core.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,103 @@
---
name: core
description: Core Skillgym workflow for agents. Read this first before writing or debugging suites. Covers how to structure a suite, run it, interpret results, and when to read deeper feature-specific skills.
---

# skillgym core

Skillgym benchmarks coding-agent behavior by running real agent sessions and asserting on the normalized execution report.

Read this skill first. It gives you the default workflow, the minimum suite shape, and the map to deeper feature skills.

## Core loop

```bash
skillgym skills get core
skillgym run <suite.ts>
skillgym run <suite.ts> --case <id>
skillgym run <suite.ts> --runner <runner-id>
```

Typical agent loop:

1. Read the target suite or create one.
2. Write or refine prompts and assertions.
3. Run one suite, case, or runner slice.
4. Inspect failures from the output directory and normalized report.
5. Tighten assertions or workspace setup until the benchmark captures the intended behavior.

## Minimum suite shape

```ts
import { assert, type TestCase } from "skillgym";

const suite: TestCase[] = [
{
id: "uses-correct-skill",
prompt: "Find the right skill and explain how to install it.",
assert(report) {
assert.skills.has(report, "find-skills");
assert.match(report.finalOutput, /install/i);
},
},
];

export default suite;
```

Use stable `id` values. Keep prompts exact. Put benchmark intent in assertions, not in prose comments.

## Primary commands

```bash
skillgym help
skillgym skills list
skillgym skills get <name>

skillgym run <suite.ts>
skillgym run <suite.ts> --case <id>
skillgym run <suite.ts> --tag <tag>
skillgym run <suite.ts> --runner <runner-id>
skillgym run <suite.ts> --reporter json-summary
skillgym run <suite.ts> --schedule parallel --max-parallel 4
skillgym run <suite.ts> --update-snapshots
```

## Mental model

- A configured runner is one agent target.
- Each selected case runs once per selected runner.
- Assertions evaluate the normalized session report after the run finishes.
- Output artifacts are written under the configured `outputDir`.
- Expected assertion failures can be benchmark-successful; infrastructure failures are still real failures.

## When to read deeper skills

Read the focused skills only when the task needs them:

- `skillgym skills get test-cases`
Use when creating or reshaping suite files, tags, expected failures, or per-case timeouts.
- `skillgym skills get assertions`
Use when writing pass/fail logic against skills, commands, tool calls, output, or failure classes.
- `skillgym skills get workspaces`
Use when the agent needs isolated filesystem state, template repos, or bootstrap commands.
- `skillgym skills get snapshots`
Use when benchmarking token regressions or updating snapshot baselines.
- `skillgym skills get reporters`
Use when choosing built-in reporters or wiring a custom reporter.

## Suggested authoring order

1. Start with one small case and one runner.
2. Make the assertion explicit and narrow.
3. Add tags or expected-failure behavior only after the baseline case works.
4. Add workspace isolation when shared state can affect the benchmark.
5. Add snapshots when behavior is stable enough to guard token regressions.

## Common mistakes

- asserting on vague output instead of checking the normalized report
- trying to select runners inside `TestCase` instead of config plus CLI filters
- using shared workspaces for stateful tasks that need isolation
- treating snapshot mismatches like functional failures instead of cost regressions
- writing one huge suite before proving one small representative case
64 changes: 64 additions & 0 deletions skills/reporters.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,64 @@
---
name: reporters
description: Skillgym reporter selection and customization. Covers built-in reporters, CLI/config selection, lifecycle hooks, and when to use machine-readable output.
---

# skillgym reporters

Use this skill when choosing how benchmark results should be rendered or consumed.

## Built-in reporters

- `standard`
- `json`
- `json-summary`
- `github-actions`

## Main commands

```bash
skillgym run <suite.ts> --reporter standard
skillgym run <suite.ts> --reporter json
skillgym run <suite.ts> --reporter json-summary
skillgym run <suite.ts> --reporter github-actions
skillgym run <suite.ts> --reporter ./path/to/custom-reporter.ts
```

## Selection rules

- omitting `--reporter` uses the built-in `standard` reporter
- CLI `--reporter` overrides config `run.reporter`
- relative custom reporter paths resolve from `process.cwd()` on CLI input

## When to use each built-in reporter

- `standard`: default interactive CLI output for humans
- `json`: full aggregated result on stdout for machine consumers
- `json-summary`: trimmed result for post-processing or LLM consumption
- `github-actions`: CI annotations and job summary output

## Custom reporter shape

```ts
import type { BenchmarkReporter } from "skillgym";

const reporter: BenchmarkReporter = {
onSuiteFinish(event) {
console.log(event.result.outputDir);
},
};

export default reporter;
```

## Reporter lifecycle hooks

- `onSuiteStart`
- `onCaseStart`
- `onRunnerStart`
- `onRunnerFinish`
- `onCaseFinish`
- `onSuiteFinish`
- `onError`

Use `json-summary` when another agent or tool needs a smaller result than the full session report.
52 changes: 52 additions & 0 deletions skills/snapshots.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,52 @@
---
name: snapshots
description: Token regression snapshots in Skillgym. Covers baseline creation, tolerance checks, update flows, metrics, and when snapshots should be used.
---

# skillgym snapshots

Use this skill when the benchmark should guard token usage regressions.

## Purpose

Snapshots compare the current run against a stored baseline for each `caseId + runner.id` pair.

This is useful for catching regressions caused by:

- skill changes that make the agent do more work
- tool changes that return too much data
- model behavior changes that increase token usage

## Important behavior

- default metric is `totalTokens`
- missing snapshot entries are created automatically
- the run fails when usage exceeds the configured tolerance
- snapshots are cost guards, not functional assertions

## Main commands

```bash
skillgym run <suite.ts> --update-snapshots
skillgym run <suite.ts> --snapshots ./skillgym.snapshots.json
```

## Supported metrics

- `totalTokens`
- `inputTokens`
- `outputTokens`
- `reasoningTokens`
- `cacheTokens`

## When to add snapshots

- after the benchmark behavior is already stable
- when you want to catch prompt or tooling cost regressions
- when the selected runner reports the needed token metric reliably

## When not to rely on snapshots alone

- when the case still lacks functional assertions
- when the run is flaky for reasons unrelated to token usage
- when you are still exploring prompt shape and workflow
Loading
Loading