callstackincubator · V3RON · May 5, 2026 · May 5, 2026
diff --git a/README.md b/README.md
@@ -104,6 +104,13 @@ View CLI help:
 npx skillgym help
 ```
 
+List bundled agent-facing skills and read the main one:
+
+```bash
+npx skillgym skills list
+npx skillgym skills get core
+```
+
 By default, `skillgym` uses the built-in `standard` reporter.
 
 TypeScript config, suite, and reporter modules work out of the box on Node `>=22.18.0` using Node's built-in TypeScript stripping.

diff --git a/package.json b/package.json
@@ -26,7 +26,8 @@
   },
   "files": [
     "dist",
-    "bin.js"
+    "bin.js",
+    "skills"
   ],
   "type": "module",
   "main": "./dist/index.js",

diff --git a/skills/assertions.md b/skills/assertions.md
@@ -0,0 +1,97 @@
+---
+name: assertions
+description: Assertion authoring for Skillgym benchmark suites. Covers hard and soft assertions, grouped helpers, skill detection checks, command matching, and failure classification.
+---
+
+# skillgym assertions
+
+Use this skill when you are writing or debugging `assert(report, ctx)` logic.
+
+## What Skillgym gives you
+
+`skillgym` exports a root `assert` object that combines Node strict assertions with benchmark-focused helpers.
+
+Available groups:
+
+- `assert.soft.*`
+- `assert.skills.*`
+- `assert.commands.*`
+- `assert.fileReads.*`
+- `assert.toolCalls.*`
+- `assert.output.*`
+
+## Typical patterns
+
+```ts
+import { assert, commandMatcher } from "skillgym";
+
+assert.skills.has(report, "find-skills");
+assert.commands.includes(report, commandMatcher("pnpm").arg("test"));
+assert.fileReads.includes(report, /README\.md$/);
+assert.toolCalls.includes(report, { tool: /skill/i });
+assert.match(ctx.finalOutput(), /expo/i);
+```
+
+## Hard vs soft assertions
+
+- Use hard assertions when one failure should stop the case immediately.
+- Use `assert.soft.*` when you want one run to report multiple mismatches together.
+
+```ts
+assert.soft.skills.has(report, "find-skills");
+assert.soft.commands.includes(report, "npx skills find");
+assert.soft.output.notEmpty(report);
+```
+
+## Skill assertions
+
+Use `assert.skills.*` against `report.detectedSkills`.
+
+Most useful methods:
+
+- `has`
+- `notHas`
+- `includes`
+- `count`
+- `exactlyOne`
+- `only`
+
+Confidence can be filtered with `minConfidence`:
+
+```ts
+assert.skills.has(report, "find-skills", { minConfidence: "strong" });
+```
+
+## Command assertions
+
+Use raw strings or `RegExp` for quick checks. Use `commandMatcher(...)` for stable matching on executable, args, and options.
+
+```ts
+assert.commands.includes(report, "pnpm test");
+assert.commands.before(report, /skills find/, /pnpm install/);
+assert.commands.includes(report, commandMatcher("pnpm").arg("test").option("--filter", "unit"));
+```
+
+## Output assertions
+
+Use output assertions when the final agent response matters directly. Use them together with normalized report assertions, not instead of them.
+
+## Failure classification
+
+Use `assert.classify(...)` when multiple failures should collapse into one stable machine-readable cause.
+
+```ts
+assert.classify({ id: "wrong-cli-alias", label: "Wrong CLI alias" }, () => {
+  assert.doesNotMatch(ctx.finalOutput(), /\bcursr\b/i);
+});
+```
+
+This improves grouped reporting across runs.
+
+## Good assertion style
+
+- assert the smallest behavior that proves the benchmark intent
+- prefer normalized report fields over fragile prose matching
+- use command and tool-call assertions for workflow checks
+- use output assertions for user-visible wording checks
+- classify repeated failure modes with stable ids
diff --git a/skills/core.md b/skills/core.md
@@ -0,0 +1,103 @@
+---
+name: core
+description: Core Skillgym workflow for agents. Read this first before writing or debugging suites. Covers how to structure a suite, run it, interpret results, and when to read deeper feature-specific skills.
+---
+
+# skillgym core
+
+Skillgym benchmarks coding-agent behavior by running real agent sessions and asserting on the normalized execution report.
+
+Read this skill first. It gives you the default workflow, the minimum suite shape, and the map to deeper feature skills.
+
+## Core loop
+
+```bash
+skillgym skills get core
+skillgym run <suite.ts>
+skillgym run <suite.ts> --case <id>
+skillgym run <suite.ts> --runner <runner-id>
+```
+
+Typical agent loop:
+
+1. Read the target suite or create one.
+2. Write or refine prompts and assertions.
+3. Run one suite, case, or runner slice.
+4. Inspect failures from the output directory and normalized report.
+5. Tighten assertions or workspace setup until the benchmark captures the intended behavior.
+
+## Minimum suite shape
+
+```ts
+import { assert, type TestCase } from "skillgym";
+
+const suite: TestCase[] = [
+  {
+    id: "uses-correct-skill",
+    prompt: "Find the right skill and explain how to install it.",
+    assert(report) {
+      assert.skills.has(report, "find-skills");
+      assert.match(report.finalOutput, /install/i);
+    },
+  },
+];
+
+export default suite;
+```
+
+Use stable `id` values. Keep prompts exact. Put benchmark intent in assertions, not in prose comments.
+
+## Primary commands
+
+```bash
+skillgym help
+skillgym skills list
+skillgym skills get <name>
+
+skillgym run <suite.ts>
+skillgym run <suite.ts> --case <id>
+skillgym run <suite.ts> --tag <tag>
+skillgym run <suite.ts> --runner <runner-id>
+skillgym run <suite.ts> --reporter json-summary
+skillgym run <suite.ts> --schedule parallel --max-parallel 4
+skillgym run <suite.ts> --update-snapshots
+```
+
+## Mental model
+
+- A configured runner is one agent target.
+- Each selected case runs once per selected runner.
+- Assertions evaluate the normalized session report after the run finishes.
+- Output artifacts are written under the configured `outputDir`.
+- Expected assertion failures can be benchmark-successful; infrastructure failures are still real failures.
+
+## When to read deeper skills
+
+Read the focused skills only when the task needs them:
+
+- `skillgym skills get test-cases`
+  Use when creating or reshaping suite files, tags, expected failures, or per-case timeouts.
+- `skillgym skills get assertions`
+  Use when writing pass/fail logic against skills, commands, tool calls, output, or failure classes.
+- `skillgym skills get workspaces`
+  Use when the agent needs isolated filesystem state, template repos, or bootstrap commands.
+- `skillgym skills get snapshots`
+  Use when benchmarking token regressions or updating snapshot baselines.
+- `skillgym skills get reporters`
+  Use when choosing built-in reporters or wiring a custom reporter.
+
+## Suggested authoring order
+
+1. Start with one small case and one runner.
+2. Make the assertion explicit and narrow.
+3. Add tags or expected-failure behavior only after the baseline case works.
+4. Add workspace isolation when shared state can affect the benchmark.
+5. Add snapshots when behavior is stable enough to guard token regressions.
+
+## Common mistakes
+
+- asserting on vague output instead of checking the normalized report
+- trying to select runners inside `TestCase` instead of config plus CLI filters
+- using shared workspaces for stateful tasks that need isolation
+- treating snapshot mismatches like functional failures instead of cost regressions
+- writing one huge suite before proving one small representative case
diff --git a/skills/reporters.md b/skills/reporters.md
@@ -0,0 +1,64 @@
+---
+name: reporters
+description: Skillgym reporter selection and customization. Covers built-in reporters, CLI/config selection, lifecycle hooks, and when to use machine-readable output.
+---
+
+# skillgym reporters
+
+Use this skill when choosing how benchmark results should be rendered or consumed.
+
+## Built-in reporters
+
+- `standard`
+- `json`
+- `json-summary`
+- `github-actions`
+
+## Main commands
+
+```bash
+skillgym run <suite.ts> --reporter standard
+skillgym run <suite.ts> --reporter json
+skillgym run <suite.ts> --reporter json-summary
+skillgym run <suite.ts> --reporter github-actions
+skillgym run <suite.ts> --reporter ./path/to/custom-reporter.ts
+```
+
+## Selection rules
+
+- omitting `--reporter` uses the built-in `standard` reporter
+- CLI `--reporter` overrides config `run.reporter`
+- relative custom reporter paths resolve from `process.cwd()` on CLI input
+
+## When to use each built-in reporter
+
+- `standard`: default interactive CLI output for humans
+- `json`: full aggregated result on stdout for machine consumers
+- `json-summary`: trimmed result for post-processing or LLM consumption
+- `github-actions`: CI annotations and job summary output
+
+## Custom reporter shape
+
+```ts
+import type { BenchmarkReporter } from "skillgym";
+
+const reporter: BenchmarkReporter = {
+  onSuiteFinish(event) {
+    console.log(event.result.outputDir);
+  },
+};
+
+export default reporter;
+```
+
+## Reporter lifecycle hooks
+
+- `onSuiteStart`
+- `onCaseStart`
+- `onRunnerStart`
+- `onRunnerFinish`
+- `onCaseFinish`
+- `onSuiteFinish`
+- `onError`
+
+Use `json-summary` when another agent or tool needs a smaller result than the full session report.
diff --git a/skills/snapshots.md b/skills/snapshots.md
@@ -0,0 +1,52 @@
+---
+name: snapshots
+description: Token regression snapshots in Skillgym. Covers baseline creation, tolerance checks, update flows, metrics, and when snapshots should be used.
+---
+
+# skillgym snapshots
+
+Use this skill when the benchmark should guard token usage regressions.
+
+## Purpose
+
+Snapshots compare the current run against a stored baseline for each `caseId + runner.id` pair.
+
+This is useful for catching regressions caused by:
+
+- skill changes that make the agent do more work
+- tool changes that return too much data
+- model behavior changes that increase token usage
+
+## Important behavior
+
+- default metric is `totalTokens`
+- missing snapshot entries are created automatically
+- the run fails when usage exceeds the configured tolerance
+- snapshots are cost guards, not functional assertions
+
+## Main commands
+
+```bash
+skillgym run <suite.ts> --update-snapshots
+skillgym run <suite.ts> --snapshots ./skillgym.snapshots.json
+```
+
+## Supported metrics
+
+- `totalTokens`
+- `inputTokens`
+- `outputTokens`
+- `reasoningTokens`
+- `cacheTokens`
+
+## When to add snapshots
+
+- after the benchmark behavior is already stable
+- when you want to catch prompt or tooling cost regressions
+- when the selected runner reports the needed token metric reliably
+
+## When not to rely on snapshots alone
+
+- when the case still lacks functional assertions
+- when the run is flaky for reasons unrelated to token usage
+- when you are still exploring prompt shape and workflow