skills-evals

Ablation evals for the Trigger.dev agent skills that ship bundled in the trigger.dev CLI. It measures whether installing the skills actually changes what an AI coding agent writes, and guards against regressions when the skills (or the SDK they describe) change.

Why

The skills exist to make AI assistants write correct Trigger.dev code. This harness checks that claim empirically by running a headless agent with and without the skills installed and comparing the output. It tests the published package (a pkg.pr.new preview or a release), not the source tree — which is how it caught a loader bug that made the entire skills feature silently no-op from the installed CLI while every source-run e2e passed.

Organizing principle: the skill is the spec

Each skill enumerates its own "Common mistakes". Those become the eval's assertions, so the eval stays in sync with the skill: add a mistake to a skill, add a rule here. Rules cross-reference the skill mistake they map to (e.g. authoring-tasks#2).

How it works

For each (scenario × arm × sample):

Isolate — copy the scenario fixture into a temp dir.
Provision — npm i the trigger.dev CLI + @trigger.dev/sdk from the configured preview, in both arms. The only variable between arms is whether the withskills arm then runs trigger skills.
Run — claude -p "<prompt>" --permission-mode acceptEdits in the temp dir.
Grade — static assertions (structural checks + anti-pattern rules) and cost (turns / tokens / $, parsed from claude -p --output-format json).

The headline is the ablation delta: the change in pass_rate, anti-pattern count, and cost between arms. Agents are nondeterministic, so each cell runs samples times and reports an average.

In practice the clearest signal is often cost, not pass/fail: a strong agent with the real SDK already in node_modules will frequently match on correctness, but the skill still cuts turns and tokens substantially (early runs: ~40% fewer) because it doesn't have to reverse-engineer the API from node_modules. Note also that skills load on demand, so a task spanning two skills' domains can still slip a mistake in the sub-domain whose skill the agent didn't load.

Findings (early)

Small N (2-4 samples), so these are directional, not tight numbers. The consistent story:

1. The skills' measurable value is cost and reliability, not pass/fail. A strong agent with the real SDK already in node_modules ties on correctness, but with skills installed it does the same work in markedly fewer turns and tokens:

Scenario	pass (base → skills)	turns	cost
`getting-started`	2/2 → 2/2	17.5 → 11.0 (~37% less)	$0.80 → $0.47 (~41% less)
`authoring-chat-agent`	1/2 → 1/2	31.0 → 22.0 (~29% less)	$1.85 → $1.05 (~43% less)
`cross-skill-setup` (N=4)	3/4 → 4/4	27.7 → 18.0 (~35% less)	$1.76 → $1.10 (~38% less)

Reliability too: in cross-skill-setup a baseline run hit the timeout and failed, while all with-skills runs finished. Pass/fail alone showed ~zero delta and would have concluded the skills don't matter; cost shows they clearly do.

2. The on-demand-loading gap is incidental-only. When a task explicitly asks for config setup, both arms write the correct import (0/8 slipped @trigger.dev/sdk/v3 in cross-skill-setup). The slip only happens when config is incidental to a chat-only prompt (the agent doesn't load getting-started for an unmentioned sub-task) and then both arms slip equally. So installing skills doesn't help work the agent never thinks to load a skill for. Product signal: a stronger always-on pointer, or folding config basics into more skills.

Usage

npm install

# run all scenarios
npm run eval

# run specific scenarios
npm run eval -- getting-started authoring-tasks

# re-grade already-produced run dirs (named <arm>-<sample>) without re-running agents
npm run eval -- grade getting-started /path/to/runs

npm test          # unit tests for the grading rules
npm run typecheck

Configure the package under test, sample count, and arms in evals.config.ts. Point packageVersion at the pkg.pr.new commit sha (PR preview) or a published version whose skills you want to evaluate.

Scenarios

One per skill (scenarios/<id>/scenario.ts), each pairing a fixture, a prompt, and a set of assertions:

Scenario	Skill	Status
`getting-started`	bootstrap Trigger.dev into a project	full
`authoring-tasks`	write tasks (retries, triggering, results)	full
`realtime-and-frontend`	live run status in React	WIP starter assertions
`authoring-chat-agent`	durable `chat.agent`	WIP starter assertions
`chat-agent-advanced`	HITL / sub-agents / sessions	WIP starter assertions

Scope and known limitations (v1)

Static only. v1 grades by typecheck-shape and anti-pattern scanning. It does not run trigger dev or execute tasks (the "boots" / "executes" tiers). Those need a seeded test project + a DEV key, which is the same headless-auth problem tracked against the remote MCP / agent-auth work.
--permission-mode acceptEdits is used deliberately: sufficient for file-writing scenarios, keeps every other gate on, and enforces the "don't run commands" constraint. --dangerously-skip-permissions is intentionally avoided.
Not yet hermetic. Runs inherit the user-global ~/.claude/CLAUDE.md. It is constant across arms (so it does not bias the delta), but a future version should isolate the agent's config for full reproducibility.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
fixtures		fixtures
scenarios		scenarios
src		src
.gitignore		.gitignore
README.md		README.md
evals.config.ts		evals.config.ts
package-lock.json		package-lock.json
package.json		package.json
tsconfig.json		tsconfig.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

skills-evals

Why

Organizing principle: the skill is the spec

How it works

Findings (early)

Usage

Scenarios

Scope and known limitations (v1)

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

skills-evals

Why

Organizing principle: the skill is the spec

How it works

Findings (early)

Usage

Scenarios

Scope and known limitations (v1)

About

Resources

Code of conduct

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages