Ablation evals for the Trigger.dev agent skills that ship bundled in the trigger.dev
CLI. It measures whether installing the skills actually changes what an AI coding agent
writes, and guards against regressions when the skills (or the SDK they describe) change.
The skills exist to make AI assistants write correct Trigger.dev code. This harness checks
that claim empirically by running a headless agent with and without the skills
installed and comparing the output. It tests the published package (a pkg.pr.new
preview or a release), not the source tree — which is how it caught a loader bug that made
the entire skills feature silently no-op from the installed CLI while every source-run e2e
passed.
Each skill enumerates its own "Common mistakes". Those become the eval's assertions, so
the eval stays in sync with the skill: add a mistake to a skill, add a rule here. Rules
cross-reference the skill mistake they map to (e.g. authoring-tasks#2).
For each (scenario × arm × sample):
- Isolate — copy the scenario fixture into a temp dir.
- Provision —
npm ithetrigger.devCLI +@trigger.dev/sdkfrom the configured preview, in both arms. The only variable between arms is whether thewithskillsarm then runstrigger skills. - Run —
claude -p "<prompt>" --permission-mode acceptEditsin the temp dir. - Grade — static assertions (structural checks + anti-pattern rules) and cost
(turns / tokens / $, parsed from
claude -p --output-format json).
The headline is the ablation delta: the change in pass_rate, anti-pattern count, and
cost between arms. Agents are nondeterministic, so each cell runs samples times and
reports an average.
In practice the clearest signal is often cost, not pass/fail: a strong agent with the
real SDK already in node_modules will frequently match on correctness, but the skill
still cuts turns and tokens substantially (early runs: ~40% fewer) because it doesn't have
to reverse-engineer the API from node_modules. Note also that skills load on demand,
so a task spanning two skills' domains can still slip a mistake in the sub-domain whose
skill the agent didn't load.
Small N (2-4 samples), so these are directional, not tight numbers. The consistent story:
1. The skills' measurable value is cost and reliability, not pass/fail. A strong agent
with the real SDK already in node_modules ties on correctness, but with skills installed
it does the same work in markedly fewer turns and tokens:
| Scenario | pass (base → skills) | turns | cost |
|---|---|---|---|
getting-started |
2/2 → 2/2 | 17.5 → 11.0 (~37% less) | $0.80 → $0.47 (~41% less) |
authoring-chat-agent |
1/2 → 1/2 | 31.0 → 22.0 (~29% less) | $1.85 → $1.05 (~43% less) |
cross-skill-setup (N=4) |
3/4 → 4/4 | 27.7 → 18.0 (~35% less) | $1.76 → $1.10 (~38% less) |
Reliability too: in cross-skill-setup a baseline run hit the timeout and failed, while all
with-skills runs finished. Pass/fail alone showed ~zero delta and would have concluded the
skills don't matter; cost shows they clearly do.
2. The on-demand-loading gap is incidental-only. When a task explicitly asks for config
setup, both arms write the correct import (0/8 slipped @trigger.dev/sdk/v3 in
cross-skill-setup). The slip only happens when config is incidental to a chat-only prompt
(the agent doesn't load getting-started for an unmentioned sub-task) and then both arms
slip equally. So installing skills doesn't help work the agent never thinks to load a skill
for. Product signal: a stronger always-on pointer, or folding config basics into more skills.
npm install
# run all scenarios
npm run eval
# run specific scenarios
npm run eval -- getting-started authoring-tasks
# re-grade already-produced run dirs (named <arm>-<sample>) without re-running agents
npm run eval -- grade getting-started /path/to/runs
npm test # unit tests for the grading rules
npm run typecheckConfigure the package under test, sample count, and arms in evals.config.ts.
Point packageVersion at the pkg.pr.new commit sha (PR preview) or a published version
whose skills you want to evaluate.
One per skill (scenarios/<id>/scenario.ts), each pairing a fixture, a prompt, and a set
of assertions:
| Scenario | Skill | Status |
|---|---|---|
getting-started |
bootstrap Trigger.dev into a project | full |
authoring-tasks |
write tasks (retries, triggering, results) | full |
realtime-and-frontend |
live run status in React | WIP starter assertions |
authoring-chat-agent |
durable chat.agent |
WIP starter assertions |
chat-agent-advanced |
HITL / sub-agents / sessions | WIP starter assertions |
- Static only. v1 grades by typecheck-shape and anti-pattern scanning. It does not run
trigger devor execute tasks (the "boots" / "executes" tiers). Those need a seeded test project + a DEV key, which is the same headless-auth problem tracked against the remote MCP / agent-auth work. --permission-mode acceptEditsis used deliberately: sufficient for file-writing scenarios, keeps every other gate on, and enforces the "don't run commands" constraint.--dangerously-skip-permissionsis intentionally avoided.- Not yet hermetic. Runs inherit the user-global
~/.claude/CLAUDE.md. It is constant across arms (so it does not bias the delta), but a future version should isolate the agent's config for full reproducibility.