Skip to content

triggerdotdev/skills-evals

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

skills-evals

Ablation evals for the Trigger.dev agent skills that ship bundled in the trigger.dev CLI. It measures whether installing the skills actually changes what an AI coding agent writes, and guards against regressions when the skills (or the SDK they describe) change.

Why

The skills exist to make AI assistants write correct Trigger.dev code. This harness checks that claim empirically by running a headless agent with and without the skills installed and comparing the output. It tests the published package (a pkg.pr.new preview or a release), not the source tree — which is how it caught a loader bug that made the entire skills feature silently no-op from the installed CLI while every source-run e2e passed.

Organizing principle: the skill is the spec

Each skill enumerates its own "Common mistakes". Those become the eval's assertions, so the eval stays in sync with the skill: add a mistake to a skill, add a rule here. Rules cross-reference the skill mistake they map to (e.g. authoring-tasks#2).

How it works

For each (scenario × arm × sample):

  1. Isolate — copy the scenario fixture into a temp dir.
  2. Provisionnpm i the trigger.dev CLI + @trigger.dev/sdk from the configured preview, in both arms. The only variable between arms is whether the withskills arm then runs trigger skills.
  3. Runclaude -p "<prompt>" --permission-mode acceptEdits in the temp dir.
  4. Grade — static assertions (structural checks + anti-pattern rules) and cost (turns / tokens / $, parsed from claude -p --output-format json).

The headline is the ablation delta: the change in pass_rate, anti-pattern count, and cost between arms. Agents are nondeterministic, so each cell runs samples times and reports an average.

In practice the clearest signal is often cost, not pass/fail: a strong agent with the real SDK already in node_modules will frequently match on correctness, but the skill still cuts turns and tokens substantially (early runs: ~40% fewer) because it doesn't have to reverse-engineer the API from node_modules. Note also that skills load on demand, so a task spanning two skills' domains can still slip a mistake in the sub-domain whose skill the agent didn't load.

Findings (early)

Small N (2-4 samples), so these are directional, not tight numbers. The consistent story:

1. The skills' measurable value is cost and reliability, not pass/fail. A strong agent with the real SDK already in node_modules ties on correctness, but with skills installed it does the same work in markedly fewer turns and tokens:

Scenario pass (base → skills) turns cost
getting-started 2/2 → 2/2 17.5 → 11.0 (~37% less) $0.80 → $0.47 (~41% less)
authoring-chat-agent 1/2 → 1/2 31.0 → 22.0 (~29% less) $1.85 → $1.05 (~43% less)
cross-skill-setup (N=4) 3/4 → 4/4 27.7 → 18.0 (~35% less) $1.76 → $1.10 (~38% less)

Reliability too: in cross-skill-setup a baseline run hit the timeout and failed, while all with-skills runs finished. Pass/fail alone showed ~zero delta and would have concluded the skills don't matter; cost shows they clearly do.

2. The on-demand-loading gap is incidental-only. When a task explicitly asks for config setup, both arms write the correct import (0/8 slipped @trigger.dev/sdk/v3 in cross-skill-setup). The slip only happens when config is incidental to a chat-only prompt (the agent doesn't load getting-started for an unmentioned sub-task) and then both arms slip equally. So installing skills doesn't help work the agent never thinks to load a skill for. Product signal: a stronger always-on pointer, or folding config basics into more skills.

Usage

npm install

# run all scenarios
npm run eval

# run specific scenarios
npm run eval -- getting-started authoring-tasks

# re-grade already-produced run dirs (named <arm>-<sample>) without re-running agents
npm run eval -- grade getting-started /path/to/runs

npm test          # unit tests for the grading rules
npm run typecheck

Configure the package under test, sample count, and arms in evals.config.ts. Point packageVersion at the pkg.pr.new commit sha (PR preview) or a published version whose skills you want to evaluate.

Scenarios

One per skill (scenarios/<id>/scenario.ts), each pairing a fixture, a prompt, and a set of assertions:

Scenario Skill Status
getting-started bootstrap Trigger.dev into a project full
authoring-tasks write tasks (retries, triggering, results) full
realtime-and-frontend live run status in React WIP starter assertions
authoring-chat-agent durable chat.agent WIP starter assertions
chat-agent-advanced HITL / sub-agents / sessions WIP starter assertions

Scope and known limitations (v1)

  • Static only. v1 grades by typecheck-shape and anti-pattern scanning. It does not run trigger dev or execute tasks (the "boots" / "executes" tiers). Those need a seeded test project + a DEV key, which is the same headless-auth problem tracked against the remote MCP / agent-auth work.
  • --permission-mode acceptEdits is used deliberately: sufficient for file-writing scenarios, keeps every other gate on, and enforces the "don't run commands" constraint. --dangerously-skip-permissions is intentionally avoided.
  • Not yet hermetic. Runs inherit the user-global ~/.claude/CLAUDE.md. It is constant across arms (so it does not bias the delta), but a future version should isolate the agent's config for full reproducibility.

About

No description, website, or topics provided.

Resources

Code of conduct

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors