Skip to content

feat(skills): native agentskills.io support#1

Closed
darkrishabh wants to merge 2 commits into
masterfrom
cursor/agent-skills-standard-6cb3
Closed

feat(skills): native agentskills.io support#1
darkrishabh wants to merge 2 commits into
masterfrom
cursor/agent-skills-standard-6cb3

Conversation

@darkrishabh

@darkrishabh darkrishabh commented Apr 30, 2026

Copy link
Copy Markdown
Owner

Summary

  • Add native agentskills.io support via the new @darkrishabh/bench-ai/skills entrypoint: loadSkill, discoverSkills, runEval, gradeOutputs, writeRunArtifacts, buildBenchmark, ensureIterationDir, and evaluateSkills.
  • Add @darkrishabh/bench-ai/providers for provider imports and extend the Provider contract with optional capabilities and completeChat while preserving complete(prompt) compatibility.
  • Add bench-ai skills <root> CLI support, spec-shaped artifact writing, baseline mode, README examples, a fixture skill, tests, and a minor version bump to 1.1.0.

Spec: https://agentskills.io/skill-creation/evaluating-skills

Testing

  • npm test
  • npm run type-check
  • npm run build
  • node ./bench-ai.mjs skills --help
  • node --input-type=module -e 'import { evaluateSkills } from "./dist/skills/index.js"; import { OpenAICompatibleProvider } from "./dist/providers/index.js"; console.log(typeof evaluateSkills, typeof OpenAICompatibleProvider);'
Open in Web Open in Cursor 

@vercel

vercel Bot commented Apr 30, 2026

Copy link
Copy Markdown

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
bench-ai Ready Ready Preview, Comment Apr 30, 2026 7:08pm

@cursor cursor Bot force-pushed the cursor/agent-skills-standard-6cb3 branch from bf427ee to 045dc91 Compare April 30, 2026 19:07
@cursor cursor Bot changed the title Add Agent Skills eval support feat(skills): native agentskills.io support Apr 30, 2026
…op mode, HTML report

- Add structured SkillsEvent stream (suite/eval start/end) so consumers
  can build rich UIs and reporters without re-parsing logs.
- Bundle a default consoleReporter with color-coded headers, prompt and
  output snippets, per-assertion verdicts with evidence, and suite-level
  pass-rate delta across with_skill / without_skill modes.
- Default workspace layout is now flat: <workspace>/<skill-slug>/...,
  overwritten on each run (CI-friendly, easy to diff).
- Opt-in loop mode (loop: true) snapshots each run into
  <workspace>/.history/iteration-N/<skill-slug>/... for skill-improvement
  workflows, returning historyIteration in the result.
- Persist meta.json (skill metadata) and prompts.json (system, user, and
  judge prompts actually sent) alongside the spec-mandated artifacts so
  reports and debugging surface exactly what each model saw.
- Add a self-contained static HTML report (report.ts) at
  <workspace>/report/index.html with overall stats, per-skill drill-down,
  side-by-side with_skill / without_skill output, and a collapsible
  judge prompt section.
- gradeOutputs now returns { grading, judgePrompt, judgeResponse } so the
  judge prompt can be exposed in events and reports.
- evaluate-skills emits events, defaults to consoleReporter when no
  onEvent/onLog is provided, generates the HTML report unless
  report: false, and returns historyIteration + reportPath in
  EvaluateSkillsResult.
- src/cli/skills.ts: update CLI JSON output to the new result shape.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant