Add benchflow evaluation framework#2139
Conversation
BenchFlow is the evaluation framework backing SkillsBench (arXiv:2602.12670). It runs containerized agent trials with paired with-skills / without-skills configurations, exposing the lift skills give a fixed agent + model as the headline metric.
|
End-to-end demo that the new framework key threads through every consumer surface. 1. Built artefacts carry the entry2. Downstream
|
|
Hey @xdotli ! Thanks for the submission. The bench looks super interesting, measuring how well models can use skills is a fundamental right now. |
Hi @NathanHB sry about the late reply! we just released 1.1 version of SkillsBench with these open models. lmk what you think! |
|
Updating with the full SkillsBench v1.1 board. Open models we evaluatedSkillsBench measures the paired with-skills vs. without-skills lift for a fixed agent + model. The v1.1 board (recomputed 2026-06-11; 87 tasks × 3 trials × 2 conditions, full 261/261 coverage per mode) includes 6 open-weight configurations across 4 families, each with a public HF repo — ready to carry
The full board also covers the current closed frontier (GPT-5.5, Claude Opus 4.8 / 4.7 / Sonnet 4.6, Gemini 3.1 Pro / 3.5 Flash / 3.1 Flash Lite, Grok 4.3) — 15 models / 18 model×agent configs / 8 families:
Traction
"SkillsBench" already appears in recent open-model eval tablesSeveral 2026 flagship open-model releases report a benchmark named SkillsBench in their agentic/coding evaluation tables:
All of the above report it with the methodology footnote: "SkillsBench: Evaluated via OpenCode on 78 tasks (self-contained subset, excluding API-dependent tasks); avg of 5 runs." Already wired on the Hub
This PR is the one remaining gate. Once |
|
@NathanHB @krampstudio Thanks for your comments! Here are some latest leaderboard results plots in our latest version of SkillsBench running on BenchFlow as eval harness! And today our latest version of SkillsBench v1.1 arXiv paper just got out! Please check it out here: https://arxiv.org/pdf/2602.12670 And our HuggingFace dataset page is here: https://huggingface.co/datasets/benchflow/skillsbench
|



Summary
benchflowto theEVALUATION_FRAMEWORKSconstant inpackages/tasks/src/eval.ts.BenchFlow is the evaluation framework backing SkillsBench (arXiv:2602.12670). It runs containerized agent trials with paired with-skills / without-skills configurations, exposing the lift skills give a fixed agent + model as the headline metric.
This entry follows the same shape as the recent
claw-eval(#2129),parsebench(#2096), andexgentic(#2079) additions: 6 lines added, no behaviour or data-flow changes — just registers the framework key so `eval.yaml` files in benchmark datasets can reference it.Traction
SkillsBench is an active benchmark with material adoption:
Tested locally
Followup
After this lands, `benchflow/skillsbench` will publish its `eval.yaml` and request allow-list inclusion via `huggingface/hub-docs` (per the registering-a-benchmark doc).
Note
Low Risk
Metadata-only registry entry with no changes to evaluation pipelines or data handling.
Overview
Registers benchflow in
EVALUATION_FRAMEWORKSso benchmark datasets can declare it ineval.yaml, matching other framework entries (name, description, GitHub URL).The description positions BenchFlow as a skill-aware agent evaluation harness (SkillsBench) with containerized trials and paired with-skills / without-skills runs. No runtime or validation logic changes—only the allow-list metadata.
Reviewed by Cursor Bugbot for commit 4521c15. Bugbot is set up for automated code reviews on this repo. Configure here.