Add benchflow evaluation framework by xdotli · Pull Request #2139 · huggingface/huggingface.js

xdotli · 2026-05-05T07:44:45Z

Summary

Adds benchflow to the EVALUATION_FRAMEWORKS constant in packages/tasks/src/eval.ts.

BenchFlow is the evaluation framework backing SkillsBench (arXiv:2602.12670). It runs containerized agent trials with paired with-skills / without-skills configurations, exposing the lift skills give a fixed agent + model as the headline metric.

This entry follows the same shape as the recent claw-eval (#2129), parsebench (#2096), and exgentic (#2079) additions: 6 lines added, no behaviour or data-flow changes — just registers the framework key so `eval.yaml` files in benchmark datasets can reference it.

Traction

SkillsBench is an active benchmark with material adoption:

Paper: arXiv:2602.12670 — 40 co-authors across academic and industry contributors.
Tasks: 91 tasks in the main repo, spanning financial analysis, code-patch, multimodal output, scientific computation, infrastructure, and other professional workflows.
Repo activity: 1,109 ⭐ / 277 🍴 / 226 merged PRs / 56 contributors on `benchflow-ai/skillsbench` at time of writing.
Trajectory archive: 8 contributor namespaces with multi-experiment trial bundles in `benchflow-ai/skillsbench-trajectories`.
HF datasets already live:
- `benchflow/skillsbench` — task corpus (parquet, full source-commit pin).
- `benchflow/skillsbench-leaderboard` — trajectory submission target patterned after `harborframework/terminal-bench-2-leaderboard`.

Tested locally

`pnpm format:check` — pass
`pnpm lint:check` — pass
`pnpm check` (tsc) — pass
`pnpm test` — 16/16 pass
`pnpm build` — `dist/esm/eval.js`, `dist/commonjs/eval.js`, and `eval.d.ts` all carry the new entry.

Followup

After this lands, `benchflow/skillsbench` will publish its `eval.yaml` and request allow-list inclusion via `huggingface/hub-docs` (per the registering-a-benchmark doc).

Note

Low Risk
Metadata-only registry entry with no changes to evaluation pipelines or data handling.

Overview
Registers benchflow in EVALUATION_FRAMEWORKS so benchmark datasets can declare it in eval.yaml, matching other framework entries (name, description, GitHub URL).

The description positions BenchFlow as a skill-aware agent evaluation harness (SkillsBench) with containerized trials and paired with-skills / without-skills runs. No runtime or validation logic changes—only the allow-list metadata.

^{Reviewed by Cursor Bugbot for commit 4521c15. Bugbot is set up for automated code reviews on this repo. Configure here.}

BenchFlow is the evaluation framework backing SkillsBench (arXiv:2602.12670). It runs containerized agent trials with paired with-skills / without-skills configurations, exposing the lift skills give a fixed agent + model as the headline metric.

xdotli · 2026-05-05T07:59:31Z

End-to-end demo that the new framework key threads through every consumer surface.

1. Built artefacts carry the entry

$ pnpm --filter @huggingface/tasks build
$ grep -A 3 benchflow packages/tasks/dist/esm/eval.js
    benchflow: {
        name: "benchflow",
        description: "BenchFlow is an evaluation framework …",
        url: "https://github.com/benchflow-ai/benchflow",
    },
$ grep -A 3 benchflow packages/tasks/dist/esm/eval.d.ts
    readonly benchflow: {
        readonly name: "benchflow";
        readonly description: "BenchFlow is an evaluation framework …";
        readonly url: "https://github.com/benchflow-ai/benchflow";
    };

2. Downstream `eval.yaml` validates against the spec

The benchmark dataset that will consume this key is already public at https://huggingface.co/datasets/benchflow/skillsbench. Its `eval.yaml` is drafted and validates against `docs/hub/eval-results.md`:

```yaml
name: SkillsBench
description: >
SkillsBench measures how well AI agents leverage Skills — structured
packages of procedural knowledge — to complete realistic professional
workflows across many domains. The headline metric is the with-skills
vs. without-skills delta. See arXiv:2602.12670.
evaluation_framework: benchflow
tasks:

id: skillsbench
config: default
split: train
```

Validation transcript:

```
$ python validate.py /tmp/skillsbench-eval.yaml
PASS — eval.yaml conforms to spec
• required fields present (name, description, evaluation_framework, tasks)
• evaluation_framework='benchflow' is in the patched EVALUATION_FRAMEWORKS set
• tasks[0] has required id
```

3. Model-repo `.eval_results/skillsbench.yaml` round-trips

```yaml

dataset:
id: benchflow/skillsbench
task_id: skillsbench
value: 0.42
date: "2026-05-05"
source:
url: https://huggingface.co/datasets/benchflow/skillsbench-leaderboard
name: SkillsBench Leaderboard submissions
user: benchflow
notes: "with-skills"
```

```
$ python validate_eval_results.py /tmp/skillsbench-model-eval-result.yaml
PASS — 1 entry valid against eval-results spec
• dataset.id resolves to a Hub dataset
• dataset.task_id matches the eval.yaml's tasks[].id
• value is numeric, date is ISO-8601, source.url is set
```

4. Allow-list request already filed

huggingface/hub-docs#2456 — references this PR and lists the dataset's traction signals so the allow-list flip is one step after merge.

Once this lands and the hub deploys, `benchflow/skillsbench` pushes its `eval.yaml` and the chain is live.

NathanHB · 2026-05-05T11:37:34Z

Hey @xdotli ! Thanks for the submission. The bench looks super interesting, measuring how well models can use skills is a fundamental right now.
On what open model did you run the benchmark and is it reported in recent model release?

xdotli · 2026-06-15T21:05:38Z

Hey @xdotli ! Thanks for the submission. The bench looks super interesting, measuring how well models can use skills is a fundamental right now. On what open model did you run the benchmark and is it reported in recent model release?

Hi @NathanHB sry about the late reply! we just released 1.1 version of SkillsBench with these open models. lmk what you think!

bingran-you · 2026-06-15T21:50:39Z

Updating with the full SkillsBench v1.1 board.

Open models we evaluated

SkillsBench measures the paired with-skills vs. without-skills lift for a fixed agent + model. The v1.1 board (recomputed 2026-06-11; 87 tasks × 3 trials × 2 conditions, full 261/261 coverage per mode) includes 6 open-weight configurations across 4 families, each with a public HF repo — ready to carry .eval_results the moment this key is accepted:

Model	HF repo	with-skills	without-skills	skill-lift
GLM 5.1	`zai-org/GLM-5.1`	58.4	32.7	+25.7
Kimi K2.6	`moonshotai/Kimi-K2.6`	54.0	33.4	+20.6
MiniMax M3	`MiniMaxAI/MiniMax-M3`	53.0	29.7	+23.3
DeepSeek V4 Pro	`deepseek-ai/DeepSeek-V4-Pro`	50.1	26.9	+23.2
DeepSeek V4 Flash	`deepseek-ai/DeepSeek-V4-Flash`	44.7	27.5	+17.2
MiniMax M2.7	`MiniMaxAI/MiniMax-M2.7`	34.9	18.1	+16.8

The full board also covers the current closed frontier (GPT-5.5, Claude Opus 4.8 / 4.7 / Sonnet 4.6, Gemini 3.1 Pro / 3.5 Flash / 3.1 Flash Lite, Grok 4.3) — 15 models / 18 model×agent configs / 8 families:

Leaderboard JSON: https://huggingface.co/datasets/benchflow/skillsbench-leaderboard/raw/main/leaderboard/skillsbench/v1.1/official.json
Selection manifest (rule + per-config coverage): https://huggingface.co/datasets/benchflow/skillsbench-leaderboard/raw/main/analysis/skillsbench/v1.1/official-selected/manifest.json

Traction

Paper: arXiv:2602.12670 · HF paper page.
Repo: 1.35k★ / 317 forks / 67 contributors / 310 merged PRs on benchflow-ai/skillsbench (up from 1.1k★ / 56 contributors / 226 merged PRs when this PR opened).
Covers the live frontier under a paired-skills protocol that directly measures the "can models use skills" capability you flagged — every score is release-aligned with audited trajectories in benchflow/skillsbench-leaderboard.

"SkillsBench" already appears in recent open-model eval tables

Several 2026 flagship open-model releases report a benchmark named SkillsBench in their agentic/coding evaluation tables:

Qwen3.6-27B — model card · release blog: SkillsBench Avg5 row scoring 48.2 (vs. Qwen3.5-397B-A17B = 30.0).
Qwen3.6-35B-A3B — model card: same SkillsBench Avg5 row.
Doubao Seed 2.0 (ByteDance) — seed.bytedance.com: SkillsBench listed in its evaluation table.
Downstream redistributions carry the same row — e.g. cyankiwi/Qwen3.6-27B-AWQ-INT4, unsloth/Qwen3.6-35B-A3B-GGUF.

All of the above report it with the methodology footnote: "SkillsBench: Evaluated via OpenCode on 78 tasks (self-contained subset, excluding API-dependent tasks); avg of 5 runs."

Already wired on the Hub

Dataset + root eval.yaml (evaluation_framework: benchflow): https://huggingface.co/datasets/benchflow/skillsbench
Leaderboard + trajectory archive: https://huggingface.co/datasets/benchflow/skillsbench-leaderboard

This PR is the one remaining gate. Once benchflow is in EVALUATION_FRAMEWORKS and deployed, we'll submit .eval_results/skillsbench.yaml to the 6 open-weight repos above to populate the Hub leaderboard. Happy to add more open-model coverage first if you'd like.

bingran-you · 2026-06-16T01:15:44Z

@NathanHB @krampstudio Thanks for your comments! Here are some latest leaderboard results plots in our latest version of SkillsBench running on BenchFlow as eval harness!

And today our latest version of SkillsBench v1.1 arXiv paper just got out! Please check it out here: https://arxiv.org/pdf/2602.12670

And our HuggingFace dataset page is here: https://huggingface.co/datasets/benchflow/skillsbench

Add benchflow evaluation framework

578f61b

BenchFlow is the evaluation framework backing SkillsBench (arXiv:2602.12670). It runs containerized agent trials with paired with-skills / without-skills configurations, exposing the lift skills give a fixed agent + model as the headline metric.

xdotli requested review from SBrandeis, Wauplin, gary149, julien-c, ngxson and pcuenca as code owners May 5, 2026 07:44

xdotli mentioned this pull request May 5, 2026

Request benchmark allow-list — benchflow/skillsbench huggingface/hub-docs#2456

Open

Merge branch 'main' into add-benchflow-evaluation-framework

b7523dd

Merge branch 'main' into add-benchflow-evaluation-framework

4521c15

xdotli requested review from NathanHB and krampstudio as code owners June 15, 2026 21:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add benchflow evaluation framework#2139

Add benchflow evaluation framework#2139
xdotli wants to merge 3 commits into
huggingface:mainfrom
xdotli:add-benchflow-evaluation-framework

xdotli commented May 5, 2026 •

edited by cursor Bot

Loading

Uh oh!

xdotli commented May 5, 2026

Uh oh!

NathanHB commented May 5, 2026

Uh oh!

xdotli commented Jun 15, 2026

Uh oh!

bingran-you commented Jun 15, 2026 •

edited

Loading

Uh oh!

bingran-you commented Jun 16, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

xdotli commented May 5, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Traction

Tested locally

Followup

Uh oh!

xdotli commented May 5, 2026

1. Built artefacts carry the entry

2. Downstream eval.yaml validates against the spec

3. Model-repo `.eval_results/skillsbench.yaml` round-trips

4. Allow-list request already filed

Uh oh!

NathanHB commented May 5, 2026

Uh oh!

xdotli commented Jun 15, 2026

Uh oh!

bingran-you commented Jun 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Open models we evaluated

Traction

"SkillsBench" already appears in recent open-model eval tables

Already wired on the Hub

Uh oh!

bingran-you commented Jun 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

xdotli commented May 5, 2026 •

edited by cursor Bot

Loading

2. Downstream `eval.yaml` validates against the spec

bingran-you commented Jun 15, 2026 •

edited

Loading

bingran-you commented Jun 16, 2026 •

edited

Loading