Skip to content

Add benchflow evaluation framework#2139

Open
xdotli wants to merge 3 commits into
huggingface:mainfrom
xdotli:add-benchflow-evaluation-framework
Open

Add benchflow evaluation framework#2139
xdotli wants to merge 3 commits into
huggingface:mainfrom
xdotli:add-benchflow-evaluation-framework

Conversation

@xdotli

@xdotli xdotli commented May 5, 2026

Copy link
Copy Markdown

Summary

  • Adds benchflow to the EVALUATION_FRAMEWORKS constant in packages/tasks/src/eval.ts.

BenchFlow is the evaluation framework backing SkillsBench (arXiv:2602.12670). It runs containerized agent trials with paired with-skills / without-skills configurations, exposing the lift skills give a fixed agent + model as the headline metric.

This entry follows the same shape as the recent claw-eval (#2129), parsebench (#2096), and exgentic (#2079) additions: 6 lines added, no behaviour or data-flow changes — just registers the framework key so `eval.yaml` files in benchmark datasets can reference it.

Traction

SkillsBench is an active benchmark with material adoption:

  • Paper: arXiv:2602.12670 — 40 co-authors across academic and industry contributors.
  • Tasks: 91 tasks in the main repo, spanning financial analysis, code-patch, multimodal output, scientific computation, infrastructure, and other professional workflows.
  • Repo activity: 1,109 ⭐ / 277 🍴 / 226 merged PRs / 56 contributors on `benchflow-ai/skillsbench` at time of writing.
  • Trajectory archive: 8 contributor namespaces with multi-experiment trial bundles in `benchflow-ai/skillsbench-trajectories`.
  • HF datasets already live:

Tested locally

  • `pnpm format:check` — pass
  • `pnpm lint:check` — pass
  • `pnpm check` (tsc) — pass
  • `pnpm test` — 16/16 pass
  • `pnpm build` — `dist/esm/eval.js`, `dist/commonjs/eval.js`, and `eval.d.ts` all carry the new entry.

Followup

After this lands, `benchflow/skillsbench` will publish its `eval.yaml` and request allow-list inclusion via `huggingface/hub-docs` (per the registering-a-benchmark doc).


Note

Low Risk
Metadata-only registry entry with no changes to evaluation pipelines or data handling.

Overview
Registers benchflow in EVALUATION_FRAMEWORKS so benchmark datasets can declare it in eval.yaml, matching other framework entries (name, description, GitHub URL).

The description positions BenchFlow as a skill-aware agent evaluation harness (SkillsBench) with containerized trials and paired with-skills / without-skills runs. No runtime or validation logic changes—only the allow-list metadata.

Reviewed by Cursor Bugbot for commit 4521c15. Bugbot is set up for automated code reviews on this repo. Configure here.

BenchFlow is the evaluation framework backing SkillsBench (arXiv:2602.12670).
It runs containerized agent trials with paired with-skills / without-skills
configurations, exposing the lift skills give a fixed agent + model as the
headline metric.
@xdotli

xdotli commented May 5, 2026

Copy link
Copy Markdown
Author

End-to-end demo that the new framework key threads through every consumer surface.

1. Built artefacts carry the entry

$ pnpm --filter @huggingface/tasks build
$ grep -A 3 benchflow packages/tasks/dist/esm/eval.js
    benchflow: {
        name: "benchflow",
        description: "BenchFlow is an evaluation framework …",
        url: "https://github.com/benchflow-ai/benchflow",
    },
$ grep -A 3 benchflow packages/tasks/dist/esm/eval.d.ts
    readonly benchflow: {
        readonly name: "benchflow";
        readonly description: "BenchFlow is an evaluation framework …";
        readonly url: "https://github.com/benchflow-ai/benchflow";
    };

2. Downstream eval.yaml validates against the spec

The benchmark dataset that will consume this key is already public at https://huggingface.co/datasets/benchflow/skillsbench. Its `eval.yaml` is drafted and validates against `docs/hub/eval-results.md`:

```yaml
name: SkillsBench
description: >
SkillsBench measures how well AI agents leverage Skills — structured
packages of procedural knowledge — to complete realistic professional
workflows across many domains. The headline metric is the with-skills
vs. without-skills delta. See arXiv:2602.12670.
evaluation_framework: benchflow
tasks:

  • id: skillsbench
    config: default
    split: train
    ```

Validation transcript:

```
$ python validate.py /tmp/skillsbench-eval.yaml
PASS — eval.yaml conforms to spec
• required fields present (name, description, evaluation_framework, tasks)
• evaluation_framework='benchflow' is in the patched EVALUATION_FRAMEWORKS set
• tasks[0] has required id
```

3. Model-repo `.eval_results/skillsbench.yaml` round-trips

```yaml

```
$ python validate_eval_results.py /tmp/skillsbench-model-eval-result.yaml
PASS — 1 entry valid against eval-results spec
• dataset.id resolves to a Hub dataset
• dataset.task_id matches the eval.yaml's tasks[].id
• value is numeric, date is ISO-8601, source.url is set
```

4. Allow-list request already filed

huggingface/hub-docs#2456 — references this PR and lists the dataset's traction signals so the allow-list flip is one step after merge.

Once this lands and the hub deploys, `benchflow/skillsbench` pushes its `eval.yaml` and the chain is live.

@NathanHB

NathanHB commented May 5, 2026

Copy link
Copy Markdown
Member

Hey @xdotli ! Thanks for the submission. The bench looks super interesting, measuring how well models can use skills is a fundamental right now.
On what open model did you run the benchmark and is it reported in recent model release?

@xdotli

xdotli commented Jun 15, 2026

Copy link
Copy Markdown
Author

Hey @xdotli ! Thanks for the submission. The bench looks super interesting, measuring how well models can use skills is a fundamental right now. On what open model did you run the benchmark and is it reported in recent model release?

Hi @NathanHB sry about the late reply! we just released 1.1 version of SkillsBench with these open models. lmk what you think!

@bingran-you

bingran-you commented Jun 15, 2026

Copy link
Copy Markdown

Updating with the full SkillsBench v1.1 board.

Open models we evaluated

SkillsBench measures the paired with-skills vs. without-skills lift for a fixed agent + model. The v1.1 board (recomputed 2026-06-11; 87 tasks × 3 trials × 2 conditions, full 261/261 coverage per mode) includes 6 open-weight configurations across 4 families, each with a public HF repo — ready to carry .eval_results the moment this key is accepted:

Model HF repo with-skills without-skills skill-lift
GLM 5.1 zai-org/GLM-5.1 58.4 32.7 +25.7
Kimi K2.6 moonshotai/Kimi-K2.6 54.0 33.4 +20.6
MiniMax M3 MiniMaxAI/MiniMax-M3 53.0 29.7 +23.3
DeepSeek V4 Pro deepseek-ai/DeepSeek-V4-Pro 50.1 26.9 +23.2
DeepSeek V4 Flash deepseek-ai/DeepSeek-V4-Flash 44.7 27.5 +17.2
MiniMax M2.7 MiniMaxAI/MiniMax-M2.7 34.9 18.1 +16.8

The full board also covers the current closed frontier (GPT-5.5, Claude Opus 4.8 / 4.7 / Sonnet 4.6, Gemini 3.1 Pro / 3.5 Flash / 3.1 Flash Lite, Grok 4.3) — 15 models / 18 model×agent configs / 8 families:

Traction

  • Paper: arXiv:2602.12670 · HF paper page.
  • Repo: 1.35k★ / 317 forks / 67 contributors / 310 merged PRs on benchflow-ai/skillsbench (up from 1.1k★ / 56 contributors / 226 merged PRs when this PR opened).
  • Covers the live frontier under a paired-skills protocol that directly measures the "can models use skills" capability you flagged — every score is release-aligned with audited trajectories in benchflow/skillsbench-leaderboard.

"SkillsBench" already appears in recent open-model eval tables

Several 2026 flagship open-model releases report a benchmark named SkillsBench in their agentic/coding evaluation tables:

  • Qwen3.6-27Bmodel card · release blog: SkillsBench Avg5 row scoring 48.2 (vs. Qwen3.5-397B-A17B = 30.0).
  • Qwen3.6-35B-A3Bmodel card: same SkillsBench Avg5 row.
  • Doubao Seed 2.0 (ByteDance) — seed.bytedance.com: SkillsBench listed in its evaluation table.
  • Downstream redistributions carry the same row — e.g. cyankiwi/Qwen3.6-27B-AWQ-INT4, unsloth/Qwen3.6-35B-A3B-GGUF.

All of the above report it with the methodology footnote: "SkillsBench: Evaluated via OpenCode on 78 tasks (self-contained subset, excluding API-dependent tasks); avg of 5 runs."

Already wired on the Hub

This PR is the one remaining gate. Once benchflow is in EVALUATION_FRAMEWORKS and deployed, we'll submit .eval_results/skillsbench.yaml to the 6 open-weight repos above to populate the Hub leaderboard. Happy to add more open-model coverage first if you'd like.

@bingran-you

bingran-you commented Jun 16, 2026

Copy link
Copy Markdown

@NathanHB @krampstudio Thanks for your comments! Here are some latest leaderboard results plots in our latest version of SkillsBench running on BenchFlow as eval harness!

And today our latest version of SkillsBench v1.1 arXiv paper just got out! Please check it out here: https://arxiv.org/pdf/2602.12670

And our HuggingFace dataset page is here: https://huggingface.co/datasets/benchflow/skillsbench

image image image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants