Skip to content

FalconerAI/falconer-benchmarks

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Falconer benchmarks

Open evaluation data comparing Falconer against the assistants teams use to find answers — Notion AI, Atlassian Rovo (Confluence), Claude Code, and Codex — across two scenarios. This repo contains the complete receipts: every question, every assistant's full answer, and every LLM-judge score. Nothing is summarized away.

Scenarios

Folder Scenario Questions Source
wix/ Doc-grounded customer support (find the right help article, answer it completely) 100 WixQA / Wix Help Center
spark/ Technical engineering (code, configs, debugging over the Apache Spark repo) 100 Stack Overflow apache-spark

Results

Head-to-head win rate = share of decisive verdicts Falconer won (ties excluded), under the weighted-sum rule described below. Falconer leads every matchup in both scenarios.

Doc-grounded customer support (wix/)

Falconer vs Win rate Wins / Losses / Ties Verdicts
Atlassian Rovo (Confluence) 88% 503 / 66 / 31 600
Notion AI 71% 316 / 132 / 116 564
Codex 63% 314 / 186 / 100 600
Claude Code 53% 213 / 192 / 195 600

Technical engineering (spark/)

Falconer vs Win rate Wins / Losses / Ties Verdicts
Atlassian Rovo (Confluence) 97% 561 / 17 / 10 588
Codex 74% 340 / 118 / 141 599
Notion AI 58% 244 / 179 / 177 600
Claude Code 56% 215 / 168 / 217 600

Read each rate alongside its tie count: the Claude Code pairs are tie-heavy, so 53% / 56% is a narrow-but-consistent edge on the decisive verdicts. (Verdict counts below 600 are coverage gaps, not zeros — see Caveats.)

How questions were selected

The selection is blind to results — we did not pick questions Falconer happens to win:

  • Questions come from public, third-party sources we did not author (WixQA; the most-engaged Apache Spark questions on Stack Overflow).
  • Each scenario uses a fixed, pre-defined set of 100, chosen for question quality and community votes before any answers were generated.
  • Every assistant answered the identical set, with the same wording and the same grading.
  • We publish all answers, including the ones Falconer lost — so the selection is auditable rather than asserted.

How answers were judged

  • Three frontier judges — GPT-5.5, Gemini 3.1 Pro, and Claude Opus 4.8 — score each answer pair. We are not the judge.
  • Every pair is judged in both orderings (Falconer first, then opponent first) and averaged, to cancel position bias.
  • Judges score four axes 0–10: faithfulness, helpfulness, completeness, relevance. Citation formatting is explicitly ignored — no one wins on style.
  • Weighted-sum verdict: 0.35·faithfulness + 0.35·helpfulness + 0.20·completeness + 0.10·relevance. A side wins a verdict only if ahead by > 0.25; otherwise it's a tie.
  • Head-to-head % = wins / (wins + losses), ties excluded.

File layout

SOURCES.md                      attribution & licenses
<scenario>/
  questions.json                the 100 questions + human reference answers
  answers-<agent>.jsonl         one assistant's answers (falconer, notion, confluence, claude-code, codex)
  judgments-falconer-vs-<x>.csv per-judge, per-ordering scores for each head-to-head pair

questions.json — array of { id, question, expected_answer, … }. wix/ adds article_ids (the Help Center articles the reference answer is grounded in); spark/ adds source_url (a link to the original Stack Overflow post).

answers-<agent>.jsonl — one JSON object per line: { index, id, question, expected_answer, "<agent>": { output, duration_ms, first_token_ms, …provenance } }. output is the assistant's full answer. Provenance records what actually ran (model, effort / thinking mode, reasoning level, sources scope). A note field appears on the few rows where an assistant timed out or its answer was recovered from a dump.

judgments-falconer-vs-<x>.csv — one row per (question, judge, ordering): question_id, judge, ordering, a_source, b_source, winner, a_{faithfulness,completeness,helpfulness,relevance}, b_{…}, reasoning, error. a_* / b_* are scores by position; a_source / b_source say which side was Falconer in that ordering.

Reproducing the win rates

For each judgment row, compute the weighted sum for each side and compare:

score(side) = 0.35*faithfulness + 0.35*helpfulness + 0.20*completeness + 0.10*relevance
diff = score(falconer) - score(opponent)     # use a_source/b_source to find which side is Falconer
verdict = win  if diff >  0.25
          loss if diff < -0.25
          tie  otherwise
win_rate = wins / (wins + losses)             # over all rows for a pair, ties excluded

Note: a few verdicts land at exactly |diff| = 0.25. Because of floating-point summation order, an independent reimplementation can classify 1–2 of these differently, shifting a headline by at most ~1 percentage point. It does not change any conclusion.

Caveats

  • Agent answers are model outputs, published as-is for transparency — not authoritative documentation. They may contain errors.
  • Coverage gaps (verdicts < 600) are dropped rows, not zero-scores: Notion returns no response on ~6% of wix/ questions (a known agent flakiness); Rovo timed out on 2 spark/ questions; one spark/ Codex verdict was excluded for a transient judge error. Missing answers drop out of the denominator rather than scoring 0.

Attribution & license

See SOURCES.md. In short: WixQA is MIT; Spark questions/answers are from Stack Overflow (CC BY-SA) with a source_url to each origin post, and cite Apache Spark docs (Apache-2.0). The agent answers and judge scores are original to this benchmark.

About

benchmarks against other AI tools

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors