Falconer benchmarks

Open evaluation data comparing Falconer against the assistants teams use to find answers — Notion AI, Atlassian Rovo (Confluence), Claude Code, and Codex — across two scenarios. This repo contains the complete receipts: every question, every assistant's full answer, and every LLM-judge score. Nothing is summarized away.

Scenarios

Folder	Scenario	Questions	Source
`wix/`	Doc-grounded customer support (find the right help article, answer it completely)	100	WixQA / Wix Help Center
`spark/`	Technical engineering (code, configs, debugging over the Apache Spark repo)	100	Stack Overflow `apache-spark`

Results

Head-to-head win rate = share of decisive verdicts Falconer won (ties excluded), under the weighted-sum rule described below. Falconer leads every matchup in both scenarios.

Doc-grounded customer support (wix/)

Falconer vs	Win rate	Wins / Losses / Ties	Verdicts
Atlassian Rovo (Confluence)	88%	503 / 66 / 31	600
Notion AI	71%	316 / 132 / 116	564
Codex	63%	314 / 186 / 100	600
Claude Code	53%	213 / 192 / 195	600

Technical engineering (spark/)

Falconer vs	Win rate	Wins / Losses / Ties	Verdicts
Atlassian Rovo (Confluence)	97%	561 / 17 / 10	588
Codex	74%	340 / 118 / 141	599
Notion AI	58%	244 / 179 / 177	600
Claude Code	56%	215 / 168 / 217	600

Read each rate alongside its tie count: the Claude Code pairs are tie-heavy, so 53% / 56% is a narrow-but-consistent edge on the decisive verdicts. (Verdict counts below 600 are coverage gaps, not zeros — see Caveats.)

How questions were selected

The selection is blind to results — we did not pick questions Falconer happens to win:

Questions come from public, third-party sources we did not author (WixQA; the most-engaged Apache Spark questions on Stack Overflow).
Each scenario uses a fixed, pre-defined set of 100, chosen for question quality and community votes before any answers were generated.
Every assistant answered the identical set, with the same wording and the same grading.
We publish all answers, including the ones Falconer lost — so the selection is auditable rather than asserted.

How answers were judged

Three frontier judges — GPT-5.5, Gemini 3.1 Pro, and Claude Opus 4.8 — score each answer pair. We are not the judge.
Every pair is judged in both orderings (Falconer first, then opponent first) and averaged, to cancel position bias.
Judges score four axes 0–10: faithfulness, helpfulness, completeness, relevance. Citation formatting is explicitly ignored — no one wins on style.
Weighted-sum verdict: 0.35·faithfulness + 0.35·helpfulness + 0.20·completeness + 0.10·relevance. A side wins a verdict only if ahead by > 0.25; otherwise it's a tie.
Head-to-head % = wins / (wins + losses), ties excluded.

File layout

SOURCES.md                      attribution & licenses
<scenario>/
  questions.json                the 100 questions + human reference answers
  answers-<agent>.jsonl         one assistant's answers (falconer, notion, confluence, claude-code, codex)
  judgments-falconer-vs-<x>.csv per-judge, per-ordering scores for each head-to-head pair

questions.json — array of { id, question, expected_answer, … }. wix/ adds article_ids (the Help Center articles the reference answer is grounded in); spark/ adds source_url (a link to the original Stack Overflow post).

answers-<agent>.jsonl — one JSON object per line: { index, id, question, expected_answer, "<agent>": { output, duration_ms, first_token_ms, …provenance } }. output is the assistant's full answer. Provenance records what actually ran (model, effort / thinking mode, reasoning level, sources scope). A note field appears on the few rows where an assistant timed out or its answer was recovered from a dump.

judgments-falconer-vs-<x>.csv — one row per (question, judge, ordering): question_id, judge, ordering, a_source, b_source, winner, a_{faithfulness,completeness,helpfulness,relevance}, b_{…}, reasoning, error. a_* / b_* are scores by position; a_source / b_source say which side was Falconer in that ordering.

Reproducing the win rates

For each judgment row, compute the weighted sum for each side and compare:

score(side) = 0.35*faithfulness + 0.35*helpfulness + 0.20*completeness + 0.10*relevance
diff = score(falconer) - score(opponent)     # use a_source/b_source to find which side is Falconer
verdict = win  if diff >  0.25
          loss if diff < -0.25
          tie  otherwise
win_rate = wins / (wins + losses)             # over all rows for a pair, ties excluded

Note: a few verdicts land at exactly |diff| = 0.25. Because of floating-point summation order, an independent reimplementation can classify 1–2 of these differently, shifting a headline by at most ~1 percentage point. It does not change any conclusion.

Caveats

Agent answers are model outputs, published as-is for transparency — not authoritative documentation. They may contain errors.
Coverage gaps (verdicts < 600) are dropped rows, not zero-scores: Notion returns no response on ~6% of wix/ questions (a known agent flakiness); Rovo timed out on 2 spark/ questions; one spark/ Codex verdict was excluded for a transient judge error. Missing answers drop out of the denominator rather than scoring 0.

Attribution & license

See SOURCES.md. In short: WixQA is MIT; Spark questions/answers are from Stack Overflow (CC BY-SA) with a source_url to each origin post, and cite Apache Spark docs (Apache-2.0). The agent answers and judge scores are original to this benchmark.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
spark		spark
wix		wix
README.md		README.md
SOURCES.md		SOURCES.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Falconer benchmarks

Scenarios

Results

How questions were selected

How answers were judged

File layout

Reproducing the win rates

Caveats

Attribution & license

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Falconer benchmarks

Scenarios

Results

How questions were selected

How answers were judged

File layout

Reproducing the win rates

Caveats

Attribution & license

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages