Skip to content

feat(compare): add normalized gain metric#1101

Merged
christso merged 3 commits intomainfrom
feat/1100-normalized-gain
Apr 15, 2026
Merged

feat(compare): add normalized gain metric#1101
christso merged 3 commits intomainfrom
feat/1100-normalized-gain

Conversation

@christso
Copy link
Copy Markdown
Collaborator

@christso christso commented Apr 14, 2026

Summary

Adds Hake's normalized gain (`g`) to `agentv compare` output, measuring improvement relative to remaining headroom.

The metric

```
g = (score_candidate − score_baseline) / (1 − score_baseline)
```

Raw delta (`Δ`) tells you how much scores changed. Normalized gain tells you how much of the available improvement was captured:

Baseline Candidate Δ g Interpretation
0.10 0.55 +0.45 0.50 Captured 50% of remaining headroom
0.90 0.95 +0.05 0.50 Same proportional gain, despite smaller Δ
0.50 0.25 −0.25 −0.50 Regression: lost 50% of headroom
1.00 1.00 0.00 null No headroom, metric undefined

Returns `null` when baseline is already 1.0 (perfect score). Null values are excluded from mean computation.

Where it appears

  • Table output: `g: +0.256` in summary line
  • Matrix pairwise: `g +0.256` alongside existing `Δ`
  • JSON output: `mean_normalized_gain` in summary, `normalized_gain` per matched result

Red/Green E2E

Before (main — no `g`):
```
Summary: 2 wins, 1 loss, 0 ties | Mean Δ: +0.267 | Status: improved
```

After (this branch):
```
Summary: 2 wins, 1 loss, 0 ties | Mean Δ: +0.267 | g: +0.256 | Status: improved
```

JSON output now includes `normalized_gain` per test and `mean_normalized_gain` in summary.

Changes

  • `apps/cli/src/commands/compare/index.ts` — `computeNormalizedGain()`, `normalizedGain` on `MatchedResult`, `meanNormalizedGain` in summary, display in `formatTable` and `formatMatrix`
  • `apps/cli/test/commands/compare/compare.test.ts` — 10 new tests covering all cases
  • `apps/web/src/content/docs/docs/tools/compare.mdx` — docs updated with formula, interpretation table, updated output examples

Test plan

  • 50/50 tests pass (10 new tests added)
  • Typecheck, lint, build all pass (pre-push hook)
  • Red/green CLI e2e verified
  • JSON output verified: `normalized_gain` per test, `mean_normalized_gain` in summary
  • Docs updated

Closes #1100

🤖 Generated with Claude Code

Add Hake's normalized gain (g) to compare output, measuring improvement
relative to remaining headroom rather than raw absolute delta.

Formula: g = (score_candidate − score_baseline) / (1 − score_baseline)

This separates genuine scaffolding from ceiling effects — a +5pp gain
from a 90% baseline (g=0.5) is proportionally much larger than +5pp
from a 10% baseline (g=0.056).

Shown as "Norm. gain" in table output and "g" in matrix pairwise summary.
Available as mean_normalized_gain in JSON output. Returns null when
baseline is 1.0 (perfect score, no headroom).

Closes #1100

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@cloudflare-workers-and-pages
Copy link
Copy Markdown

cloudflare-workers-and-pages Bot commented Apr 14, 2026

Deploying agentv with  Cloudflare Pages  Cloudflare Pages

Latest commit: ea1e60a
Status: ✅  Deploy successful!
Preview URL: https://38a450aa.agentv.pages.dev
Branch Preview URL: https://feat-1100-normalized-gain.agentv.pages.dev

View logs

Use 'g' consistently in both table summary and matrix pairwise output,
matching the standard notation from Hake (1998) and SkillsBench paper.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Add normalized gain (g) to compare docs: formula, interpretation table,
updated table/JSON output examples, and tips section.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@christso christso merged commit b37834e into main Apr 15, 2026
4 checks passed
@christso christso deleted the feat/1100-normalized-gain branch April 15, 2026 00:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

examples/showcase: expand bug-fix-benchmark with rigorous multi-scenario workflow evals

1 participant