Add page health distribution to eval harness#96
Conversation
New fast-suite evaluator that breaks down the corpus-level health score into per-page scores with tier bucketing: - Reuses existing lint rules (runAllLintRules) and scoring model (deductionFor) — no new lint rules, no weight changes. - Tiers: healthy (>=90), adequate (70-89), needs_work (50-69), broken (<50). - Returns perPage (all pages sorted worst-first), worstPages (configurable count, default 5), and distribution counts. - Integrated into the eval pipeline alongside existing metrics. - Terminal output shows tier distribution and worst pages. 11 tests covering empty wiki, tier boundaries, worstPage ordering, and parameterized count. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
ethanj
left a comment
There was a problem hiding this comment.
Thanks for this. The core evaluator is close: the per-page scoring idea fits the issue #78 direction, the tier-boundary tests are useful, and reusing the existing health deductions is the right direction.
I’d like to get a few issues fixed before merging:
-
src/eval/report.tsis based on a stale copy and reverts recently merged formatter fixes. Currentmainhas an ANSI-aware/truncatingline()helper and renders source-inventory warnings viawarningRows(). This PR replaces that with the olderpadEndimplementation, removes the terminal warning output, and deletes the regression tests that pinned both behaviors. Please rebase this file onto currentmain, preserve the existingline()/warningRows()behavior and tests, then add the Page Health section on top. -
pageHealthDistributionis required onEvalReport, which breaks rendering old eval history.llmwiki eval reportre-renders stored reports, and reports written before this PR won’t have the new field, soformatPageHealthDistribution()can throw when it readsperPage. Please make the field optional/backward-compatible, likecitationSupport, and skip the section when it’s absent. -
The new metric does not include the current freshness lint rule.
llmwiki lintnow runscheckStalePages, butrunAllLintRules()insrc/eval/health.tsmaintains a separate lint list and omits it. That means a stale page can show as healthy in the new Page Health distribution. Please either delegate to the canonical lint orchestrator or update the shared lint path so eval health and page health include the same rules asllmwiki lint.
A couple of should-fix items while you’re in there:
- The Page Health terminal rows are too wide for the 49-char box. Even with truncation restored, the distribution row and worst-page rows will lose useful information. A more compact layout, for example slug + score on one line and issues on a second dimmed line, would be easier to read.
- The eval pipeline now runs the lint rules twice, once for corpus health and once for page health. If it stays this way for PR size, that’s survivable, but since this is all internal plumbing it would be cleaner to run the lint pass once in
runEval()and feed the same results to both consumers.
Nits: please use named constants for the 90/70/50 tier cutoffs, and consider avoiding the needs_work vs needsWork mismatch in the JSON shape if it’s still easy to adjust.
The evaluator core looks salvageable; the main thing is to preserve current main behavior and make the new report field backward-compatible.
Per-page health distribution evaluator, as discussed in #78.
Reuses the existing lint/eval results (
runAllLintRules+deductionFor)so per-page scores stay consistent with the overall health score — a page's
score is the health score it would get if it were the only page in the wiki.
No new lint rules, no weight changes, overall health score semantics unchanged.
perPage(all pages, sorted worst-first),worstPages(configurable count, default 5), and
distributioncountsperPageandworstPagesshare the same element shapeTwo notes for review:
runAllLintRulesruns independently here (so lint runs twice per eval).This is pure file I/O, no LLM calls — an intentional trade-off to keep
evaluateHealth's API unchanged. A shared cache can be added later if needed.file path so same-slug pages aren't merged, but their display names can
repeat. Happy to switch to a path-relative name if you'd prefer.