Add page health distribution to eval harness by dohu012 · Pull Request #96 · atomicstrata/llm-wiki-compiler

dohu012 · 2026-06-09T09:53:22Z

Per-page health distribution evaluator, as discussed in #78.

Reuses the existing lint/eval results (runAllLintRules + deductionFor)
so per-page scores stay consistent with the overall health score — a page's
score is the health score it would get if it were the only page in the wiki.
No new lint rules, no weight changes, overall health score semantics unchanged.

Tiers: healthy (>=90), adequate (70-89), needs_work (50-69), broken (<50)
Returns perPage (all pages, sorted worst-first), worstPages
(configurable count, default 5), and distribution counts
perPage and worstPages share the same element shape
Terminal output shows tier counts and worst pages with their top issues
11 tests: empty wiki, tier boundaries, worst-page ordering, parameterized count

Two notes for review:

runAllLintRules runs independently here (so lint runs twice per eval).
This is pure file I/O, no LLM calls — an intentional trade-off to keep
evaluateHealth's API unchanged. A shared cache can be added later if needed.
Per-page display name uses the file basename. Pages are grouped by full
file path so same-slug pages aren't merged, but their display names can
repeat. Happy to switch to a path-relative name if you'd prefer.

New fast-suite evaluator that breaks down the corpus-level health score into per-page scores with tier bucketing: - Reuses existing lint rules (runAllLintRules) and scoring model (deductionFor) — no new lint rules, no weight changes. - Tiers: healthy (>=90), adequate (70-89), needs_work (50-69), broken (<50). - Returns perPage (all pages sorted worst-first), worstPages (configurable count, default 5), and distribution counts. - Integrated into the eval pipeline alongside existing metrics. - Terminal output shows tier distribution and worst pages. 11 tests covering empty wiki, tier boundaries, worstPage ordering, and parameterized count. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

ethanj

Thanks for this. The core evaluator is close: the per-page scoring idea fits the issue #78 direction, the tier-boundary tests are useful, and reusing the existing health deductions is the right direction.

I’d like to get a few issues fixed before merging:

src/eval/report.ts is based on a stale copy and reverts recently merged formatter fixes. Current main has an ANSI-aware/truncating line() helper and renders source-inventory warnings via warningRows(). This PR replaces that with the older padEnd implementation, removes the terminal warning output, and deletes the regression tests that pinned both behaviors. Please rebase this file onto current main, preserve the existing line() / warningRows() behavior and tests, then add the Page Health section on top.
pageHealthDistribution is required on EvalReport, which breaks rendering old eval history. llmwiki eval report re-renders stored reports, and reports written before this PR won’t have the new field, so formatPageHealthDistribution() can throw when it reads perPage. Please make the field optional/backward-compatible, like citationSupport, and skip the section when it’s absent.
The new metric does not include the current freshness lint rule. llmwiki lint now runs checkStalePages, but runAllLintRules() in src/eval/health.ts maintains a separate lint list and omits it. That means a stale page can show as healthy in the new Page Health distribution. Please either delegate to the canonical lint orchestrator or update the shared lint path so eval health and page health include the same rules as llmwiki lint.

A couple of should-fix items while you’re in there:

The Page Health terminal rows are too wide for the 49-char box. Even with truncation restored, the distribution row and worst-page rows will lose useful information. A more compact layout, for example slug + score on one line and issues on a second dimmed line, would be easier to read.
The eval pipeline now runs the lint rules twice, once for corpus health and once for page health. If it stays this way for PR size, that’s survivable, but since this is all internal plumbing it would be cleaner to run the lint pass once in runEval() and feed the same results to both consumers.

Nits: please use named constants for the 90/70/50 tier cutoffs, and consider avoiding the needs_work vs needsWork mismatch in the JSON shape if it’s still easy to adjust.

The evaluator core looks salvageable; the main thing is to preserve current main behavior and make the new report field backward-compatible.

dohu012 mentioned this pull request Jun 9, 2026

Proposal: extend eval harness with new quality dimensions #78

Open

ethanj requested changes Jun 10, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add page health distribution to eval harness#96

Add page health distribution to eval harness#96
dohu012 wants to merge 1 commit into
atomicstrata:mainfrom
dohu012:feature/page-health-distribution

dohu012 commented Jun 9, 2026

Uh oh!

ethanj left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

dohu012 commented Jun 9, 2026

Uh oh!

ethanj left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants