Skip to content

Add page health distribution to eval harness#96

Open
dohu012 wants to merge 1 commit into
atomicstrata:mainfrom
dohu012:feature/page-health-distribution
Open

Add page health distribution to eval harness#96
dohu012 wants to merge 1 commit into
atomicstrata:mainfrom
dohu012:feature/page-health-distribution

Conversation

@dohu012

@dohu012 dohu012 commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Per-page health distribution evaluator, as discussed in #78.

Reuses the existing lint/eval results (runAllLintRules + deductionFor)
so per-page scores stay consistent with the overall health score — a page's
score is the health score it would get if it were the only page in the wiki.
No new lint rules, no weight changes, overall health score semantics unchanged.

  • Tiers: healthy (>=90), adequate (70-89), needs_work (50-69), broken (<50)
  • Returns perPage (all pages, sorted worst-first), worstPages
    (configurable count, default 5), and distribution counts
  • perPage and worstPages share the same element shape
  • Terminal output shows tier counts and worst pages with their top issues
  • 11 tests: empty wiki, tier boundaries, worst-page ordering, parameterized count

Two notes for review:

  • runAllLintRules runs independently here (so lint runs twice per eval).
    This is pure file I/O, no LLM calls — an intentional trade-off to keep
    evaluateHealth's API unchanged. A shared cache can be added later if needed.
  • Per-page display name uses the file basename. Pages are grouped by full
    file path so same-slug pages aren't merged, but their display names can
    repeat. Happy to switch to a path-relative name if you'd prefer.

New fast-suite evaluator that breaks down the corpus-level health
score into per-page scores with tier bucketing:

- Reuses existing lint rules (runAllLintRules) and scoring model
  (deductionFor) — no new lint rules, no weight changes.
- Tiers: healthy (>=90), adequate (70-89), needs_work (50-69),
  broken (<50).
- Returns perPage (all pages sorted worst-first), worstPages
  (configurable count, default 5), and distribution counts.
- Integrated into the eval pipeline alongside existing metrics.
- Terminal output shows tier distribution and worst pages.

11 tests covering empty wiki, tier boundaries, worstPage ordering,
and parameterized count.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

@ethanj ethanj left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this. The core evaluator is close: the per-page scoring idea fits the issue #78 direction, the tier-boundary tests are useful, and reusing the existing health deductions is the right direction.

I’d like to get a few issues fixed before merging:

  1. src/eval/report.ts is based on a stale copy and reverts recently merged formatter fixes. Current main has an ANSI-aware/truncating line() helper and renders source-inventory warnings via warningRows(). This PR replaces that with the older padEnd implementation, removes the terminal warning output, and deletes the regression tests that pinned both behaviors. Please rebase this file onto current main, preserve the existing line() / warningRows() behavior and tests, then add the Page Health section on top.

  2. pageHealthDistribution is required on EvalReport, which breaks rendering old eval history. llmwiki eval report re-renders stored reports, and reports written before this PR won’t have the new field, so formatPageHealthDistribution() can throw when it reads perPage. Please make the field optional/backward-compatible, like citationSupport, and skip the section when it’s absent.

  3. The new metric does not include the current freshness lint rule. llmwiki lint now runs checkStalePages, but runAllLintRules() in src/eval/health.ts maintains a separate lint list and omits it. That means a stale page can show as healthy in the new Page Health distribution. Please either delegate to the canonical lint orchestrator or update the shared lint path so eval health and page health include the same rules as llmwiki lint.

A couple of should-fix items while you’re in there:

  • The Page Health terminal rows are too wide for the 49-char box. Even with truncation restored, the distribution row and worst-page rows will lose useful information. A more compact layout, for example slug + score on one line and issues on a second dimmed line, would be easier to read.
  • The eval pipeline now runs the lint rules twice, once for corpus health and once for page health. If it stays this way for PR size, that’s survivable, but since this is all internal plumbing it would be cleaner to run the lint pass once in runEval() and feed the same results to both consumers.

Nits: please use named constants for the 90/70/50 tier cutoffs, and consider avoiding the needs_work vs needsWork mismatch in the JSON shape if it’s still easy to adjust.

The evaluator core looks salvageable; the main thing is to preserve current main behavior and make the new report field backward-compatible.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants