Skip to content

Anti-slop eval harness + candidate gates (fork-internal review)#1

Draft
adewale wants to merge 2 commits into
mainfrom
claude/code-quality-evals-sMWBI
Draft

Anti-slop eval harness + candidate gates (fork-internal review)#1
adewale wants to merge 2 commits into
mainfrom
claude/code-quality-evals-sMWBI

Conversation

@adewale

@adewale adewale commented May 21, 2026

Copy link
Copy Markdown
Owner

Scope note (read first)

This is a fork-internal review PR — base is adewale/hallmark:main, not upstream Nutlope/hallmark. It is a sandbox to review the diff and run CI on the fork. It is not intended to be sent upstream as-is: at ~2,600 insertions across 30 files spanning three concerns, it would be a "slop PR" by the good-PR rubric. The upstream-worthy slice is much smaller (see below).

Summary

An eval-driven hillclimb that measures Hallmark against an external slop standard (Impeccable's 37 patterns) and improves the skill where the eval found gaps.

  • evals/detector.mjs — deterministic detector for the CLI-checkable subset of the 37 patterns + existing Hallmark gates (v1 = 37 rules, v2 = 43).
  • evals/run.mjs — merges detector + craft-judge sidecars, computes a cross-fixture order parameter, snapshots each cycle.
  • evals/audit-site.mjs — inlines a page's linked CSS so the detector can score real shipped pages.
  • references/slop-test.md — 15 candidate gates (70–84) added.

Result

10-cycle hillclimb: v1 fixtures 74.2 → 98.3, then the eval is upgraded to v2 (per "Your Evals Will Break"), which drops the score to 76.4 by exposing tells v1 was blind to (notably hero-float / gate 54), then climbs back to 98.7. Full table in evals/results/history.md.

How to run

cd evals
node detector.mjs fixtures/pulse.html --eval v2     # score one page
node run.mjs --cycle 10 --eval v2 --label "..."      # score a cycle
node audit-site.mjs site/examples/bananastudio/index.html  # audit a real page

Honest caveats (good-PR self-review)

  • Self-judged tests. Fixtures are author-written and craft scores are author-assigned — not independent ground truth. The deterministic detector is the trustworthy half.
  • Regex detector. No real CSSOM/contrast engine; known false-positive rate, partly hardened. Multi-theme/component-library stylesheets are flagged low-confidence.
  • Gate overlaps to dedupe before any upstreaming. Gate 84 duplicates existing 9/22/34; gate 80 duplicates 3; gate 83 overlaps 27; gate 71 is largely covered by 2 + 58.

If a slice ever goes upstream

A focused PR of only the genuinely-novel, gap-filling gates the project's own examples trip — glassmorphism (79), justified text (72), skipped headings (73), dark-mode reflex (78), button hierarchy (74) — with no eval framework, no fixtures, no result snapshots. Evidence: bananastudio ships glass + gradient text, hyperlane uses Inter.


Generated by Claude Code

claude added 2 commits May 21, 2026 10:32
Build a deterministic slop detector grounded in Impeccable's 37 patterns
plus Hallmark's own gates, score self-contained fixtures across genres, and
run a 10-cycle eval-driven hillclimb.

Phase 1 (v1, cycles 1-5): close gaps the detector found, adding gates 70-77
to references/slop-test.md; fixtures climb 74.2 -> 98.3.

Cycle 6: per "Your Evals Will Break", upgrade the eval to v2 -- six new
detector rules (incl. hero-float/gate 54 the v1-perfect fixtures had been
violating), a cross-fixture order parameter (macrostructure reuse), and two
adversarial fixtures. Score honestly drops to 76.4.

Phase 2 (v2, cycles 7-10): add gates 78-84 and climb back to 98.7, resisting
a dark/neon/metric-hero brief. The skill gained 15 gates motivated by what
the eval could measure. Full curve in evals/results/history.md.
Audit the in-repo Hallmark corpus (homepage + examples) surfaced a real
false-positive rate. Fix the worst offenders so the signal is trustworthy:

- placeholder-names: only flag actual placeholder names (Jane Doe, Acme,
  lorem ipsum), not ordinary words like "seamless"/"unleash" in prose.
- ai-palette: require the violet->cyan *ramp*, not a single deliberate brand
  hue, so a midnight-violet brand is no longer flagged.
- font counting: count a monospace family toward the budget only when used
  outside code (per gate 39); stop counting unused --font-mono tokens.
- multi-theme scoping: resolve tokens from the active [data-theme] only, and
  label 22-theme / component-library stylesheets low-confidence instead of
  scoring them as one page.

evals/audit-site.mjs inlines a page's linked stylesheets so the detector can
score real shipped pages. Fixtures unchanged (all still 5.00/5 on v1 and v2);
true positives (Inter in hyperlane, gradient text in bananastudio, "Acme" in
tally) are retained while the false positives are removed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants