Anti-slop eval harness + candidate gates (fork-internal review) by adewale · Pull Request #1 · adewale/hallmark

adewale · 2026-05-21T16:59:53Z

Scope note (read first)

This is a fork-internal review PR — base is adewale/hallmark:main, not upstream Nutlope/hallmark. It is a sandbox to review the diff and run CI on the fork. It is not intended to be sent upstream as-is: at ~2,600 insertions across 30 files spanning three concerns, it would be a "slop PR" by the good-PR rubric. The upstream-worthy slice is much smaller (see below).

Summary

An eval-driven hillclimb that measures Hallmark against an external slop standard (Impeccable's 37 patterns) and improves the skill where the eval found gaps.

evals/detector.mjs — deterministic detector for the CLI-checkable subset of the 37 patterns + existing Hallmark gates (v1 = 37 rules, v2 = 43).
evals/run.mjs — merges detector + craft-judge sidecars, computes a cross-fixture order parameter, snapshots each cycle.
evals/audit-site.mjs — inlines a page's linked CSS so the detector can score real shipped pages.
references/slop-test.md — 15 candidate gates (70–84) added.

Result

10-cycle hillclimb: v1 fixtures 74.2 → 98.3, then the eval is upgraded to v2 (per "Your Evals Will Break"), which drops the score to 76.4 by exposing tells v1 was blind to (notably hero-float / gate 54), then climbs back to 98.7. Full table in evals/results/history.md.

How to run

cd evals
node detector.mjs fixtures/pulse.html --eval v2     # score one page
node run.mjs --cycle 10 --eval v2 --label "..."      # score a cycle
node audit-site.mjs site/examples/bananastudio/index.html  # audit a real page

Honest caveats (good-PR self-review)

Self-judged tests. Fixtures are author-written and craft scores are author-assigned — not independent ground truth. The deterministic detector is the trustworthy half.
Regex detector. No real CSSOM/contrast engine; known false-positive rate, partly hardened. Multi-theme/component-library stylesheets are flagged low-confidence.
Gate overlaps to dedupe before any upstreaming. Gate 84 duplicates existing 9/22/34; gate 80 duplicates 3; gate 83 overlaps 27; gate 71 is largely covered by 2 + 58.

If a slice ever goes upstream

A focused PR of only the genuinely-novel, gap-filling gates the project's own examples trip — glassmorphism (79), justified text (72), skipped headings (73), dark-mode reflex (78), button hierarchy (74) — with no eval framework, no fixtures, no result snapshots. Evidence: bananastudio ships glass + gradient text, hyperlane uses Inter.

Generated by Claude Code

Build a deterministic slop detector grounded in Impeccable's 37 patterns plus Hallmark's own gates, score self-contained fixtures across genres, and run a 10-cycle eval-driven hillclimb. Phase 1 (v1, cycles 1-5): close gaps the detector found, adding gates 70-77 to references/slop-test.md; fixtures climb 74.2 -> 98.3. Cycle 6: per "Your Evals Will Break", upgrade the eval to v2 -- six new detector rules (incl. hero-float/gate 54 the v1-perfect fixtures had been violating), a cross-fixture order parameter (macrostructure reuse), and two adversarial fixtures. Score honestly drops to 76.4. Phase 2 (v2, cycles 7-10): add gates 78-84 and climb back to 98.7, resisting a dark/neon/metric-hero brief. The skill gained 15 gates motivated by what the eval could measure. Full curve in evals/results/history.md.

Audit the in-repo Hallmark corpus (homepage + examples) surfaced a real false-positive rate. Fix the worst offenders so the signal is trustworthy: - placeholder-names: only flag actual placeholder names (Jane Doe, Acme, lorem ipsum), not ordinary words like "seamless"/"unleash" in prose. - ai-palette: require the violet->cyan *ramp*, not a single deliberate brand hue, so a midnight-violet brand is no longer flagged. - font counting: count a monospace family toward the budget only when used outside code (per gate 39); stop counting unused --font-mono tokens. - multi-theme scoping: resolve tokens from the active [data-theme] only, and label 22-theme / component-library stylesheets low-confidence instead of scoring them as one page. evals/audit-site.mjs inlines a page's linked stylesheets so the detector can score real shipped pages. Fixtures unchanged (all still 5.00/5 on v1 and v2); true positives (Inter in hyperlane, gradient text in bananastudio, "Acme" in tally) are retained while the false positives are removed.

claude added 2 commits May 21, 2026 10:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Anti-slop eval harness + candidate gates (fork-internal review)#1

Anti-slop eval harness + candidate gates (fork-internal review)#1
adewale wants to merge 2 commits into
mainfrom
claude/code-quality-evals-sMWBI

adewale commented May 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

adewale commented May 21, 2026

Scope note (read first)

Summary

Result

How to run

Honest caveats (good-PR self-review)

If a slice ever goes upstream

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants