Add eval-driven quality harness for Hallmark outputs by adewale · Pull Request #12 · Nutlope/hallmark

adewale · 2026-06-01T10:24:38Z

What

Adds an eval-driven quality harness for Hallmark outputs, plus the slop-test gates that came out of the eval hillclimb.

Included pieces:

deterministic detector: evals/detector.mjs
shared scoring core: evals/core.mjs
non-mutating local check: node evals/check.mjs
snapshot-writing runner: evals/run.mjs
real-site audit adapter: evals/audit-site.mjs
v1 and v2 fixture sets with frozen judge sidecars
eval results/history documenting the hillclimb
eval-derived slop-test gates 70-84
ignored cache path for real-site audit snapshots: evals/.site-cache/

Why

Hallmark's promise is that generated UI should look made, not generated. The existing checklist is useful, but it is hard to know whether new guidance actually improves outputs without a repeatable measurement loop.

This PR adds that loop. v1 covers deterministic checks mapped to the existing anti-slop standard. v2 adds adversarial fixtures and a cross-fixture structure/order-parameter check so the eval does not only reward one-page cleanliness while missing template reuse across outputs.

How to read the eval

The detector is a guardrail, not an oracle.
The judge sidecars are frozen JSON inputs; the check command does not call an LLM.
node evals/check.mjs is non-mutating and intended for local/CI regression checks.
node evals/run.mjs ... is the mutating command that writes cycle snapshots/history.
node evals/audit-site.mjs ... is heuristic; it inlines local linked CSS, follows local @imports, and writes ignored snapshots under evals/.site-cache/.

Results

The committed history documents the hillclimb:

v1 baseline to saturation: 74.2 -> 98.3
v2 adversarial break: 98.3 -> 76.4
v2 recovery: 76.4 -> 98.7

Current non-mutating check:

PASS v1: 98.5/100   rules 37   fixtures 3
PASS v2: 98.7/100   rules 43   fixtures 5   structure 5.00/5

Testing

node --check evals/detector.mjs
node --check evals/core.mjs
node --check evals/check.mjs
node --check evals/run.mjs
node --check evals/audit-site.mjs
node evals/check.mjs

Risk

Detector rules are heuristic and can produce false positives.
The eval fixtures are not a substitute for visual review.
The slop-test gates make Hallmark more opinionated.

Mitigations:

The check is local and deterministic.
Real-site audit output is advisory.
The gates are documented as pre-emit questions, not runtime enforcement.

Build a deterministic slop detector grounded in Impeccable's 37 patterns plus Hallmark's own gates, score self-contained fixtures across genres, and run a 10-cycle eval-driven hillclimb. Phase 1 (v1, cycles 1-5): close gaps the detector found, adding gates 70-77 to references/slop-test.md; fixtures climb 74.2 -> 98.3. Cycle 6: per "Your Evals Will Break", upgrade the eval to v2 -- six new detector rules (incl. hero-float/gate 54 the v1-perfect fixtures had been violating), a cross-fixture order parameter (macrostructure reuse), and two adversarial fixtures. Score honestly drops to 76.4. Phase 2 (v2, cycles 7-10): add gates 78-84 and climb back to 98.7, resisting a dark/neon/metric-hero brief. The skill gained 15 gates motivated by what the eval could measure. Full curve in evals/results/history.md.

Audit the in-repo Hallmark corpus (homepage + examples) surfaced a real false-positive rate. Fix the worst offenders so the signal is trustworthy: - placeholder-names: only flag actual placeholder names (Jane Doe, Acme, lorem ipsum), not ordinary words like "seamless"/"unleash" in prose. - ai-palette: require the violet->cyan *ramp*, not a single deliberate brand hue, so a midnight-violet brand is no longer flagged. - font counting: count a monospace family toward the budget only when used outside code (per gate 39); stop counting unused --font-mono tokens. - multi-theme scoping: resolve tokens from the active [data-theme] only, and label 22-theme / component-library stylesheets low-confidence instead of scoring them as one page. evals/audit-site.mjs inlines a page's linked stylesheets so the detector can score real shipped pages. Fixtures unchanged (all still 5.00/5 on v1 and v2); true positives (Inter in hyperlane, gradient text in bananastudio, "Acme" in tally) are retained while the false positives are removed.

vercel · 2026-06-01T10:24:42Z

@adewale is attempting to deploy a commit to the Together AI Team on Vercel.

A member of the Team first needs to authorize it.

claude and others added 3 commits May 21, 2026 10:32

Add non-mutating eval check command

c7ea09a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add eval-driven quality harness for Hallmark outputs#12

Add eval-driven quality harness for Hallmark outputs#12
adewale wants to merge 3 commits into
Nutlope:mainfrom
adewale:pr/eval-quality-harness

adewale commented Jun 1, 2026

Uh oh!

vercel Bot commented Jun 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

adewale commented Jun 1, 2026

What

Why

How to read the eval

Results

Testing

Risk

Uh oh!

vercel Bot commented Jun 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants