Add eval-driven quality harness for Hallmark outputs#12
Open
adewale wants to merge 3 commits into
Open
Conversation
Build a deterministic slop detector grounded in Impeccable's 37 patterns plus Hallmark's own gates, score self-contained fixtures across genres, and run a 10-cycle eval-driven hillclimb. Phase 1 (v1, cycles 1-5): close gaps the detector found, adding gates 70-77 to references/slop-test.md; fixtures climb 74.2 -> 98.3. Cycle 6: per "Your Evals Will Break", upgrade the eval to v2 -- six new detector rules (incl. hero-float/gate 54 the v1-perfect fixtures had been violating), a cross-fixture order parameter (macrostructure reuse), and two adversarial fixtures. Score honestly drops to 76.4. Phase 2 (v2, cycles 7-10): add gates 78-84 and climb back to 98.7, resisting a dark/neon/metric-hero brief. The skill gained 15 gates motivated by what the eval could measure. Full curve in evals/results/history.md.
Audit the in-repo Hallmark corpus (homepage + examples) surfaced a real false-positive rate. Fix the worst offenders so the signal is trustworthy: - placeholder-names: only flag actual placeholder names (Jane Doe, Acme, lorem ipsum), not ordinary words like "seamless"/"unleash" in prose. - ai-palette: require the violet->cyan *ramp*, not a single deliberate brand hue, so a midnight-violet brand is no longer flagged. - font counting: count a monospace family toward the budget only when used outside code (per gate 39); stop counting unused --font-mono tokens. - multi-theme scoping: resolve tokens from the active [data-theme] only, and label 22-theme / component-library stylesheets low-confidence instead of scoring them as one page. evals/audit-site.mjs inlines a page's linked stylesheets so the detector can score real shipped pages. Fixtures unchanged (all still 5.00/5 on v1 and v2); true positives (Inter in hyperlane, gradient text in bananastudio, "Acme" in tally) are retained while the false positives are removed.
|
@adewale is attempting to deploy a commit to the Together AI Team on Vercel. A member of the Team first needs to authorize it. |
This was referenced Jun 1, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Adds an eval-driven quality harness for Hallmark outputs, plus the slop-test gates that came out of the eval hillclimb.
Included pieces:
evals/detector.mjsevals/core.mjsnode evals/check.mjsevals/run.mjsevals/audit-site.mjsevals/.site-cache/Why
Hallmark's promise is that generated UI should look made, not generated. The existing checklist is useful, but it is hard to know whether new guidance actually improves outputs without a repeatable measurement loop.
This PR adds that loop. v1 covers deterministic checks mapped to the existing anti-slop standard. v2 adds adversarial fixtures and a cross-fixture structure/order-parameter check so the eval does not only reward one-page cleanliness while missing template reuse across outputs.
How to read the eval
node evals/check.mjsis non-mutating and intended for local/CI regression checks.node evals/run.mjs ...is the mutating command that writes cycle snapshots/history.node evals/audit-site.mjs ...is heuristic; it inlines local linked CSS, follows local@imports, and writes ignored snapshots underevals/.site-cache/.Results
The committed history documents the hillclimb:
74.2 -> 98.398.3 -> 76.476.4 -> 98.7Current non-mutating check:
Testing
Risk
Mitigations: