Improve Najm example against eval audit by adewale · Pull Request #14 · Nutlope/hallmark

adewale · 2026-06-01T10:25:35Z

Draft stacked PR after #12 and #13.

Because I do not have write access to create stack base branches in Nutlope/hallmark, this PR targets main. Until the parent PRs merge, GitHub's default Files changed view will include parent diffs. The intended child diff is:

adewale/hallmark@pr/tally-eval-audit...pr/najm-eval-audit

What

Improves the Najm example using the eval audit from #12 while preserving the Moroccan fashion/drop direction.

Targeted changes:

switches body copy away from overused Inter to Source Sans 3 while keeping Bricolage Grotesque for display
removes glassmorphism from the sticky nav and product tag
tokenizes raw OKLCH gradients in product/editorial surfaces
fixes low-contrast accent-filled badges/buttons detected by the audit
removes emoji/star glyphs used as decorative feature markers
changes the cart drawer from a modal dialog role to a labelled region, matching its non-focus-trapping drawer behavior
fixes skipped footer heading levels
adds root overflow-x: clip
keeps announcement content visible on mobile instead of hiding it
removes rote hover-scale transforms
adds disabled-state coverage

Why

Najm was another low-scoring real example in the audit. This PR is the second proof-of-value pass: the eval harness can identify concrete, fixable issues in existing examples without requiring a wholesale redesign.

Audit evidence

Before this PR:

examples/najm/  v1 3.12  v2 3.27
findings: type-overused-font, color-token-discipline, color-ink-on-ink,
visual-glassmorphism, layout-center-everything, layout-arbitrary-spacing,
layout-skipped-heading, motion-hover-scale, interaction-emoji-icon,
interaction-modal-reflex, responsive-overflow-clip,
responsive-feature-amputation, general-state-coverage

After this PR:

examples/najm/  v1 5.00  v2 5.00
findings: —

Testing

node evals/check.mjs
node evals/audit-site.mjs site/examples/najm/index.html

Observed:

PASS v1: 98.5/100   rules 37   fixtures 3
PASS v2: 98.7/100   rules 43   fixtures 5   structure 5.00/5

examples/najm/  v1 5.00  v2 5.00   —

Risk

This touches a visual example, so browser review is still needed. The largest semantic change is the cart drawer role: the drawer was labelled as modal, but the implementation does not trap focus, so role="region" better matches the current behavior.

Build a deterministic slop detector grounded in Impeccable's 37 patterns plus Hallmark's own gates, score self-contained fixtures across genres, and run a 10-cycle eval-driven hillclimb. Phase 1 (v1, cycles 1-5): close gaps the detector found, adding gates 70-77 to references/slop-test.md; fixtures climb 74.2 -> 98.3. Cycle 6: per "Your Evals Will Break", upgrade the eval to v2 -- six new detector rules (incl. hero-float/gate 54 the v1-perfect fixtures had been violating), a cross-fixture order parameter (macrostructure reuse), and two adversarial fixtures. Score honestly drops to 76.4. Phase 2 (v2, cycles 7-10): add gates 78-84 and climb back to 98.7, resisting a dark/neon/metric-hero brief. The skill gained 15 gates motivated by what the eval could measure. Full curve in evals/results/history.md.

Audit the in-repo Hallmark corpus (homepage + examples) surfaced a real false-positive rate. Fix the worst offenders so the signal is trustworthy: - placeholder-names: only flag actual placeholder names (Jane Doe, Acme, lorem ipsum), not ordinary words like "seamless"/"unleash" in prose. - ai-palette: require the violet->cyan *ramp*, not a single deliberate brand hue, so a midnight-violet brand is no longer flagged. - font counting: count a monospace family toward the budget only when used outside code (per gate 39); stop counting unused --font-mono tokens. - multi-theme scoping: resolve tokens from the active [data-theme] only, and label 22-theme / component-library stylesheets low-confidence instead of scoring them as one page. evals/audit-site.mjs inlines a page's linked stylesheets so the detector can score real shipped pages. Fixtures unchanged (all still 5.00/5 on v1 and v2); true positives (Inter in hyperlane, gradient text in bananastudio, "Acme" in tally) are retained while the false positives are removed.

vercel · 2026-06-01T10:25:40Z

@adewale is attempting to deploy a commit to the Together AI Team on Vercel.

A member of the Team first needs to authorize it.

claude and others added 5 commits May 21, 2026 10:32

Add non-mutating eval check command

c7ea09a

Improve Tally example against eval audit

e188d00

Improve Najm example against eval audit

e8fa5c6

This was referenced Jun 1, 2026

Add eval-driven quality harness for Hallmark outputs adewale/hallmark#2

Closed

Improve Tally example against eval audit adewale/hallmark#3

Closed

Improve Najm example against eval audit adewale/hallmark#4

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve Najm example against eval audit#14

Improve Najm example against eval audit#14
adewale wants to merge 5 commits into
Nutlope:mainfrom
adewale:pr/najm-eval-audit

adewale commented Jun 1, 2026

Uh oh!

vercel Bot commented Jun 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

adewale commented Jun 1, 2026

What

Why

Audit evidence

Testing

Risk

Uh oh!

vercel Bot commented Jun 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants