Skip to content

Improve Tally example against eval audit#13

Draft
adewale wants to merge 4 commits into
Nutlope:mainfrom
adewale:pr/tally-eval-audit
Draft

Improve Tally example against eval audit#13
adewale wants to merge 4 commits into
Nutlope:mainfrom
adewale:pr/tally-eval-audit

Conversation

@adewale

@adewale adewale commented Jun 1, 2026

Copy link
Copy Markdown

Draft stacked PR after #12.

Because I do not have write access to create stack base branches in Nutlope/hallmark, this PR targets main. Until #12 merges, GitHub's default Files changed view will include the parent eval-harness diff. The intended child diff is:

adewale/hallmark@pr/eval-quality-harness...pr/tally-eval-audit

What

Improves the Tally example using the eval audit from #12 while keeping the same page concept and macrostructure.

Changes are targeted cleanup rather than a redesign:

  • removes the glassmorphism nav surface
  • replaces raw color/shadow literals with token-based values
  • replaces placeholder Acme copy in invoice examples
  • switches away from overused Geist/Geist Mono registers
  • fixes skipped heading levels in the console and footer
  • adds root overflow-x: clip
  • varies equal 3-column grids to avoid the rote card-grid fingerprint
  • removes a layout-property transition
  • keeps the page detector-clean under the audit adapter

Why

The new audit harness should improve real Hallmark examples, not only curated fixtures. Tally was one of the lower-scoring examples in the real-site audit and had mostly high-confidence findings that could be fixed without changing the product direction.

Audit evidence

Before this PR:

examples/tally/  v1 3.59  v2 3.71
findings: type-overused-font, mono-as-shorthand, color-token-discipline,
color-ink-on-ink, visual-glassmorphism, layout-center-everything,
layout-three-col-cards, layout-arbitrary-spacing, layout-skipped-heading,
motion-layout-animation, interaction-placeholder-names, responsive-overflow-clip

After this PR:

examples/tally/  v1 5.00  v2 5.00
findings: —

Testing

node evals/check.mjs
node evals/audit-site.mjs site/examples/tally/index.html

Observed:

PASS v1: 98.5/100   rules 37   fixtures 3
PASS v2: 98.7/100   rules 43   fixtures 5   structure 5.00/5

examples/tally/  v1 5.00  v2 5.00   —

Risk

This touches a visual example, so browser review is still needed. The intent was to keep the existing Tally direction intact and limit the diff to audit-backed cleanup.

claude and others added 4 commits May 21, 2026 10:32
Build a deterministic slop detector grounded in Impeccable's 37 patterns
plus Hallmark's own gates, score self-contained fixtures across genres, and
run a 10-cycle eval-driven hillclimb.

Phase 1 (v1, cycles 1-5): close gaps the detector found, adding gates 70-77
to references/slop-test.md; fixtures climb 74.2 -> 98.3.

Cycle 6: per "Your Evals Will Break", upgrade the eval to v2 -- six new
detector rules (incl. hero-float/gate 54 the v1-perfect fixtures had been
violating), a cross-fixture order parameter (macrostructure reuse), and two
adversarial fixtures. Score honestly drops to 76.4.

Phase 2 (v2, cycles 7-10): add gates 78-84 and climb back to 98.7, resisting
a dark/neon/metric-hero brief. The skill gained 15 gates motivated by what
the eval could measure. Full curve in evals/results/history.md.
Audit the in-repo Hallmark corpus (homepage + examples) surfaced a real
false-positive rate. Fix the worst offenders so the signal is trustworthy:

- placeholder-names: only flag actual placeholder names (Jane Doe, Acme,
  lorem ipsum), not ordinary words like "seamless"/"unleash" in prose.
- ai-palette: require the violet->cyan *ramp*, not a single deliberate brand
  hue, so a midnight-violet brand is no longer flagged.
- font counting: count a monospace family toward the budget only when used
  outside code (per gate 39); stop counting unused --font-mono tokens.
- multi-theme scoping: resolve tokens from the active [data-theme] only, and
  label 22-theme / component-library stylesheets low-confidence instead of
  scoring them as one page.

evals/audit-site.mjs inlines a page's linked stylesheets so the detector can
score real shipped pages. Fixtures unchanged (all still 5.00/5 on v1 and v2);
true positives (Inter in hyperlane, gradient text in bananastudio, "Acme" in
tally) are retained while the false positives are removed.
@vercel

vercel Bot commented Jun 1, 2026

Copy link
Copy Markdown

@adewale is attempting to deploy a commit to the Together AI Team on Vercel.

A member of the Team first needs to authorize it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants