Skip to content

feat(model-eval-ingest): sync promoted bench evals#3258

Open
lambertjosh wants to merge 11 commits into
mainfrom
rogue-timbale
Open

feat(model-eval-ingest): sync promoted bench evals#3258
lambertjosh wants to merge 11 commits into
mainfrom
rogue-timbale

Conversation

@lambertjosh
Copy link
Copy Markdown
Contributor

Summary

  • Add a dedicated Cloudflare model-eval-ingest puller that reads promoted evals from kilo-bench over a Service Binding, stores append-only audit rows, and recomputes public modelStats.benchmarks.kiloBench caches.
  • Extend cloud persistence, migrations, GDPR anonymization, and admin-only operations for sync-now / ingest history while keeping bench URLs and promoter email out of public model stats JSON.
  • Preserve fractional benchmark scores and weighted aggregate semantics, including latest-promotion-wins recompute behavior per model/task tuple.

Verification

  • Ran the bench dashboard Worker and cloud model-eval-ingest Worker locally, seeded a promoted swift-falcon-local-e2e bench eval, triggered the scheduled sync endpoint, and verified Postgres stored the ingest row and public kiloBench cache.
  • Re-triggered the local scheduled sync to verify the ingest remained idempotent, then inspected the cache payload to confirm it excluded bench URLs and promoter email while retaining ISO lastPromotedAt.

Visual Changes

N/A

Reviewer Notes

  • The new worker expects a BENCH_DASHBOARD Service Binding plus the shared internal API secret; the web admin trigger uses MODEL_EVAL_INGEST_URL to call /internal/sync.
  • model_eval_ingest.promoted_by_email is retained for audit but anonymized by the account soft-delete flow.
  • The generated Drizzle migration introduces the append-only ingest table and indexes needed for idempotency and cache recompute queries.

Comment thread services/model-eval-ingest/src/sync.ts
Comment thread services/model-eval-ingest/src/sync.ts Outdated
Comment thread apps/web/src/lib/model-eval-ingest-client.ts Outdated
Comment thread packages/db/src/schema.ts
Comment thread apps/web/src/routers/admin-model-eval-ingest-router.ts
@kilo-code-bot
Copy link
Copy Markdown
Contributor

kilo-code-bot Bot commented May 15, 2026

Code Review Summary

Status: No Issues Found | Recommendation: Merge

Executive Summary

The latest commit (6469ebbf) renames the admin UI surface label from "Model Eval Ingest" to "Model Benchmarks" in the sidebar, page heading, and breadcrumb — a cosmetic-only change with no logic, security, or data impact. All prior issues remain resolved.

All Previous Issues Resolved (click to expand)
File Issue Status
services/model-eval-ingest/src/sync.ts N+1 DB query pattern fixed — findModelStatsTargets and insertPromotions are now batched ✅ Resolved
apps/web/src/lib/model-eval-ingest-client.ts as cast replaced with Zod schema validation ✅ Resolved
services/model-eval-ingest/src/sync.ts Cache recompute intent now documented with explanatory comment ✅ Resolved
packages/db/src/schema.ts Expression index on LOWER(promoted_by_email) added for GDPR soft-delete ✅ Resolved
apps/web/src/routers/admin-model-eval-ingest-router.ts repullPromotion is now wired up in the admin UI ✅ Resolved
apps/web/src/app/admin/model-eval-ingest/ModelEvalIngestContent.tsx Toast message now uses correct pluralization via formatCount helper ✅ Resolved
packages/db/src/migrations/ Two migrations squashed into 0129_friendly_iron_patriot.sql; CREATE INDEX now runs on the newly-created (empty) table in the same migration — no lock concern on a populated table ✅ Resolved
Files Reviewed (25 files)
  • .env.local.example - 0 issues
  • apps/web/.env.development.local.example - 0 issues
  • apps/web/src/app/admin/components/AppSidebar.tsx - 0 issues
  • apps/web/src/app/admin/model-eval-ingest/ModelEvalIngestContent.tsx - 0 issues
  • apps/web/src/app/admin/model-eval-ingest/page.tsx - 0 issues
  • apps/web/src/lib/config.server.ts - 0 issues
  • apps/web/src/lib/model-eval-ingest-client.ts - 0 issues
  • apps/web/src/lib/user.test.ts - 0 issues
  • apps/web/src/lib/user.ts - 0 issues
  • apps/web/src/routers/admin-model-eval-ingest-router.ts - 0 issues
  • apps/web/src/routers/admin-router.ts - 0 issues
  • packages/db/src/migrations/0129_friendly_iron_patriot.sql - 0 issues
  • packages/db/src/schema.ts - 0 issues
  • services/model-eval-ingest/.dev.vars.example - 0 issues
  • services/model-eval-ingest/README.md - 0 issues
  • services/model-eval-ingest/src/index.ts - 0 issues
  • services/model-eval-ingest/src/sync.ts - 0 issues
  • services/model-eval-ingest/src/types.ts - 0 issues
  • services/model-eval-ingest/worker-configuration.d.ts - 0 issues
  • services/model-eval-ingest/wrangler.jsonc - 0 issues
  • dev/local/services.ts - 0 issues
  • DEVELOPMENT.md - 0 issues

Reviewed by claude-sonnet-4.6 · 315,716 tokens

Review guidance: REVIEW.md from base branch main

Comment thread packages/db/src/migrations/0130_numerous_marvel_boy.sql Outdated
Comment thread apps/web/src/app/admin/model-eval-ingest/ModelEvalIngestContent.tsx Outdated
@lambertjosh
Copy link
Copy Markdown
Contributor Author

image image image

@lambertjosh
Copy link
Copy Markdown
Contributor Author

Note that if/when this is merged, we will then need to:

  1. Wait for CI to deploy model-eval-ingest
  2. Capture its deployed URL - i.e. workers.dev
  3. Set MODEL_EVAL_INGEST_URL in Vercel production to value.
  4. Redeploy Vercel / let the next web deploy pick it up

@lambertjosh lambertjosh requested a review from jrf0110 May 15, 2026 14:47
Comment thread packages/db/src/schema.ts Outdated
Comment thread packages/db/src/schema.ts Outdated
Comment thread packages/db/src/schema.ts Outdated
scale: 6,
mode: 'number',
}),
avg_execution_ms: decimal('avg_execution_ms', { precision: 16, scale: 6, mode: 'number' }),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need floating point precision for a value measured in ms?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No! Ugh. Fixing.

Comment thread packages/db/src/schema.ts Outdated
Comment on lines +3015 to +3022
avg_cost_usd: decimal('avg_cost_usd', { precision: 14, scale: 8, mode: 'number' }),
avg_input_tokens: decimal('avg_input_tokens', { precision: 16, scale: 6, mode: 'number' }),
avg_output_tokens: decimal('avg_output_tokens', { precision: 16, scale: 6, mode: 'number' }),
avg_cache_read_tokens: decimal('avg_cache_read_tokens', {
precision: 16,
scale: 6,
mode: 'number',
}),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd argue all of these should be integers

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, they are averages but we should just round. We don't really need this level of precision.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All integers now with the exception of microdollar which is bigint, as otherwise the limit would be an avg cost of ~2000usd. Since that is only a factor of 100 off from results we have seen, I bumped it to be safe.

Comment thread packages/db/src/schema.ts Outdated
@lambertjosh
Copy link
Copy Markdown
Contributor Author

@jrf0110 - ready for re-review I think. Thanks for the review so far.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants