fix(showcase-ecommerce): align query runners with authors for realism by alexsku · Pull Request #202 · datahub-project/static-assets

alexsku · 2026-06-04T23:04:00Z

Summary

Makes the synthetic query usage on the showcase-ecommerce datapack realistic by aligning each query's runners (topUsersLast30Days) with its author. Stacks on top of pr-198 (which added 05-query-usage.json and the queryUsageFeatures aspects); this change only touches the runner identities.

Why

The seeded usage attributed runners essentially at random relative to authors:

only 43/312 queries (14%) had the author anywhere in topUsersLast30Days
3 of 8 authors never ran a single one of their own queries

That isn't how analyst usage looks — people overwhelmingly run the queries they write, and are usually the heaviest runner early in a query's life. A demo viewer who opens a query's creator and then its usage tab would otherwise see two unrelated people.

What changed (and what didn't)

Only runner identities change. Selection is seeded by query URN, so it's deterministic and reproducible:

~85% of queries: the author is the top runner, plus 0–N colleagues
single-runner queries: the lone runner is the author (a personal query) — now 206/239
~15%: author absent — the realistic "inherited / departed-owner" long tail
the 4 queries authored by the bare EMP006 placeholder use the author-absent pattern, so only the 10 real (email) corpusers ever appear as runners

Preserved unchanged: runner counts, queryCountLast30Days / queryCountTotal / lastExecutedAt, SQL, authors, subjects, and 02-data.json. Applied identically to 04-queries.json (queryUsageFeatures) and the standalone 05-query-usage.json so the two stay in sync.

Result

author present in runners: 263/312 (84%), up from 14%
every runner resolves to a real corpuser defined in 02-data.json
diff is 576 lines, every one a topUsersLast30Days value — no other fields touched

Purely a corpus-realism change; anchor-pipeline thresholds were not the motivation and are unaffected.

🤖 Generated with Claude Code

Regenerate `topUsersLast30Days` on all 312 showcase-ecommerce queries so the people shown running each query are consistent with who wrote it. Applied identically to `04-queries.json` (queryUsageFeatures) and the standalone `05-query-usage.json` so the two stay in sync. Why --- The seeded usage attributed runners essentially at random relative to authors: only 43/312 queries (14%) had the author anywhere in `topUsersLast30Days`, and 3 of 8 authors never ran a single one of their own queries. That is not how analyst usage looks — people overwhelmingly run the queries they author, and are usually the heaviest runner early in a query's life. A demo viewer who opened a query's creator and then its usage tab would see two unrelated people. What changed (and what did not) ------------------------------- Only the runner identities change. Each query's runner *count* is preserved, so execution stats stay coherent, and `queryCountLast30Days`, `queryCountTotal`, and `lastExecutedAt` are untouched. SQL, authors, subjects, the corpuser set, and `02-data.json` are all unchanged. Selection is seeded by query URN, so the result is deterministic and reproducible. Distribution: - ~85% of queries: the author is the top runner, plus 0-N colleagues. - single-runner queries: the lone runner is the author (a personal query) — now 206/239, where before the lone runner was almost always someone else. - ~15%: author absent — the realistic "inherited / departed-owner" long tail. - the 4 queries authored by the bare `EMP006` placeholder use the author-absent pattern, so only the 10 real (email) corpusers ever appear as runners; authorship and the EMP006 entity are left as-is. Result: the author now appears among the runners on 263/312 queries (84%, up from 14%); every runner resolves to a real corpuser in `02-data.json`. This is purely about corpus realism — anchor-pipeline thresholds were never the motivation and are unaffected. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

vercel · 2026-06-04T23:04:05Z

@alexsku is attempting to deploy a commit to the Acryl Data Team on Vercel.

A member of the Team first needs to authorize it.

05-query-usage.json was added (in pr-198) without the GenericAspect contentType field, so `datahub datapack load` rejected all 312 MCPs in File 5/5: 'Datum {value: ...} cannot be parsed as union schema [null, GenericAspect{value: bytes, contentType: string}]'. Every aspect serialized as a JSON string needs "contentType": "application/json" (the same envelope 04-queries.json already carries; --dry-run misses it because it only checks aspectName schema-compat, not the envelope shape). Added to all 312 queryUsageFeatures aspects; the value payloads (runners, counts, timestamps) are unchanged. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

alexsku · 2026-06-04T23:24:18Z

Added a second commit (a94ef77) that fixes a load-blocking bug surfaced while test-loading this build: 05-query-usage.json was missing the GenericAspect contentType on all 312 aspects, so datahub datapack load rejected the whole file (File 5/5):

Datum {'value': '...'} cannot be parsed as union schema ["null", GenericAspect{value: bytes, contentType: string}]

Added "contentType": "application/json" to all 312 aspects (matching 04-queries.json); values unchanged. Note --dry-run does not catch this (it only checks aspectName schema-compat, not the envelope).

FYI for whoever owns pr-198: queryUsageFeatures is now duplicated across 04-queries.json and 05-query-usage.json (identical payloads). Loading both is harmless (idempotent UPSERT), but you may want to pick one home for usage — happy to drop it from either side.

nwadams · 2026-06-04T23:30:34Z

FYI for whoever owns pr-198: queryUsageFeatures is now duplicated across 04-queries.json and 05-query-usage.json (identical payloads). Loading both is harmless (idempotent UPSERT), but you may want to pick one home for usage — happy to drop it from either side.

@alexsku lets drop from the queries since query-usage makes more sense?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(showcase-ecommerce): align query runners with authors for realism#202

fix(showcase-ecommerce): align query runners with authors for realism#202
alexsku wants to merge 2 commits into
datahub-project:pr-198from
alexsku:pr-198-realistic-runners

alexsku commented Jun 4, 2026

Uh oh!

vercel Bot commented Jun 4, 2026

Uh oh!

alexsku commented Jun 4, 2026

Uh oh!

nwadams commented Jun 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

alexsku commented Jun 4, 2026

Summary

Why

What changed (and what didn't)

Result

Uh oh!

vercel Bot commented Jun 4, 2026

Uh oh!

alexsku commented Jun 4, 2026

Uh oh!

nwadams commented Jun 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants