fix(showcase-ecommerce): align query runners with authors for realism#202
fix(showcase-ecommerce): align query runners with authors for realism#202alexsku wants to merge 2 commits into
Conversation
Regenerate `topUsersLast30Days` on all 312 showcase-ecommerce queries so the people shown running each query are consistent with who wrote it. Applied identically to `04-queries.json` (queryUsageFeatures) and the standalone `05-query-usage.json` so the two stay in sync. Why --- The seeded usage attributed runners essentially at random relative to authors: only 43/312 queries (14%) had the author anywhere in `topUsersLast30Days`, and 3 of 8 authors never ran a single one of their own queries. That is not how analyst usage looks — people overwhelmingly run the queries they author, and are usually the heaviest runner early in a query's life. A demo viewer who opened a query's creator and then its usage tab would see two unrelated people. What changed (and what did not) ------------------------------- Only the runner identities change. Each query's runner *count* is preserved, so execution stats stay coherent, and `queryCountLast30Days`, `queryCountTotal`, and `lastExecutedAt` are untouched. SQL, authors, subjects, the corpuser set, and `02-data.json` are all unchanged. Selection is seeded by query URN, so the result is deterministic and reproducible. Distribution: - ~85% of queries: the author is the top runner, plus 0-N colleagues. - single-runner queries: the lone runner is the author (a personal query) — now 206/239, where before the lone runner was almost always someone else. - ~15%: author absent — the realistic "inherited / departed-owner" long tail. - the 4 queries authored by the bare `EMP006` placeholder use the author-absent pattern, so only the 10 real (email) corpusers ever appear as runners; authorship and the EMP006 entity are left as-is. Result: the author now appears among the runners on 263/312 queries (84%, up from 14%); every runner resolves to a real corpuser in `02-data.json`. This is purely about corpus realism — anchor-pipeline thresholds were never the motivation and are unaffected. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
|
@alexsku is attempting to deploy a commit to the Acryl Data Team on Vercel. A member of the Team first needs to authorize it. |
05-query-usage.json was added (in pr-198) without the GenericAspect
contentType field, so `datahub datapack load` rejected all 312 MCPs in
File 5/5: 'Datum {value: ...} cannot be parsed as union schema [null,
GenericAspect{value: bytes, contentType: string}]'.
Every aspect serialized as a JSON string needs
"contentType": "application/json" (the same envelope 04-queries.json
already carries; --dry-run misses it because it only checks aspectName
schema-compat, not the envelope shape). Added to all 312 queryUsageFeatures
aspects; the value payloads (runners, counts, timestamps) are unchanged.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
|
Added a second commit ( Added FYI for whoever owns |
@alexsku lets drop from the queries since query-usage makes more sense? |
Summary
Makes the synthetic query usage on the showcase-ecommerce datapack realistic by aligning each query's runners (
topUsersLast30Days) with its author. Stacks on top ofpr-198(which added05-query-usage.jsonand thequeryUsageFeaturesaspects); this change only touches the runner identities.Why
The seeded usage attributed runners essentially at random relative to authors:
topUsersLast30DaysThat isn't how analyst usage looks — people overwhelmingly run the queries they write, and are usually the heaviest runner early in a query's life. A demo viewer who opens a query's creator and then its usage tab would otherwise see two unrelated people.
What changed (and what didn't)
Only runner identities change. Selection is seeded by query URN, so it's deterministic and reproducible:
EMP006placeholder use the author-absent pattern, so only the 10 real (email) corpusers ever appear as runnersPreserved unchanged: runner counts,
queryCountLast30Days/queryCountTotal/lastExecutedAt, SQL, authors, subjects, and02-data.json. Applied identically to04-queries.json(queryUsageFeatures) and the standalone05-query-usage.jsonso the two stay in sync.Result
02-data.jsontopUsersLast30Daysvalue — no other fields touchedPurely a corpus-realism change; anchor-pipeline thresholds were not the motivation and are unaffected.
🤖 Generated with Claude Code