Skip to content

fix(showcase-ecommerce): align query runners with authors for realism#202

Open
alexsku wants to merge 2 commits into
datahub-project:pr-198from
alexsku:pr-198-realistic-runners
Open

fix(showcase-ecommerce): align query runners with authors for realism#202
alexsku wants to merge 2 commits into
datahub-project:pr-198from
alexsku:pr-198-realistic-runners

Conversation

@alexsku

@alexsku alexsku commented Jun 4, 2026

Copy link
Copy Markdown
Contributor

Summary

Makes the synthetic query usage on the showcase-ecommerce datapack realistic by aligning each query's runners (topUsersLast30Days) with its author. Stacks on top of pr-198 (which added 05-query-usage.json and the queryUsageFeatures aspects); this change only touches the runner identities.

Why

The seeded usage attributed runners essentially at random relative to authors:

  • only 43/312 queries (14%) had the author anywhere in topUsersLast30Days
  • 3 of 8 authors never ran a single one of their own queries

That isn't how analyst usage looks — people overwhelmingly run the queries they write, and are usually the heaviest runner early in a query's life. A demo viewer who opens a query's creator and then its usage tab would otherwise see two unrelated people.

What changed (and what didn't)

Only runner identities change. Selection is seeded by query URN, so it's deterministic and reproducible:

  • ~85% of queries: the author is the top runner, plus 0–N colleagues
  • single-runner queries: the lone runner is the author (a personal query) — now 206/239
  • ~15%: author absent — the realistic "inherited / departed-owner" long tail
  • the 4 queries authored by the bare EMP006 placeholder use the author-absent pattern, so only the 10 real (email) corpusers ever appear as runners

Preserved unchanged: runner counts, queryCountLast30Days / queryCountTotal / lastExecutedAt, SQL, authors, subjects, and 02-data.json. Applied identically to 04-queries.json (queryUsageFeatures) and the standalone 05-query-usage.json so the two stay in sync.

Result

  • author present in runners: 263/312 (84%), up from 14%
  • every runner resolves to a real corpuser defined in 02-data.json
  • diff is 576 lines, every one a topUsersLast30Days value — no other fields touched

Purely a corpus-realism change; anchor-pipeline thresholds were not the motivation and are unaffected.

🤖 Generated with Claude Code

Regenerate `topUsersLast30Days` on all 312 showcase-ecommerce queries so the
people shown running each query are consistent with who wrote it. Applied
identically to `04-queries.json` (queryUsageFeatures) and the standalone
`05-query-usage.json` so the two stay in sync.

Why
---
The seeded usage attributed runners essentially at random relative to authors:
only 43/312 queries (14%) had the author anywhere in `topUsersLast30Days`, and
3 of 8 authors never ran a single one of their own queries. That is not how
analyst usage looks — people overwhelmingly run the queries they author, and
are usually the heaviest runner early in a query's life. A demo viewer who
opened a query's creator and then its usage tab would see two unrelated people.

What changed (and what did not)
-------------------------------
Only the runner identities change. Each query's runner *count* is preserved, so
execution stats stay coherent, and `queryCountLast30Days`, `queryCountTotal`,
and `lastExecutedAt` are untouched. SQL, authors, subjects, the corpuser set,
and `02-data.json` are all unchanged. Selection is seeded by query URN, so the
result is deterministic and reproducible.

Distribution:
- ~85% of queries: the author is the top runner, plus 0-N colleagues.
- single-runner queries: the lone runner is the author (a personal query) —
  now 206/239, where before the lone runner was almost always someone else.
- ~15%: author absent — the realistic "inherited / departed-owner" long tail.
- the 4 queries authored by the bare `EMP006` placeholder use the
  author-absent pattern, so only the 10 real (email) corpusers ever appear as
  runners; authorship and the EMP006 entity are left as-is.

Result: the author now appears among the runners on 263/312 queries (84%, up
from 14%); every runner resolves to a real corpuser in `02-data.json`. This is
purely about corpus realism — anchor-pipeline thresholds were never the
motivation and are unaffected.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
@vercel

vercel Bot commented Jun 4, 2026

Copy link
Copy Markdown

@alexsku is attempting to deploy a commit to the Acryl Data Team on Vercel.

A member of the Team first needs to authorize it.

05-query-usage.json was added (in pr-198) without the GenericAspect
contentType field, so `datahub datapack load` rejected all 312 MCPs in
File 5/5: 'Datum {value: ...} cannot be parsed as union schema [null,
GenericAspect{value: bytes, contentType: string}]'.

Every aspect serialized as a JSON string needs
"contentType": "application/json" (the same envelope 04-queries.json
already carries; --dry-run misses it because it only checks aspectName
schema-compat, not the envelope shape). Added to all 312 queryUsageFeatures
aspects; the value payloads (runners, counts, timestamps) are unchanged.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
@alexsku

alexsku commented Jun 4, 2026

Copy link
Copy Markdown
Contributor Author

Added a second commit (a94ef77) that fixes a load-blocking bug surfaced while test-loading this build: 05-query-usage.json was missing the GenericAspect contentType on all 312 aspects, so datahub datapack load rejected the whole file (File 5/5):

Datum {'value': '...'} cannot be parsed as union schema ["null", GenericAspect{value: bytes, contentType: string}]

Added "contentType": "application/json" to all 312 aspects (matching 04-queries.json); values unchanged. Note --dry-run does not catch this (it only checks aspectName schema-compat, not the envelope).

FYI for whoever owns pr-198: queryUsageFeatures is now duplicated across 04-queries.json and 05-query-usage.json (identical payloads). Loading both is harmless (idempotent UPSERT), but you may want to pick one home for usage — happy to drop it from either side.

@nwadams

nwadams commented Jun 4, 2026

Copy link
Copy Markdown
Collaborator
FYI for whoever owns pr-198: queryUsageFeatures is now duplicated across 04-queries.json and 05-query-usage.json (identical payloads). Loading both is harmless (idempotent UPSERT), but you may want to pick one home for usage — happy to drop it from either side.

@alexsku lets drop from the queries since query-usage makes more sense?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants