You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
A single error-analytics group (Observe grouping hash 16864652937831232783) is acting as a catch-all for essentially every GraphQL / Admin-API client error in the CLI, across four unrelated product areas. Over the last 10 days it held ~1,170 events spanning app, theme, store, and hydrogen. This makes the group un-routable, un-actionable, and a permanent source of false P1 escalations. We should make the CLI emit a meaningful groupingHash so distinct failures land in distinct buckets.
Context: this surfaced from resiliency issue Vault 31608 / shop/issues#32995 ("Access denied for themes field… read_themes"). That specific error was fixed in CLI 4.2.0 (#7652) and is confirmed gone on 4.2.0 — but the resiliency item stays hot because ~94% of the bucket is unrelated errors sharing the same hash.
What's actually in the bucket (last 10 days, ~1,170 events)
401 authentication failures (~390; slice store/theme, e.g. store info, store execute) — invalid/expired session token; 163 are literally Service is not valid for authentication. Owner: CLI auth/identity.
Missing access scope ACCESS_DENIED (166 = read_themes 75 + write_themes 91; slice theme) — custom-app token missing scope. Already fixed in 4.2.0 (Improve missing theme scope error message #7652); now a clean AbortError, aging out as adoption rolls.
5xx server errors (~150; Admin + App-Management HTTP 500) — server-side API reliability, not a CLI bug. Owner: the respective API teams.
THROTTLED rate limiting (46; slice theme) — expected/transient; should be retried or suppressed, not crash-reported.
Hydrogen 403/404 (~40; hydrogen deploy/link) — separate surface again.
Root cause
In packages/cli-kit/src/public/node/error-handler.ts (sendErrorToBugsnag):
Every error is rebuilt as reportableError = new Error(error.message), so the error class is always the generic Error.
No event.groupingHash is ever set, so the backend falls back to stack-trace grouping (cf. the stack_frame_grouping_hash column).
Stack frame paths are aggressively normalized (cleanStackFrameFilePath), and there are really two message shapes thrown from the same request site (graphql-request's GraphQL Error (Code: NNN): {…} for theme/store, and The Admin/App Management GraphQL API responded unsuccessfully with the HTTP status NNN … for app/hydrogen).
Net: same class + same normalized stack + ignored message ⇒ one bucket for all of them.
Why we should track these separately
Different owners. This group is assigned to one team, but it contains errors owned by App Management, Hydrogen, Storefront, CLI auth, and the server-side API teams. It cannot be routed as a single item.
Different root causes and fixes. Scope→UX messaging, 401→token refresh, 403→app permissions, THROTTLED→backoff/suppress, 5xx→server reliability. One issue can't carry five fixes, one assignee, or one fix-due date.
Expected vs. real regressions get mixed. Most of the volume is expected user-config / transient throttles; a genuine regression in any single family (e.g. a 401 or 5xx spike from a real bug) is invisible inside the aggregate.
It manufactures false P1s. Severity escalates on aggregate volume, so the group oscillates P1↔P3 independent of whether any underlying problem is bad — and a shipped fix (like Improve missing theme scope error message #7652) can never turn it green.
Proposed fix
Set a meaningful grouping key in the Bugsnag eventHandler, reusing the existing analytics taxonomy (categorizeError + formatErrorMessage in packages/cli-kit/src/private/node/analytics/error-categorizer.ts, already used by storage.ts to emit error:${category}:${signature} events):
Include slice_name so app / theme / store / hydrogen split immediately.
Trim the message at the first : { before categorizing — the raw GraphQL ClientError dumps the full request/response JSON, and the literal "request" substring currently mis-routes errors into the network category. Trimming makes categories semantically correct (scope→Permission, 401→Authentication, THROTTLED→RateLimit) and keeps signatures stable.
Complementary, higher-leverage: stop reporting known transient/user errors (THROTTLED, and arguably the 401 "Service is not valid for authentication") as unexpected — same pattern as Improve missing theme scope error message #7652 — so they leave crash reporting entirely rather than just getting relabeled.
Considerations
Setting groupingHash universally re-buckets all CLI errors (a one-time grouping migration across every error dashboard), not just these. That's an improvement (grouping finally matches the analytics taxonomy) but should be a conscious, coordinated change.
Summary
A single error-analytics group (Observe grouping hash
16864652937831232783) is acting as a catch-all for essentially every GraphQL / Admin-API client error in the CLI, across four unrelated product areas. Over the last 10 days it held ~1,170 events spanningapp,theme,store, andhydrogen. This makes the group un-routable, un-actionable, and a permanent source of false P1 escalations. We should make the CLI emit a meaningfulgroupingHashso distinct failures land in distinct buckets.Context: this surfaced from resiliency issue Vault 31608 / shop/issues#32995 ("Access denied for themes field…
read_themes"). That specific error was fixed in CLI 4.2.0 (#7652) and is confirmed gone on 4.2.0 — but the resiliency item stays hot because ~94% of the bucket is unrelated errors sharing the same hash.What's actually in the bucket (last 10 days, ~1,170 events)
By slice:
app618 ·theme385 ·store105 ·hydrogen40 · unknown 18 ·cli4.Handled split: 708 unhandled / 462 handled.
Distinct families lumped together:
app,app deploy) — caller lacks permission/membership. Owner: App Management.store/theme, e.g.store info,store execute) — invalid/expired session token; 163 are literallyService is not valid for authentication. Owner: CLI auth/identity.read_themes75 +write_themes91; slicetheme) — custom-app token missing scope. Already fixed in 4.2.0 (Improve missing theme scope error message #7652); now a cleanAbortError, aging out as adoption rolls.theme) — expected/transient; should be retried or suppressed, not crash-reported.hydrogen deploy/link) — separate surface again.Root cause
In
packages/cli-kit/src/public/node/error-handler.ts(sendErrorToBugsnag):reportableError = new Error(error.message), so the error class is always the genericError.event.groupingHashis ever set, so the backend falls back to stack-trace grouping (cf. thestack_frame_grouping_hashcolumn).cleanStackFrameFilePath), and there are really two message shapes thrown from the same request site (graphql-request'sGraphQL Error (Code: NNN): {…}for theme/store, andThe Admin/App Management GraphQL API responded unsuccessfully with the HTTP status NNN …for app/hydrogen).Net: same class + same normalized stack + ignored message ⇒ one bucket for all of them.
Why we should track these separately
Proposed fix
Set a meaningful grouping key in the Bugsnag
eventHandler, reusing the existing analytics taxonomy (categorizeError+formatErrorMessageinpackages/cli-kit/src/private/node/analytics/error-categorizer.ts, already used bystorage.tsto emiterror:${category}:${signature}events):Notes:
slice_namesoapp/theme/store/hydrogensplit immediately.: {before categorizing — the raw GraphQLClientErrordumps the full request/response JSON, and the literal"request"substring currently mis-routes errors into thenetworkcategory. Trimming makes categories semantically correct (scope→Permission, 401→Authentication, THROTTLED→RateLimit) and keeps signatures stable.Considerations
groupingHashuniversally re-buckets all CLI errors (a one-time grouping migration across every error dashboard), not just these. That's an improvement (grouping finally matches the analytics taxonomy) but should be a conscious, coordinated change.packages/cli-kit/src/public/node/error-handler.test.ts.References
16864652937831232783packages/cli-kit/src/public/node/error-handler.ts