Skip to content

split observe analytics group for cleaner resiliency tracking #7891

Description

@stephanie-shopify

Summary

A single error-analytics group (Observe grouping hash 16864652937831232783) is acting as a catch-all for essentially every GraphQL / Admin-API client error in the CLI, across four unrelated product areas. Over the last 10 days it held ~1,170 events spanning app, theme, store, and hydrogen. This makes the group un-routable, un-actionable, and a permanent source of false P1 escalations. We should make the CLI emit a meaningful groupingHash so distinct failures land in distinct buckets.

Context: this surfaced from resiliency issue Vault 31608 / shop/issues#32995 ("Access denied for themes field… read_themes"). That specific error was fixed in CLI 4.2.0 (#7652) and is confirmed gone on 4.2.0 — but the resiliency item stays hot because ~94% of the bucket is unrelated errors sharing the same hash.

What's actually in the bucket (last 10 days, ~1,170 events)

By slice: app 618 · theme 385 · store 105 · hydrogen 40 · unknown 18 · cli 4.
Handled split: 708 unhandled / 462 handled.

Distinct families lumped together:

  • App Management 403 "Unauthorized" (~316; slice app, app deploy) — caller lacks permission/membership. Owner: App Management.
  • 401 authentication failures (~390; slice store/theme, e.g. store info, store execute) — invalid/expired session token; 163 are literally Service is not valid for authentication. Owner: CLI auth/identity.
  • Missing access scope ACCESS_DENIED (166 = read_themes 75 + write_themes 91; slice theme) — custom-app token missing scope. Already fixed in 4.2.0 (Improve missing theme scope error message #7652); now a clean AbortError, aging out as adoption rolls.
  • 5xx server errors (~150; Admin + App-Management HTTP 500) — server-side API reliability, not a CLI bug. Owner: the respective API teams.
  • THROTTLED rate limiting (46; slice theme) — expected/transient; should be retried or suppressed, not crash-reported.
  • Hydrogen 403/404 (~40; hydrogen deploy/link) — separate surface again.

Root cause

In packages/cli-kit/src/public/node/error-handler.ts (sendErrorToBugsnag):

  1. Every error is rebuilt as reportableError = new Error(error.message), so the error class is always the generic Error.
  2. No event.groupingHash is ever set, so the backend falls back to stack-trace grouping (cf. the stack_frame_grouping_hash column).
  3. Stack frame paths are aggressively normalized (cleanStackFrameFilePath), and there are really two message shapes thrown from the same request site (graphql-request's GraphQL Error (Code: NNN): {…} for theme/store, and The Admin/App Management GraphQL API responded unsuccessfully with the HTTP status NNN … for app/hydrogen).

Net: same class + same normalized stack + ignored message ⇒ one bucket for all of them.

Why we should track these separately

  • Different owners. This group is assigned to one team, but it contains errors owned by App Management, Hydrogen, Storefront, CLI auth, and the server-side API teams. It cannot be routed as a single item.
  • Different root causes and fixes. Scope→UX messaging, 401→token refresh, 403→app permissions, THROTTLED→backoff/suppress, 5xx→server reliability. One issue can't carry five fixes, one assignee, or one fix-due date.
  • Expected vs. real regressions get mixed. Most of the volume is expected user-config / transient throttles; a genuine regression in any single family (e.g. a 401 or 5xx spike from a real bug) is invisible inside the aggregate.
  • It manufactures false P1s. Severity escalates on aggregate volume, so the group oscillates P1↔P3 independent of whether any underlying problem is bad — and a shipped fix (like Improve missing theme scope error message #7652) can never turn it green.

Proposed fix

Set a meaningful grouping key in the Bugsnag eventHandler, reusing the existing analytics taxonomy (categorizeError + formatErrorMessage in packages/cli-kit/src/private/node/analytics/error-categorizer.ts, already used by storage.ts to emit error:${category}:${signature} events):

import {categorizeError, formatErrorMessage} from '../../private/node/analytics/error-categorizer.js'

const category = categorizeError(error)
event.groupingHash = `${sliceName}:${category.toLowerCase()}:${formatErrorMessage(error, category)}`

Notes:

  • Include slice_name so app / theme / store / hydrogen split immediately.
  • Trim the message at the first : { before categorizing — the raw GraphQL ClientError dumps the full request/response JSON, and the literal "request" substring currently mis-routes errors into the network category. Trimming makes categories semantically correct (scope→Permission, 401→Authentication, THROTTLED→RateLimit) and keeps signatures stable.
  • Complementary, higher-leverage: stop reporting known transient/user errors (THROTTLED, and arguably the 401 "Service is not valid for authentication") as unexpected — same pattern as Improve missing theme scope error message #7652 — so they leave crash reporting entirely rather than just getting relabeled.

Considerations

  • Setting groupingHash universally re-buckets all CLI errors (a one-time grouping migration across every error dashboard), not just these. That's an improvement (grouping finally matches the analytics taxonomy) but should be a conscious, coordinated change.
  • Update packages/cli-kit/src/public/node/error-handler.test.ts.

References

Metadata

Metadata

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions