Skip to content

release: To Prod#1581

Merged
suisuss merged 6 commits into
prodfrom
staging
Jun 18, 2026
Merged

release: To Prod#1581
suisuss merged 6 commits into
prodfrom
staging

Conversation

@suisuss

@suisuss suisuss commented Jun 18, 2026

Copy link
Copy Markdown

No description provided.

joelorzet and others added 4 commits June 17, 2026 13:24
The Runs tab now shows a 'v{n}' chip on each run, resolved server-side from
the run's executed_workflow_hash to its workflow_history version (timestamp-
aware so a reverted content hash maps to the version in effect at run time).
Clicking the chip opens the History tab and highlights that version (via
?version=N, which History already rings/expands/scrolls to).
feat(runs): show the workflow version each run executed, linkable to History
keeperhub_workflow_errors_by_workflow was an all-time cumulative count
filtered only on status='error'. At prod scale that query matched every
error execution across all orgs and joined them before narrowing to
managed orgs, tripping the 8s metrics statement_timeout (Postgres 57014)
on most scrapes. The catch returned [], so reset() emptied the gauge and
it flapped (present ~9% of scrapes), which (a) starved the managed-client
alert of data and (b) combined with the alert's offset-delta guard to
read the full cumulative count as a 1h delta -> false-positive pages.

Change the gauge to a rolling-1h count: add
completed_at >= now() - interval '1 hour' to the query, backed by a new
partial index idx_workflow_executions_error_completed_at (status='error')
so the lookup is an index range scan that finishes in ms and never times
out. Metric name and labels are unchanged; only the value semantics move
from all-time cumulative to last-hour count, which matches the alert's
rolling-60-minute definition and lets the alert read the gauge directly
(the companion infra change drops the offset/guard math).

Migration is @requires-db-prep: an operator applies the index
CONCURRENTLY out-of-band; the in-file CREATE INDEX IF NOT EXISTS is a
no-op on prod and a fast build on dev/PR DBs.

Refs TECH-48
The per-workflow managed-client error gauge was never added to
METRICS_REFERENCE.md when it was introduced. Document it with the
windowed (last-1h, not cumulative) semantics and its role in the
Sky/Ajna managed-client alerts.

Refs TECH-48
chong-techops and others added 2 commits June 18, 2026 11:38
Comment cited a non-existent idx_workflow_executions_status_completed_at;
the migration creates idx_workflow_executions_error_completed_at. Align
the comment with the actual index name (PR #272 review, item 4).

Refs TECH-48
…tric

fix(metrics): window per-workflow error gauge to last 1h
@suisuss suisuss added db-prepped-prod Operator applied lock-free DDL to prod DB; safe to merge metrics-db-reviewed Reviewer sign-off: metrics aggregate queries optimised + tables indexed (KEEP-680) labels Jun 18, 2026
@suisuss suisuss merged commit 49ce37f into prod Jun 18, 2026
58 of 60 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

db-prepped-prod Operator applied lock-free DDL to prod DB; safe to merge metrics-db-reviewed Reviewer sign-off: metrics aggregate queries optimised + tables indexed (KEEP-680)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants