release: to prod by chong-techops · Pull Request #1567 · KeeperHub/keeperhub

chong-techops · 2026-06-17T06:37:05Z

Summary

Promote the following merged PRs from staging to prod:

fix(metrics): source per-workflow error metric from DB #1566 fix(metrics): source per-workflow error metric from DB
ci: free runner disk space before image builds #1565 ci: free runner disk space before image builds
fix: disambiguate Sepolia chain label fallback #1564 fix: disambiguate Sepolia chain label fallback
feat(security): security audit trail, API-key alerts, and workflow version history #1463 feat(security): security audit trail, API-key alerts, and workflow version history

Risk callouts

Database migrations: drizzle/0113_keep_671_audit_system.sql (+ drizzle/meta/_journal.json) — new audit-system tables from feat(security): security audit trail, API-key alerts, and workflow version history #1463. File-based migration runs on deploy; verify it applies cleanly.
Deploy values / secrets: deploy/keeperhub/prod/values.yaml, deploy/keeperhub/staging/values.yaml (2 lines each).
Dependency changes: package.json (+3), pnpm-lock.yaml (+17).

Post-deploy verification

deploy-keeperhub workflow finishes green
curl -fsS https://app.keeperhub.com/api/health returns 200
DB migration 0113_keep_671_audit_system applied (no relation does not exist errors)
Metric (fix(metrics): source per-workflow error metric from DB #1566): keeperhub_workflow_errors_by_workflow registered on cluster="techops-prod"; series appear for managed orgs as errors occur (verify via gcx)
Security/audit features (feat(security): security audit trail, API-key alerts, and workflow version history #1463): audit trail writes, API-key alerts, workflow version history behave as expected
Watch Sentry / logs for ~10 minutes after the rollout

deep-diff backs the security audit log's before/after change records. Its npm "latest" tag is broken and the only release (1.0.2, 2018) trips the repo minimum-release-age gate, so it is added to the .npmrc exclude list alongside the existing pinned-legacy packages.

Introduce a durable, queryable record of sensitive account actions and wire API-key create/revoke into it, alongside an out-of-band email alert. - security_audit_log table storing actor, org, action, resource, a deep-diff of before/after state, and request metadata (ip, country, user agent); composite indexes for org, actor, resource, and action timelines, each trailing created_at for filter-then-newest queries - recordAuditEvent() helper that computes the diff and writes the row best-effort, so a logging failure never breaks the user action - sendApiKeyChangeEmail() out-of-band notice on key create/revoke - POST/DELETE api-keys routes emit both the email and an audit event - migration 0098 (hand-authored; drizzle-kit generate is blocked by a pre-existing snapshot-chain collision)

Make the execution audit trail durable and reconstructable. - triggered_by_credential_type / triggered_by_credential_label on workflow_executions capture which credential triggered a run (webhook_key | org_api_key | oauth | session | internal, plus a non-secret handle). These survive key revocation, unlike the existing triggered_by_*_api_key_id FKs which are nulled when a key is deleted - executed_workflow_hash stamps the sha256 of the nodes+edges that ran, tying a run to the exact definition that produced it and joining to workflow_history.content_hash to resolve the stored snapshot - hashWorkflowDefinition() shared content-hash helper - buildAttribution() extended; execute and webhook routes populate the new fields - migration 0099 (hand-authored)

Bring org-scoped (kh_) API keys to parity with user webhook keys: both the create and revoke paths now send the out-of-band email alert and write a security audit event with org context, so every long-lived credential mint/revoke is recorded the same way regardless of scope.

GET /api/security/audit returns the active org's audit trail. Sensitive forensic data, so it is session-gated and restricted to org owners and admins, and always scoped to the caller's organization. Filterable by action, resource, and actor with a created_at cursor for pagination, all served by the composite indexes on security_audit_log.

Wire the account-takeover-relevant actions into the security audit log so the trail covers more than API keys: - password change and password reset (reset records the requesting IP) - email change (captures the before/after address) - account deactivation - session revocation - TOTP enroll/disable and backup-code regeneration Each writes a best-effort audit event with actor, resource, and request metadata at its success point.

Wire workflow lifecycle and Marketplace listing mutations into the security audit log: - workflow.created / workflow.updated / workflow.deleted - workflow.listed / workflow.unlisted / workflow.listing_updated workflow.updated records scalar fields plus a content hash of the definition rather than the full nodes/edges, keeping the audit row small; the full snapshot and structural diff remain the job of the workflow change-history table.

Audit billing in two layers, matching where the state actually lives: - Authoritative transitions are recorded in the Stripe webhook handler (handle-billing-event.ts), the source of truth: subscription.plan_changed on a price change and subscription.canceled on deletion. Actor is the provider webhook (system). - The checkout and cancel routes record the user-initiated intent (subscription.change_requested / subscription.cancel_requested) so the trail keeps which user triggered it, which the webhook does not carry.

- org_wallet.created on Turnkey org-wallet provisioning (actor = the creating user) - agentic_wallet.hmac_rotated on HMAC secret rotation, recording the key-version bump; actor is the wallet sub-org (HMAC-authenticated)

Add the workflow_history store powering change history, version load, and restore. - workflow_history table: one row per version with the full snapshot (incl. edges, which are structural), a deep-diff vs the previous version, a content hash, and the same actor capture (who/when) as the audit log. Per-workflow version counter (unique with workflow_id). - recordWorkflowSnapshot() helper, best-effort like recordAuditEvent, hooked into the workflow create and update chokepoints. - content-hash + diff now normalize the definition: node identity/type/data and edge connectivity are tracked, cosmetic ReactFlow state (position, selection, size, edge styling) is stripped, so dragging a node does not create a version but a connection or config change does. - migration 0100 (hand-authored). Listing-only metadata changes stay audit-log-only (they don't alter the definition).

- GET /api/workflows/[id]/history: admin/owner-gated version timeline with per-version diff and actor name/email enrichment, cursor-paginated. - GET /api/workflows/[id]?version=N: returns a historical snapshot in the same shape as the live row, so the editor can load a past version. - Enrich the security audit read endpoint with actor name/email too. - Shared lib/security/org-role.ts (getOrgRole / isOrgAdmin) gating these reads to organization owners and admins. Restore is performed client-side by loading a version and saving it back through the normal update path, which reuses all existing validation, schedule sync, history, and audit wiring rather than duplicating it.

Surface workflow versions in context, in the editor: - A History button in the workflow toolbar, shown only to org admins/owners (useActiveMember), opens a version-history overlay. - The overlay lists versions (who/when/source) and shows a read-only Monaco side-by-side JSON diff of the selected version against its predecessor. - Restore writes the chosen snapshot back through the normal save path (creating a new version + audit event) and syncs the live canvas. - api-client: getById(id, { version }) and getHistory(id); CodeDiffEditor wraps Monaco's DiffEditor with the shared theme. On-canvas read-only preview of a past version is deferred; the diff view already shows what changed.

On the Organisation tab, each org API key gets a History toggle (admin/owner only) that lists its create/revoke events with actor and timestamp, read from the org-scoped security audit trail via api.security.getAudit. Reuses the existing audit endpoint and actor-name enrichment -- no new backend. User (wfb_) webhook keys are personal and their audit events carry no org, so the org-scoped reader does not surface them; personal-key history is a separate follow-up.

Replace the raw Monaco JSON diff (which surfaced noisy node-position changes and was hard to read) with a human semantic diff and a clearer UX. - computeVersionDiff(): compares snapshots by node id and edge connectivity, ignoring cosmetic canvas state (position, selection, edge styling). Reports added/removed/changed nodes (with node type and field-level before/after, e.g. renamed "A" to "B", config keys changed) and added/removed connections. Drops the Monaco diff editor entirely. - Version-history overlay redesigned: timeline with author + relative time + Current badge; change list uses Plus/Minus/Pencil and ArrowRight icons (no glyph arrows) instead of JSON. - "View on canvas": load a version read-only via a new previewVersionAtom that suppresses autosave (so previewing can't clobber the live workflow); a banner offers Restore / Exit preview. The atom is reset on editor mount/unmount. - API-key history moved from a per-item expander (a key is only ever created/revoked) to a section-level activity log on the Organisation tab, capturing create + revoke across all keys, including revoked ones. Adds version-diff unit tests.

Workflows that persisted the read/write-contract function arguments under `args` (the canonical field is `functionArgs`) failed action-config validation with INVALID_ACTION_CONFIG, so autosave rejected every save -- even a fully configured node or a layout-only change. Add `args -> functionArgs` to LEGACY_FIELD_ALIASES, matching the existing `functionName -> abiFunction` alias. Validation-only and non-breaking; the runtime already reads functionArgs.

read/write-contract nodes persist abiFunctionKey (the resolved function signature used for overloaded-function disambiguation), but it is not a declared config field, so strict action-config validation rejected the save with INVALID_ACTION_CONFIG -- autosave failed on every read-contract node. The runtime recomputes the key and never reads it from config, so add it to LEGACY_IGNORED_FIELDS for read/write-contract. Validation-only, non-breaking.

# Conflicts: # app/api/user/totp/enroll/route.ts # app/api/workflows/[workflowId]/webhook/route.ts # drizzle/meta/_journal.json # lib/email.ts

Redesign the history/activity surfaces from a cramped modal into a clean, shared visual language modeled on Google Docs version history. - Shared building blocks: groupByDate (Today/Yesterday/This week/month), describeAuditAction (action -> phrase + add/remove/change kind), ActorAvatar (initials fallback), and an ActivityFeed (date-grouped, avatars, kind icons, relative time, load-more) over the security audit endpoint. - Workflow version history is now a right-docked panel anchored in the editor (replaces the modal): date-grouped timeline with author + relative time + Current badge; selecting a version previews it live on the canvas; the selected entry expands to a readable semantic change list; Restore at the bottom. Closing exits the preview. - API-keys overlay activity and a new admin/owner Settings > Activity tab (org-wide feed) reuse ActivityFeed. Tokens only (token-audit clean on changed files); type-check + lint clean.

Make the version-history panel behave like the right-docked node-config panel so switching between them is seamless: - Share width via a new rightPanelWidthPctAtom (config panel + version panel read/write the same value), so they are the same size and resizing either keeps them in sync. - Match the surface: bg-background + border-l, and slide open/close with transition-transform (no instant pop); the panel stays mounted so it animates. - Add a left-edge drag handle to resize, mirroring the config panel, plus its collapse chevron affordance in place of the header X. - On the current version, clicking a node closes the panel and reveals the config panel underneath so the node can be edited; previewing a historical version keeps the panel open.

Add a reusable cursor-pagination module (lib/pagination.ts) and adopt it on the audit and workflow-history endpoints so large histories page properly. - CursorPage<T> = { items, _links } with self/next/prev hrefs; bidirectional cursors (?cursor= older, ?before= newer) over a stable monotonic key, no COUNT, stable under concurrent inserts. - parseCursorRequest + buildCursorPage centralize limit clamping, page slicing, boundary-cursor extraction, and link building; each route only supplies the column ordering and the cursor predicate. - Client follows links via api.followPage(href) instead of reconstructing cursors; the activity feed and version-history panel now load-more through _links.next.

Switch the shared pagination to offset-based so list views get a real numbered pager (< 1 2 3 ... >) with total count and total pages. - lib/pagination.ts: Page<T> = { items, meta (total/page/pageSize/totalPages), _links (self/first/prev/next/last) }. parsePageRequest + buildPage centralize offset/limit parsing, clamping, and link building; routes run a COUNT plus an OFFSET/LIMIT slice. Audit and workflow-history endpoints adopt it. - Reusable <Pager> (first/last + current +/-1 with ellipses) showing the total. - usePaginatedResource hook owns page/items/meta/loading/error for any Page<T> endpoint, so components don't reimplement pagination state; the activity feed and version-history panel both consume it.

The panel stays mounted (parked off-screen) to animate, but its left-edge resize handle and collapse button are translated half outside the panel, so when closed they protruded at the viewport's right edge. Render the handle only while the panel is open.

The panel hardcoded a 60px top offset while the config panel uses a responsive 6rem/lg:60px offset. Below the lg breakpoint the toolbar is taller, so the panel started above it -- overlapping the toolbar and breaking the top border alignment. Use the same responsive offset so the panels line up.

- Store the semantic diff (computeVersionDiff) on each history row at write time instead of a raw deep-diff, so the timeline shows what every version changed inline -- no per-row snapshot fetch. Selecting a version now only loads its snapshot for the live canvas preview. Rows recorded before this format are detected and skipped via a shape guard. - Increase the autosave debounce from 1s to 2.5s so rapid edits don't each save and spawn a version. - Add the panel's missing top border and soften the "changed" icon color (amber-500 -> amber-400) in the panel and activity feed.

- The panel requested the default page size (50), so 19 versions fit on one page and the numbered pager never appeared. Request 10 per page so it paginates. - Autosave recorded a version on every save, including position-only/no-op edits whose semantic diff is empty -- filling the timeline with changeless rows. Skip recording when an update produces no meaningful change (the first version still always lands).

- Each version row owns its collapse state (atomic); clicking expands its change list without touching the canvas. Paging remounts rows so they collapse. "View on canvas" is an explicit per-row action. - Remove the panel's bottom Restore (it duplicated the preview banner's); restore lives only in the banner now. - New useVersionPreview hook is the single source of truth for preview / exit / restore and syncs previewVersionAtom with a `?version=` query param, so a previewed version is shareable and reopens read-only by URL (the editor applies `?version=` on load and opens the history panel). - Restyle the preview banner from the loud amber pill to the app's segmented navbar pill (bg-secondary, rounded-md, divider segments).

The version diff only captured which config keys changed, so the timeline could only say "configuration changed (key)". Capture each changed field's before/after value (truncated) and render one row per field as `<node> <key>: <before> -> <after>`. Versions recorded before this detail existed fall back to the key-name summary.

@id

Config changes were shown with raw keys and raw `{{@nodeid:...}}` template refs, which don't match the editor UI. Now: - map each config key to its editor field label via the action registry (findActionById + flattenConfigFields), falling back to the key, - strip the node-id from template refs (`{{@id:Manual.triggeredAt}}` -> `{{Manual.triggeredAt}}`), - render each change stacked -- "<node> · <Field label>" then the before/after values as muted mono chips with an arrow -- instead of a cramped inline line.

The flat gray value chips read as disabled controls. Render config diffs the way a diff should look: the old value in red, the new value in green (each a subtle tinted, ringed chip), and an absent value as a faint "empty" placeholder instead of a chip. Tighten the row layout so the node/field label sits above the before -> after values.

@id

- Resolve real node names via getActionLabel (plugins + legacy aliases; system actions and triggers are self-labeled); unconfigured nodes read as "Action". Connection endpoints and config rows use these names. - Distinct change types/icons: connect (link) / disconnect (unlink) / enable / disable, separate from node add/remove; config keys map to their field labels and the action id shows as "Action". - Config diffs use semantic before/after chips (red old, green new); clean template refs ({{@id:Manual.x}} -> {{Manual.x}}). - Collapsible rows are atomic (per-row state) with a grid-rows open animation; softer expanded card, name . time on one line, proper spacing/padding. - usePaginatedResource gains optional silent background polling; the panel refreshes every 30s while open.

ultracite 6.5.1 (bumped in 3805adb without a matching @biomejs/biome bump) rejects the `noIncrementDecrement` nursery key during config validation, so `pnpm check` errors out before linting on local installs. The key was set to "off" (a no-op for an experimental nursery rule), so removing it changes no lint behavior while letting the wrapper run again.

The backdoor PATCH /api/workflows/[workflowId] re-ran every publish-time gate (write-action, bare-@, input-schema) except the slug check, so isListed=true could be persisted with a null listedSlug. Such a row is discoverable in the marketplace catalog yet uncallable: external agents invoke a listing by slug at /api/mcp/workflows/<slug>/call, and there is no slug to address. For a paid workflow it is advertised but can never be called or settled. - Add the slug gate to the PATCH willBeListed block. Listing, or staying listed, with a null/blank final slug now returns 422 SLUG_REQUIRED. Honors the existing field-touched + isListed=false unlist-bypass + sticky-slug conventions. - Make the curator route (lib/mcp/listing.ts) return the same SLUG_REQUIRED code (was a generic INVALID_INPUT/400) so both listing surfaces reject identically. - Migration 0113 unlists existing is_listed=true AND listed_slug IS NULL rows (data-only; no schema change; slug stays nullable, global unique index intact). - Tests: list-without-slug -> 422, list-with-slug -> 200, sticky-slug edit while listed -> 200, unlist orphaned row -> 200, curator-route parity, plus an e2e full-HTTP-path roundtrip. Four existing transition tests updated to supply a slug (they had encoded the now-rejected slugless-listing state). KEEP-494

Review found the new gate (which trims) was stricter than the curator path and the cleanup migration, which only handled null/falsy slugs. A listed row with an empty or whitespace-only slug could survive and then be falsely rejected on any later edit (and the slug-immutability gate blocks fixing it in place). - lib/mcp/listing.ts: trim the curator slug before validating and persisting, so a blank/whitespace slug is refused (not stored) and matches the PATCH gate. - PATCH route: persist the trimmed slug so the stored value equals what was validated (no leading/trailing whitespace). - Migration 0113: also unlist is_listed=true rows whose slug is empty or whitespace (btrim(listed_slug) = ''), matching the gate's accept-shape. - Tests: curator whitespace slug -> SLUG_REQUIRED, curator padded slug trimmed, PATCH padded slug trimmed before persist.

Make the organization activity feed informative and actionable: - Resolve and display resource names server-side for workflows, integrations (with type), and personal + org API keys. - Workflow events open the editor History tab deep-linked to the exact version they produced. The History tab is URL-driven (historyPage, version), highlights and scrolls to the target version, and each version has a copy-link button; 'current' tracks the latest version. - Integration and API key events open their management modal with the matching row highlighted and scrolled into view. - The version preview banner shows the current version in green with no Restore action; historical versions keep Restore.

…-alerts feat(security): security audit trail, API-key alerts, and workflow version history

The fallback chain-name map labeled 11155111 as just 'Sepolia', which is ambiguous alongside Base/Arbitrum/Optimism Sepolia. Use the canonical 'Ethereum Sepolia' (matching the chains table) so the version-history diff and other fallbacks name the network unambiguously.

…abel fix: disambiguate Sepolia chain label fallback

The app/migrator/workflow-runner bake in build-images.yml and the executor bake in deploy-executor.yaml run on ubuntu-latest, whose ~14GB disk fills while BuildKit unpacks node_modules, failing with 'no space left on device'. Reclaim ~20GB+ by removing preinstalled Android/.NET/Haskell toolchains and large apt packages before the build. Keep the hosted tool cache (tool-cache: false) so CodeQL and other cached tools are untouched. Gated by each job's existing skip conditions so reruns that skip the build also skip the cleanup.

ci: free runner disk space before image builds

keeperhub_workflow_execution_errors_by_workflow_total was a per-pod counter incremented only on the kickoff/reaper/MCP finalization paths, never on the main logWorkflowCompleteDb path where normal workflow errors finalize. It therefore returned no data in prod and the managed-client alert numerator that reads it stayed empty during real error bursts. Per-pod counters emitted from the workflow runner are also dropped before Prometheus scrapes the short-lived process. Replace it with keeperhub_workflow_errors_by_workflow, a DB-sourced gauge computed on each /api/metrics/db scrape from workflow_executions (status='error'), grouped by workflow_id/org_slug/error_type and scoped to managed orgs to bound workflow_id cardinality. Being DB-sourced it counts every finalized error row regardless of which code path wrote it. The old counter is left in place; a follow-up removes it once the infra alert is switched to the new gauge. Refs TECH-48

…rs-by-workflow-db-metric fix(metrics): source per-workflow error metric from DB

…ketplace-polish-nullable-slugs-server-side-tag-search # Conflicts: # drizzle/meta/_journal.json

…lish-nullable-slugs-server-side-tag-search fix(marketplace): require a slug before listing a workflow

joelorzet added 30 commits June 4, 2026 07:24

feat(security): audit wallet creation and HMAC rotation

998ee00

- org_wallet.created on Turnkey org-wallet provisioning (actor = the creating user) - agentic_wallet.hmac_rotated on HMAC secret rotation, recording the key-version bump; actor is the wallet sub-org (HMAC-authenticated)

Merge branch 'staging' into feat/keep-671-api-token-email-alerts

2b743a0

# Conflicts: # app/api/user/totp/enroll/route.ts # app/api/workflows/[workflowId]/webhook/route.ts # drizzle/meta/_journal.json # lib/email.ts

eskp and others added 12 commits June 16, 2026 22:49

style(test): apply biome formatting to the slug-gate e2e test

2a8f1a9

Merge pull request #1463 from KeeperHub/feat/keep-671-api-token-email…

42e83b9

…-alerts feat(security): security audit trail, API-key alerts, and workflow version history

Merge pull request #1564 from KeeperHub/fix/ambiguous-sepolia-chain-l…

f6d4ccd

…abel fix: disambiguate Sepolia chain label fallback

Merge pull request #1565 from KeeperHub/ci/free-runner-disk-image-build

1a82fa1

ci: free runner disk space before image builds

Merge pull request #1566 from KeeperHub/feature/tech-48-workflow-erro…

dab00eb

…rs-by-workflow-db-metric fix(metrics): source per-workflow error metric from DB

chong-techops temporarily deployed to staging June 17, 2026 06:37 — with GitHub Actions Inactive

Merge remote-tracking branch 'origin/staging' into simon/keep-494-mar…

3113278

…ketplace-polish-nullable-slugs-server-side-tag-search # Conflicts: # drizzle/meta/_journal.json

chong-techops requested review from a team, OleksandrUA, eskp, joelorzet and suisuss and removed request for a team June 17, 2026 06:38

Merge pull request #1562 from KeeperHub/simon/keep-494-marketplace-po…

4f3127c

…lish-nullable-slugs-server-side-tag-search fix(marketplace): require a slug before listing a workflow

eskp temporarily deployed to staging June 17, 2026 08:13 — with GitHub Actions Inactive

eskp added the metrics-db-reviewed Reviewer sign-off: metrics aggregate queries optimised + tables indexed (KEEP-680) label Jun 17, 2026

eskp temporarily deployed to staging June 17, 2026 08:14 — with GitHub Actions Inactive

eskp temporarily deployed to staging June 17, 2026 08:29 — with GitHub Actions Inactive

eskp merged commit 4173ad1 into prod Jun 17, 2026
45 of 46 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

release: to prod#1567

release: to prod#1567
eskp merged 97 commits into
prodfrom
staging

chong-techops commented Jun 17, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

chong-techops commented Jun 17, 2026

Summary

Risk callouts

Post-deploy verification

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants