Skip to content

release: to prod#1567

Merged
eskp merged 97 commits into
prodfrom
staging
Jun 17, 2026
Merged

release: to prod#1567
eskp merged 97 commits into
prodfrom
staging

Conversation

@chong-techops

Copy link
Copy Markdown

Summary

Promote the following merged PRs from staging to prod:

Risk callouts

  • Database migrations: drizzle/0113_keep_671_audit_system.sql (+ drizzle/meta/_journal.json) — new audit-system tables from feat(security): security audit trail, API-key alerts, and workflow version history #1463. File-based migration runs on deploy; verify it applies cleanly.
  • Deploy values / secrets: deploy/keeperhub/prod/values.yaml, deploy/keeperhub/staging/values.yaml (2 lines each).
  • Dependency changes: package.json (+3), pnpm-lock.yaml (+17).

Post-deploy verification

joelorzet added 30 commits June 4, 2026 07:24
deep-diff backs the security audit log's before/after change records. Its
npm "latest" tag is broken and the only release (1.0.2, 2018) trips the
repo minimum-release-age gate, so it is added to the .npmrc exclude list
alongside the existing pinned-legacy packages.
Introduce a durable, queryable record of sensitive account actions and
wire API-key create/revoke into it, alongside an out-of-band email alert.

- security_audit_log table storing actor, org, action, resource, a
  deep-diff of before/after state, and request metadata (ip, country,
  user agent); composite indexes for org, actor, resource, and action
  timelines, each trailing created_at for filter-then-newest queries
- recordAuditEvent() helper that computes the diff and writes the row
  best-effort, so a logging failure never breaks the user action
- sendApiKeyChangeEmail() out-of-band notice on key create/revoke
- POST/DELETE api-keys routes emit both the email and an audit event
- migration 0098 (hand-authored; drizzle-kit generate is blocked by a
  pre-existing snapshot-chain collision)
Make the execution audit trail durable and reconstructable.

- triggered_by_credential_type / triggered_by_credential_label on
  workflow_executions capture which credential triggered a run
  (webhook_key | org_api_key | oauth | session | internal, plus a
  non-secret handle). These survive key revocation, unlike the existing
  triggered_by_*_api_key_id FKs which are nulled when a key is deleted
- executed_workflow_hash stamps the sha256 of the nodes+edges that ran,
  tying a run to the exact definition that produced it and joining to
  workflow_history.content_hash to resolve the stored snapshot
- hashWorkflowDefinition() shared content-hash helper
- buildAttribution() extended; execute and webhook routes populate the
  new fields
- migration 0099 (hand-authored)
Bring org-scoped (kh_) API keys to parity with user webhook keys: both
the create and revoke paths now send the out-of-band email alert and
write a security audit event with org context, so every long-lived
credential mint/revoke is recorded the same way regardless of scope.
GET /api/security/audit returns the active org's audit trail. Sensitive
forensic data, so it is session-gated and restricted to org owners and
admins, and always scoped to the caller's organization. Filterable by
action, resource, and actor with a created_at cursor for pagination, all
served by the composite indexes on security_audit_log.
Wire the account-takeover-relevant actions into the security audit log so
the trail covers more than API keys:

- password change and password reset (reset records the requesting IP)
- email change (captures the before/after address)
- account deactivation
- session revocation
- TOTP enroll/disable and backup-code regeneration

Each writes a best-effort audit event with actor, resource, and request
metadata at its success point.
Wire workflow lifecycle and Marketplace listing mutations into the
security audit log:

- workflow.created / workflow.updated / workflow.deleted
- workflow.listed / workflow.unlisted / workflow.listing_updated

workflow.updated records scalar fields plus a content hash of the
definition rather than the full nodes/edges, keeping the audit row small;
the full snapshot and structural diff remain the job of the workflow
change-history table.
Audit billing in two layers, matching where the state actually lives:

- Authoritative transitions are recorded in the Stripe webhook handler
  (handle-billing-event.ts), the source of truth: subscription.plan_changed
  on a price change and subscription.canceled on deletion. Actor is the
  provider webhook (system).
- The checkout and cancel routes record the user-initiated intent
  (subscription.change_requested / subscription.cancel_requested) so the
  trail keeps which user triggered it, which the webhook does not carry.
- org_wallet.created on Turnkey org-wallet provisioning (actor = the
  creating user)
- agentic_wallet.hmac_rotated on HMAC secret rotation, recording the
  key-version bump; actor is the wallet sub-org (HMAC-authenticated)
Add the workflow_history store powering change history, version load, and
restore.

- workflow_history table: one row per version with the full snapshot
  (incl. edges, which are structural), a deep-diff vs the previous version,
  a content hash, and the same actor capture (who/when) as the audit log.
  Per-workflow version counter (unique with workflow_id).
- recordWorkflowSnapshot() helper, best-effort like recordAuditEvent, hooked
  into the workflow create and update chokepoints.
- content-hash + diff now normalize the definition: node identity/type/data
  and edge connectivity are tracked, cosmetic ReactFlow state (position,
  selection, size, edge styling) is stripped, so dragging a node does not
  create a version but a connection or config change does.
- migration 0100 (hand-authored).

Listing-only metadata changes stay audit-log-only (they don't alter the
definition).
- GET /api/workflows/[id]/history: admin/owner-gated version timeline with
  per-version diff and actor name/email enrichment, cursor-paginated.
- GET /api/workflows/[id]?version=N: returns a historical snapshot in the
  same shape as the live row, so the editor can load a past version.
- Enrich the security audit read endpoint with actor name/email too.
- Shared lib/security/org-role.ts (getOrgRole / isOrgAdmin) gating these
  reads to organization owners and admins.

Restore is performed client-side by loading a version and saving it back
through the normal update path, which reuses all existing validation,
schedule sync, history, and audit wiring rather than duplicating it.
Surface workflow versions in context, in the editor:

- A History button in the workflow toolbar, shown only to org admins/owners
  (useActiveMember), opens a version-history overlay.
- The overlay lists versions (who/when/source) and shows a read-only Monaco
  side-by-side JSON diff of the selected version against its predecessor.
- Restore writes the chosen snapshot back through the normal save path
  (creating a new version + audit event) and syncs the live canvas.
- api-client: getById(id, { version }) and getHistory(id); CodeDiffEditor
  wraps Monaco's DiffEditor with the shared theme.

On-canvas read-only preview of a past version is deferred; the diff view
already shows what changed.
On the Organisation tab, each org API key gets a History toggle (admin/owner
only) that lists its create/revoke events with actor and timestamp, read
from the org-scoped security audit trail via api.security.getAudit. Reuses
the existing audit endpoint and actor-name enrichment -- no new backend.

User (wfb_) webhook keys are personal and their audit events carry no org,
so the org-scoped reader does not surface them; personal-key history is a
separate follow-up.
Replace the raw Monaco JSON diff (which surfaced noisy node-position changes
and was hard to read) with a human semantic diff and a clearer UX.

- computeVersionDiff(): compares snapshots by node id and edge connectivity,
  ignoring cosmetic canvas state (position, selection, edge styling). Reports
  added/removed/changed nodes (with node type and field-level before/after,
  e.g. renamed "A" to "B", config keys changed) and added/removed
  connections. Drops the Monaco diff editor entirely.
- Version-history overlay redesigned: timeline with author + relative time +
  Current badge; change list uses Plus/Minus/Pencil and ArrowRight icons
  (no glyph arrows) instead of JSON.
- "View on canvas": load a version read-only via a new previewVersionAtom
  that suppresses autosave (so previewing can't clobber the live workflow);
  a banner offers Restore / Exit preview. The atom is reset on editor
  mount/unmount.
- API-key history moved from a per-item expander (a key is only ever
  created/revoked) to a section-level activity log on the Organisation tab,
  capturing create + revoke across all keys, including revoked ones.

Adds version-diff unit tests.
Workflows that persisted the read/write-contract function arguments under
`args` (the canonical field is `functionArgs`) failed action-config
validation with INVALID_ACTION_CONFIG, so autosave rejected every save --
even a fully configured node or a layout-only change. Add `args ->
functionArgs` to LEGACY_FIELD_ALIASES, matching the existing
`functionName -> abiFunction` alias. Validation-only and non-breaking; the
runtime already reads functionArgs.
read/write-contract nodes persist abiFunctionKey (the resolved function
signature used for overloaded-function disambiguation), but it is not a
declared config field, so strict action-config validation rejected the save
with INVALID_ACTION_CONFIG -- autosave failed on every read-contract node.
The runtime recomputes the key and never reads it from config, so add it to
LEGACY_IGNORED_FIELDS for read/write-contract. Validation-only, non-breaking.
# Conflicts:
#	app/api/user/totp/enroll/route.ts
#	app/api/workflows/[workflowId]/webhook/route.ts
#	drizzle/meta/_journal.json
#	lib/email.ts
Redesign the history/activity surfaces from a cramped modal into a clean,
shared visual language modeled on Google Docs version history.

- Shared building blocks: groupByDate (Today/Yesterday/This week/month),
  describeAuditAction (action -> phrase + add/remove/change kind), ActorAvatar
  (initials fallback), and an ActivityFeed (date-grouped, avatars, kind icons,
  relative time, load-more) over the security audit endpoint.
- Workflow version history is now a right-docked panel anchored in the editor
  (replaces the modal): date-grouped timeline with author + relative time +
  Current badge; selecting a version previews it live on the canvas; the
  selected entry expands to a readable semantic change list; Restore at the
  bottom. Closing exits the preview.
- API-keys overlay activity and a new admin/owner Settings > Activity tab
  (org-wide feed) reuse ActivityFeed.

Tokens only (token-audit clean on changed files); type-check + lint clean.
Make the version-history panel behave like the right-docked node-config
panel so switching between them is seamless:

- Share width via a new rightPanelWidthPctAtom (config panel + version panel
  read/write the same value), so they are the same size and resizing either
  keeps them in sync.
- Match the surface: bg-background + border-l, and slide open/close with
  transition-transform (no instant pop); the panel stays mounted so it
  animates.
- Add a left-edge drag handle to resize, mirroring the config panel, plus its
  collapse chevron affordance in place of the header X.
- On the current version, clicking a node closes the panel and reveals the
  config panel underneath so the node can be edited; previewing a historical
  version keeps the panel open.
Add a reusable cursor-pagination module (lib/pagination.ts) and adopt it on
the audit and workflow-history endpoints so large histories page properly.

- CursorPage<T> = { items, _links } with self/next/prev hrefs; bidirectional
  cursors (?cursor= older, ?before= newer) over a stable monotonic key, no
  COUNT, stable under concurrent inserts.
- parseCursorRequest + buildCursorPage centralize limit clamping, page
  slicing, boundary-cursor extraction, and link building; each route only
  supplies the column ordering and the cursor predicate.
- Client follows links via api.followPage(href) instead of reconstructing
  cursors; the activity feed and version-history panel now load-more through
  _links.next.
Switch the shared pagination to offset-based so list views get a real
numbered pager (< 1 2 3 ... >) with total count and total pages.

- lib/pagination.ts: Page<T> = { items, meta (total/page/pageSize/totalPages),
  _links (self/first/prev/next/last) }. parsePageRequest + buildPage centralize
  offset/limit parsing, clamping, and link building; routes run a COUNT plus an
  OFFSET/LIMIT slice. Audit and workflow-history endpoints adopt it.
- Reusable <Pager> (first/last + current +/-1 with ellipses) showing the total.
- usePaginatedResource hook owns page/items/meta/loading/error for any
  Page<T> endpoint, so components don't reimplement pagination state; the
  activity feed and version-history panel both consume it.
The panel stays mounted (parked off-screen) to animate, but its left-edge
resize handle and collapse button are translated half outside the panel, so
when closed they protruded at the viewport's right edge. Render the handle
only while the panel is open.
The panel hardcoded a 60px top offset while the config panel uses a
responsive 6rem/lg:60px offset. Below the lg breakpoint the toolbar is taller,
so the panel started above it -- overlapping the toolbar and breaking the top
border alignment. Use the same responsive offset so the panels line up.
- Store the semantic diff (computeVersionDiff) on each history row at write
  time instead of a raw deep-diff, so the timeline shows what every version
  changed inline -- no per-row snapshot fetch. Selecting a version now only
  loads its snapshot for the live canvas preview. Rows recorded before this
  format are detected and skipped via a shape guard.
- Increase the autosave debounce from 1s to 2.5s so rapid edits don't each
  save and spawn a version.
- Add the panel's missing top border and soften the "changed" icon color
  (amber-500 -> amber-400) in the panel and activity feed.
- The panel requested the default page size (50), so 19 versions fit on one
  page and the numbered pager never appeared. Request 10 per page so it
  paginates.
- Autosave recorded a version on every save, including position-only/no-op
  edits whose semantic diff is empty -- filling the timeline with changeless
  rows. Skip recording when an update produces no meaningful change (the first
  version still always lands).
- Each version row owns its collapse state (atomic); clicking expands its
  change list without touching the canvas. Paging remounts rows so they
  collapse. "View on canvas" is an explicit per-row action.
- Remove the panel's bottom Restore (it duplicated the preview banner's);
  restore lives only in the banner now.
- New useVersionPreview hook is the single source of truth for preview /
  exit / restore and syncs previewVersionAtom with a `?version=` query param,
  so a previewed version is shareable and reopens read-only by URL (the editor
  applies `?version=` on load and opens the history panel).
- Restyle the preview banner from the loud amber pill to the app's segmented
  navbar pill (bg-secondary, rounded-md, divider segments).
The version diff only captured which config keys changed, so the timeline
could only say "configuration changed (key)". Capture each changed field's
before/after value (truncated) and render one row per field as
`<node> <key>: <before> -> <after>`. Versions recorded before this detail
existed fall back to the key-name summary.
Config changes were shown with raw keys and raw `{{@nodeid:...}}` template
refs, which don't match the editor UI. Now:
- map each config key to its editor field label via the action registry
  (findActionById + flattenConfigFields), falling back to the key,
- strip the node-id from template refs (`{{@id:Manual.triggeredAt}}` ->
  `{{Manual.triggeredAt}}`),
- render each change stacked -- "<node> · <Field label>" then the before/after
  values as muted mono chips with an arrow -- instead of a cramped inline line.
The flat gray value chips read as disabled controls. Render config diffs the
way a diff should look: the old value in red, the new value in green (each a
subtle tinted, ringed chip), and an absent value as a faint "empty"
placeholder instead of a chip. Tighten the row layout so the node/field label
sits above the before -> after values.
- Resolve real node names via getActionLabel (plugins + legacy aliases;
  system actions and triggers are self-labeled); unconfigured nodes read as
  "Action". Connection endpoints and config rows use these names.
- Distinct change types/icons: connect (link) / disconnect (unlink) /
  enable / disable, separate from node add/remove; config keys map to their
  field labels and the action id shows as "Action".
- Config diffs use semantic before/after chips (red old, green new); clean
  template refs ({{@id:Manual.x}} -> {{Manual.x}}).
- Collapsible rows are atomic (per-row state) with a grid-rows open animation;
  softer expanded card, name . time on one line, proper spacing/padding.
- usePaginatedResource gains optional silent background polling; the panel
  refreshes every 30s while open.
eskp and others added 12 commits June 16, 2026 22:49
ultracite 6.5.1 (bumped in 3805adb without a matching @biomejs/biome bump)
rejects the `noIncrementDecrement` nursery key during config validation, so
`pnpm check` errors out before linting on local installs. The key was set to
"off" (a no-op for an experimental nursery rule), so removing it changes no
lint behavior while letting the wrapper run again.
The backdoor PATCH /api/workflows/[workflowId] re-ran every publish-time gate
(write-action, bare-@, input-schema) except the slug check, so isListed=true
could be persisted with a null listedSlug. Such a row is discoverable in the
marketplace catalog yet uncallable: external agents invoke a listing by slug at
/api/mcp/workflows/<slug>/call, and there is no slug to address. For a paid
workflow it is advertised but can never be called or settled.

- Add the slug gate to the PATCH willBeListed block. Listing, or staying listed,
  with a null/blank final slug now returns 422 SLUG_REQUIRED. Honors the existing
  field-touched + isListed=false unlist-bypass + sticky-slug conventions.
- Make the curator route (lib/mcp/listing.ts) return the same SLUG_REQUIRED code
  (was a generic INVALID_INPUT/400) so both listing surfaces reject identically.
- Migration 0113 unlists existing is_listed=true AND listed_slug IS NULL rows
  (data-only; no schema change; slug stays nullable, global unique index intact).
- Tests: list-without-slug -> 422, list-with-slug -> 200, sticky-slug edit while
  listed -> 200, unlist orphaned row -> 200, curator-route parity, plus an e2e
  full-HTTP-path roundtrip. Four existing transition tests updated to supply a
  slug (they had encoded the now-rejected slugless-listing state).

KEEP-494
Review found the new gate (which trims) was stricter than the curator path
and the cleanup migration, which only handled null/falsy slugs. A listed row
with an empty or whitespace-only slug could survive and then be falsely
rejected on any later edit (and the slug-immutability gate blocks fixing it
in place).

- lib/mcp/listing.ts: trim the curator slug before validating and persisting,
  so a blank/whitespace slug is refused (not stored) and matches the PATCH gate.
- PATCH route: persist the trimmed slug so the stored value equals what was
  validated (no leading/trailing whitespace).
- Migration 0113: also unlist is_listed=true rows whose slug is empty or
  whitespace (btrim(listed_slug) = ''), matching the gate's accept-shape.
- Tests: curator whitespace slug -> SLUG_REQUIRED, curator padded slug trimmed,
  PATCH padded slug trimmed before persist.
Make the organization activity feed informative and actionable:

- Resolve and display resource names server-side for workflows,
  integrations (with type), and personal + org API keys.
- Workflow events open the editor History tab deep-linked to the exact
  version they produced. The History tab is URL-driven (historyPage,
  version), highlights and scrolls to the target version, and each
  version has a copy-link button; 'current' tracks the latest version.
- Integration and API key events open their management modal with the
  matching row highlighted and scrolled into view.
- The version preview banner shows the current version in green with no
  Restore action; historical versions keep Restore.
…-alerts

feat(security): security audit trail, API-key alerts, and workflow version history
The fallback chain-name map labeled 11155111 as just 'Sepolia', which is
ambiguous alongside Base/Arbitrum/Optimism Sepolia. Use the canonical
'Ethereum Sepolia' (matching the chains table) so the version-history
diff and other fallbacks name the network unambiguously.
…abel

fix: disambiguate Sepolia chain label fallback
The app/migrator/workflow-runner bake in build-images.yml and the executor
bake in deploy-executor.yaml run on ubuntu-latest, whose ~14GB disk fills
while BuildKit unpacks node_modules, failing with 'no space left on device'.

Reclaim ~20GB+ by removing preinstalled Android/.NET/Haskell toolchains and
large apt packages before the build. Keep the hosted tool cache (tool-cache:
false) so CodeQL and other cached tools are untouched. Gated by each job's
existing skip conditions so reruns that skip the build also skip the cleanup.
ci: free runner disk space before image builds
keeperhub_workflow_execution_errors_by_workflow_total was a per-pod
counter incremented only on the kickoff/reaper/MCP finalization paths,
never on the main logWorkflowCompleteDb path where normal workflow
errors finalize. It therefore returned no data in prod and the
managed-client alert numerator that reads it stayed empty during real
error bursts. Per-pod counters emitted from the workflow runner are also
dropped before Prometheus scrapes the short-lived process.

Replace it with keeperhub_workflow_errors_by_workflow, a DB-sourced
gauge computed on each /api/metrics/db scrape from workflow_executions
(status='error'), grouped by workflow_id/org_slug/error_type and scoped
to managed orgs to bound workflow_id cardinality. Being DB-sourced it
counts every finalized error row regardless of which code path wrote it.

The old counter is left in place; a follow-up removes it once the infra
alert is switched to the new gauge.

Refs TECH-48
…rs-by-workflow-db-metric

fix(metrics): source per-workflow error metric from DB
…ketplace-polish-nullable-slugs-server-side-tag-search

# Conflicts:
#	drizzle/meta/_journal.json
@chong-techops chong-techops requested review from a team, OleksandrUA, eskp, joelorzet and suisuss and removed request for a team June 17, 2026 06:38
…lish-nullable-slugs-server-side-tag-search

fix(marketplace): require a slug before listing a workflow
@eskp eskp added the metrics-db-reviewed Reviewer sign-off: metrics aggregate queries optimised + tables indexed (KEEP-680) label Jun 17, 2026
@eskp eskp merged commit 4173ad1 into prod Jun 17, 2026
45 of 46 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

metrics-db-reviewed Reviewer sign-off: metrics aggregate queries optimised + tables indexed (KEEP-680)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants