Skip to content

release: To Prod#1388

Merged
suisuss merged 31 commits into
prodfrom
staging
May 27, 2026
Merged

release: To Prod#1388
suisuss merged 31 commits into
prodfrom
staging

Conversation

@suisuss

@suisuss suisuss commented May 27, 2026

Copy link
Copy Markdown

No description provided.

joelorzet and others added 30 commits May 26, 2026 21:48
Reset-password form: the authenticator code input is no longer
conditional on a server retry. The fields are now ordered
reset code -> authenticator code -> new password -> confirm
password, with helper copy that explains the authenticator field
is optional for accounts without 2FA. Users who have TOTP enrolled
no longer need a round-trip with mfa_code_required to learn the
field exists.

Step indicator: the current and done bubbles use bg-keeperhub-green
instead of the primary navy. The pending state drops the muted
treatment so the digit and label sit at the default foreground
weight, making the upcoming step legible instead of looking
disabled. The digit uses dark:text-background so the dark navy
shows on the bright green bubble in dark mode.

Same three changes apply to the sign-in dual-factor stepper, the
shared DualFactorSteps component, and the TOTP setup dialog so
the whole MFA surface stays consistent.
interceptOauthCallback had six silent-return branches plus two
success paths and no way to tell which one a failing user took.
Each branch now emits a structured warn with provider + status +
the relevant identifying fields so Sentry / Loki surface the path.

Each log line is keyed on a stable error_type so we can group:
  - non_redirect_response (Better Auth returned 4xx/5xx)
  - no_session_token_in_set_cookie (interceptor saw no session cookie)
  - session_row_not_found (cookie was set but row missing)
  - user_email_missing (row found but no email to deliver an OTP to)
  - pending_oauth_mfa_path (TOTP-enrolled user, defer to /verify-mfa)
  - pending_signup_mfa_path (no TOTP, defer to /enroll-mfa)
Defense-in-depth backstop for the deactivated-user auth bypass. Adds a
BEFORE INSERT trigger on sessions that rejects any insert whose owning
user has deactivated_at set, covering paths that bypass the app-layer
gate in lib/auth-deactivation-guard.ts (future OAuth providers, direct
db.insert(sessions) call sites, hook regressions).

The rejection raises a dedicated SQLSTATE KH001 plus a fixed message so a
future detection layer can classify it by error code alone. Shipped as a
hand-authored versioned migration (0090) since drizzle-kit generate is
broken on this repo; no snapshot, matching 0086-0088.

Tests: static drift assertions on the migration SQL (tests/unit) and a
runtime KH001 contract test against a real Postgres (tests/e2e/vitest).
The scheduler selected due workflows filtering only enabled=true, never
excluding soft-deleted workflows or those whose owner is deactivated, so
stale-but-enabled workflows were dispatched and only stopped by the
block_executions INSERT trigger. The five execution entry points (scheduler
select, executor dispatch, HTTP execute, webhook, agent-call lookup) each
gated lifecycle state differently and had drifted.

Introduce lib/workflow/executable.ts as the single source of truth:
- workflowExecutableConditions(): SQL WHERE fragment for the select sites
  (scheduler, agent-call lookup), joining users for the owner-active clause.
- getWorkflowExecutability(): in-memory discriminated predicate for the
  fetch-then-gate sites (executor, execute, webhook), returning a reason so
  callers preserve their HTTP semantics (webhook 410 disabled vs 404 gone).

Both encode the same enabled + not-deleted + owner-active columns. Org/
subscription status stays in the billing layer; the block_executions DB
trigger remains a hard backstop.

Behavior changes: agent-call lookup now also requires enabled=true; a
webhook workflow that is both deleted and disabled now returns 404 (gone)
rather than 410.
The runtime test reads the SQLSTATE from err.cause.code (drizzle wraps the
postgres-js error in DrizzleQueryError); the docstring claimed err.code.
Also corrected the skip note: tests/setup.ts always sets a default
DATABASE_URL, so the no-URL skip branch rarely fires - SKIP_INFRA_TESTS is
the real opt-out.
The trigger reads committed deactivated_at, so a session insert racing a
not-yet-committed deactivation can slip through. Document that this is a
backstop behind the app-layer gate, not an airtight guarantee, so no future
reader assumes it closes the concurrent-deactivation window on its own.
Qualify the table as public.users so the trigger always reads the real
table regardless of the caller's search_path (the function runs SECURITY
INVOKER). Removes the search_path-shadowing ambiguity of the unqualified
reference. Update the drift test's matcher to expect the qualified name.
Add a self-contained e2e case: a user deactivated then reactivated
(deactivated_at back to NULL) can have a session created again. Proves the
trigger evaluates the live deactivated_at rather than a one-time state.
The execute route serves both automated dispatch (internal service auth) and
the interactive editor "Run" button (session/OAuth). Workflows default to
enabled=false, so gating enabled unconditionally would 404 every test run of a
not-yet-enabled workflow during authoring.

Apply the enabled gate only on the internal/automated branch; the interactive
branch still blocks soft-deleted and deactivated-owner workflows but allows a
disabled one through so it can be test-run. Also drop the now-redundant
access.isDeleted checks - deletion is handled uniformly by the executability
gate.
…ated-trigger

feat: add sessions INSERT trigger rejecting deactivated owners
The executor fetched the workflow, then issued a second query for the owner's
deactivatedAt. Fold it into a single leftJoin so each dispatched message costs
one round-trip instead of two. The route sites keep their separate lookup -
the app db does not register the workflows relation and a join there would
force a large test-harness rework for a single indexed lookup.
The forgot-password route generated its own 6-digit code and wrote
it to the verifications table verbatim, sitting next to the
better-auth emailOTP rows that storeOTP "encrypted" has been
hardening for months. A DB read alone was therefore enough to
recover a live reset code, the exact threat the encrypted store
was put in to close.

Same primitives as the strict-signin path: symmetricEncrypt under
BETTER_AUTH_SECRET, stored as <ciphertext>:0 to match the emailOTP
suffix shape; constant-time decrypt-and-compare on verify.

Also drop the 500 on send failure. The row is already minted and
the rate limiter has already debited, so escalating a SendGrid
hiccup also leaks the "this email exists" bit that the
user-not-found and OAuth-only branches go out of their way to
suppress. sendVerificationOTP keeps the failure log path. UI copy
shifts from "If an account exists ..." to a confident "Check your
inbox for the reset code." for the same reason.
Reset-password form: the "(if 2FA is enabled)" hint and helper
paragraph are gone; the field is just labelled "Authenticator code"
since the server already knows whether it needs that input. The
now-dead forgotNeedsMfa state and the mfa_code_required reveal
branch in the client are deleted; the field is always present so
there is nothing to reveal. Success toast switches to "Check your
inbox for the reset code." to match the new server copy.

MFA stepper across the sign-in, reset, /verify-mfa and /enroll-mfa
surfaces:
  - "Email code" and "Authenticator code" input labels picked up
    the keeperhub-green-dark accent so they read as "the thing to
    fill in," not just another label.
  - Pending step number and label drop the muted treatment so
    upcoming steps stay legible instead of looking disabled.
  - The TOTP setup dialog's "Skip for now" button on the
    download-codes step is renamed to "Skip" and made the primary
    action, themed bg-keeperhub-green with the same dark-on-green
    contrast pattern the bubbles use.
Org API keys are admin/owner-only at /api/keys via
requireAdminOrOwnerWithMfa, but the overlay let any member click
"New API Key" and the per-row trash button. The server rejection
landed as a generic 403 toast which read as a bug.

Members now see the list as before but the create action and the
per-row delete are disabled, with a banner above the list and a
tooltip on the trash button explaining the requirement. Personal
webhook (User) keys stay fully writable for every user since those
are user-scoped, not org-scoped.

Role state is read from useActiveMember, which is built on
authClient.useActiveOrganization, so the gate flips automatically
when the user switches orgs without a refresh.
Drives the actual GET /api/internal/schedules handler against a real database
and asserts it returns only the healthy workflow's schedule while excluding
soft-deleted, disabled, and deactivated-owner ones, then re-includes a schedule
after the owner is reactivated. Unmocks the globally-stubbed @/lib/db so the
route runs against seeded rows; skips when no database is available.
The module gates five execution entry points, not four.
The execute and webhook routes carried near-verbatim copies of the background
kick-off (start() + runId persist + error classification). Extract a single
lib/workflow/execute-in-background.ts that both call, parameterized by a log
prefix and endpoint, so the two cannot drift (the webhook copy had already lost
the credential-handling SECURITY comment).
The K8s runner and in-process executor re-checked enabled and deleted_at as a
defensive backstop but not owner deactivation, so they could drift from the
dispatcher's gate if the owner was deactivated between dispatch and execution.
Route both through getWorkflowExecutability so all three lifecycle conditions
are covered, cancelling the execution rather than running it.
The executor created execution rows as "pending" (then runner/in-process flip
to "running"), but the execute and webhook routes created them as "running",
so API-triggered executions skipped the pending phase. Create rows as "pending"
in the routes too and flip to "running" in executeWorkflowInBackground, guarded
on the current status so a fast completion is never clobbered back to running.
Every entry point now shares pending -> running -> terminal. ("pending" is
already the workflow_executions schema default.)
…n status

executeWorkflow writes the final execution status itself - via its
_workflowComplete step into logWorkflowCompleteDb, which carries the
error->success reconciliation, the no-overwrite guard, and richer fields
(transactionHashes, error classification, duration). The K8s runner and
in-process executor then called updateExecutionStatus again after it returned,
overwriting that authoritative row with a simpler one and defeating the
reconciliation (verified: executeWorkflow alone leaves the row at "success"
with completedAt set, no updateExecutionStatus needed).

Drop the post-run updateExecutionStatus(success/error) calls from both paths
(schedule bookkeeping and the catch-block fallback stay), and guard
updateExecutionStatus so its remaining callers (running, cancelled, fatal
fallback) can never overwrite a row that already reached success/cancelled -
matching the engine's own KEEP-431 guard. Execution-row output on these paths
now matches the API/DevKit path.
The running-execution concurrency cap was only checked on the API route, so
schedule/block/event runs dispatched to k8s-job or in-process targets bypassed
it entirely. Extract the check into a server-only-free lib/workflow/concurrency
that both the route wrapper and the standalone executor share, and enforce it in
the executor before creating the execution row. On breach the executor throws so
the SQS message is redelivered after the visibility timeout (back-pressure)
rather than dropping the scheduled run.
…s guards

The executor's billing-guard reimplemented the monthly execution-limit check
that lib/billing/plans-server.ts already does - the same count query and the
same allow/overage/block branch order, maintained by hand on both sides of the
server-only boundary. Extract a server-only-free lib/billing/execution-limit-core
(countMonthlyExecutions, decideExecutionLimit, effectiveExecutionLimit) that both
call, so the branch logic and count can no longer drift. Each side keeps its own
result shape and its own debt source: the route's is Stripe-aware and server-only,
the executor's is a plain active-debt sum, so those cannot be merged.

Feature gating is already shared via lib/features/workflow-validator
(validateWorkflowFeatures), so only the inherently db-bound plan lookup differs
there - left as is.

Minor: the executor's active-debt block result now reports the actual monthly
used count instead of 0 (it now counts before deciding, like the route).
…flow call

The in-process executor and K8s runner call executeWorkflow directly rather than
start() from workflow/api, which the DevKit editor plugin flags. That is
deliberate: these are standalone processes with no DevKit run-processor, so the
pod is the execution boundary and runs the workflow synchronously. Document the
rationale and the durability tradeoff (no checkpoint/resume; with backoffLimit:0
a crashed pod leaves the row "running" until a sweeper closes it) at both call
sites. Routing through start() would persist runs nothing executes, so it is not
the fix; the stuck-running gap is tracked separately.
The two success-path diagnostics for the OAuth interceptor were
logSystemWarn, which routes through captureException and surfaces
in Sentry as a warning. Every legitimate OAuth login by a member
without TOTP would therefore page Sentry with
pending_signup_mfa_path, drowning the actual signals we added the
logs for.

Switch the two success branches to console.info with the same
structured payload. The four genuine-failure branches
(non_redirect_response, no_session_token_in_set_cookie,
session_row_not_found, user_email_missing) stay on logSystemWarn
because those should page.
…fa-stepper-green

fix(auth): reset-password UX, MFA stepper polish, API-keys role gate
The runner and in-process executors no longer relied on their own write to
close the execution row, deferring entirely to executeWorkflow's internal
_workflowComplete write. That write is best-effort (caught and logged, never
rethrown), so if it was lost the row stayed stuck running with no recovery.

Tighten updateExecutionStatus to only transition a non-terminal row (exclude
success, error, and cancelled), then re-issue the success/error write from the
runner and in-process success branches. The guard makes the re-issued write a
no-op once the engine's own terminal write landed - so it cannot clobber the
reconciled status or richer fields - and closes the row only when that write
was lost.
countMonthlyExecutions sums two COUNT(*) subqueries; postgres-js serializes
bigint as a string, so used came back as a string for the executor billing
guard (the query builder it replaced cast to ::int and returned a number).
Cast each COUNT to ::int so both the Next billing path and the executor get a
numeric used.
fallbackCompleteExecution was defined but never called anywhere, and its two
required env vars (EVENTS_SERVICE_API_KEY, NEXT_PUBLIC_APP_URL) are not wired
into the executor, so it would no-op even if invoked. It also wrote via an
endpoint guarded only against cancelled, so it could clobber a terminal status.
The guarded same-handle backstop now covers the lost-write case; drop the dead
indirection and its test.
The agent-call lookup gated on workflowExecutableConditions (which folds in
enabled), so a listed-but-disabled workflow fell out of the result set and
collapsed to a generic 404 - misreporting a paused workflow as gone, and
discarding the disabled-vs-gone distinction the rest of the lifecycle work
preserves.

Split the predicate: workflowReachableConditions gates the hard-gone states
(soft-deleted, deactivated owner) in SQL as 404, and the route checks enabled
in-memory on the loaded row, returning 503 temporarily-unavailable for a
disabled workflow. No info leak since the listing is already public.
workflowExecutableConditions now composes on top of the reachable predicate so
the scheduler select is unchanged.
Real-DB test proving the guard that makes the runner/in-process re-issued
write a safe backstop: it closes a still-running row, and never overwrites a
row already in a terminal state (success, error, cancelled). The error case
exercises the ne(status,'error') clause specifically, so a re-issued write
cannot clobber the engine's reconciled terminal status.
…cle-filter

fix: gate workflow execution on a shared lifecycle predicate
@suisuss suisuss merged commit ca30461 into prod May 27, 2026
25 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants