Skip to content

release: To Prod#1359

Merged
eskp merged 87 commits into
prodfrom
staging
May 26, 2026
Merged

release: To Prod#1359
eskp merged 87 commits into
prodfrom
staging

Conversation

@suisuss

@suisuss suisuss commented May 25, 2026

Copy link
Copy Markdown

No description provided.

suisuss and others added 30 commits May 22, 2026 13:26
The scheduler dispatcher polls every 60s, so the effective firing
resolution is 60s. A row with `interval_seconds = 30` doesn't fire
every 30s -- it fires every poll cycle (every 60s), because
shouldTriggerInterval's `elapsedMs % intervalMs < 60_000` predicate is
always true when intervalMs < 60_000. Silent under-firing, not a
crash. Same class of bug as the original KEEP-575 "UI says X, schedule
does Y" mismatch.

The UI minimum is 1 minute so UI traffic never hits this path. Only
API / DB-direct writes can land sub-60s values; reject them so the
config falls back to cron (or no-op) at sync time instead of being
stored and silently miscadenced.

Three new boundary tests via extractScheduleConfig:
- interval=30s -> cron fallback
- interval=59s -> cron fallback (one below the floor)
- interval=60s -> interval mode accepted (the exact floor)
…er script

scripts/scheduler/schedule-dispatcher.ts is a standalone variant that
parses cron_expression directly. Since KEEP-575 PR #1282 commit
2510ed5, interval-mode rows store a fixed `0 0 1 1 *` sentinel in
that column ("00:00 on Jan 1"); if this script parsed it, it would
"fire" the schedule once a year.

Add intervalSeconds to the SELECT, filter out non-null values, and log
the skipped count so the omission is visible rather than appearing as
an unexplained drop in dispatched schedules. The production
dispatcher in keeperhub-scheduler/schedule-dispatcher/dispatch.ts
already handles both modes correctly; this fix only narrows the
standalone variant to its actual capability.
updateScheduleStatus in workflow-runner-profiled.ts computes
next_run_at by parsing cron_expression. For interval-mode rows the
column holds a `0 0 1 1 *` sentinel; parsing it would write
"Jan 1 next year" to next_run_at -- visibly wrong for an "every N
minutes" schedule.

The production updateScheduleStatus in lib/schedule-service.ts (and
its peer in keeperhub-executor/lib/db-helpers.ts) handles interval
mode correctly via computeNextIntervalRunTime. This profiling runner
is invoked manually and isn't on the critical timing path, so log and
skip instead of pulling the interval helpers in.
…ent demotion

The original parser-level guardrail (commit 373299c) returned null for
sub-60s values, causing extractScheduleConfig to silently fall through to
the cron branch. The save endpoint then returned 200 while either
demoting the schedule to cron mode (if a cron expression was set) or
deleting the schedule entirely (if not). The dispatcher stopped
miscadencing, but the API contract still lied to the caller.

Promote the parser-level guardrail to a typed throw, propagate it
through the service layer as a distinguishable result code, and have
the workflow save routes return 400 SCHEDULE_INTERVAL_TOO_SMALL.

- parseIntervalSeconds throws IntervalTooSmallError (carries raw + minimum)
  for sub-60s values. Absent / empty / non-numeric inputs still return
  null -- only out-of-range numeric values throw.
- syncWorkflowSchedule catches the throw and returns
  { synced: false, error, code: "interval_too_small" }. Other sync
  failures (timezone, cron) keep their existing warn-and-continue
  behaviour; only this one is promoted to a hard 400.
- PATCH /api/workflows/[workflowId] runs extractScheduleConfig BEFORE
  the DB update so a rejected sub-60s value never lands as persisted
  nodes paired with a missing or miscadenced schedule row.
- Admin enable-workflow returns 400 on the same code for consistency,
  in case a pre-validation-era row reaches it.

Tests:
- 3 boundary tests in schedule-service.test.ts now assert
  IntervalTooSmallError is thrown instead of asserting cron fallback,
  plus a new test verifying error.raw and error.minimum are populated
  so callers can render structured error bodies.
- New integration test tests/integration/workflow-schedule-validation-route.test.ts
  covers the PATCH route: 30s / 59s reject, no partial DB write on
  reject, 60s accept, manual trigger unaffected.
- tests/integration/workflow-listing-route.test.ts mock switched to
  vi.importActual + override because the route now imports
  extractScheduleConfig alongside syncWorkflowSchedule.

DB enforcement (a CHECK constraint on workflow_schedules.interval_seconds)
is deliberately not in this commit -- it's a separate concern and a
separate change.
The 60s floor is used by parseIntervalSeconds today and will be the
source of truth for any future code that needs to express "the
minimum interval" (UI validators, zod schemas at API boundaries, a
future DB CHECK constraint). Keeping it as a private module const
guarantees drift -- re-declared 60s here, hard-coded 60s in a
validator there.

Pure export change; no behaviour difference.
…andalone dispatcher

The profiling runner skip in
scripts/runtime/workflow_runtime_analysis/workflow-runner-profiled.ts
uses console.warn for the same "this script only handles cron mode,
skipping the interval-mode row" event. The standalone dispatcher was
using console.log for the equivalent skip, which is inconsistent and
buries the omission in info-level noise.

Align both to warn.
The existing boundary tests (in tests/unit/schedule-service.test.ts)
exercise the parser transitively via extractScheduleConfig. Their
assertions describe the downstream function's behaviour ("falls back
to cron when ...") rather than the parser's own contract, so they
couple to caller semantics that may change.

Add 19 direct tests pinning the parser's three observable outcomes:

  absent input (undefined / null / empty / NaN / non-numeric)   -> null
  unusable input (zero / negative / Infinity / objects)         -> null
  numeric < MIN_INTERVAL_SECONDS                                -> throws
  numeric >= MIN_INTERVAL_SECONDS                               -> floored int

Includes the boundary cases (59, 60), the fractional edges (59.9
floors to 59 and throws; 60.7 floors to 60 and passes), and a
populated-error assertion verifying raw and minimum are accessible
on the thrown IntervalTooSmallError so callers can render structured
400 bodies.
…one dispatcher

The filter `(s) => s.intervalSeconds === null` is intentionally strict
rather than `!= null && >= 60`. A row with `interval_seconds = 0`
falls into the "skip as interval-mode" bucket rather than the cron
bucket -- which is the safe choice, because the cron path would
parse the `0 0 1 1 *` sentinel and fire the schedule once a year.

The case isn't reachable through the API today (parseIntervalSeconds
returns null for zero) and would be impossible if a DB CHECK
constraint enforcing `interval_seconds IS NULL OR >= 60` lands. But
the script's filter is the only defense against a 0 value reaching
this code path from a script, psql, or future migration that
bypasses the service layer, so call it out explicitly rather than
leaving it as an implicit consequence of the `=== null` choice.

No behaviour change; comment-only.
CHECK (interval_seconds IS NULL OR interval_seconds >= 60) on
workflow_schedules. The application-layer rejection added in 47a6c5f
covers normal API traffic, but anything that bypasses the service
layer (scripts, psql, future migrations, refactors that drop the
throw-handling) can still write sub-60s values. The CHECK is the
defense-in-depth layer that makes the floor impossible to violate at
the storage level.

Constraint name workflow_schedules_interval_seconds_min_60 matches
the project's existing convention (table_column_kind pattern visible
in workflow_schedules_workflow_id_workflows_id_fk etc.).

Pre-flight (2026-05-21):
  - staging: SELECT COUNT(*) FROM workflow_schedules
             WHERE interval_seconds IS NOT NULL AND interval_seconds < 60
             -> 0 rows
  - prod:    same query -> 0 rows

Zero existing violators means the migration lands clean in a single
ALTER TABLE; no NOT VALID / VALIDATE two-step and no backfill needed.
The table is small (single-digit thousands of rows), so the brief
ACCESS EXCLUSIVE lock during the validation scan is functionally
instant.

The literal 60 in SQL mirrors MIN_INTERVAL_SECONDS in
lib/cron-utils.ts -- SQL can't import from TS, so the two must be
kept in sync if the floor ever changes. A comment on both sides
flags this.
…t migration

Pre-flight on 2026-05-22 against staging (39 rows) and prod (396
rows). Both well under the ~100k threshold where ADD CONSTRAINT ...
CHECK's ACCESS EXCLUSIVE validation scan would meaningfully block
concurrent traffic. Single-statement ALTER TABLE is correct at this
size; the validation scan is microsecond-fast.

Documented the verified row counts and the threshold at which a
future maintainer should switch to the NOT VALID + VALIDATE pattern
so the decision is reproducible.

Comment-only; migration SQL is unchanged.
…riminator

The previous shape had code as an optional field on the failure
branch:

  | { synced: false; error: string; code?: "interval_too_small" };

That left call sites free to ignore code without TS complaint. If a
future failure category lands (e.g. an "invalid_webhook_url" hard
fail), existing call sites won't be forced to handle it -- they'll
silently fall into the soft path.

Replace with a required kind discriminator and split hard/soft into
two distinct variants:

  | { synced: false; kind: "soft"; error: string }
  | { synced: false; kind: "hard"; code: "interval_too_small"; error: string }

Hard failures MUST be surfaced as a 400; soft failures (timezone,
cron) are warn-logged and the save still succeeds. Adding a new
hard-kind code now requires updating every call site's narrowing,
caught by TS exhaustiveness checks.

Call sites updated:

- app/api/admin/test/enable-workflow/route.staging.ts: switches the
  branch from `syncResult.code === "interval_too_small"` to
  `syncResult.kind === "hard"`.
- app/api/workflows/[workflowId]/route.ts: handlePostUpdateSideEffects
  reads `syncResult.error` only when `!syncResult.synced`, unchanged.

No behaviour change for working paths; new shape just enforces
exhaustive handling at the type level.
The try/catch around extractScheduleConfig in syncWorkflowSchedule
only handles IntervalTooSmallError -- any other thrown error
rethrows. This contract was documented in code (the `throw error`
line) but not pinned by a test.

Three tests:

  1. parseIntervalSeconds throws generic Error
     -> syncWorkflowSchedule rejects with the same Error.
  2. parseIntervalSeconds throws a custom subclass that is NOT
     IntervalTooSmallError -> rejects, preserves subclass identity.
  3. Sanity: parseIntervalSeconds throws IntervalTooSmallError
     -> syncWorkflowSchedule returns { synced: false, kind: "hard",
     code: "interval_too_small" } and does NOT rethrow.

The third test guards against a refactor that deletes the catch
block entirely (which would make tests 1 and 2 still pass for the
wrong reason -- any error would rethrow because there's no catch).

Implementation note: parseIntervalSeconds is the only thing inside
extractScheduleConfig that throws today, so it's the natural injection
point. Mocked via vi.importActual + override to keep
IntervalTooSmallError real (instanceof checks in
syncWorkflowSchedule still work).
Two new tests for the route's sync-result handling:

  1. kind:hard (interval_too_small) -> 400 SCHEDULE_INTERVAL_TOO_SMALL
     with the error message echoed in `message`. Direct coverage of
     the branch added in 98959c9, which had no test until now.
  2. kind:soft (invalid timezone) -> 200, schedule sync result is
     included in the response body. Pins the "warn-and-continue"
     behaviour so it can't silently flip to 400 in a future refactor
     of ScheduleSyncResult.

The happy-path test (sync succeeds -> 200) was already present and
unchanged. The mocked sync return values use the new discriminated
shape from 33a3955 (kind: "hard" | "soft") to assert call-site
behaviour rather than the looser code? field that existed before.
App-layer signup defenses ahead of retiring the emergency signup
whitelist trigger from migration 0084.

- BetterAuth captcha plugin (Cloudflare Turnstile) gated to
  /sign-up/email only. Production fails fast at module load if the
  secret key is missing; CI/test and admin-test-endpoint modes skip
  the plugin so existing Playwright signup tests keep working.
- Per-IP signup rate limit (5/hour) added to rateLimit.customRules,
  declared before the existing /* bypass rule so first-match wins.
  Chains through rateLimitBypassRule so the X-Test-API-Key bypass
  continues to work for E2E.
- AuthDialog mounts the Turnstile widget in the signup view when
  NEXT_PUBLIC_TURNSTILE_SITE_KEY is set, passes the token via
  signUp.email({ fetchOptions: { headers: { "x-captcha-response" } } }),
  resets the widget on view-leave and every error path so single-use
  tokens never get reused.
- .env.example documents both env vars plus Cloudflare's published
  always-pass / always-fail test keys for local dev.
- New tests/unit/signup-defenses.test.ts asserts plugin config shape,
  the four skip predicates, the production-throw, and rate-limit
  rule precedence + bypass behavior.

Verification:
- pnpm vitest tests/unit/signup-defenses.test.ts -> 8/8 pass
- pnpm type-check -> 0 errors
- pnpm check -> remaining errors are pre-existing in unrelated files

OAuth (GitHub/Google) signup is intentionally not captcha-gated; the
provider already authenticates the user.
next build evaluates route modules during "Collecting page data"
with NODE_ENV=production but no runtime secrets injected. The hard
assertion at module load was therefore crashing the Docker build
even though it correctly fires at server boot.

Skip the assertion when NEXT_PHASE === "phase-production-build".
The same check still fires under phase-production-server (next
start) and under any custom server that doesn't set NEXT_PHASE,
so production runtime stays fail-fast.

New unit test covers the build-phase skip.
Retires the block_user_signup_security_2026_05_21 trigger + function
installed via migration 0084 during the 2026-05-21 incident. App-layer
signup defenses (Cloudflare Turnstile gated on /sign-up/email plus
per-IP signup rate limit) replace it.

Identifiers in the DROP statements are explicit. The sibling
block_executions_security_2026_05_21 trigger (migration 0082) and the
cascade-deactivation triggers (migration 0085) are untouched.

Verified locally against a fresh DB:
  After 0086: only block_executions_security_2026_05_21 remains.
  block_user_signup_security_2026_05_21 and its function are gone.

Merge gates:
  - KEEP-621 (app-layer defenses) live in prod.
  - Detection layer live.
  - Staging soak shows acceptable Turnstile success rate.
DROP TRIGGER IF EXISTS matches by name. If the trigger was renamed
ad-hoc on prod (the same drift pattern TECH-11 is reconciling), the
drop silently no-ops and the signup gate stays live with nobody
noticing. The pg NOTICE forces the mismatch into deploy logs so
on-call sees it during the rollout.

Doesn't block the migration - this is observability, not a gate.
Catches future edits that would widen the DROP target, remove the
drift NOTICE, or accidentally touch the sibling triggers that must
stay in place (block_executions_security_2026_05_21,
block_database_integrations_security_2026_05_21 - though 0083 already
retired the latter, the assertion is defense against
re-introduction).

Strips SQL line-comments before checking identifier names so the
migration's prose can legitimately reference sibling triggers when
explaining why they're left alone.
Migration 0085 adds a single cascade_user_deactivation trigger, not
a family of "cascade-deactivation triggers". Comment now names the
trigger explicitly so a future reader grepping for it lands in the
right place.

Test also gains a structural assertion that the cascade trigger
isn't named in any DDL statement - same defense as for the
executions kill-switch.
Defends the OAuth + email-OTP re-entry vector. New sessions whose
login-time country (CF-IPCountry attested) differs from the user's
recent session history are flagged in sessions.risk_flags_json; if
the user has TOTP enrolled the session is minted with
requires_mfa=true and getDualAuthContext returns a mfa_required code
on every subsequent request until the user passes the step-up at
/verify-mfa.

Server:
- twoFactor() Better Auth plugin wired with TOTP + backup codes.
  Email-OTP-as-second-factor is intentionally left without a sender
  since email is already our primary login factor.
- assessLoginRisk reads CF-Connecting-IP / CF-IPCountry and compares
  against the user's last 25 session attestations. New-country logins
  flag the session; null geo (local dev / direct origin) is treated
  as inconclusive.
- session.create.before computes the signal and merges
  requires_mfa + risk_flags_json into the session insert.
- Custom enrollment endpoints bypass the plugin's password gate so
  OAuth / email-OTP users can enroll: /api/user/totp/setup,
  /enroll (atomic verify + backup-code mint, forwards rotated session
  cookie), /backup-codes (TOTP-gated regenerate), /verify-stepup,
  /disable, /status. Anonymous users (name "Anonymous" / temp- email)
  are refused.
- two_factor.user_id UNIQUE + INSERT ... ON CONFLICT DO UPDATE keeps
  enrollment idempotent under concurrent /setup calls.

Client:
- TwoFactorSection in Settings -> Security shows compact status row.
- TotpSetupDialog: linear three-step wizard (scan -> verify -> save
  codes) with horizontal numbered StepIndicator matching the deploy-
  safe wizard. The save-codes step locks every close path (X, ESC,
  outside-click, beforeunload) until copy + download + retype-confirm
  pass.
- TotpManageDialog: name, enrolled date, regenerate codes (TOTP-
  gated), disable (TOTP-gated).
- /verify-mfa page redirects out unless the current session has
  requires_mfa=true; on success clears the flag and stamps
  mfa_verified_at.

Schema (migration 0086):
- two_factor (id, user_id UNIQUE, secret, backup_codes nullable,
  name, enrolled_at) with secrets and codes encrypted by
  BETTER_AUTH_SECRET.
- users.two_factor_enabled flag.
- sessions.requires_mfa / mfa_verified_at / risk_flags_json.

Tests: tests/unit/login-risk.test.ts covers no-CF-headers,
first-attestation, known-country, new-country, malformed JSON, and
current-country exclusion paths.
Three layered checks on the two highest-risk wallet actions:
  - role = "owner" in the active org (admins are no longer accepted)
  - users.two_factor_enabled = true (action refused if not enrolled)
  - sessions.requires_mfa = false (step-up cleared on this session)

All three combine in lib/middleware/owner-mfa-guard.ts and return
distinct error codes (not_owner / mfa_not_enrolled / mfa_pending) so
the client can branch UX without parsing strings.

Routes updated:
  - /api/user/wallet/withdraw
  - /api/user/wallet/export-key/request
  - /api/user/wallet/export-key/verify (mirrors the gate on /request
    so an attacker who somehow obtains the wallet-inbox OTP can't
    skip straight to /verify)

The wallet-creator and wallet-inbox-email-OTP checks on the export
flow remain in place; this PR adds the owner+MFA layer on top.
Adds two helpers alongside the existing requireOwnerWithMfa:
  - requireAdminOrOwnerWithMfa: same shape, accepts either role.
  - requireMfaEnrolled: self-scoped (no org role check), gates only
    on users.two_factor_enabled and sessions.requires_mfa.

Switches all three helpers to a discriminated GuardOk | GuardError
return so callers can narrow with `if (!guard.ok)` instead of an
`"error" in guard` shape that didn't survive TypeScript narrowing
on the success branch.

Updates the three existing callsites (withdraw, export-key request,
export-key verify) to the new narrowing. The actions they protect
are unchanged from the previous commit — owner-only + MFA enrolled
+ step-up cleared — only the narrowing syntax changes.
Protects:
  - POST /api/keys           (mint a new organization API key)
  - DELETE /api/keys/[keyId] (revoke / soft-delete an organization
                              API key)

Both routes now require role = admin or owner in the active org,
users.two_factor_enabled = true, and sessions.requires_mfa = false
before the action is honored. Returns 403 with code one of
not_admin_or_owner / mfa_not_enrolled / mfa_pending.

Why both legs symmetrically:
  - Minting an org API key produces a long-lived credential that
    bypasses any future session MFA.
  - Revoking is the inverse; without the same gate an attacker on
    a session could rotate keys (revoke + mint) to lock owners
    out and substitute their own.
Protects:
  - POST /api/api-keys           (mint a new user-scoped API key)
  - DELETE /api/api-keys/[keyId] (delete a user-scoped API key)

User-scoped keys (wfb_ prefix) live on the apiKeys table keyed by
userId, not organizationId, so the owner-role gate does not apply.
Both legs now require users.two_factor_enabled = true and
sessions.requires_mfa = false via requireMfaEnrolled. Returns 403
with code mfa_not_enrolled or mfa_pending.

User-scoped keys are the highest-leverage MFA-immune credential a
stolen session can produce — once minted, the key authenticates
out-of-band of the session entirely. Gating creation is critical;
gating deletion keeps the create/delete pair symmetric so a
compromised session cannot rotate the user's keys.
Protects:
  - POST /api/agentic-wallet/[id]/approve (authorize a pending
                                            fund-moving operation)
  - POST /api/agentic-wallet/[id]/reject  (cancel a pending operation)

Existing checks remain in place: the caller must be the wallet's
linkedUserId (proves wallet ownership). On top of that, both legs
now require users.two_factor_enabled = true and
sessions.requires_mfa = false via requireMfaEnrolled.

Approve is the obvious target — it commits a fund movement. Reject
gets the same gate because a stolen session could otherwise DoS
legitimate requests by pre-emptively cancelling them, and keeping
the pair on identical authorization rules prevents skew if either
endpoint changes later.

Out of scope here: /api/agentic-wallet/link. That route already
verifies HMAC proof-of-possession of the wallet's provisioning
secret in addition to the session; HMAC is a stronger factor than
TOTP for that flow.
Adds lib/client/handle-guard-error.ts. Reads a 403 response from any
of the routes gated by lib/middleware/owner-mfa-guard.ts, parses the
discriminated GuardError body, and dispatches by `code`:

  - mfa_not_enrolled  -> calls onEnrollMfa (callsite opens Settings)
  - mfa_pending       -> calls onPendingMfa(currentPath) (callsite
                          navigates to /verify-mfa?next=<currentPath>)
  - not_owner         -> toast with role explanation
  - not_admin_or_owner -> same

Returns true when handled so the caller stops without surfacing a
generic error toast on top. Returns false for non-403 responses and
for 403s without a recognized code, leaving the caller's existing
error path intact.

Callsites are wired in subsequent commits.
Protected route this UI calls:
  - POST /api/user/wallet/withdraw

Before: a 403 from the owner+MFA gate surfaced as a raw "Withdrawal
failed" toast carrying the server's error string. The user had no
path forward.

Now: routes the response through handleGuardError. If the server
returned a recognized GuardError code, the helper handles it and
the modal returns to the input state instead of dropping into the
red "error" state. Specifically:

  - mfa_not_enrolled  -> opens Settings overlay so the user can enable
                          TOTP without leaving the wallet flow
  - mfa_pending       -> redirects to /verify-mfa?next=<current path>
  - not_owner         -> toast explaining only owners can withdraw

Non-guard 403s and other failures still drop through to the
existing error branch.
Protected routes this UI calls:
  - POST /api/user/wallet/export-key/request (send wallet-inbox OTP)
  - POST /api/user/wallet/export-key/verify  (consume OTP + return key)

Both legs are gated by requireOwnerWithMfa on the server, so either
can return a 403 with code not_owner / mfa_not_enrolled / mfa_pending.
Routes both fetch calls through handleGuardError with a shared
options object:

  - mfa_not_enrolled  -> closes the export dialog, opens Settings so
                          the user can enable TOTP
  - mfa_pending       -> closes the dialog, redirects to /verify-mfa
                          with the current URL as `next`
  - not_owner         -> toast explaining only owners can export

On the /request leg the dialog falls back to "idle". On /verify it
falls back to "otp" so the user can retry with a refreshed code if
the gate failure was about step-up (which they can clear in another
tab) rather than enrollment.
Protected routes this UI calls:
  - POST   /api/keys            (create org API key)
  - DELETE /api/keys/[keyId]    (revoke org API key)
  - POST   /api/api-keys        (create user API key)
  - DELETE /api/api-keys/[keyId] (delete user API key)

The same overlay handles both tabs (Webhook = user-scoped,
Organisation = org-scoped). Each underlying endpoint is gated by
requireAdminOrOwnerWithMfa or requireMfaEnrolled on the server.

Routes both create and delete through handleGuardError. On a
recognized 403:
  - mfa_not_enrolled  -> closes the overlay stack and opens Settings
                          so the user can enroll TOTP
  - mfa_pending       -> closes the stack and redirects to /verify-mfa
                          with the current URL as `next`
  - not_admin_or_owner -> toast explaining the role requirement

Non-guard errors keep their existing toast paths intact.
Replaces the wallet-inbox email-OTP factor on the export-key flow
with a TOTP fresh challenge using the same authenticator the user
enrolled for KEEP-619. The owner+MFA gate (requireOwnerWithMfa)
stays in place; this is purely swapping the second factor.

Server:
  - DELETED /api/user/wallet/export-key/request
      No more email send, no more keyExportCodes write.
  - REWROTE  /api/user/wallet/export-key/verify
      Accepts { code: <TOTP> }. Verifies via auth.api.verifyTOTP
      (the same primitive used by /verify-mfa and the manage
      dialog). Wallet-creator check + Turnkey export unchanged.

UI: components/overlays/wallet/export-private-key-button.tsx
  - Removed the "requesting" step that fired the email.
  - Dialog now opens directly to the TOTP code input.
  - Step states reduced from 5 to 4 (idle/totp/verifying/done).

Why drop the wallet-inbox factor:
  - Single source of truth for second factors. Users were trained
    on TOTP via the KEEP-619 enrollment UI; the wallet-inbox flow
    was a separate mental model with its own UX.
  - The wallet-inbox email was often the user's own account
    address anyway, so the factor was effectively single-channel
    (compromise the user's mailbox -> compromise both).
  - TOTP secret is encrypted by BETTER_AUTH_SECRET and bound to
    a physical device; email OTPs traverse SMTP and live in an
    inbox the attacker can already read after most compromises.

The owner+MFA passive gate (role=owner, MFA enrolled, step-up
cleared) plus this fresh TOTP challenge gives the export flow
four checks in total: 1) authenticated session, 2) owner role,
3) MFA enrolled on the account, 4) fresh code from authenticator.

The integration test suite that exercised the email-OTP path
(tests/integration/wallet-export-key-route.test.ts) is removed;
the new endpoint is a thin wrapper around auth.api.verifyTOTP
which is itself covered upstream.
joelorzet added 12 commits May 25, 2026 14:40
Adds /api/auth/strict-signin/{start,complete} so credential sign-in for
TOTP-enrolled users verifies password, email OTP, and TOTP atomically
before any session row is minted. Blocks Better Auth's /sign-in/email-otp
for TOTP users so the session-on-OTP path can't be used as a TOTP bypass.

Dialog now drives the three-step flow (password -> email OTP -> TOTP) and
hard-reloads on success so the session cookie is picked up. Alert dialog
rewritten as a thin Dialog wrapper to avoid the Turbopack issue with the
Radix umbrella export. user-menu no longer unmounts AuthDialog mid-flow
during session refetch.
… screen

/strict-signin/start now checks users.two_factor_enabled. If false it
signs the user in directly via auth.api.signInEmail and forwards the
session cookies. The dialog reads signedIn=true and hard-reloads,
skipping the email-OTP + TOTP steps. Users with TOTP enrolled continue
through the existing dual-factor flow.

Forced-enrollment surface no longer redirects on onEnrolled (that fires
right after verify, before the save-codes phase). The dialog stays open
through the save-codes phase; on close, the page swaps to a success
card with a 10s auto-redirect and a Continue button so the redirect is
acknowledged rather than abrupt.
Extracts the dual-factor (authenticator + email OTP) prompt into a
single component and the response-code handling into a hook, replacing
six inlined copies across withdraw, export private key, api keys
(create + revoke), settings (email change), change password, and
deactivate account.

The hook owns totpCode / emailOtp / awaitingEmailOtp state plus a
handleResponse that translates the server's factors_required /
mfa_code_invalid / email_code_invalid codes into the correct local
state transitions. Consumers pass an onError sink (toast or local
setError) so call sites stay terse.

Also strips the diagnostic console.logs that were added during the
strict-signin wiring; logSystemError calls for real failures are kept.
…itive-actions

feat(auth): owner+MFA gate on withdraw and key export
# Conflicts:
#	components/auth/dialog.tsx
#	drizzle/meta/_journal.json
…factor

Replaces the string-equality compare of the decrypted email OTP with
crypto.timingSafeEqual in both the strict-signin endpoint and the
requireDualFactor helper. The 6-digit alphabet made the timing channel
narrow, but the canonical primitive is constant-time.

Adds a sliding-window counter (10 attempts per 15 minutes, per (user,
action)) in front of requireDualFactor. It covers both empty-body
floods that mint email OTPs at the victim's inbox and brute-force
guesses against a live verifications row. The counter is wiped on a
successful both-factor verify so a legitimately confused user does
not get locked out after a few typos.

In-memory per pod, matching lib/mcp/rate-limit.ts; effective limit is
LIMIT * num_replicas. Migrate to Redis when replica count starts to
matter.
CodeQL js/request-forgery (critical) on the strict-signin/start
endpoint: the OTP-send call used new URL(request.url).origin to
build the target, which reflects the request's Host header. An
attacker setting Host: evil.com would have caused the server to
POST the sign-in OTP request to the attacker's domain.

Replaces the self-fetch with a direct auth.api.sendVerificationOTP
call. Better Auth's plugin pipeline runs in-process so the
encrypted sign-in-otp-<email> verifications row is still seeded
the same way the HTTP endpoint would, with no host-controlled
URL construction in the loop.
…DualFactorInput

Disabling TOTP from the manage dialog now runs through requireDualFactor
with action totp_disable, matching the gate on withdraw / export-key /
api-keys / password-change. A stolen session plus one factor can no
longer turn off MFA. The manage dialog Disable view swaps the single
authenticator-code input for the shared DualFactorInput so the user
sees both factors required.

DualFactorInput stops hiding the email field. Both inputs render from
the start; the email input is disabled with a "we will email a code
after you submit" hint until the server emits factors_required, at
which point it becomes editable and the hint switches to "enter the
code we just emailed". Surfacing both factors up front matches the
server-side contract and stops the "wait, where is the email field?"
confusion that the lazy-reveal pattern caused.
Adds a shared two-step DualFactorSteps wizard (email then authenticator)
mirroring TotpSetupDialog's numbered indicator pattern, and threads it
through every dual-factor surface: withdraw, export-key, api-keys
(create + revoke), settings deactivate, change-password, totp-manage
disable, and the new /verify-mfa step-up dialog. Each consumer owns a
prereq phase (label, DEACTIVATE confirmation, etc.) and hands off to
the wizard for the code steps. The wizard pre-fetches the email-OTP
silently on mount so the inbox lands the code before the user types
anything; a Resend control bypasses the one-shot guard.

Fixes a class of bugs around the requires_mfa flag. Without declaring
session.additionalFields in betterAuth({...}), Better Auth was
silently dropping the requiresMfa / mfaVerifiedAt / riskFlagsJson
fields from session inserts so the proxy step-up gate never fired,
meaning TOTP-enrolled OAuth users sailed past /verify-mfa entirely.
Declaring the fields restores the contract.

verify-stepup now runs requireDualFactor with action session_stepup so
the step-up demands both an authenticator code and an emailed code,
symmetric with the credential strict-signin atomic chain.

OAuth-pre-MFA sessions get a 10-min TTL so a stolen cookie expires
before the legitimate user finishes step-up. Follow-up will refactor
OAuth callback to defer session creation until both codes verify.

Other fixes:
- handleSignIn: missing setLoading(false) on the 401 branch was
  leaving the form permanently busy on a wrong-password attempt
- strict-signin/route.ts: email-OTP row is now consumed AFTER
  signInEmail+verifyTOTP succeed instead of before, so a transient
  Better Auth flake does not lock the user out of retries
- totp-disable runs through requireDualFactor + a step wizard so
  disabling MFA can no longer be done with a stolen session and a
  single TOTP code
When a TOTP-enrolled user finishes the provider OAuth dance, the
callback wrapper deletes the session row Better Auth just minted,
clears the session cookie, and issues a signed pending_oauth_mfa
cookie instead. The cookie carries only userId/email/redirect under
an HMAC; on its own it has no auth power. /api/auth/oauth-mfa-finalize
verifies it alongside both codes via requireDualFactor (action
oauth_signin) and only then writes a fresh session row with the
hashed token plus sets the real session cookie. /verify-mfa detects
the cookie and routes the wizard to the finalize endpoint instead
of the existing-session step-up path.

The shared layout now drops the navigation sidebar + workflow
toolbar on /verify-mfa and /enroll-mfa so no auth-aware UI renders
on those routes, regardless of whether a session row exists. Matches
the security posture of the strict-signin atomic chain: no usable
session exists for the user-agent until both factors verify.
Better Auth's verifyTOTP flips users.two_factor_enabled = true AND
rotates the session. The new (rotated) session row then runs through
session.create.before, which now reads two_factor_enabled = true and
sets requires_mfa = true. Result: the user enrolled, but the very
next request bounces to /verify-mfa, even though they just proved
TOTP in the same enroll call.

Fix: after the verifyTOTP success and backup-codes persist, update
every session belonging to this user to requires_mfa = false and
mfa_verified_at = now. The freshest possible TOTP proof is the one
we just verified inside this request, so the step-up gate has been
satisfied for those sessions.
…n-step-up-mfa

feat(auth): risk-based step-up MFA with TOTP
joelorzet added 4 commits May 25, 2026 18:59
Credential and OAuth signups previously minted a session immediately
on email verification and relied on the proxy gate to bounce the
user to /enroll-mfa for forced enrollment. Between the verify and
the enroll step a usable session existed for an account that had no
second factor.

This branch closes that window by deferring the session until TOTP
is enrolled. The flow is now symmetric with the strict-signin
atomic chain: no usable session for the user-agent until both
verification steps complete.

  - Credential signup: after emailOtp.verifyEmail succeeds, the
    dialog POSTs /api/auth/finish-credential-signup instead of
    signIn.email. That endpoint re-verifies the password, confirms
    users.email_verified, and issues a signed pending_signup_mfa
    cookie. No session row is created.

  - OAuth signup: interceptOauthCallback in /api/auth/[...all]
    branches on users.two_factor_enabled. Users who never enrolled
    take the new pending-signup path: the session Better Auth just
    minted is deleted, the session cookie is cleared, the
    pending_signup_mfa cookie is set, and the browser is redirected
    to /enroll-mfa. Already-enrolled users keep the existing
    pending_oauth_mfa flow that defers to /verify-mfa.

  - /enroll-mfa detects which auth shape is on the request (session
    vs pending_signup_mfa) and renders the wizard in the matching
    mode. The setup/enroll endpoints accept both auth shapes via a
    new resolveEnrollMfaCaller helper.

  - /api/user/totp/enroll on the pending-signup path verifies the
    TOTP code directly against the encrypted secret in two_factor
    (no session to chain through verifyTOTP with), flips
    users.two_factor_enabled, persists backup codes, and mints the
    first session for the account in one transaction. The session
    cookie returned is the FIRST usable session for the user.

pending_signup_mfa is HMAC-signed under BETTER_AUTH_SECRET with a
30-minute TTL. It carries only userId / email / provider /
redirect. The cookie alone has no auth power: it only unlocks
/enroll-mfa and the setup/enroll endpoints, both of which still
require generating + verifying a TOTP code via the user's own
authenticator app.
… enroll

Better Auth signs the better-auth.session_token cookie via hono's
setSignedCookie: HMAC-SHA256(rawToken, BETTER_AUTH_SECRET) base64,
joined to the raw token with a literal "." separator. getSession
verifies that signature and rejects any value that does not match.

The two routes that mint sessions manually (oauth-mfa-finalize and
the pending-signup branch of totp/enroll) were writing unsigned raw
tokens, so the cookie looked valid client-side but was treated as
absent on every subsequent request and the user landed anonymous
despite the sessions row existing. New helper
signSessionCookieValue in lib/auth-session-token-hash.ts produces
the same format hono emits; both routes use it now.

Also the enrollment success card's Continue button used to race the
browser's cookie commit on a fast click: window.location.assign
fired before the cookie was sent on the next request. The button
now hits /api/auth/get-session first to confirm the session is
live; if not yet visible, a 500ms grace plus the SSR-on-target
fallback catches the race.
Two follow-ups on the no-session-until-MFA path inside /api/user/totp/enroll:

1. Rate limit the TOTP verify. The pending-signup branch is reached
   while the caller is unauthenticated by session, so Better Auth's
   per-route plugin rate limiter does not protect it. Without a
   counter, an attacker holding a stolen pending_signup_mfa cookie
   could brute the 6-digit code over the full 30-minute cookie TTL.
   Use the same sliding-window primitive that gates requireDualFactor,
   keyed on (userId, "totp_enroll"). The counter is wiped on a
   successful verify so a typo burst does not lock out a legitimate
   user.

2. Make the four writes that finalize enrollment a single
   transaction (backup-codes update + users.two_factor_enabled flip
   + sessions insert). A mid-flight failure previously could leave
   the account with two_factor_enabled = true but no session row,
   while the pending cookie was about to be cleared, bricking the
   user.
…-email-verify

fix(auth): no session until MFA is enrolled on new signups
@eskp eskp merged commit c658d73 into prod May 26, 2026
33 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants