Skip to content

release: to prod#1389

Merged
suisuss merged 38 commits into
prodfrom
staging
May 28, 2026
Merged

release: to prod#1389
suisuss merged 38 commits into
prodfrom
staging

Conversation

@joelorzet

Copy link
Copy Markdown

No description provided.

joelorzet added 23 commits May 25, 2026 19:40
…token

Schema:
  - sessions.requires_ip_verification + sessions.pending_ip columns
    (reserved for forensic capture; the active flow is atomic and
    never writes a session row for an untrusted IP).
  - user_trusted_ips table, unique on (user_id, ip), records the
    countries the user has verified from.

lib/security/login-risk.ts:
  - New assessIpTrust(userId): returns trust + ip + country.
  - First-IP-free rule mirrors the existing first-geo-attestation
    rule. A user with zero trusted IPs gets the first one auto-
    trusted, so a brand-new account doesn't deadlock on /verify-ip
    before TOTP is enrolled.

lib/auth.ts session.create.before:
  - Upserts the request IP into user_trusted_ips when the IP is
    trusted (idempotent via the unique constraint). Refreshes
    last_seen_at on repeat sign-in.
  - Declares requires_ip_verification + pending_ip in
    session.additionalFields so Better Auth doesn't filter them
    out of inserts.

lib/ip-verify-token.ts:
  - HMAC-signed URL token to embed in the verify-ip email, bound
    to (userId, sessionId, ip) with a 15-min TTL.

Gate wiring (proxy redirect, /verify-ip page, /api/user/verify-ip,
email send) comes in follow-up commits on this branch.
When a fully-MFA-verified sign-in arrives from an IP that the user
has never signed in from, defer session minting and route the user
to /verify-ip for a second dual-factor exchange against the same
IP. No session row exists between the original attempt and the
verify-ip completion; a stolen pending_ip_verify cookie cannot
satisfy the gate from a different network because the IP attested
at issue time is pinned in the payload and rechecked.

strict-signin (credential flow):
  - After password + email OTP + TOTP all pass, call assessIpTrust.
    On reason="unknown" consume the email-OTP row, mint a signed
    pending_ip_verify cookie, return {redirect:/verify-ip} without
    chaining signInEmail/verifyTOTP.
  - Also clear requires_mfa + stamp mfa_verified_at on the freshly
    minted session in the trusted-IP path so the proxy gate does
    not bounce the user back to /verify-mfa for a redundant step-up
    after they already satisfied all three factors atomically.

/api/user/verify-ip:
  - Reads + verifies the pending_ip_verify cookie. Rejects when the
    current request IP does not match the cookie's pinned IP.
  - Empty-codes POST mints + emails a fresh OTP under
    mfa:verify_ip:<userId>, identical contract to requireDualFactor.
  - Full POST verifies email OTP (decrypt + constant-time compare)
    and TOTP (verifyUserTotp against the encrypted secret, since the
    caller has no session for auth.api.verifyTOTP), then mints a
    fully-MFA-verified session in a transaction with the
    user_trusted_ips upsert.

/verify-ip page:
  - Locked-open Dialog, reuses DualFactorSteps wizard + the
    useDualFactorState hook for the OTP send / handleResponse
    contract.
  - Shows the resolved IP (and CF country when available) as a
    click-to-copy chip so the user can sanity-check the source.

proxy.ts:
  - Exempts /verify-ip and /api/user/verify-ip from the MFA gate;
    callers arrive without a session cookie so the gate already
    skips them, but the explicit allow-list keeps the routing
    predictable if a stale session is also present.

dev-only login-risk fallback:
  - resolveLoginIp falls back to X-Forwarded-For / X-Real-IP when
    not in production so VPN / NAT changes during local testing
    still exercise the gate. Production reads only CF-Connecting-IP.

lib/ip-verify-token.ts is removed; the URL-link variant in the
original plan is superseded by the cookie-gated dual-factor exchange.
Better Auth's built-in rateLimit config only covers routes handled
by the Better Auth catch-all, not the custom /api/auth/strict-signin
Next.js route. Without a per-request limit, a caller with a leaked
password can brute force TOTP (1M code space, +/-1 window) at
arbitrary rate; ~333K attempts on average is well within reach for
a botnet.

Reuses checkDualFactorRateLimit with the lowercased email as the
key (the user-id lookup hasn't happened yet) and action
"strict_signin". On success at either branch (IP-trusted full
mint, or /verify-ip handoff) resetDualFactor wipes the window so a
user with a few typos who finally lands the right codes does not
get locked out.

Sibling /verify-ip endpoint was already rate-limited via the same
helper under action "verify_ip".
When this feature ships, user_trusted_ips is empty for every
existing user, so the first-IP-free rule in assessIpTrust also
covers their first post-deploy sign-in: whatever IP they are on
gets auto-trusted and /verify-ip only fires for subsequent IPs.
Comment now spells out why we are not backfilling from
sessions.ip_address (raw, no CF attestation, null on older rows).
Better Auth's `/sign-up/email` returns a synthetic 200 + fake
user object when the email already belongs to an account
(anti-enumeration: shouldReturnGenericDuplicateResponse). When
`emailOTP.sendVerificationOnSignUp` is true the plugin
unconditionally fires an OTP off that synthetic response, so the
OTP lands in the inbox of the existing user. Anyone with mailbox
access could then submit the OTP to `/email-otp/verify-email`,
which updates the existing user's `emailVerified` flag and fires
`afterEmailVerification` against their row.

Move the OTP dispatch out of the plugin and into the
`databaseHooks.user.create.after` hook. That hook fires only on
an actual user-row insert, so the duplicate-email path skips OTP
delivery entirely (no row, no hook, no OTP). OAuth signups have
`emailVerified = true` at insert time (provider attested) so the
`!user.emailVerified` guard isolates the credential path.

The hook calls `auth.api.sendVerificationOTP` with type
"email-verification", which goes through the same encrypted
verifications row + sendVerificationOTP callback path as the
plugin's auto-send. Email delivery, OTP storage and the
`/email-otp/verify-email` verifier are unchanged; only the entry
point shifts.

Existing isFreshSignup guards on the OAuth notification path
and on `emailVerification.afterEmailVerification` keep their
behaviour: the underlying Better Auth inconsistency (OTP verify
firing the hook on already-verified users) is still present
upstream, but with no OTP delivered to existing inboxes the
attack chain has no first step.
# Conflicts:
#	app/api/auth/strict-signin/route.ts
The previous version cleared the gate with WHERE user_id, which
silently lifted requires_mfa on every other session the user owns.
A stale unverified session on another browser would be unblocked
by an unrelated successful strict-signin elsewhere.

Parse the raw session token Better Auth wrote into the verifyTOTP
Set-Cookie, hash it the same way the adapter does, and target the
update by sessions.token. The new session has exactly one row that
matches the hash; every other session's gate state is untouched.

If we cannot find the session_token cookie in the response (should
not happen but guard anyway) log it instead of running a wide
update. Either way the proxy gate either sees the cleared flag or
keeps the user at /verify-mfa, which is the safer fail direction.
The atomic flow in strict-signin / oauth-mfa-finalize / verify-ip
defers session minting whenever IP trust is unknown -- nothing
writes to these two columns, ever. They were originally planned
as a stepup-style fallback that we no longer need.

Remove from sessions table, from session.additionalFields in
Better Auth config, from the migration that introduced them, and
from the mock fixture in the rpc-preferences route test that was
mirroring the field shape. user_trusted_ips and the IP-trust
upsert stay intact.

Also: ip-trust upsert failure now goes through logSystemError so
diagnostics surface in the standard error pipeline instead of via
bare console.error.
Settings -> Security gains an "Active sessions" section listing
every session row owned by the caller: device label parsed from
the user-agent, IP address, attested country (when CF passed
one), creation and last-active timestamps. The session that
matches the caller's cookie is tagged "This device" and offers no
revoke action; sign-out covers that case and avoids a
"why-did-the-dialog-close" surprise.

Each other row carries a Revoke button. Clicking it opens the
shared DualFactorSteps wizard targeting
/api/user/sessions/:sessionId/revoke, which:
  - requires the caller to be signed in and not anonymous,
  - refuses to revoke the current session,
  - gates the delete behind requireDualFactor (action
    "session_revoke") so a stolen cookie alone cannot nuke a
    user's other devices,
  - deletes the session row scoped to (id, userId) and returns
    404 if the row does not exist or belongs to someone else.

The list endpoint /api/user/sessions returns every active session
row for the caller (expires_at > now()) with the country decoded
out of risk_flags_json, sorted by last-active timestamp.

UI follows the existing Card-section pattern used by
TwoFactorSection and ChangePasswordSection. Relative timestamps
via Intl.RelativeTimeFormat so "Signed in 3 hours ago" reads
right in the user's locale.
The active sessions panel previously showed at most a 2-letter
country code derived from CF-IPCountry. For sessions that didn't
go through Cloudflare (local dev, VPN tunnels, XX/T1 attestations)
the country slot was empty and the user couldn't tell where a
listed session signed in from. Tightening the panel beyond "you
recognise this IP, right?" needs a real label.

Add an IP-to-location helper that hits ipapi.co/<ip>/json/ (no key
on the free tier) and caches per-IP for 24h with a 2-second
timeout. CF-IPCountry stays authoritative for country when
present; the external lookup layers region + city on top, and
serves as the full fallback when CF is absent.

risk_flags_json gains region + city alongside country, so the
data is captured at session-creation time and the panel reads it
straight from sessions without any per-request fetch. Older rows
without those fields decode as null, matching the panel's
existing "no location" fallback. user_trusted_ips.country is
unchanged.

The sessions list endpoint returns a formatted location string
("San Francisco, CA, US" / "Buenos Aires, BA, AR" / just "AR" if
only country resolved). The panel renders that under the IP
instead of the bare country code.

Test fixtures updated for the new LoginRiskSignal shape; existing
unit + integration tests still pass.
Two correctness bugs in the IP / session-trust path.

1. Better Auth signs the session cookie via hono's setSignedCookie
   (cookies/index.mjs:128), so the wire value is
   `<rawToken>.<base64HmacSignature>`. The cookie extractor in
   strict-signin and in the OAuth-callback interceptor was hashing
   the whole signed string, then comparing against sessions.token
   which stores hashSessionToken(rawToken) without the signature.
   The lookup never matched: strict-signin's requires_mfa clear ran
   against zero rows so the proxy bounced the freshly-signed-in
   user to /verify-mfa for a redundant step-up, and the OAuth
   interceptor's session-delete branch could miss the row and
   leave a real session cookie in place when it meant to defer.
   Strip the signature suffix by `lastIndexOf('.')` before hashing
   so we get the raw token back. Comment cites the upstream source
   in case Better Auth changes the cookie format.

2. assessIpTrust used to compare textual IPs byte-for-byte. CF
   zeroes the lower 64 bits of any IPv6 in CF-Connecting-IP for
   privacy, so the same user produces a different address every
   SLAAC reshuffle. A first sign-in stored as a full IPv6 (e.g.
   from a direct request before CF entered the path) would never
   match a subsequent CF-fronted sign-in from the same network,
   and /verify-ip would fire on every visit. Bucket IPv6 to the
   /64 prefix at both write (user_trusted_ips, pending_ip cookie
   payload, session.ip_address) and read (lookup in assessIpTrust)
   sides. IPv4 unchanged. New helper lib/security/ip-normalize.ts
   pads each prefix group to 4 hex chars and zeros the host part,
   so a stored row and a live lookup produce the same canonical
   string regardless of whether CF anonymized or not.

Also: rename the IpTrust reason "unknown" -> "untrusted". The
address isn't unknown; we know exactly what it is. It just isn't
in this user's trust list yet. The new label reads correctly in
diagnostic logs and matches what /verify-ip is actually gating.
Three small additions on top of the location-enrichment landing.

1. /verify-ip session insert now writes risk_flags_json with the
   resolved country/region/city from the pending cookie + IP
   lookup. Without it the row stayed null because /verify-ip
   inserts via Drizzle directly and the session.create.before
   path that normally populates risk flags never runs for that
   code path. Result: the active-sessions panel can label
   /verify-ip-minted sessions the same way it labels normal
   sign-ins.

2. Sessions list endpoint runs formatIpForDisplay on each row.
   /64-normalized IPv6 reads `2803:a920:289b:d6c0` in the panel
   instead of `:0000:0000:0000:0000`. /verify-ip page passes the
   chip through the same formatter so the cookie's pinned IP
   reads cleanly when the dialog displays it.

3. Active-sessions row IP is now a button. Click copies the IP
   to clipboard, flashes emerald for two seconds, toasts. Mirrors
   the existing chip behavior on the /verify-ip dialog. Useful
   when an operator wants to drop the IP into a WHOIS lookup or
   internal abuse tool without retyping a 39-char IPv6.

4. /verify-ip dialog copy softened from "We don't recognise the
   network..." to "We noticed you're signing in from a new
   network. Please confirm it's you...". Same security gate,
   less accusatory for the legitimate user it almost always
   actually is.
Two concurrency hardenings on lib/security/resolve-country.ts.

1. In-flight dedup. N concurrent resolveLocationFromIp calls for
   the same IP previously triggered N fetches at ipapi.co. The
   free-tier limit is ~1000/day per pod and an unlucky multi-tab
   sign-in burst could chew through it for one address. Track
   in-flight promises in a parallel Map; concurrent callers for
   the same IP coalesce onto the existing promise and the
   response settles once.

2. Cache TTL jitter. Every entry was set with the same exact
   24h expiry, so a burst of distinct IPs resolved during onboarding
   or a load test would all expire together a day later and
   stampede ipapi.co at the same instant. Apply +/- 10% random
   jitter on every write so subsequent expirations spread across
   ~4.8h. Math.random is fine here: the goal is decorrelation,
   not unpredictability.

The fetch + parse logic split into a small `fetchLocation` helper
so the cache + dedup orchestration in `resolveLocationFromIp`
reads top-to-bottom.
verify-ip route was inlining a JSON.stringify of the risk_flags
shape plus a direct resolveLocationFromIp call. Two problems:
the JSON shape drifts from the Better Auth path that uses
serializeRiskFlags, and any callsite that wants the shape has to
know which provider sits behind resolveLocationFromIp.

New buildRiskFlagsJsonForIp(ip, attestedCountry?) in login-risk.ts
returns a Promise<string> typed in terms of session.risk_flags_json,
fully encapsulating both the location-resolution provider AND the
serialized JSON layout. Swapping ipapi.co for MaxMind, an internal
service, or any other source only requires touching
resolve-country.ts; verify-ip and any future callsite stay the
same. JSON shape changes go through serializeRiskFlags and stay
consistent across every row writer.
…on-email-link

# Conflicts:
#	drizzle/meta/_journal.json
feat(auth): atomic IP-verification gate for unknown networks
The OAuth interceptor was building MFA redirect targets from new URL(req.url).origin. In Next.js standalone mode bound to HOSTNAME=0.0.0.0, that resolves to http://0.0.0.0:3000 when the proxy chain doesn't reliably forward Host, sending the browser to an unreachable address after OAuth callback.

Resolve the public origin from BETTER_AUTH_URL (with NEXT_PUBLIC_APP_URL and req.url as fallbacks) so the redirect target matches the configured baseURL Better Auth uses for its own redirects.
Today the trust list is populated only when a session is created. A user
who signs in from home, then opens a laptop on a coffee-shop network
without signing out, never adds the coffee-shop IP to their trust list.
When they next sign in cold from that coffee shop, the verify-ip gate
fires even though they have been actively using the network.

Add recordIpTrust(userId) called from proxy.ts on every authenticated
request that passes the gates. The unique (user_id, ip) constraint keeps
the upsert idempotent; an in-pod Set deduplicates (user, ip) tuples so
the DB only sees one write per pod lifetime per pair. Hot path uses the
CF-attested country only and skips the external IP-to-location lookup.

Behaviour:
  - Active session + any request from new IP -> IP added to trust list.
  - Fresh sign-in from IP already in trust list -> no /verify-ip.
  - Fresh sign-in from IP never seen authenticated -> /verify-ip fires.
Replace the request-time auto-add recorder with a read-only IP gate.
Authenticated requests now check the user's user_trusted_ips list on
every protected route; an unknown IP gets redirected to /verify-ip
with a signed pending_ip_verify cookie (or 403 ip_verification_required
for API callers). The trust list only grows via explicit sign-in events
(session.create.before first-attestation, /verify-ip OTP confirmation),
so a stolen session cookie cannot be reused from a different network
without the email-OTP step.

The first time we see a (user, ip) pair the legitimate account owner
receives a notification email naming the IP, country, and device. The
notification is deduped per pod lifetime so a single new network does
not flood the inbox while the user is mid-verification.

Sessions on already-trusted IPs continue to work in parallel; the new
IP joins the trust list once /verify-ip succeeds, so the user can
operate from both networks afterwards.
fix(auth): close existing-account OTP leak on credential signup
@joelorzet joelorzet requested review from a team, OleksandrUA, eskp and suisuss and removed request for a team May 27, 2026 21:28
A single page navigation in the app fans out into ~20 parallel API
requests. The new IP gate was running one DB query per request, so a
single nav cost 20 round-trips, and on an untrusted-IP burst every
fetch returned 403 in parallel.

Cache the gate result per pod for 60s when trusted and 5s when
untrusted. Trusted IPs collapse a page nav to one DB hit and the
short untrusted TTL absorbs a redirect stampede without delaying
post-verification recovery longer than necessary. Same-pod
invalidation happens in /api/user/verify-ip after a successful
upsert so the next request picks up the new trust immediately;
cross-pod stale entries age out on their own within the 5s window.

Notification email dedup was already in place per (user, ip) per
pod lifetime, so inbox flooding was never the failure mode — only
DB load and 403 UI breakage were. This commit handles both.
joelorzet and others added 6 commits May 27, 2026 20:28
1. Step indicator highlighted both steps green once the user advanced
   from email to authenticator, because the 'done' status reused the
   'current' bubble style. Restrict the green bubble + green connector
   to the active step only; completed steps revert to the neutral
   pending style so only the current step is visually emphasised.

2. The email-code input carried autoComplete="one-time-code", which
   1Password / Bitwarden / LastPass / Dashlane all interpret as 'fill
   with the stored TOTP for this site'. Since the email phase mounts
   first with only the email field in the DOM, the password manager
   stuffed the user's TOTP code into the email-code box and the user
   had to clear it manually. Opt the email field out via vendor-
   specific data-* attributes plus autoComplete=off, keeping the
   one-time-code hint on the authenticator field where the stored
   TOTP fill is actually desired.
Real Turnstile site keys are bound to a hostname allowlist in the
Cloudflare dashboard. PR environments get ephemeral hostnames that are
not on staging's allowlist, so the real widget fails with a domain
mismatch and signup breaks. Switch PR deploys to Cloudflare's public
always-pass test keys, which are domain-agnostic:

- Bake the test site key into the client bundle at build time
  (deploy-pr-environment.yaml build-args).
- Serve the test site key and always-pass test secret at runtime via kv
  literals instead of the staging SSM params (values.template.yaml).

TURNSTILE_ENFORCE stays true so the captcha flow is still exercised.
The Turnkey env assertion runs before the health server starts listening,
so any failure here prevents the pod from ever becoming ready and fails the
whole PR deploy under Helm --atomic. Bound the wallet query with a 10s
timeout and, in ephemeral PR environments (K8S_NAMESPACE starts with pr-),
downgrade both a query failure and missing Turnkey env from a throw to a
warning. Staging and production keep the hard failure.
The executor satellite intermittently failed PR deploys with context
deadline exceeded. Its pod hard-references 8 parameterStore-backed
secrets; external-secrets reconciles them asynchronously, so the pod
sits in CreateContainerConfigError until they sync. A 5m --atomic window
rolled back before that completed (~3.5m observed). Raise the shared
satellite timeout to 10m, matching the app, so ESO can catch up.
…the client

The build-args were built with format('...\n...'), but GitHub Actions does
not interpret \n as a newline in expression strings. docker/build-push-action
parses build-args line by line, so every NEXT_PUBLIC_* collapsed into the value
of the first arg (NEXT_PUBLIC_AUTH_PROVIDERS) and the rest - including
NEXT_PUBLIC_TURNSTILE_SITE_KEY - were never set. Next inlines NEXT_PUBLIC_* at
build time, so the Turnstile site key was undefined in the bundle and the
signup captcha never rendered, blocking account creation while
TURNSTILE_ENFORCE=true required a token.

Emit one build-arg per line via a YAML block scalar (same pattern as tags/
cache-from in this step). Blank lines for the migrator image are ignored.
fix(auth): clean up dual-factor verify form UX
chore: harden PR-env deploys (turnstile test keys + external-secrets deploy timeout)
suisuss added 4 commits May 28, 2026 13:41
…rks in pr-N envs

Manual workflow execution on every open PR deploy crashes inside the
runner Job pod with "SANDBOX_BACKEND must be 'remote' and SANDBOX_URL
must be set in production" (plugins/code/steps/run-code.ts module-load
assertion). The PR-deploy templates were missing both SANDBOX_BACKEND
and SANDBOX_URL on the executor, and the executor's RUNNER_SYSTEM_ENV_VARS
allowlist (cd8ad17) had nothing to forward to runner pods.

Pointing PR pods at the staging sandbox URL is not viable: the staging
sandbox NetworkPolicy only allows ingress from kubernetes.io/metadata.name=
keeperhub. Cross-namespace SYNs from pr-N get dropped by the VPC CNI
Network Policy Agent. Each PR therefore gets its own sandbox release in
its own namespace, mirroring the staging shape exactly.

Adds four new templates under deploy/pr-environment/:
- sandbox.template.yaml: Helm values cloned from staging
  (deploy/keeperhub-sandbox/staging/values.yaml). Uses sandbox-latest;
  per-PR sandbox source builds can come later as a label-driven job
  matching the deploy-pr-executor pattern.
- sandbox-sa.template.yaml: scrubbed ServiceAccount, no IRSA, no
  RoleBindings, automountServiceAccountToken: false at the SA level.
- sandbox-networkpolicy.template.yaml: ingress restricted to this PR's
  own namespace; egress default-deny except kube-dns and public TCP/443
  with private CIDRs excepted. Cross-PR sandbox traffic is blocked.
- sandbox-pod-hardening-patch.template.yaml: pod-level
  automountServiceAccountToken: false + emptyDir.medium: Memory overlay
  for fields the techops-services/common@0.2.1 chart does not expose.

Wires the sandbox into .github/workflows/deploy-pr-environment.yaml:
- Apply SA + NetworkPolicy alongside postgres/localstack/redis in the
  infrastructure-parallel step, so the chart's serviceAccount.create=false
  finds the SA and ingress policy is in place before the Pod schedules.
- Render sandbox.template.yaml in the Prepare-Helm-values step.
- helm upgrade --install + kubectl patch + rollout status alongside
  scheduler/block/event/executor in the satellites-parallel step.

Sets SANDBOX_BACKEND=remote and SANDBOX_URL=
http://keeperhub-sandbox-pr-<N>-common.pr-<N>.svc.cluster.local:8787
on both executor.template.yaml (load-bearing - forwarded to runner
Jobs via RUNNER_SYSTEM_ENV_VARS) and values.template.yaml (mirrors
staging shape; web pod's NODE_ENV=development keeps the assertion
dormant, but the var is set so any code path that does honor it works).

Staging sandbox config is intentionally not touched - the threat model
in deploy/keeperhub-sandbox/staging/networkpolicy.yaml keeps
cross-namespace ingress denied for staging.
…low-runner-<sha>

The PR-deploy build pipeline does not push a workflow-runner-<sha> tag.
docker-bake.hcl defines a workflow-runner target, but the build-images
matrix in deploy-pr-environment.yaml only invokes app and migrator. A
prior commit explicitly trimmed workflow-runner from the matrix with
the stated intent "satellite services use pre-built staging images" -
but executor.template.yaml was not updated to match, so every PR has
been pointing runner Job pods at a tag that does not exist in ECR.

Symptom: kubectl events show ImagePullBackOff -> ContainerCreating ->
DeadlineExceeded on every workflow-* Job in pr-N. No workflow execution
on any PR has ever completed since the matrix trim landed; the only
reason this surfaced now is that the SANDBOX assertion (KEEP-653 first
commit) used to crash before pod-pull was even attempted.

Mirrors the executor's own behaviour: executor falls back to
executor-latest unless deploy-pr-executor label triggers a build. Same
shape now applies to the runner. If a future PR needs to test runner
code changes, follow the deploy-pr-executor pattern: add a
deploy-pr-runner label and a build-runner-image job that pushes
workflow-runner-<sha>, then template that tag.
… deploys

The strategic-merge patch in sandbox-pod-hardening-patch.template.yaml
references the chart-rendered tmp volume by its hardcoded name
`keeperhub-sandbox-pr-${PR_NUMBER}-common-tmp-volume`. Strategic merge
keys on `name`; if the techops-services/common chart's fullname template
ever changes (chart bump, nameOverride introduction, trunc-63 collision
on long PR numbers, etc.) the patch silently APPENDS an orphan volume
rather than overlaying `medium: Memory` on the chart's one. The
chart-rendered volumeMount keeps pointing at the original
emptyDir-with-only-sizeLimit on node disk, /tmp is no longer tmpfs, and
the "no ELF outlives the Pod" property the patch's own comment promises
is silently lost. Rollout status still succeeds because the container is
healthy.

Staging's deploy-sandbox.yaml has a WORK-03 smoke step that runs
`kubectl exec -- sh -c 'test -e .../token'` after patch, but our
sandbox image is distroless (no /bin/sh) so the exec always fails;
the existing staging check effectively false-passes on every deploy.

This commit adds two kubectl-get jsonpath assertions inside the sandbox
satellite subshell, after `kubectl rollout status`:

1. `.spec.template.spec.automountServiceAccountToken` must be "false".
   Catches a patch that didn't take effect at all.
2. `.spec.template.spec.volumes[?(@.name=='...')].emptyDir.medium` must
   be "Memory". An empty result here is the smoking gun for the
   silent-append failure mode - the patch ran, kubectl returned 0, but
   the chart's volume kept its original disk-backed medium because
   strategic-merge couldn't find a matching name.

Both assertions read the Deployment spec directly so they work against
the distroless image where exec is unavailable. Errors are emitted via
`::error::` so they surface as GHA annotations and explain the likely
root cause (chart fullname-template drift).

Adds an inline comment in sandbox-pod-hardening-patch.template.yaml
warning future editors that the volume-name string is load-bearing and
must move in lockstep with the verification assertion if the chart is
upgraded.

Tested against the live pr-1392 deploy:
- automountServiceAccountToken=false (positive case, check passes)
- medium=Memory (positive case, check passes)
- A bad volume name returns empty -> medium="" -> the check would fire
  with the diagnostic error message.

Out of scope: the same brittleness exists in deploy/keeperhub-sandbox/
staging/pod-hardening-patch.yaml plus staging's exec-based WORK-03
check has the false-pass bug noted above. Both worth a follow-up;
neither is regressed by this commit.
…x-env

fix: deploy per-PR sandbox so manual workflow execution works in pr-N envs
@suisuss suisuss merged commit aa19fbb into prod May 28, 2026
34 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants