Skip to content

release: promote staging to prod#1458

Merged
OleksandrUA merged 10 commits into
prodfrom
staging
Jun 4, 2026
Merged

release: promote staging to prod#1458
OleksandrUA merged 10 commits into
prodfrom
staging

Conversation

@OleksandrUA

Copy link
Copy Markdown

Release: staging -> prod

Promotes the current staging HEAD to prod (10 commits, 4 PRs).

Included

PR Change Type
#1442 TECH-6484 - cut over /api/metrics/db to the collector + PR-env wiring functional
#1452 KEEP-676 - resolve client IP via CF-Connecting-IP on direct session writes fix
#1456 KEEP-713 - align staging keeperhub-common CPU request with prod chore
#1457 TECH-29 - comment noting runner UID 1000 is coupled to the node base image docs (no runtime change)

Notes for the reviewer

Deploy

Merging this triggers the prod CI pipeline (build -> deploy) on the prod branch.

chong-techops and others added 10 commits June 3, 2026 10:30
…ing (TECH-6484)

Stage 4 (cutover):
- Gate the app's /api/metrics/db route to 404 via METRICS_DB_OFFLOADED so the
  heavy aggregate scan never runs on the request-serving pods.
- Remove the db-metrics ServiceMonitor from deploy/keeperhub/{staging,prod} and
  set METRICS_DB_OFFLOADED=true. /api/metrics/api is unchanged.

Stage 5 (PR-env wiring + docs):
- deploy/pr-environment/metrics-collector.template.yaml (single replica, PR DB,
  ServiceMonitor off).
- deploy-pr-environment.yaml: opt-in deploy-pr-metrics label -> build-collector-image
  job + a gated deploy step. Default off, so existing PR envs are unaffected.
- METRICS_REFERENCE.md note on the collector + offload.

Depends on the collector being live + verified in staging (PR #1439). Cutover
must merge only after that, else a DB-metrics gap.
Consistency with the #1439 review (#3): rely on the Dockerfile CMD, no helm
command/args override. Matches deploy/metrics-collector/{staging,prod}.
…or lands

Found during B-now validation: adding deploy-pr-metrics to an already-deployed
PR built the collector image but did not deploy it - the collector deploy step
sits inside the should-deploy-gated deploy job. Set should-deploy=true on the
metrics-only path (mirroring deploy-pr-executor) so the deploy re-runs and the
collector step executes. The both-labels path was already correct.
The OAuth-MFA finalize and TOTP enrollment routes mint sessions directly
and derived ip_address from the leftmost X-Forwarded-For hop, which is
caller-controlled and can be rewritten by intermediate hops, so a subset
of sessions stored an unreliable address rather than the real client IP.
Better Auth's own session writes already resolve CF-Connecting-IP, but
these direct writes bypassed that.

Extract the existing CF-aware resolver in login-risk into a shared
resolveClientIpFromHeaders helper and use it in both routes. In
production only CF-Connecting-IP is trusted; outside production
X-Forwarded-For then X-Real-IP stay as local-dev fallbacks. Sessions now
store the attested client IP or null.
…utover

feat(metrics): cut over /api/metrics/db to the collector + PR-env wiring (TECH-6484)
…ip-better-auth-sessions

fix: resolve client IP via CF-Connecting-IP on direct session writes
Staging requested 1m CPU while prod requests 100m. The Grafana alert
"KeeperHub High CPU Usage (Staging)" evaluates cpu_usage / cpu_request > 2
per container, so a 1m request made idle pods (~15-25m CPU) sit
permanently at 15-25x and re-fire the P3 on every rollout.

14d per-pod usage (5m-rate): avg 19m, p95 29m, p99 48m, max 109m.
Setting the request to 100m matches prod, covers p99 with headroom, and
moves the alert bar to 2x = 200m (above the 109m observed max).
…mmon-cpu-request

fix: align staging keeperhub-common CPU request with prod
The runner Job pins runAsUser/Group to 1000 (RUNNER_UID/RUNNER_GID in
keeperhub-executor/k8s-job.ts), which only works because node:*-alpine
ships a "node" user at 1000 and the copied app files are world-readable.
That coupling spans two files that change independently: the Dockerfile
picks the base image, the executor hardcodes the UID. Add cross-
referencing comments in both so swapping the runner base image triggers a
check of UID 1000 (or an update to RUNNER_UID/RUNNER_GID). Comment-only,
no behavior change.
…ling-note

docs(executor): note runner UID 1000 is coupled to the node base image
@OleksandrUA OleksandrUA added the metrics-db-reviewed Reviewer sign-off: metrics aggregate queries optimised + tables indexed (KEEP-680) label Jun 4, 2026
@OleksandrUA OleksandrUA merged commit 8dc708f into prod Jun 4, 2026
37 of 38 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

metrics-db-reviewed Reviewer sign-off: metrics aggregate queries optimised + tables indexed (KEEP-680)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants