Skip to content

release: to prod#1455

Merged
joelorzet merged 22 commits into
prodfrom
staging
Jun 3, 2026
Merged

release: to prod#1455
joelorzet merged 22 commits into
prodfrom
staging

Conversation

@joelorzet

Copy link
Copy Markdown

No description provided.

joelorzet and others added 22 commits June 2, 2026 21:27
Raw fetch() in plugin step and connection-test files bypassed the SSRF
guard in lib/safe-fetch.ts, so a user- or attacker-controlled destination
could reach link-local (169.254.169.254) and RFC1918 hosts even with
SAFE_FETCH_ENFORCE on. Replace every bare fetch() under plugins/ with
safeFetch(url, { plugin }) so the guard validates each request and blocks
attribute to the calling plugin.

Prioritized the two user-controlled-host paths: the discord webhookUrl and
the blockscout custom-instance URL. Also harden the discord webhook check:
the old substring match on "discord.com/api/webhooks/" passed a URL that
carried that string in its path while pointing the host at an internal
address (e.g. http://169.254.169.254/discord.com/api/webhooks/x). Validate
by parsed hostname over https with an /api/webhooks/ path instead.

Update blockscout-steps unit tests to mock safeFetch, and add a discord
send-message test covering the bypass URLs and the safeFetch wiring.
Add a lint-job step that git-greps for bare fetch()/axios/http.request
under plugins/ (excluding *.test.ts, *.md, *.txt) and fails the build,
so plugin egress cannot regress off safeFetch. Pattern avoids \b so it
runs the same on the BSD and GNU regex engines.
Update the plugin generator template, authoring guides, and agent/command
definitions to use safeFetch(url, { plugin }) instead of raw fetch(), so
newly created plugins route egress through the SSRF guard and pass the
new CI egress check by default.
The build failed because lib/safe-fetch.ts is "server-only" and the
connection-test files (plugins/*/test.ts) are reachable from the
client-bundled plugin registry (plugins/index.ts), so importing safeFetch
into them pulled server-only into the client graph.

Connection tests run server-side but cannot import the server-only guard
from the client-reachable registry. Revert the nine test.ts files to the
raw fetch global, exclude plugins/*/test.ts from the egress CI guard, and
note the step-file vs connection-test distinction in the scaffolding and
authoring guides. Workflow step files (the SSRF surface this ticket
targets) remain routed through safeFetch and are unaffected.
Connection-test files (plugins/*/test.ts) are reachable from the
client-bundled plugin registry, so they cannot import the server-only SSRF
guard and fetch user-supplied instance URLs (e.g. blockscout's
BLOCKSCOUT_API_URL) with raw fetch. A user could point that at
169.254.169.254 or an RFC1918 host and use the Test Connection result as a
reachability oracle.

Validate every user-supplied url-type field on the server in
handlePluginTest, before the test runs, via assertUrlIsPublic -- mirroring
the assertConnectionUrlIsPublic pre-flight already used for database
connection tests. Generic across all plugins; no change to the
client-reachable test.ts files.
Document, in the handlePluginTest guard comment and the plugin authoring
guide, that assertUrlIsPublic is always-on (it does not honor
SAFE_FETCH_ENFORCE), so Test Connection blocks a localhost/private instance
URL in local dev even though workflow execution against it works under
safeFetch shadow mode. Also normalize prose to avoid double hyphens.
…workpolicy comments

This is a public repo. The networkpolicy.yaml comments described the staging/prod
VPC ranges, DB-subnet CIDRs by AZ, internal file paths, env var names, and the
egress threat model - unnecessary network-recon detail for a public file. Remove
all comments; the policy spec is unchanged (the applied object is byte-identical),
so there is no functional or deploy impact. Rationale now lives only in the
private infra runbook.
…omments

chore(deploy): removing comments from documentation references
Runner Job pods previously ran with serviceAccountName: default, a
mounted SA token, an empty securityContext (root, gid=0), and high-value
secrets passed as literal env values. This widened blast radius if a
workflow step was compromised.

Changes (TECH-29):
- Dedicated zero-RBAC ServiceAccount keeperhub-workflow-runner with no
  RoleBinding and automountServiceAccountToken: false (manifests applied
  by deploy-executor.yaml, mirroring the sandbox SA pattern).
- Pod securityContext: runAsNonRoot, runAsUser/Group 1000, fsGroup 1000,
  seccompProfile RuntimeDefault. Container securityContext:
  allowPrivilegeEscalation false, readOnlyRootFilesystem true,
  capabilities drop ALL. Writable /tmp emptyDir + TMPDIR so tsx/esbuild
  still works under the read-only root.
- All 12 high-value credentials now reach the runner via
  valueFrom.secretKeyRef against the executor's External Secrets-synced
  Secrets instead of literal values; secret-valued vars are no longer
  relayed as plaintext through the controller env. DATABASE_URL and
  INTEGRATION_ENCRYPTION_KEY are required refs, the rest optional.
The PR environment is a different topology than staging: the executor is
helm release executor-pr-<N> and DATABASE_URL is a kv literal pointing at
the ephemeral per-PR Postgres. Without extra wiring the hardened runner
cannot come up there:
- the dedicated keeperhub-workflow-runner SA does not exist in pr-<N>
  (staging creates it via deploy-executor.yaml, never run for PR envs)
- the runner's secret refs target keeperhub-executor-common-<slug>, but
  the PR executor's secrets are executor-pr-<N>-common-<slug>
- DATABASE_URL has no chart-managed Secret since it is kv in the PR env

Wiring added:
- runner-sa.template.yaml: per-PR zero-RBAC runner SA, applied in the
  deploy job's prerequisite block
- executor.template.yaml: RUNNER_SERVICE_ACCOUNT + RUNNER_SECRET_PREFIX
  (executor-pr-<N>-common) so the runner resolves the PR namespace's SA
  and secrets
- deploy job creates executor-pr-<N>-common-db-url from the same
  ephemeral connection string, so the runner's required DATABASE_URL
  secretKeyRef resolves
…r-job-pods

feat(executor): harden workflow-runner Job pods (dedicated SA, no token, non-root, secret refs)
Adds an ioredis-backed client for the app tier as a cross-replica coordination layer. getRedis() returns null when REDIS_HOST or REDIS_PASSWORD is unset and never opens an unauthenticated connection, so callers degrade to local behaviour and a Redis outage cannot block a request. Also adds a central Redis key registry.
… replicas

The IP gate emailed the account owner the first time a request arrived from an untrusted network, deduped by a per-pod in-memory set. With more than one replica each pod deduped independently, so a single new network produced one email per pod. The dedup now uses a Redis SET NX claim with a 7 day TTL so exactly one replica sends per network. The notification runs fire-and-forget and never blocks the gate; when Redis is unavailable it skips the courtesy email rather than risk one send per request.
A per-pod cache of untrusted gate results outlived the moment another replica recorded the IP as trusted via verify-ip, so a verified user was bounced back through verification on a stale pod. Untrusted is now read from the database on every request, the shared source of truth, so any replica sees a freshly trusted IP immediately. Trusted results stay cached per pod because trust only ever grows.
Sources REDIS_HOST, REDIS_PORT, and REDIS_PASSWORD from the keeperhub SSM namespace for staging, prod, and PR environments.
staging and every PR environment share one Redis instance, so keys are prefixed with a per-deployment REDIS_KEY_PREFIX to keep them isolated. NODE_ENV cannot serve this since staging and PR envs are both development. Adds a deploymentKey helper that every key builder uses, with unit coverage for the helper and the prefix.
… key prefix

Reuses the redis-host, redis-port, and redis-password parameters already provisioned under the keeperhub-events SSM namespace, so no new parameters are needed. Sets REDIS_KEY_PREFIX per deployment (staging, prod, pr-PR_NUMBER).
Updates the verify-ip comment that referenced the removed UNTRUSTED_TTL_MS constant, and removes the freshUser test helper that guarded against a per-pod dedup set the notification module no longer has.
…ess-through-safefetch

fix(security): route all plugin egress through safeFetch
fix(security): make new-sign-in-location email and IP gate replica-safe
Replaces the per-pod in-memory TRUST_CACHE with a read-through Redis cache keyed per deployment. The gate reads Redis first and falls back to Postgres, the durable source of truth, on a miss, re-caching the result so the next request on any replica is a hit. /verify-ip warms the cache after the upsert so a just-verified IP is trusted fleet-wide immediately. Redis is best-effort: a missing or failing Redis falls straight through to Postgres, so the gate never fails and no one is locked out. Untrusted is still never cached. Removing the per-pod Map ends the cross-replica inconsistency where one pod cached a result the next pod did not have.
refactor(security): back the IP-trust gate with a shared Redis cache
@joelorzet joelorzet requested review from a team, OleksandrUA, eskp and suisuss and removed request for a team June 3, 2026 22:26
@joelorzet joelorzet merged commit cdfa2dd into prod Jun 3, 2026
25 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants