Skip to content

release: To Prod#1337

Merged
joelorzet merged 15 commits into
prodfrom
staging
May 21, 2026
Merged

release: To Prod#1337
joelorzet merged 15 commits into
prodfrom
staging

Conversation

@suisuss

@suisuss suisuss commented May 21, 2026

Copy link
Copy Markdown

Summary

Release staging -> prod. Notable security-critical content in this release:

Security incident 2026-05-21 follow-up (PR #1335 merged into staging)

  • NetworkPolicy on workflow runner pods: kubernetes egress block to 169.254.0.0/16, fe80::/10, fc00::/7 (IMDS + link-local + ULA).
  • AI Prompt off in prod: gated behind NEXT_PUBLIC_AI_PROMPT_ENABLED (defaults false). UI + /api/ai/generate both gated.
  • DB-level kill switch (drizzle migration 0082):
    • block_database_integrations_security_2026_05_21 declared in migrations (the trigger was already ad-hoc on prod).
    • NEW block_executions_security_2026_05_21 rejects INSERT into workflow_executions when the workflow owner is deactivated_at IS NOT NULL or the workflow is deleted_at IS NOT NULL. Raises 42501.
  • Both triggers are idempotent (CREATE OR REPLACE FUNCTION + DROP TRIGGER IF EXISTS / CREATE TRIGGER).
  • This closes the live attacker leak where deactivated accounts continued running workflows.

Other content

  • KEEP-603 SSRF guard parity across HTTP + DB plugins, sandbox tests, blocklist as JSON
  • chore: executor unit tests in CI
  • Misc fixes (Blockscout chain selector, validator template address)

Test plan

  • CI green on PR hotfix(security): defence-in-depth - NetworkPolicy, AI Prompt off, DB kill switch #1335 before merge to staging
  • Staging deploy succeeded (run 26234015878)
  • CI green on this PR before merge to prod
  • After merge to prod: verify both triggers present on prod DB (SELECT tgname FROM pg_trigger WHERE tgname LIKE '%security_2026_05_21%')
  • Sample insert into workflow_executions for a soft-deleted workflow on prod returns 42501 error
  • No spike in legitimate execution errors after deploy (Joel's canaries, Dumitru's Sky monitoring continue running)

suisuss added 7 commits May 21, 2026 21:04
Extracts the SSRF blocklist data (CIDR ranges, hostname patterns,
NAT64 well-known prefix) to a new lib/ssrf-blocklist.ts module. Both
consumers read from this one file:

  - lib/safe-fetch.ts imports the data and builds its node:net
    BlockList and Set<string> at module-load time.
  - lib/sandbox/child-source.ts interpolates the data into the
    grandchild source template via JSON.stringify, so the inlined
    literals in the rendered "node -e" string are automatically in
    sync with the main-app blocklist.

Closes the drift this PR was originally responding to. Before this
commit the sandbox guard was a hand-maintained copy that stopped at
KEEP-314 - missing the four IPv6 ranges added in the KEEP-603 PR
(64:ff9b:1::/48 site-local NAT64, 2001::/32 Teredo, 2002::/16 6to4,
2001:db8::/32 documentation) and missing the entire pre-DNS hostname
denylist (localhost, *.local, *.internal, *.svc.cluster.local,
*.pod.cluster.local). It now picks up the same set automatically.

The sandbox docblock is updated: imports are now allowed in this
module, but restricted to pure data consumed at template-render time;
runtime behaviour still cannot cross into the grandchild because the
grandchild has no access to npm or to the rest of the codebase.

Validation: pnpm vitest passes (189/189). Lint and type-check clean
on the changed files. The rendered SANDBOX_CHILD_SOURCE string was
piped through node --check to confirm the interpolated template is
syntactically valid JavaScript, and the key CIDR and host-pattern
markers are present in the rendered output.

No public API change. The sandbox guard remains always-enforce; no
SAFE_FETCH_ENFORCE wiring is needed (child-source.ts does not read
the env var).
Fixes the standalone @keeperhub/sandbox package's tsc build, which
failed on the previous commit's @/-prefixed import because that
package's tsconfig maps `@/*` to its own `./src/*` (not to the
keeperhub workspace root). The new import in child-source.ts is
switched to a relative path that resolves identically from both build
contexts; an inline comment notes the asymmetry with safe-fetch.ts.

Adds tests/unit/sandbox-child-source.test.ts that asserts every entry
from lib/ssrf-blocklist.ts appears inline in the rendered
SANDBOX_CHILD_SOURCE string, plus a parse-only vm.Script check. This
locks the SoT contract: a future regression that reverts to hardcoded
literals in child-source.ts would fail the test, not slip through.

Validation: sandbox tsc build now emits dist/lib/ssrf-blocklist.js
alongside dist/lib/sandbox/child-source.js, with the compiled require
target resolving correctly at runtime. sandbox package tests 25/25
pass; keeperhub touched suites 207/207 pass (was 197 before this
file). Lint clean on changed files; no new type errors.
…ontext

The sandbox Dockerfile's build stage takes a "copy only what is needed"
approach and explicitly named `lib/sandbox/child-source.ts` as the one
shared file pulled in from the keeperhub workspace. After this branch's
SoT extraction, child-source.ts also imports `../ssrf-blocklist`, but
that file wasn't being copied into the build context - so the sandbox
tsc step failed with "Cannot find module '../ssrf-blocklist'" inside
the Docker build (local pnpm builds passed because the full workspace
tree was available).

Adds the file as an explicit COPY alongside the existing child-source.ts
copy, with a comment explaining the dependency. The sandbox tsc emits
dist/lib/ssrf-blocklist.js next to dist/lib/sandbox/child-source.js;
the latter's compiled `require("../ssrf-blocklist")` resolves at
runtime.
CI's test-unit-sandbox-remote still failed after the Dockerfile COPY
addition: the build stage now succeeded but the runtime container hit

  Error [ERR_MODULE_NOT_FOUND]: Cannot find module
  '/app/dist/lib/ssrf-blocklist' imported from
  /app/dist/lib/sandbox/child-source.js

The standalone @keeperhub/sandbox package is `"type": "module"` and
runs its compiled output via plain `node`, whose strict ESM loader
requires explicit file extensions in import specifiers. The extension-
less `"../ssrf-blocklist"` import worked at tsc time (sandbox tsconfig
uses moduleResolution: "bundler") and on the keeperhub side (Next.js
bundler) but blew up at sandbox runtime.

Switches the import to `"../ssrf-blocklist.js"` - the NodeNext / strict
ESM convention. This matches how `sandbox/src/run-code.ts` already
imports `"../../lib/sandbox/child-source.js"`. TS still resolves to the
matching .ts source at compile time; the emitted output keeps the .js
suffix verbatim so Node's resolver finds the compiled file.

Validated locally: sandbox tsc emits the .js suffix into
dist/lib/sandbox/child-source.js, and `node dist/sandbox/src/index.js`
launches without module-resolution errors. Keeperhub side: 197/197
tests pass; type-check unchanged.
… import

CI surfaced an irreducible conflict between the two consumers of the
shared SSRF blocklist:

  - lib/safe-fetch.ts is bundled by Next.js Turbopack, which resolves
    extension-less relative imports to .ts source files but refuses
    .js-suffixed imports back to the .ts source ("Module not found:
    Can't resolve '../ssrf-blocklist.js'" during the keeperhub build).
  - lib/sandbox/child-source.ts is also compiled by the standalone
    @keeperhub/sandbox package, whose strict ESM runtime requires
    explicit .js extensions and rejects extension-less imports with
    ERR_MODULE_NOT_FOUND at startup.

A .ts data file is unreachable from one consumer or the other with any
single import specifier. Switches the data file to .json, which both
bundlers resolve identically.

  - lib/ssrf-blocklist.json owns the data.
  - lib/ssrf-blocklist.ts is now a thin typed wrapper that imports the
    JSON and re-exports the values with proper Cidr tuple types. The
    keeperhub side imports paths are unchanged; safe-fetch.ts and
    connection-host-guard.ts keep their existing module specifiers.
  - lib/sandbox/child-source.ts imports the JSON directly with
    `with { type: "json" }`, skipping the TS wrapper entirely - the
    sandbox doesn't need TS types because the values are interpolated
    into the template literal as JSON.stringify(...).
  - sandbox/Dockerfile copies the .json (not the .ts) into the build
    context.
  - sandbox/package.json's build script copies the .json into dist
    after tsc, because tsc does not emit .json files even with
    resolveJsonModule enabled.

Validated locally: sandbox tsc + JSON copy emits dist/lib/ssrf-blocklist.json
alongside dist/lib/sandbox/child-source.js; `node dist/sandbox/src/index.js`
starts without module-resolution errors; keeperhub unit suites 197/197
pass; lint clean; sandbox typecheck clean.
The workflow execution path that actually runs user step code (and
therefore calls safeFetch) is the ephemeral workflow-* runner pod
spawned by the executor controller via createWorkflowJob. These pods
do not read Helm values directly - their env is constructed in
keeperhub-executor/k8s-job.ts from a hardcoded explicit list plus the
RUNNER_SYSTEM_ENV_VARS allowlist.

Pre-fix posture: SAFE_FETCH_ENFORCE was on neither path, so runner
pods got the env unset, isShadowMode() returned true, and the SSRF
guard logged blocks but let requests through. The IMDS-401 case
originally reported in production was almost certainly a runner pod -
the controller pod can be hardened via Helm (PR #1331) but that does
not propagate to its children.

Hardcodes `{ name: "SAFE_FETCH_ENFORCE", value: "true" }` in the
explicit list rather than adding to RUNNER_SYSTEM_ENV_VARS. The
hardcoded form makes the runner enforce regardless of controller
config drift; flipping it back to shadow mode now requires a code
change here, which is the intended posture for an SSRF guard in an
incident-response context.

Tests added:
- runner pod gets SAFE_FETCH_ENFORCE=true when controller env is unset
- runner pod still gets SAFE_FETCH_ENFORCE=true (hardcoded value wins)
  when controller env says "false", and the entry appears exactly once
…etch-forward

fix: force-enable SAFE_FETCH_ENFORCE on workflow runner pods
suisuss and others added 3 commits May 21, 2026 22:58
The six test files under keeperhub-executor/ (k8s-job.test.ts,
runner-env.test.ts, startup-checks.test.ts, api-execute.test.ts,
lib/metrics-shipping.test.ts, lib/ship-metrics.test.ts - 44 tests
total) have not been running in CI. They are discoverable by the
root vitest config, but the pr-checks test-unit job invokes
`pnpm test:unit` which is `vitest run tests/unit` - a positional
path filter that excludes everything outside tests/unit/.

Adds a new `test:executor` package.json script (`vitest run
keeperhub-executor`) and a follow-on step in the test-unit job
that invokes it. The new step runs in the same job as the
existing unit tests so it reuses the Node + pnpm setup and does
not add a fresh runner; failure attribution stays clean because
GitHub Actions reports each step separately.

Verified locally: `pnpm test:executor` discovers and passes all
44 tests in ~10s.

Why not a separate workflow file like pr-checks-scheduler.yml or
pr-checks-events.yml: keeperhub-scheduler and keeperhub-events
are separate pnpm workspaces with their own lockfiles, tsconfigs,
and vitest configs. keeperhub-executor is part of the root
workspace and shares dependencies, so a follow-on step in the
existing job is the lighter-weight match.

Why hardcoded as a separate script rather than expanding
test:unit: `test:unit` is reused by the sandbox-remote job
(`pnpm test:unit tests/unit/code-run-code.test.ts`) which appends
a path arg to narrow the run. Expanding test:unit's base path to
include keeperhub-executor would broaden that job too. Keeping
the executor invocation in its own script preserves the existing
test:unit semantics.
Mirrors the pr-checks-events.yml pattern: a standalone workflow file
that fires only when keeperhub-executor/** (or the workflow itself)
change, instead of an extra step on every PR's main test-unit job.

The previous commit's approach (a follow-on step in pr-checks.yml
test-unit) ran the executor suite on every PR regardless of whether
any executor code changed, which wastes runner time and clutters the
status checks UI. The dedicated workflow is cheap to skip when nothing
matches the path filter.

Differences from pr-checks-events.yml: events is a separate pnpm
workspace with its own lockfile and setup-events-deps action; the
executor lives in the root workspace, so the existing setup-node-pnpm
action is reused. Executor tests are pure unit (no docker, no anvil),
so only a single test-unit job is needed - no test-integration mirror.

Verified locally: pnpm test:executor still passes 44/44. The
test:executor package.json script added in the previous commit is
retained because it is the entry point this workflow invokes.
…ntain permissions'

Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>
…n-ci

ci: run keeperhub-executor unit tests in pr-checks
joelorzet added 2 commits May 21, 2026 11:13
Spawn the grandchild via node -e and assert await fetch(...) is
rejected for every blocklist category (IMDSv2, RFC 1918, loopback,
ULA, link-local, Teredo, 6to4, doc range, site-local NAT64,
multicast, and pre-DNS hostname patterns). Existing data-only tests
proved the list was inlined but not that the guard actually fires.

Also tighten CIDR data tests to assert the full tuple, add a guard
that locks child-source.ts to a single data-only import, and list
the JSON dependency in sandbox/tsconfig.json.
…-parity

hotfix: single source of truth for SSRF blocklist
@joelorzet joelorzet merged commit 28f8c91 into prod May 21, 2026
37 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants