Skip to content

release: to prod#1432

Merged
joelorzet merged 25 commits into
prodfrom
staging
Jun 1, 2026
Merged

release: to prod#1432
joelorzet merged 25 commits into
prodfrom
staging

Conversation

@joelorzet

Copy link
Copy Markdown

No description provided.

joelorzet and others added 25 commits May 13, 2026 18:44
Parallel For Each iterations fail with "Step ... exceeded max retries
(1 retry)" because the Workflow DevKit drops step_completed events
under heavy fan-in and re-fires steps that already succeeded. The
existing KEEP-398/KEEP-431 recovery only runs for top-level nodes;
loop-body nodes hit for-each-body-runner.ts's catch and record the
failure instead.

Make the step-success authority iteration-aware and apply the same
recovery pattern inside the body runner.

- step-success-tracker: composite key nodeId::forEachId::iter so
  parallel iterations of the same body node don't clobber each other.
  New getStepSuccess(); getSuccessfulSteps() preserved for existing
  cold-pod degradation callers.
- step-handler: pass iteration key to recordStepSuccess when the
  step ran inside a For Each.
- get-completed-step-output: new fetchCompletedStepOutputAtIterationStep
  query that matches the iteration row exactly instead of filtering
  iteration rows out. Iteration-aware inflight cache key.
- for-each-body-runner: optional resolveSpuriousRecovery /
  onSpuriousRecovery hooks on RunBodyContext. The catch consults the
  resolver when the error matches the spurious pattern. Extracted
  routeAfterSuccess helper so the recovered step's downstream still
  fires through the same routing as the normal success path.
- executor.workflow: wire the resolver via iteration-keyed
  getCompletedStepOutput. Emit
  workflow.executor.spurious_recovery.total{source="body_runner"}.

Schema unchanged: iteration_index and for_each_node_id columns
already exist on workflow_execution_logs.
…k-and-execute

Every POST /api/execute/* path previously reserved USDC and broadcast
on the first try. Builders had no way to validate inputs against the
real chain without spending real funds. Five hackathon teams reported
this as a P1 / P2.

Adds `simulate: true` on the request body (and `?simulate=true` query
flag) to all three execute endpoints. When set:

- Validate the same inputs (wallet, ABI, args, recipient).
- Run a read-only path that resolves the org's from-address, encodes
  the call, and combines `provider.estimateGas` + `provider.call` to
  return a gas estimate, the decoded return value, and the would-be
  revert reason. No transaction signed, no transaction broadcast.
- Skip checkAndReserveExecution and the directExecutions row insert
  entirely; there is no funds reservation and no row to audit.
- Respond with the new shape `{ success, status: "simulated", from,
  to, value, gasEstimate, simulatedReturnValue, wouldRevert,
  revertReason? }`. Reverts return HTTP 400 with the decoded reason;
  successful simulations return 200.

check-and-execute simulate evaluates the condition unchanged (already
read-only) and only swaps the action's write for a simulation.

Known limitation: from-address is resolved via
getOrganizationWalletAddress, so orgs that route writes through a Safe
will see a simulation that reflects the EOA sending the call, not the
Safe. Most config bugs (bad ABI, bad args, insufficient balance,
allowance mismatches) still surface; Safe-routed msg.sender semantics
do not. Documented in the simulate module header.

Out of scope (future tickets):
- simulate_workflow MCP tool for full-workflow dry-runs.
- Safe-aware from-address in the simulator.
Unit tests for lib/execute/simulate.ts (tests/unit/execute-simulate):

- Happy path for simulateContractCall: estimateGas + provider.call
  succeed, decoded return value is serialised to JSON
- Revert path: synthetic Error(string) revert data is decoded via
  decodeRevertReason and propagated as revertReason / error
- Input-validation paths return wouldRevert without touching the
  network: bad ABI JSON, bad ABI shape, missing functionName, bad
  functionArgs JSON, non-array functionArgs, malformed value
- Native transfer: happy path returns gasEstimate, throw path
  surfaces the message, malformed amount short-circuits
- Token transfer: ERC-20 ABI is encoded and delegated to
  simulateContractCall; bad amount short-circuits

Route guard tests for /api/execute/* (tests/integration/
execute-simulate-route):

- contract-call: simulate=true (body) AND ?simulate=true (query)
  both route to the simulator and DO NOT call
  checkAndReserveExecution, markRunning, or writeContractCore;
  HTTP 400 on wouldRevert
- transfer: simulate=true with no token field routes to
  simulateNativeTransfer; with a tokenAddress to simulateTokenTransfer;
  a token simulate without an address returns 400 instead of guessing
- check-and-execute: condition is still read (readContractCore), then
  the action is simulated instead of being broadcast; no reserve, no
  writeContractCore

19 tests, 19 passing.
1. RPC failover. Every chain call in lib/execute/simulate.ts now
   routes through rpcManager.executeWithFailover (operation type
   "read" for estimateGas + provider.call, "preflight" for the
   on-chain decimals lookup) so a primary-RPC blip falls over to the
   chain's configured fallback. The prior version bypassed failover
   by calling getProvider().call directly, in violation of the
   plugins read-path convention.

2. Strict-boolean simulate flag. body.simulate must now be a literal
   boolean. Strings ("true", "false"), numbers, objects, and null are
   all rejected with a structured 400 carrying field: "simulate". The
   query-string variant (?simulate=true) is removed -- the endpoint
   has exactly one shape for the input so a mistyped string flag
   cannot silently fall through to a real broadcast. Logic extracted
   to app/api/execute/_lib/simulate-flag.ts (parseSimulateFlag) and
   reused by all three routes.

3. Token decimals. simulateTokenTransfer no longer trusts a default
   of 18 when the caller omits decimals. It fetches decimals() from
   the token contract on-chain (with RPC failover) before parseUnits
   so USDC / USDT / etc. produce correct estimates without the
   caller having to know the token's decimals up front. Explicit
   decimals are still honoured.

4. Token resolution parity. simulateTokenTransfer delegates token
   address resolution to parseTokenAddress from
   plugins/web3/steps/transfer-token-core, the same helper the
   broadcast path uses. This adds support for supportedTokenId /
   customToken / legacy tokenConfig shapes that the previous
   handwritten extractor missed.

5. DRY. The ethers.Interface for the contract call is created once
   and shared between encoding and decoding. The simulate-flag block
   in each of the three routes is now a one-liner that calls the
   shared parseSimulateFlag helper.

Tests:

- tests/unit/execute-simulate-flag.test.ts: 7 cases asserting strict-
  boolean semantics (true, false, undefined, "true", 1, object, null).
- tests/unit/execute-simulate.test.ts: rewritten to mock the
  executeWithFailover boundary plus parseTokenAddress. New cases
  cover the failover wiring (verifies operation type is "read"),
  the decimals preflight order on tokenTransfer, the
  decimals-provided fast path, and the parseTokenAddress reject
  path.
- tests/integration/execute-simulate-route.test.ts: replaces the
  ?simulate=true (query) test with a coverage triple: string flag
  rejected with 400, number flag rejected with 400, query-string
  ignored (one shape per input). Token-transfer test reworked to
  prove the route hands tokenConfig + omitted-decimals through to
  the simulator without 400ing on shape alone.

30 tests, 30 passing.
…ncurrency-cap

fix(executor): recover spurious step retries inside For Each parallel iterations
iOS Safari in Private Browsing / Lockdown Mode throws
SecurityError when localStorage is accessed. The mount effect read
storage unguarded, surfacing an unhandled error on every page for
those users. Wrap both reads and writes in try/catch and fail open.
Credential signup mints no Better Auth session until the pending-signup
TOTP enroll step, which inserts the session row directly and bypasses
the session.create.before hook that normally records the source IP in
user_trusted_ips. As a result the network a user signs up from was never
trusted, so a later sign-in from that same network was treated as new
and bounced to /verify-ip.

Record the source IP when the pending-signup session is minted, mirroring
the hook's first-attestation behavior. Extract the trusted-IP upsert into
a shared upsertTrustedIp helper and add recordTrustedIpFromRequest so the
hook and the enroll path stay in sync.

Email verification and TOTP enrollment remain mandatory; only the
redundant IP gate for the signup network is removed. A genuinely new
network still hits /verify-ip.
fix(auth): trust signup IP at TOTP enrollment
…ge-safari

fix: guard localStorage access in mobile warning dialog
The workflow-runner-block-imds NetworkPolicy only blocked link-local
(169.254/16), leaving runner pods able to reach all of RFC1918, CGNAT and
in-VPC services (RDS, Redis, other pods) over egress. A compromised runner
(e.g. SSRF via an http-request step that slips past SAFE_FETCH_ENFORCE)
could pivot to internal hosts.

Replace the single permissive rule with a union of explicit allows and
deny-the-rest:

- DNS to kube-dns only
- keeperhub-sandbox :8787 (delegated code execution)
- keeperhub-executor :3080 (metrics ingest)
- RDS Postgres :5432, scoped to the dedicated DB subnets
- public internet (all ports) with 10/8, 172.16/12, 192.168/16, 169.254/16,
  100.64/10 and 127/8 stripped; IPv6 keeps NAT64 while blocking link-local
  and ULA

In-cluster Services are matched by podSelector because the VPC CNI policy
agent evaluates egress on the post-DNAT backend pod IP. Public egress is
left port-unrestricted so non-443 RPC/webhook endpoints keep working.

Validated both manifests with kubectl apply --dry-run=server against the
live API servers.
…er-egress-networkpolicy

fix(deploy): restrict workflow-runner egress to required destinations
The sandbox sheds load with 429 + Retry-After when its per-pod concurrency
cap is reached, but the client treated any non-200 as terminal, so a
momentary overshoot surfaced as a hard error on the Code node.

Honor the Retry-After with bounded, jittered, abort-aware retries (default
3, via SANDBOX_MAX_RETRIES). Only capacity 429s are retried since the run
never started; all other failures stay terminal. Also adds a scaling notes
spec for the separate pod-sizing discussion.
Retry-After is read off an HTTP response, so feeding it straight into
setTimeout let a buggy or hostile value pin a workflow step on a long
timer (CodeQL: resource exhaustion). Clamp the backoff to a 5s ceiling at
both the parse site and the timer sink.
CodeQL flagged feeding the Retry-After header into setTimeout as a
resource-exhaustion sink, and a Math.min clamp did not break the taint.
Drop the header-derived duration entirely and use a fixed, capped,
exponential backoff schedule. The 429 is still the retry trigger; only the
wait magnitude changes (the sandbox always signals a 1s backoff anyway).
fix: retry sandbox capacity 429 before surfacing to caller
The previous shortcut of `estimateGas`-only was correct for the
textbook EOA-to-EOA case but dropped the return bytes for contract /
precompile recipients (fallback handlers, the 0x01-0x0a precompiles,
etc.) -- simulatedReturnValue was always null, asymmetric with the
contract-call simulator.

Now runs both via a single executeWithFailover, so a contract recipient
surfaces its return bytes and an EOA recipient still serialises as
null (provider.call returns "0x"). One round-trip, no real cost.
Three small follow-ups from review:

1. simulateConditionalWrite in check-and-execute now sets
   executed: !result.wouldRevert. A reverted simulate means a real
   broadcast would have reverted too, so executed: true would have
   been misleading.

2. simulate operation type passed to executeWithFailover is now
   "preflight" (was "read"). The simulate path is structurally a
   pre-broadcast check, not a regular eth_call, so it sits next to
   the existing decimals preflight in ops dashboards.

3. decodeRevertReason unwraps the standard Error(string) selector
   (0x08c379a0) so the revertReason field carries just the message
   ("ERC20: transfer amount exceeds balance") instead of the awkward
   Error(...) wrapper ("Error(ERC20: transfer amount exceeds
   balance)"). Custom errors and Panic(uint256) are untouched.

Plus a Dry-Run Simulation section in docs/api/direct-execution.md
covering the simulate request shape, strict-boolean requirement,
success / would-revert response shapes, token-transfer specifics
(on-chain decimals + tokenConfig parity), the check-and-execute
envelope, and the Safe-routed from-address limitation.

Tests still 31/31; the operation-type assertions in
tests/unit/execute-simulate.test.ts are updated to "preflight".
The previous commit stripped the Error(...) wrapper from the standard
Solidity Error(string) selector. That was a UX improvement but a
breaking shape change: the existing tests pinned the wrapped form
("we keep its shape for callers that depend on it") and downstream
consumers may parse the Error(...) format. Reverting the decoder and
the docs section that promised the unwrapped output. The other two
follow-ups (executed: !wouldRevert, preflight op type) stay.
…e-dry-run

feat(execute): simulate/dry-run mode on contract-call, transfer, check-and-execute
@joelorzet joelorzet requested review from a team, OleksandrUA, eskp and suisuss and removed request for a team June 1, 2026 20:55
@joelorzet joelorzet merged commit 285b262 into prod Jun 1, 2026
25 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants