release: to prod by joelorzet · Pull Request #1432 · KeeperHub/keeperhub

joelorzet · 2026-06-01T20:55:42Z

No description provided.

Parallel For Each iterations fail with "Step ... exceeded max retries (1 retry)" because the Workflow DevKit drops step_completed events under heavy fan-in and re-fires steps that already succeeded. The existing KEEP-398/KEEP-431 recovery only runs for top-level nodes; loop-body nodes hit for-each-body-runner.ts's catch and record the failure instead. Make the step-success authority iteration-aware and apply the same recovery pattern inside the body runner. - step-success-tracker: composite key nodeId::forEachId::iter so parallel iterations of the same body node don't clobber each other. New getStepSuccess(); getSuccessfulSteps() preserved for existing cold-pod degradation callers. - step-handler: pass iteration key to recordStepSuccess when the step ran inside a For Each. - get-completed-step-output: new fetchCompletedStepOutputAtIterationStep query that matches the iteration row exactly instead of filtering iteration rows out. Iteration-aware inflight cache key. - for-each-body-runner: optional resolveSpuriousRecovery / onSpuriousRecovery hooks on RunBodyContext. The catch consults the resolver when the error matches the spurious pattern. Extracted routeAfterSuccess helper so the recovered step's downstream still fires through the same routing as the normal success path. - executor.workflow: wire the resolver via iteration-keyed getCompletedStepOutput. Emit workflow.executor.spurious_recovery.total{source="body_runner"}. Schema unchanged: iteration_index and for_each_node_id columns already exist on workflow_execution_logs.

…k-and-execute Every POST /api/execute/* path previously reserved USDC and broadcast on the first try. Builders had no way to validate inputs against the real chain without spending real funds. Five hackathon teams reported this as a P1 / P2. Adds `simulate: true` on the request body (and `?simulate=true` query flag) to all three execute endpoints. When set: - Validate the same inputs (wallet, ABI, args, recipient). - Run a read-only path that resolves the org's from-address, encodes the call, and combines `provider.estimateGas` + `provider.call` to return a gas estimate, the decoded return value, and the would-be revert reason. No transaction signed, no transaction broadcast. - Skip checkAndReserveExecution and the directExecutions row insert entirely; there is no funds reservation and no row to audit. - Respond with the new shape `{ success, status: "simulated", from, to, value, gasEstimate, simulatedReturnValue, wouldRevert, revertReason? }`. Reverts return HTTP 400 with the decoded reason; successful simulations return 200. check-and-execute simulate evaluates the condition unchanged (already read-only) and only swaps the action's write for a simulation. Known limitation: from-address is resolved via getOrganizationWalletAddress, so orgs that route writes through a Safe will see a simulation that reflects the EOA sending the call, not the Safe. Most config bugs (bad ABI, bad args, insufficient balance, allowance mismatches) still surface; Safe-routed msg.sender semantics do not. Documented in the simulate module header. Out of scope (future tickets): - simulate_workflow MCP tool for full-workflow dry-runs. - Safe-aware from-address in the simulator.

Unit tests for lib/execute/simulate.ts (tests/unit/execute-simulate): - Happy path for simulateContractCall: estimateGas + provider.call succeed, decoded return value is serialised to JSON - Revert path: synthetic Error(string) revert data is decoded via decodeRevertReason and propagated as revertReason / error - Input-validation paths return wouldRevert without touching the network: bad ABI JSON, bad ABI shape, missing functionName, bad functionArgs JSON, non-array functionArgs, malformed value - Native transfer: happy path returns gasEstimate, throw path surfaces the message, malformed amount short-circuits - Token transfer: ERC-20 ABI is encoded and delegated to simulateContractCall; bad amount short-circuits Route guard tests for /api/execute/* (tests/integration/ execute-simulate-route): - contract-call: simulate=true (body) AND ?simulate=true (query) both route to the simulator and DO NOT call checkAndReserveExecution, markRunning, or writeContractCore; HTTP 400 on wouldRevert - transfer: simulate=true with no token field routes to simulateNativeTransfer; with a tokenAddress to simulateTokenTransfer; a token simulate without an address returns 400 instead of guessing - check-and-execute: condition is still read (readContractCore), then the action is simulated instead of being broadcast; no reserve, no writeContractCore 19 tests, 19 passing.

1. RPC failover. Every chain call in lib/execute/simulate.ts now routes through rpcManager.executeWithFailover (operation type "read" for estimateGas + provider.call, "preflight" for the on-chain decimals lookup) so a primary-RPC blip falls over to the chain's configured fallback. The prior version bypassed failover by calling getProvider().call directly, in violation of the plugins read-path convention. 2. Strict-boolean simulate flag. body.simulate must now be a literal boolean. Strings ("true", "false"), numbers, objects, and null are all rejected with a structured 400 carrying field: "simulate". The query-string variant (?simulate=true) is removed -- the endpoint has exactly one shape for the input so a mistyped string flag cannot silently fall through to a real broadcast. Logic extracted to app/api/execute/_lib/simulate-flag.ts (parseSimulateFlag) and reused by all three routes. 3. Token decimals. simulateTokenTransfer no longer trusts a default of 18 when the caller omits decimals. It fetches decimals() from the token contract on-chain (with RPC failover) before parseUnits so USDC / USDT / etc. produce correct estimates without the caller having to know the token's decimals up front. Explicit decimals are still honoured. 4. Token resolution parity. simulateTokenTransfer delegates token address resolution to parseTokenAddress from plugins/web3/steps/transfer-token-core, the same helper the broadcast path uses. This adds support for supportedTokenId / customToken / legacy tokenConfig shapes that the previous handwritten extractor missed. 5. DRY. The ethers.Interface for the contract call is created once and shared between encoding and decoding. The simulate-flag block in each of the three routes is now a one-liner that calls the shared parseSimulateFlag helper. Tests: - tests/unit/execute-simulate-flag.test.ts: 7 cases asserting strict- boolean semantics (true, false, undefined, "true", 1, object, null). - tests/unit/execute-simulate.test.ts: rewritten to mock the executeWithFailover boundary plus parseTokenAddress. New cases cover the failover wiring (verifies operation type is "read"), the decimals preflight order on tokenTransfer, the decimals-provided fast path, and the parseTokenAddress reject path. - tests/integration/execute-simulate-route.test.ts: replaces the ?simulate=true (query) test with a coverage triple: string flag rejected with 400, number flag rejected with 400, query-string ignored (one shape per input). Token-transfer test reworked to prove the route hands tokenConfig + omitted-decimals through to the simulator without 400ing on shape alone. 30 tests, 30 passing.

…ncurrency-cap fix(executor): recover spurious step retries inside For Each parallel iterations

iOS Safari in Private Browsing / Lockdown Mode throws SecurityError when localStorage is accessed. The mount effect read storage unguarded, surfacing an unhandled error on every page for those users. Wrap both reads and writes in try/catch and fail open.

Credential signup mints no Better Auth session until the pending-signup TOTP enroll step, which inserts the session row directly and bypasses the session.create.before hook that normally records the source IP in user_trusted_ips. As a result the network a user signs up from was never trusted, so a later sign-in from that same network was treated as new and bounced to /verify-ip. Record the source IP when the pending-signup session is minted, mirroring the hook's first-attestation behavior. Extract the trusted-IP upsert into a shared upsertTrustedIp helper and add recordTrustedIpFromRequest so the hook and the enroll path stay in sync. Email verification and TOTP enrollment remain mandatory; only the redundant IP gate for the signup network is removed. A genuinely new network still hits /verify-ip.

fix(auth): trust signup IP at TOTP enrollment

…ge-safari fix: guard localStorage access in mobile warning dialog

The workflow-runner-block-imds NetworkPolicy only blocked link-local (169.254/16), leaving runner pods able to reach all of RFC1918, CGNAT and in-VPC services (RDS, Redis, other pods) over egress. A compromised runner (e.g. SSRF via an http-request step that slips past SAFE_FETCH_ENFORCE) could pivot to internal hosts. Replace the single permissive rule with a union of explicit allows and deny-the-rest: - DNS to kube-dns only - keeperhub-sandbox :8787 (delegated code execution) - keeperhub-executor :3080 (metrics ingest) - RDS Postgres :5432, scoped to the dedicated DB subnets - public internet (all ports) with 10/8, 172.16/12, 192.168/16, 169.254/16, 100.64/10 and 127/8 stripped; IPv6 keeps NAT64 while blocking link-local and ULA In-cluster Services are matched by podSelector because the VPC CNI policy agent evaluates egress on the post-DNAT backend pod IP. Public egress is left port-unrestricted so non-443 RPC/webhook endpoints keep working. Validated both manifests with kubectl apply --dry-run=server against the live API servers.

…er-egress-networkpolicy fix(deploy): restrict workflow-runner egress to required destinations

The sandbox sheds load with 429 + Retry-After when its per-pod concurrency cap is reached, but the client treated any non-200 as terminal, so a momentary overshoot surfaced as a hard error on the Code node. Honor the Retry-After with bounded, jittered, abort-aware retries (default 3, via SANDBOX_MAX_RETRIES). Only capacity 429s are retried since the run never started; all other failures stay terminal. Also adds a scaling notes spec for the separate pod-sizing discussion.

Retry-After is read off an HTTP response, so feeding it straight into setTimeout let a buggy or hostile value pin a workflow step on a long timer (CodeQL: resource exhaustion). Clamp the backoff to a 5s ceiling at both the parse site and the timer sink.

CodeQL flagged feeding the Retry-After header into setTimeout as a resource-exhaustion sink, and a Math.min clamp did not break the taint. Drop the header-derived duration entirely and use a fixed, capped, exponential backoff schedule. The 429 is still the retry trigger; only the wait magnitude changes (the sandbox always signals a 1s backoff anyway).

fix: retry sandbox capacity 429 before surfacing to caller

The previous shortcut of `estimateGas`-only was correct for the textbook EOA-to-EOA case but dropped the return bytes for contract / precompile recipients (fallback handlers, the 0x01-0x0a precompiles, etc.) -- simulatedReturnValue was always null, asymmetric with the contract-call simulator. Now runs both via a single executeWithFailover, so a contract recipient surfaces its return bytes and an EOA recipient still serialises as null (provider.call returns "0x"). One round-trip, no real cost.

…ute-simulate-dry-run

Three small follow-ups from review: 1. simulateConditionalWrite in check-and-execute now sets executed: !result.wouldRevert. A reverted simulate means a real broadcast would have reverted too, so executed: true would have been misleading. 2. simulate operation type passed to executeWithFailover is now "preflight" (was "read"). The simulate path is structurally a pre-broadcast check, not a regular eth_call, so it sits next to the existing decimals preflight in ops dashboards. 3. decodeRevertReason unwraps the standard Error(string) selector (0x08c379a0) so the revertReason field carries just the message ("ERC20: transfer amount exceeds balance") instead of the awkward Error(...) wrapper ("Error(ERC20: transfer amount exceeds balance)"). Custom errors and Panic(uint256) are untouched. Plus a Dry-Run Simulation section in docs/api/direct-execution.md covering the simulate request shape, strict-boolean requirement, success / would-revert response shapes, token-transfer specifics (on-chain decimals + tokenConfig parity), the check-and-execute envelope, and the Safe-routed from-address limitation. Tests still 31/31; the operation-type assertions in tests/unit/execute-simulate.test.ts are updated to "preflight".

The previous commit stripped the Error(...) wrapper from the standard Solidity Error(string) selector. That was a UX improvement but a breaking shape change: the existing tests pinned the wrapped form ("we keep its shape for callers that depend on it") and downstream consumers may parse the Error(...) format. Reverting the decoder and the docs section that promised the unwrapped output. The other two follow-ups (executed: !wouldRevert, preflight op type) stay.

…e-dry-run feat(execute): simulate/dry-run mode on contract-call, transfer, check-and-execute

joelorzet and others added 25 commits May 13, 2026 18:44

style: apply ultracite/biome formatting

b9634a9

merge: bring staging into KEEP-543 branch

c4e1bf1

Merge branch 'staging' into fix/KEEP-543-loop-parallel-concurrency-cap

f74c8a9

Merge branch 'staging' into fix/KEEP-543-loop-parallel-concurrency-cap

7180af5

Merge pull request #1324 from KeeperHub/fix/KEEP-543-loop-parallel-co…

c9f6069

…ncurrency-cap fix(executor): recover spurious step retries inside For Each parallel iterations

Merge pull request #1429 from KeeperHub/fix/record-trusted-ip-on-signup

0801e79

fix(auth): trust signup IP at TOTP enrollment

Merge pull request #1428 from KeeperHub/fix/mobile-warning-localstora…

1fc75eb

…ge-safari fix: guard localStorage access in mobile warning dialog

Merge pull request #1430 from KeeperHub/TECH-30-tighten-workflow-runn…

03fbc02

…er-egress-networkpolicy fix(deploy): restrict workflow-runner egress to required destinations

docs: simplify sandbox scaling notes for readability

b48ab0b

Merge pull request #1431 from KeeperHub/fix/sandbox-429-retry

254d57d

fix: retry sandbox capacity 429 before surfacing to caller

Merge remote-tracking branch 'origin/staging' into feat/KEEP-471-exec…

5c445a5

…ute-simulate-dry-run

Merge pull request #1400 from KeeperHub/feat/KEEP-471-execute-simulat…

59cf1ce

…e-dry-run feat(execute): simulate/dry-run mode on contract-call, transfer, check-and-execute

joelorzet requested review from a team, OleksandrUA, eskp and suisuss and removed request for a team June 1, 2026 20:55

joelorzet temporarily deployed to staging June 1, 2026 20:55 — with GitHub Actions Inactive

joelorzet merged commit 285b262 into prod Jun 1, 2026
25 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

release: to prod#1432

release: to prod#1432
joelorzet merged 25 commits into
prodfrom
staging

joelorzet commented Jun 1, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

joelorzet commented Jun 1, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants