Conversation
Parallel For Each iterations fail with "Step ... exceeded max retries
(1 retry)" because the Workflow DevKit drops step_completed events
under heavy fan-in and re-fires steps that already succeeded. The
existing KEEP-398/KEEP-431 recovery only runs for top-level nodes;
loop-body nodes hit for-each-body-runner.ts's catch and record the
failure instead.
Make the step-success authority iteration-aware and apply the same
recovery pattern inside the body runner.
- step-success-tracker: composite key nodeId::forEachId::iter so
parallel iterations of the same body node don't clobber each other.
New getStepSuccess(); getSuccessfulSteps() preserved for existing
cold-pod degradation callers.
- step-handler: pass iteration key to recordStepSuccess when the
step ran inside a For Each.
- get-completed-step-output: new fetchCompletedStepOutputAtIterationStep
query that matches the iteration row exactly instead of filtering
iteration rows out. Iteration-aware inflight cache key.
- for-each-body-runner: optional resolveSpuriousRecovery /
onSpuriousRecovery hooks on RunBodyContext. The catch consults the
resolver when the error matches the spurious pattern. Extracted
routeAfterSuccess helper so the recovered step's downstream still
fires through the same routing as the normal success path.
- executor.workflow: wire the resolver via iteration-keyed
getCompletedStepOutput. Emit
workflow.executor.spurious_recovery.total{source="body_runner"}.
Schema unchanged: iteration_index and for_each_node_id columns
already exist on workflow_execution_logs.
…k-and-execute
Every POST /api/execute/* path previously reserved USDC and broadcast
on the first try. Builders had no way to validate inputs against the
real chain without spending real funds. Five hackathon teams reported
this as a P1 / P2.
Adds `simulate: true` on the request body (and `?simulate=true` query
flag) to all three execute endpoints. When set:
- Validate the same inputs (wallet, ABI, args, recipient).
- Run a read-only path that resolves the org's from-address, encodes
the call, and combines `provider.estimateGas` + `provider.call` to
return a gas estimate, the decoded return value, and the would-be
revert reason. No transaction signed, no transaction broadcast.
- Skip checkAndReserveExecution and the directExecutions row insert
entirely; there is no funds reservation and no row to audit.
- Respond with the new shape `{ success, status: "simulated", from,
to, value, gasEstimate, simulatedReturnValue, wouldRevert,
revertReason? }`. Reverts return HTTP 400 with the decoded reason;
successful simulations return 200.
check-and-execute simulate evaluates the condition unchanged (already
read-only) and only swaps the action's write for a simulation.
Known limitation: from-address is resolved via
getOrganizationWalletAddress, so orgs that route writes through a Safe
will see a simulation that reflects the EOA sending the call, not the
Safe. Most config bugs (bad ABI, bad args, insufficient balance,
allowance mismatches) still surface; Safe-routed msg.sender semantics
do not. Documented in the simulate module header.
Out of scope (future tickets):
- simulate_workflow MCP tool for full-workflow dry-runs.
- Safe-aware from-address in the simulator.
Unit tests for lib/execute/simulate.ts (tests/unit/execute-simulate): - Happy path for simulateContractCall: estimateGas + provider.call succeed, decoded return value is serialised to JSON - Revert path: synthetic Error(string) revert data is decoded via decodeRevertReason and propagated as revertReason / error - Input-validation paths return wouldRevert without touching the network: bad ABI JSON, bad ABI shape, missing functionName, bad functionArgs JSON, non-array functionArgs, malformed value - Native transfer: happy path returns gasEstimate, throw path surfaces the message, malformed amount short-circuits - Token transfer: ERC-20 ABI is encoded and delegated to simulateContractCall; bad amount short-circuits Route guard tests for /api/execute/* (tests/integration/ execute-simulate-route): - contract-call: simulate=true (body) AND ?simulate=true (query) both route to the simulator and DO NOT call checkAndReserveExecution, markRunning, or writeContractCore; HTTP 400 on wouldRevert - transfer: simulate=true with no token field routes to simulateNativeTransfer; with a tokenAddress to simulateTokenTransfer; a token simulate without an address returns 400 instead of guessing - check-and-execute: condition is still read (readContractCore), then the action is simulated instead of being broadcast; no reserve, no writeContractCore 19 tests, 19 passing.
1. RPC failover. Every chain call in lib/execute/simulate.ts now
routes through rpcManager.executeWithFailover (operation type
"read" for estimateGas + provider.call, "preflight" for the
on-chain decimals lookup) so a primary-RPC blip falls over to the
chain's configured fallback. The prior version bypassed failover
by calling getProvider().call directly, in violation of the
plugins read-path convention.
2. Strict-boolean simulate flag. body.simulate must now be a literal
boolean. Strings ("true", "false"), numbers, objects, and null are
all rejected with a structured 400 carrying field: "simulate". The
query-string variant (?simulate=true) is removed -- the endpoint
has exactly one shape for the input so a mistyped string flag
cannot silently fall through to a real broadcast. Logic extracted
to app/api/execute/_lib/simulate-flag.ts (parseSimulateFlag) and
reused by all three routes.
3. Token decimals. simulateTokenTransfer no longer trusts a default
of 18 when the caller omits decimals. It fetches decimals() from
the token contract on-chain (with RPC failover) before parseUnits
so USDC / USDT / etc. produce correct estimates without the
caller having to know the token's decimals up front. Explicit
decimals are still honoured.
4. Token resolution parity. simulateTokenTransfer delegates token
address resolution to parseTokenAddress from
plugins/web3/steps/transfer-token-core, the same helper the
broadcast path uses. This adds support for supportedTokenId /
customToken / legacy tokenConfig shapes that the previous
handwritten extractor missed.
5. DRY. The ethers.Interface for the contract call is created once
and shared between encoding and decoding. The simulate-flag block
in each of the three routes is now a one-liner that calls the
shared parseSimulateFlag helper.
Tests:
- tests/unit/execute-simulate-flag.test.ts: 7 cases asserting strict-
boolean semantics (true, false, undefined, "true", 1, object, null).
- tests/unit/execute-simulate.test.ts: rewritten to mock the
executeWithFailover boundary plus parseTokenAddress. New cases
cover the failover wiring (verifies operation type is "read"),
the decimals preflight order on tokenTransfer, the
decimals-provided fast path, and the parseTokenAddress reject
path.
- tests/integration/execute-simulate-route.test.ts: replaces the
?simulate=true (query) test with a coverage triple: string flag
rejected with 400, number flag rejected with 400, query-string
ignored (one shape per input). Token-transfer test reworked to
prove the route hands tokenConfig + omitted-decimals through to
the simulator without 400ing on shape alone.
30 tests, 30 passing.
…ncurrency-cap fix(executor): recover spurious step retries inside For Each parallel iterations
iOS Safari in Private Browsing / Lockdown Mode throws SecurityError when localStorage is accessed. The mount effect read storage unguarded, surfacing an unhandled error on every page for those users. Wrap both reads and writes in try/catch and fail open.
Credential signup mints no Better Auth session until the pending-signup TOTP enroll step, which inserts the session row directly and bypasses the session.create.before hook that normally records the source IP in user_trusted_ips. As a result the network a user signs up from was never trusted, so a later sign-in from that same network was treated as new and bounced to /verify-ip. Record the source IP when the pending-signup session is minted, mirroring the hook's first-attestation behavior. Extract the trusted-IP upsert into a shared upsertTrustedIp helper and add recordTrustedIpFromRequest so the hook and the enroll path stay in sync. Email verification and TOTP enrollment remain mandatory; only the redundant IP gate for the signup network is removed. A genuinely new network still hits /verify-ip.
fix(auth): trust signup IP at TOTP enrollment
…ge-safari fix: guard localStorage access in mobile warning dialog
The workflow-runner-block-imds NetworkPolicy only blocked link-local (169.254/16), leaving runner pods able to reach all of RFC1918, CGNAT and in-VPC services (RDS, Redis, other pods) over egress. A compromised runner (e.g. SSRF via an http-request step that slips past SAFE_FETCH_ENFORCE) could pivot to internal hosts. Replace the single permissive rule with a union of explicit allows and deny-the-rest: - DNS to kube-dns only - keeperhub-sandbox :8787 (delegated code execution) - keeperhub-executor :3080 (metrics ingest) - RDS Postgres :5432, scoped to the dedicated DB subnets - public internet (all ports) with 10/8, 172.16/12, 192.168/16, 169.254/16, 100.64/10 and 127/8 stripped; IPv6 keeps NAT64 while blocking link-local and ULA In-cluster Services are matched by podSelector because the VPC CNI policy agent evaluates egress on the post-DNAT backend pod IP. Public egress is left port-unrestricted so non-443 RPC/webhook endpoints keep working. Validated both manifests with kubectl apply --dry-run=server against the live API servers.
…er-egress-networkpolicy fix(deploy): restrict workflow-runner egress to required destinations
The sandbox sheds load with 429 + Retry-After when its per-pod concurrency cap is reached, but the client treated any non-200 as terminal, so a momentary overshoot surfaced as a hard error on the Code node. Honor the Retry-After with bounded, jittered, abort-aware retries (default 3, via SANDBOX_MAX_RETRIES). Only capacity 429s are retried since the run never started; all other failures stay terminal. Also adds a scaling notes spec for the separate pod-sizing discussion.
Retry-After is read off an HTTP response, so feeding it straight into setTimeout let a buggy or hostile value pin a workflow step on a long timer (CodeQL: resource exhaustion). Clamp the backoff to a 5s ceiling at both the parse site and the timer sink.
CodeQL flagged feeding the Retry-After header into setTimeout as a resource-exhaustion sink, and a Math.min clamp did not break the taint. Drop the header-derived duration entirely and use a fixed, capped, exponential backoff schedule. The 429 is still the retry trigger; only the wait magnitude changes (the sandbox always signals a 1s backoff anyway).
fix: retry sandbox capacity 429 before surfacing to caller
The previous shortcut of `estimateGas`-only was correct for the textbook EOA-to-EOA case but dropped the return bytes for contract / precompile recipients (fallback handlers, the 0x01-0x0a precompiles, etc.) -- simulatedReturnValue was always null, asymmetric with the contract-call simulator. Now runs both via a single executeWithFailover, so a contract recipient surfaces its return bytes and an EOA recipient still serialises as null (provider.call returns "0x"). One round-trip, no real cost.
…ute-simulate-dry-run
Three small follow-ups from review:
1. simulateConditionalWrite in check-and-execute now sets
executed: !result.wouldRevert. A reverted simulate means a real
broadcast would have reverted too, so executed: true would have
been misleading.
2. simulate operation type passed to executeWithFailover is now
"preflight" (was "read"). The simulate path is structurally a
pre-broadcast check, not a regular eth_call, so it sits next to
the existing decimals preflight in ops dashboards.
3. decodeRevertReason unwraps the standard Error(string) selector
(0x08c379a0) so the revertReason field carries just the message
("ERC20: transfer amount exceeds balance") instead of the awkward
Error(...) wrapper ("Error(ERC20: transfer amount exceeds
balance)"). Custom errors and Panic(uint256) are untouched.
Plus a Dry-Run Simulation section in docs/api/direct-execution.md
covering the simulate request shape, strict-boolean requirement,
success / would-revert response shapes, token-transfer specifics
(on-chain decimals + tokenConfig parity), the check-and-execute
envelope, and the Safe-routed from-address limitation.
Tests still 31/31; the operation-type assertions in
tests/unit/execute-simulate.test.ts are updated to "preflight".
The previous commit stripped the Error(...) wrapper from the standard
Solidity Error(string) selector. That was a UX improvement but a
breaking shape change: the existing tests pinned the wrapped form
("we keep its shape for callers that depend on it") and downstream
consumers may parse the Error(...) format. Reverting the decoder and
the docs section that promised the unwrapped output. The other two
follow-ups (executed: !wouldRevert, preflight op type) stay.
…e-dry-run feat(execute): simulate/dry-run mode on contract-call, transfer, check-and-execute
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.