Skip to content

fix: bound retries for empty-bytecode RPC responses instead of halting#128

Closed
flyq wants to merge 2 commits into
mainfrom
liquan/opt_get_code
Closed

fix: bound retries for empty-bytecode RPC responses instead of halting#128
flyq wants to merge 2 commits into
mainfrom
liquan/opt_get_code

Conversation

@flyq
Copy link
Copy Markdown
Member

@flyq flyq commented Apr 29, 2026

Summary

Fixes a validator halt triggered when the upstream RPC node returns empty bytecode (0x) for a non-empty codehash — a common condition during rpc-node restart / cold start, where mega-reth's eth_getCodeByHash does unwrap_or_default() on a missing-from-state lookup and silently returns empty bytes. Previously this surfaced as VerificationFailure (keccak of 0x ≠ requested hash) and halted the validator. Now it's a typed BytecodeUnavailable variant that the RPC client retries in-place against the rpc_retry schedule, bounded by a 5-minute budget. If upstream is still empty after the budget, the validator halts — a longer outage is operator-visible and shouldn't be hidden behind silent retries.

Changes

crates/stateless-common/src/rpc_client.rs

  • New CodeFetchError::BytecodeUnavailable { requested } variant — distinct from VerificationFailure (non-empty wrong bytes, real divergence).
  • RpcClientConfig gains bytecode_unavailable_retry_budget: Duration (default 5 min). Total cap on the in-place retry loop for empty-bytes responses; tunable per-test for fast assertions.
  • get_codes_with_deadline now wraps each per-hash fetch in a retry loop:
    • Transport errors → unbounded retry inside get_code_with_deadline (unchanged).
    • Empty bytes for non-empty hash → sleep on the rpc_retry exponential schedule (500ms → 30s cap), retry until the configured budget elapses, then surface BytecodeUnavailable.
    • Non-empty wrong bytes → immediate VerificationFailure, never retried.
    • Caller-supplied deadline (when present) tightens the unavailable-retry cutoff.
  • Doc comments on CodeFetchError and get_codes_with_deadline rewritten: all variants are now caller-visible halt signals; the transient "upstream not synced yet" condition is absorbed inside the client.

bin/stateless-validator/src/chain_sync.rs

  • ValidatorProcessor maps BytecodeUnavailable to transient=false (halt) — same as VerificationFailure. No more cycle restart for this condition; if it surfaces here, the 5-minute budget is already exhausted.
  • Comment updated to explain the new error contract from get_codes.

bin/debug-trace-server/src/data_provider.rs

  • From<CodeFetchError> for DataProviderError extended to handle BytecodeUnavailable (folded into the same eyre-wrapped path as VerificationFailure; trace server has no halt-vs-retry split).

Test coverage

  • New test_get_codes_verify_empty_bytecode_signals_unavailable in rpc_client.rs: starts a mock RPC that returns empty bytes for any codehash, drives get_codes(verify=true) with a 10ms budget, asserts the result is BytecodeUnavailable (not VerificationFailure).

Test plan

  • cargo test --workspace (passes locally — 17 tests in stateless-common, all stateless-validator / debug-trace-server lib tests green)
  • cargo clippy --workspace --all-targets --all-features (clean)
  • cargo fmt --all --check (clean)
  • Manual: point a validator at a fresh / restarting rpc-node, confirm it rides through the warm-up period without halting and only halts if upstream stays unavailable past 5 minutes.

@claude
Copy link
Copy Markdown

claude Bot commented Apr 29, 2026

The PR currently has no labels, but based on the diff and description it should have the bug label.

The PR title uses the fix: prefix and the summary explicitly describes a validator halt triggered by empty bytecode responses from the upstream RPC node — this is a bug fix, not an enhancement or new feature. The bug label ("Something isn't working") is the appropriate match.

@flyq flyq closed this Apr 29, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant