Skip to content

Proposal: RPC Refactor + Retry Queue Worker (Rescanner Replacement) #69

@vietddude

Description

@vietddude

1) Summary

Refactor RPC infrastructure into a cleaner package layout and replace rescanner with a retry worker that reuses the same queue engine as manual, backed by an additional dedicated retry queue.

2) Why We Need This

Current pain points in main:

  1. RPC construction and provider wiring are duplicated across chain builder functions in internal/worker/factory.go.
  2. Failover, transport, auth, and chain-specific logic are too tightly coupled in internal/rpc.
  3. Rescanner depends on a shared failedChan fan-in/fan-out model.
  4. failedChan is non-blocking and drops when full (failedChan full, dropping block event), which can delay retries.
  5. FailedBlockEvent includes Chain, but rescanner listener currently consumes from shared channel without chain filtering, creating cross-chain contamination risk.

3) Goals

  1. Standardize RPC composition (transport + failover + chain manager).
  2. Remove channel-driven rescanner path and replace with deterministic queue-driven retry.
  3. Reuse one queue worker implementation for manual and retry.
  4. Preserve operational behavior for regular/catchup/manual/mempool workers.
  5. Provide safe migration from legacy failed-block storage.

4) Non-Goals

  1. No change to business event payload schema.
  2. No change to chain parser semantics.
  3. No hard cutover requiring immediate config rewrite by operators.

5) Proposed Architecture

5.1 RPC Package Refactor

Target package layout:

  1. pkg/rpc/failover
  2. pkg/rpc/transport/httpx
  3. pkg/rpc/transport/jsonrpc
  4. pkg/rpc/{evm,tron,bitcoin,solana,sui,cosmos,aptos,ton}
  5. pkg/rpc/bootstrap

Key design:

  1. Each chain package exposes NewProviderManager(chainName, chainCfg).
  2. bootstrap has generic builders:
    • BuildHTTPFailover(...)
    • BuildGRPCFailover(...)
  3. ProviderManager[T] owns provider selection, retries, blacklisting, and metrics.
  4. Transport concerns (auth, request/response handling, JSON-RPC batching) are isolated from failover policy.

Notes:

  1. Branch refactor-rpc currently uses internal/rpc/* with this architecture already applied.
  2. We can land behavior refactor first under internal/rpc, then move to pkg/rpc in a follow-up rename if we want a lower-risk rollout.

5.2 Worker Refactor: Rescanner -> Retry Queue

Replace rescanner with a queue worker model:

  1. Keep one generic queue worker runtime:
    • internal/worker/queue/worker.go
  2. ManualWorker and RetryWorker both wrap queue worker core.
  3. Two queue stores:
    • Manual queue: missing_blocks:*
    • Retry queue: retry_blocks:*
  4. Retry queue uses small range granularity (MaxBlocksPerRange = 1) for per-block retries.

Block failure flow:

  1. Regular/Catchup/Manual processing hits block error.
  2. Runtime processor enqueues result.Number to retry queue (AddRange(network, n, n)).
  3. Retry worker consumes queue and reprocesses.
  4. On success, queue progress is updated and range removed.

Result:

  1. Remove failedChan and FailedBlockEvent from worker runtime path.
  2. Eliminate shared channel race/cross-chain misrouting.
  3. Retry behavior becomes observable and deterministic through Redis queue state.

6) Data Migration and Compatibility

6.1 Legacy Failed Block Migration

At manager bootstrap (per chain):

  1. Read legacy failed blocks from blockStore.GetFailedBlocks(internalCode).
  2. Enqueue each block into retry queue.
  3. Remove migrated entries from legacy failed block store.
  4. Log migrated count.

6.2 Config Compatibility

Short-term compatibility policy:

  1. Keep parsing services.worker.rescanner.enabled.
  2. Mark it deprecated and ignore at runtime.
  3. Add services.worker.retry.enabled (recommended) or keep retry always-on (if queue idle cost is acceptable).

Recommended:

  1. Introduce explicit retry config flag with default true.
  2. Keep rescanner key accepted for at least 2 release cycles with warning logs.

6.3 CLI / UX

  1. Replace user-facing wording from rescanner to automatic retry worker.
  2. No extra flag required in phase 1 if retry is default-enabled.

7) Rollout Plan

Phase 0: Preparation

  1. Add queue store abstraction for retry (pkg/store/blockrangestore).
  2. Add metrics names for retry queue depth/throughput.

Phase 1: RPC Layer Refactor (No Behavior Change)

  1. Introduce provider-manager builders per chain.
  2. Keep existing indexer behavior untouched.
  3. Add unit tests for failover manager and transport helpers.

Phase 2: Queue Runtime Unification

  1. Introduce runtime core + queue worker.
  2. Switch manual worker to queue worker implementation.
  3. Add retry worker using same queue worker engine.

Phase 3: Rescanner Deprecation

  1. Remove failedChan writes and listeners from worker flow.
  2. Migrate legacy failed blocks at startup.
  3. Keep config backward compatibility warning for rescanner.

Phase 4: Cleanup

  1. Remove dead rescanner code paths.
  2. Finalize docs and runbook updates.
  3. Optionally rename internal/rpc to pkg/rpc if not done yet.

8) Testing Strategy

  1. Unit tests:
    • Queue add/merge/claim/remove semantics.
    • Retry enqueue on block error in processor.
    • Retry worker range progress + removal behavior.
    • Failover provider switching and blacklist recovery.
  2. Integration tests:
    • Multi-chain run with injected RPC failures to verify no cross-chain retry pollution.
    • Migration test from legacy failed blocks into retry queue.
  3. Regression tests:
    • Event emission path unchanged for successful blocks.
    • Catchup/manual behavior unchanged.

9) Observability

Add or expose:

  1. Retry queue depth per chain.
  2. Retry processed/success/failure counts.
  3. Retry enqueue rate by worker mode source.
  4. Failover metrics snapshot per chain/provider.
  5. Warning count for deprecated rescanner config usage.

10) Risks and Mitigations

  1. Risk: Duplicate retries if legacy store and retry queue are both active.
    • Mitigation: one-time migration then delete legacy entries.
  2. Risk: Redis queue growth under persistent RPC outage.
    • Mitigation: queue depth alerting + provider failover tuning + retry backoff.
  3. Risk: Behavior drift during package move (internal/rpc -> pkg/rpc).
    • Mitigation: split into separate PRs (behavior first, path rename second).

11) Acceptance Criteria

  1. No worker path writes to or depends on failedChan.
  2. Failed blocks are retried only through retry queue.
  3. Cross-chain retry contamination is impossible by design.
  4. Manual and retry workers share one queue runtime implementation.
  5. RPC builders are standardized by chain manager and failover abstraction.
  6. Existing production config still boots with deprecation warnings only.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions