-
Notifications
You must be signed in to change notification settings - Fork 31
Open
Description
1) Summary
Refactor RPC infrastructure into a cleaner package layout and replace rescanner with a retry worker that reuses the same queue engine as manual, backed by an additional dedicated retry queue.
2) Why We Need This
Current pain points in main:
- RPC construction and provider wiring are duplicated across chain builder functions in
internal/worker/factory.go. - Failover, transport, auth, and chain-specific logic are too tightly coupled in
internal/rpc. - Rescanner depends on a shared
failedChanfan-in/fan-out model. failedChanis non-blocking and drops when full (failedChan full, dropping block event), which can delay retries.FailedBlockEventincludesChain, but rescanner listener currently consumes from shared channel without chain filtering, creating cross-chain contamination risk.
3) Goals
- Standardize RPC composition (
transport+failover+ chain manager). - Remove channel-driven rescanner path and replace with deterministic queue-driven retry.
- Reuse one queue worker implementation for
manualandretry. - Preserve operational behavior for regular/catchup/manual/mempool workers.
- Provide safe migration from legacy failed-block storage.
4) Non-Goals
- No change to business event payload schema.
- No change to chain parser semantics.
- No hard cutover requiring immediate config rewrite by operators.
5) Proposed Architecture
5.1 RPC Package Refactor
Target package layout:
pkg/rpc/failoverpkg/rpc/transport/httpxpkg/rpc/transport/jsonrpcpkg/rpc/{evm,tron,bitcoin,solana,sui,cosmos,aptos,ton}pkg/rpc/bootstrap
Key design:
- Each chain package exposes
NewProviderManager(chainName, chainCfg). bootstraphas generic builders:BuildHTTPFailover(...)BuildGRPCFailover(...)
ProviderManager[T]owns provider selection, retries, blacklisting, and metrics.- Transport concerns (
auth, request/response handling, JSON-RPC batching) are isolated from failover policy.
Notes:
- Branch
refactor-rpccurrently usesinternal/rpc/*with this architecture already applied. - We can land behavior refactor first under
internal/rpc, then move topkg/rpcin a follow-up rename if we want a lower-risk rollout.
5.2 Worker Refactor: Rescanner -> Retry Queue
Replace rescanner with a queue worker model:
- Keep one generic queue worker runtime:
internal/worker/queue/worker.go
ManualWorkerandRetryWorkerboth wrap queue worker core.- Two queue stores:
- Manual queue:
missing_blocks:* - Retry queue:
retry_blocks:*
- Manual queue:
- Retry queue uses small range granularity (
MaxBlocksPerRange = 1) for per-block retries.
Block failure flow:
- Regular/Catchup/Manual processing hits block error.
- Runtime processor enqueues
result.Numberto retry queue (AddRange(network, n, n)). - Retry worker consumes queue and reprocesses.
- On success, queue progress is updated and range removed.
Result:
- Remove
failedChanandFailedBlockEventfrom worker runtime path. - Eliminate shared channel race/cross-chain misrouting.
- Retry behavior becomes observable and deterministic through Redis queue state.
6) Data Migration and Compatibility
6.1 Legacy Failed Block Migration
At manager bootstrap (per chain):
- Read legacy failed blocks from
blockStore.GetFailedBlocks(internalCode). - Enqueue each block into retry queue.
- Remove migrated entries from legacy failed block store.
- Log migrated count.
6.2 Config Compatibility
Short-term compatibility policy:
- Keep parsing
services.worker.rescanner.enabled. - Mark it deprecated and ignore at runtime.
- Add
services.worker.retry.enabled(recommended) or keep retry always-on (if queue idle cost is acceptable).
Recommended:
- Introduce explicit
retryconfig flag with defaulttrue. - Keep
rescannerkey accepted for at least 2 release cycles with warning logs.
6.3 CLI / UX
- Replace user-facing wording from
rescannertoautomatic retry worker. - No extra flag required in phase 1 if retry is default-enabled.
7) Rollout Plan
Phase 0: Preparation
- Add queue store abstraction for retry (
pkg/store/blockrangestore). - Add metrics names for retry queue depth/throughput.
Phase 1: RPC Layer Refactor (No Behavior Change)
- Introduce provider-manager builders per chain.
- Keep existing indexer behavior untouched.
- Add unit tests for failover manager and transport helpers.
Phase 2: Queue Runtime Unification
- Introduce runtime core + queue worker.
- Switch manual worker to queue worker implementation.
- Add retry worker using same queue worker engine.
Phase 3: Rescanner Deprecation
- Remove
failedChanwrites and listeners from worker flow. - Migrate legacy failed blocks at startup.
- Keep config backward compatibility warning for
rescanner.
Phase 4: Cleanup
- Remove dead rescanner code paths.
- Finalize docs and runbook updates.
- Optionally rename
internal/rpctopkg/rpcif not done yet.
8) Testing Strategy
- Unit tests:
- Queue add/merge/claim/remove semantics.
- Retry enqueue on block error in processor.
- Retry worker range progress + removal behavior.
- Failover provider switching and blacklist recovery.
- Integration tests:
- Multi-chain run with injected RPC failures to verify no cross-chain retry pollution.
- Migration test from legacy failed blocks into retry queue.
- Regression tests:
- Event emission path unchanged for successful blocks.
- Catchup/manual behavior unchanged.
9) Observability
Add or expose:
- Retry queue depth per chain.
- Retry processed/success/failure counts.
- Retry enqueue rate by worker mode source.
- Failover metrics snapshot per chain/provider.
- Warning count for deprecated rescanner config usage.
10) Risks and Mitigations
- Risk: Duplicate retries if legacy store and retry queue are both active.
- Mitigation: one-time migration then delete legacy entries.
- Risk: Redis queue growth under persistent RPC outage.
- Mitigation: queue depth alerting + provider failover tuning + retry backoff.
- Risk: Behavior drift during package move (
internal/rpc->pkg/rpc).- Mitigation: split into separate PRs (behavior first, path rename second).
11) Acceptance Criteria
- No worker path writes to or depends on
failedChan. - Failed blocks are retried only through retry queue.
- Cross-chain retry contamination is impossible by design.
- Manual and retry workers share one queue runtime implementation.
- RPC builders are standardized by chain manager and failover abstraction.
- Existing production config still boots with deprecation warnings only.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels