Skip to content

feat: add missed block gap fill worker#60

Merged
tac0turtle merged 5 commits intomainfrom
pthmas/missed-blocks-refetch-worker
Apr 24, 2026
Merged

feat: add missed block gap fill worker#60
tac0turtle merged 5 commits intomainfrom
pthmas/missed-blocks-refetch-worker

Conversation

@pthmas
Copy link
Copy Markdown
Collaborator

@pthmas pthmas commented Apr 14, 2026

Summary

  • add a background gap-fill worker that retries missed blocks from failed_blocks with backoff
  • make recovered block writes clear failed_blocks rows atomically and refresh the missing-blocks metric
  • add focused integration coverage for recovery, retry backoff, and missing-block metric updates

Summary by CodeRabbit

  • New Features

    • Background gap-fill worker launched at startup to automatically retry and re-ingest missing blocks with rate-limited RPC fetches and exponential backoff.
  • Tests

    • New integration tests covering recovery, RPC errors, backoff behavior, and metrics; test RPC mocking added via Wiremock.
  • Bug Fixes

    • Missing-block metrics now refresh when recoveries occur.

Closes #57

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 14, 2026

📝 Walkthrough

Walkthrough

Adds a new GapFillWorker background service that scans and retries rows in failed_blocks with exponential backoff, rate-limited RPC fetches, partition checks, and batch writes; refactors indexer internals for reuse; registers and spawns the worker at startup; adds integration tests and a dev wiremock dependency.

Changes

Cohort / File(s) Summary
Dependencies
backend/Cargo.toml, backend/crates/atlas-server/Cargo.toml
Added wiremock = "0.6" as a workspace dependency and as a dev-dependency for the atlas-server crate.
GapFillWorker Implementation
backend/crates/atlas-server/src/indexer/gap_fill_worker.rs
New public GapFillWorker with new, run, and process_batch that claims eligible failed_blocks (CTE + FOR UPDATE SKIP LOCKED), updates last_failed_at, rate-limits RPC fetches, rebuilds BlockBatch, verifies partitions, writes/clears failed rows or increments retries, updates missing-blocks metric, and broadcasts recoveries.
Indexer Refactor & Helpers
backend/crates/atlas-server/src/indexer/indexer.rs
Introduced write_batch_internal with pub(crate) wrappers write_batch and write_batch_and_clear_failed_block; extracted ensure_partitions_exist free function; changed partition sentinel to UNKNOWN_MAX_PARTITION and adjusted fast-path; made helper methods pub(crate) and added tests for sentinel behavior.
Module & Startup Integration
backend/crates/atlas-server/src/indexer/mod.rs, backend/crates/atlas-server/src/main.rs
Registered and re-exported gap_fill_worker; main::run() now constructs and spawns the GapFillWorker in a retry-wrapped Tokio task using a cloned broadcast sender and metrics.
Integration Tests
backend/crates/atlas-server/tests/integration/gap_fill.rs, backend/crates/atlas-server/tests/integration/main.rs
Added integration tests using wiremock to mock JSON-RPC and a global mutex to serialize DB-affecting tests; tests cover successful recovery, metric refresh, RPC error causing retry_count increment, and backoff/no-op when last_failed_at is recent.

Sequence Diagram

sequenceDiagram
    participant GW as GapFillWorker
    participant DB as Database (PgPool)
    participant RPC as RPC Server
    participant IDX as Indexer
    participant MET as Metrics
    participant BC as Broadcast Channel

    loop continuous run()
        GW->>DB: SELECT eligible failed_blocks FOR UPDATE SKIP LOCKED (backoff filter, LIMIT)
        alt rows claimed
            GW->>DB: UPDATE last_failed_at = NOW() for claimed rows
            GW->>RPC: Fetch block data (rate-limited)
            alt RPC success with block
                GW->>IDX: build BlockBatch & call write_batch_and_clear_failed_block(batch, block_number)
                IDX->>DB: ensure_partitions_exist(pool, block_number)
                IDX->>DB: write batch (COPY / tx) and DELETE failed_blocks row
                GW->>DB: SELECT COUNT(*) FROM failed_blocks
                GW->>MET: update missing_blocks gauge
                GW->>BC: broadcast recovery event
            else RPC missing / error or write failure
                GW->>DB: increment retry_count, set next last_failed_at (exponential backoff)
            end
        else no eligible rows
            GW->>GW: sleep IDLE_SLEEP
        end
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

  • Use base fee fallback in gas price chart #44: Changes to collect_block/BlockBatch handling that overlap with visibility and collection adjustments used by the worker.
  • fix: stale ERC20 total supply #45: Modifies indexer batch-writing/collection logic that the GapFillWorker reuses.
  • Issue #57: feat: gap-fill sync mode — design and behavior align closely with this worker’s goals and control flow.

Suggested reviewers

  • tac0turtle

Poem

🐰 I hop through gaps that night forgot,
I fetch the blocks and stitch each plot,
With careful waits and retries mild,
I mend the ledger — small and wild,
Metrics hum — the chain is whole, huzzah! 🎉

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 71.05% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Linked Issues check ❓ Inconclusive The PR implements retry of failed_blocks with exponential backoff [#57], refreshes missing-blocks metrics, and clears failed_blocks atomically on successful writes, fulfilling key objectives. However, gap detection, adaptive throttling, concurrency configuration, and contiguous watermark tracking are not yet implemented. The PR addresses failed_blocks retry requirements but omits gap detection (window-function scan), adaptive throttling, multi-worker concurrency, and contiguous-height metric tracking. Clarify whether this PR is an incremental deliverable or if remaining scope will be addressed separately.
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'feat: add missed block gap fill worker' clearly and concisely describes the main change—introducing a background worker that retries missed blocks, which aligns with the PR's core objective.
Out of Scope Changes check ✅ Passed All changes directly support the gap-fill worker implementation: dependency additions (wiremock), refactored batch-write methods for atomic failed_blocks clearing, partition initialization refactoring, worker implementation, and integration tests.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch pthmas/missed-blocks-refetch-worker

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (2)
backend/crates/atlas-server/src/indexer/gap_fill_worker.rs (2)

103-111: Consider persisting copy_client across batches for connection reuse.

Currently, connect_copy_client is called on every process_batch() invocation, establishing a new TCP connection each time. For a worker that runs infrequently (5-minute idle sleep), this overhead is acceptable, but persisting the connection in the struct would reduce latency and resource usage.

This is a minor optimization that could be addressed in a follow-up if gap-fill volume increases.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/crates/atlas-server/src/indexer/gap_fill_worker.rs` around lines 103
- 111, Persist the copy client instead of creating it every process_batch call:
add a field (e.g., copy_client: Option<CopyClient> or Arc<CopyClient>) to the
worker struct used in gap_fill_worker.rs, initialize it lazily by calling
Indexer::connect_copy_client(...) once (e.g., in the worker constructor or on
first process_batch run) and reuse that instance in subsequent process_batch
invocations, and handle reconnects by detecting connection errors from the
existing copy_client and re-calling Indexer::connect_copy_client to replace it;
reference the existing connect_copy_client function and the local variable
copy_client in the diff to locate where to change.

177-180: Consider using a more efficient query for frequently accessed metric.

SELECT COUNT(*) FROM failed_blocks is O(n), but since failed_blocks is expected to be small (only unrecoverable blocks), this is acceptable. If the table grows large, consider using pg_class.reltuples as done elsewhere in the codebase.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/crates/atlas-server/src/indexer/gap_fill_worker.rs` around lines 177
- 180, The current code calls get_missing_block_count() after successful
gap-fill which likely runs a COUNT(*) on failed_blocks (used by
set_indexer_missing_blocks); replace or guard that expensive COUNT query with a
cheaper estimate using pg_class.reltuples like elsewhere in the codebase when
table size is large: update the get_missing_block_count() implementation (or add
a new helper) to prefer querying pg_class.reltuples for relname =
'failed_blocks' and fall back to SELECT COUNT(*) only for small tables or when a
precise value is required, and ensure the call sites (including where succeeded
> 0 and set_indexer_missing_blocks is invoked) use the updated helper.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Nitpick comments:
In `@backend/crates/atlas-server/src/indexer/gap_fill_worker.rs`:
- Around line 103-111: Persist the copy client instead of creating it every
process_batch call: add a field (e.g., copy_client: Option<CopyClient> or
Arc<CopyClient>) to the worker struct used in gap_fill_worker.rs, initialize it
lazily by calling Indexer::connect_copy_client(...) once (e.g., in the worker
constructor or on first process_batch run) and reuse that instance in subsequent
process_batch invocations, and handle reconnects by detecting connection errors
from the existing copy_client and re-calling Indexer::connect_copy_client to
replace it; reference the existing connect_copy_client function and the local
variable copy_client in the diff to locate where to change.
- Around line 177-180: The current code calls get_missing_block_count() after
successful gap-fill which likely runs a COUNT(*) on failed_blocks (used by
set_indexer_missing_blocks); replace or guard that expensive COUNT query with a
cheaper estimate using pg_class.reltuples like elsewhere in the codebase when
table size is large: update the get_missing_block_count() implementation (or add
a new helper) to prefer querying pg_class.reltuples for relname =
'failed_blocks' and fall back to SELECT COUNT(*) only for small tables or when a
precise value is required, and ensure the call sites (including where succeeded
> 0 and set_indexer_missing_blocks is invoked) use the updated helper.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: e20ebd0e-64ce-40b7-a000-9cd68f12f743

📥 Commits

Reviewing files that changed from the base of the PR and between e1b3e31 and c6d38a5.

📒 Files selected for processing (8)
  • backend/Cargo.toml
  • backend/crates/atlas-server/Cargo.toml
  • backend/crates/atlas-server/src/indexer/gap_fill_worker.rs
  • backend/crates/atlas-server/src/indexer/indexer.rs
  • backend/crates/atlas-server/src/indexer/mod.rs
  • backend/crates/atlas-server/src/main.rs
  • backend/crates/atlas-server/tests/integration/gap_fill.rs
  • backend/crates/atlas-server/tests/integration/main.rs

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
backend/crates/atlas-server/src/main.rs (1)

388-406: ⚠️ Potential issue | 🟠 Major

Avoid starting gap-fill while reindex truncation can still be running.

Indexer::run() performs the reindex truncate after the indexer task starts, but gap-fill is spawned immediately afterward. With --reindex, stale failed_blocks rows can be selected and written concurrently with the truncate, reintroducing old data or failing on table locks. Gate gap-fill during reindex startup, or add a startup barrier so it begins only after the reset completes.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/crates/atlas-server/src/main.rs` around lines 388 - 406, The gap-fill
worker is being spawned immediately while Indexer::run() may still be performing
the reindex truncate, causing races with stale failed_blocks; fix by introducing
a startup barrier (e.g., a tokio::sync::oneshot or watch channel) so gap-fill
only starts after the indexer signals the reindex/truncate is complete: create a
oneshot sender/receiver before constructing GapFillWorker (referencing
GapFillWorker::new and gap_fill_worker.run()), pass the receiver into the
gap-fill worker (or hold off spawning it), and modify Indexer::run() to
send/notify on the sender right after the truncate completes; then spawn
tokio::spawn(async move { run_with_retry(|| gap_fill_worker.run()).await }) only
after receiving the readiness signal (or after Indexer::run() has signaled).
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@backend/crates/atlas-server/src/indexer/gap_fill_worker.rs`:
- Around line 79-87: The run() loop treats a zero recovered count as idle
because process_batch() currently returns only successful recoveries; change
process_batch() to return both attempted/selected count and recovered count
(e.g., -> Result<(usize /*attempted*/, usize /*recovered*/)> or a small struct)
and update run() to use the attempted value to decide sleeping (sleep only when
attempted == 0). Update calls and uses of process_batch() (including the other
loop that also calls process_batch()) to unpack the two values and log both
attempted and recovered counts, keeping the IDLE_SLEEP behavior tied to no
attempts rather than no recoveries.
- Around line 40-44: The SELECT_READY_SQL currently reads rows without claiming
them and orders ASC; change the selection to atomically claim rows and pick
newest gaps first by: open a transaction, run a SELECT block_number FROM
failed_blocks WHERE last_failed_at < ... ORDER BY last_failed_at DESC LIMIT $1
FOR UPDATE SKIP LOCKED (use the SELECT_READY_SQL symbol), then immediately
UPDATE those block_number(s) to mark them claimed (e.g., set last_failed_at =
NOW() or a dedicated claimed/claimed_at column and increment retry_count) within
the same transaction so other workers skip them, and use the UPDATE ...
RETURNING block_number for processing; ensure the logic that uses
SELECT_READY_SQL and any subsequent update/processing code handles the
transaction and returned rows atomically.
- Around line 191-203: The increment_retry function currently swallows DB update
errors; change increment_retry to return Result<(), sqlx::Error> (or a crate
error type) and return Err(e) when the sqlx::query().execute(...) fails (still
logging with tracing::warn! before returning), so callers can react to DB
failures; update all call sites (e.g., places that currently do
self.increment_retry(block_number).await;) to use
self.increment_retry(block_number).await?; and adjust the signatures of
run_with_retry (and any upstream async functions) to propagate the error (return
Result) so a failed UPDATE to the failed_blocks retry_count is not silently
ignored.

In `@backend/crates/atlas-server/src/indexer/indexer.rs`:
- Around line 1053-1069: The SELECT MAX(...) aggregate can return a NULL row, so
change the query result type from Option<(i64,)> to Option<(Option<i64>,)>
(i.e., make the column nullable) and update the match to handle Some((None,)) as
"no partitions" and Some((Some(max),)) to set current_max.store(max as u64,
Ordering::Relaxed) and perform the existing partition_num comparison; keep the
None => 0 fallback logic by treating NULL the same as None (or use .and_then()
to flatten Option<(Option<i64>,)> into Option<i64> before matching). Target the
code around the sqlx::query_as(...) call and the match that uses current_max,
partition_num and pool in indexer.rs.

In `@backend/crates/atlas-server/tests/integration/gap_fill.rs`:
- Around line 11-13: The current module-local SERIALIZER mutex and the use of
.ok() hide failures and don't prevent other integration tests from losing rows
when reset_failed_blocks() truncates failed_blocks; change to a
cross-suite/global locking strategy (e.g., a process-wide static lock or a DB
advisory lock) so cleanup is serialized across all test modules, and remove
swallowing of errors by replacing .ok() with fail-fast handling (use expect with
descriptive message or propagate error with ? in reset_failed_blocks() and any
callers); update references to SERIALIZER and reset_failed_blocks() to use the
new global/advisory lock and to surface errors immediately.

---

Outside diff comments:
In `@backend/crates/atlas-server/src/main.rs`:
- Around line 388-406: The gap-fill worker is being spawned immediately while
Indexer::run() may still be performing the reindex truncate, causing races with
stale failed_blocks; fix by introducing a startup barrier (e.g., a
tokio::sync::oneshot or watch channel) so gap-fill only starts after the indexer
signals the reindex/truncate is complete: create a oneshot sender/receiver
before constructing GapFillWorker (referencing GapFillWorker::new and
gap_fill_worker.run()), pass the receiver into the gap-fill worker (or hold off
spawning it), and modify Indexer::run() to send/notify on the sender right after
the truncate completes; then spawn tokio::spawn(async move { run_with_retry(||
gap_fill_worker.run()).await }) only after receiving the readiness signal (or
after Indexer::run() has signaled).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 80e3d7c3-a5d0-4405-b093-272818379c04

📥 Commits

Reviewing files that changed from the base of the PR and between c6d38a5 and 098f3c3.

📒 Files selected for processing (6)
  • backend/Cargo.toml
  • backend/crates/atlas-server/Cargo.toml
  • backend/crates/atlas-server/src/indexer/gap_fill_worker.rs
  • backend/crates/atlas-server/src/indexer/indexer.rs
  • backend/crates/atlas-server/src/main.rs
  • backend/crates/atlas-server/tests/integration/gap_fill.rs
✅ Files skipped from review due to trivial changes (2)
  • backend/Cargo.toml
  • backend/crates/atlas-server/Cargo.toml

Comment thread backend/crates/atlas-server/src/indexer/gap_fill_worker.rs Outdated
Comment thread backend/crates/atlas-server/src/indexer/gap_fill_worker.rs
Comment thread backend/crates/atlas-server/src/indexer/gap_fill_worker.rs Outdated
Comment thread backend/crates/atlas-server/src/indexer/indexer.rs Outdated
Comment on lines +11 to +13
/// Serializes gap-fill tests: process_batch queries all eligible failed_blocks,
/// so tests must not run concurrently or they'll pick up each other's rows.
static SERIALIZER: LazyLock<Mutex<()>> = LazyLock::new(|| Mutex::new(()));
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Make the destructive cleanup globally serialized and fail-fast.

SERIALIZER only protects this module, but reset_failed_blocks() deletes the entire failed_blocks table. Other integration tests can still run concurrently and lose rows; .ok() also hides cleanup failures. Prefer a shared cross-suite lock/advisory lock strategy, and use expect/? for these deletes.

Also applies to: 69-79

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/crates/atlas-server/tests/integration/gap_fill.rs` around lines 11 -
13, The current module-local SERIALIZER mutex and the use of .ok() hide failures
and don't prevent other integration tests from losing rows when
reset_failed_blocks() truncates failed_blocks; change to a cross-suite/global
locking strategy (e.g., a process-wide static lock or a DB advisory lock) so
cleanup is serialized across all test modules, and remove swallowing of errors
by replacing .ok() with fail-fast handling (use expect with descriptive message
or propagate error with ? in reset_failed_blocks() and any callers); update
references to SERIALIZER and reset_failed_blocks() to use the new
global/advisory lock and to surface errors immediately.

- Use FOR UPDATE SKIP LOCKED CTE to atomically claim failed_blocks rows,
  preventing double-processing by concurrent workers; order DESC
- process_batch returns (attempted, succeeded); run() sleeps only when
  attempted == 0 (no eligible work), not when all retries fail
- increment_retry returns Result<()> and propagates errors with ?;
  anyhow::ensure! guards rows_affected == 1
- Fix NULL MAX aggregate bug in ensure_partitions_exist: change
  Option<(i64,)> to Option<(Option<i64>,)> so an empty pg_class result
  decodes correctly instead of panicking on NULL
- Update integration test cleanup to use .expect() instead of .ok()
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
backend/crates/atlas-server/src/indexer/indexer.rs (1)

282-336: ⚠️ Potential issue | 🟠 Major

Don’t advance last_indexed_block after writing a sparse batch.

This batch can skip failed blocks into failed_blocks, but Line 336 still updates indexer_state.last_indexed_block from batch.last_block. That makes the stored watermark non-contiguous and can report/restart past a hole; either keep a separate non-contiguous head or update the persisted contiguous watermark only through the last block before the first failure.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/crates/atlas-server/src/indexer/indexer.rs` around lines 282 - 336,
The code currently advances the persisted watermark
(indexer_state.last_indexed_block) to batch.last_block even when some blocks in
the batch failed (failed_blocks), producing a non-contiguous persisted
watermark; change the logic after write_batch to only advance the persisted
contiguous watermark up to the last block BEFORE the first failure (i.e., if
failed_blocks.is_empty() keep using batch.last_block, otherwise compute
contiguous_last = first_failed_block - 1 and persist that), while still allowing
head/tracking state (what you publish via publish_committed_batch /
committed_blocks) to reflect the non-contiguous head; update the code paths that
set indexer_state.last_indexed_block (and any persisted watermark writes) to use
this contiguous_last value and leave failed_blocks and any in-memory head
variables alone.
♻️ Duplicate comments (1)
backend/crates/atlas-server/tests/integration/gap_fill.rs (1)

11-13: ⚠️ Potential issue | 🟡 Minor

Keep the destructive cleanup isolated across integration suites.

SERIALIZER only protects this module, but reset_failed_blocks() still deletes the entire failed_blocks table. Any other integration test using that table can lose rows while these tests run; use a cross-suite DB advisory lock held for the full test or make the worker testable against scoped rows.

Also applies to: 69-79

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/crates/atlas-server/tests/integration/gap_fill.rs` around lines 11 -
13, The SERIALIZER Mutex only serializes this module but reset_failed_blocks()
still truncates the entire failed_blocks table and can interfere with other
suites; change the test synchronization to use a cross-suite database advisory
lock (e.g., acquire a Postgres advisory_lock before running gap-fill tests and
hold it until they finish) or modify reset_failed_blocks() to operate on scoped
rows (delete only rows belonging to the test's tenant/job id) instead of
deleting the whole table; update references to SERIALIZER and calls to
reset_failed_blocks() so the tests acquire/release the DB advisory lock (or pass
a scoped identifier) to ensure cleanup is isolated across integration suites.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@backend/crates/atlas-server/src/indexer/gap_fill_worker.rs`:
- Around line 45-56: The current SELECT_READY_SQL only retries rows from
failed_blocks and doesn't detect missing block ranges; implement a separate fast
gap-scanner that performs a descending windowed/fast-count scan over the blocks
table to detect contiguous holes and emit gap_start/gap_end ranges
(newest-first) to the same queue that SELECT_READY_SQL feeds; add a new SQL
constant (e.g. SELECT_GAPS_SQL) and integrate it into gap_fill_worker.rs's
scheduling loop so the worker alternates or merges results from SELECT_GAPS_SQL
(gap detection) and SELECT_READY_SQL (failed_rows), and ensure the emitted items
include block range bounds and priority/newest ordering so the queue receives
both failed_rows and newly-discovered ranges for processing.

In `@backend/crates/atlas-server/src/indexer/indexer.rs`:
- Around line 1035-1071: The code treats 0 as both “unknown” and a valid
partition index causing fresh DBs with no blocks_p0 to short-circuit; change the
logic to use an explicit sentinel (e.g. UNKNOWN_MAX_PARTITION) instead of 0 for
uninitialized current_max, update the current_max initializers in Indexer and
GapFillWorker to AtomicU64::new(UNKNOWN_MAX_PARTITION), and adjust
ensure_partitions_exist to treat UNKNOWN_MAX_PARTITION as “need discovery”
(perform the pg_class MAX lookup and initialize current_max accordingly) so
partition 0 can be created; also add a unit test in a #[cfg(test)] mod tests
block in the same file that simulates a fresh DB and verifies partition-0
creation and correct cache initialization.

---

Outside diff comments:
In `@backend/crates/atlas-server/src/indexer/indexer.rs`:
- Around line 282-336: The code currently advances the persisted watermark
(indexer_state.last_indexed_block) to batch.last_block even when some blocks in
the batch failed (failed_blocks), producing a non-contiguous persisted
watermark; change the logic after write_batch to only advance the persisted
contiguous watermark up to the last block BEFORE the first failure (i.e., if
failed_blocks.is_empty() keep using batch.last_block, otherwise compute
contiguous_last = first_failed_block - 1 and persist that), while still allowing
head/tracking state (what you publish via publish_committed_batch /
committed_blocks) to reflect the non-contiguous head; update the code paths that
set indexer_state.last_indexed_block (and any persisted watermark writes) to use
this contiguous_last value and leave failed_blocks and any in-memory head
variables alone.

---

Duplicate comments:
In `@backend/crates/atlas-server/tests/integration/gap_fill.rs`:
- Around line 11-13: The SERIALIZER Mutex only serializes this module but
reset_failed_blocks() still truncates the entire failed_blocks table and can
interfere with other suites; change the test synchronization to use a
cross-suite database advisory lock (e.g., acquire a Postgres advisory_lock
before running gap-fill tests and hold it until they finish) or modify
reset_failed_blocks() to operate on scoped rows (delete only rows belonging to
the test's tenant/job id) instead of deleting the whole table; update references
to SERIALIZER and calls to reset_failed_blocks() so the tests acquire/release
the DB advisory lock (or pass a scoped identifier) to ensure cleanup is isolated
across integration suites.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 9a8af734-9b28-48dc-9e00-903d53d352da

📥 Commits

Reviewing files that changed from the base of the PR and between 098f3c3 and b09212e.

📒 Files selected for processing (3)
  • backend/crates/atlas-server/src/indexer/gap_fill_worker.rs
  • backend/crates/atlas-server/src/indexer/indexer.rs
  • backend/crates/atlas-server/tests/integration/gap_fill.rs

Comment thread backend/crates/atlas-server/src/indexer/gap_fill_worker.rs
Comment thread backend/crates/atlas-server/src/indexer/indexer.rs
…cache from partition 0

Using 0 as the initial value for current_max_partition caused a silent
bug: when block 0 was processed on a fresh DB (no blocks_p0 yet),
partition_num == 0 <= current_max_val == 0 triggered the fast path and
returned Ok(()) without ever creating the partition.

Replace the initializer with u64::MAX (UNKNOWN_MAX_PARTITION), update the
fast-path guard to skip when the sentinel is present, and use the same
sentinel in GapFillWorker. Add unit tests covering the sentinel value and
the no-DB-needed fast path.
@pthmas
Copy link
Copy Markdown
Collaborator Author

pthmas commented Apr 23, 2026

@coderabbitai review

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Apr 23, 2026

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
backend/crates/atlas-server/src/indexer/gap_fill_worker.rs (1)

45-56: ⚠️ Potential issue | 🟠 Major

Add the actual gap scanner and preserve newest-first order after claiming.

SELECT_READY_SQL still only sources failed_blocks, so holes without a failed-row entry remain undiscovered. Also, the candidate ORDER BY block_number DESC is not guaranteed to be preserved by the final UPDATE ... RETURNING; wrap the update in a claimed CTE and select from it with an explicit final order.

Suggested SQL shape for the ordering fix
 const SELECT_READY_SQL: &str = "
     WITH candidates AS (
         SELECT block_number FROM failed_blocks
         WHERE last_failed_at < NOW() - make_interval(mins => (2 * power(2, LEAST(retry_count, 10)))::int)
         ORDER BY block_number DESC
         LIMIT $1
         FOR UPDATE SKIP LOCKED
-    )
-    UPDATE failed_blocks SET last_failed_at = NOW()
-    FROM candidates
-    WHERE failed_blocks.block_number = candidates.block_number
-    RETURNING failed_blocks.block_number";
+    ),
+    claimed AS (
+        UPDATE failed_blocks SET last_failed_at = NOW()
+        FROM candidates
+        WHERE failed_blocks.block_number = candidates.block_number
+        RETURNING failed_blocks.block_number
+    )
+    SELECT block_number FROM claimed
+    ORDER BY block_number DESC";
Does PostgreSQL guarantee row order for UPDATE ... RETURNING when the source CTE has ORDER BY but the final RETURNING query has no ORDER BY?
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/crates/atlas-server/src/indexer/gap_fill_worker.rs` around lines 45 -
56, SELECT_READY_SQL currently only reads from failed_blocks and its ORDER BY
can be lost by UPDATE ... RETURNING; change it to first produce a candidates set
that includes missing blocks (e.g., scanning the block range/gaps rather than
only failed_rows) and then wrap the UPDATE in a claimed CTE: perform UPDATE ...
FROM candidates ... RETURNING into claimed, then SELECT block_number FROM
claimed ORDER BY block_number DESC to guarantee newest-first. Update the
constant SELECT_READY_SQL to (1) source gaps as well as failed_blocks (the gap
scanner logic) into candidates and (2) use a claimed CTE and final SELECT ...
ORDER BY block_number DESC to preserve ordering when returning claimed blocks.
🧹 Nitpick comments (1)
backend/crates/atlas-server/src/indexer/gap_fill_worker.rs (1)

168-168: Coalesce recovered-block notifications per cycle.

Line 168 broadcasts once per recovered historical block; with BATCH_SIZE = 50, this can wake every SSE subscriber dozens of times in one cycle. Consider sending once after the loop when succeeded > 0, alongside the missing-block metric refresh.

Proposed refactor
-                    let _ = self.block_events_tx.send(());
                     tracing::info!(block = block_num, "gap-fill: block recovered");
                     succeeded += 1;
@@
         if succeeded > 0 {
+            let _ = self.block_events_tx.send(());
             self.metrics
                 .set_indexer_missing_blocks(self.get_missing_block_count().await?);
         }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@backend/crates/atlas-server/src/indexer/gap_fill_worker.rs` at line 168, The
code currently calls self.block_events_tx.send(()) inside the per-block recovery
loop, causing many SSE wakeups; change gap_fill_worker.rs so
block_events_tx.send(()) is moved out of the inner loop and invoked once per
cycle when at least one block was recovered (e.g., after the loop when succeeded
> 0 and alongside the missing-block metric refresh), ensuring you still handle
the send result (ignore or log the error) and keep BATCH_SIZE behavior
unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@backend/crates/atlas-server/src/indexer/gap_fill_worker.rs`:
- Around line 45-56: SELECT_READY_SQL currently only reads from failed_blocks
and its ORDER BY can be lost by UPDATE ... RETURNING; change it to first produce
a candidates set that includes missing blocks (e.g., scanning the block
range/gaps rather than only failed_rows) and then wrap the UPDATE in a claimed
CTE: perform UPDATE ... FROM candidates ... RETURNING into claimed, then SELECT
block_number FROM claimed ORDER BY block_number DESC to guarantee newest-first.
Update the constant SELECT_READY_SQL to (1) source gaps as well as failed_blocks
(the gap scanner logic) into candidates and (2) use a claimed CTE and final
SELECT ... ORDER BY block_number DESC to preserve ordering when returning
claimed blocks.

---

Nitpick comments:
In `@backend/crates/atlas-server/src/indexer/gap_fill_worker.rs`:
- Line 168: The code currently calls self.block_events_tx.send(()) inside the
per-block recovery loop, causing many SSE wakeups; change gap_fill_worker.rs so
block_events_tx.send(()) is moved out of the inner loop and invoked once per
cycle when at least one block was recovered (e.g., after the loop when succeeded
> 0 and alongside the missing-block metric refresh), ensuring you still handle
the send result (ignore or log the error) and keep BATCH_SIZE behavior
unchanged.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 9631c8c7-e5fe-4b4c-b7ac-5c5ff0618d76

📥 Commits

Reviewing files that changed from the base of the PR and between b09212e and cb32d53.

📒 Files selected for processing (2)
  • backend/crates/atlas-server/src/indexer/gap_fill_worker.rs
  • backend/crates/atlas-server/src/indexer/indexer.rs

@pthmas pthmas self-assigned this Apr 24, 2026
@tac0turtle tac0turtle merged commit 3b11f28 into main Apr 24, 2026
10 checks passed
@tac0turtle tac0turtle deleted the pthmas/missed-blocks-refetch-worker branch April 24, 2026 09:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: gap-fill sync mode

2 participants