Skip to content

fix(sensor): bound fillRange so a Datastore timeout can't hot-loop#84

Merged
minhd-vu merged 1 commit into
mainfrom
fix/sensor-fillrange-hot-loop
Jun 10, 2026
Merged

fix(sensor): bound fillRange so a Datastore timeout can't hot-loop#84
minhd-vu merged 1 commit into
mainfrom
fix/sensor-fillrange-hot-loop

Conversation

@minhd-vu

Copy link
Copy Markdown
Collaborator

Description

When iter.Next() returned a non-Done error (a Datastore DeadlineExceeded), fillRange did continue with no break, backoff, or cap, and ctx is context.Background() so there was no deadline to abort it. A terminal iterator error therefore spun into an unbounded loop: it re-issued the same failing query forever, emitting one "Failed to get next block" warn per iteration.

On 2026-06-10 this turned a single transient Datastore latency spike on the shared amoy database into an ~8-hour stale-data outage and a ~100k logs/sec storm, in both dev and prod simultaneously (both read the same Datastore, so each instance's query flood kept the DB slow for the other).

  • fillRange now breaks on a persistent iterator error, matching rpc.go and heimdall.go, so the provider goroutine returns to normal polling instead of wedging.
  • Bound the backfill range to blockBufferSize blocks behind the head. After a freeze/recovery the head can jump far ahead of prevBlockNumber; querying more than the buffer can hold is wasted work (the oldest are evicted anyway). The 512 buffer cap is now a named constant used in both places, and a clamp logs how many blocks were skipped so gaps stay visible.

Jira / Linear Tickets

Testing

  • Test A
  • Test B

When iter.Next() returned a non-Done error (a Datastore DeadlineExceeded),
fillRange did `continue` with no break, backoff, or cap, and ctx is
context.Background() so there was no deadline to abort it. A terminal iterator
error therefore spun into an unbounded loop: it re-issued the same failing
query forever, emitting one "Failed to get next block" warn per iteration.

On 2026-06-10 this turned a single transient Datastore latency spike on the
shared `amoy` database into an ~8-hour stale-data outage and a ~100k logs/sec
storm, in both dev and prod simultaneously (both read the same Datastore, so
each instance's query flood kept the DB slow for the other).

- fillRange now breaks on a persistent iterator error, matching rpc.go and
  heimdall.go, so the provider goroutine returns to normal polling instead of
  wedging.
- Bound the backfill range to blockBufferSize blocks behind the head. After a
  freeze/recovery the head can jump far ahead of prevBlockNumber; querying more
  than the buffer can hold is wasted work (the oldest are evicted anyway). The
  512 buffer cap is now a named constant used in both places, and a clamp logs
  how many blocks were skipped so gaps stay visible.
@minhd-vu minhd-vu merged commit ccb12c9 into main Jun 10, 2026
3 of 4 checks passed
@minhd-vu minhd-vu deleted the fix/sensor-fillrange-hot-loop branch June 10, 2026 18:05
@sonarqubecloud

Copy link
Copy Markdown

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant