Skip to content

fix: prevent Kafka consumer rebalance during long BigQuery writes#66

Merged
anaselmhamdi merged 1 commit into
mainfrom
fix/kafka-consumer-stalling
Apr 3, 2026
Merged

fix: prevent Kafka consumer rebalance during long BigQuery writes#66
anaselmhamdi merged 1 commit into
mainfrom
fix/kafka-consumer-stalling

Conversation

@anaselmhamdi
Copy link
Copy Markdown
Collaborator

Summary

  • Overlaps Kafka consume() with BigQuery writes using a single background thread, so the main thread keeps polling and heartbeats stay alive during long destination writes
  • Adds max.poll.interval.ms=600000 (10 min) to default Kafka consumer config as a safety net
  • Backpressure via future.result() ensures at most 2 batches in memory (one writing, one being consumed)

Context

With a single Kafka consumer handling multiple topics, BigQuery writes block the main thread. During writes, no consume() calls happen, so no heartbeats are sent. Kafka evicts the consumer → rebalance → all topics stall reading 0 messages for minutes. Increasing max.poll.interval.ms alone just delays the stall; reducing batch_size kills throughput with few partitions.

Test plan

  • Deploy to staging and verify no more 0-message stall periods after large writes
  • Verify errors still propagate (not silently swallowed) — future.result() re-raises
  • Verify Datadog metrics still report correctly
  • Monitor memory — should stay bounded (no more than 2 batches in flight)

🤖 Generated with Claude Code

Overlap consume(N+1) with write(N) using a background thread so
heartbeats stay alive during destination writes. Backpressure ensures
at most 2 batches in memory. Also adds max.poll.interval.ms=600s default.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@anaselmhamdi anaselmhamdi merged commit a012213 into main Apr 3, 2026
2 checks passed
anaselmhamdi added a commit that referenced this pull request Apr 7, 2026
PR #66 (v0.3.8) wrapped destination writes in a ThreadPoolExecutor with
the goal of overlapping consume(N+1) with write(N) to keep Kafka polls
alive during long BigQuery writes. The implementation waits for
pending_future.result() *before* calling source.get(), so consume always
runs after the previous write finishes — no actual overlap, just thread
overhead, a one-iteration commit delay, and a destination-object thread
safety footgun.

This restores the simple sequential loop. The max.poll.interval.ms=600000
Kafka consumer default introduced in 0.3.8 is intentionally retained as a
safety net against rebalance during long writes.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant