fix: revert streaming runner overlap to fix Kafka lag (v0.3.9)#67
Merged
Conversation
PR #66 (v0.3.8) wrapped destination writes in a ThreadPoolExecutor with the goal of overlapping consume(N+1) with write(N) to keep Kafka polls alive during long BigQuery writes. The implementation waits for pending_future.result() *before* calling source.get(), so consume always runs after the previous write finishes — no actual overlap, just thread overhead, a one-iteration commit delay, and a destination-object thread safety footgun. This restores the simple sequential loop. The max.poll.interval.ms=600000 Kafka consumer default introduced in 0.3.8 is intentionally retained as a safety net against rebalance during long writes. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Reverts the
ThreadPoolExecutor-based streaming runner changes from #66 / v0.3.8. Bumps version to 0.3.9. Keepsmax.poll.interval.ms=600000from #66 — only the runner threading is reverted.Why
#66 was meant to overlap
consume(N+1)withwrite(N)so Kafka polls stay alive during long BigQuery writes. The implementation does not actually overlap them — the loop body waits onpending_future.result()before callingsource.get():So consume(N+1) always runs after write(N) finishes — same wall-clock as the old sequential code, plus:
destination.destination_id(latent thread-safety footgun)executor.shutdown(wait=True)was unreachable in production (nomax_iterations)Combined with the "stabilization paradox" — #60 / #62 / #63 / #64 / #65 fixed several pipeline-crashing bugs in the same window, removing the crash-and-restart cycle that previously masked how slow steady-state throughput is — Kafka pipelines now run continuously at their (slow) baseline rate, so lag is finally visible.
This PR restores the simple sequential
consume → write → commitloop that was in place pre-#66. It does not attempt to design proper consume-during-write overlap — that's a separate, larger conversation that wants real production write-time metrics first.What's kept from v0.3.8
max.poll.interval.ms=600000(10 min) inbizon/connectors/sources/kafka/src/config.py. Still useful as a belt-and-braces safety net against rebalance during long writes, even though the runner no longer claims to overlap.What's reverted
bizon/engine/runner/adapters/streaming.py(ThreadPoolExecutorimport,pending_future/should_commitlocals,do_writesclosure, the deferred-commit logic, the unreachableexecutor.shutdown).Files changed
bizon/engine/runner/adapters/streaming.py— revertrun()body to pre-fix: prevent Kafka consumer rebalance during long BigQuery writes #66 form (-66 / +25)pyproject.toml—0.3.8→0.3.9uv.lock— picks up the version bumpCHANGELOG.md—[0.3.9] - 2026-04-07entry under### FixedTest plan
uv run ruff format+ruff checkonstreaming.py— cleanuv run pytest tests/connectors/sources/kafka/— 27 passeduv run pytest tests/connectors/sources/kafka/test_kafka_decode.py— 21 passedThreadPoolExecutor/pending_future/should_commitsymbols inStreamingRunner.run9ad0ed5forstreaming.pykafka_consumer_lagfor 30–60 min vs the v0.3.8 baseline before promoting(Pre-existing test failures unrelated to this change: missing optional deps
pika,datadog,psycopg2, and live BQ tests requiring credentials.)Out of scope (deliberate)
destination.py:183). It's correct and load-bearing for Debezium CDC; revisit only if metrics later prove it's the bottleneck.batch_size/consumer_timeoutdefaults.🤖 Generated with Claude Code