Fix transactional topic reader committing stale offsets after reconnect#834
Merged
Conversation
There was a problem hiding this comment.
Pull request overview
This PR fixes a transactional topic reader edge case where offsets could be committed for a partition session that became stale after a reader reconnect, leading to server-side “Gap” errors and a consumer that appears healthy while making no progress.
Changes:
- Added a transactional commit guard to detect and reject committing batches whose partition session no longer belongs to the current reader stream.
- Added an asyncio test that forces a reconnect between
receive_batch_with_tx()andtx.commit()and asserts the commit fails and no offset-update RPC is sent.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
ydb/_topic_reader/topic_reader_asyncio.py |
Adds stale-partition-session detection for transactional offset commits and fails the tx early when detected. |
tests/topics/test_topic_transactions.py |
Adds a regression test covering tx commit after reconnect to ensure stale offsets aren’t committed. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
If the reader reconnects between receive_batch_with_tx() and commit, the batch's partition session is dead; mirror the non-tx commit guard and fail the tx (retriable) instead of sending a gapped UpdateOffsetsInTransaction.
b02923b to
fe00ecb
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
The non-transactional
ReaderStream.commit()guards against committing a batch whosepartition session belongs to an already-reconnected stream (raises
PublicTopicReaderPartitionExpiredError). The transactional path(
receive_batch_with_tx→_commit_batches_with_tx→UpdateOffsetsInTransaction)had no such guard.
So if the reader reconnects between
receive_batch_with_tx()andtx.commit(), thetransaction commits offsets of a dead partition session. On the server this surfaces as
a
Gap(issue_code 2011), and the SDK does not raise it as a tx failure — the offsetsilently never advances and the consumer looks healthy while making no progress.
Fix
Mirror the non-tx guard in
_commit_batches_with_tx: if any batch's partition sessionis no longer the current stream's, set a retriable external error on the tx and skip the
server call (leaving the current stream untouched). The session pool then retries and
re-reads from the committed offset.
Test
test_tx_commit_after_reconnect_does_not_commit_stale_offsets: reads a batch in a tx,forces a reconnect, and asserts the commit fails loudly, no
UpdateOffsetsInTransactionis sent, and the consumer offset does not advance.