sync-diff-inspector: use limit iterator to replace random iterator for TiDB source by joechenrh · Pull Request #12518 · pingcap/tiflow

joechenrh · 2026-02-05T10:10:06Z

What problem does this PR solve?

Issue Number: close #12492

What is changed and how it works?

Change to use limit iterator(which is not used previously) instead of random iterator for TiDB source, which can:

Precisely control the size of each chunk.
Limit the total boundaries stored in memory.

This PR also make below refactors:

Extract the index selection logic from each splitter, to remove some duplicated codes
If we are restarting from checkpoint, skip the first check and reuse the previous type/index.

Check List

Tests

Unit test
Integration test
Manual test (add detailed scripts or steps below)
No code

Questions

Will it cause performance regression or break compatibility?

Do you need to update user documentation, design documentation or monitoring documentation?

Release note

Please refer to [Release Notes Language Style Guide](https://pingcap.github.io/tidb-dev-guide/contribute-to-tidb/release-notes-style-guide.html) to write a quality release note.

If you don't think this PR needs a release note then fill it with `None`.

1. Extract common logic of index selection from each iterator 2. Make index-fields order‑sensitive 3. Minor changes to limit iterator, and choose it when there are available indexes. Signed-off-by: Ruihao Chen <joechenrh@gmail.com>

Signed-off-by: Ruihao Chen <joechenrh@gmail.com>

ti-chi-bot · 2026-02-05T10:10:14Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign 3aceshowhand for approval. For more information see the Code Review Process.
Please ensure that each of them provides their approval before proceeding.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

gemini-code-assist · 2026-02-05T10:10:51Z

Summary of Changes

Hello @joechenrh, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the chunk splitting mechanism for TiDB sources within sync-diff-inspector. It introduces a new LimitIterator designed to provide more precise control over chunk sizes and optimize memory usage by limiting the total boundaries stored. A key improvement is the centralization of index selection logic into a ChooseSplitType function, which intelligently selects the most appropriate splitting strategy (Bucket, Limit, or Random) based on database type, table configuration, and available statistics. Furthermore, the system now more effectively leverages checkpoint information, allowing for seamless restarts by reusing previously determined splitter types and indices. These changes aim to improve performance and stability, especially when dealing with large tables and resuming operations from checkpoints.

Highlights

New Limit Iterator for TiDB Sources: Introduced a new LimitIterator for TiDB sources, replacing the previous random iterator for better chunk size control and optimized memory management by limiting the total boundaries stored.
Centralized Index Selection Logic: Refactored index selection logic into a centralized ChooseSplitType function, which intelligently selects the most appropriate splitting strategy (Bucket, Limit, or Random) based on database type, table configuration, and available statistics, reducing code duplication across different splitter implementations.
Improved Checkpointing: Enhanced checkpointing by allowing the system to reuse the previous splitter type and index when restarting, skipping initial checks and ensuring seamless resumption of operations.
Batch Fetching for Limit Iterator: The LimitIterator now fetches bounds in batches, improving efficiency and reducing memory usage, especially for large tables.
Configurable Check Thread Count: Added CheckThreadCount to TableDiff configuration, allowing dynamic adjustment of the batch size for the LimitIterator to fine-tune performance.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog

pkg/dbutil/buckets.go
- Refactored to prevent duplicate warnings for missing index/column buckets.
- Simplified index column type assignment by removing a conditional check.
sync_diff_inspector/chunk/chunk.go
- Added a String() method for the Type enum, including the new Limit type, for better logging and debugging.
sync_diff_inspector/source/chunks_iter.go
- Replaced direct calls to getCurTableIndexID with the new chunkIter.GetIndexID() interface method.
- Removed the now-obsolete getCurTableIndexID helper function.
sync_diff_inspector/source/common/table_diff.go
- Added a CheckThreadCount field to the TableDiff struct to configure parallel processing.
sync_diff_inspector/source/mysql_shard.go
- Modified AnalyzeSplitter to utilize the new splitter.ChooseSplitType function.
- Ensured that chunk.Random type is explicitly used for MySQL shards.
- Updated NewRandomIteratorWithCheckpoint calls to include the candidate parameter.
sync_diff_inspector/source/source.go
- Integrated the CheckThreadCount configuration into the TableDiff initialization process.
sync_diff_inspector/source/source_test.go
- Added a GetIndexID() method to MockChunkIterator to align with the updated interface.
- Renamed TestFallbackToRandomIfRangeIsSet to TestUseLimitIfRangeIsSet and updated assertions to expect LimitIterator.
- Added new mock queries to support the LimitIterator in tests.
sync_diff_inspector/source/tidb.go
- Reworked AnalyzeSplitter to dynamically choose between Bucket, Random, and Limit iterators based on the result of splitter.ChooseSplitType.
- Updated NewBucketIterator calls to pass the candidate parameter.
sync_diff_inspector/splitter/bucket.go
- Renamed NewBucketIterator to NewBucketIteratorForTest and NewBucketIteratorWithCheckpoint to NewBucketIterator.
- Modified NewBucketIterator to accept an IndexCandidate, centralizing index selection logic.
- Removed the check for table.Range triviality from the bucket iterator initialization.
sync_diff_inspector/splitter/choose_splitter_test.go
- Added a new file containing comprehensive unit tests for the ChooseSplitType function, covering various scenarios like non-TiDB, range-set, bucket availability, and checkpoint fallbacks.
sync_diff_inspector/splitter/index_fields.go
- Removed the sort import and the sortColsInPlace function.
- Modified indexFieldsFromConfigString to directly find columns by name.
- Updated MatchesIndex to compare columns without requiring prior sorting.
sync_diff_inspector/splitter/index_fields_test.go
- Added TestIndexFieldsOrderSensitive to verify the order-sensitive behavior of index field matching.
sync_diff_inspector/splitter/limit.go
- Added failpoint and ast imports.
- Refactored LimitIterator struct to include chunkSize, batchSize, indexColumnNames, and logger, and replaced chunksCh/errCh with resultCh.
- Updated NewLimitIterator and NewLimitIteratorWithCheckpoint to accept an IndexCandidate.
- Rewrote produceChunks logic for batch fetching of bounds and improved checkpoint handling.
- Replaced getLimitRow with batchGetBounds to fetch multiple bounds efficiently.
- Renamed generateLimitQueryTemplate to generateBoundQueryTemplate and updated it to use ROW_NUMBER() OVER (ORDER BY ...) and MOD(rn, %d) = 0 for chunking, including an index hint.
sync_diff_inspector/splitter/random.go
- Removed the strings import.
- Added an IndexID field to the RandomIterator struct.
- Updated NewRandomIterator and NewRandomIteratorWithCheckpoint to accept an IndexCandidate.
- Removed internal index selection logic, now relying on the provided candidate.
- Removed the GetSplitFields function.
- Added a GetIndexID method to the RandomIterator.
sync_diff_inspector/splitter/splitter.go
- Added numerous imports for context, database operations, sorting, error handling, and logging.
- Extended the ChunkIterator interface with a GetIndexID() method.
- Introduced iteratorResult struct for consistent iterator output.
- Defined IndexCandidate struct to encapsulate index information for splitting.
- Implemented ChooseSplitType to intelligently select the appropriate splitter type and index candidate.
- Added BuildFakeCandidateForRandom for scenarios where no suitable real index exists.
- Created pickCandidateFromCheckpoint to retrieve index information from checkpoint data.
- Developed BuildIndexCandidates to find and sort potential indices based on user fields and database statistics.
- Included isValidIndex helper function for index validation.
sync_diff_inspector/splitter/splitter_test.go
- Added dbutil import.
- Changed chunkResult.args type to []any.
- Introduced buildIndexCandidatesForTest and buildCandidateForTest helper functions for test setup.
- Updated calls to various iterator constructors (NewRandomIterator, NewBucketIterator, NewLimitIterator) to pass the new candidate parameter.
- Modified TestBucketSpliter checkpoint test to use NewBucketIterator with candidate and added a failpoint for getRowCount.
- Updated createFakeResultForLimitSplit to mock COUNT queries and handle batch fetching.
- Adjusted TestChunkSize and TestRandomSpliterHint to reflect changes in LimitIterator query generation and RandomIterator initialization.
- Updated regular expressions for mock.ExpectQuery in createFakeResultForBucketSplit and createFakeResultForBucketIterator.
sync_diff_inspector/tests/sync_diff_inspector/checkpoint/config_base_limit.toml
- Added a new configuration file specifically for testing the limit iterator with checkpointing capabilities.
sync_diff_inspector/tests/sync_diff_inspector/checkpoint/run.sh
- Added new shell script test cases to validate the limit iterator's checkpointing functionality, including the use of failpoints to simulate partial runs and verify resume logic.
sync_diff_inspector/tests/sync_diff_inspector/run.sh
- Updated an assertion to check for 'type=Limit' instead of 'split range by random' in the log after table analysis, reflecting the new default splitter type.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces a new LimitIterator for TiDB sources and refactors the splitter selection logic, aiming for significant improvements in performance and memory usage. A security audit identified two SQL injection vulnerabilities: a high-severity issue where optimizer hints are constructed without sanitizing for the comment terminator */, allowing SQL injection via malicious database object names, and a medium-severity issue where the collation setting from the configuration file is used in SQL queries without proper escaping. Additionally, issues were found in the new LimitIterator implementation related to handling table.Range, which could lead to incorrect behavior when a range filter is used.

gemini-code-assist · 2026-02-05T10:14:21Z

sync_diff_inspector/splitter/limit.go

+		where, args := tagChunk.ToString(lmt.table.Collation)
 		query := fmt.Sprintf(lmt.queryTmpl, where)


The query for fetching chunk boundaries does not incorporate table.Range. This is a correctness issue as it can lead to splitting chunks based on data outside the specified range, potentially causing the diff to check incorrect data. The table.Range must be included in the WHERE clause of the boundary query.

where, args := tagChunk.ToString(lmt.table.Collation) if !utils.IsRangeTrivial(lmt.table.Range) { where = fmt.Sprintf("(%s) AND (%s)", where, lmt.table.Range) } query := fmt.Sprintf(lmt.queryTmpl, where)

sync_diff_inspector/splitter/limit.go

Signed-off-by: Ruihao Chen <joechenrh@gmail.com>

ti-chi-bot · 2026-03-16T23:37:17Z

@joechenrh: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
pull-error-log-review	`2ffd47d`	link	true	`/test pull-error-log-review`
pull-syncdiff-integration-test	`2ffd47d`	link	true	`/test pull-syncdiff-integration-test`
pull-dm-integration-test	`2ffd47d`	link	true	`/test pull-dm-integration-test`
pull-cdc-integration-storage-test	`2ffd47d`	link	true	`/test pull-cdc-integration-storage-test`

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

joechenrh added 4 commits February 5, 2026 01:31

refine iterator

becbd4c

1. Extract common logic of index selection from each iterator 2. Make index-fields order‑sensitive 3. Minor changes to limit iterator, and choose it when there are available indexes. Signed-off-by: Ruihao Chen <joechenrh@gmail.com>

reduce batch size

3db27b1

Signed-off-by: Ruihao Chen <joechenrh@gmail.com>

refine

153accf

Signed-off-by: Ruihao Chen <joechenrh@gmail.com>

refine

b7acae5

Signed-off-by: Ruihao Chen <joechenrh@gmail.com>

ti-chi-bot bot added the release-note Denotes a PR that will be considered when it comes time to generate release notes. label Feb 5, 2026

ti-chi-bot bot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Feb 5, 2026

gemini-code-assist bot reviewed Feb 5, 2026

View reviewed changes

joechenrh added 8 commits February 5, 2026 09:53

update

804a4f2

Signed-off-by: Ruihao Chen <joechenrh@gmail.com>

fix test

a76ee05

Signed-off-by: Ruihao Chen <joechenrh@gmail.com>

fix test

769b3c5

Signed-off-by: Ruihao Chen <joechenrh@gmail.com>

fix test

af8eeff

Signed-off-by: Ruihao Chen <joechenrh@gmail.com>

fixup

5acfd6a

Signed-off-by: Ruihao Chen <joechenrh@gmail.com>

fix test

d40643c

Signed-off-by: Ruihao Chen <joechenrh@gmail.com>

fix test

e763551

Signed-off-by: Ruihao Chen <joechenrh@gmail.com>

change config

2ffd47d

Signed-off-by: Ruihao Chen <joechenrh@gmail.com>

alastori mentioned this pull request Feb 24, 2026

sync-diff-inspector: Random iterator pre-computes all chunk boundaries, producing uneven chunks on large tables #12492

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sync-diff-inspector: use limit iterator to replace random iterator for TiDB source#12518

sync-diff-inspector: use limit iterator to replace random iterator for TiDB source#12518
joechenrh wants to merge 12 commits intopingcap:masterfrom
joechenrh:limit-iterator

joechenrh commented Feb 5, 2026

Uh oh!

ti-chi-bot bot commented Feb 5, 2026

Uh oh!

gemini-code-assist bot commented Feb 5, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Feb 5, 2026

Uh oh!

Uh oh!

Uh oh!

ti-chi-bot bot commented Mar 16, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		where, args := tagChunk.ToString(lmt.table.Collation)
		query := fmt.Sprintf(lmt.queryTmpl, where)

Conversation

joechenrh commented Feb 5, 2026

What problem does this PR solve?

What is changed and how it works?

Check List

Tests

Questions

Will it cause performance regression or break compatibility?

Do you need to update user documentation, design documentation or monitoring documentation?

Release note

Uh oh!

ti-chi-bot bot commented Feb 5, 2026

Uh oh!

gemini-code-assist bot commented Feb 5, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Feb 5, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

ti-chi-bot bot commented Mar 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ti-chi-bot bot commented Mar 16, 2026 •

edited

Loading