Skip to content

sync-diff-inspector: use limit iterator to replace random iterator for TiDB source#12518

Open
joechenrh wants to merge 12 commits intopingcap:masterfrom
joechenrh:limit-iterator
Open

sync-diff-inspector: use limit iterator to replace random iterator for TiDB source#12518
joechenrh wants to merge 12 commits intopingcap:masterfrom
joechenrh:limit-iterator

Conversation

@joechenrh
Copy link
Contributor

What problem does this PR solve?

Issue Number: close #12492

What is changed and how it works?

Change to use limit iterator(which is not used previously) instead of random iterator for TiDB source, which can:

  • Precisely control the size of each chunk.
  • Limit the total boundaries stored in memory.

This PR also make below refactors:

  • Extract the index selection logic from each splitter, to remove some duplicated codes
  • If we are restarting from checkpoint, skip the first check and reuse the previous type/index.

Check List

Tests

  • Unit test
  • Integration test
  • Manual test (add detailed scripts or steps below)
  • No code

Questions

Will it cause performance regression or break compatibility?
Do you need to update user documentation, design documentation or monitoring documentation?

Release note

Please refer to [Release Notes Language Style Guide](https://pingcap.github.io/tidb-dev-guide/contribute-to-tidb/release-notes-style-guide.html) to write a quality release note.

If you don't think this PR needs a release note then fill it with `None`.

1. Extract common logic of index selection from each iterator
2. Make index-fields order‑sensitive
3. Minor changes to limit iterator, and choose it when there are
   available indexes.

Signed-off-by: Ruihao Chen <joechenrh@gmail.com>
Signed-off-by: Ruihao Chen <joechenrh@gmail.com>
Signed-off-by: Ruihao Chen <joechenrh@gmail.com>
Signed-off-by: Ruihao Chen <joechenrh@gmail.com>
@ti-chi-bot ti-chi-bot bot added the release-note Denotes a PR that will be considered when it comes time to generate release notes. label Feb 5, 2026
@ti-chi-bot
Copy link
Contributor

ti-chi-bot bot commented Feb 5, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign 3aceshowhand for approval. For more information see the Code Review Process.
Please ensure that each of them provides their approval before proceeding.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ti-chi-bot ti-chi-bot bot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Feb 5, 2026
@gemini-code-assist
Copy link

Summary of Changes

Hello @joechenrh, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the chunk splitting mechanism for TiDB sources within sync-diff-inspector. It introduces a new LimitIterator designed to provide more precise control over chunk sizes and optimize memory usage by limiting the total boundaries stored. A key improvement is the centralization of index selection logic into a ChooseSplitType function, which intelligently selects the most appropriate splitting strategy (Bucket, Limit, or Random) based on database type, table configuration, and available statistics. Furthermore, the system now more effectively leverages checkpoint information, allowing for seamless restarts by reusing previously determined splitter types and indices. These changes aim to improve performance and stability, especially when dealing with large tables and resuming operations from checkpoints.

Highlights

  • New Limit Iterator for TiDB Sources: Introduced a new LimitIterator for TiDB sources, replacing the previous random iterator for better chunk size control and optimized memory management by limiting the total boundaries stored.
  • Centralized Index Selection Logic: Refactored index selection logic into a centralized ChooseSplitType function, which intelligently selects the most appropriate splitting strategy (Bucket, Limit, or Random) based on database type, table configuration, and available statistics, reducing code duplication across different splitter implementations.
  • Improved Checkpointing: Enhanced checkpointing by allowing the system to reuse the previous splitter type and index when restarting, skipping initial checks and ensuring seamless resumption of operations.
  • Batch Fetching for Limit Iterator: The LimitIterator now fetches bounds in batches, improving efficiency and reducing memory usage, especially for large tables.
  • Configurable Check Thread Count: Added CheckThreadCount to TableDiff configuration, allowing dynamic adjustment of the batch size for the LimitIterator to fine-tune performance.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • pkg/dbutil/buckets.go
    • Refactored to prevent duplicate warnings for missing index/column buckets.
    • Simplified index column type assignment by removing a conditional check.
  • sync_diff_inspector/chunk/chunk.go
    • Added a String() method for the Type enum, including the new Limit type, for better logging and debugging.
  • sync_diff_inspector/source/chunks_iter.go
    • Replaced direct calls to getCurTableIndexID with the new chunkIter.GetIndexID() interface method.
    • Removed the now-obsolete getCurTableIndexID helper function.
  • sync_diff_inspector/source/common/table_diff.go
    • Added a CheckThreadCount field to the TableDiff struct to configure parallel processing.
  • sync_diff_inspector/source/mysql_shard.go
    • Modified AnalyzeSplitter to utilize the new splitter.ChooseSplitType function.
    • Ensured that chunk.Random type is explicitly used for MySQL shards.
    • Updated NewRandomIteratorWithCheckpoint calls to include the candidate parameter.
  • sync_diff_inspector/source/source.go
    • Integrated the CheckThreadCount configuration into the TableDiff initialization process.
  • sync_diff_inspector/source/source_test.go
    • Added a GetIndexID() method to MockChunkIterator to align with the updated interface.
    • Renamed TestFallbackToRandomIfRangeIsSet to TestUseLimitIfRangeIsSet and updated assertions to expect LimitIterator.
    • Added new mock queries to support the LimitIterator in tests.
  • sync_diff_inspector/source/tidb.go
    • Reworked AnalyzeSplitter to dynamically choose between Bucket, Random, and Limit iterators based on the result of splitter.ChooseSplitType.
    • Updated NewBucketIterator calls to pass the candidate parameter.
  • sync_diff_inspector/splitter/bucket.go
    • Renamed NewBucketIterator to NewBucketIteratorForTest and NewBucketIteratorWithCheckpoint to NewBucketIterator.
    • Modified NewBucketIterator to accept an IndexCandidate, centralizing index selection logic.
    • Removed the check for table.Range triviality from the bucket iterator initialization.
  • sync_diff_inspector/splitter/choose_splitter_test.go
    • Added a new file containing comprehensive unit tests for the ChooseSplitType function, covering various scenarios like non-TiDB, range-set, bucket availability, and checkpoint fallbacks.
  • sync_diff_inspector/splitter/index_fields.go
    • Removed the sort import and the sortColsInPlace function.
    • Modified indexFieldsFromConfigString to directly find columns by name.
    • Updated MatchesIndex to compare columns without requiring prior sorting.
  • sync_diff_inspector/splitter/index_fields_test.go
    • Added TestIndexFieldsOrderSensitive to verify the order-sensitive behavior of index field matching.
  • sync_diff_inspector/splitter/limit.go
    • Added failpoint and ast imports.
    • Refactored LimitIterator struct to include chunkSize, batchSize, indexColumnNames, and logger, and replaced chunksCh/errCh with resultCh.
    • Updated NewLimitIterator and NewLimitIteratorWithCheckpoint to accept an IndexCandidate.
    • Rewrote produceChunks logic for batch fetching of bounds and improved checkpoint handling.
    • Replaced getLimitRow with batchGetBounds to fetch multiple bounds efficiently.
    • Renamed generateLimitQueryTemplate to generateBoundQueryTemplate and updated it to use ROW_NUMBER() OVER (ORDER BY ...) and MOD(rn, %d) = 0 for chunking, including an index hint.
  • sync_diff_inspector/splitter/random.go
    • Removed the strings import.
    • Added an IndexID field to the RandomIterator struct.
    • Updated NewRandomIterator and NewRandomIteratorWithCheckpoint to accept an IndexCandidate.
    • Removed internal index selection logic, now relying on the provided candidate.
    • Removed the GetSplitFields function.
    • Added a GetIndexID method to the RandomIterator.
  • sync_diff_inspector/splitter/splitter.go
    • Added numerous imports for context, database operations, sorting, error handling, and logging.
    • Extended the ChunkIterator interface with a GetIndexID() method.
    • Introduced iteratorResult struct for consistent iterator output.
    • Defined IndexCandidate struct to encapsulate index information for splitting.
    • Implemented ChooseSplitType to intelligently select the appropriate splitter type and index candidate.
    • Added BuildFakeCandidateForRandom for scenarios where no suitable real index exists.
    • Created pickCandidateFromCheckpoint to retrieve index information from checkpoint data.
    • Developed BuildIndexCandidates to find and sort potential indices based on user fields and database statistics.
    • Included isValidIndex helper function for index validation.
  • sync_diff_inspector/splitter/splitter_test.go
    • Added dbutil import.
    • Changed chunkResult.args type to []any.
    • Introduced buildIndexCandidatesForTest and buildCandidateForTest helper functions for test setup.
    • Updated calls to various iterator constructors (NewRandomIterator, NewBucketIterator, NewLimitIterator) to pass the new candidate parameter.
    • Modified TestBucketSpliter checkpoint test to use NewBucketIterator with candidate and added a failpoint for getRowCount.
    • Updated createFakeResultForLimitSplit to mock COUNT queries and handle batch fetching.
    • Adjusted TestChunkSize and TestRandomSpliterHint to reflect changes in LimitIterator query generation and RandomIterator initialization.
    • Updated regular expressions for mock.ExpectQuery in createFakeResultForBucketSplit and createFakeResultForBucketIterator.
  • sync_diff_inspector/tests/sync_diff_inspector/checkpoint/config_base_limit.toml
    • Added a new configuration file specifically for testing the limit iterator with checkpointing capabilities.
  • sync_diff_inspector/tests/sync_diff_inspector/checkpoint/run.sh
    • Added new shell script test cases to validate the limit iterator's checkpointing functionality, including the use of failpoints to simulate partial runs and verify resume logic.
  • sync_diff_inspector/tests/sync_diff_inspector/run.sh
    • Updated an assertion to check for 'type=Limit' instead of 'split range by random' in the log after table analysis, reflecting the new default splitter type.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new LimitIterator for TiDB sources and refactors the splitter selection logic, aiming for significant improvements in performance and memory usage. A security audit identified two SQL injection vulnerabilities: a high-severity issue where optimizer hints are constructed without sanitizing for the comment terminator */, allowing SQL injection via malicious database object names, and a medium-severity issue where the collation setting from the configuration file is used in SQL queries without proper escaping. Additionally, issues were found in the new LimitIterator implementation related to handling table.Range, which could lead to incorrect behavior when a range filter is used.

Comment on lines +208 to 209
where, args := tagChunk.ToString(lmt.table.Collation)
query := fmt.Sprintf(lmt.queryTmpl, where)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The query for fetching chunk boundaries does not incorporate table.Range. This is a correctness issue as it can lead to splitting chunks based on data outside the specified range, potentially causing the diff to check incorrect data. The table.Range must be included in the WHERE clause of the boundary query.

        where, args := tagChunk.ToString(lmt.table.Collation)
        if !utils.IsRangeTrivial(lmt.table.Range) {
            where = fmt.Sprintf("(%s) AND (%s)", where, lmt.table.Range)
        }
        query := fmt.Sprintf(lmt.queryTmpl, where)

Signed-off-by: Ruihao Chen <joechenrh@gmail.com>
Signed-off-by: Ruihao Chen <joechenrh@gmail.com>
Signed-off-by: Ruihao Chen <joechenrh@gmail.com>
Signed-off-by: Ruihao Chen <joechenrh@gmail.com>
Signed-off-by: Ruihao Chen <joechenrh@gmail.com>
Signed-off-by: Ruihao Chen <joechenrh@gmail.com>
Signed-off-by: Ruihao Chen <joechenrh@gmail.com>
Signed-off-by: Ruihao Chen <joechenrh@gmail.com>
@ti-chi-bot
Copy link
Contributor

ti-chi-bot bot commented Mar 16, 2026

@joechenrh: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-error-log-review 2ffd47d link true /test pull-error-log-review
pull-syncdiff-integration-test 2ffd47d link true /test pull-syncdiff-integration-test
pull-dm-integration-test 2ffd47d link true /test pull-dm-integration-test
pull-cdc-integration-storage-test 2ffd47d link true /test pull-cdc-integration-storage-test

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

release-note Denotes a PR that will be considered when it comes time to generate release notes. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

sync-diff-inspector: Random iterator pre-computes all chunk boundaries, producing uneven chunks on large tables

1 participant