Skip to content

sync-diff-inspector: calculate checksum only with export-fix-sql = false#12531

Open
joechenrh wants to merge 38 commits intopingcap:masterfrom
joechenrh:checksum-only
Open

sync-diff-inspector: calculate checksum only with export-fix-sql = false#12531
joechenrh wants to merge 38 commits intopingcap:masterfrom
joechenrh:checksum-only

Conversation

@joechenrh
Copy link
Contributor

@joechenrh joechenrh commented Mar 5, 2026

What problem does this PR solve?

Issue Number: ref #12492

In scenarios where export-fix-sql = false (for example B/G upgrade verification), we only need final table-level equality (count + checksum), not row-level fix SQL generation. And this PR introduces a checksum-only path for TiDB-to-TiDB data check while keeping existing chunk-mode checkpoint behavior intact.

Basic rationale:

  • Current chunk-wise comparison for non-clustered PK tables often requires index lookup, which is expensive.
  • For those configs with export-fix-sql = false, we use _tidb_rowid as split field to get a scan-friendly split path.
  • Since rowid may differ between two streams even when logical table records are the same, so we cannot use the regular one-side-range-for-both-sides model, which may worsen chunk-size imbalance on the other side.
  • The branch also includes a follow-up stabilization for the checkpoint integration test, because resumed bucket/random checkpoint logs can select different valid resumed chunks in CI and make the old assertion flaky.

What is changed and how it works?

  • Add global checksum mode gating:
    • use only when both sources are TiDB and export-fix-sql is disabled.
  • Add checksum-only iterator path for TiDB:
    • clustered table: split by PK column.
    • non-clustered table: fallback to _tidb_rowid splitting.
  • In checksum-only mode, each side builds its iterator and scans independently, then compares final table-level aggregate (count + checksum).
  • Keep checksum-only flow separated from regular chunk diff flow to reduce coupling/risk.
  • Add checksum checkpoint contract to support resume:
    • persist table-index, per-source accumulated count/checksum, last-range, and done.
  • Keep parallel checksum execution and deterministic ordered checkpoint accumulation.

Check List

Tests

  • Unit test
  • Integration test
  • Manual test (details below)

Manual benchmark (same cluster used as both upstream and downstream to simplify setup):

  • Cluster: 1 TiDB + 3 TiKV
  • Data size: 262.1 GiB (1,000,000,000 rows)
PK layout in table KV master this PR
ordered 30 min 25-30 min
partially ordered (5% out-of-order) 49 min 25-30 min
random ~5 h 25-30 min

Note:

  1. “order” means the physical order of PK columns in table KV.
  2. I only tested once for this PR, since the speed should be similar in different scenarios. So I just write a rough range.

Questions

Will it cause performance regression or break compatibility?

Checkpoint from older version can't be used with new version of sync-diff-inspectpr.

Do you need to update user documentation, design documentation or monitoring documentation?

No.

Release note

sync-diff-inspector: optimize TiDB-to-TiDB checksum-only verification for non-clustered PK tables by using handle/rowid-based splitting with resumable checksum checkpoints.

@ti-chi-bot ti-chi-bot bot added the release-note Denotes a PR that will be considered when it comes time to generate release notes. label Mar 5, 2026
@ti-chi-bot
Copy link
Contributor

ti-chi-bot bot commented Mar 5, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign zhaoxinyu for approval. For more information see the Code Review Process.
Please ensure that each of them provides their approval before proceeding.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ti-chi-bot ti-chi-bot bot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Mar 5, 2026
@gemini-code-assist
Copy link

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a specialized, high-performance checksum verification path for sync-diff-inspector when comparing two TiDB clusters and not requiring row-level fix SQL generation. The primary goal is to drastically reduce the time and resource consumption for data consistency checks on tables with non-clustered primary keys, which previously suffered from inefficient chunking. By implementing intelligent splitting logic and resumable, parallel checksum processing, the tool can now perform these verifications much more efficiently.

Highlights

  • Optimized Checksum for TiDB-to-TiDB Verification: Introduced a new checksum-only mode for sync-diff-inspector specifically for TiDB-to-TiDB data checks when export-fix-sql is disabled. This significantly improves performance for tables with non-clustered primary keys by avoiding expensive chunk-by-chunk checksums.
  • Intelligent Chunk Splitting: Implemented optimized chunk splitting strategies for the checksum-only mode: using primary key columns for PKIsHandle tables, primary index columns for IsCommonHandle tables, and falling back to _tidb_rowid for non-clustered tables.
  • Resumable Checkpoints for Checksum-Only Mode: Added a robust checkpointing mechanism for the checksum-only flow, allowing the process to resume from the last saved state, including accumulated count, checksum, and the last processed range for both upstream and downstream sources.
  • Parallel Checksum Execution: Enabled parallel execution of checksum tasks across multiple chunks, with deterministic ordered checkpoint accumulation, further enhancing the performance of the checksum-only verification.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • sync_diff_inspector/checkpoints/checkpoints.go
    • Added new data structures ChecksumSourceState and ChecksumState to store checksum-only checkpoint progress.
    • Updated SavedState to include a Checksum field for persisting checksum-only checkpoints.
    • Implemented NewChecksumState and an init method for ChecksumState to ensure proper initialization.
    • Added SaveChecksumState and LoadChecksumState functions to handle the persistence and retrieval of checksum-only checkpoint data.
  • sync_diff_inspector/checkpoints/checkpoints_test.go
    • Added TestSaveLoadChecksumState to verify the correct saving and loading of the new checksum-only checkpoint state.
  • sync_diff_inspector/diff/diff.go
    • Added checksumCheckpoint and checksumCheckpointMu fields to manage the state of checksum-only operations.
    • Modified initCheckpoint to conditionally initialize the checksum checkpoint if the global checksum mode is active.
    • Refactored the Equal method to dispatch to either equalByGlobalChecksum or equalByChunkChecksum based on configuration.
  • sync_diff_inspector/diff/global_checksum.go
    • Added a new file to encapsulate the core logic for the global checksum feature.
    • Implemented shouldUseGlobalChecksum to determine when the optimized checksum path should be used.
    • Defined equalByGlobalChecksum to orchestrate the checksum-only verification process, including checkpoint handling and parallel task execution.
    • Introduced produceChecksumTasks, runChecksumWorker, and drainOrderedChecksumTasks to manage the concurrent processing and ordered accumulation of checksum results.
    • Added getSourceGlobalChecksum to fetch checksums from a source, supporting resumption from checkpoints.
    • Implemented buildChecksumOnlyIterator to create specialized iterators for checksum-only mode.
  • sync_diff_inspector/diff/global_checksum_checkpoint.go
    • Added a new file to manage checkpointing specifically for the global checksum mode.
    • Implemented initChecksumCheckpoint to load existing checksum checkpoints on startup.
    • Implemented flushChecksumCheckpoint to periodically save the current checksum progress to disk.
  • sync_diff_inspector/diff/global_checksum_test.go
    • Added a new file containing unit tests for the global checksum logic.
    • Included mock implementations (mockChecksumIterator, mockChecksumSource) to facilitate testing.
    • Provided tests like TestGetSourceGlobalChecksumKeepsCheckpointOrder and TestGetSourceGlobalChecksumResumeFromLastRange to validate the parallel execution and checkpoint resumption.
  • sync_diff_inspector/source/tidb.go
    • Added GetChecksumOnlyIterator to provide a specialized chunk iterator for TiDB sources in checksum-only mode.
    • Implemented getChecksumSplitFields to dynamically determine the optimal splitting fields (PK, common handle, or _tidb_rowid) based on the table's structure.
  • sync_diff_inspector/source/tidb_checksum_test.go
    • Added a new file with unit tests for the getChecksumSplitFields function, covering various table primary key configurations.
Activity
  • The author performed manual tests and benchmarks on a cluster with 1 TiDB and 3 TiKV nodes, using 262.1 GiB of data (1,000,000,000 rows).
  • Benchmarks showed significant performance improvements for tables with partially ordered (from 49 min to 25 min) and random (from ~5 h to 25 min) physical PK order in KV, matching the performance of ordered PK tables.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a significant optimization for checksum-only verification in sync-diff-inspector by adding a new execution path for TiDB-to-TiDB comparisons when export-fix-sql is disabled. The changes are well-structured, separating the new logic into new files and functions, which is good for maintainability. The implementation correctly handles parallel execution, checkpointing, and resumption. I've found one high-severity issue regarding a goroutine leak that should be addressed.

@joechenrh
Copy link
Contributor Author

/retest

D3Hunter

This comment was marked as outdated.

Copy link
Contributor

@D3Hunter D3Hunter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AI-generated review based on @D3Hunter's standards; manual review will be performed after all comments are resolved.

Summary

  • Total findings: 6
  • Inline comments: 6
  • Summary-only findings (no inline anchor): 0
Findings (highest risk first)

🚨 [Blocker] (2)

  1. Global-checksum checkpoint is backward-incompatible with chunk-mode restore (sync_diff_inspector/diff/diff.go:192)
  2. Checkpoint flush can persist done-state before table result update (sync_diff_inspector/diff/global_checksum.go:131)

⚠️ [Major] (3)

  1. Checksum split-field logic can nil-deref when ignored columns remove PK metadata (sync_diff_inspector/source/tidb.go:129)
  2. Out-of-order checksum result buffering is unbounded (sync_diff_inspector/diff/global_checksum.go:324)
  3. Cancellation is handled as per-table error and loop continues (sync_diff_inspector/diff/global_checksum.go:178)

🟡 [Minor] (1)

  1. Checkpoint ordering test depends on wall-clock timing (sync_diff_inspector/diff/global_checksum_test.go:80)

Copy link
Contributor

@D3Hunter D3Hunter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AI-generated review based on @D3Hunter's standards; manual review will be performed after all comments are resolved.

Summary

  • Total findings: 1
  • Inline comments: 1
  • Summary-only findings (no inline anchor): 0
Findings (highest risk first)

⚠️ [Major] (1)

  1. Checkpoint mode mismatch can panic on resume (sync_diff_inspector/diff/diff.go:206)

@joechenrh
Copy link
Contributor Author

/retest

Copy link
Contributor

@D3Hunter D3Hunter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AI-generated review based on @D3Hunter's standards; manual review will be performed after all comments are resolved.

Summary

  • Total findings: 9
  • Inline comments: 9
  • Summary-only findings (no inline anchor): 0

Findings (highest risk first)

🚨 [Blocker] (2)

  1. Out-of-order checksum buffering is unbounded and can exhaust memory on skewed chunk latency (sync_diff_inspector/diff/global_checksum.go:331)
  2. Checkpoint schema change is not rollback-safe for N-1 binaries (sync_diff_inspector/checkpoints/checkpoints.go:188, sync_diff_inspector/checkpoints/checkpoints.go:295, sync_diff_inspector/diff/global_checksum_checkpoint.go:30, sync_diff_inspector/diff/diff.go:204)

⚠️ [Major] (3)

  1. Recovered global-checksum retries keep stale error state and can report false failure (sync_diff_inspector/diff/global_checksum.go:178, sync_diff_inspector/diff/global_checksum.go:181, sync_diff_inspector/report/report.go:331, sync_diff_inspector/report/report.go:357)
  2. Chunk-mode restart after checksum-mode checkpoint skips fix-SQL cleanup and can crash on duplicate filenames (sync_diff_inspector/diff/diff.go:204, sync_diff_inspector/diff/diff.go:217, sync_diff_inspector/diff/diff.go:814)
  3. Checksum-only mode forces random splitting that precomputes all chunks, creating O(chunk_count) startup memory pressure (sync_diff_inspector/source/tidb.go:130)

🟡 [Minor] (3)

  1. equalByGlobalChecksum combines orchestration, per-table policy, concurrency wiring, and checkpoint/report side effects in one block (sync_diff_inspector/diff/global_checksum.go:58)
  2. getChecksumSplitFields has hidden schema mutation behind a getter-style API (sync_diff_inspector/source/tidb.go:134)
  3. Upgrade from legacy chunk checkpoint to new checksum mode silently drops resume state (sync_diff_inspector/diff/diff.go:192, sync_diff_inspector/diff/global_checksum_checkpoint.go:32)

🧹 [Nit] (1)

  1. Mock-adapter comment references a non-existent interface name (sync_diff_inspector/diff/global_checksum_test.go:131)


// SavedState contains the information of the latest checked chunk and state of `report`
// When sync-diff start from the checkpoint, it will load this information and continue running
type SavedState struct {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚨 [Blocker] Checkpoint schema change is not rollback-safe for N-1 binaries

Why
The checkpoint payload now supports a checksum-only shape that omits chunk-info, but it is still written to the same checkpoint file name. Older binaries that only understand chunk checkpoints can read this file and hit a nil chunk path during resume, making downgrade/resume unsafe.

Scope
sync_diff_inspector/checkpoints/checkpoints.go:188, sync_diff_inspector/checkpoints/checkpoints.go:295, sync_diff_inspector/diff/global_checksum_checkpoint.go:30, sync_diff_inspector/diff/diff.go:204

Risk if unchanged
Rolling back from this version after an interrupted checksum-only run can fail at startup or require manual checkpoint deletion, which breaks expected rollback safety and complicates incident recovery.

Evidence
SavedState now has optional chunk-info and optional checksum-info, and SaveChecksumState persists only checksum-info into sync_diff_checkpoints.pb. The new chunk-mode loader explicitly added a nil-node branch (node == nil) for this mode-switch case, which indicates resume files may legitimately lack chunk state.

Change request
is this backward compatible with N-1 binaries? Please version the checkpoint format (or use a separate checksum checkpoint file) and add an explicit downgrade guard with an actionable error, plus a downgrade-resume compatibility test.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Positive, add export-fix-sql into checksum hash bf01a44.

return nil
})

if err := eg.Wait(); err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ [Major] Recovered global-checksum retries keep stale error state and can report false failure

Why
The table-level error path persists MeetError, DataEqual=false, and UpCount/DownCount += -1 before checkpoint flush. If a restart resumes the same table and the retry succeeds, the later success path never clears these earlier fields, so the report can stay in Error/failed state despite a successful recheck.

Scope
sync_diff_inspector/diff/global_checksum.go:178, sync_diff_inspector/diff/global_checksum.go:181, sync_diff_inspector/report/report.go:331, sync_diff_inspector/report/report.go:357

Risk if unchanged
A transient checksum/read error can become a sticky false negative after resume: final output may continue to report table failure/error and incorrect row counts even after data compares cleanly on retry.

Evidence
In the global checksum error branch, code calls SetTableMeetError(...) and SetTableDataCheckResult(..., false, 0, 0, -1, -1, ...) before flushing checkpoint state. On later success, SetTableDataCheckResult only mutates DataEqual when equal == false, and it always accumulates counts (+=), while SetTableMeetError has no corresponding clear path.

Change request
add a case for retry-after-error resume: please add logic so a successful recheck can clear stale table error/failure markers (or avoid persisting failure markers until the table is definitively unrecoverable).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Positive, error path shouldn't write DataEqual = false to checkpoint, fixed in 23d8b0c.

}
originTable.Fields = fields

return splitter.NewRandomIteratorWithCheckpoint(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ [Major] Checksum-only mode forces random splitting that precomputes all chunks, creating O(chunk_count) startup memory pressure

Why
The new TiDB checksum-only iterator always chooses the random splitter, and that splitter materializes the full chunk slice before processing starts. For very large tables or aggressive chunk sizing, this creates high startup latency and memory amplification before any checksum work is drained.

Scope
sync_diff_inspector/source/tidb.go:130

Risk if unchanged
High-cardinality tables can suffer long pre-processing stalls and elevated memory usage, increasing p99 runtime and raising failure risk in constrained environments.

Evidence
Checksum-only iterator unconditionally calls NewRandomIteratorWithCheckpoint (sync_diff_inspector/source/tidb.go:130). The random iterator computes all chunks eagerly via splitRangeByRandom and stores them in chunks (sync_diff_inspector/splitter/random.go:177, sync_diff_inspector/splitter/random.go:192).

Change request
seems no need to force random splitting for every checksum-only table; can we prefer a lazy/bucket path when eligible, or otherwise chunk-generate incrementally to keep memory bounded?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Known issue, acceptable for now.

return upstreamIsTiDB && downstreamIsTiDB
}

func (df *Diff) equalByGlobalChecksum(ctx context.Context) error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 [Minor] equalByGlobalChecksum combines orchestration, per-table policy, concurrency wiring, and checkpoint/report side effects in one block

Why
The function currently owns table-loop orchestration, missing-table handling, per-source checksum worker lifecycle, periodic checkpoint flushing, and report/progress mutation. This packs multiple responsibilities into one control-flow surface, which raises maintenance cost and makes behavior-specific tests harder to target.

Scope
sync_diff_inspector/diff/global_checksum.go:58

Risk if unchanged
Future changes to one path (for example checkpoint semantics or mismatch handling) are more likely to regress another path because they are tightly interleaved inside one large function.

Evidence
equalByGlobalChecksum spans the full workflow from table dispatch to completion (sync_diff_inspector/diff/global_checksum.go:58 through sync_diff_inspector/diff/global_checksum.go:209), with distinct branches for ignore-data-check, lack-table handling, concurrent upstream/downstream checksum execution, and final report/checkpoint updates.

Change request
Can we extract the per-table flows into focused helpers (for example ignore/lack-table fast paths, concurrent checksum run, finalize-and-persist) and leave equalByGlobalChecksum as a thin orchestrator?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why I would not overreact:

  • the code is not conceptually huge
  • excessive extraction can make control flow jumpy if helpers are too small
  • after the recent cleanup, the concurrency block is already better than before

ctx, dbutil.TableName(table.Schema, table.Table), &originTable, s.dbConn, startRange)
}

func getChecksumSplitFields(tableInfo *model.TableInfo) (string, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 [Minor] getChecksumSplitFields has hidden schema mutation behind a getter-style API

Why
The helper name suggests a pure field-selection operation, but in the default branch it also mutates tableInfo.Columns by appending _tidb_rowid. This mixes query and mutation responsibilities and makes call-site expectations less explicit.

Scope
sync_diff_inspector/source/tidb.go:134

Risk if unchanged
A future caller may reuse the same tableInfo across calls and unintentionally accumulate pseudo-columns or rely on behavior that is hard to infer from the function name.

Evidence
In sync_diff_inspector/source/tidb.go:153, getChecksumSplitFields appends a synthetic _tidb_rowid column before returning the field name string, so the helper both mutates schema state and computes a split key.

Change request
Can we separate responsibilities by either renaming this to a mutation-explicit helper (or splitting into prepare... plus get...) so callers can see the side effect at the API boundary?

Copy link
Contributor Author

@joechenrh joechenrh Mar 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed to not use a getter style API name, 2346ddf


func (m *mockChecksumSource) Close() {}

// GetChecksumOnlyIterator adapts the mock to checksumOnlyIteratorSource.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 [Nit] Mock-adapter comment references a non-existent interface name

Why
The new comment names checksumOnlyIteratorSource, but the actual interface introduced in this change is checksumOnlySource.

Scope
sync_diff_inspector/diff/global_checksum_test.go:131

Risk if unchanged
Readers may search for a non-existent type and misread which contract the mock is intended to satisfy.

Evidence
GetChecksumOnlyIterator comment says it adapts to checksumOnlyIteratorSource, while sync_diff_inspector/diff/global_checksum.go:41 defines checksumOnlySource as the real interface.

Change request
Prefer checksumOnlySource; current name is confusing. Update the comment so it matches the actual interface name.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed in aaaf85e

taskWg.Wait()
close(resultCh)
return nil
})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto on the WaitGroup + ErrGroup usage. This time we have

  1. 1 producer, writing to taskCh and closing it at the end
  2. N checksum workers, consuming taskCh and writing to resultCh
  3. 1 "closer", which waits until all N checksum workers is done and close resultCh
  4. 1 consumer (main thread), which read from resultCh and generate the total.

I'd prefer

  1. Keep the producer and workers in the eg
  2. Move the consumer to a native goroutine, listen on <-egCtx.Done() instead of close(resultCh) (use a for/select loop instead of for range loop). (Perhaps use a channel to retrieve the total to avoid shared state.)
  3. Get rid of the "closer"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

Comment on lines +152 to +158
default:
tableInfo.Columns = append(tableInfo.Columns, &model.ColumnInfo{
Name: ast.NewCIStr("_tidb_rowid"),
Offset: len(tableInfo.Columns),
FieldType: *types.NewFieldType(mysql.TypeLonglong),
})
return "_tidb_rowid", nil
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[P2] Preserve chunk splitting for heap tables in checksum-only mode

When a TiDB table has neither a clustered primary key nor any secondary indexes, this new fallback picks _tidb_rowid as the split field, but it never adds a corresponding usable index to table.Info. NewRandomIteratorWithCheckpoint still treats len(table.Info.Indices) == 0 as a no-index table and collapses the work to a single chunk, so checksum-only mode degenerates into one full-table checksum query on exactly the large heap tables it is meant to speed up. This is a meaningful performance regression for non-indexed tables.

[...]

  • _tidb_rowid does get selected as the split field in the new checksum-only path.
  • But for no-index tables, that does not currently buy automatic multi-chunk splitting unless the user also sets chunk-size, because the old random-splitter sizing rule is unchanged.

So I would retract the “regression” wording. At most, this is an implementation gap relative to the new comment in tidb.go (line 160) saying checksum-only mode “forces to use _tidb_rowid or clustered PK to split chunks.”

Note:

  1. "heap table" here means "a non-clustered table with no PK, no UK"
  2. the relevant code is this piece:

if len(table.Info.Indices) != 0 {
chunkSize = utils.CalculateChunkSize(cnt)
} else {
// no index
// will use table scan
// so we use one chunk
// plus 1 to avoid chunkSize is 0
// while chunkCnt = (2cnt)/(cnt+1) <= 1
chunkSize = cnt + 1
}

  1. I'm not sure if we want to care about comparing no-index-tables efficiently. So feel free to ignore this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, added a synthetic index in dc31f9c

Copy link
Contributor

@D3Hunter D3Hunter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AI-generated review based on @D3Hunter's standards; manual review will be performed after all comments are resolved.

Summary

  • Total findings: 13
  • Inline comments: 13
  • Summary-only findings (no inline anchor): 0
Findings (highest risk first)

🚨 [Blocker] (2)

  1. Checksum-mode checkpoint uses a shared unversioned file contract (sync_diff_inspector/checkpoints/checkpoints.go:188)
  2. Out-of-range checksum checkpoint can skip all table checks without error (sync_diff_inspector/diff/global_checksum.go:61)

⚠️ [Major] (4)

  1. Global checksum table loop carries multiple responsibilities and high control-flow complexity (sync_diff_inspector/diff/global_checksum.go:58)
  2. Mode-switch recovery silently falls back instead of failing fast (sync_diff_inspector/diff/diff.go:204)
  3. Ordered checkpoint drain can accumulate unbounded pending results under latency skew (sync_diff_inspector/diff/global_checksum.go:311)
  4. Resumed global-checksum runs cannot clear prior table failure state (sync_diff_inspector/diff/global_checksum.go:178)

🟡 [Minor] (6)

  1. Checkpoint compatibility and restart-failure paths are not covered by deterministic tests (sync_diff_inspector/diff/global_checksum_test.go:149)
  2. Global-checksum error branch does not finalize per-table progress state (sync_diff_inspector/diff/global_checksum.go:178)
  3. Missing-but-empty tables can be reported as equal in data-only runs (sync_diff_inspector/diff/global_checksum.go:89)
  4. Mode selection is coupled to concrete source types instead of capability (sync_diff_inspector/diff/global_checksum.go:49)
  5. SavedState comment is stale after adding checksum-mode payload (sync_diff_inspector/checkpoints/checkpoints.go:186)
  6. Unscoped Done field obscures checkpoint completion semantics (sync_diff_inspector/checkpoints/checkpoints.go:158)

🧹 [Nit] (1)

  1. Test comment references a non-existent interface name (sync_diff_inspector/diff/global_checksum_test.go:131)

ChunkCnt: 1,
}

if !common.AllTableExist(tableDiff.TableLack) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 [Minor] Missing-but-empty tables can be reported as equal in data-only runs

Why
For TableLack tables, global-checksum mode defines equality as upCount == downCount; when both are zero, it reports equal even though one side lacks the table object.

Scope
sync_diff_inspector/diff/global_checksum.go:89

Risk if unchanged
With check-data-only plus skip-non-existing-table, an environment can report pass for a missing table if it is empty, masking topology/schema drift in data-only verification workflows.

Evidence
In global-checksum TableLack handling, isEqual := upCount == downCount (sync_diff_inspector/diff/global_checksum.go:98). When check-data-only is enabled, struct check is skipped (sync_diff_inspector/main.go:115), so there is no separate structural failure marker. The legacy chunk path always marks lack-table data as unequal (sync_diff_inspector/diff/diff.go:466).

Change request
Please preserve the existing missing-table invariant in global-checksum mode: for !AllTableExist, always mark data unequal (or explicitly persist TableLack to report state when struct check is skipped). Add a case for missing-empty-table under check-data-only.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated in 0b85aeb

}
}

// SavedState contains the information of the latest checked chunk and state of `report`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 [Minor] SavedState comment is stale after adding checksum-mode payload

Why
The struct-level comment still describes only chunk/report checkpoint content, but the struct now also carries checksum-mode state.

Scope
sync_diff_inspector/checkpoints/checkpoints.go:186

Risk if unchanged
Maintainers can misinterpret mode compatibility and miss that checkpoint files may legitimately contain checksum-info without chunk-info.

Evidence
SavedState includes Checksum *ChecksumState at sync_diff_inspector/checkpoints/checkpoints.go:190, while the comment at sync_diff_inspector/checkpoints/checkpoints.go:186 still states it contains the latest checked chunk and report state.

Change request
Please add a short why-comment for this struct describing both checkpoint modes and that chunk/checksum payloads are mode-dependent.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed

LastRange *chunk.Range `json:"last-range,omitempty"`
Checksum uint64 `json:"checksum"`
Count int64 `json:"count"`
Done bool `json:"done"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 [Minor] Unscoped Done field obscures checkpoint completion semantics

Why
Done is persisted in checkpoint state but does not encode completion scope (per-source/per-table vs full run), while resume control flow depends on that exact meaning.

Scope
sync_diff_inspector/checkpoints/checkpoints.go:158

Risk if unchanged
Future resume logic changes can misread this flag and accidentally skip or reprocess tables when checkpoint behavior evolves.

Evidence
ChecksumSourceState defines Done bool at sync_diff_inspector/checkpoints/checkpoints.go:158, and resume advancement checks both source flags in sync_diff_inspector/diff/global_checksum.go:63 and sync_diff_inspector/diff/global_checksum_checkpoint.go:43.

Change request
Prefer a scoped name (for example TableDone) or an explicit semantics comment near the field; current name is confusing for a persisted contract.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done is already scoped by the type and container. So the actual meaning is already “this source finished checksum for the current TableIndex”. It does not look like “full run done” at call sites.

@joechenrh joechenrh changed the title sync-diff-inspector: calculate checksum only with export-fix-sql = false all: stabilize sync_diff_inspector checkpoint test Mar 13, 2026
@ti-chi-bot ti-chi-bot bot added release-note-none Denotes a PR that doesn't merit a release note. do-not-merge/needs-triage-completed and removed release-note Denotes a PR that will be considered when it comes time to generate release notes. labels Mar 13, 2026
@joechenrh joechenrh changed the title all: stabilize sync_diff_inspector checkpoint test sync-diff-inspector: calculate checksum only with export-fix-sql = false Mar 13, 2026
@ti-chi-bot ti-chi-bot bot added release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed release-note-none Denotes a PR that doesn't merit a release note. labels Mar 13, 2026
@joechenrh
Copy link
Contributor Author

/retest

1 similar comment
@joechenrh
Copy link
Contributor Author

/retest

Copy link

@ingress-bot ingress-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This review was generated by AI and should be verified by a human reviewer.
Manual follow-up is recommended before merge.

Summary

  • Total findings: 2
  • Inline comments: 2
  • Summary-only findings (no inline anchor): 0
Findings (highest risk first)

⚠️ [Major] (2)

  1. Resume state can skip a table that previously hit checksum execution error (checksum-only resume flow in equalByGlobalChecksum)
  2. Global checksum mode hard-fails ignored-handle tables instead of falling back (checksum-only iterator setup for TiDB sources with ignored handle columns)

@kennytm
Copy link
Contributor

kennytm commented Mar 17, 2026

oh a new one? how many review bots have we got 😅

@joechenrh
Copy link
Contributor Author

oh a new one? how many review bots have we got 😅

The same one as #12531 (review), but with a dedicated account 🤓

@joechenrh
Copy link
Contributor Author

/retest

if checksumState != nil {
df.checksumCheckpoint = checksumState
if reportInfo != nil {
df.report.LoadReport(reportInfo)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ [Major] Checkpoint resume can drop already-confirmed mismatches and later report success

Impact
If an earlier table was already confirmed unequal before the process stopped, a resumed run can still finish the remaining tables and print the top-level success branch, which hides the earlier mismatch behind a false overall pass.

Scope
checksum checkpoint restore via initChecksumCheckpoint and report.LoadReport

Evidence
This restore path reloads reportInfo, but report.LoadReport() only copies TableResults/duration/size and never rebuilds Report.Result. Report.Result itself is non-serialized (json:"-"), so a pre-restart mismatch can disappear from the final top-level outcome after resume. This bug also exists in the old chunk-checkpoint flow, because that path reuses the same report.LoadReport() behavior.

Change request
Recompute Report.Result from the restored table states after loading checkpointed TableResults, while still treating transient MeetError as non-durable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in joechenrh@667d2e2 .

Good catch. report.LoadReport() now recomputes the aggregate Report.Result from restored TableResults instead of relying on the non-persisted Result field.

That keeps already-confirmed mismatches visible after resume while still treating transient MeetError as non-durable.

This was not introduced by the checksum-only path though. The old chunk-checkpoint flow used the same report.LoadReport() behavior, so the same loss of top-level fail state already existed there too.

progress.FailTable(progressID)
progress.UpdateTotal(progressID, 0, true)
progress.Inc(progressID)
df.report.SetTableMeetError(schema, table, err)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 [Medium] Retry-success path never clears stale table error from report state

Impact
A retryable checksum error can remain attached to the table in MeetError, so a later successful retry still leaves an expired error in the report and future real failures can be rendered together with stale diagnostics.

Scope
checksum execution error handling in equalByGlobalChecksum plus shared report state

Evidence
This branch records SetTableMeetError(...), but the later success path only updates equality/count state and never clears MeetError. report.Print() later emits every remaining table error. This is also not unique to the new checksum-only flow: the old chunk path uses the same shared report state and has the same stale-error behavior after retry-success.

Change request
Clear per-table MeetError when the same table is successfully finalized, or otherwise reset transient table error state before success output is emitted.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in joechenrh@667d2e2 .

I added ClearTableMeetError and call it when the same table is successfully finalized, so a retry-success no longer leaves an expired transient error in the final report.

This is also an old shared-report issue rather than something unique to the checksum-only flow. The previous chunk path used the same MeetError state model, so stale retry-success errors could accumulate there as well.

joechenrh and others added 22 commits March 18, 2026 04:32
Signed-off-by: Ruihao Chen <joechenrh@gmail.com>
Signed-off-by: Ruihao Chen <joechenrh@gmail.com>
…failure

Add failpoints to simulate checksum failure after partial chunk progress:
- checksum-skip-chunk: skips seq 2 in drainOrderedChecksumTasks, so only
  chunks 0-1 are checkpointed before the gap halts further processing.
- checksum-error-source: errors one source after all results are collected.

Also fix the result goroutine to use `for range resultCh` with explicit
close after eg.Wait(), replacing the racy single select that could miss
buffered results when egCtx was cancelled.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace export/unset pattern with inline GO_FAILPOINTS=... prefix
so the env var is scoped to just the command invocation.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Ruihao Chen <joechenrh@gmail.com>
Remove ExportFixSQL from TaskConfig struct and pass it as a parameter
to Init() and ComputeConfigHash() instead, avoiding field duplication
and unintended JSON serialization.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…gcap#5)

1. Restore nil-guard on LoadChunk result to prevent panic when
   checkpoint file contains no chunk info.
2. Extract ChecksumCapableSource sub-interface from Source to avoid
   widening the core interface with checksum-only methods. Remove
   dead PreferGlobalChecksum/GetChecksumOnlyIterator from MySQLSources.
3. Rename PreferGlobalChecksum to type-assertion based capability check
   via ChecksumCapableSource, removing the misleading "Prefer" naming.
4. Add log.Warn before break in equalByGlobalChecksum error path so
   operators can see why remaining tables were skipped.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The TestComputeConfigHashIgnoresTLSName test from master called
ComputeConfigHash() without the exportFixSQL parameter added in
this branch.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Move exportFixSQL from a function parameter to a field on TaskConfig
tagged with json:"-" to prevent serialization. This simplifies the
ComputeConfigHash() and Init() signatures so callers that don't care
about export-fix-sql (e.g. TLS tests) no longer need to pass it.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Comment on lines +64 to +67
cfg := NewConfig()
require.NoError(t, os.RemoveAll("/tmp/output/config"))
require.NoError(t, cfg.Parse([]string{"--config", "config.toml"}))
require.NoError(t, cfg.Init())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wait, wut

please override the config's output-dir to t.TempDir() or its subdirectory instead of using the hard-coded "/tmp/output/config", which may not be accessible or even contain important resources on the test machine.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, shouldn't have hard-coded that 🫠 Switched to t.TempDir() and dropped the manual os.RemoveAll.

6a9564b

@codecov
Copy link

codecov bot commented Mar 18, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 57.3233%. Comparing base (e0c6d64) to head (a698a49).

❌ Your project check has failed because the head coverage (57.3233%) is below the target coverage (60.0000%). You can increase the head coverage or adjust the target coverage.

Additional details and impacted files
Components Coverage Δ
cdc 57.3233% <ø> (∅)
dm ∅ <ø> (∅)
engine ∅ <ø> (∅)
Flag Coverage Δ
cdc 57.3233% <ø> (?)
unit 57.3233% <ø> (+8.2153%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

@@               Coverage Diff                @@
##             master     #12531        +/-   ##
================================================
+ Coverage   49.1079%   57.3233%   +8.2153%     
================================================
  Files           273        524       +251     
  Lines         52967      69183     +16216     
================================================
+ Hits          26011      39658     +13647     
- Misses        24763      26670      +1907     
- Partials       2193       2855       +662     
🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

…test

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Comment on lines 25 to 27
cfg := NewConfig()
require.NoError(t, os.RemoveAll("/tmp/output/config"))
require.Nil(t, cfg.Parse([]string{"-L", "info", "--config", "config.toml"}))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in db7859b

Removed output-dir from both config.toml and config_sharding.toml entirely, so Parse no longer overwrites it. The test sets cfg.Task.OutputDir = tmpDir once before the first Parse and it sticks across all subsequent calls 👍

@joechenrh
Copy link
Contributor Author

/retest

@ti-chi-bot
Copy link
Contributor

ti-chi-bot bot commented Mar 19, 2026

@joechenrh: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-cdc-integration-kafka-test 6a9564b link true /test pull-cdc-integration-kafka-test

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

release-note Denotes a PR that will be considered when it comes time to generate release notes. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants