sync-diff-inspector: calculate checksum only with `export-fix-sql = false` by joechenrh · Pull Request #12531 · pingcap/tiflow

joechenrh · 2026-03-05T05:44:31Z

What problem does this PR solve?

Issue Number: ref #12492

In scenarios where export-fix-sql = false (for example B/G upgrade verification), we only need final table-level equality (count + checksum), not row-level fix SQL generation. And this PR introduces a checksum-only path for TiDB-to-TiDB data check while keeping existing chunk-mode checkpoint behavior intact.

Basic rationale:

Current chunk-wise comparison for non-clustered PK tables often requires index lookup, which is expensive.
For those configs with export-fix-sql = false, we use _tidb_rowid as split field to get a scan-friendly split path.
Since rowid may differ between two streams even when logical table records are the same, so we cannot use the regular one-side-range-for-both-sides model, which may worsen chunk-size imbalance on the other side.
The branch also includes a follow-up stabilization for the checkpoint integration test, because resumed bucket/random checkpoint logs can select different valid resumed chunks in CI and make the old assertion flaky.

What is changed and how it works?

Add global checksum mode gating:
- use only when both sources are TiDB and export-fix-sql is disabled.
Add checksum-only iterator path for TiDB:
- clustered table: split by PK column.
- non-clustered table: fallback to _tidb_rowid splitting.
In checksum-only mode, each side builds its iterator and scans independently, then compares final table-level aggregate (count + checksum).
Keep checksum-only flow separated from regular chunk diff flow to reduce coupling/risk.
Add checksum checkpoint contract to support resume:
- persist table-index, per-source accumulated count/checksum, last-range, and done.
Keep parallel checksum execution and deterministic ordered checkpoint accumulation.

Check List

Tests

Unit test
Integration test
Manual test (details below)

Manual benchmark (same cluster used as both upstream and downstream to simplify setup):

Cluster: 1 TiDB + 3 TiKV
Data size: 262.1 GiB (1,000,000,000 rows)

PK layout in table KV	master	this PR
ordered	30 min	25-30 min
partially ordered (5% out-of-order)	49 min	25-30 min
random	~5 h	25-30 min

Note:

“order” means the physical order of PK columns in table KV.
I only tested once for this PR, since the speed should be similar in different scenarios. So I just write a rough range.

Questions

Will it cause performance regression or break compatibility?

Checkpoint from older version can't be used with new version of sync-diff-inspectpr.

Do you need to update user documentation, design documentation or monitoring documentation?

No.

Release note

sync-diff-inspector: optimize TiDB-to-TiDB checksum-only verification for non-clustered PK tables by using handle/rowid-based splitting with resumable checksum checkpoints.

ti-chi-bot · 2026-03-05T05:44:37Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign zhaoxinyu for approval. For more information see the Code Review Process.
Please ensure that each of them provides their approval before proceeding.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

gemini-code-assist · 2026-03-05T05:44:55Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a specialized, high-performance checksum verification path for sync-diff-inspector when comparing two TiDB clusters and not requiring row-level fix SQL generation. The primary goal is to drastically reduce the time and resource consumption for data consistency checks on tables with non-clustered primary keys, which previously suffered from inefficient chunking. By implementing intelligent splitting logic and resumable, parallel checksum processing, the tool can now perform these verifications much more efficiently.

Highlights

Optimized Checksum for TiDB-to-TiDB Verification: Introduced a new checksum-only mode for sync-diff-inspector specifically for TiDB-to-TiDB data checks when export-fix-sql is disabled. This significantly improves performance for tables with non-clustered primary keys by avoiding expensive chunk-by-chunk checksums.
Intelligent Chunk Splitting: Implemented optimized chunk splitting strategies for the checksum-only mode: using primary key columns for PKIsHandle tables, primary index columns for IsCommonHandle tables, and falling back to _tidb_rowid for non-clustered tables.
Resumable Checkpoints for Checksum-Only Mode: Added a robust checkpointing mechanism for the checksum-only flow, allowing the process to resume from the last saved state, including accumulated count, checksum, and the last processed range for both upstream and downstream sources.
Parallel Checksum Execution: Enabled parallel execution of checksum tasks across multiple chunks, with deterministic ordered checkpoint accumulation, further enhancing the performance of the checksum-only verification.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog

sync_diff_inspector/checkpoints/checkpoints.go
- Added new data structures ChecksumSourceState and ChecksumState to store checksum-only checkpoint progress.
- Updated SavedState to include a Checksum field for persisting checksum-only checkpoints.
- Implemented NewChecksumState and an init method for ChecksumState to ensure proper initialization.
- Added SaveChecksumState and LoadChecksumState functions to handle the persistence and retrieval of checksum-only checkpoint data.
sync_diff_inspector/checkpoints/checkpoints_test.go
- Added TestSaveLoadChecksumState to verify the correct saving and loading of the new checksum-only checkpoint state.
sync_diff_inspector/diff/diff.go
- Added checksumCheckpoint and checksumCheckpointMu fields to manage the state of checksum-only operations.
- Modified initCheckpoint to conditionally initialize the checksum checkpoint if the global checksum mode is active.
- Refactored the Equal method to dispatch to either equalByGlobalChecksum or equalByChunkChecksum based on configuration.
sync_diff_inspector/diff/global_checksum.go
- Added a new file to encapsulate the core logic for the global checksum feature.
- Implemented shouldUseGlobalChecksum to determine when the optimized checksum path should be used.
- Defined equalByGlobalChecksum to orchestrate the checksum-only verification process, including checkpoint handling and parallel task execution.
- Introduced produceChecksumTasks, runChecksumWorker, and drainOrderedChecksumTasks to manage the concurrent processing and ordered accumulation of checksum results.
- Added getSourceGlobalChecksum to fetch checksums from a source, supporting resumption from checkpoints.
- Implemented buildChecksumOnlyIterator to create specialized iterators for checksum-only mode.
sync_diff_inspector/diff/global_checksum_checkpoint.go
- Added a new file to manage checkpointing specifically for the global checksum mode.
- Implemented initChecksumCheckpoint to load existing checksum checkpoints on startup.
- Implemented flushChecksumCheckpoint to periodically save the current checksum progress to disk.
sync_diff_inspector/diff/global_checksum_test.go
- Added a new file containing unit tests for the global checksum logic.
- Included mock implementations (mockChecksumIterator, mockChecksumSource) to facilitate testing.
- Provided tests like TestGetSourceGlobalChecksumKeepsCheckpointOrder and TestGetSourceGlobalChecksumResumeFromLastRange to validate the parallel execution and checkpoint resumption.
sync_diff_inspector/source/tidb.go
- Added GetChecksumOnlyIterator to provide a specialized chunk iterator for TiDB sources in checksum-only mode.
- Implemented getChecksumSplitFields to dynamically determine the optimal splitting fields (PK, common handle, or _tidb_rowid) based on the table's structure.
sync_diff_inspector/source/tidb_checksum_test.go
- Added a new file with unit tests for the getChecksumSplitFields function, covering various table primary key configurations.

Activity

The author performed manual tests and benchmarks on a cluster with 1 TiDB and 3 TiKV nodes, using 262.1 GiB of data (1,000,000,000 rows).
Benchmarks showed significant performance improvements for tables with partially ordered (from 49 min to 25 min) and random (from ~5 h to 25 min) physical PK order in KV, matching the performance of ordered PK tables.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces a significant optimization for checksum-only verification in sync-diff-inspector by adding a new execution path for TiDB-to-TiDB comparisons when export-fix-sql is disabled. The changes are well-structured, separating the new logic into new files and functions, which is good for maintainability. The implementation correctly handles parallel execution, checkpointing, and resumption. I've found one high-severity issue regarding a goroutine leak that should be addressed.

sync_diff_inspector/diff/global_checksum.go

joechenrh · 2026-03-05T06:30:00Z

/retest

D3Hunter

AI-generated review based on @D3Hunter's standards; manual review will be performed after all comments are resolved.

Summary

Total findings: 6
Inline comments: 6
Summary-only findings (no inline anchor): 0

Findings (highest risk first)

🚨 [Blocker] (2)

Global-checksum checkpoint is backward-incompatible with chunk-mode restore (sync_diff_inspector/diff/diff.go:192)
Checkpoint flush can persist done-state before table result update (sync_diff_inspector/diff/global_checksum.go:131)

⚠️ [Major] (3)

Checksum split-field logic can nil-deref when ignored columns remove PK metadata (sync_diff_inspector/source/tidb.go:129)
Out-of-order checksum result buffering is unbounded (sync_diff_inspector/diff/global_checksum.go:324)
Cancellation is handled as per-table error and loop continues (sync_diff_inspector/diff/global_checksum.go:178)

🟡 [Minor] (1)

Checkpoint ordering test depends on wall-clock timing (sync_diff_inspector/diff/global_checksum_test.go:80)

sync_diff_inspector/diff/diff.go

sync_diff_inspector/diff/global_checksum.go

sync_diff_inspector/source/tidb.go

sync_diff_inspector/diff/global_checksum.go

sync_diff_inspector/diff/global_checksum_test.go

D3Hunter

AI-generated review based on @D3Hunter's standards; manual review will be performed after all comments are resolved.

Summary

Total findings: 1
Inline comments: 1
Summary-only findings (no inline anchor): 0

Findings (highest risk first)

⚠️ [Major] (1)

Checkpoint mode mismatch can panic on resume (sync_diff_inspector/diff/diff.go:206)

sync_diff_inspector/diff/diff.go

joechenrh · 2026-03-10T08:09:32Z

/retest

D3Hunter

AI-generated review based on @D3Hunter's standards; manual review will be performed after all comments are resolved.

Summary

Total findings: 9
Inline comments: 9
Summary-only findings (no inline anchor): 0

Findings (highest risk first)

🚨 [Blocker] (2)

Out-of-order checksum buffering is unbounded and can exhaust memory on skewed chunk latency (sync_diff_inspector/diff/global_checksum.go:331)
Checkpoint schema change is not rollback-safe for N-1 binaries (sync_diff_inspector/checkpoints/checkpoints.go:188, sync_diff_inspector/checkpoints/checkpoints.go:295, sync_diff_inspector/diff/global_checksum_checkpoint.go:30, sync_diff_inspector/diff/diff.go:204)

⚠️ [Major] (3)

Recovered global-checksum retries keep stale error state and can report false failure (sync_diff_inspector/diff/global_checksum.go:178, sync_diff_inspector/diff/global_checksum.go:181, sync_diff_inspector/report/report.go:331, sync_diff_inspector/report/report.go:357)
Chunk-mode restart after checksum-mode checkpoint skips fix-SQL cleanup and can crash on duplicate filenames (sync_diff_inspector/diff/diff.go:204, sync_diff_inspector/diff/diff.go:217, sync_diff_inspector/diff/diff.go:814)
Checksum-only mode forces random splitting that precomputes all chunks, creating O(chunk_count) startup memory pressure (sync_diff_inspector/source/tidb.go:130)

🟡 [Minor] (3)

equalByGlobalChecksum combines orchestration, per-table policy, concurrency wiring, and checkpoint/report side effects in one block (sync_diff_inspector/diff/global_checksum.go:58)
getChecksumSplitFields has hidden schema mutation behind a getter-style API (sync_diff_inspector/source/tidb.go:134)
Upgrade from legacy chunk checkpoint to new checksum mode silently drops resume state (sync_diff_inspector/diff/diff.go:192, sync_diff_inspector/diff/global_checksum_checkpoint.go:32)

🧹 [Nit] (1)

Mock-adapter comment references a non-existent interface name (sync_diff_inspector/diff/global_checksum_test.go:131)

sync_diff_inspector/diff/global_checksum.go

D3Hunter · 2026-03-11T09:38:08Z

sync_diff_inspector/checkpoints/checkpoints.go

+
 // SavedState contains the information of the latest checked chunk and state of `report`
 // When sync-diff start from the checkpoint, it will load this information and continue running
 type SavedState struct {


🚨 [Blocker] Checkpoint schema change is not rollback-safe for N-1 binaries

Why
The checkpoint payload now supports a checksum-only shape that omits chunk-info, but it is still written to the same checkpoint file name. Older binaries that only understand chunk checkpoints can read this file and hit a nil chunk path during resume, making downgrade/resume unsafe.

Scope
sync_diff_inspector/checkpoints/checkpoints.go:188, sync_diff_inspector/checkpoints/checkpoints.go:295, sync_diff_inspector/diff/global_checksum_checkpoint.go:30, sync_diff_inspector/diff/diff.go:204

Risk if unchanged
Rolling back from this version after an interrupted checksum-only run can fail at startup or require manual checkpoint deletion, which breaks expected rollback safety and complicates incident recovery.

Evidence
SavedState now has optional chunk-info and optional checksum-info, and SaveChecksumState persists only checksum-info into sync_diff_checkpoints.pb. The new chunk-mode loader explicitly added a nil-node branch (node == nil) for this mode-switch case, which indicates resume files may legitimately lack chunk state.

Change request
is this backward compatible with N-1 binaries? Please version the checkpoint format (or use a separate checksum checkpoint file) and add an explicit downgrade guard with an actionable error, plus a downgrade-resume compatibility test.

Positive, add export-fix-sql into checksum hash bf01a44.

D3Hunter · 2026-03-11T09:38:08Z

sync_diff_inspector/diff/global_checksum.go

+			return nil
+		})
+
+		if err := eg.Wait(); err != nil {


⚠️ [Major] Recovered global-checksum retries keep stale error state and can report false failure

Why
The table-level error path persists MeetError, DataEqual=false, and UpCount/DownCount += -1 before checkpoint flush. If a restart resumes the same table and the retry succeeds, the later success path never clears these earlier fields, so the report can stay in Error/failed state despite a successful recheck.

Scope
sync_diff_inspector/diff/global_checksum.go:178, sync_diff_inspector/diff/global_checksum.go:181, sync_diff_inspector/report/report.go:331, sync_diff_inspector/report/report.go:357

Risk if unchanged
A transient checksum/read error can become a sticky false negative after resume: final output may continue to report table failure/error and incorrect row counts even after data compares cleanly on retry.

Evidence
In the global checksum error branch, code calls SetTableMeetError(...) and SetTableDataCheckResult(..., false, 0, 0, -1, -1, ...) before flushing checkpoint state. On later success, SetTableDataCheckResult only mutates DataEqual when equal == false, and it always accumulates counts (+=), while SetTableMeetError has no corresponding clear path.

Change request
add a case for retry-after-error resume: please add logic so a successful recheck can clear stale table error/failure markers (or avoid persisting failure markers until the table is definitively unrecoverable).

Positive, error path shouldn't write DataEqual = false to checkpoint, fixed in 23d8b0c.

sync_diff_inspector/diff/diff.go

D3Hunter · 2026-03-11T09:38:08Z

sync_diff_inspector/source/tidb.go

+	}
+	originTable.Fields = fields
+
+	return splitter.NewRandomIteratorWithCheckpoint(


⚠️ [Major] Checksum-only mode forces random splitting that precomputes all chunks, creating O(chunk_count) startup memory pressure

Why
The new TiDB checksum-only iterator always chooses the random splitter, and that splitter materializes the full chunk slice before processing starts. For very large tables or aggressive chunk sizing, this creates high startup latency and memory amplification before any checksum work is drained.

Scope
sync_diff_inspector/source/tidb.go:130

Risk if unchanged
High-cardinality tables can suffer long pre-processing stalls and elevated memory usage, increasing p99 runtime and raising failure risk in constrained environments.

Evidence
Checksum-only iterator unconditionally calls NewRandomIteratorWithCheckpoint (sync_diff_inspector/source/tidb.go:130). The random iterator computes all chunks eagerly via splitRangeByRandom and stores them in chunks (sync_diff_inspector/splitter/random.go:177, sync_diff_inspector/splitter/random.go:192).

Change request
seems no need to force random splitting for every checksum-only table; can we prefer a lazy/bucket path when eligible, or otherwise chunk-generate incrementally to keep memory bounded?

Known issue, acceptable for now.

D3Hunter · 2026-03-11T09:38:08Z

sync_diff_inspector/diff/global_checksum.go

+	return upstreamIsTiDB && downstreamIsTiDB
+}
+
+func (df *Diff) equalByGlobalChecksum(ctx context.Context) error {


🟡 [Minor] equalByGlobalChecksum combines orchestration, per-table policy, concurrency wiring, and checkpoint/report side effects in one block

Why
The function currently owns table-loop orchestration, missing-table handling, per-source checksum worker lifecycle, periodic checkpoint flushing, and report/progress mutation. This packs multiple responsibilities into one control-flow surface, which raises maintenance cost and makes behavior-specific tests harder to target.

Scope
sync_diff_inspector/diff/global_checksum.go:58

Risk if unchanged
Future changes to one path (for example checkpoint semantics or mismatch handling) are more likely to regress another path because they are tightly interleaved inside one large function.

Evidence
equalByGlobalChecksum spans the full workflow from table dispatch to completion (sync_diff_inspector/diff/global_checksum.go:58 through sync_diff_inspector/diff/global_checksum.go:209), with distinct branches for ignore-data-check, lack-table handling, concurrent upstream/downstream checksum execution, and final report/checkpoint updates.

Change request
Can we extract the per-table flows into focused helpers (for example ignore/lack-table fast paths, concurrent checksum run, finalize-and-persist) and leave equalByGlobalChecksum as a thin orchestrator?

Why I would not overreact:

the code is not conceptually huge

excessive extraction can make control flow jumpy if helpers are too small

after the recent cleanup, the concurrency block is already better than before

D3Hunter · 2026-03-11T09:38:08Z

sync_diff_inspector/source/tidb.go

+		ctx, dbutil.TableName(table.Schema, table.Table), &originTable, s.dbConn, startRange)
+}
+
+func getChecksumSplitFields(tableInfo *model.TableInfo) (string, error) {


🟡 [Minor] getChecksumSplitFields has hidden schema mutation behind a getter-style API

Why
The helper name suggests a pure field-selection operation, but in the default branch it also mutates tableInfo.Columns by appending _tidb_rowid. This mixes query and mutation responsibilities and makes call-site expectations less explicit.

Scope
sync_diff_inspector/source/tidb.go:134

Risk if unchanged
A future caller may reuse the same tableInfo across calls and unintentionally accumulate pseudo-columns or rely on behavior that is hard to infer from the function name.

Evidence
In sync_diff_inspector/source/tidb.go:153, getChecksumSplitFields appends a synthetic _tidb_rowid column before returning the field name string, so the helper both mutates schema state and computes a split key.

Change request
Can we separate responsibilities by either renaming this to a mutation-explicit helper (or splitting into prepare... plus get...) so callers can see the side effect at the API boundary?

Renamed to not use a getter style API name, 2346ddf

sync_diff_inspector/diff/diff.go

D3Hunter · 2026-03-11T09:38:08Z

sync_diff_inspector/diff/global_checksum_test.go

+
+func (m *mockChecksumSource) Close() {}
+
+// GetChecksumOnlyIterator adapts the mock to checksumOnlyIteratorSource.


🧹 [Nit] Mock-adapter comment references a non-existent interface name

Why
The new comment names checksumOnlyIteratorSource, but the actual interface introduced in this change is checksumOnlySource.

Scope
sync_diff_inspector/diff/global_checksum_test.go:131

Risk if unchanged
Readers may search for a non-existent type and misread which contract the mock is intended to satisfy.

Evidence
GetChecksumOnlyIterator comment says it adapts to checksumOnlyIteratorSource, while sync_diff_inspector/diff/global_checksum.go:41 defines checksumOnlySource as the real interface.

Change request
Prefer checksumOnlySource; current name is confusing. Update the comment so it matches the actual interface name.

Renamed in aaaf85e

sync_diff_inspector/diff/global_checksum.go

kennytm · 2026-03-11T10:08:24Z

sync_diff_inspector/diff/global_checksum.go

+		taskWg.Wait()
+		close(resultCh)
+		return nil
+	})


ditto on the WaitGroup + ErrGroup usage. This time we have

1 producer, writing to taskCh and closing it at the end

N checksum workers, consuming taskCh and writing to resultCh

1 "closer", which waits until all N checksum workers is done and close resultCh

1 consumer (main thread), which read from resultCh and generate the total.

I'd prefer

Keep the producer and workers in the eg

Move the consumer to a native goroutine, listen on <-egCtx.Done() instead of close(resultCh) (use a for/select loop instead of for range loop). (Perhaps use a channel to retrieve the total to avoid shared state.)

Get rid of the "closer"

kennytm · 2026-03-11T10:39:07Z

sync_diff_inspector/source/tidb.go

+	default:
+		tableInfo.Columns = append(tableInfo.Columns, &model.ColumnInfo{
+			Name:      ast.NewCIStr("_tidb_rowid"),
+			Offset:    len(tableInfo.Columns),
+			FieldType: *types.NewFieldType(mysql.TypeLonglong),
+		})
+		return "_tidb_rowid", nil


[P2] Preserve chunk splitting for heap tables in checksum-only mode

When a TiDB table has neither a clustered primary key nor any secondary indexes, this new fallback picks _tidb_rowid as the split field, but it never adds a corresponding usable index to table.Info. NewRandomIteratorWithCheckpoint still treats len(table.Info.Indices) == 0 as a no-index table and collapses the work to a single chunk, so checksum-only mode degenerates into one full-table checksum query on exactly the large heap tables it is meant to speed up. ~~This is a meaningful performance regression for non-indexed tables.~~

[...]

_tidb_rowid does get selected as the split field in the new checksum-only path.

But for no-index tables, that does not currently buy automatic multi-chunk splitting unless the user also sets chunk-size, because the old random-splitter sizing rule is unchanged.

So I would retract the “regression” wording. At most, this is an implementation gap relative to the new comment in tidb.go (line 160) saying checksum-only mode “forces to use _tidb_rowid or clustered PK to split chunks.”

Note:

"heap table" here means "a non-clustered table with no PK, no UK"

the relevant code is this piece:

tiflow/sync_diff_inspector/splitter/random.go

Lines 156 to 165 in bcbe018

if len(table.Info.Indices) != 0 {

chunkSize = utils.CalculateChunkSize(cnt)

} else {

// no index

// will use table scan

// so we use one chunk

// plus 1 to avoid chunkSize is 0

// while chunkCnt = (2cnt)/(cnt+1) <= 1

chunkSize = cnt + 1

}

I'm not sure if we want to care about comparing no-index-tables efficiently. So feel free to ignore this.

Good catch, added a synthetic index in dc31f9c

D3Hunter

AI-generated review based on @D3Hunter's standards; manual review will be performed after all comments are resolved.

Summary

Total findings: 13
Inline comments: 13
Summary-only findings (no inline anchor): 0

Findings (highest risk first)

🚨 [Blocker] (2)

Checksum-mode checkpoint uses a shared unversioned file contract (sync_diff_inspector/checkpoints/checkpoints.go:188)
Out-of-range checksum checkpoint can skip all table checks without error (sync_diff_inspector/diff/global_checksum.go:61)

⚠️ [Major] (4)

Global checksum table loop carries multiple responsibilities and high control-flow complexity (sync_diff_inspector/diff/global_checksum.go:58)
Mode-switch recovery silently falls back instead of failing fast (sync_diff_inspector/diff/diff.go:204)
Ordered checkpoint drain can accumulate unbounded pending results under latency skew (sync_diff_inspector/diff/global_checksum.go:311)
Resumed global-checksum runs cannot clear prior table failure state (sync_diff_inspector/diff/global_checksum.go:178)

🟡 [Minor] (6)

Checkpoint compatibility and restart-failure paths are not covered by deterministic tests (sync_diff_inspector/diff/global_checksum_test.go:149)
Global-checksum error branch does not finalize per-table progress state (sync_diff_inspector/diff/global_checksum.go:178)
Missing-but-empty tables can be reported as equal in data-only runs (sync_diff_inspector/diff/global_checksum.go:89)
Mode selection is coupled to concrete source types instead of capability (sync_diff_inspector/diff/global_checksum.go:49)
SavedState comment is stale after adding checksum-mode payload (sync_diff_inspector/checkpoints/checkpoints.go:186)
Unscoped Done field obscures checkpoint completion semantics (sync_diff_inspector/checkpoints/checkpoints.go:158)

🧹 [Nit] (1)

Test comment references a non-existent interface name (sync_diff_inspector/diff/global_checksum_test.go:131)

sync_diff_inspector/checkpoints/checkpoints.go

sync_diff_inspector/diff/global_checksum.go

sync_diff_inspector/diff/diff.go

sync_diff_inspector/diff/global_checksum.go

D3Hunter · 2026-03-11T10:53:29Z

sync_diff_inspector/diff/global_checksum.go

+			ChunkCnt:         1,
+		}
+
+		if !common.AllTableExist(tableDiff.TableLack) {


🟡 [Minor] Missing-but-empty tables can be reported as equal in data-only runs

Why
For TableLack tables, global-checksum mode defines equality as upCount == downCount; when both are zero, it reports equal even though one side lacks the table object.

Scope
sync_diff_inspector/diff/global_checksum.go:89

Risk if unchanged
With check-data-only plus skip-non-existing-table, an environment can report pass for a missing table if it is empty, masking topology/schema drift in data-only verification workflows.

Evidence
In global-checksum TableLack handling, isEqual := upCount == downCount (sync_diff_inspector/diff/global_checksum.go:98). When check-data-only is enabled, struct check is skipped (sync_diff_inspector/main.go:115), so there is no separate structural failure marker. The legacy chunk path always marks lack-table data as unequal (sync_diff_inspector/diff/diff.go:466).

Change request
Please preserve the existing missing-table invariant in global-checksum mode: for !AllTableExist, always mark data unequal (or explicitly persist TableLack to report state when struct check is skipped). Add a case for missing-empty-table under check-data-only.

Updated in 0b85aeb

sync_diff_inspector/diff/global_checksum.go

D3Hunter · 2026-03-11T10:53:29Z

sync_diff_inspector/checkpoints/checkpoints.go

+	}
+}
+
 // SavedState contains the information of the latest checked chunk and state of `report`


🟡 [Minor] SavedState comment is stale after adding checksum-mode payload

Why
The struct-level comment still describes only chunk/report checkpoint content, but the struct now also carries checksum-mode state.

Scope
sync_diff_inspector/checkpoints/checkpoints.go:186

Risk if unchanged
Maintainers can misinterpret mode compatibility and miss that checkpoint files may legitimately contain checksum-info without chunk-info.

Evidence
SavedState includes Checksum *ChecksumState at sync_diff_inspector/checkpoints/checkpoints.go:190, while the comment at sync_diff_inspector/checkpoints/checkpoints.go:186 still states it contains the latest checked chunk and report state.

Change request
Please add a short why-comment for this struct describing both checkpoint modes and that chunk/checksum payloads are mode-dependent.

D3Hunter · 2026-03-11T10:53:29Z

sync_diff_inspector/checkpoints/checkpoints.go

+	LastRange *chunk.Range `json:"last-range,omitempty"`
+	Checksum  uint64       `json:"checksum"`
+	Count     int64        `json:"count"`
+	Done      bool         `json:"done"`


🟡 [Minor] Unscoped Done field obscures checkpoint completion semantics

Why
Done is persisted in checkpoint state but does not encode completion scope (per-source/per-table vs full run), while resume control flow depends on that exact meaning.

Scope
sync_diff_inspector/checkpoints/checkpoints.go:158

Risk if unchanged
Future resume logic changes can misread this flag and accidentally skip or reprocess tables when checkpoint behavior evolves.

Evidence
ChecksumSourceState defines Done bool at sync_diff_inspector/checkpoints/checkpoints.go:158, and resume advancement checks both source flags in sync_diff_inspector/diff/global_checksum.go:63 and sync_diff_inspector/diff/global_checksum_checkpoint.go:43.

Change request
Prefer a scoped name (for example TableDone) or an explicit semantics comment near the field; current name is confusing for a persisted contract.

Done is already scoped by the type and container. So the actual meaning is already “this source finished checksum for the current TableIndex”. It does not look like “full run done” at call sites.

sync_diff_inspector/diff/global_checksum_test.go

sync_diff_inspector/tests/sync_diff_inspector/checksum_only/run.sh

joechenrh · 2026-03-16T05:24:54Z

/retest

joechenrh · 2026-03-16T09:53:45Z

/retest

ingress-bot

This review was generated by AI and should be verified by a human reviewer.
Manual follow-up is recommended before merge.

Summary

Total findings: 2
Inline comments: 2
Summary-only findings (no inline anchor): 0

Findings (highest risk first)

⚠️ [Major] (2)

Resume state can skip a table that previously hit checksum execution error (checksum-only resume flow in equalByGlobalChecksum)
Global checksum mode hard-fails ignored-handle tables instead of falling back (checksum-only iterator setup for TiDB sources with ignored handle columns)

sync_diff_inspector/diff/global_checksum.go

sync_diff_inspector/source/tidb.go

kennytm · 2026-03-17T03:51:14Z

oh a new one? how many review bots have we got 😅

joechenrh · 2026-03-17T04:06:51Z

oh a new one? how many review bots have we got 😅

The same one as #12531 (review), but with a dedicated account 🤓

joechenrh · 2026-03-17T06:50:37Z

/retest

ingress-bot · 2026-03-18T02:47:42Z

sync_diff_inspector/diff/global_checksum_checkpoint.go

+		if checksumState != nil {
+			df.checksumCheckpoint = checksumState
+			if reportInfo != nil {
+				df.report.LoadReport(reportInfo)


⚠️ [Major] Checkpoint resume can drop already-confirmed mismatches and later report success

Impact
If an earlier table was already confirmed unequal before the process stopped, a resumed run can still finish the remaining tables and print the top-level success branch, which hides the earlier mismatch behind a false overall pass.

Scope
checksum checkpoint restore via initChecksumCheckpoint and report.LoadReport

Evidence
This restore path reloads reportInfo, but report.LoadReport() only copies TableResults/duration/size and never rebuilds Report.Result. Report.Result itself is non-serialized (json:"-"), so a pre-restart mismatch can disappear from the final top-level outcome after resume. This bug also exists in the old chunk-checkpoint flow, because that path reuses the same report.LoadReport() behavior.

Change request
Recompute Report.Result from the restored table states after loading checkpointed TableResults, while still treating transient MeetError as non-durable.

Done in joechenrh@667d2e2 .

Good catch. report.LoadReport() now recomputes the aggregate Report.Result from restored TableResults instead of relying on the non-persisted Result field.

That keeps already-confirmed mismatches visible after resume while still treating transient MeetError as non-durable.

This was not introduced by the checksum-only path though. The old chunk-checkpoint flow used the same report.LoadReport() behavior, so the same loss of top-level fail state already existed there too.

ingress-bot · 2026-03-18T02:47:43Z

sync_diff_inspector/diff/global_checksum.go

+			progress.FailTable(progressID)
+			progress.UpdateTotal(progressID, 0, true)
+			progress.Inc(progressID)
+			df.report.SetTableMeetError(schema, table, err)


🟡 [Medium] Retry-success path never clears stale table error from report state

Impact
A retryable checksum error can remain attached to the table in MeetError, so a later successful retry still leaves an expired error in the report and future real failures can be rendered together with stale diagnostics.

Scope
checksum execution error handling in equalByGlobalChecksum plus shared report state

Evidence
This branch records SetTableMeetError(...), but the later success path only updates equality/count state and never clears MeetError. report.Print() later emits every remaining table error. This is also not unique to the new checksum-only flow: the old chunk path uses the same shared report state and has the same stale-error behavior after retry-success.

Change request
Clear per-table MeetError when the same table is successfully finalized, or otherwise reset transient table error state before success output is emitted.

Done in joechenrh@667d2e2 .

I added ClearTableMeetError and call it when the same table is successfully finalized, so a retry-success no longer leaves an expired transient error in the final report.

This is also an old shared-report issue rather than something unique to the checksum-only flow. The previous chunk path used the same MeetError state model, so stale retry-success errors could accumulate there as well.

Signed-off-by: Ruihao Chen <joechenrh@gmail.com>

…failure Add failpoints to simulate checksum failure after partial chunk progress: - checksum-skip-chunk: skips seq 2 in drainOrderedChecksumTasks, so only chunks 0-1 are checkpointed before the gap halts further processing. - checksum-error-source: errors one source after all results are collected. Also fix the result goroutine to use `for range resultCh` with explicit close after eg.Wait(), replacing the racy single select that could miss buffered results when egCtx was cancelled. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Replace export/unset pattern with inline GO_FAILPOINTS=... prefix so the env var is scoped to just the command invocation. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Signed-off-by: Ruihao Chen <joechenrh@gmail.com>

Remove ExportFixSQL from TaskConfig struct and pass it as a parameter to Init() and ComputeConfigHash() instead, avoiding field duplication and unintended JSON serialization. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…gcap#5) 1. Restore nil-guard on LoadChunk result to prevent panic when checkpoint file contains no chunk info. 2. Extract ChecksumCapableSource sub-interface from Source to avoid widening the core interface with checksum-only methods. Remove dead PreferGlobalChecksum/GetChecksumOnlyIterator from MySQLSources. 3. Rename PreferGlobalChecksum to type-assertion based capability check via ChecksumCapableSource, removing the misleading "Prefer" naming. 4. Add log.Warn before break in equalByGlobalChecksum error path so operators can see why remaining tables were skipped. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The TestComputeConfigHashIgnoresTLSName test from master called ComputeConfigHash() without the exportFixSQL parameter added in this branch. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Move exportFixSQL from a function parameter to a field on TaskConfig tagged with json:"-" to prevent serialization. This simplifies the ComputeConfigHash() and Init() signatures so callers that don't care about export-fix-sql (e.g. TLS tests) no longer need to pass it. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

kennytm · 2026-03-18T08:54:54Z

sync_diff_inspector/config/config_test.go

+	cfg := NewConfig()
+	require.NoError(t, os.RemoveAll("/tmp/output/config"))
+	require.NoError(t, cfg.Parse([]string{"--config", "config.toml"}))
+	require.NoError(t, cfg.Init())


wait, wut

please override the config's output-dir to t.TempDir() or its subdirectory instead of using the hard-coded "/tmp/output/config", which may not be accessible or even contain important resources on the test machine.

Oops, shouldn't have hard-coded that 🫠 Switched to t.TempDir() and dropped the manual os.RemoveAll.

6a9564b

codecov · 2026-03-18T09:06:44Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 57.3233%. Comparing base (e0c6d64) to head (a698a49).

❌ Your project check has failed because the head coverage (57.3233%) is below the target coverage (60.0000%). You can increase the head coverage or adjust the target coverage.

Additional details and impacted files

Components	Coverage Δ
cdc	`57.3233% <ø> (∅)`
dm	`∅ <ø> (∅)`
engine	`∅ <ø> (∅)`

Flag	Coverage Δ
cdc	`57.3233% <ø> (?)`
unit	`57.3233% <ø> (+8.2153%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

@@               Coverage Diff                @@
##             master     #12531        +/-   ##
================================================
+ Coverage   49.1079%   57.3233%   +8.2153%     
================================================
  Files           273        524       +251     
  Lines         52967      69183     +16216     
================================================
+ Hits          26011      39658     +13647     
- Misses        24763      26670      +1907     
- Partials       2193       2855       +662

🚀 New features to boost your workflow:

📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

…test Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

kennytm · 2026-03-18T15:01:49Z

sync_diff_inspector/config/config_test.go

 	cfg := NewConfig()
 	require.NoError(t, os.RemoveAll("/tmp/output/config"))
 	require.Nil(t, cfg.Parse([]string{"-L", "info", "--config", "config.toml"}))


Done in db7859b

Removed output-dir from both config.toml and config_sharding.toml entirely, so Parse no longer overwrites it. The test sets cfg.Task.OutputDir = tmpDir once before the first Parse and it sticks across all subsequent calls 👍

joechenrh · 2026-03-19T03:01:49Z

/retest

ti-chi-bot · 2026-03-19T04:00:38Z

@joechenrh: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
pull-cdc-integration-kafka-test	`6a9564b`	link	true	`/test pull-cdc-integration-kafka-test`

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

ti-chi-bot bot added the release-note Denotes a PR that will be considered when it comes time to generate release notes. label Mar 5, 2026

ti-chi-bot bot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Mar 5, 2026

gemini-code-assist bot reviewed Mar 5, 2026

View reviewed changes

sync_diff_inspector/diff/global_checksum.go Outdated Show resolved Hide resolved

This comment was marked as outdated.

Sign in to view

D3Hunter reviewed Mar 5, 2026

View reviewed changes

sync_diff_inspector/diff/diff.go Show resolved Hide resolved

D3Hunter reviewed Mar 11, 2026

View reviewed changes

kennytm reviewed Mar 11, 2026

View reviewed changes

D3Hunter reviewed Mar 11, 2026

View reviewed changes

joechenrh changed the title ~~sync-diff-inspector: calculate checksum only with export-fix-sql = false~~ all: stabilize sync_diff_inspector checkpoint test Mar 13, 2026

ti-chi-bot bot added release-note-none Denotes a PR that doesn't merit a release note. do-not-merge/needs-triage-completed and removed release-note Denotes a PR that will be considered when it comes time to generate release notes. labels Mar 13, 2026

joechenrh changed the title ~~all: stabilize sync_diff_inspector checkpoint test~~ sync-diff-inspector: calculate checksum only with export-fix-sql = false Mar 13, 2026

ti-chi-bot bot added release-note Denotes a PR that will be considered when it comes time to generate release notes. and removed release-note-none Denotes a PR that doesn't merit a release note. labels Mar 13, 2026

kennytm reviewed Mar 13, 2026

View reviewed changes

sync_diff_inspector/tests/sync_diff_inspector/checksum_only/run.sh Outdated Show resolved Hide resolved

ti-chi-bot bot removed the do-not-merge/needs-triage-completed label Mar 13, 2026

ingress-bot reviewed Mar 17, 2026

View reviewed changes

sync_diff_inspector/diff/global_checksum.go Outdated Show resolved Hide resolved

sync_diff_inspector/source/tidb.go Outdated Show resolved Hide resolved

ingress-bot reviewed Mar 18, 2026

View reviewed changes

joechenrh and others added 22 commits March 18, 2026 04:32

all: move global checksum capability into source interface

3317a75

sync-diff-inspector: invalidate checkpoints on export-fix-sql change

da208b7

sync-diff-inspector: avoid stale checksum failures after resume

7724757

sync-diff-inspector: make checksum split helper mutation explicit

5ac196d

sync-diff-inspector: simplify checksum checkpoint flush loop

397398f

sync-diff-inspector: streamline checksum worker pipeline

d3c496e

sync-diff-inspector: streamline checksum worker pipeline

81e9fe7

Signed-off-by: Ruihao Chen <joechenrh@gmail.com>

sync-diff-inspector: enable rowid chunking for checksum heap tables

3a196ed

sync-diff-inspector: clarify saved checkpoint modes

94e0276

sync-diff-inspector: finalize progress on checksum errors

2f87d92

fix ci

0981412

Signed-off-by: Ruihao Chen <joechenrh@gmail.com>

all: stabilize sync_diff_inspector checkpoint test

c1a8edc

sync-diff-inspector: use inline env vars for GO_FAILPOINTS in test

e4e249e

Replace export/unset pattern with inline GO_FAILPOINTS=... prefix so the env var is scoped to just the command invocation. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Update comment

f029a3d

Signed-off-by: Ruihao Chen <joechenrh@gmail.com>

sync-diff: harden global checksum fallback

e5193b9

sync-diff: fix resumed report state

41ead0e

sync-diff: simplify report resume state

196acc9

sync-diff: fix shfmt indentation in checksum_only test script

9052fd3

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

sync-diff: fix ComputeConfigHash call in TLS test after rebase

aa19e10

The TestComputeConfigHashIgnoresTLSName test from master called ComputeConfigHash() without the exportFixSQL parameter added in this branch. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

joechenrh force-pushed the checksum-only branch from c44066e to aa19e10 Compare March 18, 2026 08:35

kennytm reviewed Mar 18, 2026

View reviewed changes

sync-diff: use t.TempDir() instead of hard-coded /tmp path in config …

6a9564b

…test Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

kennytm reviewed Mar 18, 2026

View reviewed changes


		func (m *mockChecksumSource) Close() {}

		// GetChecksumOnlyIterator adapts the mock to checksumOnlyIteratorSource.

	if len(table.Info.Indices) != 0 {
	chunkSize = utils.CalculateChunkSize(cnt)
	} else {
	// no index
	// will use table scan
	// so we use one chunk
	// plus 1 to avoid chunkSize is 0
	// while chunkCnt = (2cnt)/(cnt+1) <= 1
	chunkSize = cnt + 1
	}

Conversation

joechenrh commented Mar 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What problem does this PR solve?

What is changed and how it works?

Check List

Tests

Questions

Will it cause performance regression or break compatibility?

Do you need to update user documentation, design documentation or monitoring documentation?

Release note

Uh oh!

ti-chi-bot bot commented Mar 5, 2026

Uh oh!

gemini-code-assist bot commented Mar 5, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

joechenrh commented Mar 5, 2026

Uh oh!

This comment was marked as outdated.

Uh oh!

D3Hunter left a comment

Choose a reason for hiding this comment

Summary

🚨 [Blocker] (2)

⚠️ [Major] (3)

🟡 [Minor] (1)

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

D3Hunter left a comment

Choose a reason for hiding this comment

Summary

⚠️ [Major] (1)

Uh oh!

Uh oh!

joechenrh commented Mar 10, 2026

Uh oh!

D3Hunter left a comment

Choose a reason for hiding this comment

Summary

Findings (highest risk first)

🚨 [Blocker] (2)

⚠️ [Major] (3)

🟡 [Minor] (3)

🧹 [Nit] (1)

Uh oh!

Uh oh!

Choose a reason for hiding this comment

🚨 [Blocker] Checkpoint schema change is not rollback-safe for N-1 binaries

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

⚠️ [Major] Recovered global-checksum retries keep stale error state and can report false failure

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

⚠️ [Major] Checksum-only mode forces random splitting that precomputes all chunks, creating O(chunk_count) startup memory pressure

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

🟡 [Minor] equalByGlobalChecksum combines orchestration, per-table policy, concurrency wiring, and checkpoint/report side effects in one block

Uh oh!

Choose a reason for hiding this comment

Uh oh!

joechenrh commented Mar 5, 2026 •

edited

Loading

🟡 [Minor] `equalByGlobalChecksum` combines orchestration, per-table policy, concurrency wiring, and checkpoint/report side effects in one block

🟡 [Minor] `getChecksumSplitFields` has hidden schema mutation behind a getter-style API

joechenrh Mar 12, 2026 •

edited

Loading