sync-diff-inspector: calculate checksum only with export-fix-sql = false#12531
sync-diff-inspector: calculate checksum only with export-fix-sql = false#12531joechenrh wants to merge 38 commits intopingcap:masterfrom
export-fix-sql = false#12531Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces a specialized, high-performance checksum verification path for Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces a significant optimization for checksum-only verification in sync-diff-inspector by adding a new execution path for TiDB-to-TiDB comparisons when export-fix-sql is disabled. The changes are well-structured, separating the new logic into new files and functions, which is good for maintainability. The implementation correctly handles parallel execution, checkpointing, and resumption. I've found one high-severity issue regarding a goroutine leak that should be addressed.
|
/retest |
D3Hunter
left a comment
There was a problem hiding this comment.
AI-generated review based on @D3Hunter's standards; manual review will be performed after all comments are resolved.
Summary
- Total findings: 6
- Inline comments: 6
- Summary-only findings (no inline anchor): 0
Findings (highest risk first)
🚨 [Blocker] (2)
- Global-checksum checkpoint is backward-incompatible with chunk-mode restore (
sync_diff_inspector/diff/diff.go:192) - Checkpoint flush can persist done-state before table result update (
sync_diff_inspector/diff/global_checksum.go:131)
⚠️ [Major] (3)
- Checksum split-field logic can nil-deref when ignored columns remove PK metadata (
sync_diff_inspector/source/tidb.go:129) - Out-of-order checksum result buffering is unbounded (
sync_diff_inspector/diff/global_checksum.go:324) - Cancellation is handled as per-table error and loop continues (
sync_diff_inspector/diff/global_checksum.go:178)
🟡 [Minor] (1)
- Checkpoint ordering test depends on wall-clock timing (
sync_diff_inspector/diff/global_checksum_test.go:80)
D3Hunter
left a comment
There was a problem hiding this comment.
AI-generated review based on @D3Hunter's standards; manual review will be performed after all comments are resolved.
Summary
- Total findings: 1
- Inline comments: 1
- Summary-only findings (no inline anchor): 0
Findings (highest risk first)
⚠️ [Major] (1)
- Checkpoint mode mismatch can panic on resume (
sync_diff_inspector/diff/diff.go:206)
|
/retest |
D3Hunter
left a comment
There was a problem hiding this comment.
AI-generated review based on @D3Hunter's standards; manual review will be performed after all comments are resolved.
Summary
- Total findings: 9
- Inline comments: 9
- Summary-only findings (no inline anchor): 0
Findings (highest risk first)
🚨 [Blocker] (2)
- Out-of-order checksum buffering is unbounded and can exhaust memory on skewed chunk latency (
sync_diff_inspector/diff/global_checksum.go:331) - Checkpoint schema change is not rollback-safe for N-1 binaries (
sync_diff_inspector/checkpoints/checkpoints.go:188, sync_diff_inspector/checkpoints/checkpoints.go:295, sync_diff_inspector/diff/global_checksum_checkpoint.go:30, sync_diff_inspector/diff/diff.go:204)
⚠️ [Major] (3)
- Recovered global-checksum retries keep stale error state and can report false failure (
sync_diff_inspector/diff/global_checksum.go:178, sync_diff_inspector/diff/global_checksum.go:181, sync_diff_inspector/report/report.go:331, sync_diff_inspector/report/report.go:357) - Chunk-mode restart after checksum-mode checkpoint skips fix-SQL cleanup and can crash on duplicate filenames (
sync_diff_inspector/diff/diff.go:204, sync_diff_inspector/diff/diff.go:217, sync_diff_inspector/diff/diff.go:814) - Checksum-only mode forces random splitting that precomputes all chunks, creating O(chunk_count) startup memory pressure (
sync_diff_inspector/source/tidb.go:130)
🟡 [Minor] (3)
equalByGlobalChecksumcombines orchestration, per-table policy, concurrency wiring, and checkpoint/report side effects in one block (sync_diff_inspector/diff/global_checksum.go:58)getChecksumSplitFieldshas hidden schema mutation behind a getter-style API (sync_diff_inspector/source/tidb.go:134)- Upgrade from legacy chunk checkpoint to new checksum mode silently drops resume state (
sync_diff_inspector/diff/diff.go:192, sync_diff_inspector/diff/global_checksum_checkpoint.go:32)
🧹 [Nit] (1)
- Mock-adapter comment references a non-existent interface name (
sync_diff_inspector/diff/global_checksum_test.go:131)
|
|
||
| // SavedState contains the information of the latest checked chunk and state of `report` | ||
| // When sync-diff start from the checkpoint, it will load this information and continue running | ||
| type SavedState struct { |
There was a problem hiding this comment.
🚨 [Blocker] Checkpoint schema change is not rollback-safe for N-1 binaries
Why
The checkpoint payload now supports a checksum-only shape that omits chunk-info, but it is still written to the same checkpoint file name. Older binaries that only understand chunk checkpoints can read this file and hit a nil chunk path during resume, making downgrade/resume unsafe.
Scope
sync_diff_inspector/checkpoints/checkpoints.go:188, sync_diff_inspector/checkpoints/checkpoints.go:295, sync_diff_inspector/diff/global_checksum_checkpoint.go:30, sync_diff_inspector/diff/diff.go:204
Risk if unchanged
Rolling back from this version after an interrupted checksum-only run can fail at startup or require manual checkpoint deletion, which breaks expected rollback safety and complicates incident recovery.
Evidence
SavedState now has optional chunk-info and optional checksum-info, and SaveChecksumState persists only checksum-info into sync_diff_checkpoints.pb. The new chunk-mode loader explicitly added a nil-node branch (node == nil) for this mode-switch case, which indicates resume files may legitimately lack chunk state.
Change request
is this backward compatible with N-1 binaries? Please version the checkpoint format (or use a separate checksum checkpoint file) and add an explicit downgrade guard with an actionable error, plus a downgrade-resume compatibility test.
There was a problem hiding this comment.
Positive, add export-fix-sql into checksum hash bf01a44.
| return nil | ||
| }) | ||
|
|
||
| if err := eg.Wait(); err != nil { |
There was a problem hiding this comment.
⚠️ [Major] Recovered global-checksum retries keep stale error state and can report false failure
Why
The table-level error path persists MeetError, DataEqual=false, and UpCount/DownCount += -1 before checkpoint flush. If a restart resumes the same table and the retry succeeds, the later success path never clears these earlier fields, so the report can stay in Error/failed state despite a successful recheck.
Scope
sync_diff_inspector/diff/global_checksum.go:178, sync_diff_inspector/diff/global_checksum.go:181, sync_diff_inspector/report/report.go:331, sync_diff_inspector/report/report.go:357
Risk if unchanged
A transient checksum/read error can become a sticky false negative after resume: final output may continue to report table failure/error and incorrect row counts even after data compares cleanly on retry.
Evidence
In the global checksum error branch, code calls SetTableMeetError(...) and SetTableDataCheckResult(..., false, 0, 0, -1, -1, ...) before flushing checkpoint state. On later success, SetTableDataCheckResult only mutates DataEqual when equal == false, and it always accumulates counts (+=), while SetTableMeetError has no corresponding clear path.
Change request
add a case for retry-after-error resume: please add logic so a successful recheck can clear stale table error/failure markers (or avoid persisting failure markers until the table is definitively unrecoverable).
There was a problem hiding this comment.
Positive, error path shouldn't write DataEqual = false to checkpoint, fixed in 23d8b0c.
| } | ||
| originTable.Fields = fields | ||
|
|
||
| return splitter.NewRandomIteratorWithCheckpoint( |
There was a problem hiding this comment.
⚠️ [Major] Checksum-only mode forces random splitting that precomputes all chunks, creating O(chunk_count) startup memory pressure
Why
The new TiDB checksum-only iterator always chooses the random splitter, and that splitter materializes the full chunk slice before processing starts. For very large tables or aggressive chunk sizing, this creates high startup latency and memory amplification before any checksum work is drained.
Scope
sync_diff_inspector/source/tidb.go:130
Risk if unchanged
High-cardinality tables can suffer long pre-processing stalls and elevated memory usage, increasing p99 runtime and raising failure risk in constrained environments.
Evidence
Checksum-only iterator unconditionally calls NewRandomIteratorWithCheckpoint (sync_diff_inspector/source/tidb.go:130). The random iterator computes all chunks eagerly via splitRangeByRandom and stores them in chunks (sync_diff_inspector/splitter/random.go:177, sync_diff_inspector/splitter/random.go:192).
Change request
seems no need to force random splitting for every checksum-only table; can we prefer a lazy/bucket path when eligible, or otherwise chunk-generate incrementally to keep memory bounded?
There was a problem hiding this comment.
Known issue, acceptable for now.
| return upstreamIsTiDB && downstreamIsTiDB | ||
| } | ||
|
|
||
| func (df *Diff) equalByGlobalChecksum(ctx context.Context) error { |
There was a problem hiding this comment.
🟡 [Minor] equalByGlobalChecksum combines orchestration, per-table policy, concurrency wiring, and checkpoint/report side effects in one block
Why
The function currently owns table-loop orchestration, missing-table handling, per-source checksum worker lifecycle, periodic checkpoint flushing, and report/progress mutation. This packs multiple responsibilities into one control-flow surface, which raises maintenance cost and makes behavior-specific tests harder to target.
Scope
sync_diff_inspector/diff/global_checksum.go:58
Risk if unchanged
Future changes to one path (for example checkpoint semantics or mismatch handling) are more likely to regress another path because they are tightly interleaved inside one large function.
Evidence
equalByGlobalChecksum spans the full workflow from table dispatch to completion (sync_diff_inspector/diff/global_checksum.go:58 through sync_diff_inspector/diff/global_checksum.go:209), with distinct branches for ignore-data-check, lack-table handling, concurrent upstream/downstream checksum execution, and final report/checkpoint updates.
Change request
Can we extract the per-table flows into focused helpers (for example ignore/lack-table fast paths, concurrent checksum run, finalize-and-persist) and leave equalByGlobalChecksum as a thin orchestrator?
There was a problem hiding this comment.
Why I would not overreact:
- the code is not conceptually huge
- excessive extraction can make control flow jumpy if helpers are too small
- after the recent cleanup, the concurrency block is already better than before
sync_diff_inspector/source/tidb.go
Outdated
| ctx, dbutil.TableName(table.Schema, table.Table), &originTable, s.dbConn, startRange) | ||
| } | ||
|
|
||
| func getChecksumSplitFields(tableInfo *model.TableInfo) (string, error) { |
There was a problem hiding this comment.
🟡 [Minor] getChecksumSplitFields has hidden schema mutation behind a getter-style API
Why
The helper name suggests a pure field-selection operation, but in the default branch it also mutates tableInfo.Columns by appending _tidb_rowid. This mixes query and mutation responsibilities and makes call-site expectations less explicit.
Scope
sync_diff_inspector/source/tidb.go:134
Risk if unchanged
A future caller may reuse the same tableInfo across calls and unintentionally accumulate pseudo-columns or rely on behavior that is hard to infer from the function name.
Evidence
In sync_diff_inspector/source/tidb.go:153, getChecksumSplitFields appends a synthetic _tidb_rowid column before returning the field name string, so the helper both mutates schema state and computes a split key.
Change request
Can we separate responsibilities by either renaming this to a mutation-explicit helper (or splitting into prepare... plus get...) so callers can see the side effect at the API boundary?
There was a problem hiding this comment.
Renamed to not use a getter style API name, 2346ddf
|
|
||
| func (m *mockChecksumSource) Close() {} | ||
|
|
||
| // GetChecksumOnlyIterator adapts the mock to checksumOnlyIteratorSource. |
There was a problem hiding this comment.
🧹 [Nit] Mock-adapter comment references a non-existent interface name
Why
The new comment names checksumOnlyIteratorSource, but the actual interface introduced in this change is checksumOnlySource.
Scope
sync_diff_inspector/diff/global_checksum_test.go:131
Risk if unchanged
Readers may search for a non-existent type and misread which contract the mock is intended to satisfy.
Evidence
GetChecksumOnlyIterator comment says it adapts to checksumOnlyIteratorSource, while sync_diff_inspector/diff/global_checksum.go:41 defines checksumOnlySource as the real interface.
Change request
Prefer checksumOnlySource; current name is confusing. Update the comment so it matches the actual interface name.
| taskWg.Wait() | ||
| close(resultCh) | ||
| return nil | ||
| }) |
There was a problem hiding this comment.
ditto on the WaitGroup + ErrGroup usage. This time we have
- 1 producer, writing to
taskChand closing it at the end - N checksum workers, consuming
taskChand writing toresultCh - 1 "closer", which waits until all N checksum workers is done and close
resultCh - 1 consumer (main thread), which read from
resultChand generate the total.
I'd prefer
- Keep the producer and workers in the
eg - Move the consumer to a native goroutine, listen on
<-egCtx.Done()instead ofclose(resultCh)(use afor/selectloop instead offor rangeloop). (Perhaps use a channel to retrieve the total to avoid shared state.) - Get rid of the "closer"
sync_diff_inspector/source/tidb.go
Outdated
| default: | ||
| tableInfo.Columns = append(tableInfo.Columns, &model.ColumnInfo{ | ||
| Name: ast.NewCIStr("_tidb_rowid"), | ||
| Offset: len(tableInfo.Columns), | ||
| FieldType: *types.NewFieldType(mysql.TypeLonglong), | ||
| }) | ||
| return "_tidb_rowid", nil |
There was a problem hiding this comment.
[P2] Preserve chunk splitting for heap tables in checksum-only mode
When a TiDB table has neither a clustered primary key nor any secondary indexes, this new fallback picks
_tidb_rowidas the split field, but it never adds a corresponding usable index totable.Info.NewRandomIteratorWithCheckpointstill treatslen(table.Info.Indices) == 0as a no-index table and collapses the work to a single chunk, so checksum-only mode degenerates into one full-table checksum query on exactly the large heap tables it is meant to speed up.This is a meaningful performance regression for non-indexed tables.[...]
_tidb_rowiddoes get selected as the split field in the new checksum-only path.- But for no-index tables, that does not currently buy automatic multi-chunk splitting unless the user also sets
chunk-size, because the old random-splitter sizing rule is unchanged.So I would retract the “regression” wording. At most, this is an implementation gap relative to the new comment in
tidb.go (line 160)saying checksum-only mode “forces to use_tidb_rowidor clustered PK to split chunks.”
Note:
- "heap table" here means "a non-clustered table with no PK, no UK"
- the relevant code is this piece:
tiflow/sync_diff_inspector/splitter/random.go
Lines 156 to 165 in bcbe018
- I'm not sure if we want to care about comparing no-index-tables efficiently. So feel free to ignore this.
There was a problem hiding this comment.
Good catch, added a synthetic index in dc31f9c
D3Hunter
left a comment
There was a problem hiding this comment.
AI-generated review based on @D3Hunter's standards; manual review will be performed after all comments are resolved.
Summary
- Total findings: 13
- Inline comments: 13
- Summary-only findings (no inline anchor): 0
Findings (highest risk first)
🚨 [Blocker] (2)
- Checksum-mode checkpoint uses a shared unversioned file contract (
sync_diff_inspector/checkpoints/checkpoints.go:188) - Out-of-range checksum checkpoint can skip all table checks without error (
sync_diff_inspector/diff/global_checksum.go:61)
⚠️ [Major] (4)
- Global checksum table loop carries multiple responsibilities and high control-flow complexity (
sync_diff_inspector/diff/global_checksum.go:58) - Mode-switch recovery silently falls back instead of failing fast (
sync_diff_inspector/diff/diff.go:204) - Ordered checkpoint drain can accumulate unbounded pending results under latency skew (
sync_diff_inspector/diff/global_checksum.go:311) - Resumed global-checksum runs cannot clear prior table failure state (
sync_diff_inspector/diff/global_checksum.go:178)
🟡 [Minor] (6)
- Checkpoint compatibility and restart-failure paths are not covered by deterministic tests (
sync_diff_inspector/diff/global_checksum_test.go:149) - Global-checksum error branch does not finalize per-table progress state (
sync_diff_inspector/diff/global_checksum.go:178) - Missing-but-empty tables can be reported as equal in data-only runs (
sync_diff_inspector/diff/global_checksum.go:89) - Mode selection is coupled to concrete source types instead of capability (
sync_diff_inspector/diff/global_checksum.go:49) - SavedState comment is stale after adding checksum-mode payload (
sync_diff_inspector/checkpoints/checkpoints.go:186) - Unscoped
Donefield obscures checkpoint completion semantics (sync_diff_inspector/checkpoints/checkpoints.go:158)
🧹 [Nit] (1)
- Test comment references a non-existent interface name (
sync_diff_inspector/diff/global_checksum_test.go:131)
| ChunkCnt: 1, | ||
| } | ||
|
|
||
| if !common.AllTableExist(tableDiff.TableLack) { |
There was a problem hiding this comment.
🟡 [Minor] Missing-but-empty tables can be reported as equal in data-only runs
Why
For TableLack tables, global-checksum mode defines equality as upCount == downCount; when both are zero, it reports equal even though one side lacks the table object.
Scope
sync_diff_inspector/diff/global_checksum.go:89
Risk if unchanged
With check-data-only plus skip-non-existing-table, an environment can report pass for a missing table if it is empty, masking topology/schema drift in data-only verification workflows.
Evidence
In global-checksum TableLack handling, isEqual := upCount == downCount (sync_diff_inspector/diff/global_checksum.go:98). When check-data-only is enabled, struct check is skipped (sync_diff_inspector/main.go:115), so there is no separate structural failure marker. The legacy chunk path always marks lack-table data as unequal (sync_diff_inspector/diff/diff.go:466).
Change request
Please preserve the existing missing-table invariant in global-checksum mode: for !AllTableExist, always mark data unequal (or explicitly persist TableLack to report state when struct check is skipped). Add a case for missing-empty-table under check-data-only.
| } | ||
| } | ||
|
|
||
| // SavedState contains the information of the latest checked chunk and state of `report` |
There was a problem hiding this comment.
🟡 [Minor] SavedState comment is stale after adding checksum-mode payload
Why
The struct-level comment still describes only chunk/report checkpoint content, but the struct now also carries checksum-mode state.
Scope
sync_diff_inspector/checkpoints/checkpoints.go:186
Risk if unchanged
Maintainers can misinterpret mode compatibility and miss that checkpoint files may legitimately contain checksum-info without chunk-info.
Evidence
SavedState includes Checksum *ChecksumState at sync_diff_inspector/checkpoints/checkpoints.go:190, while the comment at sync_diff_inspector/checkpoints/checkpoints.go:186 still states it contains the latest checked chunk and report state.
Change request
Please add a short why-comment for this struct describing both checkpoint modes and that chunk/checksum payloads are mode-dependent.
| LastRange *chunk.Range `json:"last-range,omitempty"` | ||
| Checksum uint64 `json:"checksum"` | ||
| Count int64 `json:"count"` | ||
| Done bool `json:"done"` |
There was a problem hiding this comment.
🟡 [Minor] Unscoped Done field obscures checkpoint completion semantics
Why
Done is persisted in checkpoint state but does not encode completion scope (per-source/per-table vs full run), while resume control flow depends on that exact meaning.
Scope
sync_diff_inspector/checkpoints/checkpoints.go:158
Risk if unchanged
Future resume logic changes can misread this flag and accidentally skip or reprocess tables when checkpoint behavior evolves.
Evidence
ChecksumSourceState defines Done bool at sync_diff_inspector/checkpoints/checkpoints.go:158, and resume advancement checks both source flags in sync_diff_inspector/diff/global_checksum.go:63 and sync_diff_inspector/diff/global_checksum_checkpoint.go:43.
Change request
Prefer a scoped name (for example TableDone) or an explicit semantics comment near the field; current name is confusing for a persisted contract.
There was a problem hiding this comment.
Done is already scoped by the type and container. So the actual meaning is already “this source finished checksum for the current TableIndex”. It does not look like “full run done” at call sites.
export-fix-sql = falseexport-fix-sql = false
sync_diff_inspector/tests/sync_diff_inspector/checksum_only/run.sh
Outdated
Show resolved
Hide resolved
|
/retest |
1 similar comment
|
/retest |
ingress-bot
left a comment
There was a problem hiding this comment.
This review was generated by AI and should be verified by a human reviewer.
Manual follow-up is recommended before merge.
Summary
- Total findings: 2
- Inline comments: 2
- Summary-only findings (no inline anchor): 0
Findings (highest risk first)
⚠️ [Major] (2)
- Resume state can skip a table that previously hit checksum execution error (checksum-only resume flow in
equalByGlobalChecksum) - Global checksum mode hard-fails ignored-handle tables instead of falling back (checksum-only iterator setup for TiDB sources with ignored handle columns)
|
oh a new one? how many review bots have we got 😅 |
The same one as #12531 (review), but with a dedicated account 🤓 |
|
/retest |
| if checksumState != nil { | ||
| df.checksumCheckpoint = checksumState | ||
| if reportInfo != nil { | ||
| df.report.LoadReport(reportInfo) |
There was a problem hiding this comment.
⚠️ [Major] Checkpoint resume can drop already-confirmed mismatches and later report success
Impact
If an earlier table was already confirmed unequal before the process stopped, a resumed run can still finish the remaining tables and print the top-level success branch, which hides the earlier mismatch behind a false overall pass.
Scope
checksum checkpoint restore via initChecksumCheckpoint and report.LoadReport
Evidence
This restore path reloads reportInfo, but report.LoadReport() only copies TableResults/duration/size and never rebuilds Report.Result. Report.Result itself is non-serialized (json:"-"), so a pre-restart mismatch can disappear from the final top-level outcome after resume. This bug also exists in the old chunk-checkpoint flow, because that path reuses the same report.LoadReport() behavior.
Change request
Recompute Report.Result from the restored table states after loading checkpointed TableResults, while still treating transient MeetError as non-durable.
There was a problem hiding this comment.
Done in joechenrh@667d2e2 .
Good catch. report.LoadReport() now recomputes the aggregate Report.Result from restored TableResults instead of relying on the non-persisted Result field.
That keeps already-confirmed mismatches visible after resume while still treating transient MeetError as non-durable.
This was not introduced by the checksum-only path though. The old chunk-checkpoint flow used the same report.LoadReport() behavior, so the same loss of top-level fail state already existed there too.
| progress.FailTable(progressID) | ||
| progress.UpdateTotal(progressID, 0, true) | ||
| progress.Inc(progressID) | ||
| df.report.SetTableMeetError(schema, table, err) |
There was a problem hiding this comment.
🟡 [Medium] Retry-success path never clears stale table error from report state
Impact
A retryable checksum error can remain attached to the table in MeetError, so a later successful retry still leaves an expired error in the report and future real failures can be rendered together with stale diagnostics.
Scope
checksum execution error handling in equalByGlobalChecksum plus shared report state
Evidence
This branch records SetTableMeetError(...), but the later success path only updates equality/count state and never clears MeetError. report.Print() later emits every remaining table error. This is also not unique to the new checksum-only flow: the old chunk path uses the same shared report state and has the same stale-error behavior after retry-success.
Change request
Clear per-table MeetError when the same table is successfully finalized, or otherwise reset transient table error state before success output is emitted.
There was a problem hiding this comment.
Done in joechenrh@667d2e2 .
I added ClearTableMeetError and call it when the same table is successfully finalized, so a retry-success no longer leaves an expired transient error in the final report.
This is also an old shared-report issue rather than something unique to the checksum-only flow. The previous chunk path used the same MeetError state model, so stale retry-success errors could accumulate there as well.
Signed-off-by: Ruihao Chen <joechenrh@gmail.com>
…failure Add failpoints to simulate checksum failure after partial chunk progress: - checksum-skip-chunk: skips seq 2 in drainOrderedChecksumTasks, so only chunks 0-1 are checkpointed before the gap halts further processing. - checksum-error-source: errors one source after all results are collected. Also fix the result goroutine to use `for range resultCh` with explicit close after eg.Wait(), replacing the racy single select that could miss buffered results when egCtx was cancelled. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace export/unset pattern with inline GO_FAILPOINTS=... prefix so the env var is scoped to just the command invocation. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Ruihao Chen <joechenrh@gmail.com>
Remove ExportFixSQL from TaskConfig struct and pass it as a parameter to Init() and ComputeConfigHash() instead, avoiding field duplication and unintended JSON serialization. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…gcap#5) 1. Restore nil-guard on LoadChunk result to prevent panic when checkpoint file contains no chunk info. 2. Extract ChecksumCapableSource sub-interface from Source to avoid widening the core interface with checksum-only methods. Remove dead PreferGlobalChecksum/GetChecksumOnlyIterator from MySQLSources. 3. Rename PreferGlobalChecksum to type-assertion based capability check via ChecksumCapableSource, removing the misleading "Prefer" naming. 4. Add log.Warn before break in equalByGlobalChecksum error path so operators can see why remaining tables were skipped. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The TestComputeConfigHashIgnoresTLSName test from master called ComputeConfigHash() without the exportFixSQL parameter added in this branch. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
c44066e to
aa19e10
Compare
Move exportFixSQL from a function parameter to a field on TaskConfig tagged with json:"-" to prevent serialization. This simplifies the ComputeConfigHash() and Init() signatures so callers that don't care about export-fix-sql (e.g. TLS tests) no longer need to pass it. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
| cfg := NewConfig() | ||
| require.NoError(t, os.RemoveAll("/tmp/output/config")) | ||
| require.NoError(t, cfg.Parse([]string{"--config", "config.toml"})) | ||
| require.NoError(t, cfg.Init()) |
There was a problem hiding this comment.
wait, wut
please override the config's output-dir to t.TempDir() or its subdirectory instead of using the hard-coded "/tmp/output/config", which may not be accessible or even contain important resources on the test machine.
There was a problem hiding this comment.
Oops, shouldn't have hard-coded that 🫠 Switched to t.TempDir() and dropped the manual os.RemoveAll.
Codecov Report✅ All modified and coverable lines are covered by tests. ❌ Your project check has failed because the head coverage (57.3233%) is below the target coverage (60.0000%). You can increase the head coverage or adjust the target coverage. Additional details and impacted files
Flags with carried forward coverage won't be shown. Click here to find out more. @@ Coverage Diff @@
## master #12531 +/- ##
================================================
+ Coverage 49.1079% 57.3233% +8.2153%
================================================
Files 273 524 +251
Lines 52967 69183 +16216
================================================
+ Hits 26011 39658 +13647
- Misses 24763 26670 +1907
- Partials 2193 2855 +662 🚀 New features to boost your workflow:
|
…test Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
| cfg := NewConfig() | ||
| require.NoError(t, os.RemoveAll("/tmp/output/config")) | ||
| require.Nil(t, cfg.Parse([]string{"-L", "info", "--config", "config.toml"})) |
There was a problem hiding this comment.
Done in db7859b
Removed output-dir from both config.toml and config_sharding.toml entirely, so Parse no longer overwrites it. The test sets cfg.Task.OutputDir = tmpDir once before the first Parse and it sticks across all subsequent calls 👍
|
/retest |
|
@joechenrh: The following test failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
What problem does this PR solve?
Issue Number: ref #12492
In scenarios where
export-fix-sql = false(for example B/G upgrade verification), we only need final table-level equality (count+checksum), not row-level fix SQL generation. And this PR introduces a checksum-only path for TiDB-to-TiDB data check while keeping existing chunk-mode checkpoint behavior intact.Basic rationale:
export-fix-sql = false, we use_tidb_rowidas split field to get a scan-friendly split path.What is changed and how it works?
export-fix-sqlis disabled._tidb_rowidsplitting.count+checksum).table-index, per-source accumulatedcount/checksum,last-range, anddone.Check List
Tests
Manual benchmark (same cluster used as both upstream and downstream to simplify setup):
Note:
Questions
Will it cause performance regression or break compatibility?
Checkpoint from older version can't be used with new version of sync-diff-inspectpr.
Do you need to update user documentation, design documentation or monitoring documentation?
No.
Release note