Skip to content

sink(ticdc): retry later when meet unexpected DDL state (#12483)#12515

Open
ti-chi-bot wants to merge 1 commit intopingcap:release-8.5from
ti-chi-bot:cherry-pick-12483-to-release-8.5
Open

sink(ticdc): retry later when meet unexpected DDL state (#12483)#12515
ti-chi-bot wants to merge 1 commit intopingcap:release-8.5from
ti-chi-bot:cherry-pick-12483-to-release-8.5

Conversation

@ti-chi-bot
Copy link
Member

This is an automated cherry-pick of #12483

What problem does this PR solve?

Issue Number: close #12482

What is changed and how it works?

When meeting the unexpected DDL state, instead of immediately failing, CDC now attempts to retry the DDL. This change aims to avoid CDC restart.

Check List

Tests

  • Unit test
  • Integration test

Questions

Will it cause performance regression or break compatibility?
Do you need to update user documentation, design documentation or monitoring documentation?

Release note

Avoid CDC restart when meeting lots of slow DDL.

Signed-off-by: ti-chi-bot <ti-community-prow-bot@tidb.io>
@ti-chi-bot ti-chi-bot added do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. lgtm release-note Denotes a PR that will be considered when it comes time to generate release notes. size/M Denotes a PR that changes 30-99 lines, ignoring generated files. type/cherry-pick-for-release-8.5 This PR is cherry-picked to release-8.5 from a source PR. labels Feb 4, 2026
@ti-chi-bot
Copy link
Member Author

@wk989898 This PR has conflicts, I have hold it.
Please resolve them or ask others to resolve them, then comment /unhold to remove the hold label.

@ti-chi-bot
Copy link
Contributor

ti-chi-bot bot commented Feb 4, 2026

This cherry pick PR is for a release branch and has not yet been approved by triage owners.
Adding the do-not-merge/cherry-pick-not-approved label.

To merge this cherry pick:

  1. It must be LGTMed and approved by the reviewers firstly.
  2. For pull requests to TiDB-x branches, it must have no failed tests.
  3. AFTER it has lgtm and approved labels, please wait for the cherry-pick merging approval from triage owners.
Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@ti-chi-bot
Copy link
Contributor

ti-chi-bot bot commented Feb 4, 2026

@ti-chi-bot: ## If you want to know how to resolve it, please read the guide in TiDB Dev Guide.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the ti-community-infra/tichi repository.

@ti-chi-bot
Copy link
Contributor

ti-chi-bot bot commented Feb 4, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign july2993 for approval. For more information see the Code Review Process.
Please ensure that each of them provides their approval before proceeding.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@gemini-code-assist
Copy link

Summary of Changes

Hello @ti-chi-bot, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the robustness of DDL replication in TiCDC's MySQL sink. By implementing a retry and polling mechanism for DDL operations that encounter unexpected states, the system can now gracefully handle transient issues or long-running DDLs without resorting to full CDC restarts. This change improves the overall stability and resilience of the data synchronization process, particularly in environments with complex or slow DDL executions.

Highlights

  • DDL Retry Mechanism: Introduced a new mechanism to retry DDL operations when an unexpected DDL state is encountered in the downstream database, preventing immediate failures and CDC restarts.
  • DDL State Polling: Added a waitDDLDone function in the MySQL DDL sink to periodically query the downstream TiDB's information_schema.ddl_jobs table to monitor the status of DDL operations.
  • Retryable Error Handling: Updated the IsRetryableDDLError utility function to include ErrDDLStateNotFound, ensuring that the system attempts to retry when a DDL job's state cannot be immediately determined.
  • Integration Test: A new integration test (ddl_wait) has been added to simulate scenarios with long-running DDLs, specifically ALTER TABLE ADD INDEX, to validate the new retry and polling logic.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • cdc/sink/ddlsink/mysql/mysql_ddl_sink.go
    • Added waitDDLDone function to poll DDL job status from information_schema.ddl_jobs in TiDB.
    • Implemented logic to handle various DDL job states (Done, Synced, Cancelled, Rollingback, Running, Queueing) and retry on unexpected states.
    • Added logging for DDL progress and state changes.
  • pkg/errorutil/util.go
    • Modified IsRetryableDDLError to recognize cerror.ErrDDLStateNotFound as a retryable error.
  • pkg/errorutil/util_test.go
    • Added a test case to TestIsRetryableDDLError for cerror.ErrDDLStateNotFound.
  • tests/integration_tests/ddl_wait/run.sh
    • Added a new integration test script to simulate and verify the handling of long-running DDLs, specifically ALTER TABLE ADD INDEX.
Activity
  • This is an automated cherry-pick of sink(ticdc): retry later when meet unexpected DDL state #12483.
  • The problem addressed is to prevent CDC restarts when encountering unexpected DDL states.
  • The solution involves retrying DDLs instead of failing immediately.
  • Unit and integration tests are included to validate the changes.
  • No performance regression or compatibility issues are anticipated.
  • No updates to user, design, or monitoring documentation are required.
  • A release note states: 'Avoid CDC restart when meeting lots of slow DDL.'
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@ti-chi-bot ti-chi-bot bot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Feb 4, 2026
Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request aims to improve TiCDC's resilience to slow DDLs by introducing a retry mechanism for unexpected DDL states, preventing changefeed failures. However, it introduces a critical SQL injection vulnerability in the getDDLStateFromTiDB function due to constructing a SQL query with untrusted input (ddl.Query) via string formatting. This must be fixed by using parameterized queries. Additionally, the file cdc/sink/ddlsink/mysql/mysql_ddl_sink.go contains unresolved merge conflicts and duplicated code, which will prevent compilation.

Comment on lines +278 to +339
<<<<<<< HEAD
=======
func (m *DDLSink) waitDDLDone(ctx context.Context, ddl *model.DDLEvent, ddlCreateTime string) error {
ticker := time.NewTicker(5 * time.Second)
ticker1 := time.NewTicker(10 * time.Minute)
defer ticker.Stop()
defer ticker1.Stop()
for {
select {
case <-ctx.Done():
return ctx.Err()
case <-ticker.C:
case <-ticker1.C:
log.Info("DDL is still running downstream, it blocks other DDL or DML events", zap.String("ddl", ddl.Query), zap.String("ddlCreateTime", ddlCreateTime))
}

state, err := getDDLStateFromTiDB(ctx, m.db, ddl.Query, ddlCreateTime)
if err != nil {
log.Error("Error when getting DDL state from TiDB", zap.Error(err))
}
switch state {
case timodel.JobStateDone, timodel.JobStateSynced:
log.Info("DDL replicate success", zap.String("ddl", ddl.Query), zap.String("ddlCreateTime", ddlCreateTime))
return nil
case timodel.JobStateCancelled, timodel.JobStateRollingback, timodel.JobStateRollbackDone, timodel.JobStateCancelling:
return errors.ErrExecDDLFailed.GenWithStackByArgs(ddl.Query)
case timodel.JobStateRunning, timodel.JobStateQueueing:
switch ddl.Type {
// returned immediately if not block dml
case timodel.ActionAddIndex:
log.Info("DDL is running downstream", zap.String("ddl", ddl.Query), zap.String("ddlCreateTime", ddlCreateTime), zap.Any("ddlState", state))
return nil
}
default:
log.Warn("Unexpected DDL state, may not be found downstream, retry later", zap.String("ddl", ddl.Query), zap.String("ddlCreateTime", ddlCreateTime), zap.Any("ddlState", state))
return errors.ErrDDLStateNotFound.GenWithStackByArgs(state)
}
}
}

// WriteCheckpointTs does nothing.
func (m *DDLSink) WriteCheckpointTs(_ context.Context, _ uint64, _ []*model.TableInfo) error {
// Only for RowSink for now.
return nil
}

// Close closes the database connection.
func (m *DDLSink) Close() {
if m.statistics != nil {
m.statistics.Close()
}
if m.db != nil {
if err := m.db.Close(); err != nil {
log.Warn("MySQL ddl sink close db wit error",
zap.String("namespace", m.id.Namespace),
zap.String("changefeed", m.id.ID),
zap.Error(err))
}
}
}

>>>>>>> 1da8b3e23a (sink(ticdc): retry later when meet unexpected DDL state (#12483))

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

This section of the code contains unresolved merge conflict markers (<<<<<<< HEAD, =======, >>>>>>>) and also introduces duplicated definitions for WriteCheckpointTs and Close functions, which are already defined elsewhere in the file. This will cause a compilation failure. Please resolve the merge conflict and remove the duplicated code.

Comment on lines +398 to +399
showJobs := fmt.Sprintf(`SELECT JOB_ID, JOB_TYPE, SCHEMA_STATE, SCHEMA_ID, TABLE_ID, STATE, QUERY FROM information_schema.ddl_jobs
WHERE CREATE_TIME >= "%s" AND QUERY = "%s";`, createTime, ddl)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-high high

The function getDDLStateFromTiDB constructs a SQL query by directly embedding the ddl and createTime strings into a query string using fmt.Sprintf with double quotes. This leads to a SQL injection vulnerability, as ddl can contain special characters or be manipulated, potentially allowing arbitrary SQL execution, especially since MultiStatements is enabled. To remediate this, use parameterized queries instead of string formatting. After changing the query string as suggested, you'll also need to update the db.QueryContext call on line 403 to pass createTime and ddl as arguments: db.QueryContext(ctx, showJobs, createTime, ddl).

showJobs := `SELECT JOB_ID, JOB_TYPE, SCHEMA_STATE, SCHEMA_ID, TABLE_ID, STATE, QUERY FROM information_schema.ddl_jobs	WHERE CREATE_TIME >= ? AND QUERY = ?;`

var jobsResults [][]string
err := retry.Do(ctx, func() error {
//nolint:rowserrcheck
jobsRows, err := db.QueryContext(ctx, showJobs)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-high high

When using parameterized queries, the db.QueryContext call must also be updated to pass the arguments.

		jobsRows, err := db.QueryContext(ctx, showJobs, createTime, ddl)

Comment on lines +295 to +297
if err != nil {
log.Error("Error when getting DDL state from TiDB", zap.Error(err))
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The error returned from getDDLStateFromTiDB is logged as an Error but then effectively ignored. The function proceeds to return ErrDDLStateNotFound, which is always retryable. This hides the original error and could lead to excessive retries for what might be a non-retryable issue. It would be better to return the original error and let the caller's retry logic, which uses IsRetryableDDLError, decide whether to retry. This would also provide better error propagation. Consider also logging the error at a Warn level before returning.

if err != nil {
	return err
}

@ti-chi-bot
Copy link
Contributor

ti-chi-bot bot commented Mar 17, 2026

@ti-chi-bot: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-verify 41e7b15 link true /test pull-verify
pull-dm-integration-test 41e7b15 link true /test pull-dm-integration-test
pull-dm-compatibility-test 41e7b15 link true /test pull-dm-compatibility-test
pull-syncdiff-integration-test 41e7b15 link true /test pull-syncdiff-integration-test
pull-cdc-integration-kafka-test 41e7b15 link true /test pull-cdc-integration-kafka-test
pull-cdc-integration-mysql-test 41e7b15 link true /test pull-cdc-integration-mysql-test
pull-cdc-integration-storage-test 41e7b15 link true /test pull-cdc-integration-storage-test

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

do-not-merge/cherry-pick-not-approved do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. lgtm release-note Denotes a PR that will be considered when it comes time to generate release notes. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. type/cherry-pick-for-release-8.5 This PR is cherry-picked to release-8.5 from a source PR.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants