Skip to content

fix(dm): resolve flaky TestWorker panic on unexpected Resume() call#12546

Open
Copilot wants to merge 3 commits intorelease-8.5from
copilot/fix-flaky-test-worker-panics
Open

fix(dm): resolve flaky TestWorker panic on unexpected Resume() call#12546
Copilot wants to merge 3 commits intorelease-8.5from
copilot/fix-flaky-test-worker-panics

Conversation

Copy link

Copilot AI commented Mar 10, 2026

TestWorker intermittently panicked when checkAndAutoResume dispatched an unexpected Resume() call on the mock before the test registered that expectation.

Root cause

newDMWorker() sets autoResume.LatestResumeTime = time.Now() at creation time. On slow CI, the SQLite Init() call can take >1s, meaning by the time the first StageError tick runs, time.Since(LatestResumeTime) ≥ backoff.Current() (initial backoff = DefaultBackoffMin = 1s). This causes CheckResumeSubtask to return ResumeDispatch on the first error tick — before time.Sleep(time.Second) — triggering Resume() on a mock with no registered expectation → panic.

Fix

Reset dmWorker.autoResume.LatestResumeTime = time.Now() immediately before the auto-resume test section. This makes the timing deterministic: the first error tick reliably gets ResumeSkip, and after the time.Sleep(time.Second) the second tick gets ResumeDispatch as intended.

// Reset LatestResumeTime to make auto-resume timing deterministic regardless of test execution speed.
// Without this reset, slow CI environments (e.g., a DB insert taking >1s) may cause the first
// StageError tick to dispatch an unexpected Resume() call before time.Sleep(time.Second) below.
dmWorker.autoResume.LatestResumeTime = time.Now()

// auto resume error
unitHolder.On("Stage").Return(metadata.StageError, ...).Twice()
require.NoError(t, dmWorker.Tick(context.Background()))
time.Sleep(time.Second)
// now Resume() is expected
unitHolder.On("Resume").Return(errors.New("resume error")).Once()
require.EqualError(t, dmWorker.Tick(context.Background()), "resume error")

Close #12545

Check List

Tests

  • Unit test

Questions

Will it cause performance regression or break compatibility?

No — test-only change.

Do you need to update user documentation, design documentation or monitoring documentation?

No.

Release note

None
Original prompt

This section details on the original issue you should resolve

<issue_title>Flaky: engine/executor/dm TestWorker panics on unexpected mock Resume()</issue_title>
<issue_description>### Go package
github.com/pingcap/tiflow/engine/executor/dm (dir: engine/executor/dm)

Test case

TestWorker (engine/executor/dm/worker_test.go)

Flaky symptom

CI intermittently panics due to an unexpected testify/mock call (Resume()), likely triggered by the auto-resume path in (*dmWorker).checkAndAutoResume.

Failure log / stacktrace

=== Failed

=== FAIL: engine/executor/dm TestWorker (1.35s)

[2026/03/10 10:20:58.178 +08:00] [INFO] [broker.go:104] ["Create new resource broker"] [executor-id=executor-1] [config="{\"local\":{\"base-dir\":\"/tmp/2686613752-localfiles\"},\"s3\":{\"endpoint\":\"\",\"region\":\"\",\"storage-class\":\"\",\"sse\":\"\",\"sse-kms-key-id\":\"\",\"acl\":\"\",\"access-key\":\"\",\"secret-access-key\":\"\",\"session-token\":\"\",\"provider\":\"\",\"force-path-style\":false,\"use-accelerate-endpoint\":false,\"role-arn\":\"\",\"external-id\":\"\",\"object-lock-enabled\":false,\"bucket\":\"\",\"prefix\":\"\"},\"gcs\":{\"endpoint\":\"\",\"storage-class\":\"\",\"predefined-acl\":\"\",\"credentials-file\":\"\",\"bucket\":\"\",\"prefix\":\"\"}}"]

[2026/03/10 10:20:58.178 +08:00] [INFO] [broker.go:133] ["broker will not use s3/gcs as external storage since s3/gcs are both not configured"]

[2026/03/10 10:20:58.187 +08:00] [INFO] [worker.go:149] ["initializing the dm worker"] [job_id=master-id] [worker_id=worker-id] [task-id=task-id]

[2026/03/10 10:20:59.459 +08:00] [WARN] [log.go:102] ["slow log"] [component=gorm] [elapsed=1.272378614s] [sql="INSERT INTO `worker_statuses` (`created_at`,`updated_at`,`project_id`,`job_id`,`id`,`type`,`state`,`epoch`,`error_message`,`extend_bytes`) VALUES (\"2026-03-10 10:20:58.189\",\"2026-03-10 10:20:58.189\",\"\",\"master-id\",\"worker-id\",7,0,2,\"\",\"\") ON CONFLICT (`id`,`job_id`) DO UPDATE SET `updated_at`=`excluded`.`updated_at`,`project_id`=`excluded`.`project_id`,`job_id`=`excluded`.`job_id`,`id`=`excluded`.`id`,`type`=`excluded`.`type`,`state`=`excluded`.`state`,`epoch`=`excluded`.`epoch`,`error_message`=`excluded`.`error_message`,`extend_bytes`=`excluded`.`extend_bytes` RETURNING `seq_id`"] [affected-rows=1]

[2026/03/10 10:20:59.461 +08:00] [INFO] [worker.go:252] ["task stage changed"] [job_id=master-id] [worker_id=worker-id] [task-id=task-id] [from=Initing] [to=Running]

[2026/03/10 10:20:59.461 +08:00] [INFO] [worker.go:257] ["update status"] [job_id=master-id] [worker_id=worker-id] [task-id=task-id] [status="{\"Unit\":\"DMDumpTask\",\"Task\":\"task-id\",\"Stage\":\"Running\",\"CfgModRevision\":0,\"StageUpdatedTime\":\"2026-03-10T10:20:59.461856048+08:00\"}"]

[2026/03/10 10:20:59.463 +08:00] [ERROR] [worker.go:354] ["task runs with error"] [job_id=master-id] [worker_id=worker-id] [task-id=task-id] ["error msg"="[{}]"] [stack="github.com/pingcap/tiflow/engine/executor/dm.(*dmWorker).checkAndAutoResume\n\t/home/jenkins/agent/workspace/pingcap/tiflow/release-8.5/ghpr_verify/tiflow/engine/executor/dm/worker.go:354\ngithub.com/pingcap/tiflow/engine/executor/dm.(*dmWorker).Tick\n\t/home/jenkins/agent/workspace/pingcap/tiflow/release-8.5/ghpr_verify/tiflow/engine/executor/dm/worker.go:174\ngithub.com/pingcap/tiflow/engine/executor/dm.TestWorker\n\t/home/jenkins/agent/workspace/pingcap/tiflow/release-8.5/ghpr_verify/tiflow/engine/executor/dm/worker_test.go:143\ntesting.tRunner\n\t/usr/local/go/src/testing/testing.go:1934"]

[2026/03/10 10:20:59.463 +08:00] [INFO] [worker.go:355] ["got auto resume strategy"] [job_id=master-id] [worker_id=worker-id] [task-id=task-id] [strategy="dispatch auto resume"]

[2026/03/10 10:20:59.464 +08:00] [INFO] [worker.go:359] ["dispatch auto resume task"] [job_id=master-id] [worker_id=worker-id] [task-id=task-id]

panic: 

\tassert: mock: I don't know what to return because the method call was unexpected.

\t\tEither do Mock.On(\"Resume\").Return(...) first, or remove the Resume() call.

\t\tThis method was unexpected:

\t\t\tResume()

\t\t

\tat: [/home/jenkins/agent/workspace/pingcap/tiflow/release-8.5/ghpr_verify/tiflow/engine/executor/dm/unitholder_test.go:309 /home/jenkins/agent/workspace/pingcap/tiflow/release-8.5/ghpr_verify/tiflow/engine/executor/dm/worker.go:360 /home/jenkins/agent/workspace/pingcap/tiflow/release-8.5/ghpr_verify/tiflow/engine/executor/dm/worker.go:174 /home/jenkins/agent/workspace/pingcap/tiflow/release-8.5/ghpr_verify/tiflow/engine/executor/dm/worker_test.go:143] [recovered, repanicked]



goroutine 400 [running]:

testing.tRunner.func1.2({0x9eafd00, 0xc000daa7d0})

\t/usr/local/go/src/testing/testing.go:1872 +0x419

testing.tRunner.func1()

\t/usr/local/go/src/testing/testing.go:1875 +0x683

panic({0x9eafd00?, 0xc000daa7d0?})

\t/usr/local/go/src/runtime/panic.go:783 +0x132

github.com/stretchr/testify/mock.(*Mock).fail(0xc003569028, {0xac45cb2, 0xc1}, {0xc002c37a80, 0x4, 0x...

</details>



<!-- START COPILOT CODING AGENT SUFFIX -->

- Fixes pingcap/tiflow#12545

<!-- START COPILOT CODING AGENT TIPS -->
---

💬 We'd love your input! Share your thoughts on Copilot coding agent in our [2 minute survey](https://gh.io/copilot-coding-agent-survey).

@ti-chi-bot ti-chi-bot bot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. do-not-merge/cherry-pick-not-approved labels Mar 10, 2026
@ti-chi-bot
Copy link
Contributor

ti-chi-bot bot commented Mar 10, 2026

This cherry pick PR is for a release branch and has not yet been approved by triage owners.
Adding the do-not-merge/cherry-pick-not-approved label.

To merge this cherry pick:

  1. It must be LGTMed and approved by the reviewers firstly.
  2. For pull requests to TiDB-x branches, it must have no failed tests.
  3. AFTER it has lgtm and approved labels, please wait for the cherry-pick merging approval from triage owners.
Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@ti-chi-bot ti-chi-bot bot added the do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. label Mar 10, 2026
@ti-chi-bot ti-chi-bot bot added contribution This PR is from a community contributor. first-time-contributor Indicates that the PR was contributed by an external member and is a first-time contributor. needs-ok-to-test Indicates a PR created by contributors and need ORG member send '/ok-to-test' to start testing. labels Mar 10, 2026
@ti-chi-bot
Copy link
Contributor

ti-chi-bot bot commented Mar 10, 2026

Hi @Copilot. Thanks for your PR.

I'm waiting for a pingcap member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@ti-chi-bot ti-chi-bot bot added the size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. label Mar 10, 2026
…section

Co-authored-by: wuhuizuo <2574558+wuhuizuo@users.noreply.github.com>
@ti-chi-bot
Copy link
Contributor

ti-chi-bot bot commented Mar 10, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign d3hunter for approval. For more information see the Code Review Process.
Please ensure that each of them provides their approval before proceeding.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ti-chi-bot ti-chi-bot bot added the area/engine Issues or PRs related to Dataflow Engine. label Mar 10, 2026
Copilot AI changed the title [WIP] Fix flaky TestWorker panics on unexpected mock Resume call fix(dm): resolve flaky TestWorker panic on unexpected Resume() call Mar 10, 2026
@ti-chi-bot ti-chi-bot bot added release-note-none Denotes a PR that doesn't merit a release note. and removed do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels Mar 10, 2026
@wuhuizuo
Copy link
Contributor

/test ?

@ti-chi-bot
Copy link
Contributor

ti-chi-bot bot commented Mar 10, 2026

@wuhuizuo: The following commands are available to trigger required jobs:

/test pull-cdc-integration-kafka-test
/test pull-cdc-integration-mysql-test
/test pull-cdc-integration-pulsar-test
/test pull-cdc-integration-storage-test
/test pull-dm-compatibility-test
/test pull-dm-integration-test
/test pull-syncdiff-integration-test
/test pull-verify

Use /test all to run all jobs.

Details

In response to this:

/test ?

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@wuhuizuo
Copy link
Contributor

/test pull-verify

@codecov
Copy link

codecov bot commented Mar 10, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
⚠️ Please upload report for BASE (release-8.5@95dba8c). Learn more about missing BASE report.

Additional details and impacted files
Components Coverage Δ
cdc 57.5928% <0.0000%> (?)
dm ∅ <0.0000%> (?)
engine ∅ <0.0000%> (?)
Flag Coverage Δ
cdc 57.5928% <ø> (?)
unit 57.5928% <ø> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

@@               Coverage Diff                @@
##             release-8.5     #12546   +/-   ##
================================================
  Coverage               ?   57.5928%           
================================================
  Files                  ?        520           
  Lines                  ?      67406           
  Branches               ?          0           
================================================
  Hits                   ?      38821           
  Misses                 ?      25818           
  Partials               ?       2767           
🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@wuhuizuo wuhuizuo marked this pull request as ready for review March 10, 2026 06:27
@ti-chi-bot ti-chi-bot bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 10, 2026
@wuhuizuo
Copy link
Contributor

/ok-to-test

@ti-chi-bot ti-chi-bot bot added ok-to-test Indicates a PR is ready to be tested. and removed needs-ok-to-test Indicates a PR created by contributors and need ORG member send '/ok-to-test' to start testing. labels Mar 10, 2026
@wuhuizuo
Copy link
Contributor

/test ?

@ti-chi-bot
Copy link
Contributor

ti-chi-bot bot commented Mar 10, 2026

@wuhuizuo: The following commands are available to trigger required jobs:

/test pull-cdc-integration-kafka-test
/test pull-cdc-integration-mysql-test
/test pull-cdc-integration-pulsar-test
/test pull-cdc-integration-storage-test
/test pull-dm-compatibility-test
/test pull-dm-integration-test
/test pull-syncdiff-integration-test
/test pull-verify

Use /test all to run all jobs.

Details

In response to this:

/test ?

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@wuhuizuo
Copy link
Contributor

/retest

@wuhuizuo
Copy link
Contributor

/test ?

@ti-chi-bot
Copy link
Contributor

ti-chi-bot bot commented Mar 10, 2026

@wuhuizuo: The following commands are available to trigger required jobs:

/test pull-cdc-integration-kafka-test
/test pull-cdc-integration-mysql-test
/test pull-cdc-integration-pulsar-test
/test pull-cdc-integration-storage-test
/test pull-dm-compatibility-test
/test pull-dm-integration-test
/test pull-syncdiff-integration-test
/test pull-verify

Use /test all to run all jobs.

Details

In response to this:

/test ?

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@ti-chi-bot
Copy link
Contributor

ti-chi-bot bot commented Mar 13, 2026

@Copilot: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-syncdiff-integration-test 95579d3 link true /test pull-syncdiff-integration-test

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/engine Issues or PRs related to Dataflow Engine. contribution This PR is from a community contributor. do-not-merge/cherry-pick-not-approved first-time-contributor Indicates that the PR was contributed by an external member and is a first-time contributor. ok-to-test Indicates a PR is ready to be tested. release-note-none Denotes a PR that doesn't merit a release note. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants