Skip to content

codec(ticdc): improve error logging for Debezium encoding failures#12484

Merged
ti-chi-bot[bot] merged 1 commit intopingcap:masterfrom
takaidohigasi:improve-debezium-error-logging
Feb 4, 2026
Merged

codec(ticdc): improve error logging for Debezium encoding failures#12484
ti-chi-bot[bot] merged 1 commit intopingcap:masterfrom
takaidohigasi:improve-debezium-error-logging

Conversation

@takaidohigasi
Copy link
Contributor

@takaidohigasi takaidohigasi commented Jan 8, 2026

What problem does this PR solve?

Issue Number: close #12485

Summary

  • Add detailed context to error logs when Debezium encoding fails
  • Include schema, table, column name, and value in writeDebeziumFieldValues error log
  • Include namespace, changefeed ID, schema, table, and commitTs in runEncoder error log

This helps debugging encoding issues like invalid enum values where previously only the error message was logged without context about which row/column caused the failure.

The commitTs in the log can be used to skip problematic events using cdc cli changefeed resume --overwrite-checkpoint-ts.

issue: #12485

PII Consideration

This change logs the column value (zap.Any("value", col.Value)) when encoding fails. This may expose sensitive/PII data in logs. However:

  • This is consistent with existing behavior in MySQL sink (mysql.go:229, mysql.go:713) which already logs values
  • The value is only logged at ERROR level when encoding actually fails, not during normal operation
  • This information is essential for debugging encoding issues (e.g., identifying what invalid value caused the failure)

Users handling sensitive data should ensure appropriate log access controls are in place.

Check List

Tests

Unit Test: takaidohigasi#2

Questions

  • Will it cause performance regression or break compatibility?
    I changed error log format a bit, but does not be much problem.
    changes log is only for error, so it does not cause performance regression

  • Do you need to update user documentation, design documentation or monitoring documentation?
    no

Test plan

  • Existing tests pass (go test ./pkg/sink/codec/debezium/...)
  • Manual test with invalid enum value to verify improved logging

Release note

Add CDC error log contents when debezium sink was errored

🤖 Generated with Claude Code

@ti-chi-bot ti-chi-bot bot added do-not-merge/needs-linked-issue do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels Jan 8, 2026
@takaidohigasi takaidohigasi marked this pull request as draft January 8, 2026 04:25
@ti-chi-bot ti-chi-bot bot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. contribution This PR is from a community contributor. first-time-contributor Indicates that the PR was contributed by an external member and is a first-time contributor. labels Jan 8, 2026
@ti-chi-bot
Copy link
Contributor

ti-chi-bot bot commented Jan 8, 2026

Hi @takaidohigasi. Thanks for your PR.

I'm waiting for a pingcap member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@ti-chi-bot ti-chi-bot bot added the needs-ok-to-test Indicates a PR created by contributors and need ORG member send '/ok-to-test' to start testing. label Jan 8, 2026
@ti-chi-bot
Copy link
Contributor

ti-chi-bot bot commented Jan 8, 2026

Welcome @takaidohigasi!

It looks like this is your first PR to pingcap/tiflow 🎉.

I'm the bot to help you request reviewers, add labels and more, See available commands.

We want to make sure your contribution gets all the attention it needs!



Thank you, and welcome to pingcap/tiflow. 😃

@gemini-code-assist
Copy link

Summary of Changes

Hello @takaidohigasi, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request focuses on improving the diagnostic capabilities of the Debezium encoder by enriching error logs with more contextual information. The changes ensure that when encoding failures occur, such as with invalid enum values, the logs provide specific details about the schema, table, column, and value involved, as well as the changefeed and commit timestamp for broader event-level errors. This enhancement aims to streamline the debugging process for encoding-related issues.

Highlights

  • Enhanced Error Logging for Debezium Encoding: Detailed context has been added to error logs when Debezium encoding encounters failures, significantly improving debuggability.
  • Contextual Field Value Errors: The writeDebeziumFieldValues function now includes schema, table, column name, and the problematic value in its error logs, pinpointing the exact data causing issues.
  • Contextual Encoder Run Errors: The runEncoder function's error logs now contain the namespace, changefeed ID, schema, table, and commit timestamp, providing a clearer picture of encoding failures at the event level.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@pingcap-cla-assistant
Copy link

pingcap-cla-assistant bot commented Jan 8, 2026

CLA assistant check
All committers have signed the CLA.

@ti-chi-bot ti-chi-bot bot added the size/S Denotes a PR that changes 10-29 lines, ignoring generated files. label Jan 8, 2026
Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request significantly improves error logging for Debezium encoding failures by adding valuable context to log messages. The changes in writeDebeziumFieldValues and runEncoder will greatly aid in debugging. The implementation is clear and effectively meets the PR's objectives. I have a couple of suggestions to make the logging even more clear and robust.

Add detailed context to error logs when Debezium encoding fails,
including schema, table, column name, value, changefeed ID, and
commitTs. This helps debugging encoding issues like invalid enum values.

- Fix log message grammar for clarity
- Truncate large values (>1024 bytes/chars) to avoid log flooding

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@takaidohigasi takaidohigasi force-pushed the improve-debezium-error-logging branch from 5edbb2b to 41ae67e Compare January 8, 2026 05:09
@ti-chi-bot ti-chi-bot bot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. do-not-merge/needs-linked-issue labels Jan 8, 2026
@takaidohigasi
Copy link
Contributor Author

testing on my laptop

@wk989898
Copy link
Collaborator

wk989898 commented Jan 8, 2026

/ok-to-test

@ti-chi-bot ti-chi-bot bot added ok-to-test Indicates a PR is ready to be tested. and removed needs-ok-to-test Indicates a PR created by contributors and need ORG member send '/ok-to-test' to start testing. labels Jan 8, 2026
@takaidohigasi
Copy link
Contributor Author

/retest

takaidohigasi added a commit to takaidohigasi/tiflow that referenced this pull request Jan 8, 2026
Add tests to verify the fix for issue pingcap#12474 where enum/set columns
with DEFAULT values receive string type values instead of uint64.

These tests demonstrate:
1. The improved error logging (PR pingcap#12484) shows schema, table, column, value
2. The fix (PR pingcap#12475) correctly handles string type enum/set values

Test output with reverted code:
[ERROR] ["failed to write Debezium field value"]
  [schema=test] [table=t_enum] [column=status] [value=active]
  [error="unexpected column value type string for enum column status"]

Related:
- Issue: pingcap#12474
- Fix: pingcap#12475
- Logging: pingcap#12484

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
takaidohigasi added a commit to takaidohigasi/tiflow that referenced this pull request Jan 8, 2026
Revert enum/set handling to demonstrate the bug from issue pingcap#12474 where
enum/set columns with DEFAULT values receive string type values instead
of uint64, causing encoding failures.

This branch contains:
1. Reverted codec.go that does NOT handle string type enum/set values
2. Unit tests that verify the improved error logging (PR pingcap#12484)

Test output shows the improved error logging:
[ERROR] ["failed to write Debezium field value"]
  [schema=test] [table=t_enum] [column=status] [value=active]
  [error="unexpected column value type string for enum column status"]

Related:
- Issue: pingcap#12474
- Fix: pingcap#12475
- Logging: pingcap#12484

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@takaidohigasi takaidohigasi marked this pull request as ready for review January 8, 2026 07:52
@ti-chi-bot ti-chi-bot bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 8, 2026
@takaidohigasi
Copy link
Contributor Author

playground with ticdc does not work on my laptop, so I added unit test to the different branch.

@takaidohigasi
Copy link
Contributor Author

takaidohigasi#2

@ti-chi-bot ti-chi-bot bot added the approved label Jan 9, 2026
@wk989898
Copy link
Collaborator

/retest

1 similar comment
@wk989898
Copy link
Collaborator

/retest

@ti-chi-bot
Copy link
Contributor

ti-chi-bot bot commented Jan 16, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: 3AceShowHand, wk989898

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:
  • OWNERS [3AceShowHand,wk989898]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ti-chi-bot ti-chi-bot bot added lgtm and removed needs-1-more-lgtm Indicates a PR needs 1 more LGTM. labels Jan 16, 2026
@ti-chi-bot
Copy link
Contributor

ti-chi-bot bot commented Jan 16, 2026

[LGTM Timeline notifier]

Timeline:

  • 2026-01-09 04:36:24.487413074 +0000 UTC m=+936140.305721506: ☑️ agreed by wk989898.
  • 2026-01-16 09:41:58.084299421 +0000 UTC m=+141345.698256277: ☑️ agreed by 3AceShowHand.

@wk989898
Copy link
Collaborator

/retest

1 similar comment
@takaidohigasi
Copy link
Contributor Author

/retest

@takaidohigasi
Copy link
Contributor Author

#12480

was merged

@takaidohigasi
Copy link
Contributor Author

++ jq -r .info

+ info='Please check whether PD is online and TiKV Regions are all available. If PD is offline or some TiKV regions are not available, it means that the data syncing process is complete. To check whether TiKV regions are all available, you can view '\''TiKV-Details'\'' > '\''Resolved-Ts'\'' > '\''Max Leader Resolved TS gap'\'' on Grafana. If the gap is large, such as a few minutes, it means that some regions in TiKV are unavailable. Otherwise, if the gap is small and PD is online, it means the data syncing is incomplete, so please wait'

+ target_message='Please check whether PD is online and TiKV Regions are all available. If PD is offline or some TiKV regions are not available, it means that the data syncing process is complete. To check whether TiKV regions are all available, you can view '\''TiKV-Details'\'' > '\''Resolved-Ts'\'' > '\''Max Leader Resolved TS gap'\'' on Grafana. If the gap is large, such as a few minutes, it means that some regions in TiKV are unavailable. Otherwise, if the gap is small and PD is online, it means the data syncing is incomplete, so please wait'

+ '[' 'Please check whether PD is online and TiKV Regions are all available. If PD is offline or some TiKV regions are not available, it means that the data syncing process is complete. To check whether TiKV regions are all available, you can view '\''TiKV-Details'\'' > '\''Resolved-Ts'\'' > '\''Max Leader Resolved TS gap'\'' on Grafana. If the gap is large, such as a few minutes, it means that some regions in TiKV are unavailable. Otherwise, if the gap is small and PD is online, it means the data syncing is incomplete, so please wait' '!=' 'Please check whether PD is online and TiKV Regions are all available. If PD is offline or some TiKV regions are not available, it means that the data syncing process is complete. To check whether TiKV regions are all available, you can view '\''TiKV-Details'\'' > '\''Resolved-Ts'\'' > '\''Max Leader Resolved TS gap'\'' on Grafana. If the gap is large, such as a few minutes, it means that some regions in TiKV are unavailable. Otherwise, if the gap is small and PD is online, it means the data syncing is incomplete, so please wait' ']'

+ cleanup_process cdc.test

wait process cdc.test exit for 1-th time...

wait process cdc.test exit for 2-th time...

cdc.test: no process found

wait process cdc.test exit for 3-th time...

process cdc.test already exit

+ stop_tidb_cluster

+ run_case_with_unavailable_tidb conf/changefeed-redo.toml

+ rm -rf /tmp/tidb_cdc_test/synced_status_with_redo

+ mkdir -p /tmp/tidb_cdc_test/synced_status_with_redo

+ start_tidb_cluster --workdir /tmp/tidb_cdc_test/synced_status_with_redo

shell-init: error retrieving current directory: getcwd: cannot access parent directories: No such file or directory

chdir: error retrieving current directory: getcwd: cannot access parent directories: No such file or directory

The 1 times to try to start tidb cluster...

shell-init: error retrieving current directory: getcwd: cannot access parent directories: No such file or directory

chdir: error retrieving current directory: getcwd: cannot access parent directories: No such file or directory

shell-init: error retrieving current directory: getcwd: cannot access parent directories: No such file or directory

chdir: error retrieving current directory: getcwd: cannot access parent directories: No such file or directory

++ stop_tidb_cluster

shell-init: error retrieving current directory: getcwd: cannot access parent directories: No such file or directory

chdir: error retrieving current directory: getcwd: cannot access parent directories: No such file or directory

script returned exit code 143

Timeout has been exceeded

?

@wk989898
Copy link
Collaborator

/retest

1 similar comment
@takaidohigasi
Copy link
Contributor Author

/retest

@takaidohigasi
Copy link
Contributor Author

takaidohigasi commented Jan 30, 2026

all the test is faling for page 4
https://github.com/pingcap/tiflow/pulls

@wk989898
Copy link
Collaborator

wk989898 commented Feb 3, 2026

/retest

@takaidohigasi
Copy link
Contributor Author

this PR test succeeded ...
#12483

@takaidohigasi
Copy link
Contributor Author

/test pull-cdc-integration-kafka-test

@takaidohigasi
Copy link
Contributor Author

/retest

@takaidohigasi
Copy link
Contributor Author

finally test passed!

@wk989898
Copy link
Collaborator

wk989898 commented Feb 4, 2026

/release-note-none

@ti-chi-bot ti-chi-bot bot added release-note-none Denotes a PR that doesn't merit a release note. and removed do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels Feb 4, 2026
@ti-chi-bot ti-chi-bot bot merged commit 57f55e3 into pingcap:master Feb 4, 2026
24 checks passed
@takaidohigasi
Copy link
Contributor Author

/test pull-verify

@takaidohigasi
Copy link
Contributor Author

thanks so much for your help

@ti-chi-bot
Copy link
Contributor

ti-chi-bot bot commented Feb 4, 2026

@takaidohigasi: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-verify 41ae67e link unknown /test pull-verify
pull-cdc-integration-storage-test 41ae67e link unknown /test pull-cdc-integration-storage-test

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@wk989898 wk989898 added the needs-cherry-pick-release-8.5 Should cherry pick this PR to release-8.5 branch. label Feb 5, 2026
@ti-chi-bot
Copy link
Member

In response to a cherrypick label: new pull request created to branch release-8.5: #12517.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved contribution This PR is from a community contributor. first-time-contributor Indicates that the PR was contributed by an external member and is a first-time contributor. lgtm needs-cherry-pick-release-8.5 Should cherry pick this PR to release-8.5 branch. ok-to-test Indicates a PR is ready to be tested. release-note-none Denotes a PR that doesn't merit a release note. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

CDC Debezium sink does not have enough information when errored

4 participants