mysql: transcode non-utf8 string/ENUM/SET columns to UTF-8 on CDC path#4436
mysql: transcode non-utf8 string/ENUM/SET columns to UTF-8 on CDC path#4436dtunikov wants to merge 6 commits into
Conversation
Binlog row events carry strings as raw bytes in each column's own charset, and unlike the snapshot path (SET NAMES utf8mb4) the server does no transcoding. PeerDB cast those bytes directly to Go strings, so any non-utf8 column (latin1 was the MySQL default through 5.7, plus gbk/sjis/etc.) reached the destination as invalid-UTF-8 mojibake on CDC while the snapshot was correct -- the same source value landed in two encodings. Resolve a decoder per column from the TABLE_MAP collation metadata (CollationMap / EnumSetCollationMap) and transcode string values and ENUM/SET labels to UTF-8 via golang.org/x/text. utf8/utf8mb3/utf8mb4/ascii/binary keep the original zero-cost passthrough; unknown/unsupported charsets warn once and fall back to raw bytes. Collation -> charset is sourced from information_schema.COLLATIONS, cached on the connector. Adds unit tests for the transcoders and a latin1/gbk e2e asserting snapshot and CDC rows are byte-identical. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
This PR fixes MySQL CDC string decoding so binlog row-event values for non-UTF8 STRING/ENUM/SET columns are transcoded to UTF-8 using per-column collation metadata (matching the snapshot path’s SET NAMES utf8mb4 behavior), preventing CDC/snapshot encoding divergence (DBI-810).
Changes:
- Resolve a per-column charset decoder from
TABLE_MAPcollation metadata and apply it when converting CDC row values (including ENUM/SET label tables). - Add connector-side caching for
collation_id -> CHARACTER_SET_NAMEfrominformation_schema.COLLATIONS, plus warning deduplication. - Add unit + e2e tests to ensure snapshot and CDC output are consistent for latin1/gbk columns.
Reviewed changes
Copilot reviewed 7 out of 8 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| flow/go.mod | Makes golang.org/x/text a direct dependency for charset decoding. |
| flow/go.sum | Updates sums to reflect module dependency changes. |
| flow/connectors/mysql/qvalue_convert.go | Adds per-column decoder input and applies transcoding for string/enum values in CDC conversion. |
| flow/connectors/mysql/mysql.go | Adds cached collation→charset map and warning dedupe state to the connector. |
| flow/connectors/mysql/charset.go | Implements collation→charset resolution and byte/string decoding helpers using x/text. |
| flow/connectors/mysql/charset_test.go | Adds unit tests for decoding and CDC conversion behavior. |
| flow/connectors/mysql/cdc.go | Resolves per-column encodings from TABLE_MAP metadata and transcodes ENUM/SET labels + row values. |
| flow/e2e/clickhouse_mysql_test.go | Adds an e2e regression test asserting snapshot/CDC byte-identical results for latin1/gbk columns. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Code reviewNo issues found. Checked for bugs and CLAUDE.md compliance. |
❌ 1 Tests Failed:
View the top 1 failed test(s) by shortest run time
To view more test analytics, go to the Test Analytics Dashboard |
🔄 Flaky Test DetectedAnalysis: TestMySQLSSHKeepaliveTunnelDown failed on a timing-sensitive, toxiproxy-based network-failure simulation (keepalive detection succeeded, but the blocked SELECT SLEEP(60) didn't error within the hard-coded 10s window) — a race unrelated to the PR's charset-transcoding changes and failing on only one matrix variant. ✅ Automatically retrying the workflow |
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
| // does no transcoding here. Resolve a decoder per column from the TABLE_MAP | ||
| // collation metadata so non-utf8 columns (e.g. latin1/gbk/sjis) reach the | ||
| // destination as valid UTF-8 instead of mojibake. nil = already UTF-8, no-op. | ||
| colEncodings := make([]encoding.Encoding, len(ev.Table.ColumnType)) |
There was a problem hiding this comment.
so, here we alloc a slice for every row in cdc
which is not ideal, ofc
we could maybe cache it with smht like map[columnName] -> encoding.Encoding
in that case though we would need to handle column collation changes, need to check if we can do that reliably
i don't know though if introduced complexity will be worth it
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
| rdsAuth *utils.RDSAuth | ||
| serverVersion string | ||
| collationCharset atomic.Pointer[map[uint64]string] | ||
| warnedCharsets sync.Map |
There was a problem hiding this comment.
to avoid spam in logs for unknown charsets
| return nil, nil | ||
| } | ||
|
|
||
| func (c *MySqlConnector) charsetForCollation(ctx context.Context, collationID uint64) (string, error) { |
There was a problem hiding this comment.
convert int collation to string value which later gets mapped into encoding.Encoding
Problem
Linear: DBI-810
Binlog row events carry string/ENUM/SET data as raw bytes in each column's own character set. Unlike the snapshot path — which runs
SET NAMES utf8mb4so the server transcodes before sending — the CDC path got the bytes verbatim and cast them straight to Go strings (qvalue_convert.go).Result: for any non-utf8 column (
latin1was the MySQL server default through 5.7, plusgbk/sjis/etc.), every non-ASCII character reached the destination as invalid-UTF-8 mojibake on CDC while the snapshot was correct — the same source value landing in two different encodings. ENUM/SET labels were equally raw. The collation needed to fix this was already present inTABLE_MAPmetadata but unused for decoding.Fix
x/textdecoder per column from theTABLE_MAPcollation metadata (CollationMap()for character columns,EnumSetCollationMap()for enum/set).utf8/utf8mb3/utf8mb4/ascii/binaryresolve to a nil decoder → original zero-cost passthrough, so the common case is unaffected.information_schema.COLLATIONSand cached on the connector (covers MySQL + MariaDB without a hardcoded ~300-entry table).Note: MySQL's
latin1is Windows-1252 (not ISO-8859-1) — handled explicitly.Tests
charset_test.go: unit tests for the transcoders (latin1/gbk/sjis/euckr + passthrough), and a before/after assertion that a latin1 value was mojibake and is now correctly decoded.Test_MySQL_Charset_Consistency(e2e): latin1 + gbk columns, asserts snapshot and CDC rows are byte-identical and correctly decoded.🤖 Generated with Claude Code