mysql: transcode non-utf8 string/ENUM/SET columns to UTF-8 on CDC path by dtunikov · Pull Request #4436 · PeerDB-io/peerdb

dtunikov · 2026-06-16T14:24:20Z

Problem

Binlog row events carry string/ENUM/SET data as raw bytes in each column's own character set. Unlike the snapshot path — which runs SET NAMES utf8mb4 so the server transcodes before sending — the CDC path got the bytes verbatim and cast them straight to Go strings (qvalue_convert.go).

Result: for any non-utf8 column (latin1 was the MySQL server default through 5.7, plus gbk/sjis/etc.), every non-ASCII character reached the destination as invalid-UTF-8 mojibake on CDC while the snapshot was correct — the same source value landing in two different encodings. ENUM/SET labels were equally raw. The collation needed to fix this was already present in TABLE_MAP metadata but unused for decoding.

Fix

Resolve an x/text decoder per column from the TABLE_MAP collation metadata (CollationMap() for character columns, EnumSetCollationMap() for enum/set).
Transcode string values and ENUM/SET label tables to UTF-8.
utf8/utf8mb3/utf8mb4/ascii/binary resolve to a nil decoder → original zero-cost passthrough, so the common case is unaffected.
Collation id → charset name is sourced from information_schema.COLLATIONS and cached on the connector (covers MySQL + MariaDB without a hardcoded ~300-entry table).
Graceful degradation: unknown collation / unsupported charset / missing binlog row metadata → warn once and pass bytes through (no worse than today), rather than halting the mirror.

Note: MySQL's latin1 is Windows-1252 (not ISO-8859-1) — handled explicitly.

Tests

charset_test.go: unit tests for the transcoders (latin1/gbk/sjis/euckr + passthrough), and a before/after assertion that a latin1 value was mojibake and is now correctly decoded.
Test_MySQL_Charset_Consistency (e2e): latin1 + gbk columns, asserts snapshot and CDC rows are byte-identical and correctly decoded.

🤖 Generated with Claude Code

Binlog row events carry strings as raw bytes in each column's own charset, and unlike the snapshot path (SET NAMES utf8mb4) the server does no transcoding. PeerDB cast those bytes directly to Go strings, so any non-utf8 column (latin1 was the MySQL default through 5.7, plus gbk/sjis/etc.) reached the destination as invalid-UTF-8 mojibake on CDC while the snapshot was correct -- the same source value landed in two encodings. Resolve a decoder per column from the TABLE_MAP collation metadata (CollationMap / EnumSetCollationMap) and transcode string values and ENUM/SET labels to UTF-8 via golang.org/x/text. utf8/utf8mb3/utf8mb4/ascii/binary keep the original zero-cost passthrough; unknown/unsupported charsets warn once and fall back to raw bytes. Collation -> charset is sourced from information_schema.COLLATIONS, cached on the connector. Adds unit tests for the transcoders and a latin1/gbk e2e asserting snapshot and CDC rows are byte-identical. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Copilot

Pull request overview

This PR fixes MySQL CDC string decoding so binlog row-event values for non-UTF8 STRING/ENUM/SET columns are transcoded to UTF-8 using per-column collation metadata (matching the snapshot path’s SET NAMES utf8mb4 behavior), preventing CDC/snapshot encoding divergence (DBI-810).

Changes:

Resolve a per-column charset decoder from TABLE_MAP collation metadata and apply it when converting CDC row values (including ENUM/SET label tables).
Add connector-side caching for collation_id -> CHARACTER_SET_NAME from information_schema.COLLATIONS, plus warning deduplication.
Add unit + e2e tests to ensure snapshot and CDC output are consistent for latin1/gbk columns.

Reviewed changes

Copilot reviewed 7 out of 8 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
flow/go.mod	Makes `golang.org/x/text` a direct dependency for charset decoding.
flow/go.sum	Updates sums to reflect module dependency changes.
flow/connectors/mysql/qvalue_convert.go	Adds per-column decoder input and applies transcoding for string/enum values in CDC conversion.
flow/connectors/mysql/mysql.go	Adds cached collation→charset map and warning dedupe state to the connector.
flow/connectors/mysql/charset.go	Implements collation→charset resolution and byte/string decoding helpers using `x/text`.
flow/connectors/mysql/charset_test.go	Adds unit tests for decoding and CDC conversion behavior.
flow/connectors/mysql/cdc.go	Resolves per-column encodings from TABLE_MAP metadata and transcodes ENUM/SET labels + row values.
flow/e2e/clickhouse_mysql_test.go	Adds an e2e regression test asserting snapshot/CDC byte-identical results for latin1/gbk columns.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

claude · 2026-06-16T14:42:02Z

Code review

No issues found. Checked for bugs and CLAUDE.md compliance.

codecov · 2026-06-16T14:42:03Z

❌ 1 Tests Failed:

Tests completed	Failed	Passed	Skipped
2306	1	2305	197

View the top 1 failed test(s) by shortest run time

github.com/PeerDB-io/peerdb/flow/connectors/mysql::TestMySQLSSHKeepaliveTunnelDown

Stack Traces | 40.1s run time

=== RUN   TestMySQLSSHKeepaliveTunnelDown
=== PAUSE TestMySQLSSHKeepaliveTunnelDown
=== CONT  TestMySQLSSHKeepaliveTunnelDown
2026/06/16 14:29:13 INFO Setting up SSH connection x-peerdb-additional-metadata={Operation:FLOW_OPERATION_UNKNOWN} Server=localhost:2222
    ssh_keepalive_test.go:134: Disabling proxy to simulate network failure during long-running query
2026/06/16 14:29:15 INFO syncer is closing... x-peerdb-additional-metadata={Operation:FLOW_OPERATION_UNKNOWN}
    ssh_keepalive_test.go:134: Waiting for SSH keepalive failure detection...
2026/06/16 14:29:43 ERROR Failed to close SSH client x-peerdb-additional-metadata={Operation:FLOW_OPERATION_UNKNOWN} error="close tcp [::1]:41876->[::1]:12001: use of closed network connection"
    ssh_keepalive_test.go:134: SSH keepalive failure detected successfully
2026/06/16 14:29:43 ERROR Previous keepalive request still pending, marking tunnel as bad x-peerdb-additional-metadata={Operation:FLOW_OPERATION_UNKNOWN}
2026/06/16 14:29:43 INFO SSH keepalive failed, closing connection x-peerdb-additional-metadata={Operation:FLOW_OPERATION_UNKNOWN}
    ssh_keepalive_test.go:134: Long-running query should have failed after SSH tunnel broke
2026/06/16 14:29:53 INFO Closing MySQL connector x-peerdb-additional-metadata={Operation:FLOW_OPERATION_UNKNOWN}
2026/06/16 14:29:53 ERROR [mysql] failed to close SSH tunnel x-peerdb-additional-metadata={Operation:FLOW_OPERATION_UNKNOWN} error="close tcp [::1]:41876->[::1]:12001: use of closed network connection"
--- FAIL: TestMySQLSSHKeepaliveTunnelDown (40.08s)

To view more test analytics, go to the Test Analytics Dashboard
_{📋 Got 3 mins? Take this short survey to help us improve Test Analytics.}

github-actions · 2026-06-16T14:47:26Z

🔄 Flaky Test Detected

Analysis: TestMySQLSSHKeepaliveTunnelDown failed on a timing-sensitive, toxiproxy-based network-failure simulation (keepalive detection succeeded, but the blocked SELECT SLEEP(60) didn't error within the hard-coded 10s window) — a race unrelated to the PR's charset-transcoding changes and failing on only one matrix variant.
Confidence: 0.85

✅ Automatically retrying the workflow

View workflow run

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

dtunikov · 2026-06-16T14:51:10Z

+				// does no transcoding here. Resolve a decoder per column from the TABLE_MAP
+				// collation metadata so non-utf8 columns (e.g. latin1/gbk/sjis) reach the
+				// destination as valid UTF-8 instead of mojibake. nil = already UTF-8, no-op.
+				colEncodings := make([]encoding.Encoding, len(ev.Table.ColumnType))


so, here we alloc a slice for every row in cdc
which is not ideal, ofc
we could maybe cache it with smht like map[columnName] -> encoding.Encoding
in that case though we would need to handle column collation changes, need to check if we can do that reliably
i don't know though if introduced complexity will be worth it

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

dtunikov · 2026-06-17T12:50:50Z

 	rdsAuth               *utils.RDSAuth
 	serverVersion         string
+	collationCharset      atomic.Pointer[map[uint64]string]
+	warnedCharsets        sync.Map


to avoid spam in logs for unknown charsets

dtunikov · 2026-06-17T12:52:56Z

+	return nil, nil
+}
+
+func (c *MySqlConnector) charsetForCollation(ctx context.Context, collationID uint64) (string, error) {


convert int collation to string value which later gets mapped into encoding.Encoding

dtunikov requested a review from a team as a code owner June 16, 2026 14:24

mysql: reorder MySqlConnector fields to satisfy fieldalignment

598b2be

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

dtunikov requested a review from Copilot June 16, 2026 14:34

Copilot started reviewing on behalf of dtunikov June 16, 2026 14:34 View session

Copilot AI reviewed Jun 16, 2026

View reviewed changes

Comment thread flow/connectors/mysql/cdc.go Outdated

Potential fix for pull request finding

56c1818

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

dtunikov commented Jun 16, 2026

View reviewed changes

mysql: fix import ordering to satisfy gci

016aece

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

ilidemi reviewed Jun 16, 2026

View reviewed changes

Comment thread flow/e2e/clickhouse_mysql_test.go Outdated

dtunikov and others added 2 commits June 16, 2026 19:44

cleanup comments

3f41047

mysql: fix import ordering in mysql.go (gci)

1a7571e

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

dtunikov commented Jun 17, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

mysql: transcode non-utf8 string/ENUM/SET columns to UTF-8 on CDC path#4436

mysql: transcode non-utf8 string/ENUM/SET columns to UTF-8 on CDC path#4436
dtunikov wants to merge 6 commits into
mainfrom
dbi-810-cdc-charset-transcoding

dtunikov commented Jun 16, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

claude Bot commented Jun 16, 2026

Uh oh!

codecov Bot commented Jun 16, 2026

Uh oh!

github-actions Bot commented Jun 16, 2026

Uh oh!

dtunikov Jun 16, 2026 •

edited

Loading

Uh oh!

Uh oh!

dtunikov Jun 17, 2026

Uh oh!

dtunikov Jun 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

dtunikov commented Jun 16, 2026

Problem

Fix

Tests

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

claude Bot commented Jun 16, 2026

Code review

Uh oh!

codecov Bot commented Jun 16, 2026

❌ 1 Tests Failed:

Uh oh!

github-actions Bot commented Jun 16, 2026

🔄 Flaky Test Detected

Uh oh!

dtunikov Jun 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

dtunikov Jun 17, 2026

Choose a reason for hiding this comment

Uh oh!

dtunikov Jun 17, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

dtunikov Jun 16, 2026 •

edited

Loading