Skip to content

mysql: transcode non-utf8 string/ENUM/SET columns to UTF-8 on CDC path#4436

Open
dtunikov wants to merge 6 commits into
mainfrom
dbi-810-cdc-charset-transcoding
Open

mysql: transcode non-utf8 string/ENUM/SET columns to UTF-8 on CDC path#4436
dtunikov wants to merge 6 commits into
mainfrom
dbi-810-cdc-charset-transcoding

Conversation

@dtunikov

Copy link
Copy Markdown
Collaborator

Problem

Linear: DBI-810

Binlog row events carry string/ENUM/SET data as raw bytes in each column's own character set. Unlike the snapshot path — which runs SET NAMES utf8mb4 so the server transcodes before sending — the CDC path got the bytes verbatim and cast them straight to Go strings (qvalue_convert.go).

Result: for any non-utf8 column (latin1 was the MySQL server default through 5.7, plus gbk/sjis/etc.), every non-ASCII character reached the destination as invalid-UTF-8 mojibake on CDC while the snapshot was correct — the same source value landing in two different encodings. ENUM/SET labels were equally raw. The collation needed to fix this was already present in TABLE_MAP metadata but unused for decoding.

Fix

  • Resolve an x/text decoder per column from the TABLE_MAP collation metadata (CollationMap() for character columns, EnumSetCollationMap() for enum/set).
  • Transcode string values and ENUM/SET label tables to UTF-8.
  • utf8/utf8mb3/utf8mb4/ascii/binary resolve to a nil decoder → original zero-cost passthrough, so the common case is unaffected.
  • Collation id → charset name is sourced from information_schema.COLLATIONS and cached on the connector (covers MySQL + MariaDB without a hardcoded ~300-entry table).
  • Graceful degradation: unknown collation / unsupported charset / missing binlog row metadata → warn once and pass bytes through (no worse than today), rather than halting the mirror.

Note: MySQL's latin1 is Windows-1252 (not ISO-8859-1) — handled explicitly.

Tests

  • charset_test.go: unit tests for the transcoders (latin1/gbk/sjis/euckr + passthrough), and a before/after assertion that a latin1 value was mojibake and is now correctly decoded.
  • Test_MySQL_Charset_Consistency (e2e): latin1 + gbk columns, asserts snapshot and CDC rows are byte-identical and correctly decoded.

🤖 Generated with Claude Code

Binlog row events carry strings as raw bytes in each column's own charset,
and unlike the snapshot path (SET NAMES utf8mb4) the server does no transcoding.
PeerDB cast those bytes directly to Go strings, so any non-utf8 column
(latin1 was the MySQL default through 5.7, plus gbk/sjis/etc.) reached the
destination as invalid-UTF-8 mojibake on CDC while the snapshot was correct --
the same source value landed in two encodings.

Resolve a decoder per column from the TABLE_MAP collation metadata
(CollationMap / EnumSetCollationMap) and transcode string values and ENUM/SET
labels to UTF-8 via golang.org/x/text. utf8/utf8mb3/utf8mb4/ascii/binary keep
the original zero-cost passthrough; unknown/unsupported charsets warn once and
fall back to raw bytes. Collation -> charset is sourced from
information_schema.COLLATIONS, cached on the connector.

Adds unit tests for the transcoders and a latin1/gbk e2e asserting snapshot
and CDC rows are byte-identical.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@dtunikov dtunikov requested a review from a team as a code owner June 16, 2026 14:24
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes MySQL CDC string decoding so binlog row-event values for non-UTF8 STRING/ENUM/SET columns are transcoded to UTF-8 using per-column collation metadata (matching the snapshot path’s SET NAMES utf8mb4 behavior), preventing CDC/snapshot encoding divergence (DBI-810).

Changes:

  • Resolve a per-column charset decoder from TABLE_MAP collation metadata and apply it when converting CDC row values (including ENUM/SET label tables).
  • Add connector-side caching for collation_id -> CHARACTER_SET_NAME from information_schema.COLLATIONS, plus warning deduplication.
  • Add unit + e2e tests to ensure snapshot and CDC output are consistent for latin1/gbk columns.

Reviewed changes

Copilot reviewed 7 out of 8 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
flow/go.mod Makes golang.org/x/text a direct dependency for charset decoding.
flow/go.sum Updates sums to reflect module dependency changes.
flow/connectors/mysql/qvalue_convert.go Adds per-column decoder input and applies transcoding for string/enum values in CDC conversion.
flow/connectors/mysql/mysql.go Adds cached collation→charset map and warning dedupe state to the connector.
flow/connectors/mysql/charset.go Implements collation→charset resolution and byte/string decoding helpers using x/text.
flow/connectors/mysql/charset_test.go Adds unit tests for decoding and CDC conversion behavior.
flow/connectors/mysql/cdc.go Resolves per-column encodings from TABLE_MAP metadata and transcodes ENUM/SET labels + row values.
flow/e2e/clickhouse_mysql_test.go Adds an e2e regression test asserting snapshot/CDC byte-identical results for latin1/gbk columns.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread flow/connectors/mysql/cdc.go Outdated
@claude

claude Bot commented Jun 16, 2026

Copy link
Copy Markdown

Code review

No issues found. Checked for bugs and CLAUDE.md compliance.

@codecov

codecov Bot commented Jun 16, 2026

Copy link
Copy Markdown

❌ 1 Tests Failed:

Tests completed Failed Passed Skipped
2306 1 2305 197
View the top 1 failed test(s) by shortest run time
github.com/PeerDB-io/peerdb/flow/connectors/mysql::TestMySQLSSHKeepaliveTunnelDown
Stack Traces | 40.1s run time
=== RUN   TestMySQLSSHKeepaliveTunnelDown
=== PAUSE TestMySQLSSHKeepaliveTunnelDown
=== CONT  TestMySQLSSHKeepaliveTunnelDown
2026/06/16 14:29:13 INFO Setting up SSH connection x-peerdb-additional-metadata={Operation:FLOW_OPERATION_UNKNOWN} Server=localhost:2222
    ssh_keepalive_test.go:134: Disabling proxy to simulate network failure during long-running query
2026/06/16 14:29:15 INFO syncer is closing... x-peerdb-additional-metadata={Operation:FLOW_OPERATION_UNKNOWN}
    ssh_keepalive_test.go:134: Waiting for SSH keepalive failure detection...
2026/06/16 14:29:43 ERROR Failed to close SSH client x-peerdb-additional-metadata={Operation:FLOW_OPERATION_UNKNOWN} error="close tcp [::1]:41876->[::1]:12001: use of closed network connection"
    ssh_keepalive_test.go:134: SSH keepalive failure detected successfully
2026/06/16 14:29:43 ERROR Previous keepalive request still pending, marking tunnel as bad x-peerdb-additional-metadata={Operation:FLOW_OPERATION_UNKNOWN}
2026/06/16 14:29:43 INFO SSH keepalive failed, closing connection x-peerdb-additional-metadata={Operation:FLOW_OPERATION_UNKNOWN}
    ssh_keepalive_test.go:134: Long-running query should have failed after SSH tunnel broke
2026/06/16 14:29:53 INFO Closing MySQL connector x-peerdb-additional-metadata={Operation:FLOW_OPERATION_UNKNOWN}
2026/06/16 14:29:53 ERROR [mysql] failed to close SSH tunnel x-peerdb-additional-metadata={Operation:FLOW_OPERATION_UNKNOWN} error="close tcp [::1]:41876->[::1]:12001: use of closed network connection"
--- FAIL: TestMySQLSSHKeepaliveTunnelDown (40.08s)

To view more test analytics, go to the Test Analytics Dashboard
📋 Got 3 mins? Take this short survey to help us improve Test Analytics.

@github-actions

Copy link
Copy Markdown
Contributor

🔄 Flaky Test Detected

Analysis: TestMySQLSSHKeepaliveTunnelDown failed on a timing-sensitive, toxiproxy-based network-failure simulation (keepalive detection succeeded, but the blocked SELECT SLEEP(60) didn't error within the hard-coded 10s window) — a race unrelated to the PR's charset-transcoding changes and failing on only one matrix variant.
Confidence: 0.85

✅ Automatically retrying the workflow

View workflow run

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
// does no transcoding here. Resolve a decoder per column from the TABLE_MAP
// collation metadata so non-utf8 columns (e.g. latin1/gbk/sjis) reach the
// destination as valid UTF-8 instead of mojibake. nil = already UTF-8, no-op.
colEncodings := make([]encoding.Encoding, len(ev.Table.ColumnType))

@dtunikov dtunikov Jun 16, 2026

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so, here we alloc a slice for every row in cdc
which is not ideal, ofc
we could maybe cache it with smht like map[columnName] -> encoding.Encoding
in that case though we would need to handle column collation changes, need to check if we can do that reliably
i don't know though if introduced complexity will be worth it

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Comment thread flow/e2e/clickhouse_mysql_test.go Outdated
dtunikov and others added 2 commits June 16, 2026 19:44
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
rdsAuth *utils.RDSAuth
serverVersion string
collationCharset atomic.Pointer[map[uint64]string]
warnedCharsets sync.Map

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to avoid spam in logs for unknown charsets

return nil, nil
}

func (c *MySqlConnector) charsetForCollation(ctx context.Context, collationID uint64) (string, error) {

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

convert int collation to string value which later gets mapped into encoding.Encoding

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants