fix(mysql): handle transaction payload event#4433
Conversation
❌ 1 Tests Failed:
View the top 1 failed test(s) by shortest run time
To view more test analytics, go to the Test Analytics Dashboard |
🔄 Flaky Test DetectedAnalysis: The failures are flaky: the ClickHouse S3 test TestIAMRoleCanIssueSelectFromS3 returned COUNT 0 instead of 3 in one job but passed in the sibling job on the same commit, and the second job's mass e2e failures are uniform ~61s timeouts ("stuck in snapshot somehow") — all in ClickHouse paths unrelated to this MySQL-focused PR. ✅ Automatically retrying the workflow |
❌ Test FailureAnalysis: The flow/e2e package hit a clean 1200s package-level timeout (no assertion failures) on only the mysql-gtid shard, which is exactly the code path modified by this MySQL transaction-payload-event branch — a CDC-stall-shaped hang that points to a likely real regression rather than generic flakiness. |
Code reviewNo issues found. Checked for bugs and CLAUDE.md compliance. |
cf3826e to
bf92b75
Compare
🔄 Flaky Test DetectedAnalysis: The entire flow/e2e package hit the hard 1200s wall-clock timeout on only the mysql-gtid cell (sibling mysql-pos and maria cells passed) with no specific assertion failure, panic, or race — the classic signature of a high-concurrency e2e timeout flake, though a GTID-path hang regression can't be fully ruled out given the MySQL-GTID PR. ✅ Automatically retrying the workflow |
There was a problem hiding this comment.
Pull request overview
This PR improves MySQL/MariaDB CDC robustness by handling MySQL’s compressed transaction binlog format (TRANSACTION_PAYLOAD_EVENT), detecting binlog incident events (e.g. LOST_EVENTS) as an invalid-binlog/resync condition, and covering additional GTID event variants (MariaDB GTID + MySQL tagged GTIDs).
Changes:
- Add CDC support for
TRANSACTION_PAYLOAD_EVENTby processing inner events and correctly advancing checkpoints (GTID and file/pos). - Detect
INCIDENT_EVENTand surface a dedicated “binlog invalid / resync required” error, including alert classification. - Add MySQL version gating + an e2e test that enables
binlog_transaction_compressionand asserts a payload event was actually emitted.
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| flow/shared/exceptions/mysql.go | Adds MySQLBinlogIncidentError for resync-required incident events. |
| flow/pkg/mysql/validation.go | Adds minimum MySQL version constant for binlog transaction compression support. |
| flow/e2e/mysql.go | Adds helper to scan binlog for TRANSACTION_PAYLOAD_EVENT to validate test coverage. |
| flow/e2e/clickhouse_mysql_test.go | Adds e2e test exercising compressed transaction payload CDC path (MySQL-only, version-gated). |
| flow/connectors/mysql/cdc.go | Adds decoding/processing for TransactionPayloadEvent, accumulates GTID set, and detects incident events. |
| flow/alerting/classifier.go | Classifies MySQLBinlogIncidentError as “binlog invalid” for alerting. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
🔄 Flaky Test DetectedAnalysis: The flow/e2e package hit its 20-minute timeout because TestApiPg/TestDropMissing (11m24s) and TestApiMy/TestDropMissing (5m14s) hung waiting on a Temporal drop-flow workflow — an infra/flaky hang unrelated to this MySQL branch, and the same suite passed on the other two identical matrix shards. ✅ Automatically retrying the workflow |
❌ Test FailureAnalysis: The e2e suite hung to a hard 20-minute package timeout (with MySQL-connector goroutines stuck on channel receives) exclusively in the mysql-gtid matrix variant while mysql-pos and maria passed, which correlates directly with this PR's change to MySQL GTID-mode transaction-payload event handling and indicates a real CDC deadlock/hang regression rather than flakiness. |
| // StartSyncGTID keeps the passed set by reference (as the syncer's prevGset) and mutates it | ||
| // from the syncer goroutine as GTIDs arrive. Give it an independent clone so PullRecords can | ||
| // accumulate GTIDs into `gset` (advanceGset) and read it without racing the syncer on the same map. | ||
| stream, err := syncer.StartSyncGTID(gset.Clone()) |
There was a problem hiding this comment.
otherwise we get:
fatal error: concurrent map read and map write
goroutine 11211 [running]:
internal/runtime/maps.fatal({0x5a6f5d5?, 0x5da7890cb860?})
/usr/local/go/src/runtime/panic.go:1181 +0x20
github.com/go-mysql-org/go-mysql/mysql.(*MysqlGTIDSet).AddGTIDWithTag(0x5da78691b040, {0xdc, 0x3b, 0xb4, 0x9c, 0x5e, 0x84, 0x11, 0xf1, 0xa0, ...}, ...)
because both go-mysql binlogsyncer and pullRecords read/write the same gset object (before that we never hit this case because we always read a clone and never modified it gset = ev.GSet)
14976f7 to
5c8b6ae
Compare
❌ Test FailureAnalysis: The flow/e2e package hung to its 1200s timeout only in the mysql-gtid combo (while mysql-pos and maria passed) and reproduced on a re-run, indicating a real transaction-event-handling hang in exactly the code path this MySQL PR modifies — not a flake. |
Uh oh!
There was an error while loading. Please reload this page.