Under high write throughput in a sharded cluster, read transactions are not closing and report errors every second until manual intervention:
[error]: Read transaction detected that has been open too long (over one minute)
Txn {
address: 140427120637888,
timerTracked: true,
refCount: 1,
renewingRefCount: 1, ← stuck
notCurrent: true,
openTimer: 58
}
The renewingRefCount: 1, notCurrent: true combination suggests the transaction renewal cycle is failing to complete — the transaction knows it's stale but can't close itself because of a pending renewer.
Reproduction context
- 5-node sharded cluster, 50k req/s write throughput
- 1 KB record size with blobs
- Seen on 4.6.0-alpha.3
Related
- HarperFast/harper PR #304 — "make read txn timeout configurable and set default to 1min" (in review) adds a configurable timeout, but doesn't appear to address the stuck-renewer root cause.
Acceptance criteria
- Read transactions close or are forcibly recycled even when
renewingRefCount stays positive.
- Under high write load, the "open too long" error storm doesn't recur.
🤖 Filed by Claude on behalf of Kris.
Under high write throughput in a sharded cluster, read transactions are not closing and report errors every second until manual intervention:
The
renewingRefCount: 1, notCurrent: truecombination suggests the transaction renewal cycle is failing to complete — the transaction knows it's stale but can't close itself because of a pending renewer.Reproduction context
Related
Acceptance criteria
renewingRefCountstays positive.🤖 Filed by Claude on behalf of Kris.