fix: add stale-stream detection to sync --follow#278
Conversation
|
Codex review: needs maintainer review before merge. Reviewed June 10, 2026, 12:38 AM ET / 04:38 UTC. Summary Reproducibility: no. live current-main reproduction was established in this review. Source inspection does show the relevant no-events path: current Review metrics: 2 noteworthy metrics.
Merge readiness Overall follows the weaker of proof and patch quality, so missing proof can cap an otherwise strong patch. Rank-up moves:
Risk before merge
Maintainer options:
Next step before merge
Security Review detailsBest possible solution: Merge the opt-in stale threshold implementation only after maintainers accept the public surfaces and message-delivery tradeoff, keeping the default disabled and the regression coverage in place. Do we have a high-confidence way to reproduce the issue? No live current-main reproduction was established in this review. Source inspection does show the relevant no-events path: current Is this the best way to solve the issue? Yes, conditionally: an opt-in threshold with default disabled plus heartbeat reporting is a narrow maintainable approach. The force-close/reconnect heuristic and new public surfaces still need maintainer product acceptance. AGENTS.md: found and applied where relevant. Codex review notes: model gpt-5.5, reasoning high; reviewed against faf32dacf93b. Label changesLabel changes:
Label justifications:
Evidence reviewedWhat I checked:
Likely related people:
What the crustacean ranks mean
Shiny media proof means a screenshot, video, or linked artifact directly shows the changed behavior. Runtime, network, CSP, and security claims still need visible diagnostics. How this review workflow works
|
Live runtime verification (macOS, arm64)Built from this branch ( Gate checksPR-specific unit tests (all PASS)Live sync — heartbeat file
Live sync — stale detection (5s threshold)
doctor output{
"store": {
"last_sync_at": "2026-06-09T01:39:52Z",
"last_activity_at": "2026-06-09T01:41:55Z"
}
}
@clawsweeper re-review |
|
🦞🧹 I asked ClawSweeper to review this item again. Re-review progress:
|
|
@clawsweeper re-review Live proof has been moved from the comment into the PR body under "How it was tested". The proof was built and tested against the current HEAD (3d22e44) which includes the force-close reconnect behavior. |
|
🦞🧹 I asked ClawSweeper to review this item again. Re-review progress:
|
…ngelog edits - Validate --stale-threshold >= 1s to prevent time.NewTicker(0) panic - Remove openclaw#278 entries from CHANGELOG.md (release-owned)
|
@clawsweeper re-review Addressed both P3 findings:
Live proof is now in the PR body under "How it was tested". |
|
🦞🧹 I asked ClawSweeper to review this item again. Re-review progress:
|
Add built-in stale detection so `sync --follow` automatically reconnects when the WhatsApp connection is alive but no events are arriving. Changes: - Add `--stale-threshold` flag to `sync` (default: 0 = disabled). When set, the follow loop checks whether any events have been received within the threshold window; if not, it emits a `stale` event and forces a reconnect. - Write a `HEARTBEAT` file to the store directory (throttled to once per minute) on every event, so external processes can monitor sync freshness. - Add `store.last_activity_at` to `doctor --json` output by reading the heartbeat file, enabling automation to distinguish "connected but stale" from healthy. Closes openclaw#268
- Reset lastEvent and stale ticker after successful stale reconnect so the idle duration starts fresh instead of accumulating. - Use fsutil.WritePrivateFile for HEARTBEAT (0600) instead of os.WriteFile (0644), matching the repo store-file convention. - Add regression tests: TestSyncFollowStaleReconnectResetsIdleDuration, TestHeartbeatFileHasOwnerOnlyPermissions.
…ngelog edits - Validate --stale-threshold >= 1s to prevent time.NewTicker(0) panic - Remove openclaw#278 entries from CHANGELOG.md (release-owned)
c9a5a32 to
5a2d368
Compare
steipete
left a comment
There was a problem hiding this comment.
Approved after maintainer rewrite and full verification. The final design keeps stale reconnects on whatsmeow keepalive-timeout signals, serializes close/reconnect in the follow loop, and preserves HEARTBEAT as observed activity rather than process liveness.
|
Landed as 3648912. Design/proof summary:
Local verification on 5a2d368:
CI on 5a2d368: test, linux-release-builds, docker, Socket Project Report, and Socket Pull Request Alerts all passed. Live follow proof limitation: no safe authenticated local WhatsApp account was available in this checkout. Redacted checks showed accounts=[] and the default store authenticated=false, connected=false. The live-provider behavior is therefore covered by focused fake-WA/protocol regression tests around KeepAliveTimeout, stale NDJSON emission, serialized reconnect, heartbeat writes, and doctor output. Issue #268 closed via the squash merge. |
What changed
Add built-in stale detection so
sync --followautomatically reconnects when the WhatsApp connection is alive but no events are arriving.Changes
--stale-thresholdflag onsynccommand (default: 0 = disabled). When set, the follow loop checks whether any events have been received within the threshold window; if not, it emits astaleNDJSON event and forces a reconnect viaReconnectWithBackoff.HEARTBEATfile written to the store directory (throttled to once per minute) on every event. This lets external processes monitor sync freshness without querying the database.store.last_activity_atindoctor --jsonoutput, read from the heartbeat file. Enables automation to distinguish "connected but stale" from healthy.Why
Issue #268:
wacli sync --followcan remain active/connected under systemd while the local store freshness stops advancing. Downstream automations miss messages even though the sync service appears healthy.How it was tested
Gate checks (all pass):
Unit tests (all PASS):
Live runtime verification (macOS, arm64, built from this branch):
Heartbeat file — sync creates and updates the store heartbeat automatically:
0600(owner-only, viaWritePrivateFile)doctor --jsonpicks it up aslast_activity_atStale detection — stale event fires after threshold, reconnect loop is stable across multiple stale cycles:
idle_durationstays at ~5s each cycle (timer reset works, no accumulation)wa.Close()→ReconnectWithBackoff→ reset timer (matching theStreamReplacedhandler pattern)Doctor output:
{ "store": { "last_sync_at": "2026-06-09T01:39:52Z", "last_activity_at": "2026-06-09T01:41:55Z" } }last_activity_atcorrectly reflects the heartbeat timestamp from the sync session, distinct fromlast_sync_at.Usage
New flags
--stale-threshold0(disabled)New event
stale{threshold, idle_duration}Closes #268