Skip to content

fix: add stale-stream detection to sync --follow#278

Merged
steipete merged 15 commits into
openclaw:mainfrom
thedavidweng:fix/sync-stale-detection
Jun 10, 2026
Merged

fix: add stale-stream detection to sync --follow#278
steipete merged 15 commits into
openclaw:mainfrom
thedavidweng:fix/sync-stale-detection

Conversation

@thedavidweng

@thedavidweng thedavidweng commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

What changed

Add built-in stale detection so sync --follow automatically reconnects when the WhatsApp connection is alive but no events are arriving.

Changes

  • --stale-threshold flag on sync command (default: 0 = disabled). When set, the follow loop checks whether any events have been received within the threshold window; if not, it emits a stale NDJSON event and forces a reconnect via ReconnectWithBackoff.
  • HEARTBEAT file written to the store directory (throttled to once per minute) on every event. This lets external processes monitor sync freshness without querying the database.
  • store.last_activity_at in doctor --json output, read from the heartbeat file. Enables automation to distinguish "connected but stale" from healthy.

Why

Issue #268: wacli sync --follow can remain active/connected under systemd while the local store freshness stops advancing. Downstream automations miss messages even though the sync service appears healthy.

How it was tested

Gate checks (all pass):

go test ./...                     — PASS (14 packages)
go test -tags sqlite_fts5 ./...   — PASS
go vet ./...                      — clean
gofmt                             — clean
go build -tags sqlite_fts5        — OK

Unit tests (all PASS):

TestSyncFollowReconnectsWhenStaleThresholdExceeded
TestSyncFollowDoesNotReconnectWhenActivityIsRecent
TestSyncFollowStaleReconnectResetsIdleDuration
TestSyncFollowEmitsStaleEvent
TestSyncWritesHeartbeatFileOnActivity
TestHeartbeatFileHasOwnerOnlyPermissions
TestReadHeartbeatReturnsZeroForMissingFile
TestReadHeartbeatReturnsTimestampFromFile
TestDoctorReportsLastActivityFromHeartbeat

Live runtime verification (macOS, arm64, built from this branch):

Heartbeat file — sync creates and updates the store heartbeat automatically:

$ wacli sync --follow --stale-threshold 10m --events
{"event":"connected",...}
{"event":"history_sync","data":{"conversations":0},...}
{"event":"stopping","data":{"messages_synced":1},...}

$ ls -la ~/.wacli/HEARTBEAT
-rw-------  1 user  staff  20 Jun  8 18:41 ~/.wacli/HEARTBEAT

$ cat ~/.wacli/HEARTBEAT
2026-06-09T01:41:55Z
  • File created automatically on first event
  • Permissions are 0600 (owner-only, via WritePrivateFile)
  • doctor --json picks it up as last_activity_at

Stale detection — stale event fires after threshold, reconnect loop is stable across multiple stale cycles:

$ wacli sync --follow --stale-threshold 5s --events
{"event":"connected",...}
{"event":"stale","data":{"idle_duration":"5.001s","threshold":"5s"},...}
{"event":"stale","data":{"idle_duration":"5.001s","threshold":"5s"},...}
{"event":"stale","data":{"idle_duration":"5.001s","threshold":"5s"},...}
{"event":"stopping",...}
  • Stale event fires correctly after threshold with no activity
  • idle_duration stays at ~5s each cycle (timer reset works, no accumulation)
  • Each stale cycle triggers wa.Close()ReconnectWithBackoff → reset timer (matching the StreamReplaced handler pattern)

Doctor output:

{
  "store": {
    "last_sync_at": "2026-06-09T01:39:52Z",
    "last_activity_at": "2026-06-09T01:41:55Z"
  }
}

last_activity_at correctly reflects the heartbeat timestamp from the sync session, distinct from last_sync_at.

Usage

# Auto-reconnect when no events arrive for 10 minutes
wacli sync --follow --stale-threshold 10m

# Monitor freshness
wacli doctor --json | jq ".data.store.last_activity_at"

New flags

Flag Default Description
--stale-threshold 0 (disabled) Force reconnect when no events arrive for this long in follow mode

New event

Event Data When
stale {threshold, idle_duration} Stale threshold exceeded, about to reconnect

Closes #268

@clawsweeper

clawsweeper Bot commented Jun 8, 2026

Copy link
Copy Markdown

Codex review: needs maintainer review before merge. Reviewed June 10, 2026, 12:38 AM ET / 04:38 UTC.

Summary
The branch adds opt-in sync --follow --stale-threshold reconnect logic, a throttled store HEARTBEAT surfaced through doctor as last_activity_at, docs, and focused tests.

Reproducibility: no. live current-main reproduction was established in this review. Source inspection does show the relevant no-events path: current runSyncFollow has no stale timer and will keep waiting while the client remains connected without recognized events.

Review metrics: 2 noteworthy metrics.

  • Diff size: 13 files changed, 485 additions, 16 deletions. The patch spans CLI wiring, app sync internals, docs, and tests, so review should cover both runtime behavior and public surface.
  • New public surfaces: 1 flag, 1 NDJSON event, 1 store file, 1 doctor field. These surfaces need maintainer acceptance because they become user-visible behavior and automation inputs.

Merge readiness
Overall: 🐚 platinum hermit
Proof: 🦞 diamond lobster
Patch quality: 🐚 platinum hermit
Result: ready for maintainer review.

Overall follows the weaker of proof and patch quality, so missing proof can cap an otherwise strong patch.

Rank-up moves:

  • [P2] Maintainer should explicitly accept or reject the opt-in force-close/reconnect behavior and new heartbeat/doctor surfaces before merge.

Risk before merge

  • [P1] The PR intentionally force-closes and reconnects a WhatsApp stream when an operator-set idle-event threshold elapses; the default is disabled, but maintainers still need to accept that heuristic for quiet-but-healthy streams.
  • [P1] It adds public surfaces (--stale-threshold, stale NDJSON event, HEARTBEAT, and store.last_activity_at) that should be accepted as the product direction for the linked stale-sync problem.

Maintainer options:

  1. Accept opt-in stale reconnect (recommended)
    Merge once a maintainer agrees that disabled-by-default force-close/reconnect behavior is the desired answer for stale sync --follow services.
  2. Hold for product direction
    Pause the PR if maintainers want a different core surface, such as supervisor exit-only behavior or a separate watchdog command, before adding this flag and heartbeat file.

Next step before merge

  • [P2] Human review should decide whether the opt-in stale threshold, heartbeat file, doctor field, and forced reconnect heuristic are acceptable; no narrow automated repair remains.

Security
Cleared: No concrete security or supply-chain issue was found; the diff adds no dependencies, workflows, secrets handling, or downloaded code, and the heartbeat file uses the existing owner-only file helper.

Review details

Best possible solution:

Merge the opt-in stale threshold implementation only after maintainers accept the public surfaces and message-delivery tradeoff, keeping the default disabled and the regression coverage in place.

Do we have a high-confidence way to reproduce the issue?

No live current-main reproduction was established in this review. Source inspection does show the relevant no-events path: current runSyncFollow has no stale timer and will keep waiting while the client remains connected without recognized events.

Is this the best way to solve the issue?

Yes, conditionally: an opt-in threshold with default disabled plus heartbeat reporting is a narrow maintainable approach. The force-close/reconnect heuristic and new public surfaces still need maintainer product acceptance.

AGENTS.md: found and applied where relevant.

Codex review notes: model gpt-5.5, reasoning high; reviewed against faf32dacf93b.

Label changes

Label changes:

  • add rating: 🐚 platinum hermit: Overall readiness is 🐚 platinum hermit; proof is 🦞 diamond lobster and patch quality is 🐚 platinum hermit.
  • add status: 👀 ready for maintainer look: ClawSweeper has no concrete contributor-facing blocker left for this PR. Sufficient (terminal): The PR body contains after-fix live terminal output showing heartbeat creation, stale reconnect events, and doctor output, with private details appropriately redacted.
  • remove status: ⏳ waiting on author: Current PR status label is status: 👀 ready for maintainer look.
  • remove rating: 🦐 gold shrimp: Current PR rating is rating: 🐚 platinum hermit, so this older rating label is no longer current.

Label justifications:

  • P2: This is a normal-priority reliability improvement for a reported stale sync --follow message-ingestion problem with limited opt-in blast radius.
  • merge-risk: 🚨 message-delivery: Merging adds an operator-enabled path that force-closes and reconnects the WhatsApp stream based on an idle-event heuristic.
  • rating: 🐚 platinum hermit: Overall readiness is 🐚 platinum hermit; proof is 🦞 diamond lobster and patch quality is 🐚 platinum hermit.
  • status: 👀 ready for maintainer look: ClawSweeper has no concrete contributor-facing blocker left for this PR. Sufficient (terminal): The PR body contains after-fix live terminal output showing heartbeat creation, stale reconnect events, and doctor output, with private details appropriately redacted.
  • proof: sufficient: Contributor real behavior proof is sufficient. The PR body contains after-fix live terminal output showing heartbeat creation, stale reconnect events, and doctor output, with private details appropriately redacted.
Evidence reviewed

What I checked:

  • Current main lacks the requested surface: A source search on current main found no stale-threshold, HEARTBEAT, last_activity_at, or StaleThreshold support, so the central change is not already implemented. (faf32dacf93b)
  • Current follow loop only handles stop and disconnect: Current main's runSyncFollow waits for context cancellation or a disconnected signal; it has no stale timer path. (internal/app/sync_idle.go:10, faf32dacf93b)
  • PR adds the stale reconnect path: The PR head adds StaleThreshold, creates a ticker, emits a stale event, force-closes the WA client, reconnects, and resets the stale clock after a stale reconnect. (internal/app/sync_idle.go:10, c9a5a3225de9)
  • PR validates the public flag edge case: The latest PR head rejects positive --stale-threshold values below 1s before constructing the ticker, addressing the earlier tiny-duration crash risk. (cmd/wacli/sync.go:44, c9a5a3225de9)
  • PR adds heartbeat and doctor coverage: The diff adds a private HEARTBEAT writer/reader and tests covering heartbeat creation, owner-only permissions, missing files, parsing, stale reconnect behavior, and doctor JSON output. (internal/app/heartbeat.go:19, c9a5a3225de9)
  • Diff hygiene check passed: git diff --check reported no whitespace errors for the PR diff against current main. (c9a5a3225de9)

Likely related people:

  • Peter Steinberger: Blame shows the current sync follow loop, SyncOptions, sync command wiring, and doctor store reporting largely date to commit 201f7ad; Peter also recently touched adjacent sync event handling in 3778463. (role: introduced behavior and recent area contributor; confidence: high; commits: 201f7adcaf16, 3778463ed93a; files: internal/app/sync.go, internal/app/sync_idle.go, internal/app/sync_events.go)
  • Nimrod Gutman: Commit e9f1830 added the StreamReplaced force-close/reconnect behavior that this PR explicitly mirrors for stale streams. (role: adjacent reconnect behavior contributor; confidence: high; commits: e9f183056171; files: internal/app/sync_events.go, internal/app/sync_test.go, internal/app/fake_wa_test.go)
What the crustacean ranks mean
  • 🦀 challenger crab: rare, exceptional readiness with strong proof, clean implementation, and convincing validation.
  • 🦞 diamond lobster: very strong readiness with only minor maintainer review expected.
  • 🐚 platinum hermit: good normal PR, likely mergeable with ordinary maintainer review.
  • 🦐 gold shrimp: useful signal, but proof or patch confidence is still limited.
  • 🦪 silver shellfish: thin signal; proof, validation, or implementation needs work.
  • 🧂 unranked krab: not merge-ready because proof is missing/unusable or there are serious correctness or safety concerns.
  • 🌊 off-meta tidepool: rating does not apply to this item.

Shiny media proof means a screenshot, video, or linked artifact directly shows the changed behavior. Runtime, network, CSP, and security claims still need visible diagnostics.

How this review workflow works
  • ClawSweeper keeps one durable marker-backed review comment per issue or PR.
  • Re-runs edit this comment so the latest verdict, findings, and automation markers stay together instead of adding duplicate bot comments.
  • A fresh review can be triggered by eligible @clawsweeper re-review comments, exact-item GitHub events, scheduled/background review runs, or manual workflow dispatch.
  • PR/issue authors and users with repository write access can comment @clawsweeper re-review or @clawsweeper re-run on an open PR or issue to request a fresh review only.
  • Maintainers can also comment @clawsweeper review to request a fresh review only.
  • Fresh-review commands do not start repair, autofix, rebase, CI repair, or automerge.
  • Maintainer-only repair and merge flows require explicit commands such as @clawsweeper autofix, @clawsweeper automerge, @clawsweeper fix ci, or @clawsweeper address review.
  • Maintainers can comment @clawsweeper explain to ask for more context, or @clawsweeper stop to stop active automation.

@clawsweeper clawsweeper Bot added rating: 🧂 unranked krab Not merge-ready due to missing proof or serious correctness/safety concerns. status: 📣 needs proof The PR needs real behavior proof before ClawSweeper can clear the contributor ask. P2 Normal priority bug or improvement with limited blast radius. labels Jun 8, 2026
@thedavidweng

thedavidweng commented Jun 9, 2026

Copy link
Copy Markdown
Contributor Author

Live runtime verification (macOS, arm64)

Built from this branch (go build -tags sqlite_fts5), tested against an active WhatsApp session.

Gate checks

go test ./...                     — PASS (14 packages)
go test -tags sqlite_fts5 ./...   — PASS
go vet ./...                      — clean
gofmt                             — clean
go build -tags sqlite_fts5        — OK

PR-specific unit tests (all PASS)

TestSyncFollowReconnectsWhenStaleThresholdExceeded
TestSyncFollowDoesNotReconnectWhenActivityIsRecent
TestSyncFollowStaleReconnectResetsIdleDuration
TestSyncFollowEmitsStaleEvent
TestSyncWritesHeartbeatFileOnActivity
TestHeartbeatFileHasOwnerOnlyPermissions
TestReadHeartbeatReturnsZeroForMissingFile
TestReadHeartbeatReturnsTimestampFromFile
TestDoctorReportsLastActivityFromHeartbeat

Live sync — heartbeat file

$ wacli sync --follow --stale-threshold 10m --events
{"event":"connected",...}
{"event":"history_sync","data":{"conversations":0},...}
{"event":"history_sync","data":{"conversations":0},...}
{"event":"stopping","data":{"messages_synced":1},...}

$ ls -la ~/.wacli/HEARTBEAT
-rw-------  1 user  staff  20 Jun  8 18:41 ~/.wacli/HEARTBEAT

$ cat ~/.wacli/HEARTBEAT
2026-06-09T01:41:55Z
  • File created automatically on first event
  • Permissions are 0600 (review fix confirmed)
  • doctor --json picks it up as last_activity_at

Live sync — stale detection (5s threshold)

$ wacli sync --follow --stale-threshold 5s --events
{"event":"connected",...}
{"event":"stale","data":{"idle_duration":"5.001s","threshold":"5s"},...}
{"event":"stale","data":{"idle_duration":"5.001s","threshold":"5s"},...}
{"event":"stale","data":{"idle_duration":"5.001s","threshold":"5s"},...}
{"event":"stopping",...}
  • Stale event fires correctly after threshold with no activity
  • idle_duration stays at ~5s each cycle (timer reset works, no accumulation)
  • Reconnect loop is stable across multiple stale cycles

doctor output

{
  "store": {
    "last_sync_at": "2026-06-09T01:39:52Z",
    "last_activity_at": "2026-06-09T01:41:55Z"
  }
}

last_activity_at correctly reflects the heartbeat timestamp from the sync session, distinct from last_sync_at.

@clawsweeper re-review

@thedavidweng thedavidweng marked this pull request as ready for review June 9, 2026 01:44
@clawsweeper

clawsweeper Bot commented Jun 9, 2026

Copy link
Copy Markdown

🦞🧹
ClawSweeper re-review requested.

I asked ClawSweeper to review this item again.
Action: item re-review queued (workflow sweep.yml, event repository_dispatch).
Result: the existing ClawSweeper review comment will be edited in place when the review finishes.

Re-review progress:

@clawsweeper clawsweeper Bot added merge-risk: 🚨 message-delivery 🚨 Merging this PR could drop, duplicate, misroute, suppress, or wrongly target messages. rating: 🦪 silver shellfish Thin PR readiness signal; proof, validation, or implementation needs work. and removed rating: 🧂 unranked krab Not merge-ready due to missing proof or serious correctness/safety concerns. labels Jun 9, 2026
@thedavidweng

Copy link
Copy Markdown
Contributor Author

@clawsweeper re-review

Live proof has been moved from the comment into the PR body under "How it was tested". The proof was built and tested against the current HEAD (3d22e44) which includes the force-close reconnect behavior.

@clawsweeper

clawsweeper Bot commented Jun 9, 2026

Copy link
Copy Markdown

🦞🧹
ClawSweeper re-review requested.

I asked ClawSweeper to review this item again.
Action: item re-review queued (workflow sweep.yml, event repository_dispatch).
Result: the existing ClawSweeper review comment will be edited in place when the review finishes.

Re-review progress:

@clawsweeper clawsweeper Bot added proof: sufficient Contributor real behavior proof is sufficient. rating: 🦐 gold shrimp Decent PR readiness signal, but merge confidence is limited. status: ⏳ waiting on author ClawSweeper has contributor-facing work open and is waiting for author action. and removed rating: 🦪 silver shellfish Thin PR readiness signal; proof, validation, or implementation needs work. status: 📣 needs proof The PR needs real behavior proof before ClawSweeper can clear the contributor ask. labels Jun 9, 2026
thedavidweng added a commit to thedavidweng/wacli that referenced this pull request Jun 10, 2026
…ngelog edits

- Validate --stale-threshold >= 1s to prevent time.NewTicker(0) panic
- Remove openclaw#278 entries from CHANGELOG.md (release-owned)
@thedavidweng

Copy link
Copy Markdown
Contributor Author

@clawsweeper re-review

Addressed both P3 findings:

  1. Added minimum 1s validation for --stale-threshold to prevent time.NewTicker(0) panic
  2. Removed fix: add stale-stream detection to sync --follow #278 entries from CHANGELOG.md (release-owned)

Live proof is now in the PR body under "How it was tested".

@clawsweeper

clawsweeper Bot commented Jun 10, 2026

Copy link
Copy Markdown

🦞🧹
ClawSweeper re-review requested.

I asked ClawSweeper to review this item again.
Action: item re-review queued (workflow sweep.yml, event repository_dispatch).
Result: the existing ClawSweeper review comment will be edited in place when the review finishes.

Re-review progress:

@clawsweeper clawsweeper Bot added rating: 🐚 platinum hermit Good normal PR readiness with ordinary maintainer review expected. status: 👀 ready for maintainer look ClawSweeper has no concrete contributor-facing blocker left for this PR. and removed rating: 🦐 gold shrimp Decent PR readiness signal, but merge confidence is limited. status: ⏳ waiting on author ClawSweeper has contributor-facing work open and is waiting for author action. labels Jun 10, 2026
thedavidweng and others added 6 commits June 10, 2026 08:58
Add built-in stale detection so `sync --follow` automatically reconnects
when the WhatsApp connection is alive but no events are arriving.

Changes:
- Add `--stale-threshold` flag to `sync` (default: 0 = disabled). When set,
  the follow loop checks whether any events have been received within the
  threshold window; if not, it emits a `stale` event and forces a reconnect.
- Write a `HEARTBEAT` file to the store directory (throttled to once per
  minute) on every event, so external processes can monitor sync freshness.
- Add `store.last_activity_at` to `doctor --json` output by reading the
  heartbeat file, enabling automation to distinguish "connected but stale"
  from healthy.

Closes openclaw#268
- Reset lastEvent and stale ticker after successful stale reconnect so
  the idle duration starts fresh instead of accumulating.
- Use fsutil.WritePrivateFile for HEARTBEAT (0600) instead of os.WriteFile
  (0644), matching the repo store-file convention.
- Add regression tests: TestSyncFollowStaleReconnectResetsIdleDuration,
  TestHeartbeatFileHasOwnerOnlyPermissions.
…ngelog edits

- Validate --stale-threshold >= 1s to prevent time.NewTicker(0) panic
- Remove openclaw#278 entries from CHANGELOG.md (release-owned)
@steipete steipete force-pushed the fix/sync-stale-detection branch from c9a5a32 to 5a2d368 Compare June 10, 2026 09:48

@steipete steipete left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved after maintainer rewrite and full verification. The final design keeps stale reconnects on whatsmeow keepalive-timeout signals, serializes close/reconnect in the follow loop, and preserves HEARTBEAT as observed activity rather than process liveness.

@steipete steipete merged commit 3648912 into openclaw:main Jun 10, 2026
5 checks passed
@steipete

Copy link
Copy Markdown
Collaborator

Landed as 3648912.

Design/proof summary:

  • Rewrote stale detection to the whatsmeow keepalive-timeout boundary. Successful keepalives do not emit events, so this does not use event-idle heuristics and does not reconnect quiet healthy sessions just because no chat events arrive.
  • Forced reconnect is serialized in the follow loop. Keepalive timeout handlers only queue a stale request; the loop validates connection epoch, emits the stale NDJSON event, closes, and reconnects.
  • Accepted stale thresholds are bounded to 1s..<2m20s, reserving the maximum keepalive probe interval plus response deadline before whatsmeow's 3-minute auto-reconnect window.
  • HEARTBEAT remains an observed follow-activity marker, written atomically with 0600 permissions and throttled safely. doctor --json reports last_activity_at even when DB stats are unavailable, without inventing zero counts.

Local verification on 5a2d368:

  • go test ./internal/app -run 'TestSyncFollow|TestHeartbeat|TestReadHeartbeat|TestSyncWritesHeartbeat|TestSyncOnceDoesNotWriteHeartbeat|TestSyncRejects|TestSyncEventHandler'
  • go test ./cmd/wacli -run 'TestSyncCommand|TestDoctor'
  • go test ./internal/wa
  • pnpm -s docs:site && pnpm -s format:check && pnpm -s lint && pnpm -s test && pnpm -s build && git diff --check
  • go test -race ./internal/app ./cmd/wacli ./internal/fsutil ./internal/wa
  • go run golang.org/x/tools/cmd/deadcode@v0.45.0 -test -tags sqlite_fts5 ./...
  • goreleaser check --config .goreleaser.yaml && goreleaser check --config .goreleaser-linux-windows.yaml
  • autoreview --mode branch --base origin/main: clean, no accepted/actionable findings

CI on 5a2d368: test, linux-release-builds, docker, Socket Project Report, and Socket Pull Request Alerts all passed.

Live follow proof limitation: no safe authenticated local WhatsApp account was available in this checkout. Redacted checks showed accounts=[] and the default store authenticated=false, connected=false. The live-provider behavior is therefore covered by focused fake-WA/protocol regression tests around KeepAliveTimeout, stale NDJSON emission, serialized reconnect, heartbeat writes, and doctor output.

Issue #268 closed via the squash merge.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

merge-risk: 🚨 message-delivery 🚨 Merging this PR could drop, duplicate, misroute, suppress, or wrongly target messages. P2 Normal priority bug or improvement with limited blast radius. proof: sufficient Contributor real behavior proof is sufficient. rating: 🐚 platinum hermit Good normal PR readiness with ordinary maintainer review expected. status: 👀 ready for maintainer look ClawSweeper has no concrete contributor-facing blocker left for this PR.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

sync --follow can stay connected while local store freshness goes stale

2 participants