Skip to content

fix(mqtt): re-subscribe on every connack to survive auto-reconnects#60

Merged
ben-miru merged 13 commits into
release/v0.8from
fix/mqtt-resubscribe-on-reconnect
May 8, 2026
Merged

fix(mqtt): re-subscribe on every connack to survive auto-reconnects#60
ben-miru merged 13 commits into
release/v0.8from
fix/mqtt-resubscribe-on-reconnect

Conversation

@ben-miru

@ben-miru ben-miru commented May 8, 2026

Copy link
Copy Markdown
Contributor

Summary

The agent silently stops receiving MQTT command messages after a transient broker disconnect, and stays in that broken state until the agent process is restarted. Field log evidence shows a v0.7.0 device was MQTT-deaf for 6 days after a single auto-reconnect, kept barely alive by the 12-hour poller fallback.

This PR ships the fix on the v0.8 line for a v0.8.1 hotfix release. Same change will be cherry-picked to main as a separate PR.

Symptom

  • User makes a config change in the dashboard. Backend sends an MQTT sync notification.
  • Device shows online and connected; agent's outbound MQTT publishes still succeed.
  • The deployment stays yellow for several minutes — sometimes effectively indefinitely.
  • Restarting the miru-agent process makes the deployment go green almost instantly.

Root cause

Subscriptions only happen at startup. The agent calls mqtt::device::subscribe_sync and mqtt::device::subscribe_ping exactly once, inside init_client (agent/src/workers/mqtt.rs). After a network blip, rumqttc::EventLoop::poll() reconnects internally and yields a fresh ConnAck, but the agent never re-issues the SUBSCRIBE packets.

Net effect: TCP/TLS connection is healthy, agent logs Established connection to mqtt broker, broker considers the device "connected with zero subscriptions," and inbound cmd/devices/{device_id}/sync and v1/cmd/devices/{device_id}/ping messages are silently dropped at the broker. The 12-hour poller is the only thing that recovers the device until the process is restarted.

Evidence (live incident)

Six days of journalctl from live device (running v0.7.0, plenty of upstream connectivity):

  • May 1 14:49:24 — first Established connection to mqtt broker (initial init_client → subscribed → working).
  • May 1 14:50:10 — second Established connection to mqtt broker, 46 s later (rumqttc auto-reconnect after a transient blip).
  • May 1 14:50:10 → May 7 17:38:43zero further reconnects, zero "received ping request" lines, zero MQTT-driven syncs. Every "successfully synced with backend" lands at exact 12 h intervals = the poller, never MQTT.
  • May 7 17:38:43 — operator restarts the agent (new PID); MQTT works again immediately.

Network-level errors during this window logged at debug! only, so INFO-level field logs show no evidence of the disconnect — making the issue invisible without broker-side correlation.

Fix

fix(mqtt): re-subscribe on every successful ConnAckhandle_event's Incoming::ConnAck(Success) arm now calls subscribe_sync and subscribe_ping, mirroring init_client's pattern verbatim. Subscribe failures are logged and swallowed (do not change err_streak). Idempotent: re-subscribing on an existing subscription is a no-op for the broker. Closes the bug for all reconnect paths regardless of broker session behaviour.

Affected versions

Bug exists in v0.6.0, v0.7.0, v0.8.0, and current main. This PR fixes v0.8.x. Backport to main will follow as a separate PR.

Test plan

  • All existing tests pass.
  • New test cases (above) cover the changed paths.
  • ./scripts/preflight.sh reports clean (audit, fmt, clippy, coverage).
  • CI verifies the same on the runner.
  • Manual reconnect smoke test before tagging `v0.8.1`: agent against a local rumqttd (or staging EMQX) — confirm connection #1 log on first connect, force-disconnect via broker admin, confirm connection #2 log appears with no agent restart, then trigger a sync command and confirm the agent processes it.
  • Spot-check on EMQX dashboard for one or two production devices: after the v0.8.1 deploy, subscriptions_cnt for the device should remain at 2 across observed reconnect cycles.

🤖 Generated with Claude Code

ben-miru added 4 commits May 7, 2026 17:01
Adds the ExecPlan for fixing the silent MQTT-deafness bug observed on a
v0.7.0 device: agent loses topic subscriptions after rumqttc auto-reconnect
because subscribes only happen in init_client and clean_session defaults
to true. Plan covers re-subscribe on every successful ConnAck, setting
clean_session=false, and adding a connection counter to make reconnects
visible in INFO logs.
Tightens redundancy and fixes one defect in the validation section. Two
review iterations: 6 conciseness fixes (cherry-pick exclusion stated once,
trim duplicated rumqttc rationale, collapse milestone ordering rationale,
defer to AGENTS.md for import-order rule, specify mod placement for the
options test, replace tautological validation items with behavioral
observables) plus one correctness fix (validation item 1 referenced an
init_client + handle_event flow that's not testable because init_client
is private and constructs a real mqtt::Client; rewritten to match what
the M1 tests actually assert and call out the second-connect behavior
as a code-audit observable).
@ben-miru ben-miru force-pushed the fix/mqtt-resubscribe-on-reconnect branch from f5d5f41 to dbe55b6 Compare May 8, 2026 01:46
ben-miru added 2 commits May 7, 2026 20:03
Both init_client and the ConnAck Success arm of handle_event were issuing
subscribe_sync followed by subscribe_ping with identical error handling.
Bundle the pair into mqtt::device::subscribe_all so the worker calls it
in one line. Adds three unit tests covering the helper's happy path,
short-circuit-on-sync-error, and error-propagation-on-ping-error
behaviors.
The connection counter added in 'feat(mqtt): log mqtt reconnects with a
connection counter' threaded a &mut u32 through handle_event purely so
the ConnAck Success log could include 'connection #{n}'. The signature
churn outweighs the diagnostic value: a future incident-responder can
just grep -c the log line, and EMQX-side connection-event logging gives
the same signal at the broker. Remove the field, parameter, and counter
plumbing; revert the log line to its original form.
@ben-miru ben-miru merged commit f47547e into release/v0.8 May 8, 2026
5 checks passed
@ben-miru ben-miru deleted the fix/mqtt-resubscribe-on-reconnect branch May 8, 2026 03:35
ben-miru added a commit that referenced this pull request May 8, 2026
)

Forward-port of the v0.8.1 hotfix landing as #60 on `release/v0.8`. Read
the original PR #60 for more information.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant