fix(mqtt): re-subscribe on every connack to survive auto-reconnects#60
Merged
Conversation
Adds the ExecPlan for fixing the silent MQTT-deafness bug observed on a v0.7.0 device: agent loses topic subscriptions after rumqttc auto-reconnect because subscribes only happen in init_client and clean_session defaults to true. Plan covers re-subscribe on every successful ConnAck, setting clean_session=false, and adding a connection counter to make reconnects visible in INFO logs.
Tightens redundancy and fixes one defect in the validation section. Two review iterations: 6 conciseness fixes (cherry-pick exclusion stated once, trim duplicated rumqttc rationale, collapse milestone ordering rationale, defer to AGENTS.md for import-order rule, specify mod placement for the options test, replace tautological validation items with behavioral observables) plus one correctness fix (validation item 1 referenced an init_client + handle_event flow that's not testable because init_client is private and constructs a real mqtt::Client; rewritten to match what the M1 tests actually assert and call out the second-connect behavior as a code-audit observable).
f5d5f41 to
dbe55b6
Compare
Both init_client and the ConnAck Success arm of handle_event were issuing subscribe_sync followed by subscribe_ping with identical error handling. Bundle the pair into mqtt::device::subscribe_all so the worker calls it in one line. Adds three unit tests covering the helper's happy path, short-circuit-on-sync-error, and error-propagation-on-ping-error behaviors.
The connection counter added in 'feat(mqtt): log mqtt reconnects with a
connection counter' threaded a &mut u32 through handle_event purely so
the ConnAck Success log could include 'connection #{n}'. The signature
churn outweighs the diagnostic value: a future incident-responder can
just grep -c the log line, and EMQX-side connection-event logging gives
the same signal at the broker. Remove the field, parameter, and counter
plumbing; revert the log line to its original form.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
The agent silently stops receiving MQTT command messages after a transient broker disconnect, and stays in that broken state until the agent process is restarted. Field log evidence shows a
v0.7.0device was MQTT-deaf for 6 days after a single auto-reconnect, kept barely alive by the 12-hour poller fallback.This PR ships the fix on the
v0.8line for av0.8.1hotfix release. Same change will be cherry-picked tomainas a separate PR.Symptom
miru-agentprocess makes the deployment go green almost instantly.Root cause
Subscriptions only happen at startup. The agent calls
mqtt::device::subscribe_syncandmqtt::device::subscribe_pingexactly once, insideinit_client(agent/src/workers/mqtt.rs). After a network blip,rumqttc::EventLoop::poll()reconnects internally and yields a freshConnAck, but the agent never re-issues theSUBSCRIBEpackets.Net effect: TCP/TLS connection is healthy, agent logs
Established connection to mqtt broker, broker considers the device "connected with zero subscriptions," and inboundcmd/devices/{device_id}/syncandv1/cmd/devices/{device_id}/pingmessages are silently dropped at the broker. The 12-hour poller is the only thing that recovers the device until the process is restarted.Evidence (live incident)
Six days of
journalctlfrom live device (runningv0.7.0, plenty of upstream connectivity):May 1 14:49:24— firstEstablished connection to mqtt broker(initialinit_client→ subscribed → working).May 1 14:50:10— secondEstablished connection to mqtt broker, 46 s later (rumqttc auto-reconnect after a transient blip).May 1 14:50:10 → May 7 17:38:43— zero further reconnects, zero "received ping request" lines, zero MQTT-driven syncs. Every "successfully synced with backend" lands at exact 12 h intervals = the poller, never MQTT.May 7 17:38:43— operator restarts the agent (new PID); MQTT works again immediately.Network-level errors during this window logged at
debug!only, so INFO-level field logs show no evidence of the disconnect — making the issue invisible without broker-side correlation.Fix
fix(mqtt): re-subscribe on every successful ConnAck—handle_event'sIncoming::ConnAck(Success)arm now callssubscribe_syncandsubscribe_ping, mirroringinit_client's pattern verbatim. Subscribe failures are logged and swallowed (do not changeerr_streak). Idempotent: re-subscribing on an existing subscription is a no-op for the broker. Closes the bug for all reconnect paths regardless of broker session behaviour.Affected versions
Bug exists in
v0.6.0,v0.7.0,v0.8.0, and currentmain. This PR fixesv0.8.x. Backport tomainwill follow as a separate PR.Test plan
./scripts/preflight.shreports clean (audit, fmt, clippy, coverage).rumqttd(or staging EMQX) — confirmconnection #1log on first connect, force-disconnect via broker admin, confirmconnection #2log appears with no agent restart, then trigger a sync command and confirm the agent processes it.subscriptions_cntfor the device should remain at 2 across observed reconnect cycles.🤖 Generated with Claude Code