fix: don't flap a lock_offline repair during integration startup (#1257)#1260
Merged
Conversation
…reached Closes the repair-issue flapping in #1257: after a Home Assistant restart, LCM polls/writes before the lock's integration has finished starting up (Matter surfaces this as `InvalidState: Not connected`). Those failures feed the lock breaker, and at POLL_FAILURE_ALERT_THRESHOLD (12) the coordinator raises the `lock_offline` repair -- which is then auto-cleared the instant the integration finishes loading. The repair is created and dismissed entirely within the startup window. A lock that has never been reached is not "offline" -- "offline" presupposes it was once online. Track `_reached_once` (set on the first successful poll/push via `_reset_backoff`) and only raise `lock_offline` once the lock has actually been reached. A lock that is reached and then drops still alerts normally; a lock still coming up at startup no longer flaps a repair. #1258 already routes the underlying transient Matter startup errors (`unknown(133)`, `InvalidState: Not connected`) to the retry path, so they no longer disable/suspend slots; this closes the remaining startup-window repair they fed into via the connectivity breaker. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #1260 +/- ##
=======================================
Coverage 97.15% 97.15%
=======================================
Files 54 54
Lines 6434 6437 +3
Branches 461 461
=======================================
+ Hits 6251 6254 +3
Misses 183 183
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
Review follow-up: a successful drift-detection hard refresh is a genuine contact with the lock, but it did not flow through _reset_backoff, so it never set _reached_once. Set it on drift success too, so a lock whose only successful contact was via drift can still raise lock_offline on a later outage. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Proposed change
Closes the repair-issue flapping in #1257: after a Home Assistant restart, Lock Code Manager begins polling/writing before the lock's underlying integration has finished starting up. For Matter this surfaces as transient
InvalidState: Not connectederrors. Those failures feed the lock circuit breaker, and atPOLL_FAILURE_ALERT_THRESHOLD(12) consecutive failures the coordinator raises thelock_offlinerepair — which is then auto-cleared the instant the integration finishes loading. The repair is created and dismissed entirely within the startup window, so the user sees a repair appear and vanish for a lock that was never actually offline.Fix
A lock that has never been reached is not "offline" — "offline" presupposes it was once online. The coordinator now tracks
_reached_once(set on the first successful poll/push, in_reset_backoff) and only raiseslock_offlineonce the lock has actually been reached:lock_offlinerepair is raised — nothing to flap._reached_onceisTrue, so a sustained failure raiseslock_offlineexactly as it does today, and recovery clears it.Relationship to #1258 / #1257
#1258 (released in 4.0.6) already reclassified the underlying transient Matter startup errors (
unknown(133),InvalidState: Not connected) from fatal (CodeRejectedError→ disable slot / unexpected error → suspend slot) to the retry path (LockDisconnected). That stopped theslot_disabled/slot_suspendedrepairs. But it rerouted those same failures into the connectivity breaker, which is what feedslock_offline— so the flap moved there. This PR closes that remaining startup-window repair.This is unrelated to the verified-credential lifecycle work (#1259) and branches off
main.Type of change
Additional information
lock_offlinenot created when never reached (even past the threshold); created normally after a reach-then-drop;push_updatemarks the lock reached. The four existinglock_offlinethreshold tests were updated to establish a prior reach, reflecting the new semantic._reached_onceis per-coordinator in-memory state. A lock that is dead across an HA-restart boundary and is never reached afterward will not raiselock_offline(it was never "online" this session). This is intentional — it's the startup-flap fix — and is surfaced instead by the lock entity being unavailable and slots not syncing.🤖 Generated with Claude Code