Skip to content

Fix: Improve MQTT reconnection reliability with active reconnection#26

Merged
eman merged 4 commits intomainfrom
fix/mqtt-reconnection-reliability
Oct 24, 2025
Merged

Fix: Improve MQTT reconnection reliability with active reconnection#26
eman merged 4 commits intomainfrom
fix/mqtt-reconnection-reliability

Conversation

@eman
Copy link
Copy Markdown
Owner

@eman eman commented Oct 24, 2025

Summary

Fixes MQTT reconnection failures after network interruptions by implementing active reconnection instead of relying on unreliable passive waiting for AWS IoT SDK auto-reconnection.

Problem

The MQTT client was experiencing persistent connection failures:

  • AWS_ERROR_MQTT_UNEXPECTED_HANGUP: Connection unexpectedly closed
  • Failed to reconnect after 10 attempts. Manual reconnection required
  • Periodic device status requests being skipped due to disconnection
  • Slow coordinator updates (48.80s timeouts)

The root cause was that the reconnection handler only passively waited for AWS IoT SDK to automatically reconnect, which is unreliable with WebSocket connections.

Solution

Implemented active reconnection that always recreates the MQTT connection:

Changes Made

1. mqtt_reconnection.py (50 lines changed)

  • Breaking Internal Change: reconnect_func parameter is now required (not Optional)
  • Removed unreliable passive fallback - always uses active reconnection
  • Checks if AWS SDK auto-reconnected during delay (optimization)
  • Properly emits reconnection_failed event when max attempts reached
  • Enhanced error handling with full stack traces

2. mqtt_client.py (67 lines added)

  • Added _active_reconnect() method that:
    • Validates authentication tokens are valid
    • Recreates MQTT connection manager with fresh connection
    • Updates all internal connection references
    • Preserves existing subscriptions and configuration

3. CHANGELOG.rst (19 lines added)

  • Documented fix for version 3.1.3 (2025-10-24)

4. .github/copilot-instructions.md (2 lines added)

  • Added instruction to use date +\"%Y-%m-%d\" for dates

Impact

Public API Unchanged: NavienMqttClient works exactly as before - no user changes needed
⚠️ Internal Breaking Change: MqttReconnectionHandler now requires reconnect_func parameter
Compatible: Works with existing auto-recovery examples

Testing

  • ✅ All 40 tests passing
  • ✅ Type checking: No errors (23 files)
  • ✅ Linting: All checks passed
  • ✅ Successfully imports and initializes

Benefits

  • Reliable reconnection instead of hoping AWS SDK will reconnect
  • No unreliable code paths or fallbacks
  • Cleaner architecture with single reconnection strategy
  • Better recovery from WebSocket connection interruptions
  • Proper event emission for application-level recovery
  • Improved logging and error visibility

Fixes connection issues that were causing coordinator timeouts and reconnection failures in Home Assistant integrations.

- Implement active reconnection that recreates MQTT connection on interruption
- Remove unreliable passive fallback to AWS IoT SDK automatic reconnection
- Make reconnect_func required parameter in MqttReconnectionHandler
- Add _active_reconnect() method to NavienMqttClient
- Emit reconnection_failed event when max attempts are exhausted
- Improve error handling and logging during reconnection
- Resolves AWS_ERROR_MQTT_UNEXPECTED_HANGUP connection failures

Breaking Internal Change:
- MqttReconnectionHandler now requires reconnect_func parameter
- Public API (NavienMqttClient) unchanged - no user impact

Fixes connection issues where reconnection would fail after network
interruptions, causing 'Failed to reconnect after 10 attempts' errors.
@eman eman requested a review from Copilot October 24, 2025 21:55
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR fixes MQTT reconnection failures by replacing passive waiting for AWS IoT SDK auto-reconnection with active reconnection that always recreates the MQTT connection. The changes ensure reliable recovery from network interruptions like AWS_ERROR_MQTT_UNEXPECTED_HANGUP.

Key Changes:

  • Made reconnect_func parameter required (breaking internal change) in MqttReconnectionHandler
  • Added _active_reconnect() method to NavienMqttClient that recreates MQTT connections with fresh authentication
  • Enhanced reconnection handler to emit reconnection_failed events and check connection state during retry attempts

Reviewed Changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.

File Description
src/nwp500/mqtt_reconnection.py Updated reconnection handler to require reconnect_func, implement active reconnection checks, and emit failure events
src/nwp500/mqtt_client.py Added _active_reconnect() method and wired it to reconnection handler with event emission
CHANGELOG.rst Documented the fix for version 3.1.3
.github/copilot-instructions.md Added instruction to use date command for getting current dates

config: "MqttConnectionConfig",
is_connected_func: Callable[[], bool],
schedule_coroutine_func: Callable[[Any], None],
reconnect_func: Callable[[], Any],
Copy link

Copilot AI Oct 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The type hint Callable[[], Any] should be more specific. Based on the usage in _active_reconnect() which is async, this should be Callable[[], Awaitable[None]] to accurately reflect that it's an async function returning None. Update the docstring to clarify this is an async callable.

Copilot uses AI. Check for mistakes.

# Update subscription manager with new connection
if self._subscription_manager and self._connection:
self._subscription_manager._connection = (
Copy link

Copilot AI Oct 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Directly accessing the private _connection attribute of _subscription_manager violates encapsulation. Consider adding a public method like update_connection() or a setter property to _subscription_manager to handle connection updates properly.

Suggested change
self._subscription_manager._connection = (
self._subscription_manager.update_connection(

Copilot uses AI. Check for mistakes.
_logger.info("Active reconnection successful")
else:
# Restore old connection manager
self._connection_manager = old_connection_manager
Copy link

Copilot AI Oct 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When restoring the old connection manager on failure, the _connection reference is not restored. This could leave self._connection pointing to a failed connection while self._connection_manager is the old one, creating an inconsistent state. Add self._connection = old_connection_manager.connection after this line.

Suggested change
self._connection_manager = old_connection_manager
self._connection_manager = old_connection_manager
self._connection = old_connection_manager.connection

Copilot uses AI. Check for mistakes.
When restoring old connection manager on failure, also restore the
_connection reference to maintain consistent state. Prevents
self._connection from pointing to a failed connection while
self._connection_manager points to the old one.
@eman
Copy link
Copy Markdown
Owner Author

eman commented Oct 24, 2025

Fixed inconsistent state issue

Good catch on the connection reference inconsistency!

Issue: When reconnection failed, we restored self._connection_manager but didn't restore self._connection, which could leave the connection reference pointing to a failed connection while the manager pointed to the old one.

Fix: Now restoring both references to maintain consistent state:

# Restore old connection manager and connection reference
self._connection_manager = old_connection_manager
self._connection = old_connection_manager.connection

All checks still passing:

  • ✅ Type checking: No errors
  • ✅ Linting: All checks passed
  • ✅ Tests: 40/40 passing

- Change reconnect_func type from Callable[[], Any] to Callable[[], Awaitable[None]]
- Change emit_event_func type to Callable[..., Awaitable[Any]] to match EventEmitter.emit signature
- Add import for Awaitable from collections.abc
- Update docstrings to clarify these are async callables
- More accurate type hints for better type safety and IDE support
@eman
Copy link
Copy Markdown
Owner Author

eman commented Oct 24, 2025

Improved type hints for async callables

Made the type hints more specific and accurate:

Changes:

  • reconnect_func: Callable[[], Any]Callable[[], Awaitable[None]]
  • emit_event_func: Callable[[str, Any], Any]Callable[..., Awaitable[Any]]
  • Added import for Awaitable from collections.abc
  • Updated docstrings to clarify these are async callables

Benefits:

  • More accurate type hints reflect that these are async functions
  • Better IDE support and autocomplete
  • Catches more type errors at development time
  • emit_event_func now properly matches EventEmitter.emit signature (which returns int)

All checks passing:

  • ✅ Type checking: No errors
  • ✅ Linting: All checks passed
  • ✅ Tests: 40/40 passing

…ion manager

- Add update_connection() public method to MqttSubscriptionManager
- Replace direct access to _connection attribute with proper method call
- Improves encapsulation and maintainability
- Provides clear API for updating connection after reconnection
- Includes documentation explaining the method's purpose
@eman
Copy link
Copy Markdown
Owner Author

eman commented Oct 24, 2025

Fixed encapsulation violation

Added proper public method instead of directly accessing private attributes.

Issue: Direct access to _subscription_manager._connection violated encapsulation principles.

Solution:

  • Added update_connection(connection) public method to MqttSubscriptionManager
  • Updated mqtt_client.py to use the new public method
  • Provides clear, documented API for updating connection references

Benefits:

  • Better encapsulation and separation of concerns
  • Easier to maintain and modify in the future
  • Clear API contract for connection updates
  • Properly documented method with usage notes

All checks passing:

  • ✅ Type checking: No errors
  • ✅ Linting: All checks passed
  • ✅ Tests: 40/40 passing

@eman eman merged commit 17fa1ee into main Oct 24, 2025
10 checks passed
@eman eman deleted the fix/mqtt-reconnection-reliability branch October 24, 2025 22:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants