Skip to content

Queued offline updates#1430

Open
rwalker777 wants to merge 10 commits into
esphome:mainfrom
rwalker777:queued-offline-updates
Open

Queued offline updates#1430
rwalker777 wants to merge 10 commits into
esphome:mainfrom
rwalker777:queued-offline-updates

Conversation

@rwalker777

@rwalker777 rwalker777 commented Jun 13, 2026

Copy link
Copy Markdown

What does this implement/fix?

Add detection when device is offline to local build and add a queue flag that triggers an OTA when the device wakes.

Related issue or feature (if applicable):

  • fixes

Types of changes

  • Bugfix (non-breaking change which fixes an issue) — bugfix
  • New feature (non-breaking change which adds functionality) — new-feature
  • Enhancement to an existing feature — enhancement
  • Breaking change (fix or feature that would cause existing functionality to not work as expected) — breaking-change
  • Refactor (no behaviour change) — refactor
  • Documentation only — docs
  • Maintenance / chore — maintenance
  • CI / workflow change — ci
  • Dependencies bump — dependencies

Frontend coordination

  • No frontend change needed
  • [X ] Companion frontend PR: esphome/device-builder-frontend#

Checklist

  • The code change is tested and works locally.
  • Pre-commit hooks pass (ruff, codespell, yaml/json/python checks).
  • Tests have been added or updated under tests/ where applicable.
  • components.index.json / definitions/components/*.json have not been hand-edited (regenerate via script/sync_components.py if a sync is needed).
  • Architecture-level changes are reflected in docs/ARCHITECTURE.md and/or docs/API.md.

rwalker777 and others added 5 commits June 12, 2026 14:38
…e devices. When a firmware job completes successfully for a device that is currently offline, we set a new `queued_update` flag on the device. This indicates that the compiled firmware is ready and waiting to be flashed via OTA the next time the device checks in via mDNS.
@rwalker777

Copy link
Copy Markdown
Author

Will submit frontend to show queued status and option to cancel. Feedback appreciated.

@codspeed-hq

codspeed-hq Bot commented Jun 13, 2026

Copy link
Copy Markdown

Merging this PR will not alter performance

✅ 27 untouched benchmarks


Comparing rwalker777:queued-offline-updates (5688f18) with main (8f22a7b)

Open in CodSpeed

@esphbot

esphbot commented Jun 13, 2026

Copy link
Copy Markdown
Contributor

@rwalker777: Will submit frontend to show queued status and option to cancel. Feedback appreciated.

Core gap: on_queued_update_ready unwired. No caller passes it. Wake never flashes. Also queued_update never reset — flapping device re-installs forever. Persist flag via state_callbacks path, not direct device.queued_update = True in runner (lost on restart). Fire DeviceEventData, not raw {"device": device}. Two tests reimplement monitor logic inline — drive real apply() instead.

@esphbot

esphbot commented Jun 13, 2026

Copy link
Copy Markdown
Contributor

PR Review — Queued offline updates

Promising feature, but incomplete and not safely landable as-is.

  • Blocking: the on_queued_update_ready callback is never wired and queued_update is never cleared — offline→online never triggers the OTA, and a flapping device would re-install indefinitely once it is wired.
  • Blocking: DEVICE_UPDATED is fired with a raw dict instead of DeviceEventData, and the device field is mutated in-memory only (not persisted to the sidecar, lost on restart) from the wrong layer (runner reaching into another controller's Device).
  • Blocking: two of the four tests reimplement the monitor logic inline and assert on the copy — they pass regardless of production code; the install test never hits the intercept it claims to cover.
  • Minor: cargo-cult except (ImportError, AttributeError) around bus.fire, duplicated device-lookup loop, string-literal vs DeviceState enum inconsistency, and an offline install that silently degrades with no user-facing signal.
  • The CodSpeed -10% on test_device_list_to_dict_fleet[50] is the new field in Device.to_dict; acknowledge or confirm it's noise.

🔴 Blocking

1. Queued-update callback is never wired, and the flag is never cleared (`esphome_device_builder/controllers/_device_state_monitor/controller.py`, L248-267)

This adds the on_queued_update_ready parameter and the trigger logic in apply(), but the diff never wires it up. No caller passes on_queued_update_ready= when constructing DeviceStateMonitor, so the callback is always None and the branch is dead — a device coming back online never actually triggers the OTA upload. The feature's payoff path is missing.

Even once wired, two problems remain:

  • queued_update is never reset to False. Nothing clears it after the upload (or anywhere). A flapping device (offline→online→offline→online) re-fires the callback on every ONLINE transition, re-installing repeatedly.
  • The callback only carries name; the receiving controller must re-resolve the device and decide whether the queued compile artifact is still valid (config unchanged since compile) before flashing.

Please include the consuming side (the devices controller wiring + the flag reset on successful upload) in this PR, plus a test that drives apply() end-to-end.

if trigger_queued_update and self._on_queued_update_ready:
    self._on_queued_update_ready(name)
2. Wrong event payload shape and unpersisted direct device mutation (`esphome_device_builder/controllers/firmware/runner.py`, L218-237)

Two convention violations here:

  1. Raw dict payload. This codebase fires DEVICE_UPDATED with the DeviceEventData TypedDict, never a bare dict. Compare controllers/devices/controller.py:843: self._db.bus.fire(EventType.DEVICE_UPDATED, DeviceEventData(device=device)). A raw {"device": device} bypasses the typed-payload contract (tests/test_event_payload_contracts.py pins this) and may not match what subscribers destructure.

  2. Direct, unpersisted mutation. device.queued_update = True mutates the in-memory Device only — it is never written to the metadata sidecar, so the flag is lost on restart (the device is rebuilt fresh by the scanner). Field writes that need to survive go through the persist+fire pattern in controllers/devices/state_callbacks.py (write field → persist to store → fire DEVICE_UPDATED). The runner reaching into another controller's Device and firing its event directly is a layering break; route this through a devices-controller method instead.

device.queued_update = True
try:
    controller._db.bus.fire(EventType.DEVICE_UPDATED, {"device": device})
except (ImportError, AttributeError) as e:
3. Tests reimplement production logic instead of exercising it (`tests/controllers/firmware/test_offline_queue.py`, L64-116)

test_queued_update_callback_triggered and test_online_device_without_queued_update_ignored copy the monitor's trigger loop into the test body and assert on the copy:

for d in [mock_device]:
    if d.state != DeviceState.ONLINE and getattr(d, "queued_update", False):
        trigger_queued_update = True

This tests the duplicated code, not DeviceStateMonitor.apply(). It passes even if apply() is broken or deleted. test_queued_update_callback_triggered also calls callback(...) itself and asserts it ran — proving only that calling a function calls it.

Drive the real apply(): construct a monitor with a captured on_queued_update_ready, seed a device with queued_update=True and a non-ONLINE state, call apply(name, DeviceState.ONLINE, ...), and assert the callback fired with the right name (and that it does not fire when queued_update is False or the device is already online).

Also note firmware_controller sets controller._db.devices = mock_device (a single device, not a devices controller), so self._db.devices.get_devices() returns a non-iterable MagicMock — the install-intercept loop in test_install_queues_for_offline_device never actually finds the device and falls through to the normal path, so that test isn't validating the intercept either.

trigger_queued_update = False
if mock_device.state == DeviceState.ONLINE:
    for d in [mock_device]:
        if d.state != DeviceState.ONLINE and getattr(d, "queued_update", False):
            trigger_queued_update = True

🟡 Important

1. Cargo-cult exception handling around bus.fire (`esphome_device_builder/controllers/firmware/runner.py`, L234-236)

bus.fire(...) does not raise ImportError, and catching AttributeError to silently debug-log hides real wiring bugs (e.g. _db.bus being None). If fire can't run, that's a programming error that should surface, not be swallowed. Remove the try/except.

except (ImportError, AttributeError) as e:
    _LOGGER.debug("Could not fire DEVICE_UPDATED event: %s", e)
2. Duplicated device lookup; inconsistent enum-vs-string state checks (`esphome_device_builder/controllers/firmware/controller.py`, L217-237)

The configuration→device lookup loop is duplicated verbatim here and in runner.py. Extract a helper (e.g. on the devices controller, get_by_configuration(configuration)) and reuse it — the scanner already maintains keyed indexes, so a linear get_devices() scan per install is also avoidable.

Style: device.state in ("offline", "unknown") compares against string literals while the monitor code you added uses DeviceState.ONLINE. It happens to work because DeviceState is a StrEnum, but it's inconsistent within the same PR. Use device.state in (DeviceState.OFFLINE, DeviceState.UNKNOWN) everywhere.

if device and device.state in ("offline", "unknown"):
3. Offline install silently changes semantics with no user signal (`esphome_device_builder/controllers/firmware/controller.py`, L224-237)

When the device is offline, install quietly degrades to a compile-only job and returns. From the caller/frontend's perspective this looks like a normal install request, but no upload is scheduled and (per the missing wiring above) nothing flashes on wake. Until the queued-update round-trip is complete and surfaced in the UI, a user who clicks Install on an offline device gets a compile that appears to do nothing.

Make the queued state explicit in the job/response so the frontend can show "queued, will flash when online" rather than a stalled install. The author notes a companion frontend PR — ensure the WS response shape conveys this state, not just the new queued_update field on the device snapshot.

job = self._create_job(
    configuration,
    JobType.COMPILE,
    build_source=build_source,
)
return await self._enqueue(job)

Checklist

  • No hardcoded secrets
  • Feature is functionally complete (callback wired, flag lifecycle closed) — critical #1
  • Event payloads use TypedDict convention — critical #2
  • State mutations persisted through proper layer — critical #2
  • No silent error swallowing — warning #3
  • Tests verify observable behavior, not reimplemented logic — critical #4
  • Consistent enum usage — warning #5
  • No untested branches introduced — critical #4

To rebase specific severity levels, mention me: @esphbot rebase critical (fixes 🔴 only), @esphbot rebase important (fixes 🔴 + 🟡), or just @esphbot rebase for all.


Silent Failure Analysis

🟠 **HIGH** — swallowed exception (`esphome_device_builder/controllers/firmware/runner.py:234-237`)

Risk: The flag is mutated in RAM but the event announcing it is caught and logged only at DEBUG (invisible in production), so a misconfigured bus or a typo'd EventType.DEVICE_UPDATED member silently drops the notification and the frontend never learns the firmware is queued.

device.queued_update = True
try:
    controller._db.bus.fire(EventType.DEVICE_UPDATED, {"device": device})
except (ImportError, AttributeError) as e:
    _LOGGER.debug("Could not fire DEVICE_UPDATED event: %s", e)

Fix: Let the event fire unguarded (or log at WARNING/ERROR and re-raise) so a broken bus/event-type surfaces instead of being swallowed; the event is load-bearing, not best-effort cleanup.

🟡 **MEDIUM** — unpersisted state hidden behind in-RAM flag (`esphome_device_builder/controllers/firmware/runner.py:230-232`)

Risk: queued_update is set only on the in-RAM Device and is not written to the metadata sidecar, so a backend restart before the device returns online silently loses the queued flash and the compiled firmware is never installed.

if device and device.state in ("offline", "unknown"):
    device.queued_update = True

Fix: Persist queued_update through the device metadata Store (or reconcile against pending COMPILE jobs on startup) so the queued install survives a restart.

🟡 **MEDIUM** — discarded failure result (`esphome_device_builder/controllers/firmware/runner.py:222-228`)

Risk: queued_update is set whenever the compile job succeeded, but the surrounding branch only inspects success for COMPILE jobs without distinguishing a compile that produced no installable artifact, so a 'successful' generate-only run can flag a non-flashable update that silently no-ops when the device returns.

if device and device.state in ("offline", "unknown"):
    _LOGGER.info(...)
    device.queued_update = True

Fix: Gate the flag on the artifact actually existing (binary present / job type that produces a flashable image), not solely on the subprocess exit code.

🟡 **MEDIUM** — optional masking via getattr default (`esphome_device_builder/controllers/_device_state_monitor/controller.py:256-261`)

Risk: queued_update is now a declared Device field, so the getattr-with-default defensively swallows a future rename/typo of the attribute into a permanent silent False, meaning queued updates would stop firing with no error.

for d in devices:
    if d.state != DeviceState.ONLINE and getattr(d, "queued_update", False):
        trigger_queued_update = True
        break

Fix: Access d.queued_update directly so a renamed/removed field raises instead of silently disabling the trigger.


Automated review by Kōan (Claude) HEAD=3ff8cee 3 min 9s

@esphbot esphbot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Blocking issues found — see the review comment above.

@github-actions github-actions Bot added the new-feature New feature label Jun 15, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

new-feature New feature

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants