Skip to content

Ensure that message IDs are used in the correct order#42

Merged
saml-dev merged 7 commits intosaml-dev:mainfrom
mhagger:locked-conn
Dec 10, 2025
Merged

Ensure that message IDs are used in the correct order#42
saml-dev merged 7 commits intosaml-dev:mainfrom
mhagger:locked-conn

Conversation

@mhagger
Copy link
Copy Markdown
Contributor

@mhagger mhagger commented Dec 7, 2025

Home Assistant insists that message IDs be monotonically increasing. But with Go's concurrency, that wasn't being guaranteed. It could happen that

  1. Goroutine A allocates an internal.NextID() equal to 42.
  2. Goroutine B allocates an internal.NextID() equal to 43.
  3. Goroutine B sends a message with message ID 43.
  4. Goroutine A tries to send a message with message ID 42, but it fails, because Home Assistant has already seen message ID 43.

I was indeed occasionally seeing problems like this.

To fix the problem, a connection-level lock has to be held from the time that the message ID is allocated until the message is sent, to prevent another thread from allocating its own message ID between those two actions. Ensure that the locking is done correctly using a trick similar to that used by syscall.RawConn:

  • Instead of using a global function, allocate message IDs in websocket.Conn. This is the correct scope, because different websocket connections wouldn't have a problem about using overlapping message IDs.
  • Add a new interface LockedConn. Move methods NextMessageID() and SendMessage() from Conn to LockedConn. These are the methods that are critical with respect to message ID ordering.
  • Add a new method (*Conn).Send(), which locks the Conn, creates a temporary LockedConn, and passes it to a callback. The callback thus has exclusive access to the message IDs while it is running, and (via the LockedConn interface) has access to the critical methods.

The hope is that this (somewhat subtle) mechanism only needs to be used in the implementation of App, and not by client code. In that spirit, add methods App.Call() and App.FireEvent(), so that even the service definitions don't need to know about LockedConn.

This PR is orthogonal to #41 and they don't conflict.

Caveat: I still can't run my own home setup on this branch, because it relies on other changes in #21 that haven't been re-rolled yet. So, while every commit compiles and example_live_test.go runs correctly at the tip of this branch, I can't do a lot more testing than that.

/cc @saml-dev

Previously, some were named `a` and some were named `app`. Name them
all `app` for consistency.
The functions `SubscribeToEventType()` and
`SubscribeToStateChangedEvents()` didn't use their `ctx` parameters,
and they don't really make sense. So remove them.
For now, it only includes a single method, `WriteMessage()`, provided
by `*Conn`. But that will change soon.
There's soon going to be another lock, so give this one a more
specific name.
Use new `App` methods to call HA services and fire HA events.

This is not super interesting yet, but the work done by `Call()` will
increase soon, making this helper method more of a win.
Instead of using a global function `internal.GetId()` to generate
message IDs, make this a feature of `websocket.Conn`. This makes more
logical sense, and it will also make it possible to fix a race in the
next commit.
The websocket API insists that message IDs be monotonically
increasing (i.e., not just unique). That means that we need to hold a
lock the whole time from when we allocate the message ID until we send
the message. Change the interface for sending messages to make this
safe:

* Move the "write" methods of `Conn` to a new type, `LockedConn`.

* Add a method `Conn.Send()`, which takes a callback function as an
  argument. The callback gets a `LockedConn` as an argument. (This is
  the only way to get a hold of a `LockedConn`.) The callback is
  invoked while `Conn.writeLock` is held.

The callback can thus allocate new message IDs and send messages with
those IDs without fear that another thread interferes with it.
@saml-dev
Copy link
Copy Markdown
Owner

Thanks for calling out the caveat. I will keep updating and running my home automations on the latest master as I merge your PRs, and wait to cut a release until later. That way we can get some testing in.

@saml-dev saml-dev merged commit 8825dc1 into saml-dev:main Dec 10, 2025
@mhagger mhagger deleted the locked-conn branch December 10, 2025 11:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants