Remove custom till_deferred_has_result(...) in favor of HomeserverTestCase.wait_on_thread(...) to drive async Rust (Tokio runtime/thread pool)#19867
Conversation
| events.USE_FROZEN_DICTS = False | ||
|
|
||
| def wait_on_thread(self, deferred: Deferred, timeout: int = 10) -> None: | ||
| def wait_on_thread( |
There was a problem hiding this comment.
wait_on_thread(...) was first introduced in matrix-org/synapse#5475. It's unclear from that PR itself but I'm guessing this is because of the defer_to_thread(...) usage in the media repo.
| def wait_on_thread(self, deferred: Deferred, timeout: int = 10) -> None: | ||
| def wait_on_thread( | ||
| self, | ||
| awaitable: Awaitable[TV], |
There was a problem hiding this comment.
Changed this to accept any awaitable (Deferred or Coroutine)
| time.sleep(0.01) | ||
| # Advance the Twisted reactor as the thread may have scheduled something on | ||
| # the reactor to run (like `reactor.callFromThread(...)`) | ||
| self.reactor.advance(0) |
There was a problem hiding this comment.
Changed this from advancing time by 0.01 each iteration to 0. We shouldn't need to advance the Twisted reactor time at all bolstered by the fact that the tests still pass.
| # Advance the Twisted reactor as the thread may have scheduled something on | ||
| # the reactor to run (like `reactor.callFromThread(...)`) | ||
| self.reactor.advance(0) |
There was a problem hiding this comment.
Changed the order to advance the Twisted reactor after we give some time to other threads to make some progress. The thinking is that the thread can do some work, potentially scheduling some things on reactor, and then we unblock that.
Since we're iterating in a loop, the order probably doesn't matter that much but perhaps this makes more sense.
| requester = self.get_success( | ||
| self.till_deferred_has_result( | ||
| self._auth.get_user_by_access_token("some_token") | ||
| ) | ||
| # We have to wait for the async Rust HTTP client (running on the Tokio | ||
| # thread pool) to do its thing (see `create_deferred(...)` usage) | ||
| self.wait_on_thread(self._auth.get_user_by_access_token("some_token")) | ||
| ) |
There was a problem hiding this comment.
Overall, not very satisfied with this API shape. I think it's good enough as another iteration. Perhaps a better name?
Ideally, we could just do self.get_success(self._auth.get_user_by_access_token("some_token"))) which would drive things to completion regardless of the kind of work necessary (Python or Rust).
But adding real sleeps to get_success(...) for everything would result in slowing down the entire test suite.
We could potentially add a new attribute to HomeserverTestCase like drive_work_on_threads (like the existing needs_threadpool) which would conditionally sleep for real in get_success(...). This is half-decent but having to realize this obscure detail makes things work kinda sucks.
For example, this is how needs_threadpool is defined on a test case basis:
synapse/tests/media/test_media_storage.py
Lines 71 to 72 in d3fc819
Or even better if we could automatically detect when there is work to be done on the Tokio thread pool and do the real sleep loop. Probably have to detect this by using the Tokio RuntimeMetrics. I think it would be better to explore this as a follow-up though.
There was a problem hiding this comment.
So I think that we can actually change the sleep to be time.sleep(0), since that is apparently enough to signal that other threads should run. I've tried it locally and the tests tests/handlers/test_oauth_delegation.py tests/synapse_rust/test_http_client.py seem to pass?
At which point this could be added to get_success I think?
There was a problem hiding this comment.
time.sleep(0) does seem to work 👍. If I look at the docs for time.sleep(...) ("Suspend execution of the calling thread [...]"), it also mentions that you can use os.sched_yield() ("Voluntarily relinquish the CPU.") which also works.
https://discuss.python.org/t/time-sleep-0-yield-behaviour/27185/5 mentions that time.sleep(0) releases the GIL while os.sched_yield() doesn't. But I think that is now fixed (looks like it was part of Python 3.10): python/cpython#96078
It's unclear what's better, constant context switching or just allowing some time. I would have assumed that constant context switching would have some impact but the results from #19871 show that it doesn't seem to slow things down in a noticeable way.
| # We have to wait for the async Rust HTTP client (running on the Tokio | ||
| # thread pool) to do its thing (see `create_deferred(...)` usage) | ||
| self.wait_on_thread(self._auth.get_user_by_access_token("some_token")) |
There was a problem hiding this comment.
The comment is repetitive but it's nice to know why the wait_on_thread(...) complication is being used here.
…s not subscriptable`
```
File "/home/runner/work/synapse/synapse/tests/unittest.py", line 481, in HomeserverTestCase
) -> Deferred[TV]:
builtins.TypeError: 'type' object is not subscriptable
```
till_deferred_has_result(...) in favor of HomeserverTestCase.wait_on_thread(...)till_deferred_has_result(...) in favor of HomeserverTestCase.wait_on_thread(...) to drive async Rust (Tokio runtime/thread pool)
|
Closing in favor of #19871 |
…nc Rust (Tokio runtime/thread pool) (#19871) This means you can use `get_success(...)` anywhere regardless of what kind of work needs to be done. Spawning from adding some more async Rust things in #19846 and wanting something more standard instead of the custom `till_deferred_has_result(...)` that has crept in to a few files. Alternative to #19867 spurred on by [this comment](#19867 (comment)) from @erikjohnston ### How does this work? Previously, `get_success(...)` just ran in a hot-loop advancing the Twisted reactor clock which didn't give any time for other threads to do some work or acquire the GIL if necessary (whenever there is a hand-off from Rust to Python, we need the GIL). Now, `get_success(...)` loops until we see a result (until we hit the ~0.1s real-time timeout). In the loop, we call [`time.sleep(0)`](https://docs.python.org/3/library/time.html#time.sleep) which will "Suspend execution of the calling thread [...]" (CPU and GIL) to allow other threads to do some work. Then like before, we advance the Twisted reactor clock to run any scheduled callbacks which includes anything the other threads may have scheduled. ### Does this slow down the entire test suite? Seems just as fast as before. There is minutes variance in what we had before and after but both are within the same range of each other. (see PR for actual before/after timings)
Remove custom
till_deferred_has_result(...)in favor ofHomeserverTestCase.wait_on_thread(...)to drive async Rust (Tokio runtime/thread pool).Spawning from adding some more async Rust things in #19846 and noticing that we have an existing pattern to use instead of the custom
till_deferred_has_result(...)that has crept in to a few files.Dev notes
time.sleep(0)("Suspend execution of the calling thread [...]")os.sched_yield()("Voluntarily relinquish the CPU.")Pull Request Checklist
EventStoretoEventWorkerStore.".code blocks.