Conversation
Mesa DescriptionOverviewThis PR refactors the
TestingRan locally + against a few unikernels in staging. I would connect and disconnect a bunch of tabs to simulate flaky connections (this was a pretty consistent repro of the original issue) Example logs from staging: Description generated by Mesa. Update settings |
There was a problem hiding this comment.
Performed full review of 46e2fc7...e9293ae
Analysis
-
Despite the architectural improvements, the PR doesn't address potential race conditions in the new state-based design that could occur during concurrent session state transitions.
-
The simplified boolean flag
disabledScaleToZeromay lack granularity compared to the counter approach, potentially making it difficult to handle partial or in-progress scaling operations. -
The centralized
manage()method creates a potential single point of failure and may become a bottleneck as system complexity grows. -
The PR appears to focus on fixing symptoms rather than addressing the root cause of event deduplication failures, which could resurface in other components.
Tip
⚡ Quick Actions
This review was generated by Mesa.
Actions:
Slash Commands:
/review- Request a full code review/review latest- Review only changes since the last review/describe- Generate PR description. This will update the PR body or issue comment depending on your configuration/help- Get help with Mesa commands and configuration options
1 files reviewed | 0 comments | Review on Mesa | Edit Reviewer Settings
| connectedSessions := 0 | ||
| for _, s := range m.sessions.List() { | ||
| if s.State().IsConnected { | ||
| connectedSessions++ |
There was a problem hiding this comment.
is there a world where a session is disconnected (IsConnected == false) but is trying to reconnect, and by not counting it here we might scale to zero, preventing it from connecting?
There was a problem hiding this comment.
Great question! There's a grace period after disconnects for reconnection: https://github.com/onkernel/neko/blob/e9293ae098832ae0c4b63d092a9bd370e3c0ab12/server/internal/session/session.go#L145-L213
Especially wrt to scale to zero that delay + this buffer seems reasonable. For new connections at our proxy layer we'd be "waking up" the VM externally regardless ^^
Pulls in kernel/neko#8 kernel/neko#9 <!-- CURSOR_SUMMARY --> --- > [!NOTE] > Update chromium-headful Dockerfile to use neko base image 3.0.8-v1.3.0. > > - **Docker**: > - Bump base image in `images/chromium-headful/Dockerfile` from `ghcr.io/onkernel/neko/base:3.0.8-v1.1.0` to `3.0.8-v1.3.0`. > > <sup>Written by [Cursor Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit e6afb32. This will update automatically on new commits. Configure [here](https://cursor.com/dashboard?tab=bugbot).</sup> <!-- /CURSOR_SUMMARY -->
Overview
We noticed issues where implementation would have < 0 pending. It seems like the naive assumptions that events would be deduped and delivered exactly once weren't really reliable. Instead swap to a model where we only disable/enable around edges rather than every event
Testing
Ran locally + against a few unikernels in staging. I would connect and disconnect a bunch of tabs to simulate flaky connections (this was a pretty consistent repro of the original issue)
Example logs from staging: