Skip to content

feat: enable parallel iOS simulator execution with per-device session tracking#3138

Open
qwertey6 wants to merge 1 commit intomobile-dev-inc:mainfrom
ReverentPeer:pr/1-parallel-ios-execution
Open

feat: enable parallel iOS simulator execution with per-device session tracking#3138
qwertey6 wants to merge 1 commit intomobile-dev-inc:mainfrom
ReverentPeer:pr/1-parallel-ios-execution

Conversation

@qwertey6
Copy link
Copy Markdown

@qwertey6 qwertey6 commented Apr 4, 2026

Proposed changes

Parallel execution on iOS is completely broken. Two maestro test processes targeting different iOS simulators collide on the XCTest HTTP server port and interfere with each other's sessions. This makes --shard-split unusable on iOS and forces sequential execution.

Impact

  • Unblocks iOS parallel testing: --shard-split and multi-process execution now work on iOS
  • Resolves 5 open issues and supersedes 2 stale open PRs
  • Removes the Thread.sleep(1000) hack from Fix Android Parallel/Shards Race Condition #1867 — the per-device session check eliminates the race condition it was working around

Root cause

Two bugs compound (same analysis as #2339 by @avinash-bharti):

  1. Port collision: iOS simulators share the host's localhost. Two Maestro processes both bind to the same XCTest server port.
  2. Session interference: hasActiveSessions() checks if any session exists on the same platform — regardless of device. Process B sees process A's session and skips driver.open(), connecting to the wrong device.

What this PR does differently from #2339 and #2821

Builds on the same architectural approach (per-device session tracking + --driver-host-port), with additional correctness fixes discovered during testing:

Fix #2339 #2821 This PR
Per-device session tracking Yes No Yes
--driver-host-port flag Yes Yes Yes
Session self-detection bug fix No No Yes — heartbeat at initialDelay=0 fires before session check, causing process to find itself and skip driver.open(). Broke Android completely.
Cross-process file locking No No Yes~/.maestro/sessions had no FileLock, concurrent processes could corrupt session data
Per-device shutdown No No YesshouldCloseSession(platform, deviceId) replaces global activeSessions().isEmpty()
Port availability check No No YesisPortAvailable() with correct check-then-claim ordering
Debug log directory race No No Yes — PID appended to prevent parallel collision
Unit tests No No Yes — 8 tests, fakes not mocks, Java 8 compatible

Changes

  • SessionStore.kt — Refactored from singleton to injectable class. Key format: "{platform}_{deviceId}_{sessionId}". New methods: hasActiveSessionForDevice(), shouldCloseSession(), activeSessionsForDevice()
  • MaestroSessionManager.kt — Uses SessionStore.default, per-device heartbeat/delete/close
  • KeyValueStore.kt — Added FileLock for cross-process safety
  • App.kt + TestCommand.kt--driver-host-port CLI option
  • TestCommand.ktselectPort() with isPortAvailable() check
  • SocketUtils.ktisPortAvailable() function
  • DebugLogStore.kt — PID in log directory name (Java 8 compatible via ManagementFactory)
  • Command files — Pass parent?.driverHostPort instead of null

Verification

Scenario Result
3 iOS sims parallel, cold start Pass
3 iOS sims parallel, warm start Pass
Android emulator, single device Pass
--driver-host-port explicit ports Pass
--shard-split=2 with 2 flows Pass
Solo + shard simultaneous (3 devices) Pass

Testing

  • 8 new unit tests in SessionStoreTest (per-device isolation, key format, lifecycle, shutdown)
  • useJUnitPlatform() added to maestro-cli/build.gradle.kts (was missing)
  • Uses fakes, not mocks (per CONTRIBUTING.md)
  • Java 8 compatible

Note: This is PR 1 of 4 in a stacked series. Subsequent PRs build on this:

  • PR 2: ~4x faster startup (skip unnecessary driver reinstall)
  • PR 3: Version safety (driver hash checks prevent stale drivers)
  • PR 4: Fix waitToSettleTimeoutMs on iOS (settle timeout was silently broken)

Issues fixed

Closes #1853, #2556, #1485, #2082, #2703

Supersedes #2339 and #2821 — credit to @avinash-bharti and @omnarayan for the root cause analysis and architectural direction.

@qwertey6
Copy link
Copy Markdown
Author

qwertey6 commented Apr 4, 2026

CI note: The E2E failures on Android and iOS are pre-existing and unrelated to this PR:

  • Android (killApp test): 44/45 flows pass, 1 fails with "Form Test" is not visible after a killApp command. App didn't return to expected state — flaky assertion.
  • iOS (ios-advanced-flow): Wikipedia app crashes during clearState + assertVisible "Next" in launch-clearstate-ios.yaml.

Both failures reproduce on other unrelated PRs in the repo (env-file-support, pr/contains-child-fix, pr/skill, etc.) — every recent E2E run on this workflow shows the same pattern.

Build, unit tests, web E2E, and iOS XCTest runner tests all pass.

@vaikesh07

This comment was marked as off-topic.

@Fishbowler
Copy link
Copy Markdown
Contributor

Fishbowler commented Apr 21, 2026

Is the description of this PR correct?

This makes --shard-split unusable on iOS and forces sequential execution.

I think sharding is a different use case from two parallel unrelated maestro test operation, and isn't currently problematic?

I've tidied up a couple of the linked issues - clearly duplicates. Oops.

@qwertey6
Copy link
Copy Markdown
Author

Good question — you're right that --shard-split and two separate maestro test processes are different use cases. Let me clarify:

Two separate maestro test processes: Both port collision (hardcoded 7001) and session interference (hasActiveSessions matching cross-device). Both broken.

--shard-split: Port collision is handled within the process by the existing usedPorts ConcurrentHashMap (each shard gets a unique port). But session interference IS a problem — hasActiveSessions(sessionId, platform) checks if any session exists for the platform, so shard B's heartbeat causes shard A's connectToExistingSession check to return true, skipping driver.open() on shard A's device. The Thread.sleep(1000) hack in #1867 masked this by delaying the heartbeat past the session check window, but it's a race condition that fails under load.

I'll update the PR description to be more precise about which use case each fix addresses. Thanks for the nudge — and for cleaning up the dupes.

@qwertey6 qwertey6 force-pushed the pr/1-parallel-ios-execution branch from 803cfba to 7cd8f12 Compare April 24, 2026 18:37
@qwertey6
Copy link
Copy Markdown
Author

@Leland-Takamine — I know you're heads-down on a lot, so here's the quick version: these 7 PRs close 9 open issues (some since 2023), supersede 2 stale PRs, and deliver 4-5x faster iOS startup (52s→10s) + 2x faster flows with tuned settle timeouts. They complement your #3171 by covering the iOS parallel execution side.

I've rebased onto current main and restructured so 4 of the 7 PRs are now fully independent — reviewable and mergeable in any order:

Stacked (1→2→3):

Independent:

The independent PRs are small — #3166 is 3 constants, #3165 is 10 lines. Could be quick wins.

Also — sent you a connect on LinkedIn. Would love to chat about where Maestro is headed.

Copy link
Copy Markdown
Contributor

@proksh proksh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for tackling this @qwertey6 - parallel iOS is a real pain point and the three collision points are correctly identified.

Looking at the PR, every change is coordination machinery layered around three shared resources (the XCTest port on localhost, ~/.maestro/sessions, and the debug log directory) so that two processes can use them simultaneously without lying to each other. Each fix is mechanically correct but downstream of one underlying design choice: that those three resources are shared resources processes need to negotiate over. So the PR reads as a stack of small workarounds rather than one architectural change.

I'd suggest looking at a simpler architecture that removes most of this:

  1. Ask the OS for a port. ServerSocket(0).use { it.localPort } - no scanning, no 7001..7128 range, no usedPorts map, no --driver-host-port flag. Same TOCTOU exposure as the current isPortAvailable check (microseconds, ephemeral range), dramatically simpler mechanism.

  2. Stop sharing session state across processes - and consider deleting SessionStore entirely (not "make it in-memory" - delete it). Within one Maestro JVM, two sessions never share a device: --shard-split distributes shards 1:1 onto distinct devices (runShardSuite reads deviceIds[shardIndex], and validateDevices enforces devices >= shards), so the warm-reuse check is always false within one process. The shared file at ~/.maestro/sessions exists only to enable warm reuse across separately-launched CLI invocations on the same device - a CLI-iteration optimization worth a few seconds of cold-start. If warm reuse ever turns out to matter in practice, the right shape is a long-running daemon that owns drivers, not a shared file scribbled on by every CLI invocation.

The PID-suffixed log dir should stay - its unrelated to the workaround pattern.

Happy to talk through this if I'm missing a use case.

@qwertey6 qwertey6 force-pushed the pr/1-parallel-ios-execution branch from 7cd8f12 to 58eacc2 Compare April 29, 2026 22:12
Copy link
Copy Markdown
Author

@qwertey6 qwertey6 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@proksh — thanks for the detailed review. I agree with point 1 and disagree with point 2. Here's my reasoning:

Point 1: ServerSocket(0) — done

I've updated selectPort() to use ServerSocket(0). The 7001-7128 range, usedPorts ConcurrentHashMap, and isPortAvailable scanning are all removed. Much simpler. The --driver-host-port flag stays for the explicit-port use case (it still validates via isPortAvailable before using the user-specified port).

Point 2: Deleting SessionStore — I'd push back here

The SessionStore, connectToExistingSession, heartbeat loop, and shared ~/.maestro/sessions file are existing Maestro infrastructure that my changes fix — they don't add it. The per-device key format, self-exclusion bug, and file locking are bug fixes to code that was already shipping and already broken.

Deleting SessionStore would require rearchitecting two cross-process behaviors that currently depend on it:

1. Shutdown coordination (MaestroSessionManager line 129):

if (SessionStore.activeSessions().isEmpty()) {
    session.close()
}

Process A checks whether any other process is still using a device before closing. Without this, process A's shutdown hook calls session.close()IOSDriver.close(), potentially killing a driver that process B is actively using. Removing SessionStore means either:

  • Every process kills its driver on exit → breaks parallel execution (process A kills B's XCTest runner)
  • No process kills its driver on exit → leaks orphaned xcodebuild/runner processes indefinitely
  • A daemon owns driver lifecycle → correct, but significantly larger scope

2. Warm reuse across CLI invocations (connectToExistingSession):
Without it, every maestro test invocation pays the full cold-start penalty: extract build products, run xcodebuild test-without-building, poll for ~15-20s until the HTTP server responds. Warm reuse cuts this from ~15s to ~8s per device. For local iteration (run flow, tweak, run again), that's the difference between responsive and sluggish.

A long-running daemon is the right eventual architecture for warm reuse, but it's significantly larger scope than fixing the existing mechanism. My changes make the existing shared-state approach correct (per-device keys, self-exclusion, file locking) without introducing new architectural concepts.

Summary

I've adopted ServerSocket(0) for port selection (pushed). I'd keep SessionStore with the per-device fixes — it's existing infrastructure that serves real purposes, and deleting it introduces new bugs that would require a daemon to solve properly.

@nabettu
Copy link
Copy Markdown

nabettu commented Apr 30, 2026

+1 on this. We're running parallel Maestro tests across multiple iOS simulators and hitting the exact port collision issue described here. Currently forced to run sequentially as a workaround, which significantly slows down our CI pipeline. Would love to see this merged!

@proksh
Copy link
Copy Markdown
Contributor

proksh commented Apr 30, 2026

@qwertey6 Thanks for the pushback and for shipping ServerSocket(0) - that part looks good.

I dug into SessionStore / MaestroSessionManager to make sure I understood what's actually there today. To summarize: ~/.maestro/sessions is a single file (not a directory) where KeyValueStore writes key=value lines. Each process writes a {platform}_{sessionId} key, refreshes its heartbeat timestamp every 5s, prunes entries older than 21s, and on shutdown checks activeSessions().isEmpty() before closing the driver. The heartbeat, the Thread.sleep(1000) race hack, the new FileLock, the per-device key, and the self-exclusion logic all exist to make that shared keyvalue file safe for N concurrent processes.

I agree warm reuse is worth keeping and a daemon is out of scope. What I'd still push on is the shape of the cross-process state — the shared ~/.maestro/sessions keyvalue file with FileLock is the fragile part. Shared mutable state across N processes with negotiated access tends to accumulate edge cases (stale entries from crashed processes, heartbeat skew, lock contention, NFS home dirs).

Alternative: turn ~/.maestro/sessions into a directory with one file per device, named by deviceId, reusing the existing KeyValueStore line format:
~/.maestro/drivers/<deviceId>.json{ "port": 50231, "pid": 12345, "startedAt": "..." }

  • A process starting a driver writes its own device's file; no other process writes to it.
  • For warm reuse: read the file for the target device, check PID is alive and port responds, connect or start fresh (overwriting).
  • Stale files detected by liveness check at read time — no heartbeat, no prune logic.
  • No FileLock, no shared keyvalue store, no self-exclusion, no activeSessions().isEmpty() shutdown coordination. Each process owns its own driver's lifecycle.

Migration is trivial — ~/.maestro/sessions is transient state, just delete the old file if it exists and mkdir the directory.

This addresses the underlying concern (processes shouldn't negotiate over shared mutable state) without losing warm reuse or scoping in a daemon. Should be smaller than what's in the PR today — SessionStore collapses to a thin per-device read/write, the KeyValueStore FileLock change drops out, and the heartbeat thread + Thread.sleep(1000) go away.

One open question: am I missing a case where two processes legitimately need to share the same driver on the same device simultaneously? `--shard-split~ distributes 1:1 onto distinct devices within a JVM, and across CLI invocations the second one can just connect to the first's HTTP server without host-side coordination. If there's a case I'm missing, I'll drop the objection.

@Fishbowler
Copy link
Copy Markdown
Contributor

I run maestro hierarchy against a device connected to Studio. Is that 2 sessions?

@proksh
Copy link
Copy Markdown
Contributor

proksh commented Apr 30, 2026

I run maestro hierarchy against a device connected to Studio. Is that 2 sessions?

Yes that will be 2 sessions.

@qwertey6
Copy link
Copy Markdown
Author

@proksh — I like the direction of per-device files. In fact, that's exactly what PR #3167 already implements — XCTestPortStore writes ~/.maestro/xctest-ports/<deviceId> with just the port number. On the next invocation, it reads the saved port, probes it with Socket.connect(), and reuses if something responds. No PID, no heartbeat, no prune logic.

One concern with the {port, pid, startedAt} JSON shape you proposed: PID liveness checks aren't cross-platform. ProcessHandle.of(pid) is Java 9+ (Maestro targets Java 8), and there's no reliable equivalent on Windows without JNA. The port probe alone is sufficient — if the port responds to /status, the runner is alive regardless of PID. If it doesn't, start fresh.

So the question is whether this per-device port file (already in #3167) can fully replace SessionStore, or whether there are cases where the heartbeat-based shared state is still needed. I believe it can replace it — the only cross-process coordination SessionStore provides is:

  1. Warm reuse: "is a driver already running on this device?" → port probe handles this
  2. Shutdown coordination: "should I close this driver?" → per-device file check handles this

Two paths forward:

I'd lean toward A — ship the bug fixes now, iterate on the architecture next — but happy to go either way.

Copy link
Copy Markdown
Contributor

@proksh proksh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One concern with the {port, pid, startedAt} JSON shape you proposed: PID liveness checks aren't cross-platform. ProcessHandle.of(pid) is Java 9+ (Maestro targets Java 8), and there's no reliable equivalent on Windows without JNA. The port probe alone is sufficient — if the port responds to /status, the runner is alive regardless of PID. If it doesn't, start fresh.

So the question is whether this per-device port file (already in #3167) can fully replace SessionStore, or whether there are cases where the heartbeat-based shared state is still needed. I believe it can replace it — the only cross-process coordination SessionStore provides is:

Warm reuse: "is a driver already running on this device?" → port probe handles this
Shutdown coordination: "should I close this driver?" → per-device file check handles this

Got it on PID / Java 8 / port probe + /status. That makes sense.

Re: replacing SessionStore with the per-device port approach in #3167: I’m aligned that warm reuse and shutdown hints can be driven by probe + on-disk port per deviceId, without heartbeats, as long as we define the shutdown cases clearly.

Two paths forward:

Option A: Merge this PR as-is (fixes the bugs in the existing SessionStore), then land #3167 next and use it as the foundation to refactor SessionStore out in a follow-up
Option B: Collapse #3138 and #3167 into a single PR that replaces SessionStore with per-device port files from the start
I'd lean toward A — ship the bug fixes now, iterate on the architecture next — but happy to go either way.

Lets go with option A. Prefer not to combine #3167 with this PR. This solution is already fixing a lot of edge cases we have.

Left a few minor comments.

}

fun activeSessionsForDevice(platform: Platform, deviceId: String? = null): List<String> {
val devicePrefix = "${platform}_${deviceId ?: "unknown"}_"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should avoid fallback to "unknown" here. This can cause 2 unrelated runs under one key.

Why is this deviceId nullable? Can we require non-null device key before writing session state.

}

@Test
fun `shouldCloseSession returns true when no sessions remain for device even if other devices have sessions`() {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we add a false case for this.

Two Android heartbeats on emulator-5554, delete one → shouldCloseSession(ANDROID, "emulator-5554") is false.


val deviceASessions = sessionStore.activeSessionsForDevice(Platform.IOS, "device-A")

assertThat(deviceASessions).hasSize(1)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we also validate the key here -
assertThat(deviceASessions.single()).isEqualTo(expectedKey)

For future proofing activeSessionsForDevice.

// DESIRED: should close because no Android sessions remain on this device
val shouldClose = sessionStore.shouldCloseSession(Platform.ANDROID, "emulator-5554")
assertThat(shouldClose).isTrue()
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a test for pruneInactiveSessions?

@proksh
Copy link
Copy Markdown
Contributor

proksh commented May 1, 2026

I run maestro hierarchy against a device connected to Studio. Is that 2 sessions?

To answer this more clearly @Fishbowler:

Yes - two newSessions (two processes). Second one may reuse the same XCTest runner by skipping driver.open() when connectToExistingSession is set. So two sessions, often one driver. Correct me if I am wrong @qwertey6

@Fishbowler
Copy link
Copy Markdown
Contributor

Thanks! I'd certainly like that capability to be retained.

@proksh
Copy link
Copy Markdown
Contributor

proksh commented May 6, 2026

@qwertey6 Let me know if you need anything from us.

iOS simulators share the host's localhost, causing port collisions when
multiple Maestro processes target different sims simultaneously. Session
tracking was per-platform, so two processes on different devices would
interfere with each other's sessions.

Changes:
- Per-device session tracking: SessionStore keys are now
  "{platform}_{deviceId}_{sessionId}" instead of "{platform}_{sessionId}"
- Add --driver-host-port CLI flag for explicit XCTest server port
- Auto-select available ports with isPortAvailable() check
- Refactor SessionStore from singleton to injectable class (DI)
- Add shouldCloseSession(platform, deviceId) for per-device shutdown
  instead of global activeSessions().isEmpty()
- Add cross-process file locking to KeyValueStore (~/.maestro/sessions)
- Append PID to debug log directory to prevent parallel race
- Enable useJUnitPlatform() in maestro-cli (was missing)
- Add SessionStoreTest with 8 tests covering isolation and lifecycle

Verified: 3 iOS simulators + Android emulator running simultaneously,
all passing. Both --driver-host-port (explicit) and auto-port-selection
work correctly.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@qwertey6 qwertey6 force-pushed the pr/1-parallel-ios-execution branch from 58eacc2 to ab8ca1e Compare May 6, 2026 20:35
@qwertey6
Copy link
Copy Markdown
Author

qwertey6 commented May 6, 2026

@proksh — addressed all four comments:

  1. deviceId non-null: Removed all String? = null defaults from SessionStore. The "unknown" fallback is gone. When no device ID is available (e.g., --host mode), MaestroSessionManager falls back to the session UUID as a unique device key.

  2. shouldCloseSession false case: Added test — two sessions on emulator-5554, delete one, shouldCloseSession returns false.

  3. Key validation in activeSessionsForDevice: Added assertThat(deviceASessions.single()).isEqualTo("IOS_device-A_s1").

  4. pruneInactiveSessions test: Added — writes a stale entry (22s old) directly to KV store, triggers heartbeat (which prunes internally), verifies only the fresh entry remains.

All pushed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Unable to run parallel/sharded tests on Android

5 participants