Skip to content

feat: add integrated device leasing#890

Merged
thymikee merged 16 commits into
mainfrom
codex/integrated-device-leasing
Jun 26, 2026
Merged

feat: add integrated device leasing#890
thymikee merged 16 commits into
mainfrom
codex/integrated-device-leasing

Conversation

@thymikee

@thymikee thymikee commented Jun 26, 2026

Copy link
Copy Markdown
Member

Summary

Add integrated device leasing for shared remote/proxy access, so multiple agents can use one long-lived agent-device proxy without fighting over the same simulator/device or surfacing low-level runner ownership errors.

Before this PR, remote/proxy clients could connect to the same host daemon but ownership was effectively enforced late by platform runner/helper locks. In practice, two sandboxed agents targeting the same iOS device could trip over errors like “iOS runner is already owned by another agent-device daemon”, and recovery often required manual cleanup or restarting the proxy.

After this PR, remote access has an explicit lease lifecycle:

  • connect stores the remote/proxy profile and stable client identity.
  • open lazily resolves the target device and acquires a provider-aware device lease.
  • Commands after open heartbeat the lease under the daemon request lock.
  • close releases the active session/device lease; inactive proxy leases expire after five minutes.
  • Contention fails early with DEVICE_LEASE_BUSY, including backend/provider/device metadata, instead of leaking into runner ownership failures.

Example direct-proxy usage:

# Device host Mac: start one long-lived proxy
agent-device proxy --port 4310
# Expose it through a tunnel
cloudflared tunnel --url http://127.0.0.1:4310
# or: ngrok http 4310
# Agent/client A
agent-device connect proxy \
  --daemon-base-url https://example.trycloudflare.com \
  --daemon-auth-token <token>
agent-device open Settings --platform ios --device "iPhone 17 Pro"
agent-device snapshot -i
agent-device click @e2
agent-device close
agent-device disconnect
# Agent/client B can share the same proxy
agent-device connect proxy \
  --daemon-base-url https://example.trycloudflare.com/agent-device \
  --daemon-auth-token <token>
agent-device open Maps --platform ios --device "iPhone 17 Pro Max"
agent-device snapshot -i
agent-device close
agent-device disconnect

If client B targets the exact device currently leased by client A, it gets a device-lease busy error and can retry after A closes or after the inactivity TTL expires. The proxy process itself is intended to stay up.

Implementation scope spans remote connection CLI/profile handling, daemon lease admission/session lifecycle, provider-aware device contention, generated non-secret proxy profiles, iOS runner lease diagnostics, docs, and focused regression coverage.

Touched-file count: 61.

Validation

Verified with focused lease/proxy coverage, static gates, and build:

  • pnpm exec vitest run src/daemon/__tests__/lease-context.test.ts src/daemon/__tests__/lease-registry.test.ts src/core/__tests__/lease-scope.test.ts src/__tests__/remote-connection.test.ts src/daemon/__tests__/lease-lifecycle.test.ts src/daemon/__tests__/request-execution-scope.test.ts src/platforms/ios/__tests__/runner-session.test.ts passed: 7 files / 142 tests.
  • pnpm exec vitest run src/__tests__/remote-connection.test.ts src/daemon/__tests__/lease-registry.test.ts passed after the final cleanup: 2 files / 49 tests.
  • pnpm format passed.
  • pnpm check:quick passed.
  • pnpm check:fallow --base origin/main passed: no issues in 61 changed files.
  • pnpm build passed.
  • pnpm check:unit unit phase passed: 277 files / 2719 tests. The aggregate still exits in test/integration/smoke-daemon-clean.test.ts:40 on the known daemon PID liveness assertion, outside the changed lease/proxy paths.

Manual proxy verification exercised one host proxy with multiple sandboxed clients opening stock apps on separate iOS simulators, confirmed same-device contention returns DEVICE_LEASE_BUSY, and confirmed close/reacquire works without restarting the proxy.

@github-actions

github-actions Bot commented Jun 26, 2026

Copy link
Copy Markdown
PR Preview Action v1.8.1
Preview removed because the pull request was closed.
2026-06-26 17:37 UTC

@github-actions

github-actions Bot commented Jun 26, 2026

Copy link
Copy Markdown

Size Report

Metric Base Current Diff
JS raw 1.3 MB 1.4 MB +19.9 kB
JS gzip 437.8 kB 444.2 kB +6.4 kB
npm tarball 576.1 kB 582.3 kB +6.2 kB
npm unpacked 1.9 MB 2.0 MB +21.2 kB

Startup median (7 runs, lower is better):

Scenario Base Current Diff
CLI --version 26.7 ms 27.0 ms +0.2 ms
CLI --help 47.2 ms 47.7 ms +0.5 ms

Top changed chunks:

Chunk Raw diff Gzip diff
dist/src/session.js +7.6 kB +2.4 kB
dist/src/cli.js +8.4 kB +2.2 kB
dist/src/9722.js -5.0 kB -1.8 kB
dist/src/2948.js +2.4 kB +503 B
dist/src/cli-help.js -747 B -176 B

@thymikee

Copy link
Copy Markdown
Member Author

Review finding on the latest green #890 head:

Blocking: src/cli/commands/connection-runtime.ts:115 and src/cli/commands/connection-runtime.ts:187 leave a newly allocated lease stranded if local remote-connection state cannot be written. materializeRemoteConnectionForCommand allocates or refreshes the remote lease, updates nextState with the lease id, then writes the local state file. If writeRemoteConnectionState throws, the catch only stops the newly prepared Metro cleanup and rethrows. The just-acquired lease remains active in the remote daemon, no session was opened/bound yet, and this client has no persisted state to release it later. For proxy/device-aware leases that means the selected device can stay busy until the 5 minute TTL.

Please track when this materialization acquired a new lease and release that lease in the write-failure cleanup path, passing the same tenant/run/provider/client/device metadata. The existing write-failure test at src/tests/remote-connection.test.ts:1338 already covers Metro cleanup; it should also assert the newly allocated lease is released. Existing leases that were merely heartbeated should not be released.

@thymikee

Copy link
Copy Markdown
Member Author

Still seeing the lease cleanup blocker on c8b99d6. The write-state failure path in materializeRemoteConnectionForCommand still catches writeRemoteConnectionState errors by stopping prepared Metro cleanup and rethrowing, but it does not release a fresh proxy lease returned by allocateOrReuseLease. So if allocation succeeds and the state write fails, the lease remains active until TTL. The new hardening commit updates selector propagation/scope checks, but I do not see a release-on-write-failure path or a regression test for that failure.

@thymikee thymikee force-pushed the codex/integrated-device-leasing branch from cd62dd1 to f66360a Compare June 26, 2026 14:41

Copy link
Copy Markdown
Member Author

Thermo-nuclear code-quality review

Reviewed the full diff (68 files, +3790/-595). The behavior and test coverage look solid, and a couple of extractions are genuinely good (generated-remote-config.ts centralizing secret-strip + persist for both cloud and proxy; session-teardown.ts pulling the close-time teardown steps out of the handler; the runAdmitted admission-under-lock split). File sizes are fine — nothing crosses 1k lines.

But this is the strict pass, and there are several structural issues I'd want addressed before merge. They mostly stem from one root cause: the "lease identity" concept is modeled ~10 different ways and re-stitched field-by-field at every layer, and the few helpers built to centralize that don't actually reach the boundaries that need them.

1. Blocking — the lease scope is one concept modeled as ~10 near-identical shapes, remapped by hand everywhere

We now have LeaseScope, SessionLease, DeviceLease, RunnerLogicalLeaseContext, LeaseDiagnosticsContext, RemoteConnectionRequestMetadata, plus the registry's AllocateLeaseRequest / HeartbeatLeaseRequest / ReleaseLeaseRequest / AdmissionRequest / LeaseScopeMatchRequest — all carrying essentially the same fields (tenantId/runId/leaseId/(lease)provider/deviceKey/clientId/backend/ttl) with slightly different names and optionality.

core/lease-scope.ts tries to centralize the mapping with ~8 leaseScopeToX functions, each of which manually re-lists those same fields. But the registry boundary uses a different field shape (backend instead of leaseBackend, and accepts both provider and leaseProvider), so the leaseScope* helpers don't reach it — and callers go back to hand-remapping the scope into the registry request object every single time:

  • daemon/handlers/lease.ts — 3× (allocate/heartbeat/release)
  • daemon/request-admission.ts — 2× (assertLeaseAdmission + heartbeatLease)
  • daemon/lease-lifecycle.ts releaseSessionLease
  • cli/commands/connection.ts disconnect
  • cli/commands/connection-runtime.ts allocateOrReuseLease / heartbeatOrAllocateLease

The abstraction exists but doesn't earn its keep — we pay for the mapping layer and still hand-write the same object literal at each call site. Code-judo here: pick one canonical LeaseScope/LeaseIdentity type and make LeaseRegistry accept it directly (align backendleaseBackend, drop the dual provider — see #2). Then assertLeaseAdmission/heartbeat/release/allocate take the scope as-is and the per-call-site remap literals disappear. That deletes complexity rather than relocating it.

2. Blocking — DeviceLease stores provider and leaseProvider, always equal

...(provider ? { leaseProvider: provider, provider } : {}),

The two fields are written from the same normalized value, and every read is leaseProvider ?? provider (connection-runtime.ts:131, lease-scope.ts:80, lease-registry.ts:160/439). I checked — the stored provider is never read independently of leaseProvider. It's vestigial legacy-compat leaking into the canonical stored model and then forcing ?? provider defensiveness through ~19 files.

normalizeLeaseProviderFields already collapses provider/leaseProvider to a single value at each registry entry point — so accept both at the input boundary if you must, but store and read only leaseProvider. Drop provider from DeviceLease, the four registry request types, and leaseScopeToLeaseRpcParams (which currently emits both keys). That removes a whole class of ?? provider branches.

3. Spaghetti growth — proxy special-cases woven into the shared materialize flow

Proxy-specific behavior is bolted into the general connection path as scattered leaseProvider === 'proxy' checks rather than living behind one boundary:

  • shouldAllocateLeaseForCommand(command, state)if (state.leaseProvider === 'proxy' && command === 'devices') return false
  • resolveProxyLeaseState(...) — an entire proxy-only branch inside materializeRemoteConnectionForCommand
  • leaseTtlMsForConnectionstate.leaseProvider === 'proxy' ? PROXY_REMOTE_LEASE_TTL_MS : undefined

This is feature logic leaking into a shared path. The cleaner shape is a small per-provider policy (cloud vs proxy): TTL, whether devices needs a lease, and the device-resolution step become methods/fields on the provider strategy, and materializeRemoteConnectionForCommand stops branching on the literal 'proxy'.

4. The complexity linter already flagged this — and the PR suppressed it 5×

New // fallow-ignore-next-line complexity were added to five production functions: materializeRemoteConnectionForCommand, connectCommand, prepareOpenDispatchSession, LeaseRegistry.assertOptionalScopeMatch, and isRemoteConnectionState. That rule is flagging exactly the growth this review cares about; suppressing it instead of decomposing is the smell. assertOptionalScopeMatch in particular mixes re-validation (normalizeTenantId/normalizeRunId/...) with comparison — the normFalse/throw and the equality checks want to be separate. connectCommand's nested remoteConfig ? : provider==='proxy' ? : cloud ternary should be a small provider dispatch instead.

5. Atomicity — the open lease-leak on write-failure (already raised, still applies)

@thymikee's standing comment about materializeRemoteConnectionForCommand leaking a freshly-allocated lease when writeRemoteConnectionState throws is, structurally, a non-atomic update: we acquire the remote lease and persist the local handle as two unsequenced steps, and the failure path only cleans up Metro. Beyond the targeted fix (track "freshly allocated this call" and release it on any failure before persist), this is the strongest argument for #1/#3 — once acquisition and persistence are a single provider-owned step with one cleanup path, the leak window is structural to close rather than a special-case catch.


Verdict: not approving. No correctness regression and no 1k-line explosions, but #1 and #2 are real structural regressions (concept duplicated ~10×; a vestigial stored field), #3 scatters feature logic across a shared flow, and #4 is the linter telling you the same thing. The good news is these collapse together: one canonical lease-scope type accepted at the registry boundary + a per-provider policy object would let most of the remap boilerplate, the ?? provider defensiveness, and several of the suppressed-complexity branches all disappear at once. Worth doing before this lands.


Generated by Claude Code

@thymikee

Copy link
Copy Markdown
Member Author

E2E proxy leasing verification update (2026-06-26):

  • Started local proxy at http://127.0.0.1:4310/agent-device with isolated host state dir /private/tmp/ad-e2e-host-state.
  • Verified generic direct proxy connect now materializes a proxy connection: connect --daemon-base-url .../agent-device --daemon-auth-token ... succeeded and saved the token for later open/snapshot calls.
  • Client A opened Settings on iPhone 17 Pro, then snapshot -i succeeded through the proxy. This previously failed with foreign iOS runner ownership.
  • Client B concurrently tried Maps on the same iPhone 17 Pro and received clean DEVICE_LEASE_BUSY with leaseProvider=proxy and the canonical deviceKey.
  • Client B opened Maps on iPhone 17 Pro Max while A owned the Pro; snapshot -i and @ref click both succeeded.
  • After close, B reacquired the previously held iPhone 17 Pro successfully.
  • Three gpt-5.4-mini sandbox clients exercised the same proxy: one got expected DEVICE_LEASE_BUSY, two completed open -> snapshot -i -> click -> close on the available simulator.
  • Cleanup completed: session list through the proxy returned [], saved test connections disconnected, proxy stopped.

Fixes pushed in 131a793:

  • persist proxy daemon auth token in saved connection state so follow-up commands work without repeating --daemon-auth-token
  • route connect --daemon-base-url .../agent-device into the proxy profile shortcut instead of cloud profile JSON parsing
  • allow an admitted proxy device lease to reclaim a live foreign iOS runner lease for the same canonical device key

Validation:

  • pnpm build
  • pnpm format
  • pnpm exec vitest run src/platforms/ios/tests/runner-session.test.ts src/tests/remote-connection.test.ts
  • pnpm check:quick
  • pnpm check:fallow --base origin/main
  • pnpm check:unit: unit phase passed (277 files / 2718 tests); aggregate still exits on the known unrelated smoke-daemon-clean PID liveness assertion at test/integration/smoke-daemon-clean.test.ts:40.

@thymikee

Copy link
Copy Markdown
Member Author

Addressed the latest lease review in a9d393013:

  • Canonicalized registry lease requests on leaseBackend/leaseProvider; DeviceLease no longer stores legacy provider.
  • Kept legacy provider accepted only as input compatibility at option/request projection boundaries; lease RPC params and client lease output no longer emit it.
  • Added lease-scope projection helpers for allocate/heartbeat/release and reused them in handlers, admission, and lifecycle cleanup instead of hand-remapping objects at each call site.
  • Moved proxy-specific materialization behavior behind a connection lease policy object: TTL, devices lease deferral, and proxy device resolution are policy-owned.
  • Removed the reviewed complexity suppressions by decomposing materializeRemoteConnectionForCommand, connectCommand, registry scope matching, remote connection state validation, and the reviewed session-open helper.
  • Fixed the lease leak: freshly allocated leases are now best-effort released if local connection-state persistence fails. Existing heartbeated leases are not released. Added the regression assertion to the existing write-failure test.
  • Tightened runner lease diagnostics so leaseBackend no longer projects as leaseProvider.

Validation:

  • pnpm exec vitest run src/daemon/__tests__/lease-context.test.ts src/daemon/__tests__/lease-registry.test.ts src/core/__tests__/lease-scope.test.ts src/__tests__/remote-connection.test.ts src/daemon/__tests__/lease-lifecycle.test.ts src/daemon/__tests__/request-execution-scope.test.ts src/platforms/ios/__tests__/runner-session.test.ts passed: 7 files / 142 tests.
  • pnpm format passed.
  • pnpm check:quick passed.
  • pnpm build passed.
  • pnpm check:fallow --base origin/main passed: no issues in 61 changed files.
  • pnpm check:unit unit phase passed on rerun: 277 files / 2719 tests. Aggregate still exits in test/integration/smoke-daemon-clean.test.ts:40 on the known PID liveness assertion; the previously transient Android mocked-ADB failures passed in isolation (src/core/__tests__/dispatch-rotate.test.ts + src/platforms/android/__tests__/index.test.ts, 94 tests).

@thymikee

Copy link
Copy Markdown
Member Author

Critical cleanup follow-up pushed in 289286039 after re-auditing the latest review points.

What I double-checked:

  • Stored lease model: DeviceLease has no legacy provider; client lease output also does not emit it.
  • Boundary compatibility: legacy provider?: string remains only as input compatibility in client options / lease scope normalization.
  • Registry boundary: requests use leaseBackend / leaseProvider; JSON-RPC wire payloads still keep backend for compatibility.
  • Proxy special-casing: materialization uses the provider policy object; the remaining proxy checks are connect-provider parsing and proxy lease-scope/domain behavior.
  • Complexity suppressions from the review: removed for the reviewed lease/materialization/state-validation functions. Remaining suppressions are pre-existing or unrelated command handlers.
  • Dead/obsolete code: removed the no-op LeaseRegistry.scopeMatchRequest wrapper and centralized duplicate remote connection lease-release mapping into releaseRemoteConnectionLease.
  • Coverage: added leaseBackend assertion to the proxy disconnect release test.

Validation after the cleanup:

  • pnpm exec vitest run src/__tests__/remote-connection.test.ts src/daemon/__tests__/lease-registry.test.ts passed: 2 files / 49 tests.
  • Earlier focused full lease/proxy suite passed: 7 files / 142 tests.
  • pnpm format passed.
  • pnpm check:quick passed.
  • pnpm check:fallow --base origin/main passed: no issues in 61 changed files.
  • pnpm build passed.
  • pnpm check:unit: unit phase passed, 277 files / 2719 tests; aggregate still exits only on the known test/integration/smoke-daemon-clean.test.ts:40 PID liveness assertion.

@thymikee

Copy link
Copy Markdown
Member Author

Addressed the agent-device-cloud review feedback in b3928fa31:

  • Cloud connect now supplies upstream fallback lease identity when the bridge profile omits it: leaseProvider: "cloud", stable clientId, and a fallback runId only when the bridge does not provide one. This keeps existing bridge-provided run/profile semantics, but upgrading agent-device alone no longer leaves cloud clients without a stable client identity.
  • Lease allocation now reports DEVICE_LEASE_BUSY when the same tenant/run/provider/device is already leased by a different client. Previously the run binding path hit client scope mismatch first, producing UNAUTHORIZED instead of device contention.
  • Added focused coverage for generated cloud profile identity and same-run/different-client device contention.

Important boundary: upstream still cannot invent a hosted/cloud deviceKey if the bridge/control plane does not provide or forward it. The agent-device-cloud follow-up to pass leaseProvider, clientId, and deviceKey through its command meta / lease RPC path is still the right cloud-side work for full hosted device-aware ownership.

Validation:

  • pnpm exec vitest run src/__tests__/cloud-connect-profile.test.ts src/daemon/__tests__/lease-registry.test.ts passed: 2 files / 18 tests.
  • pnpm format passed.
  • pnpm check:quick passed.
  • pnpm check:fallow --base origin/main passed.
  • pnpm build passed.

thymikee and others added 14 commits June 26, 2026 18:49
The proxy connect profile is written to disk as a non-secret remote config,
but it unconditionally copied `metroBearerToken` into that file, leaking the
secret at rest. Mirror the cloud path, which keeps `daemonAuthToken` in-memory
only: the token still flows through this connect via the returned flags, and
later commands re-supply it via AGENT_DEVICE_METRO_BEARER_TOKEN. Extend the
non-secret-profile test to assert the bearer token is absent from disk.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01VPa5Z9GBkeqoxVctC85N7e
releaseSessionLease + sessionStore.delete ran only on the happy path, after
several awaits (app-log/perf/snapshot teardown, platform close dispatch,
runner stop) that can throw. A failed close therefore stranded the device
lease until the inactivity expiry. Wrap teardown in try/finally so ownership
is always freed; the original error still propagates after finally.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Claude-Session: https://claude.ai/code/session_01VPa5Z9GBkeqoxVctC85N7e
@thymikee

Copy link
Copy Markdown
Member Author

#890 is currently merge-conflicted after latest main (#893 landed). Please rebase/resolve conflicts first; after the branch is mergeable and the full check suite reruns, reply with what changed and I’ll re-review the latest head.

@thymikee thymikee force-pushed the codex/integrated-device-leasing branch from 18963dd to 386abe7 Compare June 26, 2026 16:53
@thymikee

Copy link
Copy Markdown
Member Author

Latest-head re-review on 386abe7: I do not see a remaining blocker.

Checked the lease ADR and the production route: connect persists non-secret connection/identity metadata, proxy open resolves a concrete device before lease allocation, lease allocate/heartbeat/release carry tenant/run/provider/client/device scope, request admission and heartbeat run under the daemon request lock, open stores the admitted lease on the session, and close/disconnect release or clean up best-effort. The branch is clean against main and checks are green. The earlier local proxy E2E contention smoke covers the device-facing path I would require here.

Residual risk: this is still broad routing/lifecycle work, so I would merge it with the existing proxy E2E evidence as the main confidence signal rather than relying on unit tests alone.

@thymikee thymikee merged commit a822325 into main Jun 26, 2026
21 checks passed
@thymikee thymikee deleted the codex/integrated-device-leasing branch June 26, 2026 17:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants