Skip to content

fix(cloud-sdk): reclaim builder sandbox on cancellation#727

Open
jare1686 wants to merge 1 commit into
tensorlakeai:mainfrom
jare1686:fix/sandbox-build-cancellation-cleanup
Open

fix(cloud-sdk): reclaim builder sandbox on cancellation#727
jare1686 wants to merge 1 commit into
tensorlakeai:mainfrom
jare1686:fix/sandbox-build-cancellation-cleanup

Conversation

@jare1686

@jare1686 jare1686 commented Jun 6, 2026

Copy link
Copy Markdown
Contributor

Summary

build_sandbox_image (crates/cloud-sdk/src/sandbox_images.rs) creates a remote builder
sandbox and reclaims it with a sequential delete after the build's awaited block. If the
future is cancelled — a caller timeout, a losing select! branch, or client shutdown — during
the build, that delete never runs and the builder sandbox is leaked until the server-side
timeout reaps it. The keepalive task compounds this: it is aborted only on the success path, so
on cancellation its JoinHandle is dropped and detached, leaving a background loop pinging the
orphaned sandbox.

This change ties cleanup to ownership. A small BuilderSandboxCleanup guard captures the
runtime handle at creation and reclaims the sandbox via a spawned (detached) delete, so
cancellation before or during cleanup cannot suppress it; the normal path still awaits the
delete and emits the existing warning, and a 404 is treated as success. The keepalive handle
is wrapped so it aborts on drop. No new dependencies and no public API surface.

This is best-effort SDK-side cleanup once the sandbox ID is known; guaranteed reclamation across
process exit or runtime shutdown is a server-side concern (a sandbox lease/TTL or a reaper) and
is out of scope here.

Validation

Tests cover delete-on-drop, awaited reclaim (no double delete), cancellation during cleanup,
404-as-success, non-404 error propagation, successful delete, and keepalive abort-on-drop.

  • cargo nextest run -p tensorlake
  • cargo fmt -p tensorlake --check

Related: #528

@diptanu

diptanu commented Jun 7, 2026

Copy link
Copy Markdown
Contributor

@jare1686 Thanks for the contribution! Can you please update the PRs so that lint passes and rebase from main?

@jare1686 jare1686 force-pushed the fix/sandbox-build-cancellation-cleanup branch from af16889 to a3d0d7e Compare June 7, 2026 17:05
@jare1686

jare1686 commented Jun 7, 2026

Copy link
Copy Markdown
Contributor Author

Done! My pleasure.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants