Skip to content

feat: auto-delete stopped workspaces after configurable TTL#916

Merged
simple-agent-manager[bot] merged 7 commits intomainfrom
sam/stopped-workspace-auto-delete
May 6, 2026
Merged

feat: auto-delete stopped workspaces after configurable TTL#916
simple-agent-manager[bot] merged 7 commits intomainfrom
sam/stopped-workspace-auto-delete

Conversation

@simple-agent-manager
Copy link
Copy Markdown
Contributor

@simple-agent-manager simple-agent-manager Bot commented May 6, 2026

Summary

  • Stopped workspaces are automatically deleted after a configurable TTL (default 5 minutes) to prevent disk exhaustion on nodes
  • The NodeLifecycle Durable Object schedules deletion alarms when workspaces stop, and fires them to call the VM agent DELETE endpoint + update D1
  • A cron safety-net sweep catches workspaces where the DO alarm failed to fire (2x TTL grace period)
  • All four workspace-stop paths are wired up: lifecycle stop route, lifecycle restart (cancel), task-runner cleanup, idle-cleanup timeout

Changes

  • packages/shared: Added DEFAULT_WORKSPACE_STOPPED_TTL_MS constant (5 min)
  • apps/api/src/env.ts: Added WORKSPACE_STOPPED_TTL_MS env var
  • apps/api/src/durable-objects/node-lifecycle.ts: Extended with scheduleWorkspaceDeletion, cancelWorkspaceDeletion, unified recalculateAlarm (warm + deletion), deleteWorkspace via shared deleteWorkspaceOnNode helper
  • apps/api/src/routes/workspaces/lifecycle.ts: Stop → schedule deletion, Restart → cancel deletion
  • apps/api/src/services/task-runner.ts: cleanupTaskRun schedules deletion
  • apps/api/src/durable-objects/task-runner/state-machine.ts: cleanupOnFailure schedules deletion
  • apps/api/src/durable-objects/project-data/index.ts: Idle cleanup schedules deletion
  • apps/api/src/scheduled/node-cleanup.ts: Safety-net cron sweep for stale stopped workspaces (with TOCTOU-safe status guard)

Test plan

  • 42 source-contract tests for workspace deletion feature
  • Updated existing warm-node-pooling integration tests for recalculateAlarm
  • Updated node-cleanup unit test for new stoppedWorkspacesDeleted counter
  • Full test suite passes (4422/4423 — 1 pre-existing unrelated timeout)
  • Lint, typecheck, build all pass
  • Staging deploy succeeded and app verified via Playwright

Specialist Review Evidence

Reviewer Status Outcome
cloudflare-specialist ADDRESSED 2 HIGH (VM agent auth, storage reads) + 2 MEDIUM (TOCTOU guard, alarm bypass) fixed
task-completion-validator ADDRESSED Idle-cleanup caller site wired up

Agent Preflight (Required)

  • Preflight completed before code changes

Classification

  • external-api-change
  • cross-component-change
  • business-logic-change
  • public-surface-change
  • docs-sync-change
  • security-sensitive-change
  • ui-change
  • infra-change

External References

N/A: Pure internal feature using existing codebase patterns (NodeLifecycle DO alarm, deleteWorkspaceOnNode helper, cron sweep). No external APIs or new dependencies introduced.

Codebase Impact Analysis

  • packages/shared/src/constants/node-pooling.ts — new DEFAULT_WORKSPACE_STOPPED_TTL_MS constant
  • apps/api/src/env.ts — new WORKSPACE_STOPPED_TTL_MS optional env var
  • apps/api/src/durable-objects/node-lifecycle.ts — workspace deletion scheduling, unified recalculateAlarm, deleteWorkspace method using shared helper
  • apps/api/src/routes/workspaces/lifecycle.ts — stop/restart hooks into NodeLifecycle DO
  • apps/api/src/services/task-runner.ts — cleanupTaskRun schedules deletion
  • apps/api/src/durable-objects/task-runner/state-machine.ts — cleanupOnFailure schedules deletion
  • apps/api/src/durable-objects/project-data/index.ts — idle cleanup schedules deletion after stop
  • apps/api/src/scheduled/node-cleanup.ts — safety-net cron for stale stopped workspaces with TOCTOU guard

Documentation & Specs

N/A: No public-facing API surface or documentation changes. The feature is purely internal (DO alarm + cron) with no user-visible UI. CLAUDE.md already documents NodeLifecycle DO and warm pool concepts.

Constitution & Risk Check

Principle XI (No Hardcoded Values): TTL is configurable via WORKSPACE_STOPPED_TTL_MS env var with DEFAULT_WORKSPACE_STOPPED_TTL_MS constant fallback. No hardcoded timeouts or limits. The cron sweep uses a 2x multiplier on the configurable TTL for its grace buffer.

Staging Verification

  • Staging deployment green (run 25416009846)
  • Live app verified via Playwright (login 200, dashboard loads, settings loads)
  • API health check passes
  • No UI changes — feature is purely backend alarm/cron logic

🤖 Generated with Claude Code

raphaeltm and others added 7 commits May 6, 2026 03:24
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Extend NodeLifecycle DO to schedule workspace deletion 5 minutes after
a workspace stops. This prevents disk exhaustion on nodes caused by
accumulated stopped workspaces (Docker volumes with git repos, node_modules).

- Add DEFAULT_WORKSPACE_STOPPED_TTL_MS (300000ms) to shared constants
- Add scheduleWorkspaceDeletion/cancelWorkspaceDeletion to NodeLifecycle DO
- Alarm handler processes expired deletions (VM agent DELETE + D1 update)
- recalculateAlarm picks earliest of warm timeout and pending deletions
- Schedule deletion from: stop route, cleanupTaskRun, cleanupOnFailure
- Cancel deletion from: restart route (before TTL expires)
- Cron safety-net in node-cleanup.ts for missed DO alarms
- Configurable via WORKSPACE_STOPPED_TTL_MS env var

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
… counter

- warm-node-pooling integration test: setAlarm → recalculateAlarm, deleteAlarm → recalculateAlarm(null)
- node-cleanup unit test: add stoppedWorkspacesDeleted to expected result, mock node-agent and project-data

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- [HIGH] Use deleteWorkspaceOnNode shared helper with proper JWT auth
  instead of raw IP fetch with X-User-ID header
- [HIGH] Hoist getStoredState() before the deletion loop to avoid N
  redundant DO storage reads
- [MEDIUM] Add status='stopped' guard to cron sweep D1 update (TOCTOU)
- [MEDIUM] Use recalculateAlarm in destroying branch so pending workspace
  deletions are not delayed by retry window

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The processExpiredCleanups idle timeout path stops workspaces but
wasn't scheduling automatic deletion. Workspaces stopped via idle
timeout now get deletion scheduled via the NodeLifecycle DO, matching
the lifecycle route and task-runner paths.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@sonarqubecloud
Copy link
Copy Markdown

sonarqubecloud Bot commented May 6, 2026

Quality Gate Failed Quality Gate failed

Failed conditions
13.5% Duplication on New Code (required ≤ 3%)

See analysis details on SonarQube Cloud

@simple-agent-manager simple-agent-manager Bot merged commit 54e3ed8 into main May 6, 2026
18 of 19 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant