Skip to content

[codex] Preserve running VMs across controller restarts#454

Open
fkorotkov-oai wants to merge 1 commit into
mainfrom
dev/fkorotkov/controller-restart-grace
Open

[codex] Preserve running VMs across controller restarts#454
fkorotkov-oai wants to merge 1 commit into
mainfrom
dev/fkorotkov/controller-restart-grace

Conversation

@fkorotkov-oai

Copy link
Copy Markdown
Collaborator

Summary

  • delay offline-worker VM failure checks for one workerOfflineTimeout after scheduler startup
  • keep stale workers excluded from new scheduling while allowing their existing VMs time to reconnect
  • add deterministic grace-boundary coverage and a process-level controller restart integration test

Root cause

Worker heartbeats are persisted in the controller database. After a controller outage longer than workerOfflineTimeout, the restarted scheduler could observe the stale heartbeat and mark assigned VMs as failed before their worker had time to reconnect, even though those VMs continued running throughout the outage.

Impact

Workers now receive one configured offline timeout after scheduler startup to reconnect before stale heartbeats can fail their VMs. Normal steady-state worker failure detection and offline-worker scheduling exclusion remain unchanged. A genuinely unavailable worker can take up to one additional timeout after a controller restart to have its VMs marked failed.

Validation

  • go test ./internal/controller/... -count=1
  • go test ./internal/tests -run '^TestControllerRestart(DoesNotFailRunningVMs|HelperProcess)$' -count=1
  • golangci-lint run ./internal/controller/scheduler ./internal/tests

@fkorotkov-oai fkorotkov-oai requested a review from edi-oai June 23, 2026 14:22
@fkorotkov-oai fkorotkov-oai marked this pull request as ready for review June 23, 2026 14:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant