[codex] Preserve running VMs across controller restarts by fkorotkov-oai · Pull Request #454 · openai/orchard

fkorotkov-oai · 2026-06-23T14:21:48Z

Summary

delay offline-worker VM failure checks for one workerOfflineTimeout after scheduler startup
keep stale workers excluded from new scheduling while allowing their existing VMs time to reconnect
add deterministic grace-boundary coverage and a process-level controller restart integration test

Root cause

Worker heartbeats are persisted in the controller database. After a controller outage longer than workerOfflineTimeout, the restarted scheduler could observe the stale heartbeat and mark assigned VMs as failed before their worker had time to reconnect, even though those VMs continued running throughout the outage.

Impact

Workers now receive one configured offline timeout after scheduler startup to reconnect before stale heartbeats can fail their VMs. Normal steady-state worker failure detection and offline-worker scheduling exclusion remain unchanged. A genuinely unavailable worker can take up to one additional timeout after a controller restart to have its VMs marked failed.

Validation

go test ./internal/controller/... -count=1
go test ./internal/tests -run '^TestControllerRestart(DoesNotFailRunningVMs|HelperProcess)$' -count=1
golangci-lint run ./internal/controller/scheduler ./internal/tests

Preserve running VMs across controller restarts

835c71d

fkorotkov-oai requested a review from edi-oai June 23, 2026 14:22

fkorotkov-oai marked this pull request as ready for review June 23, 2026 14:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[codex] Preserve running VMs across controller restarts#454

[codex] Preserve running VMs across controller restarts#454
fkorotkov-oai wants to merge 1 commit into
mainfrom
dev/fkorotkov/controller-restart-grace

fkorotkov-oai commented Jun 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

fkorotkov-oai commented Jun 23, 2026

Summary

Root cause

Impact

Validation

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant