Skip to content

fix: re-read object in completePlan/failPlan to avoid stale resourceVersion#82

Merged
bdchatham merged 1 commit intomainfrom
fix/stale-statusbase-in-complete-fail-plan
Apr 13, 2026
Merged

fix: re-read object in completePlan/failPlan to avoid stale resourceVersion#82
bdchatham merged 1 commit intomainfrom
fix/stale-statusbase-in-complete-fail-plan

Conversation

@bdchatham
Copy link
Copy Markdown
Collaborator

Summary

Fixes deployments permanently stuck in Upgrading phase after plan completion.

Root cause

The plan executor patches status mid-reconcile (marking tasks and plans complete), bumping resourceVersion on the API server. completePlan and failPlan then tried to patch status using the statusBase computed at the start of Reconcile, which carried the old resourceVersion. This caused guaranteed optimistic lock conflicts on every reconcile, creating an infinite retry loop.

This is a structural bug, not specific to the InPlace migration. Any plan that completes tasks would hit this.

Fix

completePlan and failPlan now re-read the object via r.Get before building their patch base, ensuring a fresh resourceVersion. The stale statusBase parameter is removed from both functions and drivePlan.

Also adds a TODO on detectDeploymentNeeded for guarding against empty incumbentNodes (plans with empty node lists complete as no-ops).

Test plan

  • Updated TestCompletePlan_ClearsRolloutInProgress — no longer passes statusBase
  • Updated TestFailPlan_ClearsRolloutInProgress — no longer passes statusBase
  • Full test suite passes (make test)
  • Lint clean (make lint)

🤖 Generated with Claude Code

…ersion

The plan executor patches status mid-reconcile (task/plan completion),
bumping resourceVersion. completePlan/failPlan then tried to patch
status using the statusBase from the start of Reconcile, which carried
the old resourceVersion. This caused guaranteed optimistic lock
conflicts, leaving deployments permanently stuck in Upgrading phase.

Fix: completePlan and failPlan now re-read the object via r.Get before
building their patch base, ensuring a fresh resourceVersion. The stale
statusBase parameter is removed from both functions and drivePlan.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@bdchatham bdchatham merged commit 09cd670 into main Apr 13, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant