Skip to content

fix: remove unused RolloutNodeStatus and prevent templateHash race#81

Merged
bdchatham merged 2 commits intomainfrom
fix/remove-rollout-nodes-and-templatehash-race
Apr 13, 2026
Merged

fix: remove unused RolloutNodeStatus and prevent templateHash race#81
bdchatham merged 2 commits intomainfrom
fix/remove-rollout-nodes-and-templatehash-race

Conversation

@bdchatham
Copy link
Copy Markdown
Collaborator

@bdchatham bdchatham commented Apr 13, 2026

Summary

Two critical fixes discovered during prod deployment of the InPlace strategy to pacific-1:

1. RolloutNodeStatus causes CRD validation failure

The Nodes field on RolloutStatus was required by CRD validation but detectDeploymentNeeded stopped populating it after the simplification pass. Every status write failed with status.rollout.nodes: Required value, preventing any InPlace plan from being persisted.

Fix: Remove Nodes and RolloutNodeStatus entirely. Rollout lifecycle is fully managed by completePlan/failPlan — no separate convergence tracking needed. Also removes reconcileRolloutStatus which operated on the deleted field.

2. templateHash race between detectDeploymentNeeded and updateStatus

updateStatus advances templateHash to match the current spec when PlanInProgress is false. But it didn't check RolloutInProgress. This caused a race: detectDeploymentNeeded sets RolloutInProgress, then updateStatus overwrites templateHash before reconcilePlan can build the plan. On the next reconcile, no divergence is detected.

Fix: Guard templateHash advancement on both PlanInProgress and RolloutInProgress.

Test plan

  • Removed tests for deleted reconcileRolloutStatus
  • Updated plan_test.go to remove Nodes references
  • Full test suite passes (make test)
  • Lint clean (make lint)
  • Adversarial review with Tide experts requested

🤖 Generated with Claude Code

bdchatham and others added 2 commits April 13, 2026 15:59
- Remove Nodes field and RolloutNodeStatus type from RolloutStatus.
  The field was required by CRD validation but never populated,
  causing status writes to fail with "nodes: Required value".
  Rollout lifecycle is fully managed by completePlan/failPlan.

- Remove reconcileRolloutStatus (was operating on the removed Nodes
  field; rollout clearing is handled by completePlan).

- Prevent templateHash from advancing when RolloutInProgress is true.
  Without this, updateStatus races with detectDeploymentNeeded:
  the hash converges before the plan can be created.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…uard

- failPlan now returns ResultRequeueImmediate (matching completePlan)
  so the controller short-circuits before the final updateStatus call
- Add defensive nil check on Rollout before accessing TargetHash in
  the supersession event message

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@bdchatham bdchatham merged commit b6ed0ce into main Apr 13, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant