Skip to content

feat: InPlace update strategy for SeiNodeDeployment#80

Merged
bdchatham merged 10 commits intomainfrom
feat/inplace-update-strategy
Apr 13, 2026
Merged

feat: InPlace update strategy for SeiNodeDeployment#80
bdchatham merged 10 commits intomainfrom
feat/inplace-update-strategy

Conversation

@bdchatham
Copy link
Copy Markdown
Collaborator

Summary

Implements the InPlace update strategy as designed in #79 (.tide/designs/inplace-update-strategy.md).

What's new

  • InPlace added to UpdateStrategyType enum — alongside BlueGreen and HardFork
  • updateStrategy is now required — removes the implicit nil/fire-and-forget path. Zero-value migration treats empty type as InPlace with a logged warning.
  • Unified RolloutStatus replaces DeploymentStatus — single type used by all strategies with optional incumbent/entrant fields for BlueGreen/HardFork
  • RolloutInProgress condition — conditions-driven coordination between reconciler (detects diffs) and planner (actions them). Guards against concurrent mutations.
  • InPlace deployment plan: UpdateNodeSpecs → AwaitSpecUpdate → MarkReady
    • UpdateNodeSpecs patches child SeiNode images via kube client
    • AwaitSpecUpdate polls node.status.currentImage == node.spec.image (reads SeiNode status only, never StatefulSet directly)
    • MarkReady builds sidecar clients, submits mark-ready, confirms Ready status
  • status.currentImage on SeiNode — set by the node controller when StatefulSet rollout completes (currentRevision == updateRevision). Parent controllers compare against spec.image for convergence.
  • Stalled rollout escalationRolloutInProgress reason transitions to Stalled after 10 minutes, providing a durable alerting signal

Architecture decisions (validated with Tide experts)

  • Deployment controller owns the full InPlace lifecycle including sidecar mark-ready (existing precedent: awaitNodesCaughtUpExecution, genesis assembly)
  • SeiNode controller is unaware of deployment strategy — it just converges StatefulSet/Service and surfaces currentImage
  • Three single-purpose plan steps — mutate, verify rollout, signal readiness. Clear failure diagnostics at each stage.
  • No auto-rollback — rolling back a blockchain binary after a chain upgrade leaves the node unable to process new blocks

Test plan

  • TestDetectDeploymentNeeded_InPlace_SetsRolloutInProgress
  • TestDetectDeploymentNeeded_InPlace_AlreadyActive
  • TestDetectDeploymentNeeded_EmptyType_TreatedAsInPlace
  • TestReconcileRolloutStatus_InPlace_AllReady
  • TestReconcileRolloutStatus_InPlace_Partial
  • TestReconcileRolloutStatus_InPlace_Stalled
  • TestComputeGroupPhase_RolloutInProgress
  • TestObserveCurrentImage_UpdatesWhenConverged
  • TestObserveCurrentImage_SkipsWhenRolling
  • TestInPlacePlan_ThreeTasks
  • Full test suite passes (make test)
  • Lint clean (make lint)

🤖 Generated with Claude Code

bdchatham and others added 6 commits April 13, 2026 11:34
Replace unconditional ensureMarkReady-on-every-reconcile with a
plan-based approach that preserves controller responsibility boundaries:

- SeiNodeDeployment controller (orchestrator) creates an InPlace plan:
  UpdateNodeSpecs → AwaitRunning
- UpdateNodeSpecs patches child SeiNode images and sets a
  ReadinessApproved condition on each SeiNode
- SeiNode controller (executor) reacts to ReadinessApproved by
  submitting mark-ready to its own sidecar, then clears the condition
- Standalone SeiNodes (no parent deployment) self-approve via
  owner reference check

This preserves the single-writer invariant: only the SeiNode controller
talks to the sidecar. The Kubernetes resource is the communication
channel between controllers, matching Cluster API and Crossplane
patterns.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…Place enum

- Add UpdateStrategyInPlace to the UpdateStrategyType enum
- Make updateStrategy a required field (remove pointer/optional)
- Replace DeploymentStatus with RolloutStatus containing: Strategy,
  TargetHash, StartedAt, per-node convergence tracking, plus the
  existing incumbent/entrant fields for BlueGreen/HardFork
- Add ConditionRolloutInProgress on SeiNodeDeployment
- Add ConditionReadinessApproved on SeiNode
- Migrate all references from Deployment→Rollout across controllers,
  planner, tasks, and tests

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…→ MarkReady)

- Add status.currentImage to SeiNode, populated by the node controller
  when StatefulSet rollout completes (currentRevision == updateRevision)
- Add UpdateNodeSpecs task: patches child SeiNode image via kube client
- Add AwaitSpecUpdate task: polls node.status.currentImage == spec.image
  to confirm the pod is running the new image (reads SeiNode status
  only, never StatefulSet directly)
- Add MarkNodesReady task: builds sidecar clients via sidecarClientForNode,
  submits mark-ready once reachable, completes when all sidecars report Ready
- Wire InPlace planner to generate the 3-step plan
- Register new task types in the task registry

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…cking

- detectDeploymentNeeded sets RolloutInProgress condition and populates
  per-node RolloutNodeStatus on the RolloutStatus
- Add zero-value migration: empty updateStrategy.type treated as InPlace
- Add reconcileRolloutStatus for InPlace convergence tracking:
  polls child SeiNode phases, clears rollout when all nodes Running
- Add stalled rollout escalation: RolloutInProgress reason transitions
  to Stalled after 10 minutes with non-ready nodes
- computeGroupPhase returns Upgrading when RolloutInProgress is True

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- TestDetectDeploymentNeeded_InPlace_SetsRolloutInProgress
- TestDetectDeploymentNeeded_InPlace_AlreadyActive
- TestDetectDeploymentNeeded_EmptyType_TreatedAsInPlace
- TestReconcileRolloutStatus_InPlace_AllReady
- TestReconcileRolloutStatus_InPlace_Partial
- TestReconcileRolloutStatus_InPlace_Stalled
- TestComputeGroupPhase_RolloutInProgress
- TestObserveCurrentImage_UpdatesWhenConverged
- TestObserveCurrentImage_SkipsWhenRolling
- TestInPlacePlan_ThreeTasks

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
UpdateNodeSpecs → AwaitSpecUpdate → MarkReady plan, status.currentImage
for convergence observation, and deployment-level sidecar interaction.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
bdchatham and others added 4 commits April 13, 2026 13:52
Critical fixes:
- Clear RolloutInProgress condition in completePlan/failPlan so future
  deployments are not permanently blocked
- Implement InPlace plan supersession: if the spec changes during an
  active rollout (targetHash != currentHash), the stale plan is replaced
  with a new one targeting the latest spec
- Harden observeCurrentImage: require ReadyReplicas >= 1 and non-empty
  CurrentRevision to prevent premature currentImage reporting

Cleanup:
- Remove dead ConditionReadinessApproved (replaced by plan-based approach)
- Move stallThreshold const to top of status.go
- Fix gofmt formatting in seinode_types.go
- Update tests: supersession test, converged-image test with ReadyReplicas

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Add rollout lifecycle events: RolloutStarted, RolloutSuperseded,
  RolloutComplete emitted at the appropriate state transitions
- Add TODO for AwaitSpecUpdate pod failure detection (kubelet waiting
  reasons not exported as constants; mitigated by plan supersession)
- Clarify templateHash godoc: lists tracked fields and explains which
  changes trigger deployment plans vs in-place propagation
- Clean up unused imports from earlier rejected edit

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Supersession handles the primary recovery case (bad image push).
The Upgrading phase itself is a durable signal for infra-level issues.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
observeCurrentImage edge cases:
- SkipsWhenReadyReplicasZero (race condition guard)
- SkipsWhenEmptyRevision (fresh StatefulSet)
- NoopWhenAlreadyCurrent (idempotency)
- StatefulSetNotFound (graceful nil return)

Plan lifecycle:
- CompletePlan_ClearsRolloutInProgress
- FailPlan_ClearsRolloutInProgress
- DoesNotClearWhilePlanActive (PlanInProgress guard)

Task execution:
- UpdateNodeSpecs patches image, skips already-current
- AwaitSpecUpdate completes when converged, stays Running otherwise
- MarkNodesReady deserialization and initial state

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@bdchatham bdchatham merged commit f6429c3 into main Apr 13, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant