Skip to content

design: HardFork deployment simplification via bootstrap#67

Open
bdchatham wants to merge 1 commit intomainfrom
design/hardfork-bootstrap-simplification
Open

design: HardFork deployment simplification via bootstrap#67
bdchatham wants to merge 1 commit intomainfrom
design/hardfork-bootstrap-simplification

Conversation

@bdchatham
Copy link
Copy Markdown
Collaborator

Summary

LLD for simplifying the HardFork deployment plan from 6 tasks to 4 by leveraging the existing bootstrap Job mechanism.

Key Change

Instead of the controller orchestrating halt-height coordination between incumbent and entrant nodes via sidecar tasks (SubmitHaltSignal + AwaitNodesAtHeight), the entrant node's bootstrap spec handles it:

  • Bootstrap image = incumbent binary (e.g., sei:v6.3.0)
  • Target height = haltHeight - 1
  • Production image = new binary (e.g., sei:v6.4.0)

The bootstrap Job syncs to one block before the upgrade, exits cleanly, then the StatefulSet starts with the new binary which processes the upgrade height.

Plan Reduction

Before (6 tasks): Create → AwaitRunning → SubmitHaltSignal → AwaitHeight → Switch → Teardown
After  (4 tasks): Create → AwaitRunning → Switch → Teardown

Design Covers

  • CRD changes (IncumbentImage on DeploymentStatus)
  • Planner simplification (remove 2 tasks)
  • CreateEntrantNodes bootstrap injection
  • Entrant lifecycle trace (bootstrap → upgrade handler → Running)
  • Failure modes and recovery
  • BlueGreen convergence analysis
  • Traffic switch timing
  • File-by-file changes and implementation order

📄 Design doc: .tide/designs/hardfork-bootstrap-simplification.md

🤖 Generated with Claude Code

LLD for reducing the HardFork deployment plan from 6 tasks to 4 by
leveraging the existing bootstrap Job system. The entrant node's
bootstrap image (incumbent binary) syncs to haltHeight-1, then the
production StatefulSet (new binary) handles the upgrade height.

Eliminates SubmitHaltSignal and AwaitNodesAtHeight tasks — the
bootstrap mechanism handles halt-height coordination internally.
Incumbents keep running until traffic switch, removing the risk of
premature SIGTERM.

Covers: CRD changes, planner simplification, CreateEntrantNodes
bootstrap injection, failure modes, BlueGreen convergence analysis,
traffic switch timing, and implementation order.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
bdchatham added a commit that referenced this pull request Apr 13, 2026
LLD for formalizing the InPlace update strategy as an explicit
UpdateStrategyType alongside BlueGreen and HardFork. Covers CRD
changes, rollout status tracking, upgrade height gating, the sidecar
mark-ready restart fix, and failure modes.

Key decisions:
- No plan/task machinery — uses existing ensureSeiNode path
- ensureMarkReady in reconcileRunning unblocks all pod restarts
- Optional upgradeHeight gating prevents premature rollouts
- No auto-rollback (unsafe for chain upgrades)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
bdchatham added a commit that referenced this pull request Apr 13, 2026
* design: InPlace update strategy for SeiNodeDeployment (#67 sibling)

LLD for formalizing the InPlace update strategy as an explicit
UpdateStrategyType alongside BlueGreen and HardFork. Covers CRD
changes, rollout status tracking, upgrade height gating, the sidecar
mark-ready restart fix, and failure modes.

Key decisions:
- No plan/task machinery — uses existing ensureSeiNode path
- ensureMarkReady in reconcileRunning unblocks all pod restarts
- Optional upgradeHeight gating prevents premature rollouts
- No auto-rollback (unsafe for chain upgrades)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* design: revise InPlace strategy per PR review feedback

- Remove nil/no-strategy path; updateStrategy is now required
- Unify DeploymentStatus and RolloutStatus into single RolloutStatus
- Drop upgradeHeight gating (no reliable block height source)
- Introduce conditions-driven reconciliation pattern with
  RolloutInProgress as the coordination mechanism between
  reconciler (detects diffs) and planner (actions them)
- Document block height sourcing as open problem
- Migrate BlueGreen/HardFork to use unified RolloutStatus

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

* design: incorporate Tide team review findings

- Add zero-value migration handling for existing CRDs without
  updateStrategy (treat empty Type as InPlace, log warning)
- Restore IncumbentRevision on RolloutStatus (needed by BlueGreen)
- Add stalled rollout escalation: RolloutInProgress reason transitions
  to Stalled after 10min, providing durable alerting signal
- Add design principles section: small condition vocabulary,
  RolloutStatus as real state machine, simultaneous rollout rationale

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant