[#9292] Initial Elasticsearch changes for new pause-orchestration annotation#9330
[#9292] Initial Elasticsearch changes for new pause-orchestration annotation#9330dtuck9 wants to merge 12 commits intoelastic:mainfrom
pause-orchestration annotation#9330Conversation
✅ Snyk checks have passed. No issues have been found so far.
💻 Catch issues earlier using the plugins for VS Code, JetBrains IDEs, Visual Studio, and Eclipse. |
915709b to
ac32abe
Compare
|
buildkite test this -f p=gke,t=TestPauseOrchestration* |
9370285 to
9f96971
Compare
|
buildkite test this -f p=gke,t=TestPauseOrchestration* |
9f96971 to
9f29c96
Compare
66db300 to
052400f
Compare
|
buildkite test this -f p=gke,t=TestPauseOrchestration* |
|
buildkite test this -f p=gke |
pebrc
left a comment
There was a problem hiding this comment.
Tested manually on a kind cluster against both ES 8.17.0 and 9.3.3. The core design is sound — the pause check is well-placed in the reconciliation flow (after shared resources/certs but before node spec changes), housekeeping correctly continues while paused (certs, secrets, health monitoring all verified), and scale-out is properly held back. Resuming applies pending changes immediately.
Found three bugs during testing:
- Phase never transitions to
Paused— the controller's post-reconcile logic always overwrites it toReady. - Pod deletion while paused leaves pods permanently not-ready — the pre-stop hook registers a RESTART shutdown, but
reconcileCriticalStepsnever clears it. ES keeps the readiness port closed until the shutdown entry is explicitly deleted (this is by design in ES —ReadinessServicechecks for shutdown entry existence, not status). Reproduced on both 8.17.0 and 9.3.3. This is the most critical issue since pod disruptions during maintenance windows are exactly the scenario this feature is meant to handle. hasPendingSpecChangesfalse positive — reports pending changes immediately after enabling the annotation even with no spec change, due to server-side defaulting drift in the actual-vs-computed StatefulSet comparison.
…est, fix handling of deleted pod
5c0f8e4 to
68647bb
Compare
…re are pending changes
…, make sure version supports node shutdown, check for satisfied expectations
|
buildkite test this -f p=gke,t=TestPauseOrchestration* |
|
buildkite test this -f p=gke,t=TestPauseOrchestration* |
pebrc
left a comment
There was a problem hiding this comment.
Tested manually on a local k3d cluster against ES 8.17.0: pause with no changes, upscale while paused (blocked correctly), resume (upscale proceeds), pod deletion while paused (readiness recovers via shutdown cleanup), and managed=false precedence. Phases, conditions, and k8s events all behave as expected at every step.
A few minor suggestions inline — nothing blocking.
…ate driver_test tests
What this PR does
Introduces a new
eck.k8s.elastic.co/pause-orchestrationannotation that pauses spec-driven orchestration (rolling upgrades, StatefulSet spec changes, scale up/down) while keeping housekeeping running (certificate rotation, unicast hosts, user/secret reconciliation, health monitoring).This is distinct from the existing
eck.k8s.elastic.co/managed=falseannotation, which skips reconciliation entirely and is marked asDeprecatedwith this change. The new annotation allows operators to freeze cluster topology during maintenance windows without losing certificate rotation or health monitoring.Changes
pkg/controller/common/unmanaged.go— NewPauseOrchestrationAnnotationconstant and
IsOrchestrationPausedhelper.ManagedAnnotationandIsUnmanagedare marked deprecated in favour of the new annotation (tracked in Deprecate managed=false and remove legacy pause annotation #9295).
pkg/controller/elasticsearch/driver/stateful/driver.go— Whenpause-orchestration: "true"is set,ReconcileskipsreconcileNodeSpecsandcalls
reconcileCriticalStepsinstead.reconcileCriticalStepsreconciles the PDB,detects pending spec changes via
hasPendingSpecChanges, and sets theOrchestrationPausedcondition + phase accordingly. A warning event is emitted ifspec changes are pending while paused.
pkg/apis/elasticsearch/v1/status.go— NewElasticsearchOrchestrationPausedphase (
"Paused") andOrchestrationPausedcondition type.pkg/controller/common/events/events.go— NewEventReasonPausedandEventActionPendingOrchestrationChangesconstants.Testing
IsOrchestrationPaused,hasPendingSpecChanges,hasSpecDiff, andreconcileCriticalSteps(condition wiring, event emission, phase, error propagation).TestPauseOrchestrationcycles through: create → enable annotation → updatespec while paused (verifies scale-out is held back) → disable annotation (verifies
scale-out applies).
Relates to
Closes #9292
AI Usage Disclaimer
Claude used for unit test case generation.