Skip to content

[#9292] Initial Elasticsearch changes for new pause-orchestration annotation#9330

Open
dtuck9 wants to merge 12 commits intoelastic:mainfrom
dtuck9:add-pause-orchestration-annotation
Open

[#9292] Initial Elasticsearch changes for new pause-orchestration annotation#9330
dtuck9 wants to merge 12 commits intoelastic:mainfrom
dtuck9:add-pause-orchestration-annotation

Conversation

@dtuck9
Copy link
Copy Markdown
Contributor

@dtuck9 dtuck9 commented Apr 8, 2026

What this PR does

Introduces a new eck.k8s.elastic.co/pause-orchestration annotation that pauses spec-driven orchestration (rolling upgrades, StatefulSet spec changes, scale up/down) while keeping housekeeping running (certificate rotation, unicast hosts, user/secret reconciliation, health monitoring).

This is distinct from the existing eck.k8s.elastic.co/managed=false annotation, which skips reconciliation entirely and is marked as Deprecated with this change. The new annotation allows operators to freeze cluster topology during maintenance windows without losing certificate rotation or health monitoring.

Changes

  • pkg/controller/common/unmanaged.go — New PauseOrchestrationAnnotation
    constant and IsOrchestrationPaused helper. ManagedAnnotation and IsUnmanaged
    are marked deprecated in favour of the new annotation (tracked in Deprecate managed=false and remove legacy pause annotation #9295).

  • pkg/controller/elasticsearch/driver/stateful/driver.go — When
    pause-orchestration: "true" is set, Reconcile skips reconcileNodeSpecs and
    calls reconcileCriticalSteps instead. reconcileCriticalSteps reconciles the PDB,
    detects pending spec changes via hasPendingSpecChanges, and sets the
    OrchestrationPaused condition + phase accordingly. A warning event is emitted if
    spec changes are pending while paused.

  • pkg/apis/elasticsearch/v1/status.go — New ElasticsearchOrchestrationPaused
    phase ("Paused") and OrchestrationPaused condition type.

  • pkg/controller/common/events/events.go — New EventReasonPaused and
    EventActionPendingOrchestrationChanges constants.

Testing

  • Unit tests for IsOrchestrationPaused, hasPendingSpecChanges, hasSpecDiff, and
    reconcileCriticalSteps (condition wiring, event emission, phase, error propagation).
  • E2e test TestPauseOrchestration cycles through: create → enable annotation → update
    spec while paused (verifies scale-out is held back) → disable annotation (verifies
    scale-out applies).
  • Full run of e2e tests against GKE

Relates to

Closes #9292

AI Usage Disclaimer

Claude used for unit test case generation.

@prodsecmachine
Copy link
Copy Markdown
Collaborator

prodsecmachine commented Apr 8, 2026

Snyk checks have passed. No issues have been found so far.

Status Scan Engine Critical High Medium Low Total (0)
Open Source Security 0 0 0 0 0 issues
Licenses 0 0 0 0 0 issues

💻 Catch issues earlier using the plugins for VS Code, JetBrains IDEs, Visual Studio, and Eclipse.

@botelastic botelastic Bot added the triage label Apr 8, 2026
@dtuck9 dtuck9 added >feature Adds or discusses adding a feature to the product v3.5.0 (next) labels Apr 9, 2026
@botelastic botelastic Bot removed the triage label Apr 9, 2026
@dtuck9 dtuck9 force-pushed the add-pause-orchestration-annotation branch from 915709b to ac32abe Compare April 10, 2026 22:59
@dtuck9
Copy link
Copy Markdown
Contributor Author

dtuck9 commented Apr 10, 2026

buildkite test this -f p=gke,t=TestPauseOrchestration*

@dtuck9 dtuck9 force-pushed the add-pause-orchestration-annotation branch 5 times, most recently from 9370285 to 9f96971 Compare April 13, 2026 22:35
@dtuck9
Copy link
Copy Markdown
Contributor Author

dtuck9 commented Apr 13, 2026

buildkite test this -f p=gke,t=TestPauseOrchestration*

Comment thread test/e2e/es/pause_orchestration_test.go
@dtuck9 dtuck9 force-pushed the add-pause-orchestration-annotation branch from 9f96971 to 9f29c96 Compare April 13, 2026 23:07
Comment thread test/e2e/test/utils.go
@dtuck9 dtuck9 force-pushed the add-pause-orchestration-annotation branch 5 times, most recently from 66db300 to 052400f Compare April 14, 2026 20:48
@dtuck9
Copy link
Copy Markdown
Contributor Author

dtuck9 commented Apr 14, 2026

buildkite test this -f p=gke,t=TestPauseOrchestration*

@dtuck9
Copy link
Copy Markdown
Contributor Author

dtuck9 commented Apr 14, 2026

buildkite test this -f p=gke

Copy link
Copy Markdown
Collaborator

@pebrc pebrc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested manually on a kind cluster against both ES 8.17.0 and 9.3.3. The core design is sound — the pause check is well-placed in the reconciliation flow (after shared resources/certs but before node spec changes), housekeeping correctly continues while paused (certs, secrets, health monitoring all verified), and scale-out is properly held back. Resuming applies pending changes immediately.

Found three bugs during testing:

  1. Phase never transitions to Paused — the controller's post-reconcile logic always overwrites it to Ready.
  2. Pod deletion while paused leaves pods permanently not-ready — the pre-stop hook registers a RESTART shutdown, but reconcileCriticalSteps never clears it. ES keeps the readiness port closed until the shutdown entry is explicitly deleted (this is by design in ES — ReadinessService checks for shutdown entry existence, not status). Reproduced on both 8.17.0 and 9.3.3. This is the most critical issue since pod disruptions during maintenance windows are exactly the scenario this feature is meant to handle.
  3. hasPendingSpecChanges false positive — reports pending changes immediately after enabling the annotation even with no spec change, due to server-side defaulting drift in the actual-vs-computed StatefulSet comparison.

Comment thread pkg/controller/elasticsearch/driver/stateful/driver.go
Comment thread pkg/controller/elasticsearch/driver/stateful/driver.go Outdated
Comment thread pkg/controller/elasticsearch/driver/stateful/driver.go Outdated
Comment thread pkg/controller/elasticsearch/driver/stateful/driver.go
Comment thread test/e2e/test/utils.go Outdated
@dtuck9 dtuck9 force-pushed the add-pause-orchestration-annotation branch from 5c0f8e4 to 68647bb Compare April 20, 2026 16:36
Comment thread pkg/controller/elasticsearch/driver/stateful/driver.go Outdated
Comment thread pkg/controller/elasticsearch/driver/stateful/driver.go Outdated
Comment thread pkg/controller/elasticsearch/driver/stateful/driver.go Outdated
dtuck9 added 2 commits April 20, 2026 11:17
…, make sure version supports node shutdown, check for satisfied expectations
@dtuck9
Copy link
Copy Markdown
Contributor Author

dtuck9 commented Apr 20, 2026

buildkite test this -f p=gke,t=TestPauseOrchestration*

@dtuck9
Copy link
Copy Markdown
Contributor Author

dtuck9 commented Apr 20, 2026

buildkite test this -f p=gke,t=TestPauseOrchestration*

Copy link
Copy Markdown
Collaborator

@pebrc pebrc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested manually on a local k3d cluster against ES 8.17.0: pause with no changes, upscale while paused (blocked correctly), resume (upscale proceeds), pod deletion while paused (readiness recovers via shutdown cleanup), and managed=false precedence. Phases, conditions, and k8s events all behave as expected at every step.

A few minor suggestions inline — nothing blocking.

Comment thread pkg/controller/elasticsearch/driver/stateful/driver_test.go
Comment thread pkg/controller/elasticsearch/driver/stateful/driver.go Outdated
Comment thread pkg/controller/elasticsearch/driver/stateful/driver.go Outdated
@dtuck9 dtuck9 requested review from a team and pebrc April 21, 2026 17:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

>feature Adds or discusses adding a feature to the product v3.5.0 (next)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Pause orchestration: Elasticsearch implementation

3 participants