From 83256e76a2aeee7ab3fc6ce715ae34ffc1638f13 Mon Sep 17 00:00:00 2001 From: David Yu Date: Wed, 24 Jun 2026 14:36:09 -0700 Subject: [PATCH] manage/k8s: document decommission timing (--decommission-wait-interval) [v/25.2 backport] Backport of #1761 to v/25.2. Adds the "Tune automatic decommission timing" section: --decommission-wait-interval (Operator, via additionalCmdFlags), decommissionRequeueTimeout / decommissionAfter (Helm sidecar), a TIP cross-reference, and shell-correct --set "additionalCmdFlags={...}" quoting. Omits the main-only Operator-example rewrite, since v/25.2's example uses the version-appropriate brokerDecommissioner sidecar. Co-Authored-By: Claude Opus 4.8 (1M context) --- .../kubernetes/k-decommission-brokers.adoc | 61 ++++++++++++++++++- 1 file changed, 59 insertions(+), 2 deletions(-) diff --git a/modules/manage/pages/kubernetes/k-decommission-brokers.adoc b/modules/manage/pages/kubernetes/k-decommission-brokers.adoc index d8251819f1..0b2db95b74 100644 --- a/modules/manage/pages/kubernetes/k-decommission-brokers.adoc +++ b/modules/manage/pages/kubernetes/k-decommission-brokers.adoc @@ -466,12 +466,14 @@ helm upgrade --install redpanda-controller redpanda/operator \ --namespace \ --set image.tag={latest-operator-version} \ --create-namespace \ - --set additionalCmdFlags={--additional-controllers="decommission"} \ + --set "additionalCmdFlags={--additional-controllers=decommission}" \ --set rbac.createAdditionalControllerCRs=true ---- + -- `--additional-controllers="decommission"`: Enables the Decommission controller. +- `--additional-controllers=decommission`: Enables the Decommission controller. - `rbac.createAdditionalControllerCRs=true`: Creates the required RBAC rules for the Redpanda Operator to monitor the StatefulSet and update PVCs and PVs. ++ +TIP: To change how often the Decommission controller re-checks the cluster for brokers that need decommissioning, pass the `--decommission-wait-interval` flag through `additionalCmdFlags`. See <>. .. Configure a Redpanda resource with seven Redpanda brokers: + @@ -644,6 +646,61 @@ kubectl logs --namespace -c sidecars You can repeat this procedure to continue to scale down. +[[decommission-timing]] +== Tune automatic decommission timing + +The <> re-checks the cluster on a regular interval for brokers that need to be decommissioned. The setting that controls this interval, and any debounce window before the decommissioner acts, depends on how the controller is deployed: as the Decommission controller inside the Redpanda Operator, or as the broker decommissioner sidecar in a Helm-only deployment. + +[cols="2,1,4"] +|=== +| Setting | Default | Description + +| `--decommission-wait-interval` (Operator; set through `additionalCmdFlags`) +| `8s` +| Requeue interval (`RequeueAfter`) for the Operator's Decommission controller: how often the controller re-checks the cluster for brokers that need decommissioning when a reconcile did not already schedule a sooner re-check. + +| `decommissionRequeueTimeout` (Helm sidecar; under `statefulset.sideCars.brokerDecommissioner`) +| `10s` +| How often the sidecar re-checks a cluster that already has a broker flagged for decommissioning. + +| `decommissionAfter` (Helm sidecar; under `statefulset.sideCars.brokerDecommissioner`) +| `60s` +| How long a broker must continuously meet the decommission conditions before the sidecar acts. This debounce window prevents acting on transient conditions, such as a broker that is briefly unreachable during a restart. +|=== + +=== Set the interval for the Operator + +The Operator's Decommission controller does not expose its interval as a dedicated Helm value. Instead, pass the `--decommission-wait-interval` flag through `additionalCmdFlags` when you install or upgrade the Operator: + +[,bash,subs="attributes+"] +---- +helm upgrade --install redpanda-controller redpanda/operator \ + --namespace \ + --create-namespace \ + --set image.tag={latest-operator-version} \ + --set "additionalCmdFlags={--additional-controllers=decommission,--decommission-wait-interval=30s}" \ + --set rbac.createAdditionalControllerCRs=true +---- + +The flag accepts any Go duration string, such as `8s`, `30s`, or `2m`. The default is `8s`. After each reconcile, the controller logs the next scheduled run, and the `next run in` value reflects the configured interval: + +[.no-copy] +---- +{"level":"info","logger":"DecommissionReconciler.Reconcile","msg":"successful reconciliation finished in 1m0s, next run in 30s","controller":"statefulset", ...} +---- + +=== Set the intervals for Helm + +For a Helm-only deployment, set the sidecar values directly under `statefulset.sideCars.brokerDecommissioner`. For a full example, see <>. + +=== Guidance for adjusting the intervals + +* These settings control only how often the decommissioner *re-checks* for work and how long it waits before acting. They do not change how fast partition data is reallocated once a decommission begins. Reallocation throughput is governed by xref:reference:cluster-properties.adoc#raft_learner_recovery_rate[`raft_learner_recovery_rate`] and xref:reference:tunable-properties.adoc#partition_autobalancing_concurrent_moves[`partition_autobalancing_concurrent_moves`]. +* This interval is the *periodic* re-check cadence. A scale-in that you initiate by reducing `statefulset.replicas` is detected from a StatefulSet watch event and acted on promptly, so raising the interval does not delay a routine scale-in. The interval primarily determines how quickly the controller notices conditions that arise without a triggering event, such as a broker that becomes unreachable. +* Increase the re-check interval to reduce reconcile frequency, and the associated log and Admin API traffic, on large or stable clusters. Decrease it for faster detection of brokers that need decommissioning. +* For Helm (sidecar) deployments, keep `decommissionRequeueTimeout` smaller than `decommissionAfter` -- ideally well below it -- so the sidecar re-evaluates the cluster at least once within the debounce window. If the re-check interval is close to or larger than `decommissionAfter`, the decommissioner may wait up to one additional interval before acting. The Kubernetes controller-runtime work queue also adds a small amount of jitter. +* A single Operator reconcile can take up to about a minute because the Decommission controller verifies that cluster health is stable before it commits to a decommission. This is expected, and is independent of the `--decommission-wait-interval` value. + == Troubleshooting If the decommissioning process is not making progress, investigate the following potential issues: