From de98fa6bc4d2ce0a03ae33967084c0e404122627 Mon Sep 17 00:00:00 2001 From: David Yu Date: Tue, 23 Jun 2026 13:54:18 -0700 Subject: [PATCH 1/4] manage/k8s: document decommission timing settings (--decommission-wait-interval, RequeueAfter) Add a "Tune automatic decommission timing" section to the Kubernetes decommission guide explaining the re-check/requeue interval settings for both deployment modes: - Operator: --decommission-wait-interval (default 8s), passed via the operator chart's additionalCmdFlags, which sets the Decommission controller's RequeueAfter (surfaced as the "next run in" log line). - Helm sidecar: decommissionRequeueTimeout (10s) and decommissionAfter (60s). Includes defaults, a worked helm example, how to read the interval from operator logs, and guidance for adjusting the values (recheck vs debounce, reallocation throughput is separate). Ref: DOC-2270 Co-Authored-By: Claude Opus 4.8 (1M context) --- .../kubernetes/k-decommission-brokers.adoc | 56 +++++++++++++++++++ 1 file changed, 56 insertions(+) diff --git a/modules/manage/pages/kubernetes/k-decommission-brokers.adoc b/modules/manage/pages/kubernetes/k-decommission-brokers.adoc index ee3310794a..f0042b7ff1 100644 --- a/modules/manage/pages/kubernetes/k-decommission-brokers.adoc +++ b/modules/manage/pages/kubernetes/k-decommission-brokers.adoc @@ -482,6 +482,8 @@ helm upgrade --install redpanda-controller redpanda/operator \ + - `--additional-controllers="decommission"`: Enables the Decommission controller. - `rbac.createAdditionalControllerCRs=true`: Creates the required RBAC rules for the Redpanda Operator to monitor the StatefulSet and update PVCs and PVs. ++ +TIP: To change how often the Decommission controller re-checks the cluster for brokers that need decommissioning, pass the `--decommission-wait-interval` flag through `additionalCmdFlags`. See <>. .. Configure a Redpanda resource with seven Redpanda brokers: + @@ -660,6 +662,60 @@ kubectl logs deployment/redpanda-controller --namespace You can repeat this procedure to continue to scale down. +[[decommission-timing]] +== Tune automatic decommission timing + +The <> polls the cluster on a regular interval to detect brokers that need to be decommissioned. The setting that controls this interval, and any debounce window before the decommissioner acts, depends on how the controller is deployed: as the Decommission controller inside the Redpanda Operator, or as the broker decommissioner sidecar in a Helm-only deployment. + +[cols="2,1,4"] +|=== +| Setting | Default | Description + +| `--decommission-wait-interval` (Operator; set through `additionalCmdFlags`) +| `8s` +| Requeue interval (`RequeueAfter`) for the operator's Decommission controller: how often the controller re-checks the cluster for brokers that need decommissioning when a reconcile did not already schedule a sooner re-check. + +| `decommissionRequeueTimeout` (Helm sidecar; under `statefulset.sideCars.brokerDecommissioner`) +| `10s` +| How often the sidecar re-checks a cluster that already has a broker flagged for decommissioning. + +| `decommissionAfter` (Helm sidecar; under `statefulset.sideCars.brokerDecommissioner`) +| `60s` +| How long a broker must continuously meet the decommission conditions before the sidecar acts. This debounce window prevents acting on transient conditions, such as a broker that is briefly unreachable during a restart. +|=== + +=== Set the interval for the Operator + +The operator's Decommission controller does not expose its interval as a dedicated Helm value. Instead, pass the `--decommission-wait-interval` flag through `additionalCmdFlags` when you install or upgrade the operator: + +[,bash,subs="attributes+"] +---- +helm upgrade --install redpanda-controller redpanda/operator \ + --namespace \ + --create-namespace \ + --set image.tag={latest-operator-version} \ + --set "additionalCmdFlags={--additional-controllers=decommission,--decommission-wait-interval=30s}" \ + --set rbac.createAdditionalControllerCRs=true +---- + +The flag accepts any Go duration string, such as `8s`, `30s`, or `2m`. The default is `8s`. After each reconcile, the controller logs the next scheduled run, and the `next run in` value reflects the configured interval: + +[.no-copy] +---- +{"level":"info","logger":"DecommissionReconciler.Reconcile","msg":"successful reconciliation finished in 1m0s, next run in 30s","controller":"statefulset", ...} +---- + +=== Set the intervals for Helm + +For a Helm-only deployment, set the sidecar values directly under `statefulset.sideCars.brokerDecommissioner`. For a full example, see <>. + +=== Guidance for adjusting the intervals + +* These settings control only how often the decommissioner *re-checks* for work and how long it waits before acting. They do not change how fast partition data is reallocated once a decommission begins. Reallocation throughput is governed by xref:reference:cluster-properties.adoc#raft_learner_recovery_rate[`raft_learner_recovery_rate`] and xref:reference:tunable-properties.adoc#partition_autobalancing_concurrent_moves[`partition_autobalancing_concurrent_moves`]. +* Increase the re-check interval to reduce reconcile frequency, and the associated log and Admin API traffic, on large or stable clusters. Decrease it for faster detection of brokers that need decommissioning. +* For Helm (sidecar) deployments, keep `decommissionRequeueTimeout` smaller than `decommissionAfter` -- ideally well below it -- so the sidecar re-evaluates the cluster at least once within the debounce window. If the re-check interval is close to or larger than `decommissionAfter`, the decommissioner may wait up to one additional interval before acting. The Kubernetes controller-runtime work queue also adds a small amount of jitter. +* A single operator reconcile can take up to about a minute because the Decommission controller verifies that cluster health is stable before it commits to a decommission. This is expected, and is independent of the `--decommission-wait-interval` value. + == Troubleshooting If the decommissioning process is not making progress, investigate the following potential issues: From 9c813d26e3d29f7afda26a2c154eb22cd7182d49 Mon Sep 17 00:00:00 2001 From: David Yu Date: Tue, 23 Jun 2026 14:21:13 -0700 Subject: [PATCH 2/4] manage/k8s: clarify decommission interval is periodic re-check, not scale-in gate Per EKS end-to-end testing: a user-initiated scale-in (reducing statefulset.replicas) is detected from a StatefulSet watch event and acted on promptly (~seconds) regardless of --decommission-wait-interval. The interval governs the periodic re-check cadence for conditions that arise without a triggering event (for example, a broker that becomes unreachable), so raising it does not delay routine scale-ins. Co-Authored-By: Claude Opus 4.8 (1M context) --- modules/manage/pages/kubernetes/k-decommission-brokers.adoc | 1 + 1 file changed, 1 insertion(+) diff --git a/modules/manage/pages/kubernetes/k-decommission-brokers.adoc b/modules/manage/pages/kubernetes/k-decommission-brokers.adoc index f0042b7ff1..a2a1615606 100644 --- a/modules/manage/pages/kubernetes/k-decommission-brokers.adoc +++ b/modules/manage/pages/kubernetes/k-decommission-brokers.adoc @@ -712,6 +712,7 @@ For a Helm-only deployment, set the sidecar values directly under `statefulset.s === Guidance for adjusting the intervals * These settings control only how often the decommissioner *re-checks* for work and how long it waits before acting. They do not change how fast partition data is reallocated once a decommission begins. Reallocation throughput is governed by xref:reference:cluster-properties.adoc#raft_learner_recovery_rate[`raft_learner_recovery_rate`] and xref:reference:tunable-properties.adoc#partition_autobalancing_concurrent_moves[`partition_autobalancing_concurrent_moves`]. +* This interval is the *periodic* re-check cadence. A scale-in that you initiate by reducing `statefulset.replicas` is detected from a StatefulSet watch event and acted on promptly, so raising the interval does not delay a routine scale-in. The interval primarily determines how quickly the controller notices conditions that arise without a triggering event, such as a broker that becomes unreachable. * Increase the re-check interval to reduce reconcile frequency, and the associated log and Admin API traffic, on large or stable clusters. Decrease it for faster detection of brokers that need decommissioning. * For Helm (sidecar) deployments, keep `decommissionRequeueTimeout` smaller than `decommissionAfter` -- ideally well below it -- so the sidecar re-evaluates the cluster at least once within the debounce window. If the re-check interval is close to or larger than `decommissionAfter`, the decommissioner may wait up to one additional interval before acting. The Kubernetes controller-runtime work queue also adds a small amount of jitter. * A single operator reconcile can take up to about a minute because the Decommission controller verifies that cluster health is stable before it commits to a decommission. This is expected, and is independent of the `--decommission-wait-interval` value. From 54eae0d4d1c31c675805df24b26da4f706c6ba5c Mon Sep 17 00:00:00 2001 From: David Yu Date: Tue, 23 Jun 2026 22:07:48 -0700 Subject: [PATCH 3/4] =?UTF-8?q?manage/k8s:=20address=20review=20=E2=80=94?= =?UTF-8?q?=20shell-correct=20helm=20--set,=20lowercase=20operator,=20re-c?= =?UTF-8?q?heck=20wording?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit - Align both additionalCmdFlags examples to the shell-correct form (--set "additionalCmdFlags={...}"): outer-quoted to protect {}/comma from brace expansion, no pointless inner quotes. Verified the rendered list with `helm template`: ["--additional-controllers=decommission","--decommission-wait-interval=30s"]. - Lowercase bare-noun "operator" (heading + table label) per docs convention. - Intro: "polls ... to detect" -> "re-checks ... for" to match the event-driven note. Co-Authored-By: Claude Opus 4.8 (1M context) --- .../pages/kubernetes/k-decommission-brokers.adoc | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/modules/manage/pages/kubernetes/k-decommission-brokers.adoc b/modules/manage/pages/kubernetes/k-decommission-brokers.adoc index a2a1615606..f39514a8c5 100644 --- a/modules/manage/pages/kubernetes/k-decommission-brokers.adoc +++ b/modules/manage/pages/kubernetes/k-decommission-brokers.adoc @@ -476,11 +476,11 @@ helm upgrade --install redpanda-controller redpanda/operator \ --namespace \ --set image.tag={latest-operator-version} \ --create-namespace \ - --set additionalCmdFlags={--additional-controllers="decommission"} \ + --set "additionalCmdFlags={--additional-controllers=decommission}" \ --set rbac.createAdditionalControllerCRs=true ---- + -- `--additional-controllers="decommission"`: Enables the Decommission controller. +- `--additional-controllers=decommission`: Enables the Decommission controller. - `rbac.createAdditionalControllerCRs=true`: Creates the required RBAC rules for the Redpanda Operator to monitor the StatefulSet and update PVCs and PVs. + TIP: To change how often the Decommission controller re-checks the cluster for brokers that need decommissioning, pass the `--decommission-wait-interval` flag through `additionalCmdFlags`. See <>. @@ -582,7 +582,7 @@ When scaling in (removing brokers), remove only one broker at a time. If you red Operator:: + -- -The Decommission controller is already running in the Redpanda Operator (enabled in the earlier `additionalCmdFlags={--additional-controllers="decommission"}` step). To trigger a decommission, change only the StatefulSet replica count on the Redpanda resource. Do not add `sideCars.brokerDecommissioner` here, as that field is not part of the Redpanda CRD and is silently dropped when the resource is applied. +The Decommission controller is already running in the Redpanda Operator (enabled in the earlier `additionalCmdFlags={--additional-controllers=decommission}` step). To trigger a decommission, change only the StatefulSet replica count on the Redpanda resource. Do not add `sideCars.brokerDecommissioner` here, as that field is not part of the Redpanda CRD and is silently dropped when the resource is applied. .`redpanda-cluster.yaml` [,yaml,lines=9] @@ -665,13 +665,13 @@ You can repeat this procedure to continue to scale down. [[decommission-timing]] == Tune automatic decommission timing -The <> polls the cluster on a regular interval to detect brokers that need to be decommissioned. The setting that controls this interval, and any debounce window before the decommissioner acts, depends on how the controller is deployed: as the Decommission controller inside the Redpanda Operator, or as the broker decommissioner sidecar in a Helm-only deployment. +The <> re-checks the cluster on a regular interval for brokers that need to be decommissioned. The setting that controls this interval, and any debounce window before the decommissioner acts, depends on how the controller is deployed: as the Decommission controller inside the Redpanda Operator, or as the broker decommissioner sidecar in a Helm-only deployment. [cols="2,1,4"] |=== | Setting | Default | Description -| `--decommission-wait-interval` (Operator; set through `additionalCmdFlags`) +| `--decommission-wait-interval` (operator; set through `additionalCmdFlags`) | `8s` | Requeue interval (`RequeueAfter`) for the operator's Decommission controller: how often the controller re-checks the cluster for brokers that need decommissioning when a reconcile did not already schedule a sooner re-check. @@ -684,7 +684,7 @@ The <> polls the cluster on a regular interv | How long a broker must continuously meet the decommission conditions before the sidecar acts. This debounce window prevents acting on transient conditions, such as a broker that is briefly unreachable during a restart. |=== -=== Set the interval for the Operator +=== Set the interval for the operator The operator's Decommission controller does not expose its interval as a dedicated Helm value. Instead, pass the `--decommission-wait-interval` flag through `additionalCmdFlags` when you install or upgrade the operator: From 8052d9cecd0c1f1be9d43d047ae8c2c7b36208c7 Mon Sep 17 00:00:00 2001 From: David Yu Date: Tue, 23 Jun 2026 22:10:46 -0700 Subject: [PATCH 4/4] manage/k8s: use capital "Operator" (Kubernetes convention) for the bare noun MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Per maintainer preference, capitalize bare-noun "Operator" page-wide (heading, table label, prose) — reverts the earlier lowercasing. Chart path `redpanda/operator` and the `{latest-operator-version}` attribute stay lowercase. Co-Authored-By: Claude Opus 4.8 (1M context) --- .../pages/kubernetes/k-decommission-brokers.adoc | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/modules/manage/pages/kubernetes/k-decommission-brokers.adoc b/modules/manage/pages/kubernetes/k-decommission-brokers.adoc index f39514a8c5..1a29891f2d 100644 --- a/modules/manage/pages/kubernetes/k-decommission-brokers.adoc +++ b/modules/manage/pages/kubernetes/k-decommission-brokers.adoc @@ -598,7 +598,7 @@ spec: replicas: 6 <1> ---- + -<1> `statefulset.replicas`: Reduce by one. The Decommission controller in the operator detects the change and decommissions the broker on the highest-ordinal Pod. +<1> `statefulset.replicas`: Reduce by one. The Decommission controller in the Operator detects the change and decommissions the broker on the highest-ordinal Pod. ```bash kubectl apply -f redpanda-cluster.yaml --namespace @@ -671,9 +671,9 @@ The <> re-checks the cluster on a regular in |=== | Setting | Default | Description -| `--decommission-wait-interval` (operator; set through `additionalCmdFlags`) +| `--decommission-wait-interval` (Operator; set through `additionalCmdFlags`) | `8s` -| Requeue interval (`RequeueAfter`) for the operator's Decommission controller: how often the controller re-checks the cluster for brokers that need decommissioning when a reconcile did not already schedule a sooner re-check. +| Requeue interval (`RequeueAfter`) for the Operator's Decommission controller: how often the controller re-checks the cluster for brokers that need decommissioning when a reconcile did not already schedule a sooner re-check. | `decommissionRequeueTimeout` (Helm sidecar; under `statefulset.sideCars.brokerDecommissioner`) | `10s` @@ -684,9 +684,9 @@ The <> re-checks the cluster on a regular in | How long a broker must continuously meet the decommission conditions before the sidecar acts. This debounce window prevents acting on transient conditions, such as a broker that is briefly unreachable during a restart. |=== -=== Set the interval for the operator +=== Set the interval for the Operator -The operator's Decommission controller does not expose its interval as a dedicated Helm value. Instead, pass the `--decommission-wait-interval` flag through `additionalCmdFlags` when you install or upgrade the operator: +The Operator's Decommission controller does not expose its interval as a dedicated Helm value. Instead, pass the `--decommission-wait-interval` flag through `additionalCmdFlags` when you install or upgrade the Operator: [,bash,subs="attributes+"] ---- @@ -715,7 +715,7 @@ For a Helm-only deployment, set the sidecar values directly under `statefulset.s * This interval is the *periodic* re-check cadence. A scale-in that you initiate by reducing `statefulset.replicas` is detected from a StatefulSet watch event and acted on promptly, so raising the interval does not delay a routine scale-in. The interval primarily determines how quickly the controller notices conditions that arise without a triggering event, such as a broker that becomes unreachable. * Increase the re-check interval to reduce reconcile frequency, and the associated log and Admin API traffic, on large or stable clusters. Decrease it for faster detection of brokers that need decommissioning. * For Helm (sidecar) deployments, keep `decommissionRequeueTimeout` smaller than `decommissionAfter` -- ideally well below it -- so the sidecar re-evaluates the cluster at least once within the debounce window. If the re-check interval is close to or larger than `decommissionAfter`, the decommissioner may wait up to one additional interval before acting. The Kubernetes controller-runtime work queue also adds a small amount of jitter. -* A single operator reconcile can take up to about a minute because the Decommission controller verifies that cluster health is stable before it commits to a decommission. This is expected, and is independent of the `--decommission-wait-interval` value. +* A single Operator reconcile can take up to about a minute because the Decommission controller verifies that cluster health is stable before it commits to a decommission. This is expected, and is independent of the `--decommission-wait-interval` value. == Troubleshooting