From d9a2120214ad46d2642cb5d302cf121ba599b4cd Mon Sep 17 00:00:00 2001 From: David Yu Date: Wed, 24 Jun 2026 06:40:22 -0700 Subject: [PATCH 1/3] manage/k8s: document decommission timing settings (--decommission-wait-interval, RequeueAfter) Backport of #1761 to v/25.2. Adds the "Tune automatic decommission timing" section (--decommission-wait-interval for the Operator; decommissionRequeueTimeout and decommissionAfter for the Helm sidecar), a TIP cross-reference, and the shell-correct --set "additionalCmdFlags={...}" quoting. The Operator-example rewrite from #1761 is main-only and intentionally omitted: v/25.2's Operator example uses the version-appropriate brokerDecommissioner sidecar, so backporting that paragraph/callout would contradict the example. Co-Authored-By: Claude Opus 4.8 (1M context) --- .../kubernetes/k-decommission-brokers.adoc | 61 ++++++++++++++++++- 1 file changed, 59 insertions(+), 2 deletions(-) diff --git a/modules/manage/pages/kubernetes/k-decommission-brokers.adoc b/modules/manage/pages/kubernetes/k-decommission-brokers.adoc index d8251819f1..0b2db95b74 100644 --- a/modules/manage/pages/kubernetes/k-decommission-brokers.adoc +++ b/modules/manage/pages/kubernetes/k-decommission-brokers.adoc @@ -466,12 +466,14 @@ helm upgrade --install redpanda-controller redpanda/operator \ --namespace \ --set image.tag={latest-operator-version} \ --create-namespace \ - --set additionalCmdFlags={--additional-controllers="decommission"} \ + --set "additionalCmdFlags={--additional-controllers=decommission}" \ --set rbac.createAdditionalControllerCRs=true ---- + -- `--additional-controllers="decommission"`: Enables the Decommission controller. +- `--additional-controllers=decommission`: Enables the Decommission controller. - `rbac.createAdditionalControllerCRs=true`: Creates the required RBAC rules for the Redpanda Operator to monitor the StatefulSet and update PVCs and PVs. ++ +TIP: To change how often the Decommission controller re-checks the cluster for brokers that need decommissioning, pass the `--decommission-wait-interval` flag through `additionalCmdFlags`. See <>. .. Configure a Redpanda resource with seven Redpanda brokers: + @@ -644,6 +646,61 @@ kubectl logs --namespace -c sidecars You can repeat this procedure to continue to scale down. +[[decommission-timing]] +== Tune automatic decommission timing + +The <> re-checks the cluster on a regular interval for brokers that need to be decommissioned. The setting that controls this interval, and any debounce window before the decommissioner acts, depends on how the controller is deployed: as the Decommission controller inside the Redpanda Operator, or as the broker decommissioner sidecar in a Helm-only deployment. + +[cols="2,1,4"] +|=== +| Setting | Default | Description + +| `--decommission-wait-interval` (Operator; set through `additionalCmdFlags`) +| `8s` +| Requeue interval (`RequeueAfter`) for the Operator's Decommission controller: how often the controller re-checks the cluster for brokers that need decommissioning when a reconcile did not already schedule a sooner re-check. + +| `decommissionRequeueTimeout` (Helm sidecar; under `statefulset.sideCars.brokerDecommissioner`) +| `10s` +| How often the sidecar re-checks a cluster that already has a broker flagged for decommissioning. + +| `decommissionAfter` (Helm sidecar; under `statefulset.sideCars.brokerDecommissioner`) +| `60s` +| How long a broker must continuously meet the decommission conditions before the sidecar acts. This debounce window prevents acting on transient conditions, such as a broker that is briefly unreachable during a restart. +|=== + +=== Set the interval for the Operator + +The Operator's Decommission controller does not expose its interval as a dedicated Helm value. Instead, pass the `--decommission-wait-interval` flag through `additionalCmdFlags` when you install or upgrade the Operator: + +[,bash,subs="attributes+"] +---- +helm upgrade --install redpanda-controller redpanda/operator \ + --namespace \ + --create-namespace \ + --set image.tag={latest-operator-version} \ + --set "additionalCmdFlags={--additional-controllers=decommission,--decommission-wait-interval=30s}" \ + --set rbac.createAdditionalControllerCRs=true +---- + +The flag accepts any Go duration string, such as `8s`, `30s`, or `2m`. The default is `8s`. After each reconcile, the controller logs the next scheduled run, and the `next run in` value reflects the configured interval: + +[.no-copy] +---- +{"level":"info","logger":"DecommissionReconciler.Reconcile","msg":"successful reconciliation finished in 1m0s, next run in 30s","controller":"statefulset", ...} +---- + +=== Set the intervals for Helm + +For a Helm-only deployment, set the sidecar values directly under `statefulset.sideCars.brokerDecommissioner`. For a full example, see <>. + +=== Guidance for adjusting the intervals + +* These settings control only how often the decommissioner *re-checks* for work and how long it waits before acting. They do not change how fast partition data is reallocated once a decommission begins. Reallocation throughput is governed by xref:reference:cluster-properties.adoc#raft_learner_recovery_rate[`raft_learner_recovery_rate`] and xref:reference:tunable-properties.adoc#partition_autobalancing_concurrent_moves[`partition_autobalancing_concurrent_moves`]. +* This interval is the *periodic* re-check cadence. A scale-in that you initiate by reducing `statefulset.replicas` is detected from a StatefulSet watch event and acted on promptly, so raising the interval does not delay a routine scale-in. The interval primarily determines how quickly the controller notices conditions that arise without a triggering event, such as a broker that becomes unreachable. +* Increase the re-check interval to reduce reconcile frequency, and the associated log and Admin API traffic, on large or stable clusters. Decrease it for faster detection of brokers that need decommissioning. +* For Helm (sidecar) deployments, keep `decommissionRequeueTimeout` smaller than `decommissionAfter` -- ideally well below it -- so the sidecar re-evaluates the cluster at least once within the debounce window. If the re-check interval is close to or larger than `decommissionAfter`, the decommissioner may wait up to one additional interval before acting. The Kubernetes controller-runtime work queue also adds a small amount of jitter. +* A single Operator reconcile can take up to about a minute because the Decommission controller verifies that cluster health is stable before it commits to a decommission. This is expected, and is independent of the `--decommission-wait-interval` value. + == Troubleshooting If the decommissioning process is not making progress, investigate the following potential issues: From 9e36b4eebbcc6329a5eb20e3f09b3885221de3ac Mon Sep 17 00:00:00 2001 From: David Yu Date: Wed, 24 Jun 2026 11:00:19 -0700 Subject: [PATCH 2/3] ci(netlify): pin NODE_VERSION=20 so the preview build installs deps The new Netlify build image (noble-new-builds) defaults to node 22 / npm 10.9.3 when no version is pinned, which fails `npm install` on this branch with `ERR_INVALID_ARG_TYPE: The "from" argument must be of type string. Received undefined`. This branch's deps install cleanly on node 20 (the version pinned in .github/workflows/test-docs.yml, where the Antora build passes) but not on node 22. This branch had no netlify.toml, so add a minimal one pinning node 20 and matching main's NODE_OPTIONS. Restores the broken Netlify preview build; no content changes. Co-Authored-By: Claude Opus 4.8 (1M context) --- netlify.toml | 3 +++ 1 file changed, 3 insertions(+) create mode 100644 netlify.toml diff --git a/netlify.toml b/netlify.toml new file mode 100644 index 0000000000..7adb8e7d8b --- /dev/null +++ b/netlify.toml @@ -0,0 +1,3 @@ +[build.environment] +NODE_VERSION = "20" +NODE_OPTIONS = "--max-old-space-size=6144" From 96b59d1b6a42a3b11c798ccc3d9cd5f7b28bdc98 Mon Sep 17 00:00:00 2001 From: David Yu Date: Wed, 24 Jun 2026 14:03:53 -0700 Subject: [PATCH 3/3] chore(netlify): trigger clean rebuild for the v/25.2 preview Empty commit to re-run the Netlify deploy preview after the build cache was cleared. The node-20 pin is already in place; this forces a fresh dependency install (no poisoned node_modules cache). Co-Authored-By: Claude Opus 4.8 (1M context)