From bca4e86d5fa02ceda64d56f494ff6fe09708d043 Mon Sep 17 00:00:00 2001 From: delthas Date: Mon, 23 Feb 2026 18:12:33 +0100 Subject: [PATCH] Increase default probe initialDelaySeconds from 10 to 60 MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The default liveness and readiness probe initialDelaySeconds (10s) is incompatible with the startup script's DNS retry loop, which can take up to 42 seconds in the worst case. This causes pods to enter CrashLoopBackOff: the liveness probe kills the container at ~30s (initialDelaySeconds 10 + failureThreshold 3 × periodSeconds 10), before ZooKeeper has a chance to start. The DNS retry loop in zookeeperStart.sh retries `getent hosts $DOMAIN` up to 21 times with a 2-second sleep between attempts. For a single-node cluster or any case where the headless service has no ready endpoints, DNS will never resolve during the loop, and the script must wait the full ~42 seconds before proceeding to start ZooKeeper. History of the DNS check in zookeeperStart.sh: 1. 97ddb6e - Original: simple `nslookup`, no retry. DNS failure meant no ensemble, script moved on immediately. 2. ed1f1d1 - "Added polling for checking headless service is active": introduced the retry loop (count=20, sleep 2) because nslookup of the headless service can fail transiently even when an active ensemble exists. The loop was guarded by `$MYID -ne 1` so the first node skipped it entirely. 3. 5c86f53 - "Observers fail to register when zk ensemble service domain is not yet available": added an `elif nslookup $DOMAIN | grep "server can't find"` fast-path to skip the retry loop when DNS definitively says "not found". This also removed the `$MYID -ne 1` guard. 4. c693909 - "Use getent instead of nslookup for starting scripts": replaced nslookup with getent. Dropped the elif because getent does not produce a parseable "server can't find" message. This restored the original retry-always behavior from step 2, but without the MYID guard, meaning all nodes now unconditionally wait up to 42 seconds when DNS does not resolve. The probe defaults were never updated to account for step 4, so pods that hit the full DNS retry path are killed before startup completes. Increasing initialDelaySeconds to 60 gives the startup script time to exhaust the DNS loop and start ZooKeeper before probes begin firing. --- api/v1beta1/zookeepercluster_types.go | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/api/v1beta1/zookeepercluster_types.go b/api/v1beta1/zookeepercluster_types.go index dde0ccd9..03061763 100644 --- a/api/v1beta1/zookeepercluster_types.go +++ b/api/v1beta1/zookeepercluster_types.go @@ -42,7 +42,7 @@ const ( // DefaultReadinessProbeInitialDelaySeconds is the default initial delay (in seconds) // for the readiness probe - DefaultReadinessProbeInitialDelaySeconds = 10 + DefaultReadinessProbeInitialDelaySeconds = 60 // DefaultReadinessProbePeriodSeconds is the default probe period (in seconds) // for the readiness probe @@ -62,7 +62,7 @@ const ( // DefaultLivenessProbeInitialDelaySeconds is the default initial delay (in seconds) // for the liveness probe - DefaultLivenessProbeInitialDelaySeconds = 10 + DefaultLivenessProbeInitialDelaySeconds = 60 // DefaultLivenessProbePeriodSeconds is the default probe period (in seconds) // for the liveness probe