From bca4e86d5fa02ceda64d56f494ff6fe09708d043 Mon Sep 17 00:00:00 2001
From: delthas <delthas@dille.cc>
Date: Mon, 23 Feb 2026 18:12:33 +0100
Subject: [PATCH] Increase default probe initialDelaySeconds from 10 to 60
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

The default liveness and readiness probe initialDelaySeconds (10s) is
incompatible with the startup script's DNS retry loop, which can take
up to 42 seconds in the worst case. This causes pods to enter
CrashLoopBackOff: the liveness probe kills the container at ~30s
(initialDelaySeconds 10 + failureThreshold 3 × periodSeconds 10),
before ZooKeeper has a chance to start.

The DNS retry loop in zookeeperStart.sh retries `getent hosts $DOMAIN`
up to 21 times with a 2-second sleep between attempts. For a
single-node cluster or any case where the headless service has no
ready endpoints, DNS will never resolve during the loop, and the
script must wait the full ~42 seconds before proceeding to start
ZooKeeper.

History of the DNS check in zookeeperStart.sh:

1. 97ddb6e - Original: simple `nslookup`, no retry. DNS failure meant
   no ensemble, script moved on immediately.

2. ed1f1d1 - "Added polling for checking headless service is active":
   introduced the retry loop (count=20, sleep 2) because nslookup of
   the headless service can fail transiently even when an active
   ensemble exists. The loop was guarded by `$MYID -ne 1` so the
   first node skipped it entirely.

3. 5c86f53 - "Observers fail to register when zk ensemble service
   domain is not yet available": added an `elif nslookup $DOMAIN |
   grep "server can't find"` fast-path to skip the retry loop when
   DNS definitively says "not found". This also removed the
   `$MYID -ne 1` guard.

4. c693909 - "Use getent instead of nslookup for starting scripts":
   replaced nslookup with getent. Dropped the elif because getent
   does not produce a parseable "server can't find" message. This
   restored the original retry-always behavior from step 2, but
   without the MYID guard, meaning all nodes now unconditionally
   wait up to 42 seconds when DNS does not resolve.

The probe defaults were never updated to account for step 4, so pods
that hit the full DNS retry path are killed before startup completes.
Increasing initialDelaySeconds to 60 gives the startup script time to
exhaust the DNS loop and start ZooKeeper before probes begin firing.
---
 api/v1beta1/zookeepercluster_types.go | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/api/v1beta1/zookeepercluster_types.go b/api/v1beta1/zookeepercluster_types.go
index dde0ccd9..03061763 100644
--- a/api/v1beta1/zookeepercluster_types.go
+++ b/api/v1beta1/zookeepercluster_types.go
@@ -42,7 +42,7 @@ const (
 
 	// DefaultReadinessProbeInitialDelaySeconds is the default initial delay (in seconds)
 	// for the readiness probe
-	DefaultReadinessProbeInitialDelaySeconds = 10
+	DefaultReadinessProbeInitialDelaySeconds = 60
 
 	// DefaultReadinessProbePeriodSeconds is the default probe period (in seconds)
 	// for the readiness probe
@@ -62,7 +62,7 @@ const (
 
 	// DefaultLivenessProbeInitialDelaySeconds is the default initial delay (in seconds)
 	// for the liveness probe
-	DefaultLivenessProbeInitialDelaySeconds = 10
+	DefaultLivenessProbeInitialDelaySeconds = 60
 
 	// DefaultLivenessProbePeriodSeconds is the default probe period (in seconds)
 	// for the liveness probe