Skip to content

feat: Kafka bootstrap and runtime health probes#76

Open
bigbluechief wants to merge 13 commits into
developfrom
feature/CT-2384_kafka_health_probes
Open

feat: Kafka bootstrap and runtime health probes#76
bigbluechief wants to merge 13 commits into
developfrom
feature/CT-2384_kafka_health_probes

Conversation

@bigbluechief
Copy link
Copy Markdown
Contributor

@bigbluechief bigbluechief commented Mar 23, 2026

Introduce Kafka-aware readiness and liveness health handling for the consumer.

Readiness is now based on initial Kafka bootstrap instead of a fixed startup delay. The application stays unready until the blocking listeners have consumed up to their startup end offsets, and then remains ready for the rest of the pod lifetime. This includes both the entity listener and the relation-update listener.

Liveness is now separated from bootstrap and tracks Kafka runtime health for registered listeners. It reacts to Spring Kafka runtime events such as non-responsive consumers, failed starts and stopped consumers, while using a grace period to avoid false positives from short interruptions. Normal lag and quiet topics do not make the pod unhealthy.

Also add Micrometer metrics for bootstrap progress and runtime Kafka health, including bootstrap duration, pending partitions, runtime problem counters and unhealthy state gauges.

Update actuator health group configuration and add documentation for the new startup/readiness/liveness model, Kafka-specific health behavior, metrics and Kubernetes probe configuration.

Introduce Kafka-aware readiness and liveness health handling for the
consumer.

Readiness is now based on initial Kafka bootstrap instead of a fixed
startup delay. The application stays unready until the blocking
listeners have consumed up to their startup end offsets, and then
remains ready for the rest of the pod lifetime. This includes both the
entity listener and the relation-update listener.

Liveness is now separated from bootstrap and tracks Kafka runtime health
for registered listeners. It reacts to Spring Kafka runtime events such
as non-responsive consumers, failed starts and stopped consumers, while
using a grace period to avoid false positives from short interruptions.
Normal lag and quiet topics do not make the pod unhealthy.

Also add Micrometer metrics for bootstrap progress and runtime Kafka
health, including bootstrap duration, pending partitions, runtime
problem counters and unhealthy state gauges.

Update actuator health group configuration and add documentation for the
new startup/readiness/liveness model, Kafka-specific health behavior,
metrics and Kubernetes probe configuration.
@bigbluechief bigbluechief requested review from alstad and nozoz March 23, 2026 08:56
@bigbluechief bigbluechief self-assigned this Mar 23, 2026
@nozoz
Copy link
Copy Markdown
Contributor

nozoz commented Apr 20, 2026

Under merge konflikt så endte jeg med opp med å fjerne noe legacy kode, spesifikt "legacyTopics" logikk som ikke trengs lenger.

nozoz and others added 2 commits April 21, 2026 08:43
…ined

Register REQUEST_EVENT and RESPONSE_EVENT consumers with
InitialKafkaBootstrapTracker so readiness stays OUT_OF_SERVICE
until both topics catch up to their assignment-time end offsets.
Previously only ENTITY and RELATION_UPDATE gated bootstrap.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Spring's error handler skips failed records (noRetries +
skipRecordOnRecoveryFailure), so the consumer moves on, but the
bootstrap tracker's processedOffset did not. A poison record at
the tail of a partition would leave readiness OUT_OF_SERVICE
forever. Advance the tracker in the catch before rethrowing.
@nozoz
Copy link
Copy Markdown
Contributor

nozoz commented Apr 21, 2026

Request og response-meldinger er viktige for Sdworx, så vi tar dem med i bootstrapen. Vi teller også opp trackeren hvis en relasjonsoppdatering feiler - ellers kan siste melding i partisjonen blokkere oppstart av tjenesten.

bigbluechief and others added 6 commits April 21, 2026 09:51
AdminClient.listOffsets(...) was called synchronously inside
onPartitionsAssigned with a 10s timeout. When the call timed out
during pod startup, TimeoutException (a checked exception, not a
RuntimeException) escaped the catch block, surfaced as
"User rebalance callback throws an error", and DefaultErrorHandler
killed the listener container.
The tracker now records assignments in-memory and a single-thread
ScheduledExecutorService refreshes pending end-offsets in the
background with retry on any Exception. Records processed before
the offset arrives are buffered so caughtUp evaluates correctly
once it lands. Refresh interval and shutdown timeout are exposed
via KafkaHealthProperties.
KafkaAdminEndOffsetProvider built its own AdminClient via
KafkaProperties.buildAdminProperties(null), which produced a
different effective config than the consumer pipeline (no
SslBundles resolution, missing the securityProps map that
no.novari.kafka.KafkaConfiguration populates). In prod this
manifested as every listOffsets call hanging the full 10s and
timing out, even though consumers on the same broker connected
fine.
Inject the AdminClient bean from the library so admin and consumer
share one configuration and one lifecycle.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants