feat: Kafka bootstrap and runtime health probes#76
Open
bigbluechief wants to merge 13 commits into
Open
Conversation
Introduce Kafka-aware readiness and liveness health handling for the consumer. Readiness is now based on initial Kafka bootstrap instead of a fixed startup delay. The application stays unready until the blocking listeners have consumed up to their startup end offsets, and then remains ready for the rest of the pod lifetime. This includes both the entity listener and the relation-update listener. Liveness is now separated from bootstrap and tracks Kafka runtime health for registered listeners. It reacts to Spring Kafka runtime events such as non-responsive consumers, failed starts and stopped consumers, while using a grace period to avoid false positives from short interruptions. Normal lag and quiet topics do not make the pod unhealthy. Also add Micrometer metrics for bootstrap progress and runtime Kafka health, including bootstrap duration, pending partitions, runtime problem counters and unhealthy state gauges. Update actuator health group configuration and add documentation for the new startup/readiness/liveness model, Kafka-specific health behavior, metrics and Kubernetes probe configuration.
Contributor
|
Under merge konflikt så endte jeg med opp med å fjerne noe legacy kode, spesifikt "legacyTopics" logikk som ikke trengs lenger. |
…ined Register REQUEST_EVENT and RESPONSE_EVENT consumers with InitialKafkaBootstrapTracker so readiness stays OUT_OF_SERVICE until both topics catch up to their assignment-time end offsets. Previously only ENTITY and RELATION_UPDATE gated bootstrap. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Spring's error handler skips failed records (noRetries + skipRecordOnRecoveryFailure), so the consumer moves on, but the bootstrap tracker's processedOffset did not. A poison record at the tail of a partition would leave readiness OUT_OF_SERVICE forever. Advance the tracker in the catch before rethrowing.
Contributor
|
Request og response-meldinger er viktige for Sdworx, så vi tar dem med i bootstrapen. Vi teller også opp trackeren hvis en relasjonsoppdatering feiler - ellers kan siste melding i partisjonen blokkere oppstart av tjenesten. |
AdminClient.listOffsets(...) was called synchronously inside onPartitionsAssigned with a 10s timeout. When the call timed out during pod startup, TimeoutException (a checked exception, not a RuntimeException) escaped the catch block, surfaced as "User rebalance callback throws an error", and DefaultErrorHandler killed the listener container. The tracker now records assignments in-memory and a single-thread ScheduledExecutorService refreshes pending end-offsets in the background with retry on any Exception. Records processed before the offset arrives are buffered so caughtUp evaluates correctly once it lands. Refresh interval and shutdown timeout are exposed via KafkaHealthProperties.
KafkaAdminEndOffsetProvider built its own AdminClient via KafkaProperties.buildAdminProperties(null), which produced a different effective config than the consumer pipeline (no SslBundles resolution, missing the securityProps map that no.novari.kafka.KafkaConfiguration populates). In prod this manifested as every listOffsets call hanging the full 10s and timing out, even though consumers on the same broker connected fine. Inject the AdminClient bean from the library so admin and consumer share one configuration and one lifecycle.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Introduce Kafka-aware readiness and liveness health handling for the consumer.
Readiness is now based on initial Kafka bootstrap instead of a fixed startup delay. The application stays unready until the blocking listeners have consumed up to their startup end offsets, and then remains ready for the rest of the pod lifetime. This includes both the entity listener and the relation-update listener.
Liveness is now separated from bootstrap and tracks Kafka runtime health for registered listeners. It reacts to Spring Kafka runtime events such as non-responsive consumers, failed starts and stopped consumers, while using a grace period to avoid false positives from short interruptions. Normal lag and quiet topics do not make the pod unhealthy.
Also add Micrometer metrics for bootstrap progress and runtime Kafka health, including bootstrap duration, pending partitions, runtime problem counters and unhealthy state gauges.
Update actuator health group configuration and add documentation for the new startup/readiness/liveness model, Kafka-specific health behavior, metrics and Kubernetes probe configuration.