feat: Kafka bootstrap and runtime health probes by bigbluechief · Pull Request #76 · FINTLabs/fint-core-consumer

bigbluechief · 2026-03-23T08:56:14Z

Introduce Kafka-aware readiness and liveness health handling for the consumer.

Readiness is now based on initial Kafka bootstrap instead of a fixed startup delay. The application stays unready until the blocking listeners have consumed up to their startup end offsets, and then remains ready for the rest of the pod lifetime. This includes both the entity listener and the relation-update listener.

Liveness is now separated from bootstrap and tracks Kafka runtime health for registered listeners. It reacts to Spring Kafka runtime events such as non-responsive consumers, failed starts and stopped consumers, while using a grace period to avoid false positives from short interruptions. Normal lag and quiet topics do not make the pod unhealthy.

Also add Micrometer metrics for bootstrap progress and runtime Kafka health, including bootstrap duration, pending partitions, runtime problem counters and unhealthy state gauges.

Update actuator health group configuration and add documentation for the new startup/readiness/liveness model, Kafka-specific health behavior, metrics and Kubernetes probe configuration.

Introduce Kafka-aware readiness and liveness health handling for the consumer. Readiness is now based on initial Kafka bootstrap instead of a fixed startup delay. The application stays unready until the blocking listeners have consumed up to their startup end offsets, and then remains ready for the rest of the pod lifetime. This includes both the entity listener and the relation-update listener. Liveness is now separated from bootstrap and tracks Kafka runtime health for registered listeners. It reacts to Spring Kafka runtime events such as non-responsive consumers, failed starts and stopped consumers, while using a grace period to avoid false positives from short interruptions. Normal lag and quiet topics do not make the pod unhealthy. Also add Micrometer metrics for bootstrap progress and runtime Kafka health, including bootstrap duration, pending partitions, runtime problem counters and unhealthy state gauges. Update actuator health group configuration and add documentation for the new startup/readiness/liveness model, Kafka-specific health behavior, metrics and Kubernetes probe configuration.

nozoz · 2026-04-20T14:03:33Z

Under merge konflikt så endte jeg med opp med å fjerne noe legacy kode, spesifikt "legacyTopics" logikk som ikke trengs lenger.

…ined Register REQUEST_EVENT and RESPONSE_EVENT consumers with InitialKafkaBootstrapTracker so readiness stays OUT_OF_SERVICE until both topics catch up to their assignment-time end offsets. Previously only ENTITY and RELATION_UPDATE gated bootstrap. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Spring's error handler skips failed records (noRetries + skipRecordOnRecoveryFailure), so the consumer moves on, but the bootstrap tracker's processedOffset did not. A poison record at the tail of a partition would leave readiness OUT_OF_SERVICE forever. Advance the tracker in the catch before rethrowing.

nozoz · 2026-04-21T06:52:53Z

Request og response-meldinger er viktige for Sdworx, så vi tar dem med i bootstrapen. Vi teller også opp trackeren hvis en relasjonsoppdatering feiler - ellers kan siste melding i partisjonen blokkere oppstart av tjenesten.

AdminClient.listOffsets(...) was called synchronously inside onPartitionsAssigned with a 10s timeout. When the call timed out during pod startup, TimeoutException (a checked exception, not a RuntimeException) escaped the catch block, surfaced as "User rebalance callback throws an error", and DefaultErrorHandler killed the listener container. The tracker now records assignments in-memory and a single-thread ScheduledExecutorService refreshes pending end-offsets in the background with retry on any Exception. Records processed before the offset arrives are buffered so caughtUp evaluates correctly once it lands. Refresh interval and shutdown timeout are exposed via KafkaHealthProperties.

KafkaAdminEndOffsetProvider built its own AdminClient via KafkaProperties.buildAdminProperties(null), which produced a different effective config than the consumer pipeline (no SslBundles resolution, missing the securityProps map that no.novari.kafka.KafkaConfiguration populates). In prod this manifested as every listOffsets call hanging the full 10s and timing out, even though consumers on the same broker connected fine. Inject the AdminClient bean from the library so admin and consumer share one configuration and one lifecycle.

bigbluechief requested review from alstad and nozoz March 23, 2026 08:56

bigbluechief self-assigned this Mar 23, 2026

nozoz added 4 commits April 20, 2026 15:27

merge develop

f064114

refactor: remove unused legacy resource topics configuration

789880c

test: use health metric arguments & remove legacy resource topic tests

392f8d7

test: remove unnecessary LegacyResourceTopic integration tests

397b875

nozoz and others added 2 commits April 21, 2026 08:43

bigbluechief and others added 6 commits April 21, 2026 09:51

Update health check documentation and add Grafana dashboard

1a40940

Fix path

8376115

merge develop

bc006f1

merge develop

ca3b0b0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Kafka bootstrap and runtime health probes#76

feat: Kafka bootstrap and runtime health probes#76
bigbluechief wants to merge 13 commits into
developfrom
feature/CT-2384_kafka_health_probes

bigbluechief commented Mar 23, 2026 •

edited

Loading

Uh oh!

nozoz commented Apr 20, 2026

Uh oh!

nozoz commented Apr 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

bigbluechief commented Mar 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nozoz commented Apr 20, 2026

Uh oh!

nozoz commented Apr 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

bigbluechief commented Mar 23, 2026 •

edited

Loading