fix: add Kafka static membership to stop rebalance storm by anaselmhamdi · Pull Request #75 · bizon-data/bizon-core

anaselmhamdi · 2026-04-17T15:18:40Z

Summary

Restores the 0.3.13 recreate-in-place behavior in commit() (0.3.14's "just log and return" was wrong — librdkafka didn't auto-rejoin and commits kept failing indefinitely in production)
Adds static membership (KIP-345): group.instance.id auto-derived from the K8s-injected HOSTNAME env var. With a stable instance ID, the close+recreate reconnects as the same member, so the broker skips the full-group rebalance
Users can override by setting group.instance.id explicitly in consumer_config

Why this fixes the storm

Inspecting the consumer group via AdminClient.describe_consumer_groups while the storm was live:

state: PREPARING_REBALANCING
partition_assignor: cooperative-sticky
members: 34                    ← 16 pods but 34 members = 18 ghosts
group_instance_id: None        ← no static membership

Each pod's consumer.close() sent LeaveGroup, and each new Consumer() joined as a brand-new dynamic member (different member.id). With 16 pods recreating every ~40s and session.timeout.ms=60s, the group accumulated ghosts and stayed permanently in PREPARING_REBALANCING. Cooperative-sticky made each rebalance incremental but couldn't eliminate them — adding/removing group members always triggers one.

With static membership, the recreate reconnects to the existing slot → no LeaveGroup-driven rebalance → no cascade.

Preemptible node interaction

In-process recreate (the actual cause of the storm): static membership eliminates rebalance ✓
Pod eviction on preemption: ~2 rebalances per pod replacement (same as before — new HOSTNAME = new instance), but these are rare and incremental under cooperative-sticky
Net: more reliable, same preemption behavior

Test plan

uv run pytest tests/connectors/sources/kafka/ — 27 tests pass
uv run ruff check && format
Deploy to one low-replica pipeline first (lower blast radius given prior incident). Verify:
- Startup log shows Kafka static membership enabled: group.instance.id=...
- AdminClient describe shows group_instance_id populated on the member
- Normal processing continues, commits succeed
Then deploy to conversations-events (16 replicas). Verify:
- members count in the group equals live pod count (no ghost accumulation)
- Recreating consumer log frequency drops dramatically (may still appear occasionally for legitimate rebalances, but not cascading)
- BigQuery lag continues to drop

Rollback

Single revert. Recreate logic is already there (from 0.3.13), static membership is purely additive.

🤖 Generated with Claude Code

0.3.14 removed the close+recreate on ILLEGAL_GENERATION on the assumption librdkafka would auto-rejoin on the next consume(). It doesn't — commits kept failing indefinitely in production. This restores the 0.3.13 recreate-in-place behavior AND adds static membership (KIP-345) via HOSTNAME (K8s pod name). With a stable group.instance.id the reconnecting consumer is recognized as the same instance, so the close+recreate no longer triggers a full-group rebalance that cascades to invalidate all other consumers' commits. Users can override by setting group.instance.id in consumer_config. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

anaselmhamdi merged commit b36ac25 into main Apr 17, 2026
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: add Kafka static membership to stop rebalance storm#75

fix: add Kafka static membership to stop rebalance storm#75
anaselmhamdi merged 1 commit into
mainfrom
fix/kafka-static-membership

anaselmhamdi commented Apr 17, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

anaselmhamdi commented Apr 17, 2026

Summary

Why this fixes the storm

Preemptible node interaction

Test plan

Rollback

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant