Skip to content

fix: add Kafka static membership to stop rebalance storm#75

Merged
anaselmhamdi merged 1 commit into
mainfrom
fix/kafka-static-membership
Apr 17, 2026
Merged

fix: add Kafka static membership to stop rebalance storm#75
anaselmhamdi merged 1 commit into
mainfrom
fix/kafka-static-membership

Conversation

@anaselmhamdi

Copy link
Copy Markdown
Collaborator

Summary

  • Restores the 0.3.13 recreate-in-place behavior in commit() (0.3.14's "just log and return" was wrong — librdkafka didn't auto-rejoin and commits kept failing indefinitely in production)
  • Adds static membership (KIP-345): group.instance.id auto-derived from the K8s-injected HOSTNAME env var. With a stable instance ID, the close+recreate reconnects as the same member, so the broker skips the full-group rebalance
  • Users can override by setting group.instance.id explicitly in consumer_config

Why this fixes the storm

Inspecting the consumer group via AdminClient.describe_consumer_groups while the storm was live:

state: PREPARING_REBALANCING
partition_assignor: cooperative-sticky
members: 34                    ← 16 pods but 34 members = 18 ghosts
group_instance_id: None        ← no static membership

Each pod's consumer.close() sent LeaveGroup, and each new Consumer() joined as a brand-new dynamic member (different member.id). With 16 pods recreating every ~40s and session.timeout.ms=60s, the group accumulated ghosts and stayed permanently in PREPARING_REBALANCING. Cooperative-sticky made each rebalance incremental but couldn't eliminate them — adding/removing group members always triggers one.

With static membership, the recreate reconnects to the existing slot → no LeaveGroup-driven rebalance → no cascade.

Preemptible node interaction

  • In-process recreate (the actual cause of the storm): static membership eliminates rebalance ✓
  • Pod eviction on preemption: ~2 rebalances per pod replacement (same as before — new HOSTNAME = new instance), but these are rare and incremental under cooperative-sticky
  • Net: more reliable, same preemption behavior

Test plan

  • uv run pytest tests/connectors/sources/kafka/ — 27 tests pass
  • uv run ruff check && format
  • Deploy to one low-replica pipeline first (lower blast radius given prior incident). Verify:
    • Startup log shows Kafka static membership enabled: group.instance.id=...
    • AdminClient describe shows group_instance_id populated on the member
    • Normal processing continues, commits succeed
  • Then deploy to conversations-events (16 replicas). Verify:
    • members count in the group equals live pod count (no ghost accumulation)
    • Recreating consumer log frequency drops dramatically (may still appear occasionally for legitimate rebalances, but not cascading)
    • BigQuery lag continues to drop

Rollback

Single revert. Recreate logic is already there (from 0.3.13), static membership is purely additive.

🤖 Generated with Claude Code

0.3.14 removed the close+recreate on ILLEGAL_GENERATION on the
assumption librdkafka would auto-rejoin on the next consume(). It
doesn't — commits kept failing indefinitely in production.

This restores the 0.3.13 recreate-in-place behavior AND adds static
membership (KIP-345) via HOSTNAME (K8s pod name). With a stable
group.instance.id the reconnecting consumer is recognized as the
same instance, so the close+recreate no longer triggers a full-group
rebalance that cascades to invalidate all other consumers' commits.

Users can override by setting group.instance.id in consumer_config.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@anaselmhamdi anaselmhamdi merged commit b36ac25 into main Apr 17, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant