Skip to content

Add load shedding for backend message processing#739

Draft
SimonHeybrock wants to merge 10 commits intomainfrom
load-shedding
Draft

Add load shedding for backend message processing#739
SimonHeybrock wants to merge 10 commits intomainfrom
load-shedding

Conversation

@SimonHeybrock
Copy link
Member

@SimonHeybrock SimonHeybrock commented Feb 27, 2026

Summary

Closes #378

When the backend can't keep up with the Kafka message stream, a new LoadShedder selectively drops bulk event data while preserving control messages and f144 logs.

Detection: Consecutive non-None batcher results signal overload — zero-cost, no API changes, directly reflects falling behind. Hysteresis (5 consecutive batches to activate, 3 idle cycles to deactivate) prevents oscillation.

Shedding strategy: Uses exponential levels where each level keeps 1/2^N of droppable messages (level 1 = 50% drop, level 2 = 75%, level 3 = 87.5%). Capped at level 3 — this handles up to 8x overload while still producing usable data (~1-2 messages per source per batch at typical rates). Each escalation/de-escalation step requires the same consecutive-cycle thresholds, so the system ramps up under sustained overload and ramps down gracefully when load subsides. Only bulk data (DETECTOR_EVENTS, MONITOR_EVENTS, MONITOR_COUNTS, AREA_DETECTOR) is eligible — all LIVEDATA_* control streams and LOG (f144) are preserved.

Reporting: Drop statistics use a rolling 60-second window (10 buckets x 6s) so the dashboard shows a meaningful current rate rather than an ever-growing cumulative counter. The dashboard displays an amber "SHEDDING" badge and the drop rate percentage on affected workers, with the rate naturally decaying to 0% after shedding stops. Level transitions are logged at warning severity.

Wiring: OrchestratingProcessor runs shed() before the batcher and report_batch_result() after. Load shedding is enabled by default but can be disabled via enable_load_shedding=False (used in integration tests).

image

Test plan

  • tests/core/load_shedder_test.py — activation, escalation, de-escalation, max level cap, drop rates at each level, rolling window decay, state reporting
  • tests/kafka/status_message_test.py — roundtrip for is_shedding, shedding_level, messages_dropped, messages_eligible fields
  • Full test suite passes (2484 passed, 28 skipped)
  • Manual testing: shedding activates and escalates under sustained load, caps at level 3, de-escalates when load subsides

🤖 Generated with Claude Code

SimonHeybrock and others added 10 commits February 27, 2026 11:44
When the backend can't keep up with the Kafka message stream, the new
LoadShedder selectively drops bulk event data (detector events, monitor
events/counts, area detector) while preserving control messages and f144
logs. Detection uses consecutive non-None batcher results as the overload
signal, with hysteresis to prevent oscillation.

- New LoadShedder class with 50% subsampling when active
- ServiceStatus gains is_shedding and messages_dropped fields
- x5f2 serialization updated for new fields (backward-compatible defaults)
- OrchestratingProcessor wires in shedding before the batcher
- Dashboard shows SHEDDING badge (amber) and dropped message count
- Load shedding is enabled by default but can be disabled for testing

Prompt: Implement the following plan: Load Shedding for Backend Message
Processing (#378)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace the cumulative dropped-messages counter with a rolling 60-second
window that tracks both dropped and eligible (total droppable) counts.
The dashboard now shows "Dropped: 38% (62/164)" — percentage plus raw
counts over the recent window — giving an actionable, stable signal
instead of an ever-growing number.

Implementation uses time-bucketed counters (10 buckets × 6s) inside
LoadShedder, with a fake-clock injection point for deterministic testing.
Eligible messages are counted even when shedding is inactive, so the rate
correctly reflects the fraction being shed.

Prompt: Show drop fraction/percentage and use a rolling time window
instead of cumulative counters.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The raw counts (62/164) next to the cumulative Msgs count was confusing
since they cover different time windows. Show just "Dropped: 50%" which
is unambiguous alongside the total "Msgs" throughput counter.

Prompt: Remove raw counts from the drop rate display, just show percentage.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
"Dropping: 50%" reads as a current rate, which better matches the
rolling-window semantics than "Dropped" which implies a past event.

Prompt: Change Dropped->Dropping to indicate current rate. Also update PR.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The single-level shedder (fixed 50% drop) could not recover when the
backend was overloaded by more than 2x. Replace the boolean on/off with
escalating levels where each level keeps 1/2^N of droppable messages
(level 1 = 50%, level 2 = 75%, level 3 = 87.5%, …). Escalation and
de-escalation each require the same consecutive-cycle thresholds as
before, applied per level step.

Wire the new shedding_level field through ServiceStatus, x5f2
serialization, and structured metrics logging. The dashboard badge
remains "SHEDDING" since the existing drop-rate percentage already
conveys severity.

Prompt: Help me understand the shedding strategy added in this branch.
PR says it drops every other message, but what if that is not enough?
Follow-up: Multi-level shedding sounds good, is it simple to implement?
Follow-up: No need to cap, dropping 9/10 or more is still reasonable.
Would it make sense to use sth. like 1/2->1/4->1/8->...?
Follow-up: Note that I don't think there is a need to report the
shedding level like "SHEDDING L2" if we report the drop rate? The two
are redundant in a sense.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Escalation, de-escalation, and full stop are discrete events worth
surfacing above the periodic INFO metrics. Each log line includes the
new level and keep rate for quick diagnosis.

Prompt: I wonder if we should increase the logging severity when we shed?
Follow-up: Maybe you are right that the periodic metric logging should
stay at INFO level, but changes in shedding level might be worth a
warning?

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Without a cap, high shedding levels make processing cycles near-instant,
causing the "consecutive non-None batches" overload signal to fire
rapidly in a positive feedback loop. Level 3 (keep 1/8, 87.5% drop)
handles up to 8x overload while still producing usable data (~1-2
messages per source per batch at typical rates of 14 msg/s/source with
5-10 sources per worker).

Also switch keep-rate log format from percentage (which underflows to
0.0% at high levels) to a fraction like "1/8".

Prompt: What is that rapid escalation then deescalation? and 0.0% is
not useful.
Follow-up: In practice we expect each source to produce 14
messages/second, and each worker would process 5-10 sources. Leaving
aside that the current indiscriminate shedding is an issue, what might a
good max level be?

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The SimpleMessageBatcher emits empty batches (non-None with zero messages)
when message timestamps jump forward after a pause between measurements.
These were incorrectly counted as overload signals, triggering shedding
during normal operation.

Changed report_batch_result to accept batch_message_count (int) instead of
batch_produced (bool).  Empty batches and no-batch cycles are both treated
as idle, so only genuine data-carrying batches contribute to the activation
threshold.

Added documentation explaining why consecutive non-empty batches are a
reliable overload signal (given the batcher's 1-second time windows and the
~10 ms processing loop) and why empty batches must be excluded.

Prompt: "Please use a new worktree to help me think through #739
(branch load-shedding). I am particularly worried about whether the
overload-detection strategy is safe in the sense that I want to be
really sure that shedding does not get activated unless we actually
have to. Are there scenarios where the system is operating normally
but we nevertheless enter shedding?"

Follow-up: "I was more worried about the batcher genuinely never
returning None because the upstream message rate is much higher than
our processing frequency. Is the shedding strategy only working as a
coincidence given how the current batcher works?"

Follow-up: "We should fix this (and add tests) and also document
clearly what you explained above in docstrings and code comments."

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The docstring claimed each cycle fetches "only ~10 ms worth of messages"
based on the poll interval.  Under real load, processing itself takes
hundreds of milliseconds, so the relevant question is whether the total
cycle (fetch + process + publish) fits within the 1-second batch window.
Reworded to describe the actual mechanism.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The 100 ms idle sleep in process() means the total cycle time is
processing_time + N*100ms, so shedding activates at ~90% utilization
rather than exactly 100%.  This is a desirable safety margin.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Gracefully deal with cases where Beamlime cannot keep up with Kafka stream

1 participant