Add load shedding for backend message processing by SimonHeybrock · Pull Request #739 · scipp/esslivedata

SimonHeybrock · 2026-02-27T11:48:14Z

Summary

Closes #378

When the backend can't keep up with the Kafka message stream, a new LoadShedder selectively drops bulk event data while preserving control messages and f144 logs.

Detection: Consecutive non-None batcher results signal overload — zero-cost, no API changes, directly reflects falling behind. Hysteresis (5 consecutive batches to activate, 3 idle cycles to deactivate) prevents oscillation.

Shedding strategy: Uses exponential levels where each level keeps 1/2^N of droppable messages (level 1 = 50% drop, level 2 = 75%, level 3 = 87.5%). Capped at level 3 — this handles up to 8x overload while still producing usable data (~1-2 messages per source per batch at typical rates). Each escalation/de-escalation step requires the same consecutive-cycle thresholds, so the system ramps up under sustained overload and ramps down gracefully when load subsides. Only bulk data (DETECTOR_EVENTS, MONITOR_EVENTS, MONITOR_COUNTS, AREA_DETECTOR) is eligible — all LIVEDATA_* control streams and LOG (f144) are preserved.

Reporting: Drop statistics use a rolling 60-second window (10 buckets x 6s) so the dashboard shows a meaningful current rate rather than an ever-growing cumulative counter. The dashboard displays an amber "SHEDDING" badge and the drop rate percentage on affected workers, with the rate naturally decaying to 0% after shedding stops. Level transitions are logged at warning severity.

Wiring: OrchestratingProcessor runs shed() before the batcher and report_batch_result() after. Load shedding is enabled by default but can be disabled via enable_load_shedding=False (used in integration tests).

Test plan

tests/core/load_shedder_test.py — activation, escalation, de-escalation, max level cap, drop rates at each level, rolling window decay, state reporting
tests/kafka/status_message_test.py — roundtrip for is_shedding, shedding_level, messages_dropped, messages_eligible fields
Full test suite passes (2484 passed, 28 skipped)
Manual testing: shedding activates and escalates under sustained load, caps at level 3, de-escalates when load subsides

🤖 Generated with Claude Code

When the backend can't keep up with the Kafka message stream, the new LoadShedder selectively drops bulk event data (detector events, monitor events/counts, area detector) while preserving control messages and f144 logs. Detection uses consecutive non-None batcher results as the overload signal, with hysteresis to prevent oscillation. - New LoadShedder class with 50% subsampling when active - ServiceStatus gains is_shedding and messages_dropped fields - x5f2 serialization updated for new fields (backward-compatible defaults) - OrchestratingProcessor wires in shedding before the batcher - Dashboard shows SHEDDING badge (amber) and dropped message count - Load shedding is enabled by default but can be disabled for testing Prompt: Implement the following plan: Load Shedding for Backend Message Processing (#378) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Replace the cumulative dropped-messages counter with a rolling 60-second window that tracks both dropped and eligible (total droppable) counts. The dashboard now shows "Dropped: 38% (62/164)" — percentage plus raw counts over the recent window — giving an actionable, stable signal instead of an ever-growing number. Implementation uses time-bucketed counters (10 buckets × 6s) inside LoadShedder, with a fake-clock injection point for deterministic testing. Eligible messages are counted even when shedding is inactive, so the rate correctly reflects the fraction being shed. Prompt: Show drop fraction/percentage and use a rolling time window instead of cumulative counters. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

The raw counts (62/164) next to the cumulative Msgs count was confusing since they cover different time windows. Show just "Dropped: 50%" which is unambiguous alongside the total "Msgs" throughput counter. Prompt: Remove raw counts from the drop rate display, just show percentage. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

"Dropping: 50%" reads as a current rate, which better matches the rolling-window semantics than "Dropped" which implies a past event. Prompt: Change Dropped->Dropping to indicate current rate. Also update PR. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

The single-level shedder (fixed 50% drop) could not recover when the backend was overloaded by more than 2x. Replace the boolean on/off with escalating levels where each level keeps 1/2^N of droppable messages (level 1 = 50%, level 2 = 75%, level 3 = 87.5%, …). Escalation and de-escalation each require the same consecutive-cycle thresholds as before, applied per level step. Wire the new shedding_level field through ServiceStatus, x5f2 serialization, and structured metrics logging. The dashboard badge remains "SHEDDING" since the existing drop-rate percentage already conveys severity. Prompt: Help me understand the shedding strategy added in this branch. PR says it drops every other message, but what if that is not enough? Follow-up: Multi-level shedding sounds good, is it simple to implement? Follow-up: No need to cap, dropping 9/10 or more is still reasonable. Would it make sense to use sth. like 1/2->1/4->1/8->...? Follow-up: Note that I don't think there is a need to report the shedding level like "SHEDDING L2" if we report the drop rate? The two are redundant in a sense. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Escalation, de-escalation, and full stop are discrete events worth surfacing above the periodic INFO metrics. Each log line includes the new level and keep rate for quick diagnosis. Prompt: I wonder if we should increase the logging severity when we shed? Follow-up: Maybe you are right that the periodic metric logging should stay at INFO level, but changes in shedding level might be worth a warning? Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Without a cap, high shedding levels make processing cycles near-instant, causing the "consecutive non-None batches" overload signal to fire rapidly in a positive feedback loop. Level 3 (keep 1/8, 87.5% drop) handles up to 8x overload while still producing usable data (~1-2 messages per source per batch at typical rates of 14 msg/s/source with 5-10 sources per worker). Also switch keep-rate log format from percentage (which underflows to 0.0% at high levels) to a fraction like "1/8". Prompt: What is that rapid escalation then deescalation? and 0.0% is not useful. Follow-up: In practice we expect each source to produce 14 messages/second, and each worker would process 5-10 sources. Leaving aside that the current indiscriminate shedding is an issue, what might a good max level be? Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

The SimpleMessageBatcher emits empty batches (non-None with zero messages) when message timestamps jump forward after a pause between measurements. These were incorrectly counted as overload signals, triggering shedding during normal operation. Changed report_batch_result to accept batch_message_count (int) instead of batch_produced (bool). Empty batches and no-batch cycles are both treated as idle, so only genuine data-carrying batches contribute to the activation threshold. Added documentation explaining why consecutive non-empty batches are a reliable overload signal (given the batcher's 1-second time windows and the ~10 ms processing loop) and why empty batches must be excluded. Prompt: "Please use a new worktree to help me think through #739 (branch load-shedding). I am particularly worried about whether the overload-detection strategy is safe in the sense that I want to be really sure that shedding does not get activated unless we actually have to. Are there scenarios where the system is operating normally but we nevertheless enter shedding?" Follow-up: "I was more worried about the batcher genuinely never returning None because the upstream message rate is much higher than our processing frequency. Is the shedding strategy only working as a coincidence given how the current batcher works?" Follow-up: "We should fix this (and add tests) and also document clearly what you explained above in docstrings and code comments." Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

The docstring claimed each cycle fetches "only ~10 ms worth of messages" based on the poll interval. Under real load, processing itself takes hundreds of milliseconds, so the relevant question is whether the total cycle (fetch + process + publish) fits within the 1-second batch window. Reworded to describe the actual mechanism. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

The 100 ms idle sleep in process() means the total cycle time is processing_time + N*100ms, so shedding activates at ~90% utilization rather than exactly 100%. This is a desirable safety margin. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

SimonHeybrock and others added 10 commits February 27, 2026 11:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add load shedding for backend message processing#739

Add load shedding for backend message processing#739
SimonHeybrock wants to merge 10 commits intomainfrom
load-shedding

SimonHeybrock commented Feb 27, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

SimonHeybrock commented Feb 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

SimonHeybrock commented Feb 27, 2026 •

edited

Loading