Add load shedding for backend message processing#739
Draft
SimonHeybrock wants to merge 10 commits intomainfrom
Draft
Add load shedding for backend message processing#739SimonHeybrock wants to merge 10 commits intomainfrom
SimonHeybrock wants to merge 10 commits intomainfrom
Conversation
When the backend can't keep up with the Kafka message stream, the new LoadShedder selectively drops bulk event data (detector events, monitor events/counts, area detector) while preserving control messages and f144 logs. Detection uses consecutive non-None batcher results as the overload signal, with hysteresis to prevent oscillation. - New LoadShedder class with 50% subsampling when active - ServiceStatus gains is_shedding and messages_dropped fields - x5f2 serialization updated for new fields (backward-compatible defaults) - OrchestratingProcessor wires in shedding before the batcher - Dashboard shows SHEDDING badge (amber) and dropped message count - Load shedding is enabled by default but can be disabled for testing Prompt: Implement the following plan: Load Shedding for Backend Message Processing (#378) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace the cumulative dropped-messages counter with a rolling 60-second window that tracks both dropped and eligible (total droppable) counts. The dashboard now shows "Dropped: 38% (62/164)" — percentage plus raw counts over the recent window — giving an actionable, stable signal instead of an ever-growing number. Implementation uses time-bucketed counters (10 buckets × 6s) inside LoadShedder, with a fake-clock injection point for deterministic testing. Eligible messages are counted even when shedding is inactive, so the rate correctly reflects the fraction being shed. Prompt: Show drop fraction/percentage and use a rolling time window instead of cumulative counters. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The raw counts (62/164) next to the cumulative Msgs count was confusing since they cover different time windows. Show just "Dropped: 50%" which is unambiguous alongside the total "Msgs" throughput counter. Prompt: Remove raw counts from the drop rate display, just show percentage. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
"Dropping: 50%" reads as a current rate, which better matches the rolling-window semantics than "Dropped" which implies a past event. Prompt: Change Dropped->Dropping to indicate current rate. Also update PR. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The single-level shedder (fixed 50% drop) could not recover when the backend was overloaded by more than 2x. Replace the boolean on/off with escalating levels where each level keeps 1/2^N of droppable messages (level 1 = 50%, level 2 = 75%, level 3 = 87.5%, …). Escalation and de-escalation each require the same consecutive-cycle thresholds as before, applied per level step. Wire the new shedding_level field through ServiceStatus, x5f2 serialization, and structured metrics logging. The dashboard badge remains "SHEDDING" since the existing drop-rate percentage already conveys severity. Prompt: Help me understand the shedding strategy added in this branch. PR says it drops every other message, but what if that is not enough? Follow-up: Multi-level shedding sounds good, is it simple to implement? Follow-up: No need to cap, dropping 9/10 or more is still reasonable. Would it make sense to use sth. like 1/2->1/4->1/8->...? Follow-up: Note that I don't think there is a need to report the shedding level like "SHEDDING L2" if we report the drop rate? The two are redundant in a sense. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Escalation, de-escalation, and full stop are discrete events worth surfacing above the periodic INFO metrics. Each log line includes the new level and keep rate for quick diagnosis. Prompt: I wonder if we should increase the logging severity when we shed? Follow-up: Maybe you are right that the periodic metric logging should stay at INFO level, but changes in shedding level might be worth a warning? Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Without a cap, high shedding levels make processing cycles near-instant, causing the "consecutive non-None batches" overload signal to fire rapidly in a positive feedback loop. Level 3 (keep 1/8, 87.5% drop) handles up to 8x overload while still producing usable data (~1-2 messages per source per batch at typical rates of 14 msg/s/source with 5-10 sources per worker). Also switch keep-rate log format from percentage (which underflows to 0.0% at high levels) to a fraction like "1/8". Prompt: What is that rapid escalation then deescalation? and 0.0% is not useful. Follow-up: In practice we expect each source to produce 14 messages/second, and each worker would process 5-10 sources. Leaving aside that the current indiscriminate shedding is an issue, what might a good max level be? Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The SimpleMessageBatcher emits empty batches (non-None with zero messages) when message timestamps jump forward after a pause between measurements. These were incorrectly counted as overload signals, triggering shedding during normal operation. Changed report_batch_result to accept batch_message_count (int) instead of batch_produced (bool). Empty batches and no-batch cycles are both treated as idle, so only genuine data-carrying batches contribute to the activation threshold. Added documentation explaining why consecutive non-empty batches are a reliable overload signal (given the batcher's 1-second time windows and the ~10 ms processing loop) and why empty batches must be excluded. Prompt: "Please use a new worktree to help me think through #739 (branch load-shedding). I am particularly worried about whether the overload-detection strategy is safe in the sense that I want to be really sure that shedding does not get activated unless we actually have to. Are there scenarios where the system is operating normally but we nevertheless enter shedding?" Follow-up: "I was more worried about the batcher genuinely never returning None because the upstream message rate is much higher than our processing frequency. Is the shedding strategy only working as a coincidence given how the current batcher works?" Follow-up: "We should fix this (and add tests) and also document clearly what you explained above in docstrings and code comments." Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The docstring claimed each cycle fetches "only ~10 ms worth of messages" based on the poll interval. Under real load, processing itself takes hundreds of milliseconds, so the relevant question is whether the total cycle (fetch + process + publish) fits within the 1-second batch window. Reworded to describe the actual mechanism. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The 100 ms idle sleep in process() means the total cycle time is processing_time + N*100ms, so shedding activates at ~90% utilization rather than exactly 100%. This is a desirable safety margin. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Closes #378
When the backend can't keep up with the Kafka message stream, a new
LoadShedderselectively drops bulk event data while preserving control messages and f144 logs.Detection: Consecutive non-None batcher results signal overload — zero-cost, no API changes, directly reflects falling behind. Hysteresis (5 consecutive batches to activate, 3 idle cycles to deactivate) prevents oscillation.
Shedding strategy: Uses exponential levels where each level keeps
1/2^Nof droppable messages (level 1 = 50% drop, level 2 = 75%, level 3 = 87.5%). Capped at level 3 — this handles up to 8x overload while still producing usable data (~1-2 messages per source per batch at typical rates). Each escalation/de-escalation step requires the same consecutive-cycle thresholds, so the system ramps up under sustained overload and ramps down gracefully when load subsides. Only bulk data (DETECTOR_EVENTS,MONITOR_EVENTS,MONITOR_COUNTS,AREA_DETECTOR) is eligible — allLIVEDATA_*control streams andLOG(f144) are preserved.Reporting: Drop statistics use a rolling 60-second window (10 buckets x 6s) so the dashboard shows a meaningful current rate rather than an ever-growing cumulative counter. The dashboard displays an amber "SHEDDING" badge and the drop rate percentage on affected workers, with the rate naturally decaying to 0% after shedding stops. Level transitions are logged at warning severity.
Wiring:
OrchestratingProcessorrunsshed()before the batcher andreport_batch_result()after. Load shedding is enabled by default but can be disabled viaenable_load_shedding=False(used in integration tests).Test plan
tests/core/load_shedder_test.py— activation, escalation, de-escalation, max level cap, drop rates at each level, rolling window decay, state reportingtests/kafka/status_message_test.py— roundtrip foris_shedding,shedding_level,messages_dropped,messages_eligiblefields🤖 Generated with Claude Code