fix(chaos): rate-limit tight-loop antagonists to prevent long-run OOM#604
Merged
Conversation
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Contributor
CI Test ResultsRun: #27750860542 | Commit:
Status Overview
Legend: ✅ passed | ❌ failed | ⚪ skipped | 🚫 cancelled Summary: Total: 32 | Passed: 32 | Failed: 0 Updated: 2026-06-18 10:00:54 UTC |
|
rkennke
approved these changes
Jun 18, 2026
rkennke
left a comment
Contributor
There was a problem hiding this comment.
Yup makes sense, and looks good!
Collaborator
Author
|
@rkennke thanks for the review! |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What does this PR do?:
Adds inter-iteration sleeps to three chaos antagonists that ran in tight loops without yielding, causing Java heap OOM after ~1 hour in the scheduled reliability/chaos CI cell.
Motivation:
The chaos harness was failing with
java.lang.OutOfMemoryError: Java heap spaceafter ~59 minutes. Root cause:WeakRefWaveAntagonist.waveLoop()callsSystem.gc()in a tight loop with no pause between waves. Each forced full-GC is a stop-the-world pause during whichThreadChurnAntagonist(64 threads/5ms),DumpStormAntagonist(96 threads), andAllocStormAntagonistkeep accumulating objects. Over time this degrades into GC thrashing and exhausts the 2GB heap.Two secondary contributors also ran with no inter-iteration sleep:
ClassLoaderChurnAntagonistandHiddenClassChurnAntagonistboth define unique classes in tight loops; ClassLoaders require old-gen GC to collect and can accumulate faster than the GC can reclaim them.Changes:
WeakRefWaveAntagonist: 200ms sleep after each wave — gives G1's concurrent markers time to work between forced full GCsClassLoaderChurnAntagonist: 1ms sleep per iteration — enough for old-gen to keep up with ClassLoader churnHiddenClassChurnAntagonist: 1ms sleep per iteration — same reasonThe
System.gc()call is preserved; only the pathological back-to-back cadence is fixed.Additional Notes:
No profiler code changed. This is purely a chaos harness fix.
How to test the change?:
The scheduled reliability/chaos CI pipeline will exercise this. The harness should now complete the full duration (RUNTIME seconds) without OOM.
For Datadog employees: