Skip to content

fix(chaos): rate-limit tight-loop antagonists to prevent long-run OOM#604

Merged
jbachorik merged 2 commits into
mainfrom
jb/chaos-oom-fix
Jun 18, 2026
Merged

fix(chaos): rate-limit tight-loop antagonists to prevent long-run OOM#604
jbachorik merged 2 commits into
mainfrom
jb/chaos-oom-fix

Conversation

@jbachorik

Copy link
Copy Markdown
Collaborator

What does this PR do?:
Adds inter-iteration sleeps to three chaos antagonists that ran in tight loops without yielding, causing Java heap OOM after ~1 hour in the scheduled reliability/chaos CI cell.

Motivation:
The chaos harness was failing with java.lang.OutOfMemoryError: Java heap space after ~59 minutes. Root cause: WeakRefWaveAntagonist.waveLoop() calls System.gc() in a tight loop with no pause between waves. Each forced full-GC is a stop-the-world pause during which ThreadChurnAntagonist (64 threads/5ms), DumpStormAntagonist (96 threads), and AllocStormAntagonist keep accumulating objects. Over time this degrades into GC thrashing and exhausts the 2GB heap.

Two secondary contributors also ran with no inter-iteration sleep: ClassLoaderChurnAntagonist and HiddenClassChurnAntagonist both define unique classes in tight loops; ClassLoaders require old-gen GC to collect and can accumulate faster than the GC can reclaim them.

Changes:

  • WeakRefWaveAntagonist: 200ms sleep after each wave — gives G1's concurrent markers time to work between forced full GCs
  • ClassLoaderChurnAntagonist: 1ms sleep per iteration — enough for old-gen to keep up with ClassLoader churn
  • HiddenClassChurnAntagonist: 1ms sleep per iteration — same reason

The System.gc() call is preserved; only the pathological back-to-back cadence is fixed.

Additional Notes:
No profiler code changed. This is purely a chaos harness fix.

How to test the change?:
The scheduled reliability/chaos CI pipeline will exercise this. The harness should now complete the full duration (RUNTIME seconds) without OOM.

For Datadog employees:

  • This PR doesn't touch any of that.
  • JIRA: N/A — no ticket, direct chaos harness fix

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@jbachorik jbachorik added the AI label Jun 18, 2026
@dd-octo-sts

dd-octo-sts Bot commented Jun 18, 2026

Copy link
Copy Markdown
Contributor

CI Test Results

Run: #27750860542 | Commit: e37755c | Duration: 12m 59s (longest job)

All 32 test jobs passed

Status Overview

JDK glibc-aarch64/debug glibc-amd64/debug musl-aarch64/debug musl-amd64/debug
8 - - -
8-ibm - - -
8-j9 - -
8-librca - -
8-orcl - - -
11 - - -
11-j9 - -
11-librca - -
17 - -
17-graal - -
17-j9 - -
17-librca - -
21 - -
21-graal - -
21-librca - -
25 - -
25-graal - -
25-librca - -

Legend: ✅ passed | ❌ failed | ⚪ skipped | 🚫 cancelled

Summary: Total: 32 | Passed: 32 | Failed: 0


Updated: 2026-06-18 10:00:54 UTC

@datadog-datadog-prod-us1-2

datadog-datadog-prod-us1-2 Bot commented Jun 18, 2026

Copy link
Copy Markdown

Pipelines

Fix all issues with BitsAI

⚠️ Warnings

🚦 14 Pipeline jobs failed

DataDog/java-profiler | benchmarks-candidate-aarch64: [alloc]   View in Datadog   GitLab

DataDog/java-profiler | benchmarks-candidate-aarch64: [cpu,wall,alloc,memleak]   View in Datadog   GitLab

DataDog/java-profiler | benchmarks-candidate-aarch64: [cpu,wall]   View in Datadog   GitLab

View all 14 failed jobs.

Useful? React with 👍 / 👎

This comment will be updated automatically if new data arrives.
🔗 Commit SHA: a83ec01 | Docs | Datadog PR Page | Give us feedback!

@jbachorik jbachorik marked this pull request as ready for review June 18, 2026 16:25
@jbachorik jbachorik requested a review from a team as a code owner June 18, 2026 16:25

@rkennke rkennke left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup makes sense, and looks good!

@jbachorik

Copy link
Copy Markdown
Collaborator Author

@rkennke thanks for the review!

@jbachorik jbachorik merged commit 503b37c into main Jun 18, 2026
105 checks passed
@jbachorik jbachorik deleted the jb/chaos-oom-fix branch June 18, 2026 17:05
@github-actions github-actions Bot added this to the 1.45.0 milestone Jun 18, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants