Here are detailed and structured notes from the Elasticsearch configuration tuning discussion. Every key technical and operational point from the transcript has been captured below.
The team discussed tuning Elasticsearch configurations to improve cluster stability, heap usage, and garbage collection efficiency, while minimizing node unresponsiveness and search rejections. Key parameters considered include heap usage limits, circuit breakers, field data cache, request/total breaker limits, refresh interval, and max buckets.
-
Elasticsearch nodes were becoming unresponsive due to high heap usage and inefficient garbage collection (GC).
-
Circuit breakers triggered too late (at 95%), leading to cascading node failures and cluster rebalancing overload.
-
The goal was to reduce the breaker thresholds and cache limits to allow early recovery and prevent OOM (Out of Memory) errors.
-
Issue: Nodes hit 95% heap usage → GC fails → nodes removed from cluster → cascading instability.
-
Proposal: Lower circuit breaker limit from 95% → 90% or 85%.
- This allows GC more time and avoids node eviction.
-
Concern: Lowering breaker may increase search rejections; must monitor the impact.
-
Begin with field data limit reduction, then adjust breaker limits.
-
Monitor heap usage and search rejections as breakers are tuned.
-
Typical healthy heap utilization: 70–80%, though 90% is common in high-cache clusters.
-
Current: 40% of heap.
-
Proposal: Lower to 30–35%.
-
Reducing cache will lower heap usage but may increase aggregation query latency.
-
Cache is LRU (Least Recently Used), persists until eviction or restart.
-
-
Dependency: If field data cache limit decreases, total breaker limit should also reduce proportionally.
| Parameter | Old | New (Proposed) | Expected Effect |
|---|---|---|---|
| Field Data Cache | 40% | 30–35% | Decrease heap usage, slightly higher latency |
| Breaker Total Limit | 95% | 90% → 85% | Early failure, less OOM risk |
-
Current: 60% of heap.
-
Proposal: Reduce stepwise → 50% → 40%.
-
Rationale: Heavy aggregation queries trigger request breaker; tuning prevents OOM.
-
Should follow after tuning field data and total breaker limits.
| Parameter | Old | New | Approach |
|---|---|---|---|
| Request Breaker Limit | 60% | 50% → 40% | Reduce gradually, monitor search rejections |
-
Current: 100% (default).
-
Proposal: Lower to 90% → 85%.
-
Default usually works, but safer to reduce in clusters with memory pressure.
-
Needs documentation check before changes.
-
Default: 100,000.
-
Elasticsearch Recommendation: 10,000.
-
Proposal: Step down to 20,000 as middle ground.
-
Reduces risk of OOM from oversized aggregation queries.
-
Minimizes customer impact (some will complain, but fewer than at 10,000).
-
-
Gradual rollback plan: 20k → 25k → 30k if customer feedback requires it.
-
Currently: Unlimited (-1).
-
Proposal: Set to 1 minute (60 seconds).
-
Queries exceeding 10 seconds are already heavy; 30–60 seconds is acceptable.
-
Prevents runaway queries from consuming excessive heap.
-
-
Trade-off: Long-running searches may fail earlier but cluster health improves.
-
Current: 30–45 seconds (varies by realm).
-
Proposal: Standardize to 60 seconds across all realms.
-
Ingestion-first architecture (Cassandra as source of truth) makes higher refresh intervals acceptable.
-
Helps reduce CPU thrash from frequent segment merges.
-
-
Risk: Higher interval keeps unflushed data in heap longer → potential memory use increase.
-
Decision: 60 seconds safe for clusters with moderate ingestion (~2M docs/min, <1KB per doc).
| Realm | Current | Proposed | Status |
|---|---|---|---|
| US-0 | 60s | Maintain | Working fine |
| US-1 | 30s–45s | Increase to 60s | For stability |
| EU-0 | 30s | Increase to 60s | Align across realms |
- Customer Impact: Slight delay (<1 min) in search visibility. Acceptable per user feedback (e.g., Atlassian won’t notice; Splunk SEC expects ≤1 min).
-
US-1 (ES7): Currently rebalancing after scale-out (~30% increase in nodes).
-
Heap usage dropped below 70% post-scale.
-
Node Pool Migration: Proposal to migrate Elasticsearch pods to dedicated node pools.
-
Ensures no noisy neighbors and dedicated resources.
-
Helps with vertical scaling.
-
Decision pending based on migration complexity and AWS node availability.
-
-
Adjust Field Data Limit → reduce to 30–35%.
-
Lower Total Circuit Breaker Limit → 90%, monitor, then 85% if stable.
-
Reduce Request Breaker Limit → 50% → 40%.
-
Tweak Accounting Limit → 90%, possibly 85%.
-
Lower Search Max Buckets → 20,000.
-
Set Default Search Timeout → 60 seconds.
-
Standardize Refresh Interval → 60 seconds.
-
Monitor Metrics:
-
Heap usage
-
GC frequency
-
Search rejections
-
Breaker trips
-
Query latency
-
Execution will be incremental — one setting at a time — with close observation between steps.
-
Execution Start: Tomorrow, 9 AM Pacific.
-
Responsible: Jerry (implementation), Venkat & Lydia (standby + monitoring).
-
Communication: Jerry to announce start and each change in Slack.
-
Rollback: Manual reversal of dynamic settings; before/after diffs recorded in spreadsheet.
-
Monitoring Tools: Heap charts, circuit breaker logs, GC metrics.
-
Dynamic changes: All parameters can be updated live (no restart needed).
-
Rollback Strategy: Manual reversion using captured diffs.
-
Documentation Needed: Update spreadsheet with each change, rationale, and observed impact.
-
Stability Goal: Reduce OOM, improve node responsiveness, and maintain consistent refresh latency across clusters.
| Setting | Current | Target | Notes |
|---|---|---|---|
| Field Data Cache | 40% | 30–35% | Reduces heap pressure |
| Request Breaker | 60% | 40–50% | Prevents heavy query overload |
| Total Breaker | 95% | 90% → 85% | Allows GC to recover |
| Accounting Limit | 100% | 90–85% | Optional; safety margin |
| Search Max Buckets | 100k | 20k | Avoid oversized aggregation |
| Search Timeout | Unlimited | 60s | Prevent runaway queries |
| Refresh Interval | 30–45s | 60s | Balance ingestion vs. CPU |
| Cluster Health | Heap <70% | Maintain | Stable after scale-out |
Would you like me to turn this into a formal change plan document (with purpose, procedure, risk, rollback, and monitoring sections) suitable for operations approval and tracking?