Skip to content

Latest commit

 

History

History
253 lines (129 loc) · 7.6 KB

File metadata and controls

253 lines (129 loc) · 7.6 KB

Here are detailed and structured notes from the Elasticsearch configuration tuning discussion. Every key technical and operational point from the transcript has been captured below.


🧠 Overview

The team discussed tuning Elasticsearch configurations to improve cluster stabilityheap usage, and garbage collection efficiency, while minimizing node unresponsiveness and search rejections. Key parameters considered include heap usage limitscircuit breakersfield data cacherequest/total breaker limitsrefresh interval, and max buckets.


⚙️ Problem Summary

  • Elasticsearch nodes were becoming unresponsive due to high heap usage and inefficient garbage collection (GC).

  • Circuit breakers triggered too late (at 95%), leading to cascading node failures and cluster rebalancing overload.

  • The goal was to reduce the breaker thresholds and cache limits to allow early recovery and prevent OOM (Out of Memory) errors.


🧩 Key Discussion Points

1. Heap Usage & Circuit Breakers

  • Issue: Nodes hit 95% heap usage → GC fails → nodes removed from cluster → cascading instability.

  • Proposal: Lower circuit breaker limit from 95% → 90% or 85%.

    • This allows GC more time and avoids node eviction.
  • Concern: Lowering breaker may increase search rejections; must monitor the impact.

✅ Action Plan

  • Begin with field data limit reduction, then adjust breaker limits.

  • Monitor heap usage and search rejections as breakers are tuned.

  • Typical healthy heap utilization: 70–80%, though 90% is common in high-cache clusters.


2. Field Data Cache (Memory-Intensive)

  • Current: 40% of heap.

  • Proposal: Lower to 30–35%.

    • Reducing cache will lower heap usage but may increase aggregation query latency.

    • Cache is LRU (Least Recently Used), persists until eviction or restart.

  • Dependency: If field data cache limit decreases, total breaker limit should also reduce proportionally.

✅ Plan

Parameter Old New (Proposed) Expected Effect
Field Data Cache 40% 30–35% Decrease heap usage, slightly higher latency
Breaker Total Limit 95% 90% → 85% Early failure, less OOM risk

3. Request Breaker (Aggregation Queries)

  • Current: 60% of heap.

  • Proposal: Reduce stepwise → 50% → 40%.

  • Rationale: Heavy aggregation queries trigger request breaker; tuning prevents OOM.

  • Should follow after tuning field data and total breaker limits.

✅ Plan

Parameter Old New Approach
Request Breaker Limit 60% 50% → 40% Reduce gradually, monitor search rejections

4. Accounting Limit Breaker

  • Current: 100% (default).

  • Proposal: Lower to 90% → 85%.

  • Default usually works, but safer to reduce in clusters with memory pressure.

  • Needs documentation check before changes.


5. Search Max Buckets

  • Default: 100,000.

  • Elasticsearch Recommendation: 10,000.

  • Proposal: Step down to 20,000 as middle ground.

    • Reduces risk of OOM from oversized aggregation queries.

    • Minimizes customer impact (some will complain, but fewer than at 10,000).

  • Gradual rollback plan: 20k → 25k → 30k if customer feedback requires it.


6. Default Search Timeout

  • Currently: Unlimited (-1).

  • Proposal: Set to 1 minute (60 seconds).

    • Queries exceeding 10 seconds are already heavy; 30–60 seconds is acceptable.

    • Prevents runaway queries from consuming excessive heap.

  • Trade-off: Long-running searches may fail earlier but cluster health improves.


7. Refresh Interval

  • Current: 30–45 seconds (varies by realm).

  • Proposal: Standardize to 60 seconds across all realms.

    • Ingestion-first architecture (Cassandra as source of truth) makes higher refresh intervals acceptable.

    • Helps reduce CPU thrash from frequent segment merges.

  • Risk: Higher interval keeps unflushed data in heap longer → potential memory use increase.

  • Decision: 60 seconds safe for clusters with moderate ingestion (~2M docs/min, <1KB per doc).

✅ Realm-Specific Notes

Realm Current Proposed Status
US-0 60s Maintain Working fine
US-1 30s–45s Increase to 60s For stability
EU-0 30s Increase to 60s Align across realms
  • Customer Impact: Slight delay (<1 min) in search visibility. Acceptable per user feedback (e.g., Atlassian won’t notice; Splunk SEC expects ≤1 min).

8. Cluster Scaling and Node Pool

  • US-1 (ES7): Currently rebalancing after scale-out (~30% increase in nodes).

  • Heap usage dropped below 70% post-scale.

  • Node Pool Migration: Proposal to migrate Elasticsearch pods to dedicated node pools.

    • Ensures no noisy neighbors and dedicated resources.

    • Helps with vertical scaling.

    • Decision pending based on migration complexity and AWS node availability.


🧮 Operational Plan & Execution Order

  1. Adjust Field Data Limit → reduce to 30–35%.

  2. Lower Total Circuit Breaker Limit → 90%, monitor, then 85% if stable.

  3. Reduce Request Breaker Limit → 50% → 40%.

  4. Tweak Accounting Limit → 90%, possibly 85%.

  5. Lower Search Max Buckets → 20,000.

  6. Set Default Search Timeout → 60 seconds.

  7. Standardize Refresh Interval → 60 seconds.

  8. Monitor Metrics:

    • Heap usage

    • GC frequency

    • Search rejections

    • Breaker trips

    • Query latency

Execution will be incremental — one setting at a time — with close observation between steps.


🕒 Implementation Schedule

  • Execution Start: Tomorrow, 9 AM Pacific.

  • Responsible: Jerry (implementation), Venkat & Lydia (standby + monitoring).

  • Communication: Jerry to announce start and each change in Slack.

  • Rollback: Manual reversal of dynamic settings; before/after diffs recorded in spreadsheet.

  • Monitoring Tools: Heap charts, circuit breaker logs, GC metrics.


🧾 Other Notes

  • Dynamic changes: All parameters can be updated live (no restart needed).

  • Rollback Strategy: Manual reversion using captured diffs.

  • Documentation Needed: Update spreadsheet with each change, rationale, and observed impact.

  • Stability Goal: Reduce OOM, improve node responsiveness, and maintain consistent refresh latency across clusters.


✅ Final Agreed Configuration Targets

Setting Current Target Notes
Field Data Cache 40% 30–35% Reduces heap pressure
Request Breaker 60% 40–50% Prevents heavy query overload
Total Breaker 95% 90% → 85% Allows GC to recover
Accounting Limit 100% 90–85% Optional; safety margin
Search Max Buckets 100k 20k Avoid oversized aggregation
Search Timeout Unlimited 60s Prevent runaway queries
Refresh Interval 30–45s 60s Balance ingestion vs. CPU
Cluster Health Heap <70% Maintain Stable after scale-out

Would you like me to turn this into a formal change plan document (with purpose, procedure, risk, rollback, and monitoring sections) suitable for operations approval and tracking?