Improve resource monitor errors#5129
Improve resource monitor errors#5129Swiddis wants to merge 6 commits intoopensearch-project:mainfrom
Conversation
Signed-off-by: Simeon Widdis <sawiddis@amazon.com>
Signed-off-by: Simeon Widdis <sawiddis@amazon.com>
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughSummary by CodeRabbit
WalkthroughReplace boolean health checks with a structured ResourceStatus model: add ResourceStatus/ResourceType, add getStatus()/isHealthyImpl() hooks to ResourceMonitor, update monitors and execution/enumerator code to return/consume ResourceStatus, add memory-specific status methods, and update tests to assert enriched status descriptions and formatted messages. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant Executor as Query Executor
participant Monitor as ResourceMonitor
participant Memory as MemoryMonitor
participant Status as ResourceStatus
Executor->>Monitor: getStatus()
Monitor->>Memory: getMemoryStatus(limitBytes)
Memory-->>Monitor: ResourceStatus (MEMORY, current, limit, desc)
Monitor-->>Executor: ResourceStatus
Executor->>Status: status.isHealthy()
Status-->>Executor: boolean
alt Unhealthy
Executor->>Status: status.getFormattedDescription()
Status-->>Executor: formatted message
Executor->>Executor: throw exception with formatted message (includes memory-limit guidance / rows processed)
else Healthy
Executor->>Executor: continue execution
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 4 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
🧪 Generate unit tests (beta)
⚔️ Resolve merge conflicts (beta)
No actionable comments were generated in the recent review. 🎉 Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
core/src/main/java/org/opensearch/sql/monitor/ResourceMonitor.java (1)
13-38:⚠️ Potential issue | 🟠 MajorAdd required
@throwstags to public method JavaDoc.
isHealthy()andgetStatus()can throwUnsupportedOperationExceptionviaisHealthyImpl(), but their JavaDoc lacks@throws.📌 Proposed JavaDoc fix
/** * Is the resource healthy. * * `@return` true for healthy, otherwise false. + * `@throws` UnsupportedOperationException if the subclass doesn't override getStatus() or + * isHealthyImpl() */ public boolean isHealthy() { return getStatus().isHealthy(); } @@ /** * Get detailed resource status including context for error messages. Subclasses should override * this method to provide rich status information. * * `@return` ResourceStatus with health state and detailed context + * `@throws` UnsupportedOperationException if the subclass doesn't override getStatus() or + * isHealthyImpl() */ public ResourceStatus getStatus() {As per coding guidelines, Public methods MUST have JavaDoc with
@param,@return, and@throws.
🤖 Fix all issues with AI agents
In `@core/src/main/java/org/opensearch/sql/monitor/ResourceMonitor.java`:
- Around line 28-52: The inline comment for ResourceMonitor.isHealthyImpl() is
incorrect; update the comment to state that the default implementation throws
UnsupportedOperationException and that subclasses must override either
getStatus() or isHealthyImpl() (reference ResourceMonitor, isHealthyImpl(),
getStatus()); then add a unit test in the core module that creates a minimal
anonymous subclass of ResourceMonitor which does not override getStatus() or
isHealthyImpl() and asserts that calling isHealthyImpl() (or invoking the
default health path) results in UnsupportedOperationException to document the
exception path (ensure the test references OpenSearchResourceMonitor and
AlwaysHealthyMonitor only as existing overrides for context).
In `@core/src/main/java/org/opensearch/sql/monitor/ResourceStatus.java`:
- Around line 84-90: In getFormattedDescription, guard against maxLimit <= 0
before computing percentage: check maxLimit and only compute and include the
usage percentage when maxLimit > 0; otherwise return a string that includes
description, formatBytes(currentUsage) and formatBytes(maxLimit) but omits the
percentage. Update the conditional that currently checks currentUsage != null &&
maxLimit != null to handle maxLimit <= 0 and use
formatBytes(currentUsage)/formatBytes(maxLimit) accordingly so you don't produce
Infinity/NaN in the output.
- Around line 18-92: Add unit tests for the new ResourceStatus class: cover the
static factory methods ResourceStatus.healthy(ResourceType) and
ResourceStatus.unhealthy(ResourceType,long,long,String) for each ResourceType
enum value, assert the healthy flag, type, description, and metric fields; test
getFormattedDescription() both when currentUsage/maxLimit are set (verify
formatted string contains formatted bytes and percentage) and when they are null
(returns description); add boundary byte-formatting tests exercising formatBytes
via getFormattedDescription() for values <1KB (e.g., 0 and 512), exactly 1KB,
1MB, 1GB, and edge cases like currentUsage==0 and currentUsage==maxLimit to
ensure correct formatting and percentage calculations. Ensure tests live
alongside the class tests and use clear assertions for fields and formatted
output.
1. Fixed contradictory comment in ResourceMonitor.isHealthyImpl() that claimed to delegate to old behavior but actually throws exception 2. Added unit test for ResourceMonitor default exception path 3. Created comprehensive unit tests for ResourceStatus class covering: - Factory methods (healthy/unhealthy) for all ResourceType values - getFormattedDescription() with and without metrics - Byte formatting boundary conditions (<1KB, 1KB, 1MB, 1GB) - Edge cases (zero usage, usage equals limit, invalid limit) 4. Fixed getFormattedDescription() to handle maxLimit <= 0 gracefully by treating it as 0 and omitting percentage calculation Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Simeon Widdis <sawiddis@amazon.com>
c416116 to
aa89c9d
Compare
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Fix all issues with AI agents
In `@core/src/main/java/org/opensearch/sql/monitor/ResourceMonitor.java`:
- Around line 22-39: Update the JavaDoc for getStatus() to document that calls
to isHealthyImpl() may propagate an UnsupportedOperationException: add an
`@throws` UnsupportedOperationException description to the existing comment for
getStatus(), referencing that the default implementation may throw this when
isHealthyImpl() is not supported; keep the existing `@return` text and general
description intact and ensure the JavaDoc follows the project guideline for
public methods.
- Around line 18-20: Update the JavaDoc for the public method isHealthy() to
include an `@throws` entry documenting that it can throw
UnsupportedOperationException (when neither getStatus() nor isHealthyImpl() is
overridden), e.g., add a brief sentence under `@throws` naming
UnsupportedOperationException and the condition; ensure the existing `@return`
remains and method summary stays intact for isHealthy().
In `@core/src/test/java/org/opensearch/sql/monitor/ResourceStatusTest.java`:
- Around line 18-70: The tests for ResourceStatus factory methods are missing
null-input coverage; add unit tests that call ResourceStatus.healthy(null) and
ResourceStatus.unhealthy(null, ...) and also pass null descriptions to unhealthy
(e.g., ResourceStatus.unhealthy(ResourceType.MEMORY, 1, 2, null)) and assert the
expected contract (either assertThrows for NPE/IllegalArgumentException or
assert fallback behavior) to match ResourceStatus' defined behavior; include
checks for both null ResourceType and null description for each factory method
(healthy and unhealthy) and use the existing test structure (ResourceStatusTest)
to locate ResourceStatus.healthy and ResourceStatus.unhealthy.
Signed-off-by: Simeon Widdis <sawiddis@amazon.com>
opensearch/src/main/java/org/opensearch/sql/opensearch/monitor/OpenSearchResourceMonitor.java
Outdated
Show resolved
Hide resolved
...arch/src/main/java/org/opensearch/sql/opensearch/executor/protector/ResourceMonitorPlan.java
Outdated
Show resolved
Hide resolved
Signed-off-by: Simeon Widdis <sawiddis@amazon.com>
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Fix all issues with AI agents
In
`@opensearch/src/test/java/org/opensearch/sql/opensearch/monitor/OpenSearchResourceMonitorTest.java`:
- Around line 82-144: Rename the JUnit test methods in
OpenSearchResourceMonitorTest to follow the test<Functionality><Condition>
pattern: change getStatusHealthy to testGetStatusHealthy,
getStatusUnhealthyWithRetry to testGetStatusUnhealthyWithRetry, and
getStatusBecomesHealthyAfterRetry to testGetStatusBecomesHealthyAfterRetry;
update the method names for the three tests (they are the `@Test` methods that
instantiate OpenSearchResourceMonitor and call getStatus()) and run tests to
ensure no other references break.
- Around line 82-144: Add unit tests for OpenSearchResourceMonitor.getStatus to
cover null/boundary/error conditions: (1) stub settings.getSettingValue(...) to
return null and assert getStatus returns a valid ResourceStatus (no uncaught
exception) and verify expected calls to memoryMonitor.getMemoryStatus; (2) add
boundary tests where the memory limit is 0 and where usage == limit to assert
healthy/unhealthy per existing logic and verify call counts; (3) stub
memoryMonitor.getMemoryStatus(...) to throw an exception and assert getStatus
handles it (doesn't propagate) and verify retry behavior (calls to
memoryMonitor.getMemoryStatus and how many retries are attempted). Use the same
test structure as existing tests, Mockito stubbing/thenThrow for exceptions, and
reference OpenSearchResourceMonitor, settings.getSettingValue, and
memoryMonitor.getMemoryStatus when adding test methods.
🧹 Nitpick comments (2)
opensearch/src/main/java/org/opensearch/sql/opensearch/monitor/OpenSearchMemoryHealthy.java (1)
73-98: Include human-readable units/percent in the status description.The description currently emits raw bytes, which may undercut the “richer error message” goal. Consider formatting with human-readable units and percentage (or delegating to a shared formatter) to keep downstream responses consistent.
♻️ Example formatting adjustment
+import org.opensearch.core.common.unit.ByteSizeValue; ... - String description = - String.format( - "Memory usage exceeds limit: %d bytes used, %d bytes limit", - currentMemoryUsage, limitBytes); + double usagePct = limitBytes > 0 ? (currentMemoryUsage * 100.0 / limitBytes) : 0.0; + String description = + String.format( + "Memory usage exceeds limit: %s used / %s limit (%.2f%%)", + ByteSizeValue.ofBytes(currentMemoryUsage), + ByteSizeValue.ofBytes(limitBytes), + usagePct);opensearch/src/main/java/org/opensearch/sql/opensearch/monitor/OpenSearchResourceMonitor.java (1)
33-45: Avoid retrying on definitively unhealthy status.
statusRetryretries whenever the status is unhealthy, which can add latency under sustained memory pressure and doesn’t seem to provide new signal. Consider returning the first unhealthy status (or retrying only on transient exceptions) to minimize extra work when memory is already over limit.
...earch/src/test/java/org/opensearch/sql/opensearch/monitor/OpenSearchResourceMonitorTest.java
Show resolved
Hide resolved
Signed-off-by: Simeon Widdis <sawiddis@amazon.com>
Description
Quick update for resource management errors.
Technically related to work on #4919 but these changes don't involve the framework, this just adds some resource monitoring infrastructure that didn't exist before.
Before:
{ "error": { "reason": "There was internal problem at backend", "details": "java.sql.SQLException: exception while executing query: insufficient resources to run the query, quit.", "type": "RuntimeException" }, "status": 500 }After:
{ "error": { "reason": "There was internal problem at backend", "details": "Insufficient resources to start query: Memory usage exceeds limit: 74510336 bytes used, 21474836 bytes limit (current: 71.1MB, limit: 20.5MB, usage: 347.0%). To increase the limit, adjust the 'plugins.query.memory_limit' setting (default: 85%).", "type": "RuntimeException" }, "status": 500 }Related Issues
Resolves #4869
Check List
--signoffor-s.By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.