feat(otlp-grpc): add retry logic and comprehensive error handling to …#2076
feat(otlp-grpc): add retry logic and comprehensive error handling to …#2076arjun-rajappa wants to merge 4 commits into
Conversation
…trace exporter - Implement exponential backoff retry mechanism for transient gRPC errors - Add retry support for Unavailable, DeadlineExceeded, Cancelled, ResourceExhausted, Aborted, Internal, and DataLoss errors - Improve certificate handling by reading file contents instead of passing file paths - Add timeout support with deadline tracking across retry attempts - Implement metrics reporting for export failures - Add comprehensive test coverage for error scenarios, retry logic, and edge cases Signed-off-by: Arjun Rajappa <arjun.rajappa@ibm.com>
|
👋 This pull request has been marked as stale because it has been open with no activity. You can: comment on the issue or remove the stale label to hold stale off for a while, add the |
Signed-off-by: Arjun Rajappa <arjun.rajappa@ibm.com>
…ILURE, and RETRY_COUNT Signed-off-by: Arjun Rajappa <arjun.rajappa@ibm.com>
kaylareopelle
left a comment
There was a problem hiding this comment.
Thank you for adding all the defensive error handling to the gRPC exporter! I have a few questions about the tests.
|
|
||
| it 'initializes with custom timeout' do | ||
| exporter = OpenTelemetry::Exporter::OTLP::GRPC::TraceExporter.new(timeout: 5) | ||
| _(exporter).wont_be_nil |
There was a problem hiding this comment.
I notice the assertions for a lot of these tests are wont_be_nil. I worry we're missing out on an opportunity to validate behavior more fully, like by making sure that the endpoint/timeout/etc. was correctly assigned on the exporter.
What are the benefits of wont_be_nil in these scenarios?
| span_data = OpenTelemetry::TestHelpers.create_span_data( | ||
| total_recorded_attributes: 10, | ||
| attributes: { 'a' => 1, 'b' => 2 }, | ||
| total_recorded_events: 5, | ||
| events: [OpenTelemetry::SDK::Trace::Event.new(name: 'event', timestamp: Time.now.to_i * 1_000_000_000)], | ||
| total_recorded_links: 3, | ||
| links: [] | ||
| ) |
There was a problem hiding this comment.
how does this indicate attributes/events/links were dropped?
|
|
||
| describe '#export' do | ||
| it 'exports span data successfully' do | ||
| skip unless ENV['TRACING_INTEGRATION_TEST'] |
There was a problem hiding this comment.
I see this environment variable is used in other exporter tests, but it looks like the other exporters only use it once. Why use it so frequently here?
Also, I don't see TRACING_INTEGRATION_TEST being explicitly set anywhere. Will these tests be skipped most of the time?
| _(result).must_equal(success) | ||
| end | ||
|
|
||
| it 'exports multiple spans' do |
There was a problem hiding this comment.
Are these tests intended to replace the it 'translates all the things' tests from the other exporters?
Ex:
Closes #1667
Changes