Skip to content

feat(otlp-grpc): add retry logic and comprehensive error handling to …#2076

Open
arjun-rajappa wants to merge 4 commits into
open-telemetry:mainfrom
arjun-rajappa:add-tests-grpc-exporter
Open

feat(otlp-grpc): add retry logic and comprehensive error handling to …#2076
arjun-rajappa wants to merge 4 commits into
open-telemetry:mainfrom
arjun-rajappa:add-tests-grpc-exporter

Conversation

@arjun-rajappa

Copy link
Copy Markdown
Contributor

Closes #1667

Changes

  • Implement exponential backoff retry mechanism for transient gRPC errors
  • Add retry support for Unavailable, DeadlineExceeded, Cancelled, ResourceExhausted, Aborted, Internal, and DataLoss errors
  • Improve certificate handling by reading file contents instead of passing file paths
  • Add timeout support with deadline tracking across retry attempts
  • Implement metrics reporting for export failures
  • Add comprehensive test coverage for error scenarios, retry logic, and edge cases

…trace exporter

- Implement exponential backoff retry mechanism for transient gRPC errors
- Add retry support for Unavailable, DeadlineExceeded, Cancelled, ResourceExhausted, Aborted, Internal, and DataLoss errors
- Improve certificate handling by reading file contents instead of passing file paths
- Add timeout support with deadline tracking across retry attempts
- Implement metrics reporting for export failures
- Add comprehensive test coverage for error scenarios, retry logic, and edge cases

Signed-off-by: Arjun Rajappa <arjun.rajappa@ibm.com>
@github-actions

github-actions Bot commented May 7, 2026

Copy link
Copy Markdown
Contributor

👋 This pull request has been marked as stale because it has been open with no activity. You can: comment on the issue or remove the stale label to hold stale off for a while, add the keep label to hold stale off permanently, or do nothing. If you do nothing this pull request will be closed eventually by the stale bot

Comment thread exporter/otlp-grpc/lib/opentelemetry/exporter/otlp/grpc/trace_exporter.rb Outdated
Signed-off-by: Arjun Rajappa <arjun.rajappa@ibm.com>
…ILURE, and RETRY_COUNT

Signed-off-by: Arjun Rajappa <arjun.rajappa@ibm.com>

@kaylareopelle kaylareopelle left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for adding all the defensive error handling to the gRPC exporter! I have a few questions about the tests.


it 'initializes with custom timeout' do
exporter = OpenTelemetry::Exporter::OTLP::GRPC::TraceExporter.new(timeout: 5)
_(exporter).wont_be_nil

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I notice the assertions for a lot of these tests are wont_be_nil. I worry we're missing out on an opportunity to validate behavior more fully, like by making sure that the endpoint/timeout/etc. was correctly assigned on the exporter.

What are the benefits of wont_be_nil in these scenarios?

Comment on lines +247 to +254
span_data = OpenTelemetry::TestHelpers.create_span_data(
total_recorded_attributes: 10,
attributes: { 'a' => 1, 'b' => 2 },
total_recorded_events: 5,
events: [OpenTelemetry::SDK::Trace::Event.new(name: 'event', timestamp: Time.now.to_i * 1_000_000_000)],
total_recorded_links: 3,
links: []
)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how does this indicate attributes/events/links were dropped?


describe '#export' do
it 'exports span data successfully' do
skip unless ENV['TRACING_INTEGRATION_TEST']

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see this environment variable is used in other exporter tests, but it looks like the other exporters only use it once. Why use it so frequently here?

Also, I don't see TRACING_INTEGRATION_TEST being explicitly set anywhere. Will these tests be skipped most of the time?

_(result).must_equal(success)
end

it 'exports multiple spans' do

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these tests intended to replace the it 'translates all the things' tests from the other exporters?

Ex:

it 'translates all the things' do
stub_request(:post, 'http://localhost:4318/v1/traces').to_return(status: 200)
processor = OpenTelemetry::SDK::Trace::Export::BatchSpanProcessor.new(exporter)
tracer = OpenTelemetry.tracer_provider.tracer('tracer', 'v0.0.1')
other_tracer = OpenTelemetry.tracer_provider.tracer('other_tracer')
trace_id = OpenTelemetry::Trace.generate_trace_id
root_span_id = OpenTelemetry::Trace.generate_span_id
child_span_id = OpenTelemetry::Trace.generate_span_id
client_span_id = OpenTelemetry::Trace.generate_span_id
server_span_id = OpenTelemetry::Trace.generate_span_id
consumer_span_id = OpenTelemetry::Trace.generate_span_id
start_timestamp = Time.now
end_timestamp = start_timestamp + 6
OpenTelemetry.tracer_provider.add_span_processor(processor)
root = OpenTelemetry::TestHelpers.with_ids(trace_id, root_span_id) { tracer.start_root_span('root', kind: :internal, start_timestamp: start_timestamp) }
root.status = OpenTelemetry::Trace::Status.ok
root.finish(end_timestamp: end_timestamp)
root_ctx = OpenTelemetry::Trace.context_with_span(root)
span = OpenTelemetry::TestHelpers.with_ids(trace_id, child_span_id) { tracer.start_span('child', with_parent: root_ctx, kind: :producer, start_timestamp: start_timestamp + 1, links: [OpenTelemetry::Trace::Link.new(root.context, 'attr' => 4)]) }
span['b'] = true
span['f'] = 1.1
span['i'] = 2
span['s'] = 'val'
span['a'] = [3, 4]
span.status = OpenTelemetry::Trace::Status.error
child_ctx = OpenTelemetry::Trace.context_with_span(span)
client = OpenTelemetry::TestHelpers.with_ids(trace_id, client_span_id) { tracer.start_span('client', with_parent: child_ctx, kind: :client, start_timestamp: start_timestamp + 2).finish(end_timestamp: end_timestamp) }
client_ctx = OpenTelemetry::Trace.context_with_span(client)
OpenTelemetry::TestHelpers.with_ids(trace_id, server_span_id) { other_tracer.start_span('server', with_parent: client_ctx, kind: :server, start_timestamp: start_timestamp + 3).finish(end_timestamp: end_timestamp) }
span.add_event('event', attributes: { 'attr' => 42 }, timestamp: start_timestamp + 4)
OpenTelemetry::TestHelpers.with_ids(trace_id, consumer_span_id) { tracer.start_span('consumer', with_parent: child_ctx, kind: :consumer, start_timestamp: start_timestamp + 5).finish(end_timestamp: end_timestamp) }
span.finish(end_timestamp: end_timestamp)
OpenTelemetry.tracer_provider.shutdown
encoded_etsr = Opentelemetry::Proto::Collector::Trace::V1::ExportTraceServiceRequest.encode(
Opentelemetry::Proto::Collector::Trace::V1::ExportTraceServiceRequest.new(
resource_spans: [
Opentelemetry::Proto::Trace::V1::ResourceSpans.new(
resource: Opentelemetry::Proto::Resource::V1::Resource.new(
attributes: [
Opentelemetry::Proto::Common::V1::KeyValue.new(key: 'telemetry.sdk.name', value: Opentelemetry::Proto::Common::V1::AnyValue.new(string_value: 'opentelemetry')),
Opentelemetry::Proto::Common::V1::KeyValue.new(key: 'telemetry.sdk.language', value: Opentelemetry::Proto::Common::V1::AnyValue.new(string_value: 'ruby')),
Opentelemetry::Proto::Common::V1::KeyValue.new(key: 'telemetry.sdk.version', value: Opentelemetry::Proto::Common::V1::AnyValue.new(string_value: OpenTelemetry::SDK::VERSION))
]
),
scope_spans: [
Opentelemetry::Proto::Trace::V1::ScopeSpans.new(
scope: Opentelemetry::Proto::Common::V1::InstrumentationScope.new(
name: 'tracer',
version: 'v0.0.1'
),
spans: [
Opentelemetry::Proto::Trace::V1::Span.new(
trace_id: trace_id,
span_id: root_span_id,
parent_span_id: nil,
name: 'root',
kind: Opentelemetry::Proto::Trace::V1::Span::SpanKind::SPAN_KIND_INTERNAL,
start_time_unix_nano: (start_timestamp.to_r * 1_000_000_000).to_i,
end_time_unix_nano: (end_timestamp.to_r * 1_000_000_000).to_i,
flags: (
Opentelemetry::Proto::Trace::V1::SpanFlags::SPAN_FLAGS_CONTEXT_HAS_IS_REMOTE_MASK |
1
),
status: Opentelemetry::Proto::Trace::V1::Status.new(
code: Opentelemetry::Proto::Trace::V1::Status::StatusCode::STATUS_CODE_OK
)
),
Opentelemetry::Proto::Trace::V1::Span.new(
trace_id: trace_id,
span_id: client_span_id,
parent_span_id: child_span_id,
name: 'client',
kind: Opentelemetry::Proto::Trace::V1::Span::SpanKind::SPAN_KIND_CLIENT,
start_time_unix_nano: ((start_timestamp + 2).to_r * 1_000_000_000).to_i,
end_time_unix_nano: (end_timestamp.to_r * 1_000_000_000).to_i,
flags: (
Opentelemetry::Proto::Trace::V1::SpanFlags::SPAN_FLAGS_CONTEXT_HAS_IS_REMOTE_MASK |
1
),
status: Opentelemetry::Proto::Trace::V1::Status.new(
code: Opentelemetry::Proto::Trace::V1::Status::StatusCode::STATUS_CODE_UNSET
)
),
Opentelemetry::Proto::Trace::V1::Span.new(
trace_id: trace_id,
span_id: consumer_span_id,
parent_span_id: child_span_id,
name: 'consumer',
kind: Opentelemetry::Proto::Trace::V1::Span::SpanKind::SPAN_KIND_CONSUMER,
start_time_unix_nano: ((start_timestamp + 5).to_r * 1_000_000_000).to_i,
end_time_unix_nano: (end_timestamp.to_r * 1_000_000_000).to_i,
flags: (
Opentelemetry::Proto::Trace::V1::SpanFlags::SPAN_FLAGS_CONTEXT_HAS_IS_REMOTE_MASK |
1
),
status: Opentelemetry::Proto::Trace::V1::Status.new(
code: Opentelemetry::Proto::Trace::V1::Status::StatusCode::STATUS_CODE_UNSET
)
),
Opentelemetry::Proto::Trace::V1::Span.new(
trace_id: trace_id,
span_id: child_span_id,
parent_span_id: root_span_id,
name: 'child',
kind: Opentelemetry::Proto::Trace::V1::Span::SpanKind::SPAN_KIND_PRODUCER,
start_time_unix_nano: ((start_timestamp + 1).to_r * 1_000_000_000).to_i,
end_time_unix_nano: (end_timestamp.to_r * 1_000_000_000).to_i,
attributes: [
Opentelemetry::Proto::Common::V1::KeyValue.new(key: 'b', value: Opentelemetry::Proto::Common::V1::AnyValue.new(bool_value: true)),
Opentelemetry::Proto::Common::V1::KeyValue.new(key: 'f', value: Opentelemetry::Proto::Common::V1::AnyValue.new(double_value: 1.1)),
Opentelemetry::Proto::Common::V1::KeyValue.new(key: 'i', value: Opentelemetry::Proto::Common::V1::AnyValue.new(int_value: 2)),
Opentelemetry::Proto::Common::V1::KeyValue.new(key: 's', value: Opentelemetry::Proto::Common::V1::AnyValue.new(string_value: 'val')),
Opentelemetry::Proto::Common::V1::KeyValue.new(
key: 'a',
value: Opentelemetry::Proto::Common::V1::AnyValue.new(
array_value: Opentelemetry::Proto::Common::V1::ArrayValue.new(
values: [
Opentelemetry::Proto::Common::V1::AnyValue.new(int_value: 3),
Opentelemetry::Proto::Common::V1::AnyValue.new(int_value: 4)
]
)
)
)
],
events: [
Opentelemetry::Proto::Trace::V1::Span::Event.new(
time_unix_nano: ((start_timestamp + 4).to_r * 1_000_000_000).to_i,
name: 'event',
attributes: [
Opentelemetry::Proto::Common::V1::KeyValue.new(key: 'attr', value: Opentelemetry::Proto::Common::V1::AnyValue.new(int_value: 42))
]
)
],
links: [
Opentelemetry::Proto::Trace::V1::Span::Link.new(
trace_id: trace_id,
span_id: root_span_id,
attributes: [
Opentelemetry::Proto::Common::V1::KeyValue.new(key: 'attr', value: Opentelemetry::Proto::Common::V1::AnyValue.new(int_value: 4))
],
flags: (
Opentelemetry::Proto::Trace::V1::SpanFlags::SPAN_FLAGS_CONTEXT_HAS_IS_REMOTE_MASK |
1
)
)
],
status: Opentelemetry::Proto::Trace::V1::Status.new(
code: Opentelemetry::Proto::Trace::V1::Status::StatusCode::STATUS_CODE_ERROR
),
flags: (
Opentelemetry::Proto::Trace::V1::SpanFlags::SPAN_FLAGS_CONTEXT_HAS_IS_REMOTE_MASK |
1
)
)
]
),
Opentelemetry::Proto::Trace::V1::ScopeSpans.new(
scope: Opentelemetry::Proto::Common::V1::InstrumentationScope.new(
name: 'other_tracer'
),
spans: [
Opentelemetry::Proto::Trace::V1::Span.new(
trace_id: trace_id,
span_id: server_span_id,
parent_span_id: client_span_id,
name: 'server',
kind: Opentelemetry::Proto::Trace::V1::Span::SpanKind::SPAN_KIND_SERVER,
start_time_unix_nano: ((start_timestamp + 3).to_r * 1_000_000_000).to_i,
end_time_unix_nano: (end_timestamp.to_r * 1_000_000_000).to_i,
flags: (
Opentelemetry::Proto::Trace::V1::SpanFlags::SPAN_FLAGS_CONTEXT_HAS_IS_REMOTE_MASK |
1
),
status: Opentelemetry::Proto::Trace::V1::Status.new(
code: Opentelemetry::Proto::Trace::V1::Status::StatusCode::STATUS_CODE_UNSET
)
)
]
)
]
)
]
)
)
assert_requested(:post, 'http://localhost:4318/v1/traces') do |req|
Zlib.gunzip(req.body) == encoded_etsr
end
end
end

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Make otlp-http, otlp-grpc, and otlp-common gems production-ready

3 participants