The Problem
The OpenTelemetry::Exporter::OTLP::Exporter#send_bytes method establishes a persistent HTTP connection and re-uses that connection when receiving certain error statuses back from the server and retrying the export request.
At GitHub, we observed that this can cause a pile-on effect impacting certain backend nodes (in our case, we have an OTel collector backend). Nodes that received "bad" requests or that are returning errors to the client for other reasons then continue to receive all the retries from a given request since the client is re-using the same persistent HTTP connection. The collector node would then be under increased pressure, and where the collector node was already under memory or CPU pressure, this would exacerbate the situation.
So, we introduced a monkey patch to the OTLP::Exporter to force it to create a new HTTP connection in the event of an error response. As a result, we saw a marked decrease in client exporter failure rates and OTel collector span refusal and drop rates, and we saw improvements in the distribution of memory usage across our fleet of OTel collector pods.
The Proposal
The OTLP::Exporter should close the current HTTP connection and open a new one when #send_bytes gets an error response back from the backend.
Implementation Suggestion
Our monkey patch looks like this:
def backoff?(retry_count:, reason:, retry_after: nil)
@http.finish if @http.started?
super
end
The #backoff? method is called before any call to #redo to retry the request in #send_bytes. So, the #backoff? method would be an appropriate place to close the HTTP connection. Then, the code already present in #send_bytes will start a fresh connection when #redo is called.
The Problem
The
OpenTelemetry::Exporter::OTLP::Exporter#send_bytesmethod establishes a persistent HTTP connection and re-uses that connection when receiving certain error statuses back from the server and retrying the export request.At GitHub, we observed that this can cause a pile-on effect impacting certain backend nodes (in our case, we have an OTel collector backend). Nodes that received "bad" requests or that are returning errors to the client for other reasons then continue to receive all the retries from a given request since the client is re-using the same persistent HTTP connection. The collector node would then be under increased pressure, and where the collector node was already under memory or CPU pressure, this would exacerbate the situation.
So, we introduced a monkey patch to the
OTLP::Exporterto force it to create a new HTTP connection in the event of an error response. As a result, we saw a marked decrease in client exporter failure rates and OTel collector span refusal and drop rates, and we saw improvements in the distribution of memory usage across our fleet of OTel collector pods.The Proposal
The
OTLP::Exportershould close the current HTTP connection and open a new one when#send_bytesgets an error response back from the backend.Implementation Suggestion
Our monkey patch looks like this:
The
#backoff?method is called before any call to#redoto retry the request in#send_bytes. So, the#backoff?method would be an appropriate place to close the HTTP connection. Then, the code already present in#send_byteswill start a fresh connection when#redois called.