Prometheus: preserve job/instance when translating Prometheus -> OTLP#4956
Prometheus: preserve job/instance when translating Prometheus -> OTLP#4956dashpole wants to merge 10 commits into
Conversation
|
Hmmmm, to make quicker progress here, would it make sense to focus on the Prometheus -> OTLP transformation and leave OTLP -> Prometheus in a separate PR? I feel like OTLP -> Prometheus is a lot more complicated, and will require deeper discussions with other parts of OTel, like Infrastructure semantic conventions, OBI, etc 🤔 |
|
My main goal is to make sure that everything still round-trips properly. If I make changes to one direction, I usually have to make the inverse change to the other direction to ensure that. I can remove some of the editorial changes to the OTLP -> Prometheus section if that helps. But I would like aggregated exporters to accept |
|
This PR was marked stale. It will be closed in 14 days without additional activity. |
jmacd
left a comment
There was a problem hiding this comment.
I left a very minor question.
ArthurSens
left a comment
There was a problem hiding this comment.
I did a pass through the PR, and so far, I have just one comment.
I feel like I haven't fully grasped the potential consequences of this change, so I'm taking some extra time to reflect, I hope that's ok :)
7248acd to
4dda69a
Compare
ArthurSens
left a comment
There was a problem hiding this comment.
This behavior seems very very similar to the configuration option called keep_identifying_attributes from the Prometheus Server configuration.
I know we're only modifying the data model spec here, not the exporter configuration spec, but I wonder whether we'd like to make adjustments to the exporter spec or the Prometheus configuration so both OTel and Prometheus are aligned.
@aknuds1, I think this is somewhat close to work you've done with @cyrille-leclerc in the past. Do you have anything extra to add?
Yeah. If we went with this change, it would make sense to eventually switch that to be true by default (e.g. in the next major version, if we consider it breaking). It would be less likely to be duplicate information, so it would make more sense for users to keep the original ones. |
|
I'll have a look. |
@ArthurSens @dashpole AFAICT, the following implies
|
This is meant to say that |
aknuds1
left a comment
There was a problem hiding this comment.
LGTM sans remaining two suggestions.
Co-authored-by: Arve Knudsen <arve.knudsen@gmail.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
c5d9afc to
3f74736
Compare
|
@bogdandrutu, @jmacd, @reyang can you take another look? You're previous approvals were reset. |
aknuds1
left a comment
There was a problem hiding this comment.
I have doubts regarding this proposed spec change. Please see comment.
| contains all metrics with that `job` and `instance`. `job` and `instance` labels | ||
| MUST be added as resource attributes, and not as metric attributes. |
There was a problem hiding this comment.
Are you sure it's a good idea to add job and instance labels as resource attributes? OTLP -> Prometheus translation considers (according to spec and implementation) instead service.instance.id and service.name to be the identfiying resource attributes. Also, I think that job and instance resource attributes would be dropped in favour of corresponding labels synthesized by Prometheus when generating target_info.
There was a problem hiding this comment.
See this discussion above: #4956 (comment). While our specification stated that service.* attributes were identifying, that was based on a misread of the service attributes specification.
This is now proposing that job and instance are respected if they already exist as resource attributes when ingesting OTLP (since it means that the metric was originally scraped from a prometheus endpoint.
There was a problem hiding this comment.
@dashpole What exactly is based on a misread? That Prometheus' OTLP endpoint treats service.instance.id and service.name resource attributes as identifying? If so, how precisely? Changing it would be backwards incompatible, and I would argue outside of this PR's scope.
What about the second problem I raised, i.e. that job and instance resource attributes would be dropped by the Prometheus OTLP endpoint in favour of corresponding labels synthesized from identifying resource attributes when generating the target_info metric?
This is now proposing that job and instance are respected if they already exist as resource attributes when ingesting OTLP
Where is that stated in the PR's proposed text?
There was a problem hiding this comment.
since it means that the metric was originally scraped from a prometheus endpoint
@dashpole We can't make that assumption. It's but one possible source of job and instance OTel resource attributes.
| If `job` is not present as a resource attribute and `service.name` is present, | ||
| the `service.name` and `service.namespace` resource attributes MUST be combined | ||
| as `<service.namespace>/<service.name>`, or `<service.name>` if namespace is | ||
| empty, to form the `job` metric label; otherwise, `job` SHOULD be added with an | ||
| empty value. If `instance` is not present as a resource attribute and | ||
| `service.instance.id` is present, `service.instance.id` MUST be converted to the | ||
| `instance` label; otherwise, `instance` SHOULD be added with an empty value. |
There was a problem hiding this comment.
On re-reading the PR, I think this must be where you intend to define that job and instance resource attributes respectively should be identifying iff present, with service.name and service.instance.id as respective fallbacks. That should be clearly spelled out, as the intent is effectively ambiguous as currently written.
There are some issues with such a radical change however:
- Users whose resource attributes include
joband/orinstance(doesn't sound entirely unlikely) will experience that their series identities change, when said attributes differ from (service.namespace/ +)service.name/service.instance.id- or when the latter are absent. - The effect of the
keep_identifying_resource_attributesconfig parameter becomes data-dependent: It can only meaningfully apply to theservice.nameandservice.instance.idresource attributes synthesized intojobandinstance. Whenjoband/orinstanceresource attributes take precedence, the former presumably have to be kept ontarget_inforegardless of the parameter.
There was a problem hiding this comment.
I thought about this some more in the meantime - I think that projecting the Prometheus meaning of job and instance onto OTel resource attributes is a flawed design. They don't have this intrinsic meaning within the OTel scope, so I do think that the previous proposal within this PR to prefix them with "prometheus." was an objectively better choice. See also my comment above regarding the assumption that job and instance OTel resource attributes would be proof of Prometheus provenance.
krajorama
left a comment
There was a problem hiding this comment.
I agree with @aknuds1 that this is a breaking change for some people.
On Prometheus->OTLP side new labels showing up can in general lead to different results if the query language allows without construct, that is generating series by ignoring some attributes, but allowing everything else.
On OTLP->Prometheus side: user might have knowingly or unknowingly masked job, instance resource attributes by overwriting from service.name, service.id. That would break now as we'd keep job and instance.
Let's discuss at the next WG meeting. See also #4753 (comment)
|
The (resolved) comment thread #4956 (comment) and the comments above lead me to agree that this will be a breaking change, likely in more than one way. See particularly the comment by @ArthurSens #4956 (comment): we're going to need Collector documentation updates for the users that have become used to the service.* resource attributes. |
Part of #4753
Fixes #4577
Related to open-telemetry/opentelemetry-collector-contrib#45982
Issues with the current state
Changes
There are a bunch of changes to the formatting and wording included in this PR. If needed, I can split those out.
service.nameMAY default tojob, andservice.instance.idMAY default toinstance.jobandinstanceare now preserved when translating Prometheus -> OTLP.Alternatives
prometheus.joborprometheus.instanceinstead ofjobandinstance.service.nameandservice.instance.idattributes tojoborinstance. This would be breaking for users.@open-telemetry/prometheus-interoperability @aknuds1 @kln21002 @cyrille-leclerc