Add DCGM exporter pod metadata enrichment API#2406
Add DCGM exporter pod metadata enrichment API#2406karthikvetrivel wants to merge 1 commit intoNVIDIA:mainfrom
Conversation
aa28ce9 to
ab5a207
Compare
| name: nvidia-dcgm-exporter-read-pods | ||
| labels: | ||
| app: nvidia-dcgm-exporter | ||
| # Resourceslices are DRA-only and should be added when GPU Operator exposes DRA exporter support. |
There was a problem hiding this comment.
This was just to explain why there isn't parity with the upstream DCGM Exporter clusterrole, which contains resource slices. Can remove if you'd like?
There was a problem hiding this comment.
Yes, either add a TODO comment with the intended change in the future or just remove
There was a problem hiding this comment.
Updated to be a TODO!
ab5a207 to
39fb8d8
Compare
| // +operator-sdk:gen-csv:customresourcedefinitions.specDescriptors.x-descriptors="urn:alm:descriptor:com.tectonic.ui:advanced" | ||
| HPCJobMapping *DCGMExporterHPCJobMappingConfig `json:"hpcJobMapping,omitempty"` | ||
|
|
||
| // Enable Kubernetes pod labels in metrics. Requires cluster-level read access to pods. |
There was a problem hiding this comment.
This feels a bit too terse. Can we expand on this a bit more? Let's also clarify that this setting adds pod label as a label dimension to the DCGM exporter prometheus metrics.
Also, the "Requires cluster-level read access to pods" is a bit unnecessary IMO. If we do want to mention this, can we enclose it in parenthesis?
There was a problem hiding this comment.
Yep, expanded on the label and enclosed "Requires cluster-level read access to pods" in parentheses.
Signed-off-by: Karthik Vetrivel <kvetrivel@nvidia.com>
39fb8d8 to
5e5530f
Compare
Description
Resolves #2009.
Previously, there was no way of enabling pod metrics when using DCGM exporter through GPU Operator. This fix introduces introduces first-class Helm values that provision RBAC and allow pod metrics to include Pod UID & Pod Label.
Design Choices:
Checklist
make lint)make validate-generated-assets)make validate-modules)Testing
Added unit tests.
Confirmed the operator reconciled the expected DCGM exporter resources:
nvidia-dcgm-exporterDaemonSet rolled out successfully.-
DCGM_EXPORTER_KUBERNETES_ENABLE_POD_LABELS=true-
DCGM_EXPORTER_KUBERNETES_ENABLE_POD_UID=true-
DCGM_EXPORTER_KUBERNETES_POD_LABEL_ALLOWLIST_REGEX=^gpu_test_label$automountServiceAccountToken: true.nvidia-dcgm-exporter-read-podsClusterRole and ClusterRoleBinding were created.get,list, andwatchpods cluster-wide.Scraped DCGM exporter metrics from a labeled GPU pod and confirmed metrics included
gpu_test_label="live-test"andpod_uid="..."