Skip to content

Add DCGM exporter pod metadata enrichment API#2406

Open
karthikvetrivel wants to merge 1 commit intoNVIDIA:mainfrom
karthikvetrivel:fix/dcgm-exporter-pod-enrichment
Open

Add DCGM exporter pod metadata enrichment API#2406
karthikvetrivel wants to merge 1 commit intoNVIDIA:mainfrom
karthikvetrivel:fix/dcgm-exporter-pod-enrichment

Conversation

@karthikvetrivel
Copy link
Copy Markdown
Member

Description

Resolves #2009.

Previously, there was no way of enabling pod metrics when using DCGM exporter through GPU Operator. This fix introduces introduces first-class Helm values that provision RBAC and allow pod metrics to include Pod UID & Pod Label.

Design Choices:

  • Introduced enabling pod metrics as first-class Helm values instead of making arbitrary env vars provision RBAC to match the upstream DCGM metrics helm chart.
  • I intentionally did not add resourceSlices RBAC. Standalone dcgm-exporter Helm includes resourceSlices because it also supports DRA-related Kubernetes enrichment. GPU Operator is only exposing pod metadata enrichment here because it does not support DRA yet. I am open to changing this.

Checklist

  • No secrets, sensitive information, or unrelated changes
  • Lint checks passing (make lint)
  • Generated assets in-sync (make validate-generated-assets)
  • Go mod artifacts in-sync (make validate-modules)
  • Test cases are added for new code paths

Testing

Added unit tests.
Confirmed the operator reconciled the expected DCGM exporter resources:

  • nvidia-dcgm-exporter DaemonSet rolled out successfully.
  • DCGM exporter container received:
    - DCGM_EXPORTER_KUBERNETES_ENABLE_POD_LABELS=true
    - DCGM_EXPORTER_KUBERNETES_ENABLE_POD_UID=true
    - DCGM_EXPORTER_KUBERNETES_POD_LABEL_ALLOWLIST_REGEX=^gpu_test_label$
    • Pod template had automountServiceAccountToken: true.
    • nvidia-dcgm-exporter-read-pods ClusterRole and ClusterRoleBinding were created.
  • Exporter service account could get, list, and watch pods cluster-wide.

Scraped DCGM exporter metrics from a labeled GPU pod and confirmed metrics included gpu_test_label="live-test" and pod_uid="..."

Comment thread controllers/object_controls.go Outdated
Comment thread controllers/object_controls.go Outdated
Copy link
Copy Markdown
Contributor

@rahulait rahulait left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@karthikvetrivel karthikvetrivel force-pushed the fix/dcgm-exporter-pod-enrichment branch 4 times, most recently from aa28ce9 to ab5a207 Compare May 4, 2026 14:46
name: nvidia-dcgm-exporter-read-pods
labels:
app: nvidia-dcgm-exporter
# Resourceslices are DRA-only and should be added when GPU Operator exposes DRA exporter support.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a TODO comment?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was just to explain why there isn't parity with the upstream DCGM Exporter clusterrole, which contains resource slices. Can remove if you'd like?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, either add a TODO comment with the intended change in the future or just remove

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated to be a TODO!

Comment thread controllers/object_controls_test.go Outdated
@karthikvetrivel karthikvetrivel force-pushed the fix/dcgm-exporter-pod-enrichment branch from ab5a207 to 39fb8d8 Compare May 4, 2026 16:06
Comment thread api/nvidia/v1/clusterpolicy_types.go Outdated
// +operator-sdk:gen-csv:customresourcedefinitions.specDescriptors.x-descriptors="urn:alm:descriptor:com.tectonic.ui:advanced"
HPCJobMapping *DCGMExporterHPCJobMappingConfig `json:"hpcJobMapping,omitempty"`

// Enable Kubernetes pod labels in metrics. Requires cluster-level read access to pods.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels a bit too terse. Can we expand on this a bit more? Let's also clarify that this setting adds pod label as a label dimension to the DCGM exporter prometheus metrics.

Also, the "Requires cluster-level read access to pods" is a bit unnecessary IMO. If we do want to mention this, can we enclose it in parenthesis?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, expanded on the label and enclosed "Requires cluster-level read access to pods" in parentheses.

Signed-off-by: Karthik Vetrivel <kvetrivel@nvidia.com>
@karthikvetrivel karthikvetrivel force-pushed the fix/dcgm-exporter-pod-enrichment branch from 39fb8d8 to 5e5530f Compare May 5, 2026 14:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Pod Label not visible in DCGM Exporter Metrics

3 participants