feat(observability): emit OTel traces and metrics from Temporal workflows (GNINWD-1588) by kiwueke2 · Pull Request #44 · NVIDIA/nv-config-manager

kiwueke2 · 2026-06-12T15:36:44Z

Instruments Temporal worker and API with OpenTelemetry: telemetry.py bootstrap, TracingInterceptor on worker/client/API, FastAPIInstrumentor, otel-collector sidecar (OTLP receiver -> k8sattributes + Panoptes attribute processors -> mTLS OTLP/HTTP export), Panoptes mTLS ExternalSecret, and image-mirror CI workflow. All gated on temporal.observability.enabled: false — zero behaviour change to existing deployments.

Summary by CodeRabbit

Release Notes

New Features
- Added OpenTelemetry observability integration for Temporal metrics and trace collection
- Added Grafana dashboard for monitoring Temporal workflow health with key metrics
Documentation
- Added documentation for Grafana dashboard management and provisioning workflow
Chores
- Added GitHub Actions workflow for mirroring OpenTelemetry Collector images
- Updated dependencies to support OpenTelemetry instrumentation

…lows (GNINWD-1588) Instruments Temporal worker and API with OpenTelemetry: telemetry.py bootstrap, TracingInterceptor on worker/client/API, FastAPIInstrumentor, otel-collector sidecar (OTLP receiver -> k8sattributes + Panoptes attribute processors -> mTLS OTLP/HTTP export), Panoptes mTLS ExternalSecret, and image-mirror CI workflow. All gated on temporal.observability.enabled: false. Signed-off-by: Kezie Iwueke <kiwueke@nvidia.com>

copy-pr-bot · 2026-06-12T15:36:48Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

coderabbitai · 2026-06-12T15:36:58Z

📝 Walkthrough

Walkthrough

This PR adds OpenTelemetry observability to Temporal components by introducing a telemetry bootstrap module, instrumenting the FastAPI app and Temporal client/worker with tracing interceptors, deploying an OTel Collector sidecar via Helm with Panoptes OTLP export over mTLS, provisioning certificates via Vault, adding a Grafana dashboard for workflow health, and supporting CI/CD for the OTel Collector image mirror.

Changes

Temporal OpenTelemetry Observability

Layer / File(s)	Summary
OTel Telemetry Module and Dependencies `src/nv_config_manager/temporal/telemetry.py`, `pyproject.toml`	New `telemetry.py` module provides `setup_telemetry(service_name)` and `get_runtime()` functions to configure OTLP-based tracing and Temporal runtime with cached state. Updated `pyproject.toml` to add `temporalio[opentelemetry]` extra and OpenTelemetry exporter/FastAPI instrumentation dependencies.
Application Instrumentation `src/nv_config_manager/temporal/api/main.py`, `src/nv_config_manager/temporal/api/workflow_v1.py`, `src/nv_config_manager/temporal/worker/main.py`	FastAPI entrypoint now calls `setup_telemetry()` and instruments app with `FastAPIInstrumentor`. Both API and worker client/worker configurations import and use `TracingInterceptor()` and wire the cached telemetry runtime into client and worker connections.
Helm Configuration Values `deploy/helm/values.yaml`	Added `global.images.otelCollector` image definition, `secrets.vault.paths.panoptesMtls` vault secret path with certificate key mappings, and `temporal.observability` feature config with OTLP endpoint, Panoptes metadata fields, and sidecar resource limits.
OTel Collector ConfigMap and Sidecar Injection `deploy/helm/templates/temporal-otel-collector-configmap.yaml`, `deploy/helm/templates/temporal.yaml`	New ConfigMap template defines OTLP receivers (gRPC/HTTP), Kubernetes/Panoptes attribute processors, resource enrichment, batching, and TLS-enabled OTLP/HTTP exporter. Updated Temporal template conditionally injects `otel-collector` sidecars with environment variables, config mounts, and certificate mounts into both worker and API deployments.
mTLS Certificate Provisioning `deploy/helm/templates/vault-secrets.yaml`	New ExternalSecret conditionally syncs Panoptes mTLS certificates (`ca.crt`, `tls.crt`, `tls.key`) from Vault into a Kubernetes Secret mounted by the OTel Collector sidecar with `1h` refresh interval.
Monitoring Dashboard and Documentation `deploy/dashboards/kiwi-temporal/temporal-workflow-health.json`, `deploy/dashboards/README.md`	New Grafana dashboard "Kiwi Temporal — Workflow Health" with 8 panels for workflow/activity health, latency quantiles, and slot saturation. Dashboard README documents provisioning workflow, datasource/folderUID/tagging conventions, direct JSON edit rules, and synchronization requirements between this repo and dashboards repo.
CI/CD Image Mirroring Workflow `.github/workflows/mirror-otel-collector.yml`	New GitHub Actions workflow fetches NVCR credentials from Vault via OIDC, installs `crane`, mirrors the OTel Collector image from Docker Hub to NVCR by user-supplied tag, and validates digest match between source and destination.

Sequence Diagram(s)

sequenceDiagram
  participant TemporalWorker as Temporal Worker
  participant TemporalAPI as Temporal API
  participant OTelCollector as OTel Collector Sidecar
  participant Panoptes as Panoptes OTLP Endpoint
  TemporalWorker->>OTelCollector: OTLP gRPC/HTTP (localhost:4317/4318)
  TemporalAPI->>OTelCollector: OTLP gRPC/HTTP (localhost:4317/4318)
  OTelCollector->>OTelCollector: Process & enrich (K8s attrs, Panoptes routing)
  OTelCollector->>Panoptes: Export metrics & traces (TLS)

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Suggested reviewers

ryanheffernan
dmathur0
spidercensus

Poem

🐰 Hop along, config and code align,
From Temporal flows to metrics so fine,
Sidecars now whisper through OTLP's bright thread,
While dashboards illuminate where workflows have tread!
Observability blooms in the Kubernetes spring! 🌿✨

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly and specifically describes the main change: adding OpenTelemetry instrumentation to emit traces and metrics from Temporal workflows, with an associated ticket reference.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch feat/gninwd-1588-temporal-otel

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 7

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

deploy/helm/templates/temporal.yaml (1)

395-457: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Add a pod-template checksum for the collector ConfigMap.

Lines 395-457 and 519-591 mount {{ $temporalName }}-otel-collector-config, but neither pod template rolls when that ConfigMap changes. Updates to temporal.observability.panoptes.* in deploy/helm/values.yaml or the collector pipeline in deploy/helm/templates/temporal-otel-collector-configmap.yaml will leave the worker/API sidecars on stale config until someone restarts them manually.

Suggested fix

       annotations:
         {{- include "nv-config-manager.authIniChecksum" . | nindent 8 }}
+        checksum/otel-collector-config: {{ include (print $.Template.BasePath "/temporal-otel-collector-configmap.yaml") . | sha256sum }}
         {{- if eq .Values.secrets.method "vault-agent" }}
         {{- include "nv-config-manager.vaultAgent.injectorAnnotations" . | nindent 8 }}
         {{- end }}

Also applies to: 519-591

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@deploy/helm/templates/temporal.yaml` around lines 395 - 457, Pod templates
that mount the otel-collector config (volume name otel-collector-config /
ConfigMap name {{ $temporalName }}-otel-collector-config) don't roll when that
ConfigMap changes; add a pod-template annotation that is a checksum of the
rendered collector ConfigMap so updates trigger a restart. In the pod template
metadata for the Temporal worker/API (the sections that mount
otel-collector-config and include the otel sidecar) add an annotation like
checksum/otel-collector-config: {{ include
"nv-config-manager.temporal-otel-collector-configmap" . | sha256sum }} (or the
equivalent helper you have for rendering the collector config + sha256sum) and
apply the same change to the other pod template that mounts {{ $temporalName
}}-otel-collector-config so both roll on ConfigMap changes.

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In @.github/workflows/mirror-otel-collector.yml:
- Line 46: Replace the mutable tag "hashicorp/vault-action@v3" in the workflow
step with a pinned full commit SHA for the hashicorp/vault-action GitHub Action
(e.g., "hashicorp/vault-action@<full-commit-sha>"); update the uses entry to
reference the specific commit SHA and verify the SHA corresponds to the intended
v3 release before committing so the Vault credential-fetching step is pinned
immutably.
- Around line 72-79: The workflow currently injects ${{ inputs.tag }} directly
into shell run steps (used by crane copy and crane digest) which can allow shell
injection; update the workflow to validate and sanitize the tag before any shell
use by either: 1) adding an inputs validation regex in the workflow metadata for
the workflow_dispatch input (e.g., restrict to [A-Za-z0-9_.-]+) so ${{
inputs.tag }} can only contain safe characters, or 2) perform an explicit
runtime whitelist check in a preliminary step (use a short shell snippet or
actions/github-script that tests the tag against a strict regex and fails if it
doesn’t match), then assign the validated value to a safe env var (e.g.,
VALID_TAG) and use that env var in the crane copy and digest commands (the steps
invoking crane copy and the SRC_DIGEST/DST_DIGEST assignments).

In `@deploy/dashboards/kiwi-temporal/temporal-workflow-health.json`:
- Line 184: The PromQL percentage expression can produce Inf/NaN when the
denominator is zero; modify the expression that computes 100 * failed /
(completed + failed) to guard the denominator using clamp_min or a tiny epsilon,
e.g. replace the denominator
(sum(rate(temporal_workflow_completed_total{...}[5m])) +
sum(rate(temporal_workflow_failed_total{...}[5m]))) with clamp_min(..., 1e-9) so
PromQL never divides by zero; apply the same change to the other similar
expressions referencing temporal_workflow_failed_total and
temporal_workflow_completed_total (also at the block around lines 418-429) to
harden all percentage formulas.
- Line 12: Remove the top-level ".version" field from the dashboard JSON (the
"version": 1 entry in temporal-workflow-health.json) so the file matches the
direct-Git dashboard contract; locate the "version" key in the JSON and delete
that property entirely, leaving all other fields intact to avoid
validation/promotion drift.

In `@deploy/dashboards/README.md`:
- Line 6: Remove the hardcoded developer-specific path "(local clone:
`/Users/kiwueke/Documents/dashboards`)" from the README and replace it with a
portable placeholder (e.g., "(local clone: /path/to/dashboards)" or instruct
users to clone into their preferred directory or use $HOME), so the
documentation is not tied to a single user's filesystem and remains reusable for
all contributors.

In `@deploy/helm/templates/vault-secrets.yaml`:
- Around line 1120-1154: The ExternalSecret resource for panoptes Mtls (metadata
name using {{ $temporalName }}-panoptes-mtls-eso and target {{ $temporalName
}}-panoptes-mtls) is rendered even when ESO mode is not enabled, causing
reconciliation failures because the SecretStore (created only under the $anyEso
block) may not exist; wrap the existing block that creates this ExternalSecret
with the same ESO-mode guard used earlier (the $anyEso / secrets.method /
.Values.secrets.vaultAgent.esoForOperators check) so the ExternalSecret is only
emitted when ESO is enabled, or alternatively add the same conditional used for
creating the SecretStore around this section to ensure parity between
SecretStore and ExternalSecret creation.

In `@src/nv_config_manager/temporal/telemetry.py`:
- Around line 49-63: The telemetry setup currently defaults OTLP_EXPORTER to
"http://localhost:4317" which forces export even when observability should be
disabled; modify setup_telemetry() to check for an explicit OTLP endpoint or an
enable flag (e.g., OTEL_EXPORTER_OTLP_ENDPOINT present and/or OTEL_ENABLED)
before creating the TracerProvider/OTLPSpanExporter and before constructing
Runtime with TelemetryConfig/OpenTelemetryConfig; if the endpoint is missing or
observability is disabled, skip provider/span processor setup and return a
Runtime without OpenTelemetry metrics configured, preserving use of
OTEL_SERVICE_NAME and ENVIRONMENT when telemetry is enabled.

---

Outside diff comments:
In `@deploy/helm/templates/temporal.yaml`:
- Around line 395-457: Pod templates that mount the otel-collector config
(volume name otel-collector-config / ConfigMap name {{ $temporalName
}}-otel-collector-config) don't roll when that ConfigMap changes; add a
pod-template annotation that is a checksum of the rendered collector ConfigMap
so updates trigger a restart. In the pod template metadata for the Temporal
worker/API (the sections that mount otel-collector-config and include the otel
sidecar) add an annotation like checksum/otel-collector-config: {{ include
"nv-config-manager.temporal-otel-collector-configmap" . | sha256sum }} (or the
equivalent helper you have for rendering the collector config + sha256sum) and
apply the same change to the other pod template that mounts {{ $temporalName
}}-otel-collector-config so both roll on ConfigMap changes.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 7e4b64e8-03e6-4b7a-a1d9-0dca0c294958

📥 Commits

Reviewing files that changed from the base of the PR and between 7c25886 and ec18260.

⛔ Files ignored due to path filters (1)

uv.lock is excluded by !**/*.lock

📒 Files selected for processing (12)

.github/workflows/mirror-otel-collector.yml
deploy/dashboards/README.md
deploy/dashboards/kiwi-temporal/temporal-workflow-health.json
deploy/helm/templates/temporal-otel-collector-configmap.yaml
deploy/helm/templates/temporal.yaml
deploy/helm/templates/vault-secrets.yaml
deploy/helm/values.yaml
pyproject.toml
src/nv_config_manager/temporal/api/main.py
src/nv_config_manager/temporal/api/workflow_v1.py
src/nv_config_manager/temporal/telemetry.py
src/nv_config_manager/temporal/worker/main.py

coderabbitai · 2026-06-12T15:45:05Z

+        uses: nv-gha-runners/setup-proxy-cache@14229018fe157c83e03c008f27d183d8e99bc67c
+
+      - name: Fetch nvcr registry token from Vault
+        uses: hashicorp/vault-action@v3


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash set -euo pipefail FILE=".github/workflows/mirror-otel-collector.yml" echo "== Checking $FILE ==" if [ ! -f "$FILE" ]; then echo "Missing file: $FILE" exit 1 fi nl -ba "$FILE" | sed -n '35,70p' echo echo "== Searching for hashicorp/vault-action@ across repo ==" rg -n --hidden --no-ignore-vcs "hashicorp/vault-action@" .github/workflows || true

Repository: NVIDIA/nv-config-manager

Length of output: 189

🏁 Script executed:

#!/bin/bash set -euo pipefail FILE=".github/workflows/mirror-otel-collector.yml" echo "== Checking $FILE ==" if [ ! -f "$FILE" ]; then echo "Missing file: $FILE" exit 1 fi awk 'NR>=35 && NR<=70 {printf "%d:%s\n", NR, $0}' "$FILE" echo echo "== Searching for hashicorp/vault-action@ across repo ==" rg -n --hidden --no-ignore-vcs "hashicorp/vault-action@" . || true

Repository: NVIDIA/nv-config-manager

Length of output: 1707

Pin hashicorp/vault-action to a full commit SHA (avoid @v3).

.github/workflows/mirror-otel-collector.yml:46 uses hashicorp/vault-action@v3; tag-based references are mutable and weaken supply-chain guarantees for a Vault credential-fetching step—replace @v3 with a specific commit SHA.

🧰 Tools

🪛 zizmor (1.25.2)

[error] 46-46: unpinned action reference (unpinned-uses): action is not pinned to a hash (required by blanket policy)

(unpinned-uses)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In @.github/workflows/mirror-otel-collector.yml at line 46, Replace the mutable tag "hashicorp/vault-action@v3" in the workflow step with a pinned full commit SHA for the hashicorp/vault-action GitHub Action (e.g., "hashicorp/vault-action@<full-commit-sha>"); update the uses entry to reference the specific commit SHA and verify the SHA corresponds to the intended v3 release before committing so the Vault credential-fetching step is pinned immutably.

Source: Linters/SAST tools

coderabbitai · 2026-06-12T15:45:05Z

+      - name: Mirror image
+        run: |
+          crane copy "${SOURCE_IMAGE}:${{ inputs.tag }}" "${DEST_IMAGE}:${{ inputs.tag }}"
+
+      - name: Verify digest matches source
+        run: |
+          SRC_DIGEST=$(crane digest "${SOURCE_IMAGE}:${{ inputs.tag }}")
+          DST_DIGEST=$(crane digest "${DEST_IMAGE}:${{ inputs.tag }}")


⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Validate and sanitize workflow_dispatch tag before shell usage.

Line 74, Line 78, and Line 79 expand ${{ inputs.tag }} directly inside run scripts. A crafted tag (for example including command substitution) can execute arbitrary shell on the runner.

Suggested hardening diff

jobs: mirror: @@ steps: + - name: Validate mirror tag + id: validate_tag + run: | + set -euo pipefail + TAG='${{ inputs.tag }}' + if [[ ! "${TAG}" =~ ^[0-9]+\.[0-9]+\.[0-9]+([-.][0-9A-Za-z._-]+)?$ ]]; then + echo "Invalid tag format: ${TAG}" >&2 + exit 1 + fi + echo "tag=${TAG}" >> "${GITHUB_OUTPUT}" @@ - name: Mirror image run: | - crane copy "${SOURCE_IMAGE}:${{ inputs.tag }}" "${DEST_IMAGE}:${{ inputs.tag }}" + crane copy "${SOURCE_IMAGE}:${{ steps.validate_tag.outputs.tag }}" "${DEST_IMAGE}:${{ steps.validate_tag.outputs.tag }}" @@ - name: Verify digest matches source run: | - SRC_DIGEST=$(crane digest "${SOURCE_IMAGE}:${{ inputs.tag }}") - DST_DIGEST=$(crane digest "${DEST_IMAGE}:${{ inputs.tag }}") + SRC_DIGEST=$(crane digest "${SOURCE_IMAGE}:${{ steps.validate_tag.outputs.tag }}") + DST_DIGEST=$(crane digest "${DEST_IMAGE}:${{ steps.validate_tag.outputs.tag }}")

🧰 Tools

🪛 zizmor (1.25.2)

[error] 74-74: code injection via template expansion (template-injection): may expand into attacker-controllable code

(template-injection)

[error] 74-74: code injection via template expansion (template-injection): may expand into attacker-controllable code

(template-injection)

[error] 78-78: code injection via template expansion (template-injection): may expand into attacker-controllable code

(template-injection)

[error] 79-79: code injection via template expansion (template-injection): may expand into attacker-controllable code

(template-injection)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In @.github/workflows/mirror-otel-collector.yml around lines 72 - 79, The workflow currently injects ${{ inputs.tag }} directly into shell run steps (used by crane copy and crane digest) which can allow shell injection; update the workflow to validate and sanitize the tag before any shell use by either: 1) adding an inputs validation regex in the workflow metadata for the workflow_dispatch input (e.g., restrict to [A-Za-z0-9_.-]+) so ${{ inputs.tag }} can only contain safe characters, or 2) perform an explicit runtime whitelist check in a preliminary step (use a short shell snippet or actions/github-script that tests the tag against a strict regex and fails if it doesn’t match), then assign the validated value to a safe env var (e.g., VALID_TAG) and use that env var in the crane copy and digest commands (the steps invoking crane copy and the SRC_DIGEST/DST_DIGEST assignments).

Source: Linters/SAST tools

coderabbitai · 2026-06-12T15:45:05Z

+    "gni"
+  ],
+  "schemaVersion": 39,
+  "version": 1,


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Remove the dashboard .version field to match the documented direct-Git contract.

Line 12 sets "version": 1, but the same PR’s workflow rules require direct Git dashboard JSON to omit .version. This mismatch will create drift and can break validation/promotion workflows.

Suggested fix

- "version": 1,

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

"version": 1,

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@deploy/dashboards/kiwi-temporal/temporal-workflow-health.json` at line 12, Remove the top-level ".version" field from the dashboard JSON (the "version": 1 entry in temporal-workflow-health.json) so the file matches the direct-Git dashboard contract; locate the "version" key in the JSON and delete that property entirely, leaving all other fields intact to avoid validation/promotion drift.

coderabbitai · 2026-06-12T15:45:05Z

+            "uid": "[[.datasourceName]]"
+          },
+          "editorMode": "code",
+          "expr": "100 * sum(rate(temporal_workflow_failed_total{deployment_environment=~\"$env\", workflow_type=~\"$workflow_type\"}[5m])) / (sum(rate(temporal_workflow_completed_total{deployment_environment=~\"$env\", workflow_type=~\"$workflow_type\"}[5m])) + sum(rate(temporal_workflow_failed_total{deployment_environment=~\"$env\", workflow_type=~\"$workflow_type\"}[5m])))",


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Harden percentage formulas against zero denominators.

These ratios can emit Inf/NaN when no matching series exist in the selected time window/environment, which causes misleading spikes/gaps in the dashboard.

Suggested fix

- "expr": "100 * sum(rate(temporal_workflow_failed_total{deployment_environment=~\"$env\", workflow_type=~\"$workflow_type\"}[5m])) / (sum(rate(temporal_workflow_completed_total{deployment_environment=~\"$env\", workflow_type=~\"$workflow_type\"}[5m])) + sum(rate(temporal_workflow_failed_total{deployment_environment=~\"$env\", workflow_type=~\"$workflow_type\"}[5m])))", + "expr": "100 * sum(rate(temporal_workflow_failed_total{deployment_environment=~\"$env\", workflow_type=~\"$workflow_type\"}[5m])) / clamp_min((sum(rate(temporal_workflow_completed_total{deployment_environment=~\"$env\", workflow_type=~\"$workflow_type\"}[5m])) + sum(rate(temporal_workflow_failed_total{deployment_environment=~\"$env\", workflow_type=~\"$workflow_type\"}[5m]))), 1e-9)",

- "expr": "100 * sum(temporal_worker_task_slots_used{deployment_environment=~\"$env\", slot_type=\"ActivitySlot\"}) / (sum(temporal_worker_task_slots_available{deployment_environment=~\"$env\", slot_type=\"ActivitySlot\"}) + sum(temporal_worker_task_slots_used{deployment_environment=~\"$env\", slot_type=\"ActivitySlot\"}))", + "expr": "100 * sum(temporal_worker_task_slots_used{deployment_environment=~\"$env\", slot_type=\"ActivitySlot\"}) / clamp_min((sum(temporal_worker_task_slots_available{deployment_environment=~\"$env\", slot_type=\"ActivitySlot\"}) + sum(temporal_worker_task_slots_used{deployment_environment=~\"$env\", slot_type=\"ActivitySlot\"})), 1e-9)",

- "expr": "100 * sum(temporal_worker_task_slots_used{deployment_environment=~\"$env\", slot_type=\"WorkflowSlot\"}) / (sum(temporal_worker_task_slots_available{deployment_environment=~\"$env\", slot_type=\"WorkflowSlot\"}) + sum(temporal_worker_task_slots_used{deployment_environment=~\"$env\", slot_type=\"WorkflowSlot\"}))", + "expr": "100 * sum(temporal_worker_task_slots_used{deployment_environment=~\"$env\", slot_type=\"WorkflowSlot\"}) / clamp_min((sum(temporal_worker_task_slots_available{deployment_environment=~\"$env\", slot_type=\"WorkflowSlot\"}) + sum(temporal_worker_task_slots_used{deployment_environment=~\"$env\", slot_type=\"WorkflowSlot\"})), 1e-9)",

Also applies to: 418-429

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@deploy/dashboards/kiwi-temporal/temporal-workflow-health.json` at line 184, The PromQL percentage expression can produce Inf/NaN when the denominator is zero; modify the expression that computes 100 * failed / (completed + failed) to guard the denominator using clamp_min or a tiny epsilon, e.g. replace the denominator (sum(rate(temporal_workflow_completed_total{...}[5m])) + sum(rate(temporal_workflow_failed_total{...}[5m]))) with clamp_min(..., 1e-9) so PromQL never divides by zero; apply the same change to the other similar expressions referencing temporal_workflow_failed_total and temporal_workflow_completed_total (also at the block around lines 418-429) to harden all percentage formulas.

coderabbitai · 2026-06-12T15:45:06Z

+Dashboard JSON files here mirror what lives (or will live) in the dashboards repo:
+
+> `gitlab-master.nvidia.com/nsv-network/gni-dev/observability/deploy/dashboards`
+> (local clone: `/Users/kiwueke/Documents/dashboards`)


⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Avoid committing a user-specific local filesystem path.

Line 6 hardcodes one developer’s local path, which is not portable for other contributors and tends to go stale in docs.

Suggested fix

-> (local clone: `/Users/kiwueke/Documents/dashboards`) +> (example local clone path; update to your environment)

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

> (local clone: `/Users/kiwueke/Documents/dashboards`)

> (example local clone path; update to your environment)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@deploy/dashboards/README.md` at line 6, Remove the hardcoded developer-specific path "(local clone: `/Users/kiwueke/Documents/dashboards`)" from the README and replace it with a portable placeholder (e.g., "(local clone: /path/to/dashboards)" or instruct users to clone into their preferred directory or use $HOME), so the documentation is not tied to a single user's filesystem and remains reusable for all contributors.

coderabbitai · 2026-06-12T15:45:06Z

+{{- if and .Values.temporal.enabled .Values.temporal.observability.enabled }}
+{{- if .Values.secrets.vault.paths.panoptesMtls }}
+{{- if .Values.secrets.vault.paths.panoptesMtls.path }}
+{{- $temporalName := include "nv-config-manager.componentName" (dict "root" . "component" "temporal") }}
+{{- $vaultSecretStoreName := include "nv-config-manager.vaultSecretStoreName" . }}
+---
+apiVersion: external-secrets.io/v1
+kind: ExternalSecret
+metadata:
+  name: {{ $temporalName }}-panoptes-mtls-eso
+  namespace: {{ .Values.global.namespace }}
+spec:
+  secretStoreRef:
+    name: {{ $vaultSecretStoreName }}
+    kind: SecretStore
+  target:
+    name: {{ $temporalName }}-panoptes-mtls
+    creationPolicy: Owner
+  data:
+    - secretKey: ca.crt
+      remoteRef:
+        key: "{{ include "nv-config-manager.vault.secretPath" (dict "root" . "secret" "panoptesMtls") }}"
+        property: {{ include "nv-config-manager.vault.keyName" (dict "root" . "secret" "panoptesMtls" "key" "caCert") }}
+    - secretKey: tls.crt
+      remoteRef:
+        key: "{{ include "nv-config-manager.vault.secretPath" (dict "root" . "secret" "panoptesMtls") }}"
+        property: {{ include "nv-config-manager.vault.keyName" (dict "root" . "secret" "panoptesMtls" "key" "tlsCert") }}
+    - secretKey: tls.key
+      remoteRef:
+        key: "{{ include "nv-config-manager.vault.secretPath" (dict "root" . "secret" "panoptesMtls") }}"
+        property: {{ include "nv-config-manager.vault.keyName" (dict "root" . "secret" "panoptesMtls" "key" "tlsKey") }}
+  refreshInterval: 1h
+{{- end }}
+{{- end }}
+{{- end }}


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Gate the Panoptes ExternalSecret behind the same ESO-mode check as the rest of this file.

Lines 1120-1154 render an ExternalSecret even when secrets.method is kubernetes or vault-agent with secrets.vaultAgent.esoForOperators: false, but the referenced SecretStore is only created inside the earlier $anyEso block. In those modes this resource cannot reconcile, and the Temporal worker/API mounts in deploy/helm/templates/temporal.yaml still expect {{ $temporalName }}-panoptes-mtls to exist.

Suggested fix

-{{- if and .Values.temporal.enabled .Values.temporal.observability.enabled }} +{{- if and $anyEso .Values.temporal.enabled .Values.temporal.observability.enabled }} {{- if .Values.secrets.vault.paths.panoptesMtls }} {{- if .Values.secrets.vault.paths.panoptesMtls.path }}

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

{{- if and .Values.temporal.enabled .Values.temporal.observability.enabled }}

{{- if .Values.secrets.vault.paths.panoptesMtls }}

{{- if .Values.secrets.vault.paths.panoptesMtls.path }}

{{- $temporalName := include "nv-config-manager.componentName" (dict "root" . "component" "temporal") }}

{{- $vaultSecretStoreName := include "nv-config-manager.vaultSecretStoreName" . }}

---

apiVersion: external-secrets.io/v1

kind: ExternalSecret

metadata:

name: {{ $temporalName }}-panoptes-mtls-eso

namespace: {{ .Values.global.namespace }}

spec:

secretStoreRef:

name: {{ $vaultSecretStoreName }}

kind: SecretStore

target:

name: {{ $temporalName }}-panoptes-mtls

creationPolicy: Owner

data:

- secretKey: ca.crt

remoteRef:

key: "{{ include "nv-config-manager.vault.secretPath" (dict "root" . "secret" "panoptesMtls") }}"

property: {{ include "nv-config-manager.vault.keyName" (dict "root" . "secret" "panoptesMtls" "key" "caCert") }}

- secretKey: tls.crt

remoteRef:

key: "{{ include "nv-config-manager.vault.secretPath" (dict "root" . "secret" "panoptesMtls") }}"

property: {{ include "nv-config-manager.vault.keyName" (dict "root" . "secret" "panoptesMtls" "key" "tlsCert") }}

- secretKey: tls.key

remoteRef:

key: "{{ include "nv-config-manager.vault.secretPath" (dict "root" . "secret" "panoptesMtls") }}"

property: {{ include "nv-config-manager.vault.keyName" (dict "root" . "secret" "panoptesMtls" "key" "tlsKey") }}

refreshInterval: 1h

{{- end }}

{{- end }}

{{- end }}

{{- if and $anyEso .Values.temporal.enabled .Values.temporal.observability.enabled }}

{{- if .Values.secrets.vault.paths.panoptesMtls }}

{{- if .Values.secrets.vault.paths.panoptesMtls.path }}

{{- $temporalName := include "nv-config-manager.componentName" (dict "root" . "component" "temporal") }}

{{- $vaultSecretStoreName := include "nv-config-manager.vaultSecretStoreName" . }}

---

apiVersion: external-secrets.io/v1

kind: ExternalSecret

metadata:

name: {{ $temporalName }}-panoptes-mtls-eso

namespace: {{ .Values.global.namespace }}

spec:

secretStoreRef:

name: {{ $vaultSecretStoreName }}

kind: SecretStore

target:

name: {{ $temporalName }}-panoptes-mtls

creationPolicy: Owner

data:

- secretKey: ca.crt

remoteRef:

key: "{{ include "nv-config-manager.vault.secretPath" (dict "root" . "secret" "panoptesMtls") }}"

property: {{ include "nv-config-manager.vault.keyName" (dict "root" . "secret" "panoptesMtls" "key" "caCert") }}

- secretKey: tls.crt

remoteRef:

key: "{{ include "nv-config-manager.vault.secretPath" (dict "root" . "secret" "panoptesMtls") }}"

property: {{ include "nv-config-manager.vault.keyName" (dict "root" . "secret" "panoptesMtls" "key" "tlsCert") }}

- secretKey: tls.key

remoteRef:

key: "{{ include "nv-config-manager.vault.secretPath" (dict "root" . "secret" "panoptesMtls") }}"

property: {{ include "nv-config-manager.vault.keyName" (dict "root" . "secret" "panoptesMtls" "key" "tlsKey") }}

refreshInterval: 1h

{{- end }}

{{- end }}

{{- end }}

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@deploy/helm/templates/vault-secrets.yaml` around lines 1120 - 1154, The ExternalSecret resource for panoptes Mtls (metadata name using {{ $temporalName }}-panoptes-mtls-eso and target {{ $temporalName }}-panoptes-mtls) is rendered even when ESO mode is not enabled, causing reconciliation failures because the SecretStore (created only under the $anyEso block) may not exist; wrap the existing block that creates this ExternalSecret with the same ESO-mode guard used earlier (the $anyEso / secrets.method / .Values.secrets.vaultAgent.esoForOperators check) so the ExternalSecret is only emitted when ESO is enabled, or alternatively add the same conditional used for creating the SecretStore around this section to ensure parity between SecretStore and ExternalSecret creation.

coderabbitai · 2026-06-12T15:45:06Z

+    otlp_endpoint = os.getenv("OTEL_EXPORTER_OTLP_ENDPOINT", "http://localhost:4317")
+    resolved_name = os.getenv("OTEL_SERVICE_NAME", service_name)
+
+    resource = Resource.create(
+        {
+            "service.name": resolved_name,
+            "deployment.environment": os.getenv("ENVIRONMENT", "unknown"),
+        }
+    )
+    provider = TracerProvider(resource=resource)
+    provider.add_span_processor(BatchSpanProcessor(OTLPSpanExporter(endpoint=otlp_endpoint)))
+    trace.set_global_tracer_provider(provider)
+
+    _runtime = Runtime(telemetry=TelemetryConfig(metrics=OpenTelemetryConfig(url=otlp_endpoint)))
+    return _runtime


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Honor the disabled observability path here instead of defaulting to localhost.

The PR contract says observability is gated off by default, but this code still enables OTLP export whenever setup_telemetry() is called because it falls back to http://localhost:4317. Since both src/nv_config_manager/temporal/api/main.py and src/nv_config_manager/temporal/worker/main.py invoke this unconditionally, deployments without the sidecar will keep trying to export to a nonexistent local collector.

Suggested fix

def setup_telemetry(service_name: str) -> Runtime: """Initialize OTel tracing and build a Temporal Runtime with OTel metrics. @@ """ global _runtime - otlp_endpoint = os.getenv("OTEL_EXPORTER_OTLP_ENDPOINT", "http://localhost:4317") + otlp_endpoint = os.getenv("OTEL_EXPORTER_OTLP_ENDPOINT") + if not otlp_endpoint: + _runtime = Runtime(telemetry=TelemetryConfig()) + return _runtime + resolved_name = os.getenv("OTEL_SERVICE_NAME", service_name) @@ provider = TracerProvider(resource=resource) provider.add_span_processor(BatchSpanProcessor(OTLPSpanExporter(endpoint=otlp_endpoint))) trace.set_global_tracer_provider(provider) _runtime = Runtime(telemetry=TelemetryConfig(metrics=OpenTelemetryConfig(url=otlp_endpoint))) return _runtime

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

otlp_endpoint = os.getenv("OTEL_EXPORTER_OTLP_ENDPOINT", "http://localhost:4317")

resolved_name = os.getenv("OTEL_SERVICE_NAME", service_name)

resource = Resource.create(

{

"service.name": resolved_name,

"deployment.environment": os.getenv("ENVIRONMENT", "unknown"),

}

)

provider = TracerProvider(resource=resource)

provider.add_span_processor(BatchSpanProcessor(OTLPSpanExporter(endpoint=otlp_endpoint)))

trace.set_global_tracer_provider(provider)

_runtime = Runtime(telemetry=TelemetryConfig(metrics=OpenTelemetryConfig(url=otlp_endpoint)))

return _runtime

otlp_endpoint = os.getenv("OTEL_EXPORTER_OTLP_ENDPOINT")

if not otlp_endpoint:

_runtime = Runtime(telemetry=TelemetryConfig())

return _runtime

resolved_name = os.getenv("OTEL_SERVICE_NAME", service_name)

resource = Resource.create(

{

"service.name": resolved_name,

"deployment.environment": os.getenv("ENVIRONMENT", "unknown"),

}

)

provider = TracerProvider(resource=resource)

provider.add_span_processor(BatchSpanProcessor(OTLPSpanExporter(endpoint=otlp_endpoint)))

trace.set_global_tracer_provider(provider)

_runtime = Runtime(telemetry=TelemetryConfig(metrics=OpenTelemetryConfig(url=otlp_endpoint)))

return _runtime

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/nv_config_manager/temporal/telemetry.py` around lines 49 - 63, The telemetry setup currently defaults OTLP_EXPORTER to "http://localhost:4317" which forces export even when observability should be disabled; modify setup_telemetry() to check for an explicit OTLP endpoint or an enable flag (e.g., OTEL_EXPORTER_OTLP_ENDPOINT present and/or OTEL_ENABLED) before creating the TracerProvider/OTLPSpanExporter and before constructing Runtime with TelemetryConfig/OpenTelemetryConfig; if the endpoint is missing or observability is disabled, skip provider/span processor setup and return a Runtime without OpenTelemetry metrics configured, preserving use of OTEL_SERVICE_NAME and ENVIRONMENT when telemetry is enabled.

coderabbitai Bot reviewed Jun 12, 2026

View reviewed changes

	> (local clone: `/Users/kiwueke/Documents/dashboards`)
	> (example local clone path; update to your environment)

Conversation

kiwueke2 commented Jun 12, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Release Notes

Uh oh!

copy-pr-bot Bot commented Jun 12, 2026

Uh oh!

coderabbitai Bot commented Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Suggested reviewers

Poem

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

kiwueke2 commented Jun 12, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 12, 2026 •

edited

Loading