Skip to content

feat(observability): emit OTel traces and metrics from Temporal workflows (GNINWD-1588)#44

Draft
kiwueke2 wants to merge 1 commit into
mainfrom
feat/gninwd-1588-temporal-otel
Draft

feat(observability): emit OTel traces and metrics from Temporal workflows (GNINWD-1588)#44
kiwueke2 wants to merge 1 commit into
mainfrom
feat/gninwd-1588-temporal-otel

Conversation

@kiwueke2

@kiwueke2 kiwueke2 commented Jun 12, 2026

Copy link
Copy Markdown
Collaborator

Instruments Temporal worker and API with OpenTelemetry: telemetry.py bootstrap, TracingInterceptor on worker/client/API, FastAPIInstrumentor, otel-collector sidecar (OTLP receiver -> k8sattributes + Panoptes attribute processors -> mTLS OTLP/HTTP export), Panoptes mTLS ExternalSecret, and image-mirror CI workflow. All gated on temporal.observability.enabled: false — zero behaviour change to existing deployments.

Summary by CodeRabbit

Release Notes

  • New Features

    • Added OpenTelemetry observability integration for Temporal metrics and trace collection
    • Added Grafana dashboard for monitoring Temporal workflow health with key metrics
  • Documentation

    • Added documentation for Grafana dashboard management and provisioning workflow
  • Chores

    • Added GitHub Actions workflow for mirroring OpenTelemetry Collector images
    • Updated dependencies to support OpenTelemetry instrumentation

…lows (GNINWD-1588)

Instruments Temporal worker and API with OpenTelemetry: telemetry.py bootstrap, TracingInterceptor on worker/client/API, FastAPIInstrumentor, otel-collector sidecar (OTLP receiver -> k8sattributes + Panoptes attribute processors -> mTLS OTLP/HTTP export), Panoptes mTLS ExternalSecret, and image-mirror CI workflow. All gated on temporal.observability.enabled: false.

Signed-off-by: Kezie Iwueke <kiwueke@nvidia.com>
@copy-pr-bot

copy-pr-bot Bot commented Jun 12, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@coderabbitai

coderabbitai Bot commented Jun 12, 2026

Copy link
Copy Markdown

Review Change Stack

📝 Walkthrough

Walkthrough

This PR adds OpenTelemetry observability to Temporal components by introducing a telemetry bootstrap module, instrumenting the FastAPI app and Temporal client/worker with tracing interceptors, deploying an OTel Collector sidecar via Helm with Panoptes OTLP export over mTLS, provisioning certificates via Vault, adding a Grafana dashboard for workflow health, and supporting CI/CD for the OTel Collector image mirror.

Changes

Temporal OpenTelemetry Observability

Layer / File(s) Summary
OTel Telemetry Module and Dependencies
src/nv_config_manager/temporal/telemetry.py, pyproject.toml
New telemetry.py module provides setup_telemetry(service_name) and get_runtime() functions to configure OTLP-based tracing and Temporal runtime with cached state. Updated pyproject.toml to add temporalio[opentelemetry] extra and OpenTelemetry exporter/FastAPI instrumentation dependencies.
Application Instrumentation
src/nv_config_manager/temporal/api/main.py, src/nv_config_manager/temporal/api/workflow_v1.py, src/nv_config_manager/temporal/worker/main.py
FastAPI entrypoint now calls setup_telemetry() and instruments app with FastAPIInstrumentor. Both API and worker client/worker configurations import and use TracingInterceptor() and wire the cached telemetry runtime into client and worker connections.
Helm Configuration Values
deploy/helm/values.yaml
Added global.images.otelCollector image definition, secrets.vault.paths.panoptesMtls vault secret path with certificate key mappings, and temporal.observability feature config with OTLP endpoint, Panoptes metadata fields, and sidecar resource limits.
OTel Collector ConfigMap and Sidecar Injection
deploy/helm/templates/temporal-otel-collector-configmap.yaml, deploy/helm/templates/temporal.yaml
New ConfigMap template defines OTLP receivers (gRPC/HTTP), Kubernetes/Panoptes attribute processors, resource enrichment, batching, and TLS-enabled OTLP/HTTP exporter. Updated Temporal template conditionally injects otel-collector sidecars with environment variables, config mounts, and certificate mounts into both worker and API deployments.
mTLS Certificate Provisioning
deploy/helm/templates/vault-secrets.yaml
New ExternalSecret conditionally syncs Panoptes mTLS certificates (ca.crt, tls.crt, tls.key) from Vault into a Kubernetes Secret mounted by the OTel Collector sidecar with 1h refresh interval.
Monitoring Dashboard and Documentation
deploy/dashboards/kiwi-temporal/temporal-workflow-health.json, deploy/dashboards/README.md
New Grafana dashboard "Kiwi Temporal — Workflow Health" with 8 panels for workflow/activity health, latency quantiles, and slot saturation. Dashboard README documents provisioning workflow, datasource/folderUID/tagging conventions, direct JSON edit rules, and synchronization requirements between this repo and dashboards repo.
CI/CD Image Mirroring Workflow
.github/workflows/mirror-otel-collector.yml
New GitHub Actions workflow fetches NVCR credentials from Vault via OIDC, installs crane, mirrors the OTel Collector image from Docker Hub to NVCR by user-supplied tag, and validates digest match between source and destination.

Sequence Diagram(s)

sequenceDiagram
  participant TemporalWorker as Temporal Worker
  participant TemporalAPI as Temporal API
  participant OTelCollector as OTel Collector Sidecar
  participant Panoptes as Panoptes OTLP Endpoint
  TemporalWorker->>OTelCollector: OTLP gRPC/HTTP (localhost:4317/4318)
  TemporalAPI->>OTelCollector: OTLP gRPC/HTTP (localhost:4317/4318)
  OTelCollector->>OTelCollector: Process & enrich (K8s attrs, Panoptes routing)
  OTelCollector->>Panoptes: Export metrics & traces (TLS)
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Suggested reviewers

  • ryanheffernan
  • dmathur0
  • spidercensus

Poem

🐰 Hop along, config and code align,
From Temporal flows to metrics so fine,
Sidecars now whisper through OTLP's bright thread,
While dashboards illuminate where workflows have tread!
Observability blooms in the Kubernetes spring! 🌿✨

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and specifically describes the main change: adding OpenTelemetry instrumentation to emit traces and metrics from Temporal workflows, with an associated ticket reference.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/gninwd-1588-temporal-otel

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
deploy/helm/templates/temporal.yaml (1)

395-457: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Add a pod-template checksum for the collector ConfigMap.

Lines 395-457 and 519-591 mount {{ $temporalName }}-otel-collector-config, but neither pod template rolls when that ConfigMap changes. Updates to temporal.observability.panoptes.* in deploy/helm/values.yaml or the collector pipeline in deploy/helm/templates/temporal-otel-collector-configmap.yaml will leave the worker/API sidecars on stale config until someone restarts them manually.

Suggested fix
       annotations:
         {{- include "nv-config-manager.authIniChecksum" . | nindent 8 }}
+        checksum/otel-collector-config: {{ include (print $.Template.BasePath "/temporal-otel-collector-configmap.yaml") . | sha256sum }}
         {{- if eq .Values.secrets.method "vault-agent" }}
         {{- include "nv-config-manager.vaultAgent.injectorAnnotations" . | nindent 8 }}
         {{- end }}

Also applies to: 519-591

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@deploy/helm/templates/temporal.yaml` around lines 395 - 457, Pod templates
that mount the otel-collector config (volume name otel-collector-config /
ConfigMap name {{ $temporalName }}-otel-collector-config) don't roll when that
ConfigMap changes; add a pod-template annotation that is a checksum of the
rendered collector ConfigMap so updates trigger a restart. In the pod template
metadata for the Temporal worker/API (the sections that mount
otel-collector-config and include the otel sidecar) add an annotation like
checksum/otel-collector-config: {{ include
"nv-config-manager.temporal-otel-collector-configmap" . | sha256sum }} (or the
equivalent helper you have for rendering the collector config + sha256sum) and
apply the same change to the other pod template that mounts {{ $temporalName
}}-otel-collector-config so both roll on ConfigMap changes.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In @.github/workflows/mirror-otel-collector.yml:
- Line 46: Replace the mutable tag "hashicorp/vault-action@v3" in the workflow
step with a pinned full commit SHA for the hashicorp/vault-action GitHub Action
(e.g., "hashicorp/vault-action@<full-commit-sha>"); update the uses entry to
reference the specific commit SHA and verify the SHA corresponds to the intended
v3 release before committing so the Vault credential-fetching step is pinned
immutably.
- Around line 72-79: The workflow currently injects ${{ inputs.tag }} directly
into shell run steps (used by crane copy and crane digest) which can allow shell
injection; update the workflow to validate and sanitize the tag before any shell
use by either: 1) adding an inputs validation regex in the workflow metadata for
the workflow_dispatch input (e.g., restrict to [A-Za-z0-9_.-]+) so ${{
inputs.tag }} can only contain safe characters, or 2) perform an explicit
runtime whitelist check in a preliminary step (use a short shell snippet or
actions/github-script that tests the tag against a strict regex and fails if it
doesn’t match), then assign the validated value to a safe env var (e.g.,
VALID_TAG) and use that env var in the crane copy and digest commands (the steps
invoking crane copy and the SRC_DIGEST/DST_DIGEST assignments).

In `@deploy/dashboards/kiwi-temporal/temporal-workflow-health.json`:
- Line 184: The PromQL percentage expression can produce Inf/NaN when the
denominator is zero; modify the expression that computes 100 * failed /
(completed + failed) to guard the denominator using clamp_min or a tiny epsilon,
e.g. replace the denominator
(sum(rate(temporal_workflow_completed_total{...}[5m])) +
sum(rate(temporal_workflow_failed_total{...}[5m]))) with clamp_min(..., 1e-9) so
PromQL never divides by zero; apply the same change to the other similar
expressions referencing temporal_workflow_failed_total and
temporal_workflow_completed_total (also at the block around lines 418-429) to
harden all percentage formulas.
- Line 12: Remove the top-level ".version" field from the dashboard JSON (the
"version": 1 entry in temporal-workflow-health.json) so the file matches the
direct-Git dashboard contract; locate the "version" key in the JSON and delete
that property entirely, leaving all other fields intact to avoid
validation/promotion drift.

In `@deploy/dashboards/README.md`:
- Line 6: Remove the hardcoded developer-specific path "(local clone:
`/Users/kiwueke/Documents/dashboards`)" from the README and replace it with a
portable placeholder (e.g., "(local clone: /path/to/dashboards)" or instruct
users to clone into their preferred directory or use $HOME), so the
documentation is not tied to a single user's filesystem and remains reusable for
all contributors.

In `@deploy/helm/templates/vault-secrets.yaml`:
- Around line 1120-1154: The ExternalSecret resource for panoptes Mtls (metadata
name using {{ $temporalName }}-panoptes-mtls-eso and target {{ $temporalName
}}-panoptes-mtls) is rendered even when ESO mode is not enabled, causing
reconciliation failures because the SecretStore (created only under the $anyEso
block) may not exist; wrap the existing block that creates this ExternalSecret
with the same ESO-mode guard used earlier (the $anyEso / secrets.method /
.Values.secrets.vaultAgent.esoForOperators check) so the ExternalSecret is only
emitted when ESO is enabled, or alternatively add the same conditional used for
creating the SecretStore around this section to ensure parity between
SecretStore and ExternalSecret creation.

In `@src/nv_config_manager/temporal/telemetry.py`:
- Around line 49-63: The telemetry setup currently defaults OTLP_EXPORTER to
"http://localhost:4317" which forces export even when observability should be
disabled; modify setup_telemetry() to check for an explicit OTLP endpoint or an
enable flag (e.g., OTEL_EXPORTER_OTLP_ENDPOINT present and/or OTEL_ENABLED)
before creating the TracerProvider/OTLPSpanExporter and before constructing
Runtime with TelemetryConfig/OpenTelemetryConfig; if the endpoint is missing or
observability is disabled, skip provider/span processor setup and return a
Runtime without OpenTelemetry metrics configured, preserving use of
OTEL_SERVICE_NAME and ENVIRONMENT when telemetry is enabled.

---

Outside diff comments:
In `@deploy/helm/templates/temporal.yaml`:
- Around line 395-457: Pod templates that mount the otel-collector config
(volume name otel-collector-config / ConfigMap name {{ $temporalName
}}-otel-collector-config) don't roll when that ConfigMap changes; add a
pod-template annotation that is a checksum of the rendered collector ConfigMap
so updates trigger a restart. In the pod template metadata for the Temporal
worker/API (the sections that mount otel-collector-config and include the otel
sidecar) add an annotation like checksum/otel-collector-config: {{ include
"nv-config-manager.temporal-otel-collector-configmap" . | sha256sum }} (or the
equivalent helper you have for rendering the collector config + sha256sum) and
apply the same change to the other pod template that mounts {{ $temporalName
}}-otel-collector-config so both roll on ConfigMap changes.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 7e4b64e8-03e6-4b7a-a1d9-0dca0c294958

📥 Commits

Reviewing files that changed from the base of the PR and between 7c25886 and ec18260.

⛔ Files ignored due to path filters (1)
  • uv.lock is excluded by !**/*.lock
📒 Files selected for processing (12)
  • .github/workflows/mirror-otel-collector.yml
  • deploy/dashboards/README.md
  • deploy/dashboards/kiwi-temporal/temporal-workflow-health.json
  • deploy/helm/templates/temporal-otel-collector-configmap.yaml
  • deploy/helm/templates/temporal.yaml
  • deploy/helm/templates/vault-secrets.yaml
  • deploy/helm/values.yaml
  • pyproject.toml
  • src/nv_config_manager/temporal/api/main.py
  • src/nv_config_manager/temporal/api/workflow_v1.py
  • src/nv_config_manager/temporal/telemetry.py
  • src/nv_config_manager/temporal/worker/main.py

uses: nv-gha-runners/setup-proxy-cache@14229018fe157c83e03c008f27d183d8e99bc67c

- name: Fetch nvcr registry token from Vault
uses: hashicorp/vault-action@v3

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

FILE=".github/workflows/mirror-otel-collector.yml"
echo "== Checking $FILE =="
if [ ! -f "$FILE" ]; then
  echo "Missing file: $FILE"
  exit 1
fi

nl -ba "$FILE" | sed -n '35,70p'

echo
echo "== Searching for hashicorp/vault-action@ across repo =="
rg -n --hidden --no-ignore-vcs "hashicorp/vault-action@" .github/workflows || true

Repository: NVIDIA/nv-config-manager

Length of output: 189


🏁 Script executed:

#!/bin/bash
set -euo pipefail

FILE=".github/workflows/mirror-otel-collector.yml"
echo "== Checking $FILE =="
if [ ! -f "$FILE" ]; then
  echo "Missing file: $FILE"
  exit 1
fi

awk 'NR>=35 && NR<=70 {printf "%d:%s\n", NR, $0}' "$FILE"

echo
echo "== Searching for hashicorp/vault-action@ across repo =="
rg -n --hidden --no-ignore-vcs "hashicorp/vault-action@" . || true

Repository: NVIDIA/nv-config-manager

Length of output: 1707


Pin hashicorp/vault-action to a full commit SHA (avoid @v3).

.github/workflows/mirror-otel-collector.yml:46 uses hashicorp/vault-action@v3; tag-based references are mutable and weaken supply-chain guarantees for a Vault credential-fetching step—replace @v3 with a specific commit SHA.

🧰 Tools
🪛 zizmor (1.25.2)

[error] 46-46: unpinned action reference (unpinned-uses): action is not pinned to a hash (required by blanket policy)

(unpinned-uses)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.github/workflows/mirror-otel-collector.yml at line 46, Replace the mutable
tag "hashicorp/vault-action@v3" in the workflow step with a pinned full commit
SHA for the hashicorp/vault-action GitHub Action (e.g.,
"hashicorp/vault-action@<full-commit-sha>"); update the uses entry to reference
the specific commit SHA and verify the SHA corresponds to the intended v3
release before committing so the Vault credential-fetching step is pinned
immutably.

Source: Linters/SAST tools

Comment on lines +72 to +79
- name: Mirror image
run: |
crane copy "${SOURCE_IMAGE}:${{ inputs.tag }}" "${DEST_IMAGE}:${{ inputs.tag }}"

- name: Verify digest matches source
run: |
SRC_DIGEST=$(crane digest "${SOURCE_IMAGE}:${{ inputs.tag }}")
DST_DIGEST=$(crane digest "${DEST_IMAGE}:${{ inputs.tag }}")

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Validate and sanitize workflow_dispatch tag before shell usage.

Line 74, Line 78, and Line 79 expand ${{ inputs.tag }} directly inside run scripts. A crafted tag (for example including command substitution) can execute arbitrary shell on the runner.

Suggested hardening diff
 jobs:
   mirror:
@@
     steps:
+      - name: Validate mirror tag
+        id: validate_tag
+        run: |
+          set -euo pipefail
+          TAG='${{ inputs.tag }}'
+          if [[ ! "${TAG}" =~ ^[0-9]+\.[0-9]+\.[0-9]+([-.][0-9A-Za-z._-]+)?$ ]]; then
+            echo "Invalid tag format: ${TAG}" >&2
+            exit 1
+          fi
+          echo "tag=${TAG}" >> "${GITHUB_OUTPUT}"
@@
       - name: Mirror image
         run: |
-          crane copy "${SOURCE_IMAGE}:${{ inputs.tag }}" "${DEST_IMAGE}:${{ inputs.tag }}"
+          crane copy "${SOURCE_IMAGE}:${{ steps.validate_tag.outputs.tag }}" "${DEST_IMAGE}:${{ steps.validate_tag.outputs.tag }}"
@@
       - name: Verify digest matches source
         run: |
-          SRC_DIGEST=$(crane digest "${SOURCE_IMAGE}:${{ inputs.tag }}")
-          DST_DIGEST=$(crane digest "${DEST_IMAGE}:${{ inputs.tag }}")
+          SRC_DIGEST=$(crane digest "${SOURCE_IMAGE}:${{ steps.validate_tag.outputs.tag }}")
+          DST_DIGEST=$(crane digest "${DEST_IMAGE}:${{ steps.validate_tag.outputs.tag }}")
🧰 Tools
🪛 zizmor (1.25.2)

[error] 74-74: code injection via template expansion (template-injection): may expand into attacker-controllable code

(template-injection)


[error] 74-74: code injection via template expansion (template-injection): may expand into attacker-controllable code

(template-injection)


[error] 78-78: code injection via template expansion (template-injection): may expand into attacker-controllable code

(template-injection)


[error] 79-79: code injection via template expansion (template-injection): may expand into attacker-controllable code

(template-injection)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.github/workflows/mirror-otel-collector.yml around lines 72 - 79, The
workflow currently injects ${{ inputs.tag }} directly into shell run steps (used
by crane copy and crane digest) which can allow shell injection; update the
workflow to validate and sanitize the tag before any shell use by either: 1)
adding an inputs validation regex in the workflow metadata for the
workflow_dispatch input (e.g., restrict to [A-Za-z0-9_.-]+) so ${{ inputs.tag }}
can only contain safe characters, or 2) perform an explicit runtime whitelist
check in a preliminary step (use a short shell snippet or actions/github-script
that tests the tag against a strict regex and fails if it doesn’t match), then
assign the validated value to a safe env var (e.g., VALID_TAG) and use that env
var in the crane copy and digest commands (the steps invoking crane copy and the
SRC_DIGEST/DST_DIGEST assignments).

Source: Linters/SAST tools

"gni"
],
"schemaVersion": 39,
"version": 1,

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Remove the dashboard .version field to match the documented direct-Git contract.

Line 12 sets "version": 1, but the same PR’s workflow rules require direct Git dashboard JSON to omit .version. This mismatch will create drift and can break validation/promotion workflows.

Suggested fix
-  "version": 1,
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
"version": 1,
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@deploy/dashboards/kiwi-temporal/temporal-workflow-health.json` at line 12,
Remove the top-level ".version" field from the dashboard JSON (the "version": 1
entry in temporal-workflow-health.json) so the file matches the direct-Git
dashboard contract; locate the "version" key in the JSON and delete that
property entirely, leaving all other fields intact to avoid validation/promotion
drift.

"uid": "[[.datasourceName]]"
},
"editorMode": "code",
"expr": "100 * sum(rate(temporal_workflow_failed_total{deployment_environment=~\"$env\", workflow_type=~\"$workflow_type\"}[5m])) / (sum(rate(temporal_workflow_completed_total{deployment_environment=~\"$env\", workflow_type=~\"$workflow_type\"}[5m])) + sum(rate(temporal_workflow_failed_total{deployment_environment=~\"$env\", workflow_type=~\"$workflow_type\"}[5m])))",

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Harden percentage formulas against zero denominators.

These ratios can emit Inf/NaN when no matching series exist in the selected time window/environment, which causes misleading spikes/gaps in the dashboard.

Suggested fix
-          "expr": "100 * sum(rate(temporal_workflow_failed_total{deployment_environment=~\"$env\", workflow_type=~\"$workflow_type\"}[5m])) / (sum(rate(temporal_workflow_completed_total{deployment_environment=~\"$env\", workflow_type=~\"$workflow_type\"}[5m])) + sum(rate(temporal_workflow_failed_total{deployment_environment=~\"$env\", workflow_type=~\"$workflow_type\"}[5m])))",
+          "expr": "100 * sum(rate(temporal_workflow_failed_total{deployment_environment=~\"$env\", workflow_type=~\"$workflow_type\"}[5m])) / clamp_min((sum(rate(temporal_workflow_completed_total{deployment_environment=~\"$env\", workflow_type=~\"$workflow_type\"}[5m])) + sum(rate(temporal_workflow_failed_total{deployment_environment=~\"$env\", workflow_type=~\"$workflow_type\"}[5m]))), 1e-9)",
-          "expr": "100 * sum(temporal_worker_task_slots_used{deployment_environment=~\"$env\", slot_type=\"ActivitySlot\"}) / (sum(temporal_worker_task_slots_available{deployment_environment=~\"$env\", slot_type=\"ActivitySlot\"}) + sum(temporal_worker_task_slots_used{deployment_environment=~\"$env\", slot_type=\"ActivitySlot\"}))",
+          "expr": "100 * sum(temporal_worker_task_slots_used{deployment_environment=~\"$env\", slot_type=\"ActivitySlot\"}) / clamp_min((sum(temporal_worker_task_slots_available{deployment_environment=~\"$env\", slot_type=\"ActivitySlot\"}) + sum(temporal_worker_task_slots_used{deployment_environment=~\"$env\", slot_type=\"ActivitySlot\"})), 1e-9)",
-          "expr": "100 * sum(temporal_worker_task_slots_used{deployment_environment=~\"$env\", slot_type=\"WorkflowSlot\"}) / (sum(temporal_worker_task_slots_available{deployment_environment=~\"$env\", slot_type=\"WorkflowSlot\"}) + sum(temporal_worker_task_slots_used{deployment_environment=~\"$env\", slot_type=\"WorkflowSlot\"}))",
+          "expr": "100 * sum(temporal_worker_task_slots_used{deployment_environment=~\"$env\", slot_type=\"WorkflowSlot\"}) / clamp_min((sum(temporal_worker_task_slots_available{deployment_environment=~\"$env\", slot_type=\"WorkflowSlot\"}) + sum(temporal_worker_task_slots_used{deployment_environment=~\"$env\", slot_type=\"WorkflowSlot\"})), 1e-9)",

Also applies to: 418-429

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@deploy/dashboards/kiwi-temporal/temporal-workflow-health.json` at line 184,
The PromQL percentage expression can produce Inf/NaN when the denominator is
zero; modify the expression that computes 100 * failed / (completed + failed) to
guard the denominator using clamp_min or a tiny epsilon, e.g. replace the
denominator (sum(rate(temporal_workflow_completed_total{...}[5m])) +
sum(rate(temporal_workflow_failed_total{...}[5m]))) with clamp_min(..., 1e-9) so
PromQL never divides by zero; apply the same change to the other similar
expressions referencing temporal_workflow_failed_total and
temporal_workflow_completed_total (also at the block around lines 418-429) to
harden all percentage formulas.

Dashboard JSON files here mirror what lives (or will live) in the dashboards repo:

> `gitlab-master.nvidia.com/nsv-network/gni-dev/observability/deploy/dashboards`
> (local clone: `/Users/kiwueke/Documents/dashboards`)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Avoid committing a user-specific local filesystem path.

Line 6 hardcodes one developer’s local path, which is not portable for other contributors and tends to go stale in docs.

Suggested fix
-> (local clone: `/Users/kiwueke/Documents/dashboards`)
+> (example local clone path; update to your environment)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
> (local clone: `/Users/kiwueke/Documents/dashboards`)
> (example local clone path; update to your environment)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@deploy/dashboards/README.md` at line 6, Remove the hardcoded
developer-specific path "(local clone: `/Users/kiwueke/Documents/dashboards`)"
from the README and replace it with a portable placeholder (e.g., "(local clone:
/path/to/dashboards)" or instruct users to clone into their preferred directory
or use $HOME), so the documentation is not tied to a single user's filesystem
and remains reusable for all contributors.

Comment on lines +1120 to +1154
{{- if and .Values.temporal.enabled .Values.temporal.observability.enabled }}
{{- if .Values.secrets.vault.paths.panoptesMtls }}
{{- if .Values.secrets.vault.paths.panoptesMtls.path }}
{{- $temporalName := include "nv-config-manager.componentName" (dict "root" . "component" "temporal") }}
{{- $vaultSecretStoreName := include "nv-config-manager.vaultSecretStoreName" . }}
---
apiVersion: external-secrets.io/v1
kind: ExternalSecret
metadata:
name: {{ $temporalName }}-panoptes-mtls-eso
namespace: {{ .Values.global.namespace }}
spec:
secretStoreRef:
name: {{ $vaultSecretStoreName }}
kind: SecretStore
target:
name: {{ $temporalName }}-panoptes-mtls
creationPolicy: Owner
data:
- secretKey: ca.crt
remoteRef:
key: "{{ include "nv-config-manager.vault.secretPath" (dict "root" . "secret" "panoptesMtls") }}"
property: {{ include "nv-config-manager.vault.keyName" (dict "root" . "secret" "panoptesMtls" "key" "caCert") }}
- secretKey: tls.crt
remoteRef:
key: "{{ include "nv-config-manager.vault.secretPath" (dict "root" . "secret" "panoptesMtls") }}"
property: {{ include "nv-config-manager.vault.keyName" (dict "root" . "secret" "panoptesMtls" "key" "tlsCert") }}
- secretKey: tls.key
remoteRef:
key: "{{ include "nv-config-manager.vault.secretPath" (dict "root" . "secret" "panoptesMtls") }}"
property: {{ include "nv-config-manager.vault.keyName" (dict "root" . "secret" "panoptesMtls" "key" "tlsKey") }}
refreshInterval: 1h
{{- end }}
{{- end }}
{{- end }}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Gate the Panoptes ExternalSecret behind the same ESO-mode check as the rest of this file.

Lines 1120-1154 render an ExternalSecret even when secrets.method is kubernetes or vault-agent with secrets.vaultAgent.esoForOperators: false, but the referenced SecretStore is only created inside the earlier $anyEso block. In those modes this resource cannot reconcile, and the Temporal worker/API mounts in deploy/helm/templates/temporal.yaml still expect {{ $temporalName }}-panoptes-mtls to exist.

Suggested fix
-{{- if and .Values.temporal.enabled .Values.temporal.observability.enabled }}
+{{- if and $anyEso .Values.temporal.enabled .Values.temporal.observability.enabled }}
 {{- if .Values.secrets.vault.paths.panoptesMtls }}
 {{- if .Values.secrets.vault.paths.panoptesMtls.path }}
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
{{- if and .Values.temporal.enabled .Values.temporal.observability.enabled }}
{{- if .Values.secrets.vault.paths.panoptesMtls }}
{{- if .Values.secrets.vault.paths.panoptesMtls.path }}
{{- $temporalName := include "nv-config-manager.componentName" (dict "root" . "component" "temporal") }}
{{- $vaultSecretStoreName := include "nv-config-manager.vaultSecretStoreName" . }}
---
apiVersion: external-secrets.io/v1
kind: ExternalSecret
metadata:
name: {{ $temporalName }}-panoptes-mtls-eso
namespace: {{ .Values.global.namespace }}
spec:
secretStoreRef:
name: {{ $vaultSecretStoreName }}
kind: SecretStore
target:
name: {{ $temporalName }}-panoptes-mtls
creationPolicy: Owner
data:
- secretKey: ca.crt
remoteRef:
key: "{{ include "nv-config-manager.vault.secretPath" (dict "root" . "secret" "panoptesMtls") }}"
property: {{ include "nv-config-manager.vault.keyName" (dict "root" . "secret" "panoptesMtls" "key" "caCert") }}
- secretKey: tls.crt
remoteRef:
key: "{{ include "nv-config-manager.vault.secretPath" (dict "root" . "secret" "panoptesMtls") }}"
property: {{ include "nv-config-manager.vault.keyName" (dict "root" . "secret" "panoptesMtls" "key" "tlsCert") }}
- secretKey: tls.key
remoteRef:
key: "{{ include "nv-config-manager.vault.secretPath" (dict "root" . "secret" "panoptesMtls") }}"
property: {{ include "nv-config-manager.vault.keyName" (dict "root" . "secret" "panoptesMtls" "key" "tlsKey") }}
refreshInterval: 1h
{{- end }}
{{- end }}
{{- end }}
{{- if and $anyEso .Values.temporal.enabled .Values.temporal.observability.enabled }}
{{- if .Values.secrets.vault.paths.panoptesMtls }}
{{- if .Values.secrets.vault.paths.panoptesMtls.path }}
{{- $temporalName := include "nv-config-manager.componentName" (dict "root" . "component" "temporal") }}
{{- $vaultSecretStoreName := include "nv-config-manager.vaultSecretStoreName" . }}
---
apiVersion: external-secrets.io/v1
kind: ExternalSecret
metadata:
name: {{ $temporalName }}-panoptes-mtls-eso
namespace: {{ .Values.global.namespace }}
spec:
secretStoreRef:
name: {{ $vaultSecretStoreName }}
kind: SecretStore
target:
name: {{ $temporalName }}-panoptes-mtls
creationPolicy: Owner
data:
- secretKey: ca.crt
remoteRef:
key: "{{ include "nv-config-manager.vault.secretPath" (dict "root" . "secret" "panoptesMtls") }}"
property: {{ include "nv-config-manager.vault.keyName" (dict "root" . "secret" "panoptesMtls" "key" "caCert") }}
- secretKey: tls.crt
remoteRef:
key: "{{ include "nv-config-manager.vault.secretPath" (dict "root" . "secret" "panoptesMtls") }}"
property: {{ include "nv-config-manager.vault.keyName" (dict "root" . "secret" "panoptesMtls" "key" "tlsCert") }}
- secretKey: tls.key
remoteRef:
key: "{{ include "nv-config-manager.vault.secretPath" (dict "root" . "secret" "panoptesMtls") }}"
property: {{ include "nv-config-manager.vault.keyName" (dict "root" . "secret" "panoptesMtls" "key" "tlsKey") }}
refreshInterval: 1h
{{- end }}
{{- end }}
{{- end }}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@deploy/helm/templates/vault-secrets.yaml` around lines 1120 - 1154, The
ExternalSecret resource for panoptes Mtls (metadata name using {{ $temporalName
}}-panoptes-mtls-eso and target {{ $temporalName }}-panoptes-mtls) is rendered
even when ESO mode is not enabled, causing reconciliation failures because the
SecretStore (created only under the $anyEso block) may not exist; wrap the
existing block that creates this ExternalSecret with the same ESO-mode guard
used earlier (the $anyEso / secrets.method /
.Values.secrets.vaultAgent.esoForOperators check) so the ExternalSecret is only
emitted when ESO is enabled, or alternatively add the same conditional used for
creating the SecretStore around this section to ensure parity between
SecretStore and ExternalSecret creation.

Comment on lines +49 to +63
otlp_endpoint = os.getenv("OTEL_EXPORTER_OTLP_ENDPOINT", "http://localhost:4317")
resolved_name = os.getenv("OTEL_SERVICE_NAME", service_name)

resource = Resource.create(
{
"service.name": resolved_name,
"deployment.environment": os.getenv("ENVIRONMENT", "unknown"),
}
)
provider = TracerProvider(resource=resource)
provider.add_span_processor(BatchSpanProcessor(OTLPSpanExporter(endpoint=otlp_endpoint)))
trace.set_global_tracer_provider(provider)

_runtime = Runtime(telemetry=TelemetryConfig(metrics=OpenTelemetryConfig(url=otlp_endpoint)))
return _runtime

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Honor the disabled observability path here instead of defaulting to localhost.

The PR contract says observability is gated off by default, but this code still enables OTLP export whenever setup_telemetry() is called because it falls back to http://localhost:4317. Since both src/nv_config_manager/temporal/api/main.py and src/nv_config_manager/temporal/worker/main.py invoke this unconditionally, deployments without the sidecar will keep trying to export to a nonexistent local collector.

Suggested fix
 def setup_telemetry(service_name: str) -> Runtime:
     """Initialize OTel tracing and build a Temporal Runtime with OTel metrics.
@@
     """
     global _runtime
-    otlp_endpoint = os.getenv("OTEL_EXPORTER_OTLP_ENDPOINT", "http://localhost:4317")
+    otlp_endpoint = os.getenv("OTEL_EXPORTER_OTLP_ENDPOINT")
+    if not otlp_endpoint:
+        _runtime = Runtime(telemetry=TelemetryConfig())
+        return _runtime
+
     resolved_name = os.getenv("OTEL_SERVICE_NAME", service_name)
@@
     provider = TracerProvider(resource=resource)
     provider.add_span_processor(BatchSpanProcessor(OTLPSpanExporter(endpoint=otlp_endpoint)))
     trace.set_global_tracer_provider(provider)
 
     _runtime = Runtime(telemetry=TelemetryConfig(metrics=OpenTelemetryConfig(url=otlp_endpoint)))
     return _runtime
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
otlp_endpoint = os.getenv("OTEL_EXPORTER_OTLP_ENDPOINT", "http://localhost:4317")
resolved_name = os.getenv("OTEL_SERVICE_NAME", service_name)
resource = Resource.create(
{
"service.name": resolved_name,
"deployment.environment": os.getenv("ENVIRONMENT", "unknown"),
}
)
provider = TracerProvider(resource=resource)
provider.add_span_processor(BatchSpanProcessor(OTLPSpanExporter(endpoint=otlp_endpoint)))
trace.set_global_tracer_provider(provider)
_runtime = Runtime(telemetry=TelemetryConfig(metrics=OpenTelemetryConfig(url=otlp_endpoint)))
return _runtime
otlp_endpoint = os.getenv("OTEL_EXPORTER_OTLP_ENDPOINT")
if not otlp_endpoint:
_runtime = Runtime(telemetry=TelemetryConfig())
return _runtime
resolved_name = os.getenv("OTEL_SERVICE_NAME", service_name)
resource = Resource.create(
{
"service.name": resolved_name,
"deployment.environment": os.getenv("ENVIRONMENT", "unknown"),
}
)
provider = TracerProvider(resource=resource)
provider.add_span_processor(BatchSpanProcessor(OTLPSpanExporter(endpoint=otlp_endpoint)))
trace.set_global_tracer_provider(provider)
_runtime = Runtime(telemetry=TelemetryConfig(metrics=OpenTelemetryConfig(url=otlp_endpoint)))
return _runtime
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/nv_config_manager/temporal/telemetry.py` around lines 49 - 63, The
telemetry setup currently defaults OTLP_EXPORTER to "http://localhost:4317"
which forces export even when observability should be disabled; modify
setup_telemetry() to check for an explicit OTLP endpoint or an enable flag
(e.g., OTEL_EXPORTER_OTLP_ENDPOINT present and/or OTEL_ENABLED) before creating
the TracerProvider/OTLPSpanExporter and before constructing Runtime with
TelemetryConfig/OpenTelemetryConfig; if the endpoint is missing or observability
is disabled, skip provider/span processor setup and return a Runtime without
OpenTelemetry metrics configured, preserving use of OTEL_SERVICE_NAME and
ENVIRONMENT when telemetry is enabled.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant