feat(observability): emit OTel traces and metrics from Temporal workflows (GNINWD-1588)#44
feat(observability): emit OTel traces and metrics from Temporal workflows (GNINWD-1588)#44kiwueke2 wants to merge 1 commit into
Conversation
…lows (GNINWD-1588) Instruments Temporal worker and API with OpenTelemetry: telemetry.py bootstrap, TracingInterceptor on worker/client/API, FastAPIInstrumentor, otel-collector sidecar (OTLP receiver -> k8sattributes + Panoptes attribute processors -> mTLS OTLP/HTTP export), Panoptes mTLS ExternalSecret, and image-mirror CI workflow. All gated on temporal.observability.enabled: false. Signed-off-by: Kezie Iwueke <kiwueke@nvidia.com>
📝 WalkthroughWalkthroughThis PR adds OpenTelemetry observability to Temporal components by introducing a telemetry bootstrap module, instrumenting the FastAPI app and Temporal client/worker with tracing interceptors, deploying an OTel Collector sidecar via Helm with Panoptes OTLP export over mTLS, provisioning certificates via Vault, adding a Grafana dashboard for workflow health, and supporting CI/CD for the OTel Collector image mirror. ChangesTemporal OpenTelemetry Observability
Sequence Diagram(s)sequenceDiagram
participant TemporalWorker as Temporal Worker
participant TemporalAPI as Temporal API
participant OTelCollector as OTel Collector Sidecar
participant Panoptes as Panoptes OTLP Endpoint
TemporalWorker->>OTelCollector: OTLP gRPC/HTTP (localhost:4317/4318)
TemporalAPI->>OTelCollector: OTLP gRPC/HTTP (localhost:4317/4318)
OTelCollector->>OTelCollector: Process & enrich (K8s attrs, Panoptes routing)
OTelCollector->>Panoptes: Export metrics & traces (TLS)
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 7
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
deploy/helm/templates/temporal.yaml (1)
395-457:⚠️ Potential issue | 🟠 Major | ⚡ Quick winAdd a pod-template checksum for the collector ConfigMap.
Lines 395-457 and 519-591 mount
{{ $temporalName }}-otel-collector-config, but neither pod template rolls when that ConfigMap changes. Updates totemporal.observability.panoptes.*indeploy/helm/values.yamlor the collector pipeline indeploy/helm/templates/temporal-otel-collector-configmap.yamlwill leave the worker/API sidecars on stale config until someone restarts them manually.Suggested fix
annotations: {{- include "nv-config-manager.authIniChecksum" . | nindent 8 }} + checksum/otel-collector-config: {{ include (print $.Template.BasePath "/temporal-otel-collector-configmap.yaml") . | sha256sum }} {{- if eq .Values.secrets.method "vault-agent" }} {{- include "nv-config-manager.vaultAgent.injectorAnnotations" . | nindent 8 }} {{- end }}Also applies to: 519-591
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@deploy/helm/templates/temporal.yaml` around lines 395 - 457, Pod templates that mount the otel-collector config (volume name otel-collector-config / ConfigMap name {{ $temporalName }}-otel-collector-config) don't roll when that ConfigMap changes; add a pod-template annotation that is a checksum of the rendered collector ConfigMap so updates trigger a restart. In the pod template metadata for the Temporal worker/API (the sections that mount otel-collector-config and include the otel sidecar) add an annotation like checksum/otel-collector-config: {{ include "nv-config-manager.temporal-otel-collector-configmap" . | sha256sum }} (or the equivalent helper you have for rendering the collector config + sha256sum) and apply the same change to the other pod template that mounts {{ $temporalName }}-otel-collector-config so both roll on ConfigMap changes.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In @.github/workflows/mirror-otel-collector.yml:
- Line 46: Replace the mutable tag "hashicorp/vault-action@v3" in the workflow
step with a pinned full commit SHA for the hashicorp/vault-action GitHub Action
(e.g., "hashicorp/vault-action@<full-commit-sha>"); update the uses entry to
reference the specific commit SHA and verify the SHA corresponds to the intended
v3 release before committing so the Vault credential-fetching step is pinned
immutably.
- Around line 72-79: The workflow currently injects ${{ inputs.tag }} directly
into shell run steps (used by crane copy and crane digest) which can allow shell
injection; update the workflow to validate and sanitize the tag before any shell
use by either: 1) adding an inputs validation regex in the workflow metadata for
the workflow_dispatch input (e.g., restrict to [A-Za-z0-9_.-]+) so ${{
inputs.tag }} can only contain safe characters, or 2) perform an explicit
runtime whitelist check in a preliminary step (use a short shell snippet or
actions/github-script that tests the tag against a strict regex and fails if it
doesn’t match), then assign the validated value to a safe env var (e.g.,
VALID_TAG) and use that env var in the crane copy and digest commands (the steps
invoking crane copy and the SRC_DIGEST/DST_DIGEST assignments).
In `@deploy/dashboards/kiwi-temporal/temporal-workflow-health.json`:
- Line 184: The PromQL percentage expression can produce Inf/NaN when the
denominator is zero; modify the expression that computes 100 * failed /
(completed + failed) to guard the denominator using clamp_min or a tiny epsilon,
e.g. replace the denominator
(sum(rate(temporal_workflow_completed_total{...}[5m])) +
sum(rate(temporal_workflow_failed_total{...}[5m]))) with clamp_min(..., 1e-9) so
PromQL never divides by zero; apply the same change to the other similar
expressions referencing temporal_workflow_failed_total and
temporal_workflow_completed_total (also at the block around lines 418-429) to
harden all percentage formulas.
- Line 12: Remove the top-level ".version" field from the dashboard JSON (the
"version": 1 entry in temporal-workflow-health.json) so the file matches the
direct-Git dashboard contract; locate the "version" key in the JSON and delete
that property entirely, leaving all other fields intact to avoid
validation/promotion drift.
In `@deploy/dashboards/README.md`:
- Line 6: Remove the hardcoded developer-specific path "(local clone:
`/Users/kiwueke/Documents/dashboards`)" from the README and replace it with a
portable placeholder (e.g., "(local clone: /path/to/dashboards)" or instruct
users to clone into their preferred directory or use $HOME), so the
documentation is not tied to a single user's filesystem and remains reusable for
all contributors.
In `@deploy/helm/templates/vault-secrets.yaml`:
- Around line 1120-1154: The ExternalSecret resource for panoptes Mtls (metadata
name using {{ $temporalName }}-panoptes-mtls-eso and target {{ $temporalName
}}-panoptes-mtls) is rendered even when ESO mode is not enabled, causing
reconciliation failures because the SecretStore (created only under the $anyEso
block) may not exist; wrap the existing block that creates this ExternalSecret
with the same ESO-mode guard used earlier (the $anyEso / secrets.method /
.Values.secrets.vaultAgent.esoForOperators check) so the ExternalSecret is only
emitted when ESO is enabled, or alternatively add the same conditional used for
creating the SecretStore around this section to ensure parity between
SecretStore and ExternalSecret creation.
In `@src/nv_config_manager/temporal/telemetry.py`:
- Around line 49-63: The telemetry setup currently defaults OTLP_EXPORTER to
"http://localhost:4317" which forces export even when observability should be
disabled; modify setup_telemetry() to check for an explicit OTLP endpoint or an
enable flag (e.g., OTEL_EXPORTER_OTLP_ENDPOINT present and/or OTEL_ENABLED)
before creating the TracerProvider/OTLPSpanExporter and before constructing
Runtime with TelemetryConfig/OpenTelemetryConfig; if the endpoint is missing or
observability is disabled, skip provider/span processor setup and return a
Runtime without OpenTelemetry metrics configured, preserving use of
OTEL_SERVICE_NAME and ENVIRONMENT when telemetry is enabled.
---
Outside diff comments:
In `@deploy/helm/templates/temporal.yaml`:
- Around line 395-457: Pod templates that mount the otel-collector config
(volume name otel-collector-config / ConfigMap name {{ $temporalName
}}-otel-collector-config) don't roll when that ConfigMap changes; add a
pod-template annotation that is a checksum of the rendered collector ConfigMap
so updates trigger a restart. In the pod template metadata for the Temporal
worker/API (the sections that mount otel-collector-config and include the otel
sidecar) add an annotation like checksum/otel-collector-config: {{ include
"nv-config-manager.temporal-otel-collector-configmap" . | sha256sum }} (or the
equivalent helper you have for rendering the collector config + sha256sum) and
apply the same change to the other pod template that mounts {{ $temporalName
}}-otel-collector-config so both roll on ConfigMap changes.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 7e4b64e8-03e6-4b7a-a1d9-0dca0c294958
⛔ Files ignored due to path filters (1)
uv.lockis excluded by!**/*.lock
📒 Files selected for processing (12)
.github/workflows/mirror-otel-collector.ymldeploy/dashboards/README.mddeploy/dashboards/kiwi-temporal/temporal-workflow-health.jsondeploy/helm/templates/temporal-otel-collector-configmap.yamldeploy/helm/templates/temporal.yamldeploy/helm/templates/vault-secrets.yamldeploy/helm/values.yamlpyproject.tomlsrc/nv_config_manager/temporal/api/main.pysrc/nv_config_manager/temporal/api/workflow_v1.pysrc/nv_config_manager/temporal/telemetry.pysrc/nv_config_manager/temporal/worker/main.py
| uses: nv-gha-runners/setup-proxy-cache@14229018fe157c83e03c008f27d183d8e99bc67c | ||
|
|
||
| - name: Fetch nvcr registry token from Vault | ||
| uses: hashicorp/vault-action@v3 |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
set -euo pipefail
FILE=".github/workflows/mirror-otel-collector.yml"
echo "== Checking $FILE =="
if [ ! -f "$FILE" ]; then
echo "Missing file: $FILE"
exit 1
fi
nl -ba "$FILE" | sed -n '35,70p'
echo
echo "== Searching for hashicorp/vault-action@ across repo =="
rg -n --hidden --no-ignore-vcs "hashicorp/vault-action@" .github/workflows || trueRepository: NVIDIA/nv-config-manager
Length of output: 189
🏁 Script executed:
#!/bin/bash
set -euo pipefail
FILE=".github/workflows/mirror-otel-collector.yml"
echo "== Checking $FILE =="
if [ ! -f "$FILE" ]; then
echo "Missing file: $FILE"
exit 1
fi
awk 'NR>=35 && NR<=70 {printf "%d:%s\n", NR, $0}' "$FILE"
echo
echo "== Searching for hashicorp/vault-action@ across repo =="
rg -n --hidden --no-ignore-vcs "hashicorp/vault-action@" . || trueRepository: NVIDIA/nv-config-manager
Length of output: 1707
Pin hashicorp/vault-action to a full commit SHA (avoid @v3).
.github/workflows/mirror-otel-collector.yml:46 uses hashicorp/vault-action@v3; tag-based references are mutable and weaken supply-chain guarantees for a Vault credential-fetching step—replace @v3 with a specific commit SHA.
🧰 Tools
🪛 zizmor (1.25.2)
[error] 46-46: unpinned action reference (unpinned-uses): action is not pinned to a hash (required by blanket policy)
(unpinned-uses)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In @.github/workflows/mirror-otel-collector.yml at line 46, Replace the mutable
tag "hashicorp/vault-action@v3" in the workflow step with a pinned full commit
SHA for the hashicorp/vault-action GitHub Action (e.g.,
"hashicorp/vault-action@<full-commit-sha>"); update the uses entry to reference
the specific commit SHA and verify the SHA corresponds to the intended v3
release before committing so the Vault credential-fetching step is pinned
immutably.
Source: Linters/SAST tools
| - name: Mirror image | ||
| run: | | ||
| crane copy "${SOURCE_IMAGE}:${{ inputs.tag }}" "${DEST_IMAGE}:${{ inputs.tag }}" | ||
|
|
||
| - name: Verify digest matches source | ||
| run: | | ||
| SRC_DIGEST=$(crane digest "${SOURCE_IMAGE}:${{ inputs.tag }}") | ||
| DST_DIGEST=$(crane digest "${DEST_IMAGE}:${{ inputs.tag }}") |
There was a problem hiding this comment.
Validate and sanitize workflow_dispatch tag before shell usage.
Line 74, Line 78, and Line 79 expand ${{ inputs.tag }} directly inside run scripts. A crafted tag (for example including command substitution) can execute arbitrary shell on the runner.
Suggested hardening diff
jobs:
mirror:
@@
steps:
+ - name: Validate mirror tag
+ id: validate_tag
+ run: |
+ set -euo pipefail
+ TAG='${{ inputs.tag }}'
+ if [[ ! "${TAG}" =~ ^[0-9]+\.[0-9]+\.[0-9]+([-.][0-9A-Za-z._-]+)?$ ]]; then
+ echo "Invalid tag format: ${TAG}" >&2
+ exit 1
+ fi
+ echo "tag=${TAG}" >> "${GITHUB_OUTPUT}"
@@
- name: Mirror image
run: |
- crane copy "${SOURCE_IMAGE}:${{ inputs.tag }}" "${DEST_IMAGE}:${{ inputs.tag }}"
+ crane copy "${SOURCE_IMAGE}:${{ steps.validate_tag.outputs.tag }}" "${DEST_IMAGE}:${{ steps.validate_tag.outputs.tag }}"
@@
- name: Verify digest matches source
run: |
- SRC_DIGEST=$(crane digest "${SOURCE_IMAGE}:${{ inputs.tag }}")
- DST_DIGEST=$(crane digest "${DEST_IMAGE}:${{ inputs.tag }}")
+ SRC_DIGEST=$(crane digest "${SOURCE_IMAGE}:${{ steps.validate_tag.outputs.tag }}")
+ DST_DIGEST=$(crane digest "${DEST_IMAGE}:${{ steps.validate_tag.outputs.tag }}")🧰 Tools
🪛 zizmor (1.25.2)
[error] 74-74: code injection via template expansion (template-injection): may expand into attacker-controllable code
(template-injection)
[error] 74-74: code injection via template expansion (template-injection): may expand into attacker-controllable code
(template-injection)
[error] 78-78: code injection via template expansion (template-injection): may expand into attacker-controllable code
(template-injection)
[error] 79-79: code injection via template expansion (template-injection): may expand into attacker-controllable code
(template-injection)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In @.github/workflows/mirror-otel-collector.yml around lines 72 - 79, The
workflow currently injects ${{ inputs.tag }} directly into shell run steps (used
by crane copy and crane digest) which can allow shell injection; update the
workflow to validate and sanitize the tag before any shell use by either: 1)
adding an inputs validation regex in the workflow metadata for the
workflow_dispatch input (e.g., restrict to [A-Za-z0-9_.-]+) so ${{ inputs.tag }}
can only contain safe characters, or 2) perform an explicit runtime whitelist
check in a preliminary step (use a short shell snippet or actions/github-script
that tests the tag against a strict regex and fails if it doesn’t match), then
assign the validated value to a safe env var (e.g., VALID_TAG) and use that env
var in the crane copy and digest commands (the steps invoking crane copy and the
SRC_DIGEST/DST_DIGEST assignments).
Source: Linters/SAST tools
| "gni" | ||
| ], | ||
| "schemaVersion": 39, | ||
| "version": 1, |
There was a problem hiding this comment.
Remove the dashboard .version field to match the documented direct-Git contract.
Line 12 sets "version": 1, but the same PR’s workflow rules require direct Git dashboard JSON to omit .version. This mismatch will create drift and can break validation/promotion workflows.
Suggested fix
- "version": 1,📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| "version": 1, |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@deploy/dashboards/kiwi-temporal/temporal-workflow-health.json` at line 12,
Remove the top-level ".version" field from the dashboard JSON (the "version": 1
entry in temporal-workflow-health.json) so the file matches the direct-Git
dashboard contract; locate the "version" key in the JSON and delete that
property entirely, leaving all other fields intact to avoid validation/promotion
drift.
| "uid": "[[.datasourceName]]" | ||
| }, | ||
| "editorMode": "code", | ||
| "expr": "100 * sum(rate(temporal_workflow_failed_total{deployment_environment=~\"$env\", workflow_type=~\"$workflow_type\"}[5m])) / (sum(rate(temporal_workflow_completed_total{deployment_environment=~\"$env\", workflow_type=~\"$workflow_type\"}[5m])) + sum(rate(temporal_workflow_failed_total{deployment_environment=~\"$env\", workflow_type=~\"$workflow_type\"}[5m])))", |
There was a problem hiding this comment.
Harden percentage formulas against zero denominators.
These ratios can emit Inf/NaN when no matching series exist in the selected time window/environment, which causes misleading spikes/gaps in the dashboard.
Suggested fix
- "expr": "100 * sum(rate(temporal_workflow_failed_total{deployment_environment=~\"$env\", workflow_type=~\"$workflow_type\"}[5m])) / (sum(rate(temporal_workflow_completed_total{deployment_environment=~\"$env\", workflow_type=~\"$workflow_type\"}[5m])) + sum(rate(temporal_workflow_failed_total{deployment_environment=~\"$env\", workflow_type=~\"$workflow_type\"}[5m])))",
+ "expr": "100 * sum(rate(temporal_workflow_failed_total{deployment_environment=~\"$env\", workflow_type=~\"$workflow_type\"}[5m])) / clamp_min((sum(rate(temporal_workflow_completed_total{deployment_environment=~\"$env\", workflow_type=~\"$workflow_type\"}[5m])) + sum(rate(temporal_workflow_failed_total{deployment_environment=~\"$env\", workflow_type=~\"$workflow_type\"}[5m]))), 1e-9)",- "expr": "100 * sum(temporal_worker_task_slots_used{deployment_environment=~\"$env\", slot_type=\"ActivitySlot\"}) / (sum(temporal_worker_task_slots_available{deployment_environment=~\"$env\", slot_type=\"ActivitySlot\"}) + sum(temporal_worker_task_slots_used{deployment_environment=~\"$env\", slot_type=\"ActivitySlot\"}))",
+ "expr": "100 * sum(temporal_worker_task_slots_used{deployment_environment=~\"$env\", slot_type=\"ActivitySlot\"}) / clamp_min((sum(temporal_worker_task_slots_available{deployment_environment=~\"$env\", slot_type=\"ActivitySlot\"}) + sum(temporal_worker_task_slots_used{deployment_environment=~\"$env\", slot_type=\"ActivitySlot\"})), 1e-9)",- "expr": "100 * sum(temporal_worker_task_slots_used{deployment_environment=~\"$env\", slot_type=\"WorkflowSlot\"}) / (sum(temporal_worker_task_slots_available{deployment_environment=~\"$env\", slot_type=\"WorkflowSlot\"}) + sum(temporal_worker_task_slots_used{deployment_environment=~\"$env\", slot_type=\"WorkflowSlot\"}))",
+ "expr": "100 * sum(temporal_worker_task_slots_used{deployment_environment=~\"$env\", slot_type=\"WorkflowSlot\"}) / clamp_min((sum(temporal_worker_task_slots_available{deployment_environment=~\"$env\", slot_type=\"WorkflowSlot\"}) + sum(temporal_worker_task_slots_used{deployment_environment=~\"$env\", slot_type=\"WorkflowSlot\"})), 1e-9)",Also applies to: 418-429
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@deploy/dashboards/kiwi-temporal/temporal-workflow-health.json` at line 184,
The PromQL percentage expression can produce Inf/NaN when the denominator is
zero; modify the expression that computes 100 * failed / (completed + failed) to
guard the denominator using clamp_min or a tiny epsilon, e.g. replace the
denominator (sum(rate(temporal_workflow_completed_total{...}[5m])) +
sum(rate(temporal_workflow_failed_total{...}[5m]))) with clamp_min(..., 1e-9) so
PromQL never divides by zero; apply the same change to the other similar
expressions referencing temporal_workflow_failed_total and
temporal_workflow_completed_total (also at the block around lines 418-429) to
harden all percentage formulas.
| Dashboard JSON files here mirror what lives (or will live) in the dashboards repo: | ||
|
|
||
| > `gitlab-master.nvidia.com/nsv-network/gni-dev/observability/deploy/dashboards` | ||
| > (local clone: `/Users/kiwueke/Documents/dashboards`) |
There was a problem hiding this comment.
Avoid committing a user-specific local filesystem path.
Line 6 hardcodes one developer’s local path, which is not portable for other contributors and tends to go stale in docs.
Suggested fix
-> (local clone: `/Users/kiwueke/Documents/dashboards`)
+> (example local clone path; update to your environment)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| > (local clone: `/Users/kiwueke/Documents/dashboards`) | |
| > (example local clone path; update to your environment) |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@deploy/dashboards/README.md` at line 6, Remove the hardcoded
developer-specific path "(local clone: `/Users/kiwueke/Documents/dashboards`)"
from the README and replace it with a portable placeholder (e.g., "(local clone:
/path/to/dashboards)" or instruct users to clone into their preferred directory
or use $HOME), so the documentation is not tied to a single user's filesystem
and remains reusable for all contributors.
| {{- if and .Values.temporal.enabled .Values.temporal.observability.enabled }} | ||
| {{- if .Values.secrets.vault.paths.panoptesMtls }} | ||
| {{- if .Values.secrets.vault.paths.panoptesMtls.path }} | ||
| {{- $temporalName := include "nv-config-manager.componentName" (dict "root" . "component" "temporal") }} | ||
| {{- $vaultSecretStoreName := include "nv-config-manager.vaultSecretStoreName" . }} | ||
| --- | ||
| apiVersion: external-secrets.io/v1 | ||
| kind: ExternalSecret | ||
| metadata: | ||
| name: {{ $temporalName }}-panoptes-mtls-eso | ||
| namespace: {{ .Values.global.namespace }} | ||
| spec: | ||
| secretStoreRef: | ||
| name: {{ $vaultSecretStoreName }} | ||
| kind: SecretStore | ||
| target: | ||
| name: {{ $temporalName }}-panoptes-mtls | ||
| creationPolicy: Owner | ||
| data: | ||
| - secretKey: ca.crt | ||
| remoteRef: | ||
| key: "{{ include "nv-config-manager.vault.secretPath" (dict "root" . "secret" "panoptesMtls") }}" | ||
| property: {{ include "nv-config-manager.vault.keyName" (dict "root" . "secret" "panoptesMtls" "key" "caCert") }} | ||
| - secretKey: tls.crt | ||
| remoteRef: | ||
| key: "{{ include "nv-config-manager.vault.secretPath" (dict "root" . "secret" "panoptesMtls") }}" | ||
| property: {{ include "nv-config-manager.vault.keyName" (dict "root" . "secret" "panoptesMtls" "key" "tlsCert") }} | ||
| - secretKey: tls.key | ||
| remoteRef: | ||
| key: "{{ include "nv-config-manager.vault.secretPath" (dict "root" . "secret" "panoptesMtls") }}" | ||
| property: {{ include "nv-config-manager.vault.keyName" (dict "root" . "secret" "panoptesMtls" "key" "tlsKey") }} | ||
| refreshInterval: 1h | ||
| {{- end }} | ||
| {{- end }} | ||
| {{- end }} |
There was a problem hiding this comment.
Gate the Panoptes ExternalSecret behind the same ESO-mode check as the rest of this file.
Lines 1120-1154 render an ExternalSecret even when secrets.method is kubernetes or vault-agent with secrets.vaultAgent.esoForOperators: false, but the referenced SecretStore is only created inside the earlier $anyEso block. In those modes this resource cannot reconcile, and the Temporal worker/API mounts in deploy/helm/templates/temporal.yaml still expect {{ $temporalName }}-panoptes-mtls to exist.
Suggested fix
-{{- if and .Values.temporal.enabled .Values.temporal.observability.enabled }}
+{{- if and $anyEso .Values.temporal.enabled .Values.temporal.observability.enabled }}
{{- if .Values.secrets.vault.paths.panoptesMtls }}
{{- if .Values.secrets.vault.paths.panoptesMtls.path }}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| {{- if and .Values.temporal.enabled .Values.temporal.observability.enabled }} | |
| {{- if .Values.secrets.vault.paths.panoptesMtls }} | |
| {{- if .Values.secrets.vault.paths.panoptesMtls.path }} | |
| {{- $temporalName := include "nv-config-manager.componentName" (dict "root" . "component" "temporal") }} | |
| {{- $vaultSecretStoreName := include "nv-config-manager.vaultSecretStoreName" . }} | |
| --- | |
| apiVersion: external-secrets.io/v1 | |
| kind: ExternalSecret | |
| metadata: | |
| name: {{ $temporalName }}-panoptes-mtls-eso | |
| namespace: {{ .Values.global.namespace }} | |
| spec: | |
| secretStoreRef: | |
| name: {{ $vaultSecretStoreName }} | |
| kind: SecretStore | |
| target: | |
| name: {{ $temporalName }}-panoptes-mtls | |
| creationPolicy: Owner | |
| data: | |
| - secretKey: ca.crt | |
| remoteRef: | |
| key: "{{ include "nv-config-manager.vault.secretPath" (dict "root" . "secret" "panoptesMtls") }}" | |
| property: {{ include "nv-config-manager.vault.keyName" (dict "root" . "secret" "panoptesMtls" "key" "caCert") }} | |
| - secretKey: tls.crt | |
| remoteRef: | |
| key: "{{ include "nv-config-manager.vault.secretPath" (dict "root" . "secret" "panoptesMtls") }}" | |
| property: {{ include "nv-config-manager.vault.keyName" (dict "root" . "secret" "panoptesMtls" "key" "tlsCert") }} | |
| - secretKey: tls.key | |
| remoteRef: | |
| key: "{{ include "nv-config-manager.vault.secretPath" (dict "root" . "secret" "panoptesMtls") }}" | |
| property: {{ include "nv-config-manager.vault.keyName" (dict "root" . "secret" "panoptesMtls" "key" "tlsKey") }} | |
| refreshInterval: 1h | |
| {{- end }} | |
| {{- end }} | |
| {{- end }} | |
| {{- if and $anyEso .Values.temporal.enabled .Values.temporal.observability.enabled }} | |
| {{- if .Values.secrets.vault.paths.panoptesMtls }} | |
| {{- if .Values.secrets.vault.paths.panoptesMtls.path }} | |
| {{- $temporalName := include "nv-config-manager.componentName" (dict "root" . "component" "temporal") }} | |
| {{- $vaultSecretStoreName := include "nv-config-manager.vaultSecretStoreName" . }} | |
| --- | |
| apiVersion: external-secrets.io/v1 | |
| kind: ExternalSecret | |
| metadata: | |
| name: {{ $temporalName }}-panoptes-mtls-eso | |
| namespace: {{ .Values.global.namespace }} | |
| spec: | |
| secretStoreRef: | |
| name: {{ $vaultSecretStoreName }} | |
| kind: SecretStore | |
| target: | |
| name: {{ $temporalName }}-panoptes-mtls | |
| creationPolicy: Owner | |
| data: | |
| - secretKey: ca.crt | |
| remoteRef: | |
| key: "{{ include "nv-config-manager.vault.secretPath" (dict "root" . "secret" "panoptesMtls") }}" | |
| property: {{ include "nv-config-manager.vault.keyName" (dict "root" . "secret" "panoptesMtls" "key" "caCert") }} | |
| - secretKey: tls.crt | |
| remoteRef: | |
| key: "{{ include "nv-config-manager.vault.secretPath" (dict "root" . "secret" "panoptesMtls") }}" | |
| property: {{ include "nv-config-manager.vault.keyName" (dict "root" . "secret" "panoptesMtls" "key" "tlsCert") }} | |
| - secretKey: tls.key | |
| remoteRef: | |
| key: "{{ include "nv-config-manager.vault.secretPath" (dict "root" . "secret" "panoptesMtls") }}" | |
| property: {{ include "nv-config-manager.vault.keyName" (dict "root" . "secret" "panoptesMtls" "key" "tlsKey") }} | |
| refreshInterval: 1h | |
| {{- end }} | |
| {{- end }} | |
| {{- end }} |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@deploy/helm/templates/vault-secrets.yaml` around lines 1120 - 1154, The
ExternalSecret resource for panoptes Mtls (metadata name using {{ $temporalName
}}-panoptes-mtls-eso and target {{ $temporalName }}-panoptes-mtls) is rendered
even when ESO mode is not enabled, causing reconciliation failures because the
SecretStore (created only under the $anyEso block) may not exist; wrap the
existing block that creates this ExternalSecret with the same ESO-mode guard
used earlier (the $anyEso / secrets.method /
.Values.secrets.vaultAgent.esoForOperators check) so the ExternalSecret is only
emitted when ESO is enabled, or alternatively add the same conditional used for
creating the SecretStore around this section to ensure parity between
SecretStore and ExternalSecret creation.
| otlp_endpoint = os.getenv("OTEL_EXPORTER_OTLP_ENDPOINT", "http://localhost:4317") | ||
| resolved_name = os.getenv("OTEL_SERVICE_NAME", service_name) | ||
|
|
||
| resource = Resource.create( | ||
| { | ||
| "service.name": resolved_name, | ||
| "deployment.environment": os.getenv("ENVIRONMENT", "unknown"), | ||
| } | ||
| ) | ||
| provider = TracerProvider(resource=resource) | ||
| provider.add_span_processor(BatchSpanProcessor(OTLPSpanExporter(endpoint=otlp_endpoint))) | ||
| trace.set_global_tracer_provider(provider) | ||
|
|
||
| _runtime = Runtime(telemetry=TelemetryConfig(metrics=OpenTelemetryConfig(url=otlp_endpoint))) | ||
| return _runtime |
There was a problem hiding this comment.
Honor the disabled observability path here instead of defaulting to localhost.
The PR contract says observability is gated off by default, but this code still enables OTLP export whenever setup_telemetry() is called because it falls back to http://localhost:4317. Since both src/nv_config_manager/temporal/api/main.py and src/nv_config_manager/temporal/worker/main.py invoke this unconditionally, deployments without the sidecar will keep trying to export to a nonexistent local collector.
Suggested fix
def setup_telemetry(service_name: str) -> Runtime:
"""Initialize OTel tracing and build a Temporal Runtime with OTel metrics.
@@
"""
global _runtime
- otlp_endpoint = os.getenv("OTEL_EXPORTER_OTLP_ENDPOINT", "http://localhost:4317")
+ otlp_endpoint = os.getenv("OTEL_EXPORTER_OTLP_ENDPOINT")
+ if not otlp_endpoint:
+ _runtime = Runtime(telemetry=TelemetryConfig())
+ return _runtime
+
resolved_name = os.getenv("OTEL_SERVICE_NAME", service_name)
@@
provider = TracerProvider(resource=resource)
provider.add_span_processor(BatchSpanProcessor(OTLPSpanExporter(endpoint=otlp_endpoint)))
trace.set_global_tracer_provider(provider)
_runtime = Runtime(telemetry=TelemetryConfig(metrics=OpenTelemetryConfig(url=otlp_endpoint)))
return _runtime📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| otlp_endpoint = os.getenv("OTEL_EXPORTER_OTLP_ENDPOINT", "http://localhost:4317") | |
| resolved_name = os.getenv("OTEL_SERVICE_NAME", service_name) | |
| resource = Resource.create( | |
| { | |
| "service.name": resolved_name, | |
| "deployment.environment": os.getenv("ENVIRONMENT", "unknown"), | |
| } | |
| ) | |
| provider = TracerProvider(resource=resource) | |
| provider.add_span_processor(BatchSpanProcessor(OTLPSpanExporter(endpoint=otlp_endpoint))) | |
| trace.set_global_tracer_provider(provider) | |
| _runtime = Runtime(telemetry=TelemetryConfig(metrics=OpenTelemetryConfig(url=otlp_endpoint))) | |
| return _runtime | |
| otlp_endpoint = os.getenv("OTEL_EXPORTER_OTLP_ENDPOINT") | |
| if not otlp_endpoint: | |
| _runtime = Runtime(telemetry=TelemetryConfig()) | |
| return _runtime | |
| resolved_name = os.getenv("OTEL_SERVICE_NAME", service_name) | |
| resource = Resource.create( | |
| { | |
| "service.name": resolved_name, | |
| "deployment.environment": os.getenv("ENVIRONMENT", "unknown"), | |
| } | |
| ) | |
| provider = TracerProvider(resource=resource) | |
| provider.add_span_processor(BatchSpanProcessor(OTLPSpanExporter(endpoint=otlp_endpoint))) | |
| trace.set_global_tracer_provider(provider) | |
| _runtime = Runtime(telemetry=TelemetryConfig(metrics=OpenTelemetryConfig(url=otlp_endpoint))) | |
| return _runtime |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@src/nv_config_manager/temporal/telemetry.py` around lines 49 - 63, The
telemetry setup currently defaults OTLP_EXPORTER to "http://localhost:4317"
which forces export even when observability should be disabled; modify
setup_telemetry() to check for an explicit OTLP endpoint or an enable flag
(e.g., OTEL_EXPORTER_OTLP_ENDPOINT present and/or OTEL_ENABLED) before creating
the TracerProvider/OTLPSpanExporter and before constructing Runtime with
TelemetryConfig/OpenTelemetryConfig; if the endpoint is missing or observability
is disabled, skip provider/span processor setup and return a Runtime without
OpenTelemetry metrics configured, preserving use of OTEL_SERVICE_NAME and
ENVIRONMENT when telemetry is enabled.
Instruments Temporal worker and API with OpenTelemetry: telemetry.py bootstrap, TracingInterceptor on worker/client/API, FastAPIInstrumentor, otel-collector sidecar (OTLP receiver -> k8sattributes + Panoptes attribute processors -> mTLS OTLP/HTTP export), Panoptes mTLS ExternalSecret, and image-mirror CI workflow. All gated on temporal.observability.enabled: false — zero behaviour change to existing deployments.
Summary by CodeRabbit
Release Notes
New Features
Documentation
Chores