Skip to content

feat(helm): observability support (ArcadeDB 26.7.1, #4463)#12

Merged
robfrank merged 12 commits into
mainfrom
feat/observability-support
Jun 18, 2026
Merged

feat(helm): observability support (ArcadeDB 26.7.1, #4463)#12
robfrank merged 12 commits into
mainfrom
feat/observability-support

Conversation

@robfrank

Copy link
Copy Markdown
Contributor

Summary

Adds opt-in, behavior-preserving observability support for ArcadeDB 26.7.1 (issue ArcadeData/arcadedb#4463), covering all four pillars: health probes (#4464), OTLP metrics export (#4465), structured JSON logging (#4466), and distributed tracing (#4467).

A new feature-grouped observability: values section is translated by helpers into -D JVM args, plus Kubernetes-native Prometheus discovery the raw args can't provide.

  • Metrics / OTLPobservability.metrics.otlp.{enabled,endpoint} pushes metrics to an OTel collector alongside the existing /prometheus endpoint.
  • Prometheus discovery — optional ServiceMonitor (Prometheus Operator) and prometheus.io/* scrape pod annotations. A new requireAuthentication knob on the prometheus plugin enables unauthenticated scraping. A template guard fails fast if scrape discovery is enabled without the prometheus plugin.
  • Tracingobservability.tracing.{enabled,endpoint,samplingRate}.
  • Structured loggingobservability.logging.{format,includeTrace} (text|json).
  • Health probes — default liveness probe switched to the dependency-free /api/v1/health (no DB I/O, never 503); readiness stays on /api/v1/ready. Optional observability.health.readinessRequiresHA.
  • Versionversion/appVersion bumped to 26.7.1 (the release that serves /api/v1/health), so the default image and the new liveness default move together.

The ServiceMonitor selects only the {fullname}-http Service via a new app.kubernetes.io/component: http label, avoiding double-scraping the headless Service.

Every knob defaults off / behavior-preserving except the liveness path, which is coupled to the 26.7.1 bump.

Test Plan

  • make test-unit — 159/159 pass across 13 suites (new observability_test.yaml, servicemonitor_test.yaml; updated statefulset_test.yaml, helpers_test.yaml, service_test.yaml)
  • make lint — clean
  • bash -n ci/integration-test.sh — clean; adds a /api/v1/health → 204 smoke phase (runs green under the latest-image guard)
  • helm template fully-enabled and at-defaults renders verified
  • kind HA integration job (pulls appVersion image — do not release until the arcadedb:26.7.1 image is published)

Note: before release, confirm the serverMetrics.otlp.* / serverMetrics.tracing.* property names against the running 26.7.1 server — unknown -D keys silently no-op.

🤖 Generated with Claude Code

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces comprehensive, opt-in observability support for the ArcadeDB Helm chart (bumping the version to 26.7.1), including structured JSON logging, OTLP metrics, distributed tracing, and a new Prometheus ServiceMonitor template. It also updates the default liveness probe to use the dependency-free '/api/v1/health' endpoint. The review feedback highlights critical opportunities to prevent nil pointer dereference errors during template rendering by safely defaulting nested maps under '.Values.observability' in the helper templates and the ServiceMonitor manifest.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment on lines +183 to +203
{{- define "arcadedb.observability.args" -}}
{{- $o := .Values.observability -}}
{{- if eq $o.logging.format "json" }}
- -Darcadedb.server.logFormat=json
{{- end }}
{{- if $o.logging.includeTrace }}
- -Darcadedb.server.logIncludeTrace=true
{{- end }}
{{- if $o.metrics.otlp.enabled }}
- -Darcadedb.serverMetrics.otlp.enabled=true
- -Darcadedb.serverMetrics.otlp.endpoint={{ $o.metrics.otlp.endpoint }}
{{- end }}
{{- if $o.tracing.enabled }}
- -Darcadedb.serverMetrics.tracing.enabled=true
- -Darcadedb.serverMetrics.tracing.endpoint={{ $o.tracing.endpoint }}
- -Darcadedb.serverMetrics.tracing.samplingRate={{ $o.tracing.samplingRate }}
{{- end }}
{{- if $o.health.readinessRequiresHA }}
- -Darcadedb.server.readinessRequiresHA=true
{{- end }}
{{- end -}}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The helper arcadedb.observability.args directly accesses nested keys under .Values.observability (such as .Values.observability.logging.format and .Values.observability.metrics.otlp.enabled) without checking if the parent keys exist. If a user provides a custom values file that partially overrides or completely omits the observability section, Helm will throw a nil pointer evaluation error (e.g., nil pointer evaluating interface {}). Using default dict to safely navigate these nested maps prevents rendering crashes.

{{- define "arcadedb.observability.args" -}}
{{- $o := .Values.observability | default dict -}}
{{- $logging := $o.logging | default dict -}}
{{- $metrics := $o.metrics | default dict -}}
{{- $otlp := $metrics.otlp | default dict -}}
{{- $tracing := $o.tracing | default dict -}}
{{- $health := $o.health | default dict -}}
{{- if eq $logging.format "json" }}
- -Darcadedb.server.logFormat=json
{{- end }}
{{- if $logging.includeTrace }}
- -Darcadedb.server.logIncludeTrace=true
{{- end }}
{{- if $otlp.enabled }}
- -Darcadedb.serverMetrics.otlp.enabled=true
- -Darcadedb.serverMetrics.otlp.endpoint={{ $otlp.endpoint }}
{{- end }}
{{- if $tracing.enabled }}
- -Darcadedb.serverMetrics.tracing.enabled=true
- -Darcadedb.serverMetrics.tracing.endpoint={{ $tracing.endpoint }}
- -Darcadedb.serverMetrics.tracing.samplingRate={{ $tracing.samplingRate }}
{{- end }}
{{- if $health.readinessRequiresHA }}
- -Darcadedb.server.readinessRequiresHA=true
{{- end }}
{{- end -}}

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Verified this empirically before applying. The stated trigger — partially overriding or omitting the observability section — does not panic: Helm coalesces user values over the chart's values.yaml defaults, so omitted/partially-set keys retain their defaults (helm template -f with only observability.metrics.otlp.enabled: true renders fine, exit 0; the unit suite also sets partial observability keys and passes).

The real failure mode is explicitly nulling a map, e.g. --set observability=null or --set observability.logging=null, which does dereference nil. That's a genuine crash, so I've applied nil-safe navigation (| default dict) to this helper and the other three sites, plus regression tests covering the explicit-null case.

Fixed in 9e17bbb.

Comment on lines +209 to +220
{{- define "arcadedb.observability.validate" -}}
{{- $p := .Values.observability.metrics.prometheus -}}
{{- if or $p.serviceMonitor.enabled $p.podAnnotations.enabled -}}
{{- $promEnabled := false -}}
{{- with .Values.arcadedb.plugins.prometheus -}}
{{- if .enabled -}}{{- $promEnabled = true -}}{{- end -}}
{{- end -}}
{{- if not $promEnabled -}}
{{- fail "observability.metrics.prometheus serviceMonitor/podAnnotations require arcadedb.plugins.prometheus.enabled=true (the /prometheus endpoint must be served)" -}}
{{- end -}}
{{- end -}}
{{- end -}}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The validation helper arcadedb.observability.validate accesses .Values.observability.metrics.prometheus directly. If observability or metrics is omitted or overridden partially, this will cause a nil pointer dereference error during rendering. Safely defaulting the nested maps ensures robust template rendering.

{{- define "arcadedb.observability.validate" -}}
{{- $o := .Values.observability | default dict -}}
{{- $metrics := $o.metrics | default dict -}}
{{- $p := $metrics.prometheus | default dict -}}
{{- $serviceMonitor := $p.serviceMonitor | default dict -}}
{{- $podAnnotations := $p.podAnnotations | default dict -}}
{{- if or $serviceMonitor.enabled $podAnnotations.enabled -}}
  {{- $promEnabled := false -}}
  {{- with .Values.arcadedb.plugins.prometheus -}}
    {{- if .enabled -}}{{- $promEnabled = true -}}{{- end -}}
  {{- end -}}
  {{- if not $promEnabled -}}
    {{- fail "observability.metrics.prometheus serviceMonitor/podAnnotations require arcadedb.plugins.prometheus.enabled=true (the /prometheus endpoint must be served)" -}}
  {{- end -}}
{{- end -}}
{{- end -}}

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 9e17bbbarcadedb.observability.validate now safely defaults the nested maps. (See the thread above: normal partial overrides were already safe via Helm coalescing; the real crash was explicit nulling, now guarded with a regression test.)

Comment on lines +226 to +239
{{- define "arcadedb.podAnnotations" -}}
{{- $annotations := deepCopy (default dict .Values.podAnnotations) -}}
{{- $pa := .Values.observability.metrics.prometheus.podAnnotations -}}
{{- if $pa.enabled -}}
{{- $port := int .Values.service.http.port -}}
{{- if $pa.port -}}{{- $port = int $pa.port -}}{{- end -}}
{{- $_ := set $annotations "prometheus.io/scrape" "true" -}}
{{- $_ := set $annotations "prometheus.io/port" (printf "%d" $port) -}}
{{- $_ := set $annotations "prometheus.io/path" (default "/prometheus" $pa.path) -}}
{{- end -}}
{{- if $annotations -}}
{{- toYaml $annotations -}}
{{- end -}}
{{- end -}}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The helper arcadedb.podAnnotations accesses .Values.observability.metrics.prometheus.podAnnotations directly. If any of the parent keys are omitted in custom values, this will cause a nil pointer dereference error. Defaulting the nested maps safely avoids this issue.

{{- define "arcadedb.podAnnotations" -}}
{{- $annotations := deepCopy (default dict .Values.podAnnotations) -}}
{{- $o := .Values.observability | default dict -}}
{{- $metrics := $o.metrics | default dict -}}
{{- $prometheus := $metrics.prometheus | default dict -}}
{{- $pa := $prometheus.podAnnotations | default dict -}}
{{- if $pa.enabled -}}
  {{- $port := int .Values.service.http.port -}}
  {{- if $pa.port -}}{{- $port = int $pa.port -}}
  {{- end -}}
  {{- $_ := set $annotations "prometheus.io/scrape" "true" -}}
  {{- $_ := set $annotations "prometheus.io/port" (printf "%d" $port) -}}
  {{- $_ := set $annotations "prometheus.io/path" (default "/prometheus" $pa.path) -}}
{{- end -}}
{{- if $annotations -}}
{{- toYaml $annotations -}}
{{- end -}}
{{- end -}}

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 9e17bbbarcadedb.podAnnotations now safely navigates the nested maps with | default dict.

Comment on lines +1 to +6
{{- if .Values.observability.metrics.prometheus.serviceMonitor.enabled }}
{{- include "arcadedb.observability.validate" . -}}
{{- $sm := .Values.observability.metrics.prometheus.serviceMonitor -}}
{{- if and $sm.basicAuth.enabled (not $sm.basicAuth.secretName) -}}
{{- fail "serviceMonitor.basicAuth.enabled requires basicAuth.secretName (a secret with username + password keys)" -}}
{{- end -}}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

In servicemonitor.yaml, accessing .Values.observability.metrics.prometheus.serviceMonitor.enabled and $sm.basicAuth.enabled directly can cause nil pointer dereference errors if observability or basicAuth is omitted or partially overridden in custom values. Safely defaulting these nested maps prevents rendering failures.

{{- $o := .Values.observability | default dict -}}
{{- $metrics := $o.metrics | default dict -}}
{{- $prometheus := $metrics.prometheus | default dict -}}
{{- $sm := $prometheus.serviceMonitor | default dict -}}
{{- $basicAuth := $sm.basicAuth | default dict -}}
{{- if $sm.enabled }}
{{- include "arcadedb.observability.validate" . -}}
{{- if and $basicAuth.enabled (not $basicAuth.secretName) -}}
{{- fail "serviceMonitor.basicAuth.enabled requires basicAuth.secretName (a secret with username + password keys)" -}}
{{- end -}}

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 9e17bbbservicemonitor.yaml now resolves $sm and $basicAuth via | default dict before use, so an explicitly nulled observability/serviceMonitor/basicAuth renders cleanly instead of panicking. Regression test added.

robfrank and others added 3 commits June 18, 2026 08:53
Guards the observability helpers and ServiceMonitor template against an
explicitly nulled observability map/sub-map (e.g. --set observability=null),
which panicked with a nil pointer dereference. Normal partial overrides and
omissions were already safe via Helm's value coalescing; this covers the
explicit-null edge. Adds regression tests.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The chart appVersion is bumped to 26.7.1 ahead of the image publish, so the
default image.tag=appVersion is not pullable yet and the integration job's
helm install --wait times out. Point the PR integration job at the rolling
`latest` tag (== 26.7.1 content) until arcadedata/arcadedb:26.7.1 is published.

REVERT this `with:` override at release so the job tests the pinned appVersion
again (see f3db108).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Document that appVersion can be bumped ahead of the image publish, that the
PR integration job is then temporarily pinned to `latest`, and that the
override must be removed once the pinned image ships. Adds a pending callout
for the current 26.7.1 state.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@robfrank robfrank merged commit 280dfb6 into main Jun 18, 2026
3 checks passed
@robfrank robfrank deleted the feat/observability-support branch June 18, 2026 07:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant