diff --git a/.agents/skills/debug-openshell-cluster/SKILL.md b/.agents/skills/debug-openshell-cluster/SKILL.md index fae8af123..68ecc7749 100644 --- a/.agents/skills/debug-openshell-cluster/SKILL.md +++ b/.agents/skills/debug-openshell-cluster/SKILL.md @@ -54,7 +54,7 @@ Use gateway metadata, deployment values, or the user's setup notes to identify t |---|---| | Docker | Gateway process logs, Docker daemon health, sandbox containers, image pulls. | | Podman | Podman socket, rootless networking, sandbox containers, image pulls. | -| Kubernetes | Helm release, StatefulSet, service, secrets, sandbox pods, events. | +| Kubernetes | Helm release, gateway workload, service, secrets, sandbox pods, events. | | VM | VM driver logs, rootfs availability, host virtualization support. | ### Step 3: Check Docker-Backed Gateways @@ -131,12 +131,17 @@ Common findings: ```bash helm -n openshell status openshell helm -n openshell get values openshell -kubectl -n openshell get statefulset,pod,svc,pvc +kubectl -n openshell get deployment,statefulset,pod,svc,pvc +kubectl -n openshell logs deployment/openshell -c openshell-gateway --tail=200 kubectl -n openshell logs statefulset/openshell -c openshell-gateway --tail=200 +kubectl -n openshell rollout status deployment/openshell kubectl -n openshell rollout status statefulset/openshell ``` -Look for failed installs, unexpected values, missing namespace, wrong image tag, TLS settings that do not match the registered endpoint, and scheduling failures. +Use the log and rollout commands for the workload kind that exists in the +release. Look for failed installs, unexpected values, missing namespace, wrong +image tag, TLS settings that do not match the registered endpoint, and +scheduling failures. For HA or PostgreSQL-backed installs, also check the external database Secret referenced by `server.externalDbSecret` and the PostgreSQL workload if the test @@ -169,7 +174,7 @@ Secrets but does not create the sandbox JWT signing Secret. If the gateway exits with `failed to read sandbox JWT signing key from /etc/openshell-jwt/signing.pem`, verify that `openshell-jwt-keys` contains -`signing.pem`, `public.pem`, and `kid`, and that the StatefulSet mounts the +`signing.pem`, `public.pem`, and `kid`, and that the gateway workload mounts the `sandbox-jwt` secret at `/etc/openshell-jwt`. The sandbox JWT mount is required even when local Helm values disable TLS. @@ -194,8 +199,9 @@ label, supervisor env vars `OPENSHELL_K8S_SA_TOKEN_FILE` and Check the image references currently used by the gateway deployment: ```bash +kubectl -n openshell get deployment openshell -o jsonpath="{.spec.template.spec.containers[*].image}{\"\n\"}{.spec.template.spec.containers[*].env[?(@.name==\"OPENSHELL_SUPERVISOR_IMAGE\")].value}{\"\n\"}" kubectl -n openshell get statefulset openshell -o jsonpath="{.spec.template.spec.containers[*].image}{\"\n\"}{.spec.template.spec.containers[*].env[?(@.name==\"OPENSHELL_SUPERVISOR_IMAGE\")].value}{\"\n\"}" -helm -n openshell get values openshell | grep -E 'repository|tag|supervisorImage' +helm -n openshell get values openshell | grep -E 'repository|tag|supervisorImage|workload' ``` The gateway image built from `deploy/docker/Dockerfile.gateway` and the scratch supervisor image built from `deploy/docker/Dockerfile.supervisor` should use the same build tag in branch and E2E deploys. A stale supervisor image can make sandbox behavior lag behind gateway policy or proto changes. @@ -238,6 +244,7 @@ If the gateway is healthy but sandbox creation fails: ```bash kubectl -n openshell get pods kubectl -n openshell get events --sort-by=.lastTimestamp | tail -n 50 +kubectl -n openshell logs deployment/openshell -c openshell-gateway --tail=200 kubectl -n openshell logs statefulset/openshell -c openshell-gateway --tail=200 ``` @@ -286,7 +293,7 @@ openshell logs | Docker or Podman sandbox never registers | Wrong callback endpoint or supervisor startup failure | Gateway logs and sandbox container logs | | Docker GPU e2e fails before GPU sandbox comparison | NVIDIA CDI specs are missing or Docker has not discovered them | `docker info --format '{{json .DiscoveredDevices}}'`, `/etc/cdi`, `/var/run/cdi`, `nvidia-cdi-refresh.service` | | Kubernetes gateway pod pending | PVC unbound, taint, selector, or insufficient resources | `kubectl -n openshell describe pod ` | -| Kubernetes gateway pod crash loops | Missing secret, bad DB URL, bad TLS config | `kubectl -n openshell logs statefulset/openshell -c openshell-gateway` | +| Kubernetes gateway pod crash loops | Missing secret, bad DB URL, bad TLS config | `kubectl -n openshell logs deployment/openshell -c openshell-gateway` or `kubectl -n openshell logs statefulset/openshell -c openshell-gateway` | | CLI TLS error | Local mTLS bundle does not match server cert/CA | Check `~/.config/openshell/gateways//mtls/` | | Image pull failure | Gateway or sandbox image cannot be pulled | Runtime events and image pull credentials | | `K8s namespace not ready` with `envoy-gateway-openshell.yaml: the server could not find the requested resource` | Optional Gateway API manifest was applied without Envoy Gateway CRDs, or k3s Helm controller startup exceeded the namespace wait | Apply `deploy/kube/manifests/envoy-gateway-openshell.yaml` manually only after Envoy Gateway is installed and `grpcRoute` is enabled | diff --git a/.agents/skills/openshell-cli/SKILL.md b/.agents/skills/openshell-cli/SKILL.md index 4b7501c1f..a55905891 100644 --- a/.agents/skills/openshell-cli/SKILL.md +++ b/.agents/skills/openshell-cli/SKILL.md @@ -495,8 +495,9 @@ openshell gateway remove local # Remove local registrati ```bash # Inspect a Kubernetes Helm release and gateway pod helm -n openshell status openshell -kubectl -n openshell get pods,svc -kubectl -n openshell logs statefulset/openshell --tail=100 +kubectl -n openshell get deployment,statefulset,pods,svc +kubectl -n openshell logs deployment/openshell -c openshell-gateway --tail=100 +kubectl -n openshell logs statefulset/openshell -c openshell-gateway --tail=100 ``` For Docker, Podman, and VM-backed gateways, inspect the gateway process or container logs and the selected runtime directly. diff --git a/architecture/compute-runtimes.md b/architecture/compute-runtimes.md index 8b66efac6..ec0efded6 100644 --- a/architecture/compute-runtimes.md +++ b/architecture/compute-runtimes.md @@ -97,8 +97,11 @@ runtime still owns GPU device injection. ## Deployment Shape Kubernetes deployments use the Helm chart under `deploy/helm/openshell`. The -chart deploys the gateway and sandbox runtime integration, but HA deployments -must point `server.externalDbSecret` at an operator-managed PostgreSQL database. +chart deploys the gateway and sandbox runtime integration. The default gateway +workload is a StatefulSet for SQLite-backed single-replica installs. External +database-backed installs can render a Deployment with `workload.kind=deployment`; +HA deployments must point `server.externalDbSecret` at an operator-managed +PostgreSQL database. Standalone local deployments start the gateway with a selected runtime such as Docker, Podman, or VM. The CLI can register multiple gateways and switch between them without changing the sandbox architecture. diff --git a/architecture/gateway.md b/architecture/gateway.md index c8bb695ea..91963eb91 100644 --- a/architecture/gateway.md +++ b/architecture/gateway.md @@ -384,8 +384,8 @@ hook Job using the gateway image itself -- no separate cert-generation image, no extra mirror burden in air-gapped environments. In the default built-in PKI path the hook creates TLS and sandbox JWT Secrets. When cert-manager is enabled, cert-manager owns TLS Secrets and the hook runs with `--jwt-only` so the -required sandbox JWT Secret still exists before the gateway StatefulSet mounts -it, even if `pkiInitJob.enabled` remains true. On package-managed local +required sandbox JWT Secret still exists before the gateway workload mounts it, +even if `pkiInitJob.enabled` remains true. On package-managed local gateways, the same command runs from the systemd unit's `ExecStartPre` to bootstrap PKI into the configured local TLS directory on first start. The Linux package unit defaults that directory to diff --git a/deploy/helm/openshell/README.md b/deploy/helm/openshell/README.md index 62d7826f3..9cad26221 100644 --- a/deploy/helm/openshell/README.md +++ b/deploy/helm/openshell/README.md @@ -62,7 +62,8 @@ See [`values.yaml`](values.yaml) for source defaults. Selected overlays: ### Database backend -By default, OpenShell uses SQLite: +By default, OpenShell uses SQLite and runs the gateway as a StatefulSet so the +database is backed by a per-pod PVC: ```yaml server: @@ -89,9 +90,15 @@ Then install the chart pointing at that Secret: ```bash helm install openshell oci://ghcr.io/nvidia/openshell/helm-chart --version \ -n openshell \ + --set workload.kind=deployment \ --set server.externalDbSecret=my-pg-credentials ``` +Use `workload.kind=deployment` for external database-backed multi-replica +gateways. `workload.kind=statefulset` is still available for single-replica +SQLite installs and for operators who explicitly need StatefulSet identity or +storage semantics. + #### OpenShift Append these flags to any of the PostgreSQL commands above for OpenShift: @@ -229,6 +236,8 @@ add `ci/values-spire.yaml` to the OpenShell release values files. | supervisor.image.tag | string | `""` | Supervisor image tag. Defaults to the chart appVersion when empty. | | supervisor.sideloadMethod | string | `""` | How the supervisor binary is delivered into sandbox pods. Empty (default) = auto-detect from cluster version: K8s >= v1.35 -> "image-volume" (ImageVolume enabled by default; GA in v1.36) K8s < v1.35 -> "init-container" (copies via init container + emptyDir) On K8s v1.33-v1.34 with the ImageVolume feature gate manually enabled, set this to "image-volume" explicitly. | | tolerations | list | `[]` | Tolerations for the gateway pod. | +| workload.allowMultiReplicaStatefulSet | bool | `false` | Allow replicaCount > 1 while rendering a StatefulSet. Prefer workload.kind=deployment for external database-backed multi-replica gateways; this override exists for operators who explicitly require StatefulSet identity or storage semantics. | +| workload.kind | string | `"statefulset"` | Gateway workload controller kind. Use `statefulset` for the default SQLite database, or `deployment` when server.externalDbSecret points at an external database. | ---------------------------------------------- Autogenerated from chart metadata using [helm-docs v1.14.2](https://github.com/norwoodj/helm-docs/releases/v1.14.2) diff --git a/deploy/helm/openshell/README.md.gotmpl b/deploy/helm/openshell/README.md.gotmpl index 17b5f7821..e246ca67b 100644 --- a/deploy/helm/openshell/README.md.gotmpl +++ b/deploy/helm/openshell/README.md.gotmpl @@ -62,7 +62,8 @@ See [`values.yaml`](values.yaml) for source defaults. Selected overlays: ### Database backend -By default, OpenShell uses SQLite: +By default, OpenShell uses SQLite and runs the gateway as a StatefulSet so the +database is backed by a per-pod PVC: ```yaml server: @@ -89,9 +90,15 @@ Then install the chart pointing at that Secret: ```bash helm install openshell oci://ghcr.io/nvidia/openshell/helm-chart --version \ -n openshell \ + --set workload.kind=deployment \ --set server.externalDbSecret=my-pg-credentials ``` +Use `workload.kind=deployment` for external database-backed multi-replica +gateways. `workload.kind=statefulset` is still available for single-replica +SQLite installs and for operators who explicitly need StatefulSet identity or +storage semantics. + #### OpenShift Append these flags to any of the PostgreSQL commands above for OpenShift: diff --git a/deploy/helm/openshell/ci/values-high-availability.yaml b/deploy/helm/openshell/ci/values-high-availability.yaml index ba439cfbf..407326d67 100644 --- a/deploy/helm/openshell/ci/values-high-availability.yaml +++ b/deploy/helm/openshell/ci/values-high-availability.yaml @@ -6,5 +6,8 @@ # overlay expects the caller to provide a PostgreSQL Secret named openshell-ha-pg. replicaCount: 2 +workload: + kind: deployment + server: externalDbSecret: openshell-ha-pg diff --git a/deploy/helm/openshell/templates/_gateway-workload.tpl b/deploy/helm/openshell/templates/_gateway-workload.tpl new file mode 100644 index 000000000..5931047e5 --- /dev/null +++ b/deploy/helm/openshell/templates/_gateway-workload.tpl @@ -0,0 +1,177 @@ +# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 + +{{/* +Gateway pod template shared by the StatefulSet and Deployment workload shapes. +*/}} +{{- define "openshell.gatewayPodTemplate" -}} +metadata: + annotations: + # Roll the gateway workload when the rendered gateway TOML changes - the + # gateway only reads /etc/openshell/gateway.toml at startup, so without + # this annotation a `helm upgrade` that only mutates the ConfigMap would + # leave pods running with stale config. + checksum/gateway-config: {{ include (print $.Template.BasePath "/gateway-config.yaml") . | sha256sum }} + {{- with .Values.podAnnotations }} + {{- toYaml . | nindent 4 }} + {{- end }} + labels: + {{- include "openshell.labels" . | nindent 4 }} + {{- with .Values.podLabels }} + {{- toYaml . | nindent 4 }} + {{- end }} +spec: + terminationGracePeriodSeconds: {{ .Values.podLifecycle.terminationGracePeriodSeconds }} + {{- with .Values.imagePullSecrets }} + imagePullSecrets: + {{- toYaml . | nindent 4 }} + {{- end }} + serviceAccountName: {{ include "openshell.serviceAccountName" . }} + {{- if .Values.server.hostGatewayIP }} + hostAliases: + - ip: {{ .Values.server.hostGatewayIP | quote }} + hostnames: + - host.docker.internal + - host.openshell.internal + {{- end }} + securityContext: + {{- toYaml .Values.podSecurityContext | nindent 4 }} + containers: + - name: openshell-gateway + securityContext: + {{- toYaml .Values.securityContext | nindent 8 }} + image: {{ include "openshell.image" . | quote }} + imagePullPolicy: {{ .Values.image.pullPolicy }} + args: + - --config + - /etc/openshell/gateway.toml + {{- if not .Values.server.externalDbSecret }} + - --db-url + - {{ .Values.server.dbUrl | quote }} + {{- end }} + env: + {{- if .Values.server.externalDbSecret }} + - name: OPENSHELL_DB_URL + valueFrom: + secretKeyRef: + name: {{ .Values.server.externalDbSecret }} + key: uri + {{- end }} + # All gateway settings live in the ConfigMap-backed TOML file + # mounted at /etc/openshell/gateway.toml. The only env var below + # is a process-level setting consumed by libraries outside + # gateway code (currently just SSL_CERT_FILE for OIDC issuer TLS). + {{- if and .Values.server.oidc.issuer .Values.server.oidc.caConfigMapName }} + # OIDC issuer custom-CA: rustls/reqwest read SSL_CERT_FILE for + # outbound TLS verification. This is a process-level env var + # consumed by the TLS stack itself, not by gateway code, so it + # cannot be represented in the gateway TOML schema. + - name: SSL_CERT_FILE + value: /etc/openshell-tls/oidc-ca/ca.crt + {{- end }} + volumeMounts: + {{- if eq (include "openshell.workloadKind" .) "statefulset" }} + - name: openshell-data + mountPath: /var/openshell + {{- end }} + - name: gateway-config + mountPath: /etc/openshell + readOnly: true + - name: sandbox-jwt + mountPath: /etc/openshell-jwt + readOnly: true + {{- if not .Values.server.disableTls }} + - name: tls-cert + mountPath: /etc/openshell-tls/server + readOnly: true + {{- if or .Values.server.tls.clientCaSecretName (and .Values.pkiInitJob.enabled (not .Values.certManager.enabled)) (and .Values.certManager.enabled .Values.certManager.clientCaFromServerTlsSecret) }} + - name: tls-client-ca + mountPath: /etc/openshell-tls/client-ca + readOnly: true + {{- end }} + {{- end }} + {{- if and .Values.server.oidc.issuer .Values.server.oidc.caConfigMapName }} + - name: oidc-ca + mountPath: /etc/openshell-tls/oidc-ca + readOnly: true + {{- end }} + ports: + - name: grpc + containerPort: {{ .Values.service.port }} + protocol: TCP + - name: health + containerPort: {{ .Values.service.healthPort }} + protocol: TCP + {{- if .Values.service.metricsPort }} + - name: metrics + containerPort: {{ .Values.service.metricsPort }} + protocol: TCP + {{- end }} + startupProbe: + httpGet: + path: /healthz + port: health + periodSeconds: {{ .Values.probes.startup.periodSeconds }} + timeoutSeconds: {{ .Values.probes.startup.timeoutSeconds }} + failureThreshold: {{ .Values.probes.startup.failureThreshold }} + livenessProbe: + httpGet: + path: /healthz + port: health + initialDelaySeconds: {{ .Values.probes.liveness.initialDelaySeconds }} + periodSeconds: {{ .Values.probes.liveness.periodSeconds }} + timeoutSeconds: {{ .Values.probes.liveness.timeoutSeconds }} + failureThreshold: {{ .Values.probes.liveness.failureThreshold }} + readinessProbe: + httpGet: + path: /readyz + port: health + initialDelaySeconds: {{ .Values.probes.readiness.initialDelaySeconds }} + periodSeconds: {{ .Values.probes.readiness.periodSeconds }} + timeoutSeconds: {{ .Values.probes.readiness.timeoutSeconds }} + failureThreshold: {{ .Values.probes.readiness.failureThreshold }} + resources: + {{- toYaml .Values.resources | nindent 8 }} + volumes: + - name: gateway-config + configMap: + name: {{ include "openshell.fullname" . }}-config + - name: sandbox-jwt + secret: + secretName: {{ include "openshell.sandboxJwtSecretName" . }} + defaultMode: {{ .Values.server.sandboxJwt.secretDefaultMode | default 0400 }} + {{- if not .Values.server.disableTls }} + - name: tls-cert + secret: + secretName: {{ .Values.server.tls.certSecretName }} + {{- if or .Values.server.tls.clientCaSecretName (and .Values.pkiInitJob.enabled (not .Values.certManager.enabled)) (and .Values.certManager.enabled .Values.certManager.clientCaFromServerTlsSecret) }} + - name: tls-client-ca + secret: + {{- if or (and .Values.pkiInitJob.enabled (not .Values.certManager.enabled)) (and .Values.certManager.enabled .Values.certManager.clientCaFromServerTlsSecret) }} + secretName: {{ .Values.server.tls.certSecretName }} + items: + - key: ca.crt + path: ca.crt + {{- else }} + secretName: {{ .Values.server.tls.clientCaSecretName }} + {{- end }} + {{- end }} + {{- end }} + {{- if and .Values.server.oidc.issuer .Values.server.oidc.caConfigMapName }} + - name: oidc-ca + configMap: + name: {{ .Values.server.oidc.caConfigMapName }} + {{- end }} + {{- with .Values.nodeSelector }} + nodeSelector: + {{- toYaml . | nindent 4 }} + {{- end }} + {{- with .Values.affinity }} + affinity: + {{- toYaml . | nindent 4 }} + {{- end }} + {{- with .Values.tolerations }} + tolerations: + {{- toYaml . | nindent 4 }} + {{- end }} +{{- end }} diff --git a/deploy/helm/openshell/templates/_helpers.tpl b/deploy/helm/openshell/templates/_helpers.tpl index 5876be542..30c027576 100644 --- a/deploy/helm/openshell/templates/_helpers.tpl +++ b/deploy/helm/openshell/templates/_helpers.tpl @@ -144,14 +144,38 @@ init-container {{- end -}} {{- end }} +{{/* +Gateway workload kind. StatefulSet is the default because the default SQLite +database requires persistent per-pod storage. +*/}} +{{- define "openshell.workloadKind" -}} +{{- $workload := .Values.workload | default dict -}} +{{- if not (kindIs "map" $workload) -}} +{{- fail "workload must be a map with kind and allowMultiReplicaStatefulSet fields." -}} +{{- end -}} +{{- default "statefulset" (get $workload "kind") | lower -}} +{{- end }} + {{/* Validate chart values that Helm would otherwise accept silently. */}} {{- define "openshell.validateValues" -}} +{{- $workloadKind := include "openshell.workloadKind" . -}} +{{- $workload := .Values.workload | default dict -}} +{{- $replicaCount := int (default 1 .Values.replicaCount) -}} {{- if and (hasKey .Values "postgres") (kindIs "map" .Values.postgres) (hasKey .Values.postgres "enabled") -}} {{- fail "postgres.enabled was removed; the OpenShell chart no longer deploys PostgreSQL. Provision PostgreSQL separately and set server.externalDbSecret to a Secret containing a PostgreSQL URI." -}} {{- end -}} -{{- if and (gt (int (default 1 .Values.replicaCount)) 1) (not .Values.server.externalDbSecret) -}} +{{- if not (or (eq $workloadKind "statefulset") (eq $workloadKind "deployment")) -}} +{{- fail "workload.kind must be one of: statefulset, deployment." -}} +{{- end -}} +{{- if and (eq $workloadKind "deployment") (not .Values.server.externalDbSecret) -}} +{{- fail "workload.kind=deployment requires server.externalDbSecret; use workload.kind=statefulset for the default SQLite database." -}} +{{- end -}} +{{- if and (gt $replicaCount 1) (not .Values.server.externalDbSecret) -}} {{- fail "replicaCount > 1 requires server.externalDbSecret; multiple gateway replicas cannot share the default per-pod SQLite database." -}} {{- end -}} +{{- if and (eq $workloadKind "statefulset") (gt $replicaCount 1) (not (get $workload "allowMultiReplicaStatefulSet" | default false)) -}} +{{- fail "replicaCount > 1 with workload.kind=statefulset requires workload.allowMultiReplicaStatefulSet=true; use workload.kind=deployment for external database-backed multi-replica gateways." -}} +{{- end -}} {{- end }} diff --git a/deploy/helm/openshell/templates/deployment.yaml b/deploy/helm/openshell/templates/deployment.yaml new file mode 100644 index 000000000..e93797937 --- /dev/null +++ b/deploy/helm/openshell/templates/deployment.yaml @@ -0,0 +1,18 @@ +# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-License-Identifier: Apache-2.0 +{{- include "openshell.validateValues" . }} +{{- if eq (include "openshell.workloadKind" .) "deployment" }} +apiVersion: apps/v1 +kind: Deployment +metadata: + name: {{ include "openshell.fullname" . }} + labels: + {{- include "openshell.labels" . | nindent 4 }} +spec: + replicas: {{ .Values.replicaCount }} + selector: + matchLabels: + {{- include "openshell.selectorLabels" . | nindent 6 }} + template: + {{- include "openshell.gatewayPodTemplate" . | nindent 4 }} +{{- end }} diff --git a/deploy/helm/openshell/templates/gateway-config.yaml b/deploy/helm/openshell/templates/gateway-config.yaml index d51b343e5..82c401cfa 100644 --- a/deploy/helm/openshell/templates/gateway-config.yaml +++ b/deploy/helm/openshell/templates/gateway-config.yaml @@ -4,7 +4,7 @@ ConfigMap holding the gateway TOML config file (RFC 0003). The gateway reads `/etc/openshell/gateway.toml` (mounted from this ConfigMap) -at startup. CLI flags and OPENSHELL_* env vars on the StatefulSet container +at startup. CLI flags and OPENSHELL_* env vars on the gateway workload container still override anything in this file. One value is intentionally NOT rendered here: diff --git a/deploy/helm/openshell/templates/statefulset.yaml b/deploy/helm/openshell/templates/statefulset.yaml index 3c0bd2cd3..30571f80b 100644 --- a/deploy/helm/openshell/templates/statefulset.yaml +++ b/deploy/helm/openshell/templates/statefulset.yaml @@ -1,6 +1,7 @@ # SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. # SPDX-License-Identifier: Apache-2.0 {{- include "openshell.validateValues" . }} +{{- if eq (include "openshell.workloadKind" .) "statefulset" }} apiVersion: apps/v1 kind: StatefulSet metadata: @@ -14,173 +15,7 @@ spec: matchLabels: {{- include "openshell.selectorLabels" . | nindent 6 }} template: - metadata: - annotations: - # Roll the StatefulSet when the rendered gateway TOML changes — the - # gateway only reads /etc/openshell/gateway.toml at startup, so - # without this annotation a `helm upgrade` that only mutates the - # ConfigMap would leave pods running with stale config. - checksum/gateway-config: {{ include (print $.Template.BasePath "/gateway-config.yaml") . | sha256sum }} - {{- with .Values.podAnnotations }} - {{- toYaml . | nindent 8 }} - {{- end }} - labels: - {{- include "openshell.labels" . | nindent 8 }} - {{- with .Values.podLabels }} - {{- toYaml . | nindent 8 }} - {{- end }} - spec: - terminationGracePeriodSeconds: {{ .Values.podLifecycle.terminationGracePeriodSeconds }} - {{- with .Values.imagePullSecrets }} - imagePullSecrets: - {{- toYaml . | nindent 8 }} - {{- end }} - serviceAccountName: {{ include "openshell.serviceAccountName" . }} - {{- if .Values.server.hostGatewayIP }} - hostAliases: - - ip: {{ .Values.server.hostGatewayIP | quote }} - hostnames: - - host.docker.internal - - host.openshell.internal - {{- end }} - securityContext: - {{- toYaml .Values.podSecurityContext | nindent 8 }} - containers: - - name: openshell-gateway - securityContext: - {{- toYaml .Values.securityContext | nindent 12 }} - image: {{ include "openshell.image" . | quote }} - imagePullPolicy: {{ .Values.image.pullPolicy }} - args: - - --config - - /etc/openshell/gateway.toml - {{- if not .Values.server.externalDbSecret }} - - --db-url - - {{ .Values.server.dbUrl | quote }} - {{- end }} - env: - {{- if .Values.server.externalDbSecret }} - - name: OPENSHELL_DB_URL - valueFrom: - secretKeyRef: - name: {{ .Values.server.externalDbSecret }} - key: uri - {{- end }} - # All gateway settings live in the ConfigMap-backed TOML file - # mounted at /etc/openshell/gateway.toml. The only env var below - # is a process-level setting consumed by libraries outside - # gateway code (currently just SSL_CERT_FILE for OIDC issuer TLS). - {{- if and .Values.server.oidc.issuer .Values.server.oidc.caConfigMapName }} - # OIDC issuer custom-CA: rustls/reqwest read SSL_CERT_FILE for - # outbound TLS verification. This is a process-level env var - # consumed by the TLS stack itself, not by gateway code, so it - # cannot be represented in the gateway TOML schema. - - name: SSL_CERT_FILE - value: /etc/openshell-tls/oidc-ca/ca.crt - {{- end }} - volumeMounts: - - name: openshell-data - mountPath: /var/openshell - - name: gateway-config - mountPath: /etc/openshell - readOnly: true - - name: sandbox-jwt - mountPath: /etc/openshell-jwt - readOnly: true - {{- if not .Values.server.disableTls }} - - name: tls-cert - mountPath: /etc/openshell-tls/server - readOnly: true - {{- if or .Values.server.tls.clientCaSecretName (and .Values.pkiInitJob.enabled (not .Values.certManager.enabled)) (and .Values.certManager.enabled .Values.certManager.clientCaFromServerTlsSecret) }} - - name: tls-client-ca - mountPath: /etc/openshell-tls/client-ca - readOnly: true - {{- end }} - {{- end }} - {{- if and .Values.server.oidc.issuer .Values.server.oidc.caConfigMapName }} - - name: oidc-ca - mountPath: /etc/openshell-tls/oidc-ca - readOnly: true - {{- end }} - ports: - - name: grpc - containerPort: {{ .Values.service.port }} - protocol: TCP - - name: health - containerPort: {{ .Values.service.healthPort }} - protocol: TCP - {{- if .Values.service.metricsPort }} - - name: metrics - containerPort: {{ .Values.service.metricsPort }} - protocol: TCP - {{- end }} - startupProbe: - httpGet: - path: /healthz - port: health - periodSeconds: {{ .Values.probes.startup.periodSeconds }} - timeoutSeconds: {{ .Values.probes.startup.timeoutSeconds }} - failureThreshold: {{ .Values.probes.startup.failureThreshold }} - livenessProbe: - httpGet: - path: /healthz - port: health - initialDelaySeconds: {{ .Values.probes.liveness.initialDelaySeconds }} - periodSeconds: {{ .Values.probes.liveness.periodSeconds }} - timeoutSeconds: {{ .Values.probes.liveness.timeoutSeconds }} - failureThreshold: {{ .Values.probes.liveness.failureThreshold }} - readinessProbe: - httpGet: - path: /readyz - port: health - initialDelaySeconds: {{ .Values.probes.readiness.initialDelaySeconds }} - periodSeconds: {{ .Values.probes.readiness.periodSeconds }} - timeoutSeconds: {{ .Values.probes.readiness.timeoutSeconds }} - failureThreshold: {{ .Values.probes.readiness.failureThreshold }} - resources: - {{- toYaml .Values.resources | nindent 12 }} - volumes: - - name: gateway-config - configMap: - name: {{ include "openshell.fullname" . }}-config - - name: sandbox-jwt - secret: - secretName: {{ include "openshell.sandboxJwtSecretName" . }} - defaultMode: {{ .Values.server.sandboxJwt.secretDefaultMode | default 0400 }} - {{- if not .Values.server.disableTls }} - - name: tls-cert - secret: - secretName: {{ .Values.server.tls.certSecretName }} - {{- if or .Values.server.tls.clientCaSecretName (and .Values.pkiInitJob.enabled (not .Values.certManager.enabled)) (and .Values.certManager.enabled .Values.certManager.clientCaFromServerTlsSecret) }} - - name: tls-client-ca - secret: - {{- if or (and .Values.pkiInitJob.enabled (not .Values.certManager.enabled)) (and .Values.certManager.enabled .Values.certManager.clientCaFromServerTlsSecret) }} - secretName: {{ .Values.server.tls.certSecretName }} - items: - - key: ca.crt - path: ca.crt - {{- else }} - secretName: {{ .Values.server.tls.clientCaSecretName }} - {{- end }} - {{- end }} - {{- end }} - {{- if and .Values.server.oidc.issuer .Values.server.oidc.caConfigMapName }} - - name: oidc-ca - configMap: - name: {{ .Values.server.oidc.caConfigMapName }} - {{- end }} - {{- with .Values.nodeSelector }} - nodeSelector: - {{- toYaml . | nindent 8 }} - {{- end }} - {{- with .Values.affinity }} - affinity: - {{- toYaml . | nindent 8 }} - {{- end }} - {{- with .Values.tolerations }} - tolerations: - {{- toYaml . | nindent 8 }} - {{- end }} + {{- include "openshell.gatewayPodTemplate" . | nindent 4 }} volumeClaimTemplates: - metadata: name: openshell-data @@ -189,3 +24,4 @@ spec: resources: requests: storage: 1Gi +{{- end }} diff --git a/deploy/helm/openshell/tests/gateway_config_test.yaml b/deploy/helm/openshell/tests/gateway_config_test.yaml index e9ca8014f..9f86d845c 100644 --- a/deploy/helm/openshell/tests/gateway_config_test.yaml +++ b/deploy/helm/openshell/tests/gateway_config_test.yaml @@ -4,6 +4,7 @@ suite: gateway TOML config shape templates: - templates/gateway-config.yaml + - templates/deployment.yaml - templates/statefulset.yaml release: name: openshell @@ -18,6 +19,22 @@ tests: - exists: path: spec.template.metadata.annotations["checksum/gateway-config"] + - it: renders a StatefulSet by default + template: templates/statefulset.yaml + asserts: + - equal: + path: kind + value: StatefulSet + + - it: treats a null workload map as the default StatefulSet + template: templates/statefulset.yaml + set: + workload: null + asserts: + - equal: + path: kind + value: StatefulSet + - it: uses a stable gateway container name template: templates/statefulset.yaml asserts: @@ -193,6 +210,80 @@ tests: - failedTemplate: errorPattern: "replicaCount > 1 requires server.externalDbSecret" + - it: renders a Deployment for external database-backed gateway workloads + template: templates/deployment.yaml + set: + workload.kind: deployment + replicaCount: 2 + server.externalDbSecret: my-pg-secret + asserts: + - equal: + path: kind + value: Deployment + - equal: + path: spec.replicas + value: 2 + - equal: + path: spec.template.spec.containers[0].name + value: openshell-gateway + - notContains: + path: spec.template.spec.containers[0].volumeMounts + content: + name: openshell-data + mountPath: /var/openshell + - contains: + path: spec.template.spec.containers[0].env + content: + name: OPENSHELL_DB_URL + valueFrom: + secretKeyRef: + name: my-pg-secret + key: uri + + - it: fails when a Deployment uses the default SQLite database + template: templates/statefulset.yaml + set: + workload.kind: deployment + asserts: + - failedTemplate: + errorPattern: "workload.kind=deployment requires server.externalDbSecret" + + - it: fails when multiple replicas use a StatefulSet without an override + template: templates/statefulset.yaml + set: + replicaCount: 2 + server.externalDbSecret: my-pg-secret + asserts: + - failedTemplate: + errorPattern: "replicaCount > 1 with workload.kind=statefulset requires workload.allowMultiReplicaStatefulSet=true" + + - it: allows a multi-replica StatefulSet with an external database and explicit override + template: templates/statefulset.yaml + set: + replicaCount: 2 + server.externalDbSecret: my-pg-secret + workload.allowMultiReplicaStatefulSet: true + asserts: + - equal: + path: kind + value: StatefulSet + - equal: + path: spec.replicas + value: 2 + - contains: + path: spec.template.spec.containers[0].volumeMounts + content: + name: openshell-data + mountPath: /var/openshell + + - it: fails when workload.kind is invalid + template: templates/statefulset.yaml + set: + workload.kind: daemonset + asserts: + - failedTemplate: + errorPattern: "workload.kind must be one of: statefulset, deployment" + - it: does not pass --db-url in args when externalDbSecret is set template: templates/statefulset.yaml set: @@ -216,11 +307,14 @@ tests: name: my-pg-secret key: uri - - it: renders HA external database configuration from the CI overlay - template: templates/statefulset.yaml + - it: renders HA external database configuration from the CI overlay as a Deployment + template: templates/deployment.yaml values: - ../ci/values-high-availability.yaml asserts: + - equal: + path: kind + value: Deployment - equal: path: spec.replicas value: 2 diff --git a/deploy/helm/openshell/values.yaml b/deploy/helm/openshell/values.yaml index 4fd5ba4fc..2c255dca2 100644 --- a/deploy/helm/openshell/values.yaml +++ b/deploy/helm/openshell/values.yaml @@ -7,6 +7,17 @@ # server.externalDbSecret because the default SQLite backend is per pod. replicaCount: 1 +workload: + # -- Gateway workload controller kind. Use `statefulset` for the default + # SQLite database, or `deployment` when server.externalDbSecret points at an + # external database. + kind: statefulset + # -- Allow replicaCount > 1 while rendering a StatefulSet. Prefer + # workload.kind=deployment for external database-backed multi-replica + # gateways; this override exists for operators who explicitly require + # StatefulSet identity or storage semantics. + allowMultiReplicaStatefulSet: false + image: # -- Gateway image repository. repository: ghcr.io/nvidia/openshell/gateway diff --git a/docs/kubernetes/openshift.mdx b/docs/kubernetes/openshift.mdx index e56fc37db..b8313bdfe 100644 --- a/docs/kubernetes/openshift.mdx +++ b/docs/kubernetes/openshift.mdx @@ -68,6 +68,9 @@ installing the chart. oc -n openshell rollout status statefulset/openshell ``` +If you set `workload.kind=deployment`, use +`oc -n openshell rollout status deployment/openshell` instead. + ## Connect to the gateway diff --git a/docs/kubernetes/setup.mdx b/docs/kubernetes/setup.mdx index 78c685dd6..0a25eceeb 100644 --- a/docs/kubernetes/setup.mdx +++ b/docs/kubernetes/setup.mdx @@ -12,7 +12,7 @@ position: 1 The OpenShell Helm chart is experimental and under active development. Templates, values, and defaults can change between releases. Do not use it in production. -Use the Kubernetes deployment when the gateway should run on a shared cluster, in a cloud environment, or as part of team infrastructure. The Helm chart deploys the gateway as a StatefulSet and handles PKI bootstrap, RBAC, and sandbox namespace setup automatically. +Use the Kubernetes deployment when the gateway should run on a shared cluster, in a cloud environment, or as part of team infrastructure. The Helm chart handles PKI bootstrap, RBAC, sandbox namespace setup, and the gateway workload. It uses a StatefulSet by default for the SQLite database, and can render a Deployment when `server.externalDbSecret` points at an external database. ## Prerequisites @@ -88,6 +88,12 @@ The chart automatically generates PKI secrets on first install using pre-install kubectl -n openshell rollout status statefulset/openshell ``` +If you set `workload.kind=deployment`, wait on the Deployment instead: + +```shell +kubectl -n openshell rollout status deployment/openshell +``` + ## Connect to the gateway For local evaluation, use a port-forward: @@ -137,6 +143,8 @@ The most commonly changed values are: |---|---| | `image.repository` / `image.tag` | Gateway container image. Defaults to `ghcr.io/nvidia/openshell/gateway:latest`. | | `replicaCount` | Number of gateway replicas. Leave at `1` unless you are explicitly testing multi-replica behavior. | +| `workload.kind` | Gateway workload controller. Use `statefulset` for SQLite or `deployment` with `server.externalDbSecret`. | +| `workload.allowMultiReplicaStatefulSet` | Allow `replicaCount > 1` with `workload.kind=statefulset`. Prefer Deployment for external database-backed multi-replica gateways. | | `server.sandboxNamespace` | Namespace where sandbox pods are created. Defaults to the Helm release namespace when left empty. | | `server.externalDbSecret` | Secret containing a PostgreSQL connection URI in the `uri` key. Use when the database is managed outside the chart. | | `server.sandboxImage` | Default sandbox image used when a sandbox does not specify one. | diff --git a/docs/reference/support-matrix.mdx b/docs/reference/support-matrix.mdx index f278b15eb..6fe54921a 100644 --- a/docs/reference/support-matrix.mdx +++ b/docs/reference/support-matrix.mdx @@ -71,7 +71,7 @@ OpenShell publishes the gateway image for `linux/amd64` and `linux/arm64`. |---|---|---| | Gateway | `ghcr.io/nvidia/openshell/gateway:latest` | Helm chart install or upgrade, or standalone container deployment | -The Helm chart in `deploy/helm/openshell` deploys the gateway StatefulSet, service account, service, persistent storage, and network policy for Kubernetes. +The Helm chart in `deploy/helm/openshell` deploys the gateway workload, service account, service, optional persistent storage, and network policy for Kubernetes. It defaults to a StatefulSet for SQLite-backed installs and can render a Deployment for external database-backed installs. Sandbox images are maintained separately in the [openshell-community](https://github.com/nvidia/openshell-community) repository. diff --git a/docs/sandboxes/manage-gateways.mdx b/docs/sandboxes/manage-gateways.mdx index 6cfa39121..c92b86958 100644 --- a/docs/sandboxes/manage-gateways.mdx +++ b/docs/sandboxes/manage-gateways.mdx @@ -140,8 +140,9 @@ openshell gateway info For Kubernetes gateways, inspect the gateway workload and cluster events: ```shell -kubectl -n openshell get pods -kubectl -n openshell logs statefulset/openshell +kubectl -n openshell get deployment,statefulset,pods +kubectl -n openshell logs deployment/openshell -c openshell-gateway --tail=100 +kubectl -n openshell logs statefulset/openshell -c openshell-gateway --tail=100 kubectl -n openshell get events --sort-by=.lastTimestamp ```