Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
19 changes: 13 additions & 6 deletions .agents/skills/debug-openshell-cluster/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -54,7 +54,7 @@ Use gateway metadata, deployment values, or the user's setup notes to identify t
|---|---|
| Docker | Gateway process logs, Docker daemon health, sandbox containers, image pulls. |
| Podman | Podman socket, rootless networking, sandbox containers, image pulls. |
| Kubernetes | Helm release, StatefulSet, service, secrets, sandbox pods, events. |
| Kubernetes | Helm release, gateway workload, service, secrets, sandbox pods, events. |
| VM | VM driver logs, rootfs availability, host virtualization support. |

### Step 3: Check Docker-Backed Gateways
Expand Down Expand Up @@ -131,12 +131,17 @@ Common findings:
```bash
helm -n openshell status openshell
helm -n openshell get values openshell
kubectl -n openshell get statefulset,pod,svc,pvc
kubectl -n openshell get deployment,statefulset,pod,svc,pvc
kubectl -n openshell logs deployment/openshell -c openshell-gateway --tail=200
kubectl -n openshell logs statefulset/openshell -c openshell-gateway --tail=200
kubectl -n openshell rollout status deployment/openshell
kubectl -n openshell rollout status statefulset/openshell
```

Look for failed installs, unexpected values, missing namespace, wrong image tag, TLS settings that do not match the registered endpoint, and scheduling failures.
Use the log and rollout commands for the workload kind that exists in the
release. Look for failed installs, unexpected values, missing namespace, wrong
image tag, TLS settings that do not match the registered endpoint, and
scheduling failures.

For HA or PostgreSQL-backed installs, also check the external database Secret
referenced by `server.externalDbSecret` and the PostgreSQL workload if the test
Expand Down Expand Up @@ -169,7 +174,7 @@ Secrets but does not create the sandbox JWT signing Secret.

If the gateway exits with `failed to read sandbox JWT signing key from
/etc/openshell-jwt/signing.pem`, verify that `openshell-jwt-keys` contains
`signing.pem`, `public.pem`, and `kid`, and that the StatefulSet mounts the
`signing.pem`, `public.pem`, and `kid`, and that the gateway workload mounts the
`sandbox-jwt` secret at `/etc/openshell-jwt`. The sandbox JWT mount is required
even when local Helm values disable TLS.

Expand All @@ -194,8 +199,9 @@ label, supervisor env vars `OPENSHELL_K8S_SA_TOKEN_FILE` and
Check the image references currently used by the gateway deployment:

```bash
kubectl -n openshell get deployment openshell -o jsonpath="{.spec.template.spec.containers[*].image}{\"\n\"}{.spec.template.spec.containers[*].env[?(@.name==\"OPENSHELL_SUPERVISOR_IMAGE\")].value}{\"\n\"}"
kubectl -n openshell get statefulset openshell -o jsonpath="{.spec.template.spec.containers[*].image}{\"\n\"}{.spec.template.spec.containers[*].env[?(@.name==\"OPENSHELL_SUPERVISOR_IMAGE\")].value}{\"\n\"}"
helm -n openshell get values openshell | grep -E 'repository|tag|supervisorImage'
helm -n openshell get values openshell | grep -E 'repository|tag|supervisorImage|workload'
```

The gateway image built from `deploy/docker/Dockerfile.gateway` and the scratch supervisor image built from `deploy/docker/Dockerfile.supervisor` should use the same build tag in branch and E2E deploys. A stale supervisor image can make sandbox behavior lag behind gateway policy or proto changes.
Expand Down Expand Up @@ -238,6 +244,7 @@ If the gateway is healthy but sandbox creation fails:
```bash
kubectl -n openshell get pods
kubectl -n openshell get events --sort-by=.lastTimestamp | tail -n 50
kubectl -n openshell logs deployment/openshell -c openshell-gateway --tail=200
kubectl -n openshell logs statefulset/openshell -c openshell-gateway --tail=200
```

Expand Down Expand Up @@ -286,7 +293,7 @@ openshell logs <sandbox-name>
| Docker or Podman sandbox never registers | Wrong callback endpoint or supervisor startup failure | Gateway logs and sandbox container logs |
| Docker GPU e2e fails before GPU sandbox comparison | NVIDIA CDI specs are missing or Docker has not discovered them | `docker info --format '{{json .DiscoveredDevices}}'`, `/etc/cdi`, `/var/run/cdi`, `nvidia-cdi-refresh.service` |
| Kubernetes gateway pod pending | PVC unbound, taint, selector, or insufficient resources | `kubectl -n openshell describe pod <pod>` |
| Kubernetes gateway pod crash loops | Missing secret, bad DB URL, bad TLS config | `kubectl -n openshell logs statefulset/openshell -c openshell-gateway` |
| Kubernetes gateway pod crash loops | Missing secret, bad DB URL, bad TLS config | `kubectl -n openshell logs deployment/openshell -c openshell-gateway` or `kubectl -n openshell logs statefulset/openshell -c openshell-gateway` |
| CLI TLS error | Local mTLS bundle does not match server cert/CA | Check `~/.config/openshell/gateways/<name>/mtls/` |
| Image pull failure | Gateway or sandbox image cannot be pulled | Runtime events and image pull credentials |
| `K8s namespace not ready` with `envoy-gateway-openshell.yaml: the server could not find the requested resource` | Optional Gateway API manifest was applied without Envoy Gateway CRDs, or k3s Helm controller startup exceeded the namespace wait | Apply `deploy/kube/manifests/envoy-gateway-openshell.yaml` manually only after Envoy Gateway is installed and `grpcRoute` is enabled |
Expand Down
5 changes: 3 additions & 2 deletions .agents/skills/openshell-cli/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -495,8 +495,9 @@ openshell gateway remove local # Remove local registrati
```bash
# Inspect a Kubernetes Helm release and gateway pod
helm -n openshell status openshell
kubectl -n openshell get pods,svc
kubectl -n openshell logs statefulset/openshell --tail=100
kubectl -n openshell get deployment,statefulset,pods,svc
kubectl -n openshell logs deployment/openshell -c openshell-gateway --tail=100
kubectl -n openshell logs statefulset/openshell -c openshell-gateway --tail=100
```

For Docker, Podman, and VM-backed gateways, inspect the gateway process or container logs and the selected runtime directly.
Expand Down
7 changes: 5 additions & 2 deletions architecture/compute-runtimes.md
Original file line number Diff line number Diff line change
Expand Up @@ -97,8 +97,11 @@ runtime still owns GPU device injection.
## Deployment Shape

Kubernetes deployments use the Helm chart under `deploy/helm/openshell`. The
chart deploys the gateway and sandbox runtime integration, but HA deployments
must point `server.externalDbSecret` at an operator-managed PostgreSQL database.
chart deploys the gateway and sandbox runtime integration. The default gateway
workload is a StatefulSet for SQLite-backed single-replica installs. External
database-backed installs can render a Deployment with `workload.kind=deployment`;
HA deployments must point `server.externalDbSecret` at an operator-managed
PostgreSQL database.
Standalone local deployments start the gateway with a selected runtime such as
Docker, Podman, or VM. The CLI can register multiple gateways and switch between
them without changing the sandbox architecture.
Expand Down
4 changes: 2 additions & 2 deletions architecture/gateway.md
Original file line number Diff line number Diff line change
Expand Up @@ -384,8 +384,8 @@ hook Job using the gateway image itself -- no separate cert-generation image,
no extra mirror burden in air-gapped environments. In the default built-in PKI
path the hook creates TLS and sandbox JWT Secrets. When cert-manager is enabled,
cert-manager owns TLS Secrets and the hook runs with `--jwt-only` so the
required sandbox JWT Secret still exists before the gateway StatefulSet mounts
it, even if `pkiInitJob.enabled` remains true. On package-managed local
required sandbox JWT Secret still exists before the gateway workload mounts it,
even if `pkiInitJob.enabled` remains true. On package-managed local
gateways, the same command runs from the systemd
unit's `ExecStartPre` to bootstrap PKI into the configured local TLS directory
on first start. The Linux package unit defaults that directory to
Expand Down
11 changes: 10 additions & 1 deletion deploy/helm/openshell/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -62,7 +62,8 @@ See [`values.yaml`](values.yaml) for source defaults. Selected overlays:

### Database backend

By default, OpenShell uses SQLite:
By default, OpenShell uses SQLite and runs the gateway as a StatefulSet so the
database is backed by a per-pod PVC:

```yaml
server:
Expand All @@ -89,9 +90,15 @@ Then install the chart pointing at that Secret:
```bash
helm install openshell oci://ghcr.io/nvidia/openshell/helm-chart --version <version> \
-n openshell \
--set workload.kind=deployment \
--set server.externalDbSecret=my-pg-credentials
```

Use `workload.kind=deployment` for external database-backed multi-replica
gateways. `workload.kind=statefulset` is still available for single-replica
SQLite installs and for operators who explicitly need StatefulSet identity or
storage semantics.

#### OpenShift

Append these flags to any of the PostgreSQL commands above for OpenShift:
Expand Down Expand Up @@ -229,6 +236,8 @@ add `ci/values-spire.yaml` to the OpenShell release values files.
| supervisor.image.tag | string | `""` | Supervisor image tag. Defaults to the chart appVersion when empty. |
| supervisor.sideloadMethod | string | `""` | How the supervisor binary is delivered into sandbox pods. Empty (default) = auto-detect from cluster version: K8s >= v1.35 -> "image-volume" (ImageVolume enabled by default; GA in v1.36) K8s < v1.35 -> "init-container" (copies via init container + emptyDir) On K8s v1.33-v1.34 with the ImageVolume feature gate manually enabled, set this to "image-volume" explicitly. |
| tolerations | list | `[]` | Tolerations for the gateway pod. |
| workload.allowMultiReplicaStatefulSet | bool | `false` | Allow replicaCount > 1 while rendering a StatefulSet. Prefer workload.kind=deployment for external database-backed multi-replica gateways; this override exists for operators who explicitly require StatefulSet identity or storage semantics. |
| workload.kind | string | `"statefulset"` | Gateway workload controller kind. Use `statefulset` for the default SQLite database, or `deployment` when server.externalDbSecret points at an external database. |

----------------------------------------------
Autogenerated from chart metadata using [helm-docs v1.14.2](https://github.com/norwoodj/helm-docs/releases/v1.14.2)
9 changes: 8 additions & 1 deletion deploy/helm/openshell/README.md.gotmpl
Original file line number Diff line number Diff line change
Expand Up @@ -62,7 +62,8 @@ See [`values.yaml`](values.yaml) for source defaults. Selected overlays:

### Database backend

By default, OpenShell uses SQLite:
By default, OpenShell uses SQLite and runs the gateway as a StatefulSet so the
database is backed by a per-pod PVC:

```yaml
server:
Expand All @@ -89,9 +90,15 @@ Then install the chart pointing at that Secret:
```bash
helm install openshell oci://ghcr.io/nvidia/openshell/helm-chart --version <version> \
-n openshell \
--set workload.kind=deployment \
--set server.externalDbSecret=my-pg-credentials
```

Use `workload.kind=deployment` for external database-backed multi-replica
gateways. `workload.kind=statefulset` is still available for single-replica
SQLite installs and for operators who explicitly need StatefulSet identity or
storage semantics.

#### OpenShift

Append these flags to any of the PostgreSQL commands above for OpenShift:
Expand Down
3 changes: 3 additions & 0 deletions deploy/helm/openshell/ci/values-high-availability.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -6,5 +6,8 @@
# overlay expects the caller to provide a PostgreSQL Secret named openshell-ha-pg.
replicaCount: 2

workload:
kind: deployment

server:
externalDbSecret: openshell-ha-pg
177 changes: 177 additions & 0 deletions deploy/helm/openshell/templates/_gateway-workload.tpl
Original file line number Diff line number Diff line change
@@ -0,0 +1,177 @@
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0

{{/*
Gateway pod template shared by the StatefulSet and Deployment workload shapes.
*/}}
{{- define "openshell.gatewayPodTemplate" -}}
metadata:
annotations:
# Roll the gateway workload when the rendered gateway TOML changes - the
# gateway only reads /etc/openshell/gateway.toml at startup, so without
# this annotation a `helm upgrade` that only mutates the ConfigMap would
# leave pods running with stale config.
checksum/gateway-config: {{ include (print $.Template.BasePath "/gateway-config.yaml") . | sha256sum }}
{{- with .Values.podAnnotations }}
{{- toYaml . | nindent 4 }}
{{- end }}
labels:
{{- include "openshell.labels" . | nindent 4 }}
{{- with .Values.podLabels }}
{{- toYaml . | nindent 4 }}
{{- end }}
spec:
terminationGracePeriodSeconds: {{ .Values.podLifecycle.terminationGracePeriodSeconds }}
{{- with .Values.imagePullSecrets }}
imagePullSecrets:
{{- toYaml . | nindent 4 }}
{{- end }}
serviceAccountName: {{ include "openshell.serviceAccountName" . }}
{{- if .Values.server.hostGatewayIP }}
hostAliases:
- ip: {{ .Values.server.hostGatewayIP | quote }}
hostnames:
- host.docker.internal
- host.openshell.internal
{{- end }}
securityContext:
{{- toYaml .Values.podSecurityContext | nindent 4 }}
containers:
- name: openshell-gateway
securityContext:
{{- toYaml .Values.securityContext | nindent 8 }}
image: {{ include "openshell.image" . | quote }}
imagePullPolicy: {{ .Values.image.pullPolicy }}
args:
- --config
- /etc/openshell/gateway.toml
{{- if not .Values.server.externalDbSecret }}
- --db-url
- {{ .Values.server.dbUrl | quote }}
{{- end }}
env:
{{- if .Values.server.externalDbSecret }}
- name: OPENSHELL_DB_URL
valueFrom:
secretKeyRef:
name: {{ .Values.server.externalDbSecret }}
key: uri
{{- end }}
# All gateway settings live in the ConfigMap-backed TOML file
# mounted at /etc/openshell/gateway.toml. The only env var below
# is a process-level setting consumed by libraries outside
# gateway code (currently just SSL_CERT_FILE for OIDC issuer TLS).
{{- if and .Values.server.oidc.issuer .Values.server.oidc.caConfigMapName }}
# OIDC issuer custom-CA: rustls/reqwest read SSL_CERT_FILE for
# outbound TLS verification. This is a process-level env var
# consumed by the TLS stack itself, not by gateway code, so it
# cannot be represented in the gateway TOML schema.
- name: SSL_CERT_FILE
value: /etc/openshell-tls/oidc-ca/ca.crt
{{- end }}
volumeMounts:
{{- if eq (include "openshell.workloadKind" .) "statefulset" }}
- name: openshell-data
mountPath: /var/openshell
{{- end }}
- name: gateway-config
mountPath: /etc/openshell
readOnly: true
- name: sandbox-jwt
mountPath: /etc/openshell-jwt
readOnly: true
{{- if not .Values.server.disableTls }}
- name: tls-cert
mountPath: /etc/openshell-tls/server
readOnly: true
{{- if or .Values.server.tls.clientCaSecretName (and .Values.pkiInitJob.enabled (not .Values.certManager.enabled)) (and .Values.certManager.enabled .Values.certManager.clientCaFromServerTlsSecret) }}
- name: tls-client-ca
mountPath: /etc/openshell-tls/client-ca
readOnly: true
{{- end }}
{{- end }}
{{- if and .Values.server.oidc.issuer .Values.server.oidc.caConfigMapName }}
- name: oidc-ca
mountPath: /etc/openshell-tls/oidc-ca
readOnly: true
{{- end }}
ports:
- name: grpc
containerPort: {{ .Values.service.port }}
protocol: TCP
- name: health
containerPort: {{ .Values.service.healthPort }}
protocol: TCP
{{- if .Values.service.metricsPort }}
- name: metrics
containerPort: {{ .Values.service.metricsPort }}
protocol: TCP
{{- end }}
startupProbe:
httpGet:
path: /healthz
port: health
periodSeconds: {{ .Values.probes.startup.periodSeconds }}
timeoutSeconds: {{ .Values.probes.startup.timeoutSeconds }}
failureThreshold: {{ .Values.probes.startup.failureThreshold }}
livenessProbe:
httpGet:
path: /healthz
port: health
initialDelaySeconds: {{ .Values.probes.liveness.initialDelaySeconds }}
periodSeconds: {{ .Values.probes.liveness.periodSeconds }}
timeoutSeconds: {{ .Values.probes.liveness.timeoutSeconds }}
failureThreshold: {{ .Values.probes.liveness.failureThreshold }}
readinessProbe:
httpGet:
path: /readyz
port: health
initialDelaySeconds: {{ .Values.probes.readiness.initialDelaySeconds }}
periodSeconds: {{ .Values.probes.readiness.periodSeconds }}
timeoutSeconds: {{ .Values.probes.readiness.timeoutSeconds }}
failureThreshold: {{ .Values.probes.readiness.failureThreshold }}
resources:
{{- toYaml .Values.resources | nindent 8 }}
volumes:
- name: gateway-config
configMap:
name: {{ include "openshell.fullname" . }}-config
- name: sandbox-jwt
secret:
secretName: {{ include "openshell.sandboxJwtSecretName" . }}
defaultMode: {{ .Values.server.sandboxJwt.secretDefaultMode | default 0400 }}
{{- if not .Values.server.disableTls }}
- name: tls-cert
secret:
secretName: {{ .Values.server.tls.certSecretName }}
{{- if or .Values.server.tls.clientCaSecretName (and .Values.pkiInitJob.enabled (not .Values.certManager.enabled)) (and .Values.certManager.enabled .Values.certManager.clientCaFromServerTlsSecret) }}
- name: tls-client-ca
secret:
{{- if or (and .Values.pkiInitJob.enabled (not .Values.certManager.enabled)) (and .Values.certManager.enabled .Values.certManager.clientCaFromServerTlsSecret) }}
secretName: {{ .Values.server.tls.certSecretName }}
items:
- key: ca.crt
path: ca.crt
{{- else }}
secretName: {{ .Values.server.tls.clientCaSecretName }}
{{- end }}
{{- end }}
{{- end }}
{{- if and .Values.server.oidc.issuer .Values.server.oidc.caConfigMapName }}
- name: oidc-ca
configMap:
name: {{ .Values.server.oidc.caConfigMapName }}
{{- end }}
{{- with .Values.nodeSelector }}
nodeSelector:
{{- toYaml . | nindent 4 }}
{{- end }}
{{- with .Values.affinity }}
affinity:
{{- toYaml . | nindent 4 }}
{{- end }}
{{- with .Values.tolerations }}
tolerations:
{{- toYaml . | nindent 4 }}
{{- end }}
{{- end }}
26 changes: 25 additions & 1 deletion deploy/helm/openshell/templates/_helpers.tpl
Original file line number Diff line number Diff line change
Expand Up @@ -144,14 +144,38 @@ init-container
{{- end -}}
{{- end }}

{{/*
Gateway workload kind. StatefulSet is the default because the default SQLite
database requires persistent per-pod storage.
*/}}
{{- define "openshell.workloadKind" -}}
{{- $workload := .Values.workload | default dict -}}
{{- if not (kindIs "map" $workload) -}}
{{- fail "workload must be a map with kind and allowMultiReplicaStatefulSet fields." -}}
{{- end -}}
{{- default "statefulset" (get $workload "kind") | lower -}}
{{- end }}

{{/*
Validate chart values that Helm would otherwise accept silently.
*/}}
{{- define "openshell.validateValues" -}}
{{- $workloadKind := include "openshell.workloadKind" . -}}
{{- $workload := .Values.workload | default dict -}}
{{- $replicaCount := int (default 1 .Values.replicaCount) -}}
{{- if and (hasKey .Values "postgres") (kindIs "map" .Values.postgres) (hasKey .Values.postgres "enabled") -}}
{{- fail "postgres.enabled was removed; the OpenShell chart no longer deploys PostgreSQL. Provision PostgreSQL separately and set server.externalDbSecret to a Secret containing a PostgreSQL URI." -}}
{{- end -}}
{{- if and (gt (int (default 1 .Values.replicaCount)) 1) (not .Values.server.externalDbSecret) -}}
{{- if not (or (eq $workloadKind "statefulset") (eq $workloadKind "deployment")) -}}
{{- fail "workload.kind must be one of: statefulset, deployment." -}}
{{- end -}}
{{- if and (eq $workloadKind "deployment") (not .Values.server.externalDbSecret) -}}
{{- fail "workload.kind=deployment requires server.externalDbSecret; use workload.kind=statefulset for the default SQLite database." -}}
{{- end -}}
{{- if and (gt $replicaCount 1) (not .Values.server.externalDbSecret) -}}
{{- fail "replicaCount > 1 requires server.externalDbSecret; multiple gateway replicas cannot share the default per-pod SQLite database." -}}
{{- end -}}
{{- if and (eq $workloadKind "statefulset") (gt $replicaCount 1) (not (get $workload "allowMultiReplicaStatefulSet" | default false)) -}}
{{- fail "replicaCount > 1 with workload.kind=statefulset requires workload.allowMultiReplicaStatefulSet=true; use workload.kind=deployment for external database-backed multi-replica gateways." -}}
{{- end -}}
{{- end }}
Loading
Loading