Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
41 changes: 33 additions & 8 deletions .agents/skills/debug-openshell-cluster/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -132,10 +132,9 @@ Common findings:
helm -n openshell status openshell
helm -n openshell get values openshell
kubectl -n openshell get deployment,statefulset,pod,svc,pvc
kubectl -n openshell logs deployment/openshell -c openshell-gateway --tail=200
kubectl -n openshell logs statefulset/openshell -c openshell-gateway --tail=200
kubectl -n openshell rollout status deployment/openshell
kubectl -n openshell rollout status statefulset/openshell
WORKLOAD="$(kubectl -n openshell get deployment openshell >/dev/null 2>&1 && echo deployment/openshell || echo statefulset/openshell)"
kubectl -n openshell logs "${WORKLOAD}" -c openshell-gateway --tail=200
kubectl -n openshell rollout status "${WORKLOAD}"
```

Use the log and rollout commands for the workload kind that exists in the
Expand All @@ -153,6 +152,32 @@ kubectl -n openshell get deployment,service,pod -l app.kubernetes.io/name=opensh
kubectl -n openshell logs deployment/openshell-e2e-postgres --tail=200
```

For multi-replica gateway installs, supervisor and client session traffic may
be served by a non-owner gateway replica and relayed to the current supervisor
owner over the internal `PeerRelay` RPC. Check the headless peer Service,
projected peer ServiceAccount token volume, and TokenReview RBAC:

```bash
kubectl -n openshell get svc openshell-peer -o wide
kubectl -n openshell get endpoints openshell-peer
kubectl -n openshell get pod -l app.kubernetes.io/instance=openshell \
-o jsonpath='{range .items[*]}{.metadata.name}{"\n"}{.spec.volumes[?(@.name=="gateway-peer-token")]}{"\n"}{.spec.containers[0].env[?(@.name=="OPENSHELL_PEER_SERVICE_ACCOUNT_TOKEN_FILE")]}{"\n"}{.spec.containers[0].env[?(@.name=="OPENSHELL_PEER_ENDPOINT")]}{"\n"}{end}'
kubectl auth can-i create tokenreviews.authentication.k8s.io \
--as=system:serviceaccount:openshell:openshell
kubectl auth can-i get pods -n openshell \
--as=system:serviceaccount:openshell:openshell
kubectl -n openshell logs "${WORKLOAD}" --tail=200 | grep -E 'gateway peer|PeerRelay|supervisor owner|owner relay'
```

Expected gateway startup logs include
`gateway peer ServiceAccount TokenReview authentication enabled`. If peer relay
calls fail with `Unauthenticated`, verify the `gateway-peer-token` projected
volume has audience `openshell-gateway-peer` and that the receiving gateway can
create TokenReviews. If they fail with `PermissionDenied`, verify the gateway
ServiceAccount name, release namespace, pod UID, and Helm selector labels match
the live gateway pods. Deployment-backed gateway pods should also publish
`OPENSHELL_PEER_ENDPOINT` from their pod IP.

Check required Helm deployment secrets:

```bash
Expand Down Expand Up @@ -199,8 +224,8 @@ label, supervisor env vars `OPENSHELL_K8S_SA_TOKEN_FILE` and
Check the image references currently used by the gateway deployment:

```bash
kubectl -n openshell get deployment openshell -o jsonpath="{.spec.template.spec.containers[*].image}{\"\n\"}{.spec.template.spec.containers[*].env[?(@.name==\"OPENSHELL_SUPERVISOR_IMAGE\")].value}{\"\n\"}"
kubectl -n openshell get statefulset openshell -o jsonpath="{.spec.template.spec.containers[*].image}{\"\n\"}{.spec.template.spec.containers[*].env[?(@.name==\"OPENSHELL_SUPERVISOR_IMAGE\")].value}{\"\n\"}"
WORKLOAD="$(kubectl -n openshell get deployment openshell >/dev/null 2>&1 && echo deployment/openshell || echo statefulset/openshell)"
kubectl -n openshell get "${WORKLOAD}" -o jsonpath="{.spec.template.spec.containers[*].image}{\"\n\"}{.spec.template.spec.containers[*].env[?(@.name==\"OPENSHELL_SUPERVISOR_IMAGE\")].value}{\"\n\"}"
helm -n openshell get values openshell | grep -E 'repository|tag|supervisorImage|workload'
```

Expand Down Expand Up @@ -244,8 +269,8 @@ If the gateway is healthy but sandbox creation fails:
```bash
kubectl -n openshell get pods
kubectl -n openshell get events --sort-by=.lastTimestamp | tail -n 50
kubectl -n openshell logs deployment/openshell -c openshell-gateway --tail=200
kubectl -n openshell logs statefulset/openshell -c openshell-gateway --tail=200
WORKLOAD="$(kubectl -n openshell get deployment openshell >/dev/null 2>&1 && echo deployment/openshell || echo statefulset/openshell)"
kubectl -n openshell logs "${WORKLOAD}" -c openshell-gateway --tail=200
```

Check the configured sandbox namespace:
Expand Down
111 changes: 94 additions & 17 deletions .agents/skills/helm-dev-environment/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -66,10 +66,34 @@ generates mTLS secrets on first install. Envoy Gateway opt-in; see the Optional

The gateway Service uses ClusterIP. Access is via Envoy Gateway (port `8080`) or `kubectl port-forward`.

**HA test deploy** (two gateway replicas + external PostgreSQL Secret): uncomment
`#- ci/values-high-availability.yaml` in `deploy/helm/openshell/skaffold.yaml`,
create the Secret named `openshell-ha-pg` with a `uri` key, then run
`mise run helm:skaffold:run` or `mise run helm:skaffold:dev`.
Skaffold profiles are available for HA and reverse-proxy development. Run these from
`deploy/helm/openshell/`:

```bash
# Two gateway replicas + external PostgreSQL Secret.
KUBECONFIG=../../../kubeconfig skaffold run -p high-availability

# Two gateway replicas + Envoy Gateway + Gateway API route.
KUBECONFIG=../../../kubeconfig skaffold run -p ha-envoy
```

The HA profiles expect a Secret named `openshell-ha-pg` in the `openshell`
namespace with a `uri` key. For local manual testing, either create your own
PostgreSQL Secret or use the e2e PostgreSQL fixture manifest in
`e2e/kubernetes/postgres-fixture.yaml`.

For the `ha-envoy` profile, return to the repository root and apply the
GatewayClass and BackendTrafficPolicy manifest after Skaffold has installed
Envoy Gateway:

```bash
KUBECONFIG=kubeconfig mise run helm:gateway:apply
```

The BackendTrafficPolicy disables Envoy request and stream-duration timeouts for
OpenShell's `GRPCRoute`. Keep that policy in `deploy/kube/manifests/envoy-gateway-openshell.yaml`,
not in the Helm chart; it is required for long-lived gRPC create/watch/exec/relay
streams during gateway rollouts and scale events.

### TLS behaviour

Expand Down Expand Up @@ -139,23 +163,76 @@ but will point to a deleted cluster — safe to ignore or clean up manually.

## Optional Add-ons

Each add-on requires uncommenting the corresponding `valuesFiles` entry in
`deploy/helm/openshell/skaffold.yaml` before running `helm:skaffold:dev` or `helm:skaffold:run`.
Some add-ons can be enabled by uncommenting values in `skaffold.yaml`, but prefer
the dedicated Skaffold profiles when they exist. Profiles avoid leaving local
manual edits in the worktree.

### Envoy Gateway (Gateway API / GRPCRoute)

Envoy Gateway is already installed by Skaffold (the `envoy-gateway` Helm release in
`skaffold.yaml`). To activate routing:
Use the `ha-envoy` Skaffold profile for HA reverse-proxy testing:

1. Uncomment `#- values-gateway.yaml` in `skaffold.yaml`
2. Redeploy: `mise run helm:skaffold:run`
3. Apply the GatewayClass: `mise run helm:gateway:apply`
4. Access: `http://127.0.0.1:8080`
```bash
cd deploy/helm/openshell
KUBECONFIG=../../../kubeconfig skaffold run -p ha-envoy
cd ../../..
KUBECONFIG=kubeconfig mise run helm:gateway:apply
```

`values-gateway.yaml` creates a `Gateway` (listener on port 80, class `eg`) and
`GRPCRoute` in the `openshell` namespace. The `ha-envoy` profile installs the
Envoy Gateway Helm chart and layers both `values-high-availability.yaml` and
`values-gateway.yaml` onto the OpenShell release.

`values-gateway.yaml` creates a `Gateway` (listener on port 80, class `eg`) and a
`GRPCRoute` in the `openshell` namespace. Envoy Gateway provisions a LoadBalancer
service for the proxy; klipper-lb binds it to hostPort 80, reachable via the
`8080:80` load balancer port mapping.
`deploy/kube/manifests/envoy-gateway-openshell.yaml` creates:

- `GatewayClass/eg`
- `BackendTrafficPolicy/openshell-grpc-timeouts`

The Envoy Gateway proxy Service is usually exposed through the k3d load balancer
at `http://127.0.0.1:8080`. If the cluster was created with a different
`HELM_K3S_LB_HOST_PORT`, use that host port instead.

For manual tests against an existing cluster, prefer forwarding the Envoy proxy
Service rather than `svc/openshell`. That keeps client traffic on the same path
as a real reverse proxy while gateway pods rotate behind it:

```bash
KUBECONFIG=kubeconfig kubectl get svc -A \
-l gateway.envoyproxy.io/owning-gateway-name=openshell
KUBECONFIG=kubeconfig kubectl -n <envoy-service-namespace> port-forward \
svc/<envoy-service-name> 8080:80
openshell gateway add http://127.0.0.1:8080 --name openshell --local
```

When running e2e tests manually through Envoy, register gateway metadata (as
above) instead of relying only on `OPENSHELL_GATEWAY_ENDPOINT`; some tests call
`openshell gateway info` and expect metadata for the active gateway.

### Kubernetes E2E Notes

Use `mise run e2e:kubernetes` for the standard Helm-backed Kubernetes suite.
The kube e2e wrapper creates only one port-forward, to `svc/openshell`; it no
longer forwards the unauthenticated health listener or runs a `/readyz` e2e
target. `/readyz` remains covered by server unit/integration tests.

Use `mise run e2e:kubernetes:ha-rebalancing` for full-suite HA coverage. The
task creates an external PostgreSQL fixture, installs Envoy Gateway, applies
`deploy/kube/manifests/envoy-gateway-openshell.yaml`, enables the chart
`GRPCRoute`, and runs the full Kubernetes e2e suite, including
`kubernetes_ha_rebalancing`. That coverage validates sandbox create/watch and
exec through the Envoy proxy while gateway replicas scale up, scale down, and
rotate.

If you reuse an existing Skaffold cluster for the full kube suite, make sure the
cluster has the Docker Desktop host-gateway alias configured for host-gateway
tests. The e2e wrapper sets this on chart installs; manual reuse may require:

```bash
KUBECONFIG=kubeconfig helm upgrade openshell deploy/helm/openshell \
--namespace openshell --reuse-values \
--set server.hostGatewayIP=192.168.65.254 \
--wait --timeout 5m
```

### Keycloak OIDC

Expand Down Expand Up @@ -226,6 +303,6 @@ mise run helm:k3s:status
| `deploy/helm/openshell/ci/values-spire.yaml` | SPIFFE/SPIRE provider token grant overlay |
| `deploy/helm/openshell/ci/values-spire-stack.yaml` | SPIRE hardened chart values for local dev |
| `deploy/helm/openshell/ci/values-tls-disabled.yaml` | Lint-only: TLS + auth disabled (reverse-proxy edge termination) |
| `deploy/kube/manifests/envoy-gateway-openshell.yaml` | GatewayClass for Envoy Gateway (`mise run helm:gateway:apply`) |
| `deploy/kube/manifests/envoy-gateway-openshell.yaml` | GatewayClass and BackendTrafficPolicy for Envoy Gateway (`mise run helm:gateway:apply`) |
| `tasks/scripts/helm-k3s-local.sh` | k3d cluster create/delete/start/stop/status |
| `tasks/scripts/keycloak-k8s-setup.sh` | Keycloak deploy + realm import |
1 change: 1 addition & 0 deletions .github/workflows/branch-e2e.yml
Original file line number Diff line number Diff line change
Expand Up @@ -123,6 +123,7 @@ jobs:
job-name: Kubernetes HA E2E (Rust smoke)
extra-helm-values: deploy/helm/openshell/ci/values-high-availability.yaml
external-postgres-secret: openshell-ha-pg
use-envoy-gateway: true

core-e2e-result:
name: Core E2E result
Expand Down
18 changes: 18 additions & 0 deletions .github/workflows/e2e-kubernetes-test.yml
Original file line number Diff line number Diff line change
Expand Up @@ -32,6 +32,21 @@ on:
required: false
type: string
default: ""
test-name:
description: "Rust e2e test target to run (sets OPENSHELL_E2E_KUBE_TEST)"
required: false
type: string
default: ""
kubernetes-features:
description: "Cargo feature list for the Kubernetes e2e crate"
required: false
type: string
default: ""
use-envoy-gateway:
description: "Install Envoy Gateway and run the e2e command through the chart GRPCRoute"
required: false
type: boolean
default: false
mise-version:
description: "mise version to install on the bare Kubernetes e2e runner"
required: false
Expand Down Expand Up @@ -117,6 +132,9 @@ jobs:
OPENSHELL_E2E_KUBE_CONTEXT: kind-${{ env.KIND_CLUSTER_NAME }}
OPENSHELL_E2E_KUBE_EXTRA_VALUES: ${{ inputs.extra-helm-values }}
OPENSHELL_E2E_KUBE_EXTERNAL_POSTGRES_SECRET: ${{ inputs.external-postgres-secret }}
OPENSHELL_E2E_KUBE_TEST: ${{ inputs.test-name }}
OPENSHELL_E2E_KUBERNETES_FEATURES: ${{ inputs.kubernetes-features }}
OPENSHELL_E2E_KUBE_USE_ENVOY: ${{ inputs.use-envoy-gateway }}
IMAGE_TAG: ${{ inputs.image-tag }}
OPENSHELL_REGISTRY: ghcr.io/nvidia/openshell
run: mise run --no-deps --skip-deps e2e:kubernetes
30 changes: 30 additions & 0 deletions architecture/gateway.md
Original file line number Diff line number Diff line change
Expand Up @@ -91,6 +91,36 @@ authenticated sandbox ID with any sandbox ID or name resolved from the request.
Supervisor control and relay streams require a matching sandbox principal before
the gateway registers the session or bridges relay bytes.

## HA Supervisor Ownership

In multi-replica Kubernetes deployments, every gateway pod can accept client
RPCs, but a sandbox supervisor maintains one active stream to one gateway
replica at a time. The connected replica publishes a short-lived supervisor
owner record in the shared Postgres object store with its replica id, peer DNS
endpoint, supervisor instance id, and connection epoch. Heartbeats renew the
record, and reconnects from the same supervisor instance with a newer epoch can
supersede the previous owner before the TTL expires.

Session-bound operations such as exec, TCP forwarding, file sync, and sandbox
service routing first check the local session registry. If the supervisor is
owned by another gateway replica, the serving gateway opens an internal
`PeerRelay` stream to that owner and asks it to open the supervisor relay. This
keeps client traffic working when a Kubernetes Service routes the client to a
non-owner gateway pod. If a peer owner is stale or unreachable during a rollout,
the serving gateway retries ownership lookup until the normal relay wait
deadline.

File upload and download use tar-over-SSH through the same relay path. A gateway
pod termination drops the active SSH proxy byte stream, so the CLI retries the
whole sync operation with a fresh SSH session instead of attempting mid-stream
resume.

Gateway peer RPCs authenticate with Kubernetes ServiceAccount identity rather
than a shared secret. Helm mounts a projected, pod-bound token with audience
`openshell-gateway-peer`; the receiving gateway validates it through
TokenReview, checks the live pod UID and chart selector labels, and authorizes
only the internal peer relay method.

## API Surface

The gateway API is organized around platform objects and operational streams:
Expand Down
Loading
Loading