Skip to content

feat(server): support TLS certificate hot-reload#1870

Open
lunarwhite wants to merge 3 commits into
NVIDIA:mainfrom
lunarwhite:cm-renew
Open

feat(server): support TLS certificate hot-reload#1870
lunarwhite wants to merge 3 commits into
NVIDIA:mainfrom
lunarwhite:cm-renew

Conversation

@lunarwhite

Copy link
Copy Markdown

Summary

Add polling-based TLS certificate hot-reload to the gateway, allowing cert/key/CA rotation without restarting the server. Uses ArcSwap for atomic config swapping so in-flight TLS handshakes are never blocked.

Related Issue

Fixes #1836

Changes

  • Add reload_interval_secs config field to [openshell.gateway.tls] (default 0 = disabled)
  • Replace TlsAcceptor internals with ArcSwap<ServerConfig> for lock-free atomic swaps
  • Add reload() for on-demand cert refresh and spawn_reload_worker() for periodic polling
  • Add early private key type validation to surface bad key types at startup instead of handshake time
  • Expose reloadIntervalSecs in the Helm chart values and gateway-config template
  • Extract shared TLS test utilities (generate_test_certs_with_ca, install_rustls_provider, write_test_file) into tls_test_utils.rs

Testing

  • mise run pre-commit passes
  • Unit tests added/updated: 5 new tests covering reload, concurrent handshake+reload, cert rotation, worker shutdown, mTLS CA rotation
  • E2E tests added/updated (if applicable)
  • E2E tests executed manually in local k3s cluster
E2E test record

Step 1: Create cluster

mise run helm:k3s:create

Step 2: Enable TLS + reload

Create deploy/helm/openshell/ci/values-reload-test.yaml:

server:
  disableTls: false
  tls:
    reloadIntervalSecs: 10

Add to deploy/helm/openshell/skaffold.yaml after - ci/values-skaffold.yaml:

          - ci/values-reload-test.yaml

Step 3: Deploy

mise run helm:skaffold:run

Step 4: Verify reload worker started

KUBECONFIG=kubeconfig kubectl -n openshell logs openshell-0 | grep -i reload

Output:

INFO openshell_server::tls: TLS certificate reload worker started interval_seconds=10

Step 5: Capture initial cert fingerprint

KUBECONFIG=kubeconfig kubectl -n openshell port-forward pod/openshell-0 18443:8080 &>/dev/null &
sleep 3
echo "" | openssl s_client -connect 127.0.0.1:18443 -servername openshell 2>/dev/null \
  | openssl x509 -fingerprint -sha256 -noout
kill %1 2>/dev/null

Output:

sha256 Fingerprint=A0:21:<...>:65

Step 6: Overwrite TLS secret (simulate cert-manager renewal)

openssl req -x509 -newkey rsa:2048 -keyout /tmp/new-tls.key -out /tmp/new-tls.crt \
  -days 1 -nodes -subj "/CN=openshell"

KUBECONFIG=kubeconfig kubectl -n openshell create secret tls openshell-server-tls \
  --cert=/tmp/new-tls.crt --key=/tmp/new-tls.key \
  --dry-run=client -o yaml | KUBECONFIG=kubeconfig kubectl apply -f -

Output:

secret/openshell-server-tls configured

Step 7: Wait for kubelet sync + reload ticks

sleep 60

Step 8: Verify new cert is served — no pod restart

8a. Pod restarts
KUBECONFIG=kubeconfig kubectl -n openshell get pods

Output:

NAME          READY   STATUS    RESTARTS   AGE
openshell-0   1/1     Running   0          2m53s
8b. Current cert from gateway
KUBECONFIG=kubeconfig kubectl -n openshell port-forward pod/openshell-0 28443:8080 &>/dev/null &
sleep 3
echo "" | openssl s_client -connect 127.0.0.1:28443 -servername openshell 2>/dev/null \
  | openssl x509 -fingerprint -sha256 -noout
kill %1 2>/dev/null

Output:

sha256 Fingerprint=41:90:<...>:1B
8c. Cert stored in Secret (should match gateway)
KUBECONFIG=kubeconfig kubectl -n openshell get secret openshell-server-tls \
  -o jsonpath='{.data.tls\.crt}' | base64 -d | openssl x509 -fingerprint -sha256 -noout

Output:

sha256 Fingerprint=41:90:<...>:1B

Results

Checkpoint Fingerprint Source
Before renewal A0:21:<...>:65 PKI init job
After renewal 41:90:<...>:1B Our replacement
Secret matches gateway? ✅ Yes
Pod restarts 0
Reload worker active? interval_seconds=10

The gateway detected the updated certificate on disk and atomically swapped the active TLS config without a pod restart. The reload worker re-reads cert/key/CA files every reloadIntervalSecs and the new cert was served within the kubelet sync window + one reload tick.

Checklist

  • Follows Conventional Commits
  • Commits are signed off (DCO)
  • Architecture docs updated (if applicable)
  • docs/reference/gateway-config.mdx updated

Signed-off-by: Yuedong Wu <dwcn22@outlook.com>
Signed-off-by: Yuedong Wu <dwcn22@outlook.com>
Signed-off-by: Yuedong Wu <dwcn22@outlook.com>
@lunarwhite lunarwhite requested review from a team, derekwaynecarr and mrunalp as code owners June 11, 2026 12:14
@copy-pr-bot

copy-pr-bot Bot commented Jun 11, 2026

Copy link
Copy Markdown

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions

github-actions Bot commented Jun 11, 2026

Copy link
Copy Markdown

All contributors have signed the DCO ✍️ ✅
Posted by the DCO Assistant Lite bot.

@lunarwhite

Copy link
Copy Markdown
Author

I have read the DCO document and I hereby sign the DCO.

@lunarwhite

Copy link
Copy Markdown
Author

recheck

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

bug: gateway does not reload TLS certificate after cert-manager renewal

1 participant