Skip to content

refactor(helm): require external postgres for ha#1844

Merged
TaylorMutch merged 3 commits into
mainfrom
stablize-helm-testing/tmutch
Jun 9, 2026
Merged

refactor(helm): require external postgres for ha#1844
TaylorMutch merged 3 commits into
mainfrom
stablize-helm-testing/tmutch

Conversation

@TaylorMutch

Copy link
Copy Markdown
Collaborator

Summary

Remove the bundled Bitnami PostgreSQL subchart and make external PostgreSQL the only Helm-backed HA database path. CI now creates a simple in-cluster PostgreSQL fixture for HA Kubernetes e2e and points OpenShell at the generated Secret.

Related Issue

N/A

Changes

  • Remove the Helm chart dependency, Chart.lock, postgres values, and Bitnami service-binding template logic.
  • Add a static PostgreSQL fixture manifest for Kubernetes/OpenShift e2e.
  • Wire the Kubernetes HA workflow to create an external DB Secret before installing OpenShell.
  • Update Helm docs, Kubernetes setup docs, architecture notes, and agent guidance.

Testing

  • mise run pre-commit passes
  • mise run helm:docs:check passes
  • mise run helm:lint passes
  • mise run helm:test passes
  • bash -n e2e/with-kube-gateway.sh passes
  • bash -n e2e/rust/e2e-openshift.sh passes
  • YAML parse check for e2e/kubernetes/postgres-fixture.yaml passes
  • Kubernetes e2e label run: expected to validate Helm install, even if Rust e2e later fails

Checklist

  • Follows Conventional Commits
  • Commits are signed off (DCO)

Signed-off-by: Taylor Mutch <taylormutch@gmail.com>
@TaylorMutch TaylorMutch requested review from a team, derekwaynecarr and mrunalp as code owners June 9, 2026 19:36
@TaylorMutch TaylorMutch added the test:e2e-kubernetes Requires Kubernetes end-to-end coverage label Jun 9, 2026
@github-actions

github-actions Bot commented Jun 9, 2026

Copy link
Copy Markdown

Label test:e2e-kubernetes applied for 922afad. Open Branch E2E Checks, find the run for commit 922afad, and click Re-run all jobs to execute with the label set. The run will execute Kubernetes HA E2E after building the required gateway and supervisor images once. This is an optional proof-of-life suite; failures are visible in the workflow run but do not publish a required CI gate status.

@github-actions

github-actions Bot commented Jun 9, 2026

Copy link
Copy Markdown

@johntmyers johntmyers added the gator:in-review Gator is reviewing or awaiting PR review feedback label Jun 9, 2026
@johntmyers

Copy link
Copy Markdown
Collaborator

gator-agent

PR Review Status

Validation: project-valid by maintainer/admin author auto-validation. This PR is concentrated Helm/Kubernetes CI and docs work that removes the bundled Bitnami PostgreSQL chart and requires an external PostgreSQL Secret for Helm-backed HA database use.
Head SHA: 922afad47255eace18724f4398ddf4dd09d33d4b

Review findings:

  • Blocking: existing upgrade values with postgres.enabled: true can now be silently ignored by Helm, causing installs to fall back to SQLite unless server.externalDbSecret is set. Please fail fast when removed postgres.enabled is present, and add Helm unit coverage. Also fail replicaCount > 1 without server.externalDbSecret so HA does not run on per-pod SQLite.
  • Blocking: the documented/CI HA shape sets replicaCount: 2, but the current Kubernetes HA E2E run fails after install with supervisor session not connected. The external PostgreSQL fixture and Helm install succeeded; the failure is consistent with sandbox supervisor callbacks going through the shared Service and reaching a different gateway process than the one holding the in-memory session. Please either implement/validate HA-safe callback routing/session state, or scope this PR to external PostgreSQL support without advertising/running two gateway replicas.
  • Warning: e2e/kubernetes/postgres-fixture.yaml uses mutable image tag mirror.gcr.io/library/postgres:17-alpine; please pin by digest or make the checked CI default immutable/overridable.

Docs: Kubernetes Helm docs were updated in the existing Fern page; no navigation change appears necessary.

Checks: required Branch Checks and Helm Lint are green. test:e2e-kubernetes is applied, and Kubernetes HA E2E failed on head 922afad as noted above.

Next state: gator:in-review

Signed-off-by: Taylor Mutch <taylormutch@gmail.com>
@TaylorMutch

Copy link
Copy Markdown
Collaborator Author

Blocking: existing upgrade values with postgres.enabled: true can now be silently ignored by Helm, causing installs to fall back to SQLite unless server.externalDbSecret is set. Please fail fast when removed postgres.enabled is present, and add Helm unit coverage. Also fail replicaCount > 1 without server.externalDbSecret so HA does not run on per-pod SQLite.

Pushed a change to address that.

Blocking: the documented/CI HA shape sets replicaCount: 2, but the current Kubernetes HA E2E run fails after install with supervisor session not connected. The external PostgreSQL fixture and Helm install succeeded; the failure is consistent with sandbox supervisor callbacks going through the shared Service and reaching a different gateway process than the one holding the in-memory session. Please either implement/validate HA-safe callback routing/session state, or scope this PR to external PostgreSQL support without advertising/running two gateway replicas.

Its a known issue that HA is failing, so this PR will not resolve that issue. This is existing state, so not including that in scope to be fixed.

@johntmyers

Copy link
Copy Markdown
Collaborator

gator-agent

Re-check After Author Update

I re-evaluated latest head e08405b24ee8285990ea0e8b51e46f6bcd414164 after TaylorMutch's 2026-06-09 21:13 UTC commit fix(helm): validate external database requirements.

Disposition: partially resolved.

Resolved items:

  • The stale postgres.enabled upgrade-values path now fails fast instead of silently falling back to SQLite, with Helm unit coverage.
  • replicaCount > 1 without server.externalDbSecret now fails validation, with Helm unit coverage.

Remaining items:

  • Blocking: the PR still sets/documents/runs the HA shape with replicaCount: 2, but the previously observed Kubernetes HA E2E failure (supervisor session not connected) is not addressed by the latest commit. Please either implement and validate HA-safe supervisor callback routing/session state, or scope this PR to external PostgreSQL support without advertising/running two gateway replicas.
  • Warning: e2e/kubernetes/postgres-fixture.yaml still uses mutable image tag mirror.gcr.io/library/postgres:17-alpine; please pin by digest or make the checked CI default immutable/overridable.

Checks: required Branch Checks and Helm Lint are green on this head. The Kubernetes E2E/image-build path has jobs still in progress, but this PR should remain in review until the blocking HA feedback is resolved or waived by a maintainer.

Next state: gator:in-review

@johntmyers

Copy link
Copy Markdown
Collaborator

gator-agent

Re-check After Author Scope Comment

I saw TaylorMutch's 2026-06-09 21:27 UTC comment that the HA callback/session failure is a known existing issue and is not intended to be fixed in this PR.

Disposition: not resolved as currently scoped.

The concern is not that this PR must fix all existing HA routing behavior. The blocker is that this PR still sets, documents, and exercises the two-gateway-replica HA shape as part of the external PostgreSQL path, while the known supervisor session not connected failure means that shape is not validated as working. A maintainer can explicitly waive this and accept the existing-risk scope, or the PR can remove the replicaCount: 2 HA claim from CI/docs for now and keep the change focused on external PostgreSQL support.

Remaining items:

  • Blocking: resolve/validate the two-replica HA callback/session behavior, obtain an explicit maintainer waiver, or scope this PR away from advertising/running two gateway replicas.
  • Warning: pin or make immutable/overridable the PostgreSQL fixture image used by checked CI.

Next state: gator:in-review

@TaylorMutch

Copy link
Copy Markdown
Collaborator Author

Warning: pin or make immutable/overridable the PostgreSQL fixture image used by checked CI.

The current image seems acceptable to me.

@johntmyers

Copy link
Copy Markdown
Collaborator

gator-agent

Re-check After Author Fixture Image Comment

I re-evaluated latest head e08405b24ee8285990ea0e8b51e46f6bcd414164 after TaylorMutch's 2026-06-09 21:28 UTC comment that the current PostgreSQL fixture image seems acceptable.

Disposition: partially resolved.

The fixture image item remains a supply-chain/reproducibility warning rather than a blocking finding. Pinning mirror.gcr.io/library/postgres:17-alpine by digest or making the CI fixture image explicitly overridable would still be better, but this warning does not by itself block progress.

Remaining items:

  • Blocking: the PR still sets, documents, and runs the two-gateway-replica HA shape while the known supervisor session not connected failure is unresolved. Please resolve/validate HA-safe callback routing/session behavior, obtain an explicit maintainer waiver for shipping this known-risk scope, or scope the PR away from advertising/running two gateway replicas.
  • Warning: the PostgreSQL fixture image remains mutable; acceptable to defer only if maintainers are comfortable with that CI reproducibility risk.

Checks: required Branch Checks and Helm Lint are green on this head. test:e2e-kubernetes is applied and Kubernetes HA E2E is still pending.

Next state: gator:in-review

Signed-off-by: Taylor Mutch <taylormutch@gmail.com>
@TaylorMutch

Copy link
Copy Markdown
Collaborator Author

I think I've addressed both points now

@TaylorMutch TaylorMutch merged commit c4ca283 into main Jun 9, 2026
36 of 38 checks passed
@TaylorMutch TaylorMutch deleted the stablize-helm-testing/tmutch branch June 9, 2026 23:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

gator:in-review Gator is reviewing or awaiting PR review feedback test:e2e-kubernetes Requires Kubernetes end-to-end coverage

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants